Skip to Content
AdministrationHigh Availability

High Availability

This guide explains how to deploy the AtlasAI Tenant Plane (TP) in a highly available configuration for production on-prem and BYOC environments. For SaaS and Dedicated TP customers, AtlasAI manages HA automatically — this guide is for self-hosted deployments.


How TP handles availability

The Tenant Plane is designed to be stateless. This means:

  • All persistent data lives in PostgreSQL (or external Redis for rate limiting)
  • Authentication uses JWT tokens that are verified locally on each request — no session store or sticky sessions needed
  • Any replica can serve any request — the load balancer does not need session affinity
  • You can run as many replicas as needed and scale them up or down without downtime

Because TP is stateless, high availability simply means: run at least 2 replicas behind a load balancer, connected to a highly available database.


Minimum HA requirements

ComponentMinimum for HARecommended
TP replicas23+ (across multiple nodes)
DatabasePostgreSQL with read replicaPostgreSQL with streaming replication + auto-failover (Patroni, Aurora, CloudSQL)
Load balancerAny TCP/HTTP LB (nginx, HAProxy, ALB)Application load balancer with health checks
RedisOptional (improves rate-limit consistency)Redis Sentinel or cluster for HA rate limiting

Architecture overview

Internet / Internal Network ┌────────────────────────┐ │ Load Balancer │ │ (ALB / nginx / k8s) │ │ Health: /api/health │ └────────┬───────────────┘ ┌────┴────┐ │ │ ▼ ▼ ┌────────┐ ┌────────┐ (add more replicas as needed) │ TP #1 │ │ TP #2 │ │ │ │ │ └───┬────┘ └───┬────┘ │ │ └─────┬─────┘ ┌─────▼──────────────┐ │ PostgreSQL (HA) │ │ Primary + Replica │ └────────────────────┘ ┌─────▼──────────────┐ │ Redis (optional) │ │ Rate limit cache │ └────────────────────┘

Option 1: Docker Compose with multiple replicas

For on-prem deployments not using Kubernetes, use Docker Compose with a reverse proxy.

Step 1: Configure your .env

# Identity TENANT_ID=acme-corp JWT_SECRET=<generate: openssl rand -hex 32> ENCRYPTION_KEY=<generate: openssl rand -hex 32> # Database (shared between all replicas) TENANT_PLANE_DATABASE_URL=postgresql://atlasusr:password@db:5432/atlas # Optional: Redis for rate limiting consistency across replicas REDIS_URL=redis://redis:6379 # License (on-prem) ATLASAI_LICENSE_KEY=eyJhbGciOiJSUzI1NiJ9... CP_LICENSE_PUBLIC_KEY=-----BEGIN PUBLIC KEY-----\n...\n-----END PUBLIC KEY-----

Step 2: Scale using Docker Compose

# docker-compose.yml services: tenant-plane: image: atlasai/tenant-plane:1.3.0 env_file: .env deploy: replicas: 3 # run 3 instances healthcheck: test: ["CMD", "wget", "-q", "--spider", "http://localhost:3000/api/health"] interval: 30s timeout: 5s retries: 3 nginx: image: nginx:alpine ports: - "80:80" - "443:443" volumes: - ./nginx.conf:/etc/nginx/nginx.conf:ro depends_on: - tenant-plane

Step 3: Configure nginx as load balancer

# nginx.conf upstream tenant_plane { least_conn; server tenant-plane:3000; keepalive 32; } server { listen 80; server_name atlas.yourdomain.com; location /api/health { proxy_pass http://tenant_plane; access_log off; } location / { proxy_pass http://tenant_plane; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; proxy_http_version 1.1; proxy_read_timeout 90s; } }

Start everything:

docker compose up -d --scale tenant-plane=3

Verify all replicas are healthy:

docker compose ps # Should show 3 tenant-plane instances, all "Up (healthy)"

Step 1: Prepare your values file

Create a values-production.yaml file:

# Number of replicas — minimum 2 for HA, 3+ recommended replicaCount: 3 image: repository: atlasai/tenant-plane tag: "1.3.0" pullPolicy: IfNotPresent # Database — REQUIRED for HA (cannot use SQLite with multiple replicas) db: url: "postgresql://atlasusr:password@rds.internal:5432/atlas" poolMin: 2 poolMax: 10 # Authentication secrets — must be identical on all replicas auth: jwtSecret: "your-32-char-secret-here" encryptionKey: "your-32-char-encryption-key" # Tenant identity tenant: id: "acme-corp" # ─── High Availability settings ────────────────────────────────────────────── ha: # Spread replicas across different nodes (preferred, not required) podAntiAffinity: true # Ensure at least 1 pod is always running during updates or node drains podDisruptionBudget: enabled: true minAvailable: 1 # Spread replicas across availability zones topologySpreadConstraints: true # Auto-scaling: scale based on CPU/memory load autoscaling: enabled: true minReplicas: 2 maxReplicas: 20 targetCPUUtilizationPercentage: 70 targetMemoryUtilizationPercentage: 80 # Resource requests and limits per replica resources: requests: cpu: 250m memory: 512Mi limits: cpu: "1" memory: 1Gi # License (for on-prem/BYOC) license: publicKey: "-----BEGIN PUBLIC KEY-----\n...\n-----END PUBLIC KEY-----" key: "eyJhbGciOiJSUzI1NiJ9..."

Step 2: Install

helm install atlasai-tp deploy/helm/atlasai-tp \ -f values-production.yaml \ --namespace atlasai \ --create-namespace \ --wait \ --timeout 5m

Step 3: Verify

# Check all pods are running kubectl get pods -n atlasai -l app=atlasai-tp # Check the PodDisruptionBudget is in place kubectl get pdb -n atlasai # Check horizontal autoscaler kubectl get hpa -n atlasai # Hit the health endpoint through the service kubectl run test --rm -it --image=curlimages/curl --restart=Never -n atlasai -- \ curl -s http://atlasai-tp/api/health

Expected pod output:

NAME READY STATUS RESTARTS AGE atlasai-tp-7d8f9b4c5-2xqmn 1/1 Running 0 5m atlasai-tp-7d8f9b4c5-6bktv 1/1 Running 0 5m atlasai-tp-7d8f9b4c5-r9pvw 1/1 Running 0 5m

Database high availability

Why PostgreSQL is required for HA

SQLite is a single-file database that cannot be shared across replicas. For multi-replica deployments you must use PostgreSQL.

Set the Postgres connection string:

TENANT_PLANE_DATABASE_URL=postgresql://username:password@hostname:5432/dbname
PlatformServiceNotes
AWSAmazon RDS Aurora PostgreSQLAutomatic failover, up to 15 read replicas
GCPCloud SQL for PostgreSQLHA with automatic failover
AzureAzure Database for PostgreSQLZone-redundant HA
Self-hostedPatroni + etcd + HAProxyOpen-source HA stack
Self-hostedPostgres Streaming ReplicationManual failover, simpler setup

All of these work with AtlasAI. The Tenant Plane uses standard PostgreSQL wire protocol — no special extensions required beyond pgvector for AI search features.

Database connection pooling

AtlasAI automatically pools database connections. Configure the pool size per replica:

DB_POOL_MIN=2 # Minimum connections kept open (default: 2) DB_POOL_MAX=10 # Maximum connections per replica (default: 10) DB_CONNECT_TIMEOUT=5000 # Connection timeout in ms (default: 5000) DB_IDLE_TIMEOUT=30000 # Idle connection timeout in ms (default: 30000)

Total connections formula: replicas × DB_POOL_MAX

Example: 3 replicas × 10 connections = 30 max connections to Postgres. Size your Postgres max_connections accordingly (default is usually 100).


Health checks and monitoring

The /api/health endpoint is the canonical health check for all monitoring:

curl https://your-tenant-plane/api/health
{ "status": "ok", "plane": "tenant", "version": "1.3.0", "uptime_seconds": 14283, "db": { "enabled": true, "reachable": true }, "vector_db": "pgvector", "timestamp": "2026-03-26T10:00:00.000Z" }
  • status: "ok" — replica is healthy and ready to serve traffic
  • status: "degraded" — replica is running but some non-critical service is unavailable
  • HTTP 503 — replica is not ready; load balancer should stop sending traffic to it

Kubernetes probes

The Helm chart configures liveness and readiness probes automatically. Both point to /api/health:

  • Liveness probe: checked every 30 seconds; restarts pod if failing for more than 3 cycles
  • Readiness probe: checked every 10 seconds; removes pod from load balancer rotation when failing

Session and authentication

TP uses stateless JWT authentication. Here is what this means for HA:

  • No sticky sessions needed — any replica validates any JWT independently
  • No shared session store — tokens are self-contained and verified using JWT_SECRET
  • All replicas must share the same JWT_SECRET — if they differ, users logged into one replica cannot be authenticated by another

Make sure JWT_SECRET is identical on all replicas. In Kubernetes, this is set via the Helm secret automatically.


Redis is not required for HA but improves two things:

  1. Rate limiting consistency — without Redis, each replica tracks rate limits independently, meaning the effective limit is per-replica limit × number of replicas. With Redis, the limit is global across all replicas.
  2. JWT revocation — when a user’s session is forcibly terminated, Redis allows all replicas to instantly know the token is invalid.

Configure Redis:

REDIS_URL=redis://redis-hostname:6379

For Redis HA, use Redis Sentinel or Redis Cluster.


┌──────────────────────────────────────────┐ │ Your Network │ │ │ │ ┌──────────────────────────────────┐ │ │ │ Load Balancer / Ingress │ │ │ │ (nginx / ALB / k8s Ingress) │ │ │ └────────┬────────────┬─────────────┘ │ │ │ │ │ │ ┌──────▼──┐ ┌──────▼──┐ │ │ │ TP #1 │ │ TP #2 │ (+ more) │ │ │ 1 CPU │ │ 1 CPU │ │ │ │ 1 GB RAM │ │ 1 GB RAM │ │ │ └────┬────┘ └────┬────┘ │ │ └─────┬──────┘ │ │ │ │ │ ┌────────────▼──────────────────────┐ │ │ │ PostgreSQL (Aurora / Patroni) │ │ │ │ Primary + Standby │ │ │ └────────────────────────────────────┘ │ │ │ │ ┌────────────────────────────────────┐ │ │ │ Redis (optional, for rate limits) │ │ │ └────────────────────────────────────┘ │ └──────────────────────────────────────────┘

This topology handles:

  • Single replica failure: remaining replicas serve 100% of traffic
  • Database primary failure: standby promotes automatically (Aurora < 30s, Patroni ~ 30-60s)
  • Node failure: Kubernetes reschedules pods to healthy nodes
  • Traffic spikes: HPA adds replicas based on CPU/memory