Skip to Content
Getting StartedArchitecture Overview

Architecture Overview

AtlasAI is a multi-tenant AI-native IT operations platform built on a Control Plane / Tenant Plane separation model. Each layer has distinct responsibilities and security boundaries.

System Diagram


Control Plane (CP)

The Control Plane is the SaaS management layer. It handles:

  • Tenant provisioning — creates tenant records, assigns DB schemas, issues credentials
  • Subscription & billing — Razorpay webhooks → entitlement enforcement → TP notification
  • Feature entitlements — plan matrix enforced at TP via JWT claims and /api/entitlements/refresh
  • Governance — AI model approval, policy publishing, RBAC role definitions
  • Multi-region routing — routes tenant login to nearest TP instance

CP is a Next.js App Router application backed by PostgreSQL (production) or SQLite (development).


Tenant Plane (TP)

The Tenant Plane is the per-tenant operational layer. It handles all operational workloads:

  • Incident management — lifecycle from triggered → open → investigating → resolved → closed
  • ITSM — change requests (CAB workflow), service catalog, SLA management
  • Observability — logs, metrics, traces ingestion and querying
  • CMDB — CIs, relationships, topology, impact analysis
  • AI — RCA, runbook generation, copilot, analytics, remediation planning
  • Automation — runbooks, workflows, job execution, approvals

TP deployment models:

ModelDescription
Shared TPMulti-tenant, cost-effective, one DB per tenant
Dedicated TPSingle-tenant ECS cluster, isolated Aurora
BYOCCustomer-owned AWS account, Atlas-managed
On-PremDocker Compose / Kubernetes in customer data center

AI Control Layer

All AI inference flows through the AI Control Layer:

  1. AI Gateway — routes to Bedrock (Claude / Llama), validates credits, falls back on error
  2. RAG Engine — pgvector similarity search on runbooks, incident history, KB articles
  3. Runbook Generator — creates actionable runbooks from incident description + RCA
  4. Copilot — natural language Q&A with persona routing (SRE, SecOps, Executive, FinOps)
  5. Autonomous Remediation — evaluates incidents, generates remediation plans, verifies fixes

Data Flow

Customer Environment └── Edge Agent / OTLP Collector └── POST /api/ingest/events └── Tenant Plane API ├── Persist to DB (events, logs, metrics, traces) ├── Evaluate alert rules → trigger incidents ├── Correlation engine → chain incidents └── AI pipeline → RCA → runbook → remediation

Security Boundaries

  • All TP APIs require Authorization: Bearer <jwt> + x-tenant-id header
  • CP validates tenant_id ownership before issuing TP tokens
  • No cross-tenant data access — all queries scoped to tenant_id
  • RBAC enforced at route layer with requireTenantScopedPermission()
  • Secrets stored in AWS Secrets Manager; never in environment plain text
  • mTLS between TP tasks and Aurora (production)