Architecture Overview
AtlasAI is a multi-tenant AI-native IT operations platform built on a Control Plane / Tenant Plane separation model. Each layer has distinct responsibilities and security boundaries.
System Diagram
Control Plane (CP)
The Control Plane is the SaaS management layer. It handles:
- Tenant provisioning — creates tenant records, assigns DB schemas, issues credentials
- Subscription & billing — Razorpay webhooks → entitlement enforcement → TP notification
- Feature entitlements — plan matrix enforced at TP via JWT claims and
/api/entitlements/refresh - Governance — AI model approval, policy publishing, RBAC role definitions
- Multi-region routing — routes tenant login to nearest TP instance
CP is a Next.js App Router application backed by PostgreSQL (production) or SQLite (development).
Tenant Plane (TP)
The Tenant Plane is the per-tenant operational layer. It handles all operational workloads:
- Incident management — lifecycle from
triggered → open → investigating → resolved → closed - ITSM — change requests (CAB workflow), service catalog, SLA management
- Observability — logs, metrics, traces ingestion and querying
- CMDB — CIs, relationships, topology, impact analysis
- AI — RCA, runbook generation, copilot, analytics, remediation planning
- Automation — runbooks, workflows, job execution, approvals
TP deployment models:
| Model | Description |
|---|---|
| Shared TP | Multi-tenant, cost-effective, one DB per tenant |
| Dedicated TP | Single-tenant ECS cluster, isolated Aurora |
| BYOC | Customer-owned AWS account, Atlas-managed |
| On-Prem | Docker Compose / Kubernetes in customer data center |
AI Control Layer
All AI inference flows through the AI Control Layer:
- AI Gateway — routes to Bedrock (Claude / Llama), validates credits, falls back on error
- RAG Engine — pgvector similarity search on runbooks, incident history, KB articles
- Runbook Generator — creates actionable runbooks from incident description + RCA
- Copilot — natural language Q&A with persona routing (SRE, SecOps, Executive, FinOps)
- Autonomous Remediation — evaluates incidents, generates remediation plans, verifies fixes
Data Flow
Customer Environment
└── Edge Agent / OTLP Collector
└── POST /api/ingest/events
└── Tenant Plane API
├── Persist to DB (events, logs, metrics, traces)
├── Evaluate alert rules → trigger incidents
├── Correlation engine → chain incidents
└── AI pipeline → RCA → runbook → remediationSecurity Boundaries
- All TP APIs require
Authorization: Bearer <jwt>+x-tenant-idheader - CP validates tenant_id ownership before issuing TP tokens
- No cross-tenant data access — all queries scoped to
tenant_id - RBAC enforced at route layer with
requireTenantScopedPermission() - Secrets stored in AWS Secrets Manager; never in environment plain text
- mTLS between TP tasks and Aurora (production)