Architecture Overview

AtlasAI is a multi-tenant AI-native IT operations platform built on a Control Plane / Tenant Plane separation model. Each layer has distinct responsibilities and security boundaries.

System Diagram

Control Plane (CP)

The Control Plane is the SaaS management layer. It handles:

Tenant provisioning — creates tenant records, assigns DB schemas, issues credentials
Subscription & billing — Razorpay webhooks → entitlement enforcement → TP notification
Feature entitlements — plan matrix enforced at TP via JWT claims and /api/entitlements/refresh
Governance — AI model approval, policy publishing, RBAC role definitions
Multi-region routing — routes tenant login to nearest TP instance

CP is a Next.js App Router application backed by PostgreSQL (production) or SQLite (development).

Tenant Plane (TP)

The Tenant Plane is the per-tenant operational layer. It handles all operational workloads:

Incident management — lifecycle from triggered → open → investigating → resolved → closed
ITSM — change requests (CAB workflow), service catalog, SLA management
Observability — logs, metrics, traces ingestion and querying
CMDB — CIs, relationships, topology, impact analysis
AI — RCA, runbook generation, copilot, analytics, remediation planning
Automation — runbooks, workflows, job execution, approvals

TP deployment models:

Model	Description
Shared TP	Multi-tenant, cost-effective, one DB per tenant
Dedicated TP	Single-tenant ECS cluster, isolated Aurora
BYOC	Customer-owned AWS account, Atlas-managed
On-Prem	Docker Compose / Kubernetes in customer data center

AI Control Layer

All AI inference flows through the AI Control Layer:

AI Gateway — routes to Bedrock (Claude / Llama), validates credits, falls back on error
RAG Engine — pgvector similarity search on runbooks, incident history, KB articles
Runbook Generator — creates actionable runbooks from incident description + RCA
Copilot — natural language Q&A with persona routing (SRE, SecOps, Executive, FinOps)
Autonomous Remediation — evaluates incidents, generates remediation plans, verifies fixes

Data Flow


Customer Environment
  └── Edge Agent / OTLP Collector
        └── POST /api/ingest/events
              └── Tenant Plane API
                    ├── Persist to DB (events, logs, metrics, traces)
                    ├── Evaluate alert rules → trigger incidents
                    ├── Correlation engine → chain incidents
                    └── AI pipeline → RCA → runbook → remediation

Security Boundaries

All TP APIs require Authorization: Bearer <jwt> + x-tenant-id header
CP validates tenant_id ownership before issuing TP tokens
No cross-tenant data access — all queries scoped to tenant_id
RBAC enforced at route layer with requireTenantScopedPermission()
Secrets stored in AWS Secrets Manager; never in environment plain text
mTLS between TP tasks and Aurora (production)