AI & Reasoning Overview

AtlasAI’s AI system is purpose-built for IT operations. Unlike general-purpose AI tools, it is trained and tuned specifically for infrastructure reasoning — understanding metrics anomalies, correlating log patterns, traversing service topologies, and generating safe remediation actions.

AI Architecture

The AI system consists of several interconnected components:


┌─────────────────────────────────────────────┐
│              AI Gateway                     │
│                                             │
│  ┌─────────┐  ┌──────────┐  ┌───────────┐  │
│  │ RCA     │  │ Runbook  │  │ Copilot   │  │
│  │ Engine  │  │ Generator│  │ Engine    │  │
│  └────┬────┘  └────┬─────┘  └─────┬─────┘  │
│       │            │              │         │
│  ┌────▼────────────▼──────────────▼─────┐   │
│  │          RAG Knowledge Base          │   │
│  │  (incidents, runbooks, docs, CMDB)   │   │
│  └──────────────────────────────────────┘   │
│                                             │
│  ┌──────────────────────────────────────┐   │
│  │       Model Router & Orchestrator    │   │
│  │  (selects model, manages context)    │   │
│  └──────────────────────────────────────┘   │
└─────────────────────────────────────────────┘

Core AI Capabilities

1. Root Cause Analysis (RCA)

The RCA engine analyzes multi-signal evidence to produce ranked root cause hypotheses. It combines:

Signal analysis — Pattern recognition across metrics, logs, and traces
Topology reasoning — Graph traversal of service dependencies to trace failure propagation
Historical correlation — RAG retrieval of similar past incidents and their resolutions
Change correlation — Linking incidents to recent deployments, config changes, or infrastructure modifications

Learn more about RCA Reasoning →

2. Runbook Generation

The AI generates structured remediation playbooks based on RCA output and historical resolutions. Generated runbooks include risk-classified steps, rollback instructions, and approval gates.

Learn more about Runbook Generation →

3. RAG Knowledge Base

Every incident resolution, runbook execution, and operator interaction is indexed into a vector knowledge base. The AI retrieves relevant context from this knowledge base during RCA and Copilot interactions.

Learn more about RAG & Knowledge →

4. Trust & Autonomy

AtlasAI’s progressive autonomy system (L0–L5) controls how much authority the AI has. Organizations start conservative and increase autonomy as trust builds.

Learn more about Trust & Autonomy →

5. Bring Your Own LLM

Enterprise customers can connect their own LLM infrastructure (Azure OpenAI, AWS Bedrock, self-hosted models) for data sovereignty and compliance.

Learn more about BYOC LLM →

AI Safety Principles

AtlasAI’s AI operates under strict safety constraints:

No unsupervised destructive actions — Actions that delete data or terminate production resources always require explicit approval (even at L5)
Explainability — Every AI decision includes an evidence chain showing how it was reached
Confidence thresholds — The AI only takes automated action when confidence exceeds configurable thresholds
Human override — Operators can always override, pause, or roll back AI decisions
Audit trail — Every AI action is logged with the model used, input context, and output