AI & Reasoning Overview
AtlasAI’s AI system is purpose-built for IT operations. Unlike general-purpose AI tools, it is trained and tuned specifically for infrastructure reasoning — understanding metrics anomalies, correlating log patterns, traversing service topologies, and generating safe remediation actions.
AI Architecture
The AI system consists of several interconnected components:
┌─────────────────────────────────────────────┐
│ AI Gateway │
│ │
│ ┌─────────┐ ┌──────────┐ ┌───────────┐ │
│ │ RCA │ │ Runbook │ │ Copilot │ │
│ │ Engine │ │ Generator│ │ Engine │ │
│ └────┬────┘ └────┬─────┘ └─────┬─────┘ │
│ │ │ │ │
│ ┌────▼────────────▼──────────────▼─────┐ │
│ │ RAG Knowledge Base │ │
│ │ (incidents, runbooks, docs, CMDB) │ │
│ └──────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────┐ │
│ │ Model Router & Orchestrator │ │
│ │ (selects model, manages context) │ │
│ └──────────────────────────────────────┘ │
└─────────────────────────────────────────────┘Core AI Capabilities
1. Root Cause Analysis (RCA)
The RCA engine analyzes multi-signal evidence to produce ranked root cause hypotheses. It combines:
- Signal analysis — Pattern recognition across metrics, logs, and traces
- Topology reasoning — Graph traversal of service dependencies to trace failure propagation
- Historical correlation — RAG retrieval of similar past incidents and their resolutions
- Change correlation — Linking incidents to recent deployments, config changes, or infrastructure modifications
Learn more about RCA Reasoning →
2. Runbook Generation
The AI generates structured remediation playbooks based on RCA output and historical resolutions. Generated runbooks include risk-classified steps, rollback instructions, and approval gates.
Learn more about Runbook Generation →
3. RAG Knowledge Base
Every incident resolution, runbook execution, and operator interaction is indexed into a vector knowledge base. The AI retrieves relevant context from this knowledge base during RCA and Copilot interactions.
Learn more about RAG & Knowledge →
4. Trust & Autonomy
AtlasAI’s progressive autonomy system (L0–L5) controls how much authority the AI has. Organizations start conservative and increase autonomy as trust builds.
Learn more about Trust & Autonomy →
5. Bring Your Own LLM
Enterprise customers can connect their own LLM infrastructure (Azure OpenAI, AWS Bedrock, self-hosted models) for data sovereignty and compliance.
AI Safety Principles
AtlasAI’s AI operates under strict safety constraints:
- No unsupervised destructive actions — Actions that delete data or terminate production resources always require explicit approval (even at L5)
- Explainability — Every AI decision includes an evidence chain showing how it was reached
- Confidence thresholds — The AI only takes automated action when confidence exceeds configurable thresholds
- Human override — Operators can always override, pause, or roll back AI decisions
- Audit trail — Every AI action is logged with the model used, input context, and output