RCA Reasoning Engine
The RCA (Root Cause Analysis) reasoning engine is AtlasAI’s core analytical capability. Given an incident with attached evidence, it produces a ranked list of probable root causes with confidence scores and explainable evidence chains.
How RCA Works
The reasoning process follows a structured pipeline:
1. Evidence Collection
The engine gathers all available evidence for the incident:
- Metrics — Time series data from the affected service and its dependencies (CPU, memory, latency, error rates)
- Logs — Error logs, warning logs, and anomalous log patterns from the incident timeframe
- Traces — Distributed traces showing request flow and latency spikes
- Alerts — Correlated alerts that fired around the same time
- Topology — Service dependency graph showing upstream and downstream relationships
- Changes — Recent deployments, configuration changes, or infrastructure modifications
2. Signal Analysis
Each evidence type is analyzed independently:
- Metric anomaly detection — Statistical analysis identifies metrics that deviated significantly from baseline during the incident window
- Log pattern clustering — NLP-based grouping of log messages to identify new error patterns that appeared around the incident start time
- Trace critical path analysis — Identification of spans with abnormal latency or error rates
- Alert correlation — Temporal and topological correlation of fired alerts
3. Topology Traversal
The engine walks the service dependency graph to understand failure propagation:
- Start from the initially affected service
- Check health of all upstream dependencies (services this one depends on)
- Check health of all downstream dependents (services that depend on this one)
- Identify the service where the anomaly originated (not just where symptoms appeared)
- Score each candidate root cause based on temporal ordering (cause precedes effect)
4. RAG Retrieval
The engine queries the knowledge base for historical context:
- Similar incidents — Past incidents with matching service, error patterns, or metric signatures
- Known resolutions — How similar incidents were resolved, and whether the fix was effective
- Runbook history — Which runbooks were executed and their outcomes
- Documentation — Internal docs, wiki pages, and architecture notes relevant to the affected services
5. Hypothesis Generation
Combining all analyses, the engine produces ranked hypotheses:
| Field | Description |
|---|---|
| Hypothesis | A natural language description of the probable root cause |
| Confidence | Percentage score (0–100) based on evidence strength |
| Evidence Chain | Ordered list of observations that support this hypothesis |
| Contradictions | Any evidence that weakens this hypothesis |
| Similar Past Incidents | Links to historical incidents with matching patterns |
| Suggested Remediation | Initial recommendation for how to resolve |
Tuning RCA Accuracy
You can improve RCA accuracy over time by:
- Providing feedback — Mark RCA results as correct, partially correct, or incorrect
- Adding evidence — The more evidence attached to an incident, the more accurate the analysis
- Maintaining the CMDB — Accurate service dependencies enable better topology traversal
- Resolving incidents fully — Complete resolution summaries feed the RAG knowledge base
- Calibrating confidence thresholds — Adjust minimum confidence for automated actions under Settings → AI