Skip to Content
AI & ReasoningRCA Reasoning

RCA Reasoning Engine

The RCA (Root Cause Analysis) reasoning engine is AtlasAI’s core analytical capability. Given an incident with attached evidence, it produces a ranked list of probable root causes with confidence scores and explainable evidence chains.

How RCA Works

The reasoning process follows a structured pipeline:

1. Evidence Collection

The engine gathers all available evidence for the incident:

  • Metrics — Time series data from the affected service and its dependencies (CPU, memory, latency, error rates)
  • Logs — Error logs, warning logs, and anomalous log patterns from the incident timeframe
  • Traces — Distributed traces showing request flow and latency spikes
  • Alerts — Correlated alerts that fired around the same time
  • Topology — Service dependency graph showing upstream and downstream relationships
  • Changes — Recent deployments, configuration changes, or infrastructure modifications

2. Signal Analysis

Each evidence type is analyzed independently:

  • Metric anomaly detection — Statistical analysis identifies metrics that deviated significantly from baseline during the incident window
  • Log pattern clustering — NLP-based grouping of log messages to identify new error patterns that appeared around the incident start time
  • Trace critical path analysis — Identification of spans with abnormal latency or error rates
  • Alert correlation — Temporal and topological correlation of fired alerts

3. Topology Traversal

The engine walks the service dependency graph to understand failure propagation:

  1. Start from the initially affected service
  2. Check health of all upstream dependencies (services this one depends on)
  3. Check health of all downstream dependents (services that depend on this one)
  4. Identify the service where the anomaly originated (not just where symptoms appeared)
  5. Score each candidate root cause based on temporal ordering (cause precedes effect)

4. RAG Retrieval

The engine queries the knowledge base for historical context:

  • Similar incidents — Past incidents with matching service, error patterns, or metric signatures
  • Known resolutions — How similar incidents were resolved, and whether the fix was effective
  • Runbook history — Which runbooks were executed and their outcomes
  • Documentation — Internal docs, wiki pages, and architecture notes relevant to the affected services

5. Hypothesis Generation

Combining all analyses, the engine produces ranked hypotheses:

FieldDescription
HypothesisA natural language description of the probable root cause
ConfidencePercentage score (0–100) based on evidence strength
Evidence ChainOrdered list of observations that support this hypothesis
ContradictionsAny evidence that weakens this hypothesis
Similar Past IncidentsLinks to historical incidents with matching patterns
Suggested RemediationInitial recommendation for how to resolve

Tuning RCA Accuracy

You can improve RCA accuracy over time by:

  • Providing feedback — Mark RCA results as correct, partially correct, or incorrect
  • Adding evidence — The more evidence attached to an incident, the more accurate the analysis
  • Maintaining the CMDB — Accurate service dependencies enable better topology traversal
  • Resolving incidents fully — Complete resolution summaries feed the RAG knowledge base
  • Calibrating confidence thresholds — Adjust minimum confidence for automated actions under Settings → AI