RCA Reasoning Engine

The RCA (Root Cause Analysis) reasoning engine is AtlasAI’s core analytical capability. Given an incident with attached evidence, it produces a ranked list of probable root causes with confidence scores and explainable evidence chains.

How RCA Works

The reasoning process follows a structured pipeline:

1. Evidence Collection

The engine gathers all available evidence for the incident:

Metrics — Time series data from the affected service and its dependencies (CPU, memory, latency, error rates)
Logs — Error logs, warning logs, and anomalous log patterns from the incident timeframe
Traces — Distributed traces showing request flow and latency spikes
Alerts — Correlated alerts that fired around the same time
Topology — Service dependency graph showing upstream and downstream relationships
Changes — Recent deployments, configuration changes, or infrastructure modifications

2. Signal Analysis

Each evidence type is analyzed independently:

Metric anomaly detection — Statistical analysis identifies metrics that deviated significantly from baseline during the incident window
Log pattern clustering — NLP-based grouping of log messages to identify new error patterns that appeared around the incident start time
Trace critical path analysis — Identification of spans with abnormal latency or error rates
Alert correlation — Temporal and topological correlation of fired alerts

3. Topology Traversal

The engine walks the service dependency graph to understand failure propagation:

Start from the initially affected service
Check health of all upstream dependencies (services this one depends on)
Check health of all downstream dependents (services that depend on this one)
Identify the service where the anomaly originated (not just where symptoms appeared)
Score each candidate root cause based on temporal ordering (cause precedes effect)

4. RAG Retrieval

The engine queries the knowledge base for historical context:

Similar incidents — Past incidents with matching service, error patterns, or metric signatures
Known resolutions — How similar incidents were resolved, and whether the fix was effective
Runbook history — Which runbooks were executed and their outcomes
Documentation — Internal docs, wiki pages, and architecture notes relevant to the affected services

5. Hypothesis Generation

Combining all analyses, the engine produces ranked hypotheses:

Field	Description
Hypothesis	A natural language description of the probable root cause
Confidence	Percentage score (0–100) based on evidence strength
Evidence Chain	Ordered list of observations that support this hypothesis
Contradictions	Any evidence that weakens this hypothesis
Similar Past Incidents	Links to historical incidents with matching patterns
Suggested Remediation	Initial recommendation for how to resolve

Tuning RCA Accuracy

You can improve RCA accuracy over time by:

Providing feedback — Mark RCA results as correct, partially correct, or incorrect
Adding evidence — The more evidence attached to an incident, the more accurate the analysis
Maintaining the CMDB — Accurate service dependencies enable better topology traversal
Resolving incidents fully — Complete resolution summaries feed the RAG knowledge base
Calibrating confidence thresholds — Adjust minimum confidence for automated actions under Settings → AI