Skip to Content
User JourneysAlert to Resolution

User Journey: Alert to Resolution

This journey walks you from an alert or event to a resolved incident using correlation, root cause analysis, and automation. It’s the core flow for on-call and SRE teams.

Overview

Alert/Event → Correlation (optional) → Incident → Evidence → RCA → Runbook → Execute → Resolve

You can enter this flow at several points: alerts can auto-create incidents via correlation rules, or you can create an incident manually and attach evidence. The steps below cover both paths.

When to use this journey

  • On-call / SRE: When an alert fires or you’re handed an incident and need to diagnose, remediate, and close it quickly.
  • After correlation: When correlation has already created an incident from multiple events and you want to run RCA and automation.
  • Manual incident: When you create an incident from the UI (e.g. from a war room or external report) and want to attach evidence, run RCA, and optionally automate.

Why each step matters

StepWhy it matters
Alerts/events inWithout signals, you have nothing to correlate or attach to incidents.
Incident existsA single incident record ties evidence, RCA, and automation together and tracks resolution.
Add evidenceMore context (metrics, logs, topology) gives the AI better RCA and runbook suggestions.
Run RCAYou get a ranked root cause and suggested actions instead of guessing.
RunbookAutomation turns the fix into a repeatable, auditable playbook.
ExecuteThe runbook runs against your environment; you approve steps if required.
ResolveClosing with a resolution summary feeds the learning loop for future incidents.

Step 1: Alerts and events reach AtlasAI

How alerts get in:

  • Integrations: Prometheus, Datadog, PagerDuty, ServiceNow, Jira, etc. send alerts or webhooks to AtlasAI.
  • Edge Agent: Collectors on your hosts push metrics and logs; alert rules can evaluate and create events.
  • Data sources: Logs, metrics, and traces are ingested; anomaly detection or threshold rules generate events.
  • Manual: You create an incident from the UI without an incoming alert.

Where to configure:

  • CONFIGURE (sidebar) — Integrations and Discovery (or Data Sources if your deployment shows it) — add connectors and configure what gets ingested.
  • Integrations — Connect Prometheus, Datadog, PagerDuty, etc.; map alert payloads to events.
  • Monitoring policies — Under CONFIGURE → Network Monitoring (or Policies); define targets and alert rules for collected metrics.

Outcome: Events and alerts are in the system. Either a correlation rule creates an incident, or you create one manually.


Step 2: Incident exists (auto or manual)

If correlation created it:

  • Correlation rules group related events and create (or attach to) an incident.
  • The incident appears in Command Center and Incidents with initial severity and linked evidence.

If you create it manually:

  1. Go to Command Center or Incidents (use the command palette to jump there quickly).
  2. Click New Incident.
  3. Enter title, severity (P1–P5), category, and affected service (forms validate required fields).
  4. Click Create.

Outcome: You have an incident (e.g. INC-00042) in an Open or Investigating state.


Step 3: Add evidence

Evidence improves RCA accuracy. Attach whatever you have:

  1. Open the incident (click it from Command Center or Incidents list).
  2. In the Evidence panel, click Add Evidence.
  3. Choose type:
    • Metrics — Time-range snapshot (e.g. CPU, memory for the affected service).
    • Logs — Log query or saved search result.
    • Alerts — Correlated alerts that fired around the same time.
    • Topology — Service dependency graph for the affected service.
    • Traces — Relevant trace(s) if APM is connected.
  4. Attach each piece; the more context, the better the RCA.

Outcome: The incident has evidence the AI can use for root cause analysis.


Step 4: Run root cause analysis (RCA)

  1. In the incident view, click Run RCA.
  2. The reasoning engine:
    • Analyzes attached evidence
    • Uses topology to consider upstream/downstream impact
    • Consults historical incidents via RAG
    • Produces ranked root cause hypotheses with confidence scores
  3. Wait 15–45 seconds; results appear in the RCA panel.

What you see:

  • Root cause hypothesis — Most likely cause and confidence.
  • Evidence chain — How the AI reached the conclusion.
  • Related incidents — Past incidents with similar patterns.
  • Impact analysis — Services affected.

Outcome: You have a clear root cause and often suggested actions (e.g. a runbook to run).


Step 5: Generate or select a runbook

If the AI suggests a runbook:

  • Use the Suggested runbook link from the RCA result to open the runbook.
  • Review steps (action, target, risk, reversible).
  • Edit if needed, then go to Runbooks and Approve the runbook (if it’s in draft).

If you want to generate one:

  1. Click Generate Runbook from the incident or RCA panel.
  2. Review the proposed steps; add, remove, or reorder as needed.
  3. Save; the runbook is in your library and can be approved for execution.

Outcome: You have an approved runbook ready to run for this incident.


Step 6: Execute automation

  1. Go to Automation (or use the incident’s Execute action if available).
  2. Start a run with:
    • Runbook — The one you approved.
    • Incident — Link this incident so the run is tracked and the incident can auto-resolve.
    • Variables — Override any runbook variables (e.g. host, service).
  3. If approval is required (e.g. L1 Suggest), Approve each step or Approve all.
  4. Execution runs; you see logs and status per step.

Outcome: The runbook has run; the issue may already be fixed (e.g. service restarted, pod scaled).


Step 7: Resolve the incident

  1. Confirm the issue is resolved (check metrics/logs).
  2. In the incident, click Resolve.
  3. Add or edit the resolution summary (the AI may pre-fill from actions taken).
  4. Set root cause category (e.g. Resource Exhaustion).
  5. Click Close incident.

Outcome: The incident is closed. The resolution is stored in the knowledge base so future similar incidents get better RCA and runbook suggestions.


Summary: where to go in the UI (quick reference)

StepWhere in AtlasAIHow
Alerts/events inData Sources, Integrations, Monitoring policiesConfigure connectors and rules
See/create incidentCommand Center, IncidentsUse list (paginated if many); New Incident
Add evidenceIncident detail → Evidence panelAdd Evidence → choose type, attach
Run RCAIncident detail → Run RCAWait 15–45 s; review hypothesis and evidence chain
RunbookRunbooks (approve) → Automation (execute)Generate or select; approve; execute with incident linked
ResolveIncident detail → Resolve → CloseConfirm fix; add resolution summary; close

For large lists (e.g. alerts, incidents), use pagination and filters to find items quickly. See What’s new — lists and pagination.

See also