User Journey: Alert to Resolution
This journey walks you from an alert or event to a resolved incident using correlation, root cause analysis, and automation. It’s the core flow for on-call and SRE teams.
Overview
Alert/Event → Correlation (optional) → Incident → Evidence → RCA → Runbook → Execute → ResolveYou can enter this flow at several points: alerts can auto-create incidents via correlation rules, or you can create an incident manually and attach evidence. The steps below cover both paths.
When to use this journey
- On-call / SRE: When an alert fires or you’re handed an incident and need to diagnose, remediate, and close it quickly.
- After correlation: When correlation has already created an incident from multiple events and you want to run RCA and automation.
- Manual incident: When you create an incident from the UI (e.g. from a war room or external report) and want to attach evidence, run RCA, and optionally automate.
Why each step matters
| Step | Why it matters |
|---|---|
| Alerts/events in | Without signals, you have nothing to correlate or attach to incidents. |
| Incident exists | A single incident record ties evidence, RCA, and automation together and tracks resolution. |
| Add evidence | More context (metrics, logs, topology) gives the AI better RCA and runbook suggestions. |
| Run RCA | You get a ranked root cause and suggested actions instead of guessing. |
| Runbook | Automation turns the fix into a repeatable, auditable playbook. |
| Execute | The runbook runs against your environment; you approve steps if required. |
| Resolve | Closing with a resolution summary feeds the learning loop for future incidents. |
Step 1: Alerts and events reach AtlasAI
How alerts get in:
- Integrations: Prometheus, Datadog, PagerDuty, ServiceNow, Jira, etc. send alerts or webhooks to AtlasAI.
- Edge Agent: Collectors on your hosts push metrics and logs; alert rules can evaluate and create events.
- Data sources: Logs, metrics, and traces are ingested; anomaly detection or threshold rules generate events.
- Manual: You create an incident from the UI without an incoming alert.
Where to configure:
- CONFIGURE (sidebar) — Integrations and Discovery (or Data Sources if your deployment shows it) — add connectors and configure what gets ingested.
- Integrations — Connect Prometheus, Datadog, PagerDuty, etc.; map alert payloads to events.
- Monitoring policies — Under CONFIGURE → Network Monitoring (or Policies); define targets and alert rules for collected metrics.
Outcome: Events and alerts are in the system. Either a correlation rule creates an incident, or you create one manually.
Step 2: Incident exists (auto or manual)
If correlation created it:
- Correlation rules group related events and create (or attach to) an incident.
- The incident appears in Command Center and Incidents with initial severity and linked evidence.
If you create it manually:
- Go to Command Center or Incidents (use the command palette to jump there quickly).
- Click New Incident.
- Enter title, severity (P1–P5), category, and affected service (forms validate required fields).
- Click Create.
Outcome: You have an incident (e.g. INC-00042) in an Open or Investigating state.
Step 3: Add evidence
Evidence improves RCA accuracy. Attach whatever you have:
- Open the incident (click it from Command Center or Incidents list).
- In the Evidence panel, click Add Evidence.
- Choose type:
- Metrics — Time-range snapshot (e.g. CPU, memory for the affected service).
- Logs — Log query or saved search result.
- Alerts — Correlated alerts that fired around the same time.
- Topology — Service dependency graph for the affected service.
- Traces — Relevant trace(s) if APM is connected.
- Attach each piece; the more context, the better the RCA.
Outcome: The incident has evidence the AI can use for root cause analysis.
Step 4: Run root cause analysis (RCA)
- In the incident view, click Run RCA.
- The reasoning engine:
- Analyzes attached evidence
- Uses topology to consider upstream/downstream impact
- Consults historical incidents via RAG
- Produces ranked root cause hypotheses with confidence scores
- Wait 15–45 seconds; results appear in the RCA panel.
What you see:
- Root cause hypothesis — Most likely cause and confidence.
- Evidence chain — How the AI reached the conclusion.
- Related incidents — Past incidents with similar patterns.
- Impact analysis — Services affected.
Outcome: You have a clear root cause and often suggested actions (e.g. a runbook to run).
Step 5: Generate or select a runbook
If the AI suggests a runbook:
- Use the Suggested runbook link from the RCA result to open the runbook.
- Review steps (action, target, risk, reversible).
- Edit if needed, then go to Runbooks and Approve the runbook (if it’s in draft).
If you want to generate one:
- Click Generate Runbook from the incident or RCA panel.
- Review the proposed steps; add, remove, or reorder as needed.
- Save; the runbook is in your library and can be approved for execution.
Outcome: You have an approved runbook ready to run for this incident.
Step 6: Execute automation
- Go to Automation (or use the incident’s Execute action if available).
- Start a run with:
- Runbook — The one you approved.
- Incident — Link this incident so the run is tracked and the incident can auto-resolve.
- Variables — Override any runbook variables (e.g. host, service).
- If approval is required (e.g. L1 Suggest), Approve each step or Approve all.
- Execution runs; you see logs and status per step.
Outcome: The runbook has run; the issue may already be fixed (e.g. service restarted, pod scaled).
Step 7: Resolve the incident
- Confirm the issue is resolved (check metrics/logs).
- In the incident, click Resolve.
- Add or edit the resolution summary (the AI may pre-fill from actions taken).
- Set root cause category (e.g. Resource Exhaustion).
- Click Close incident.
Outcome: The incident is closed. The resolution is stored in the knowledge base so future similar incidents get better RCA and runbook suggestions.
Summary: where to go in the UI (quick reference)
| Step | Where in AtlasAI | How |
|---|---|---|
| Alerts/events in | Data Sources, Integrations, Monitoring policies | Configure connectors and rules |
| See/create incident | Command Center, Incidents | Use list (paginated if many); New Incident |
| Add evidence | Incident detail → Evidence panel | Add Evidence → choose type, attach |
| Run RCA | Incident detail → Run RCA | Wait 15–45 s; review hypothesis and evidence chain |
| Runbook | Runbooks (approve) → Automation (execute) | Generate or select; approve; execute with incident linked |
| Resolve | Incident detail → Resolve → Close | Confirm fix; add resolution summary; close |
For large lists (e.g. alerts, incidents), use pagination and filters to find items quickly. See What’s new — lists and pagination.
See also
- Your first incident — Tutorial version of this flow
- Incidents module — Incident lifecycle and features
- RCA Lab — Deeper RCA options
- Runbooks and Automation — Runbook authoring and execution
- War room & major incidents — For major incidents and coordination
- Using the interface — Command palette, sidebar, forms, and error handling