OSMC 2025: Agentic Incident Response by Birol Yildiz.pdf

It's 2:37 a.m. ilert rings.
01
Login to five tools
Context scattered across tools
02
Ten people join Slack
Frantic coordination begins
03
The fix? Rollback& again
Same pattern, different night
04
By 4 a.m., everyone's exhausted
Another night lost to manual toil
The future of Root Cause Analysis is agentic 4 with human oversight.

AI-First Incident Response
Used by:
DevOps & SRE
IT Ops
MSPs
Used to:
Reduce MTTR & MTTA
Increase Productivity
Reduce Costs

What does an SRE actually do?
System Design
Creative, architectural
Development
Increasingly AI-assisted
Troubleshooting
Still human-heavy
AI won't replace design. It will remove toil.

How RCA Works 4 and Why It Hurts
Triage
Gather logs, metrics, deployments across fragmented systems
Diagnose
Correlate signals manually, ask "what changed?" repeatedly
Mitigate
Rollback, patch, restart - hope it sticks
Reality:
Manual
Fragmented
Slow
Tribal knowledge
"Troubleshooting is the last part of DevOps that still wakes people up."
1
2
3

Evolution of LLM apps
How it started
Text generation, summarization, &
Input LLM Output
Workflows
LLMs orchestrated by code
Input
LLM
Output
Aggregator
LLM
LLM
Agents
LLMs making their own decisions
Human
LLM Environment
Action
Feedback
Stop

Agents are LLMs using tools in a reasoning loop

Model Context Protocol (MCP) standardized
integrating external APIs into your LLM
Universal Adapter
The "USB port" for AI
Augmented Context
Connects to all your tools and data
sources
Secure Access
AI queries APIs within governed
framework

Live Demo - Let's Watch an Agent Do RCA

Why Now?
Observability Maturity
Data exists 4 MELT (metrics,
events, logs, traces) is
everywhere
Reasoning Models
AI can now connect the dots
across complex systems
Cultural acceptance & readiness
Less toil, more strategic engineering work
"The bottleneck isn't data 4 it's attention. AI gives us scalable
attention."

What We Learned Building AI SRE
Six key insights from building production AI agents

Build for Reasoning, Not
Retrieval
Predefined workflows fail in enterprise incidents. Reasoning models
adapt and improve.

Keep Prompts Minimal,
Schemas Strong
Compact instructions + enforced JSON schemas reduce hallucinations.

Right-Size Your Tools
Around 20 well-scoped tools make agents faster and more stable.
Too few tools limit capability, too many create confusion and slow
decision-making.

Autonomy b Unsupervised
Human approval required for high-impact actions. AI executes safely.
Autonomous doesn't mean unsupervised. The most effective AI SRE
agents operate with guardrails, they can act independently for
routine, safe operations but require human approval for critical or
potentially risky decisions.

Own Your MCP Servers
Vendor MCPs might not be optimized for your use case. Build purpose-
built, consistent integrations.

Use an LLM Orchestration
Layer
The LLM orchestration layer allows seamless swapping of models (e.g.,
GPT-4o Mini, Claude Sonnet) based on performance, cost, or
compliance needs, ensuring future-proof reliability.

Under the Hood
Reasoning Models
GPT-5, Claude Sonnet 4.5 4 swap based on needs
LLM Orchestration Layer
Model-agnostic design for TPM, compliance flexibility
Custom MCP Servers
Consistent, compliant, deep integrations we control
Schema-driven Reasoning
Minimal prompt, structured output, predictable behavior
"Autonomy is built on structure 4 not creativity."

Outlook: From Smart Agents to Autonomous
Reliability
Scaling Context Intelligence
Orchestrating multiple agents to work with very large
context. Future agents must:
Search and reason intelligently to pull the right 0.1%
Utilize multi-agent orchestration for telemetry, code,
and change data
Killing the 3 a.m. Wakeup
Routine incidents will resolve autonomously,
transforming SRE roles:
Agents verify fixes with regression checks and
metrics baselines
Humans are paged only when AI gets stuck, not for
every minor issue
"The real promise of AI SRE: fewer interruptions, more strategic engineering. You sleep through the night and wake up to
a report."

OSMC 2025: Agentic Incident Response by Birol Yildiz.pdf

More Related Content

Similar to OSMC 2025: Agentic Incident Response by Birol Yildiz.pdf

Recently uploaded

OSMC 2025: Agentic Incident Response by Birol Yildiz.pdf