It's 2:37 a.m. ilert rings.
01
Login to five tools
Context scattered across tools
02
Ten people join Slack
Frantic coordination begins
03
The fix? Rollback& again
Same pattern, different night
04
By 4 a.m., everyone's exhausted
Another night lost to manual toil
The future of Root Cause Analysis is agentic 4 with human oversight.
AI-First Incident Response
Used by:
DevOps & SRE
IT Ops
MSPs
Used to:
Reduce MTTR & MTTA
Increase Productivity
Reduce Costs
What does an SRE actually do?
System Design
Creative, architectural
Development
Increasingly AI-assisted
Troubleshooting
Still human-heavy
AI won't replace design. It will remove toil.
How RCA Works 4 and Why It Hurts
Triage
Gather logs, metrics, deployments across fragmented systems
Diagnose
Correlate signals manually, ask "what changed?" repeatedly
Mitigate
Rollback, patch, restart - hope it sticks
Reality:
Manual
Fragmented
Slow
Tribal knowledge
"Troubleshooting is the last part of DevOps that still wakes people up."
1
2
3
What is an agent, again?
Evolution of LLM apps
How it started
Text generation, summarization, &
Input LLM Output
Workflows
LLMs orchestrated by code
Input
LLM
Output
Aggregator
LLM
LLM
Agents
LLMs making their own decisions
Human
LLM Environment
Action
Feedback
Stop
Agents are LLMs using tools in a reasoning loop
Model Context Protocol (MCP) standardized
integrating external APIs into your LLM
Universal Adapter
The "USB port" for AI
Augmented Context
Connects to all your tools and data
sources
Secure Access
AI queries APIs within governed
framework
Live Demo - Let's Watch an Agent Do RCA
Why Now?
Observability Maturity
Data exists 4 MELT (metrics,
events, logs, traces) is
everywhere
Reasoning Models
AI can now connect the dots
across complex systems
Cultural acceptance & readiness
Less toil, more strategic engineering work
"The bottleneck isn't data 4 it's attention. AI gives us scalable
attention."
What We Learned Building AI SRE
Six key insights from building production AI agents
Build for Reasoning, Not
Retrieval
Predefined workflows fail in enterprise incidents. Reasoning models
adapt and improve.
Keep Prompts Minimal,
Schemas Strong
Compact instructions + enforced JSON schemas reduce hallucinations.
Right-Size Your Tools
Around 20 well-scoped tools make agents faster and more stable.
Too few tools limit capability, too many create confusion and slow
decision-making.
Autonomy b Unsupervised
Human approval required for high-impact actions. AI executes safely.
Autonomous doesn't mean unsupervised. The most effective AI SRE
agents operate with guardrails, they can act independently for
routine, safe operations but require human approval for critical or
potentially risky decisions.
Own Your MCP Servers
Vendor MCPs might not be optimized for your use case. Build purpose-
built, consistent integrations.
Use an LLM Orchestration
Layer
The LLM orchestration layer allows seamless swapping of models (e.g.,
GPT-4o Mini, Claude Sonnet) based on performance, cost, or
compliance needs, ensuring future-proof reliability.
Under the Hood
Reasoning Models
GPT-5, Claude Sonnet 4.5 4 swap based on needs
LLM Orchestration Layer
Model-agnostic design for TPM, compliance flexibility
Custom MCP Servers
Consistent, compliant, deep integrations we control
Schema-driven Reasoning
Minimal prompt, structured output, predictable behavior
"Autonomy is built on structure 4 not creativity."
Outlook: From Smart Agents to Autonomous
Reliability
Scaling Context Intelligence
Orchestrating multiple agents to work with very large
context. Future agents must:
Search and reason intelligently to pull the right 0.1%
Utilize multi-agent orchestration for telemetry, code,
and change data
Killing the 3 a.m. Wakeup
Routine incidents will resolve autonomously,
transforming SRE roles:
Agents verify fixes with regression checks and
metrics baselines
Humans are paged only when AI gets stuck, not for
every minor issue
"The real promise of AI SRE: fewer interruptions, more strategic engineering. You sleep through the night and wake up to
a report."
Thank You!
Questions?

OSMC 2025: Agentic Incident Response by Birol Yildiz.pdf

  • 2.
    It's 2:37 a.m.ilert rings. 01 Login to five tools Context scattered across tools 02 Ten people join Slack Frantic coordination begins 03 The fix? Rollback& again Same pattern, different night 04 By 4 a.m., everyone's exhausted Another night lost to manual toil The future of Root Cause Analysis is agentic 4 with human oversight.
  • 3.
    AI-First Incident Response Usedby: DevOps & SRE IT Ops MSPs Used to: Reduce MTTR & MTTA Increase Productivity Reduce Costs
  • 4.
    What does anSRE actually do? System Design Creative, architectural Development Increasingly AI-assisted Troubleshooting Still human-heavy AI won't replace design. It will remove toil.
  • 5.
    How RCA Works4 and Why It Hurts Triage Gather logs, metrics, deployments across fragmented systems Diagnose Correlate signals manually, ask "what changed?" repeatedly Mitigate Rollback, patch, restart - hope it sticks Reality: Manual Fragmented Slow Tribal knowledge "Troubleshooting is the last part of DevOps that still wakes people up." 1 2 3
  • 6.
    What is anagent, again?
  • 7.
    Evolution of LLMapps How it started Text generation, summarization, & Input LLM Output Workflows LLMs orchestrated by code Input LLM Output Aggregator LLM LLM Agents LLMs making their own decisions Human LLM Environment Action Feedback Stop
  • 8.
    Agents are LLMsusing tools in a reasoning loop
  • 9.
    Model Context Protocol(MCP) standardized integrating external APIs into your LLM Universal Adapter The "USB port" for AI Augmented Context Connects to all your tools and data sources Secure Access AI queries APIs within governed framework
  • 10.
    Live Demo -Let's Watch an Agent Do RCA
  • 11.
    Why Now? Observability Maturity Dataexists 4 MELT (metrics, events, logs, traces) is everywhere Reasoning Models AI can now connect the dots across complex systems Cultural acceptance & readiness Less toil, more strategic engineering work "The bottleneck isn't data 4 it's attention. AI gives us scalable attention."
  • 12.
    What We LearnedBuilding AI SRE Six key insights from building production AI agents
  • 13.
    Build for Reasoning,Not Retrieval Predefined workflows fail in enterprise incidents. Reasoning models adapt and improve.
  • 14.
    Keep Prompts Minimal, SchemasStrong Compact instructions + enforced JSON schemas reduce hallucinations.
  • 15.
    Right-Size Your Tools Around20 well-scoped tools make agents faster and more stable. Too few tools limit capability, too many create confusion and slow decision-making.
  • 16.
    Autonomy b Unsupervised Humanapproval required for high-impact actions. AI executes safely. Autonomous doesn't mean unsupervised. The most effective AI SRE agents operate with guardrails, they can act independently for routine, safe operations but require human approval for critical or potentially risky decisions.
  • 17.
    Own Your MCPServers Vendor MCPs might not be optimized for your use case. Build purpose- built, consistent integrations.
  • 18.
    Use an LLMOrchestration Layer The LLM orchestration layer allows seamless swapping of models (e.g., GPT-4o Mini, Claude Sonnet) based on performance, cost, or compliance needs, ensuring future-proof reliability.
  • 19.
    Under the Hood ReasoningModels GPT-5, Claude Sonnet 4.5 4 swap based on needs LLM Orchestration Layer Model-agnostic design for TPM, compliance flexibility Custom MCP Servers Consistent, compliant, deep integrations we control Schema-driven Reasoning Minimal prompt, structured output, predictable behavior "Autonomy is built on structure 4 not creativity."
  • 20.
    Outlook: From SmartAgents to Autonomous Reliability Scaling Context Intelligence Orchestrating multiple agents to work with very large context. Future agents must: Search and reason intelligently to pull the right 0.1% Utilize multi-agent orchestration for telemetry, code, and change data Killing the 3 a.m. Wakeup Routine incidents will resolve autonomously, transforming SRE roles: Agents verify fixes with regression checks and metrics baselines Humans are paged only when AI gets stuck, not for every minor issue "The real promise of AI SRE: fewer interruptions, more strategic engineering. You sleep through the night and wake up to a report."
  • 21.