OSMC 2025: Generative AI for Observability: Automating Root Cause Analysis in Modern IT Systems by Kamal Bisht.pdf

©2025 Discover, a division of Capital One, N.A. Opinions are those of the individual
author. Unless noted otherwise in this post, Discover is not affiliated with, nor
endorsed by, any of the companies mentioned. All trademarks and other intellectual
property used or displayed are property of their respective owners

HI I’M KAMAL SINGH BISHT
Observability & AI/ML Technologist | IEEE Senior Member | Author & Speaker
▪ A seasoned technologist with extensive experience spanning observability,
AI/ML, and cybersecurity.
▪ Currently at Discover, with prior senior engineering roles at
JPMorgan Chase and Zillow.
▪ IEEE Senior Member, conference speaker, and author of multiple
AI-driven observability publications.
https://www.linkedin.com/in/kmluvce/

AGENDA
• Overview
• Challenges
• Architecture
• RCA Workflow
• Improvements
• Takeaways

THE STATE OF MODERN IT SYSTEMS
• Explosion in telemetry data (logs, metrics, traces)
• Complex microservices with distributed failure domains
• Manual Root Cause Analysis (RCA) still dominates incident workflows
• High Mean Time to Detect and Mean Time to Resolve

WHY TRADITIONAL AIOPS FALLS SHORT

ARCHITECTURE OVERVIEW
Dynamic & Adaptive Behavior –
1 2 3
4

DATA SOURCE
Data Source
Monitoring
Tools
GenAI- RCA GenAI- RCA
Type 1 Type 2
Alerts
Logs/
Metrics/
Traces

DATA SOURCE
Data Source
Monitoring
Tools
GenAI- RCA GenAI- RCA
Type 3: Hybrid
Alerts
Logs/
Metrics/
Traces

MULTIMODAL CONTEXT FUSION LAYER

LLM PROMPT FLOW
System Message
“You are an AI-driven RCA engine. Analyze logs,
metrics, traces, and alerts to identify the most
probable root cause, summarize impact, and
recommend fixes. Keep output concise and evidence-
based.”
User Message (Telemetry Input)
{
"incident_id": "INC-001",
"logs": [
{"svc": "checkout", "msg": "inventory lookup failed"},
{"svc": "inventory", "msg": "db timeout"}
]
}
LLM
Generate
RCA
Prompting
• Zero-shot
• Few-Shot
• Chain-Of-Thought(COT)

LLM COMPARISON
Model Type Key Strengths Limitations /
Considerations
GPT-4 Turbo (OpenAI) Proprietary
(Closed-
Source, API-
based)
Proven reasoning
accuracy, reliable RCA
summaries
Cost, black-box,
Data privacy &
compliance
Llama 3 (Meta) Open-source Strong generalization,
fine-tunable
GPU-intensive
Mistral
(mistral AI)
Open-source lightweight, fast, cost-
efficient, on-prem ready
Limited
reasoning
depth
Sources: OpenAI (GPT-4 Turbo Docs, 2025); Meta AI (Llama 3 Model Card, 2025); Mistral AI (Official Model Documentation, 2025); Author’s empirical evaluation, 2025.

TECHNICAL & ARCHITECTURAL LIMITATIONS
Limitation Mitigation
Model Hallucination RAG integration
Context Window Sliding memory buffers
Compute Demand Vector caching
Integration Modular API design

RESOURCES
GitHub:
https://github.com/rootiq-ai
Published Papers:
• Generative AI–Driven Observability for Automated Root Cause Analysis in Modern IT
Systems: Architecture and Vision
• Convergence of AI and Observability: Predictive Insights Automation in Modern IT Operations
Hugging Face
• https://huggingface.co/
LangChain
• https://www.langchain.com/
LangGraph
• https://www.langchain.com/langgraph

Dynamic & Adaptive Behavior –

OSMC 2025: Generative AI for Observability: Automating Root Cause Analysis in Modern IT Systems by Kamal Bisht.pdf

More Related Content

Similar to OSMC 2025: Generative AI for Observability: Automating Root Cause Analysis in Modern IT Systems by Kamal Bisht.pdf

Recently uploaded

OSMC 2025: Generative AI for Observability: Automating Root Cause Analysis in Modern IT Systems by Kamal Bisht.pdf