©2025 Discover, a division of Capital One, N.A. Opinions are those of the individual
author. Unless noted otherwise in this post, Discover is not affiliated with, nor
endorsed by, any of the companies mentioned. All trademarks and other intellectual
property used or displayed are property of their respective owners
HI I’M KAMAL SINGH BISHT
Observability & AI/ML Technologist | IEEE Senior Member | Author & Speaker
▪ A seasoned technologist with extensive experience spanning observability,
AI/ML, and cybersecurity.
▪ Currently at Discover, with prior senior engineering roles at
JPMorgan Chase and Zillow.
▪ IEEE Senior Member, conference speaker, and author of multiple
AI-driven observability publications.
https://www.linkedin.com/in/kmluvce/
AGENDA
• Overview
• Challenges
• Architecture
• RCA Workflow
• Improvements
• Takeaways
THE STATE OF MODERN IT SYSTEMS
• Explosion in telemetry data (logs, metrics, traces)
• Complex microservices with distributed failure domains
• Manual Root Cause Analysis (RCA) still dominates incident workflows
• High Mean Time to Detect and Mean Time to Resolve
WHY TRADITIONAL AIOPS FALLS SHORT
ARCHITECTURE OVERVIEW
Dynamic & Adaptive Behavior –
1 2 3
4
DATA SOURCE
Data Source
Monitoring
Tools
GenAI- RCA GenAI- RCA
Type 1 Type 2
Alerts
Logs/
Metrics/
Traces
DATA SOURCE
Data Source
Monitoring
Tools
GenAI- RCA GenAI- RCA
Type 3: Hybrid
Alerts
Logs/
Metrics/
Traces
INGESTION LAYER
CONTEXTUAL ENRICHMENT LAYER
MULTIMODAL CONTEXT FUSION LAYER
RCA ENGINE(RAG- QUERYING)
RCA ENGINE(RAG – INGESTION)
AGENTIC AI INVOCATION
RCA/FIX RESULT
LLM PROMPT FLOW
System Message
“You are an AI-driven RCA engine. Analyze logs,
metrics, traces, and alerts to identify the most
probable root cause, summarize impact, and
recommend fixes. Keep output concise and evidence-
based.”
User Message (Telemetry Input)
{
"incident_id": "INC-001",
"logs": [
{"svc": "checkout", "msg": "inventory lookup failed"},
{"svc": "inventory", "msg": "db timeout"}
]
}
LLM
Generate
RCA
Prompting
• Zero-shot
• Few-Shot
• Chain-Of-Thought(COT)
IMPROVEMENTS
LLM COMPARISON
Model Type Key Strengths Limitations /
Considerations
GPT-4 Turbo (OpenAI) Proprietary
(Closed-
Source, API-
based)
Proven reasoning
accuracy, reliable RCA
summaries
Cost, black-box,
Data privacy &
compliance
Llama 3 (Meta) Open-source Strong generalization,
fine-tunable
GPU-intensive
Mistral
(mistral AI)
Open-source lightweight, fast, cost-
efficient, on-prem ready
Limited
reasoning
depth
Sources: OpenAI (GPT-4 Turbo Docs, 2025); Meta AI (Llama 3 Model Card, 2025); Mistral AI (Official Model Documentation, 2025); Author’s empirical evaluation, 2025.
TECHNICAL & ARCHITECTURAL LIMITATIONS
Limitation Mitigation
Model Hallucination RAG integration
Context Window Sliding memory buffers
Compute Demand Vector caching
Integration Modular API design
RESOURCES
GitHub:
https://github.com/rootiq-ai
Published Papers:
• Generative AI–Driven Observability for Automated Root Cause Analysis in Modern IT
Systems: Architecture and Vision
• Convergence of AI and Observability: Predictive Insights Automation in Modern IT Operations
Hugging Face
• https://huggingface.co/
LangChain
• https://www.langchain.com/
LangGraph
• https://www.langchain.com/langgraph
Dynamic & Adaptive Behavior –

OSMC 2025: Generative AI for Observability: Automating Root Cause Analysis in Modern IT Systems by Kamal Bisht.pdf

  • 1.
    ©2025 Discover, adivision of Capital One, N.A. Opinions are those of the individual author. Unless noted otherwise in this post, Discover is not affiliated with, nor endorsed by, any of the companies mentioned. All trademarks and other intellectual property used or displayed are property of their respective owners
  • 2.
    HI I’M KAMALSINGH BISHT Observability & AI/ML Technologist | IEEE Senior Member | Author & Speaker ▪ A seasoned technologist with extensive experience spanning observability, AI/ML, and cybersecurity. ▪ Currently at Discover, with prior senior engineering roles at JPMorgan Chase and Zillow. ▪ IEEE Senior Member, conference speaker, and author of multiple AI-driven observability publications. https://www.linkedin.com/in/kmluvce/
  • 3.
    AGENDA • Overview • Challenges •Architecture • RCA Workflow • Improvements • Takeaways
  • 4.
    THE STATE OFMODERN IT SYSTEMS • Explosion in telemetry data (logs, metrics, traces) • Complex microservices with distributed failure domains • Manual Root Cause Analysis (RCA) still dominates incident workflows • High Mean Time to Detect and Mean Time to Resolve
  • 5.
  • 6.
    ARCHITECTURE OVERVIEW Dynamic &Adaptive Behavior – 1 2 3 4
  • 7.
    DATA SOURCE Data Source Monitoring Tools GenAI-RCA GenAI- RCA Type 1 Type 2 Alerts Logs/ Metrics/ Traces
  • 8.
    DATA SOURCE Data Source Monitoring Tools GenAI-RCA GenAI- RCA Type 3: Hybrid Alerts Logs/ Metrics/ Traces
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
    LLM PROMPT FLOW SystemMessage “You are an AI-driven RCA engine. Analyze logs, metrics, traces, and alerts to identify the most probable root cause, summarize impact, and recommend fixes. Keep output concise and evidence- based.” User Message (Telemetry Input) { "incident_id": "INC-001", "logs": [ {"svc": "checkout", "msg": "inventory lookup failed"}, {"svc": "inventory", "msg": "db timeout"} ] } LLM Generate RCA Prompting • Zero-shot • Few-Shot • Chain-Of-Thought(COT)
  • 17.
  • 18.
    LLM COMPARISON Model TypeKey Strengths Limitations / Considerations GPT-4 Turbo (OpenAI) Proprietary (Closed- Source, API- based) Proven reasoning accuracy, reliable RCA summaries Cost, black-box, Data privacy & compliance Llama 3 (Meta) Open-source Strong generalization, fine-tunable GPU-intensive Mistral (mistral AI) Open-source lightweight, fast, cost- efficient, on-prem ready Limited reasoning depth Sources: OpenAI (GPT-4 Turbo Docs, 2025); Meta AI (Llama 3 Model Card, 2025); Mistral AI (Official Model Documentation, 2025); Author’s empirical evaluation, 2025.
  • 19.
    TECHNICAL & ARCHITECTURALLIMITATIONS Limitation Mitigation Model Hallucination RAG integration Context Window Sliding memory buffers Compute Demand Vector caching Integration Modular API design
  • 20.
    RESOURCES GitHub: https://github.com/rootiq-ai Published Papers: • GenerativeAI–Driven Observability for Automated Root Cause Analysis in Modern IT Systems: Architecture and Vision • Convergence of AI and Observability: Predictive Insights Automation in Modern IT Operations Hugging Face • https://huggingface.co/ LangChain • https://www.langchain.com/ LangGraph • https://www.langchain.com/langgraph
  • 21.
    Dynamic & AdaptiveBehavior –