Inspiration
I wanted to build a robust, self-healing observability platform for AI agents, inspired by real-world outages where a single agent failure caused widespread impact. My goal was to combine Splunk’s analytics with automated remediation, circuit breaker logic, and clear dashboards for every agent.
What it does
Agentic AI Observability for Splunk provides real-time monitoring, anomaly detection, and automated remediation for three different AI agents. It ingests agent session data, detects anomalies, triggers alerts, and uses a circuit breaker to quarantine failing agents for 30 minutes, with fallback routing and auto-restore. Each agent has its own dedicated dashboard, and all alert activity is visualized in Splunk.
How I built it
I built the solution using Python for data generation, ingestion, and remediation logic, and Splunk for dashboards, alerting, and data storage. I used the Model Context Protocol (MCP) for AI integration, wrote modular scripts for each agent, and designed three modern dashboards in Splunk Dashboard Studio—one for each agent—plus a global overview. Alerts are configured for all critical anomalies.
Challenges I ran into
I faced challenges with Splunk’s data model, alert scheduling, and integrating fallback logic for agent failures. Debugging the circuit breaker state and ensuring reliable quarantine/restore flows took several iterations. Making the dashboards both informative and visually appealing for all three agents also required careful SPL tuning.
Accomplishments that I'm proud of
I’m proud of implementing a robust circuit breaker pattern for AI agents, building end-to-end automated remediation, and delivering a solution that’s both technically deep and easy for operators to use. The dashboards provide instant insight into each agent’s health, and the system can recover from failures without manual intervention. All alerts are actionable and visible in Splunk.
What I learned
I learned how to leverage Splunk’s advanced features for real-time AI operations, how to design resilient agent workflows, and how to integrate Python automation with enterprise observability tools. I also gained experience in building for reliability and clarity under hackathon time constraints.
What's next for Agentic AI Observability for Splunk
Next, I plan to add multi-agent dependency graphs, integrate with PagerDuty and Twilio for real-world notifications, and open source the project for the community. I’d also like to extend the circuit breaker logic to support dynamic thresholds and adaptive quarantine times, and add even more customizable dashboards and alerting options.
Log in or sign up for Devpost to join the conversation.