Inspiration

Every organization deploying AI agents inside Splunk faces the same blind spot. Your MCP Server is running queries, your AI Assistant is generating SPL — but when it returns a hallucinated result, when quality drifts over days, when token costs are about to spike — nobody catches it. AgentWatch was built to solve this: who watches your AI?

What it does

AgentWatch is a real-time AI observability platform with four core capabilities:

Agent Trace Collection — captures every MCP tool call with latency, token usage, and status. Logs 459+ trace events to a dedicated Splunk index with unique trace IDs.

AI Quality Scoring — uses an LLM-as-judge to score every agent response for hallucination, relevance, and completeness (0.0–1.0). Fires drift alerts when quality drops below 0.7 threshold.

Token Cost Forecasting — uses Splunk MLTK predict command to forecast token usage 24 hours ahead with 95% confidence bounds. Detected 6 real anomaly spikes in testing.

Self-Healing Engine — when a tool call fails, AgentWatch pulls context from Splunk, generates a root cause and corrected SPL query using local AI, and writes a notable event back to Splunk automatically.

How we built it

  • Splunk MCP Server 1.2 — intercepts and traces every AI agent tool call
  • Splunk AI Toolkit 5.7.4 — MLTK predict for time-series forecasting
  • Splunk Python SDK — all data ingestion, search, and index management
  • Ollama llama3.2 — local LLM running as quality judge and fix generator
  • FastAPI — REST API exposing POST /trace and POST /score endpoints
  • Splunk Classic Dashboard — 8 panels covering latency, quality, forecast, and healing

Challenges we ran into

The self-healing engine was the hardest component to make reliable. Ollama sometimes returns inconsistent JSON mixed with explanatory text. We solved this with a multi-strategy JSON parser that tries direct parsing, bracket extraction, and regex matching in sequence.

Generating 7 days of realistic historical data with correct timestamps for the MLTK forecast required careful handling of Splunk's time indexing.

Accomplishments that we're proud of

  • Complete AI observability stack built in under 2 weeks as a solo developer
  • Quality scorer correctly flags hallucinated responses with zero false negatives in testing
  • MLTK forecaster detected 6 real token usage anomalies with zero manual configuration
  • Self-healing engine generates actionable SPL corrections automatically

What we learned

Building observability for AI systems requires thinking recursively — you need AI to watch AI. The LLM-as-judge pattern for quality scoring is surprisingly effective at catching hallucinations that rule-based systems would miss. The Splunk MLTK predict command is powerful for anomaly detection with minimal setup.

What's next for AgentWatch — AI Observability for Splunk AI Agents

  • Integration with Splunk Enterprise Security to surface AI quality issues as ES notable events
  • Support for Splunk Cloud with hosted Foundation-sec model as the quality judge
  • Real-time streaming dashboard using Splunk WebSocket API
  • Multi-agent workflow tracing across full agentic pipelines
  • Cost attribution per team or use case for AI budget management

Built With

Share this project:

Updates

posted an update

AgentWatch has been successfully submitted to the Splunk Agentic Ops Hackathon.

Built as a solo project, AgentWatch provides observability for AI agents through trace collection, AI quality scoring, token usage forecasting, and automated self-healing recommendations. Looking forward to refining the project further and sharing additional demos and screenshots.

Log in or sign up for Devpost to join the conversation.