Inspiration

As enterprise adoption of GenAI skyrockets, engineering and security teams are left in the dark. Implementing custom logging for LLM costs, latency, and tokens is tedious, and security teams have no real-time visibility into prompt injection attacks or regulatory compliance drifts. We built NeuralWatch to bridge this gap: a zero-code-change telemetry wrapper that plugs into python apps in two lines and streams full structured telemetry to Splunk instantly.

What it does

NeuralWatch is an AI Fleet Observatory that delivers:

  1. AI Fleet Observatory: Real-time dashboards monitoring total AI API costs, latency percentiles (p50/p95), error rates, and token consumption across models and services.
  2. Prompt Injection Sentinel: A built-in adversarial prompt classifier (based on the Foundation-Sec heuristics) scoring inputs from 0.0 to 1.0 to identify and block injections (Critical, High, Medium, Low) and trace persistent multi-session attacks.
  3. NLP Query Agent (MCP): A Model Context Protocol (MCP) server allowing users to query live Splunk logs using natural language (e.g. "How much did we spend on Claude-3-Sonnet today?"), auto-compiling it to SPL.
  4. EU AI Act Compliance Scoring: Real-time scoring dashboards evaluating services against five core regulatory articles (Articles 9, 13, 14, 17, and 72).

How we built it

  • Python SDK: Built a lightweight, PyPI-publishable SDK (neuralwatch-splunk) using monkey-patched hooks on OpenAI and Anthropic client methods. Features a queue-backed background thread for non-blocking telemetry forwarding via HEC.
  • Splunk App: Created custom XML dashboards (neuralwatch_main, neuralwatch_injection, neuralwatch_compliance), props configuration for clean JSON field extractions, and lookup tables for baseline compliance scores.
  • MCP Agent: Built with the Model Context Protocol, using an LLM-powered compiler to translate natural language into optimized SPL queries executed through the Splunk Management SDK.

Challenges we ran into

  • Telemetry Latency: We couldn't block the hot path of application LLM calls. We solved this by executing prompt parsing and cost estimation synchronously, but routing all Splunk HEC payloads asynchronously via a thread-safe background queue with automatic atexit flush safeguards.
  • Multi-Value Field Extractions: Splunk HEC was double-extracting JSON fields due to overlapping INDEXED_EXTRACTIONS and KV_MODE settings, wrapping telemetry values in arrays. We restructured the props.conf configuration and performed hot reloads to clean the indexed schemas.

Accomplishments that we're proud of

  • Developing a zero-code SDK wrapper that requires only a single instrument() call to fully monitor an application namespace.
  • Achieving a fully functioning MCP server that translates complex Splunk searches into natural language answers for non-technical stakeholders.
  • Building a real-time compliance scorecard that directly maps operations to actual articles of the EU AI Act.

What we learned

We learned the intricacies of Splunk's indexing pipelines, how to structure robust, non-blocking telemetry collectors, and the critical importance of standardization when aligning AI observability with security regulations.

What's next for NeuralWatch

  • Multi-language SDKs: Porting the instrumentor to Node.js and Go.
  • Automated Mitigation: Implementing active guardrails that block prompt injections before they hit the upstream LLM API.
  • Vector Embeddings Tracking: Tracking and visualizing embedding drift over time in Splunk.

Built With

  • anthropic-claude-api
  • model-context-protocol-(mcp)
  • openai-api
  • pytest
  • python
  • ruff
  • splunk-enterprise
Share this project:

Updates