Inspiration

LLMs are powerful but operate like a "black box" in production. We noticed teams couldn't answer simple questions like:

  • "Why did our AI costs spike 10x yesterday?"
  • "Is our LLM giving wrong answers without us knowing?"
  • "How do we catch unsafe content automatically?"

We built LLM Black Box to give teams real visibility into their AI systems.

What it does

Think of it like a car dashboard, but for AI applications. It shows you:

  • Costs in real-time (token usage = your AI bill)
  • Safety alerts when content gets blocked
  • Performance issues before users complain
  • Automatic trouble tickets with all the details AI engineers need

When something goes wrong, it doesn't just say "something's wrong" – it shows exactly what prompt caused it, how many tokens it used, and what the AI replied.

How we built it

  1. Built a simple AI app using FastAPI + Google's Gemini (Vertex AI)
  2. Added monitoring hooks to capture token counts, response times, safety ratings
  3. Sent everything to Datadog for dashboards and alerts
  4. Created smart rules that spot unusual patterns (like sudden token spikes)
  5. Automated incident creation so engineers get tickets with all the context they need

Tech stack: FastAPI (Python), Google Vertex AI, Datadog (APM, Logs, Metrics, Monitors), OpenTelemetry, Docker.

Challenges we ran into

  • Getting token counts and safety ratings from Vertex AI responses (the data was nested deep)
  • Making Datadog Agent work smoothly with our local setup
  • Setting the right thresholds for alerts (what's "normal" token usage?)
  • Linking incidents back to specific user queries automatically
  • Making it easy for judges to test without complex setup

Accomplishments that we're proud of

  • Created a working system that actually catches AI-specific problems
  • Made it simple enough that teams could use it tomorrow
  • Built realistic test scenarios (token explosions, safety blocks, latency issues)
  • Got all the pieces talking to each other: Google Cloud → FastAPI → Datadog
  • Documented everything so others can build on it

What we learned

  • LLMs need different monitoring than regular apps (tokens ≠ CPU usage)
  • Safety monitoring can't be "set and forget" – needs constant checking
  • Engineers need different info for AI incidents (prompts/responses, not just logs)
  • Cost visibility changes how teams use AI (when you see tokens = dollars, you optimize)
  • OpenTelemetry is great for traces but needs extensions for AI data

What's next for LLM Black Box

  • Make it easier to deploy (one-click Google Cloud Run setup)
  • Add more AI providers (OpenAI, Claude, open-source models)
  • Better cost forecasting ("at this rate, you'll spend $X this month")
  • Community contributions – we want others to help improve it
    • Real customer testing – see what actual AI teams need

Built With

  • datadog
  • datadogapm
  • dockerdesktop
  • fastapi
  • googlevertexai
  • html
  • opentelemetry
Share this project:

Updates