🤖 LLM Observability Application
Inspiration
As Large Language Models (LLMs) become central to modern applications, the "black box" nature of their operations poses significant challenges. While building GenAI applications, I noticed a critical gap: observability. It's easy to make an API call, but difficult to answer questions like:
- "Why did this specific prompt fail?"
- "How much is this specific feature costing us per day?"
- "Are users trying to inject malicious prompts?"
I wanted to move beyond simple "vibes-based" evaluation to engineering-grade observability. This inspired me to build a Level 3 Full Stack LLM App that treats model interactions with the same rigor as database queries or microservice calls—complete with metrics, traces, and security monitoring.
What it does
This application is a full-stack chat platform powered by Google Gemini that provides real-time, deep visibility into LLM operations.
- Interactive Chat Interface: A modern, responsive UI for users to interact with the LLM.
- Deep Telemetry: Automatically tracks input/output tokens, latency, and error rates for every single request.
- Security Guardrails: Detects and blocks PII (like emails or SSNs) and Prompt Injection attacks before they reach the model.
- Cost Tracking: Calculates the financial cost of every interaction in real-time using model-specific pricing (math included!).
- Datadog Integration: Streams all this data to a comprehensive Datadog dashboard for alerting and visualization.
How we built it
I established a robust architecture using Python (FastAPI) for the backend and Google Gemini 2.5 Flash as the intelligence layer.
1. The Intelligence Layer
I integrated Google's gemini-2.5-flash model for its speed and efficiency. To ensure reliability, I built a custom GeminiClient wrapper that handles:
- Async Execution: Using
ThreadPoolExecutorto prevent blocking the main event loop. - Resiliency: Robust timeout handling and error propagation.
2. Comprehensive Telemetry
This was the core engineering effort. I didn't just want logs; I wanted structure.
- Structured Logging: I implemented a
DateDogJsonFormatterthat outputs logs as JSON objects, automatically injecting trace IDs (dd.trace_id) and span IDs (dd.span_id) to correlate logs with APM traces. - Custom Metrics: Using
statsd, I track granular metrics like token usage and latency.
3. Cost Tracking Math
I implemented a middleware that calculates costs based on specific token counts: $$ \text{Cost} = \left( \frac{\text{InputTokens}}{1000} \times 0.00025 \right) + \left( \frac{\text{OutputTokens}}{1000} \times 0.0005 \right) $$ This runs for every request, providing a live "dollars per query" metric.
Challenges we ran into
The "Silent Failure" of Costs
Initially, I had no visibility into token usage until the monthly bill arrived. It was hard to predict burn rates.
Solution: I implemented a real-time cost_tracker that aggregates verification-grade token counts from every response metadata and pushes them to a Datadog Gauge (llm.cost.daily).
Correlating Logs and Traces
It was difficult to debug issues because logs were separate from APM traces.
Solution: I manually injected the active trace context into the Python logging.LogRecord attributes. This allows me to click a spike in the Request Latency graph and drill down instantly to the exact log lines for that request.
Accomplishments that we're proud of
- Zero-Latency Overhead: The observability instrumentation is designed to be non-blocking (async), ensuring the user experience remains snappy.
- Bank-Grade Security: Successfully preventing PII leakage and basic prompt injection attempts gave me a lot of confidence in the system's robustness.
- Unified Dashboard: Seeing the "Single Pane of Glass" in Datadog—where I can see costs, errors, and traces in one view—was a huge win.
What we learned
- Observability is a Feature: It shouldn't be an afterthought. Building with observability in mind from day one made debugging the integration 10x faster.
- The Power of APM: Seeing the flame graph of a request helped me optimize the async handoffs between the API layer and the Gemini client.
- Security First: LLMs are vulnerable to unique vectors like Prompt Injection. Implementing detection logic upstream saves tokens and protects the system.
What's next for LLM Observability Application
- Evaluation Pipelines: Implementing automated "relevancy" scores for LLM responses using a smaller judge model.
- Caching Layer: Adding semantic caching (Redis) to reduce costs for identical queries.
- Multi-Modal Support: Extending the observability metrics to cover image and audio inputs supported by Gemini Pro Vision.
Log in or sign up for Devpost to join the conversation.