🤖 LLM Observability Application

Inspiration

As Large Language Models (LLMs) become central to modern applications, the "black box" nature of their operations poses significant challenges. While building GenAI applications, I noticed a critical gap: observability. It's easy to make an API call, but difficult to answer questions like:

  • "Why did this specific prompt fail?"
  • "How much is this specific feature costing us per day?"
  • "Are users trying to inject malicious prompts?"

I wanted to move beyond simple "vibes-based" evaluation to engineering-grade observability. This inspired me to build a Level 3 Full Stack LLM App that treats model interactions with the same rigor as database queries or microservice calls—complete with metrics, traces, and security monitoring.

What it does

This application is a full-stack chat platform powered by Google Gemini that provides real-time, deep visibility into LLM operations.

  • Interactive Chat Interface: A modern, responsive UI for users to interact with the LLM.
  • Deep Telemetry: Automatically tracks input/output tokens, latency, and error rates for every single request.
  • Security Guardrails: Detects and blocks PII (like emails or SSNs) and Prompt Injection attacks before they reach the model.
  • Cost Tracking: Calculates the financial cost of every interaction in real-time using model-specific pricing (math included!).
  • Datadog Integration: Streams all this data to a comprehensive Datadog dashboard for alerting and visualization.

How we built it

I established a robust architecture using Python (FastAPI) for the backend and Google Gemini 2.5 Flash as the intelligence layer.

1. The Intelligence Layer

I integrated Google's gemini-2.5-flash model for its speed and efficiency. To ensure reliability, I built a custom GeminiClient wrapper that handles:

  • Async Execution: Using ThreadPoolExecutor to prevent blocking the main event loop.
  • Resiliency: Robust timeout handling and error propagation.

2. Comprehensive Telemetry

This was the core engineering effort. I didn't just want logs; I wanted structure.

  • Structured Logging: I implemented a DateDogJsonFormatter that outputs logs as JSON objects, automatically injecting trace IDs (dd.trace_id) and span IDs (dd.span_id) to correlate logs with APM traces.
  • Custom Metrics: Using statsd, I track granular metrics like token usage and latency.

3. Cost Tracking Math

I implemented a middleware that calculates costs based on specific token counts: $$ \text{Cost} = \left( \frac{\text{InputTokens}}{1000} \times 0.00025 \right) + \left( \frac{\text{OutputTokens}}{1000} \times 0.0005 \right) $$ This runs for every request, providing a live "dollars per query" metric.

Challenges we ran into

The "Silent Failure" of Costs

Initially, I had no visibility into token usage until the monthly bill arrived. It was hard to predict burn rates. Solution: I implemented a real-time cost_tracker that aggregates verification-grade token counts from every response metadata and pushes them to a Datadog Gauge (llm.cost.daily).

Correlating Logs and Traces

It was difficult to debug issues because logs were separate from APM traces. Solution: I manually injected the active trace context into the Python logging.LogRecord attributes. This allows me to click a spike in the Request Latency graph and drill down instantly to the exact log lines for that request.

Accomplishments that we're proud of

  • Zero-Latency Overhead: The observability instrumentation is designed to be non-blocking (async), ensuring the user experience remains snappy.
  • Bank-Grade Security: Successfully preventing PII leakage and basic prompt injection attempts gave me a lot of confidence in the system's robustness.
  • Unified Dashboard: Seeing the "Single Pane of Glass" in Datadog—where I can see costs, errors, and traces in one view—was a huge win.

What we learned

  1. Observability is a Feature: It shouldn't be an afterthought. Building with observability in mind from day one made debugging the integration 10x faster.
  2. The Power of APM: Seeing the flame graph of a request helped me optimize the async handoffs between the API layer and the Gemini client.
  3. Security First: LLMs are vulnerable to unique vectors like Prompt Injection. Implementing detection logic upstream saves tokens and protects the system.

What's next for LLM Observability Application

  • Evaluation Pipelines: Implementing automated "relevancy" scores for LLM responses using a smaller judge model.
  • Caching Layer: Adding semantic caching (Redis) to reduce costs for identical queries.
  • Multi-Modal Support: Extending the observability metrics to cover image and audio inputs supported by Gemini Pro Vision.
Share this project:

Updates