LLM Code Review Service with RAG & Observability

Audience: Engineering Leadership / Tech Team


Inspiration

As AI transitions from standalone tools to embedded infrastructure, we realized that getting an LLM to work is the easy part—keeping it running reliably and cost-effectively at scale is the real challenge. Inference costs can rival training costs, yet most AI demos skip observability entirely. We wanted to build something production-ready from day one.


What it does

An AI-powered code review service that automatically reviews GitHub pull requests using Gemini 2.5 Pro with RAG (Retrieval-Augmented Generation) for full codebase context. It:

  • Analyzes PRs with awareness of the entire codebase, not just changed lines
  • Scans for security issues (secrets, PII, prompt injection) before LLM processing
  • Posts actionable review comments directly to GitHub PRs
  • Tracks every metric in Datadog: cost per review, token usage, RAG quality, and errors

How we built it

  1. Started with the webhook — FastAPI endpoint to receive GitHub PR events with HMAC signature verification

  2. Added the LLM — Integrated Vertex AI Gemini 2.5 Pro to analyze code diffs and generate reviews

  3. Built the RAG pipeline — FAISS vector search + Vertex AI embeddings to give the LLM full codebase context

  4. Layered in security — Pre-LLM scanning for secrets, PII, and prompt injection attempts

  5. Instrumented everything — Datadog metrics for cost, tokens, latency, and RAG quality

  6. Deployed serverless — Cloud Run for auto-scaling and cost efficiency


Challenges we ran into

  • RAG tuning: Finding the right chunk size and similarity threshold to retrieve relevant context without overwhelming the LLM's context window
  • Cost visibility: LLM APIs don't return cost directly—we had to calculate it from token counts and model pricing
  • Webhook reliability: Handling GitHub's ping events, retry logic, and signature verification correctly
  • Cold starts: Cloud Run scale-to-zero means first requests are slow; balancing cost vs. latency

Accomplishments that we're proud of

  • Full observability from day one: Every review is tracked—cost, latency, token usage, RAG quality
  • Security-first design: Pre-LLM scanning catches secrets and PII before they hit the model
  • Production-ready architecture: Not a demo—this handles real PRs with real error handling
  • Unit economics visibility: We know exactly what each review costs (~$0.02) and can optimize

What we learned

  • Observability isn't optional for production AI—it's essential for cost control and debugging
  • RAG quality metrics (similarity scores, chunks retrieved) are as important as LLM metrics
  • Context window utilization is a key optimization lever (we use ~15% of 128K)
  • The gap between "working demo" and "production system" is mostly error handling and monitoring

What's next for LLM Code Review Observability Service

  • Multi-repo RAG: Index across related repositories for monorepo-style context
  • Review quality scoring: Track whether reviews get resolved or dismissed
  • Cost optimization: Experiment with smaller models for simpler files
  • Caching layer: Avoid re-embedding unchanged files
  • Custom review rules: Let teams define their own code standards

Slide 1: Introduction - The Shift to Production AI

LLM Code Review Service: "Prototype vs. Production"

Inference costs can rival training costs at scale

Category Tool/Technology Purpose
AI/ML Vertex AI Gemini 2.5 Pro LLM for code review analysis
AI/ML Vertex AI text-embedding-005 Text embeddings for RAG vector search
Vector Search FAISS Local vector similarity search for RAG
Web Framework FastAPI HTTP API & webhook handling
HTTP Server Uvicorn ASGI server for FastAPI
Observability Datadog Metrics, traces, and monitoring
Cloud Platform Google Cloud Run Serverless container hosting
Source Control GitHub Code hosting & webhooks
Data Validation Pydantic Request/response models
Language Python 3.11+ Runtime

Slide 2: The Solution - Context at Scale

Key features include:

  1. RAG-Powered Context: Uses in-memory Vector Search (FAISS) to understand the entire codebase, not just the changed lines. This means it catches issues that might break other dependencies.
  2. Multi-Language Support: Handles Python, Go, JS, and more out of the box.
  3. Proactive Security: Before the code even hits the LLM, scans run for secrets, PII, and prompt injection attempts.

Slide 3: Observability - The "Special Sauce"

In production, inference at scale is expensive. We can track:

  • Cost per Review: Know exactly how much each PR costs.
  • Token Efficiency: Monitor prompt vs. completion tokens to optimize context window.
  • Quality Metrics: Track errors and RAG retrieval relevance.

Example stats from my experiments: Average review cost: ~$0.07 Average LLM duration: ~6 mins Context utilization: 15% of 128K window

This gives confidence to scale with full visibility into performance and unit economics.

LLM Token Metrics (Orange) Tokens directly drive cost and latency. Understanding the prompt-to-completion ratio reveals whether you’re sending too much context or getting verbose responses. Token Usage: Prompt vs Completion — Prompt tokens = what you send (code diff + RAG context + system prompt). Completion tokens = the review response. A healthy ratio is typically 10:1 to 20:1 (prompts much larger than completions). Context Utilization % — What percentage of Gemini’s 128K context window you’re using. If consistently low (<20%), you have room to add more context. If high (>80%), you risk truncation on large PRs.

RAG Retrieval (Purple) RAG quality directly impacts review accuracy. Poor retrieval = irrelevant context = hallucinated issues or missed bugs. RAG Retrievals Count — How many times the system searched for relevant code context. Should roughly match PR review count. RAG Retrieval Quality (Max Similarity) — The highest cosine similarity score from vector search (0–1). Scores >0.7 indicate highly relevant context was found. Low scores (<0.5) mean the retrieved code may not be useful. Chunks Retrieved (per event) — Number of code chunks injected into the prompt. Too few = missing context. Too many = noise and wasted tokens. Context Tokens (per event) — Token count of retrieved context. Balance between providing enough context and staying within budget/limits.

RAG Indexing & Embeddings (Pink) The RAG pipeline must index code before it can retrieve it. Indexing failures = blind spots in code review. RAG Indexing Events — Count of codebase indexing operations. Spikes indicate repo updates being processed. Files & Chunks (per event) — How many files were indexed and how they were chunked. Large repos will show more chunks. Useful for capacity planning. Embedding API Calls (Success/Failed) — Vertex AI embedding API reliability. Failed calls mean code wasn’t indexed. Any red bars here need investigation.

Security & Errors (Grey) Pre-LLM security scanning protects against sending secrets to the model and catches issues before they reach production. Security Scan Findings — Count of detected secrets, PII, or prompt injection attempts in PR diffs. High numbers may indicate developer education needed. Review Errors — Failed review attempts (API errors, timeouts, parsing failures). Should be near zero. Spikes indicate service health issues.

Metric Pattern What It Means
High cost, low similarity Sending lots of context but it's not relevant — tune RAG
Low context utilization Room to add more files or history to improve reviews
Embedding failures Vertex AI quota/connectivity issues — check GCP console
Security findings spike New developer or repo with different practices
Cost variance between repos Some codebases need different review strategies

Slide 4: Architecture & Stack

Request Flow:

  1. GitHub Webhook → Triggers on PR open/update
  2. FastAPI → Receives and validates webhook
  3. Security Scan → Checks for secrets, PII, prompt injection
  4. RAG Context → Retrieves relevant code from FAISS index
  5. Gemini LLM → Generates code review
  6. GitHub API → Posts review comments to PR

External Services:

  • Datadog — Metrics and observability
  • FAISS — Vector similarity search
  • Vertex AI — Gemini 2.5 Pro + Embeddings

Slide 5: Conclusion

Ready for Testing the Deployment

Resource URL
Running App (Health) https://llm-code-review-blz5bvu6kq-uc.a.run.app/health
Running App https://llm-code-review-blz5bvu6kq-uc.a.run.app/health
Observability Datadog https://p.us5.datadoghq.com/sb/6134f236-e075-11f0-bac8-b24f51c75d5d-8eb6a42c48d48e6171c5b72d9037c42c
Eval Test Cases PR https://github.com/random-droid/eval-test-cases/pull/1
Project Codebase https://github.com/random-droid/llm-observability-service

Built With

Share this project:

Updates