Inspiration

AI systems fail silently. A chatbot goes live, gets confused by multi-topic questions, ignores follow-ups, or hallucinates features that don't exist — and the developer only finds out when users complain. Debugging it means manually reading hundreds of conversation logs, guessing at patterns, and writing test cases from scratch. That process takes hours. I wanted to build something that does it in 2 minutes, autonomously.

What it does

Sentinel is an autonomous AI quality engineer. You give it a project name, it investigates a deployed chatbot end-to-end and delivers a verdict — no human intervention required.

The 5-step pipeline:

  1. OBSERVE — Connects to Arize Phoenix Cloud and pulls all conversation traces
  2. CLUSTER & HYPOTHESIZE — Sends sampled traces to Gemini 2.5 Flash, which identifies the failure pattern and forms a falsifiable hypothesis
  3. EXPERIMENT — Runs 10 targeted questions against the chatbot and measures the real failure rate
  4. VERDICT — Gemini evaluates the hypothesis against experimental results and returns a CRITICAL/HIGH/MEDIUM/LOW severity verdict with root cause + fix recommendation
  5. STREAM — Every step streams live to the frontend via SSE so you watch the investigation unfold in real time

How we built it

  • LLM: Gemini 2.5 Flash (google-genai==1.16.0, thinking_budget=0 for speed)
  • Observability: Arize Phoenix Cloud — stores 301 conversation traces from the patient-chatbot; Sentinel pulls them via the Phoenix Client SDK
  • Backend: FastAPI with Server-Sent Events for live step streaming; asyncio.to_thread
    • asyncio.wait_for for non-blocking LLM calls with timeouts
  • Retry logic: Custom gemini_generate() wrapper — retries on 429/503, falls back to gemini-2.0-flash automatically
  • Frontend: React (Create React App) — live step cards, confidence badge, verdict card
  • Deploy: Hugging Face Spaces (Docker, port 7860) for backend; Vercel for frontend

The broken chatbot (patient_chatbot.py) has 3 intentional failure patterns: multi-topic confusion, ignored follow-ups, and hallucination — which Sentinel detects and diagnoses without being told what to look for.

Challenges we ran into

  • SSE + async timing: Getting FastAPI SSE to flush each step in real time while Gemini calls were running required careful use of asyncio.to_thread and manual timeout handling
  • Gemini rate limits: The hackathon free tier hits 429s under load — built a retry wrapper with exponential backoff and an automatic model fallback
  • Structured JSON from LLM: Getting Gemini to reliably return structured hypothesis and verdict JSON (not markdown-wrapped, not truncated) required prompt engineering and a clean stripping step before json.loads()
  • HF Spaces cold starts: Hugging Face Spaces Docker containers sleep — had to add a /health endpoint and configure the frontend to handle the initial latency

Accomplishments that we're proud of

  • Fully autonomous — zero human input after clicking Investigate
  • Experimental validation — Sentinel doesn't just theorize, it proves the hypothesis by running live tests and measuring failure rate (90–100% confirmed)
  • End-to-end deployed — live frontend + backend, not a notebook demo
  • Production-grade reliability — retry logic, fallback models, timeout guards, and a health endpoint that checks both API keys on every request

What we learned

  • Arize Phoenix is a genuinely powerful observability layer — having 301 real traces to analyze made the hypothesis step dramatically more accurate than any synthetic data
  • Agentic loops need hard timeouts — without asyncio.wait_for, one slow Gemini call blocks the entire SSE stream
  • The "observe → hypothesize → experiment → verdict" loop mirrors how a senior engineer actually debugs production issues; encoding that structure into the agent made the outputs far more useful than a simple log summarizer

What's next for Sentinel — Autonomous AI Quality Engineer

  • Multi-project support — investigate multiple chatbots in parallel
  • Auto-patch mode — Sentinel generates and proposes a code fix, not just a recommendation
  • Scheduled sweeps — run nightly, alert on regression
  • Support for more observability backends — LangSmith, Weights & Biases, custom trace stores
  • Slack/email alerts — push the verdict to your team automatically

Built With

Share this project:

Updates