Inspiration
AI systems fail silently. A chatbot goes live, gets confused by multi-topic questions, ignores follow-ups, or hallucinates features that don't exist — and the developer only finds out when users complain. Debugging it means manually reading hundreds of conversation logs, guessing at patterns, and writing test cases from scratch. That process takes hours. I wanted to build something that does it in 2 minutes, autonomously.
What it does
Sentinel is an autonomous AI quality engineer. You give it a project name, it investigates a deployed chatbot end-to-end and delivers a verdict — no human intervention required.
The 5-step pipeline:
- OBSERVE — Connects to Arize Phoenix Cloud and pulls all conversation traces
- CLUSTER & HYPOTHESIZE — Sends sampled traces to Gemini 2.5 Flash, which identifies the failure pattern and forms a falsifiable hypothesis
- EXPERIMENT — Runs 10 targeted questions against the chatbot and measures the real failure rate
- VERDICT — Gemini evaluates the hypothesis against experimental results and returns a CRITICAL/HIGH/MEDIUM/LOW severity verdict with root cause + fix recommendation
- STREAM — Every step streams live to the frontend via SSE so you watch the investigation unfold in real time
How we built it
- LLM: Gemini 2.5 Flash (
google-genai==1.16.0,thinking_budget=0for speed) - Observability: Arize Phoenix Cloud — stores 301 conversation traces from the patient-chatbot; Sentinel pulls them via the Phoenix Client SDK
- Backend: FastAPI with Server-Sent Events for live step streaming;
asyncio.to_threadasyncio.wait_forfor non-blocking LLM calls with timeouts
- Retry logic: Custom
gemini_generate()wrapper — retries on 429/503, falls back togemini-2.0-flashautomatically - Frontend: React (Create React App) — live step cards, confidence badge, verdict card
- Deploy: Hugging Face Spaces (Docker, port 7860) for backend; Vercel for frontend
The broken chatbot (patient_chatbot.py) has 3 intentional failure patterns:
multi-topic confusion, ignored follow-ups, and hallucination — which Sentinel
detects and diagnoses without being told what to look for.
Challenges we ran into
- SSE + async timing: Getting FastAPI SSE to flush each step in real time while
Gemini calls were running required careful use of
asyncio.to_threadand manual timeout handling - Gemini rate limits: The hackathon free tier hits 429s under load — built a retry wrapper with exponential backoff and an automatic model fallback
- Structured JSON from LLM: Getting Gemini to reliably return structured hypothesis
and verdict JSON (not markdown-wrapped, not truncated) required prompt engineering
and a clean stripping step before
json.loads() - HF Spaces cold starts: Hugging Face Spaces Docker containers sleep — had to
add a
/healthendpoint and configure the frontend to handle the initial latency
Accomplishments that we're proud of
- Fully autonomous — zero human input after clicking Investigate
- Experimental validation — Sentinel doesn't just theorize, it proves the hypothesis by running live tests and measuring failure rate (90–100% confirmed)
- End-to-end deployed — live frontend + backend, not a notebook demo
- Production-grade reliability — retry logic, fallback models, timeout guards, and a health endpoint that checks both API keys on every request
What we learned
- Arize Phoenix is a genuinely powerful observability layer — having 301 real traces to analyze made the hypothesis step dramatically more accurate than any synthetic data
- Agentic loops need hard timeouts — without
asyncio.wait_for, one slow Gemini call blocks the entire SSE stream - The "observe → hypothesize → experiment → verdict" loop mirrors how a senior engineer actually debugs production issues; encoding that structure into the agent made the outputs far more useful than a simple log summarizer
What's next for Sentinel — Autonomous AI Quality Engineer
- Multi-project support — investigate multiple chatbots in parallel
- Auto-patch mode — Sentinel generates and proposes a code fix, not just a recommendation
- Scheduled sweeps — run nightly, alert on regression
- Support for more observability backends — LangSmith, Weights & Biases, custom trace stores
- Slack/email alerts — push the verdict to your team automatically
Log in or sign up for Devpost to join the conversation.