Sentinel — Autonomous AI Quality Engineer

Inspiration

AI systems fail silently. A chatbot goes live, gets confused by multi-topic questions, ignores follow-ups, or hallucinates features that don't exist — and the developer only finds out when users complain. Debugging it means manually reading hundreds of conversation logs, guessing at patterns, and writing test cases from scratch. That process takes hours. I wanted to build something that does it in 2 minutes, autonomously.

What it does

Sentinel is an autonomous AI quality engineer. You give it a project name, it investigates a deployed chatbot end-to-end and delivers a verdict — no human intervention required.

The 5-step pipeline:

OBSERVE — Connects to Arize Phoenix Cloud and pulls all conversation traces
CLUSTER & HYPOTHESIZE — Sends sampled traces to Gemini 2.5 Flash, which identifies the failure pattern and forms a falsifiable hypothesis
EXPERIMENT — Runs 10 targeted questions against the chatbot and measures the real failure rate
VERDICT — Gemini evaluates the hypothesis against experimental results and returns a CRITICAL/HIGH/MEDIUM/LOW severity verdict with root cause + fix recommendation
STREAM — Every step streams live to the frontend via SSE so you watch the investigation unfold in real time

How we built it

LLM: Gemini 2.5 Flash (google-genai==1.16.0, thinking_budget=0 for speed)
Observability: Arize Phoenix Cloud — stores 301 conversation traces from the patient-chatbot; Sentinel pulls them via the Phoenix Client SDK
Backend: FastAPI with Server-Sent Events for live step streaming; asyncio.to_thread
- asyncio.wait_for for non-blocking LLM calls with timeouts
Retry logic: Custom gemini_generate() wrapper — retries on 429/503, falls back to gemini-2.0-flash automatically
Frontend: React (Create React App) — live step cards, confidence badge, verdict card
Deploy: Hugging Face Spaces (Docker, port 7860) for backend; Vercel for frontend

The broken chatbot (patient_chatbot.py) has 3 intentional failure patterns: multi-topic confusion, ignored follow-ups, and hallucination — which Sentinel detects and diagnoses without being told what to look for.

Challenges we ran into

SSE + async timing: Getting FastAPI SSE to flush each step in real time while Gemini calls were running required careful use of asyncio.to_thread and manual timeout handling
Gemini rate limits: The hackathon free tier hits 429s under load — built a retry wrapper with exponential backoff and an automatic model fallback
Structured JSON from LLM: Getting Gemini to reliably return structured hypothesis and verdict JSON (not markdown-wrapped, not truncated) required prompt engineering and a clean stripping step before json.loads()
HF Spaces cold starts: Hugging Face Spaces Docker containers sleep — had to add a /health endpoint and configure the frontend to handle the initial latency

Accomplishments that we're proud of

Fully autonomous — zero human input after clicking Investigate
Experimental validation — Sentinel doesn't just theorize, it proves the hypothesis by running live tests and measuring failure rate (90–100% confirmed)
End-to-end deployed — live frontend + backend, not a notebook demo
Production-grade reliability — retry logic, fallback models, timeout guards, and a health endpoint that checks both API keys on every request

What we learned

Arize Phoenix is a genuinely powerful observability layer — having 301 real traces to analyze made the hypothesis step dramatically more accurate than any synthetic data
Agentic loops need hard timeouts — without asyncio.wait_for, one slow Gemini call blocks the entire SSE stream
The "observe → hypothesize → experiment → verdict" loop mirrors how a senior engineer actually debugs production issues; encoding that structure into the agent made the outputs far more useful than a simple log summarizer

What's next for Sentinel — Autonomous AI Quality Engineer

Multi-project support — investigate multiple chatbots in parallel
Auto-patch mode — Sentinel generates and proposes a code fix, not just a recommendation
Scheduled sweeps — run nightly, alert on regression
Support for more observability backends — LangSmith, Weights & Biases, custom trace stores
Slack/email alerts — push the verdict to your team automatically

Built With

2.5
arize
asyncio
docker
face
fastapi
flash
gemini
google-genai
hugging
javascript
phoenix
python
react
vercel

Updates

Bonamukkala Charan started this project — Jun 11, 2026 12:03 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.