Inspiration
Long-running AI agents and copilots often fail when a user’s goal changes silently over time. Chat UIs don’t model “intent over time” or produce decisions agents can trust. I wanted to build something that treats Gemini 3 as a temporal reasoning engine - not a chatbot and outputs a deterministic, evidence-backed drift decision that downstream systems can act on (pause, re-plan, escalate). The idea was: what if we could detect intent drift the way we detect anomalies in logs with traceability and consensus? That led to Intent Drift Radar and the optional Ensemble Mode (low/medium/high thinking in parallel, majority vote, evidence agreement).
What it does
Intent Drift Radar takes a time-ordered signal stream (e.g. Day 1 … Day 5: notes, decisions, declarations) and answers: Did the user’s original intent drift? If so, why, when, and how confidently?
- Single-run analysis: One Gemini 3 call returns baseline intent, current intent, drift yes/no, confidence, evidence tied to specific days, reasoning cards, and a compact drift signature for agent orchestration.
- Ensemble Mode (optional): Three parallel Gemini calls (low / medium / high thinking) with majority voting and evidence bucketing (3/3, 2/3, 1/3 agreement). One consensus result, no extra “meta” model call.
- Judge Mode / Quick Demo: Load a demo dataset and see a cached result instantly so judges can evaluate without waiting for live API calls. A callout offers “Run Ensemble (Live)” for discoverability.
- Traceability: The UI links evidence and reasoning cards to timeline days - hover or click to see which days drove the decision. Copy summary, submit feedback (confirm/reject drift).
Built for builders of autonomous agents and copilots who need reliable intent-change detection, not another chat interface.
How I built it
- Backend: FastAPI (Python 3.11). Single endpoint
POST /api/analyze(one Gemini 3 call, configurable thinking level) andPOST /api/analyze/ensemble(3 parallel calls, deterministic consensus). Prompt lives indocs/ai-studio/prompt.md; response is validated with Pydantic, then postprocessed (guardrails, drift signature normalization). Retry-with-repair on invalid JSON; model fallback on 404; per-call timeouts (25s single, 50s per run in ensemble). - Frontend: React 18 + TypeScript, Vite. Timeline panel (days + signals), analysis panel (drift banner, evidence, reasoning cards, mode label, “prove it” DevTools hint), evidence panel with day refs, feedback form. Ensemble: toggle in settings, optional callout above analysis when viewing cached demo, expandable “Ensemble breakdown” (per-mode table + evidence agreement chips).
- Infra: Terraform for GCP — Cloud Run service (120s request timeout for ensemble), Artifact Registry, Secret Manager for
GEMINI_API_KEY. Single Dockerfile serves the built frontend + uvicorn backend. - Docs: README (quick demo, judge checks, timeout notes), architecture doc, release notes. Judge check script for pre-submit validation.
All built and deployed as a solo developer: backend, frontend, ensemble logic, Terraform, and copy.
Challenges I ran into
- Structured output reliability: Gemini sometimes returned valid-looking JSON that failed Pydantic (e.g. extra fields, wrong types). Added strict schema in the prompt, retry-with-repair (one retry with “fix this JSON” instruction), and postprocess guardrails (normalize drift signature, clamp confidence) so the API contract stays deterministic.
- Ensemble timeouts: Running 3 Gemini calls in parallel hit Cloud Run’s default request timeout and sometimes per-call timeouts. Fixed by: increasing Cloud Run timeout to 120s in Terraform, raising per-call timeout to 50s for ensemble only, and documenting 504 behavior and curl checks in the README. Partial success (2/3 runs) still returns 200 with consensus.
- Judge Mode without extra calls: Judges needed a fast path without triggering live Gemini. Implemented Quick Demo: load demo dataset, serve cached result from
/api/demo, same UI and schema as live. AddedX-IDR-Modeheader and “prove it” instructions so evaluators can verify demo vs live in DevTools. - Discoverability of Ensemble: Wanted judges to see “you can also run Ensemble” without changing defaults or auto-calling the API. Added a small callout above the analysis panel when a cached result is shown, with a single “Run Ensemble (Live)” button that calls the ensemble endpoint and replaces the result.
Accomplishments that I'm proud of
- Production-ready contract: Drift signature (
IDR:v1|dir=…|span=…|e=…|conf=…), evidence day refs, and reasoning cards give agents and humans a clear, parseable decision layer — not free-form chat. - Ensemble consensus without a fourth call: Consensus is computed in-process (majority vote, median confidence, evidence bucketed by agreement). No extra “arbitrator” model; the UI shows both consensus and per-run breakdown.
- Full traceability: Evidence and reasoning cards link to timeline days; pinned/hover state shows which days drove the decision. Copy summary and feedback (confirm/reject) close the loop for evaluation.
- Deployed and evaluable: Live app on Cloud Run, Quick Demo for one-click judge flow, health/version endpoints, Terraform for reproducible infra, and a judge check script so the project can be validated end-to-end.
What I learned
- Treating the model as a reasoning engine changes the design. Once we stopped thinking “chat” and started thinking “temporal decision over a signal stream,” prompt structure, schema, and postprocessing became the main levers for reliability.
- Timeouts and parallelism need to be tuned together. Ensemble’s 3 parallel calls required raising both per-call timeout (so “high” thinking could finish) and the Cloud Run request timeout so the whole request didn’t 504 before consensus.
- Judge experience matters. Quick Demo + cached result + one “Run Ensemble (Live)” callout let evaluators see the product in seconds and still try the advanced path without changing default behavior.
What's next for Intent Drift Radar
- Feedback loop in the pipeline: Use confirm/reject + comment from the UI to refine prompts or fine-tune (e.g. store feedback and periodically retrain or adjust few-shot examples).
- More signal types and windows: Support different baseline/current window sizes and signal types (e.g. actions, errors) for richer temporal reasoning.
- Agent SDK: Small client library (e.g. “call Intent Drift Radar with this signal buffer, get drift decision + signature”) so other apps and agents can embed drift detection without building their own UI.
- Observability: Structured logging and optional metrics (e.g. drift rate, confidence distribution) for production operators.
Built With
- fastapi
- gar
- google-cloud-run
- google-gemini-3-pro-api;-react-18
- pydantic
- python-3.11
- secret-manager
- typescript
- vite;-terraform
Log in or sign up for Devpost to join the conversation.