Crucible

Inspiration

Everyone can prompt an LLM to write SQL. Almost nobody can prove it got better. We wanted an agent that doesn't just generate text-to-SQL — it measures itself against human gold answers, reads its own failures, and fixes the right thing next. Not "looks plausible." Real SQL, executed against a real database, scored against gold. The crucible is the test: only mutations that survive measurement get to stay.

## What it does Crucible builds and self-optimizes text-to-SQL agents. Give it a database (Spider/BIRD-style) and it runs a closed reflexive loop:

Draft a candidate text-to-SQL agent.
Score it by execution-match against human gold SQL on a held-out test split — real SQL run against a real SQLite database, no estimates or mocks.
Introspect its OWN failing traces by reading them back through the Arize Phoenix MCP server.
Hypothesize — form one atomic hypothesis about the dominant failure cluster.
Mutate the agent to address exactly that cluster.
Re-score and climb a leaderboard until it clears the quality bar.

ML hygiene is enforced, not decorative: accept a mutation only if it improves the TRAIN split, report on a HELD-OUT test set that is never optimized against, keep best-so-far, and patience early-stop.

## How we built it Reasoning: Gemini 3 drafts agents, reads failing traces, forms hypotheses, and writes mutations.

Observability + self-introspection: Arize Phoenix for tracing, scored experiments, and — the core trick — MCP self-introspection: a Gemini agent reads its own failing Phoenix experiment back through the Phoenix MCP server to drive the next fix.

App: FastAPI + React "Mission Control" UI streaming the climb live over SSE. Python, uv, 65 passing tests.

Five architecture layers:

Eval substrate — read-only SQLite sandbox + an execution-match comparator (multiset rows, order-sensitive only on top-level ORDER BY, numeric-tolerant, NULL-aware).
Reflexive optimization loop — draft → score → introspect → hypothesize → mutate → re-score.
Mutation engine — turns one atomic hypothesis into one targeted change.
Phoenix tracing / experiments / MCP introspection — the agent reads its own failures.
Mission Control UI — live SSE stream of the leaderboard climb.

## Challenges we ran into

A fair execution-match comparator. Result sets are multisets, ordering only matters with a top-level ORDER BY, floats need tolerance, NULLs need explicit handling — the difference between a real score and a lie.
Wiring agent-initiated MCP introspection. Making a Gemini agent reach back through the Phoenix MCP server to read its own failing experiment was the hardest plumbing in the project.
LLM rate limits. Keeping the climb reproducible and the demo deterministic while respecting live-model throughput.

## Accomplishments that we're proud of

A genuine, measured 50% → 100% climb on a held-out test set over 3 accepted mutations — every score is real SQL executed against a real database.
Failure clusters fixed in the right order: JOIN → aggregation → ordering.
An agent that reads its own Phoenix traces via MCP to decide what to fix next — true self-introspection.
A fully reproducible, tested system: 65 passing tests, deterministic offline demo.

## What we learned
Measurement, not generation, is the hard part — a fair, executable comparator plus train/test discipline turns "the LLM wrote some SQL" into "the agent provably got better." Giving an agent read access to its own traces (via MCP) makes self-improvement concrete.

## What's next for Crucible

Multi-DB / BIRD generalization — climb across many databases.
Funded live Gemini for on-demand climbs from the UI.
Promote winning prompts to the Phoenix prompt registry so proven agents become reusable artifacts.

Built With

arize-phoenix
fastapi
gemini
google-adk
mcp
python
react
sqlite
sse
uv
vite

Updates

Ankit Kiran started this project — Jun 11, 2026 04:59 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.