Inspiration
Everyone can prompt an LLM to write SQL. Almost nobody can prove it got better. We wanted an agent that doesn't just generate text-to-SQL — it measures itself against human gold answers, reads its own failures, and fixes the right thing next. Not "looks plausible." Real SQL, executed against a real database, scored against gold. The crucible is the test: only mutations that survive measurement get to stay.
## What it does Crucible builds and self-optimizes text-to-SQL agents. Give it a database (Spider/BIRD-style) and it runs a closed reflexive loop:
- Draft a candidate text-to-SQL agent.
- Score it by execution-match against human gold SQL on a held-out test split — real SQL run against a real SQLite database, no estimates or mocks.
- Introspect its OWN failing traces by reading them back through the Arize Phoenix MCP server.
- Hypothesize — form one atomic hypothesis about the dominant failure cluster.
- Mutate the agent to address exactly that cluster.
- Re-score and climb a leaderboard until it clears the quality bar.
ML hygiene is enforced, not decorative: accept a mutation only if it improves the TRAIN split, report on a HELD-OUT test set that is never optimized against, keep best-so-far, and patience early-stop.
## How we built it Reasoning: Gemini 3 drafts agents, reads failing traces, forms hypotheses, and writes mutations.
Observability + self-introspection: Arize Phoenix for tracing, scored experiments, and — the core trick — MCP self-introspection: a Gemini agent reads its own failing Phoenix experiment back through the Phoenix MCP server to drive the next fix.
App: FastAPI + React "Mission Control" UI streaming the climb live over SSE. Python, uv, 65 passing tests.
Five architecture layers:
- Eval substrate — read-only SQLite sandbox + an execution-match comparator (multiset rows, order-sensitive only on top-level
ORDER BY, numeric-tolerant, NULL-aware). - Reflexive optimization loop — draft → score → introspect → hypothesize → mutate → re-score.
- Mutation engine — turns one atomic hypothesis into one targeted change.
- Phoenix tracing / experiments / MCP introspection — the agent reads its own failures.
- Mission Control UI — live SSE stream of the leaderboard climb.
## Challenges we ran into
- A fair execution-match comparator. Result sets are multisets, ordering only matters with a top-level
ORDER BY, floats need tolerance, NULLs need explicit handling — the difference between a real score and a lie. - Wiring agent-initiated MCP introspection. Making a Gemini agent reach back through the Phoenix MCP server to read its own failing experiment was the hardest plumbing in the project.
- LLM rate limits. Keeping the climb reproducible and the demo deterministic while respecting live-model throughput.
## Accomplishments that we're proud of
- A genuine, measured 50% → 100% climb on a held-out test set over 3 accepted mutations — every score is real SQL executed against a real database.
- Failure clusters fixed in the right order: JOIN → aggregation → ordering.
- An agent that reads its own Phoenix traces via MCP to decide what to fix next — true self-introspection.
- A fully reproducible, tested system: 65 passing tests, deterministic offline demo.
## What we learned
Measurement, not generation, is the hard part — a fair, executable comparator plus train/test discipline turns "the LLM wrote some SQL" into "the agent provably got better." Giving an agent read access to its
own traces (via MCP) makes self-improvement concrete.
## What's next for Crucible
- Multi-DB / BIRD generalization — climb across many databases.
- Funded live Gemini for on-demand climbs from the UI.
- Promote winning prompts to the Phoenix prompt registry so proven agents become reusable artifacts.
Log in or sign up for Devpost to join the conversation.