Inspiration

Everyone can prompt an LLM to write SQL. Almost nobody can prove it got better. We wanted an agent that doesn't just generate text-to-SQL — it measures itself against human gold answers, reads its own failures, and fixes the right thing next. Not "looks plausible." Real SQL, executed against a real database, scored against gold. The crucible is the test: only mutations that survive measurement get to stay.

## What it does Crucible builds and self-optimizes text-to-SQL agents. Give it a database (Spider/BIRD-style) and it runs a closed reflexive loop:

  1. Draft a candidate text-to-SQL agent.
  2. Score it by execution-match against human gold SQL on a held-out test split — real SQL run against a real SQLite database, no estimates or mocks.
  3. Introspect its OWN failing traces by reading them back through the Arize Phoenix MCP server.
  4. Hypothesize — form one atomic hypothesis about the dominant failure cluster.
  5. Mutate the agent to address exactly that cluster.
  6. Re-score and climb a leaderboard until it clears the quality bar.

ML hygiene is enforced, not decorative: accept a mutation only if it improves the TRAIN split, report on a HELD-OUT test set that is never optimized against, keep best-so-far, and patience early-stop.

## How we built it Reasoning: Gemini 3 drafts agents, reads failing traces, forms hypotheses, and writes mutations.

Observability + self-introspection: Arize Phoenix for tracing, scored experiments, and — the core trick — MCP self-introspection: a Gemini agent reads its own failing Phoenix experiment back through the Phoenix MCP server to drive the next fix.

App: FastAPI + React "Mission Control" UI streaming the climb live over SSE. Python, uv, 65 passing tests.

Five architecture layers:

  1. Eval substrate — read-only SQLite sandbox + an execution-match comparator (multiset rows, order-sensitive only on top-level ORDER BY, numeric-tolerant, NULL-aware).
  2. Reflexive optimization loop — draft → score → introspect → hypothesize → mutate → re-score.
  3. Mutation engine — turns one atomic hypothesis into one targeted change.
  4. Phoenix tracing / experiments / MCP introspection — the agent reads its own failures.
  5. Mission Control UI — live SSE stream of the leaderboard climb.

## Challenges we ran into

  • A fair execution-match comparator. Result sets are multisets, ordering only matters with a top-level ORDER BY, floats need tolerance, NULLs need explicit handling — the difference between a real score and a lie.
  • Wiring agent-initiated MCP introspection. Making a Gemini agent reach back through the Phoenix MCP server to read its own failing experiment was the hardest plumbing in the project.
  • LLM rate limits. Keeping the climb reproducible and the demo deterministic while respecting live-model throughput.

## Accomplishments that we're proud of

  • A genuine, measured 50% → 100% climb on a held-out test set over 3 accepted mutations — every score is real SQL executed against a real database.
  • Failure clusters fixed in the right order: JOIN → aggregation → ordering.
  • An agent that reads its own Phoenix traces via MCP to decide what to fix next — true self-introspection.
  • A fully reproducible, tested system: 65 passing tests, deterministic offline demo.

## What we learned
Measurement, not generation, is the hard part — a fair, executable comparator plus train/test discipline turns "the LLM wrote some SQL" into "the agent provably got better." Giving an agent read access to its own traces (via MCP) makes self-improvement concrete.

## What's next for Crucible

  • Multi-DB / BIRD generalization — climb across many databases.
  • Funded live Gemini for on-demand climbs from the UI.
  • Promote winning prompts to the Phoenix prompt registry so proven agents become reusable artifacts.

Built With

Share this project:

Updates