Inspiration
Modern LLMs are great at giving answers — and that's exactly the problem. Hand a student the answer and you've robbed them of the learning; worse, LLMs silently flub arithmetic while sounding confident. We'd built a proof chatbot before (proof-agent), but a chatbot isn't a tutor. We wanted an agent that does what great tutors do: carry the mathematical burden, check your work against ground truth, and make you do the thinking — while never being wrong about what's correct.
What it does
A Socratic math tutor that never gives the answer and never hallucinates correctness, because correctness doesn't come from the LLM — it comes from two ground-truth engines the agent routes to:
- Computation → SymPy. Submit a step like $3(x+2) = 3x+2$ and it pinpoints the exact error (the distribution), then asks a guiding question instead of fixing it for you.
- Proofs → Lean 4 + Mathlib. State a claim in plain English ("for every integer $x$, if $2x+4=10$ then $x=3$") and a multi-agent autoformalization pipeline takes over: one agent formalizes it into Lean, the Lean kernel machine-checks the proof (with an error-repair loop), and a 3-judge panel cross-examines whether the Lean statement actually means your claim — unanimity required. A proof that compiles can't be faked; the panel guards against proving the wrong thing.
Around that core: a clickable solution-space map when you first pose a problem (see how different approaches interconnect before choosing your path), a "view machine-checked Lean proof" button that reveals the proof with a plain-English translation of every step, generative UI (the model composes interactive buttons, step scaffolds, and verdict callouts per reply via OpenUI), live learning analytics streamed to ClickHouse, and autonomous progress reports emailed via Composio.
How we built it
Two-brain architecture: a Python FastAPI agent loop (hand-rolled tool-use loop on Claude Opus 4.8 — no framework) with five tools: check_step (SymPy), get_solution, autoformalize_and_verify (the Opus→Lean→judge-panel pipeline), verify_proof, and present_solution_map. Frontends: a classic KaTeX + Mermaid UI, and an OpenUI (Next.js) generative frontend that feeds the OpenUI Lang spec to the Python brain and renders its replies as live components. Deployed on Render as a two-service blueprint.
Challenges we ran into
- Autoformalization fidelity. The scary failure isn't a broken proof (the kernel catches that) — it's proving an easier, different statement. Hence the adversarial judge panel with required unanimity, and honest "outside Mathlib's reach" verdicts instead of fake passes.
- Pedagogy is a prompt-engineering problem. Early versions interrogated students ("state the inequality precisely?"). We rebuilt the policy around propose-and-confirm: the agent states the theorem and asks yes/no — the setup burden is always the agent's.
- Bridging ecosystems. OpenUI's stream adapter expects OpenAI-style NDJSON; our brain is Python. We read the adapter source and built an exact-contract bridge.
- Adversarial review pays. A headless-browser review of our graph renderer found five real bugs — including students breaking the UI by naming a node
end(a Mermaid reserved word).
What we learned
The harness matters more than the model. The same Claude that confidently flubs arithmetic becomes a trustworthy tutor when its judgment is grounded in engines that cannot lie — and when the scaffolding (judge panels, repair loops, Socratic policy) is engineered, not vibed.
What's next
Persistent learner models, class-level autonomous grading loops, and faster proof checking via the Lean REPL.
Built With
- anthropic)
- claude
- clickhouse
- composio
- fastapi
- katex
- lean4
- mathlib
- mermaid
- next.js
- openui
- python
- react
- render
- sympy
- typescript
Log in or sign up for Devpost to join the conversation.