Inspiration

Math progress is bottlenecked by proof search and verification. With open models like GPT‑OSS‑120B showing strong mathematical reasoning, we asked: can an open, reproducible agent systematically propose conjectures, attempt rigorous proofs, and iterate toward publishable results—accelerating not just olympiad problem solving, but science more broadly?

What it does

DeepResearch++ is an end‑to‑end research+proving agent for mathematics:

  • Literature review → conjecture generation: builds candidate conjectures from seeds and recent results.
  • Novelty filter: down‑selects to likely‑new statements to avoid re‑discoveries.
  • Proof engine: multi‑attempt prover generates Markdown proofs; a judge critiques flaws; the system iterates and selects the best proof.
  • Outputs: a reproducible report with statement, status (proved/failed), proof, and failure analysis.
  • Interfaces: Minimal web UI; runs with OpenAI‑compatible endpoints (local GPT‑OSS via Ollama or hosted).

How we built it

  • Orchestration: solver.py coordinates parallel Provers and a Judge with retries and final selection.
  • Structured I/O: Pydantic schemas in output_schemas.py; model calls go through a provider‑agnostic wrapper (llm_provider.py) using the Responses API.
  • Research loop: research.py does review → prediction → novelty filter → proof → report; optional guideline steering to focus on a direction.
  • Safety/rigor: result_refiner.py cleans and standardizes proofs.
  • Stack: Python, FastAPI/CLI backend; Vite/React front‑end; runs with GPT‑OSS locally (Ollama) or cloud models interchangeably.

Challenges we ran into

  • Stabilizing long, multi‑step proofs (hallucination vs. rigor).
  • Novelty checking without a full formal math KB.
  • Geometry brittleness—fixing with a strict complex‑plane policy.
  • Latency/cost/VRAM constraints for large open models.
  • Normalizing different provider APIs and structured outputs.

Accomplishments that we're proud of

  • A reproducible agent that produces checkable Markdown proofs and failure analyses.
  • Strong performance on olympiad‑style problems with open models (GPT‑OSS) in our internal runs.
  • A clean, scriptable CLI and lightweight UI that make research loops easy to reproduce.
  • Provider‑agnostic design: same pipeline runs locally via Ollama or in the cloud.

What we learned

  • Judging diversity and iteration matter more than a single long pass.
  • Most people don't know how to squeeze out all abilities of the models to push humanity forward rapidly.
  • Super intelligence is here.

What's next for DeepResearch++

  • Make the LLM more creative at predicting new results (I already came up with a way to do this)
  • Polish the generated report.
  • Create automate systems to read and understand, then apply the produced new math results to physics, biology, chemistry, machine learning. I will need a lot of tool call wirings for this.

Built With

Share this project:

Updates