Inspiration
Math progress is bottlenecked by proof search and verification. With open models like GPT‑OSS‑120B showing strong mathematical reasoning, we asked: can an open, reproducible agent systematically propose conjectures, attempt rigorous proofs, and iterate toward publishable results—accelerating not just olympiad problem solving, but science more broadly?
What it does
DeepResearch++ is an end‑to‑end research+proving agent for mathematics:
- Literature review → conjecture generation: builds candidate conjectures from seeds and recent results.
- Novelty filter: down‑selects to likely‑new statements to avoid re‑discoveries.
- Proof engine: multi‑attempt prover generates Markdown proofs; a judge critiques flaws; the system iterates and selects the best proof.
- Outputs: a reproducible report with statement, status (proved/failed), proof, and failure analysis.
- Interfaces: Minimal web UI; runs with OpenAI‑compatible endpoints (local GPT‑OSS via Ollama or hosted).
How we built it
- Orchestration:
solver.pycoordinates parallelProvers and aJudgewith retries and final selection. - Structured I/O: Pydantic schemas in
output_schemas.py; model calls go through a provider‑agnostic wrapper (llm_provider.py) using the Responses API. - Research loop:
research.pydoes review → prediction → novelty filter → proof → report; optional guideline steering to focus on a direction. - Safety/rigor:
result_refiner.pycleans and standardizes proofs. - Stack: Python, FastAPI/CLI backend; Vite/React front‑end; runs with GPT‑OSS locally (Ollama) or cloud models interchangeably.
Challenges we ran into
- Stabilizing long, multi‑step proofs (hallucination vs. rigor).
- Novelty checking without a full formal math KB.
- Geometry brittleness—fixing with a strict complex‑plane policy.
- Latency/cost/VRAM constraints for large open models.
- Normalizing different provider APIs and structured outputs.
Accomplishments that we're proud of
- A reproducible agent that produces checkable Markdown proofs and failure analyses.
- Strong performance on olympiad‑style problems with open models (GPT‑OSS) in our internal runs.
- A clean, scriptable CLI and lightweight UI that make research loops easy to reproduce.
- Provider‑agnostic design: same pipeline runs locally via Ollama or in the cloud.
What we learned
- Judging diversity and iteration matter more than a single long pass.
- Most people don't know how to squeeze out all abilities of the models to push humanity forward rapidly.
- Super intelligence is here.
What's next for DeepResearch++
- Make the LLM more creative at predicting new results (I already came up with a way to do this)
- Polish the generated report.
- Create automate systems to read and understand, then apply the produced new math results to physics, biology, chemistry, machine learning. I will need a lot of tool call wirings for this.
Built With
- openai
- python
Log in or sign up for Devpost to join the conversation.