Automated Prototyping Engine for Research Papers (ArXiv-to-Code)

Rationale

Today, AI research advances quickly, but reproducibility in AI research does not match the speed of progress for modern AI research. Groundbreaking research papers appear daily on arXiv that have elegant mathematics and elaborate architecture diagrams, but a reliable implementation of the research often does not exist for months, if one ever appears at all. When we were learning about the latest developments in research as students or practitioners of research, we encountered the same problem time and time again: understanding the mathematics was only half of the battle; the actual translation of that understanding into functioning code was the more difficult part of the task.

We decided to create an automated research prototyping engine called ArXiv-to-Code to resolve this disparity - an engine that would convert research papers into executable models.


Objective

ArXiv-to-Code is a multimodal artificial intelligence system that converts dense deep learning research papers into working deep learning code.

For instance, if provided with a research paper in PDF format, the system:

  • Reads long form academic text
  • Interprets mathematical equations and the Greek alphabet
  • Visually analyzes the architecture diagram and figures
  • Extracts the model definition and loss function
  • Generates executable PyTorch source code files (e.g., model.py, loss.py)

As an example, the system will extract the loss function from Figure 3 by reconstructing each of the equations contained within this figure.

[ \mathcal{L} = \mathbb{E}_{(x,y)\sim\mathcal{D}} \left[ \lambda_1 |\hat{y} - y|_2^2 + \lambda_2 \, \mathrm{KL}(q(z|x)\,|\,p(z)) \right] ]

and directly implements them as numerically stable Python code ready for training.


How we built it

We built the project using Gemini’s multimodal capabilities in AI Studio, focusing on research-level fidelity rather than summarization.

Our pipeline consists of:

  1. Long-context document understanding to process full research PDFs
  2. Vision–language reasoning to interpret figures and architecture diagrams
  3. Mathematical parsing to preserve equations, symbols, and constraints
  4. Code synthesis to translate theory into PyTorch implementations

We designed a strict prompting framework that enforces:

  • No hallucinated equations
  • Exact variable matching with the paper
  • Diagram-first reasoning when figures define key logic

The result is an AI system that behaves more like a research engineer than a chatbot.

--

Challenges we ran into

One of the biggest challenges was preventing the model from guessing missing details. Research papers often omit implementation specifics, and it was crucial that the system explicitly flag missing information rather than silently invent it.

Another challenge was diagram interpretation. Figures often encode critical logic that is not fully described in text, so we had to ensure the system treated diagrams as first-class sources of truth.

Finally, translating complex mathematical expressions into stable, shape-correct code—especially for multi-term loss functions—required careful reasoning and validation.


Accomplishments that we're proud of

  • Successfully extracting and implementing loss functions directly from paper figures
  • Generating clean, modular, runnable PyTorch code from brand-new research papers
  • Building a reproducibility-focused system that prioritizes correctness over verbosity
  • Demonstrating how multimodal AI can meaningfully accelerate research workflows

What we learned

This project taught us how powerful multimodal reasoning can be when applied to real research problems. We learned that diagrams are not just visual aids—they often are the specification. We also gained a deeper appreciation for the gap between mathematical elegance and implementation reality, and how carefully designed AI systems can help bridge that gap.


What's next for ArXiv-to-Code: Automated Research Prototyping Engine

Next, we plan to:

  • Add full training script generation and dataset hooks
  • Support multiple frameworks (TensorFlow, JAX)
  • Automatically generate experiment-ready GitHub repositories
  • Evaluate implementations against official or community benchmarks
  • Extend the system to other domains such as robotics and scientific computing

Our long-term vision is to make research reproducibility the default, not the exception.

Built With

Share this project:

Updates