Inspo

Evaluating AI-generated code safely across multiple environments is messy — every sandbox API has its own quirks, authentication, and runtime behavior. We wanted a single place where researchers and developers could run, score, and compare code execution across backends like Daytona, E2B, and Docker — instantly and reproducibly. That’s how PolySandbox was born: a unified, backend-agnostic execution layer for AI evaluation.

What it does

PolySandbox is a universal sandbox orchestrator that lets you:

  • Run Python code safely in isolated backends (Daytona, E2B, Docker)
  • Evaluate model-generated code from datasets like MBPP and HumanEval
  • Automatically score correctness and capture runtime metrics
  • Compare sandbox performance (speed, reliability, success rate) side-by-side
  • Interact through a simple FastAPI backend and a Streamlit UI

In short: one click → any sandbox → consistent results.

How I built it

  • Backend: FastAPI server that exposes /run_single, /run_agent , run_batch endpoints for different sandboxes
  • Adapters: Modular clients for Daytona, E2B, and Docker, each implementing a shared interface (SandboxClient)
  • Evaluator: Unified pipeline that executes code, measures runtime, and normalizes outputs into a common ExecutionResult schema
  • Datasets: Integrated loaders for MBPP and HumanEval via Hugging Face
  • Frontend: Streamlit app that lets users pick a backend, edit code, and visualize results in real time
  • Testing & Tooling: Built and managed using uv, pytest, and async test harnesses

Challenges I ran into

  1. Orchestrating multiple sandbox APIs — Each platform (Daytona, E2B, Docker) had unique authentication, request formats, and runtime limits. Normalizing these into a single clean interface took careful abstraction.
  2. Error handling and resilience — Handling network timeouts, cold starts, and inconsistent API responses gracefully without crashing the pipeline.
  3. Runtime safety — Ensuring user code executes securely and deterministically inside sandbox containers.
  4. Cross-backend consistency — Guaranteeing that equivalent tests behave identically across environments.

Accomplishments I am proud of

  • Achieved end-to-end runs across three sandbox systems with a single /run call
  • Built a clean, extensible adapter interface (SandboxClient) that new backends can plug into easily
  • Integrated benchmark datasets like MBPP for reproducible AI code evaluation
  • Delivered a working UI → API → Sandbox → Scorer pipeline in under 48 hours
  • Demonstrated measurable performance improvements on Daytona’s cold starts

What I learned

  • The importance of interface design in building flexible infrastructure — good abstractions made it possible to add backends rapidly.
  • How subtle runtime differences between sandbox providers affect reproducibility.
  • How to balance developer experience and system reliability under tight time constraints.
  • Deepened understanding of async orchestration and error resilience in distributed systems.

What’s next for PolySandbox

  • Add cost metrics to compare sandbox providers economically.
  • Expand datasets (e.g., ARC-AGI, LiveBench) for broader evaluation coverage.
  • Integrate LLM-as-Judge agents to automatically score and explain failures.
  • Open-source the framework so researchers can plug in their own backends and datasets.
  • Deploy a public demo showcasing Daytona’s low-latency execution as a benchmark standard.

Built With

Share this project:

Updates