PolySandbox

logo

Inspo

Evaluating AI-generated code safely across multiple environments is messy — every sandbox API has its own quirks, authentication, and runtime behavior. We wanted a single place where researchers and developers could run, score, and compare code execution across backends like Daytona, E2B, and Docker — instantly and reproducibly. That’s how PolySandbox was born: a unified, backend-agnostic execution layer for AI evaluation.

What it does

PolySandbox is a universal sandbox orchestrator that lets you:

Run Python code safely in isolated backends (Daytona, E2B, Docker)
Evaluate model-generated code from datasets like MBPP and HumanEval
Automatically score correctness and capture runtime metrics
Compare sandbox performance (speed, reliability, success rate) side-by-side
Interact through a simple FastAPI backend and a Streamlit UI

In short: one click → any sandbox → consistent results.

How I built it

Backend: FastAPI server that exposes /run_single, /run_agent , run_batch endpoints for different sandboxes
Adapters: Modular clients for Daytona, E2B, and Docker, each implementing a shared interface (SandboxClient)
Evaluator: Unified pipeline that executes code, measures runtime, and normalizes outputs into a common ExecutionResult schema
Datasets: Integrated loaders for MBPP and HumanEval via Hugging Face
Frontend: Streamlit app that lets users pick a backend, edit code, and visualize results in real time
Testing & Tooling: Built and managed using uv, pytest, and async test harnesses

Challenges I ran into

Orchestrating multiple sandbox APIs — Each platform (Daytona, E2B, Docker) had unique authentication, request formats, and runtime limits. Normalizing these into a single clean interface took careful abstraction.
Error handling and resilience — Handling network timeouts, cold starts, and inconsistent API responses gracefully without crashing the pipeline.
Runtime safety — Ensuring user code executes securely and deterministically inside sandbox containers.
Cross-backend consistency — Guaranteeing that equivalent tests behave identically across environments.

Accomplishments I am proud of

Achieved end-to-end runs across three sandbox systems with a single /run call
Built a clean, extensible adapter interface (SandboxClient) that new backends can plug into easily
Integrated benchmark datasets like MBPP for reproducible AI code evaluation
Delivered a working UI → API → Sandbox → Scorer pipeline in under 48 hours
Demonstrated measurable performance improvements on Daytona’s cold starts

What I learned

The importance of interface design in building flexible infrastructure — good abstractions made it possible to add backends rapidly.
How subtle runtime differences between sandbox providers affect reproducibility.
How to balance developer experience and system reliability under tight time constraints.
Deepened understanding of async orchestration and error resilience in distributed systems.

What’s next for PolySandbox

Add cost metrics to compare sandbox providers economically.
Expand datasets (e.g., ARC-AGI, LiveBench) for broader evaluation coverage.
Integrate LLM-as-Judge agents to automatically score and explain failures.
Open-source the framework so researchers can plug in their own backends and datasets.
Deploy a public demo showcasing Daytona’s low-latency execution as a benchmark standard.

Built With

daytona
e2b
flask
python
streamlit
uv

Updates

Daniel Graviet started this project — Oct 18, 2025 06:51 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.