Inspo
Evaluating AI-generated code safely across multiple environments is messy — every sandbox API has its own quirks, authentication, and runtime behavior. We wanted a single place where researchers and developers could run, score, and compare code execution across backends like Daytona, E2B, and Docker — instantly and reproducibly. That’s how PolySandbox was born: a unified, backend-agnostic execution layer for AI evaluation.
What it does
PolySandbox is a universal sandbox orchestrator that lets you:
- Run Python code safely in isolated backends (Daytona, E2B, Docker)
- Evaluate model-generated code from datasets like MBPP and HumanEval
- Automatically score correctness and capture runtime metrics
- Compare sandbox performance (speed, reliability, success rate) side-by-side
- Interact through a simple FastAPI backend and a Streamlit UI
In short: one click → any sandbox → consistent results.
How I built it
- Backend: FastAPI server that exposes
/run_single,/run_agent,run_batchendpoints for different sandboxes - Adapters: Modular clients for Daytona, E2B, and Docker, each implementing a shared interface (
SandboxClient) - Evaluator: Unified pipeline that executes code, measures runtime, and normalizes outputs into a common
ExecutionResultschema - Datasets: Integrated loaders for MBPP and HumanEval via Hugging Face
- Frontend: Streamlit app that lets users pick a backend, edit code, and visualize results in real time
- Testing & Tooling: Built and managed using uv, pytest, and async test harnesses
Challenges I ran into
- Orchestrating multiple sandbox APIs — Each platform (Daytona, E2B, Docker) had unique authentication, request formats, and runtime limits. Normalizing these into a single clean interface took careful abstraction.
- Error handling and resilience — Handling network timeouts, cold starts, and inconsistent API responses gracefully without crashing the pipeline.
- Runtime safety — Ensuring user code executes securely and deterministically inside sandbox containers.
- Cross-backend consistency — Guaranteeing that equivalent tests behave identically across environments.
Accomplishments I am proud of
- Achieved end-to-end runs across three sandbox systems with a single
/runcall - Built a clean, extensible adapter interface (
SandboxClient) that new backends can plug into easily - Integrated benchmark datasets like MBPP for reproducible AI code evaluation
- Delivered a working UI → API → Sandbox → Scorer pipeline in under 48 hours
- Demonstrated measurable performance improvements on Daytona’s cold starts
What I learned
- The importance of interface design in building flexible infrastructure — good abstractions made it possible to add backends rapidly.
- How subtle runtime differences between sandbox providers affect reproducibility.
- How to balance developer experience and system reliability under tight time constraints.
- Deepened understanding of async orchestration and error resilience in distributed systems.
What’s next for PolySandbox
- Add cost metrics to compare sandbox providers economically.
- Expand datasets (e.g., ARC-AGI, LiveBench) for broader evaluation coverage.
- Integrate LLM-as-Judge agents to automatically score and explain failures.
- Open-source the framework so researchers can plug in their own backends and datasets.
- Deploy a public demo showcasing Daytona’s low-latency execution as a benchmark standard.
Log in or sign up for Devpost to join the conversation.