BenchMe: Real-Time Evaluation of AI-Generated Code
👩🔬 About Us
We are Software Engineers with deep industry experience in software engineering and testing.
Having seen firsthand how AI-generated code can silently introduce bugs and technical debt, we built BenchMe to bring scientific rigor to evaluating and approving code generated by AI agents.
🚀 Motivation
As AI agents increasingly generate code, the question becomes:
How do we verify, approve, and ensure the quality of this code?
Traditional review methods often fail to detect bugs, enforce functionality, or prevent technical debt in AI-generated code. To address this, we need rigorous and scientific benchmarking, both in real-time and during PR review.
💡 Inspiration
LLMs now play a central role in unit test generation. However, choosing the best model for a specific codebase is non-trivial—LLMs behave differently depending on the task and architecture.
Two critical benchmarks inspired BenchMe:
TestGenEval
Evaluates LLMs on their ability to generate effective unit tests.
📊 Metrics:- Line Coverage: How much of the source code is exercised by generated tests
- Mutation Score: The percentage of injected code mutations caught by the tests
- Line Coverage: How much of the source code is exercised by generated tests
SWE-bench
Evaluates an LLM’s ability to resolve real-world GitHub issues by submitting semantically correct PRs.
📊 Metrics:- Pass@1 (functional correctness) against ground-truth solutions
- Multi-file / multi-function change coordination success
- Pass@1 (functional correctness) against ground-truth solutions
However, both benchmarks are static and commonly used in training data—limiting their utility in identifying which model is best now, for your codebase.
🛠️ What It Does
BenchMe brings real-time, benchmark-driven LLM evaluation to your actual GitHub pull requests.
Using an MCP (Model Context Protocol) server, BenchMe:
- Listens to PRs via webhook or scheduled trigger
- Extracts context (file diffs, function changes, test files, etc.)
- Runs inference with multiple LLMs on real-world testing tasks
- Evaluates results using:
- ✅ Line Coverage (statement execution)
- 🧬 Mutation Score (killed mutants vs. survived)
- 🧪 Optional: assertion quality, test runtime stability
- Ranks models by performance on your codebase
- Selects the best model to generate final unit tests
These metrics are computed per-PR, ensuring relevance to the actual changes and eliminating reliance on outdated training datasets.
⚙️ How We Built It
- PR triggers flow through a dockerized MCP (Model Context Protocol) server
- The MCP server extracts and normalizes the source context from the PR
- LLMs are executed in parallel across real-world testing tasks using Docker containers
- Each LLM generates unit tests, which are then evaluated using:
coverage.pyfor line coveragemutmut,stryker, or custom tools for mutation testing
- Benchmarking results are gathered in real time, enabling per-PR model comparison
- The best-performing model’s test output is selected and can be:
- Committed to the repo
- Posted as a comment on the PR
- Delivered via API for external use
This fully containerized architecture allows teams to deploy the system self-hosted or in the cloud, with repeatable and scalable evaluation on every pull request.
🧱 Challenges
- Refactoring research-grade benchmarks into production workflows
- Reducing latency for real-time PR use cases
- Model consistency across multiple codebases and language boundaries
- Dealing with PRs that touch untestable or framework-incompatible code
🏆 Accomplishments
- Unified unit testing + mutation testing into a real-time scoring pipeline
- Enabled reproducible, model-vs-model evaluation over closed-source PRs
- Built a system that doesn't just evaluate code—but actively selects better models per task
🧠 What We Learned
- Smaller LLMs (e.g. 7B–13B) can outperform larger ones for specific test tasks
- Evaluation should be per-repo, not just per-dataset
- Mutation testing is a far more precise measure of test quality than coverage alone
- Even in high-quality repos, AI-generated tests often miss edge cases—evaluation is critical
🏢 BenchMe Enterprise
BenchMe Enterprise offers a self-hosted dashboard for engineering teams and code reviewers. It allows you to:
- ✅ Approve or reject AI-generated code with benchmarked testing metrics
- 📊 Visualize model performance (line coverage, mutation score, pass/fail rate)
- 🔁 Continuously evaluate models on internal PRs across different services
- 🛡️ Ensure agentic code contributions don’t silently introduce regressions or debt
This enables organizations to adopt AI-generated code confidently, backed by rigorous metrics and reproducible evaluation pipelines.
🌐 Coming Soon: TestBenchArena
Inspired by ChatBotArena, we’re building:
- A public leaderboard of LLMs evaluated on real-world PRs
- Community-driven benchmark submissions via API
- Transparent data for researchers and developers to track model performance across time and codebases
🧪 Rigorous. Reproducible. Real-time. BenchMe brings scientific evaluation into the future of autonomous software agents.
Log in or sign up for Devpost to join the conversation.