BenchMe: Real-Time Evaluation of AI-Generated Code

👩‍🔬 About Us

We are Software Engineers with deep industry experience in software engineering and testing.
Having seen firsthand how AI-generated code can silently introduce bugs and technical debt, we built BenchMe to bring scientific rigor to evaluating and approving code generated by AI agents.

🚀 Motivation

As AI agents increasingly generate code, the question becomes:
How do we verify, approve, and ensure the quality of this code?

Traditional review methods often fail to detect bugs, enforce functionality, or prevent technical debt in AI-generated code. To address this, we need rigorous and scientific benchmarking, both in real-time and during PR review.

💡 Inspiration

LLMs now play a central role in unit test generation. However, choosing the best model for a specific codebase is non-trivial—LLMs behave differently depending on the task and architecture.

Two critical benchmarks inspired BenchMe:

TestGenEval
Evaluates LLMs on their ability to generate effective unit tests.
📊 Metrics:
- Line Coverage: How much of the source code is exercised by generated tests
- Mutation Score: The percentage of injected code mutations caught by the tests
SWE-bench
Evaluates an LLM’s ability to resolve real-world GitHub issues by submitting semantically correct PRs.
📊 Metrics:
- Pass@1 (functional correctness) against ground-truth solutions
- Multi-file / multi-function change coordination success

However, both benchmarks are static and commonly used in training data—limiting their utility in identifying which model is best now, for your codebase.

🛠️ What It Does

BenchMe brings real-time, benchmark-driven LLM evaluation to your actual GitHub pull requests.

Using an MCP (Model Context Protocol) server, BenchMe:

Listens to PRs via webhook or scheduled trigger
Extracts context (file diffs, function changes, test files, etc.)
Runs inference with multiple LLMs on real-world testing tasks
Evaluates results using:
- ✅ Line Coverage (statement execution)
- 🧬 Mutation Score (killed mutants vs. survived)
- 🧪 Optional: assertion quality, test runtime stability
Ranks models by performance on your codebase
Selects the best model to generate final unit tests

These metrics are computed per-PR, ensuring relevance to the actual changes and eliminating reliance on outdated training datasets.

⚙️ How We Built It

PR triggers flow through a dockerized MCP (Model Context Protocol) server
The MCP server extracts and normalizes the source context from the PR
LLMs are executed in parallel across real-world testing tasks using Docker containers
Each LLM generates unit tests, which are then evaluated using:
- coverage.py for line coverage
- mutmut, stryker, or custom tools for mutation testing
Benchmarking results are gathered in real time, enabling per-PR model comparison
The best-performing model’s test output is selected and can be:
- Committed to the repo
- Posted as a comment on the PR
- Delivered via API for external use

This fully containerized architecture allows teams to deploy the system self-hosted or in the cloud, with repeatable and scalable evaluation on every pull request.

🧱 Challenges

Refactoring research-grade benchmarks into production workflows
Reducing latency for real-time PR use cases
Model consistency across multiple codebases and language boundaries
Dealing with PRs that touch untestable or framework-incompatible code

🏆 Accomplishments

Unified unit testing + mutation testing into a real-time scoring pipeline
Enabled reproducible, model-vs-model evaluation over closed-source PRs
Built a system that doesn't just evaluate code—but actively selects better models per task

🧠 What We Learned

Smaller LLMs (e.g. 7B–13B) can outperform larger ones for specific test tasks
Evaluation should be per-repo, not just per-dataset
Mutation testing is a far more precise measure of test quality than coverage alone
Even in high-quality repos, AI-generated tests often miss edge cases—evaluation is critical

🏢 BenchMe Enterprise

BenchMe Enterprise offers a self-hosted dashboard for engineering teams and code reviewers. It allows you to:

✅ Approve or reject AI-generated code with benchmarked testing metrics
📊 Visualize model performance (line coverage, mutation score, pass/fail rate)
🔁 Continuously evaluate models on internal PRs across different services
🛡️ Ensure agentic code contributions don’t silently introduce regressions or debt

This enables organizations to adopt AI-generated code confidently, backed by rigorous metrics and reproducible evaluation pipelines.

🌐 Coming Soon: TestBenchArena

Inspired by ChatBotArena, we’re building:

A public leaderboard of LLMs evaluated on real-world PRs
Community-driven benchmark submissions via API
Transparent data for researchers and developers to track model performance across time and codebases