BenchMe: Real-Time Evaluation of AI-Generated Code

👩‍🔬 About Us

We are Software Engineers with deep industry experience in software engineering and testing.
Having seen firsthand how AI-generated code can silently introduce bugs and technical debt, we built BenchMe to bring scientific rigor to evaluating and approving code generated by AI agents.


🚀 Motivation

As AI agents increasingly generate code, the question becomes:
How do we verify, approve, and ensure the quality of this code?

Traditional review methods often fail to detect bugs, enforce functionality, or prevent technical debt in AI-generated code. To address this, we need rigorous and scientific benchmarking, both in real-time and during PR review.


💡 Inspiration

LLMs now play a central role in unit test generation. However, choosing the best model for a specific codebase is non-trivial—LLMs behave differently depending on the task and architecture.

Two critical benchmarks inspired BenchMe:

  • TestGenEval
    Evaluates LLMs on their ability to generate effective unit tests.
    📊 Metrics:

    • Line Coverage: How much of the source code is exercised by generated tests
    • Mutation Score: The percentage of injected code mutations caught by the tests
  • SWE-bench
    Evaluates an LLM’s ability to resolve real-world GitHub issues by submitting semantically correct PRs.
    📊 Metrics:

    • Pass@1 (functional correctness) against ground-truth solutions
    • Multi-file / multi-function change coordination success

However, both benchmarks are static and commonly used in training data—limiting their utility in identifying which model is best now, for your codebase.


🛠️ What It Does

BenchMe brings real-time, benchmark-driven LLM evaluation to your actual GitHub pull requests.

Using an MCP (Model Context Protocol) server, BenchMe:

  1. Listens to PRs via webhook or scheduled trigger
  2. Extracts context (file diffs, function changes, test files, etc.)
  3. Runs inference with multiple LLMs on real-world testing tasks
  4. Evaluates results using:
    • Line Coverage (statement execution)
    • 🧬 Mutation Score (killed mutants vs. survived)
    • 🧪 Optional: assertion quality, test runtime stability
  5. Ranks models by performance on your codebase
  6. Selects the best model to generate final unit tests

These metrics are computed per-PR, ensuring relevance to the actual changes and eliminating reliance on outdated training datasets.


⚙️ How We Built It

  • PR triggers flow through a dockerized MCP (Model Context Protocol) server
  • The MCP server extracts and normalizes the source context from the PR
  • LLMs are executed in parallel across real-world testing tasks using Docker containers
  • Each LLM generates unit tests, which are then evaluated using:
    • coverage.py for line coverage
    • mutmut, stryker, or custom tools for mutation testing
  • Benchmarking results are gathered in real time, enabling per-PR model comparison
  • The best-performing model’s test output is selected and can be:
    • Committed to the repo
    • Posted as a comment on the PR
    • Delivered via API for external use

This fully containerized architecture allows teams to deploy the system self-hosted or in the cloud, with repeatable and scalable evaluation on every pull request.


🧱 Challenges

  • Refactoring research-grade benchmarks into production workflows
  • Reducing latency for real-time PR use cases
  • Model consistency across multiple codebases and language boundaries
  • Dealing with PRs that touch untestable or framework-incompatible code

🏆 Accomplishments

  • Unified unit testing + mutation testing into a real-time scoring pipeline
  • Enabled reproducible, model-vs-model evaluation over closed-source PRs
  • Built a system that doesn't just evaluate code—but actively selects better models per task

🧠 What We Learned

  • Smaller LLMs (e.g. 7B–13B) can outperform larger ones for specific test tasks
  • Evaluation should be per-repo, not just per-dataset
  • Mutation testing is a far more precise measure of test quality than coverage alone
  • Even in high-quality repos, AI-generated tests often miss edge cases—evaluation is critical

🏢 BenchMe Enterprise

BenchMe Enterprise offers a self-hosted dashboard for engineering teams and code reviewers. It allows you to:

  • ✅ Approve or reject AI-generated code with benchmarked testing metrics
  • 📊 Visualize model performance (line coverage, mutation score, pass/fail rate)
  • 🔁 Continuously evaluate models on internal PRs across different services
  • 🛡️ Ensure agentic code contributions don’t silently introduce regressions or debt

This enables organizations to adopt AI-generated code confidently, backed by rigorous metrics and reproducible evaluation pipelines.


🌐 Coming Soon: TestBenchArena

Inspired by ChatBotArena, we’re building:

  • A public leaderboard of LLMs evaluated on real-world PRs
  • Community-driven benchmark submissions via API
  • Transparent data for researchers and developers to track model performance across time and codebases

🧪 Rigorous. Reproducible. Real-time. BenchMe brings scientific evaluation into the future of autonomous software agents.

Built With

Share this project:

Updates