Inspiration

Code-generating AI is transforming the SDLC, but there is a critical gap: how do you know if the code an AI generated actually works? Today, teams eyeball LLM output, manually paste it into a project, and hope for the best. When an MCP server generates a Rust smart contract, there is no standardized way to verify it compiles, passes lint, and survives a test suite -- let alone to automatically fix it if it does not.

We built WhyMe QA to answer one question with executable proof: does this AI-generated code actually work?

What it does

WhyMe QA connects to any MCP server over stdio, executes scenario-defined tool calls, materializes the generated code into a workspace, and runs configurable verification gates -- format, lint, compile, test, WASM build -- each weighted into a 0-100 quality score.

When gates fail, the LLM Agent Loop kicks in: an LLM reads workspace files and error output, returns fixes, and the platform re-runs all gates. This repeats until every gate passes or an iteration limit is reached. The agent maintains iteration history to prevent repeating failed approaches, and performs best-workspace rollback on score regression.

Key capabilities:

  • Scenario-as-code: YAML defines MCP server, tool calls, gates, and agent config with ${VAR:-default} env var substitution
  • 7 verification gates for Stylus MVP: cargo fmt, clippy, compile, tests, WASM build, Stylus check, cargo audit
  • Rich reporting: JSON, JUnit XML (GitLab test reports), and Markdown with iteration details
  • GitLab CI pipeline with lint, test, verify stages + GitLab Duo flow with configurable inputs
  • 46 passing unit tests, zero external test frameworks

How we built it

A pnpm monorepo with four TypeScript packages (ESM, Node 22+): schemas, adapters, reporter, runner. The MCP client implements JSON-RPC 2.0 over stdio. The agent loop uses OpenRouter (supporting Claude, GPT-4, DeepSeek) with an injectable LlmClient interface. Gate scripts are standalone shell scripts receiving WORKSPACE_DIR, making them trivially replaceable. Built with native fetch, node:test, and tsc -- zero bundlers, minimal dependencies.

Challenges we ran into

  • Stylus SDK ecosystem quirks: alloy-consensus breaks on non-WASM targets, requiring fallbacks in gate scripts
  • LLM response parsing: MCP servers return code in wildly different formats -- built a three-strategy parser
  • Preventing infinite fix loops: Solved with iteration history in prompts and best-workspace rollback on regression

Accomplishments that we're proud of

  • The agent loop genuinely works -- taking a 25/100 failing contract to 100/100 across 3 iterations
  • Truly generic architecture -- swap gate scripts for any language/framework
  • Zero unnecessary dependencies -- lightweight, auditable, fast
  • Full cryptographic provenance chain for every run

What we learned

  • AI-generated code quality varies enormously -- and is now measurable with reproducible scores
  • The agent loop is more powerful than expected, handling compilation errors, dependency issues, and test failures
  • MCP is the right abstraction layer for evaluating complete code generation pipelines
  • GitLab Duo flows make CI/CD accessible to non-engineers

What's next for WhyMe QA

  • Multi-language gate packs (Python, Solidity, Go, TypeScript)
  • Comparative benchmarking across MCP servers and models
  • GitLab MR comment integration for automated quality gates
  • Community scenario library for standardized benchmarks

Built With

Share this project:

Updates