WhyMe QA

Inspiration

Code-generating AI is transforming the SDLC, but there is a critical gap: how do you know if the code an AI generated actually works? Today, teams eyeball LLM output, manually paste it into a project, and hope for the best. When an MCP server generates a Rust smart contract, there is no standardized way to verify it compiles, passes lint, and survives a test suite -- let alone to automatically fix it if it does not.

We built WhyMe QA to answer one question with executable proof: does this AI-generated code actually work?

What it does

WhyMe QA connects to any MCP server over stdio, executes scenario-defined tool calls, materializes the generated code into a workspace, and runs configurable verification gates -- format, lint, compile, test, WASM build -- each weighted into a 0-100 quality score.

When gates fail, the LLM Agent Loop kicks in: an LLM reads workspace files and error output, returns fixes, and the platform re-runs all gates. This repeats until every gate passes or an iteration limit is reached. The agent maintains iteration history to prevent repeating failed approaches, and performs best-workspace rollback on score regression.

Key capabilities:

Scenario-as-code: YAML defines MCP server, tool calls, gates, and agent config with ${VAR:-default} env var substitution
7 verification gates for Stylus MVP: cargo fmt, clippy, compile, tests, WASM build, Stylus check, cargo audit
Rich reporting: JSON, JUnit XML (GitLab test reports), and Markdown with iteration details
GitLab CI pipeline with lint, test, verify stages + GitLab Duo flow with configurable inputs
46 passing unit tests, zero external test frameworks

How we built it

A pnpm monorepo with four TypeScript packages (ESM, Node 22+): schemas, adapters, reporter, runner. The MCP client implements JSON-RPC 2.0 over stdio. The agent loop uses OpenRouter (supporting Claude, GPT-4, DeepSeek) with an injectable LlmClient interface. Gate scripts are standalone shell scripts receiving WORKSPACE_DIR, making them trivially replaceable. Built with native fetch, node:test, and tsc -- zero bundlers, minimal dependencies.

Challenges we ran into

Stylus SDK ecosystem quirks: alloy-consensus breaks on non-WASM targets, requiring fallbacks in gate scripts
LLM response parsing: MCP servers return code in wildly different formats -- built a three-strategy parser
Preventing infinite fix loops: Solved with iteration history in prompts and best-workspace rollback on regression

Accomplishments that we're proud of

The agent loop genuinely works -- taking a 25/100 failing contract to 100/100 across 3 iterations
Truly generic architecture -- swap gate scripts for any language/framework
Zero unnecessary dependencies -- lightweight, auditable, fast
Full cryptographic provenance chain for every run

What we learned

AI-generated code quality varies enormously -- and is now measurable with reproducible scores
The agent loop is more powerful than expected, handling compilation errors, dependency issues, and test failures
MCP is the right abstraction layer for evaluating complete code generation pipelines
GitLab Duo flows make CI/CD accessible to non-engineers

What's next for WhyMe QA

Multi-language gate packs (Python, Solidity, Go, TypeScript)
Comparative benchmarking across MCP servers and models
GitLab MR comment integration for automated quality gates
Community scenario library for standardized benchmarks

Built With

arbitrum-stylus
gitlab-ci/cd
gitlab-duo
mcp-(model-context-protocol)
node.js
openrouter
pnpm
rust
shell
typescript

Updates

Whyme Labs started this project — Mar 21, 2026 04:20 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.