Inspiration
Code-generating AI is transforming the SDLC, but there is a critical gap: how do you know if the code an AI generated actually works? Today, teams eyeball LLM output, manually paste it into a project, and hope for the best. When an MCP server generates a Rust smart contract, there is no standardized way to verify it compiles, passes lint, and survives a test suite -- let alone to automatically fix it if it does not.
We built WhyMe QA to answer one question with executable proof: does this AI-generated code actually work?
What it does
WhyMe QA connects to any MCP server over stdio, executes scenario-defined tool calls, materializes the generated code into a workspace, and runs configurable verification gates -- format, lint, compile, test, WASM build -- each weighted into a 0-100 quality score.
When gates fail, the LLM Agent Loop kicks in: an LLM reads workspace files and error output, returns fixes, and the platform re-runs all gates. This repeats until every gate passes or an iteration limit is reached. The agent maintains iteration history to prevent repeating failed approaches, and performs best-workspace rollback on score regression.
Key capabilities:
- Scenario-as-code: YAML defines MCP server, tool calls, gates, and agent config with
${VAR:-default}env var substitution - 7 verification gates for Stylus MVP: cargo fmt, clippy, compile, tests, WASM build, Stylus check, cargo audit
- Rich reporting: JSON, JUnit XML (GitLab test reports), and Markdown with iteration details
- GitLab CI pipeline with lint, test, verify stages + GitLab Duo flow with configurable inputs
- 46 passing unit tests, zero external test frameworks
How we built it
A pnpm monorepo with four TypeScript packages (ESM, Node 22+): schemas, adapters, reporter, runner. The MCP client implements JSON-RPC 2.0 over stdio. The agent loop uses OpenRouter (supporting Claude, GPT-4, DeepSeek) with an injectable LlmClient interface. Gate scripts are standalone shell scripts receiving WORKSPACE_DIR, making them trivially replaceable. Built with native fetch, node:test, and tsc -- zero bundlers, minimal dependencies.
Challenges we ran into
- Stylus SDK ecosystem quirks: alloy-consensus breaks on non-WASM targets, requiring fallbacks in gate scripts
- LLM response parsing: MCP servers return code in wildly different formats -- built a three-strategy parser
- Preventing infinite fix loops: Solved with iteration history in prompts and best-workspace rollback on regression
Accomplishments that we're proud of
- The agent loop genuinely works -- taking a 25/100 failing contract to 100/100 across 3 iterations
- Truly generic architecture -- swap gate scripts for any language/framework
- Zero unnecessary dependencies -- lightweight, auditable, fast
- Full cryptographic provenance chain for every run
What we learned
- AI-generated code quality varies enormously -- and is now measurable with reproducible scores
- The agent loop is more powerful than expected, handling compilation errors, dependency issues, and test failures
- MCP is the right abstraction layer for evaluating complete code generation pipelines
- GitLab Duo flows make CI/CD accessible to non-engineers
What's next for WhyMe QA
- Multi-language gate packs (Python, Solidity, Go, TypeScript)
- Comparative benchmarking across MCP servers and models
- GitLab MR comment integration for automated quality gates
- Community scenario library for standardized benchmarks
Built With
- arbitrum-stylus
- gitlab-ci/cd
- gitlab-duo
- mcp-(model-context-protocol)
- node.js
- openrouter
- pnpm
- rust
- shell
- typescript
Log in or sign up for Devpost to join the conversation.