Vera | Devpost

Inspiration

Verification eats 60–70% of every chip project. Armies of engineers hand-write tests, stare at waveforms, and chase coverage checklists — and a single missed corner case can cost a $100M respin. Meanwhile, the loop itself is mechanical: find the gap → write a test → run it → check it → repeat. That's an agent loop wearing a hard hat. We wanted to prove an AI could close that entire loop — not a copilot suggesting testbench snippets, but an autonomous engineer that decides, tests, catches, and explains on its own.

What it does

Vera is an autonomous verification engineer. Point it at Verilog RTL and it:

Reads the design and identifies which corner cases haven't been exercised (a hand-rolled functional coverage model — six corner cases per design)
Writes its own tests — an LLM generates stimulus as structured JSON, one op per clock cycle
Runs them on a real simulator (Icarus Verilog + cocotb — no mocked hardware)
Judges correctness, not just coverage: every cycle, DUT outputs are compared against a golden Python reference model. Coverage says what we tried; the oracle says whether it was right:

$$\text{coverage} = \frac{\text{points hit}}{6} \times 100% \quad\text{but}\quad \text{correct} \iff \forall t:\ \text{DUT}(t) = \text{ref}(t)$$

Catches real bugs: we planted an off-by-one (full = (count == DEPTH-1)) in a FIFO. Vera drives into the corner, catches the mismatch, shrinks the failure to a minimal 7-write repro, points at the offending RTL line, and proposes the one-line fix
Gets smarter every run: winning test patterns are saved to memory. Run a second design (a round-robin arbiter) and Vera warm-starts from patterns learned on the FIFO — faster climb, "reused N patterns" badge, zero false positives on the clean design
Explains itself: click any test in the live dashboard for a plain-English explanation; ask the built-in chat anything about the run, grounded in a deep post-run analysis

All streamed live to a mission-control dashboard.

How we built it

Three models, each doing what it's cheapest at, all through Pioneer's single Anthropic-compatible endpoint:

claude-sonnet-4-6 — the fast brain: high-frequency stimulus generation + per-test explanations
claude-fable-5 — the heavy brain: post-run engineering analysis and reasoning
claude-opus-4-8 — interactive Q&A, grounded in Fable 5's analysis + run data + the RTL itself

The stack: Python orchestrator (strategy → generate → run → check → triage agents), Icarus Verilog + cocotb in Docker, FastAPI serving a live state contract, React + Vite dashboard polling at 1 Hz. Tests are pure data (JSON op lists), so all designs share one stimulus vocabulary — write→req0, read→req1 on the arbiter — which is what makes cross-design pattern reuse honest instead of hand-waved. Generated tests that fail to run get their error fed back to the model (max 3 retries), so the loop never stalls. One ./run.sh, or one Docker container deployed on Render.

Challenges we ran into

The stale-binary lie. Our buggy FIFO passed with zero mismatches — impossible. make was silently reusing the previous DUT's compiled simulation binary. Per-DUT build dirs fixed it, and it was a visceral lesson in why verification infrastructure must itself be verified.
The AI was too good. Given the full uncovered list, the model one-shotted a 42-cycle test that closed 17%→100% instantly — impressive, useless for understanding what closed what. We rearchitected so the strategy agent picks one target and the generator writes a minimal, surgical test for exactly it.
Toolchain hell: host Python was too new for cocotb, no simulator installed — solved by shipping the entire sim stack (Icarus + cocotb + engine) in one container.
Honesty engineering: keeping coverage ("exercised") strictly separate from the checker ("correct") so the demo never overclaims.

Accomplishments that we're proud of

A fully closed agent loop on a real simulator — not a mockup: AI-written tests climb 0→100% coverage live
The bug hunt works end-to-end: caught → shrunk from 19 ops to the true minimal 7 writes → root-caused → one-line fix diff, all autonomous
The flywheel is real: design #2 verifiably warm-starts from design #1's learned patterns, and drops them the moment they stop paying
Zero false positives on the clean arbiter — Vera doesn't cry wolf

What we learned

The oracle is the hard part. Generating tests is easy; knowing the answer should have been is the entire game
Cheap-model/expensive-model splits aren't cost optimization theater — they change what's economically possible to run in a loop
Agents need fallbacks at every layer (retry → seed tests → demo replay) or live demos die
Build infrastructure lies to you; verify the verifier

What's next for Vera

Grounded specs: pipe design documents into a verified knowledge base so "expected behavior" comes with a citation trail, not vibes
Auto-PR the fix: Vera files the bug with repro + patch as a pull request
Bigger designs (CPU pipelines), constrained-random stimulus, and adaptive fine-tuning of the fast brain on its own failure data
Same loop, commercial simulators — the tools companies already pay millions for. Today a FIFO; the wedge is the loop.

Built With

claude-fable-5-(deep-run-analysis)
claude-opus-4.8-(grounded-q&a-chat);-anthropic-python-sdk-backend:-fastapi
cocotb
cocotb-(python-testbench-framework)-ai-/-apis:-pioneer.ai-inference-platform-(anthropic-compatible-api
colima-(local-docker-runtime-on-macos)
css-in-js-with-custom-keyframe-animations-platforms-/-infra:-docker-(multi-stage-build
full-sim-toolchain-containerized)
github-testing:-playwright-(automated-ui-verification)
gnu-make-(simulation-orchestration)-frontend:-react-19
icarus-verilog-(open-source-simulator)
javascript-(jsx)
json
lucide-icons
python
python-3.12
python-threading-(background-agent-loop)
recharts-(coverage-chart)
render-(cloud-deployment-via-render.yaml-blueprint)
single-key)-serving-three-models-?-anthropic-claude-sonnet-4.6-(stimulus-generation-+-test-explanations)
sql-free-(state-is-json-contracts)-hardware/simulation:-icarus-verilog-(open-source-simulator)
uvicorn
verilog
vite

Updates

Arihant Kaul started this project — Jun 12, 2026 07:28 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.