Inspiration
A pattern I've noticed for people using LLMs for RTL generation is that hosted credit models are great until they hit the ceilings for usage. Then, the fallback is open models running in your own GPUs, which are noticeably weaker at RTL specifically. Verilog is fairly underrepresented in training data, so models would constantly produce circuits that look right but are actually wrong. The constraint was to optimize a local open source model to a point where it can produce RTL at a reasonably equal level. And it lined up with the build the machine track where we optimize open source models to their limits. Can we take a deliberately weaker model compared to frontier ones like opus 4.8, but spend compute at inference to make it produce the correct hardware descriptive language anyway? Hardware sounded perfect to test because the Verilog either matches the simulation under test, or it doesn't (binary, so easy to deduce accuracy).
What it does
Anvil takes a weak open source LLM and optimizes it in such a way that pushes accuracy on VerilogEval from 33% to 71% on its hard set (which includes 21 sequential circuits like FSMs, shift registers, edge detectors) without making it a frontier model. Given an initial hardware spec, it generates N candidate implementations in parallel, compiles and simulates every implementation against the golden test bench (the oracle) and ranks them. if none pass, it takes the best scoring candidate, feeds it specific failing test vectors (concrete expected vs actual mismatches cycle by cycle) and back into the repair prompt until the circuit passes or the loop runs out. The model generates its output and the simulator decides whether it's right.
How we built it
The core idea on evaluating it is a harness around Icarus verilog that compiles and simulates each candidate, parses the verdict (whether its a pass/compile-error/functional-fail/timeout), and extracts the first failing test vectors as concrete (cycle (time), expected, actual) tuples. The feedback is the entire lever. Around that sits the inference time compute loop in python - scoring against the testbench, best-of-N selection, and a repair stage that feeds failing vectors back to the model. The whole thing is provider agnostic as it used Claude Haiku for the main study and used open source Qwen-2.5-7B parameters on 8xA100s. The exact same scripts drive both models.
Challenges we ran into
Ironically, one of the hardest parts was setting up the Qwen model in our local supercluster node. There were dozen incompatibilities like NumPy/SciPy version conflicts, an outdated driver, a CUDA mismatch in torchaudio, rope-scaling config errors specific to the model, a tokenizer that broke under a too-new transformers version, and zombie processes holding 70GB of VRAM. We eventually had the vLLM serve the model but then it ran into countless 500s, on both endpoints, in eager mode, at fp16. After giving up and concluding that it was a rabbit hole, I wrote a custom FastAPI inference server using the plain transformers.generate() that exposed the Open AI compatible endpoints - which thankfully worked on the first try. The other challenge was keeping track of numbers for intellectual honesty - a real database checkup.
Accomplishments that we're proud of
24 points at equal compute (all proof inside GitHub scripts and raw script log results including the scripts itself), Sampling plateaus around 48%, and past 8 samples, drawing more does nothing. The verify repair agent reaches 71% on the same budget. The two strategies used solved different problems. The 5 problems that the sampling could never solve, the repair agent could as once the model sees the failure behavior after testing it with known RTL test benches, it can reason about the fix. The sharpest finding definitely came from ablation - not more feedback that helps but concrete feedback that specifically tells what cycle of the testbench it failed at and the expected output over there. “Expected 0 instead of 1 in cycle 45” would genuinely help the model find the mismatch. And plus, it generalizes, open source Qwen-2.5 coder topped out at 60% on pure N-sampling, but the agent fixes helped it climb to 80%!! Frontier purely verilog trained models like ChipAgents reach pass rates of around 97%. But, the point stands that it’s not just a Haiku artifact but a loop that be replicated on other weak open source LLMs as well.
What we learned
AI-generated code is due to verification-based generation. The use of a testbench isn’t a limitation – it’s how the results become true. If an RTL generator has no basis for comparison, then it will simply generate speculative statements based on nothing; using a testbench is what provides the results’ validity. We’ve now seen empirically as well as abstractly that you can get so much more out of a bad model by providing it with very tangible, directly observable evidence of exactly where it failed. On the system level: deploying publically available models is quite difficult because they fail at many levels below their surface.
What's next for Anvil
A couple of directions: expand the oracle to be able to handle larger, more complex designs (multi-modules), as well as the constraints in real world synthesis vs. just simulating behavior. Test the loop over a number of open models to see where it has the greatest benefit. Continue toward a process that is designed by specification/behavioral model, with acceptance testing, and Anvil will generate valid RTL based on those specifications -- test driven development of hardware. Tighten up the repair mechanism: better select which failing vectors are surfaced to the user, and determine when to use sampling and when to spend all of the remaining budget on repair.
Built With
- 8xa100s
- accelerate
- cuda
- fastapi
- fastapi-+-uvicorn-(the-serving-layer)-hardware-simulation-(the-oracle):-icarus-verilog-(iverilog-/-vvp)-for-compile-+-simulate-benchmark:-nvidia-verilogeval-(spec-to-rtl
- git
- git/github-(version-control)
- haiku-4.5
- hard-set)-compute-/-cloud:-prime-intellect-(8x-a100-80gb-node)
- hugging-face-transformers
- icarus-verilog
- lambda-cloud
- matplotlib
- matplotlib-(the-result-charts)
- nvidia-verilogeval
- openai-compatible-api-(for-the-local-model-endpoint)-tooling-/-libraries:-threadpoolexecutor-(parallel-sampling-+-simulation)
- python
- pytorch
- qwen-2.5-coder-7b-parameters
- qwen2.5-coder-7b-instruct-(open-model-generalization)-ml-serving-/-inference:-hugging-face-transformers-(custom-openai-compatible-inference-server)
- systemverilog
- systemverilog-/-verilog-(the-generated-artifacts-+-golden-testbenches)-models:-claude-haiku-4.5-(baseline-/-main-study
- threadpoolexecutor
- ubuntu
- ubuntu-apis:-anthropic-api
- uvicorn
- verilog
- via-anthropic-api)
- vllm
- vllm-(attempted)
Log in or sign up for Devpost to join the conversation.