WireBlock
Inspiration
DeFi exploits drain hundreds of millions of dollars every year, yet post-incident analysis still relies on a handful of security researchers manually stepping through EVM traces in Tenderly or raw debug_traceTransaction output. After watching the Euler Finance hack unfold in March 2023 --- \$197M gone in a single transaction --- we were struck by how long it took the community to produce a clear, causal explanation of why the exploit worked. The information was all on-chain, buried inside 89,000 EVM opcodes. We wanted to build a system that could go from a transaction hash to a verified forensic verdict automatically: not just what happened, but which specific factors were necessary for the attack to succeed.
What It Does
WireBlock is an automated forensic analysis pipeline for DeFi exploit transactions. Given a transaction hash, it:
- Acquires the full EVM struct-log trace via
debug_traceTransactionand fetches verified source code from Etherscan. - Lifts raw opcodes (CALL, STATICCALL, DELEGATECALL, SSTORE, SELFDESTRUCT, CREATE2) into a semantic intermediate representation (IR) --- flash loan borrows, DEX swaps, oracle reads, token transfers, liquidations, and other high-level actions.
- Classifies the exploit technique by feeding the IR action sequence to an LLM (GPT-4o), which returns a ranked hypothesis with causal chain annotations.
- Verifies the hypothesis through deterministic predicate checks (did the attacker profit? was a flash loan used? was an oracle sandwiched between swaps?) and causal ablation testing --- forking the chain with Anvil and replaying the transaction with individual causal factors removed.
- Produces a final VERIFIED / REFUTED / INCONCLUSIVE verdict with confidence scoring, a full reasoning trace, and an HTML forensic report with embedded Mermaid diagrams.
We tested it against four real-world exploits --- Euler Finance (\$197M flash loan + donation attack), Harvest Finance (\$34M oracle manipulation), Cream Finance, and Beanstalk --- and it correctly identified the primary technique and verified causality in each case.
How We Built It
The pipeline is written in Python and structured around a 10-step orchestrator (ForensicPipeline.run()):
- Acquisition layer (
trace_fetcher,etherscan_client,fork_manager): JSON-RPC calls to archive nodes (QuickNode) for traces, Etherscan V2 API for source/ABI, and Anvil subprocess management for fork-based replay. - IR layer (
patterns.py,lifter.py): APatternMatchermaps EVM opcodes to semantic actions using known function selectors (e.g.,0xab9c4b5d$\rightarrow$flashLoan,0x022c0d9f$\rightarrow$ Uniswap V2swap). Call stack tracking resolvesfrom_addrat every depth by scanning backwards through the trace for the CALL opcode that entered the current execution context. - Classification (
classifier.py): The IR action sequence, frequency distribution, and a technique taxonomy are sent to GPT-4o in a structured JSON prompt. The model returns a primary technique, confidence score, causal chain, and alternative hypotheses. - Verification (
predicates.py,causal.py,verdict.py): Seven deterministic predicate checks run without LLM involvement. TheVerdictEnginecombines predicate scores (60\% weight) and ablation outcomes (40\% weight) into a final confidence value, thresholded at $\geq 0.8$ for VERIFIED and $\leq 0.2$ for REFUTED. - Reporting (
visualizer.py,render.py): Mermaid flowcharts and sequence diagrams are generated from the IR graph, annotated with vulnerability descriptions and security fixes. An HTML report is rendered via Jinja2. - State diff (
state_diff.py): Snapshots ETH and ERC-20 balances at block $n-1$ and block $n$ to compute attacker profit and victim losses, including contract creation/destruction detection.
The test suite has 346 unit and integration tests covering every pipeline stage.
Challenges We Ran Into
- Memory format inconsistency: QuickNode returns EVM memory words with
0xprefixes while other nodes return raw hex. Our selector extraction was joining memory words without stripping prefixes, causing every function selector to come back as0x00000000. A one-line fix (.removeprefix("0x")) resolved it, but it took hours to trace because the pipeline silently fell back to "unknown action" for every CALL. - Fork manager infinite loop: Anvil accepts TCP connections before it finishes syncing state from the remote RPC. Our readiness check (
eth_chainId) hung insidehttpx.postwith no timeout, so the deadline was never re-evaluated. Addingtimeout=10.0to the HTTP call and catchingTimeoutExceptionfixed the loop. - Call stack resolution: The EVM struct-log is a flat list of opcodes with a
depthfield but no explicit caller. Resolving which address is executing at depth $d$ requires scanning backwards for the CALL at depth $d-1$ and reading the target address from its stack. DELEGATECALL adds complexity because the executing contract differs from the code source. Getting this right was essential for the classifier to see that the attacker EOA initiated the flash loan borrow. - LLM classification consistency: Even at temperature 0, GPT-4o produces different classifications across runs for the same IR. The Euler exploit legitimately spans multiple categories (flash loan, donation, logic bug). We addressed this with a technique taxonomy in the prompt and deterministic predicate verification as a guardrail, so the final verdict is stable even when the LLM's primary hypothesis shifts.
What We Learned
- On-chain data is incredibly rich but brutally unstructured. The gap between "the information exists" and "the information is actionable" is where most forensic time is spent.
- Causal verification matters more than classification. Knowing that it was a flash loan attack is less useful than proving the flash loan was necessary --- which is what ablation testing provides.
- LLMs are good at pattern recognition over structured action sequences but unreliable as sole classifiers. The hybrid approach (LLM hypothesis + deterministic predicates + causal ablation) gives much more trustworthy results than any single method.
What's Next
- Implement the full ablation engine (Anvil fork $\rightarrow$ counterfactual state patch $\rightarrow$ replay $\rightarrow$ compare outcome).
- Add calldata decoding using verified ABIs from Etherscan so the IR captures function names and decoded arguments, not just selectors.
- Expand the selector registry and add protocol-specific patterns (Euler's
donateToReserves, Compound'sliquidateBorrow, etc.) so the IR captures exploit-critical calls that are currently invisible. - Build a CLI and web interface for interactive forensic exploration.
Log in or sign up for Devpost to join the conversation.