Hotpath
Hotpath turns your app's production runtime into a live heatmap of what actually runs, then uses that signal to verify any code change preserved behavior — surfacing a minimal failing input when it didn't.
Inspiration
When we started this hackathon, every "AI code review" tool we could name — Greptile, CodeRabbit, Qodo, Cursor Bugbot, Devin Review — was, structurally, an LLM second-guessing another LLM's diff. They read the patch, run a vibes check, and ship a verdict. None of them actually run the code. None of them know whether process_payment is called 12,000 times a minute or once a quarter. None of them can prove that a refactor preserved behavior — they can only opine.
That gap is fatal because AI coding agents are getting confident. Devin will rewrite a hot path, Cursor will refactor a class, Claude will hand you a diff that passes type-checking and looks beautiful — and silently changes behavior on an edge case nobody was thinking about. The reviewer reads the diff, nods, and merges. The first counterexample shows up in a Sentry alert at 3 AM.
We wanted to build the layer underneath all of those tools — the one that actually answers, in seconds:
"Did this refactor change user-observable behavior? If yes, here's the input that breaks."
And — critically — to do it with compute proportional to how much the function actually matters in production. A refactor of process_payment deserves 30 seconds of symbolic execution and 500 fuzz inputs. A refactor of legacy_export_csv deserves 2 seconds and a sanity check. Spending the same compute on both is what makes existing verification stacks unusable on real codebases.
That insight — runtime hotness as a budget allocator for symbolic compute — is Hotpath.
What It Does
Hotpath is three connected pieces, all running on a single laptop:
- A live heatmap of your app's actual production behavior — every function colored red→gray by runtime sample count, with edges showing the call graph, refreshing in real time as Locust drives load against the demo service.
- A behavioral verifier that takes a
(before.py, after.py, function_name)pair and runs differential fuzzing + symbolic execution + property-based testing, with the budget scaled by the function's blast radius. On divergence, it shrinks the failing input to a minimum (often 3–5 fields) using delta-debugging. - An MCP server exposing 9 tools so any AI coding agent — Devin, Cursor, Claude Code — can plug in:
Tools & Purpose
| Tool | Purpose |
|---|---|
get_hotspots |
Functions ranked by blast radius, using real hotness data + callgraph |
get_risk_profile |
Hotness + static fanout + blast radius for a single function |
get_review_priority |
Score for a PR — blends blast radius with size signals |
find_callers |
Transitive callers from the static call-graph (BFS, capped) |
get_function_source |
Source text — saves agents a grep round-trip |
verify_refactor |
Async equivalence check on a before/after pair |
verify_pr |
Whole-PR verification — resolves base/head from git |
pr_triage |
Ranks a repo's open PRs by review priority |
accept_counterexample |
3-button elicitation form (revert / accept / autofix) |
Plus a paste-link feature: drop any GitHub URL into the heatmap UI and Hotpath clones it, builds the static call-graph, and renders the blast radius even with no runtime data.
The Math
The core idea is a single equation. For every function f in the call graph, we compute a blast radius score.
That score then drives the verification budget allocator:
$$B(f) = B_{\min} + (B_{\max} - B_{\min}) \cdot r(f)$$
Concrete thresholds:
- $r < 0.3$ (cold): 2s budget, 20 fuzz inputs, no symbolic execution.
- $0.3 <= r <= 0.7$: Linear interpolation of resources.
- $r > 0.7$ (hot): 30s budget, 500 fuzz inputs, symbolic execution enabled.
How We Built It
Architecture
A 9-package uv workspace monorepo, glued together by two contracts:
- A
RepoContextdataclass passed between every package. - A file-bus at
.runtime/hotness.jsonthat the profiler writes and everyone else reads.
The packages, in dependency order:
hotpath-types: Shared dataclasses.hotpath-profiler:py-spywrapper.hotpath-callgraph:code2flow+ AST walker.hotpath-scoring: Blast radius math & priority logic.hotpath-verifier:CrossHair+Hypothesis+ddmin.hotpath-mcp: MCP server for AI agents.hotpath-autofix:claude-agent-sdkloop for patching.apps/demo-service: FastAPI app with planted bugs.apps/heatmap-backend: SSE-based FastAPI server.apps/heatmap: Vite + React 19 +visxfrontend.
Tech Stack
| Layer | Choice |
|---|---|
| Package Manager | uv 0.11.7 (10x faster than Poetry) |
| Symbolic Execution | crosshair-tool 0.0.103 |
| Property Testing | hypothesis 6.152.2 |
| Profiler | py-spy 0.4.2 (Rust-based, ~1% CPU overhead) |
| Graph Storage | networkx 3.4.x |
| Frontend | React 19 + Vite 8 |
| Visualization | @xyflow/react + visx |
| Agent Framework | claude-agent-sdk 0.1.67 |
Challenges We Ran Into
1. py-spy needs sudo on macOS
py-spy uses ptrace-equivalent calls blocked by SIP.
- Fix: Created a dual-mode profiler. If sudo isn't available, we inject a
sys.setprofilehook inside the uvicorn worker.
2. CrossHair path explosion
Z3 doesn't always yield, meaning standard timeouts fail.
- Fix: Every symbolic run spawns a
multiprocessing.Process. We use a hardprocess.kill()after the wall-clock budget expires.
3. The 4MB MCP payload limit
Sending entire source files over MCP blew past the ceiling.
- Fix: Agents now pass filesystem path references (
before_path,after_path) since Devin and Claude Code have direct FS access.
4. Graph scalability
The initial heatmap was unusable on large repos like pallets/click.
- Fix: Layered visualization. We cap the overview at the 12 hottest nodes and use adaptive radial layout to prevent label stacking.
What We Learned
- Verification is a Budget Problem: Behavioral verification is only expensive if you apply it blindly. Runtime profiling provides the "budget" that makes symbolic execution viable in a developer's daily workflow.
- Hybrid Analysis is King: Sampling profilers + static analysis together produce a "blast radius" signal far more useful than either provides in isolation.
- MCP is the Right Abstraction: Building as an MCP server rather than a CLI tool allowed us to integrate with Devin, Cursor, and Claude Code simultaneously with zero integration overhead.
- Determinism is a Feature: In a verifier, two runs must produce byte-identical counterexamples. Pinning seeds across Z3, Hypothesis, and Python's hash randomization was crucial for credibility.
What's Next
- Multi-language Support: Expanding the architecture to Go and Rust.
- Sandboxed Runtime Profiling: For cloned repos, run the test suite under
py-spyautomatically to generate "production-shaped" hotness data. - Bayesian Weighting: Using historical bug locations to automatically tune our blast-radius coefficients.
- Devin Marketplace: Listing Hotpath as the default verification layer for Devin installations.
Log in or sign up for Devpost to join the conversation.