Hotpath

Hotpath turns your app's production runtime into a live heatmap of what actually runs, then uses that signal to verify any code change preserved behavior — surfacing a minimal failing input when it didn't.

Inspiration

When we started this hackathon, every "AI code review" tool we could name — Greptile, CodeRabbit, Qodo, Cursor Bugbot, Devin Review — was, structurally, an LLM second-guessing another LLM's diff. They read the patch, run a vibes check, and ship a verdict. None of them actually run the code. None of them know whether process_payment is called 12,000 times a minute or once a quarter. None of them can prove that a refactor preserved behavior — they can only opine.

That gap is fatal because AI coding agents are getting confident. Devin will rewrite a hot path, Cursor will refactor a class, Claude will hand you a diff that passes type-checking and looks beautiful — and silently changes behavior on an edge case nobody was thinking about. The reviewer reads the diff, nods, and merges. The first counterexample shows up in a Sentry alert at 3 AM.

We wanted to build the layer underneath all of those tools — the one that actually answers, in seconds:

"Did this refactor change user-observable behavior? If yes, here's the input that breaks."

And — critically — to do it with compute proportional to how much the function actually matters in production. A refactor of process_payment deserves 30 seconds of symbolic execution and 500 fuzz inputs. A refactor of legacy_export_csv deserves 2 seconds and a sanity check. Spending the same compute on both is what makes existing verification stacks unusable on real codebases.

That insight — runtime hotness as a budget allocator for symbolic compute — is Hotpath.

What It Does

Hotpath is three connected pieces, all running on a single laptop:

A live heatmap of your app's actual production behavior — every function colored red→gray by runtime sample count, with edges showing the call graph, refreshing in real time as Locust drives load against the demo service.
A behavioral verifier that takes a (before.py, after.py, function_name) pair and runs differential fuzzing + symbolic execution + property-based testing, with the budget scaled by the function's blast radius. On divergence, it shrinks the failing input to a minimum (often 3–5 fields) using delta-debugging.
An MCP server exposing 9 tools so any AI coding agent — Devin, Cursor, Claude Code — can plug in:

Tools & Purpose

Tool	Purpose
`get_hotspots`	Functions ranked by blast radius, using real hotness data + callgraph
`get_risk_profile`	Hotness + static fanout + blast radius for a single function
`get_review_priority`	Score for a PR — blends blast radius with size signals
`find_callers`	Transitive callers from the static call-graph (BFS, capped)
`get_function_source`	Source text — saves agents a grep round-trip
`verify_refactor`	Async equivalence check on a before/after pair
`verify_pr`	Whole-PR verification — resolves base/head from git
`pr_triage`	Ranks a repo's open PRs by review priority
`accept_counterexample`	3-button elicitation form (revert / accept / autofix)

Plus a paste-link feature: drop any GitHub URL into the heatmap UI and Hotpath clones it, builds the static call-graph, and renders the blast radius even with no runtime data.

The Math

The core idea is a single equation. For every function f in the call graph, we compute a blast radius score.

That score then drives the verification budget allocator:

$$B(f) = B_{\min} + (B_{\max} - B_{\min}) \cdot r(f)$$

Concrete thresholds:

$r < 0.3$ (cold): 2s budget, 20 fuzz inputs, no symbolic execution.
$0.3 <= r <= 0.7$: Linear interpolation of resources.
$r > 0.7$ (hot): 30s budget, 500 fuzz inputs, symbolic execution enabled.

How We Built It

Architecture

A 9-package uv workspace monorepo, glued together by two contracts:

A RepoContext dataclass passed between every package.
A file-bus at .runtime/hotness.json that the profiler writes and everyone else reads.

The packages, in dependency order:

hotpath-types: Shared dataclasses.
hotpath-profiler: py-spy wrapper.
hotpath-callgraph: code2flow + AST walker.
hotpath-scoring: Blast radius math & priority logic.
hotpath-verifier: CrossHair + Hypothesis + ddmin.
hotpath-mcp: MCP server for AI agents.
hotpath-autofix: claude-agent-sdk loop for patching.
apps/demo-service: FastAPI app with planted bugs.
apps/heatmap-backend: SSE-based FastAPI server.
apps/heatmap: Vite + React 19 + visx frontend.

Tech Stack

Layer	Choice
Package Manager	`uv` 0.11.7 (10x faster than Poetry)
Symbolic Execution	`crosshair-tool` 0.0.103
Property Testing	`hypothesis` 6.152.2
Profiler	`py-spy` 0.4.2 (Rust-based, ~1% CPU overhead)
Graph Storage	`networkx` 3.4.x
Frontend	React 19 + Vite 8
Visualization	`@xyflow/react` + `visx`
Agent Framework	`claude-agent-sdk` 0.1.67

Challenges We Ran Into

1. py-spy needs sudo on macOS

py-spy uses ptrace-equivalent calls blocked by SIP.

Fix: Created a dual-mode profiler. If sudo isn't available, we inject a sys.setprofile hook inside the uvicorn worker.

2. CrossHair path explosion

Z3 doesn't always yield, meaning standard timeouts fail.

Fix: Every symbolic run spawns a multiprocessing.Process. We use a hard process.kill() after the wall-clock budget expires.

3. The 4MB MCP payload limit

Sending entire source files over MCP blew past the ceiling.

Fix: Agents now pass filesystem path references (before_path, after_path) since Devin and Claude Code have direct FS access.

4. Graph scalability

The initial heatmap was unusable on large repos like pallets/click.

Fix: Layered visualization. We cap the overview at the 12 hottest nodes and use adaptive radial layout to prevent label stacking.

What We Learned

Verification is a Budget Problem: Behavioral verification is only expensive if you apply it blindly. Runtime profiling provides the "budget" that makes symbolic execution viable in a developer's daily workflow.
Hybrid Analysis is King: Sampling profilers + static analysis together produce a "blast radius" signal far more useful than either provides in isolation.
MCP is the Right Abstraction: Building as an MCP server rather than a CLI tool allowed us to integrate with Devin, Cursor, and Claude Code simultaneously with zero integration overhead.
Determinism is a Feature: In a verifier, two runs must produce byte-identical counterexamples. Pinning seeds across Z3, Hypothesis, and Python's hash randomization was crucial for credibility.

What's Next

Multi-language Support: Expanding the architecture to Go and Rust.
Sandboxed Runtime Profiling: For cloned repos, run the test suite under py-spy automatically to generate "production-shaped" hotness data.
Bayesian Weighting: Using historical bug locations to automatically tune our blast-radius coefficients.
Devin Marketplace: Listing Hotpath as the default verification layer for Devin installations.

Built With

claude-agent
code2flow
crosshair
fastapi
mcp
py-spy
python
react
uv
vite

Updates

Rahul Puritipati started this project — Apr 26, 2026 10:59 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.