Hotpath

Hotpath turns your app's production runtime into a live heatmap of what actually runs, then uses that signal to verify any code change preserved behavior — surfacing a minimal failing input when it didn't.


Inspiration

When we started this hackathon, every "AI code review" tool we could name — Greptile, CodeRabbit, Qodo, Cursor Bugbot, Devin Review — was, structurally, an LLM second-guessing another LLM's diff. They read the patch, run a vibes check, and ship a verdict. None of them actually run the code. None of them know whether process_payment is called 12,000 times a minute or once a quarter. None of them can prove that a refactor preserved behavior — they can only opine.

That gap is fatal because AI coding agents are getting confident. Devin will rewrite a hot path, Cursor will refactor a class, Claude will hand you a diff that passes type-checking and looks beautiful — and silently changes behavior on an edge case nobody was thinking about. The reviewer reads the diff, nods, and merges. The first counterexample shows up in a Sentry alert at 3 AM.

We wanted to build the layer underneath all of those tools — the one that actually answers, in seconds:

"Did this refactor change user-observable behavior? If yes, here's the input that breaks."

And — critically — to do it with compute proportional to how much the function actually matters in production. A refactor of process_payment deserves 30 seconds of symbolic execution and 500 fuzz inputs. A refactor of legacy_export_csv deserves 2 seconds and a sanity check. Spending the same compute on both is what makes existing verification stacks unusable on real codebases.

That insight — runtime hotness as a budget allocator for symbolic compute — is Hotpath.


What It Does

Hotpath is three connected pieces, all running on a single laptop:

  1. A live heatmap of your app's actual production behavior — every function colored red→gray by runtime sample count, with edges showing the call graph, refreshing in real time as Locust drives load against the demo service.
  2. A behavioral verifier that takes a (before.py, after.py, function_name) pair and runs differential fuzzing + symbolic execution + property-based testing, with the budget scaled by the function's blast radius. On divergence, it shrinks the failing input to a minimum (often 3–5 fields) using delta-debugging.
  3. An MCP server exposing 9 tools so any AI coding agent — Devin, Cursor, Claude Code — can plug in:

Tools & Purpose

Tool Purpose
get_hotspots Functions ranked by blast radius, using real hotness data + callgraph
get_risk_profile Hotness + static fanout + blast radius for a single function
get_review_priority Score for a PR — blends blast radius with size signals
find_callers Transitive callers from the static call-graph (BFS, capped)
get_function_source Source text — saves agents a grep round-trip
verify_refactor Async equivalence check on a before/after pair
verify_pr Whole-PR verification — resolves base/head from git
pr_triage Ranks a repo's open PRs by review priority
accept_counterexample 3-button elicitation form (revert / accept / autofix)

Plus a paste-link feature: drop any GitHub URL into the heatmap UI and Hotpath clones it, builds the static call-graph, and renders the blast radius even with no runtime data.


The Math

The core idea is a single equation. For every function f in the call graph, we compute a blast radius score.

That score then drives the verification budget allocator:

$$B(f) = B_{\min} + (B_{\max} - B_{\min}) \cdot r(f)$$

Concrete thresholds:

  • $r < 0.3$ (cold): 2s budget, 20 fuzz inputs, no symbolic execution.
  • $0.3 <= r <= 0.7$: Linear interpolation of resources.
  • $r > 0.7$ (hot): 30s budget, 500 fuzz inputs, symbolic execution enabled.

How We Built It

Architecture

A 9-package uv workspace monorepo, glued together by two contracts:

  • A RepoContext dataclass passed between every package.
  • A file-bus at .runtime/hotness.json that the profiler writes and everyone else reads.

The packages, in dependency order:

  • hotpath-types: Shared dataclasses.
  • hotpath-profiler: py-spy wrapper.
  • hotpath-callgraph: code2flow + AST walker.
  • hotpath-scoring: Blast radius math & priority logic.
  • hotpath-verifier: CrossHair + Hypothesis + ddmin.
  • hotpath-mcp: MCP server for AI agents.
  • hotpath-autofix: claude-agent-sdk loop for patching.
  • apps/demo-service: FastAPI app with planted bugs.
  • apps/heatmap-backend: SSE-based FastAPI server.
  • apps/heatmap: Vite + React 19 + visx frontend.

Tech Stack

Layer Choice
Package Manager uv 0.11.7 (10x faster than Poetry)
Symbolic Execution crosshair-tool 0.0.103
Property Testing hypothesis 6.152.2
Profiler py-spy 0.4.2 (Rust-based, ~1% CPU overhead)
Graph Storage networkx 3.4.x
Frontend React 19 + Vite 8
Visualization @xyflow/react + visx
Agent Framework claude-agent-sdk 0.1.67

Challenges We Ran Into

1. py-spy needs sudo on macOS

py-spy uses ptrace-equivalent calls blocked by SIP.

  • Fix: Created a dual-mode profiler. If sudo isn't available, we inject a sys.setprofile hook inside the uvicorn worker.

2. CrossHair path explosion

Z3 doesn't always yield, meaning standard timeouts fail.

  • Fix: Every symbolic run spawns a multiprocessing.Process. We use a hard process.kill() after the wall-clock budget expires.

3. The 4MB MCP payload limit

Sending entire source files over MCP blew past the ceiling.

  • Fix: Agents now pass filesystem path references (before_path, after_path) since Devin and Claude Code have direct FS access.

4. Graph scalability

The initial heatmap was unusable on large repos like pallets/click.

  • Fix: Layered visualization. We cap the overview at the 12 hottest nodes and use adaptive radial layout to prevent label stacking.

What We Learned

  • Verification is a Budget Problem: Behavioral verification is only expensive if you apply it blindly. Runtime profiling provides the "budget" that makes symbolic execution viable in a developer's daily workflow.
  • Hybrid Analysis is King: Sampling profilers + static analysis together produce a "blast radius" signal far more useful than either provides in isolation.
  • MCP is the Right Abstraction: Building as an MCP server rather than a CLI tool allowed us to integrate with Devin, Cursor, and Claude Code simultaneously with zero integration overhead.
  • Determinism is a Feature: In a verifier, two runs must produce byte-identical counterexamples. Pinning seeds across Z3, Hypothesis, and Python's hash randomization was crucial for credibility.

What's Next

  • Multi-language Support: Expanding the architecture to Go and Rust.
  • Sandboxed Runtime Profiling: For cloned repos, run the test suite under py-spy automatically to generate "production-shaped" hotness data.
  • Bayesian Weighting: Using historical bug locations to automatically tune our blast-radius coefficients.
  • Devin Marketplace: Listing Hotpath as the default verification layer for Devin installations.

Built With

  • claude-agent
  • code2flow
  • crosshair
  • fastapi
  • mcp
  • py-spy
  • python
  • react
  • uv
  • vite
Share this project:

Updates