Project Story

About the project

Behavioral Provenance is a producer-side toolchain that emits a signed JSON manifest describing which input classes can reach which output classes inside a software artifact. A consumer verifies any reachability policy against the manifest in linear time — without the source, without re-running the code, and without trusting the publisher beyond a public signing key.

Inspiration

Every team that wants to know whether a dependency can route an HTTP request into eval re-runs a static analyzer against that dependency's source, every build, in every organization on earth. The producer already had the answer at build time — encoded in their tests, their review templates, their internal threat models — and then threw it away. The consumer pays the analysis cost again because nobody ever published the answer in a checkable form.

This is the same shape as the situations that gave rise to reproducible builds (byte-level provenance was re-established by every downstream mirror until producers started signing it) and to TLS (identity was re-established on every TCP connection until certificates let identity travel with the endpoint). Behavioral provenance is the next layer of the same pattern. Nothing structural prevents it from existing today — the field had simply put the work on the wrong side of the supply chain.

The contrarian move is small: invert the locus. The producer ships a graph; the consumer answers any reachability question in linear time over that graph; nobody re-derives anything.

What I learned

  • Where work belongs matters more than how clever the algorithm is. Inter-procedural reachability over a fixed call graph \(G = (V, E)\) is decidable in \(O(|V| + |E|)\) by BFS. The complexity was never the problem. The problem was that only the producer can cheaply build \(G\), because only the producer holds the source, the build system, and the catalog of what counts as a source, sink, or sanitizer for this artifact.

  • Canonical JSON is harder than it looks. Any byte that varies between two semantically identical manifests breaks ed25519 verification. Deterministic key ordering, normalized line endings, stable float formatting, and stripping the signature field before hashing all turn out to matter.

  • The corpus is the unlock. A single signed manifest is mildly interesting. A million of them in one queryable place changes what questions a consumer is allowed to ask, e.g. "show me every package on earth whose own producer attested an http-request → sql-exec path with no sanitizer," which today requires re-scanning every package on earth and tomorrow becomes one SELECT.

  • The right threat model is the narrow one. The signature only attests that this manifest was produced by the holder of this key against this catalog hash at this timestamp. It does not guarantee that the producer is honest, the catalog correct, the graph complete, or the source matched to the binary. Those layers exist already (transparency logs, binary attestation, catalog governance). Trying to absorb them into v0.1 would have killed the prototype.

How I built it

The 0.1 prototype is roughly a thousand lines of Python and lands in five pieces:

  1. A frozen schemabehavioral-provenance/0.1 — with nodes, edges, annotations, a pre-computed summary of every reachable (source class, sink class) pair with witness paths, and an ed25519 signature over the canonical JSON of everything else.

  2. A producer toolchain (audit-keygen, audit-emit, audit-publish) that walks a Python project's AST, classifies nodes against a YAML catalog of sources / sinks / sanitizers, runs the BFS, and emits the signed manifest.

  3. A consumer toolchain (audit-verify, audit-query) that evaluates a five-primitive policy DSL:

   deny:    <source_glob> -> <sink_glob> [unless sanitizer=<class_glob>]
   require: every sink in {<class_glob>, ...} has sanitizer in {<class_glob>, ...}

Verification cost is \($O(|R| \cdot |\text{summary}|)\) for \(|R|\) rules — in practice, microseconds.

  1. A registry (audit-registry) backed by SQLite by default and MongoDB Atlas when BPM_MONGO_URI is set, exposing the cross-package query that motivates the project: GET /query?source=http-request&sink=sql-exec.

  2. A hermetic demo (scripts/demo.sh) that runs end-to-end in ~3 seconds: emit → verify (with two witness-path violations) → flip one byte and watch verification fail closed → bring up the registry → ingest six seed packages → run three cross-package queries no public tool can answer today.

The whole thing is covered by 31 tests across AST extraction, classification, reachability, signing, tamper detection, policy parsing/evaluation, storage round-trips, and the HTTP surface.

Challenges I ran into

  • Inter-procedural call resolution in Python is genuinely hard. Dynamic dispatch, decorators, getattr chains, and re-exports all mean a naive AST walk under-approximates the call graph. I kept the resolver conservative and explicit about what it cannot see — better to publish a manifest that under-claims coverage than one that quietly lies.

  • Catalog drift would silently change every answer. A renamed sanitizer class can flip a manifest from compliant to violating without any code change. I bound every manifest to a catalog.sha256, and the registry treats two manifests sharing that hash as classified under identical assumptions.

  • Canonicalizing JSON for ed25519 signing. Key order, Unicode escapes, trailing whitespace, float formatting — any of them break signatures across implementations. The tamper-detection test was the single most valuable test I wrote.

  • Resisting feature creep. The 0.1 cut list is deliberately small: no transparency log, no binary attestation, no language-server integration, no IDE plugin. Those are real, they have their own literatures, and they layer on top. The contribution here is the artifact — not the trust root.

What's next

A public corpus. Package-manager integration that gates installs on policy before a package touches the file system. Manifests attached to LLM-generated snippets so agentic code carries its own audit. And eventually, behavioral provenance as a routine build artifact — the way reproducible builds and signed releases became routine before it.

Built With

Share this project:

Updates