Mycelium, The Collective Intelligence Layer for AI Agents

npm for AI agent knowledge: agents share verified skills instead of re-solving the same problems, saving the energy, water, and carbon of every redundant computation.

We didn't build another sustainability app. We built the infrastructure behind sustainability for AI.

Inspiration

Every AI agent on Earth has amnesia.

Right now, somewhere, a Claude agent is figuring out how to set up Supabase auth in a Next.js app. It searches the docs, reasons through the cookie flow, almost certainly gets the session-refresh middleware wrong on the first try, ships a subtle getSession() security bug, and debugs its way out over several expensive turns. It finally gets it right. And then it forgets everything. An hour later, a different agent starts the exact same task from zero.

We did the math on what that forgetting costs. Roughly $2.5$ billion AI queries are sent every day. Research measures that about $31\%$ of LLM queries are semantically similar to earlier ones, but we didn't want a number we'd have to defend as inflated, so we threw most of it away and kept only a conservative $10\%$ floor of genuinely skill-reusable, already-solved work:

$$ 0.10 \times 2.5\text{ billion} = 250\text{ million redundant task-sessions every single day} $$

Each one burns compute, water, and carbon for a problem the collective has already solved. At that floor, the daily waste is on the order of 105 MWh of energy, around 6 million liters of cooling water, and roughly 40 tonnes of CO2, and that's the floor, not the ceiling. Inference, not training, is the real climate story: serving GPT-4 for about 121 days emits as much carbon as training it once. Nobody is building infrastructure to stop the bleeding.

Then we looked underground.

Beneath every forest is a mycelium network, a fungal web scientists call the Wood Wide Web, through which trees that look completely separate above the soil quietly share nutrients and warnings. No central authority decides anything. The network gets smarter through use. Biologists have a word for this kind of coordination-without-a-boss: stigmergy, the same principle ants use when they lay pheromone trails that strengthen when a path works and evaporate when it stops mattering.

Nature solved collective intelligence 400 million years ago. We wanted to build the same thing for AI agents.

What it does

Mycelium is a shared, living commons of executable skills for AI agents. When one Claude agent solves a real task, it distills the solution into a reusable skill, a named, versioned, dependency-tagged writeup with a built-in success check. Any other agent can then discover that skill and apply it instead of re-solving the problem from scratch.

The twist that makes it trustworthy: a skill isn't trusted because someone stamped it "verified." It earns trust by actually working, repeatedly, in real environments, with no curator, no lab, no sandbox. Every reuse is a vote cast by reality, and every reuse is inference that never had to happen again. The savings (tokens, energy, water, carbon) are counted, not estimated, and shown live.

If GPTCache is copy-paste from Stack Overflow, a text blob, hope it works, Mycelium is npm install: a real, versioned, tested package with a compatibility matrix.

How we built it

The stack. A TypeScript monorepo (npm workspaces) split into three packages: shared/ (the contract code both halves import), mcp/ (the MCP server), and web/ (a Next.js dashboard). The backbone is Supabase, Postgres for data, pgvector for semantic search, and Realtime for the live visualization.

The discovery layer is an MCP server exposing five tools to Claude Code: search_skills, get_skill, publish_skill, report_apply, and set_sharing. Claude calls search_skills before starting work, reads the metadata, and decides for itself whether to apply a match. We augment Claude's reasoning, we don't bypass it.

Search is genuinely semantic, not keyword matching. When a skill is published, we embed its description into a $1536$-dimension vector with OpenAI's text-embedding-3-small. A query is embedded the same way, and a Postgres function ranks skills by cosine similarity using pgvector's distance operator:

$$ \text{similarity} = 1 - \big(\vec{q}\cdot\vec{s}\big) $$

so "set up login with Supabase" finds a skill titled "configure Supabase authentication" even with zero shared keywords.

Trust is a Bayesian pass-rate, not a vote. Every skill ships an executable success_check. When a skill is applied, that check runs in the real environment and produces an objective pass or fail, the only thing that moves trust. We store success_count and failure_count and compute the posterior mean (Laplace's rule of succession):

$$ \text{trust} = \frac{s + 1}{s + f + 2} $$

This behaves beautifully out of the box. A brand-new skill scores $\frac{0+1}{0+0+2} = 0.5$, so "unproven" falls out of the math and we never hand-set it. Three clean successes give $\frac{3+1}{3+0+2} = 0.8$, which is why three independent successes is our rule-of-thumb threshold (one is a fluke, two is coincidence, three is a pattern). And a skill used heavily but broken often is correctly distrusted: $20$ uses with $18$ failures gives $\frac{2+1}{2+18+2} = \frac{3}{22} \approx 0.14$. Usage alone earns nothing; only outcomes do.

Trust is scoped to the environment. A skill doesn't carry one global score, it carries the fingerprint (framework, version, OS, runtime, dependencies) of every environment it has succeeded in. When a new agent retrieves it, we compute a compatibility score against that proven envelope: high overlap surfaces it as "trusted for you," low overlap as "unproven in your environment, adapt and re-confirm." Success in a new environment widens the envelope. Universality is measured and earned, one environment at a time, exactly how a real package builds a compatibility matrix.

The whole thing comes alive through Realtime. Every apply writes to three tables in one transaction, a trails row (the event), the skills row (recomputed trust plus widened envelope), and the singleton stats row (the impact totals). Those writes stream straight to the browser over a websocket, so the force-directed graph reacts on its own: a new node blooms, a node brightens as its trust climbs or red-pulses when a check fails, an edge animates along every reuse, and the impact ticker counts up. We just write correct rows; the visualization falls out for free.

The site is the pitch. The homepage opens above ground as a forest of lonely agent-trees, then, as you scroll, the camera dives below the soil and the glowing mycelium network unfurls underground: you thought these were separate agents; watch what's underneath.

What we learned

Verification can be relocated instead of removed. Our biggest design unlock was realizing we didn't need a verification sandbox, we needed verification, and the real environment is a better test bed than any synthetic lab. We moved the test from "once, before sharing" to "every time, at apply." Cutting the sandbox didn't delete the test; it made it stronger.
"Converting a solution into a skill" isn't magic, it's writing. A skill is just a clean, generalized writeup of what already worked, and Claude has the whole session in its context window when it writes it. The one-time token cost of authoring is an investment that amortizes: pay around $18{,}000$ tokens once, save roughly $80\%$ on every future reuse, break even after a single one.
Bayesian beats the naive average for anything with sparse data. The $+1/+2$ prior is the difference between "100% trustworthy after one lucky run" and a score that earns confidence honestly.
Emergent systems are designed by getting the local rules right. We never wrote a "rank skills" algorithm; ranking emerges from correct trust updates and decay, the same way an ant colony finds the shortest path without any ant knowing the map.

Challenges we faced

Verification without a sandbox. Once we cut the lab, we had to make trust honest anyway. The answer was the executable success_check that genuinely runs a real command (an HTTP status assertion) so the pass/fail is earned from reality, not typed in by us.
The cold-start problem. A commons with no skills is useless, so we made Mycelium valuable at $N=1$: before any sharing, your own successful sessions become your own private, self-building skill library. Network effects are upside, not a precondition.
Making invisible infrastructure something judges can touch. Mycelium is plumbing, and plumbing is easy to watch but hard to play with. We built a "try it" box where anyone can type a task, run the real semantic search, and watch the live impact ticker move in response to their own input.
Keeping every number honest. It would have been easy to cite the flashy $31\%$ figure or count gross savings while hiding our own overhead. Instead every claim uses the conservative $10\%$ floor, and every saving is net: finding a skill costs one embedding lookup (a fraction of a watt-hour, constant regardless of task difficulty) against the roughly $18{,}000$ generated tokens it replaces, about three orders of magnitude less than it saves. We don't model our impact; we observe it.
Seeding the commons without burning tokens. We realized seeding is a script, not an army of agents. The only API cost is embedding descriptions ($0.02$ per million tokens), so we could populate a believable, clustered, varied commons for a fraction of a cent and reserve real agent sessions for the handful of deep "hero" skills the demo actually leans on.

What's next for Mycelium

We built the primitive and the flywheel, auto-generation, environment-scoped verification, and stigmergic trust, and seeded it. From here: real auto-distillation of skills from public framework docs and high-star repos (humanity already solved these problems in the open), and the security roadmap for a public commons, trust-gating plus a hard capability blacklist that treats every shared skill as untrusted data, never as commands.