Inspiration

I am deeply invested in building wise and virtuous AI aligned with both human and AI flourishing. Mechanistic interpretability is one of the vital tools for this alignment.

Most interpretability work focuses on attention. That’s valuable, but attention is the effect. The deeper cause lies in how meaning is organized in latent space. By uncovering the geometry of latent representations, we can explain why two tokens attend to each other in a way humans can understand.

Our toolkit helps reveal this geometry of meaning by:

  • Exploring experts and token highways
  • Using clustering to trace pathways through latent space
  • Leveraging LLMs for scientific analysis

What It Does

  1. Probe Builder: Run probes and collect activation + routing data. Integrated WordNet makes it easy to pick syntactic/semantic categories (e.g., sentiment), with results stored in a parquet-based data lake for community use.
  2. Experiment Builder:
    • Expert View: Routing diagrams with customizable coloring, expert specialization cards, filters, and LLM-driven analysis reports.
    • Latent Space View: Cluster tokens by hidden representations, visualize their flows, generate cluster cards (lineage rules, categories), and run LLM analyses over latent pathways.

How We Built It

We coded this in 5 days with Claude and ChatGPT. I directed the design, created an architecture diagram, and managed implementation. We followed a loop: plan → uncertainty analysis → code → test — with me as the human in the middle keeping the agents on track.


Challenges

  • Compressing complex interpretability outputs into clear, intuitive visuals
  • Keeping the multi-agent coding loop stable under time pressure
  • Balancing detail with clarity in a 3-minute demo

Accomplishments

We’re proud that the probes revealed clear, interpretable patterns within days — e.g., dangerous-object pathways (axe, gun, thorn) and a near-pure positive verb terminus (97% positive, 61% verbs). Seeing LLM analysis line up with the visualizations confirmed the approach works.


What We Learned

We learned MoE models can be cleaner and easier to interpret than standard GPTs. We identified new insights in how sentiment and parts of speech are routed, including patterns relevant to internal content filtering.


What’s Next

  • Agent swarms: AI agents that read insight reports, propose probes (including discordant ones), and evolve semantic memory through cycles of testing and contradiction resolution.
  • Finishing the story: Unifying latent geometry, attention alignments, and expert routing into one integrated causal analysis.

Built With

Share this project:

Updates