Inspiration
I am deeply invested in building wise and virtuous AI aligned with both human and AI flourishing. Mechanistic interpretability is one of the vital tools for this alignment.
Most interpretability work focuses on attention. That’s valuable, but attention is the effect. The deeper cause lies in how meaning is organized in latent space. By uncovering the geometry of latent representations, we can explain why two tokens attend to each other in a way humans can understand.
Our toolkit helps reveal this geometry of meaning by:
- Exploring experts and token highways
- Using clustering to trace pathways through latent space
- Leveraging LLMs for scientific analysis
What It Does
- Probe Builder: Run probes and collect activation + routing data. Integrated WordNet makes it easy to pick syntactic/semantic categories (e.g., sentiment), with results stored in a parquet-based data lake for community use.
- Experiment Builder:
- Expert View: Routing diagrams with customizable coloring, expert specialization cards, filters, and LLM-driven analysis reports.
- Latent Space View: Cluster tokens by hidden representations, visualize their flows, generate cluster cards (lineage rules, categories), and run LLM analyses over latent pathways.
- Expert View: Routing diagrams with customizable coloring, expert specialization cards, filters, and LLM-driven analysis reports.
How We Built It
We coded this in 5 days with Claude and ChatGPT. I directed the design, created an architecture diagram, and managed implementation. We followed a loop: plan → uncertainty analysis → code → test — with me as the human in the middle keeping the agents on track.
Challenges
- Compressing complex interpretability outputs into clear, intuitive visuals
- Keeping the multi-agent coding loop stable under time pressure
- Balancing detail with clarity in a 3-minute demo
Accomplishments
We’re proud that the probes revealed clear, interpretable patterns within days — e.g., dangerous-object pathways (axe, gun, thorn) and a near-pure positive verb terminus (97% positive, 61% verbs). Seeing LLM analysis line up with the visualizations confirmed the approach works.
What We Learned
We learned MoE models can be cleaner and easier to interpret than standard GPTs. We identified new insights in how sentiment and parts of speech are routed, including patterns relevant to internal content filtering.
What’s Next
- Agent swarms: AI agents that read insight reports, propose probes (including discordant ones), and evolve semantic memory through cycles of testing and contradiction resolution.
- Finishing the story: Unifying latent geometry, attention alignments, and expert routing into one integrated causal analysis.
Log in or sign up for Devpost to join the conversation.