MoE Concept MRI: Interpretability Science Toolkit

Inspiration

I am deeply invested in building wise and virtuous AI aligned with both human and AI flourishing. Mechanistic interpretability is one of the vital tools for this alignment.

Most interpretability work focuses on attention. That’s valuable, but attention is the effect. The deeper cause lies in how meaning is organized in latent space. By uncovering the geometry of latent representations, we can explain why two tokens attend to each other in a way humans can understand.

Our toolkit helps reveal this geometry of meaning by:

Exploring experts and token highways
Using clustering to trace pathways through latent space
Leveraging LLMs for scientific analysis

What It Does

Probe Builder: Run probes and collect activation + routing data. Integrated WordNet makes it easy to pick syntactic/semantic categories (e.g., sentiment), with results stored in a parquet-based data lake for community use.
Experiment Builder:
- Expert View: Routing diagrams with customizable coloring, expert specialization cards, filters, and LLM-driven analysis reports.
- Latent Space View: Cluster tokens by hidden representations, visualize their flows, generate cluster cards (lineage rules, categories), and run LLM analyses over latent pathways.

How We Built It

We coded this in 5 days with Claude and ChatGPT. I directed the design, created an architecture diagram, and managed implementation. We followed a loop: plan → uncertainty analysis → code → test — with me as the human in the middle keeping the agents on track.

Challenges

Compressing complex interpretability outputs into clear, intuitive visuals
Keeping the multi-agent coding loop stable under time pressure
Balancing detail with clarity in a 3-minute demo

Accomplishments

We’re proud that the probes revealed clear, interpretable patterns within days — e.g., dangerous-object pathways (axe, gun, thorn) and a near-pure positive verb terminus (97% positive, 61% verbs). Seeing LLM analysis line up with the visualizations confirmed the approach works.

What We Learned

We learned MoE models can be cleaner and easier to interpret than standard GPTs. We identified new insights in how sentiment and parts of speech are routed, including patterns relevant to internal content filtering.

What’s Next

Agent swarms: AI agents that read insight reports, propose probes (including discordant ones), and evolve semantic memory through cycles of testing and contradiction resolution.
Finishing the story: Unifying latent geometry, attention alignments, and expert routing into one integrated causal analysis.

Built With

node.js
parquet
python
react
tailwind

Updates

Andrew Smigaj started this project — Sep 11, 2025 06:18 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.