Example of prompt injection detection
Selfieeee

GlassBox

Inspiration

AI will happily give you a confident medical answer. We have no clear way of making it interpretable because it is inherenly polysemantic and features are held in superposition (one neuron activating to multiple features). We built GlassBox to close this gap, a window into what a model is actually doing while it answers, so the nurse, doctor, or patient on the other end can tell whether to trust the answer or question it.

A model can write a fluent, self-assured paragraph while, underneath, its internal signals are anything but sure. And it turns out you can read that wiring. Recent interpretability work shows that simple probes can catch internal states a model's text is hiding, from uncertainty to deception (Apollo Research's linear-probe work was a big influence here). We picked Google's Gemma 3 4B as our subject because it's open, it's small enough to actually instrument, and its middle layers are where these signals live cleanly. We also use Gemma Scope Sparse Auto Encoders to do unsupervisedb exploration of features in the model.

GlassBox is what happens when you stop treating the model as a black box and start watching it think.

What it does

GlassBox sits next to a medical chatbot. You ask a clinical question, the model answers like it always would, and GlassBox shows you what was going on inside while it did.

Two views, side by side with the answer:

The feature map: which concepts the model actually engaged with for your question, surfaced from its internal activations.
The trackers: live readings for things like uncertainty and harmfulness. An answer that reads confident with nothing firing underneath is very different from one that reads confident while its uncertainty signal is lit.

Instead of a lone answer, the person in the loop gets context, enough to catch the confidently-wrong cases before they reach a patient.

How we built it

GlassBox runs as two pieces. There's a lightweight application layer that handles the chat, the views, and everything a user touches, and a separate GPU service that holds the model and does the heavy lifting. The app layer never loads the model itself; it asks the GPU service for what it needs over a simple API. Keeping those two apart means the interface stays fast and responsive while the expensive, GPU-bound work scales on its own, and either side can be worked on or restarted without dragging the other down.

The feature map comes from a pretrained sparse autoencoder that turns raw activations into interpretable features. The trackers work differently: for each concept we precompute a "direction" from labeled examples ahead of time, and at answer-time we just measure the model's live activation against it. That's one cheap comparison per tracker, because the hard part already happened offline.

We used Arize Phoenix* to trace every turn, so when something looks off we can see exactly what happened inside a request, and Sentry to catch failures before they quietly corrupt the signal. An interpretability tool you can't debug is just another black box.

Challenges we ran into

Our first probes scored no better than a coin flip (AUROC = 0.5) The cause was a single oversized signal in the model's internals that drowned out everything else. Normalizing for it fixed the probes, and taught us that this correction isn't optional. It has to ship with every direction.
This project required heavy GPU compute and on-demand pods which was not easily accessible. We have to then figure out a way to use runpod's GPU and created a micro-service that handled the heavy lifting.
We first trained probes in a setup that wasn't identical to how we serve them, and the mismatch quietly degraded the signal. We had to route the training path through the exact same machinery as the live one.
Enabling ad-hoc probe monitors was very ambitious. We essentially production-alized a frontier research paper (Persona Vectors) by Anthropic to make this feasible.

Accomplishments that we're proud of

It works, end to end. A question goes in, an answer comes out, and the feature map and tracker readings light up beside it in real time. We're proud the whole loop holds together as one system, fast enough to actually use, instead of a pile of disconnected scripts. And we stayed honest about the numbers: every tracker is measured against a baseline, so the signal we put on screen is one we'd actually stand behind.

We are also really proud of converting a intensive mechanistic interpretability project into a 24 hour hackathon submission

What we learned

We learned that the messiness inside these models is real and has to be handled, not ignored, and that an interpretability number means nothing without a baseline next to it. Above all, we confirmed our starting hunch which was that the gap between what a model says and what it's signaling inside is real, measurable, and worth showing to the people who depend on it. Making models more interpretable is the first and most crucial step in safe use of AI.

What's next for GlassBox

Extend this to agents and agentic harnesses. The most natural next step would be to scale up this system to be able to detect hallucinated tool calls, and detecting misaligned behavior.
Building an end to end interpretability agent that can go beyond feature discovery and monitoring. Real time steering of models and reliably setting up guardrails is very powerful.