IntrospectAI

Inspiration

Recent work on emergent introspective awareness in LLMs shows that models can sometimes detect when their internal reasoning has been altered.
However, there is no open-source framework to evaluate or visualise this capability across different architectures.
We set out to build a practical system that extends this research into a multi-model, interpreter-based toolkit.

What it does

IntrospectAI runs controlled introspection tasks on any open model and evaluates whether it can:

Detect injected or corrupted reasoning
Distinguish internal chain-of-thought from surface text
Recognise overwritten or inconsistent thoughts
Notice shifts in planning, intent, or behaviour

Alongside behavioural tests, we use Sparse Autoencoders (SAEs) to inspect feature activations during introspection and visualise underlying circuits.

How we built it

We built a unified evaluation harness, a JSON-based trace standard, and ran 4,000 introspection trials across five open models (Llama, Qwen, Gemma, Mistral; 7B–13B).
We integrated SAE Lens to capture feature activations from layers 9, 20, and 31 of Gemma-2-9B, comparing baseline vs. tampered reasoning.
A lightweight UI ties everything together, enabling replay, comparison, and visualisation.

Challenges we ran into

Limited access to gated or proprietary models
Running large models under strict hackathon compute limits
Ensuring consistent prompts and scoring across architectures
Handling noise, confabulation, and variance in introspection replies
Mapping behavioural signals to meaningful feature activations

Accomplishments that we're proud of

Built one of the first open frameworks for introspection analysis
Benchmarked introspection behaviour across five model families
Captured clear cross-model differences in self-monitoring ability
Used SAEs to study layer-wise behaviour during introspection tasks
Demonstrated the gap between claimed and actual introspective ability
Extended the original Anthropic work into a reproducible toolkit

Results

Across all 4,000 trials, models detected tampered reasoning only 21.2% of the time.
Even the strongest model, Llama-2-13B, reached only 26.8% accuracy.

Detection by intervention type

Loud manipulations: 60.0% (750 / 1,250)
Bread tests: 0% (0 / 750)
Intent manipulations: 1.2% (12 / 1,000)
Agentic tasks: 8.6% (86 / 1,000), driven entirely by:
- Llama-2-13B: 26%
- Qwen-2-7B: 17%

Neural findings

Using SAE Lens on Gemma-2-9B we found that:

No stable neural feature predicts introspective success
Layers 9, 20, and 31 produced identical behavioural outcomes
No monosemantic “introspection neuron” emerged
Introspection appears distributed, fragile, or absent in these slices

Safety interpretation

Echoing the original Anthropic paper, introspection can improve transparency and error detection, but increased self-awareness may also enable models to anticipate interventions or pursue goals with more coherence.
Understanding this tradeoff is essential for safe model design.

What we learned

Introspection emerges in small, uneven fragments
Most models fail silently when their thoughts are altered
Only two models showed weak agentic awareness
SAEs help isolate features, but introspection does not map to one layer or direction
Studying these behaviours offers insight into both AI cognition and human introspection

What's next for IntrospectAI

Run activation patching and full causal tracing
Expand SAE exploration across all 42 layers
Build a richer UI for real-time introspection playback
Extend benchmark to multimodal introspection (vision, audio)
Release a public dataset of introspection traces + SAE features
Explore applications in safety, debugging, and agent monitoring

Acknowledgement

This project builds on the original work by the Anthropic / Transformer Circuits team on emergent introspective awareness:
https://transformer-circuits.pub/2025/introspection/