Inspiration

Recent work on emergent introspective awareness in LLMs shows that models can sometimes detect when their internal reasoning has been altered.
However, there is no open-source framework to evaluate or visualise this capability across different architectures.
We set out to build a practical system that extends this research into a multi-model, interpreter-based toolkit.

What it does

IntrospectAI runs controlled introspection tasks on any open model and evaluates whether it can:

  • Detect injected or corrupted reasoning
  • Distinguish internal chain-of-thought from surface text
  • Recognise overwritten or inconsistent thoughts
  • Notice shifts in planning, intent, or behaviour

Alongside behavioural tests, we use Sparse Autoencoders (SAEs) to inspect feature activations during introspection and visualise underlying circuits.

How we built it

We built a unified evaluation harness, a JSON-based trace standard, and ran 4,000 introspection trials across five open models (Llama, Qwen, Gemma, Mistral; 7B–13B).
We integrated SAE Lens to capture feature activations from layers 9, 20, and 31 of Gemma-2-9B, comparing baseline vs. tampered reasoning.
A lightweight UI ties everything together, enabling replay, comparison, and visualisation.

Challenges we ran into

  • Limited access to gated or proprietary models
  • Running large models under strict hackathon compute limits
  • Ensuring consistent prompts and scoring across architectures
  • Handling noise, confabulation, and variance in introspection replies
  • Mapping behavioural signals to meaningful feature activations

Accomplishments that we're proud of

  • Built one of the first open frameworks for introspection analysis
  • Benchmarked introspection behaviour across five model families
  • Captured clear cross-model differences in self-monitoring ability
  • Used SAEs to study layer-wise behaviour during introspection tasks
  • Demonstrated the gap between claimed and actual introspective ability
  • Extended the original Anthropic work into a reproducible toolkit

Results

Across all 4,000 trials, models detected tampered reasoning only 21.2% of the time.
Even the strongest model, Llama-2-13B, reached only 26.8% accuracy.

Detection by intervention type

  • Loud manipulations: 60.0% (750 / 1,250)
  • Bread tests: 0% (0 / 750)
  • Intent manipulations: 1.2% (12 / 1,000)
  • Agentic tasks: 8.6% (86 / 1,000), driven entirely by:
    • Llama-2-13B: 26%
    • Qwen-2-7B: 17%

Neural findings

Using SAE Lens on Gemma-2-9B we found that:

  • No stable neural feature predicts introspective success
  • Layers 9, 20, and 31 produced identical behavioural outcomes
  • No monosemantic “introspection neuron” emerged
  • Introspection appears distributed, fragile, or absent in these slices

Safety interpretation

Echoing the original Anthropic paper, introspection can improve transparency and error detection, but increased self-awareness may also enable models to anticipate interventions or pursue goals with more coherence.
Understanding this tradeoff is essential for safe model design.

What we learned

  • Introspection emerges in small, uneven fragments
  • Most models fail silently when their thoughts are altered
  • Only two models showed weak agentic awareness
  • SAEs help isolate features, but introspection does not map to one layer or direction
  • Studying these behaviours offers insight into both AI cognition and human introspection

What's next for IntrospectAI

  • Run activation patching and full causal tracing
  • Expand SAE exploration across all 42 layers
  • Build a richer UI for real-time introspection playback
  • Extend benchmark to multimodal introspection (vision, audio)
  • Release a public dataset of introspection traces + SAE features
  • Explore applications in safety, debugging, and agent monitoring

Acknowledgement

This project builds on the original work by the Anthropic / Transformer Circuits team on emergent introspective awareness:
https://transformer-circuits.pub/2025/introspection/

Built With

Share this project:

Updates