Learning features in pLM using sparse autoencoders

An example from the SAE feature explorer tool we built to analyze SAE features

Inspiration

Protein language models (pLMs) have shown success in capturing various features related to protein structure and function, such as contact maps and mutation effects. How exactly do they do this though?

Inspired by recent advancements in mechanistic interpretability of language models, we aimed to explore how sparse autoencoders (SAEs) could be used to identify and interpret the features learned by pLMs.

We wondered if we could apply the same technique to interpreting protein language models.

What it does

Our project uses sparse autoencoders to identify interpretable features in pLMs by encoding their activations. By training SAEs on the ESM2-650M model's activations, we can reveal latent dimensions that correspond to distinct protein features, such as motifs, structural elements, and evolutionary patterns. We also experimented with steering the output of the model by clamping the activation of specific latent dimensions, uncovering how the model responds when particular features are emphasized.

How we built it

We trained SAEs on activations from ESM2-650M using a dataset of 1 million protein sequences from UniRef50. We chose top-k sparse autoencoders to focus on the most important dimensions and employed 1-dimensional logistic regression to evaluate each latent dimension on binary classification tasks. These tasks were sourced from swissprot annotations, covering motifs like alpha helices and beta strands. Additionally, we built a visualization tool to inspect and analyze latent activation patterns.

Challenges we ran into

Initially, while running logistic regression on latent dimensions, we only found features that corresponded to amino acid identity and struggled to uncover higher-level, interpretable features. The complex and sometimes unexpected latent activation patterns led us to dig deeper through our visualizer. Another challenge was getting meaningful output when steering the model using activation clamping. While some dimensions resulted in clear changes, many others led to minimal alterations in the generated sequences.

Automated evaluations and interpretation of SAEs is hard because our assumptions about what features should look like are mostly wrong.

Accomplishments that we're proud of

We are proud to have trained a set of SAEs that successfully identified interpretable features in a protein language model. Beyond simple structural patterns like beta strands or alpha helices, we discovered nuanced, context-dependent features, such as latent dimensions firing only on specific regions, such as beta strands near binding sites or transitions between different secondary structures. The successful steering of ESM2’s output by clamping amino acid-related dimensions was another exciting achievement.

First demonstration (to our knowledge) of SAEs to a protein language model
An SAE feature explorer which helps aid in the interpretation of features learned by SAEs
A set of trained SAEs for ESM2-650M that other people can experiment with

What we learned

Because it is hard to guess what the SAE features look like, visualizations are the most powerful way to interpret SAE latents.
ESM2 explicitly stores information about amino acid identity.
We can steer latent dimensions to design proteins with desired properties amplified.

What's next for Learning features in pLM using sparse autoencoders

Next, we plan to explore models like ESM3 to take advantage of its generation capabilities, enabling deeper investigation into model steering. We also aim to refine our visualizer, making it more user-friendly for broader use in the research community.

Add many features to the visualizer such as enabling users to upload their own sequences to see how a feature activates against it and visualize the effects of clamping.
Analyze the other SAEs with different hyperparameter settings and sourced from different ESM layers and make this available in the visualizer.
Repeat this on other ML models in biology such as AlphaFold and Evo.
Experiment with other SAE architectures such as JumpRELU.