Evaluating Sparse Autoencoders with SHIFT and TPP

Table of Contents

The Challenge
The Solution
Methods and Contributions
Evaluating Concept Isolation
Validation and Sanity Checks
Experimental Results
Discussion and Limitations
Conclusion
Acknowledgments
Future Directions
Probe Training Insights
Sparse Autoencoder Training Procedures
Original Source

Sparse Autoencoders (SAEs) help make sense of how neural networks work by breaking down their activations into understandable parts. A big problem in this area is that we don't have good ways to measure how well SAEs are doing. Most past studies have used methods that aren't very reliable. In this work, we present new ways to assess SAEs by using a method called Shift, which helps determine what part of a neural network is not helping with the task at hand. We also introduce the Targeted Probe Perturbation (TPP) method, which measures how well an SAE can tell apart similar Concepts.

The Challenge

SAEs are a useful tool for understanding neural networks. This year, many new types of SAEs have been developed, such as TopK and Gated SAEs. However, there is still a major issue with not having reliable Metrics to test progress in this area. Unlike other machine learning tasks that have straightforward goals, evaluating SAEs for interpretability lacks a clear standard.

The usual metrics like sparsity and fidelity do not always match what we want in terms of understanding the model better. This disconnect makes it hard to know if improvements in SAEs really enhance their interpretability or if they just improve these proxy metrics.

The Solution

To tackle this, we propose measuring SAEs based on how well they work for tasks outside of their training. The SHIFT method helps assess how well an SAE can identify and remove parts of a model that contribute to biased predictions. By using SHIFT, researchers can see which features influence a neural network’s outputs and which do not matter. We created new evaluations based on SHIFT called Spurious Correlation Removal (SCR) to assess an SAE's effectiveness in separating different concepts.

However, SCR has limitations when trying to scale across various types of data. To overcome this, we developed the TPP method, which looks at how an SAE can identify and change one specific class while leaving others alone. For both SCR and TPP, we choose SAE features using scores that reflect how much they affect the classification task.

Methods and Contributions

Our main contributions are:

Adapting SHIFT: We adjusted the spurious correlation removal task in SHIFT to function as an evaluation tool for SAEs.
Introducing TPP: We developed the Targeted Probe Perturbation metric to evaluate SAEs across various datasets.
Open-Source Suite: We trained and made available a collection of SAEs and tested our metrics using different language models and datasets.

SAEs aim to find a set of understandable features from a neural network's internal workings. A good SAE should be true to the model's processes and be able to separate human-understandable concepts.

Traditionally, people have used two main unsupervised metrics to evaluate SAEs:

The cross-entropy loss recovered: This checks how well the original model's performance can be mimicked using the SAE's predictions.
The L0-norm of feature activations: This measures how many features are activated for a given input.

Recent studies have looked into evaluating SAEs using board games, circuits, and specific language concepts. The goal of concept removal is to find and eliminate unwanted ideas from a model while keeping its overall performance intact. Our aim is not to improve current methods for removing concepts but to turn these tasks into metrics for assessing SAE progress.

Evaluating Concept Isolation

In this research, we focus on how well an SAE can isolate different concepts as a main measure of its quality. To test our methods, we follow a systematic approach:

Train a classifier for a specific concept.
Identify the SAE features that relate to that concept.
Check if removing features related to the concept affects the classifier as intended.

A good SAE will significantly impact the accuracy of the classifier when relevant features are removed. Our SHIFT and TPP metrics operationalize this idea.

Validation and Sanity Checks

To ensure that our metrics are valid, we run several tests to see if they align with the expected properties of SAEs. Each subsection below details the evaluation steps, and more information is available in the appendix.

SAE Latent Selection

Choosing which SAE features to evaluate requires figuring out which ones are most relevant for a specific concept. We do this by ranking their effects on a classifier and may filter these features for interpretability.

To find the most relevant features, we use linear classifiers to spot connections from the model outputs. We gather scores that reflect how much each feature contributes and then select the top ones. We also use an LLM judge to assess whether a feature is understandable based on the context it activates.

Applying SHIFT and TPP

The SHIFT method needs datasets that connect text to two binary labels. We use the Bias in Bios dataset for profession and gender classifications and the Amazon reviews dataset for product categories and ratings.

We filter both datasets for two labels and train a classifier on the biased dataset. We remove features from the classifier using the process described earlier to see how well the classifier works without the biases.

The TPP approach generalizes SHIFT and works for any text classification dataset. Here, we find SAE features that help differentiate classes and check how well removing them affects model accuracy.

Experimental Results

We trained SAEs on two models, Pythia-70M and Gemma-2-2B, to test our metrics. Both metrics show that SAEs can effectively remove bias and enhance classifier accuracy. The SHIFT evaluation distinguishes between various SAE types and architectures.

Findings

The results consistently show that TopK and JumpReLU architectures outperform Standard SAEs. We also note that the performance of SAEs improves during training, with the first part of training contributing significantly to overall score gains.

Our findings indicate that most top SAE features, regardless of selection method, are seen as interpretable by the LLM judge. The noise-informed method, which doesn't require the LLM, is faster and provides decent evaluations.

Discussion and Limitations

Our experiments confirm that SHIFT and TPP successfully differentiate between different SAE architectures. However, the best sparsity levels for each metric vary. More work is needed to relate the TPP metric to sparsity measurements.

The LLM judge we used has a lower standard for interpretability than other implementations. While our simpler methods are faster and cheaper, they can miss some interpretations. Thus, there's a balance between quality and efficiency when deciding whether to use the LLM judge.

SHIFt and TPP depend on human-set ideals of what SAEs should learn, which may not match what the model actually represents. This reliance can overlook important features.

Despite their strengths, both metrics have limitations in terms of complexity and undefined parameters. They should complement other evaluation methods rather than serve as standalone measures.

Conclusion

The SHIFT and TPP methods provide valuable tools for assessing Sparse Autoencoders. They are easy to apply across different datasets, demonstrate improvements during training, and can be computed quickly. We recommend researchers to utilize our metrics to evaluate their own SAEs and keep tabs on training progress.

Acknowledgments

This research was supported by the ML Alignment Theory Scholars Program. We thank all those who contributed their insights and expertise during this project. Additionally, we appreciate the computational resources provided by various labs.

Future Directions

In the future, we aim to improve evaluations that cover not just causal isolation but also other important qualities of SAEs. We recognize that developing a comprehensive framework to examine all aspects of SAE quality remains a significant challenge.

Probe Training Insights

When training probes on biased datasets, it's crucial to balance the signals detected. If a probe is heavily biased towards one label, it limits the effectiveness of removing unwanted features. We found that adjusting batch sizes and learning rates can significantly affect probe accuracy.

To minimize dependence on dataset labels, we averaged scores over multiple class pairs. By selecting pairs with at least 60% accuracy for both classes, we could improve the reliability of our evaluations.

Sparse Autoencoder Training Procedures

We train and make available a variety of SAEs based on the Pythia-70M and Gemma-2-2B models. Our training parameters aim to ensure good feature identification across different datasets.

With our findings, we hope to encourage more research in SAE evaluation methods, enhancing the understanding of how these models operate and are improved over time.

Evaluating Sparse Autoencoders with SHIFT and TPP

New metrics improve understanding of Sparse Autoencoders in neural networks.

The Challenge

The Solution

Methods and Contributions

Evaluating Concept Isolation

Validation and Sanity Checks

SAE Latent Selection

Applying SHIFT and TPP

Experimental Results

Findings

Discussion and Limitations

Conclusion

Acknowledgments

Future Directions

Probe Training Insights

Sparse Autoencoder Training Procedures

Referenced Topics

Evaluating Sparse Autoencoders with SHIFT and TPP

New metrics improve understanding of Sparse Autoencoders in neural networks.

#The Challenge

#The Solution

#Methods and Contributions

#Evaluating Concept Isolation

#Validation and Sanity Checks

#SAE Latent Selection

#Applying SHIFT and TPP

#Experimental Results

#Findings

#Discussion and Limitations

#Conclusion

#Acknowledgments

#Future Directions

#Probe Training Insights

#Sparse Autoencoder Training Procedures

Referenced Topics

The Challenge

The Solution

Methods and Contributions

Evaluating Concept Isolation

Validation and Sanity Checks

SAE Latent Selection

Applying SHIFT and TPP

Experimental Results

Findings

Discussion and Limitations

Conclusion

Acknowledgments

Future Directions

Probe Training Insights

Sparse Autoencoder Training Procedures