Zhengxuan Wu

Assessing the accuracy of neuron explanations in language models reveals significant flaws.

2025-09-24T10:54:24+00:00 ― 5 min read

Explore how interpretability illusions affect our view of neural networks.

2025-09-14T21:24:42+00:00 ― 7 min read

A study assessing various methods for interpreting language model neurons.

2025-09-03T17:28:12+00:00 ― 7 min read