Sparse Autoencoders Find Highly Interpretable Features in Language Models
Interpretability
Identification
Deep Neural Networks
DOI:
10.48550/arxiv.2309.08600
Publication Date:
2023-01-01
AUTHORS (5)
ABSTRACT
One of the roadblocks to a better understanding neural networks' internals is \textit{polysemanticity}, where neurons appear activate in multiple, semantically distinct contexts. Polysemanticity prevents us from identifying concise, human-understandable explanations for what networks are doing internally. hypothesised cause polysemanticity \textit{superposition}, represent more features than they have by assigning an overcomplete set directions activation space, rather individual neurons. Here, we attempt identify those directions, using sparse autoencoders reconstruct internal activations language model. These learn sets sparsely activating that interpretable and monosemantic identified alternative approaches, interpretability measured automated methods. Moreover, show with our learned features, can pinpoint causally responsible counterfactual behaviour on indirect object identification task \citep{wang2022interpretability} finer degree previous decompositions. This work indicates it possible resolve superposition models scalable, unsupervised method. Our method may serve as foundation future mechanistic work, which hope will enable greater model transparency steerability.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....