NFDI4DS | UHH-SEMS - Publication Details

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Interpretability Identification Deep Neural Networks

DOI: 10.48550/arxiv.2309.08600 Publication Date: 2023-01-01

Abstract Supplemental Material References Cited by

AUTHORS (5)

Hoagy Cunningham

Aidan Ewart

Logan Riggs

Robert P. Huben

Lee Sharkey

ABSTRACT

One of the roadblocks to a better understanding neural networks' internals is \textit{polysemanticity}, where neurons appear activate in multiple, semantically distinct contexts. Polysemanticity prevents us from identifying concise, human-understandable explanations for what networks are doing internally. hypothesised cause polysemanticity \textit{superposition}, represent more features than they have by assigning an overcomplete set directions activation space, rather individual neurons. Here, we attempt identify those directions, using sparse autoencoders reconstruct internal activations language model. These learn sets sparsely activating that interpretable and monosemantic identified alternative approaches, interpretability measured automated methods. Moreover, show with our learned features, can pinpoint causally responsible counterfactual behaviour on indirect object identification task \citep{wang2022interpretability} finer degree previous decompositions. This work indicates it possible resolve superposition models scalable, unsupervised method. Our method may serve as foundation future mechanistic work, which hope will enable greater model transparency steerability.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENALEX - Publications OPENAIRE - Products

PlumX Metrics

Sparse Autoencoders Find Highly Interpretable Features in Language Models

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....