NFDI4DS | UHH-SEMS - Publication Details

Black-Box Access is Insufficient for Rigorous AI Audits

OPENALEX - Publications

Stephen Casper Carson Ezell Charlotte Siegmann Noam Kolt Taylor Lynn Curtis and 16 more

External audits of AI systems are increasingly recognized as a key mechanism for governance. The effectiveness an audit, however, depends on the degree access granted to auditors. Recent state-of-the-art have primarily relied black-box access, in which auditors can only query system and observe its outputs. However, white-box system's inner workings (e.g., weights, activations, gradients) allows auditor perform stronger attacks, more thoroughly interpret models, conduct fine-tuning....

10.1145/3630106.3659037 article EN cc-by 2022 ACM Conference on Fairness, Accountability, and Transparency 2024-06-03

Open Problems in Mechanistic Interpretability

OPENALEX - Publications

Lee Sharkey Bilal Chughtai Joshua Batson Jack Lindsey Jeff Wu and 24 more

Mechanistic interpretability aims to understand the computational mechanisms underlying neural networks' capabilities in order accomplish concrete scientific and engineering goals. Progress this field thus promises provide greater assurance over AI system behavior shed light on exciting questions about nature of intelligence. Despite recent progress toward these goals, there are many open problems that require solutions before practical benefits can be realized: Our methods both conceptual...

10.48550/arxiv.2501.16496 preprint EN arXiv (Cornell University) 2025-01-27

National palliative care capacities around the world: Results from the World Health Organization Noncommunicable Disease Country Capacity Survey

OPENALEX - Publications

Lee Sharkey Belinda Loring Melanie Cowan Leanne M Riley Eric L. Krakauer

Previous estimates of global palliative care development have not been based on official country data.The World Health Organization Noncommunicable Disease Country Capacity Survey member state officials monitors countries' capacities for the prevention and control noncommunicable diseases. In 2015, first time, questions were included a number metrics to generate baseline data monitoring development.Participants given instructions, glossary terms, 3 months complete this closed,...

10.1177/0269216317716060 article EN Palliative Medicine 2017-07-05

Sparse Autoencoders Find Highly Interpretable Features in Language Models

OPENALEX - Publications

Hoagy Cunningham Aidan Ewart Logan Riggs Robert P. Huben Lee Sharkey

One of the roadblocks to a better understanding neural networks' internals is \textit{polysemanticity}, where neurons appear activate in multiple, semantically distinct contexts. Polysemanticity prevents us from identifying concise, human-understandable explanations for what networks are doing internally. hypothesised cause polysemanticity \textit{superposition}, represent more features than they have by assigning an overcomplete set directions activation space, rather individual neurons....

10.48550/arxiv.2309.08600 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition

OPENALEX - Publications

Dan Braun Lucius Bushnaq Stefan Heimersheim Jake Mendel Lee Sharkey

Mechanistic interpretability aims to understand the internal mechanisms learned by neural networks. Despite recent progress toward this goal, it remains unclear how best decompose network parameters into mechanistic components. We introduce Attribution-based Parameter Decomposition (APD), a method that directly decomposes network's components (i) are faithful of original network, (ii) require minimal number process any input, and (iii) maximally simple. Our approach thus optimizes for length...

10.48550/arxiv.2501.14926 preprint EN arXiv (Cornell University) 2025-01-24

Sparse Autoencoders Do Not Find Canonical Units of Analysis

OPENALEX - Publications

Patrick Leask Bart Bussmann Michael Pearce Joseph Bloom Curt Tigges and 3 more

A common goal of mechanistic interpretability is to decompose the activations neural networks into features: interpretable properties input computed by model. Sparse autoencoders (SAEs) are a popular method for finding these features in LLMs, and it has been postulated that they can be used find \textit{canonical} set units: unique complete list atomic features. We cast doubt on this belief using two novel techniques: SAE stitching show incomplete, meta-SAEs not atomic. involves inserting or...

10.48550/arxiv.2502.04878 preprint EN arXiv (Cornell University) 2025-02-07

A Causal Framework for AI Regulation and Auditing

OPENALEX - Publications

Lee Sharkey Clíodhna Ní Ghuidhir Dan Braun Jérémy Scheurer Mikita Balesni and 3 more

Artificial intelligence (AI) systems are poised to become deeply integrated into society. If developed responsibly, AI has potential benefit humanity immensely. However, it also poses a range of risks, including risks catastrophic accidents. It is crucial that we develop oversight mechanisms prevent harm. This article outlines framework for evaluating and auditing provide assurance responsible development deployment, focusing on risks. We argue requires comprehensive proportional...

10.20944/preprints202401.1424.v1 preprint EN 2024-01-18

Interpreting Neural Networks through the Polytope Lens

OPENALEX - Publications

Sid Black Lee Sharkey Léo Grinsztajn Eric Winsor Dan Braun and 6 more

Mechanistic interpretability aims to explain what a neural network has learned at nuts-and-bolts level. What are the fundamental primitives of representations? Previous mechanistic descriptions have used individual neurons or their linear combinations understand representations learned. But there clues that and not correct units description: directions cannot describe how networks use nonlinearities structure representations. Moreover, many instances polysemantic (i.e. they multiple...

10.48550/arxiv.2211.12312 preprint EN other-oa arXiv (Cornell University) 2022-01-01

Black-Box Access is Insufficient for Rigorous AI Audits

OPENALEX - Publications

Stephen Casper Carson Ezell Charlotte Siegmann Noam Kolt Taylor Lynn Curtis and 16 more

External audits of AI systems are increasingly recognized as a key mechanism for governance. The effectiveness an audit, however, depends on the degree access granted to auditors. Recent state-of-the-art have primarily relied black-box access, in which auditors can only query system and observe its outputs. However, white-box system's inner workings (e.g., weights, activations, gradients) allows auditor perform stronger attacks, more thoroughly interpret models, conduct fine-tuning....

10.1145/3630106.3659037 preprint EN arXiv (Cornell University) 2024-01-25

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

OPENALEX - Publications

Dan Braun Jordan Taylor Nicholas Goldowsky-Dill Lee Sharkey

Identifying the features learned by neural networks is a core challenge in mechanistic interpretability. Sparse autoencoders (SAEs), which learn sparse, overcomplete dictionary that reconstructs network's internal activations, have been used to identify these features. However, SAEs may more about structure of datatset than computational network. There therefore only indirect reason believe directions found dictionaries are functionally important We propose end-to-end (e2e) sparse learning,...

10.48550/arxiv.2405.12241 preprint EN arXiv (Cornell University) 2024-05-17

Bilinear MLPs enable weight-based mechanistic interpretability

OPENALEX - Publications

Michael Pearce Thomas Dooms Alice Rigg José Oramas Lee Sharkey

A mechanistic understanding of how MLPs do computation in deep neural networks remains elusive. Current interpretability work can extract features from hidden activations over an input dataset but generally cannot explain MLP weights construct features. One challenge is that element-wise nonlinearities introduce higher-order interactions and make it difficult to trace computations through the layer. In this paper, we analyze bilinear MLPs, a type Gated Linear Unit (GLU) without any...

10.48550/arxiv.2410.08417 preprint EN arXiv (Cornell University) 2024-10-10

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

OPENALEX - Publications

Kola Ayonrinde Michael Pearce Lee Sharkey

Sparse Autoencoders (SAEs) have emerged as a useful tool for interpreting the internal representations of neural networks. However, naively optimising SAEs reconstruction loss and sparsity results in preference that are extremely wide sparse. We present an information-theoretic framework lossy compression algorithms communicating explanations activations. appeal to Minimal Description Length (MDL) principle motivate activations which both accurate concise. further argue interpretable require...

10.48550/arxiv.2410.11179 preprint EN arXiv (Cornell University) 2024-10-14

Lashing the Body from the Bones

OPENALEX - Publications

Lee Sharkey

Do you plead guilty to this – No– So why did confess to– I was not involved in– Perhaps pled acting in concert with–

10.1353/mar.2015.0035 article EN The Massachusetts review 2015-01-01

A technical note on bilinear layers for interpretability

OPENALEX - Publications

Lee Sharkey

The ability of neural networks to represent more features than neurons makes interpreting them challenging. This phenomenon, known as superposition, has spurred efforts find architectures that are interpretable standard multilayer perceptrons (MLPs) with elementwise activation functions. In this note, I examine bilinear layers, which a type MLP layer mathematically much easier analyze while simultaneously performing better MLPs. Although they nonlinear functions their input, demonstrate...

10.48550/arxiv.2305.03452 preprint EN cc-by arXiv (Cornell University) 2023-01-01

The Gift of a Cradle, and: Studying War

OPENALEX - Publications

Lee Sharkey

The Gift of a Cradle, and: Studying War Lee Sharkey (bio) Cradle Lovely beneath the lovely curveof horizon rests baby curve lovelynests in mother's breast sweet stream slipsin sinuous meander curved air playsa celebratory cello wet mouthis shaping love sounds Peach peach was color when I opened my eyespeach through tree trunks pulse whole budof fruit globe peace stubborn as soil and futile lovestood there while train blew to shrapnel Prado dwarves lamp-jawed Infantesthe featherhead Counts...

10.1353/psg.2007.0149 article EN Prairie schooner 2007-06-01

Circumventing interpretability: How to defeat mind-readers

OPENALEX - Publications

Lee Sharkey

The increasing capabilities of artificial intelligence (AI) systems make it ever more important that we interpret their internals to ensure intentions are aligned with human values. Yet there is reason believe misaligned will have a convergent instrumental incentive its thoughts difficult for us interpret. In this article, I discuss many ways capable AI might circumvent scalable interpretability methods and suggest framework thinking about these potential future risks.

10.48550/arxiv.2212.11415 preprint EN cc-by arXiv (Cornell University) 2022-01-01