Lee Sharkey

ORCID: 0009-0009-2137-6027
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Adversarial Robustness in Machine Learning
  • Explainable Artificial Intelligence (XAI)
  • Natural Language Processing Techniques
  • Neural Networks and Applications
  • Empathy and Medical Education
  • Machine Learning and Data Classification
  • Information and Cyber Security
  • Topic Modeling
  • Privacy-Preserving Technologies in Data
  • Ethics and Social Impacts of AI
  • Generative Adversarial Networks and Image Synthesis
  • Statistical and Computational Modeling
  • Chronic Disease Management Strategies
  • Multimodal Machine Learning Applications
  • Palliative Care and End-of-Life Issues
  • Security and Verification in Computing
  • Model Reduction and Neural Networks
  • Video Analysis and Summarization
  • Text and Document Classification Technologies
  • Health Systems, Economic Evaluations, Quality of Life
  • American Literature and Culture

World Health Organization
2017

External audits of AI systems are increasingly recognized as a key mechanism for governance. The effectiveness an audit, however, depends on the degree access granted to auditors. Recent state-of-the-art have primarily relied black-box access, in which auditors can only query system and observe its outputs. However, white-box system's inner workings (e.g., weights, activations, gradients) allows auditor perform stronger attacks, more thoroughly interpret models, conduct fine-tuning....

10.1145/3630106.3659037 article EN cc-by 2022 ACM Conference on Fairness, Accountability, and Transparency 2024-06-03

Mechanistic interpretability aims to understand the computational mechanisms underlying neural networks' capabilities in order accomplish concrete scientific and engineering goals. Progress this field thus promises provide greater assurance over AI system behavior shed light on exciting questions about nature of intelligence. Despite recent progress toward these goals, there are many open problems that require solutions before practical benefits can be realized: Our methods both conceptual...

10.48550/arxiv.2501.16496 preprint EN arXiv (Cornell University) 2025-01-27

Previous estimates of global palliative care development have not been based on official country data.The World Health Organization Noncommunicable Disease Country Capacity Survey member state officials monitors countries' capacities for the prevention and control noncommunicable diseases. In 2015, first time, questions were included a number metrics to generate baseline data monitoring development.Participants given instructions, glossary terms, 3 months complete this closed,...

10.1177/0269216317716060 article EN Palliative Medicine 2017-07-05

One of the roadblocks to a better understanding neural networks' internals is \textit{polysemanticity}, where neurons appear activate in multiple, semantically distinct contexts. Polysemanticity prevents us from identifying concise, human-understandable explanations for what networks are doing internally. hypothesised cause polysemanticity \textit{superposition}, represent more features than they have by assigning an overcomplete set directions activation space, rather individual neurons....

10.48550/arxiv.2309.08600 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Mechanistic interpretability aims to understand the internal mechanisms learned by neural networks. Despite recent progress toward this goal, it remains unclear how best decompose network parameters into mechanistic components. We introduce Attribution-based Parameter Decomposition (APD), a method that directly decomposes network's components (i) are faithful of original network, (ii) require minimal number process any input, and (iii) maximally simple. Our approach thus optimizes for length...

10.48550/arxiv.2501.14926 preprint EN arXiv (Cornell University) 2025-01-24

A common goal of mechanistic interpretability is to decompose the activations neural networks into features: interpretable properties input computed by model. Sparse autoencoders (SAEs) are a popular method for finding these features in LLMs, and it has been postulated that they can be used find \textit{canonical} set units: unique complete list atomic features. We cast doubt on this belief using two novel techniques: SAE stitching show incomplete, meta-SAEs not atomic. involves inserting or...

10.48550/arxiv.2502.04878 preprint EN arXiv (Cornell University) 2025-02-07

Artificial intelligence (AI) systems are poised to become deeply integrated into society. If developed responsibly, AI has potential benefit humanity immensely. However, it also poses a range of risks, including risks catastrophic accidents. It is crucial that we develop oversight mechanisms prevent harm. This article outlines framework for evaluating and auditing provide assurance responsible development deployment, focusing on risks. We argue requires comprehensive proportional...

10.20944/preprints202401.1424.v1 preprint EN 2024-01-18

Mechanistic interpretability aims to explain what a neural network has learned at nuts-and-bolts level. What are the fundamental primitives of representations? Previous mechanistic descriptions have used individual neurons or their linear combinations understand representations learned. But there clues that and not correct units description: directions cannot describe how networks use nonlinearities structure representations. Moreover, many instances polysemantic (i.e. they multiple...

10.48550/arxiv.2211.12312 preprint EN other-oa arXiv (Cornell University) 2022-01-01

External audits of AI systems are increasingly recognized as a key mechanism for governance. The effectiveness an audit, however, depends on the degree access granted to auditors. Recent state-of-the-art have primarily relied black-box access, in which auditors can only query system and observe its outputs. However, white-box system's inner workings (e.g., weights, activations, gradients) allows auditor perform stronger attacks, more thoroughly interpret models, conduct fine-tuning....

10.1145/3630106.3659037 preprint EN arXiv (Cornell University) 2024-01-25

Identifying the features learned by neural networks is a core challenge in mechanistic interpretability. Sparse autoencoders (SAEs), which learn sparse, overcomplete dictionary that reconstructs network's internal activations, have been used to identify these features. However, SAEs may more about structure of datatset than computational network. There therefore only indirect reason believe directions found dictionaries are functionally important We propose end-to-end (e2e) sparse learning,...

10.48550/arxiv.2405.12241 preprint EN arXiv (Cornell University) 2024-05-17

A mechanistic understanding of how MLPs do computation in deep neural networks remains elusive. Current interpretability work can extract features from hidden activations over an input dataset but generally cannot explain MLP weights construct features. One challenge is that element-wise nonlinearities introduce higher-order interactions and make it difficult to trace computations through the layer. In this paper, we analyze bilinear MLPs, a type Gated Linear Unit (GLU) without any...

10.48550/arxiv.2410.08417 preprint EN arXiv (Cornell University) 2024-10-10

Sparse Autoencoders (SAEs) have emerged as a useful tool for interpreting the internal representations of neural networks. However, naively optimising SAEs reconstruction loss and sparsity results in preference that are extremely wide sparse. We present an information-theoretic framework lossy compression algorithms communicating explanations activations. appeal to Minimal Description Length (MDL) principle motivate activations which both accurate concise. further argue interpretable require...

10.48550/arxiv.2410.11179 preprint EN arXiv (Cornell University) 2024-10-14

Do you plead guilty to this – No– So why did confess to– I was not involved in– Perhaps pled acting in concert with–

10.1353/mar.2015.0035 article EN ˜The œMassachusetts review 2015-01-01

The ability of neural networks to represent more features than neurons makes interpreting them challenging. This phenomenon, known as superposition, has spurred efforts find architectures that are interpretable standard multilayer perceptrons (MLPs) with elementwise activation functions. In this note, I examine bilinear layers, which a type MLP layer mathematically much easier analyze while simultaneously performing better MLPs. Although they nonlinear functions their input, demonstrate...

10.48550/arxiv.2305.03452 preprint EN cc-by arXiv (Cornell University) 2023-01-01

The Gift of a Cradle, and: Studying War Lee Sharkey (bio) Cradle Lovely beneath the lovely curveof horizon rests baby curve lovelynests in mother's breast sweet stream slipsin sinuous meander curved air playsa celebratory cello wet mouthis shaping love sounds Peach peach was color when I opened my eyespeach through tree trunks pulse whole budof fruit globe peace stubborn as soil and futile lovestood there while train blew to shrapnel Prado dwarves lamp-jawed Infantesthe featherhead Counts...

10.1353/psg.2007.0149 article EN Prairie schooner 2007-06-01

The increasing capabilities of artificial intelligence (AI) systems make it ever more important that we interpret their internals to ensure intentions are aligned with human values. Yet there is reason believe misaligned will have a convergent instrumental incentive its thoughts difficult for us interpret. In this article, I discuss many ways capable AI might circumvent scalable interpretability methods and suggest framework thinking about these potential future risks.

10.48550/arxiv.2212.11415 preprint EN cc-by arXiv (Cornell University) 2022-01-01
Coming Soon ...