NFDI4DS | UHH-SEMS - Publication Details

Black-Box Access is Insufficient for Rigorous AI Audits

OPENALEX - Publications

Stephen Casper Carson Ezell Charlotte Siegmann Noam Kolt Taylor Lynn Curtis and 16 more

External audits of AI systems are increasingly recognized as a key mechanism for governance. The effectiveness an audit, however, depends on the degree access granted to auditors. Recent state-of-the-art have primarily relied black-box access, in which auditors can only query system and observe its outputs. However, white-box system's inner workings (e.g., weights, activations, gradients) allows auditor perform stronger attacks, more thoroughly interpret models, conduct fine-tuning....

10.1145/3630106.3659037 article EN cc-by 2022 ACM Conference on Fairness, Accountability, and Transparency 2024-06-03

Black-Box Access is Insufficient for Rigorous AI Audits

OPENALEX - Publications

Stephen T. Casper Carson Ezell Charlotte Siegmann Noam Kolt Taylor Lynn Curtis and 16 more

External audits of AI systems are increasingly recognized as a key mechanism for governance. The effectiveness an audit, however, depends on the degree access granted to auditors. Recent state-of-the-art have primarily relied black-box access, in which auditors can only query system and observe its outputs. However, white-box system's inner workings (e.g., weights, activations, gradients) allows auditor perform stronger attacks, more thoroughly interpret models, conduct fine-tuning....

10.1145/3630106.3659037 preprint EN arXiv (Cornell University) 2024-01-25

Cannot or Should Not? Automatic Analysis of Refusal Composition in IFT/RLHF Datasets and Refusal Behavior of Black-Box LLMs

OPENALEX - Publications

Alexander von Recum Christoph Schnabl Gabor Hollbeck Silas Alberti Philip Blinde and 1 more

Refusals - instances where large language models (LLMs) decline or fail to fully execute user instructions are crucial for both AI safety and capabilities the reduction of hallucinations in particular. These behaviors learned during post-training, especially instruction fine-tuning (IFT) reinforcement learning from human feedback (RLHF). However, existing taxonomies evaluation datasets refusals inadequate, often focusing solely on should-not-related (instead cannot-related) categories,...

10.48550/arxiv.2412.16974 preprint EN arXiv (Cornell University) 2024-12-22