- Adversarial Robustness in Machine Learning
- Information and Cyber Security
- Privacy-Preserving Technologies in Data
- Software Engineering Research
- Ethics and Social Impacts of AI
- Security and Verification in Computing
- Law, AI, and Intellectual Property
- Explainable Artificial Intelligence (XAI)
- Anomaly Detection Techniques and Applications
Stanford University
2024
External audits of AI systems are increasingly recognized as a key mechanism for governance. The effectiveness an audit, however, depends on the degree access granted to auditors. Recent state-of-the-art have primarily relied black-box access, in which auditors can only query system and observe its outputs. However, white-box system's inner workings (e.g., weights, activations, gradients) allows auditor perform stronger attacks, more thoroughly interpret models, conduct fine-tuning....
External audits of AI systems are increasingly recognized as a key mechanism for governance. The effectiveness an audit, however, depends on the degree access granted to auditors. Recent state-of-the-art have primarily relied black-box access, in which auditors can only query system and observe its outputs. However, white-box system's inner workings (e.g., weights, activations, gradients) allows auditor perform stronger attacks, more thoroughly interpret models, conduct fine-tuning....
Refusals - instances where large language models (LLMs) decline or fail to fully execute user instructions are crucial for both AI safety and capabilities the reduction of hallucinations in particular. These behaviors learned during post-training, especially instruction fine-tuning (IFT) reinforcement learning from human feedback (RLHF). However, existing taxonomies evaluation datasets refusals inadequate, often focusing solely on should-not-related (instead cannot-related) categories,...