Silas Alberti

ORCID: 0000-0003-1611-5737
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Adversarial Robustness in Machine Learning
  • Information and Cyber Security
  • Privacy-Preserving Technologies in Data
  • Software Engineering Research
  • Ethics and Social Impacts of AI
  • Security and Verification in Computing
  • Law, AI, and Intellectual Property
  • Explainable Artificial Intelligence (XAI)
  • Anomaly Detection Techniques and Applications

Stanford University
2024

External audits of AI systems are increasingly recognized as a key mechanism for governance. The effectiveness an audit, however, depends on the degree access granted to auditors. Recent state-of-the-art have primarily relied black-box access, in which auditors can only query system and observe its outputs. However, white-box system's inner workings (e.g., weights, activations, gradients) allows auditor perform stronger attacks, more thoroughly interpret models, conduct fine-tuning....

10.1145/3630106.3659037 article EN cc-by 2022 ACM Conference on Fairness, Accountability, and Transparency 2024-06-03

External audits of AI systems are increasingly recognized as a key mechanism for governance. The effectiveness an audit, however, depends on the degree access granted to auditors. Recent state-of-the-art have primarily relied black-box access, in which auditors can only query system and observe its outputs. However, white-box system's inner workings (e.g., weights, activations, gradients) allows auditor perform stronger attacks, more thoroughly interpret models, conduct fine-tuning....

10.1145/3630106.3659037 preprint EN arXiv (Cornell University) 2024-01-25

Refusals - instances where large language models (LLMs) decline or fail to fully execute user instructions are crucial for both AI safety and capabilities the reduction of hallucinations in particular. These behaviors learned during post-training, especially instruction fine-tuning (IFT) reinforcement learning from human feedback (RLHF). However, existing taxonomies evaluation datasets refusals inadequate, often focusing solely on should-not-related (instead cannot-related) categories,...

10.48550/arxiv.2412.16974 preprint EN arXiv (Cornell University) 2024-12-22
Coming Soon ...