NFDI4DS | UHH-SEMS - Publication Details

Dylan Hadfield-Menell

ORCID: 0000-0002-6168-4763

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5076757561

Research Areas

Reinforcement Learning in Robotics
Adversarial Robustness in Machine Learning
Auction Theory and Applications
Robot Manipulation and Learning
Ethics and Social Impacts of AI
Advanced Bandit Algorithms Research
Explainable Artificial Intelligence (XAI)
Experimental Behavioral Economics Studies
AI-based Problem Solving and Planning
Machine Learning and Algorithms
Topic Modeling
Game Theory and Applications
Anomaly Detection Techniques and Applications
Decision-Making and Behavioral Economics
Robotic Path Planning Algorithms
Law, Economics, and Judicial Systems
Computability, Logic, AI Algorithms
Evolutionary Game Theory and Cooperation
Complex Systems and Decision Making
Domain Adaptation and Few-Shot Learning
Data Stream Mining Techniques
Multi-Agent Systems and Negotiation
Artificial Intelligence in Healthcare and Education
Recommender Systems and Techniques
Blood donation and transfusion practices

Massachusetts Institute of Technology
2013-2024

Vassar College
2024

University of California, Berkeley
2014-2022

IIT@MIT
2022

Moscow Institute of Thermal Technology
2021

Berkeley College
2019

University of New Mexico
2015

Mind Research Network
2015

Intel (United States)
2013

Cooperative Inverse Reinforcement Learning

OPENALEX - Publications

Dylan Hadfield-Menell Anca D. Dragan Pieter Abbeel Stuart Russell

For an autonomous system to be helpful humans and pose no unwarranted risks, it needs align its values with those of the in environment such a way that actions contribute maximization value for humans. We propose formal definition alignment problem as cooperative inverse reinforcement learning (CIRL). A CIRL is cooperative, partial-information game two agents, human robot; both are rewarded according human's reward function, but robot does not initially know what this is. In contrast...

10.48550/arxiv.1606.03137 preprint EN other-oa arXiv (Cornell University) 2016-01-01

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

OPENALEX - Publications

Stephen T. Casper Xander Davies Claudia Shi Thomas Krendl Gilbert Jérémy Scheurer and 27 more

Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with goals. RLHF has emerged as the central method used finetune state-of-the-art large language models (LLMs). Despite this popularity, there been relatively little public work systematizing its flaws. In paper, we (1) survey open problems and fundamental limitations of related methods; (2) overview techniques understand, improve, complement in practice; (3) propose auditing disclosure...

10.48550/arxiv.2307.15217 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks

OPENALEX - Publications

Tilman Räuker Anson Ho Stephen T. Casper Dylan Hadfield-Menell

The last decade of machine learning has seen drastic increases in scale and capabilities. Deep neural networks (DNNs) are increasingly being deployed the real world. However, they difficult to analyze, raising concerns about using them without a rigorous understanding how function. Effective tools for interpreting will be important building more trustworthy AI by helping identify problems, fix bugs, improve basic understanding. In particular, "inner" interpretability techniques, which focus...

10.1109/satml54575.2023.00039 article EN 2023-02-01

Building Human Values into Recommender Systems: An Interdisciplinary Synthesis

OPENALEX - Publications

Jonathan Stray Alon Halevy Parisa Assar Dylan Hadfield-Menell Craig Boutilier and 17 more

Recommender systems are the algorithms which select, filter, and personalize content across many of world's largest platforms apps. As such, their positive negative effects on individuals societies have been extensively theorized studied. Our overarching question is how to ensure that recommender enact values they serve. Addressing this in a principled fashion requires technical knowledge design operation, also critically depends insights from diverse fields including social science, ethics,...

10.1145/3632297 article EN ACM Transactions on Recommender Systems 2023-11-13

Black-Box Access is Insufficient for Rigorous AI Audits

OPENALEX - Publications

Stephen Casper Carson Ezell Charlotte Siegmann Noam Kolt Taylor Lynn Curtis and 16 more

External audits of AI systems are increasingly recognized as a key mechanism for governance. The effectiveness an audit, however, depends on the degree access granted to auditors. Recent state-of-the-art have primarily relied black-box access, in which auditors can only query system and observe its outputs. However, white-box system's inner workings (e.g., weights, activations, gradients) allows auditor perform stronger attacks, more thoroughly interpret models, conduct fine-tuning....

10.1145/3630106.3659037 article EN cc-by 2022 ACM Conference on Fairness, Accountability, and Transparency 2024-06-03

Guided search for task and motion plans using learned heuristics

OPENALEX - Publications

Rohan Chitnis Dylan Hadfield-Menell Abhishek Gupta Siddharth Srivastava Edward Groshev and 2 more

Tasks in mobile manipulation planning often require thousands of individual motions to complete. Such tasks reasoning about complex goals as well the feasibility movements configuration space. In discrete representations, complexity is exponential length plan. manipulation, parameters for an action draw from a continuous space, so we must also cope with infinite branching factor. Task and motion (TAMP) methods integrate logical search over high-level actions geometric address this challenge....

10.1109/icra.2016.7487165 article EN 2016-05-01

Spurious normativity enhances learning of compliance and enforcement behavior in artificial agents

OPENALEX - Publications

Raphael Köster Dylan Hadfield-Menell Richard Everett Laura Weidinger Gillian K. Hadfield and 1 more

How do societies learn and maintain social norms? Here we use multiagent reinforcement learning to investigate the dynamics of enforcement compliance behaviors. Artificial agents populate a foraging environment need avoid poisonous berry. Agents eating berries better when doing so is taboo, meaning behavior punished by other agents. The taboo helps overcome credit assignment problem in discovering delayed health effects. Critically, introducing an additional which results punishment for...

10.1073/pnas.2106028118 article EN cc-by-nc-nd Proceedings of the National Academy of Sciences 2022-01-12

On the Geometry of Adversarial Examples

OPENALEX - Publications

Marc Khoury Dylan Hadfield-Menell

Adversarial examples are a pervasive phenomenon of machine learning models where seemingly imperceptible perturbations to the input lead misclassifications for otherwise statistically accurate models. We propose geometric framework, drawing on tools from manifold reconstruction literature, analyze high-dimensional geometry adversarial examples. In particular, we highlight importance codimension: low-dimensional data manifolds embedded in space there many directions off which construct...

10.48550/arxiv.1811.00525 preprint EN other-oa arXiv (Cornell University) 2018-01-01

The Off-Switch Game

OPENALEX - Publications

Dylan Hadfield-Menell Anca D. Dragan Pieter Abbeel Stuart Russell

It is clear that one of the primary tools we can use to mitigate potential risk from a misbehaving AI system ability turn off. As capabilities systems improve, it important ensure such do not adopt subgoals prevent human switching This challenge because many formulations rational agents create strong incentives for self-preservation. caused by built-in instinct, but agent will maximize expected utility and cannot achieve whatever objective has been given if dead. Our goal study an allow...

10.24963/ijcai.2017/32 article EN 2017-07-28

Disjoint Processing Mechanisms of Hierarchical and Linear Grammars in Large Language Models

OPENALEX - Publications

Aruna Sankaranarayanan Dylan Hadfield-Menell Aaron Mueller

All natural languages are structured hierarchically. In humans, this structural restriction is neurologically coded: when two grammars presented with identical vocabularies, brain areas responsible for language processing only sensitive to hierarchical grammars. Using large models (LLMs), we investigate whether such functionally distinct regions can arise solely from exposure large-scale distributions. We generate inputs using English, Italian, Japanese, or nonce words, varying the...

10.48550/arxiv.2501.08618 preprint EN arXiv (Cornell University) 2025-01-15

The AI Agent Index

OPENALEX - Publications

Stephen T. Casper Luke Bailey Robert C. Hunter Carson Ezell Emma Cabalé and 10 more

Leading AI developers and startups are increasingly deploying agentic systems that can plan execute complex tasks with limited human involvement. However, there is currently no structured framework for documenting the technical components, intended uses, safety features of systems. To fill this gap, we introduce Agent Index, first public database to document information about deployed For each system meets criteria inclusion in index, system's components (e.g., base model, reasoning...

10.48550/arxiv.2502.01635 preprint EN arXiv (Cornell University) 2025-02-03

Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities

OPENALEX - Publications

Zora Che Stephen T. Casper Robert Kirk Anirudh Satheesh Stewart Slocum and 10 more

Evaluations of large language model (LLM) risks and capabilities are increasingly being incorporated into AI risk management governance frameworks. Currently, most evaluations conducted by designing inputs that elicit harmful behaviors from the system. However, a fundamental limitation this approach is harmfulness identified during any particular evaluation can only lower bound model's worst-possible-case behavior. As complementary method for eliciting behaviors, we propose evaluating LLMs...

10.48550/arxiv.2502.05209 preprint EN arXiv (Cornell University) 2025-02-03

Pitfalls of Evidence-Based AI Policy

OPENALEX - Publications

Stephen T. Casper David Krueger Dylan Hadfield-Menell

Nations across the world are working to govern AI. However, from a technical perspective, there is uncertainty and disagreement on best way do this. Meanwhile, recent debates over AI regulation have led calls for "evidence-based policy" which emphasize holding regulatory action high evidentiary standard. Evidence of irreplaceable value policymaking. too an standard can lead systematic neglect certain risks. In historical policy (e.g., tobacco ca. 1965 fossil fuels 1985) rhetoric also...

10.48550/arxiv.2502.09618 preprint EN arXiv (Cornell University) 2025-02-13

On the Utility of Model Learning in HRI

OPENALEX - Publications

Rohan Choudhury Gokul Swamy Dylan Hadfield-Menell Anca D. Dragan

Fundamental to robotics is the debate between model-based and model-free learning: should robot build an explicit model of world, or learn a policy directly? In context HRI, part world be modeled human. One option for treat human as black box how they act directly. But it can also agent, rely on "theory mind" guide bias learning (grey box). We contribute characterization performance these methods under optimistic case having ideal theory mind, well different scenarios in which assumptions...

10.1109/hri.2019.8673256 article EN 2016 11th ACM/IEEE International Conference on Human-Robot Interaction (HRI) 2019-03-01

Modular task and motion planning in belief space

OPENALEX - Publications

Dylan Hadfield-Menell Edward Groshev Rohan Chitnis Pieter Abbeel

The execution of long-horizon tasks under uncertainty is a fundamental challenge in robotics. Recent approaches have made headway on these with an integration task and motion planning. In this paper, we present Interfaced Belief Space Planning (IBSP): modular approach to planning belief space. We use task-independent interface layer combine off-the-shelf classical planner inference. determinize the problem maximum likelihood observation assumption obtain deterministic representation where...

10.1109/iros.2015.7354079 article EN 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2015-09-01

Expressive Robot Motion Timing

OPENALEX - Publications

Allan Zhou Dylan Hadfield-Menell Anusha Nagabandi Anca D. Dragan

Our goal is to enable robots time their motion in a way that purposefully expressive of internal states, making them more transparent people. We start by investigating what types states timing capable expressing, focusing on robot manipulation and keeping the path constant while systematically varying timing. find users naturally pick up certain properties (like confidence), naturalness), or task weight object carrying). then conduct hypothesis-driven experiment tease out directions...

10.1145/2909824.3020221 preprint EN 2017-03-01

Incomplete Contracting and AI Alignment

OPENALEX - Publications

Dylan Hadfield-Menell Gillian K. Hadfield

We suggest that the analysis of incomplete contracting developed by law and economics researchers can provide a useful framework for understanding AI alignment problem help to generate systematic approach finding solutions. first an overview literature explore parallels between this work alignment. As we emphasize, misalignment principal agent is core focus economic analysis. highlight some technical results from on contracts may insights researchers. Our contribution, however, bring bear...

10.1145/3306618.3314250 article EN 2019-01-27

Judging facts, judging norms: Training machine learning models to judge humans requires a modified approach to labeling data

OPENALEX - Publications

Aparna Balagopalan David Madras David Yang Dylan Hadfield-Menell Gillian K. Hadfield and 1 more

As governments and industry turn to increased use of automated decision systems, it becomes essential consider how closely such systems can reproduce human judgment. We identify a core potential failure, finding that annotators label objects differently depending on whether they are being asked factual question or normative question. This challenges natural assumption maintained in many standard machine-learning (ML) data acquisition procedures: there is no difference between predicting the...

10.1126/sciadv.abq0701 article EN cc-by-nc Science Advances 2023-05-10

Unifying scene registration and trajectory optimization for learning from demonstrations with application to manipulation of deformable objects

OPENALEX - Publications

Alex X. Lee Sandy H. Huang Dylan Hadfield-Menell Eric Tzeng Pieter Abbeel

Recent work [1], [2] has shown promising results in enabling robotic manipulation of deformable objects through learning from demonstrations. Their method computes a registration training scene to test scene, and then applies an extrapolation this the gripper motion obtain for scene. The warping cost scene-to-scene registrations is used determine nearest neighbor set Then once been generalized situation, they apply trajectory optimization [3] plan robot motions that will track predicted...

10.1109/iros.2014.6943185 article EN 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems 2014-09-01

Conservative Agency via Attainable Utility Preservation

OPENALEX - Publications

Alexander Turner Dylan Hadfield-Menell Prasad Tadepalli

Reward functions are easy to misspecify; although designers can make corrections after observing mistakes, an agent pursuing a misspecified reward function irreversibly change the state of its environment. If that precludes optimization correctly specified function, then correction is futile. For example, robotic factory assistant could break expensive equipment due misspecification; even if immediately correct damage done. To mitigate this risk, we introduce approach balances primary with...

10.1145/3375627.3375851 preprint EN 2020-02-05

The Assistive Multi-Armed Bandit

OPENALEX - Publications

Lawrence S. Chan Dylan Hadfield-Menell Siddhartha S Srinivasa Anca D. Dragan

Learning preferences implicit in the choices humans make is a well studied problem both economics and computer science. However, most work makes assumption that are acting (noisily) optimally with respect to their preferences. Such approaches can fail when people themselves learning about what they want. In this work, we introduce assistive multi-armed bandit, where robot assists human playing bandit task maximize cumulative reward. problem, does not know reward function but learn it through...

10.1109/hri.2019.8673234 article EN 2016 11th ACM/IEEE International Conference on Human-Robot Interaction (HRI) 2019-03-01

Coming Soon ...