NFDI4DS | UHH-SEMS - Publication Details

David Lindner

ORCID: 0000-0001-7051-7433

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5059597480

Research Areas

Reinforcement Learning in Robotics
Advanced Bandit Algorithms Research
Adversarial Robustness in Machine Learning
Machine Learning and Algorithms
Smart Grid Energy Management
Optimization and Search Problems
IPv6, Mobility, Handover, Networks, Security
Network Security and Intrusion Detection
Machine Learning and Data Classification
Software Engineering Research
Energy Load and Power Forecasting
Explainable Artificial Intelligence (XAI)
Data Stream Mining Techniques
Topic Modeling
Fault Detection and Control Systems
Generative Adversarial Networks and Image Synthesis
Network Packet Processing and Optimization
Energy Efficiency and Management
Access Control and Trust
Model Reduction and Neural Networks
Advanced Authentication Protocols Security
Digital Communication and Language
Topological Materials and Phenomena
Digital Media Forensic Detection
Infrastructure Resilience and Vulnerability Analysis

ETH Zurich
2019-2023

University of Cologne
2018-2019

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

OPENALEX - Publications

Stephen T. Casper Xander Davies Claudia Shi Thomas Krendl Gilbert Jérémy Scheurer and 27 more

Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with goals. RLHF has emerged as the central method used finetune state-of-the-art large language models (LLMs). Despite this popularity, there been relatively little public work systematizing its flaws. In paper, we (1) survey open problems and fundamental limitations of related methods; (2) overview techniques understand, improve, complement in practice; (3) propose auditing disclosure...

10.48550/arxiv.2307.15217 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Red-Teaming the Stable Diffusion Safety Filter

OPENALEX - Publications

Javier Rando Daniel Paleka David Lindner Lennard Heim Florian Tramèr

Stable Diffusion is a recent open-source image generation model comparable to proprietary models such as DALLE, Imagen, or Parti. comes with safety filter that aims prevent generating explicit images. Unfortunately, the obfuscated and poorly documented. This makes it hard for users misuse in their applications, understand filter's limitations improve it. We first show easy generate disturbing content bypasses filter. then reverse-engineer find while sexual content, ignores violence, gore,...

10.48550/arxiv.2210.04610 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Evaluating Frontier Models for Dangerous Capabilities

OPENALEX - Publications

Mary Phuong Matthew Aitchison Elliot Catt Sarah Cogan Alexandre Kaskasoli and 22 more

To understand the risks posed by a new AI system, we must what it can and cannot do. Building on prior work, introduce programme of "dangerous capability" evaluations pilot them Gemini 1.0 models. Our cover four areas: (1) persuasion deception; (2) cyber-security; (3) self-proliferation; (4) self-reasoning. We do not find evidence strong dangerous capabilities in models evaluated, but flag early warning signs. goal is to help advance rigorous science capability evaluation, preparation for future

10.48550/arxiv.2403.13793 preprint EN arXiv (Cornell University) 2024-03-20

MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking

OPENALEX - Publications

Sebastian Farquhar Vikrant Varma David Lindner David Elson Caleb Biddulph and 2 more

Future advanced AI systems may learn sophisticated strategies through reinforcement learning (RL) that humans cannot understand well enough to safely evaluate. We propose a training method which avoids agents undesired multi-step plans receive high reward (multi-step "reward hacks") even if are not able detect the behaviour is undesired. The method, Myopic Optimization with Non-myopic Approval (MONA), works by combining short-sighted optimization far-sighted reward. demonstrate MONA can...

10.48550/arxiv.2501.13011 preprint EN arXiv (Cornell University) 2025-01-22

Sensing Social Media Signals for Cryptocurrency News

OPENALEX - Publications

Johannes Beck Roberta Huang David Lindner Tian Guo Ce Zhang and 2 more

The ability to track and monitor relevant important news in real-time is of crucial interest multiple industrial sectors. In this work, we focus on the set cryptocurrency news, which recently became emerging general financial audience. order real-time, (i) match from web with tweets social media, (ii) their intraday tweet activity (iii) explore different machine learning models for predicting number article mentions Twitter within first 24 hours after its publication. We compare several...

10.1145/3308560.3316706 preprint EN 2019-05-13

Tracr: Compiled Transformers as a Laboratory for Interpretability

OPENALEX - Publications

David Lindner János Kramár Matthew Rahtz Thomas McGrath Vladimir Mikulik

We show how to "compile" human-readable programs into standard decoder-only transformer models. Our compiler, Tracr, generates models with known structure. This structure can be used design experiments. For example, we use it study "superposition" in transformers that execute multi-step algorithms. Additionally, the of Tracr-compiled serve as ground-truth for evaluating interpretability methods. Commonly, because "programs" learned by are unknown is unclear whether an interpretation...

10.48550/arxiv.2301.05062 preprint EN cc-by arXiv (Cornell University) 2023-01-01

On scalable oversight with weak LLMs judging strong LLMs

OPENALEX - Publications

Zachary Kenton Noah Y. Siegel János Kramár Jonah Brown-Cohen Samuel Albanie and 6 more

Scalable oversight protocols aim to enable humans accurately supervise superhuman AI. In this paper we study debate, where two AI's compete convince a judge; consultancy, single AI tries judge that asks questions; and compare baseline of direct question-answering, the just answers outright without We use large language models (LLMs) as both agents stand-ins for human judges, taking be weaker than agent models. benchmark on diverse range asymmetries between judges agents, extending previous...

10.48550/arxiv.2407.04622 preprint EN arXiv (Cornell University) 2024-07-05

Learning Safety Constraints from Demonstrations with Unknown Rewards

OPENALEX - Publications

David Lindner Xin Chen Sebastian Tschiatschek Katja Hofmann Andreas Krause

We propose Convex Constraint Learning for Reinforcement (CoCoRL), a novel approach inferring shared constraints in Constrained Markov Decision Process (CMDP) from set of safe demonstrations with possibly different reward functions. While previous work is limited to known rewards or fully environment dynamics, CoCoRL can learn unknown without knowledge the dynamics. constructs convex based on demonstrations, which provably guarantees safety even potentially sub-optimal (but safe)...

10.48550/arxiv.2305.16147 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Active Exploration for Inverse Reinforcement Learning

OPENALEX - Publications

David Lindner Andreas Krause Giorgia Ramponi

Inverse Reinforcement Learning (IRL) is a powerful paradigm for inferring reward function from expert demonstrations. Many IRL algorithms require known transition model and sometimes even policy, or they at least access to generative model. However, these assumptions are too strong many real-world applications, where the environment can be accessed only through sequential interaction. We propose novel algorithm: Active exploration (AceIRL), which actively explores an unknown policy quickly...

10.48550/arxiv.2207.08645 preprint EN other-oa arXiv (Cornell University) 2022-01-01

GoSafeOpt: Scalable safe exploration for global optimization of dynamical systems

OPENALEX - Publications

Bhavya Sukhija Matteo Turchetta David Lindner Andreas Krause Sebastian Trimpe and 1 more

Learning optimal control policies directly on physical systems is challenging since even a single failure can lead to costly hardware damage. Most existing model-free learning methods that guarantee safety, i.e., no failures, during exploration are limited local optima. A notable exception the GoSafe algorithm, which, unfortunately, cannot handle high-dimensional and hence be applied most real-world dynamical systems. This work proposes GoSafeOpt as first algorithm safely discover globally...

10.1016/j.artint.2023.103922 article EN cc-by Artificial Intelligence 2023-04-20

Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning

OPENALEX - Publications

Juan Rocamonde Victoriano Montesinos Elvis Nava Ethan Perez David Lindner

Reinforcement learning (RL) requires either manually specifying a reward function, which is often infeasible, or model from large amount of human feedback, very expensive. We study more sample-efficient alternative: using pretrained vision-language models (VLMs) as zero-shot (RMs) to specify tasks via natural language. propose and general approach VLMs models, we call VLM-RMs. use VLM-RMs based on CLIP train MuJoCo humanoid learn complex without specified such kneeling, doing the splits,...

10.48550/arxiv.2310.12921 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Topological semimetals and insulators in three-dimensional honeycomb materials

OPENALEX - Publications

Dennis Wawrzik David Lindner Maria Hermanns Simon Trebst

Semimetals, in which conduction and valence bands touch but do not form Fermi surfaces, have attracted considerable interest for their anomalous properties starting with the discovery of Dirac matter graphene other two-dimensional honeycomb materials. Here we introduce a family three-dimensional systems whose electronic band structures exhibit variety topological semimetals nodal lines. We show that these lines appear varying numbers mutual geometries, depending on underlying lattice...

10.1103/physrevb.98.115114 article EN Physical review. B./Physical review. B 2018-09-10

Mapping out the Space of Human Feedback for Reinforcement Learning: A Conceptual Framework

OPENALEX - Publications

Yannick Metz David Lindner Raphaël Baur Mennatallah El‐Assady

Reinforcement Learning from Human feedback (RLHF) has become a powerful tool to fine-tune or train agentic machine learning models. Similar how humans interact in social contexts, we can use many types of communicate our preferences, intentions, and knowledge an RL agent. However, applications human are often limited scope disregard factors. In this work, bridge the gap between human-computer interaction efforts by developing shared understanding interactive scenarios. We first introduce...

10.48550/arxiv.2411.11761 preprint EN arXiv (Cornell University) 2024-11-18

Addressing the Long-term Impact of ML Decisions via Policy Regret

OPENALEX - Publications

David Lindner Hoda Heidari Andreas Krause

Machine Learning (ML) increasingly informs the allocation of opportunities to individuals and communities in areas such as lending, education, employment, beyond. Such decisions often impact their subjects' future characteristics capabilities an a priori unknown fashion. The decision-maker, therefore, faces exploration-exploitation dilemmas akin those multi-armed bandits. Following prior work, we model arms. To capture long-term effects ML-based decisions, study setting which reward from...

10.24963/ijcai.2021/75 article EN 2021-08-01

Towards evaluations-based safety cases for AI scheming

OPENALEX - Publications

Mikita Balesni Marius Hobbhahn David Lindner Alexander Meinke Tomek Korbak and 11 more

We sketch how developers of frontier AI systems could construct a structured rationale -- 'safety case' that an system is unlikely to cause catastrophic outcomes through scheming. Scheming potential threat model where pursue misaligned goals covertly, hiding their true capabilities and objectives. In this report, we propose three arguments safety cases use in relation For each argument evidence be gathered from empirical evaluations, what assumptions would need met provide strong assurance....

10.48550/arxiv.2411.03336 preprint EN arXiv (Cornell University) 2024-10-29

ViSTa Dataset: Do vision-language models understand sequential tasks?

OPENALEX - Publications

Evžen Wybitul Elsa L. Gunter Mikhail Seleznyov David Lindner

Using vision-language models (VLMs) as reward in reinforcement learning holds promise for reducing costs and improving safety. So far, VLM have only been used goal-oriented tasks, where the agent must reach a particular final outcome. We explore VLMs' potential to supervise tasks that cannot be scored by state alone. To this end, we introduce ViSTa, dataset evaluating Vision-based understanding of Sequential Tasks. ViSTa comprises over 4,000 videos with step-by-step descriptions virtual...

10.48550/arxiv.2411.13211 preprint EN arXiv (Cornell University) 2024-11-20

MISR: Measuring Instrumental Self-Reasoning in Frontier Models

OPENALEX - Publications

Kai Fronsdal David Lindner

We propose a suite of tasks to evaluate the instrumental self-reasoning ability large language model (LLM) agents. Instrumental could improve adaptability and enable self-modification, but it also pose significant risks, such as enabling deceptive alignment. Prior work has only evaluated in non-agentic settings or limited domains. In this paper, we evaluations for agentic wide range scenarios, including knowledge seeking, opaque self-reasoning. agents built using state-of-the-art LLMs,...

10.48550/arxiv.2412.03904 preprint EN arXiv (Cornell University) 2024-12-05

Information Directed Reward Learning for Reinforcement Learning

OPENALEX - Publications

David Lindner Matteo Turchetta Sebastian Tschiatschek Kamil Ciosek Andreas Krause

For many reinforcement learning (RL) applications, specifying a reward is difficult. This paper considers an RL setting where the agent obtains information about only by querying expert that can, for example, evaluate individual states or provide binary preferences over trajectories. From such expensive feedback, we aim to learn model of allows standard algorithms achieve high expected returns with as few queries possible. To this end, propose Information Directed Reward Learning (IDRL),...

10.48550/arxiv.2102.12466 preprint EN other-oa arXiv (Cornell University) 2021-01-01

Humans are not Boltzmann Distributions: Challenges and Opportunities for Modelling Human Feedback and Interaction in Reinforcement Learning

OPENALEX - Publications

David Lindner Mennatallah El‐Assady

Reinforcement learning (RL) commonly assumes access to well-specified reward functions, which many practical applications do not provide. Instead, recently, more work has explored what from interacting with humans. So far, most of these approaches model humans as being (nosily) rational and, in particular, giving unbiased feedback. We argue that models are too simplistic and RL researchers need develop realistic human design evaluate their algorithms. In we have be personal, contextual,...

10.48550/arxiv.2206.13316 preprint EN other-oa arXiv (Cornell University) 2022-01-01

Challenges for Using Impact Regularizers to Avoid Negative Side Effects

OPENALEX - Publications

David Lindner Kyle Matoba Alexander Meulemans

Designing reward functions for reinforcement learning is difficult: besides specifying which behavior rewarded a task, the also has to discourage undesired outcomes. Misspecified can lead unintended negative side effects, and overall unsafe behavior. To overcome this problem, recent work proposed augment specified function with an impact regularizer that discourages big on environment. Although initial results regularizers seem promising in mitigating some types of important challenges...

10.48550/arxiv.2101.12509 preprint EN cc-by-sa arXiv (Cornell University) 2021-01-01

RLHF-Blender: A Configurable Interactive Interface for Learning from Diverse Human Feedback

OPENALEX - Publications

Yannick Metz David Lindner Raphaël Baur Daniel A. Keim Mennatallah El‐Assady

To use reinforcement learning from human feedback (RLHF) in practical applications, it is crucial to learn reward models diverse sources of and consider factors involved providing different types. However, the systematic study types held back by limited standardized tooling available researchers. bridge this gap, we propose RLHF-Blender, a configurable, interactive interface for feedback. RLHF-Blender provides modular experimentation framework implementation that enables researchers...

10.48550/arxiv.2308.04332 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Coming Soon ...