- Reinforcement Learning in Robotics
- Advanced Bandit Algorithms Research
- Adversarial Robustness in Machine Learning
- Machine Learning and Algorithms
- Smart Grid Energy Management
- Optimization and Search Problems
- IPv6, Mobility, Handover, Networks, Security
- Network Security and Intrusion Detection
- Machine Learning and Data Classification
- Software Engineering Research
- Energy Load and Power Forecasting
- Explainable Artificial Intelligence (XAI)
- Data Stream Mining Techniques
- Topic Modeling
- Fault Detection and Control Systems
- Generative Adversarial Networks and Image Synthesis
- Network Packet Processing and Optimization
- Energy Efficiency and Management
- Access Control and Trust
- Model Reduction and Neural Networks
- Advanced Authentication Protocols Security
- Digital Communication and Language
- Topological Materials and Phenomena
- Digital Media Forensic Detection
- Infrastructure Resilience and Vulnerability Analysis
ETH Zurich
2019-2023
University of Cologne
2018-2019
Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with goals. RLHF has emerged as the central method used finetune state-of-the-art large language models (LLMs). Despite this popularity, there been relatively little public work systematizing its flaws. In paper, we (1) survey open problems and fundamental limitations of related methods; (2) overview techniques understand, improve, complement in practice; (3) propose auditing disclosure...
Stable Diffusion is a recent open-source image generation model comparable to proprietary models such as DALLE, Imagen, or Parti. comes with safety filter that aims prevent generating explicit images. Unfortunately, the obfuscated and poorly documented. This makes it hard for users misuse in their applications, understand filter's limitations improve it. We first show easy generate disturbing content bypasses filter. then reverse-engineer find while sexual content, ignores violence, gore,...
To understand the risks posed by a new AI system, we must what it can and cannot do. Building on prior work, introduce programme of "dangerous capability" evaluations pilot them Gemini 1.0 models. Our cover four areas: (1) persuasion deception; (2) cyber-security; (3) self-proliferation; (4) self-reasoning. We do not find evidence strong dangerous capabilities in models evaluated, but flag early warning signs. goal is to help advance rigorous science capability evaluation, preparation for future
Future advanced AI systems may learn sophisticated strategies through reinforcement learning (RL) that humans cannot understand well enough to safely evaluate. We propose a training method which avoids agents undesired multi-step plans receive high reward (multi-step "reward hacks") even if are not able detect the behaviour is undesired. The method, Myopic Optimization with Non-myopic Approval (MONA), works by combining short-sighted optimization far-sighted reward. demonstrate MONA can...
The ability to track and monitor relevant important news in real-time is of crucial interest multiple industrial sectors. In this work, we focus on the set cryptocurrency news, which recently became emerging general financial audience. order real-time, (i) match from web with tweets social media, (ii) their intraday tweet activity (iii) explore different machine learning models for predicting number article mentions Twitter within first 24 hours after its publication. We compare several...
We show how to "compile" human-readable programs into standard decoder-only transformer models. Our compiler, Tracr, generates models with known structure. This structure can be used design experiments. For example, we use it study "superposition" in transformers that execute multi-step algorithms. Additionally, the of Tracr-compiled serve as ground-truth for evaluating interpretability methods. Commonly, because "programs" learned by are unknown is unclear whether an interpretation...
Scalable oversight protocols aim to enable humans accurately supervise superhuman AI. In this paper we study debate, where two AI's compete convince a judge; consultancy, single AI tries judge that asks questions; and compare baseline of direct question-answering, the just answers outright without We use large language models (LLMs) as both agents stand-ins for human judges, taking be weaker than agent models. benchmark on diverse range asymmetries between judges agents, extending previous...
We propose Convex Constraint Learning for Reinforcement (CoCoRL), a novel approach inferring shared constraints in Constrained Markov Decision Process (CMDP) from set of safe demonstrations with possibly different reward functions. While previous work is limited to known rewards or fully environment dynamics, CoCoRL can learn unknown without knowledge the dynamics. constructs convex based on demonstrations, which provably guarantees safety even potentially sub-optimal (but safe)...
Inverse Reinforcement Learning (IRL) is a powerful paradigm for inferring reward function from expert demonstrations. Many IRL algorithms require known transition model and sometimes even policy, or they at least access to generative model. However, these assumptions are too strong many real-world applications, where the environment can be accessed only through sequential interaction. We propose novel algorithm: Active exploration (AceIRL), which actively explores an unknown policy quickly...
Learning optimal control policies directly on physical systems is challenging since even a single failure can lead to costly hardware damage. Most existing model-free learning methods that guarantee safety, i.e., no failures, during exploration are limited local optima. A notable exception the GoSafe algorithm, which, unfortunately, cannot handle high-dimensional and hence be applied most real-world dynamical systems. This work proposes GoSafeOpt as first algorithm safely discover globally...
Reinforcement learning (RL) requires either manually specifying a reward function, which is often infeasible, or model from large amount of human feedback, very expensive. We study more sample-efficient alternative: using pretrained vision-language models (VLMs) as zero-shot (RMs) to specify tasks via natural language. propose and general approach VLMs models, we call VLM-RMs. use VLM-RMs based on CLIP train MuJoCo humanoid learn complex without specified such kneeling, doing the splits,...
Semimetals, in which conduction and valence bands touch but do not form Fermi surfaces, have attracted considerable interest for their anomalous properties starting with the discovery of Dirac matter graphene other two-dimensional honeycomb materials. Here we introduce a family three-dimensional systems whose electronic band structures exhibit variety topological semimetals nodal lines. We show that these lines appear varying numbers mutual geometries, depending on underlying lattice...
Reinforcement Learning from Human feedback (RLHF) has become a powerful tool to fine-tune or train agentic machine learning models. Similar how humans interact in social contexts, we can use many types of communicate our preferences, intentions, and knowledge an RL agent. However, applications human are often limited scope disregard factors. In this work, bridge the gap between human-computer interaction efforts by developing shared understanding interactive scenarios. We first introduce...
Machine Learning (ML) increasingly informs the allocation of opportunities to individuals and communities in areas such as lending, education, employment, beyond. Such decisions often impact their subjects' future characteristics capabilities an a priori unknown fashion. The decision-maker, therefore, faces exploration-exploitation dilemmas akin those multi-armed bandits. Following prior work, we model arms. To capture long-term effects ML-based decisions, study setting which reward from...
We sketch how developers of frontier AI systems could construct a structured rationale -- 'safety case' that an system is unlikely to cause catastrophic outcomes through scheming. Scheming potential threat model where pursue misaligned goals covertly, hiding their true capabilities and objectives. In this report, we propose three arguments safety cases use in relation For each argument evidence be gathered from empirical evaluations, what assumptions would need met provide strong assurance....
Using vision-language models (VLMs) as reward in reinforcement learning holds promise for reducing costs and improving safety. So far, VLM have only been used goal-oriented tasks, where the agent must reach a particular final outcome. We explore VLMs' potential to supervise tasks that cannot be scored by state alone. To this end, we introduce ViSTa, dataset evaluating Vision-based understanding of Sequential Tasks. ViSTa comprises over 4,000 videos with step-by-step descriptions virtual...
We propose a suite of tasks to evaluate the instrumental self-reasoning ability large language model (LLM) agents. Instrumental could improve adaptability and enable self-modification, but it also pose significant risks, such as enabling deceptive alignment. Prior work has only evaluated in non-agentic settings or limited domains. In this paper, we evaluations for agentic wide range scenarios, including knowledge seeking, opaque self-reasoning. agents built using state-of-the-art LLMs,...
For many reinforcement learning (RL) applications, specifying a reward is difficult. This paper considers an RL setting where the agent obtains information about only by querying expert that can, for example, evaluate individual states or provide binary preferences over trajectories. From such expensive feedback, we aim to learn model of allows standard algorithms achieve high expected returns with as few queries possible. To this end, propose Information Directed Reward Learning (IDRL),...
Reinforcement learning (RL) commonly assumes access to well-specified reward functions, which many practical applications do not provide. Instead, recently, more work has explored what from interacting with humans. So far, most of these approaches model humans as being (nosily) rational and, in particular, giving unbiased feedback. We argue that models are too simplistic and RL researchers need develop realistic human design evaluate their algorithms. In we have be personal, contextual,...
Designing reward functions for reinforcement learning is difficult: besides specifying which behavior rewarded a task, the also has to discourage undesired outcomes. Misspecified can lead unintended negative side effects, and overall unsafe behavior. To overcome this problem, recent work proposed augment specified function with an impact regularizer that discourages big on environment. Although initial results regularizers seem promising in mitigating some types of important challenges...
To use reinforcement learning from human feedback (RLHF) in practical applications, it is crucial to learn reward models diverse sources of and consider factors involved providing different types. However, the systematic study types held back by limited standardized tooling available researchers. bridge this gap, we propose RLHF-Blender, a configurable, interactive interface for feedback. RLHF-Blender provides modular experimentation framework implementation that enables researchers...