Tomasz Korbak

ORCID: 0000-0002-6258-2013
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Topic Modeling
  • Natural Language Processing Techniques
  • Language and cultural evolution
  • Embodied and Extended Cognition
  • Philosophy and History of Science
  • Origins and Evolution of Life
  • Speech Recognition and Synthesis
  • Speech and dialogue systems
  • Multimodal Machine Learning Applications
  • Explainable Artificial Intelligence (XAI)
  • Cognitive Science and Education Research
  • Machine Learning and Data Classification
  • Semantic Web and Ontologies
  • Neural Networks and Applications
  • Neural dynamics and brain function
  • Hate Speech and Cyberbullying Detection
  • Software Engineering Research
  • Reinforcement Learning in Robotics
  • Management and Organizational Practices
  • Law, AI, and Intellectual Property
  • Education and Cultural Studies
  • Catholicism, Bioethics, Media, Education
  • Diverse Scientific and Engineering Research
  • Paranormal Experiences and Beliefs
  • Psychiatry, Mental Health, Neuroscience

Institute of Philosophy and Sociology
2019-2022

University of Sussex
2021-2022

Polish Academy of Sciences
2019-2022

Sussex County Community College
2022

New York University
2022

University of Warsaw
2015-2021

Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with goals. RLHF has emerged as the central method used finetune state-of-the-art large language models (LLMs). Despite this popularity, there been relatively little public work systematizing its flaws. In paper, we (1) survey open problems and fundamental limitations of related methods; (2) overview techniques understand, improve, complement in practice; (3) propose auditing disclosure...

10.48550/arxiv.2307.15217 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Language models (LMs) are pretrained to imitate internet text, including content that would violate human preferences if generated by an LM: falsehoods, offensive comments, personally identifiable information, low-quality or buggy code, and more. Here, we explore alternative objectives for pretraining LMs in a way also guides them generate text aligned with preferences. We benchmark five feedback across three tasks study how they affect the trade-off between alignment capabilities of LMs....

10.48550/arxiv.2302.08582 preprint EN cc-by arXiv (Cornell University) 2023-01-01

We expose a surprising failure of generalization in auto-regressive large language models (LLMs). If model is trained on sentence the form "A B", it will not automatically generalize to reverse direction "B A". This Reversal Curse. For instance, if "Valentina Tereshkova was first woman travel space", be able answer question, "Who space?". Moreover, likelihood correct ("Valentina Tershkova") higher than for random name. Thus, do prevalent pattern their training set: B" occurs, A" more likely...

10.48550/arxiv.2309.12288 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Work on scaling laws has found that large language models (LMs) show predictable improvements to overall loss with increased scale (model size, training data, and compute). Here, we present evidence for the claim LMs may inverse scaling, or worse task performance scale, e.g., due flaws in objective data. We empirical of 11 datasets collected by running a public contest, Inverse Scaling Prize, substantial prize pool. Through analysis datasets, along other examples literature, identify four...

10.48550/arxiv.2306.09479 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Human feedback is commonly utilized to finetune AI assistants. But human may also encourage model responses that match user beliefs over truthful ones, a behaviour known as sycophancy. We investigate the prevalence of sycophancy in models whose finetuning procedure made use feedback, and potential role preference judgments such behavior. first demonstrate five state-of-the-art assistants consistently exhibit across four varied free-form text-generation tasks. To understand if preferences...

10.48550/arxiv.2310.13548 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Pretrained language models often generate outputs that are not in line with human preferences, such as harmful text or factually incorrect summaries. Recent work approaches the above issues by learning from a simple form of feedback: comparisons between pairs model-generated outputs. However, comparison feedback only conveys limited information about preferences. In this paper, we introduce Imitation Language Feedback (ILF), new approach utilizes more informative feedback. ILF consists three...

10.48550/arxiv.2303.16755 preprint EN other-oa arXiv (Cornell University) 2023-01-01

This work identifies 18 foundational challenges in assuring the alignment and safety of large language models (LLMs). These are organized into three different categories: scientific understanding LLMs, development deployment methods, sociotechnical challenges. Based on identified challenges, we pose $200+$ concrete research questions.

10.48550/arxiv.2404.09932 preprint EN arXiv (Cornell University) 2024-04-15

The potential for pre-trained large language models (LLMs) to use natural feedback at inference time has been an exciting recent development. We build upon this observation by formalizing algorithm learning from training instead, which we call Imitation Language Feedback (ILF). ILF requires only a small amount of human-written during and does not require the same test time, making it both user-friendly sample-efficient. further show that can be seen as form minimizing KL divergence ground...

10.48550/arxiv.2303.16749 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Abstract In this paper, I argue that enactivism and computationalism—two seemingly incompatible research traditions in modern cognitive science—can be fruitfully reconciled under the framework of free energy principle (FEP). FEP holds systems encode generative models their niches cognition can understood terms minimizing these models. There are two philosophical interpretations picture. A computationalist will as claims Bayesian inference underpins both perception action, it entails a...

10.1007/s11229-019-02243-4 article EN cc-by Synthese 2019-05-16

The proliferation of generative models, combined with pretraining on web-scale data, raises a timely question: what happens when these models are trained their own generated outputs? Recent investigations into model-data feedback loops discovered that such can lead to model collapse, phenomenon where performance progressively degrades each model-fitting iteration until the latest becomes useless. However, several recent papers studying collapse assumed new data replace old over time rather...

10.48550/arxiv.2404.01413 preprint EN arXiv (Cornell University) 2024-04-01

Aligning language models with preferences can be posed as approximating a target distribution representing some desired behavior. Existing approaches differ both in the functional form of and algorithm used to approximate it. For instance, Reinforcement Learning from Human Feedback (RLHF) corresponds minimizing reverse KL an implicit arising penalty objective. On other hand, Generative Distributional Control (GDC) has explicit minimizes forward it using Policy Gradient (DPG) algorithm. In...

10.48550/arxiv.2302.08215 preprint EN cc-by arXiv (Cornell University) 2023-01-01

We aim to better understand the emergence of `situational awareness' in large language models (LLMs). A model is situationally aware if it's that a and can recognize whether currently testing or deployment. Today's LLMs are tested for safety alignment before they deployed. An LLM could exploit situational awareness achieve high score on tests, while taking harmful actions after Situational may emerge unexpectedly as byproduct scaling. One way foresee this run scaling experiments abilities...

10.48550/arxiv.2309.00667 preprint EN other-oa arXiv (Cornell University) 2023-01-01

This paper explores a novel approach to achieving emergent compositional communication in multi-agent systems. We propose training regime implementing template transfer, the idea of carrying over learned biases across contexts. In our method, sender-receiver pair is first trained with disentangled loss functions and then receiver transferred train new sender standard loss. Unlike other methods (e.g. obverter algorithm), does not require imposing inductive on architecture agents....

10.48550/arxiv.1910.06079 preprint EN other-oa arXiv (Cornell University) 2019-01-01

The availability of large pre-trained models is changing the landscape Machine Learning research and practice, moving from a training-from-scratch to fine-tuning paradigm. While in some applications goal "nudge" distribution towards preferred outputs, others it steer different over sample space. Two main paradigms have emerged tackle this challenge: Reward Maximization (RM) and, more recently, Distribution Matching (DM). RM applies standard Reinforcement (RL) techniques, such as Policy...

10.48550/arxiv.2206.00761 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Compositionality is an important explanatory target in emergent communication and language evolution. The vast majority of computational models account for the emergence only a very basic form compositionality: trivial compositionality. A compositional protocol trivially if meaning complex signal (e.g. blue circle) boils down to intersection meanings its constituents set objects circles). non-trivially (NTC) biggest apple) more function their constituents. In this paper, we review several...

10.48550/arxiv.2010.15058 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Abstract Hutto and Myin (2013) famously argue that basic minds are not contentful content exists only as far it is scaffolded with social linguistic practices. This view, however, rests on a troublesome distinction between minds. Since have to account for language purely in terms of joint action guidance, there no reason why simpler communication systems, such cellular signaling pathways, should give rise well. conclusion remains valid even if one rejects the view mediated through public...

10.1515/slgr-2015-0022 article EN cc-by Studies in Logic Grammar and Rhetoric 2015-06-01

Abstract In this paper, we explore interaction history as a particular source of pressure for achieving emergent compositional communication in multi-agent systems. We propose training regime implementing template transfer, the idea carrying over learned biases across contexts. presented method, sender-receiver dyad is first trained with disentangled pair objectives, and then receiver transferred to train new sender standard objective. Unlike other methods (e.g. obverter algorithm), transfer...

10.1075/is.21020.kor article EN cc-by Interaction Studies Social Behaviour and Communication in Biological and Artificial Systems 2021-12-31

Communication is compositional if complex signals can be represented as a combination of simpler subparts. In this paper, we theoretically show that inductive biases on both the training framework and data are needed to develop communication. Moreover, prove compositionality spontaneously arises in signaling games, where agents communicate over noisy channel. We experimentally confirm range noise levels, which depends model data, indeed promotes compositionality. Finally, provide...

10.48550/arxiv.2111.06464 preprint EN other-oa arXiv (Cornell University) 2021-01-01

Machine learning is shifting towards general-purpose pretrained generative models, trained in a self-supervised manner on large amounts of data, which can then be applied to solve number tasks. However, due their generic training methodology, these models often fail meet some the downstream requirements (e.g., hallucinations abstractive summarization or style violations code generation). This raises important question how adapt pre-trained all without destroying general capabilities...

10.48550/arxiv.2112.00791 preprint EN cc-by arXiv (Cornell University) 2021-01-01

The notion of self-organisation plays a major role in enactive cognitive science. In this paper, I review several formal models that various approaches modern science rely upon. then focus on Rosen’s account as closure to efficient cause and his argument systems closed – ( M, R) are uncomputable. Despite being sometimes relied by enactivists is problematic it rests assumptions unacceptable for enactivists: living can be modelled time-invariant material-independent. argue there exists simple...

10.1177/10597123211066155 article EN Adaptive Behavior 2022-02-23

Reinforcement learning (RL) is frequently employed in fine-tuning large language models (LMs), such as GPT-3, to penalize them for undesirable features of generated sequences, offensiveness, social bias, harmfulness or falsehood. The RL formulation involves treating the LM a policy and updating it maximise expected value reward function which captures human preferences, non-offensiveness. In this paper, we analyze challenges associated with model an show how avoiding those requires moving...

10.18653/v1/2022.findings-emnlp.77 article EN cc-by 2022-01-01

Language models (LMs) trained on vast quantities of text data can acquire sophisticated skills such as generating summaries, answering questions or code. However, they also manifest behaviors that violate human preferences, e.g., generate offensive content, falsehoods perpetuate social biases. In this thesis, I explore several approaches to aligning LMs with preferences. First, argue be seen Bayesian inference: conditioning a prior (base, pretrained LM) evidence about preferences (Chapter...

10.48550/arxiv.2404.12150 preprint EN arXiv (Cornell University) 2024-04-18

Neural language models can be successfully trained on source code, leading to applications such as code completion. However, their versatile autoregressive self-supervision objective overlooks important global sequence-level features that are present in the data syntactic correctness or compilability. In this work, we pose problem of learning generate compilable constraint satisfaction. We define an Energy-Based Model (EBM) representing a pre-trained generative model with imposed generating...

10.48550/arxiv.2106.04985 preprint EN cc-by arXiv (Cornell University) 2021-01-01
Coming Soon ...