Zac Hatfield-Dodds
- Topic Modeling
- Natural Language Processing Techniques
- Software Testing and Debugging Techniques
- Service-Oriented Architecture and Web Services
- Explainable Artificial Intelligence (XAI)
- Scientific Computing and Data Management
- Software Engineering Research
- Software System Performance and Reliability
- Software Reliability and Analysis Research
- Computational Physics and Python Applications
- Advanced Graph Neural Networks
- Text Readability and Simplification
- Reinforcement Learning in Robotics
- Neural Networks and Applications
- Multimodal Machine Learning Applications
- Ethics and Social Impacts of AI
- Human-Automation Interaction and Safety
- Astronomy and Astrophysical Research
- Particle Detector Development and Performance
- Gamma-ray bursts and supernovae
- Machine Learning in Bioinformatics
- Occupational Health and Safety Research
- Machine Learning and ELM
- Multi-Agent Systems and Negotiation
- Speech and dialogue systems
Australian National University
2019-2024
Abstract The Astropy Project supports and fosters the development of open-source openly developed Python packages that provide commonly needed functionality to astronomical community. A key element is core package astropy , which serves as foundation for more specialized projects packages. In this article, we summarize features in recent major release, version 5.0, updates on Project. We then discuss supporting a broader ecosystem interoperable packages, including connections with several...
We apply preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models act as helpful harmless assistants. find this alignment training improves performance on almost all NLP evaluations, is fully compatible with for specialized skills such python coding summarization. explore an iterated online mode of training, where RL policies are updated a weekly cadence fresh data, efficiently improving our datasets models. Finally, we investigate the robustness...
As AI systems become more capable, we would like to enlist their help supervise other AIs. We experiment with methods for training a harmless assistant through self-improvement, without any human labels identifying harmful outputs. The only oversight is provided list of rules or principles, and so refer the method as 'Constitutional AI'. process involves both supervised learning reinforcement phase. In phase sample from an initial model, then generate self-critiques revisions, finetune...
Large-scale pre-training has recently emerged as a technique for creating capable, general purpose, generative models such GPT-3, Megatron-Turing NLG, Gopher, and many others. In this paper, we highlight counterintuitive property of discuss the policy implications property. Namely, these have an unusual combination predictable loss on broad training distribution (as embodied in their "scaling laws"), unpredictable specific capabilities, inputs, outputs. We believe that high-level...
Ethan Perez, Sam Ringer, Kamile Lukosiute, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, Andy Jones, Anna Benjamin Mann, Brian Israel, Bryan Seethor, Cameron McKinnon, Christopher Olah, Da Yan, Daniela Amodei, Dario Dawn Drain, Dustin Li, Eli Tran-Johnson, Guro Khundadze, Jackson Kernion, James Landis, Jamie Kerr, Jared Mueller, Jeeyoon Hyun, Joshua Landau, Kamal Ndousse, Landon Goldberg, Liane Lovitt, Martin Lucas, Michael...
Given the broad capabilities of large language models, it should be possible to work towards a general-purpose, text-based assistant that is aligned with human values, meaning helpful, honest, and harmless. As an initial foray in this direction we study simple baseline techniques evaluations, such as prompting. We find benefits from modest interventions increase model size, generalize variety alignment do not compromise performance models. Next investigate scaling trends for several training...
"Induction heads" are attention heads that implement a simple algorithm to complete token sequences like [A][B] ... [A] -> [B]. In this work, we present preliminary and indirect evidence for hypothesis induction might constitute the mechanism majority of all "in-context learning" in large transformer models (i.e. decreasing loss at increasing indices). We find develop precisely same point as sudden sharp increase in-context learning ability, visible bump training loss. six complementary...
We test the hypothesis that language models trained with reinforcement learning from human feedback (RLHF) have capability to "morally self-correct" -- avoid producing harmful outputs if instructed do so. find strong evidence in support of this across three different experiments, each which reveal facets moral self-correction. for self-correction emerges at 22B model parameters, and typically improves increasing size RLHF training. believe level scale, obtain two capabilities they can use...
Large language models (LLMs) may not equitably represent diverse global perspectives on societal issues. In this paper, we develop a quantitative framework to evaluate whose opinions model-generated responses are more similar to. We first build dataset, GlobalOpinionQA, comprised of questions and answers from cross-national surveys designed capture issues across different countries. Next, define metric that quantifies the similarity between LLM-generated survey human responses, conditioned...
Property-based testing is a style of popularised by the QuickCheck family libraries, first in Haskell (Claessen & Hughes, 2000) and later Erlang (Arts, Johansson, Wiger, 2006), which integrates generated test cases into existing software workflows: Instead tests that provide examples single concrete behaviour, specify properties hold for wide range inputs, library then attempts to generate refute.For general introduction property-based testing, see (MacIver, 2019).Hypothesis mature widely...
Neural networks often pack many unrelated concepts into a single neuron - puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides toy model where polysemanticity can be fully understood, arising result of models storing additional sparse features in "superposition." We demonstrate the existence phase change, surprising connection to geometry uniform polytopes, and evidence link adversarial examples. also discuss potential...
Human feedback is commonly utilized to finetune AI assistants. But human may also encourage model responses that match user beliefs over truthful ones, a behaviour known as sycophancy. We investigate the prevalence of sycophancy in models whose finetuning procedure made use feedback, and potential role preference judgments such behavior. first demonstrate five state-of-the-art assistants consistently exhibit across four varied free-form text-generation tasks. To understand if preferences...
Recent large language models have been trained on vast datasets, but also often repeated data, either intentionally for the purpose of upweighting higher quality or unintentionally because data deduplication is not perfect and model exposed to at sentence, paragraph, document level. Some works reported substantial negative performance effects this data. In paper we attempt study systematically understand its mechanistically. To do this, train a family where most unique small fraction it many...
Developing safe and useful general-purpose AI systems will require us to make progress on scalable oversight: the problem of supervising that potentially outperform most skills relevant task at hand. Empirical work this is not straightforward, since we do yet have broadly exceed our abilities. This paper discusses one major ways think about problem, with a focus it can be studied empirically. We first present an experimental design centered tasks for which human specialists succeed but...
We present Schemathesis, a tool for finding semantic errors and crashes in OpenAPI or GraphQL web APIs through property-based testing. Our evaluation, thirty independent runs of eight tools against sixteen containerized open-source services, shows that Schemathesis wildly outperforms all previous tools.
Large language models (LLMs) perform better when they produce step-by-step, "Chain-of-Thought" (CoT) reasoning before answering a question, but it is unclear if the stated faithful explanation of model's actual (i.e., its process for question). We investigate hypotheses how CoT may be unfaithful, by examining model predictions change we intervene on (e.g., adding mistakes or paraphrasing it). Models show large variation across tasks in strongly condition predicting their answer, sometimes...
We present Schemathesis, a tool for finding semantic errors and crashes in OpenAPI or GraphQL web APIs through property-based testing. Our evaluation, thirty independent runs of eight tools against sixteen containerized open-source services, shows that Schemathesis wildly outperforms all previous tools.It is the only to find defects four targets, finds 1.4× 4.5× more unique than respectively second-best each remaining target, handle two-thirds our target services without fatal internal...
As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how behave. Prior work creates evaluations with crowdwork (which is time-consuming expensive) or existing data sources are not always available). Here, we automatically generate LMs. We explore approaches varying amounts of human effort, from instructing LMs write yes/no questions making complex Winogender schemas multiple stages LM-based generation filtering. Crowdworkers rate...
As large language models (LLMs) perform more difficult tasks, it becomes harder to verify the correctness and safety of their behavior. One approach help with this issue is prompt LLMs externalize reasoning, e.g., by having them generate step-by-step reasoning as they answer a question (Chain-of-Thought; CoT). The may enable us check process that use tasks. However, relies on stated faithfully reflecting model's actual which not always case. To improve over faithfulness CoT we have...
Human feedback can prevent overtly harmful utterances in conversational models, but may not automatically mitigate subtle problematic behaviors such as a stated desire for self-preservation or power. Constitutional AI offers an alternative, replacing human with from models conditioned only on list of written principles. We find this approach effectively prevents the expression behaviors. The success simple principles motivates us to ask: learn general ethical single principle? To test this,...
Where traditional example-based tests check software using manually-specified input-output pairs, property-based exploit a general description of valid inputs and program behaviour to automatically search for falsifying examples. Given that Python has excellent testing tools, such are often easier work with routinely find serious bugs all other techniques have missed.
Large-scale pre-training has recently emerged as a technique for creating capable, general purpose, generative models such GPT-3, Megatron-Turing NLG, Gopher, and many others. In this paper, we highlight counterintuitive property of discuss the policy implications property. Namely, these have an unusual combination predictable loss on broad training distribution (as embodied in their "scaling laws"), unpredictable specific capabilities, inputs, outputs. We believe that high-level...