Zac Hatfield-Dodds

ORCID: 0000-0002-8646-8362
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Topic Modeling
  • Natural Language Processing Techniques
  • Software Testing and Debugging Techniques
  • Service-Oriented Architecture and Web Services
  • Explainable Artificial Intelligence (XAI)
  • Scientific Computing and Data Management
  • Software Engineering Research
  • Software System Performance and Reliability
  • Software Reliability and Analysis Research
  • Computational Physics and Python Applications
  • Advanced Graph Neural Networks
  • Text Readability and Simplification
  • Reinforcement Learning in Robotics
  • Neural Networks and Applications
  • Multimodal Machine Learning Applications
  • Ethics and Social Impacts of AI
  • Human-Automation Interaction and Safety
  • Astronomy and Astrophysical Research
  • Particle Detector Development and Performance
  • Gamma-ray bursts and supernovae
  • Machine Learning in Bioinformatics
  • Occupational Health and Safety Research
  • Machine Learning and ELM
  • Multi-Agent Systems and Negotiation
  • Speech and dialogue systems

Australian National University
2019-2024

The Astropy Collaboration Adrian M. Price-Whelan Pey Lian Lim N. Earl Nathaniel Starkman and 95 more Larry Bradley D. L. Shupe Aarya A. Patil Lía Corrales C. E. Brasseur Maximilian Nöthe Axel Donath Erik Tollerud Brett M. Morris Adam Ginsburg Eero Vaher Benjamin Weaver James Tocknell William Brian Jamieson M. H. van Kerkwijk Thomas Robitaille Bruce Merry Matteo Bachetti Hans Moritz Günther Thomas L. Aldcroft Jaime A. Alvarado-Montes Anne M. Archibald Attila Bódi Shreyas Bapat Geert Barentsen Juanjo Bazán Manish Biswas M. Boquien D. J. Burke Daria Cara Mihai Cara Kyle E. Conroy Simon Conseil Matthew Craig R. Cross Kelle L. Cruz Francesco D’Eugenio Nadia Dencheva Hadrien A. R. Devillepoix J. P. Dietrich Arthur Eigenbrot T. Erben Leonardo Ferreira Daniel Foreman-Mackey Ryan Fox Nabil Freij Suyog Garg Robel Geda Lauren Glattly Yash Gondhalekar Karl D. Gordon David Grant P. Greenfield Austen Groener S. Guest S. Gurovich R. Handberg Akeem Hart Zac Hatfield-Dodds D. Homeier G. Hosseinzadeh T. Jenness Craig Jones P. Joseph J. Bryce Kalmbach E. Karamehmetoglu Mikołaj Kałuszyński Michael S. P. Kelley Nicholas S. Kern Wolfgang Kerzendorf Eric W. Koch Shankar Kulumani Antony Lee Chun Ly Zhiyuan Ma C. D. MacBride Jakob M. Maljaars Demitri Muna Nicholas A. Murphy Henrik Norman Richard O’Steen Kyle A. Oman Camilla Pacifici S. Pascual J. Pascual-Granado Rohit R. Patil G. I. Perren T. E. Pickering Tushar Rastogi Benjamin R. Roulston Daniel F. Ryan E. S. Rykoff J. Sabater Parikshit Sakurikar J. Salgado

Abstract The Astropy Project supports and fosters the development of open-source openly developed Python packages that provide commonly needed functionality to astronomical community. A key element is core package astropy , which serves as foundation for more specialized projects packages. In this article, we summarize features in recent major release, version 5.0, updates on Project. We then discuss supporting a broader ecosystem interoperable packages, including connections with several...

10.3847/1538-4357/ac7c74 article EN cc-by The Astrophysical Journal 2022-08-01

We apply preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models act as helpful harmless assistants. find this alignment training improves performance on almost all NLP evaluations, is fully compatible with for specialized skills such python coding summarization. explore an iterated online mode of training, where RL policies are updated a weekly cadence fresh data, efficiently improving our datasets models. Finally, we investigate the robustness...

10.48550/arxiv.2204.05862 preprint EN cc-by arXiv (Cornell University) 2022-01-01

As AI systems become more capable, we would like to enlist their help supervise other AIs. We experiment with methods for training a harmless assistant through self-improvement, without any human labels identifying harmful outputs. The only oversight is provided list of rules or principles, and so refer the method as 'Constitutional AI'. process involves both supervised learning reinforcement phase. In phase sample from an initial model, then generate self-critiques revisions, finetune...

10.48550/arxiv.2212.08073 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Large-scale pre-training has recently emerged as a technique for creating capable, general purpose, generative models such GPT-3, Megatron-Turing NLG, Gopher, and many others. In this paper, we highlight counterintuitive property of discuss the policy implications property. Namely, these have an unusual combination predictable loss on broad training distribution (as embodied in their "scaling laws"), unpredictable specific capabilities, inputs, outputs. We believe that high-level...

10.1145/3531146.3533229 article EN 2022 ACM Conference on Fairness, Accountability, and Transparency 2022-06-20

Ethan Perez, Sam Ringer, Kamile Lukosiute, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, Andy Jones, Anna Benjamin Mann, Brian Israel, Bryan Seethor, Cameron McKinnon, Christopher Olah, Da Yan, Daniela Amodei, Dario Dawn Drain, Dustin Li, Eli Tran-Johnson, Guro Khundadze, Jackson Kernion, James Landis, Jamie Kerr, Jared Mueller, Jeeyoon Hyun, Joshua Landau, Kamal Ndousse, Landon Goldberg, Liane Lovitt, Martin Lucas, Michael...

10.18653/v1/2023.findings-acl.847 article EN cc-by Findings of the Association for Computational Linguistics: ACL 2022 2023-01-01

Given the broad capabilities of large language models, it should be possible to work towards a general-purpose, text-based assistant that is aligned with human values, meaning helpful, honest, and harmless. As an initial foray in this direction we study simple baseline techniques evaluations, such as prompting. We find benefits from modest interventions increase model size, generalize variety alignment do not compromise performance models. Next investigate scaling trends for several training...

10.48550/arxiv.2112.00861 preprint EN other-oa arXiv (Cornell University) 2021-01-01

"Induction heads" are attention heads that implement a simple algorithm to complete token sequences like [A][B] ... [A] -> [B]. In this work, we present preliminary and indirect evidence for hypothesis induction might constitute the mechanism majority of all "in-context learning" in large transformer models (i.e. decreasing loss at increasing indices). We find develop precisely same point as sudden sharp increase in-context learning ability, visible bump training loss. six complementary...

10.48550/arxiv.2209.11895 preprint EN cc-by arXiv (Cornell University) 2022-01-01

We test the hypothesis that language models trained with reinforcement learning from human feedback (RLHF) have capability to "morally self-correct" -- avoid producing harmful outputs if instructed do so. find strong evidence in support of this across three different experiments, each which reveal facets moral self-correction. for self-correction emerges at 22B model parameters, and typically improves increasing size RLHF training. believe level scale, obtain two capabilities they can use...

10.48550/arxiv.2302.07459 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Large language models (LLMs) may not equitably represent diverse global perspectives on societal issues. In this paper, we develop a quantitative framework to evaluate whose opinions model-generated responses are more similar to. We first build dataset, GlobalOpinionQA, comprised of questions and answers from cross-national surveys designed capture issues across different countries. Next, define metric that quantifies the similarity between LLM-generated survey human responses, conditioned...

10.48550/arxiv.2306.16388 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Property-based testing is a style of popularised by the QuickCheck family libraries, first in Haskell (Claessen & Hughes, 2000) and later Erlang (Arts, Johansson, Wiger, 2006), which integrates generated test cases into existing software workflows: Instead tests that provide examples single concrete behaviour, specify properties hold for wide range inputs, library then attempts to generate refute.For general introduction property-based testing, see (MacIver, 2019).Hypothesis mature widely...

10.21105/joss.01891 article EN cc-by The Journal of Open Source Software 2019-11-21

Neural networks often pack many unrelated concepts into a single neuron - puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides toy model where polysemanticity can be fully understood, arising result of models storing additional sparse features in "superposition." We demonstrate the existence phase change, surprising connection to geometry uniform polytopes, and evidence link adversarial examples. also discuss potential...

10.48550/arxiv.2209.10652 preprint EN other-oa arXiv (Cornell University) 2022-01-01

Human feedback is commonly utilized to finetune AI assistants. But human may also encourage model responses that match user beliefs over truthful ones, a behaviour known as sycophancy. We investigate the prevalence of sycophancy in models whose finetuning procedure made use feedback, and potential role preference judgments such behavior. first demonstrate five state-of-the-art assistants consistently exhibit across four varied free-form text-generation tasks. To understand if preferences...

10.48550/arxiv.2310.13548 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Recent large language models have been trained on vast datasets, but also often repeated data, either intentionally for the purpose of upweighting higher quality or unintentionally because data deduplication is not perfect and model exposed to at sentence, paragraph, document level. Some works reported substantial negative performance effects this data. In paper we attempt study systematically understand its mechanistically. To do this, train a family where most unique small fraction it many...

10.48550/arxiv.2205.10487 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Developing safe and useful general-purpose AI systems will require us to make progress on scalable oversight: the problem of supervising that potentially outperform most skills relevant task at hand. Empirical work this is not straightforward, since we do yet have broadly exceed our abilities. This paper discusses one major ways think about problem, with a focus it can be studied empirically. We first present an experimental design centered tasks for which human specialists succeed but...

10.48550/arxiv.2211.03540 preprint EN other-oa arXiv (Cornell University) 2022-01-01

We present Schemathesis, a tool for finding semantic errors and crashes in OpenAPI or GraphQL web APIs through property-based testing. Our evaluation, thirty independent runs of eight tools against sixteen containerized open-source services, shows that Schemathesis wildly outperforms all previous tools.

10.1145/3510454.3528637 article EN 2022-05-21

Large language models (LLMs) perform better when they produce step-by-step, "Chain-of-Thought" (CoT) reasoning before answering a question, but it is unclear if the stated faithful explanation of model's actual (i.e., its process for question). We investigate hypotheses how CoT may be unfaithful, by examining model predictions change we intervene on (e.g., adding mistakes or paraphrasing it). Models show large variation across tasks in strongly condition predicting their answer, sometimes...

10.48550/arxiv.2307.13702 preprint EN other-oa arXiv (Cornell University) 2023-01-01

We present Schemathesis, a tool for finding semantic errors and crashes in OpenAPI or GraphQL web APIs through property-based testing. Our evaluation, thirty independent runs of eight tools against sixteen containerized open-source services, shows that Schemathesis wildly outperforms all previous tools.It is the only to find defects four targets, finds 1.4× 4.5× more unique than respectively second-best each remaining target, handle two-thirds our target services without fatal internal...

10.1109/icse-companion55297.2022.9793781 article EN 2022 IEEE/ACM 44th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion) 2022-05-01

As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how behave. Prior work creates evaluations with crowdwork (which is time-consuming expensive) or existing data sources are not always available). Here, we automatically generate LMs. We explore approaches varying amounts of human effort, from instructing LMs write yes/no questions making complex Winogender schemas multiple stages LM-based generation filtering. Crowdworkers rate...

10.48550/arxiv.2212.09251 preprint EN other-oa arXiv (Cornell University) 2022-01-01

As large language models (LLMs) perform more difficult tasks, it becomes harder to verify the correctness and safety of their behavior. One approach help with this issue is prompt LLMs externalize reasoning, e.g., by having them generate step-by-step reasoning as they answer a question (Chain-of-Thought; CoT). The may enable us check process that use tasks. However, relies on stated faithfully reflecting model's actual which not always case. To improve over faithfulness CoT we have...

10.48550/arxiv.2307.11768 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Human feedback can prevent overtly harmful utterances in conversational models, but may not automatically mitigate subtle problematic behaviors such as a stated desire for self-preservation or power. Constitutional AI offers an alternative, replacing human with from models conditioned only on list of written principles. We find this approach effectively prevents the expression behaviors. The success simple principles motivates us to ask: learn general ethical single principle? To test this,...

10.48550/arxiv.2310.13798 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Where traditional example-based tests check software using manually-specified input-output pairs, property-based exploit a general description of valid inputs and program behaviour to automatically search for falsifying examples. Given that Python has excellent testing tools, such are often easier work with routinely find serious bugs all other techniques have missed.

10.25080/majora-342d178e-016 article EN cc-by Proceedings of the Python in Science Conferences 2020-01-01

Large-scale pre-training has recently emerged as a technique for creating capable, general purpose, generative models such GPT-3, Megatron-Turing NLG, Gopher, and many others. In this paper, we highlight counterintuitive property of discuss the policy implications property. Namely, these have an unusual combination predictable loss on broad training distribution (as embodied in their "scaling laws"), unpredictable specific capabilities, inputs, outputs. We believe that high-level...

10.48550/arxiv.2202.07785 preprint EN cc-by arXiv (Cornell University) 2022-01-01
Coming Soon ...