NFDI4DS | UHH-SEMS - Publication Details

Esin Durmus

ORCID: 0009-0009-7331-8160

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5001994692

Research Areas

Topic Modeling
Natural Language Processing Techniques
Advanced Text Analysis Techniques
Text Readability and Simplification
Hate Speech and Cyberbullying Detection
Software Engineering Research
Social Media and Politics
Sentiment Analysis and Opinion Mining
Explainable Artificial Intelligence (XAI)
Ethics and Social Impacts of AI
Misinformation and Its Impacts
Computational and Text Analysis Methods
Persona Design and Applications
Opinion Dynamics and Social Influence
Advanced Graph Neural Networks
Media Influence and Politics
Aerospace Engineering and Energy Systems
Machine Learning and Data Classification
Adversarial Robustness in Machine Learning
Wikis in Education and Collaboration
Domain Adaptation and Few-Shot Learning
Privacy-Preserving Technologies in Data
Online Learning and Analytics
Speech and dialogue systems
Reinforcement Learning in Robotics

Stanford University
2022-2024

Cornell University
2018-2021

Columbia University
2020

New York University
2020

George Washington University
2020

On the Opportunities and Risks of Foundation Models

OPENALEX - Publications

Rishi Bommasani Drew A. Hudson Ehsan Adeli Russ B. Altman Simran Arora and 95 more

AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and adaptable to wide range downstream tasks. We call these foundation underscore their critically central yet incomplete character. This report provides thorough account opportunities risks models, ranging from capabilities language, vision, robotics, reasoning, human interaction) technical principles(e.g., model architectures, training procedures, data, systems,...

10.48550/arxiv.2108.07258 preprint EN cc-by arXiv (Cornell University) 2021-01-01

FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization

OPENALEX - Publications

Esin Durmus He He Mona Diab

Neural abstractive summarization models are prone to generate content inconsistent with the source document, i.e. unfaithful. Existing automatic metrics do not capture such mistakes effectively. We tackle problem of evaluating faithfulness a generated summary given its document. first collected human annotations for outputs from numerous on two datasets. find that current exhibit trade-off between abstractiveness and faithfulness: less word overlap document more likely be Next, we propose an...

10.18653/v1/2020.acl-main.454 preprint EN cc-by 2020-01-01

The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics

OPENALEX - Publications

Sebastian Gehrmann Tosin Adewumi Karmanya Aggarwal Pawan Sasanka Ammanamanchi Anuoluwapo Aremu and 51 more

Sebastian Gehrmann, Tosin Adewumi, Karmanya Aggarwal, Pawan Sasanka Ammanamanchi, Anuoluwapo Aremu, Antoine Bosselut, Khyathi Raghavi Chandu, Miruna-Adriana Clinciu, Dipanjan Das, Kaustubh Dhole, Wanyu Du, Esin Durmus, Ondřej Dušek, Chris Chinenye Emezue, Varun Gangal, Cristina Garbacea, Tatsunori Hashimoto, Yufang Hou, Yacine Jernite, Harsh Jhamtani, Yangfeng Ji, Shailza Jolly, Mihir Kale, Dhruv Kumar, Faisal Ladhak, Aman Madaan, Mounica Maddela, Khyati Mahajan, Saad Mahamood, Bodhisattwa...

10.18653/v1/2021.gem-1.10 preprint ID cc-by 2021-01-01

Easily Accessible Text-to-Image Generation Amplifies Demographic Stereotypes at Large Scale

OPENALEX - Publications

Federico Bianchi Pratyusha Kalluri Esin Durmus Faisal Ladhak Myra Cheng and 5 more

Machine learning models that convert user-written text descriptions into images are now widely available online and used by millions of users to generate a day. We investigate the potential for these amplify dangerous complex stereotypes. find broad range ordinary prompts produce stereotypes, including simply mentioning traits, descriptors, occupations, or objects. For example, we cases prompting basic traits social roles resulting in reinforcing whiteness as ideal, occupations amplification...

10.1145/3593013.3594095 article EN 2022 ACM Conference on Fairness, Accountability, and Transparency 2023-06-12

Benchmarking Large Language Models for News Summarization

OPENALEX - Publications

Tianyi Zhang Faisal Ladhak Esin Durmus Percy Liang Kathleen McKeown and 1 more

Abstract Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. By conducting a human evaluation on ten LLMs across different pretraining methods, prompts, and model scales, we make two important observations. First, find instruction tuning, not size, is key to LLM’s zero-shot capability. Second, existing studies been limited by low-quality references, leading underestimates of performance lower few-shot...

10.1162/tacl_a_00632 article EN cc-by Transactions of the Association for Computational Linguistics 2024-01-01

Whose Opinions Do Language Models Reflect?

OPENALEX - Publications

Shibani Santurkar Esin Durmus Faisal Ladhak Cinoo Lee Percy Liang and 1 more

Language models (LMs) are increasingly being used in open-ended contexts, where the opinions reflected by LMs response to subjective queries can have a profound impact, both on user satisfaction, as well shaping views of society at large. In this work, we put forth quantitative framework investigate -- leveraging high-quality public opinion polls and their associated human responses. Using framework, create OpinionsQA, new dataset for evaluating alignment LM with those 60 US demographic...

10.48550/arxiv.2303.17548 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Benchmarking Large Language Models for News Summarization

OPENALEX - Publications

Tianyi Zhang Faisal Ladhak Esin Durmus Percy Liang Kathleen McKeown and 1 more

Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. By conducting a human evaluation on ten LLMs across different pretraining methods, prompts, and model scales, we make two important observations. First, find instruction tuning, not size, is key to LLM's zero-shot capability. Second, existing studies been limited by low-quality references, leading underestimates of performance lower few-shot finetuning...

10.48550/arxiv.2301.13848 preprint EN cc-by arXiv (Cornell University) 2023-01-01

WikiLingua: A New Benchmark Dataset for Cross-Lingual Abstractive Summarization

OPENALEX - Publications

Faisal Ladhak Esin Durmus Claire Cardie Kathleen McKeown

We introduce WikiLingua, a large-scale, multilingual dataset for the evaluation of cross-lingual abstractive summarization systems. extract article and summary pairs in 18 languages from WikiHow, high quality, collaborative resource how-to guides on diverse set topics written by human authors. create gold-standard article-summary alignments across aligning images that are used to describe each step an article. As baselines further studies, we evaluate performance existing methods our...

10.18653/v1/2020.findings-emnlp.360 article EN cc-by 2020-01-01

Faithful or Extractive? On Mitigating the Faithfulness-Abstractiveness Trade-off in Abstractive Summarization

OPENALEX - Publications

Faisal Ladhak Esin Durmus He He Claire Cardie Kathleen McKeown

Despite recent progress in abstractive summarization, systems still suffer from faithfulness errors. While prior work has proposed models that improve faithfulness, it is unclear whether the improvement comes an increased level of extractiveness model outputs as one naive way to make summarization more extractive. In this work, we present a framework for evaluating effective systems, by generating faithfulness-abstractiveness trade-off curve serves control at different operating points on...

10.18653/v1/2022.acl-long.100 article EN cc-by Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2022-01-01

Holistic Evaluation of Language Models

OPENALEX - Publications

Percy Liang Rishi Bommasani Tong Lee Dimitris Tsipras Dilara Soylu and 45 more

Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks not well understood. We present Holistic Evaluation of Models (HELM) to improve transparency models. First, we taxonomize vast space potential scenarios (i.e. use cases) metrics desiderata) that interest LMs. Then select a broad subset based on coverage feasibility, noting what's missing or underrepresented (e.g. question answering neglected English...

10.48550/arxiv.2211.09110 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Towards Measuring the Representation of Subjective Global Opinions in Language Models

OPENALEX - Publications

Esin Durmus Karina Nyugen Thomas I. Liao Nicholas Schiefer Amanda Askell and 13 more

Large language models (LLMs) may not equitably represent diverse global perspectives on societal issues. In this paper, we develop a quantitative framework to evaluate whose opinions model-generated responses are more similar to. We first build dataset, GlobalOpinionQA, comprised of questions and answers from cross-national surveys designed capture issues across different countries. Next, define metric that quantifies the similarity between LLM-generated survey human responses, conditioned...

10.48550/arxiv.2306.16388 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Marked Personas: Using Natural Language Prompts to Measure Stereotypes in Language Models

OPENALEX - Publications

Myra Cheng Esin Durmus Dan Jurafsky

To recognize and mitigate harms from large language models (LLMs), we need to understand the prevalence nuances of stereotypes in LLM outputs. Toward this end, present Marked Personas, a prompt-based method measure LLMs for intersectional demographic groups without any lexicon or data labeling.Grounded sociolinguistic concept markedness (which characterizes explicitly linguistically marked categories versus unmarked defaults), our proposed is twofold: 1) prompting an generate personas, i.e.,...

10.18653/v1/2023.acl-long.84 article EN cc-by 2023-01-01

Collective Constitutional AI: Aligning a Language Model with Public Input

OPENALEX - Publications

Saffron Huang Divya Siddarth Liane Lovitt Thomas I. Liao Esin Durmus and 2 more

There is growing consensus that language model (LM) developers should not be the sole deciders of LM behavior, creating a need for methods enable broader public to collectively shape behavior systems affect them. To address this need, we present Collective Constitutional AI (CCAI): multi-stage process sourcing and integrating input into LMs—from identifying target population principles training evaluating model. We demonstrate real-world practicality approach by what is, our knowledge, first...

10.1145/3630106.3658979 preprint EN cc-by 2022 ACM Conference on Fairness, Accountability, and Transparency 2024-06-03

The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics

OPENALEX - Publications

Sebastian Gehrmann Tosin Adewumi Karmanya Aggarwal Pawan Sasanka Ammanamanchi Anuoluwapo Aremu and 51 more

We introduce GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics. Measuring progress in NLG relies on constantly evolving ecosystem of automated metrics, datasets, human evaluation standards. Due to this moving target, new models often still evaluate divergent anglo-centric corpora with well-established, but flawed, metrics. This disconnect makes it challenging identify the limitations current opportunities progress. Addressing limitation, GEM provides...

10.48550/arxiv.2102.01672 preprint EN cc-by arXiv (Cornell University) 2021-01-01

Evaluating Human-Language Model Interaction

OPENALEX - Publications

Mina Lee Megha Srivastava Amelia Hardy John Thickstun Esin Durmus and 13 more

Many real-world applications of language models (LMs), such as writing assistance and code autocomplete, involve human-LM interaction. However, most benchmarks are non-interactive in that a model produces output without human involvement. To evaluate interaction, we develop new framework, Human-AI Language-based Interaction Evaluation (HALIE), defines the components interactive systems dimensions to consider when designing evaluation metrics. Compared standard, evaluation, HALIE captures (i)...

10.48550/arxiv.2212.09746 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Exploring the Role of Prior Beliefs for Argument Persuasion

OPENALEX - Publications

Esin Durmus Claire Cardie

Esin Durmus, Claire Cardie. Proceedings of the 2018 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018.

10.18653/v1/n18-1094 preprint EN cc-by 2018-01-01

Studying Large Language Model Generalization with Influence Functions

OPENALEX - Publications

Roger Grosse Juhan Bae Cem Anil Nelson Elhage Alex Tamkin and 12 more

When trying to gain better visibility into a machine learning model in order understand and mitigate the associated risks, potentially valuable source of evidence is: which training examples most contribute given behavior? Influence functions aim answer counterfactual: how would model's parameters (and hence its outputs) change if sequence were added set? While influence have produced insights for small models, they are difficult scale large language models (LLMs) due difficulty computing an...

10.48550/arxiv.2308.03296 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Towards Understanding Sycophancy in Language Models

OPENALEX - Publications

Mrinank Sharma Meg Tong Tomasz Korbak David Duvenaud Amanda Askell and 14 more

Human feedback is commonly utilized to finetune AI assistants. But human may also encourage model responses that match user beliefs over truthful ones, a behaviour known as sycophancy. We investigate the prevalence of sycophancy in models whose finetuning procedure made use feedback, and potential role preference judgments such behavior. first demonstrate five state-of-the-art assistants consistently exhibit across four varied free-form text-generation tasks. To understand if preferences...

10.48550/arxiv.2310.13548 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Opportunities and Risks of LLMs for Scalable Deliberation with Polis

OPENALEX - Publications

Christopher Small Ivan Vendrov Esin Durmus Hadjar Homaei Elizabeth L. Barry and 4 more

Polis is a platform that leverages machine intelligence to scale up deliberative processes. In this paper, we explore the opportunities and risks associated with applying Large Language Models (LLMs) towards challenges facilitating, moderating summarizing results of engagements. particular, demonstrate pilot experiments using Anthropic's Claude LLMs can indeed augment human help more efficiently run conversations. find summarization capabilities enable categorically new methods immense...

10.48550/arxiv.2306.11932 preprint EN cc-by-nc-nd arXiv (Cornell University) 2023-01-01

When Do Pre-Training Biases Propagate to Downstream Tasks? A Case Study in Text Summarization

OPENALEX - Publications

Faisal Ladhak Esin Durmus Mirac Süzgün Tianyi Zhang Dan Jurafsky and 2 more

Faisal Ladhak, Esin Durmus, Mirac Suzgun, Tianyi Zhang, Dan Jurafsky, Kathleen McKeown, Tatsunori Hashimoto. Proceedings of the 17th Conference European Chapter Association for Computational Linguistics. 2023.

10.18653/v1/2023.eacl-main.234 article EN cc-by 2023-01-01

Exploring the Role of Argument Structure in Online Debate Persuasion

OPENALEX - Publications

Jialu Li Esin Durmus Claire Cardie

Online debate forums provide users a platform to express their opinions on controversial topics while being exposed from diverse set of viewpoints. Existing work in Natural Language Processing (NLP) has shown that linguistic features extracted the text and encoding characteristics audience are both critical persuasion studies. In this paper, we aim further investigate role discourse structure arguments online debates persuasiveness. particular, use factor graph model obtain for argument an...

10.18653/v1/2020.emnlp-main.716 article EN cc-by 2020-01-01

A Corpus for Modeling User and Language Effects in Argumentation on Online Debating

OPENALEX - Publications

Esin Durmus Claire Cardie

Existing argumentation datasets have succeeded in allowing researchers to develop computational methods for analyzing the content, structure and linguistic features of argumentative text. They been much less successful fostering studies effect "user" traits -- characteristics beliefs participants on debate/argument outcome as this type user information is generally not available. This paper presents a dataset 78, 376 debates generated over 10-year period along with surprisingly comprehensive...

10.18653/v1/p19-1057 preprint EN cc-by 2019-01-01

Measuring Faithfulness in Chain-of-Thought Reasoning

OPENALEX - Publications

Tamera Lanham Anna Chen Ansh Radhakrishnan Benoit Steiner Carson Denison and 25 more

Large language models (LLMs) perform better when they produce step-by-step, "Chain-of-Thought" (CoT) reasoning before answering a question, but it is unclear if the stated faithful explanation of model's actual (i.e., its process for question). We investigate hypotheses how CoT may be unfaithful, by examining model predictions change we intervene on (e.g., adding mistakes or paraphrasing it). Models show large variation across tasks in strongly condition predicting their answer, sometimes...

10.48550/arxiv.2307.13702 preprint EN other-oa arXiv (Cornell University) 2023-01-01

GEMv2: Multilingual NLG Benchmarking in a Single Line of Code

OPENALEX - Publications

Sebastian Gehrmann Abhik Bhattacharjee Abinaya Mahendiran Alex Wang Alexandros Papangelis and 72 more

Sebastian Gehrmann, Abhik Bhattacharjee, Abinaya Mahendiran, Alex Wang, Alexandros Papangelis, Aman Madaan, Angelina Mcmillan-major, Anna Shvets, Ashish Upadhyay, Bernd Bohnet, Bingsheng Yao, Bryan Wilie, Chandra Bhagavatula, Chaobin You, Craig Thomson, Cristina Garbacea, Dakuo Daniel Deutsch, Deyi Xiong, Di Jin, Dimitra Gkatzia, Dragomir Radev, Elizabeth Clark, Esin Durmus, Faisal Ladhak, Filip Ginter, Genta Indra Winata, Hendrik Strobelt, Hiroaki Hayashi, Jekaterina Novikova, Jenna...

10.18653/v1/2022.emnlp-demos.27 article EN cc-by 2022-01-01

Coming Soon ...