Esin Durmus

ORCID: 0009-0009-7331-8160
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Topic Modeling
  • Natural Language Processing Techniques
  • Advanced Text Analysis Techniques
  • Text Readability and Simplification
  • Hate Speech and Cyberbullying Detection
  • Software Engineering Research
  • Social Media and Politics
  • Sentiment Analysis and Opinion Mining
  • Explainable Artificial Intelligence (XAI)
  • Ethics and Social Impacts of AI
  • Misinformation and Its Impacts
  • Computational and Text Analysis Methods
  • Persona Design and Applications
  • Opinion Dynamics and Social Influence
  • Advanced Graph Neural Networks
  • Media Influence and Politics
  • Aerospace Engineering and Energy Systems
  • Machine Learning and Data Classification
  • Adversarial Robustness in Machine Learning
  • Wikis in Education and Collaboration
  • Domain Adaptation and Few-Shot Learning
  • Privacy-Preserving Technologies in Data
  • Online Learning and Analytics
  • Speech and dialogue systems
  • Reinforcement Learning in Robotics

Stanford University
2022-2024

Cornell University
2018-2021

Columbia University
2020

New York University
2020

George Washington University
2020

AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and adaptable to wide range downstream tasks. We call these foundation underscore their critically central yet incomplete character. This report provides thorough account opportunities risks models, ranging from capabilities language, vision, robotics, reasoning, human interaction) technical principles(e.g., model architectures, training procedures, data, systems,...

10.48550/arxiv.2108.07258 preprint EN cc-by arXiv (Cornell University) 2021-01-01

Neural abstractive summarization models are prone to generate content inconsistent with the source document, i.e. unfaithful. Existing automatic metrics do not capture such mistakes effectively. We tackle problem of evaluating faithfulness a generated summary given its document. first collected human annotations for outputs from numerous on two datasets. find that current exhibit trade-off between abstractiveness and faithfulness: less word overlap document more likely be Next, we propose an...

10.18653/v1/2020.acl-main.454 preprint EN cc-by 2020-01-01

Sebastian Gehrmann, Tosin Adewumi, Karmanya Aggarwal, Pawan Sasanka Ammanamanchi, Anuoluwapo Aremu, Antoine Bosselut, Khyathi Raghavi Chandu, Miruna-Adriana Clinciu, Dipanjan Das, Kaustubh Dhole, Wanyu Du, Esin Durmus, Ondřej Dušek, Chris Chinenye Emezue, Varun Gangal, Cristina Garbacea, Tatsunori Hashimoto, Yufang Hou, Yacine Jernite, Harsh Jhamtani, Yangfeng Ji, Shailza Jolly, Mihir Kale, Dhruv Kumar, Faisal Ladhak, Aman Madaan, Mounica Maddela, Khyati Mahajan, Saad Mahamood, Bodhisattwa...

10.18653/v1/2021.gem-1.10 preprint ID cc-by 2021-01-01

Machine learning models that convert user-written text descriptions into images are now widely available online and used by millions of users to generate a day. We investigate the potential for these amplify dangerous complex stereotypes. find broad range ordinary prompts produce stereotypes, including simply mentioning traits, descriptors, occupations, or objects. For example, we cases prompting basic traits social roles resulting in reinforcing whiteness as ideal, occupations amplification...

10.1145/3593013.3594095 article EN 2022 ACM Conference on Fairness, Accountability, and Transparency 2023-06-12

Abstract Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. By conducting a human evaluation on ten LLMs across different pretraining methods, prompts, and model scales, we make two important observations. First, find instruction tuning, not size, is key to LLM’s zero-shot capability. Second, existing studies been limited by low-quality references, leading underestimates of performance lower few-shot...

10.1162/tacl_a_00632 article EN cc-by Transactions of the Association for Computational Linguistics 2024-01-01

Language models (LMs) are increasingly being used in open-ended contexts, where the opinions reflected by LMs response to subjective queries can have a profound impact, both on user satisfaction, as well shaping views of society at large. In this work, we put forth quantitative framework investigate -- leveraging high-quality public opinion polls and their associated human responses. Using framework, create OpinionsQA, new dataset for evaluating alignment LM with those 60 US demographic...

10.48550/arxiv.2303.17548 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. By conducting a human evaluation on ten LLMs across different pretraining methods, prompts, and model scales, we make two important observations. First, find instruction tuning, not size, is key to LLM's zero-shot capability. Second, existing studies been limited by low-quality references, leading underestimates of performance lower few-shot finetuning...

10.48550/arxiv.2301.13848 preprint EN cc-by arXiv (Cornell University) 2023-01-01

We introduce WikiLingua, a large-scale, multilingual dataset for the evaluation of cross-lingual abstractive summarization systems. extract article and summary pairs in 18 languages from WikiHow, high quality, collaborative resource how-to guides on diverse set topics written by human authors. create gold-standard article-summary alignments across aligning images that are used to describe each step an article. As baselines further studies, we evaluate performance existing methods our...

10.18653/v1/2020.findings-emnlp.360 article EN cc-by 2020-01-01

Despite recent progress in abstractive summarization, systems still suffer from faithfulness errors. While prior work has proposed models that improve faithfulness, it is unclear whether the improvement comes an increased level of extractiveness model outputs as one naive way to make summarization more extractive. In this work, we present a framework for evaluating effective systems, by generating faithfulness-abstractiveness trade-off curve serves control at different operating points on...

10.18653/v1/2022.acl-long.100 article EN cc-by Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2022-01-01

Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks not well understood. We present Holistic Evaluation of Models (HELM) to improve transparency models. First, we taxonomize vast space potential scenarios (i.e. use cases) metrics desiderata) that interest LMs. Then select a broad subset based on coverage feasibility, noting what's missing or underrepresented (e.g. question answering neglected English...

10.48550/arxiv.2211.09110 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Large language models (LLMs) may not equitably represent diverse global perspectives on societal issues. In this paper, we develop a quantitative framework to evaluate whose opinions model-generated responses are more similar to. We first build dataset, GlobalOpinionQA, comprised of questions and answers from cross-national surveys designed capture issues across different countries. Next, define metric that quantifies the similarity between LLM-generated survey human responses, conditioned...

10.48550/arxiv.2306.16388 preprint EN cc-by arXiv (Cornell University) 2023-01-01

To recognize and mitigate harms from large language models (LLMs), we need to understand the prevalence nuances of stereotypes in LLM outputs. Toward this end, present Marked Personas, a prompt-based method measure LLMs for intersectional demographic groups without any lexicon or data labeling.Grounded sociolinguistic concept markedness (which characterizes explicitly linguistically marked categories versus unmarked defaults), our proposed is twofold: 1) prompting an generate personas, i.e.,...

10.18653/v1/2023.acl-long.84 article EN cc-by 2023-01-01

There is growing consensus that language model (LM) developers should not be the sole deciders of LM behavior, creating a need for methods enable broader public to collectively shape behavior systems affect them. To address this need, we present Collective Constitutional AI (CCAI): multi-stage process sourcing and integrating input into LMs—from identifying target population principles training evaluating model. We demonstrate real-world practicality approach by what is, our knowledge, first...

10.1145/3630106.3658979 preprint EN cc-by 2022 ACM Conference on Fairness, Accountability, and Transparency 2024-06-03

We introduce GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics. Measuring progress in NLG relies on constantly evolving ecosystem of automated metrics, datasets, human evaluation standards. Due to this moving target, new models often still evaluate divergent anglo-centric corpora with well-established, but flawed, metrics. This disconnect makes it challenging identify the limitations current opportunities progress. Addressing limitation, GEM provides...

10.48550/arxiv.2102.01672 preprint EN cc-by arXiv (Cornell University) 2021-01-01

Many real-world applications of language models (LMs), such as writing assistance and code autocomplete, involve human-LM interaction. However, most benchmarks are non-interactive in that a model produces output without human involvement. To evaluate interaction, we develop new framework, Human-AI Language-based Interaction Evaluation (HALIE), defines the components interactive systems dimensions to consider when designing evaluation metrics. Compared standard, evaluation, HALIE captures (i)...

10.48550/arxiv.2212.09746 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Esin Durmus, Claire Cardie. Proceedings of the 2018 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018.

10.18653/v1/n18-1094 preprint EN cc-by 2018-01-01

When trying to gain better visibility into a machine learning model in order understand and mitigate the associated risks, potentially valuable source of evidence is: which training examples most contribute given behavior? Influence functions aim answer counterfactual: how would model's parameters (and hence its outputs) change if sequence were added set? While influence have produced insights for small models, they are difficult scale large language models (LLMs) due difficulty computing an...

10.48550/arxiv.2308.03296 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Human feedback is commonly utilized to finetune AI assistants. But human may also encourage model responses that match user beliefs over truthful ones, a behaviour known as sycophancy. We investigate the prevalence of sycophancy in models whose finetuning procedure made use feedback, and potential role preference judgments such behavior. first demonstrate five state-of-the-art assistants consistently exhibit across four varied free-form text-generation tasks. To understand if preferences...

10.48550/arxiv.2310.13548 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Polis is a platform that leverages machine intelligence to scale up deliberative processes. In this paper, we explore the opportunities and risks associated with applying Large Language Models (LLMs) towards challenges facilitating, moderating summarizing results of engagements. particular, demonstrate pilot experiments using Anthropic's Claude LLMs can indeed augment human help more efficiently run conversations. find summarization capabilities enable categorically new methods immense...

10.48550/arxiv.2306.11932 preprint EN cc-by-nc-nd arXiv (Cornell University) 2023-01-01

Faisal Ladhak, Esin Durmus, Mirac Suzgun, Tianyi Zhang, Dan Jurafsky, Kathleen McKeown, Tatsunori Hashimoto. Proceedings of the 17th Conference European Chapter Association for Computational Linguistics. 2023.

10.18653/v1/2023.eacl-main.234 article EN cc-by 2023-01-01

Online debate forums provide users a platform to express their opinions on controversial topics while being exposed from diverse set of viewpoints. Existing work in Natural Language Processing (NLP) has shown that linguistic features extracted the text and encoding characteristics audience are both critical persuasion studies. In this paper, we aim further investigate role discourse structure arguments online debates persuasiveness. particular, use factor graph model obtain for argument an...

10.18653/v1/2020.emnlp-main.716 article EN cc-by 2020-01-01

Existing argumentation datasets have succeeded in allowing researchers to develop computational methods for analyzing the content, structure and linguistic features of argumentative text. They been much less successful fostering studies effect "user" traits -- characteristics beliefs participants on debate/argument outcome as this type user information is generally not available. This paper presents a dataset 78, 376 debates generated over 10-year period along with surprisingly comprehensive...

10.18653/v1/p19-1057 preprint EN cc-by 2019-01-01

Large language models (LLMs) perform better when they produce step-by-step, "Chain-of-Thought" (CoT) reasoning before answering a question, but it is unclear if the stated faithful explanation of model's actual (i.e., its process for question). We investigate hypotheses how CoT may be unfaithful, by examining model predictions change we intervene on (e.g., adding mistakes or paraphrasing it). Models show large variation across tasks in strongly condition predicting their answer, sometimes...

10.48550/arxiv.2307.13702 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Sebastian Gehrmann, Abhik Bhattacharjee, Abinaya Mahendiran, Alex Wang, Alexandros Papangelis, Aman Madaan, Angelina Mcmillan-major, Anna Shvets, Ashish Upadhyay, Bernd Bohnet, Bingsheng Yao, Bryan Wilie, Chandra Bhagavatula, Chaobin You, Craig Thomson, Cristina Garbacea, Dakuo Daniel Deutsch, Deyi Xiong, Di Jin, Dimitra Gkatzia, Dragomir Radev, Elizabeth Clark, Esin Durmus, Faisal Ladhak, Filip Ginter, Genta Indra Winata, Hendrik Strobelt, Hiroaki Hayashi, Jekaterina Novikova, Jenna...

10.18653/v1/2022.emnlp-demos.27 article EN cc-by 2022-01-01
Coming Soon ...