NFDI4DS | UHH-SEMS - Publication Details

On the Opportunities and Risks of Foundation Models

OPENALEX - Publications

Rishi Bommasani Drew A. Hudson Ehsan Adeli Russ B. Altman Simran Arora and 95 more

AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and adaptable to wide range downstream tasks. We call these foundation underscore their critically central yet incomplete character. This report provides thorough account opportunities risks models, ranging from capabilities language, vision, robotics, reasoning, human interaction) technical principles(e.g., model architectures, training procedures, data, systems,...

10.48550/arxiv.2108.07258 preprint EN cc-by arXiv (Cornell University) 2021-01-01

The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics

OPENALEX - Publications

Sebastian Gehrmann Tosin Adewumi Karmanya Aggarwal Pawan Sasanka Ammanamanchi Anuoluwapo Aremu and 51 more

Sebastian Gehrmann, Tosin Adewumi, Karmanya Aggarwal, Pawan Sasanka Ammanamanchi, Anuoluwapo Aremu, Antoine Bosselut, Khyathi Raghavi Chandu, Miruna-Adriana Clinciu, Dipanjan Das, Kaustubh Dhole, Wanyu Du, Esin Durmus, Ondřej Dušek, Chris Chinenye Emezue, Varun Gangal, Cristina Garbacea, Tatsunori Hashimoto, Yufang Hou, Yacine Jernite, Harsh Jhamtani, Yangfeng Ji, Shailza Jolly, Mihir Kale, Dhruv Kumar, Faisal Ladhak, Aman Madaan, Mounica Maddela, Khyati Mahajan, Saad Mahamood, Bodhisattwa...

10.18653/v1/2021.gem-1.10 preprint ID cc-by 2021-01-01

Easily Accessible Text-to-Image Generation Amplifies Demographic Stereotypes at Large Scale

OPENALEX - Publications

Federico Bianchi Pratyusha Kalluri Esin Durmus Faisal Ladhak Myra Cheng and 5 more

Machine learning models that convert user-written text descriptions into images are now widely available online and used by millions of users to generate a day. We investigate the potential for these amplify dangerous complex stereotypes. find broad range ordinary prompts produce stereotypes, including simply mentioning traits, descriptors, occupations, or objects. For example, we cases prompting basic traits social roles resulting in reinforcing whiteness as ideal, occupations amplification...

10.1145/3593013.3594095 article EN 2022 ACM Conference on Fairness, Accountability, and Transparency 2023-06-12

Benchmarking Large Language Models for News Summarization

OPENALEX - Publications

Tianyi Zhang Faisal Ladhak Esin Durmus Percy Liang Kathleen McKeown and 1 more

Abstract Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. By conducting a human evaluation on ten LLMs across different pretraining methods, prompts, and model scales, we make two important observations. First, find instruction tuning, not size, is key to LLM’s zero-shot capability. Second, existing studies been limited by low-quality references, leading underestimates of performance lower few-shot...

10.1162/tacl_a_00632 article EN cc-by Transactions of the Association for Computational Linguistics 2024-01-01

Whose Opinions Do Language Models Reflect?

OPENALEX - Publications

Shibani Santurkar Esin Durmus Faisal Ladhak Cinoo Lee Percy Liang and 1 more

Language models (LMs) are increasingly being used in open-ended contexts, where the opinions reflected by LMs response to subjective queries can have a profound impact, both on user satisfaction, as well shaping views of society at large. In this work, we put forth quantitative framework investigate -- leveraging high-quality public opinion polls and their associated human responses. Using framework, create OpinionsQA, new dataset for evaluating alignment LM with those 60 US demographic...

10.48550/arxiv.2303.17548 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Benchmarking Large Language Models for News Summarization

OPENALEX - Publications

Tianyi Zhang Faisal Ladhak Esin Durmus Percy Liang Kathleen McKeown and 1 more

Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. By conducting a human evaluation on ten LLMs across different pretraining methods, prompts, and model scales, we make two important observations. First, find instruction tuning, not size, is key to LLM's zero-shot capability. Second, existing studies been limited by low-quality references, leading underestimates of performance lower few-shot finetuning...

10.48550/arxiv.2301.13848 preprint EN cc-by arXiv (Cornell University) 2023-01-01

WikiLingua: A New Benchmark Dataset for Cross-Lingual Abstractive Summarization

OPENALEX - Publications

Faisal Ladhak Esin Durmus Claire Cardie Kathleen McKeown

We introduce WikiLingua, a large-scale, multilingual dataset for the evaluation of cross-lingual abstractive summarization systems. extract article and summary pairs in 18 languages from WikiHow, high quality, collaborative resource how-to guides on diverse set topics written by human authors. create gold-standard article-summary alignments across aligning images that are used to describe each step an article. As baselines further studies, we evaluate performance existing methods our...

10.18653/v1/2020.findings-emnlp.360 article EN cc-by 2020-01-01

A neural interlingua for multilingual machine translation

OPENALEX - Publications

Yichao Lu Phillip Keung Faisal Ladhak Vikas Bhardwaj Shaonan Zhang and 1 more

We incorporate an explicit neural interlingua into a multilingual encoder-decoder machine translation (NMT) architecture. demonstrate that our model learns language-independent representation by performing direct zero-shot (without using pivot translation), and the source sentence embeddings to create English Yelp review classifier that, through mediation of interlingua, can also classify French German reviews. Furthermore, we show despite smaller number parameters than pairwise collection...

10.18653/v1/w18-6309 article EN cc-by 2018-01-01

Faithful or Extractive? On Mitigating the Faithfulness-Abstractiveness Trade-off in Abstractive Summarization

OPENALEX - Publications

Faisal Ladhak Esin Durmus He He Claire Cardie Kathleen McKeown

Despite recent progress in abstractive summarization, systems still suffer from faithfulness errors. While prior work has proposed models that improve faithfulness, it is unclear whether the improvement comes an increased level of extractiveness model outputs as one naive way to make summarization more extractive. In this work, we present a framework for evaluating effective systems, by generating faithfulness-abstractiveness trade-off curve serves control at different operating points on...

10.18653/v1/2022.acl-long.100 article EN cc-by Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2022-01-01

Holistic Evaluation of Language Models

OPENALEX - Publications

Percy Liang Rishi Bommasani Tong Lee Dimitris Tsipras Dilara Soylu and 45 more

Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks not well understood. We present Holistic Evaluation of Models (HELM) to improve transparency models. First, we taxonomize vast space potential scenarios (i.e. use cases) metrics desiderata) that interest LMs. Then select a broad subset based on coverage feasibility, noting what's missing or underrepresented (e.g. question answering neglected English...

10.48550/arxiv.2211.09110 preprint EN cc-by arXiv (Cornell University) 2022-01-01

LatticeRnn: Recurrent Neural Networks Over Lattices

OPENALEX - Publications

Faisal Ladhak Ankur Gandhe Markus Dreyer Lambert Mathias Ariya Rastrow and 1 more

10.21437/interspeech.2016-1583 article EN Interspeech 2022 2016-08-29

The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics

OPENALEX - Publications

Sebastian Gehrmann Tosin Adewumi Karmanya Aggarwal Pawan Sasanka Ammanamanchi Anuoluwapo Aremu and 51 more

We introduce GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics. Measuring progress in NLG relies on constantly evolving ecosystem of automated metrics, datasets, human evaluation standards. Due to this moving target, new models often still evaluate divergent anglo-centric corpora with well-established, but flawed, metrics. This disconnect makes it challenging identify the limitations current opportunities progress. Addressing limitation, GEM provides...

10.48550/arxiv.2102.01672 preprint EN cc-by arXiv (Cornell University) 2021-01-01

Evaluating Human-Language Model Interaction

OPENALEX - Publications

Mina Lee Megha Srivastava Amelia Hardy John Thickstun Esin Durmus and 13 more

Many real-world applications of language models (LMs), such as writing assistance and code autocomplete, involve human-LM interaction. However, most benchmarks are non-interactive in that a model produces output without human involvement. To evaluate interaction, we develop new framework, Human-AI Language-based Interaction Evaluation (HALIE), defines the components interactive systems dimensions to consider when designing evaluation metrics. Compared standard, evaluation, HALIE captures (i)...

10.48550/arxiv.2212.09746 preprint EN cc-by arXiv (Cornell University) 2022-01-01

From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting

OPENALEX - Publications

Griffin Adams Alex Fabbri Faisal Ladhak Eric Lehman Noémie Elhadad

Selecting the "right" amount of information to include in a summary is difficult task. A good should be detailed and entity-centric without being overly dense hard follow. To better understand this tradeoff, we solicit increasingly GPT-4 summaries with what refer as "Chain Density" (CoD) prompt. Specifically, generates an initial entity-sparse before iteratively incorporating missing salient entities increasing length. Summaries generated by CoD are more abstractive, exhibit fusion, have...

10.18653/v1/2023.newsum-1.7 article EN cc-by 2023-01-01

When Do Pre-Training Biases Propagate to Downstream Tasks? A Case Study in Text Summarization

OPENALEX - Publications

Faisal Ladhak Esin Durmus Mirac Süzgün Tianyi Zhang Dan Jurafsky and 2 more

Faisal Ladhak, Esin Durmus, Mirac Suzgun, Tianyi Zhang, Dan Jurafsky, Kathleen McKeown, Tatsunori Hashimoto. Proceedings of the 17th Conference European Chapter Association for Computational Linguistics. 2023.

10.18653/v1/2023.eacl-main.234 article EN cc-by 2023-01-01

Exploring Content Selection in Summarization of Novel Chapters

OPENALEX - Publications

Faisal Ladhak Bryan Li Yaser Al-Onaizan Kathleen McKeown

We present a new summarization task, generating summaries of novel chapters using summary/chapter pairs from online study guides. This is harder task than the news given chapter length as well extreme paraphrasing and generalization found in summaries. focus on extractive summarization, which requires creation gold-standard set metric for aligning reference summary sentences with to create gold extracts also experiment different alignment methods. Our experiments demonstrate significant...

10.18653/v1/2020.acl-main.453 article EN cc-by 2020-01-01

GEMv2: Multilingual NLG Benchmarking in a Single Line of Code

OPENALEX - Publications

Sebastian Gehrmann Abhik Bhattacharjee Abinaya Mahendiran Alex Wang Alexandros Papangelis and 72 more

Sebastian Gehrmann, Abhik Bhattacharjee, Abinaya Mahendiran, Alex Wang, Alexandros Papangelis, Aman Madaan, Angelina Mcmillan-major, Anna Shvets, Ashish Upadhyay, Bernd Bohnet, Bingsheng Yao, Bryan Wilie, Chandra Bhagavatula, Chaobin You, Craig Thomson, Cristina Garbacea, Dakuo Daniel Deutsch, Deyi Xiong, Di Jin, Dimitra Gkatzia, Dragomir Radev, Elizabeth Clark, Esin Durmus, Faisal Ladhak, Filip Ginter, Genta Indra Winata, Hendrik Strobelt, Hiroaki Hayashi, Jekaterina Novikova, Jenna...

10.18653/v1/2022.emnlp-demos.27 article EN cc-by 2022-01-01

The Role of Pragmatic and Discourse Context in Determining Argument Impact

OPENALEX - Publications

Esin Durmus Faisal Ladhak Claire Cardie

Research in the social sciences and psychology has shown that persuasiveness of an argument depends not only language employed, but also on attributes source/communicator, audience, appropriateness strength argument's claims given pragmatic discourse context argument. Among these characteristics persuasive arguments, prior work NLP does explicitly investigate effect when determining quality. This paper presents a new dataset to initiate study this aspect argumentation: it consists diverse...

10.18653/v1/d19-1568 preprint EN cc-by 2019-01-01

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

OPENALEX - Publications

Benjamin C. Warner Antoine Chaffin Benjamin Clavié Orion Weller Oskar Hallström and 9 more

Encoder-only transformer models such as BERT offer a great performance-size tradeoff for retrieval and classification tasks with respect to larger decoder-only models. Despite being the workhorse of numerous production pipelines, there have been limited Pareto improvements since its release. In this paper, we introduce ModernBERT, bringing modern model optimizations encoder-only representing major improvement over older encoders. Trained on 2 trillion tokens native 8192 sequence length,...

10.48550/arxiv.2412.13663 preprint EN arXiv (Cornell University) 2024-12-18

Spurious Correlations in Reference-Free Evaluation of Text Generation

OPENALEX - Publications

Esin Durmus Faisal Ladhak Tatsunori Hashimoto

Model-based, reference-free evaluation metricshave been proposed as a fast and cost-effectiveapproach to evaluate Natural Language Generation(NLG) systems. Despite promising recentresults, we find evidence that reference-freeevaluation metrics of summarization dialoggeneration may be relying on spuriouscorrelations with measures such word overlap,perplexity, length. We further observethat for text summarization, these havehigh error rates when ranking current state-ofthe-art abstractive...

10.18653/v1/2022.acl-long.102 article EN cc-by Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2022-01-01

To BERT or Not to BERT: Comparing Task-specific and Task-agnostic Semi-Supervised Approaches for Sequence Tagging

OPENALEX - Publications

Kasturi Bhattacharjee Miguel Ballesteros Rishita Anubhai Smaranda Muresan Jie Ma and 2 more

Kasturi Bhattacharjee, Miguel Ballesteros, Rishita Anubhai, Smaranda Muresan, Jie Ma, Faisal Ladhak, Yaser Al-Onaizan. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020.

10.18653/v1/2020.emnlp-main.636 article EN cc-by 2020-01-01

ToKen: Task Decomposition and Knowledge Infusion for Few-Shot Hate Speech Detection

OPENALEX - Publications

Badr AlKhamissi Faisal Ladhak Srinivasan Iyer Veselin Stoyanov Zornitsa Kozareva and 5 more

Badr AlKhamissi, Faisal Ladhak, Srinivasan Iyer, Veselin Stoyanov, Zornitsa Kozareva, Xian Li, Pascale Fung, Lambert Mathias, Asli Celikyilmaz, Mona Diab. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022.

10.18653/v1/2022.emnlp-main.136 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2022-01-01

Determining Relative Argument Specificity and Stance for Complex Argumentative Structures

OPENALEX - Publications

Esin Durmus Faisal Ladhak Claire Cardie

Systems for automatic argument generation and debate require the ability to (1) determine stance of any claims employed in (2) assess specificity each claim relative context. Existing work on understanding stance, however, has been limited study argumentative structures that are relatively shallow, most often consisting a single directly supports or opposes thesis. In this paper, we tackle these tasks context complex arguments diverse set topics. particular, our dataset consists manually...

10.18653/v1/p19-1456 preprint EN cc-by 2019-01-01

Easily Accessible Text-to-Image Generation Amplifies Demographic Stereotypes at Large Scale

OPENALEX - Publications

Federico Bianchi Pratyusha Kalluri Esin Durmus Faisal Ladhak Myra Cheng and 5 more

Machine learning models that convert user-written text descriptions into images are now widely available online and used by millions of users to generate a day. We investigate the potential for these amplify dangerous complex stereotypes. find broad range ordinary prompts produce stereotypes, including simply mentioning traits, descriptors, occupations, or objects. For example, we cases prompting basic traits social roles resulting in reinforcing whiteness as ideal, occupations amplification...

10.48550/arxiv.2211.03759 preprint EN other-oa arXiv (Cornell University) 2022-01-01

A neural interlingua for multilingual machine translation

OPENALEX - Publications

Yichao Lu Phillip Keung Faisal Ladhak Vikas Bhardwaj Shaonan Zhang and 1 more

We incorporate an explicit neural interlingua into a multilingual encoder-decoder machine translation (NMT) architecture. demonstrate that our model learns language-independent representation by performing direct zero-shot (without using pivot translation), and the source sentence embeddings to create English Yelp review classifier that, through mediation of interlingua, can also classify French German reviews. Furthermore, we show despite smaller number parameters than pairwise collection...

10.48550/arxiv.1804.08198 preprint EN other-oa arXiv (Cornell University) 2018-01-01