- Topic Modeling
- Natural Language Processing Techniques
- Advanced Text Analysis Techniques
- Multimodal Machine Learning Applications
- Software Engineering Research
- Text Readability and Simplification
- Adversarial Robustness in Machine Learning
- Hate Speech and Cyberbullying Detection
- Speech Recognition and Synthesis
- Computational and Text Analysis Methods
- Speech and dialogue systems
- Advanced Data Compression Techniques
- Biomedical Text Mining and Ontologies
- Explainable Artificial Intelligence (XAI)
- Music and Audio Processing
- Media Influence and Politics
- Machine Learning and Data Classification
- Human Pose and Action Recognition
- Domain Adaptation and Few-Shot Learning
- Algorithms and Data Compression
- Video Analysis and Summarization
- Online Learning and Analytics
- Web Data Mining and Analysis
- Machine Learning and Algorithms
- Misinformation and Its Impacts
Columbia University
2020-2024
Salesforce (United States)
2023
Stanford University
2023
Amazon (United States)
2020
Cornell University
2019
AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and adaptable to wide range downstream tasks. We call these foundation underscore their critically central yet incomplete character. This report provides thorough account opportunities risks models, ranging from capabilities language, vision, robotics, reasoning, human interaction) technical principles(e.g., model architectures, training procedures, data, systems,...
Sebastian Gehrmann, Tosin Adewumi, Karmanya Aggarwal, Pawan Sasanka Ammanamanchi, Anuoluwapo Aremu, Antoine Bosselut, Khyathi Raghavi Chandu, Miruna-Adriana Clinciu, Dipanjan Das, Kaustubh Dhole, Wanyu Du, Esin Durmus, Ondřej Dušek, Chris Chinenye Emezue, Varun Gangal, Cristina Garbacea, Tatsunori Hashimoto, Yufang Hou, Yacine Jernite, Harsh Jhamtani, Yangfeng Ji, Shailza Jolly, Mihir Kale, Dhruv Kumar, Faisal Ladhak, Aman Madaan, Mounica Maddela, Khyati Mahajan, Saad Mahamood, Bodhisattwa...
Machine learning models that convert user-written text descriptions into images are now widely available online and used by millions of users to generate a day. We investigate the potential for these amplify dangerous complex stereotypes. find broad range ordinary prompts produce stereotypes, including simply mentioning traits, descriptors, occupations, or objects. For example, we cases prompting basic traits social roles resulting in reinforcing whiteness as ideal, occupations amplification...
Abstract Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. By conducting a human evaluation on ten LLMs across different pretraining methods, prompts, and model scales, we make two important observations. First, find instruction tuning, not size, is key to LLM’s zero-shot capability. Second, existing studies been limited by low-quality references, leading underestimates of performance lower few-shot...
Language models (LMs) are increasingly being used in open-ended contexts, where the opinions reflected by LMs response to subjective queries can have a profound impact, both on user satisfaction, as well shaping views of society at large. In this work, we put forth quantitative framework investigate -- leveraging high-quality public opinion polls and their associated human responses. Using framework, create OpinionsQA, new dataset for evaluating alignment LM with those 60 US demographic...
Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. By conducting a human evaluation on ten LLMs across different pretraining methods, prompts, and model scales, we make two important observations. First, find instruction tuning, not size, is key to LLM's zero-shot capability. Second, existing studies been limited by low-quality references, leading underestimates of performance lower few-shot finetuning...
We introduce WikiLingua, a large-scale, multilingual dataset for the evaluation of cross-lingual abstractive summarization systems. extract article and summary pairs in 18 languages from WikiHow, high quality, collaborative resource how-to guides on diverse set topics written by human authors. create gold-standard article-summary alignments across aligning images that are used to describe each step an article. As baselines further studies, we evaluate performance existing methods our...
We incorporate an explicit neural interlingua into a multilingual encoder-decoder machine translation (NMT) architecture. demonstrate that our model learns language-independent representation by performing direct zero-shot (without using pivot translation), and the source sentence embeddings to create English Yelp review classifier that, through mediation of interlingua, can also classify French German reviews. Furthermore, we show despite smaller number parameters than pairwise collection...
Despite recent progress in abstractive summarization, systems still suffer from faithfulness errors. While prior work has proposed models that improve faithfulness, it is unclear whether the improvement comes an increased level of extractiveness model outputs as one naive way to make summarization more extractive. In this work, we present a framework for evaluating effective systems, by generating faithfulness-abstractiveness trade-off curve serves control at different operating points on...
Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks not well understood. We present Holistic Evaluation of Models (HELM) to improve transparency models. First, we taxonomize vast space potential scenarios (i.e. use cases) metrics desiderata) that interest LMs. Then select a broad subset based on coverage feasibility, noting what's missing or underrepresented (e.g. question answering neglected English...
We introduce GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics. Measuring progress in NLG relies on constantly evolving ecosystem of automated metrics, datasets, human evaluation standards. Due to this moving target, new models often still evaluate divergent anglo-centric corpora with well-established, but flawed, metrics. This disconnect makes it challenging identify the limitations current opportunities progress. Addressing limitation, GEM provides...
Many real-world applications of language models (LMs), such as writing assistance and code autocomplete, involve human-LM interaction. However, most benchmarks are non-interactive in that a model produces output without human involvement. To evaluate interaction, we develop new framework, Human-AI Language-based Interaction Evaluation (HALIE), defines the components interactive systems dimensions to consider when designing evaluation metrics. Compared standard, evaluation, HALIE captures (i)...
Selecting the "right" amount of information to include in a summary is difficult task. A good should be detailed and entity-centric without being overly dense hard follow. To better understand this tradeoff, we solicit increasingly GPT-4 summaries with what refer as "Chain Density" (CoD) prompt. Specifically, generates an initial entity-sparse before iteratively incorporating missing salient entities increasing length. Summaries generated by CoD are more abstractive, exhibit fusion, have...
Faisal Ladhak, Esin Durmus, Mirac Suzgun, Tianyi Zhang, Dan Jurafsky, Kathleen McKeown, Tatsunori Hashimoto. Proceedings of the 17th Conference European Chapter Association for Computational Linguistics. 2023.
We present a new summarization task, generating summaries of novel chapters using summary/chapter pairs from online study guides. This is harder task than the news given chapter length as well extreme paraphrasing and generalization found in summaries. focus on extractive summarization, which requires creation gold-standard set metric for aligning reference summary sentences with to create gold extracts also experiment different alignment methods. Our experiments demonstrate significant...
Sebastian Gehrmann, Abhik Bhattacharjee, Abinaya Mahendiran, Alex Wang, Alexandros Papangelis, Aman Madaan, Angelina Mcmillan-major, Anna Shvets, Ashish Upadhyay, Bernd Bohnet, Bingsheng Yao, Bryan Wilie, Chandra Bhagavatula, Chaobin You, Craig Thomson, Cristina Garbacea, Dakuo Daniel Deutsch, Deyi Xiong, Di Jin, Dimitra Gkatzia, Dragomir Radev, Elizabeth Clark, Esin Durmus, Faisal Ladhak, Filip Ginter, Genta Indra Winata, Hendrik Strobelt, Hiroaki Hayashi, Jekaterina Novikova, Jenna...
Research in the social sciences and psychology has shown that persuasiveness of an argument depends not only language employed, but also on attributes source/communicator, audience, appropriateness strength argument's claims given pragmatic discourse context argument. Among these characteristics persuasive arguments, prior work NLP does explicitly investigate effect when determining quality. This paper presents a new dataset to initiate study this aspect argumentation: it consists diverse...
Encoder-only transformer models such as BERT offer a great performance-size tradeoff for retrieval and classification tasks with respect to larger decoder-only models. Despite being the workhorse of numerous production pipelines, there have been limited Pareto improvements since its release. In this paper, we introduce ModernBERT, bringing modern model optimizations encoder-only representing major improvement over older encoders. Trained on 2 trillion tokens native 8192 sequence length,...
Model-based, reference-free evaluation metricshave been proposed as a fast and cost-effectiveapproach to evaluate Natural Language Generation(NLG) systems. Despite promising recentresults, we find evidence that reference-freeevaluation metrics of summarization dialoggeneration may be relying on spuriouscorrelations with measures such word overlap,perplexity, length. We further observethat for text summarization, these havehigh error rates when ranking current state-ofthe-art abstractive...
Kasturi Bhattacharjee, Miguel Ballesteros, Rishita Anubhai, Smaranda Muresan, Jie Ma, Faisal Ladhak, Yaser Al-Onaizan. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020.
Badr AlKhamissi, Faisal Ladhak, Srinivasan Iyer, Veselin Stoyanov, Zornitsa Kozareva, Xian Li, Pascale Fung, Lambert Mathias, Asli Celikyilmaz, Mona Diab. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022.
Systems for automatic argument generation and debate require the ability to (1) determine stance of any claims employed in (2) assess specificity each claim relative context. Existing work on understanding stance, however, has been limited study argumentative structures that are relatively shallow, most often consisting a single directly supports or opposes thesis. In this paper, we tackle these tasks context complex arguments diverse set topics. particular, our dataset consists manually...
Machine learning models that convert user-written text descriptions into images are now widely available online and used by millions of users to generate a day. We investigate the potential for these amplify dangerous complex stereotypes. find broad range ordinary prompts produce stereotypes, including simply mentioning traits, descriptors, occupations, or objects. For example, we cases prompting basic traits social roles resulting in reinforcing whiteness as ideal, occupations amplification...
We incorporate an explicit neural interlingua into a multilingual encoder-decoder machine translation (NMT) architecture. demonstrate that our model learns language-independent representation by performing direct zero-shot (without using pivot translation), and the source sentence embeddings to create English Yelp review classifier that, through mediation of interlingua, can also classify French German reviews. Furthermore, we show despite smaller number parameters than pairwise collection...