- Topic Modeling
- Natural Language Processing Techniques
- Multimodal Machine Learning Applications
- Social Media and Politics
- Hate Speech and Cyberbullying Detection
- Sentiment Analysis and Opinion Mining
- Speech and dialogue systems
- Misinformation and Its Impacts
- Online Learning and Analytics
- Text Readability and Simplification
- Wikis in Education and Collaboration
- Explainable Artificial Intelligence (XAI)
- Speech Recognition and Synthesis
- Computational and Text Analysis Methods
- Complex Network Analysis Techniques
- Recommender Systems and Techniques
- Advanced Text Analysis Techniques
- Software Engineering Research
- Mental Health via Writing
- Domain Adaptation and Few-Shot Learning
- Adversarial Robustness in Machine Learning
- Innovative Teaching and Learning Methods
- Text and Document Classification Technologies
- Online and Blended Learning
- Opinion Dynamics and Social Influence
Stanford University
2022-2025
Georgia Institute of Technology
2019-2023
University of Illinois Urbana-Champaign
2023
Amazon (United States)
2023
Laboratoire d'Informatique de Paris-Nord
2023
Google (United States)
2023
Harvard University Press
2023
University of Washington
2023
Dartmouth Hospital
2023
Harvard University
2023
Zichao Yang, Diyi Chris Dyer, Xiaodong He, Alex Smola, Eduard Hovy. Proceedings of the 2016 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies. 2016.
We propose a novel data augmentation approach to enhance computational behavioral analysis using social media text.In particular, we collect Twitter corpus of the descriptions annoying behaviors #petpeeve hashtags.In qualitative analysis, study language use in these tweets, with special focus on fine-grained categories and geographic variation language.In quantitative show that lexical syntactic features are useful for automatic categorization behaviors, frame-semantic further boost...
Spurred by advancements in scale, large language models (LLMs) have demonstrated the ability to perform a variety of natural processing (NLP) tasks zero-shot—i.e., without adaptation on downstream data. Recently, debut ChatGPT has drawn great deal attention from community due fact that it can generate high-quality responses human input and self-correct previous mistakes based subsequent conversations. However, is not yet known whether serve as generalist model many NLP zero-shot. In this...
This paper presents MixText, a semi-supervised learning method for text classification, which uses our newly designed data augmentation called TMix. TMix creates large amount of augmented training samples by interpolating in hidden space. Moreover, we leverage recent advances to guess low-entropy labels unlabeled data, hence making them as easy use labeled data. By mixing labeled, and MixText significantly outperformed current pre-trained fined-tuned models other state-of-the-art methods on...
Humor is an essential component in personal communication. How to create computational models discover the structures behind humor, recognize humor and even extract anchors remains a challenge. In this work, we first identify several semantic design sets of features for each structure, next employ approach humor. Furthermore, develop simple effective method that enable sentence. Experiments conducted on two datasets demonstrate our recognizer automatically distinguishing between humorous...
Abstract Large language models (LLMs) are capable of successfully performing many processing tasks zero-shot (without training data). If LLMs can also reliably classify and explain social phenomena like persuasiveness political ideology, then could augment the computational science (CSS) pipeline in important ways. This work provides a road map for using as CSS tools. Towards this end, we contribute set prompting best practices an extensive evaluation to measure performance 13 on 25...
Ankur Parikh, Xuezhi Wang, Sebastian Gehrmann, Manaal Faruqui, Bhuwan Dhingra, Diyi Yang, Dipanjan Das. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020.
Sebastian Gehrmann, Tosin Adewumi, Karmanya Aggarwal, Pawan Sasanka Ammanamanchi, Anuoluwapo Aremu, Antoine Bosselut, Khyathi Raghavi Chandu, Miruna-Adriana Clinciu, Dipanjan Das, Kaustubh Dhole, Wanyu Du, Esin Durmus, Ondřej Dušek, Chris Chinenye Emezue, Varun Gangal, Cristina Garbacea, Tatsunori Hashimoto, Yufang Hou, Yacine Jernite, Harsh Jhamtani, Yangfeng Ji, Shailza Jolly, Mihir Kale, Dhruv Kumar, Faisal Ladhak, Aman Madaan, Mounica Maddela, Khyati Mahajan, Saad Mahamood, Bodhisattwa...
Abstract A fundamental goal of scientific research is to learn about causal relationships. However, despite its critical role in the life and social sciences, causality has not had same importance Natural Language Processing (NLP), which traditionally placed more emphasis on predictive tasks. This distinction beginning fade, with an emerging area interdisciplinary at convergence inference language processing. Still, NLP remains scattered across domains without unified definitions, benchmark...
Deplatforming refers to the permanent ban of controversial public figures with large followings on social media sites. In recent years, platforms like Facebook, Twitter and YouTube have deplatformed many influencers curb spread offensive speech. We present a case study three high-profile who were Twitter---Alex Jones, Milo Yiannopoulos, Owen Benjamin. Working over 49M tweets, we found that deplatforming significantly reduced number conversations about all individuals Twitter. Further,...
Mai ElSherief, Caleb Ziems, David Muchlinski, Vaishnavi Anupindi, Jordyn Seybolt, Munmun De Choudhury, Diyi Yang. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021.
Natural language processing (NLP) applications are now more powerful and ubiquitous than ever before. With rapidly developing (neural) models ever-more available data, current NLP have access to information any human speaker during their life. Still, it would be hard argue that reached human-level capacity. In this position paper, we the reason for limitations is a focus on content while ignoring language's social factors. We show systems systematically break down when faced with...
Abstract NLP has achieved great progress in the past decade through use of neural models and large labeled datasets. The dependence on abundant data prevents from being applied to low-resource settings or novel tasks where significant time, money, expertise is required label massive amounts textual data. Recently, augmentation methods have been explored as a means improving efficiency NLP. To date, there no systematic empirical overview for limited setting, making it difficult understand...
Twitter enables an online public sphere for social movement actors, news organizations, and others to frame climate change the movement. In this paper, we analyze five million English tweets posted from 2018 2021 demonstrating how peaks in activity relate key events framing of strike discourse has evolved over past three years. We also collected 30,000 articles major sources English-speaking countries (Australia, Canada, United States, Kingdom) demonstrate actors media differ their issue,...
The study of network robustness is a critical tool in the characterization and sense making complex interconnected systems such as infrastructure, communication social networks. While significant research has been conducted these areas, gaps surveying literature still exist. Answers to key questions are currently scattered across multiple scientific fields numerous papers. In this survey, we distill findings domains provide researchers crucial access important information by(1) summarizing...
In this paper, we explore student dropout behavior in a Massively Open Online Course (MOOC). We use survival model to measure the impact of three social factors that make predictions about attrition along way for students who have participated course discussion forum.
Texts like news, encyclopedias, and some social media strive for objectivity. Yet bias in the form of inappropriate subjectivity — introducing attitudes via framing, presupposing truth, casting doubt remains ubiquitous. This kind erodes our collective trust fuels conflict. To address this issue, we introduce a novel testbed natural language generation: automatically bringing inappropriately subjective text into neutral point view (“neutralizing” biased text). We also offer first parallel...
Text summarization is one of the most challenging and interesting problems in NLP. Although much attention has been paid to summarizing structured text like news reports or encyclopedia articles, conversations—an essential part human-human/machine interaction where important pieces information are scattered across various utterances different speakers—remains relatively under-investigated. This work proposes a multi-view sequence-to-sequence model by first extracting conversational...
Thousands of students enroll in Massive Open Online Courses~(MOOCs) to seek opportunities for learning and self-improvement. However, the process often involves struggles with confusion, which may have an adverse effect on course participation experience, leading dropout along way. In this paper, we quantify that effect. We describe a classification model using discussion forum behavior clickstream data automatically identify posts express confusion. then apply survival analysis impact...
While data from Massive Open Online Courses (MOOCs) offers the potential to gain new insights into ways in which online communities can contribute student learning, much of richness trace is still yet be mined. In particular, very little work has attempted fine-grained content analyses interactions MOOCs. Survey research indicates importance goals and intentions keeping them involved a MOOC over time. Automated offer detect monitor evidence engagement how it relates other aspects their...
People with health concerns go to online support groups obtain help and advice. To do so, they frequently disclose personal details, many times in public. Although research non-health settings suggests that people self-disclose less public than private, this pattern may not apply where want get relevant help. Our work examines how the use of private channels influences members' self-disclosure an cancer group, moderate influence on reciprocity receiving support. By automatically measuring...
Participants in online communities often enact different roles when participating their communities. For example, some cancer support specialize providing disease-related information or socializing new members. This work clusters the behavioral patterns of users a community into specific functional roles. Based on series quantitative and qualitative evaluations, this research identified eleven that members occupy, such as welcomer story sharer. We investigated role dynamics, including how...
The spread of COVID-19 has sparked racism and hate on social media targeted towards Asian communities. However, little is known about how racial spreads during a pandemic the role counterspeech in mitigating this spread. In work, we study evolution anti-Asian speech through lens Twitter. We create COVID-HATE, largest dataset spanning 14 months, containing over 206 million tweets, network with 127 nodes. By creating novel hand-labeled 3,355 train text classifier to identify hateful tweets...