- Topic Modeling
- Software Engineering Research
- Natural Language Processing Techniques
- Explainable Artificial Intelligence (XAI)
- Ethics and Social Impacts of AI
- Multimodal Machine Learning Applications
- Scientific Computing and Data Management
- Software Engineering Techniques and Practices
- Data Visualization and Analytics
- Mobile Crowdsensing and Crowdsourcing
- Big Data and Business Intelligence
- Adversarial Robustness in Machine Learning
- Artificial Intelligence in Healthcare and Education
- AI in Service Interactions
- Software System Performance and Reliability
- Semantic Web and Ontologies
- Software Testing and Debugging Techniques
- Text and Document Classification Technologies
- Speech and dialogue systems
- Data Quality and Management
- Digital Games and Media
- Spreadsheets and End-User Computing
- Misinformation and Its Impacts
- Consumer Market Behavior and Pricing
- Information Retrieval and Search Behavior
Carnegie Mellon University
2022-2025
University of Washington
2019-2023
Administration for Community Living
2023
Tokyo Institute of Technology
2023
IT University of Copenhagen
2023
American Jewish Committee
2023
Mongolia International University
2023
RIKEN Center for Advanced Intelligence Project
2023
Microsoft (United States)
2022
University of Notre Dame
2022
Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or specific behaviors. Inspired by principles behavioral testing in software engineering, we introduce CheckList, a task-agnostic methodology models. CheckList includes matrix general linguistic capabilities and test types that facilitate comprehensive ideation, as...
Many researchers motivate explainable AI with studies showing that human-AI team performance on decision-making tasks improves when the explains its recommendations. However, prior observed improvements from explanations only AI, alone, outperformed both human and best team. Can help lead to complementary performance, where accuracy is higher than either or working solo? We conduct mixed-method user three datasets, an comparable humans helps participants solve a task (explaining itself in...
Although large language models (LLMs) have demonstrated impressive potential on simple tasks, their breadth of scope, lack transparency, and insufficient controllability can make them less effective when assisting humans more complex tasks. In response, we introduce the concept Chaining LLM steps together, where output one step becomes input for next, thus aggregating gains per step. We first define a set primitive operations useful Chain construction, then present an interactive system...
Tongshuang Wu, Marco Tulio Ribeiro, Jeffrey Heer, Daniel Weld. Proceedings of the 59th Annual Meeting Association for Computational Linguistics and 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021.
While LLMs have made it possible to rapidly prototype new ML functionalities, many real-world applications involve complex tasks that cannot be easily handled via a single run of an LLM. Recent work has found chaining multiple LLM runs together (with the output one step being input next) can help users accomplish these more tasks, and in way is perceived transparent controllable. However, remains unknown what need when authoring their own chains – key lowering barriers for non-AI-experts...
Despite its benefits for children's skill development and parent-child bonding, many parents do not often engage in interactive storytelling by having story-related dialogues with their child due to limited availability or challenges coming up appropriate questions. While recent advances made AI generation of questions from stories possible, the fully-automated approach excludes parent involvement, disregards educational goals, underoptimizes engagement. Informed need-finding interviews...
Sensemaking in unfamiliar domains can be challenging, demanding considerable user effort to compare different options with respect various criteria. Prior research and our formative study found that people would benefit from reading an overview of information space upfront, including the criteria others previously useful. However, existing sensemaking tools struggle "cold-start" problem — it not only requires significant input previous users generate share these overviews, but such overviews...
Abstract Unlike traditional unsupervised clustering, semi-supervised clustering allows users to provide meaningful structure the data, which helps algorithm match user’s intent. Existing approaches require a significant amount of feedback from an expert improve clusters. In this paper, we ask whether large language model (LLM) can amplify expert’s guidance enable query-efficient, few-shot text clustering. We show that LLMs are surprisingly effective at improving explore three stages where be...
Though error analysis is crucial to understanding and improving NLP models, the common practice of manual, subjective categorization a small sample errors can yield biased incomplete conclusions. This paper codifies model task agnostic principles for informative analysis, presents Errudite, an interactive tool better supporting this process. First, groups should be precisely defined reproducibility; Errudite supports with expressive domain-specific language. Second, avoid spurious...
Automatically generated explanations of how machine learning (ML) models reason can help users understand and accept them. However, have unintended consequences: promoting over-reliance or undermining trust. This paper investigates shape users' perceptions ML with without the ability to provide feedback them: (1) does revealing model flaws increase desire "fix" them; (2) providing cause believe - wrongly that are introspective, will thus improve over time. Through two controlled experiments...
Controlled text perturbation is useful for evaluating and improving model generalizability. However, current techniques rely on training a every target perturbation, which expensive hard to generalize. We present Tailor, semantically-controlled generation system. Tailor builds pretrained seq2seq produces textual outputs conditioned control codes derived from semantic representations. craft set of operations modify the codes, in turn steer towards targeted attributes. These can be further...
A key challenge to visualization authoring is the process of getting familiar with complex user interfaces tools. Natural Language Interface (NLI) presents promising benefits due its learnability and usability. However, supporting NLIs for tools requires expertise in natural language processing, while existing are mostly designed visual analytic workflow. In this paper, we propose an authoring-oriented NLI pipeline by introducing a structured representation users' editing intents, called...
Abstract Natural language generation has witnessed significant advancements due to the training of large models on vast internet-scale datasets. Despite these advancements, there exists a critical challenge: These can inadvertently generate content that is toxic, inaccurate, and unhelpful, existing automatic evaluation metrics often fall short identifying shortcomings. As become more capable, human feedback an invaluable signal for evaluating improving models. This survey aims provide...
Humans possess an extraordinary ability to create and utilize tools, allowing them overcome physical limitations explore new frontiers. With the advent of foundation models, AI systems have potential be equally adept in tool use as humans. This paradigm, i.e., learning with combines strengths specialized tools models achieve enhanced accuracy, efficiency, automation problem-solving. Despite its immense potential, there is still a lack comprehensive understanding key challenges,...
Abstract One widely cited barrier to the adoption of LLMs as proxies for humans in subjective tasks is their sensitivity prompt wording—but interestingly, also display sensitivities instruction changes form response biases. We investigate extent which reflect human biases, if at all. look survey design, where biases caused by wordings “prompts” have been extensively explored social psychology literature. Drawing from these works, we design a dataset and framework evaluate whether exhibit...
Abstract The race to train language models on vast, diverse and inconsistently documented datasets raises pressing legal ethical concerns. To improve data transparency understanding, we convene a multi-disciplinary effort between machine learning experts systematically audit trace more than 1,800 text datasets. We develop tools standards the lineage of these datasets, including their source, creators, licences subsequent use. Our landscape analysis highlights sharp divides in composition...
Ying Xu, Dakuo Wang, Mo Yu, Daniel Ritchie, Bingsheng Yao, Tongshuang Wu, Zheng Zhang, Toby Li, Nora Bradford, Branda Sun, Tran Hoang, Yisi Sang, Yufang Hou, Xiaojuan Ma, Diyi Yang, Nanyun Peng, Zhou Mark Warschauer. Proceedings of the 60th Annual Meeting Association for Computational Linguistics (Volume 1: Long Papers). 2022.
Efficiently reviewing scholarly literature and synthesizing prior art are crucial for scientific progress. Yet, the growing scale of publications burden knowledge make synthesis research threads more challenging than ever. While significant has been devoted to helping scholars interact with individual papers, building scattered across multiple papers remains a challenge. Most top-down (and LLMs) it difficult personalize iterate on output, while bottom-up is costly in time effort. Here, we...
Despite a surge collection of XAI methods, users still struggle to obtain required AI explanations. Previous research suggests chatbots as dynamic solutions, but the effective design conversational agents for practical human needs remains under-explored. This paper focuses on Conversational AI-assisted scientific writing tasks. Drawing from linguistic theories and formative studies, we identify four rationales: "multifaceted", "controllability", "mix-initiative", "context-aware drill-down"....
AI tools are increasingly deployed in community contexts. However, datasets used to evaluate typically created by developers and annotators outside a given community, which can yield misleading conclusions about performance. How might we empower communities drive the intentional design curation of evaluation for that impacts them? We investigate this question on Wikipedia, an online with multiple AI-based content moderation deployed. introduce Wikibench, system enables collaboratively curate...
Prompting LLMs for complex tasks (e.g., building a trip advisor chatbot) needs humans to clearly articulate customized requirements “start the response with tl;dr”). However, existing prompt engineering instructions often lack focused training on requirement articulation and instead tend emphasize increasingly automatable strategies tricks like adding role-plays “think step-by-step”). To address gap, we introduce Requirement-Oriented Prompt Engineering (ROPE), paradigm that focuses human...
Existing question answering (QA) techniques are created mainly to answer questions asked by humans. But in educational applications, teachers often need decide what they should ask, order help students improve their narrative understanding capabilities. We design an automated question-answer generation (QAG) system for this education scenario: given a story book at the kindergarten eighth-grade level as input, our can automatically generate QA pairs that capable of testing variety dimensions...
Tools for Interactive Machine Learning (IML) enable end users to update models in a “rapid, focused, and incremental”—yet local—manner. In this work, we study the question of local decision making an IML context around feature selection sentiment classification task. Specifically, characterize utility interactive through combination human-subjects experiments computational simulations. We find that, expectation, modification fails improve model performance may hamper generalization due...
In this paper, we present a novel visual analytics system called NameClarifier to interactively disambiguate author names in publications by keeping humans the loop. Specifically, quantifies and visualizes similarities between ambiguous those that have been confirmed digital libraries. The are calculated using three key factors, namely, co-authorships, publication venues, temporal information. Our estimates all possible allocations, then provides cues users help them validate every case. By...