- Topic Modeling
- Natural Language Processing Techniques
- Multimodal Machine Learning Applications
- Explainable Artificial Intelligence (XAI)
- Adversarial Robustness in Machine Learning
- Software Engineering Research
- Machine Learning and Data Classification
- Anomaly Detection Techniques and Applications
- Smart Grid Security and Resilience
- Domain Adaptation and Few-Shot Learning
- Microgrid Control and Optimization
- Text Readability and Simplification
- Smart Grid Energy Management
- Advanced Text Analysis Techniques
- Software System Performance and Reliability
- Artificial Intelligence in Healthcare and Education
- Speech and dialogue systems
- Machine Learning and Algorithms
- Mobile Crowdsensing and Crowdsourcing
- Intelligent Tutoring Systems and Adaptive Learning
- Text and Document Classification Technologies
- Advanced Malware Detection Techniques
- Sentiment Analysis and Opinion Mining
- Semantic Web and Ontologies
- Educational Technology and Assessment
Google (United States)
2024
Arizona State University
2020-2023
Allen Institute
2023
Johns Hopkins University
2023
University of Washington
2023
Mineral Products Association
2022
Decision Systems (United States)
2021
Hong Kong University of Science and Technology
2020
University of Hong Kong
2020
Carleton College
2020
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, Hannaneh Hajishirzi. Proceedings of the 61st Annual Meeting Association for Computational Linguistics (Volume 1: Long Papers). 2023.
Humans (e.g., crowdworkers) have a remarkable ability in solving different tasks, by simply reading textual instructions that define them and looking at few examples. Despite the success of conventional supervised learning on individual datasets, such models often struggle with generalization across tasks question-answering system cannot solve classification tasks). A long-standing challenge AI is to build model learns new task understanding human-readable it. To study this, we introduce...
Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Atharva Naik, Arjun Ashok, Arut Selvan Dhanasekaran, Anjana Arunkumar, David Stap, Eshaan Pathak, Giannis Karamanolakis, Haizhi Lai, Ishan Purohit, Ishani Mondal, Jacob Anderson, Kirby Kuznia, Krima Doshi, Kuntal Kumar Pal, Maitreya Patel, Mehrad Moradshahi, Mihir Parmar, Mirali Neeraj Varshney, Phani Rohitha Kaza, Pulkit Verma, Ravsehaj Singh Puri, Rushang Karia, Savan Shailaja Keyur Sampat, Siddhartha...
Swaroop Mishra, Arindam Mitra, Neeraj Varshney, Bhavdeep Sachdeva, Peter Clark, Chitta Baral, Ashwin Kalyan. Proceedings of the 60th Annual Meeting Association for Computational Linguistics (Volume 1: Long Papers). 2022.
When answering a question, humans utilize the information available across different modalities to synthesize consistent and complete chain of thought (CoT). This process is normally black box in case deep learning models like large-scale language models. Recently, science question benchmarks have been used diagnose multi-hop reasoning ability interpretability an AI system. However, existing datasets fail provide annotations for answers, or are restricted textual-only modality, small scales,...
What kinds of instructional prompts are easier to follow for Language Models (LMs)? We study this question by conducting extensive empirical analysis that shed light on important features successful prompts. Specifically, we several classes reframing techniques manual reformulation into more effective ones. Some examples include decomposing a complex task instruction multiple simpler tasks or itemizing instructions sequential steps. Our experiments compare the zero-shot and few-shot...
Large "instruction-tuned" language models (i.e., finetuned to respond instructions) have demonstrated a remarkable ability generalize zero-shot new tasks. Nevertheless, they depend heavily on human-written instruction data that is often limited in quantity, diversity, and creativity, therefore hindering the generality of tuned model. We introduce Self-Instruct, framework for improving instruction-following capabilities pretrained by bootstrapping off their own generations. Our pipeline...
Swaroop Mishra, Matthew Finlayson, Pan Lu, Leonard Tang, Sean Welleck, Chitta Baral, Tanmay Rajpurohit, Oyvind Tafjord, Ashish Sabharwal, Peter Clark, Ashwin Kalyan. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022.
In recent years, progress in NLU has been driven by benchmarks. These benchmarks are typically collected crowdsourcing, where annotators write examples based on annotation instructions crafted dataset creators. this work, we hypothesize that pick up patterns the crowdsourcing instructions, which bias them to many similar then over-represented data. We study form of bias, termed instruction 14 benchmarks, showing often exhibit concrete patterns, propagated crowdworkers This extends previous...
How well can NLP models generalize to a variety of unseen tasks when provided with task instructions? To address this question, we first introduce Super-NaturalInstructions, benchmark 1,616 diverse and their expert-written instructions. Our collection covers 76 distinct types, including but not limited classification, extraction, infilling, sequence tagging, text rewriting, composition. This large enables rigorous benchmarking cross-task generalization under instructions -- training follow...
Large Language Models (LLMs) have emerged as a groundbreaking technology with their unparalleled text generation capabilities across various applications. Nevertheless, concerns persist regarding the accuracy and appropriateness of generated content. A contemporary methodology, self-correction, has been proposed remedy to these issues. Building upon this premise, paper critically examines role efficacy self-correction within LLMs, shedding light on its true potential limitations. Central our...
In order to equip NLP systems with 'selective prediction' capability, several task-specific approaches have been proposed. However, which work best across tasks or even if they consistently outperform the simplest baseline MaxProb remains be explored. To this end, we systematically study selective prediction in a large-scale setup of 17 datasets tasks. Through comprehensive experiments under in-domain (IID), out-of-domain (OOD), and adversarial (ADV) settings, show that despite leveraging...
We introduce SELF-DISCOVER, a general framework for LLMs to self-discover the task-intrinsic reasoning structures tackle complex problems that are challenging typical prompting methods. Core is self-discovery process where select multiple atomic modules such as critical thinking and step-by-step thinking, compose them into an explicit structure follow during decoding. SELF-DISCOVER substantially improves GPT-4 PaLM 2's performance on benchmarks BigBench-Hard, grounded agent reasoning, MATH,...
Recently several datasets have been proposed to encourage research in Question Answering domains where commonsense knowledge is expected play an important role. Recent language models such as ROBERTA, BERT and GPT that pre-trained on Wikipedia articles books shown reasonable performance with little fine-tuning Multiple Choice Question-Answering (MCQ) datasets. Our goal this work develop methods incorporate additional (commonsense) into model-based approaches for better question-answering...
Single-task models have proven pivotal in solving specific tasks; however, they limitations real-world applications where multi-tasking is necessary and domain shifts are exhibited. Recently, instructional prompts shown significant improvement towards multi-task generalization; the effect of Multi-Task Learning (MTL) has not been systematically studied biomedical domain. Motivated by this, this paper explores impact for MTL. We introduce BoX, a collection 32 instruction tasks Biomedical NLP...
Large Language Models (LMs) have achieved state-of-the-art performance on many Natural Processing (NLP) benchmarks. With the growing number of new benchmarks, we build bigger and more complex LMs. However, building LMs may not be an ideal option owing to cost, time environmental impact associated with it. We explore alternative route: can modify data by expressing it in terms model's strengths, so that a question becomes easier for models answer? investigate if humans decompose hard into set...
Knowledge of difficulty level questions helps a teacher in several ways, such as estimating students' potential quickly by asking carefully selected and improving quality examination modifying trivial hard questions. Can we extract benefits instance Natural Language Processing? To this end, conduct Instance-Level Difficulty Analysis Evaluation data (ILDAE) large-scale setup 23 datasets demonstrate its five novel applications: 1) conducting efficient-yet-accurate evaluations with fewer...
Recently introduced instruction-paradigm empowers non-expert users to leverage NLP resources by defining a new task in natural language. Instruction-tuned models have significantly outperformed multitask learning (without instruction); however they are far from state-of-the-art task-specific models. Conventional approaches improve model performance via creating datasets with large number of instances or architectural changes the may not be feasible for users. However, can write alternate...
It's better to say "I can't answer" than answer incorrectly. This selective prediction ability is crucial for NLP systems be reliably deployed in real-world applications. Prior work has shown that existing techniques fail perform well, especially the out-of-domain setting. In this work, we propose a method improves probability estimates of models by calibrating them using confidence and difficulty score instances. Using these two signals, first annotate held-out instances then train...
One core capability of Large Language Models (LLMs) is to follow natural language instructions. However, the evaluation such abilities not standardized: Human evaluations are expensive, slow, and objectively reproducible, while LLM-based auto-evaluation potentially biased or limited by ability evaluator LLM. To overcome these issues, we introduce Instruction-Following Eval (IFEval) for large models. IFEval a straightforward easy-to-reproduce benchmark. It focuses on set "verifiable...
In-context learning (ICL, also known as few-shot prompting) has been the standard method of adapting LLMs to downstream tasks, by from a few input-output examples. Nonetheless, all ICL-based approaches only learn correct pairs. In this paper, we revisit paradigm, more given We introduce Learning Principles (LEAP): First, intentionally induce model make mistakes on these examples; then reflect mistakes, and explicit task-specific "principles" them, which help solve similar problems avoid...