- Time Series Analysis and Forecasting
- Topic Modeling
- Machine Learning in Healthcare
- Natural Language Processing Techniques
- Anomaly Detection Techniques and Applications
- Explainable Artificial Intelligence (XAI)
- Text and Document Classification Technologies
- Machine Learning and Data Classification
- Artificial Intelligence in Healthcare and Education
- EEG and Brain-Computer Interfaces
- Imbalanced Data Classification Techniques
- Artificial Intelligence in Healthcare
- Semantic Web and Ontologies
- Stock Market Forecasting Methods
- Biomedical Text Mining and Ontologies
- Text Readability and Simplification
- Neural dynamics and brain function
- COVID-19 diagnosis using AI
- Clostridium difficile and Clostridium perfringens research
- Hate Speech and Cyberbullying Detection
- Neural Networks and Applications
- Intelligent Tutoring Systems and Adaptive Learning
- Digital Radiography and Breast Imaging
- Advanced Memory and Neural Computing
- Machine Learning and ELM
University of Virginia
2023-2025
Massachusetts Institute of Technology
2022-2023
IIT@MIT
2023
Microsoft (United States)
2022
Allen Institute
2022
Carnegie Mellon University
2022
Worcester Polytechnic Institute
2017-2021
Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, Ece Kamar. Proceedings of the 60th Annual Meeting Association for Computational Linguistics (Volume 1: Long Papers). 2022.
Machine learning models in safety-critical settings like healthcare are often blackboxes: they contain a large number of parameters which not transparent to users. Post-hoc explainability methods where simple, human-interpretable model imitates the behavior these blackbox proposed help users trust predictions. In this work, we audit quality such explanations for different protected subgroups using real data from four finance, healthcare, college admissions, and US justice system. Across two...
Motivated by human attention, computational attention mechanisms have been designed to help neural networks adjust their focus on specific parts of the input data. While are claimed achieve interpretability, little is known about actual relationships between machine and attention. In this work, we conduct first quantitative assessment versus for text classification task. To this, design a large-scale crowd-sourcing study collect maps that encode humans when conducting classification. Based...
Abstract Background: Large language models (LLMs) are increasingly used to generate medical content, yet their inherent design follow user instructions may leave them vulnerable producing misinformation. This risk becomes especially pronounced when LLMs incorrect information that could adversely affect human health. A propensity comply with prompts, even these lead illogical or false information, highlights a critical gap in safety, high-stakes fields like healthcare. Methods: We evaluated...
Sparse Autoencoders (SAEs) provide potentials for uncovering structured, human-interpretable representations in Large Language Models (LLMs), making them a crucial tool transparent and controllable AI systems. We systematically analyze SAE interpretable feature extraction from LLMs safety-critical classification tasks. Our framework evaluates (1) model-layer selection scaling properties, (2) architectural configurations, including width pooling strategies, (3) the effect of binarizing...
Early classification of time series is the prediction class label a before it observed in its entirety. In time-sensitive domains where information collected over worth sacrificing some accuracy favor earlier predictions, ideally early enough for actions to be taken. However, since and earliness are contradictory objectives, solution must address this challenge discover task-dependent trade-offs. We design an model, called EARLIEST, which tackles multi-objective optimization problem, jointly...
Large language models (LLMs) are being applied to time series tasks, particularly forecasting. However, actually useful for series? After a of ablation studies on three recent and popular LLM-based forecasting methods, we find that removing the LLM component or replacing it with basic attention layer does not degrade results -- in most cases even improved. We also despite their significant computational cost, pretrained LLMs do no better than trained from scratch, represent sequential...
Despite recent concerns about undesirable behaviors generated by large language models (LLMs), including non-factual, biased, and hateful language, we find LLMs are inherent multi-task checkers based on their latent representations of natural social knowledge. We present an interpretable, unified, checking (UniLC) method for both human machine-generated that aims to check if input is factual fair. While fairness fact-checking tasks have been handled separately with dedicated models, can...
Artificial intelligence (AI) stands to improve healthcare through innovative new systems ranging from diagnosis aids patient tools. However, such "Health AI" are complicated and challenging integrate into standing clinical practice. With advancing AI, regulations, practice, policies must adapt a wide range of risks while experts learn interact with complex automated systems. Even in the early stages Health gaps being identified, like severe underperformance models for minority groups...
Early multi-label classification of time series, the assignment a label set to series before is entirely observed, critical for time-sensitive domains such as healthcare. In cases, waiting too long classify can render predictions useless, regardless their accuracy, while predicting prematurely result in potentially costly erroneous results. When multiple labels (for example, types infections), dependencies between be learned and leveraged improve overall accuracy. Together, reliably correct...
Explainable classification is essential to high-impact settings where practitioners requireevidence support their decisions. However, state-of-the-art deep learning models lack transparency in how they make predictions. One increasingly popular solution attribution-based explainability, which finds the impact of input features on model's While this for computer vision, little has been done explain time series classifiers.In work, we study problem and propose PERT, a novel perturbation-based...
Deployed language models decay over time due to shifting inputs, changing user needs, or emergent world-knowledge gaps. When such problems are identified, we want make targeted edits while avoiding expensive retraining. However, current model editors, which modify behaviors of pre-trained models, degrade performance quickly across multiple, sequential edits. We propose GRACE, a lifelong editing method, implements spot-fixes on streaming errors deployed model, ensuring minimal impact...
Foundation models, especially LLMs, are profoundly transforming deep learning. Instead of training many task-specific we can adapt a single pretrained model to tasks via fewshot prompting or fine-tuning. However, current foundation models apply sequence data but not time series, which present unique challenges due the inherent diverse and multidomain series datasets, diverging task specifications across forecasting, classification other types tasks, apparent need for task-specialized models....
For medical imaging AI models to be clinically impactful, they must generalize. However, this goal is hindered by (i) diverse types of distribution shifts, such as temporal, demographic, and label (ii) limited diversity in datasets that are siloed within single institutions. While these limitations have spurred interest federated learning, current evaluation benchmarks fail evaluate different shifts simultaneously. real healthcare settings, multiple co-exist, yet their impact on performance...
Vision-language models, like CLIP (Contrastive Language Image Pretraining), are becoming increasingly popular for a wide range of multimodal retrieval tasks. However, prior work has shown that large language and deep vision models can learn historical biases contained in their training sets, leading to perpetuation stereotypes potential downstream harm. In this work, we conduct systematic analysis the social present CLIP, with focus on interaction between image text modalities. We first...
Understanding the roles of human proteins remains a major challenge, with approximately 20% lacking known functions and more than 40% missing context-specific functional insights. Even well-annotated are often poorly characterized in diverse biological contexts, disease states, perturbations. We present P ro C yon , foundation model for modeling, generating, predicting protein phenotypes across five interrelated knowledge domains: molecular functions, therapeutic mechanisms, associations,...
Math word problems are critical K-8 educational tools, but writing them is time-consuming and requires domain expertise. We suggest that language models can support math education by automatically generating at scale. To be educational, generated must 1) solvable, 2) accurate, 3) appropriate. Existing datasets unlabeled for these criteria, making ill-suited training problem generators. introduce MATHWELL, a Llama-2 (70B) model iteratively finetuned to generate using data from expert...
Medical knowledge is context-dependent and requires consistent reasoning across various natural language expressions of semantically equivalent phrases. This particularly crucial for drug names, where patients often use brand names like Advil or Tylenol instead their generic equivalents. To study this, we create a new robustness dataset, RABBITS, to evaluate performance differences on medical benchmarks after swapping using physician expert annotations. We assess both open-source API-based...
Positive-Unlabeled (PU) learning methods train a classifier to distinguish between the positive and negative classes given only unlabeled data. While traditional PU require labeled samples be an unbiased sample of distribution, in practice is often biased draw from true distribution. Prior work shows that if we know likelihood each instance will selected for labeling, referred as propensity score, then can used learning. Unfortunately, no prior has been proposed inference strategy which...
Toxic language detection systems often falsely flag text that contains minority group mentions as toxic, those groups are the targets of online hate. Such over-reliance on spurious correlations also causes to struggle with detecting implicitly toxic language. To help mitigate these issues, we create ToxiGen, a new large-scale and machine-generated dataset 274k benign statements about 13 groups. We develop demonstration-based prompting framework an adversarial classifier-in-the-loop decoding...