NFDI4DS | UHH-SEMS - Publication Details

MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records

OPENALEX - Publications

Scott L. Fleming Alejandro Lozano William J. Haberkorn Jenelle Jindal Eduardo Pontes Reis and 25 more

The ability of large language models (LLMs) to follow natural instructions with human-level fluency suggests many opportunities in healthcare reduce administrative burden and improve quality care. However, evaluating LLMs on realistic text generation tasks for remains challenging. Existing question answering datasets electronic health record (EHR) data fail capture the complexity information needs documentation burdens experienced by clinicians. To address these challenges, we introduce...

10.1609/aaai.v38i20.30205 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2024-03-24

Automated Classification of Radiographic Knee Osteoarthritis Severity Using Deep Neural Networks

OPENALEX - Publications

Kevin A. Thomas Łukasz Kidziński Eni Halilaj Scott L. Fleming Guhan Venkataraman and 3 more

To develop an automated model for staging knee osteoarthritis severity from radiographs and to compare its performance that of musculoskeletal radiologists.Radiographs the Osteoarthritis Initiative staged by a radiologist committee using Kellgren-Lawrence (KL) system were used. Before images as input convolutional neural network model, they standardized augmented automatically. The was trained with 32 116 images, tuned 4074 evaluated 4090-image test set, compared two individual radiologists...

10.1148/ryai.2020190065 article EN Radiology Artificial Intelligence 2020-03-01

Detecting Developmental Delay and Autism Through Machine Learning Models Using Home Videos of Bangladeshi Children: Development and Validation Study

OPENALEX - Publications

Qandeel Tariq Scott L. Fleming Jessey Schwartz Kaitlyn Dunlap Conor K. Corbin and 5 more

Autism spectrum disorder (ASD) is currently diagnosed using qualitative methods that measure between 20-100 behaviors, can span multiple appointments with trained clinicians, and take several hours to complete. In our previous work, we demonstrated the efficacy of machine learning classifiers accelerate process by collecting home videos US-based children, identifying a reduced subset behavioral features are scored untrained raters classifier determine children's "risk scores" for autism. We...

10.2196/13822 article EN cc-by Journal of Medical Internet Research 2019-04-24

Mapping Neural Circuit Biotypes to Symptoms and Behavioral Dimensions of Depression and Anxiety

OPENALEX - Publications

Andrea Goldstein‐Piekarski Tali M. Ball Zoe Samara Brooke R. Staveland Arielle S. Keller and 6 more

BackgroundDespite tremendous advances in characterizing human neural circuits that govern emotional and cognitive functions impaired depression anxiety, we lack a circuit-based taxonomy for anxiety captures transdiagnostic heterogeneity informs clinical decision making.MethodsWe developed tested novel system quantifying 6 brain reproducibly at the individual patient level. We implemented standardized circuit definitions relative to healthy reference sample algorithms generate scores overall...

10.1016/j.biopsych.2021.06.024 article EN cc-by-nc-nd Biological Psychiatry 2021-07-11

Ontology-driven weak supervision for clinical entity classification in electronic health records

OPENALEX - Publications

Jason Fries Ethan Steinberg Saelig Khattar Scott L. Fleming Jose Posada and 2 more

Abstract In the electronic health record, using clinical notes to identify entities such as disorders and their temporality (e.g. order of an event relative a time index) can inform many important analyses. However, creating training data for entity tasks is consuming sharing labeled challenging due privacy concerns. The information needs COVID-19 pandemic highlight need agile methods machine learning models notes. We present Trove, framework weakly supervised classification medical...

10.1038/s41467-021-22328-4 article EN cc-by Nature Communications 2021-04-01

EHR foundation models improve robustness in the presence of temporal distribution shift

OPENALEX - Publications

Lin Guo Ethan Steinberg Scott L. Fleming Jose Posada Joshua Lemmon and 4 more

Abstract Temporal distribution shift negatively impacts the performance of clinical prediction models over time. Pretraining foundation using self-supervised learning on electronic health records (EHR) may be effective in acquiring informative global patterns that can improve robustness task-specific models. The objective was to evaluate utility EHR improving in-distribution (ID) and out-of-distribution (OOD) Transformer- gated recurrent unit-based were pretrained up 1.8 M patients (382...

10.1038/s41598-023-30820-8 article EN cc-by Scientific Reports 2023-03-07

Evaluating Treatment Prioritization Rules via Rank-Weighted Average Treatment Effects

OPENALEX - Publications

Steve Yadlowsky Scott L. Fleming Nigam H. Shah Emma Brunskill Stefan Wager

There are a number of available methods for selecting whom to prioritize treatment, including ones based on treatment effect estimation, risk scoring, and hand-crafted rules. We propose rank-weighted average (RATE) metrics as simple general family comparing testing the quality prioritization RATE agnostic how rules were derived, only assess well they identify individuals that benefit most from treatment. define estimators prove central limit theorem enables asymptotically exact inference in...

10.1080/01621459.2024.2393466 article EN Journal of the American Statistical Association 2024-09-03

Characterizing the limitations of using diagnosis codes in the context of machine learning for healthcare

OPENALEX - Publications

Lin Lawrence Guo Keith Morse Catherine Aftandilian Ethan Steinberg Jason Fries and 6 more

Abstract Background Diagnostic codes are commonly used as inputs for clinical prediction models, to create labels tasks, and identify cohorts multicenter network studies. However, the coverage rates of diagnostic their variability across institutions underexplored. The primary objective was describe lab- diagnosis-based 7 selected outcomes at three institutions. Secondary objectives were agreement, sensitivity, specificity against lab-based labels. Methods This study included cohorts:...

10.1186/s12911-024-02449-8 article EN cc-by BMC Medical Informatics and Decision Making 2024-02-14

Assessing the accuracy of automatic speech recognition for psychotherapy

OPENALEX - Publications

Adam S. Miner Albert Haque Jason Fries Scott L. Fleming Denise E. Wilfley and 7 more

Abstract Accurate transcription of audio recordings in psychotherapy would improve therapy effectiveness, clinician training, and safety monitoring. Although automatic speech recognition software is commercially available, its accuracy mental health settings has not been well described. It unclear which metrics thresholds are appropriate for different clinical use cases, may range from population descriptions to individual Here we show that feasible psychotherapy, but further improvements...

10.1038/s41746-020-0285-8 article EN cc-by npj Digital Medicine 2020-06-03

Clinfo.ai: An Open-Source Retrieval-Augmented Large Language Model System for Answering Medical Questions using Scientific Literature

OPENALEX - Publications

Alejandro Lozano Scott L. Fleming Chia‐Chun Chiang Nigam H. Shah

10.1142/9789811286421_0002 article EN Biocomputing 2023-12-01

Red Teaming Large Language Models in Medicine: Real-World Insights on Model Behavior

OPENALEX - Publications

Crystal Chang Hodan Farah Haiwen Gui Shawheen J. Rezaei Charbel Bou-Khalil and 75 more

0. Abstract Background The integration of large language models (LLMs) in healthcare offers immense opportunity to streamline tasks, but also carries risks such as response accuracy and bias perpetration. To address this, we conducted a red-teaming exercise assess LLMs developed dataset clinically relevant scenarios for future teams use. Methods We convened 80 multi-disciplinary experts evaluate the performance popular across multiple medical scenarios. Teams composed clinicians, engineering...

10.1101/2024.04.05.24305411 preprint EN cc-by-nc-nd medRxiv (Cold Spring Harbor Laboratory) 2024-04-07

A multi-center study on the adaptability of a shared foundation model for electronic health records

OPENALEX - Publications

Lin Guo Jason Fries Ethan Steinberg Scott L. Fleming Keith Morse and 4 more

Foundation models are transforming artificial intelligence (AI) in healthcare by providing modular components adaptable for various downstream tasks, making AI development more scalable and cost-effective. structured electronic health records (EHR), trained on coded medical from millions of patients, demonstrated benefits including increased performance with fewer training labels, improved robustness to distribution shifts. However, questions remain the feasibility sharing these across...

10.1038/s41746-024-01166-w article EN cc-by npj Digital Medicine 2024-06-27

Test-retest reliability of the human functional connectome over consecutive days: identifying highly reliable portions and assessing the impact of methodological choices

OPENALEX - Publications

Leonardo Tozzi Scott L. Fleming Zachary Taylor Cooper Raterink Leanne M. Williams

Abstract Countless studies have advanced our understanding of the human brain and its organization by using functional magnetic resonance imaging (fMRI) to derive network representations function. However, we do not know what extent these “functional connectomes” are reliable over time. In a large public sample healthy participants (N = 833) scanned on two consecutive days, assessed test-retest reliability fMRI connectivity consequences three common sources variation in analysis workflows:...

10.1162/netn_a_00148 article EN cc-by Network Neuroscience 2020-05-20

Red teaming ChatGPT in medicine to yield real-world insights on model behavior

OPENALEX - Publications

Crystal Chang Hodan Farah Haiwen Gui Shawheen J. Rezaei Charbel Bou-Khalil and 75 more

Red teaming, the practice of adversarially exposing unexpected or undesired model behaviors, is critical towards improving equity and accuracy large language models, but non-model creator-affiliated red teaming scant in healthcare. We convened teams clinicians, medical engineering students, technical professionals (80 participants total) to stress-test models with real-world clinical cases categorize inappropriate responses along axes safety, privacy, hallucinations/accuracy, bias. Six...

10.1038/s41746-025-01542-0 article EN cc-by npj Digital Medicine 2025-03-07

Assessing the Potential of USMLE-Like Exam Questions Generated by GPT-4

OPENALEX - Publications

Suhana Bedi Scott L. Fleming Chia‐Chun Chiang Keith Morse A. Kumar and 5 more

The United States Medical Licensing Examination (USMLE) is a critical step in assessing the competence of future physicians, yet process creating exam questions and study materials both time-consuming costly. While Large Language Models (LLMs), such as OpenAI’s GPT-4, have demonstrated proficiency answering medical questions, their potential generating remains underexplored. This presents QUEST-AI, novel system that utilizes LLMs to (1) generate USMLE-style (2) identify flag incorrect (3)...

10.1101/2023.04.25.23288588 preprint EN cc-by medRxiv (Cold Spring Harbor Laboratory) 2023-04-28

Considerations in the reliability and fairness audits of predictive models for advance care planning

OPENALEX - Publications

Jonathan Lu Amelia Sattler Samantha Wang Ali Raza Khaki Alison Callahan and 22 more

Multiple reporting guidelines for artificial intelligence (AI) models in healthcare recommend that be audited reliability and fairness. However, there is a gap of operational guidance performing fairness audits practice. Following guideline recommendations, we conducted audit two based on model performance calibration as well summary statistics, subgroup calibration. We assessed the Epic End-of-Life (EOL) Index an internally developed Stanford Hospital Medicine (HM) Advance Care Planning...

10.3389/fdgth.2022.943768 article EN cc-by Frontiers in Digital Health 2022-09-12

Intrinsic reward circuit connectivity profiles underlying symptom and quality of life outcomes following antidepressant medication: a report from the iSPOT-D trial

OPENALEX - Publications

Adina S. Fischer Bailey Holt-Gosselin Scott L. Fleming Laura M. Hack Tali M. Ball and 2 more

10.1038/s41386-020-00905-3 article EN Neuropsychopharmacology 2020-11-23

Intrinsic Connectivity and Family Dynamics: Striatolimbic Markers of Risk and Resilience in Youth at Familial Risk for Mood Disorders

OPENALEX - Publications

Adina S. Fischer Bailey Holt-Gosselin Kelsey E. Hagan Scott L. Fleming Akua F. Nimarko and 2 more

10.1016/j.bpsc.2022.02.009 article EN publisher-specific-oa Biological Psychiatry Cognitive Neuroscience and Neuroimaging 2022-03-08

Evaluation of Feature Selection Methods for Preserving Machine Learning Performance in the Presence of Temporal Dataset Shift in Clinical Medicine

OPENALEX - Publications

Joshua Lemmon Lin Guo Jose Posada Stephen Pfohl Jason Fries and 4 more

Abstract Background Temporal dataset shift can cause degradation in model performance as discrepancies between training and deployment data grow over time. The primary objective was to determine whether parsimonious models produced by specific feature selection methods are more robust temporal measured out-of-distribution (OOD) performance, while maintaining in-distribution (ID) performance. Methods Our consisted of intensive care unit patients from MIMIC-IV categorized year groups...

10.1055/s-0043-1762904 article EN Methods of Information in Medicine 2023-02-22

MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records

OPENALEX - Publications

Scott L. Fleming Alejandro Lozano William J. Haberkorn Jenelle Jindal Eduardo P. Reis and 25 more

The ability of large language models (LLMs) to follow natural instructions with human-level fluency suggests many opportunities in healthcare reduce administrative burden and improve quality care. However, evaluating LLMs on realistic text generation tasks for remains challenging. Existing question answering datasets electronic health record (EHR) data fail capture the complexity information needs documentation burdens experienced by clinicians. To address these challenges, we introduce...

10.48550/arxiv.2308.14089 preprint EN cc-by arXiv (Cornell University) 2023-01-01

A computational approach to measure the linguistic characteristics of psychotherapy timing, responsiveness, and consistency

OPENALEX - Publications

Adam S. Miner Scott L. Fleming Albert Haque Jason Fries Tim Althoff and 8 more

Abstract Although individual psychotherapy is generally effective for a range of mental health conditions, little known about the moment-to-moment language use therapists. Increased access to computational power, coupled with rise in computer-mediated communication (telehealth), makes feasible large-scale analyses during psychotherapy. Transparent methodological approaches are lacking, however. Here we present novel methods increase efficiency efforts examine We evaluate three important...

10.1038/s44184-022-00020-9 article EN cc-by npj Mental Health Research 2022-12-02

Machine learning for detection of heterogeneous effects of Medicaid coverage on depression

OPENALEX - Publications

Ryunosuke Goto Kosuke Inoue Itsuki Osawa Katherine Baicker Scott L. Fleming and 1 more

Abstract In 2008, Oregon expanded its Medicaid program using a lottery, creating rare opportunity to study the effects of coverage randomized controlled design (Oregon Health Insurance Experiment). Analysis showed that lowered risk depression. However, this effect may vary between individuals, and identification individuals likely benefit most has potential improve effectiveness efficiency program. By applying machine learning causal forest data from experiment, we found substantial...

10.1093/aje/kwae008 article EN American Journal of Epidemiology 2024-02-22

Self-supervised machine learning using adult inpatient data produces effective models for pediatric clinical prediction tasks

OPENALEX - Publications

Joshua Lemmon Lin Guo Ethan Steinberg Keith Morse Scott L. Fleming and 6 more

Development of electronic health records (EHR)-based machine learning models for pediatric inpatients is challenged by limited training data. Self-supervised using adult data may be a promising approach to creating robust prediction models. The primary objective was determine whether self-supervised model trained in noninferior logistic regression inpatients, inpatient clinical tasks.

10.1093/jamia/ocad175 article EN Journal of the American Medical Informatics Association 2023-08-28