Benjamin J. Lengerich

ORCID: 0000-0001-8690-9554
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Explainable Artificial Intelligence (XAI)
  • Machine Learning in Healthcare
  • Cell Image Analysis Techniques
  • Cancer Genomics and Diagnostics
  • Pregnancy and preeclampsia studies
  • Bioinformatics and Genomic Networks
  • Gene expression and cancer classification
  • Genetics, Bioinformatics, and Biomedical Research
  • Maternal and fetal healthcare
  • Trauma and Emergency Care Studies
  • Machine Learning and Data Classification
  • Neural Networks and Applications
  • COVID-19 Clinical Research Studies
  • AI in cancer detection
  • Statistical Methods and Inference
  • Adversarial Robustness in Machine Learning
  • Scientific Computing and Data Management
  • Health Systems, Economic Evaluations, Quality of Life
  • Advanced Causal Inference Techniques
  • Gaussian Processes and Bayesian Inference
  • Sepsis Diagnosis and Treatment
  • Advanced Graph Neural Networks
  • Topic Modeling
  • Statistical Methods in Epidemiology
  • Pharmacovigilance and Adverse Drug Reactions

Broad Institute
2022-2024

Massachusetts Institute of Technology
2021-2024

Carnegie Mellon University
2017-2022

Vassar College
2021-2022

Pennsylvania State University
2014

Abstract Motivation Association studies to discover links between genetic markers and phenotypes are central bioinformatics. Methods of regularized regression, such as variants the Lasso, popular for this task. Despite good predictive performance these methods in average case, they suffer from unstable selections correlated variables inconsistent linearly dependent variables. Unfortunately, we demonstrate empirically, problematic situations often exist genomic datasets lead under-performance...

10.1093/bioinformatics/bty750 article EN cc-by Bioinformatics 2018-09-01

Abstract Deep learning, which describes a class of machine learning algorithms, has recently showed impressive results across variety domains. Biology and medicine are data rich, but the complex often ill-understood. Problems this nature may be particularly well-suited to deep techniques. We examine applications biomedical problems—patient classification, fundamental biological processes, treatment patients—and discuss whether will transform these tasks or if sphere poses unique challenges....

10.1101/142760 preprint EN cc-by bioRxiv (Cold Spring Harbor Laboratory) 2017-05-28

Machine learning is a modern approach to problem-solving and task automation. In particular, machine concerned with the development applications of algorithms that can recognize patterns in data use them for predictive modeling. Artificial neural networks are particular class models evolved into what now described as deep learning. Given computational advances made last decade, be applied massive sets innumerable contexts. Therefore, has become its own subfield context biological research,...

10.1371/journal.pcbi.1009803 article EN cc-by PLoS Computational Biology 2022-03-24

The positioning of catalytic groups within proteins plays an important role in enzyme catalysis, and here we investigate the general base ketosteroid isomerase (KSI). oxygen atoms Asp38, KSI, were previously shown to be involved anion-aromatic interactions with two neighboring Phe residues. Here ask whether those are sufficient, overall protein architecture, position Asp38 for catalysis or side chains that pack against and/or residues structured loop is capped by necessary achieve optimal...

10.1021/bi401671t article EN publisher-specific-oa Biochemistry 2014-03-05

Abstract Cancers are shaped by somatic mutations, microenvironment, and patient background, each altering gene expression regulation in complex ways, resulting heterogeneous cellular states dynamics. Inferring regulatory network (GRN) models from data can help characterize this regulation-driven heterogeneity, but inference requires many statistical samples, traditionally limiting GRNs to cluster-level analyses that ignore intra-cluster heterogeneity. We propose move beyond cluster-based...

10.1101/2023.12.01.569658 preprint EN cc-by bioRxiv (Cold Spring Harbor Laboratory) 2023-12-04

Heterogeneous and context-dependent systems are common in real-world processes, such as those biology, medicine, finance, the social sciences.However, learning accurate interpretable models of these heterogeneous remains an unsolved problem.Most statistical modeling approaches make strict assumptions about data homogeneity, leading to inaccurate models, while more flexible often too complex interpret directly.Fundamentally, existing tools force users choose between accuracy...

10.21105/joss.06469 article EN cc-by The Journal of Open Source Software 2024-05-08

Abstract Integration of single-cell RNA-sequencing (scRNA-seq) datasets has become a standard part the analysis, with conditional variational autoencoders (cVAE) being among most popular approaches. Increasingly, researchers are asking to map cells across challenging cases such as cross-organs, species, or organoids and primary tissue, well different scRNA-seq protocols, including single-nuclei. Current computational methods struggle harmonize substantial differences, driven by technical...

10.1101/2023.11.03.565463 preprint EN cc-by-nc-nd bioRxiv (Cold Spring Harbor Laboratory) 2023-11-05

In many applications, inter-sample heterogeneity is crucial to understanding the complex biological processes under study. For example, in genomic analysis of cancers, each patient a cohort may have different driver mutation, making it difficult or impossible identify causal mutations from an averaged view entire cohort. Unfortunately, traditional methods for seek estimate single model which shared by all samples population, ignoring this entirely. order better understand heterogeneity,...

10.1093/bioinformatics/bty250 article EN cc-by-nc Bioinformatics 2018-04-23

Knowledge graphs are a versatile framework to encode richly structured data relationships, but it can be challenging combine these with unstructured data. Methods for retrofitting pre-trained entity representations the structure of knowledge graph typically assume that entities embedded in connected space and relations imply similarity. However, useful often contain diverse (with potentially disjoint underlying corpora) which do not accord assumptions. To overcome limitations, we present...

10.48550/arxiv.1708.00112 preprint EN other-oa arXiv (Cornell University) 2017-01-01

Abstract Summarizing multiple data modalities into a parsimonious cancer “subtype” is difficult because the most informative representation of each patient’s disease not observed. We propose to model these latent summaries as discriminative subtypes : sample representations which induce accurate and interpretable sample-specific models for downstream predictions. In this way, subtypes, are shared between modalities, can be estimated from one modality optimized according predictions induced...

10.1101/2020.06.25.20140053 preprint EN cc-by-nc-nd medRxiv (Cold Spring Harbor Laboratory) 2020-06-26

Modern applications of machine learning (ML) deal with increasingly heterogeneous datasets comprised data collected from overlapping latent subpopulations. As a result, traditional models trained over large may fail to recognize highly predictive localized effects in favour weakly global patterns. This is problem because are critical developing individualized policies and treatment plans ranging precision medicine advertising. To address this challenge, we propose estimate sample-specific...

10.48550/arxiv.1910.06939 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Most pregnancies and births result in a good outcome, but complications are not uncommon when they do occur, can be associated with serious implications for mothers babies. Predictive modeling has the potential to improve outcomes through better understanding of risk factors, heightened surveillance, more timely appropriate interventions, thereby helping obstetricians deliver care. For three types we identify study most important factors using Explainable Boosting Machine (EBM), glass box...

10.48550/arxiv.2207.05322 preprint EN other-oa arXiv (Cornell University) 2022-01-01

Models which estimate main effects of individual variables alongside interaction have an identifiability challenge: can be freely moved between and without changing the model prediction. This is a critical problem for interpretability because it permits "contradictory" models to represent same function. To solve this problem, we propose pure effects: variance in outcome cannot represented by any smaller subset features. definition has equivalence with Functional ANOVA decomposition. compute...

10.48550/arxiv.1911.04974 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Abstract In many applications, inter-sample heterogeneity is crucial to understanding the complex biological processes under study. For example, in genomic analysis of cancers, each patient a cohort may have different driver mutation, making it difficult or impossible identify causal mutations from an averaged view entire cohort. Unfortunately, traditional methods for seek estimate single model which shared by all samples population, ignoring this entirely. order better understand...

10.1101/294496 preprint EN bioRxiv (Cold Spring Harbor Laboratory) 2018-04-05
Coming Soon ...