- Machine Learning in Bioinformatics
- Protein Structure and Dynamics
- RNA and protein synthesis mechanisms
- Genomics and Phylogenetic Studies
- Bioinformatics and Genomic Networks
- Evolution and Genetic Dynamics
- Computational Drug Discovery Methods
- Microbial Metabolic Engineering and Bioproduction
- Coral and Marine Ecosystems Studies
- vaccines and immunoinformatics approaches
- Gene Regulatory Network Analysis
- Aquaculture disease management and microbiota
- Gut microbiota and health
- Monoclonal and Polyclonal Antibodies Research
- Marine Sponges and Natural Products
- Mosquito-borne diseases and control
- Cancer Genomics and Diagnostics
- Aquaculture Nutrition and Growth
- Biomedical Text Mining and Ontologies
- Neurobiology and Insect Physiology Research
- Lipid Membrane Structure and Behavior
- CRISPR and Genetic Engineering
- Single-cell and spatial transcriptomics
- Animal Virus Infections Studies
- Bat Biology and Ecology Studies
Massachusetts Institute of Technology
2021-2025
Microsoft (United States)
2023-2025
Flatiron Health (United States)
2024-2025
Flatiron Institute
2024
Moscow Institute of Thermal Technology
2024
Tufts University
2022
Broad Institute
2022
University of Connecticut
2019-2020
We combine advances in neural language modeling and structurally motivated design to develop D-SCRIPT, an interpretable generalizable deep-learning model, which predicts interaction between two proteins using only their sequence maintains high accuracy with limited training data across species. show that a D-SCRIPT model trained on 38,345 human PPIs enables significantly improved functional characterization of fly compared the state-of-the-art approach. Evaluating same protein complexes...
Sequence-based prediction of drug-target interactions has the potential to accelerate drug discovery by complementing experimental screens. Such computational needs be generalizable and scalable while remaining sensitive subtle variations in inputs. However, current techniques fail simultaneously meet these goals, often sacrificing performance one achieve others. We develop a deep learning model, ConPLex, successfully leveraging advances pretrained protein language models ("PLex") employing...
Proteomics has been revolutionized by large protein language models (PLMs), which learn unsupervised representations from corpora of sequences. These are typically fine-tuned in a supervised setting to adapt the model specific downstream tasks. However, computational and memory footprint fine-tuning (FT) PLMs presents barrier for many research groups with limited resources. Natural processing seen similar explosion size models, where these challenges have addressed methods...
Abstract Summary Computational methods to predict protein–protein interaction (PPI) typically segregate into sequence-based ‘bottom-up’ that infer properties from the characteristics of individual protein sequences, or global ‘top-down’ pattern already known PPIs in species interest. However, a way incorporate top-down insights bottom-up PPI prediction has been elusive. We thus introduce Topsy-Turvy, method newly synthesizes both views sequence-based, multi-scale, deep-learning model for...
Abstract The majority of proteins must form higher-order assemblies to perform their biological functions, yet few machine learning models can accurately and rapidly predict the symmetry involving multiple copies same protein chain. Here, we address this gap by finetuning several classes foundation models, homo-oligomer symmetry. Our best model named Seq2Symm, which utilizes ESM2, outperforms existing template-based deep methods achieving an average AUC-PR 0.47, 0.44 0.49 across symmetries...
High-quality computational structural models are now precomputed and available for nearly every protein in UniProt. However, the best way to leverage these predict which pairs of proteins interact a high-throughput manner is not immediately clear. The recent Foldseek method van Kempen et al. encodes information distances angles along backbone into linear string same length as string, using tokens from 21-letter discretized alphabet (3Di).
Protein language models (PLMs) based on machine learning have demon-strated impressive success in predicting protein structure and function. However, general-purpose (“foundational”) PLMs limited performance antibodies due to the latter’s hypervariable regions, which do not conform evolutionary conservation principles that such rely on. In this study, we propose a new transfer framework called AbMAP, fine-tunes foundational for antibody-sequence inputs by supervising antibody binding...
Abstract Protein-protein interaction (PPI) networks have proven to be a valuable tool in systems biology facilitate the discovery and understanding of protein function. Unfortunately, experimental PPI data remains sparse most model organisms even more so other species. Existing methods for computational prediction PPIs seek address this limitation, while they perform well when sufficient within-species training is available, generalize poorly new species or often require specific types sizes...
<title>Abstract</title> The majority of proteins must form higher-order assemblies to perform their biological functions. Despite the importance protein quaternary structure, there are few machine learning models that can accurately and rapidly predict symmetry involving multiple copies same chain. Here, we address this gap by training several classes foundation models, including ESM-MSA, ESM2, RoseTTAFold2, homo-oligomer symmetry. Our best model named Seq2Symm, which utilizes outperforms...
With the ease of gene sequencing and technology available to study manipulate non-model organisms, extension methodological toolbox required translate our understanding model organisms has become an urgent problem. For example, mining large coral their symbiont sequence data is a challenge, but also provides opportunity for functionality evolution these other organisms. Much more information than any eukaryotic species humans, especially related signal transduction diseases. However,...
Proteomics has been revolutionized by large pre-trained protein language models, which learn unsupervised representations from corpora of sequences. The parameters these models are then fine-tuned in a supervised setting to tailor the model specific downstream task. However, as size increases, computational and memory footprint fine-tuning becomes barrier for many research groups. In field natural processing, seen similar explosion challenges have addressed methods parameter-efficient...
Protein Language Models (PLMs) trained on large databases of protein sequences have proven effective in modeling biology across a wide range applications. However, while PLMs excel at capturing individual properties, they face challenges natively representing protein–protein interactions (PPIs), which are crucial to understanding cellular processes and disease mechanisms. Here, we introduce MINT, PLM specifically designed model sets interacting proteins contextual scalable manner. Using...
Protein language models (PLMs) have demonstrated impressive success in modeling proteins. However, general-purpose “foundational” PLMs limited performance antibodies due to the latter’s hypervariable regions, which do not conform evolutionary conservation principles that such rely on. In this study, we propose a transfer learning framework called Antibody Mutagenesis-Augmented Processing (AbMAP), fine-tunes foundational for antibody-sequence inputs by supervising on antibody structure and...
Abstract We consider the problem of sequence-based drug-target interaction (DTI) prediction, showing that a straightforward deep learning architecture leverages pre-trained protein language models (PLMs) for embedding outperforms state art approaches, achieving higher accuracy, expanded generalizability, and an order magnitude faster training. PLM embeddings are found to contain general information is especially useful in few-shot (small training data set) zero-shot instances (unseen...
Abstract Despite significant advances in identifying genetic drivers of neurodegenerative disorders, the majority affected individuals lack molecular diagnosis, with somatic mutations proposed as one potential contributor to increased risk. Here, we report first cell-type-specific map mosaicism Alzheimer’s Dementia (AlzD), using 4,014 cells from prefrontal cortex samples 19 AlzD and 17 non-AlzD individuals. We integrate full-transcript single-nucleus RNA-seq (SMART-Seq) matched...
Many existing methods for estimation of infectious disease transmission networks use a phylogeny the infecting strains as basis network inference, and accurate inference relies on accuracy this underlying evolutionary history. However, phylogenetic reconstruction can be highly error prone more sophisticated fail to scale larger outbreaks, negatively impacting downstream inference.We introduce new method, TreeFix-TP, scalable phylogenies based an error-correction framework. Our method uses...
An accurate understanding of the evolutionary history rapidly-evolving viruses like SARS-CoV-2, responsible for COVID-19 pandemic, is crucial to tracking and preventing spread emerging pathogens. However, undergo frequent recombination, which makes it difficult trace their using traditional phylogenetic methods. In this study, we present a workflow, virDTL, analyzing viral evolution in presence recombination. Our approach leverages reconciliation methods developed inferring horizontal gene...
Abstract Background Many existing methods for estimation of infectious disease transmission networks use a phylogeny the infecting strains as basis network inference, and accurate inference relies on accuracy this underlying evolutionary history. However, phylogenetic reconstruction can be highly error prone more sophisticated fail to scale larger outbreaks, negatively impacting downstream inference. Additionally, there are no currently available which able within-host diversity improve...
Abstract An accurate understanding of the evolutionary history rapidly-evolving viruses like SARS-CoV-2, responsible for COVID-19 pandemic, is crucial to tracking and preventing spread emerging pathogens. However, undergo frequent recombination, which makes it difficult trace their using traditional phylogenetic methods. Here, we present a workflow, virDTL, analyzing viral evolution in presence recombination. Our approach leverages reconciliation methods developed inferring horizontal gene...
Once thought to be a unique capability of the Langerhans Islands in pancreas mammals, insulin production is now recognized as an evolutionarily ancient function going back prokaryotes, ubiquitously present unicellular eukaryotes, fungi, worm, Drosophila and course human. While functionality signaling pathway has been experimentally demonstrated some these organisms, it not yet exploited for pharmacological applications. To enable such applications, we need understand extent which structure...
Protein-protein interaction (PPI) networks are a fundamental resource for modeling cellular and molecular function, large sophisticated toolbox has been developed to leverage their structure topological organization predict the functional roles of under-studied genes, proteins, pathways. However, overwhelming majority experimentally-determined interactions from which such constructed come small number well-studied model organisms. Indeed, most species lack even single in these databases,...