- Protein Structure and Dynamics
- Machine Learning in Bioinformatics
- Genomics and Phylogenetic Studies
- RNA and protein synthesis mechanisms
- Enzyme Structure and Function
- Complex Network Analysis Techniques
- Microbial Metabolic Engineering and Bioproduction
- Mobile Crowdsensing and Crowdsourcing
- Human Mobility and Location-Based Analysis
- Cell Image Analysis Techniques
- Bioinformatics and Genomic Networks
- Computational and Text Analysis Methods
- Cognitive Science and Education Research
- Explainable Artificial Intelligence (XAI)
- Advanced MRI Techniques and Applications
- Monoclonal and Polyclonal Antibodies Research
- Single-cell and spatial transcriptomics
- Privacy-Preserving Technologies in Data
- Complex Systems and Time Series Analysis
- Reproductive tract infections research
- Topic Modeling
- Computational Drug Discovery Methods
- Advanced Proteomics Techniques and Applications
- Opinion Dynamics and Social Influence
- Historical Art and Architecture Studies
Courant Institute of Mathematical Sciences
2021-2024
New York University
2021-2024
Simons Foundation
2019-2023
Flatiron Institute
2021
Flatiron Health (United States)
2021
University of Vermont
2018
Abstract The rapid increase in the number of proteins sequence databases and diversity their functions challenge computational approaches for automated function prediction. Here, we introduce DeepFRI, a Graph Convolutional Network predicting protein by leveraging features extracted from language model structures. It outperforms current leading methods sequence-based Neural Networks scales to size repositories. Augmenting training set experimental structures with homology models allows us...
AlphaFold2 revolutionized structural biology with the ability to predict protein structures exceptionally high accuracy. Its implementation, however, lacks code and data required train new models. These are necessary (1) tackle tasks, like protein–ligand complex structure prediction, (2) investigate process by which model learns (3) assess model's capacity generalize unseen regions of fold space. Here we report OpenFold, a fast, memory efficient trainable implementation AlphaFold2. We...
Abstract AlphaFold2 revolutionized structural biology with the ability to predict protein structures exceptionally high accuracy. Its implementation, however, lacks code and data required train new models. These are necessary (i) tackle tasks, like protein-ligand complex structure prediction, (ii) investigate process by which model learns, remains poorly understood, (iii) assess model’s generalization capacity unseen regions of fold space. Here we report OpenFold, a fast, memory-efficient,...
Abstract Exploiting sequence–structure–function relationships in biotechnology requires improved methods for aligning proteins that have low sequence similarity to previously annotated proteins. We develop two deep learning address this gap, TM-Vec and DeepBLAST. allows searching structure–structure similarities large databases. It is trained accurately predict TM-scores as a metric of structural directly from pairs without the need intermediate computation or solution structures. Once...
Abstract For the past half-century, structural biologists relied on notion that similar protein sequences give rise to structures and functions. While this assumption has driven research explore certain parts of universe, it disregards spaces don’t rely assumption. Here we areas universe where functions can be achieved by different structures. We predict ~200,000 for diverse from 1,003 representative genomes across microbial tree life annotate them functionally a per-residue basis. Structure...
Therapeutic antibody design is a complex multi-property optimization problem that traditionally relies on expensive search through sequence space. Here, we introduce "Lab-in-the-loop," paradigm shift for orchestrates generative machine learning models, multi-task property predictors, active ranking and selection, in vitro experimentation semi-autonomous, iterative loop. By automating the of variants, prediction, selection designs to assay lab, ingestion data, enable holistic, end-to-end...
The large number of available sequences and the diversity protein functions challenge current experimental computational approaches to determining predicting function. We present a deep learning Graph Convolutional Network (GCN) for concurrently identifying functionally important residues. This model is initially trained using experimentally determined structures from Protein Data Bank (PDB) but has significant de-noising capability, with only minor drop in performance observed when...
Abstract Protein design is challenging because it requires searching through a vast combinatorial space that only sparsely functional. Self-supervised learning approaches offer the potential to navigate this more effectively and thereby accelerate protein engineering. We introduce sequence denoising autoencoder (DAE) learns manifold of sequences from large amount potentially unlabelled proteins. This DAE combined with function predictor guides sampling towards higher levels desired...
Abstract Exploiting sequence-structure-function relationships in molecular biology and computational modeling relies on detecting proteins with high sequence similarities. However, the most commonly used alignment-based methods, such as BLAST, frequently fail low similarity to previously annotated proteins. We developed a deep learning method, TM-Vec, that uses alignments learn structural features can then be search for structure-structure similarities large databases. train TM-Vec...
Abstract Computing sequence similarity is a fundamental task in biology, with alignment forming the basis for annotation of genes and genomes providing core data structures evolutionary analysis. Standard approaches are mainstay modern molecular biology rely on variations edit distance to obtain explicit alignments between pairs biological sequences. However, algorithms struggle remote homology tasks cannot identify similarities many proteins similar likely homology. Recent work suggests...
Abstract Trichomonas vaginalis is the causative agent of venereal disease trichomoniasis which infects men and women globally associated with serious outcomes during pregnancy cancers human reproductive tract. Trichomonads parasitize a range hosts in addition to humans including birds, livestock, domesticated animals. Recent genetic analysis trichomonads recovered from columbid birds has provided evidence that these parasite species undergo frequent host-switching, current epoch spillover...
Accurately and efficiently crowdsourcing complex, open-ended tasks can be difficult, as crowd participants tend to favor short, repetitive "microtasks". We study the of large networks where provides network topology via microtasks. Crowds explore many types social information networks, but we focus on causal attributions, an important that signifies cause-and-effect relationships. conduct experiments Amazon Mechanical Turk (AMT) testing how workers propose validate individual relationships...
We resolve difficulties in training and sampling from a discrete generative model by learning smoothed energy function, the data manifold with Langevin Markov chain Monte Carlo (MCMC), projecting back to true one-step denoising. Our Discrete Walk-Jump Sampling formalism combines contrastive divergence of an energy-based improved sample quality score-based model, while simplifying requiring only single noise level. evaluate robustness our approach on modeling antibody proteins introduce...
Abstract / Summary For the past half-century, structural biologists relied on notion that similar protein sequences give rise to structures and functions. While this assumption has driven research explore certain parts of universe, it disregards spaces don’t rely assumption. Here we areas universe where functions can be achieved by different structures. We predict ∼200,000 for diverse from 1,003 representative genomes 1 across microbial tree life, annotate them functionally a per-residue...
Many research fields codify their findings in standard formats, often by reporting correlations between quantities of interest. But the space all testable correlates is far larger than scientific resources can currently address, so ability to accurately predict would be useful plan and allocate resources. Using a dataset approximately 170,000 correlational extracted from leading social science journals, we show that trained neural network reported using only text descriptions correlates....
Deep generative modeling for biological sequences presents a unique challenge in reconciling the bias-variance trade-off between explicit insight and model flexibility. The deep manifold sampler was recently proposed as means to iteratively sample variable-length protein by exploiting gradients from function predictor. We introduce an alternative approach this guided sampling procedure, multi-segment preserving sampling, that enables direct inclusion of domain-specific knowledge designating...
Multiple sequence alignments (MSAs) of proteins encode rich biological information and have been workhorses in bioinformatic methods for tasks like protein design structure prediction decades. Recent breakthroughs AlphaFold2 that use transformers to attend directly over large quantities raw MSAs reaffirmed their importance. Generation is highly computationally intensive, however, no datasets comparable those used train made available the research community, hindering progress machine...
Cause-and-effect reasoning, the attribution of effects to causes, is one most powerful and unique skills humans possess. Multiple surveys are mapping out causal attributions as networks, but it unclear how well these efforts can be combined. Further, total size collective network held by currently unknown, making challenging assess progress surveys. Here we study three networks determine they combined into a single network. Combining requires dealing with ambiguous nodes, nodes represent...