- Genomics and Phylogenetic Studies
- Machine Learning in Bioinformatics
- Protein Structure and Dynamics
- RNA and protein synthesis mechanisms
- Microbial Community Ecology and Physiology
- Enzyme Structure and Function
- Bioinformatics and Genomic Networks
- Bacteriophages and microbial interactions
- Gut microbiota and health
- Gene expression and cancer classification
- Glycosylation and Glycoproteins Research
- Remote Sensing and LiDAR Applications
- Advanced Proteomics Techniques and Applications
- Genetics, Bioinformatics, and Biomedical Research
- Remote Sensing in Agriculture
- Algorithms and Data Compression
- Data-Driven Disease Surveillance
- Microbial Metabolic Engineering and Bioproduction
- Tryptophan and brain disorders
- RNA modifications and cancer
- Mycorrhizal Fungi and Plant Interactions
- Microbial Natural Products and Biosynthesis
- Forest ecology and management
- Chromosomal and Genetic Variations
- Genetic diversity and population structure
Seoul National University
2016-2025
Institute of Molecular Biology and Genetics
2021-2025
Weizmann Institute of Science
2024
Instituto de Biomedicina y Genética Molecular de Valladolid
2024
Max Planck Institute for Multidisciplinary Sciences
2024
Max Planck Institute for Biophysical Chemistry
2016-2021
The University of Tokyo
2021
Michigan State University
2021
Harvard University Press
2021
Johns Hopkins University
2018-2020
Abstract Proteins are essential to life, and understanding their structure can facilitate a mechanistic of function. Through an enormous experimental effort 1–4 , the structures around 100,000 unique proteins have been determined 5 but this represents small fraction billions known protein sequences 6,7 . Structural coverage is bottlenecked by months years painstaking required determine single structure. Accurate computational approaches needed address gap enable large-scale structural...
Abstract ColabFold offers accelerated prediction of protein structures and complexes by combining the fast homology search MMseqs2 with AlphaFold2 or RoseTTAFold. ColabFold’s 40−60-fold faster optimized model utilization enables close to 1,000 per day on a server one graphics processing unit. Coupled Google Colaboratory, becomes free accessible platform for folding. is open-source software available at https://github.com/sokrypton/ColabFold its novel environmental databases are...
Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models (LMs) taken Natural Processing (NLP). These LMs reach new prediction frontiers at low inference costs. Here, we trained two auto-regressive models (Transformer-XL, XLNet) four auto-encoder (BERT, Albert, Electra, T5) on UniRef BFD containing up to 393 billion amino acids. The (pLMs) were the Summit supercomputer using 5616 GPUs TPU Pod up-to 1024 cores. Dimensionality...
Abstract As structure prediction methods are generating millions of publicly available protein structures, searching these databases is becoming a bottleneck. Foldseek aligns the query against database by describing tertiary amino acid interactions within proteins as sequences over structural alphabet. decreases computation times four to five orders magnitude with 86%, 88% and 133% sensitivities Dali, TM-align CE, respectively.
HH-suite is a widely used open source software suite for sensitive sequence similarity searches and protein fold recognition. It based on pairwise alignment of profile Hidden Markov models (HMMs), which represent multiple alignments homologous proteins.We developed single-instruction multiple-data (SIMD) vectorized implementation the Viterbi algorithm HMM introduced various other speed-ups. These accelerated search methods HHsearch by factor 4 HHblits 2 over previous version 2.0.16. HHblits3...
Metagenomic datasets contain billions of protein sequences that could greatly enhance large-scale functional annotation and structure prediction. Utilizing this enormous resource would require reducing its redundancy by similarity clustering. However, clustering hundreds millions is impractical using current algorithms because their runtimes scale as the input set size N times number clusters K, which typically similar order N, resulting in increase almost quadratically with N. We developed...
We present three clustered protein sequence databases, Uniclust90, Uniclust50, Uniclust30 and databases of multiple alignments (MSAs), Uniboost10, Uniboost20 Uniboost30, as a resource for analysis, function prediction searches. The Uniclust cluster UniProtKB sequences at the level 90%, 50% 30% pairwise identity. Uniclust90 Uniclust50 clusters showed better consistency functional annotation than those UniRef90 UniRef50, owing to an optimised clustering pipeline that runs with our MMseqs2...
The MPI Bioinformatics Toolkit (https://toolkit.tuebingen.mpg.de) provides interactive access to a wide range of the best-performing bioinformatics tools and databases, including state-of-the-art protein sequence comparison methods HHblits HHpred. currently includes 35 external in-house tools, covering functionalities such as similarity searching, prediction features, classification. Due this breadth functionality, tight interconnection its constituent ease use, has become an important...
The AlphaFold Database Protein Structure (AlphaFold DB, https://alphafold.ebi.ac.uk) has significantly impacted structural biology by amassing over 214 million predicted protein structures, expanding from the initial 300k structures released in 2021. Enabled groundbreaking AlphaFold2 artificial intelligence (AI) system, predictions archived DB have been integrated into primary data resources such as PDB, UniProt, Ensembl, InterPro and MobiDB. Our manuscript details subsequent enhancements...
ColabFold offers accelerated protein structure and complex predictions by combining the fast homology search of MMseqs2 with AlphaFold2 or RoseTTAFold. ColabFold’s 40 - 60× faster optimized model use allows predicting close to a thousand structures per day on server one GPU. Coupled Google Colaboratory, becomes free accessible platform for folding. is open-source software available at github.com/sokrypton/ColabFold . Its novel environmental databases are colabfold.mmseqs.com Contact...
The MMseqs2 desktop and web server app facilitates interactive sequence searches through custom protein profile databases on personal workstations. By eliminating MMseqs2's runtime overhead, we reduced response times to a few seconds at sensitivities close BLAST.The is easy install for non-experts. GPLv3-licensed code, pre-built packages Windows, MacOS Linux, Docker images the application demo are available https://search.mmseqs.com.Supplementary data Bioinformatics online.
We describe the operation and improvement of AlphaFold, system that was entered by team AlphaFold2 to "human" category in 14th Critical Assessment Protein Structure Prediction (CASP14). The AlphaFold CASP14 is entirely different one CASP13. It used a novel end-to-end deep neural network trained produce protein structures from amino acid sequence, multiple sequence alignments, homologous proteins. In assessors' ranking summed z scores (>2.0), scored 244.0 compared 90.8 next best group....
As structure prediction methods are generating millions of publicly available protein structures, searching these databases is becoming a bottleneck. Foldseek aligns the query against database by describing amino acid backbone proteins as sequences over structural alphabet. decreases computation times four to five orders magnitude with 86%, 88% and 133% sensitivities DALI, TM-align CE, respectively.
Abstract Motivation: Sequence databases are growing fast, challenging existing analysis pipelines. Reducing the redundancy of sequence by similarity clustering improves speed and sensitivity iterative searches. But tools cannot efficiently cluster size UniProt to 50% maximum pairwise identity or below. Furthermore, in metagenomics experiments typically large fractions reads be matched any known anymore because searching with sensitive but relatively slow (e.g. BLAST HMMER3) through...
Abstract Since 1992 PredictProtein (https://predictprotein.org) is a one-stop online resource for protein sequence analysis with its main site hosted at the Luxembourg Centre Systems Biomedicine (LCSB) and queried monthly by over 3,000 users in 2020. was first Internet server predictions. It pioneered combining evolutionary information machine learning. Given as input, outputs multiple alignments, predictions of structure 1D 2D (secondary structure, solvent accessibility, transmembrane...
MMseqs2 taxonomy is a new tool to assign taxonomic labels metagenomic contigs. It extracts all possible protein fragments from each contig, quickly retains those that can contribute annotation, assigns them with robust and determines the contig's identity by weighted voting. Its fragment extraction step suitable for analysis of domains life. 2-18× faster than state-of-the-art tools also contains modules creating manipulating reference databases as well reporting visualizing...
Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models taken NLP. These LMs reach new prediction frontiers at low inference costs. Here, we trained two auto-regressive models (Transformer-XL, XLNet) four auto-encoder (BERT, Albert, Electra, T5) on UniRef BFD containing up to 393 billion amino acids. The were the Summit supercomputer using 5616 GPUs TPU Pod up-to 1024 cores. Dimensionality reduction revealed that raw...
Abstract Genomic analyses are sensitive to contamination in public databases caused by incorrectly labeled reference sequences. Here, we describe Conterminator, an efficient method detect and remove sequences exhaustive all-against-all sequence comparison. Our analysis reports of 2,161,746, 114,035, 14,148 the RefSeq, GenBank, NR databases, respectively, spanning whole range from draft “complete” model organism genomes. scales linearly with input size can process 3.3 TB 12 days on a 32-core...
Proteins are key to all cellular processes and their structure is important in understanding function evolution. Sequence-based predictions of protein structures have increased accuracy1, over 214 million predicted available the AlphaFold database2. However, studying at this scale requires highly efficient methods. Here, we developed a structural-alignment-based clustering algorithm-Foldseek cluster-that can cluster hundreds millions structures. Using method, clustered database, identifying...
Genes of unknown function are among the biggest challenges in molecular biology, especially microbial systems, where 40-60% predicted genes unknown. Despite previous attempts, systematic approaches to include fraction into analytical workflows still lacking. Here, we present a conceptual framework, its translation computational workflow AGNOSTOS and demonstration on how can bridge known-unknown gap genomes metagenomes. By analyzing 415,971,742 from 1749 metagenomes 28,941 bacterial archaeal...
Abstract Deep-learning (DL) methods like DeepMind’s AlphaFold2 (AF2) have led to substantial improvements in protein structure prediction. We analyse confident AF2 models from 21 model organisms using a new classification protocol (CATH-Assign) which exploits novel DL for structural comparison and classification. Of ~370,000 models, 92% can be assigned 3253 superfamilies our CATH domain superfamily The remaining cluster into 2367 putative superfamilies. Detailed manual analysis on 618 of...
Abstract Adapting large language models (LLMs) to protein sequences spawned the development of powerful (pLMs). Concurrently, AlphaFold2 broke through in structure prediction. Now we can systematically and comprehensively explore dual nature proteins that act exist as three-dimensional (3D) machines evolve linear strings one-dimensional (1D) sequences. Here, leverage pLMs simultaneously model both modalities by combining 1D with 3D a single model. We encode structures token using...