Martin Steinegger

ORCID: 0000-0001-8781-9753
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Genomics and Phylogenetic Studies
  • Machine Learning in Bioinformatics
  • Protein Structure and Dynamics
  • RNA and protein synthesis mechanisms
  • Microbial Community Ecology and Physiology
  • Enzyme Structure and Function
  • Bioinformatics and Genomic Networks
  • Bacteriophages and microbial interactions
  • Gut microbiota and health
  • Gene expression and cancer classification
  • Glycosylation and Glycoproteins Research
  • Remote Sensing and LiDAR Applications
  • Advanced Proteomics Techniques and Applications
  • Genetics, Bioinformatics, and Biomedical Research
  • Remote Sensing in Agriculture
  • Algorithms and Data Compression
  • Data-Driven Disease Surveillance
  • Microbial Metabolic Engineering and Bioproduction
  • Tryptophan and brain disorders
  • RNA modifications and cancer
  • Mycorrhizal Fungi and Plant Interactions
  • Microbial Natural Products and Biosynthesis
  • Forest ecology and management
  • Chromosomal and Genetic Variations
  • Genetic diversity and population structure

Seoul National University
2016-2025

Institute of Molecular Biology and Genetics
2021-2025

Weizmann Institute of Science
2024

Instituto de Biomedicina y Genética Molecular de Valladolid
2024

Max Planck Institute for Multidisciplinary Sciences
2024

Max Planck Institute for Biophysical Chemistry
2016-2021

The University of Tokyo
2021

Michigan State University
2021

Harvard University Press
2021

Johns Hopkins University
2018-2020

Abstract Proteins are essential to life, and understanding their structure can facilitate a mechanistic of function. Through an enormous experimental effort 1–4 , the structures around 100,000 unique proteins have been determined 5 but this represents small fraction billions known protein sequences 6,7 . Structural coverage is bottlenecked by months years painstaking required determine single structure. Accurate computational approaches needed address gap enable large-scale structural...

10.1038/s41586-021-03819-2 article EN cc-by Nature 2021-07-15

Abstract ColabFold offers accelerated prediction of protein structures and complexes by combining the fast homology search MMseqs2 with AlphaFold2 or RoseTTAFold. ColabFold’s 40−60-fold faster optimized model utilization enables close to 1,000 per day on a server one graphics processing unit. Coupled Google Colaboratory, becomes free accessible platform for folding. is open-source software available at https://github.com/sokrypton/ColabFold its novel environmental databases are...

10.1038/s41592-022-01488-1 article EN cc-by Nature Methods 2022-05-30

Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models (LMs) taken Natural Processing (NLP). These LMs reach new prediction frontiers at low inference costs. Here, we trained two auto-regressive models (Transformer-XL, XLNet) four auto-encoder (BERT, Albert, Electra, T5) on UniRef BFD containing up to 393 billion amino acids. The (pLMs) were the Summit supercomputer using 5616 GPUs TPU Pod up-to 1024 cores. Dimensionality...

10.1109/tpami.2021.3095381 article EN cc-by IEEE Transactions on Pattern Analysis and Machine Intelligence 2021-07-07

Abstract As structure prediction methods are generating millions of publicly available protein structures, searching these databases is becoming a bottleneck. Foldseek aligns the query against database by describing tertiary amino acid interactions within proteins as sequences over structural alphabet. decreases computation times four to five orders magnitude with 86%, 88% and 133% sensitivities Dali, TM-align CE, respectively.

10.1038/s41587-023-01773-0 article EN cc-by Nature Biotechnology 2023-05-08

HH-suite is a widely used open source software suite for sensitive sequence similarity searches and protein fold recognition. It based on pairwise alignment of profile Hidden Markov models (HMMs), which represent multiple alignments homologous proteins.We developed single-instruction multiple-data (SIMD) vectorized implementation the Viterbi algorithm HMM introduced various other speed-ups. These accelerated search methods HHsearch by factor 4 HHblits 2 over previous version 2.0.16. HHblits3...

10.1186/s12859-019-3019-7 article EN cc-by BMC Bioinformatics 2019-09-14

Metagenomic datasets contain billions of protein sequences that could greatly enhance large-scale functional annotation and structure prediction. Utilizing this enormous resource would require reducing its redundancy by similarity clustering. However, clustering hundreds millions is impractical using current algorithms because their runtimes scale as the input set size N times number clusters K, which typically similar order N, resulting in increase almost quadratically with N. We developed...

10.1038/s41467-018-04964-5 article EN cc-by Nature Communications 2018-06-25

We present three clustered protein sequence databases, Uniclust90, Uniclust50, Uniclust30 and databases of multiple alignments (MSAs), Uniboost10, Uniboost20 Uniboost30, as a resource for analysis, function prediction searches. The Uniclust cluster UniProtKB sequences at the level 90%, 50% 30% pairwise identity. Uniclust90 Uniclust50 clusters showed better consistency functional annotation than those UniRef90 UniRef50, owing to an optimised clustering pipeline that runs with our MMseqs2...

10.1093/nar/gkw1081 article EN cc-by Nucleic Acids Research 2016-11-01

The MPI Bioinformatics Toolkit (https://toolkit.tuebingen.mpg.de) provides interactive access to a wide range of the best-performing bioinformatics tools and databases, including state-of-the-art protein sequence comparison methods HHblits HHpred. currently includes 35 external in-house tools, covering functionalities such as similarity searching, prediction features, classification. Due this breadth functionality, tight interconnection its constituent ease use, has become an important...

10.1002/cpbi.108 article EN cc-by Current Protocols in Bioinformatics 2020-12-01

The AlphaFold Database Protein Structure (AlphaFold DB, https://alphafold.ebi.ac.uk) has significantly impacted structural biology by amassing over 214 million predicted protein structures, expanding from the initial 300k structures released in 2021. Enabled groundbreaking AlphaFold2 artificial intelligence (AI) system, predictions archived DB have been integrated into primary data resources such as PDB, UniProt, Ensembl, InterPro and MobiDB. Our manuscript details subsequent enhancements...

10.1093/nar/gkad1011 article EN cc-by Nucleic Acids Research 2023-11-02

ColabFold offers accelerated protein structure and complex predictions by combining the fast homology search of MMseqs2 with AlphaFold2 or RoseTTAFold. ColabFold’s 40 - 60× faster optimized model use allows predicting close to a thousand structures per day on server one GPU. Coupled Google Colaboratory, becomes free accessible platform for folding. is open-source software available at github.com/sokrypton/ColabFold . Its novel environmental databases are colabfold.mmseqs.com Contact...

10.1101/2021.08.15.456425 preprint EN cc-by-nc-nd bioRxiv (Cold Spring Harbor Laboratory) 2021-08-15

The MMseqs2 desktop and web server app facilitates interactive sequence searches through custom protein profile databases on personal workstations. By eliminating MMseqs2's runtime overhead, we reduced response times to a few seconds at sensitivities close BLAST.The is easy install for non-experts. GPLv3-licensed code, pre-built packages Windows, MacOS Linux, Docker images the application demo are available https://search.mmseqs.com.Supplementary data Bioinformatics online.

10.1093/bioinformatics/bty1057 article EN cc-by Bioinformatics 2019-01-04

We describe the operation and improvement of AlphaFold, system that was entered by team AlphaFold2 to "human" category in 14th Critical Assessment Protein Structure Prediction (CASP14). The AlphaFold CASP14 is entirely different one CASP13. It used a novel end-to-end deep neural network trained produce protein structures from amino acid sequence, multiple sequence alignments, homologous proteins. In assessors' ranking summed z scores (>2.0), scored 244.0 compared 90.8 next best group....

10.1002/prot.26257 article EN Proteins Structure Function and Bioinformatics 2021-10-04

As structure prediction methods are generating millions of publicly available protein structures, searching these databases is becoming a bottleneck. Foldseek aligns the query against database by describing amino acid backbone proteins as sequences over structural alphabet. decreases computation times four to five orders magnitude with 86%, 88% and 133% sensitivities DALI, TM-align CE, respectively.

10.1101/2022.02.07.479398 preprint EN cc-by bioRxiv (Cold Spring Harbor Laboratory) 2022-02-09

Abstract Motivation: Sequence databases are growing fast, challenging existing analysis pipelines. Reducing the redundancy of sequence by similarity clustering improves speed and sensitivity iterative searches. But tools cannot efficiently cluster size UniProt to 50% maximum pairwise identity or below. Furthermore, in metagenomics experiments typically large fractions reads be matched any known anymore because searching with sensitive but relatively slow (e.g. BLAST HMMER3) through...

10.1093/bioinformatics/btw006 article EN Bioinformatics 2016-01-06

Abstract Since 1992 PredictProtein (https://predictprotein.org) is a one-stop online resource for protein sequence analysis with its main site hosted at the Luxembourg Centre Systems Biomedicine (LCSB) and queried monthly by over 3,000 users in 2020. was first Internet server predictions. It pioneered combining evolutionary information machine learning. Given as input, outputs multiple alignments, predictions of structure 1D 2D (secondary structure, solvent accessibility, transmembrane...

10.1093/nar/gkab354 article EN cc-by Nucleic Acids Research 2021-05-11

MMseqs2 taxonomy is a new tool to assign taxonomic labels metagenomic contigs. It extracts all possible protein fragments from each contig, quickly retains those that can contribute annotation, assigns them with robust and determines the contig's identity by weighted voting. Its fragment extraction step suitable for analysis of domains life. 2-18× faster than state-of-the-art tools also contains modules creating manipulating reference databases as well reporting visualizing...

10.1093/bioinformatics/btab184 article EN cc-by Bioinformatics 2021-03-16

Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models taken NLP. These LMs reach new prediction frontiers at low inference costs. Here, we trained two auto-regressive models (Transformer-XL, XLNet) four auto-encoder (BERT, Albert, Electra, T5) on UniRef BFD containing up to 393 billion amino acids. The were the Summit supercomputer using 5616 GPUs TPU Pod up-to 1024 cores. Dimensionality reduction revealed that raw...

10.48550/arxiv.2007.06225 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Abstract Genomic analyses are sensitive to contamination in public databases caused by incorrectly labeled reference sequences. Here, we describe Conterminator, an efficient method detect and remove sequences exhaustive all-against-all sequence comparison. Our analysis reports of 2,161,746, 114,035, 14,148 the RefSeq, GenBank, NR databases, respectively, spanning whole range from draft “complete” model organism genomes. scales linearly with input size can process 3.3 TB 12 days on a 32-core...

10.1186/s13059-020-02023-1 article EN cc-by Genome biology 2020-05-12

Proteins are key to all cellular processes and their structure is important in understanding function evolution. Sequence-based predictions of protein structures have increased accuracy1, over 214 million predicted available the AlphaFold database2. However, studying at this scale requires highly efficient methods. Here, we developed a structural-alignment-based clustering algorithm-Foldseek cluster-that can cluster hundreds millions structures. Using method, clustered database, identifying...

10.1038/s41586-023-06510-w article EN cc-by Nature 2023-09-13

Genes of unknown function are among the biggest challenges in molecular biology, especially microbial systems, where 40-60% predicted genes unknown. Despite previous attempts, systematic approaches to include fraction into analytical workflows still lacking. Here, we present a conceptual framework, its translation computational workflow AGNOSTOS and demonstration on how can bridge known-unknown gap genomes metagenomes. By analyzing 415,971,742 from 1749 metagenomes 28,941 bacterial archaeal...

10.7554/elife.67667 article EN cc-by eLife 2022-03-31

Abstract Deep-learning (DL) methods like DeepMind’s AlphaFold2 (AF2) have led to substantial improvements in protein structure prediction. We analyse confident AF2 models from 21 model organisms using a new classification protocol (CATH-Assign) which exploits novel DL for structural comparison and classification. Of ~370,000 models, 92% can be assigned 3253 superfamilies our CATH domain superfamily The remaining cluster into 2367 putative superfamilies. Detailed manual analysis on 618 of...

10.1038/s42003-023-04488-9 article EN cc-by Communications Biology 2023-02-08

Abstract Adapting large language models (LLMs) to protein sequences spawned the development of powerful (pLMs). Concurrently, AlphaFold2 broke through in structure prediction. Now we can systematically and comprehensively explore dual nature proteins that act exist as three-dimensional (3D) machines evolve linear strings one-dimensional (1D) sequences. Here, leverage pLMs simultaneously model both modalities by combining 1D with 3D a single model. We encode structures token using...

10.1101/2023.07.23.550085 preprint EN cc-by bioRxiv (Cold Spring Harbor Laboratory) 2023-07-25
Coming Soon ...