NFDI4DS | UHH-SEMS - Publication Details

Martin Steinegger

ORCID: 0000-0001-8781-9753

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5019985343

Research Areas

Genomics and Phylogenetic Studies
Machine Learning in Bioinformatics
Protein Structure and Dynamics
RNA and protein synthesis mechanisms
Microbial Community Ecology and Physiology
Enzyme Structure and Function
Bioinformatics and Genomic Networks
Bacteriophages and microbial interactions
Gut microbiota and health
Gene expression and cancer classification
Glycosylation and Glycoproteins Research
Remote Sensing and LiDAR Applications
Advanced Proteomics Techniques and Applications
Genetics, Bioinformatics, and Biomedical Research
Remote Sensing in Agriculture
Algorithms and Data Compression
Data-Driven Disease Surveillance
Microbial Metabolic Engineering and Bioproduction
Tryptophan and brain disorders
RNA modifications and cancer
Mycorrhizal Fungi and Plant Interactions
Microbial Natural Products and Biosynthesis
Forest ecology and management
Chromosomal and Genetic Variations
Genetic diversity and population structure

Seoul National University
2016-2025

Institute of Molecular Biology and Genetics
2021-2025

Weizmann Institute of Science
2024

Instituto de Biomedicina y Genética Molecular de Valladolid
2024

Max Planck Institute for Multidisciplinary Sciences
2024

Max Planck Institute for Biophysical Chemistry
2016-2021

The University of Tokyo
2021

Michigan State University
2021

Harvard University Press
2021

Johns Hopkins University
2018-2020

Highly accurate protein structure prediction with AlphaFold

OPENALEX - Publications

John Jumper Richard Evans Alexander Pritzel Tim Green Michael Figurnov and 29 more

Abstract Proteins are essential to life, and understanding their structure can facilitate a mechanistic of function. Through an enormous experimental effort 1–4 , the structures around 100,000 unique proteins have been determined 5 but this represents small fraction billions known protein sequences 6,7 . Structural coverage is bottlenecked by months years painstaking required determine single structure. Accurate computational approaches needed address gap enable large-scale structural...

10.1038/s41586-021-03819-2 article EN cc-by Nature 2021-07-15

ColabFold: making protein folding accessible to all

OPENALEX - Publications

Milot Mirdita Konstantin Schütze Yoshitaka Moriwaki Lim Heo Sergey Ovchinnikov and 1 more

Abstract ColabFold offers accelerated prediction of protein structures and complexes by combining the fast homology search MMseqs2 with AlphaFold2 or RoseTTAFold. ColabFold’s 40−60-fold faster optimized model utilization enables close to 1,000 per day on a server one graphics processing unit. Coupled Google Colaboratory, becomes free accessible platform for folding. is open-source software available at https://github.com/sokrypton/ColabFold its novel environmental databases are...

10.1038/s41592-022-01488-1 article EN cc-by Nature Methods 2022-05-30

MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets

OPENALEX - Publications

Martin Steinegger Johannes Söding

10.1038/nbt.3988 article EN Nature Biotechnology 2017-10-16

ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning

OPENALEX - Publications

Ahmed Elnaggar Michael Heinzinger Christian Dallago Ghalia Rehawi Yu Wang and 7 more

Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models (LMs) taken Natural Processing (NLP). These LMs reach new prediction frontiers at low inference costs. Here, we trained two auto-regressive models (Transformer-XL, XLNet) four auto-encoder (BERT, Albert, Electra, T5) on UniRef BFD containing up to 393 billion amino acids. The (pLMs) were the Summit supercomputer using 5616 GPUs TPU Pod up-to 1024 cores. Dimensionality...

10.1109/tpami.2021.3095381 article EN cc-by IEEE Transactions on Pattern Analysis and Machine Intelligence 2021-07-07

Fast and accurate protein structure search with Foldseek

OPENALEX - Publications

Michel van Kempen Stephanie Kim Charlotte Tumescheit Milot Mirdita Jeong-Jae Lee and 3 more

Abstract As structure prediction methods are generating millions of publicly available protein structures, searching these databases is becoming a bottleneck. Foldseek aligns the query against database by describing tertiary amino acid interactions within proteins as sequences over structural alphabet. decreases computation times four to five orders magnitude with 86%, 88% and 133% sensitivities Dali, TM-align CE, respectively.

10.1038/s41587-023-01773-0 article EN cc-by Nature Biotechnology 2023-05-08

HH-suite3 for fast remote homology detection and deep protein annotation

OPENALEX - Publications

Martin Steinegger Markus Meier Milot Mirdita Harald Vöhringer Stephan J. Haunsberger and 1 more

HH-suite is a widely used open source software suite for sensitive sequence similarity searches and protein fold recognition. It based on pairwise alignment of profile Hidden Markov models (HMMs), which represent multiple alignments homologous proteins.We developed single-instruction multiple-data (SIMD) vectorized implementation the Viterbi algorithm HMM introduced various other speed-ups. These accelerated search methods HHsearch by factor 4 HHblits 2 over previous version 2.0.16. HHblits3...

10.1186/s12859-019-3019-7 article EN cc-by BMC Bioinformatics 2019-09-14

Clustering huge protein sequence sets in linear time

OPENALEX - Publications

Martin Steinegger Johannes Söding

Metagenomic datasets contain billions of protein sequences that could greatly enhance large-scale functional annotation and structure prediction. Utilizing this enormous resource would require reducing its redundancy by similarity clustering. However, clustering hundreds millions is impractical using current algorithms because their runtimes scale as the input set size N times number clusters K, which typically similar order N, resulting in increase almost quadratically with N. We developed...

10.1038/s41467-018-04964-5 article EN cc-by Nature Communications 2018-06-25

Uniclust databases of clustered and deeply annotated protein sequences and alignments

OPENALEX - Publications

Milot Mirdita Lars von den Driesch Clovis Galiez María Martín Johannes Söding and 1 more

We present three clustered protein sequence databases, Uniclust90, Uniclust50, Uniclust30 and databases of multiple alignments (MSAs), Uniboost10, Uniboost20 Uniboost30, as a resource for analysis, function prediction searches. The Uniclust cluster UniProtKB sequences at the level 90%, 50% 30% pairwise identity. Uniclust90 Uniclust50 clusters showed better consistency functional annotation than those UniRef90 UniRef50, owing to an optimised clustering pipeline that runs with our MMseqs2...

10.1093/nar/gkw1081 article EN cc-by Nucleic Acids Research 2016-11-01

Protein Sequence Analysis Using the MPI Bioinformatics Toolkit

OPENALEX - Publications

Felix Gabler Seung‐Zin Nam Sebastian Till Milot Mirdita Martin Steinegger and 3 more

The MPI Bioinformatics Toolkit (https://toolkit.tuebingen.mpg.de) provides interactive access to a wide range of the best-performing bioinformatics tools and databases, including state-of-the-art protein sequence comparison methods HHblits HHpred. currently includes 35 external in-house tools, covering functionalities such as similarity searching, prediction features, classification. Due this breadth functionality, tight interconnection its constituent ease use, has become an important...

10.1002/cpbi.108 article EN cc-by Current Protocols in Bioinformatics 2020-12-01

AlphaFold Protein Structure Database in 2024: providing structure coverage for over 214 million protein sequences

OPENALEX - Publications

Mihály Váradi Damian Bertoni Paulyna Magaña Urmila Paramval Ivanna Pidruchna and 18 more

The AlphaFold Database Protein Structure (AlphaFold DB, https://alphafold.ebi.ac.uk) has significantly impacted structural biology by amassing over 214 million predicted protein structures, expanding from the initial 300k structures released in 2021. Enabled groundbreaking AlphaFold2 artificial intelligence (AI) system, predictions archived DB have been integrated into primary data resources such as PDB, UniProt, Ensembl, InterPro and MobiDB. Our manuscript details subsequent enhancements...

10.1093/nar/gkad1011 article EN cc-by Nucleic Acids Research 2023-11-02

ColabFold - Making protein folding accessible to all

OPENALEX - Publications

Milot Mirdita Konstantin Schütze Yoshitaka Moriwaki Lim Heo Sergey Ovchinnikov and 1 more

ColabFold offers accelerated protein structure and complex predictions by combining the fast homology search of MMseqs2 with AlphaFold2 or RoseTTAFold. ColabFold’s 40 - 60× faster optimized model use allows predicting close to a thousand structures per day on server one GPU. Coupled Google Colaboratory, becomes free accessible platform for folding. is open-source software available at github.com/sokrypton/ColabFold . Its novel environmental databases are colabfold.mmseqs.com Contact...

10.1101/2021.08.15.456425 preprint EN cc-by-nc-nd bioRxiv (Cold Spring Harbor Laboratory) 2021-08-15

MMseqs2 desktop and local web server app for fast, interactive sequence searches

OPENALEX - Publications

Milot Mirdita Martin Steinegger Johannes Söding

The MMseqs2 desktop and web server app facilitates interactive sequence searches through custom protein profile databases on personal workstations. By eliminating MMseqs2's runtime overhead, we reduced response times to a few seconds at sensitivities close BLAST.The is easy install for non-experts. GPLv3-licensed code, pre-built packages Windows, MacOS Linux, Docker images the application demo are available https://search.mmseqs.com.Supplementary data Bioinformatics online.

10.1093/bioinformatics/bty1057 article EN cc-by Bioinformatics 2019-01-04

Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold

OPENALEX - Publications

Martin Steinegger Milot Mirdita Johannes Söding

10.1038/s41592-019-0437-4 article EN Nature Methods 2019-06-24

Applying and improving AlphaFold at CASP14

OPENALEX - Publications

John Jumper Richard Evans Alexander Pritzel Tim Green Michael Figurnov and 28 more

We describe the operation and improvement of AlphaFold, system that was entered by team AlphaFold2 to "human" category in 14th Critical Assessment Protein Structure Prediction (CASP14). The AlphaFold CASP14 is entirely different one CASP13. It used a novel end-to-end deep neural network trained produce protein structures from amino acid sequence, multiple sequence alignments, homologous proteins. In assessors' ranking summed z scores (>2.0), scored 244.0 compared 90.8 next best group....

10.1002/prot.26257 article EN Proteins Structure Function and Bioinformatics 2021-10-04

Fast and accurate protein structure search with Foldseek

OPENALEX - Publications

Michel van Kempen Stephanie Kim Charlotte Tumescheit Milot Mirdita Jeong-Jae Lee and 3 more

As structure prediction methods are generating millions of publicly available protein structures, searching these databases is becoming a bottleneck. Foldseek aligns the query against database by describing amino acid backbone proteins as sequences over structural alphabet. decreases computation times four to five orders magnitude with 86%, 88% and 133% sensitivities DALI, TM-align CE, respectively.

10.1101/2022.02.07.479398 preprint EN cc-by bioRxiv (Cold Spring Harbor Laboratory) 2022-02-09

MMseqs software suite for fast and deep clustering and searching of large protein sequence sets

OPENALEX - Publications

Maria Hauser Martin Steinegger Johannes Söding

Abstract Motivation: Sequence databases are growing fast, challenging existing analysis pipelines. Reducing the redundancy of sequence by similarity clustering improves speed and sensitivity iterative searches. But tools cannot efficiently cluster size UniProt to 50% maximum pairwise identity or below. Furthermore, in metagenomics experiments typically large fractions reads be matched any known anymore because searching with sensitive but relatively slow (e.g. BLAST HMMER3) through...

10.1093/bioinformatics/btw006 article EN Bioinformatics 2016-01-06

PredictProtein - Predicting Protein Structure and Function for 29 Years

OPENALEX - Publications

Michael Bernhofer Christian Dallago Tim Karl Venkata Satagopam Michael Heinzinger and 23 more

Abstract Since 1992 PredictProtein (https://predictprotein.org) is a one-stop online resource for protein sequence analysis with its main site hosted at the Luxembourg Centre Systems Biomedicine (LCSB) and queried monthly by over 3,000 users in 2020. was first Internet server predictions. It pioneered combining evolutionary information machine learning. Given as input, outputs multiple alignments, predictions of structure 1D 2D (secondary structure, solvent accessibility, transmembrane...

10.1093/nar/gkab354 article EN cc-by Nucleic Acids Research 2021-05-11

Fast and sensitive taxonomic assignment to metagenomic contigs

OPENALEX - Publications

Milot Mirdita Martin Steinegger Florian P. Breitwieser Johannes Söding Eli Levy Karin

MMseqs2 taxonomy is a new tool to assign taxonomic labels metagenomic contigs. It extracts all possible protein fragments from each contig, quickly retains those that can contribute annotation, assigns them with robust and determines the contig's identity by weighted voting. Its fragment extraction step suitable for analysis of domains life. 2-18× faster than state-of-the-art tools also contains modules creating manipulating reference databases as well reporting visualizing...

10.1093/bioinformatics/btab184 article EN cc-by Bioinformatics 2021-03-16

ProtTrans: Towards Cracking the Language of Life's Code Through Self-Supervised Deep Learning and High Performance Computing

OPENALEX - Publications

Ahmed Elnaggar Michael Heinzinger Christian Dallago Ghalia Rihawi Yu Wang and 7 more

Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models taken NLP. These LMs reach new prediction frontiers at low inference costs. Here, we trained two auto-regressive models (Transformer-XL, XLNet) four auto-encoder (BERT, Albert, Electra, T5) on UniRef BFD containing up to 393 billion amino acids. The were the Summit supercomputer using 5616 GPUs TPU Pod up-to 1024 cores. Dimensionality reduction revealed that raw...

10.48550/arxiv.2007.06225 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank

OPENALEX - Publications

Martin Steinegger Steven L. Salzberg

Abstract Genomic analyses are sensitive to contamination in public databases caused by incorrectly labeled reference sequences. Here, we describe Conterminator, an efficient method detect and remove sequences exhaustive all-against-all sequence comparison. Our analysis reports of 2,161,746, 114,035, 14,148 the RefSeq, GenBank, NR databases, respectively, spanning whole range from draft “complete” model organism genomes. scales linearly with input size can process 3.3 TB 12 days on a 32-core...

10.1186/s13059-020-02023-1 article EN cc-by Genome biology 2020-05-12

Clustering predicted structures at the scale of the known protein universe

OPENALEX - Publications

Inigo Barrio‐Hernandez Jingi Yeo Jürgen Jänes Milot Mirdita Cameron L. M. Gilchrist and 5 more

Proteins are key to all cellular processes and their structure is important in understanding function evolution. Sequence-based predictions of protein structures have increased accuracy1, over 214 million predicted available the AlphaFold database2. However, studying at this scale requires highly efficient methods. Here, we developed a structural-alignment-based clustering algorithm-Foldseek cluster-that can cluster hundreds millions structures. Using method, clustered database, identifying...

10.1038/s41586-023-06510-w article EN cc-by Nature 2023-09-13

Unifying the known and unknown microbial coding sequence space

OPENALEX - Publications

Chiara Vanni Matthew S. Schechter Silvia G. Acinas Albert Barberán Pier Luigi Buttigieg and 12 more

Genes of unknown function are among the biggest challenges in molecular biology, especially microbial systems, where 40-60% predicted genes unknown. Despite previous attempts, systematic approaches to include fraction into analytical workflows still lacking. Here, we present a conceptual framework, its translation computational workflow AGNOSTOS and demonstration on how can bridge known-unknown gap genomes metagenomes. By analyzing 415,971,742 from 1749 metagenomes 28,941 bacterial archaeal...

10.7554/elife.67667 article EN cc-by eLife 2022-03-31

AlphaFold2 reveals commonalities and novelties in protein structure space for 21 model organisms

OPENALEX - Publications

Nicola Bordin Ian Sillitoe Vamsi Nallapareddy Clemens Rauer Su Datt Lam and 9 more

Abstract Deep-learning (DL) methods like DeepMind’s AlphaFold2 (AF2) have led to substantial improvements in protein structure prediction. We analyse confident AF2 models from 21 model organisms using a new classification protocol (CATH-Assign) which exploits novel DL for structural comparison and classification. Of ~370,000 models, 92% can be assigned 3253 superfamilies our CATH domain superfamily The remaining cluster into 2367 putative superfamilies. Detailed manual analysis on 618 of...

10.1038/s42003-023-04488-9 article EN cc-by Communications Biology 2023-02-08

Bilingual Language Model for Protein Sequence and Structure

OPENALEX - Publications

Michael Heinzinger Konstantin Weißenow Joaquin Gomez Sanchez Adrian Henkel Milot Mirdita and 2 more

Abstract Adapting large language models (LLMs) to protein sequences spawned the development of powerful (pLMs). Concurrently, AlphaFold2 broke through in structure prediction. Now we can systematically and comprehensively explore dual nature proteins that act exist as three-dimensional (3D) machines evolve linear strings one-dimensional (1D) sequences. Here, leverage pLMs simultaneously model both modalities by combining 1D with 3D a single model. We encode structures token using...

10.1101/2023.07.23.550085 preprint EN cc-by bioRxiv (Cold Spring Harbor Laboratory) 2023-07-25

Coming Soon ...