- Genomics and Phylogenetic Studies
- Microbial Community Ecology and Physiology
- RNA and protein synthesis mechanisms
- Bacteriophages and microbial interactions
- Genetic diversity and population structure
- Machine Learning in Bioinformatics
- Algorithms and Data Compression
- Glycosylation and Glycoproteins Research
- Protist diversity and phylogeny
- Plant and animal studies
- Evolution and Genetic Dynamics
- Protein Structure and Dynamics
- Enzyme Structure and Function
- Advanced biosensing and bioanalysis techniques
- CRISPR and Genetic Engineering
- Gene expression and cancer classification
- Evolution and Paleontology Studies
- Environmental DNA in Biodiversity Studies
- Viral-associated cancers and disorders
- Marine and coastal ecosystems
- Herpesvirus Infections and Treatments
- Plant Disease Resistance and Genetics
- Peptidase Inhibition and Analysis
- Genetics, Bioinformatics, and Biomedical Research
- Bioinformatics and Genomic Networks
Seoul National University
2024
Weizmann Institute of Science
2024
Max Planck Institute for Multidisciplinary Sciences
2023-2024
Tel Aviv University
2013-2023
Max Planck Institute for Biophysical Chemistry
2019-2021
Metagenomics is revolutionizing the study of microorganisms and their involvement in biological, biomedical, geochemical processes, allowing us to investigate by direct sequencing a tremendous diversity organisms without need for prior cultivation. Unicellular eukaryotes play essential roles most microbial communities as chief predators, decomposers, phototrophs, bacterial hosts, symbionts, parasites plants animals. Investigating therefore great interest ecology, biotechnology, human health,...
MMseqs2 taxonomy is a new tool to assign taxonomic labels metagenomic contigs. It extracts all possible protein fragments from each contig, quickly retains those that can contribute annotation, assigns them with robust and determines the contig's identity by weighted voting. Its fragment extraction step suitable for analysis of domains life. 2-18× faster than state-of-the-art tools also contains modules creating manipulating reference databases as well reporting visualizing...
Abstract Advances in computational structure prediction will vastly augment the hundreds of thousands currently available protein complex structures. Translating these into discoveries requires aligning them, which is computationally prohibitive. Foldseek-Multimer computes alignments from compatible chain-to-chain alignments, identified by efficiently clustering their superposition vectors. 3–4 orders magnitudes faster than gold standard, while producing comparable alignments; this allows it...
SpacePHARER (CRISPR Spacer Phage-Host Pair Finder) is a sensitive and fast tool for de novo prediction of phage-host relationships via identifying phage genomes that match CRISPR spacers in genomic or metagenomic data. gains sensitivity by comparing phages at the protein level, optimizing its scores matching very short sequences, combining evidence from multiple matches, while controlling false positives. We demonstrate searching comprehensive spacer list against all complete...
Advances in computational structure prediction will vastly augment the hundreds of thousands currently-available protein complex structures. Translating these into discoveries requires aligning them, which is computationally prohibitive. Foldseek-Multimer computes alignments from compatible chain-to-chain alignments, identified by efficiently clustering their superposition vectors. 3-4 orders magnitudes faster than gold standard, while producing comparable alignments; allowing it to compare...
Abstract The AlphaFold Protein Structure Database (AFDB) is the largest repository of accurately predicted structures with taxonomic labels. Despite providing predictions for over 214 million UniProt entries, AFDB does not cover viral sequences, severely limiting their study. To address this, we created Big Fantastic Virus (BFVD), a 351 242 protein by applying ColabFold to sequence representatives UniRef30 clusters. By utilizing homology searches across two petabases assembled sequencing...
Abstract The recent CASP15 competition highlighted the critical role of multiple sequence alignments (MSAs) in protein structure prediction, as demonstrated by success top AlphaFold2-based prediction methods. To push boundaries MSA utilization, we conducted a petabase-scale search Sequence Read Archive (SRA), resulting gigabytes aligned homologs for targets. These were merged with default MSAs produced ColabFold-search and provided to ColabFold-predict. By using SRA data, achieved highly...
Sewon Lee1,10, Gyuri Kim1,10, Eli Levy Karin2, Milot Mirdita1, Sukhwan Park3, Rayan Chikhi4, Artem Babaian5,6, Andriy Kryshtafovych7 and Martin Steinegger1,3,8,9 1School of Biological Sciences, Seoul National University, Gwanak-gu, 08826, South Korea 2ELKMO, Copenhagen 2720, Denmark 3Interdisciplinary Program in Bioinformatics, 4Institut Pasteur, Université Paris Cité, G5 Sequence 75015 Paris, France 5Department Molecular Genetics, University Toronto, Ontario M5S 1A8, Canada 6Donnelly Centre...
The AlphaFold Protein Structure Database (AFDB) is the largest repository of accurately predicted structures with taxonomic labels. Despite providing predictions for over 214 million UniProt entries, AFDB does not cover viral sequences, severely limiting their study. To bridge this gap, we created Big Fantastic Virus (BFVD), a 351,242 protein by applying ColabFold to sequence representatives UniRef30 clusters. BFVD holds unique repertoire as 63% its entries show no or low structural...
Abstract Recent years have seen incredible progress in the development of deep-learning (DL) tools for analysis biological data, with most prominent example being AlphaFold2 accurate protein structure prediction. DL-based are especially useful identifying patterns and connections within sparsely labeled datasets. This makes them essential metagenomic which is mostly unannotated bears little sequence similarity to known genes proteins. In this review, we chose present twelve deem as offering...
Abstract Since its public release in 2021, AlphaFold2 (AF2) has made investigating biological questions, using predicted protein structures of single monomers or full complexes, a common practice. ColabFold-AF2 is an open-source Jupyter Notebook inside Google Colaboratory and command-line tool, which makes it easy to use AF2, while exposing advanced options. shortens turn-around times experiments due optimized usage AF2’s models. In this protocol, we guide the reader through ColabFold...
Evolutionary analysis of phyletic patterns (phylogenetic profiles) is widely used in biology, representing presence or absence characters such as genes, restriction sites, introns, indels and methylation sites. The pattern observed extant genomes the result ancestral gain loss events along phylogenetic tree. Here we present CoPAP (coevolution presence–absence patterns), a user-friendly web server, which performs accurate inference coevolving manifested by co-occurring gains losses. uses...
Recent years have seen a constant rise in the availability of trait data, including morphological features, ecological preferences, and life history characteristics. These phenotypic data provide means to associate genomic regions with attributes, thus allowing identification traits associated rate genome sequence evolution. However, inference methodologies that analyze unified statistical framework are still scarce. Here, we present TraitRateProp, probabilistic method allows testing whether...
Abstract The classic methodology of inferring a phylogenetic tree from sequence data is composed two steps. First, multiple alignment (MSA) computed. Then, reconstructed assuming the MSA correct. Yet, inferred MSAs were shown to be inaccurate and errors reduce inference accuracy. It was previously proposed that filtering unreliable regions can increase accuracy inference. However, it also demonstrated benefit this often obscured by resulting loss signal. In work we explore an approach, in...
Estimating phylogenetic trees from sequence data is an extremely challenging and important statistical task. Within the maximum-likelihood paradigm, best tree a point estimate. To determine how strongly support such evolutionary scenario, hypothesis testing methodology required. this end, Kishino–Hasegawa (KH) test was developed to whether one topology significantly more supported by than another one. This its derivatives are widely used in phylogenetics phylogenomics. Here, we show that KH...
Certain protist lineages bear cytoskeletal structures that are germane to them and define their individual group. Trichomonadida excavate parasites united by a unique framework, which includes tubulin-based such as the pelta axostyle, but also other filaments striated costa whose protein composition remains unknown. We determined proteome of detergent-resistant cytoskeleton Tetratrichomonas gallinarum. 203 proteins with homology Trichomonas vaginalis were identified, contain significantly...
Perfect short inverted repeats (IRs) are known to be enriched in a variety of bacterial and eukaryotic genomes. Currently, it is unclear whether perfect IRs conserved over evolutionary time scales. In this study, we aimed characterize the prevalence conservation across 20 proteobacterial strains. We first identified Escherichia coli K-12 substr MG1655 showed that they overabundant. next test overabundance reflected To end, for each IR E. MG1655, collected orthologous sequences from related...
Zooplankton are important eukaryotic constituents of marine ecosystems characterized by limited motility in the water. These metazoans predominantly occupy intermediate trophic levels and energetically link primary producers to higher levels. Through processes including diel vertical migration (DVM) production sinking pellets they also contribute biological carbon pump which regulates atmospheric CO2 Despite their prominent role ecosystems, perhaps, because staggering diversity, much remains...
Summary MMseqs2 taxonomy is a new tool to assign taxonomic labels metagenomic contigs. It extracts all possible protein fragments from each contig, quickly retains those that can contribute annotation, assigns them with robust and determines the contig’s identity by weighted voting. Its fragment extraction step suitable for analysis of domains life. 2-18x faster than state-of-the-art tools also contains modules creating manipulating reference databases as well reporting visualizing...
Summary SpacePHARER (CRISPR Spacer Phage-Host Pair Finder) is a sensitive and fast tool for de novo prediction of phage-host relationships via identifying phage genomes that match CRISPR spacers in genomic or metagenomic data. gains sensitivity by comparing phages at the protein level, optimizing its scores matching very short sequences, combining evidence from multiple matches, while controlling false positives. We demonstrate searching comprehensive spacer list against all complete genomes.
The most common evolutionary events at the molecular level are single-base substitutions, as well insertions and deletions (indels) of short DNA segments. A large body research has been devoted to develop probabilistic substitution models infer their parameters using likelihood Bayesian approaches. In contrast, relatively little done model indel dynamics, probably due difficulty in writing explicit functions. Here, we contribute effort modeling dynamics by presenting SpartaABC, an...
Abstract Background Metagenomics is revolutionizing the study of microorganisms and their involvement in biological, biomedical, geochemical processes, allowing us to investigate by direct sequencing a tremendous diversity organisms without need for prior cultivation. Unicellular eukaryotes play essential roles most microbial communities as chief predators, decomposers, phototrophs, bacterial hosts, symbionts parasites plants animals. Investigating therefore great interest ecology,...
Understanding species adaptation at the molecular level has been a central goal of evolutionary biology and genomics research. This important task becomes increasingly relevant with constant rise in both genotypic phenotypic data availabilities. The TraitRateProp web server offers unique perspective into this by allowing detection associations between sequence evolution rate whole-organism phenotypes. By analyzing sequences phenotypes extant context their phylogeny, it identifies sites...