Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy
Jaccard index
Comparative Genomics
DOI:
10.1371/journal.pone.0258693
Publication Date:
2021-10-14T19:42:37Z
AUTHORS (3)
ABSTRACT
Information theoretic approaches are ubiquitous and effective in a wide variety of bioinformatics applications. In comparative genomics, alignment-free methods, based on short DNA words, or k-mers, particularly powerful. We evaluated the utility varying k-mer lengths for genome comparisons by analyzing their sequence space coverage 5805 genomes KEGG GENOME database. subsequent analyses four spanning relevant range (11, 21, 31, 41), hierarchical clustering 1634 genus-level representative using pairwise 21- 31-mer Jaccard similarities best recapitulated phylogenetic/taxonomic tree life with clear boundaries superkingdom domains high subtree similarity named taxons at lower levels (family through phylum). By ~14.2M prokaryotic lowest-common-ancestor taxon levels, we detected many potential misclassification errors curated database, further demonstrating need wide-scale adoption quantitative taxonomic classifications whole-genome similarity.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (99)
CITATIONS (37)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....