Minimally overlapping words for sequence similarity search
Similarity (geometry)
Sequence (biology)
Boosting
DOI:
10.1093/bioinformatics/btaa1054
Publication Date:
2020-12-09T05:04:36Z
AUTHORS (3)
ABSTRACT
Analysis of genetic sequences is usually based on finding similar parts sequences, e.g. DNA reads and/or genomes. For big data, this typically done via 'seeds': simple similarities (e.g. exact matches) that can be found quickly. huge sparse seeding useful, where we only consider seeds at a subset positions in sequence.Here, study sparse-seeding method: using certain 'words' ac, at, gc or gt). Sensitivity maximized by words with minimal overlaps. That because, random sequence, minimally overlapping are anti-clumped. We provide evidence often superior to acclaimed 'minimizer' methods. Our approach unified design inexact (spaced and subset) seeds, further boosting sensitivity. Thus, present promising sequence similarity search, open questions how optimize it.Software test freely available https://gitlab.com/mcfrith/noverlap.Supplementary data Bioinformatics online.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (28)
CITATIONS (21)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....