NFDI4DS | UHH-SEMS - Publication Details

Minimally overlapping words for sequence similarity search

Similarity (geometry) Sequence (biology) Boosting

DOI: 10.1093/bioinformatics/btaa1054 Publication Date: 2020-12-09T05:04:36Z

Abstract Supplemental Material References Cited by

AUTHORS (3)

Martin C Frith

Laurent Noé

Gregory Kucherov

ABSTRACT

Analysis of genetic sequences is usually based on finding similar parts sequences, e.g. DNA reads and/or genomes. For big data, this typically done via 'seeds': simple similarities (e.g. exact matches) that can be found quickly. huge sparse seeding useful, where we only consider seeds at a subset positions in sequence.Here, study sparse-seeding method: using certain 'words' ac, at, gc or gt). Sensitivity maximized by words with minimal overlaps. That because, random sequence, minimally overlapping are anti-clumped. We provide evidence often superior to acclaimed 'minimizer' methods. Our approach unified design inexact (spaced and subset) seeds, further boosting sensitivity. Thus, present promising sequence similarity search, open questions how optimize it.Software test freely available https://gitlab.com/mcfrith/noverlap.Supplementary data Bioinformatics online.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES (28)

CITATIONS (21)

EXTERNAL LINKS

OPENAIRE - Products CROSSREF - Publications OPENALEX - Publications

PlumX Metrics

Minimally overlapping words for sequence similarity search

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....