MMseqs software suite for fast and deep clustering and searching of large protein sequence sets

UniProt Software suite Smith–Waterman algorithm
DOI: 10.1093/bioinformatics/btw006 Publication Date: 2016-02-15T01:09:07Z
ABSTRACT
Abstract Motivation: Sequence databases are growing fast, challenging existing analysis pipelines. Reducing the redundancy of sequence by similarity clustering improves speed and sensitivity iterative searches. But tools cannot efficiently cluster size UniProt to 50% maximum pairwise identity or below. Furthermore, in metagenomics experiments typically large fractions reads be matched any known anymore because searching with sensitive but relatively slow (e.g. BLAST HMMER3) through comprehensive such as is becoming too costly. Results: MMseqs (Many-against-Many searching) a software suite for fast deep datasets, UniProt, 6-frame translated sequencing reads. contains three core modules: prefiltering module that sums up scores similar k-mers between query target sequences, an SSE2- multi-core-parallelized local alignment module, module. In our homology detection benchmarks, much more 4–30 times faster than UBLAST RAPsearch, respectively, although it does not reach yet. Using its cascaded workflow, can down ∼30% at hundreds BLASTclust deeper CD-HIT USEARCH. also update database linear instead quadratic time. Its improved sensitivity-speed trade-off should make attractive wide range large-scale tasks. Availability implementation: open-source available under GPL https://github.com/soedinglab/MMseqs Contact: martin.steinegger@mpibpc.mpg.de, soeding@mpibpc.mpg.de Supplementary information: data Bioinformatics online.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (28)
CITATIONS (211)