A generic motif discovery algorithm for sequential data
Models, Molecular
0301 basic medicine
Sequence Homology, Amino Acid
Protein Conformation
Gene Expression Profiling
Amino Acid Motifs
Molecular Sequence Data
Computational Biology
Sequence Analysis, DNA
Protein Structure, Secondary
Pattern Recognition, Automated
03 medical and health sciences
Cluster Analysis
Humans
Amino Acid Sequence
Sequence Alignment
Algorithms
Conserved Sequence
Software
DOI:
10.1093/bioinformatics/bti745
Publication Date:
2005-10-29T00:13:06Z
AUTHORS (4)
ABSTRACT
Abstract
Motivation: Motif discovery in sequential data is a problem of great interest and with many applications. However, previous methods have been unable to combine exhaustive search with complex motif representations and are each typically only applicable to a certain class of problems.
Results: Here we present a generic motif discovery algorithm (Gemoda) for sequential data. Gemoda can be applied to any dataset with a sequential character, including both categorical and real-valued data. As we show, Gemoda deterministically discovers motifs that are maximal in composition and length. As well, the algorithm allows any choice of similarity metric for finding motifs. Finally, Gemoda's output motifs are representation-agnostic: they can be represented using regular expressions, position weight matrices or any number of other models for any type of sequential data. We demonstrate a number of applications of the algorithm, including the discovery of motifs in amino acids sequences, a new solution to the (l,d)-motif problem in DNA sequences and the discovery of conserved protein substructures.
Availability: Gemoda is freely available at
Contact: gregstep@mit.edu
Supplementary Information: Available at
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (49)
CITATIONS (47)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....