Overcoming Species Boundaries in Peptide Identification with Bayesian Information Criterion-driven Error-tolerant Peptide Search (BICEPS)
Identification
Database search engine
DOI:
10.1074/mcp.m111.014167
Publication Date:
2012-04-08T04:47:27Z
AUTHORS (10)
ABSTRACT
Currently, the reliable identification of peptides and proteins is only feasible when thoroughly annotated sequence databases are available. Although sequencing capacities continue to grow, many organisms remain without reliable, fully reference genomes required for proteomic analyses. Standard database search algorithms fail identify that not exactly contained in a protein database. De novo searches generally hindered by their restricted reliability, current error-tolerant strategies limited global, heuristic tradeoffs between spectral information. We propose Bayesian information criterion-driven peptide (BICEPS) offer an open source implementation based on this statistical criterion automatically balance each single spectrum database, while limiting run time. show BICEPS performs as well such applied sequenced organisms, whereas uses remotely related organism For instance, we use chicken instead human corresponding evolutionary distance more than 300 million years (International Chicken Genome Sequencing Consortium (2004) Sequence comparative analysis genome provide unique perspectives vertebrate evolution. Nature 432, 695–716). demonstrate successful application cross-species proteomics with 33% increase number identified filarial nematode sample Litomosoides sigmodontis. The from mass spectra key step understanding cellular mechanisms, most which occur level. Proteomic steps classification patterns, biomarkers, or quantitative analyses deliver additional if results can be linked correctly (1McHugh L. Arthur J.W. Computational methods spectrometry data.PLoS Comput. Biol. 2008; 4: e12Crossref PubMed Scopus (77) Google Scholar, 2Wright J.C. Beynon R.J. Hubbard S.J. Cross species proteomics.Methods Mol. 2010; 604: 123-135Crossref (25) Scholar). success rate depends proteome coverage available databases, suboptimal even important model Chinese hamster African clawfrog Xenopus laevis (3Liska A.J. Sunyaev S. Shilov I.N. Schaeffer D.A. Shevchenko A. Error-tolerant EST tandem MultiTag software.Proteomics. 2005; 5: 4118-4122Crossref (17) Scholar), but also economically relevant crops (4Grossmann J. Fischer B. Baerenfaller K. Owiti Buhmann J.M. Gruissem W. Baginsky A worflow detection unsequenced high-throughput experiments.Proteomics. 2007; 7: 4245-4254Crossref (38) let alone extinct including dinosaurs (5Asara Schweitzer M.H. Freimark L.M. Phillips M. Cantley L.C. Protein sequences mastodon Tyrannosaurus rex revealed spectrometry.Science. 316: 280-285Crossref (218) 6Buckley Walker Ho S.Y. Yang Y. Smith C. Ashton P. Oates J.T. Cappellini E. Koon H. Penkman Elsworth Ashford D. Solazzo Andrews Strahler Shapiro Ostrom Gandhi Miller Raney Zylber M.I. Gilbert M.T. Prigodich R.V. Ryan Rijsdijk K.F. Janoo Collins M.J. Comment "Protein spectrometry. ".Science. 319: 33Crossref 7Pevzner P.A. Kim Ng 321: 1040Crossref (47) 8Schweitzer Zheng Organ C.L. Avci R. Suo Z. Lebleu V.S. Duncan M.B. Vander Heiden M.G. Neveu Lane W.S. Cottrell J.S. Horner J.R. Kalluri Asara Biomolecular characterization Campanian hadrosaur canadensis.Science. 2009; 324: 626-631Crossref (160) However, cases, exist whose has been thus facilitate proteins, e.g. X. tropicalis laevis. existing could used make currently unannotated genomes. Still, problems arise because locations substitutions nature modifications unknown. Even cannot estimated priori. It vary strongly different protein. similar problem arises single-nucleotide polymorphisms (SNPs), where departures particular interest (9Dasari Chambers M.C. Slebos Zimmerman L.J. Ham Tabb D.L. TagRecon: High-throughput mutation through tagging.J. Proteome Res. 9: 1716-1726Crossref (95) 10Li Su Ma Z.Q. Halvey Liebler D.C. Pao Zhang bioinformatics workflow variant shotgun proteomics.Mol. Cell. Proteomics. 2011; 10.1074/mcp.M110.006536Abstract Full Text PDF (84) Scholar) regard antibodies. Next generation genomes, challenges persist creating (11Florea Souvorov Kalbfleisch T.S. Salzberg S.L. assembly major impact gene content: comparison annotation two Bos taurus assemblies.PLoS ONE. 6: e21400Crossref (42) Three classes approaches distinguished spectra: library searches, de sequencing, (see Refs. 1McHugh Scholar 12Nesvizhskii A.I. Vitek O. Aebersold Analysis validation data generated spectrometry.Nat. Methods. 787-797Crossref (514) reviews). Spectral approaches, compare libraries already peptides, exclusively find previously peptides. infer directly differences fragment ions spectra. they definition, do sufficient reliability low quality regions (13Kim Bandeira N. Pevzner profiles: novel representation applications identification.Mol. 8: 1391-1400Abstract (32) 14Liu Yan Song Xu Cai Peptide tag-based blind post-translational point process model.Bioinformatics. 2006; 22: E307-E313Crossref (29) 15Shevchenko Valcu C.M. Junqueira Tools exploring proteomosphere.J. 72: 137-144Crossref considered full solution (2Wright Database procedures how observed fits theoretical obtained Popular include Mascot (16Perkins D.N. Pappin D.J. Creasy D.M. Probability-based searching using data.Electrophoresis. 1999; 20: 3551-3567Crossref (6763) Sequest (17Eng J.K. McCormack A.L. Yates An approach correlate massspectral amino-acid-sequences database.J. Am. Soc. Mass Spectrom. 1994; 976-989Crossref (5420) X!Tandem (18Craig Beavis R.C. TANDEM: Matching spectra.Bioinformatics. 2004; 1466-1467Crossref (1987) PepSplice (19Roos F.F. Jacob Grossmann Widmayer PepSplice: Cache-eficient comprehensive 23: 3016-3023Crossref ProteinPilot (20Shilov I.V. Seymour Patel A.A. Loboda Tang W.H. Keating S.P. Hunter Nuwaysir Paragon algorithm: next engine temperature values feature probabilities spectra.Mol. 1638-1655Abstract (1059) highly identifying present these standard mode contained. several extensions have implemented. naïve expanding (21Yates 3rd, Eng Schieltz Method modified amino acid database.Anal. Chem. 1995; 67: 1426-1436Crossref (1110) becomes infeasible containing large considered. average tryptic consisting 11 acids, considering limitations ends fact isoleucine leucine spectrometry, space expanded factor 191 substitution 16,452 within one peptide. Not times memory requirements become excessive, risk false positives increased enormous size space. High accuracy simplify filtering out differing precursor mass, happen after enumeration all possible sequences. Iterative proposed (22Craig method reducing time match spectra.Rapid Commun. 2003; 17: 2310-2316Crossref (398) 23Creasy Error tolerant uninterpreted data.Proteomics. 2002; 2: 1426-1434Crossref (207) 24Starkweather Barnes C.S. Wyckoff G.J. Keightley J.A. Virtual polymorphism: Finding divergent matches data.Anal. 79: 5030-5039Crossref (7) These rely assumption every identifiable at least any substitutions. Consequently, unmodified changes conducted first run. Tag-based generate characteristic tags 3–5 acids filter those tag 25Mann Wilm tags.Anal. 66: 4390-4399Crossref (1316) 26Tabb Saraf GutenTag: tagging via empirically derived fragmentation model.Anal. 75: 6415-6421Crossref (247) Multitag (27Sunyaev Liska Golod MultiTag: Multiple sequence-similarity spectrometry.Anal. 1307-1315Crossref (107) LookUp Peaks (28Bern Goldberg Lookup peaks: hybrid 1393-1400Crossref (158) shorter, tags, UStags (29Shen Tolić Hixson K.K. Purvine S.O. Anderson G.A. R.D. discovery proteins.Anal. 80: 7742-7754Crossref (34) 30Shen Pasa-Tolić Qian W.J. Adkins J.N. Moore Proteome-wide decreased ambiguities improved rates 1871-1882Crossref Gapped recover longer tags. Because part (either itself flanking masses) allowed depart original sequence, there upper limit (15Shevchenko error tolerance. Extending idea search, combine initial species. DeNovoID (31Halligan B.D. Ruotti V. Twigger S.N. Greene A.S. DeNovoID: web-based tool deduced spectroscopy.Nucleic Acids 33: W376-W381Crossref (16) identifies chemical composition other (32DiMaggio Jr., Floudas C.A. Lu integer linear optimization, local quadrupole time-of-flight OrbiTrap spectrometry.J. 1584-1593Crossref (22) 33Han SPIDER: Software error.J. Bioinform. 3: 697-716Crossref (167) 34Searle B.C. Dasari Turner Reddy A.P. Choi Wilmarth David L.L. Nagalla S.R. unanticipated mass-based alignment algorithm MS/MS results.Anal. 76: 2220-2230Crossref (129) apply FASTA-like thresholds decide whether reason mismatch. individual multiple proteolytic enzymes, gives very promising (35Bandeira Pham Arnott Lill Automated monoclonal antibodies.Nat. Biotechnol. 26: 1336-1338Crossref (94) 36Liu Han Yuen (re)sequencing homologous yields almost accuracy.Bioinformatics. 25: 2174-2180Crossref (26) overlaps help mitigate error, usually protocols. Conversely, two-step novo-BLAST step. There, submitted Either below prespecified cutoff threshold (37Habermann Oegema power spectrometry-driven similarity searches.Mol. 238-249Abstract (134) 38Shevchenko Bork Ens Standing K.G. Charting proteomes MALDI-quadrupole BLAST homology searching.Anal. 2001; 73: 1917-1926Crossref (529) meeting certain criteria 39Junqueira Spirin Balbuena Thomas Adzhubei I. pipeline homology-driven proteomics.J. 71: 346-356Crossref 40Waridel Frank Surendranath similarity-driven unknown LC-MS/MS automated sequencing.Proteomics. 2318-2329Crossref (91) 41Wielsch Waridel Rapid identifications borderline confidence MS searches.J. 2448-2456Crossref (40) then subjected interpretation subsequent search. choice critical, depend further relies hypothesis found organism, mutated might compiling numerous organisms. applicable cases available, may some combined database-driven strategies. it helpful phylogenetically exist, same line heritage. Then, gathered references. If heritage (with interest), mutations should included closest relative; adding another relative give (but rather basis positive identifications). One difficulty shared lies question allow allowing random hits incurring absurd times. imperative prematurely. varying amounts arbitrary need avoided Any reduce amount introduce builds adjusts adaptively appropriate regularization. This regularization does require thresholds, trades attainable goodness fit inflation ion spectra, allows restrict quick selection small extensive performed necessary per increasing accordingly. To avoid hits, incur penalty admitted resulting outweighs penalty. overall 1The abbreviations are:BICEPSBayesian searchBICBayesian criterionFDRfalse ratePSMpeptide matchCHOChinese ovary. detailed Fig. 1. In (Fig. 1, A–C), mapped candidate reasonable choices function (fit) (distmodify) λ, parameter quantifies tradeoff D E). Then derive bound early termination 1F). requires exhaustive entire 1G). motivate strategy determine level estimate 1H) aggregate 1I). general, understood problem. formalize description mathematical terms our approach. Given S, M, set positions A, define modify: (M,A) ↛ M modification (e.g. modification) distmodify: (M,M) [0,∞) measures difference sequence. Further, fit: (M,S) [0,1] given spectrum. Finally, parameter, λ ∈ [0,∞), distance. finding best matching rewritten argmaxmodify(M,A)fit(modify(M,A),S)−λ× dist modify(modify(M,A),M)(Eq. 1) equation fit, second weighted λ. goal combines aspects. → ∞, corresponds departure carries infinite penalty; setting = 0 no departing case, irrelevant scoring replaced remaining value provides focus contribution Tags DirecTag (42Tabb Martin D.B. DirecTag: Accurate scoring.J. 3838-3846Crossref (98) Pepnovo (43Frank PepNovo: probabilistic network modeling.Anal. 77: 964-973Crossref (526) Up 20 length 5 computed approach, screened error-tolerance. combination increases sensitivity maintaining specificity. following generation, thorough supplemental material. Once detectable collected, specific identified. Including complete would comprise lead potential hits. restricting decrease miscleaved nontryptic missed. ideas, cutout procedure subsequence comprising adjacent excised so measured mass. add both ensure case heavier lighter 1C). spectrum, fast scheme extend triply charged ions. applies hypergeometric (44Sadygov R.G. probability databases.Anal. 3792-3798Crossref (178) yielding (fit()). logarithmic goes hand consideration incorporate prior knowledge likelihood varies significantly. justice procedure. Rather penalizing
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (53)
CITATIONS (25)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....