The Paragon Algorithm, a Next Generation Search Engine That Uses Sequence Temperature Values and Feature Probabilities to Identify Peptides from Tandem Mass Spectra
Tandem
Feature (linguistics)
Sequence (biology)
DOI:
10.1074/mcp.t600050-mcp200
Publication Date:
2007-05-28T00:13:16Z
AUTHORS (9)
ABSTRACT
The Paragon™ Algorithm, a novel database search engine for the identification of peptides from tandem mass spectrometry data, is presented. Sequence Temperature Values are computed using sequence tag algorithm, allowing degree implication by an MS/MS spectrum each region to be determined on continuum. Counter conventional approaches, features such as modifications, substitutions, and cleavage events modeled with probabilities rather than discrete user-controlled settings consider or not feature. use feature in conjunction allows very large increase effective space only small actual number hypotheses that must scored. algorithm has new kind user interface removes expertise requirement, presenting control language laboratory translated optimal algorithmic settings. To validate this comparison Mascot presented series analogous searches explore relative impact increasing probed relaxing tryptic digestion conformance requirements trypsin semitrypsin no enzyme Paragon Algorithm its Rapid mode Thorough without specificity. Although they performed similarly space, dramatic differences were observed space. With hundreds biological artifact all possible levels expected pattern can searched single step, yet typical cost time 2–5 times Despite there drastic loss discrimination typically accompanies exploration This study presents software technology spectra called hereafter referred interchangeably "Paragon." most common application class tools so-called "shotgun" "bottom-up" proteomics experiments (1Aebersold R. Mann M. Mass spectrometry-based proteomics.Nature. 2003; 422: 198-207Crossref PubMed Scopus (5540) Google Scholar) where protein mixture any complexity digested proteolytic reagent, analyzed spectrometry, then type used identify (2Sadygov R.G. Coriorva D. Yates J.R. Large-scale searching spectra: looking up answer back book.Nat. Methods. 2004; 1: 195-202Crossref (333) Scholar, 3Kapp E.A. Schütz F. Connolly L.M. Chakel J.A. Meza J.E. Miller C.A. Fenyo Eng J.K. Adkins J.N. Omenn G.S. Simpson R.J. An evaluation, comparison, accurate benchmarking several publicly available algorithms: sensitivity specificity analysis.Proteomics. 2005; 5: 3475-3490Crossref (310) and, inference, determine which proteins have been detected (4Nesvizhskii A. Aebersold Interpretation shotgun proteomic data.Mol. Cell. Proteomics. 4: 1419-1440Abstract Full Text PDF (780) Scholar). 1Seymour, S. L., Loboda, A., Tang, W. H., Nimkar, S., Schaeffer, (2004) Poster at 52nd ASMS Conference Spectrometry Allied Topics, Nashville, TN (May 23–27, 2004). it currently much less common, also applied direct analysis endogenous result natural proteolysis organism (5Fricker L.D. Lim J. Pan H. Che F.-Y. Peptidomics: quantification neuroendocrine tissues.Mass Spectrom. Rev. 2006; 25: 327-344Crossref (171) 6Hardt Thomas L.R. Dixon S.E. Newport G. Agabian N. Prakobphol Hall S.C. Witkowska H.E. Fisher S.J. Toward defining human parotid gland salivary proteome peptidome: characterization 2D SDS-PAGE, ultrafiltration, HPLC, spectrometry.Biochemistry. 44: 2885-2899Crossref (147) 7Hardt Webb Assessing effects diurnal variation composition saliva: quantitative native iTRAQ reagents.Anal. Chem. 77: 4947-4954Crossref (144) 8Geho D.H. Liotta L.A. Petricoin E.F. Zhao Araujo R.P. amplified treasure chest candidate biomarkers.Curr. Opin. Biol. 10: 50-55Crossref (66) 9Villanueva Shaffer D.R. Philip Chaparro Erdjument-Bromage Olshen A.B. Fleisher Lilja Brogi E. Boyd Sanchez-Carbayo Holland E.C. Cordon-Cardo C. Scher H.I. Tempst P. Differential exoprotease activities confer tumor-specific serum peptidome patterns.J. Clin. Investig. 116: 271-284Crossref (678) 10Purcell A.W. Gorman J.J. Immunoproteomics: methods targets immune response.Mol. 3: 193-208Abstract (80) specifically focus peptide process. part larger package ProteinPilot™ Software, uses approach described here automatically conducts inference Pro Group™ discussed elsewhere. 2004)., 2Seymour, L. (2005) PowerPoint presentation MCP Workshop: Criteria Publication Proteomic Data, Paris, France 12–13, 2005) (www.mcponline.org/misc/PariReport_PP.shtml)., 3S. Seymour, Patel, I. V. Shilov, manuscript preparation. (www.mcponline.org/misc/PariReport_PP.shtml). preparation.Protein fragmentation data bottom-up thought having four main stages: 1) preprocessing, 2) selection hypotheses, 3) scoring 4) inference. preprocessing stage 1 include conversion raw simplified peak lists, averaging deemed sufficiently similar, filtering considered unlikely yield good identification, etc. Most fall into one two categories differing how selected: approaches some de novo estimation information (11Mann Wilm Error-tolerant databases tags.Anal. 1994; 66: 4390-4399Crossref (1310) 12Pappin D.J.C. Chemistry, peptide-mass databases: evolution rapid mapping cellular proteins.in: Burlingame A.L. Carr S.A. 3rd International Symposium Health & Life Sciences. Humana Press, Clifton, NJ1994Google 13Tabb D.L. Saraf GutenTag: high-throughput tagging via empirically derived model.Anal. 75: 6415-6421Crossref (247) 14Tanner Shu Frank Wang Zandi Mumby Pevzner P.A. Bafna InsPecT: posttranslationally modified spectra.Anal. 4626-4639Crossref (497) 15Taylor Johnson R.S. sequencing spectrometry.Rapid Commun. 1997; 11: 1067-1075Crossref (337) 16Tsur Tanner Identification post-translational modifications blind spectra.Nat. Biotechnol. 23: 1562-1567Crossref (224) 17Clauser K.R. Baker P.R. Role measurement (±10 ppm) strategies employing MS searching.Anal. 1999; 71: 2871-2882Crossref (975) Scholar), whereas precursor rely filter (17Clauser 18Eng McCormack III, correlate spectral amino acid sequences database.J. Am. Soc. 976-989Crossref (5363) 19Perkins D.N. Pappin D.J. Creasy D.M. Cottrell J.S. Probability-based data.Electrophoresis. 20: 3551-3567Crossref (6709) 20Bafna Edwards SCOPE: probabilistic model against database.Bioinformatics. 2001; 17: S13-S21Crossref (179) 21Craig Beavis R.C. TANDEM: matching spectra.Bioinformatics. 1466-1467Crossref (1965) 22Field Fenyö RADARS, bioinformatics solution automates analysis, optimises archives relational database.Proteomics. 2002; 2: 36-47Crossref (192) 23Colinge Masselot Giron Dessingy T. Magnin OLAV: towards identification.Proteomics. 1454-1463Crossref (268) 24Tang W.H. Halpern B.R. Shilov I.V. Seymour S.L. Keating S.P. Loboda Patel A.A. Schaeffer D.A. Nuwaysir Discovering known unanticipated 3931-3946Crossref (56) 25Chalkley Huang Hansen K.C. Allen N.P. Rexach Comprehensive multidimensional liquid chromatography dataset acquired quadrupole selecting collision cell, time-of-flight spectrometer. II. New developments Protein Prospector allow reliable comprehensive automatic datasets.Mol. 1194-1204Abstract (145) goal both gain efficiency constraining universe smaller tractable manual inspection.In methods, sequence(s) automated full partial initial constraint. In earliest example method section sequence, "sequence tag," would manually interpreted provided their along masses unsequenced regions flanking tag. They three pieces, preceding tag, following "peptide tag." was subsequently scanned find matches elements "error-tolerant" mode, required match, successful even presence unsuspected modifications. At same time, co-workers developing similar (12Pappin now exists "Sequence query" (26Pappin Rahman H.F. Bartlet-Jones Jeffery Bleasby A.J. Biological Totowa, NJ1996: 135-150Crossref sequence-based implemented forms, including MS-Seq More recently, category attempt call stretches particularly "homology searching" problem species interest poorly represented (27Shevchenko Sunyaev Shevchenko Bork Ens Standing K.G. Charting proteomes organisms genomes MALDI-quadrupole BLAST homology 73: 1917-1926Crossref (529) 28Liska EST multiTag software.Proteomics. 4118-4122Crossref (17) 29Sunyaev Liska Golod MultiTag: multiple error-tolerant sequence-similarity spectrometry.Anal. 1307-1315Crossref (107) tags derive metrics quality step precursor-type (30Ma B. Zhang K. Hendrie Liang Li Doherty-Kirby Lajoie PEAKS: powerful 2337-2342Crossref (947) Scholar).In algorithms, MS/MS-derived used, selected solely basis theoretical peptide. exhaustively enumerated given constraints allowed rules, match within prescribed tolerance scoring. brute force approach, dominant current use, eclipsing tags. engines, "MS/MS ions" (19Perkins SEQUEST (18Eng type. reason almost certainly ease often require sequencing.Despite being algorithms should, theory, more selectivity during hypothesis giving potential faster well. However, addition practical high throughput applications, come significant risk: incorrect may exclude right consideration. Initially tag-based relied per assumption made interpretation correct. That is, hard filter; portions considered. Newer GutenTag (13Tabb InsPecT (14Tanner offered improvements determining sets many restrict contain least tags.Although broadly do limitations. Unlike will prevent ever identified. For example, if N-terminally acetylated, but search, wrong answers returned It might seem simply variations search. feasible, however, because bring combinatorial explosion additional need scored, yielding unacceptable poor practice, upper limit what engines around 6–10 Partly challenges analyses fraction total acquired, roughly 5–20% low resolution ion trap instruments (3Kapp 31von Haller P.D. Yi Donohoe Vaughn Keller Nesvizhskii A.I. X.-j. Goodlet Watts J.D. profiling ICAT spectrometry: Statistically annotated identified co-purifying T cell lipid rafts.Mol. 426-427Abstract (46) 15–70% (24Tang 32Chalkley Medzihrahszky K.F. How theoretically interpretable engines?.Mol. 1189-1193Abstract (48) cases, 2–3-fold sufficient go unidentified unexpected cleavages, monoisotopic assignments, charge state determinations, substitutions considered, frequency relatively small, collectively allowance frequent account spectra, thus desirable ways improve space.The identification. contrast recent advances relies key innovations nothing stage. Our efforts focused stage, driven belief greater improvement score, score it. First, likely relevance segment quantified continuum weighted compute Value (STV). 4The abbreviations are: STV, Value; ROC, receiver operating characteristic; ID, identification; SS, Search Space; CDS, Celera Discovery System. Second, formally frequencies events, net probability hypothesis. great reduction through implementation translation layer between describes understands. Third, overall threshold effect STV probabilities, highly selective triage worth assessment evidence efficient balancing effort commensurate likelihood related correct "searched extensively" sense lower combined weakly implicated segments less," probable features.The offers performance informatics barrier doing while maintaining automation engines. fundamental description validation technology.RESULTSParagon Components—Fig. 1A diagram core components invoked depending needs particular first component, Fraglet, essentially standard mass-filtered run isolation second component case digest searching. Taglet, component. always "separate pass" well searches. coordination controlled based input. our previous Interrogator designed work directly create index file. indexing types faster, decided flexibility support different digestions, filters, important. A components. scored p value absolute measure chance randomly fragment ions spectrum, ignoring homology. generally done b y ions. percent confidence taking other distinct these basic measure, various attributes peptide, j, Equation 1. Confidencej = f(p−value)j(phypothesis)j∑i=1n(f(p−value)i(phypothesis)i)(Eq. summation denominator includes member set identical peptides. (where among ambiguous actually right) brings beneficial competitive element dilutes confidences cases dissimilar marginal matches. hypothesis, phypothesis, independent fragmentation, phypothesis Πf=1m (pf)(Eq. pf factors lack termini patterns, consistency ratio. cysteine alkylation modification higher neither end conforming missing modification. We estimate measuring occurrences lysine proline could estimated follows. pcleavage(K−P) vcleaved K−Pvcleaved K−P + vuncleaved K−P(Eq. Clearly vary putatively treated way. found Paragon, estimating average values samples captures enough variation, importantly, proven quite robust rough estimates sufficient.Although precision internally, never reported 99.00% place
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (38)
CITATIONS (1155)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....