Annotation of the Zebrafish Genome through an Integrated Transcriptomic and Proteomic Analysis

Proteome Proteogenomics Cancer genome sequencing Gene prediction Gene Annotation
DOI: 10.1074/mcp.m114.038299 Publication Date: 2014-07-25T02:46:22Z
ABSTRACT
Accurate annotation of protein-coding genes is one the primary tasks upon completion whole genome sequencing any organism. In this study, we used an integrated transcriptomic and proteomic strategy to validate improve existing zebrafish annotation. We undertook high-resolution mass-spectrometry-based profiling 10 adult organs, fish body, two developmental stages (SAT line), in addition six organs. More than 7,000 proteins were identified from analyses, ∼69,000 high-confidence transcripts assembled RNA data. Approximately 15% mapped intergenic regions, majority which are likely long non-coding RNAs. These high-quality data manually reannotate genome. report identification 157 novel genes. addition, our led modification gene structures including exons, changes exon coordinates, frame translation, translation annotated UTRs, joining Finally, discovered four instances assembly errors that supported by both Our study shows how integrative analysis transcriptome proteome can extend understanding even well-annotated genomes. Zebrafish (Danio rerio) important vertebrate model organism has been widely biomedical research several areas, biology, disease toxicology, behavior. The latest assembly, Zv9, was released October 2011, combines advantages clone-by-clone shotgun technologies. 83% sequences generated capillary clones, with gaps filled reads via next-generation (1Howe K. Clark M.D. Torroja C.F. Torrance J. Berthelot C. Muffato M. Collins J.E. Humphray S. McLaren Matthews L. Sealy I. Caccamo Churcher Scott Barrett J.C. Koch R. Rauch G.J. White Chow W. Kilian B. Quintais L.T. Guerra-Assuncao J.A. Zhou Y. Gu Yen Vogel J.H. Eyre T. Redmond Banerjee Chi Fu Langley E. Maguire S.F. Laird G.K. Lloyd D. Kenyon Donaldson Sehra H. Almeida-King Loveland Trevanion Jones Quail Willey Hunt A. Burton Sims McLay Plumb Davis Clee Oliver Riddle Eliott Threadgold G. Harden Ware Mortimer Kerry Heath P. Phillimore Tracey Corby N. Dunn Johnson Wood Pelan Griffiths Smith Glithero Howden Barker Stevens Harley Holt Panagiotidis Lovell Beasley Henderson Gordon Auger Wright Raisen Dyer Leung Robertson Ambridge Leongamornlert McGuire Gilderthorp Manthravadi Nichol Whitehead Kay Brown Murnane Gray Humphries Sycamore Saunders Wallis Babbage Hammond Mashreghi-Mohammadi Barr Martin Wray Ellington Ellwood Woodmansey Cooper Tromans Grafham Skuce Pandian Andrews Harrison Kimberley Garnett Fosker Hall Garner Kelly Bird Palmer Gehring Berger Dooley C.M. Ersan-Urun Z. Eser Geiger Geisler Karotki Kirn Konantz Oberlander Rudolph-Geiger Teucke Osoegawa Zhu Rapp Widaa Langford Yang F. Carter N.P. Harrow Ning Herrero Searle S.M. Enright Plasterk R.H. Lee Westerfield de Jong P.J. Zon L.I. Postlethwait Nusslein-Volhard Hubbard T.J. Roest Crollius Rogers Stemple D.L. reference sequence its relationship human genome.Nature. 2013; 496: 498-503Crossref PubMed Scopus (2749) Google Scholar). includes 25 chromosomes along 995 contigs could not be into chromosomes. extent quality ultimately determine usefulness itself. current Ensembl set (genebuild release 75) 56,754 corresponding 33,737 This annotations automated pipeline, a VEGA manual pipeline (VEGA Release 55), transcript models derived RNA-Seq-derived 1The abbreviations are:RNA-SeqRNA sequencingPSMpeptide spectrum matchCPATCoding-Potential Assessment ToolFDRfalse discovery rate. five tissues seven (2Collins Incorporating RNA-seq genebuild.Genome Res. 2012; 22: 2067-2078Crossref (78) peptide match Coding-Potential Tool false Shotgun proteomics NextGen have great potential assist through as well strategies. With advancements methods for processing, increasing number studies being carried out analyzed manner (3Lundberg Fagerberg Klevebring Matic Cox Algenas Lundeberg Mann Uhlen Defining three functionally different cell lines.Mol. Syst. Biol. 2010; 6: 450Crossref (270) Scholar, 4Evans V.C. Heesom K.J. Fan Bessant D.A. De novo derivation proteomes transcriptomes protein identification.Nat. Methods. 9: 1207-1211Crossref (128) There also reports coding loci using alone or combination (5Peterson E.S. McCue L.A. Schrimpe-Rutledge A.C. Jensen J.L. Walker Kobold M.A. Webb S.R. Payne S.H. Ansong Adkins J.N. Cannon W.R. Webb-Robertson B.J. VESPA: software facilitate genomic prokaryotic organisms integration data.BMC Genomics. 13: 131Crossref (30) 6Mohien C.U. Colquhoun D.R. Mathias D.K. Gibbons J.G. Armistead J.S. Rodriguez M.C. M.H. Edwards N.J. Hartler Thallinger G.G. Graham Martinez-Barnetche Rokas Dinglasan R.R. A bioinformatics approach comparative analyses non-sequenced anopheline vectors malaria parasites.Mol. Cell. Proteomics. 12: 120-131Abstract Full Text PDF (18) previous efforts successfully demonstrated power proteogenomic improving annotation, exemplified on Mycobacterium tuberculosis, Candida glabrata, Leishmania donovani, Anopheles gambiae, Homo sapiens (7Chaerkady Kelkar D.S. Muthusamy Kandasamy Dwivedi S.B. Sahasrabuddhe N.A. Kim M.S. Renuse Pinto Sharma Pawar Sekhar N.R. Mohanty A.K. Getnet Zhong Dash A.P. MacCallum R.M. Delanghe Mlambo Kumar Keshava Prasad T.S. Okulate Pandey gambiae Fourier transform mass spectrometry.Genome 2011; 21: 1872-1881Crossref (47) 8Prasad Harsha H.C. Keerthikumar Selvan L.D. Subbannayya Chaerkady Mathur P.P. Ravikumar Proteogenomic glabrata high resolution spectrometry.J. Proteome 11: 247-260Crossref (37) 9Kelkar Balakrishnan Yadav Shrivastava Marimuthu Anand Sundaram Kingsbury Nair Chauhan Katoch V.M. Ramachandran tuberculosis spectrometry.Mol. (M111.011627)Abstract 10Pawar G.S. Venugopal Nemade Khobragade S.N. Patole map unsequenced pathogen - donovani.Proteomics. 832-844Crossref (40) 11Kim Nirujogi R.S. Manda S.S. Madugundu Isserlin Jain Thomas J.K. Leal-Rojas Advani George L.D.N. Patil A.H. Nanjappa V. Radhakrishnan Raju Sreenivasamurthy S.K. Sathe Chavan Datta K.K. Sahu Yelamanchi S.D. Jayaram Rajagopalan Murthy K.R. Syed Goel Khan Ahmad Dey Mudgal Chatterjee Huang Wu X. Shaw P.G. Freed Zahari Mukherjee Shankar Mahadevan Lam Mitchell C.J. Satishchandra Schroeder J.T. Sirdeshmukh Maitra Leach Drake C.G. Halushka M.K. T.S.K. Hruban Kerr C.L. Bader G.D. Iacobuzio-Donahue C.H. Gowda draft proteome.Nature. 2014; 509: 575-581Crossref (1494) Here, use in-depth refine (Fig. 1). (RNA-Seq) 69,206 transcripts, 22,585 9,404 transcribed loci. total, 6,975 stages. employed various strategies included searching spectra against custom databases, six-frame translated database, RNA-Seq prediction set. To reduce positives (12Gupta Bandeira Keich U. Pevzner P.A. Target-decoy rate: when things may go wrong.J. Am. Soc. Mass Spectrom. 1111-1120Crossref (112) 13Blakeley Overton I.M. S.J. Addressing statistical biases nucleotide-derived databases search strategies.J. 5221-5234Crossref (64) Scholar), verified matches (PSMs) each these searches. Novel peptides obtained only good-quality spectral considered improvement. Apart genes, significant findings include errors, splice forms, alternate translational start sites. genetically defined SAT line (Sanger AB Tübingen) procured cultured in-house facility. Muscle, liver, intestine/pancreas, testis, eye, spleen dissected collected RNAlater ice before extraction. Total isolated organ Qiagen RNeasy Kit (Qiagen, Inc., Carlsbad, CA) according manufacturer's protocol. organs/tissues performed protocol Illumina TruSeq Sample Preparation SBS v3 (Illumina, San Diego, CA). Briefly, determined Agilent Bioanalyzer Nano 6000 chip. library construction started 500 ng total then subjected poly(A)+ selection fragmentation. Followed first second strand synthesis, cDNA end repair, adenylation 3′ ends, adapter ligation. One unique indices individual sample. After AMPure XP magnetic bead (Beckman Coulter, Brea, clean-up, sample 15 cycles PCR amplification ABI 9700 thermal cycler. size distribution checked DNA 1000 libraries showed between 200 bp peak at ∼260 bp. All carefully quantitated Qubit 2.0 fluorometer (Invitrogen, Grand Island, NY) stored microfuge tubes (Invitrogen) −20 °C freezer. cluster generation done V3 flow lane, repeated lanes, concentration ∼8.6 pm. Illumina's HiScanSQ system (Illumina) kit 50 paired reads. filtered Phred-based base (Q > 20) FastX tools. 99% passed threshold downstream steps. TopHat (version 1.4.1) default parameters align Zv9 (14Trapnell Pachter Salzberg S.L. TopHat: discovering junctions RNA-Seq.Bioinformatics. 2009; 25: 1105-1111Crossref (8995) Transcript Cufflinks 2.0). RABT (Reference Annotation Based Assembly) option used. An coordinate file (.gtf) provided file. Transcripts separately combined Cuffcompare. categorized (class codes) known isoforms, Cuffcompare (15Trapnell Roberts Goff Pertea Kelley Pimentel Rinn Differential expression experiments Cufflinks.Nat. Protoc. 7: 562-578Crossref (168) From filtering shown supplemental Fig. S1. all fragments per kilobase million (FPKM) ≥ 1. remaining set, class codes e, p, c, o, s eliminated. u, i, x, o (multi-exonic), smaller 250 = j retained. evidence retained regardless their code size. Protein predicted sets CPAT (16Wang Park H.J. Dasari Wang Kocher J.P. Li CPAT: alignment-free logistic regression model.Nucleic Acids 41: e74Crossref (1076) had probability greater 0.38 potentially transcripts. Different organs (eye, brain, spleen, ovary, muscle, heart, head) ∼100 strain). embryos 48 120 h post-fertilization. samples lysed 2% SDS lysis buffer 8 m urea buffer. Lysates homogenized sonicated, estimation followed. Proteins lysates separated SDS-PAGE, in-gel digestion described previously (17Amanchy Kalume D.E. Stable isotope labeling amino acids culture (SILAC) studying dynamics abundance posttranslational modifications.Sci. STKE. 2005; 2005: l2Google bands destained, reduced alkylated, trypsin Lys-C 8:1 ratio. extracted, vacuum dried, −80 until further analysis. 1-mg in-solution digestion. Samples reduced, digested (8:1) overnight 37 °C. digests desalted C18 cartridge lyophilized. lyophilized reconstituted basic reverse-phase liquid chromatography solvent (10 mm tetraethyl ammonium bicarbonate (TEABC), pH 8.5), loaded XBridge 5 μm × 4.6 column (Waters, Milford, MA), eluted 0% 100% B TEABC acetonitrile, 8.5) 50-min gradient. fractions dried pooled concatenation 24 fractions. Enrichment N-terminal acetylated slightly modified Taouatas et al. (18Taouatas Altelaar A.F. Drugan M.M. Helbig A.O. Mohammed Heck A.J. Strong cation exchange-based fractionation Lys-N-generated facilitates targeted post-translational modifications.Mol. 8: 190-200Abstract (68) purified extracts pairs fractionated polysulfoethyl (PolyLC, Columbia, MD; 2.1 μm, Å) low-ionic-strength (solvent A, KH2PO4, 30% acetonitrile 2.7; B, 350 KCl, acetonitrile). re-fractionated chromatography. Peptide Q-TOF 6540 spectrometer interfaced HPLC Chip (Agilent Technologies, Santa Clara, CA.). (0.1% formic acid) onto chip trap 1200 series system. Both analytical columns embedded made up Zorbax 300SB-C18 5-μm particle gradient 5% 40% acid 90% acetonitrile) over min. operated voltage 1800 V, fragmentor 175 medium isolation width 4 m/z, energy slope 3 V plus 2-V offset. MS acquired MassHunter acquisition (Version B.04.00, Technologies). range m/z 350–1,800, followed MS/MS scan 50–2,000. duty cycle scans second. precursor based preference charge state order 2+, 3+, >3+ ions level abundance. Additionally, testis Orbitrap Velos spectrometer. Enriched N-terminally Elite analyzer 60K 15K settings, respectively. Fragmentation higher-energy collisional dissociation mode. processed generate Mascot generic format files (B.04.00) Discoverer 1.3. searched database Ensembl-HAVANA (release 70) common contaminants like trypsin, keratin, BSA (42,200 sequences). 1.3 (Thermo Scientific, Bremen, Germany) Sequest (SCM build 59) 2.2) algorithms. protease missed cleavage allowed. Carbamidomethylation cysteine specified fixed modification, oxidation methionine variable modification. minimum length acids. error parent 20 ppm, whereas fragment it 0.05 Da. LC-MS/MS reversed-sequence calculate 1% rate (FDR) score. FDR PSM score calculated % (number hits reverse above score/total target score) 100 (19Kall Storey J.D. MacCoss M.J. Noble W.S. Assigning significance tandem spectrometry decoy databases.J. 2008; 29-34Crossref (441) parsimonious list grouping Discoverer. For quantitative analysis, intensity- PSM-number-based values similar lines intensity-based absolute quantification (20Schwanhausser Busse Dittmar Schuchhardt Wolf Chen Selbach Global mammalian control.Nature. 473: 337-342Crossref (4058) sum intensities PSMs belonging divided possible tryptic gene. value normalized across dividing experiment (e.g. brain fractions). ratio log2 transformed. Gene functional classification Web-based DAVID resource, significantly enriched (p < 0.5) selected (21Jiao Sherman B.T. da Stephens Baseler M.W. Lane Lempicki R.A. DAVID-WS: stateful web service gene/protein analysis.Bioinformatics. 28: 1805-1806Crossref (690) Seven alternative identifying (i) (ii) (from study), (iii) (iv) graph split junctions, (v) ab initio GENSCAN, (vi) hypothetical (vii) three-frame RNAs Common contaminant added prior searching. created reversing databases. X!Tandem CYCLONE, 2011.12.01) unless otherwise specified. Search searches (a) allowed (b) Da, (c) carbamidomethylation (d) (e) consideration cleavage. validation apart FDR. Specific details about parameters, post-processing outcomes below. version downloaded FTP server. Similarly, JHU-IOB Sanger pearl API "other features" database. (Transcripts falling categories =, j, c frames; other frames.) thus consisted stop-codon-to-stop-codon template sequence. Sequences shorter included. read alignments. non-redundant compact junction introns. generating intervals (exonic regions) correspond nodes, edges exons putatively spliced together. detailed method creation conversion compatible FASTA found Ref. 22Woo Cha S.W. Merrihew He Castellana Guest Maccoss Bafna driven large scale data.J. 21-28Crossref (91) Scholar. MS-GFDB 20120106) algorithm. results algorithm explained Spectra matched multiple equal scores Prediction GENSCAN FTP, pseudogenes, unprocessed pseudogenes BioMart; frames. Additional acetylation Genscan fetching began ended K/R lengths ranging 6 considered. identification. searches, without subpeptides "quick acetyl" available compared (Ensembl Genebuild identify peptides. inspection validity major criteria evaluation assignment intense peaks (intense unassigned see whether they arising internal ions); y ions; low-m/z-range b ions, is, b1, b2, b3 a2 a4 observed typical spectrum; immonium ion indicated presence present assigned (if so, rejected); Y1 confirmed ending either K (m/z 147.11) R 175.12); (f) un-assigned present, especially higher range, part (g) cleavages acidic acids, E D; (h) many noise neutral loss
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (30)
CITATIONS (53)