Figure - PMC

Skip to main content

View full-text article in PMC

. Author manuscript; available in PMC: 2018 Mar 30.

Published in final edited form as: Nat Rev Genet. 2016 Oct 24;17(12):758–772. doi: 10.1038/nrg.2016.119

These workflows illustrate general annotation principles, not the specific pipelines of any particular genebuild. a) Protein-coding genes within reference genomes were largely annotated based on the computational genomic alignment of Sanger-sequenced transcripts and protein-coding sequences, followed by manual annotation via interface tools such as Zmap¹, WebApollo²⁶, Artemis¹²⁶ or the Integrative Genomics Viewer¹²⁷. Transcripts were typically taken from GenBank¹²⁸, proteins from Swiss-Prot¹². b) Protein-coding genes within non-reference genomes are usually annotated based on fewer resources; here, RNA sequencing (RNA-seq) data are used in combination with protein homology information extrapolated from a closely-related genome. RNA-seq pipelines for read alignment include STAR¹²⁹ and TopHat¹³⁰, whereas model creation is commonly performed by Cufflinks²². c) Long non-coding RNA (lncRNA) structures can be annotated in a similar manner to protein-coding transcripts as for (a) and (b), although coding potential must be ruled out. This is typically done by examining sequence conservation with phyloCSF¹³¹ or using experimental datasets such as mass spectrometry or ribosome profiling. Here, 5’ Cap Analysis of Gene Expression (CAGE)⁴⁵ and polyA-seq data⁴⁶ are also incorporated to obtain true transcript endpoints. Designated lncRNA pipelines include PLAR⁴⁸. d) Small RNAs are typically added to genebuilds by mining repositories such as RFAM¹³² or miRBase¹³³. However, these entries can be used to search for additional loci based on homology. e) Pseudogene annotation is based on identification of loci with protein-homology to either paralogous or orthologous protein-coding genes. Computational annotation pipelines include PseudoPipe⁵², although manual annotation is more accurate⁵³. Finally, all annotation methods can be thwarted by the existence of sequence gaps in the genome assembly (right-angled arrow).