Skip to main content
. Author manuscript; available in PMC: 2020 Sep 9.
Published in final edited form as: Annu Rev Genomics Hum Genet. 2020 May 18;21:55–79. doi: 10.1146/annurev-genom-121119-083418

Table 1. Evidence relevant to the annotation of different types of genes.

Biotype Transcription data (INSDC, RNA-seq, PacBio, ONT) Terminal transcription data (CAGE, RAMPAGE, polyA-seq) Protein homology data (UniProt) Protein experimental data (MS, ribo-seq) Conservation data (PhyloCSF, PhastCons, GERP) RNA secondary structure data (Infernal) External expert database (miRBase, Rfam, IMGT)
Protein coding Yes Yes Yes Yes Yes No No
lncRNA Yes Yes No No No No No
sRNA Yes No No No Yes Yes Yes
Pseudogene Noa Noa Yes No Yes b No No
IG/TR No No No Yes No No Yes

This table illustrates the evidence types generally used by manual annotators in the Ensembl team to determine the correct structure and function of a transcript model. Protein-coding genes require transcriptomic evidence to define structure and terminal transcription data sets to define transcript start and end coordinates. Homology with UniProt and proteomics data informs or validates the decision to assign a transcript or locus the protein-coding biotype—that is, to decide whether a functional protein is encoded. Similarly, evolutionary conservation of sequence and of protein-coding potential also informs this decision. Decisions about protein-coding genes do not generally use RNA secondary structure or other expert databases, although they may be consulted on a case-by-case basis. The annotation of lncRNAs utilizes the same transcriptomic data sets as protein-coding genes; however, the absence of protein homology, experimental proteomics data, and conservation is a key determinant in choosing not to annotate a transcript as protein coding. For sRNAs, transcriptomic data sets, conservation data, RNA secondary structure data, and expert external databases are utilized. Pseudogenes are annotated based solely on their homology to annotated protein sequences, although transcriptomic data are used to support the transcribed pseudogene biotypes. IG/TR gene segments are annotated on the basis of protein experimental data and homology to IG/TR sequences from the IMGT database. Abbreviations: CAGE, cap analysis gene expression; GERP, Genomic Evolutionary Rate Profiling; IG, immunoglobulin; IMGT, International Immunogenetics; Infernal, Inference of RNA Alignment; INSDC, International Nucleotide Sequence Database Collaboration; lncRNA, long noncoding RNA; miRBase, MicroRNA Database; MS, mass spectrometry; ONT, Oxford Nanopore Technologies; PacBio, Pacific Biosciences; PhyloCSF, Phylogenetic Codon Substitution Frequencies; polyA-seq, polyA sequencing; RAMPAGE, RNA annotation and mapping of promoters for the analysis of gene expression; ribo-seq, ribosome profiling; RNA-seq, RNA sequencing; sRNA, small RNA; TR, T cell receptor.

a

For nontranscribed pseudogenes only; transcribed pseudogenes may be supported by these data.

b

While pseudogenes are not conserved over large evolutionary distances, known artifacts in the whole-genome alignments on which conservation detection is based permit their identification with care.