Table 1. Evidence relevant to the annotation of different types of genes.
Biotype | Transcription data (INSDC, RNA-seq, PacBio, ONT) | Terminal transcription data (CAGE, RAMPAGE, polyA-seq) | Protein homology data (UniProt) | Protein experimental data (MS, ribo-seq) | Conservation data (PhyloCSF, PhastCons, GERP) | RNA secondary structure data (Infernal) | External expert database (miRBase, Rfam, IMGT) |
---|---|---|---|---|---|---|---|
Protein coding | Yes | Yes | Yes | Yes | Yes | No | No |
lncRNA | Yes | Yes | No | No | No | No | No |
sRNA | Yes | No | No | No | Yes | Yes | Yes |
Pseudogene | Noa | Noa | Yes | No | Yes b | No | No |
IG/TR | No | No | No | Yes | No | No | Yes |
This table illustrates the evidence types generally used by manual annotators in the Ensembl team to determine the correct structure and function of a transcript model. Protein-coding genes require transcriptomic evidence to define structure and terminal transcription data sets to define transcript start and end coordinates. Homology with UniProt and proteomics data informs or validates the decision to assign a transcript or locus the protein-coding biotype—that is, to decide whether a functional protein is encoded. Similarly, evolutionary conservation of sequence and of protein-coding potential also informs this decision. Decisions about protein-coding genes do not generally use RNA secondary structure or other expert databases, although they may be consulted on a case-by-case basis. The annotation of lncRNAs utilizes the same transcriptomic data sets as protein-coding genes; however, the absence of protein homology, experimental proteomics data, and conservation is a key determinant in choosing not to annotate a transcript as protein coding. For sRNAs, transcriptomic data sets, conservation data, RNA secondary structure data, and expert external databases are utilized. Pseudogenes are annotated based solely on their homology to annotated protein sequences, although transcriptomic data are used to support the transcribed pseudogene biotypes. IG/TR gene segments are annotated on the basis of protein experimental data and homology to IG/TR sequences from the IMGT database. Abbreviations: CAGE, cap analysis gene expression; GERP, Genomic Evolutionary Rate Profiling; IG, immunoglobulin; IMGT, International Immunogenetics; Infernal, Inference of RNA Alignment; INSDC, International Nucleotide Sequence Database Collaboration; lncRNA, long noncoding RNA; miRBase, MicroRNA Database; MS, mass spectrometry; ONT, Oxford Nanopore Technologies; PacBio, Pacific Biosciences; PhyloCSF, Phylogenetic Codon Substitution Frequencies; polyA-seq, polyA sequencing; RAMPAGE, RNA annotation and mapping of promoters for the analysis of gene expression; ribo-seq, ribosome profiling; RNA-seq, RNA sequencing; sRNA, small RNA; TR, T cell receptor.
For nontranscribed pseudogenes only; transcribed pseudogenes may be supported by these data.
While pseudogenes are not conserved over large evolutionary distances, known artifacts in the whole-genome alignments on which conservation detection is based permit their identification with care.