Table 1 |.
name (version) | Reported size (gene loci) | Methodsa | comments | completeness | comprehensivenessb | exhaustivenessc |
---|---|---|---|---|---|---|
NONCODE (v5) | 96,308 | Integration of other databases | The most comprehensive resource | 8.9% | 67,276 | 2.3 |
MiTranscriptome (v2) | 63,615 | Assembly from short reads | Mainly cancer samples | 4.4% | 45,088 | 4.4 |
FANTOM CAT (v1) | 27,919 | Assembly, other annotations and CAGE evidence | Mapped 5′ ends using CAGE tags | 15.8% | 27,278 | 3.3 |
RefSeq (GCF_000001405.37_GRCh38.p11) | 15,791 | Manual (based on cDNA) and automated annotation (based on RNA-seq data) | The oldest annotation | 11.0% | 14,889 | 1.9 |
GENCODE (v27) | 15,778 | Manual annotation based on cDNA, ESTs and high-quality long-read data | Used by most consortia and integrated with Ensembl | 13.5% | 15,063 | 1.9 |
BIGTranscriptome (v1) | 14,158 | Assembly, with CAGE and 3 P-seq evidence | Full-length transcripts | 27.7% | 12,632 | 2.1 |
GENCODE+ | 13,434 | Union of GENCODE (v20) and CLS lncRNAs with anchor-merged CLS transcript models | Extension of GENCODE by CLS | 24.0% | 13,434 | 3.3 |
CLS FL | 807 | lncRNAs from GENCODE+ with CAGE and poly(A) evidence | Full-length transcripts | 71.7% | 807 | 5.5 |
Protein-codingd | 19,502 | GENCODE confident protein-coding transcripts | Not tagged mRNA_end_NF nor mRNA_start_NF in the original GENCODE v27 GTF file | 53.8% | 18,995 | 2.9 |
All numbers correct as of the end of 2017. MiTranscriptome, Functional Annotation of the Mammalian genome (FANTOM) cap analysis of gene expression (CAGE)-associated transcriptome (CAT) and BIGTranscriptome long non-coding RNA (lncRNA) catalogues were lifted over to the Genome Reference Consortium Human Build 38 (GRCh38) genome assembly. 3P-seq, poly(A)-position profiling by sequencing; CLS, capture long-read sequencing; EST, expressed sequence tag; RNA-seq, RNA sequencing.
Assembly in the Methods column refers to transcriptome assembly using short reads from RNA-seq.
Comprehensiveness is the total number of gene loci boundaries defined using buildLoci. To compare gene sets in a consistent way, the assembly patches were excluded, and the gene loci boundaries were redefined using buildLoci, which explains discrepancies between gene numbers presented here and those reported in original publications.
Exhaustiveness is the average number of isoforms per gene locus. Figures for completeness, comprehensiveness and exhaustiveness as presented in FIG. 3 are shown here.
A set of protein-coding transcripts was used as a reference.