Skip to main content
. 2020 Aug 24;9:e55792. doi: 10.7554/eLife.55792

Figure 3. The V4.2 annotation improves detection of cell-type-specific genes from bulk RNA-seq data.

(A) Schematic outline for generating a new zebrafish transcriptome annotation. See Results and Materials and methods sections for details. (B) Pie charts showing the proportion of reference genes with same, longer or shorter 3' UTR in the V4.3 annotation compared to relative 3' UTR length between Ens95 and RefSeq. (C, E) Venn diagrams showing intersection of reference genes with commonly annotated NCBI ID that are significantly enriched in (C) kdrlpos- or (E) pdgfrbpos-cells in each indicated annotation. (D, F) Volcano plots of reference genes with common NCBI ID identified as (D) kdrlpos- or (F) pdgfrbpos-enriched only by Ens95 in comparison to RefSeq. Indicated values are from the same genes quantified using V4.3. Red dots indicate log2 fold change >1 and adjp <0.05. (G) 3' UTR lengths and (H) log10 average expression (n = 3) using RefSeq and V4.3 for reference genes in the indicated dataset. (G, H) Data are not normally distributed, Wilcoxon matched-pairs signed-rank test, p values are indicated. Error bars denote mean and standard deviation. (I) Log10 average expression (n = 3) and (J) 3' UTR lengths across all annotations for reference genes uniquely identified as enriched in indicated transgene-positive cell type using V4.3 (log2 fold change >1, padj <0.05). Data are not normally distributed. Friedman test to assess variance (p<0.0001 in all cases). Dunn's multiple comparison test was used for pairwise comparisons, p values are indicated. Error bars denote mean and standard deviation.

Figure 3—source data 1. List of SRA accession numbers, stages, and read numbers from GSE32900 for associated RNA-seq datasets used in this study.
Figure 3—source data 2. List of manually-identified discrepancies in Ensembl gene annotation due to spurious fusionor overlapping transcripts.
Table includes Ens95 gene symbol, gene ID, and spurious transcript ID. Persistence of observed discrepancy in Ens99 is indicated, as is previous status of curation in ZFIN. All of these have been reported to ZFIN.
Figure 3—source data 3. RefSeq (worksheet 1) and Ens99 (worksheet 2) genes missing from the V4.2 annotation.
Figure 3—source data 4. Novel genes from V4.2 genome annotation.
This table includes information regarding blastx hits against zebrafish and human proteins, matches with lincRNAs, number of exons per gene, and whether the novel locus was included in the V4.3 annotation.
elife-55792-fig3-data4.xlsx (565.3KB, xlsx)
Figure 3—source data 5. V4.3 gene information table, including unique LL ID numbers, associated Ens99 gene ID, NCBI ID, and ZFIN gene ID numbers, gene symbols, and gene names.
Annotation notes are also included regarding the relative strength of coordinate-based incorporation of NCBI (Entrez) and Ens99 gene identifiers.
Figure 3—source data 6. Output from DESeq2 analysis comparing kdrlpos and kdrlneg RNA-seq.
Gene expression levels were quantified using RSEM with the V4.3 annotation. Median ratio normalized expression values are shown for each sample, along with adjusted p-value and log2 fold change. Matching Ensembl and NCBI gene IDs are included.
Figure 3—source data 7. Output from DESeq2 analysis comparing pdgfrbpos and pdgfrbneg RNA-seq.
Gene expression levels were quantified using RSEM with the V4.3 annotation. Median ratio normalized expression values are shown for each replicate, along with adjusted p-value and log2 fold change. Matching Ensembl and NCBI gene IDs are included.
Figure 3—source data 8. Worksheet 1 - Output from DESeq2 analysis comparing Nr2f2pos and Nr2f2neg RNA-seq.
Gene expression levels were quantified using RSEM with the V4.3 annotation. Median ratio normalized expression values are shown for each replicate, along with adjusted p-value and log2 fold change. Matching Ensembl and NCBI gene IDs are included. Worksheet 2 – Nr2f2pos-enriched genes with matched entries from reference gene set (Figure 2—source data 2) and associated 3' UTR lengths (Figure 2—source data 2).

Figure 3.

Figure 3—figure supplement 1. Ensembl naming conflicts and improved transcript diversity in V4.3.

Figure 3—figure supplement 1.

(A, B) Annotated UCSC Genome Browser screenshots of (A) cenpq and mrpl39 loci and (B) talgn3b and abhd10b loci. RefSeq, Ens95, Ens99 and V4.3 transcript annotations are shown. Ensembl-annotated transcripts overlapping adjacent locus and leading to misassignment of gene names are indicated. (C) Apparent detected expression for abhd10b and tagln3b in pdgfrbpos RNA-seq reads (n = 3) using indicated annotation. Error bars denote SEM. The abhd10b transcripts are annotated as belonging to the tagln3b gene in Ensembl95 and 99. This leads to a failure to detect abhd10b when using Ensembl annotations, along with spurious inflation of tagln3b expression levels due to the inclusion of reads that should be mapped to abhd10b. (D) Histogram plot of numbers of transcripts per gene for indicated annotation. The plot is limited to genes with 10 or fewer transcripts. (E) UCSC browser image of the erg locus showing transcripts from indicated annotation and mapped reads from GSE32900.
Figure 3—figure supplement 2. The V4.3 annotation improves the detection of cell-type-specific genes from bulk RNA-seq data.

Figure 3—figure supplement 2.

(A, B) Volcano plots of RNA-seq data from (A) kdrl-positive and negative and (B) pdgfrb-positive and negative cells quantified using V4.3. (A, B) Numbers of differentially expressed genes, along with selected known (A) endothelial or (B) mural cell genes are indicated with green dots. (A, B) Genes with significant differences (padj <0.05) are shown as red (log2 fold change pos/neg > 1) or blue (log2 fold changepos/neg <-1). (C–E) Left panels, UCSC browser images of (C) slc7a5 (minus strand), (D) slc2a1a (plus strand), and (E) cspg4 (minus strand) loci showing 3' UTR annotations from V4.3, ENS95, and RefSeq. Mapped depth of RNA-seq reads from indicated cell type on the genome, or assigned to each annotation are indicated, as are 3P-seq features. (C–E) Right panels, log10 normalized expression of (C) slc7a5, (D) slc2a1a, and (E) cspg4 in replicate RNA-seq samples (n = 3) quantified using indicated annotations. Values are normalized with median ratio normalization. (C, D) Values display normal distribution (Shapiro-Wilks test) and analysis of variance revealed statistical significance (slc2a1a, p=0.0160; slc7a5, p=0.0002). Adjusted p-values from Dunnett's multiple comparison tests are indicated. (E) Values are not normally distributed, variance determined by Friedman test (p=0.0278). p-values from Dunn's multiple comparison test are indicated. Error bars represent mean and standard deviation.
Figure 3—figure supplement 3. Analysis of Nr2f2pos and Nr2f2neg datasets using V4.3.

Figure 3—figure supplement 3.

(A) Volcano plot of RNA-seq data from Nr2f2pos and Nr2f2neg cells quantified using V4.3. Numbers of differentially expressed genes are shown. Selected known venous endothelial genes are indicated with green dots, as is erg, which is not detected as differentially expressed by Ens95. Genes with significant differences (padj <0.05) are red (log2 fold change pos/neg > 1) or blue (Nr2f2pos-enriched, log2 fold change pos/neg <-1). (B) Venn diagram showing intersection of reference genes (using commonly annotated NCBI ID) that are significantly enriched in Nr2f2pos-cells in each indicated annotation. (C) Volcano plot of Nr2f2pos-enriched reference genes identified only by Ens95 in comparison to RefSeq. Indicated values are from the same genes quantified using V4.3. Red dots indicate log2 fold change<-1 and adjp <0.05. (D) 3' UTR lengths and (E) log10 average expression (n = 3) for reference genes in Ens95 and V4.3 for the indicated dataset. (D, E) Data are normally distributed, paired t-test, p values are indicated. Error bars denote mean and standard deviation.