Figure 1.
Combining short read RNA-seq and long read RNA-seq to assemble hPSC-specific transcriptome. (A) Read density of different experimental techniques across the length of the transcript. Pileups of hPSC data from deepCAGE, short read RNA-seq, and polyA-seq reads across the lengths of the transcripts from the 5′ ends to the 3′ ends and the flanking 2 kb regions. Each transcript is scaled to the same size and orientated to the same strand. DeepCAGE specifically sequences the 5′ ends of transcripts and can identify TSSs. DeepCAGE data is measured in normalized tag counts, taken from Ref. (47). PolyA-seq data is from the 3′ RNA-seq data set GSE138759 (49), and is measured in normalized counts. RNA-seq (SR) refers to pileups of the SR RNA-seq data only, across the transcripts. The SR sample accessions used in this study are described in Supplementary Table S1. (B) The number of transcripts (in thousands) that are supported by short read (SR)-only, long read (LR)-only, or both (SR + LR). (C) The number of transcripts (in thousands) that were defined as matching (all internal exons boundaries match exactly to a GENCODE transcript, exact 5′ and 3′ ends of the transcript are not enforced), variant (shares any exon or overlapping exon segment with a GENCODE transcript) or novel (does not share any exonic nucleotide with a GENCODE transcript). (D) Pie charts showing the proportion of nucleotide sequences at the 5′ or 3′ splice sites. The transcripts are divided into the matching, variant or novel classes and all GENCODE transcripts are shown for comparison. (E) Violin plots showing normalized RNA counts for matching, variant and novel transcripts, for RNA-seq (from short read data) and deepCAGE data. RNA-seq is presented in log2 transcripts per million (TPM). DeepCAGE is in log2 normalized tag counts, as deepCAGE data only sequences the 5′ ends, only transcripts with unique 5′ ends were used in the analysis of deepCAGE data. (F) Number and percentage of coding and noncoding transcripts by transcript class. Coding and noncoding here refers to the prediction by FEELnc. Novel transcripts have no overlapping exons with GENCODE, variants overlap by any single base pair against the GENCODE annotations, matching have exactly matching internal exon splicing sites. ‘All’ are all assembled hPSC transcripts. (G) RNA levels of coding and noncoding transcripts, for short read RNA-seq (left violins) or deepCAGE data (right violins). For deepCAGE, only transcripts with a unique 5′ end were used.