Abstract
For transcriptome analysis, it is critical to precisely define all the transcripts across the whole genome. More and more digital gene expression (DGE) scannings have indicated the presence of huge amount of novel transcripts in addition to the known gene models. However, almost all these studies still depend crucially on existing annotation. Here, we present Gene2DGE, a Perl software package for gene model renewal with DGE data. We applied Gene2DGE to the mouse blastomere transcriptome, and defined 98,532 read-enriched regions (RERs) by read clustering supported by more than four reads for each base pair. Taking advantage of this ab initio method, we refined 2,104 exonic regions (4% of a total of 48,501 annotated transcribed regions) with remarkable extension into un-annotated regions (>50 bp). For 5% of uniquely mapped reads falling within intron regions, we identified 13,291 additional possible exons. As a result, we renewed 4,788 gene models, which account for 39% of a total of 12,277 transcribed genes. Furthermore, we identified 12,613 intergenic RERs, suggesting the possible presence of novel genes outside the existing gene models. In this study, therefore, we have developed a suitable tool for renewal of known gene models by ab initio prediction in transcriptome dissection. The Gene2DGE package is freely available at http://bighapmap.big.ac.cn/.
Key words: transcriptome, annotation, ab initio prediction
Introduction
Digital gene expression sequencing, namely DGE-seq, refers to the use of high-throughput sequencing technologies to sequence cDNA in order to get information about RNA content of a sample (1). It can provide researchers with a powerful tool to obtain unbiased and unparalleled information about gene transcripts 2, 3. Currently, computational methods are being developed to identify and annotate these transcripts with alternative splice forms 4, 5. Although most DGE-seq studies have identified expression outside of known loci (in intronic or intergenic regions) 6, 7, 8, 9, 10, few attempts have been made to ab initio define the read-enriched regions (RERs) in detail and compare them with known gene models.
Here, we present Gene2DGE, a free Perl software package for RER detection and gene model update. This novel method consists of RER definition based on read clustering followed by annotation comparison with known gene models. The input of Gene2DGE is the file of mapped reads from RNA-seq data and a gene annotation file of the corresponding genome. In addition, a cmap file needs to be prepared for application to different species to correct the chromosome numbers. The output of Gene2DGE includes a text file containing a set of RERs and a series of text files containing annotated information of the eligible RERs. The Gene2DGE package is freely available at http://bighapmap.big.ac.cn/.
Implementation
We developed Gene2DGE as an ab initio tool to annotate the transcriptome using mapped reads from the SOLiD platform (Applied Biosystems) and annotation information downloaded from Ensembl. Gene2DGE consists of three steps. First, we filter the “uniquely mapped” reads from the aligned results of the SOLiD Whole Transcriptome Pipeline. A uniquely mapped read is defined as one with a max scoring alignment to the genome scoring at least 24 and at least four higher than any of the other alignments of that read to the genome (11). Considering the restrictions of computer memory, this process will be performed for each chromosome in order, so it is relatively convenient for use on any personal computer.
Second, based on uniquely mapped reads, we construct the RERs by grouping overlapped reads with a number greater than a threshold (at least four reads) (12). We list each RER including start position, end position and the number of mapped reads. In addition, we set a parameter for the “maximal distance between RERs”, defined by the start positions of RERs minus the start positions of the first one upstream. It can be customized according to the specific requirement of the experiment (the default value is 50 bp).
Finally, we compare the RERs to existing gene models, and generate a catalogue of candidate genes with new annotation information, including exon extension, possible additional exons, and novel genes. The file of existing gene models in gtf format can be downloaded from the Ensembl website for the candidate species. We picked out eligible RERs and then checked their overlap with known gene models. As a result, a series of annotated files will be output and then can be used in further analysis.
Application
We applied Gene2DGE to the RNA-seq data from the mouse blastomere dataset obtained from a single-cell whole transcriptome (13). The mRNA-Seq short reads were analyzed using whole-transcriptome software tools (Applied Biosystems, http://www.solidsoftwaretools.com/). The reads generated were mapped to the mouse genome (mm9, NCBI build 37). We got more than 6.6 million reads that could be uniquely aligned to the mouse genomic reference (“uniquely mapped reads”). Based on Ensembl annotation (NCBI M37.61), 89% reads (5.9 million) were mapped to annotated regions in exons including coding sequences (CDS) and untranslated regions (UTR), which is significantly higher than those mapped to intronic (0.3 million, 5%) and intergenic (0.4 million, 6%) regions (Figure 1A).
Figure 1.
Summary of read-enriched regions (RERs) across the mouse genome. A. Distribution of read counts within RERs demonstrates possible transcription in previously non-annotated regions. B. Deviation between exon ends and corresponding RER boundaries. The minus numbering indicates RERs are shorter than known exons, while the positive numbering indicates RERs are longer than known exons. The apparent shortness of both first 5’ and last 3’ ends is possibly caused by transcript degradation.
Across the mouse genome, 98,532 RERs were identified and each contained more than 4 reads. A total of 62.3% of RER boundaries were within 10 bp of the ends for the corresponding exons (Figure 1B). Meanwhile, we identified 2,217 exon ends with remarkable extension into un-annotated regions (>50 bp), suggesting that the mouse transcriptome was more complex than we expected. The transcript levels for RERs overlapping with known exons (exonic RERs) were significantly higher than those of novel RERs (Mann-Whitney U test, P<10−35).
We detected 12,277 expressed transcripts (with at least 1 RER across the genic region), in which 11,261 (92%) transcribed genes contained at least one exonic RER. For the 72,628 (74% of all 98,532 RERs) exonic RERs, we found that the known gene models were well defined by the ab initio method. For example, read distribution on chr 7 (56900000-57200000) revealed sharp boundaries of RER regions (Figure 2).
Figure 2.
Transcriptome features of mouse blastomere illustrated by DGE data based on annotation available. Read distribution on chr 7 (56900000-57200000) in upper panel (Top/Reverse) was shown using sequencing data obtained from mouse blastomere. Boundaries of RERs were generated using Gene2DGE based on the read distribution. Improved annotation of gene models and novel transcriptions were also illustrated.
Furthermore, we detected a certain proportion of RERs (25,904, 26% of all RERs identified), which are located outside of the annotated regions in this transcriptome dataset. Among them, 13,291 intronic RERs (about 13% of all) were identified in 4,449 genes, indicating possible additional exons. Interestingly, about 1,016 genes (23%) only had transcripts detected in the intronic regions but not in the exonic regions. As a result, we renewed 4,788 gene models, which account for 39% of 12,277 transcribed genes in total. The remaining 12,613 RERs are located in the intergenic regions, suggesting possible presence of novel genes outside of the existing gene models.
Conclusion
Here we have developed an exploratory tool, Gene2DGE, which can be employed to determine the RERs and improve genome annotation. The package and methods can be applied to analyze other sources of any mapped short read counts from RNA-seq data, such as results of sequencing by AB SOLiD platform and Illumina Solexa platform. Moreover, Gene2DGE can be used on any personal computer with a low requirement for computer memory capacity, since data are processed for all chromosomes in order with one chromosome at a time.
In this study, we provide an example of Gene2DGE usage to illustrate its application in transcriptome analysis. Gene2DGE has also been applied to analyze datasets from other mouse tissues or tissues from other species (data not shown). All these results indicate that Gene2DGE is a suitable tool for the renewal of known gene models by ab initio prediction in transcriptome dissection.
Authors’ contributions
XT and LD designed the study and performed the majority of data analysis. DZ, JL, YW, QZ, XL and GL collected the dataset and participated in data analysis and visualization. XT, LD and LS supervised the project and wrote the manuscript. All authors read and approved the final manuscript.
Competing interests
The authors declare that no competing interests exist.
Acknowledgements
This work was supported by the National Nature Science Foundation of China (Grant No. 81171184, 31060139 and 30871384), Nature Science Foundation of Jiangxi Province (Grant No. 20114BAB215019), Department of Health of Jiangxi Province (Grant No. 20111209) and Technology Pedestal and Society Development Project of Jiangxi Province (Grant No. 2010BSA09500 and 20111BBG70009-1).
References
- 1.Wang Z. RNA-Seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet. 2009;10:57–63. doi: 10.1038/nrg2484. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Eveland A.L. Digital gene expression signatures for maize development. Plant Physiol. 2010;154:1024–1039. doi: 10.1104/pp.110.159673. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Lai Y. Differential expression analysis of Digital Gene Expression data: RNA-tag filtering, comparison of t-type tests and their genome-wide co-expression based adjustments. Int. J. Bioinform. Res. Appl. 2010;6:353–365. doi: 10.1504/IJBRA.2010.035999. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Laporta J. Short communication: expression and alternative splicing of POU1F1 pathway genes in preimplantation bovine embryos. J. Dairy Sci. 2011;94:4220–4223. doi: 10.3168/jds.2011-4144. [DOI] [PubMed] [Google Scholar]
- 5.Shang H. Identification and characterization of alternative promoters, transcripts and protein isoforms of zebrafish R2 gene. PLoS One. 2011;6 doi: 10.1371/journal.pone.0024089. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Cloonan N. Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat. Methods. 2008;5:613–619. doi: 10.1038/nmeth.1223. [DOI] [PubMed] [Google Scholar]
- 7.Sultan M. A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science. 2008;321:956–960. doi: 10.1126/science.1160342. [DOI] [PubMed] [Google Scholar]
- 8.Wilhelm B.T. Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution. Nature. 2008;453:1239–1243. doi: 10.1038/nature07002. [DOI] [PubMed] [Google Scholar]
- 9.Robinson M.D. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26:139–140. doi: 10.1093/bioinformatics/btp616. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Wang L. DEGseq: an R package for identifying differentially expressed genes from RNA-seq data. Bioinformatics. 2010;26:136–138. doi: 10.1093/bioinformatics/btp612. [DOI] [PubMed] [Google Scholar]
- 11.Tuch B.B. Tumor transcriptome sequencing reveals allelic expression imbalances associated with copy number alterations. PLoS One. 2010;5 doi: 10.1371/journal.pone.0009317. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Mortazavi A. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods. 2008;5:621–628. doi: 10.1038/nmeth.1226. [DOI] [PubMed] [Google Scholar]
- 13.Tang F. mRNA-Seq whole-transcriptome analysis of a single cell. Nat. Methods. 2009;6:377–382. doi: 10.1038/nmeth.1315. [DOI] [PubMed] [Google Scholar]