Abstract
Microarray provides genome‐wide transcript profiles, whereas RNA‐seq is an alternative approach applied for transcript discovery and genome annotation. Both high‐throughput techniques show quantitative measurement of gene expression. To explore differential gene expression rates and understand biological functions, the authors designed a system which utilises annotations from Kyoto Encyclopedia of Genes and Genomes (KEGG) biological pathways and Gene Ontology (GO) associations for integrating multiple RNA‐seq or microarray datasets. The developed system is initiated by either estimating gene expression levels from mapping next generation sequencing short reads onto reference genomes or performing intensity analysis from microarray raw images. Normalisation procedures on expression levels are evaluated and compared through different approaches including Reads Per Kilobase per Million mapped reads (RPKM) and housekeeping gene selection. Such gene expression levels are shown in different colour shades and graphically displayed in designed temporal pathways. To enhance importance of functional relationships of clustered genes, representative GO terms associated with differentially expressed gene cluster are visually illustrated in a tag cloud representation.
Inspec keywords: genetics, genomics, lab‐on‐a‐chip, molecular biophysics, molecular configurations, RNA
Other keywords: gene expression rate, multiple high‐throughput datasets, genome‐wide transcript profiles, transcript discovery, genome annotation, high‐throughput techniques, quantitative measurement, genome biological pathways, gene ontology associations, microarray datasets, multiple RNA‐seq datasets, gene expression levels, next generation sequencing, microarray raw images, housekeeping gene selection, colour shades, temporal pathways, representative GO terms, gene cluster, tag cloud representation
1 Introduction
Gene expression level is an essential feature for studying how genes activate and generate products for natural life organisms. Transcriptomic researches of cells play a crucial role in understanding how a specific phenotype is formed and how genes interact with each other under different circumstances. For over a decade, DNA microarray has been an important and powerful tool which facilitates researchers examining the expression of thousands of genes from a sample simultaneously. The microarray approach is still a very popular experiment tool in the post‐next generation sequencing (NGS) era because of its maturity, continually increasing throughput and relatively low price per experiment [1]. However, prior sequence knowledge for the target genes is mainly required for traditional microarray analysis, which makes it inefficient for transcriptome‐range analysis. Besides, optical intensity‐based protocols limit the dynamic range of the gene expression levels [2].
After moving from the microarray‐based era, an important technology of high‐throughput sequencing exploits dynamic complementary DNA sequencing in an approach termed high‐throughput RNA sequencing (RNA‐seq) [3]. The benefits of RNA‐seq technology compared with microarray and expressed sequence tag sequencing could be summarised in discovering novel genes, raising quality, high gene expression level range, cutting down experimental time and cost‐efficient scale [4]. Therefore it was overwhelmingly applied in recent years for transcriptome analysis. Up to now, deep sequencing researches were used widely in complex disease gene expression such as cancer studies, quantitative analysis of transcript expression such as organism diversity and evolution, antisense transcriptome analysis and discovery of novel isoforms [5–7]. Another important advantage of RNA‐seq is the capacity for quantitative measurement of each expressed element at transcriptome scale, which assists researchers to analyse differential gene expression under various circumstances [8]. With rapid growth of bioinformatics techniques, RNA‐seq‐related researches and applications are becoming more and more concerned in recent years [9]. Typically, an RNA‐seq experiment generates a large number of short reads for transcriptome analysis, and these reads are mapped or aligned to the expressed genes by reference mapping tools [10]. The expression level of each assembled gene segment could be determined according to the coverage which indicated the number of times a nucleotide is being read within a gene during sequencing processes. However, most analyses still focused on evaluating the existence of a specific gene or a small group of genes related to a specified function at a time. Therefore some important associated information might be neglected because of a limited analytical scale or non‐quantitative measurements of gene analysis. In order to comprehensively and automatically analyse variations of gene expression from different transcriptome datasets, transforming profiles from gene expression levels into corresponding function‐orientated gene clusters becomes an intuitive and systematic approach.
Two popular functional annotation methods for clustered gene groups including biological pathways and Gene Ontology (GO) terms were adopted. A biological pathway is one of the most important annotation methods to describe related biological functions. It represents consequent chains of chemical reactions catalysed by cells, enzymes or ligands. Each pathway contains a signal transduction starting with a signal to another receptor and ending with changes in cellular behaviours [11]. The expression level of each gene within a regulatory network is usually different in distinct organs and tissues. It could respond dynamically because of various environmental conditions, different disease stages, or distinct phases within a cell cycle. By integrating coverage information from RNA‐seq datasets within biological pathways, differential gene expressions among different strains or time points could reveal dynamic status through a constrained gene cluster.
Another major functional annotation method for describing gene products is GO. It is a set of structured vocabularies defined by the GO Consortium [12], and aims to provide a universal standard of functional annotation for gene products. All GO terms are connected with each other by directed acyclic graphs with hierarchy relationships. Each GO term belongs to one of the three independent ontologies: biological process, molecular function and cellular component (CC), which represents different aspects of gene in temporal, functional and spatial domains. Up to now, GO annotation is the most frequently used as a de facto standard of gene annotation, and various studies have shown that the GO terms provide conserved function information in a gene group through over representation analysis [13, 14]. Several tools based on GO annotation approach could provide transmission from expressed data to gene annotation. For example, GOMiner [15] is a tool for analysing microarray through GO properties to identify specific functions with gene‐by‐gene approach. DAVID [16] is an integrated functional analyser to annotate and categorise gene functions from gene/protein identifier lists. They mapped those identifiers to common GO terms or gene‐interaction maps through bioinformatics resources. These tools performed well in annotating gene functions and analysing gene networks.
Through previously discovered functional features, both sequenced RNA‐seq and microarray data could be transformed either from quantitatively measured coverage rates or scanned microarray images respectively into gene expression levels as a global view of biological system responses. In this study, we have focused on the evaluation of dynamic expression of specific gene clusters among different conditions. The differential gene expression levels among various RNA‐seq or microarray imaging datasets regarding a mapped pathway or a GO term would be statistically analysed and graphically shown by a novel representation for on‐line users.
2 Materials and methods
2.1 System flowchart
To emphasise differential gene expression under different circumstances, more than single experimental data should be imported for comparison. The experimental samples for transcriptome profiling might be selected from different tissues, different strains or under various controlled environments. The differential gene expression levels could also be observed at different time points such as sampling at various embryonic stages or before and after drug treatment. A proposed system is shown in Fig. 1 which is designed to accept quantitative coverage rates of RNA‐seq data and estimated intensities of microarray images. The proposed system includes four major phases: expression level quantisation, expression level normalisation, functional gene pathway mapping and GO tag cloud visualisation. At the first stage, for RNA‐seq, reads from multiple sequenced datasets were individually mapped to known reference genes by existing reference mapping tools. A reference mapping tool could provide results of how each read was mapped to the known coding regions from the selected target model species. According to the mapping results, the initiative coverage rates could be calculated for each expressed gene. For microarray data, image intensity analysis and background normalisation were employed to process raw image files for obtaining corresponding expression levels of each gene. Next, expression rates were normalised to balance the experimental results under different conditions in order to eliminate bias caused by different technical variations. The derived data from normalisation procedures were considered as corresponding expression level for each gene. After retrieving all gene expression levels, active genes would be recognised and assigned to the belonging biological pathways to dynamically display expression differences incorporated with biological functions. These identified genes possessing most differential effectiveness among various experiments could be selected and clustered automatically according to Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway and GO annotations. Besides, the over representation analysis of GO terms could be applied for comparing different gene groups with differential gene expression, and the major variations would be displayed by a GO tag cloud technique which is a novel visualisation approach facilitating users with a clear and intuitive representation of how dynamic changes of molecular function correspond largely to the differences of function level among various transcriptome datasets.
Fig. 1.

System flowchart for analysing multiple microarray/RNA‐seq datasets for dynamic gene expression level analysis
2.2 Gene expression rate
The first step for differential expression analysis is to calculate gene expression rates from experimental data. For RNA‐seq data, the short‐reads generated by NGS are mapped to a set of known reference genes. The referred gene sequences should contain UTRs and coding regions only, but not intron segments for preventing coverage bias. Here, we selected Ensembl database as the resource of reference genes, which provides not only detailed coordinates and annotation information, but also corresponding GO information for most collected genes [17]. There are many reference mapping tools available from commercial software or open‐source projects [18–20]. Most mapping tools generated mapped results with sequence alignment/map (SAM) or binary alignment/map (BAM) formats, and provided information of how reads were mapped to the references. The expression rate of each gene could be determined by taking the average coverage rate from reference mapping results. An average coverage rate for each gene could be counted according to the number of accumulated times at each nucleotide position. On the other hand, for the traditional microarray approach, the gene expression rate is estimated by evaluating optical intensities on each array spot, which is also available from commercial or free software solutions [21].
2.3 Expression rate normalisation
Since gene expression rates from each dataset were calculated from different experiments, the expression rate among different experiments should be normalised to prevent any range bias caused by technical variations. For NGS reads, the bias might appear because of throughput deviation from each individual NGS run, which is considered as the first essential problem to overcome before performing RNA‐seq data analysis [22]. A similar situation also occured in microarray data because of fluorescent dye performance, which also suggests that normalisation procedure is required for utilising microarray data [23]. In terms of microarray data, different normalisation procedures can be applied according to various assumptions. For example, total intensity normalisation assumes that the total quantity of gene expression for two experimental data is the same, whereas mean log centering assumes that the mean expression ratio should be equal to zero for the whole gene set. In this study, normalisation procedure for microarray data is performed based on total intensities of individual raw image among different datasets [24]. For RNA‐seq reference mapping results, a gene expression rate is obtained by counting the total number of short‐reads at each nucleotide position which could be aligned to the target genes. The most commonly used normalisation method is the Reads Per Kilobase per Million mapped reads (RPKM), which considers not only the lengths among different genes, but also the total throughput of aligned short reads [25]. RPKM can be easily obtained from the reference mapped results and the total output dataset size. Here, the RPKM normalisation procedure is applied as the default method if no further information is required by the user.
In addition to the normalisation method based on the total throughput from each experiment, another approach preserving biological sense is to utilise housekeeping genes as standard references. The housekeeping genes are a set of genes which are involved in fundamental cellular functions. These genes are expressed in almost all kinds of tissues under very wide range of conditions since they are crucial elements for basic cell functionality. Expression rates of housekeeping genes among different time/tissue are expected to be relatively constant [26]. According to previous reports, one of the normalisation methods based on housekeeping genes performed better in the benchmark rather than simply utilising the total reads from each experiment on the RNA‐seq data [27]. Similarly, the housekeeping gene normalisation approach was also widely applied for microarray analysis for many years [28]. However, previous published research also demonstrated that housekeeping gene expression rates may also appear to be highly variant under disease conditions, specific tissues or extreme experimental conditions [29, 30]. Therefore specific set of housekeeping genes for normalisation procedure should be conditionally selected based on biological experiments. In this proposed system, we have designed a housekeeping gene‐based normalisation module for user's selection. If a set of housekeeping genes was assigned from the user, the average coverage rate among all housekeeping genes could be used as a referencing factor for normalisation processes. All gene coverage rates were then multiplied by the scaling parameter linearly for read coverage normalisation. In this system, a list of housekeeping gene sets are enumerated by previously published researches [31, 32].
2.4 Biological pathway
The Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway database is applied as the fundamental resource for differential gene expression analysis among RNA‐seq and microarray datasets [33]. For each KEGG pathway, a rectangular component represents a set of genes or enzymes; a cyclic component represents compounds; and a linkage regulates genes in corresponding metabolic reactions. According to the distribution of expression levels normalised from RNA‐seq coverage and microarray intensities, the proposed system employs different colours and shade levels in each component for distinguishing various extents of gene expression. There are ten levels in variation scores for each gene in a resulting pathway map. The quantitatively changed value for each functional gene is normalised to li according to (1), where Si is the value of standard deviation for the i th node, S max is the maximum standard deviation in a specified KEGG map and the ZC counts are denoted as the number of zero‐crossings appearing as the sign change of slope value in each specific gene for various RNA‐seq or microarray datasets. To obtain the number of ZC counts, the sign of slope value is firstly extracted between two continuous gene expression levels, and then the total number of sign changes is counted to represent its ZC value.
| (1) |
Hence, all gene boxes within a pathway directly show their dynamic change conditions of gene expression according to sequenced datasets, and a gene node would be filled with red rectangle boxes when its corresponding ZC value is greater than or equal to 1. Different red layers are selected according to the previously defined level li . When the ZC of coverage rate distribution within a gene node is equal to 0 from multiple datasets, the trend of gene expression will be further examined and see if it strictly increases or decreases. For the growing up condition, the gene box is drawn by a blue triangle, whereas the growing down situation is represented by a green triangle.
2.5 Gene Ontology
In order to present functions related with differential expression gene set, we employed GO over representation analysis to select the representative GO terms. Since GO terms are within a hierarchical structure system annotation for each child term could be considered the inheritance from its parents. Hence, the terms located at higher level always possess more common annotated chances for each gene [12]. Normalisation for each GO term based on the appearance in the whole genome set has been applied in order to omit the bias from a hierarchical structure of term‐level. In GO, annotations for each gene are marked with the three‐character evidence code, which indicates the type of evidence supporting the annotation. For example, the evidence code IDA represents the annotation being supported by Inferred from Direct Assay, and the ISS for Inferred from Sequence or Structural Similarity. Previous study indicated that evidence code could significantly influence the accuracies in a GO‐based classifier, and suggested to use the computationally predicted annotations with caution [34]. Therefore the GO analysis is frequently computed in separation based on different evidence code groups. Five evidence code groups according to the GO consortium including electronic, experimental, computational analysis, author statement and curatorial statement were applied and categorised in the proposed system.
The system has developed a novel tag cloud visualisation method for GO variations through efficient identification of important GO terms from dynamic changes among RNA‐seq or microarray datasets. The system assigns different size weighting coefficients for various GO terms from the mapped gene set to generate a tag cloud from identified differential gene expression of GO terms. The size of a GO term entry in a tag cloud indicates quantitative changes among multiple RNA‐seq or microarray experiments. Therefore a linear accumulation formula could be applied for assigning weighting coefficients. This formula simply counts the differences of coverage rate of a specific gene among multiple experimental sets of short reads or intensities between microarray gene spots. According to the definitions, if an identified GO term possesses dramatic changes in gene expression levels, these terms will be defined with a higher weighting coefficient, and the text size of the identified term entry in the tag cloud will be drawn according to the weighting values. Here, the variant weighting scores are initially normalised into ten different levels according to the average distribution. Larger GO terms shown in the tag cloud graph represent the GO term containing genes with higher gene differential expression.
2.6 System implementation
The comprehensive system was completely developed and verified, but it is not open to the public because of Internet bandwidth considerations and partial license issues. However, to provide a web based proof‐of‐concept system, we also built another light version freely available to users by uploading constrained experimental data. The system can be freely accessed at the URL http://deepgo.cs.ntou.edu.tw. Input data files for applying this system are limited to gene expression rate tables instead of experimental raw data. For general microarray data, output formats vary from different protocols, which are usually required to be normalised with its original image data. For the NGS RNA‐seq raw reads, it requires significant bandwidths and unanticipated transmission time for transferring several NGS raw‐read files through the Internet since the testing NGS datasets may be stored even in several hundred gigabytes. The reference mapping procedures also take a long time which might finally cause connection timeout. Therefore to utilise the proposed system, it is suggested that the user should perform reference mapping procedures or microarray image processing tasks with their preference tools at his/her local machine, and generate a table of average coverage for references genes. A small tool for generating average coverage table from SAM file is also provided with standard C source codes. The average coverage tables for each time/spatial dataset can be stored in Comma Separated Values (*.csv) format or Microsoft Excel format. The selected genes in the uploaded file should contain with their Ensembl ID format in order to prevent ambiguity of gene names. After file uploading and clicking on the analyse button, the system automatically performs gene expression rate analysis on pathway and GO term annotation, and the results will be shown through webpage format. Users can access the results through an assigned unique URL within the next 72 h.
3 Result
Several experimental RNA‐seq datasets were used to examine the performance of our proposed system. Three query datasets were collected from the NCBI SRA database including: ‘SRP002237’, ‘SRP005380‐DatasetN1’ and ‘SRP005380‐DatasetN2’ [35]. The first SRP002237 transcriptome dataset included 24 sets of RNA‐seq experiments and these cDNA datasets were sequenced for a study of natural selection on cis‐ and trans‐ regulation in yeast [36] of which 12 datasets were obtained from co‐culture yeast (run: SRR039256–SRR039267) originating from two strains of S. cerevisiae yeast, and the other 12 datasets were obtained from hybrid yeast (run: SRR039244–SRR039255) originating from their hybrid offspring (F 1 hybrids). The second dataset is ‘SRP005380‐DatasetN1’ which included 4 datasets sequenced at four different time points: 0, 1, 2 and 3 h in meiosis processes, and the last dataset of ‘SRP005380‐DatasetN2’ contained only 2 datasets sequenced at two different time points: 0 and 4 h. The latter two datasets were applied to analyse meiotic diploid of S. cerevisiae temporally [37]. Calculation of average coverage rates of all yeast genes from these three RNA‐seq datasets were shown in Table 1. All selected RNA‐seq reads were generated by Illumina high‐throughput sequencing technologies.
Table 1.
Average coverage rate in experimental datasets
| Datasets | Average coverage rates |
|---|---|
| SRP002237 | |
| SRR039244–55 (hybrid) | 24.83 |
| SRR039256–67 (co‐culture) | 25.18 |
| SRP005380 Dataset‐1 | |
| SRR094602_0 hr | 10.22 |
| SRR094603_1 hr | 8.94 |
| SRR094604_2 hr | 12.71 |
| SRR094605_3 hr | 13.05 |
| SRP005380 Dataset‐2 | |
| SRR094606_0 hr | 24.29 |
| SRR094607_4 hr | 28.79 |
The first step of the analytical pipeline mapped short reads to the S. cerevisiae genes. Here, we used a commercial reference mapping tool, CLC Genomic workbench (version 5.1), to obtain aligned short reads on reference genes [18]. To successfully utilise the data from the SRA website, the ‘fastq‐dump’ program from SRA toolkit was primarily executed to obtain FASTQ sequences. Next, the extracted FASTQ reads were imported into the system by removing failed reads. After all reads were imported, a ‘Map Read to References’ toolkit from CLC genomic workbench was used for reference mapping. The resulting data produced by mapping tool contained the information of coverage rate for each aligned gene. It should be noted that the reference mapping tools in this step are not necessarily a commercial software, and it could be substituted by any open‐source reference mapping tools such as Bowtie or SOAP [19, 20].
In the next phase, statistical analysis was performed for average coverage rate of each gene, and normalisation procedures were carried out by featuring a housekeeping gene list. In this study, we selected 14 housekeeping genes from TAFs family within S. cerevisiae [32]. Next, according to the expression levels from the selected housekeeping genes, previously defined biological pathways from KEGG dataset [38] and GO term association were automatically detected. All orthologous genes within a gene node from an identified pathway were individually annotated with normalised expression level among various RNA‐seq datasets. Accordingly, these analysed gene expression levels of all mapped genes among various datasets were visually displayed through temporal/cross‐strains pathway maps and GO tag cloud representation.
For the SRP002237 dataset, there were in total 95 yeast KEGG pathway maps identified and retrieved after gene clustering procedures. The depth of coverage variation in RNA‐seq for each gene in an identified pathway map was colour coded for transforming gene expression quantities into systematic visual representation. For example, the pheromone signal transfer pathway in the MAPK signaling pathway (Map ID: 04011) was shown in Fig. 2. The trend of average coverage rate for each gene among co‐culture and hybrid datasets were calculated, and details of individual gene expression were statistically shown after clicking on the colour coded gene boxes. In the statistical plot of expression levels, the green bars represented gene expression levels for the first co‐cultural experimental RNA‐seq, and the blue bars showed the expression quantities for the second hybrid experiment. To easily recognise the trend of differential feature of gene expression, an ascending blue triangle within a gene box represented the depth of coverage being increased from the first experiment to the second one; a descending green triangle within a gene box denoted the gene expression levels being decreased in an opposite trend. Fig. 2 showed that the average coverage rate of the Ste2 gene in the co‐culture experiment was 2083.16 and decreased to an average of 14.08 for the hybrid experiment. Conversely, the average coverage of the Mcm1 gene in co‐culture experiment was 94.31 and increased to an average of 126.94 for the hybrid experiment. With the information of coverage rate variations between different generations, the developed system could imply differential gene expression in a specific biological pathway, which could provide useful information for selecting appropriate genes for various applications such as cis ‐ and trans‐ changes in regulatory evolution of genes.
Fig. 2.

Variations of gene coverage rate in MAPK signaling pathway (Map ID:04011) from SRP002237 RNA‐seq datasets
System responded to the comparative results between two different experimental datasets
Regarding the same datasets, most GO term variations in CC category were shown in Fig. 3 by a tag cloud visualisation approach. Two different types of evidence code were selected for demonstration. In Fig. 3 a, the electronic evidence code (IEA) for the SRP002237 dataset was selected, while the experimental evidence codes (EXP, IDA, IPI, IMP, IGI and IEP) for SRP002237 dataset were shown in Fig. 3 b. The larger size symbol of a GO term represented its corresponding genes possessing higher coverage variation rates among different RNA‐seq datasets. Here, relatively high RNA‐seq coverage variations of the top 4 GO terms were shown in Fig. 4 including nuclear matrix (GO:0016363), eukaryotic translation elongation factor 1 complex (GO:0005853), actin cytoskeleton (GO:0015629) and 3‐isopropylmalate dehydratase complex (GO:0009316). Users could click on any GO term on the text cloud to visualise the coverage rates among different RNA‐seq datasets. Corresponding RNA‐seq variations of these top ranked 4 GO terms were shown in Fig. 4. From this example, the gene expression levels for the co‐culture yeast genes at the GO terms of ‘nuclear matrix’, ‘actin cytoskeleton’ and ‘3‐isopropylmalate dehydratase complex’ were significantly higher than hybrid yeast by observing bar chart distributions in Figs. 4 a, b and d. Conversely, the hybrid yeast gene at GO term of ‘eukaryotic translation elongation factor 1 complex’ was significantly higher than co‐culture yeast according to the bar charts in Fig. 4 b.
Fig. 3.

GO term variations in CC category
a GO term variations associated with CC based on IEA for SRP002237. The differences of average coverage rate between two experiments are more than 100 units, and the variations were normalised to show in tag cloud representation. The bottom figure shows the gene expression variations for GO term of eukaryotic translation elongation factor 1 complex (GO:0005853)
b GO term variations associated with CC based on experimental evidence codes (EXP, IDA, IPI, IMP, IGI and IEP) for SRP002237. The bottom figure shows the gene expression variations for GO term of histone deacetylase complex (GO:0000118)
Fig. 4.

Top 4 variation of GO terms with CC in IEA of SRP002237
a GO:0016363
b GO:0005853
c GO:0015629
d GO:0009316
The other two yeast RNA‐seq datasets of ‘SRP005380‐DatasetN1’ and ‘SRP005380‐DatasetN2’ were applied for temporal pathway analysis. In these two testing cases, there were also 95 yeast KEGG pathways identified and retrieved through gene mapping and identification processes. The data visualisation method was exactly the same as described in the previous case. The ‘SRP005380_N1’ dataset contained 4 RNA‐seq datasets which were sequenced at each hour, and ‘SRP005380_N1’ contained only two datasets which were sequenced at two time points with a 4 h difference. From the comparison results of both datasets, several mapped pathways provided differential gene expression at significant levels. For example, both datasets reflected higher differential gene expression rates in meiotic pathway map (ID:04113). Fig. 5 a represented the meiosis yeast pathway map for SRP005380_N1 and Fig. 5 b for SRP005380_N2. It was observed that the gene of Mek1 possessed common status of decreased gene expression within these two datasets and gene of Glc7 for increased conditions simultaneously. In addition to the temporal pathway analysis for these two RNA‐seq datasets, associated GO term analysis was also performed. The coverage accounts of different time points for each gene and its associated GO terms were accumulated and compared for temporal GO term variation analysis. For example, the GO term of GO:000045 (THO complex part of transcription export complex), ‘GO:0000446’ (nucleoplasmic THO complex) and ‘GO: 0019897’ (extrinsic to plasma membrane) at CC ontology with Experimental evidence codes groups under dataset SPR005380_M1 revealed with higher gene expression variations than other GO terms.
Fig. 5.

Meiosis yeast pathway maps (MAP:04113) with gene expression indication for
a SRP005380_N1
b SRP005380_N2. Mek1 gene for decreasing gene expression and GLC7 gene for increasing status in both RNA‐seq datasets of meiosis experiments
4 Conclusion
In this study, we have developed a system for analysing differential gene expression from RNA‐seq or microarray datasets. The system normalises gene expression levels by either total‐size (total‐intensity) or constraint to selected housekeeping genes. The KEGG pathway database was integrated into our developed system to efficiently retrieve gene clusters of interest and analyse variations of gene expression under different environmental conditions. Users can efficiently select a group of functionally associated genes and display various levels of differential gene expression regarding an interesting biological function. Besides, a tag cloud representation for GO annotations with selectable evidence codes was also designed for visualising functional conservation within highly dynamic expression genes. We employed publicly available RNA‐seq reads as testing datasets to demonstrate that the workflow could clearly indicate an association between differential gene expressions and biological function levels. This workflow can be applied under various experimental conditions invoked with different gene expression, and it is useful for further design of biological experiments.
5 Acknowledgment
This work is supported by the Center of Excellence for Marine Bioenvironment and Biotechology, National Taiwan Ocean University and National Science Council, Taiwan, R.O.C. (NSC 101‐2321‐B‐019‐001 and NSC 100‐2627‐B‐019‐006 to T.‐W. Pai)
6 References
- 1. Miller M.B., and Tang Y.W.: ‘Basic concepts of microarrays and potential applications in clinical microbiology’, Clin. Microbiol. Rev., 2009, 22, (4), pp. 611–33 (doi: 10.1128/CMR.00019-09) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Fu X. Fu N., and Guo S. et al.: ‘Estimating accuracy of RNA‐seq and microarrays with proteomics’, BMC Genomics, 2009, 10, p. 161 (doi: 10.1186/1471-2164-10-161) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Marguerat S., and Bahler J.: ‘RNA‐seq: from technology to biology’, Cell. Mol. Life Sci., 2010, 67, (4), pp. 569–79 (doi: 10.1007/s00018-009-0180-6) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Wang Z. Gerstein M., and Snyder M.: ‘RNA‐seq: a revolutionary tool for transcriptomics’, Nat. Rev. Genet., 2009, 10, (1), pp. 57–63 (doi: 10.1038/nrg2484) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Sugarbaker D.J. Richards W.G., and Gordon G.J. et al.: ‘Transcriptome sequencing of malignant pleural mesothelioma tumors’, Proc. Natl. Acad. Sci. USA, 2008, 105, (9), pp. 3521–3526 (doi: 10.1073/pnas.0712399105) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Maher C.A. Kumar‐Sinha C., and Cao X. et al.: ‘Transcriptome sequencing to detect gene fusions in cancer’, Nature, 2009, 458, (7234), pp. 97–101 (doi: 10.1038/nature07638) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Zhao Q. Caballero O.L., and Levy S. et al.: ‘Transcriptome‐guided characterization of genomic rearrangements in a breast cancer cell line’, Proc. Natl. Acad. Sci. USA, 2009, 106, (6), pp. 1886–1891 (doi: 10.1073/pnas.0812945106) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Wilhelm B.T., and Landry J.‐R.: ‘RNA‐seq‐quantitative measurement of expression through massively parallel RNA‐sequencing’, Methods, 2009, 48, (3), pp. 249–57 (doi: 10.1016/j.ymeth.2009.03.016) [DOI] [PubMed] [Google Scholar]
- 9. Garber M. Grabherr M.G. Guttman M., and Trapnell C.: ‘Computational methods for transcriptome annotation and quantification using RNA‐seq’, Nat. Methods, 2011, 8, (6), pp. 469–477 (doi: 10.1038/nmeth.1613) [DOI] [PubMed] [Google Scholar]
- 10. Shendure J., and Ji H.: ‘Next‐generation DNA sequencing’, Nat. Biotechnol., 2008, 26, (10), pp. 1135–45 (doi: 10.1038/nbt1486) [DOI] [PubMed] [Google Scholar]
- 11. Macneil L.T., and Walhout A.J.: ‘Gene regulatory networks and the role of robustness and stochasticity in the control of gene expression’, Genome Res., 2011, 21, (5), pp. 645–657 (doi: 10.1101/gr.097378.109) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Ashburner M. Ball C.A., and Blake J.A. et al.: ‘Gene ontology: tool for the unification of biology. The Gene Ontology Consortium’, Nat. Genet., 2000, 25, (1), pp. 25–9 (doi: 10.1038/75556) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Beissbarth T., and Speed T.P.: ‘GOstat: find statistically overrepresented Gene Ontologies within a group of genes’, Bioinformatics, 2004, 20, (9), pp. 1464–5 (doi: 10.1093/bioinformatics/bth088) [DOI] [PubMed] [Google Scholar]
- 14. Bauer S. Grossmann S. Vingron M., and Robinson P.N.: ‘Ontologizer 2.0 – a multifunctional tool for GO term enrichment analysis and data exploration’, Bioinformatics, 2008, 24, (14), pp. 1650–1 (doi: 10.1093/bioinformatics/btn250) [DOI] [PubMed] [Google Scholar]
- 15. Zeeberg B.R. Feng W., and Wang G. et al.: ‘GoMiner: a resource for biological interpretation of genomic and proteomic data’, Genome Biol., 2003, 4, (4), pp. R28 (doi: 10.1186/gb-2003-4-4-r28) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Huang da W. Sherman B.T., and Lempicki R.A.: ‘Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources’, Nat. Protoc, 2009, 4, (1), pp. 44–57 (doi: 10.1038/nprot.2008.211) [DOI] [PubMed] [Google Scholar]
- 17. Flicek P. Amode M.R., and Barrell D. et al.: ‘Ensembl 2012’, Nucleic Acids Res., 2012, 40, (Database issue), pp. D84–90 (doi: 10.1093/nar/gkr991) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.bio C. CLC Genomics Workbench Product Sheet
- 19. Li R. Yu C., and Li Y. et al.: ‘SOAP2: an improved ultrafast tool for short read alignment’, Bioinformatics, 2009, 25, (15), pp. 1966–7 (doi: 10.1093/bioinformatics/btp336) [DOI] [PubMed] [Google Scholar]
- 20. Langmead B., and Salzberg S.L.: ‘Fast gapped‐read alignment with Bowtie 2’, Nat. Methods, 2012, 9, (4), pp. 357–359 (doi: 10.1038/nmeth.1923) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Dudoit S. Gentleman R.C., and Quackenbush J.: ‘Open source software for the analysis of microarray data’, Biotechniques, 2003, Suppl, pp. 45–51 [PubMed] [Google Scholar]
- 22. Anders S., and Huber W.: ‘Differential expression analysis for sequence count data’, Genome Biol., 2010, 11, (10), p.R106 (doi: 10.1186/gb-2010-11-10-r106) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Kadanga A.K. Leroux C., and Bonnet M. et al.: ‘Image analysis and data normalization procedures are crucial for microarray analyses’, Gene Regul. Syst. Bio., 2008, 2, pp. 107–12 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Grant R.P.: ‘Computational genomics: theory and application’ (Horizon Bioscience, Wymondham, 2004), pp. 225–249 [Google Scholar]
- 25. Mortazavi A. Williams B.A. McCue K. Schaeffer L., and Wold B.: ‘Mapping and quantifying mammalian transcriptomes by RNA‐seq’, Nat. Methods, 2008, 5, (7), pp. 621–8 (doi: 10.1038/nmeth.1226) [DOI] [PubMed] [Google Scholar]
- 26. Hsiao L.L. Dangond F., and Yoshida T. et al.: ‘A compendium of gene expression in normal human tissues’, Physiol. Genomics, 2001, 7, (2), pp. 97–104 [DOI] [PubMed] [Google Scholar]
- 27. Bullard J.H. Purdom E. Hansen K.D., and Dudoit S.: ‘Evaluation of statistical methods for normalization and differential expression in mRNA‐seq experiments’, BMC Bioinf., 2010, 11, p.94 (doi: 10.1186/1471-2105-11-94) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Wilson D.L. Buckley M.J. Helliwell C.A., and Wilson I.W.: ‘New normalization methods for cDNA microarray data’, Bioinformatics, 2003, 19, (11), pp. 1325–32 (doi: 10.1093/bioinformatics/btg146) [DOI] [PubMed] [Google Scholar]
- 29. Thellin O. Zorzi W., and Lakaye B. et al.: ‘Housekeeping genes as internal standards: use and limits’, J. Biotechnol., 1999, 75, (2–3), pp. 291–5 (doi: 10.1016/S0168-1656(99)00163-7) [DOI] [PubMed] [Google Scholar]
- 30. Lee P.D. Sladek R. Greenwood C.M., and Hudson T.J.: ‘Control genes and variability: absence of ubiquitous reference transcripts in diverse mammalian expression studies’, Genome Res., 2002, 12, (2), pp. 292–297 (doi: 10.1101/gr.217802) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. de Jonge H.J. Fehrmann R.S., and de Bont E.S. et al.: ‘Evidence based selection of housekeeping genes’, PLoS One, 2007, 2, (9), p. e898 (doi: 10.1371/journal.pone.0000898) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Huisinga K.L., and Pugh B.F.: ‘A genome‐wide housekeeping role for TFIID and a highly regulated stress‐related role for SAGA in Saccharomyces cerevisiae’, Mol. Cell, 2004, 13, (4), pp. 573–85 (doi: 10.1016/S1097-2765(04)00087-5) [DOI] [PubMed] [Google Scholar]
- 33. Kanehisa M., and Goto S.: ‘KEGG: kyoto encyclopedia of genes and genomes’, Nucleic Acids Res., 2000, 28, (1), pp. 27–30 (doi: 10.1093/nar/28.1.27) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Rogers M.F., and Ben‐Hur A.: ‘The use of gene ontology evidence codes in preventing classifier assessment bias’, Bioinformatics, 2009, 25, (9), pp. 1173–1177 (doi: 10.1093/bioinformatics/btp122) [DOI] [PubMed] [Google Scholar]
- 35. Kodama Y. Shumway M., and Leinonen R.: ‘The sequence read archive: explosive growth of sequencing data’, Nucleic Acids Res., 2012, 40, (Database issue), pp. D54–6 (doi: 10.1093/nar/gkr854) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Emerson J.J. Hsieh L.C., and Sung H.M. et al.: ‘Natural selection on cis and trans regulation in yeasts’, Genome Res., 2010, 20, (6), pp. 826–36 (doi: 10.1101/gr.101576.109) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Pan J. Sasaki M., and Kniewel R. et al.: ‘A hierarchical combination of factors shapes the genome‐wide topography of yeast meiotic recombination initiation’, Cell, 2011, 144, (5), pp. 719–31 (doi: 10.1016/j.cell.2011.02.009) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Kotera M. Hirakawa M. Tokimatsu T. Goto S., and Kanehisa M.: ‘The KEGG databases and tools facilitating omics analysis: latest developments involving human diseases and pharmaceuticals’, Methods Mol. Biol., 2012, 802, pp. 19–39 (doi: 10.1007/978-1-61779-400-1_2) [DOI] [PubMed] [Google Scholar]
