Abstract
Motivation
We have proposed a mixture model based approach to the concordant integrative analysis of multiple large-scale two-sample expression datasets. Since the mixture model is based on the transformed differential expression test P-values (z-scores), it is generally applicable to the expression data generated by either microarray or RNA-seq platforms. The mixture model is simple with three normal distribution components for each dataset to represent down-regulation, up-regulation and no differential expression. However, when the number of datasets increases, the model parameter space increases exponentially due to the component combination from different datasets.
Results
In this study, motivated by the well-known generalized estimating equations (GEEs) for longitudinal data analysis, we focus on the concordant components and assume that the proportions of non-concordant components follow a special structure. We discuss the exchangeable, multiset coefficient and autoregressive structures for model reduction, and their related expectation-maximization (EM) algorithms. Then, the parameter space is linear with the number of datasets. In our previous study, we have applied the general mixture model to three microarray datasets for lung cancer studies. We show that more gene sets (or pathways) can be detected by the reduced mixture model with the exchangeable structure. Furthermore, we show that more genes can also be detected by the reduced model. The Cancer Genome Atlas (TCGA) data have been increasingly collected. The advantage of incorporating the concordance feature has also been clearly demonstrated based on TCGA RNA sequencing data for studying two closely related types of cancer.
Availability and Implementation
Additional results are included in a supplemental file. Computer program R-functions are freely available at http://home.gwu.edu/∼ylai/research/Concordance.
Supplementary information
Supplementary data are available at Bioinformatics online.
1 Introduction
Microarray and RNA-seq technologies have been widely used in biomedical studies (Lockhart et al., 1996; Nagalakshmi et al., 2008; Schena et al., 1995; The Cancer Genome Atlas Network, 2008; Wilhelm et al., 2008). Since the collection of large-scale gene expression data, many datasets have been made available to understand different diseases (Edgar and Barrett, 2006). Differential expression analysis has been widely conducted to identify disease-related genes (Storey and Tibshirani, 2003) and gene set enrichment analysis (GSEA) has been widely conducted to identify disease-related pathways or gene sets (Mootha et al., 2003; Subramanian et al., 2005). Multiple two-sample datasets have also been collected to understand some important diseases. For example, three large-scale two-sample expression datasets have been collected to understand the disease mechanisms of lung cancer (Beer et al., 2002; Bhattacharjee et al., 2001; Garber et al., 2001). We expect to achieve more efficient analysis results if an integrative analysis of these datasets can be conducted (Choi et al., 2003 Chen et al., 2013; de Magalhaes et al. 2009; Shen and Tseng, 2010; Tanner and Agarwal, 2008).
Genes may show different behaviors among different datasets due to disease heterogeneity. Furthermore, there is usually a considerable amount of noise in large-scale gene expression data. Therefore, it is interesting to identify genes and pathways (or gene sets) with concordant behaviors among different datasets (Subramanian et al., 2005). We have proposed a mixture model based approach to the concordant integrative analysis of multiple large-scale two-sample expression datasets (Lai et al., 2009, 2014). Since the mixture model is based on the transformed differential expression test P-values (z-scores), it is generally applicable to the expression data generated by different platforms (e.g. microarray and RNA-seq). The mixture model is simple with three normal distribution components for each dataset to represent down-regulation, up-regulation and no differential expression. However, when the number of datasets increases, the model parameter space increases exponentially due to the component combination from different datasets.
This study is a statistical methodological development based on our previous work (Lai et al., 2007, 2009, 2014). In our previous work, we have proposed the general partial concordance-discordance (PCD) mixture model and its related likelihood ratio test of genome-wide concordance or genome-wide discordance (Lai et al., 2007). Then, based on this statistical framework, we have demonstrated how the concordant differential expression can be detected (Lai et al., 2009). Also based on this framework, we have developed a method for analyzing concordant integrative gene set enrichment (Lai et al., 2014). In this study, our goal is to develop a method for detecting concordantly differentially expressed genes and concordantly enriched gene sets from a relatively large number of two-sample expression datasets. As the parameter space increases exponentially with the number of datasets, we have developed some efficient approaches so that the parameter space can be reduced and it is still feasible to conduct concordant integrative analysis.
In this study, motivated by the well-known generalized estimating equations (GEEs) for longitudinal data analysis (Diggle et al., 2013), we focus on the concordant components and assume that the proportions of non-concordant components follow a special structure. We discuss three approaches: exchangeable, multiset coefficient and autoregressive structures, and their related expectation-maximization (EM) algorithms (McLachlan and Krishnan, 2008). Then, the parameter space is linear with the number of datasets. Through this model reduction strategy, we expect to achieve more efficient analysis results.
2 Materials and methods
2.1 Motivation and summary
Two-sample genome-wide expression microarray and sequencing data have been widely collected in biomedical studies. Differential expression analysis and gene set enrichment analysis (GSEA, or gene set analysis) have been widely conducted for these data. Furthermore, multiple two-sample genome-wide expression datasets have been collected for some similar or same study purposes. However, it is usually difficult to simply combine multiple datasets into one set and different datasets need to be analyzed separately. One interesting data exploration for multiple sets is to identify genes showing statistically significant changes that are concordant among multiple datasets. For example, we observe a gene showing a clearly positive change from one group to the other group, and we observe this for the same gene (and the same groups) among multiple datasets. Furthermore, gene sets (or pathways) can be defined and data exploration can be performed at the gene set level. With a given collection (a large number) of gene sets, it is also interesting to identify gene sets showing statistically significant coordinate changes that are concordant among multiple datasets. For example, in a given gene set, we observe many genes showing clearly positive changes from one group to the other group, and we observe this for the same gene set (and the same groups) among multiple datasets.
Based on the two-sample test z-scores (see Multiple datasets and z-scores for details), we have developed a mixture model based framework for exploring these concordant changes (see A general mixture model for details). As the number of datasets increases, it is necessary to reduce the number of parameters in the model (see Reduction of parameter space for details). Motivated by the well-known generalized estimating equations (GEEs) for longitudinal data analysis, we propose some model reduction approaches so that efficient analysis results can be achieved. The statistical significance can be evaluated based on the mixture model based false discovery rate (see Concordance scores and false discovery rate for details).
2.2 Multiple datasets and z-scores
For a concordant integrative analysis of multiple two-sample gene expression datasets, we have proposed a mixture model based on the z-scores (Lai et al., 2009, 2014). Let G be the number of common genes across K datasets for a concordant integrative analysis. For the ith gene in the kth dataset, and , we perform a two-sample differential expression test and obtain the related upper-tailed P-value pik. Then, the P-value is transformed to a z-score: , where is the standard normal cumulative distribution function (c.d.f.). (Notice that one-sided P-value is necessary so that the sign of a z-score is related to the direction of differential expression.) Then, we obtain a matrix of z-scores for all the common genes across different datasets. (Notice that, as any appropriate two-sample test can be considered for differential expression analysis, this approach is generally applicable to large-scale expression data like microarray or RNA-seq data.)
2.3 A general mixture model
Genes may have different behaviors across different datasets. For simplicity, we have proposed three representative normal distribution components for each gene in the kth dataset: negative mean (for down-regulation) (to be estimated), zero mean (for no differential expression) (fixed) and positive mean (for up-regulation) (to be estimated). Notice that no differential expression is the null hypothesis for the two-sample test mentioned above. Since the P-values follow the uniform distribution under the null hypothesis, the transformed z-scores follow the standard normal distribution. Therefore, the variance parameter for the second component is fixed as . The other two variance parameters and are to be estimated.
We assume that different datasets have been independently collected. For gene i, , if we know its behavior in each dataset, then the probability of observing its z-scores is the product of z-score probabilities from individual datasets. Let be the probability to choose component j1 for the first dataset, j2 for the second dataset, … , jK for the Kth dataset, . ( are the mixture proportions.) Let be the normal probability density function (p.d.f.) with mean μ and variance . We have proposed a partial-concordance-discordance (PCD) mixture model with the following probability density function:
The parameters can be estimated by the well-known expectation-maximization (EM) algorithm (McLachlan and Krishnan, 2008), in which it is necessary to introduce the a set of unobserved indicator variables when zik is sampled from the jkth component for the kth dataset and otherwise, for and . In the algorithm, E-step and M-step are iterated until a numerical convergence is achieved. In this study, the numerical convergence is defined as that the difference between the current log-likelihood and the previous one is within a given tolerance value (e.g. ).
2.4 Reduction of parameter space
The number of distinct proportion parameters is in the above PCD model. It increases exponentially when the number of datasets increases. For a concordant integrative analysis, we are more interested in the concordant components with for the proportion parameters. Therefore, motivated by the well-known generalizing estimating equations (GEEs), we consider three special structures for the proportions of non-concordant components: exchangeable and multiset coefficient structures (also autoregressive structure in Appendix). Their details are described as follows.
2.4.1 Exchangeable structure
For the exchangeable structure, we assume that the proportion of non-concordant components are the same, i.e. when and jk’s are not all the same, . Then, there are only three distinct proportion parameters and we have where πj is the simplified notation for with . Define the set ’s are not all the same; . Then, the probability density function is
We can derive the related expectation-maximization (EM) algorithm. The E-step is the calculation of expected values. When ’s are all the same as j, we have
when ’s are not all the same, we have
The M-step is the parameter estimation and we have
2.4.2 Multiset coefficient structure
For the set A defined above, we divide it into several groups. In each group, the numbers of distinct values for are exactly the same. (For example, (1,2,1) and (1,1,2) are in the same group when K = 3.) Let dh be the number of occurrences of component h. The possible values for dh are . Then, the number of groups is equal to the number of non-negative integer solutions to the equation , then subtract 3 (when all jk’s are identical). This is equivalent to the classic problem of combinations with repetition. The number of groups is , in which is called the multiset coefficient in combinatorics. We use to denote these groups, where .
Let rt be the number of elements in group At and dth be the number of repetition of component h in At. Then, rt is the number of ways to choose the first component times, the second component times, and the third component times. Therefore, . For all non-concordant components in At, we assume that the related proportions are the same, i.e. when . Notice that . Then, with defined above, the probability density function is
The E-step for this model is as follows. When ’s are all the same as j, we have
when , we have
For the M-step, we have
2.5 Concordance scores and false discovery rate
It is biologically interesting if a gene is showing a concordant behavior across different datasets. In the mixture model, this is related to the component for down-regulation or for up-regulation. Then, for the ith gene Xi, the concordant differential expression score (CS) can be defined by the following mixture-model-based conditional probability:
with for up-regulation and for down-regulation (Lai et al., 2009).
It is also biologically interesting if a gene set or pathway is showing an overall concordant gene set enrichment. It has been discussed that concordant gene set enrichment can be measured by a probability, which is for the number of genes with concordant differential expression larger than the expected value (Lai et al., 2014). The mixture-model-based concordant gene set enrichment score (CES) for a gene set S with mS genes is give as below.
where is an indicator that the gene is showing a concordant differential expression ( for down-regulation and for up-regulation). However, it has been shown that the exact calculation of CES is related to the heterogeneous Bernoulli process (Lai et al., 2014). The related computation is usually not practically feasible and a Monte Carlo approximation for CES of a gene set S has been suggested (Lai et al., 2014). The following is a brief description: (i) Initial W = 0. (ii) For the ith gene in S, simulate a Bernoulli random variable with probability of event equal to the mixture-model-based conditional probability of this gene being concordantly differentially expressed. (iii) Count the number of events from all genes in S and increase W by one if the count is larger than the expected value ( for down-regulation or for up-regulation). (iv) Repeat steps ii and iii B times and then W/B is the approximated CES. B = 2000 has been suggested (Lai et al., 2014).
The relationship between the false discovery rate (FDR) and mixture models have been discussed in the literature (Lai et al., 2014; Mclachlan et al., 2006). The above concordance score is a conditional probability as well as a true positive proportion. Therefore, if top T genes (ranked by their concordant differential expression scores) are identified, then the related FDR can be calculated as . Similarly, if top T gene sets or pathways (ranked by their concordant gene set enrichment scores) are identified, then the related FDR can be calculated as .
3 Results
3.1 Three lung cancer study datasets
To illustrate our reduced mixture model based approach to the concordant integrative analysis of multiple large-scale two-sample expression datasets, we consider three published experimental microarray gene expression datasets for studying lung cancer (Beer et al., 2002; Bhattacharjee et al., 2001; Garber et al., 2001). One dataset with 62 subjects was collected by a group of scientists in Boston, one dataset with 86 subjects was collected by a group of scientists in Michigan and one dataset with 24 subjects was collected by a group of scientists in Stanford. Subjects were classified as either ‘good’ or ‘poor’ outcomes (two-sample groups). 2865 genes were in common among these three datasets.
In our previous study, we have applied the general mixture model to these three datasets and we have observed interesting detections of biological pathways. In this study, since we have no prior knowledge about non-concordant gene behaviors, we consider the simple exchangeable structure for model reduction. Furthermore, the number of z-scores has been clearly reduced to 2865 as common genes are required among these three datasets. Compared to the multiset coefficient structure, the number of parameters is relatively small for the exchangeable structure.
Based on the false discovery rate (FDR), we compare the number of identified gene sets (concordantly enriched) based on the reduced mixture model versus the number of identified gene sets based on the general mixture model. Furthermore, we also compare the number of identified genes (concordantly differentially expressed) based on the reduced mixture model versus the number of identified genes based on the general mixture model.
3.1.1 z-scores
Figure 1 shows the pairwise comparison of z-scores. Overall, the pairwise comparison patterns from three pairs of datasets are all visually similar as a concordant pattern. We have developed a related statistical test for this purpose. The test considers the general PCD model as the alternative hypothesis and the reduced complete concordance model as the null hypothesis (Lai et al., 2007). As we have explained in our previous paper (Lai et al., 2007), it is more appropriate to use this genome-wide concordance test instead of the traditional Pearson correlation. The P-values can be computed by the parametric bootstrap procedure (McLachlan and Krishnan, 2008). All three P-values (for three possible pairs from the three lung cancer study datasets) were not significant (i.e. > 0.05). This implies an overall genome-wide concordance among these datasets (Fig. 1). Therefore, it is appropriate to use the exchangeable structure for our model reduction. Among 2865 common genes, there are 39 in the well-known KEGG apoptosis pathway. For an illustration of concordant gene set enrichment, we highlight these genes in the figure. Among these 39 genes, twelve genes are showing negative z-scores among all three datasets. (Notice that the number of positive/negative combinations from three datasets is .) Although our concordant gene set enrichment analysis is based a more complicated mixture model, this relatively high proportion () clearly demonstrates why this pathway is concordantly enriched in down-regulation.
Fig. 1.
Pairwise comparison of z-scores based on the Boston, Michigan and Stanford datasets. Gray dots represent 2865 common genes. Dark dots represent genes in the KEGG apoptosis pathway
3.1.2 Concordantly enriched gene sets
We use the gene set information in the Molecular Signatures Database (Mootha et al., 2003; Subramanian et al., 2005). Figures 2 and 3 show the curves based on the false discovery rate (FDR) versus the number of identified gene sets. (A lower curve means that the related method is better.) For either up-regulation or down-regulation, the curve based on the reduced mixture model is always lower than the curve based on the general mixture model. Figure 2 is based on 1320 gene sets from the C2 collection of curated gene sets (version 4 downloaded at the time of study). Figure 3 is based on 186 KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways. All these comparison results clearly support that the exchangeable structure is an efficient model reduction strategy.
Fig. 2.
Comparison of false discovery rate (FDR) curves based on the reduced mixture model versus the general mixture model. The curves are based on 1320 C2 collection of curated gene sets from the Molecular Signatures Database
Fig. 3.
Comparison of false discovery rate (FDR) curves based on the reduced mixture model versus the general mixture model. The curves are based on 186 KEGG pathways from the Molecular Signature Database
Overall, the concordant gene set enrichment scores based on the general mixture model and these scores based on the reduced mixture model are still highly consistent. For 1320 C2 collection gene sets, the Spearman correlation is 91.4% for the up-regulation based score and the Spearman correlation is 91.5% for the down-regulation based score. For 186 KEGG pathways, the Spearman correlation is 92.0% for the up-regulation based score and the Spearman correlation is 93.7% for the down-regulation based score.
Regarding this application, we can also conduct a likelihood ratio test between the general PCD mixture model (alternative hypothesis) and the reduced mixture model (null hypothesis). The null hypothesis was not rejected (P-value > 0.05) for this application, which supported the choice of exchangeable structure. Furthermore, the difference between two curves is also visibly clear in Figure 2b (notice that Fig. 3 is based on the KEGG collection of 186 pathways, which is a subset of 1320 C2 collection for Fig. 2). Moreover, from either Figure 4a or b, the difference between two curves is also clear.
Fig. 4.
Comparison of false discovery rate (FDR) curves based on the reduced mixture model versus the general mixture model. The curves are based on 2865 commons genes among the Boston, Michigan and Stanford datasets
3.1.3 Concordantly differentially expressed genes
Figure 4 shows the curves based on the false discovery rate (FDR) versus the number of identified genes. (A lower curve means that the related method is better.) Again, for either up-regulation or down-regulation, the curve based on the reduced mixture model is always lower than the curve based on the general mixture model. Among 2865 commons genes among three datasets, at 5% FDR, for up-regulation, almost one hundred genes can be identified by the reduced mixture model but less than twenty genes can be identified by the general mixture model; for down-regulation, more than one hundred genes can be identified by the reduced mixture model but still only about ten genes can be identified by the general mixture model. Therefore, these comparison results also clearly support the use of exchangeable structure as an efficient model reduction strategy.
Overall, the concordant differential expression scores based on the general mixture model and these scores based on the reduced mixture model are also highly consistent. For 2865 common genes among three datasets, the Spearman correlation is 96.3% for the up-regulation based score and the Spearman correlation is 92.9% for the down-regulation based score.
Among the identified genes with significant concordant differential expression (at 5% FDR), all the genes identified by the general mixture model can be identified by the reduced mixture model although their ranks are different by different models (for both up-regulated and down-regulated differential expression, see Supplementary File for details). Among genes additionally identified by the reduced mixture model, there are genes ARHE (or RND3, rank # 5), KRT19 (rank # 8), etc. for significant up-regulated differential expression and genes IL16 (rank # 3), LSP1 (rank # 4), etc. for significant down-regulated differential expression. It has been reported in the biomedical literature that genes ARHE/RND3 (Paysan et al., 2016), KRT19 (Ohtsuka et al., 2016), IL16 (Tang et al., 2014) and LSP1 (Park et al., 2014) are directly or indirectly associated with lung cancers.
3.1.4 Stability and reproducibility
We also conducted a comprehensive perturbation analysis so that the stability of analysis results by the reduced mixture model can be investigated. We first removed one subject (array) randomly from each dataset (there are possible combinations from three datasets); we then recalculated the concordant differential expression scores (CS, for 2865 genes) and concordant gene set enrichment scores (CES, for 186 KEGG pathways). Both Pearson and Spearman correlations were calculated between the CS/CES scores based on the slightly reduced data and the CS/CES scores based on the original data. After 1000 random repetitions, we summarized the related (2.5%, 25%, 50%, 75%, 97.5%) quantiles. For CS scores on down-regulation, the quantiles for Pearson and Spearman correlations were (0.949, 0.981, 0.988, 0.992, 0.994) and (0.983, 0.991, 0.993, 0.994, 0.995), respectively; for CS scores on up-regulation, the quantiles for Pearson and Spearman correlations were (0.969, 0.985, 0.989, 0.990, 0.992) and (0.976, 0.987, 0.989, 0.991, 0.993), respectively. For CES scores on down-regulation, the quantiles for Pearson and Spearman correlations were (0.949, 0.961, 0.965, 0.970, 0.989) and (0.954, 0.964, 0.967, 0.971, 0.990), respectively; for CES scores on up-regulation, the quantiles for Pearson and Spearman correlations were (0.951, 0.964, 0.969, 0.980, 0.995) and (0.956, 0.967, 0.972, 0.980, 0.994), respectively. Overall, a satisfactory consistency between the results based on the slightly reduced data and the results based on the original data was observed for both CS and CES scores.
Furthermore, we conducted a comprehensive subset sampling analysis so that the reproducibility of analysis results by the reduced mixture model can be investigated, especially when the sample size is relatively small for each dataset. We first selected ten subjects (arrays) randomly from each sample group in each dataset (10 + 10 for each dataset); we then recalculated the concordant differential expression scores (CS, for 2865 genes) and concordant gene set enrichment scores (CES, for 186 KEGG pathways). Both Pearson and Spearman correlations were calculated between the CS/CES scores based on the small sampled data and the CS/CES scores based on the original data. After 1000 random repetitions, we summarized the related (2.5%, 25%, 50%, 75%, 97.5%) quantiles. For CS scores on down-regulation, the quantiles for Pearson and Spearman correlations were (0.397, 0.591, 0.645, 0.678, 0.714) and (0.438, 0.632, 0.682, 0.709, 0.736), respectively; for CS scores on up-regulation, the quantiles for Pearson and Spearman correlations were (0.325, 0.467, 0.558, 0.638, 0.698) and (0.451, 0.580, 0.648, 0.690, 0.720), respectively. For CES scores on down-regulation, the quantiles for Pearson and Spearman correlations were (0.312, 0.598, 0.713, 0.777, 0.832) and (0.288, 0.615, 0.733, 0.795, 0.848), respectively; for CES scores on up-regulation, the quantiles for Pearson and Spearman correlations were (0.189, 0.476, 0.641, 0.750, 0.833) and (0.221, 0.512, 0.672, 0.777, 0.854), respectively. Overall, a considerable consistency between the results based on the small sampled data and the results based on the original data was still observed for both CS and CES scores (although the noise level of early microarray data was relatively high).
3.2 The cancer genome atlas datasets
For a further illustration of our reduced mixture model, we applied the model with multiset coefficient structure to the RNA sequencing (RNA-seq) data collected by The Cancer Genome Atlas (TCGA) project (The Cancer Genome Atlas Network, 2008). We select the data for colon adenocarcinoma (COAD) and stomach adenocarcinoma (STAD). COAD and STAD are both gastrointestinal (GI) carcinoid tumors. Therefore, we expect a considerable level of similarity in genome-wide expression profiles between these two types of cancer. For these two RNA-seq datasets, we checked the numbers of subjects from different centers. At the time of study, there were four centers (two combined) with adequate numbers of normal/tumor subjects for COAD, and there were five centers (two combined) with adequate numbers of normal/tumor subjects for STAD. Therefore, we illustrate our method with these seven subsets (it is necessary to reduce the number of model parameters). Furthermore, there are more than 20 000 common genes. It is appropriate to consider more complicated multiset coefficient structure (instead of simple exchangeable structure).
After downloading TCGA RNA-seq data, we specifically removed the tumor samples from their adjacent normal samples to avoid the related dependence issue. At the time of study, for COAD study, the numbers of normal and tumor samples are 14 and 27 from the ‘A6’ center, 20 and 13 from the ‘AA’ center and 7 and 28 from the centers ‘AZ’ and ‘F4,’ respectively. For the STAD study, the numbers of normal and tumor samples are 12 and 118 from the ‘BR’ center, 8 and 26 from the ‘HU’ center, 7 and 32 from the ‘CG’ center and 7 and 22 from the centers ‘FP’ and ‘IN,’ respectively. (Please check the online references/manuals for the related TCGA details.) There are 20,531 common genes available for our study.
3.2.1 Gene set analysis based Fisher’s method
The gene set analysis (GSA) method was proposed by Efron and Tibshirani (2007) for analyzing enrichment in pathways (or gene sets). Maciejewski (2014) recommended this method as a preferred one in gene set enrichment analysis. For each subset, we calculated the P-value of enrichment in up-regulation for each gene set. Then, we use Fisher’s method (or Fisher’s combined probability test) to integrate the P-values from multiple datasets (for the same gene set). In summary, we summed up log-transformed P-values and then multiplied the sum by −2. It is well-known to follow a chi-squared distribution under the null hypotheses. Then, an integrative gene set enrichment analysis of multiple datasets was performed. (Notice that, using GSA, we also calculated the P-value of enrichment in down-regulation for each gene set and each dataset. The related chi-squared P-values were then calculated by Fisher’s method.) It is important to emphasize that our analysis purpose is to identify concordance enrichment among multiple datasets. However, this feature is usually not incorporated in a traditional integrative analysis (like Fisher’s method).
3.2.2 Comparison based on cancer related pathways
Among KEGG pathways, there are sixteen cancer related pathways. Fourteen of them can be downloaded from the Molecular Signatures Database (Mootha et al., 2003; Subramanian et al., 2005). Table 1 shows the comparison of our concordant gene set enrichment score (CES, up-regulation or down-regulation) to the P-values calculated by GSA-based Fisher’s method (up-regulation or down-regulation). (Notice that, lower P-value means more significant result but higher CES means more significant result.) The p53 signaling pathway and cell cycle pathway are the two pathways identified by both methods. With the consideration of either Bonferroni-type adjustment or FDR-type adjustment of P-values, there is no further clear detection from GSA-based Fisher’s method. For example, at 5% level, the simple Bonferroni corrected P-value threshold is about (when up-regulation and down-regulation are considered separately). Only the p53 signaling and cell cycle pathways show their P-values less than this threshold. A similar probability threshold for CES is (the larger more significant for CES). For down-regulation, our method still clearly identified a few more pathways with concordance enrichment like the focal adhesion, adherens junction, MAPK signaling and PPAR signaling pathways (Table 1).
Table 1.
A comparison study
| KEGG pathway | P-value (GSA-Fisher, up) | CES (up) | P-value (GSA-Fisher, down) | CES (down) |
|---|---|---|---|---|
| ECM receptor interaction | 0.395 | 0.001 | 0.680 | 0.003 |
| Cytokine cytokine receptor interaction | 0.663 | < 0.001 | 0.779 | 0.033 |
| Focal adhesion | 0.641 | < 0.001 | 0.398 | > 0.999 |
| WNT signaling pathway | 0.249 | < 0.001 | 0.617 | 0.513 |
| Adherens junction | 0.817 | < 0.001 | 0.109 | > 0.999 |
| JAK STAT signaling pathway | 0.731 | < 0.001 | 0.646 | < 0.001 |
| MAPK signaling pathway | 0.967 | < 0.001 | 0.119 | > 0.999 |
| MTOR signaling pathway | 0.729 | < 0.001 | 0.123 | 0.806 |
| PPAR signaling pathway | > 0.999 | < 0.001 | 0.010 | > 0.999 |
| EGF signaling pathway | 0.948 | < 0.001 | 0.429 | 0.996 |
| Apoptosis | 0.207 | < 0.001 | 0.239 | 0.984 |
| P53 signaling pathway | < 0.001 | > 0.999 | > 0.999 | < 0.001 |
| Cell cycle | < 0.001 | > 0.999 | > 0.999 | < 0.001 |
| TGF beta signaling pathway | 0.357 | < 0.001 | 0.793 | 0.783 |
For significant detections, check pathways with low P-values or pathways with high CES.
4 Discussion
In this study, based on our previously proposed general mixture model, we discussed three model reduction approaches to an efficient concordant integrative analysis of multiple large-scale two-sample expression datasets. When the number of datasets increases, due to the combination of differential expression components from different datasets, the parameter space of the general mixture model increases exponentially. Motivated by the well-known generalized estimating equations (GEEs) for longitudinal data analysis, we focused on the concordant components and assumed that the proportions of non-concordant components follow a special structure. Then, the parameter space is linear with the number of datasets. We discussed the exchangeable, multiset coefficient and autoregressive structures (see Appendix), and their related expectation-maximization (EM) algorithms.
Exchangeable structure means that different discordant behaviors are proportionally equally considered in our mixture model. Multiset coefficient structure means that discordant behaviors with the same pattern are proportionally equally considered in our mixture model. (For example, [down,null,up], [null,down,up], [down,up,null], etc. share the same combination pattern: one null, one down and one up.)
For three lung cancer study datasets, we checked the genome-wide concordance pair-wisely. Based on the parametric bootstrap procedure suggested by McLachlan and Krishnan (2008), the likelihood ratio test P-values were all insignificant (i.e. > 0.05), which implies an overall genome-wide concordance among these datasets. Therefore, we may not need to worry much about possible discordant behaviors. With the introduction of one α parameter in the exchangeable structure, discordant behaviors are considered in our mixture model.
For TCGA data, we studied two different disease types (COAD and STAD) in several different centers. This implies a considerable amount of discordant behaviors. Then, the exchangeable structure may be too simple to be considered in this situation. With the introduction of more αt parameters in the multiset coefficient structure, we expect to achieve a reasonable consideration of discordant behaviors in our mixture model.
Since we focus on the concordant component in our model reduction strategy, we expect to achieve more efficient concordant integrative analysis results. In our previous study, we applied the general mixture model to three microarray datasets for lung cancer studies. Using the updated collection of gene sets from the Molecular Signatures Database, by comparing the FDR curves, we showed that more pathways (or gene sets) and also more genes were detected by the reduced mixture model with the exchangeable structure.
Our mixture models are based on the transformed differential expression test P-values (z-scores). Although we have illustrated our methods with three early microarray gene expression datasets, they can also be applied directly to the recent gene-level RNA-seq data. We further demonstrated the advantage of incorporating the concordance feature through a comparison study based on The Cancer Genome Atlas (TCGA) RNA sequencing (RNA-seq) data. (For the analysis of exon-level RNA-seq data, many genes have isoforms and this must be considered in the analysis. Therefore, for our future research, we are interested in extending our current methods to this more complicated analysis scenario.)
The motivation of this study is to make the concordant integrative analysis feasible for many datasets. As explained in the manuscript, the parameter space increases exponentially with the number of datasets. Therefore, even many genome-wide expression datasets are available; it may not be feasible for us to conduct a concordant integrative analysis due to the computation and estimation limitations. In this study, we demonstrated two reduced models. Each one can be used in a concordant integrative analysis of many genome-wide expression datasets.
The first application was based on three datasets. Then, it is feasible to compare the analysis results from the general model to the reduced model with exchangeable structure. Due to space limitation, we could not include the descriptions of genes and gene sets identified by different models. However, we have provided a Supplementary File for the lists of gene and gene sets (with false discovery rates included).
The second application was based on seven datasets. It is not feasible to use the general model due to the large parameter space. We considered the reduced model with multiset coefficient structure. Due to space limitation, we focused on 14 KEGG pathways and compared our results with the results from the GSA-Fisher method.
In practice, if there is a clear overall genome-wide concordance among multiple datasets, then the impact of the different discordant behaviors may not be strong (see our first application study for an example). In this situation, the reduced mixture model with exchangeable structure can be considered. However, if there is a considerable amount of discordant behaviors, then the reduced mixture model with multiset coefficient structure can be considered so that different discordant behaviors can be accommodated (see our second application study for an example). Compared to the simple exchangeable structure, more proportion parameters are considered in the multiset coefficient structure. If less proportion parameters are preferred (e.g. when G is not relatively large) and the autoregressive pattern can be considered, then the reduced mixture model with autoregressive structure can be used.
There are many genome-wide expression studies with multiple sample groups. We consider the following two situations. The first situation is that, among multiple sample groups, there is a common reference group and each of the other groups can be compared to it. Then, we will have multiple dependent z-scores for each gene in each dataset. Instead of univariate normal distributions described in our mixture models for two-sample data, multivariate normal distributions can be used in this situation for the consideration of the dependence among different z-scores (for each gene in each dataset with multiple sample groups). When multiple datasets are integrated, the set of proportion parameters will be expanded to accommodate this situation in our mixture models. The second situation is that, among multiple sample groups, there is an increasing or decreasing order and these groups can be listed sequentially. Then, in this situation, we can consider an (increasing/decreasing) order restricted test to generate a single z-score for each gene in each dataset. Our mixture models can still be applied in this situation.
Supplementary Material
Acknowledgements
Y.Lai conceived of the study, developed the methods, performed the statistical analysis and drafted the manuscript; F.Zhang developed the methods, performed the statistical analysis, and helped to draft the manuscript; T.K.Nayak, R.Modarres, N.H.Lee and T.A.McCaffrey helped to draft the manuscript.
Funding
This work was partially supported by the NIH grant GM-092963 (Y.Lai). The publication costs were supported by the Department of Statistics at The George Washington University.
Conflict of Interest: none declared.
References
- Beer D.G. et al. (2002) Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nat. Med., 8, 816–824. [DOI] [PubMed] [Google Scholar]
- Bhattacharjee A. et al. (2001) Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc. Natl. Acad. Sci. U. S. A., 98, 13790–13795. [DOI] [PMC free article] [PubMed] [Google Scholar]
- The Cancer Genome Atlas Network (2008) Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature, 455, 1061–1068. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen M. et al. (2013) A powerful Bayesian meta-analysis method to integrate multiple gene set enrichment studies. Bioinformatics, 29, 862–869. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Choi J.K. et al. (2003) Combining multiple microarray studies and modeling interstudy variation. Bioinformatics, 19, i84–i90. [DOI] [PubMed] [Google Scholar]
- Diggle P. et al. (2013) Analysis of Longitudinal Data. 2nd edn.Oxford University Press, Oxford, United Kingdom. [Google Scholar]
- Edgar R., Barrett T. (2006) NCBI GEO standards and services for microarray data. Nat. Biotechnol., 24, 1471–1472. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Efron B., Tibshirani R. (2007) On testing the significance of sets of genes. Ann. Appl. Stat., 1, 107–129. [Google Scholar]
- Garber M.E. et al. (2001) Diversity of gene expression in adenocarcinoma of the lung. Proc. Natl. Acad. Sci. U. S. A., 98, 13784–13789. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lai Y. et al. (2007) A mixture model approach to the tests of concordance and discordance between two large scale experiments with two-sample groups. Bioinformatics, 23, 1243–1250. [DOI] [PubMed] [Google Scholar]
- Lai Y. et al. (2009) A statistical framework for integrating two microarray data sets in differential expression analysis. BMC Bioinformatics, 10, S23.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lai Y. et al. (2014) Concordant integrative gene set enrichment analysis of multiple large-scale two-sample expression data sets. BMC Genomics, 15, S6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lockhart D. et al. (1996) Expression monitoring by hybridization to high-density oligonuleotide arrays. Nat. Biotechnol., 14, 1675–1680. [DOI] [PubMed] [Google Scholar]
- Maciejewski H. (2014) Gene set analysis methods: statistical models and methodological differences. Brief. Bioinf., 15, 504–518. [DOI] [PMC free article] [PubMed] [Google Scholar]
- de Magalhaes J.P. et al. (2009) Meta-analysis of age-related gene expression profiles identifies common signatures of aging. Bioinformatics, 25, 875–881. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McLachlan G.J., Krishnan T. (2008) The EM Algorithm and Extensions. 2nd edn.John Wiley & Sons, Inc, Hoboken, New Jersey, USA. [Google Scholar]
- McLachlan G.J. et al. (2006) A simple implementation of a normal mixture approach to differential gene expression in multiclass microarrays. Bioinformatics, 22, 1608–1615. [DOI] [PubMed] [Google Scholar]
- Mootha V.K. et al. (2003) PGC-1α-response genes involved in oxidative phos-phorylation are coordinately downregulated in human diabetes. Nat. Genet., 34, 267–273. [DOI] [PubMed] [Google Scholar]
- Nagalakshmi U. et al. (2008) The transcriptional landscape of the yeast genome defined by RNA sequencing. Science, 320, 1344–1349. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ohtsuka T. et al. (2016) Interaction of cytokeratin 19 head domain and HER2 in the cytoplasm leads to activation of HER2-Erk pathway. Sci. Rep., 6, 39557.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Park S.L. et al. (2014) Pleiotropic associations of risk variants identified for other cancers with lung cancer risk: the PAGE and TRICL consortia. J. Natl. Cancer Inst., 106, dju061.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Paysan L. et al. (2016) Rnd3 in cancer: a review of the evidence for tumor promoter or suppressor. Mol. Cancer Res., 14, 1033–1044. [DOI] [PubMed] [Google Scholar]
- Schena M. et al. (1995) Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science, 270, 467–470. [DOI] [PubMed] [Google Scholar]
- Shen K., Tseng G.C. (2010) Meta-analysis for pathway enrichment analysis when combining multiple genomic studies. Bioinformatics, 26, 1316–1323. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Storey J.D., Tibshirani R. (2003) Statistical significance for genomewide studies. Proc. Natl. Acad. Sci. U. S. A., 100, 9440–9445. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Subramanian A. et al. (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. U. S. A., 102, 15545–15550. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tang W. et al. (2014) Large-scale genome-wide association studies and meta-analyses of longitudinal change in adult lung function. PLoS One, 9, e100776.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tanner S.W., Agarwal P. (2008) Gene Vector Analysis (Geneva): a unified method to detect differentially-regulated gene sets and similar microarray experiments. BMC Bioinformatics, 9, 348.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wilhelm B.T. et al. (2008) Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution. Nature, 453, 1239–1243. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.




