Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2017 Mar 11;33(14):2140–2147. doi: 10.1093/bioinformatics/btx138

Meta-analytic framework for liquid association

Lin Wang 1,1, Silvia Liu 2,3,1, Ying Ding 2,3, Shin-sheng Yuan 4, Yen-Yi Ho 5,, George C Tseng 2,3,
Editor: Bonnie Berger
PMCID: PMC6044323  PMID: 28334340

Abstract

Motivation

Although coexpression analysis via pair-wise expression correlation is popularly used to elucidate gene-gene interactions at the whole-genome scale, many complicated multi-gene regulations require more advanced detection methods. Liquid association (LA) is a powerful tool to detect the dynamic correlation of two gene variables depending on the expression level of a third variable (LA scouting gene). LA detection from single transcriptomic study, however, is often unstable and not generalizable due to cohort bias, biological variation and limited sample size. With the rapid development of microarray and NGS technology, LA analysis combining multiple gene expression studies can provide more accurate and stable results.

Results

In this article, we proposed two meta-analytic approaches for LA analysis (MetaLA and MetaMLA) to combine multiple transcriptomic studies. To compensate demanding computing, we also proposed a two-step fast screening algorithm for more efficient genome-wide screening: bootstrap filtering and sign filtering. We applied the methods to five Saccharomyces cerevisiae datasets related to environmental changes. The fast screening algorithm reduced 98% of running time. When compared with single study analysis, MetaLA and MetaMLA provided stronger detection signal and more consistent and stable results. The top triplets are highly enriched in fundamental biological processes related to environmental changes. Our method can help biologists understand underlying regulatory mechanisms under different environmental exposure or disease states.

Availability and Implementation

A MetaLA R package, data and code for this article are available at http://tsenglab.biostat.pitt.edu/software.htm

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Gene co-expression analysis is vastly applied to study pairwise gene synchronization to elucidate potential gene regulatory mechanisms. For example, an unweighted gene co-expression network can be constructed from a transcriptomic study given a co-expression measure (e.g. Pearson correlation) and an edge cut-off (e.g. two nodes are connected if absolute correlation ≥ 0.6 and disconnected if < 0.6). In the literature, different measures such as Pearson correlation, Spearman correlation and mutual information (Butte and Kohane, 2000) have been used (see Song et al., 2012 for a comparative study). Alternatively, Zhang et al. (2005) developed a weighted correlation network analysis (WGCNA) framework using cluster analysis to construct gene co-expression modules and their associated weighted co-expression networks. Network properties and extended pathway analysis can then be studied to investigate disease-related network alterations and mechanisms.

Although guilt-by-association heuristic assumed in gene co-expression network analysis is widely used in genomics (Wolfe et al., 2005), many complex regulatory mechanisms in the system cannot be readily captured by direct association because of multi-way interactions. The first column in Figure 1A shows an example of liquid association (LA) first described in Li (2002). Gene YCR005C and YPL262W are overall non-correlated in study GSE11452 (Spearman correlation = 0.239) but they exhibited high correlation (cor = 0.692) when a third gene YGR175C is low expressed (expression intensity  < −0.424) and a much lower correlation (cor = −0.790) when expression of gene YGR175C is high (>0.441). The simple interaction among the trio is biologically meaningful since the third gene YGR175C may serve as a surrogate of certain (hidden) cellular state or regulator that controls the presence and absence of co-regulation between gene YCR005C and YPL262W.

Fig. 1.

Fig. 1

The scatter plot of the gene expressions in the high and low bins. (A) is for the triplet selected by GSE11452 through singleMLA and (B) is for the triplet selected by the studies GSE11452, Causton and Gasch through MetaMLA (Color version of this figure is available at Bioinformatics online.)

To quantify the conditional association in the triplet genes, Li (2012) proposed a LA measure to quantify the dynamic correlation of two variables depending on a third variable (Ho et al., 2011; Li, 2002; Li et al., 2004). Li (2002) introduced this concept and proposed a computationally efficient three-product-moment measure (see Section 2.2). Zhang et al. (2007) adopted a simplified LA score based on z-transformed Pearson correlation conditional on discretized expression of the third gene. Ho et al. (2011) extended the trivariate dependency structure into a parametric Gaussian framework (called modified LA; MLA) to develop improved estimation frameworks and statistical test for the existence of the LA dependence. The computational complexity to screen all possible triplets is O(n3) and is generally too high for applying LA methods in a genome-wide scale. Gunderson and Ho (2014) introduced an efficient screening algorithm fastLA for the MLA method containing two steps: (i) screening the candidate triplets by difference between the correlations of the LA pair when the scouting gene is high and low; (ii) fitting and estimating the model based on conditional normal distributions. The algorithm greatly improved the computing efficiency for genome-wide LA analysis.

LA estimated from a single study is often unstable and not generalizable due to cohort bias, biological variation and sample size limitation. With rapid accumulation of transcriptomic studies in the public domain, identifying LA triplets by combining multiple studies is likely to produce more stable and biologically reproducible results. For example, Figure 1A shows an example of an LA triplet (gene YCR005C, YPL262W and YGR175C) where the LA is statistically significant in the first yeast study GSE11452 but the LA association does not hold for the remaining four independent studies. Such an association is likely condition-specific for the first study or a false positive. On the other hand, the LA triplet (YGR264C, YOR197W and YDR519W) in Figure 1B is obtained from the combined meta-analysis of the first three studies. The association is more likely to validate in the fourth and fifth studies. In this article, we develop two meta-analytic frameworks for LA to accurately identify LA triplets that are consistent across multiple studies. The result shows that meta-analytic methods generate more stable LA triplets that are more reproducible in independent studies. The LA triplets also generate better pathway enrichment results to better understand the biological insight and/or generate further hypothesis.

2 Materials and methods

2.1 Datasets and databases

We used five yeast (Saccharomyces cerevisiae) datasets—Causton (Causton et al., 2001), Gasch (Gasch et al., 2000), Rosetta (Hughes et al., 2000), GSE60613 (Chasman et al., 2014) and GSE11452 (Knijnenburg et al., 2009)—to illustrate our meta-analytic methods. In each study, yeast samples are exposed to a variety of environmental stress and the transcriptomic expression profiles are measured. Causton et al. includes a yeast gene expression series including yeasts treated with acid, alkali, heat, hydrogen peroxide, salt, sorbital and during diauxic shift; Gasch et al. contains yeasts treated with amino acid starvation, diamide, DTT, exposure to peroxide, menadione, nitrogen depletion, osmolarity and temperature shifts; Rosetta corresponds to 300 diverse mutations and chemical treatments; GSE60613 analyzes the stress-activated signaling network; GSE11452 corresponds to chemostat cultures under 55 different conditions. As shown in the data preprocessing step in Figure 2, within each individual study we first deleted genes and samples with >10 and 30% missing values respectively, imputed the missing values with K-nearest neighbors algorithm (Altman, 1992), and quantile normalized the samples (Amaratunga and Cabrera, 2001). We further performed unbiased filtering within each study to filter out non-expressed genes (lowest 35% of mean expression) and non-informative genes (lowest 35% of expression variances). Finally, our datasets include 1770 overlapped genes across five studies and 45, 173, 300, 67 and 170 samples for study Causton, Gasch, Rosetta, GSE60613 and GSE11452, respectively.

Fig. 2.

Fig. 2

A process map of the genome-wide application of the MetaMLA algorithm

As an in silico biological evaluation of the LA triplets, we downloaded yeast protein–protein interaction (PPI) database from Saccharomyces Genome database (Cherry et al., 2011). The database included 101 325 unique PPI pairs involving 5706 genes. We applied pathway enrichment analysis on two databases: gene ontology (GO) (Cherry et al., 2011) and Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa et al., 2016) databases and obtained 1398 GO terms and 95 KEGG pathways with at least five genes. Additionally in order to test how co-regulated genes are enriched in transcription factor (TF)-binding data, we downloaded a TF-binding gene sets from YEASTRACT database (Teixeira et al., 2013) and 96 gene sets with 5–200 validated genes were selected for further enrichment analysis. Fisher’s exact test (Upton, 1992) was used for pathway enrichment analysis. The P-values were corrected by Benjamini-Hochberg (BH) algorithm (Benjamini and Hochberg, 1995) and the significance level was set to be α = 0.05.

2.2 LA methods (LA and MLA) for a single study

Li (2002) introduced the concept of ‘LA’ and defined the LA score for a gene pair X1 and X2 given a scouting gene X3 as LA(X1,X2|X3)=Eg(X3), where g(x3)=E(X1X2|X3=x3) and g(x) is the first derivative of g(x). After standardizing the three gene expressions to fit Gaussian assumption and applying Stein’s lemma, they proposed a computationally efficient estimator by LA^=l=1nX1lX2lX3l/n, where n is the total number of observations (samples) and X1l,X2l and X3l are the lth observations for genes X1,X2 and X3, respectively.

Ho et al. (2011) proposed an MLA method by MLA(X1,X2|X3)=Eh(X3), where h(X3)=ρ(X1,X2|X3),h(x) is the first derivative of h(x), and ρ is the Pearson correlation coefficient. They proposed a direct estimation of MLA score by MLA^=j=1Mρ^jX¯3j/M, where M is the number of bins over X3,X¯3j is the sample mean of X3 within bin j, and ρ^j is the correlation of the LA pair X1 and X2 in bin j. A key advantage of the MLA estimator is the capability of performing hypothesis testing H0:MLA(X1,X2|X3)=0 by a Wald test statistics TMLA=MLA^/SE(MLA^) to assess the P-value, where SE(MLA^) is the standard error of MLA^.

2.3 MetaMLA and MetaLA methods

In this section, we extend the original three-product-moment LA method (Li, 2002) and the model-based MLA method (Gunderson and Ho, 2014; Ho et al., 2011) into a meta-analytic scheme for combining information from multiple transcriptomic studies.

Suppose that we have K studies. For a gene triplet t: (X1,X2,X3), if the LA scouting gene is Z=Xi(i=1,2,3), after standardizing all the three genes to have mean 0 and variance 1 and the scouting gene to follow normal distribution, the direct estimation of the MLA score (Ho et al., 2011) for the single study k (k=1,,K) is defined as MLA^t(k,i)=j=1Mρ^t,j(k,i)z¯t,j(k,i)/M, where M is the number of bins, ρ^t,j(k,i) is the sample Pearson correlation coefficient of the LA pair in bin j when the scouting gene Z is Xi in triplet t, and z¯t,j(k,i) is the mean of Z in bin j. The test statistic for single study k is TMLA,t(k,i)=MLA^t(k,i)/SE(MLA^t(k,i)), where SE(MLA^t(k,i)) is the standard error of MLA^t(k,i) for k=1,,K and i = 1, 2, 3. The MetaMLA statistic combines individual study MLA statistics TMLA,t(k,i) and is defined as mMLAt(i)=T¯MLA,t(i)/(st(i)+s0) where T¯MLA,t(i) and st(i) are the sample mean and SD of {TMLA,t(k,i),k=1,2,,K}, respectively. st(i) provides standardization according to the variance of MLA scores across studies. s0 is a fudge parameter to avoid obtaining large mMLA score caused by very small st(i) values, which happens frequently in genome-wide screening. In our yeast datasets, suppose N is the total number of triplets for the hypothesis testing. We choose s0 to be 10×med{st(i),i=1,2,3andt=1,,N} (where med(·) means the median) to guarantee the stability of the test statistics, especially when sample size is small. The standardization by dividing the variance in the TMLA,t(k,i) score considers both sample size and sample heterogeneity effects in single studies. For a study of large sample size, the SD of MLA score is usually smaller and thus generates larger TMLA,t(k,i) score. For a study containing large biological variation or considerable outliers in samples, the SD of MLA score is large and results in smaller TMLA,t(k,i) score.

The MetaLA statistic can be defined similarly with the MetaMLA statistic. The estimation of the LA score (Li, 2002) for the single study k (k=1,,K) is defined as LA^t(k)=l=1nkX1l(k)X2l(k)X3l(k)/nk, where nk is the total number of observations (samples) and X1l(k),X2l(k) and X3l(k) are the lth observations for genes X1,X2 and X3 in study k, respectively. The MetaLA statistic combines individual study LA scores LA^t(k) and is defined as mLAt=LA^¯t(k)/(st+s0) where LA^¯t(k) and st are the sample mean and SD of {LA^t(k),k=1,2,,K}, respectively. st provides standardization according to the variance of LA scores across studies. s0 is a fudge parameter to avoid obtaining large mLA score caused by very small st values.

2.4 Hypothesis testing and inference for MetaMLA and MetaLA

Based on MetaMLA, the hypothesis for LA in the gene triplet t: (X1,X2,X3) is

H0:mMLAt(i)=0,i{1,2,3}
H1:i{1,2,3},s.t.MLAt(i)0,

where i = 1, 2, 3 corresponds to LA scouting gene Z=Xi (i = 1, 2, 3). The null hypothesis represents all zero LAs no matter which one of X1,X2 and X3 acts as the scouting gene Z. The test statistic is defined as

Tt=maxi=1,2,3|mMLAt(i)|.

The distribution of Tt under the null hypothesis can be obtained by randomly permuting the samples of the LA scouting gene Z when calculating each mMLAt(i) in the Tt statistics. We repeat the permutation for B times and use the resulting B × N permuted values of Tt(b) (1bB,1tN) as the null distribution. The P-value can be given by P=(b=1Bt=1NI(Tt(b)Tobs)/(B×N)), where Tobs is the observed value of the test statistic. The P-values are corrected by BH algorithm (Benjamini and Hochberg, 1995) and the false discovery rate (FDR) is set to be α = 0.01. Since the number of possible triplets N is usually very large, a small B is needed (B = 40) and used in the article. We note that theoretically we should perform permutation for each triplet to form its own null distribution. The computation is, however, obviously not feasible (= number of permutations × number of triplets). In our approach, we imposed an assumption of common null distributions across all triplets to allow affordable computation.

Based on MetaLA, the hypothesis for LA in the gene triplet t: (X1,X2,X3) is H0:mLAt=0H1:mLAt0. The test statistic can be defined as Tt=|mLAt|. The distribution of Tt under the null hypothesis can be obtained by randomly permuting the samples inside gene X1,X2 or X3 in turn. We repeat the permutation for B times and use the resulting B × N permuted values of Tt(b) (1bB,1tN) as the null distribution. The P-value can be given by P=b=1Bt=1NI(Tt(b)Tobs)/(B×N), where Tobs is the observed value of the test statistic. The P-values are corrected by BH algorithm (Benjamini and Hochberg, 1995) and the FDR is set to be α = 0.01. Similar to MetaMLA, B = 40 is used.

2.5 Filtering to reduce computation of MetaMLA

Genome-wide calculation of the LA is usually time-consuming and resource-intensive for a single study (Ho et al., 2011; Li, 2002). This problem is further aggravated when combining multiple studies. In this section, we will develop a screening algorithm to perform a genome-wide MetaMLA analysis with higher efficiency. As illustrated in Figure 2, our algorithm seeks to reduce the number of triplets which need to be examined in depth in two screening steps: bootstrap filtering and sign filtering (Fig. 2).

In the first bootstrap filtering step, we filter out triplets with small correlation difference between the high and low bins. Define ρdiff to be the difference of the LA pair correlations when scouting gene assigned to the highest and lowest bins. In the literature, the fastLA algorithm for single study (Gunderson and Ho, 2014) has used screening procedure for fast computing. In meta-analysis, we aim to detect triplets with consistently large or consistently small LAs across multiple studies. For the triplet t: (X1,X2,X3), given the scouting gene Z=Xi (i = 1, 2, 3), we define ρdiff,t(k,i)=ρhigh,t(k,i)ρlow,t(k,i), where ρhigh,t(k,i) and ρlow,t(k,i) are the Pearson correlations when gene Z is in the high and low bins of study k, respectively. We use the score k=1K|ρdiff,t(k,i)|/K as the meta-filtering criteria. Since the scouting gene Z could be X1,X2, or X3, we use maxi=1,2,3(k=1K|ρdiff,t(k,i)|)/K to order and filter out triplets that are unlikely to have LA association. To avoid outlier effect when calculating correlations in the bins, we propose to bootstrap (Efron and Tibshirani, 1986) samples in each study for B times and get ρdiff,t(meta,b)=maxi=1,2,3k=1K|ρdiff,t(k,i,b)|/K, where b=1,2,,B. Finally, we can use

ρdiff,t(meta)=med(ρdiff,t(meta,b),b=1,,B)

to screen the triplets, where med(·) means taking the median. We set ρdiff, t(meta)>0.4 as the cutoff to keep the triplets for further testing. ρdiff, t(meta) can largely reduce computational complexity for two reasons: (i) calculating ρdiff, t(meta) is computationally much simpler than the MetaMLA statistic; (ii) ρdiff, t(meta) can filter out a large percent of triplets and further reduce the computational cost of P-value calculation in the permutation step.

In the second sign filtering step, we filter out triplets with inconsistent signs of test statistics among meta and singleMLA. The scouting gene is chosen to maximize the test statistic of MetaMLA. In other words, we keep the triplets satisfying k=1KI(sign(mMLAt(i0))·sign(TMLA,t(k,i0))=1)=1, where I(·) is the indicator function and i0=argmaxi=1,2,3|mMLAt(i)|. For fair comparison, we use the same triplets filtered by MetaMLA to perform MetaLA and single-study MLA.

3 Results

3.1 Computational reduction by filtering

Below we describe the screening result to avoid high computational load when evaluating all possible triplets in MetaMLA. After unbiased filtering of non-expressed and non-informative genes, we kept 1,770 genes, which led to a total number of (17703)9.23×108 triplets. The computing time is demanding if we perform hypothesis testing for all possible triplets. By applying bootstrap filtering with ρdiff, t(meta)>0.4 with three bins, the number of triplets reduced to 2.18 × 107, ∼2.36% of the original total number. Furthermore, the sign filtering step decreased the number of the remaining triplets to 1.21 × 107, which was only 1.32% of the total number.

Given the fact that our screening pipeline can dramatically reduce the number of triplets, we assessed whether the filtering procedures ignored statistically significant LA triplets. We performed MetaMLA on all the 9.23 × 108 triplets and reduced 1.32% triplets after filtering. As shown in Supplementary Table S1, our screening steps only missed 89, 219, 375, 520 and 690 of the top 2000, 4000, 6000, 8000 and 10 000 triplets obtained from full analysis without filtering. P-values from Fisher’s exact test are almost 0 and odds ratio are between 1000 and 1600 (Supplementary Table S1). In summary, we only missed about 5% significant triplets but saved almost 99% of computing time to make genome-wide LA triplet screening possible. Since filtering step also consumes computing time, we compared computing time of analyses with filtering versus non-filtering on a small dataset of 95 genes (using stringent selection criteria by removing genes with small means and small variances). By using five computing threads (Intel Xeon E7-2850), computing time for analyses with filtering versus non-filtering saved about 88% of computing time (16.3 versus 134.6 min).

In general, filtering out potentially non-significant triplets will gain statistical power (Bourgon et al., 2010; van Iterson et al., 2010). In other words, we can detect more significant triplets under the same FDR control. To demonstrate the empirical effect of filtering in real data, we randomly selected 500 genes from the five Yeast studies and re-ran our MetaMLA algorithms by both filtering and non-filtering pipelines. Supplementary Figure S1 shows that for a given reasonable FDR (e.g. 0.005 and 0.01), filtering pipeline can detect more significant triplets than full studies as we expected.

3.2 MetaMLA detects more over-represented pathways

We performed pathway enrichment analysis using GO and KEGG for all the genes from top m significant triplets (m=200,300,,1000) selected by the single study MLA, MetaMLA, and MetaLA. Figure 3 shows the numbers of enriched GO terms and KEGG pathways for different top numbers of triplets under FDR = 0.05 threshold. MetaMLA (solid square line) consistently performed better than any single-study MLA (five dash lines) and MetaLA (solid rhombus line) method by detecting more enriched pathway. Jitter plots of q-values of the GO terms and KEGG pathways for the top 500 triplets at minus log 10 scale are further shown in Supplementary Figure S2. Since single MLA and MetaMLA method can differentiate LA scouting gene Z, similar pathway enrichment analysis were done only for Z genes from the top triples (Supplementary Figs S3 and S4).

Fig. 3.

Fig. 3

The number of enriched gene sets for all the genes from different numbers of top triplets detected by meta and single analysis. (A) is for GO terms and (B) is for KEGG pathways (Color version of this figure is available at Bioinformatics online.)

3.3 MetaMLA provides more consistent biomarker and pathway results with single study analyses

Figure 1A shows an example with LA association in the first study (correlation dropped from 0.692 to −0.79 for high and low expression groups of the LA scouting gene YGR175C) but fails to reproduce in the remaining four studies. Such an LA association with failed reproducibility is likely a false positive. Figure 1B demonstrates another example with consistent LA association in all five studies (correlation dropped significantly for high and low expression groups of YDR519W). In order to inspect agreement of top LA triplets across pairwise studies, Supplementary Figures S5 and S6 show scatter plots of test statistics and rank correlations of the pairwise top 1000 triplets. MetaMLA method combines information from all single studies. Conceptually, MetaMLA can provide more consistent results with single study MLA than results among single study MLA. In Figure 4A, we examined pairwise overlap of detected top 1000 triplets from the five single-study MLA and the MetaMLA. The result shows zero overlapping in all single-study MLA top triplets. (We also tried other top number of triplets in Supplementary Fig. S7 and they all show out small overlap among single studies.) On the other hand, top triplets from MetaMLA have much higher percentage of overlapping with results from each single-study MLA.

Fig. 4.

Fig. 4

Overlap of meta and single analysis. (A) is for the number of overlapped triplets for the top 1000 significant triplets; (B) is for the number of overlapped enriched GO terms using all the genes from top 500 triplets for gene set enrichment analysis; (C) is for the number of overlapped enriched KEGG pathways using all the genes from top 500 triplets for gene set enrichment analysis (Color version of this figure is available at Bioinformatics online.)

We next calculated the number of overlaps of enriched GO terms and KEGG pathways when we used all the genes from the top 500 triplets from each MLA analysis for pathway enrichment. The results are shown in Figure 4B and C. Numbers on the diagonal cells demonstrate the number of enriched GO or KEGG pathways from each single-study MLA and MetaMLA. (Similarly, overlapped pathways by only Z genes from the top 1000 triplets are shown in Supplementary Fig. S8). Similar to overlapped triplets in Figure 4A, we observed much higher overlapped pathways between the MetaMLA result and each single-study MLA result than results between pair-wise single-study MLA. For example, study Causton detected 27 enriched GO terms, among which 8, 9, 6, and 9 pathways overlapped with results from the other four single-study MLA. Notably, it has 12 and 15 GO terms overlapped with MetaLA and MetaMLA. Comparing the two meta-analytic methods, MetaMLA performed much better than MetaLA.

3.4 MetaLA and MetaMLA provide more stable results

Below we apply subsampling and bootstrap techniques to compare stability for LA triplets detected by single-study MLA, MetaLA and MetaMLA. Figure 5A and B show the number of overlapped triplets between top triplets detected by original full dataset and subsampled datasets (90 and 80%, respectively). The numbers of top triplets are displayed on the x- and y-axis is for the overlapping numbers. The result shows much better reproducibility of top triplets detected by subsampled data in MetaMLA (solid square line) and MetaLA (solid rhombus line) compared to single-study MLA (five dash lines). Similarly, comparison with bootstrapped data in Figure 5C shows similar trend. In summary, MetaMLA provides better stability in detecting top LA triplets, when compared to single-study MLA. MetaLA further outperforms MetaMLA.

Fig. 5.

Fig. 5

The number of overlapped top significant triplets between the original dataset and the subsampled or bootstrap datasets. (A) and (B) are for the results of means and standard errors of ten times subsampling for the proportion of 0.90 and 0.80, respectively; (C) is for the results of means and SEs of ten times bootstrap (Color version of this figure is available at Bioinformatics online.)

3.5 Pathway enrichment analysis and network visualization

In Sections 3.2–3.4, although MetaLA provides more stable result than MetaMLA 3.4, it detects much fewer enriched pathways (Section 3.2) and generates less consistent biomarker and pathways with single studies (Section 3.3). As a result, we will focus on MetaMLA for further biological investigation in this subsection.

To test how the LA genes detected by MetaMLA method are consistent with TF binding, we downloaded the TF-binding gene sets from the YEASTRACT database (Teixeira et al., 2013) and selected 96 gene sets with 5–200 genes. Among these 96 TF genes, Hog1 (YLR113W) has the highest frequency among all the genes from the top 20 000 triplets detected by MetaMLA method. Genes inside the same triplet as Hog1 are enriched in Hog1 binding gene sets (P = 0.027). More significantly, Hog1 is also the most frequent gene among the LA scouting gene Z in the top 100 000 triplets. Genes regulated by Hog1 (inside the same triplets) are more significantly enriched in Hog1 binding gene sets (p = 1.44E−5). Supplementary Table S2 shows the top enriched TF binding gene sets. Among them, Hot1 is another enriched gene sets (p = 7.67E−6) and Alepuz et al. (2003) shows that Hot1 targets on Hop1p to osmostress responsive promoters and Hog1 mediates recruitment/activation of RNAPII at Hot1p-dependent promoters. The analysis shows that top triplets selected by MetaMLA method are highly consistent with known TF regulation pattern.

Table 1 shows 18 significantly enriched KEGG pathways with hierarchical structure using all the genes from top 500 triplets selected by MetaMLA. Pathway enrichment using GO database identified 68 GO terms (Supplementary Table S3). Since the five transcriptomic studies contain yeast samples treated with different environmental conditions and mutations, we observed many pathways related to energy metabolism (q = 5.67E−12), carbohydrate, metabolism (q = 1.40E−8), amino acid metabolism (q = 5.87E−8) and translation (q = 0.0065).

Table 1.

Enriched KEGG pathways and their hierarchical categories for all the genes from top 500 triplets selected by MetaMLA method

Entry and category P-value q-value Odds ratio Count Size Name
Metabolism 1.53E-23 1.85E-21 2.69 200 835
 Energy metabolism 9.37E-14 5.67E-12 5.36 43 122
  sce00190 2.03E-12 8.21E-11 7.91 29 72 Oxidative phosphorylation
  sce00680 0.007957 0.041109 3.49 8 28 Methane metabolism
 Carbohydrate metabolism 4.64E-10 1.40E-08 2.86 62 229
  sce00620 1.91E-06 3.30E-05 5.53 17 39 Pyruvate metabolism
  sce00630 3.15E-05 0.000423 6.14 12 26 Glyoxylate and dicarboxylate metabolism
  sce00020 6.30E-05 0.000763 4.99 13 32 Citrate cycle (TCA cycle)
  sce00010 0.000527 0.004249 2.99 17 58 Glycolysis/Gluconeogenesis
  sce00051 0.005765 0.034881 3.76 8 25 Fructose and mannose metabolism
  sce00030 0.007158 0.039369 3.23 9 28 Pentose phosphate pathway
 Amino acid metabolism 2.42E-09 5.87E-08 3.06 51 178
  sce00260 6.40E-08 1.29E-06 8.99 16 32 Glycine, serine and threonine metabolism
  sce00270 0.000146 0.001468 4.44 13 36 Cysteine and methionine metabolism
  sce00250 0.002637 0.016791 3.60 10 30 Alanine, aspartate and glutamate metabolism
 Lipid metabolism 0.000913 0.006501 2.11 29 126
  sce00100 0.000183 0.001705 6.89 9 17 Steroid biosynthesis
  sce01040 0.000454 0.003927 12.21 6 11 Biosynthesis of unsaturated fatty acids
  sce00062 0.002175 0.014623 10.16 5 8 Fatty acid elongation
 Metabolism of cofactors and vitamins 0.011664 0.052272 1.80 24 117
  sce00670 0.008629 0.041763 4.57 6 15 One carbon pool by folate
Genetic information processing 0.214481 0.447452 1.10 114 1123
 Translation 0.000861 0.006501 1.59 70 682
  sce03010 9.42E-05 0.001036 2.22 37 181 Ribosome
 Folding, sorting and degradation 0.105526 0.283748 1.27 41 263
  sce03050 0.008154 0.041109 2.91 10 35 Proteasome
Cellular processes 0.718769 0.995721 0.92 45 382
 Transport and catabolism 0.010656 0.049591 1.61 36 194
  sce04145 2.96E-05 0.000423 5.07 14 36 Phagosome

To investigate further the identified LA association gene interactions, we chose among the top 20 000 LA triplets (q < 6.64E−5) and included a total of 41 triplets with all three genes involved in the metabolism category (q = 1.85E−21 in Table 1) for network visualization (Fig. 6). Genes within one triplet are connected by edges in the same color. The dashed line represents reported interactions or regulations in the PPI database. In this network, there are totally four interactions validated by PPI database, more enriched than a randomly generated PPI database (0.69 random interactions on average, with P-value 0.00197 by Fisher’s exact test).

Fig. 6.

Fig. 6

Gene network associated with metabolism. Genes within one triplet are connected by edges in the same color. The dash line means that the edge is in the PPI database. The small circle connected with the gene means that this gene is in the corresponding subcategory (Color version of this figure is available at Bioinformatics online.)

In Figure 6, we observed a cluster of gene modules related to carbohydrate metabolism (purple background circle in Fig. 6; almost all genes annotated with gray dots). IDH2 and BDH2 are two notable hub genes that have many LA association with other neighboring genes. IDH2 is a subunit of mitochondrial NAD(+)-dependent isocitrate dehydrogenase, a key complex in tricarboxylic acid (TCA) cycle to catalyze the oxidation of isocitrate to alpha-ketoglutarate (Reinders et al., 2007). BDH2 is a putative medium-cahin alcohol dehydrogenase (Dickinson et al., 2003). In carbohydrate metabolism, pyruvate is the main input for a series of chemical reactions for aerobic TCA cycle. In the subnetwork, CDC19 is a key pyruvate kinase, which coverts phosphoenolpyruvate to pyruvate (Byrne and Wolfe, 2005; Xu et al., 2012), and its physical PPI with IDH2 has been previously reported in Gavin et al. (2006). In addition, PDC5 is a minor isoform of pyruvate decarboxylase and PDC1 is a major of three pyruvate decarboxylase isozymes to decarboxylate pyruvate to acetaldehyde (Dickinson et al., 2003). ENO2 is also a phosphopyruvate hydratase involved in pyruvate metabolism to catalyze 2-phosphoglycerate to phosphoenolpyruvate during glycolysis (Byrne and Wolfe, 2005; McAlister and Holland, 1982). All these genes from the top MetaMLA triplets are potentially co-regulated with functional annotation from the carbohydrate metabolism pathway. However, if we examine the direct gene–gene Pearson correlations, the pair-wise correlations are low and the co-expression analysis will fail to identify association among these genes (see Supplementary Table S4).

4 Conclusion and discussion

In this article, we proposed two meta-analytic methods (MetaLA and MetaMLA) for LA analysis combining multiple studies. We used the mean of the singleMLA test statistics as the main part of the MetaMLA statistic and the SD to penalize the inconsistent patterns among different studies. On the genome-wide application, we proposed to screen genes by bootstrap filtering and sign filtering (Fig. 2) to reduce the computation load. In the yeast datasets, we reduced >98% of the triplets for the hypothesis testing and captured 94–95% of the top triplets with large MetaMLA statistic. When compared with singleMLA method, MetaMLA can provide stronger pathway enrichment signal, more consistent results with single-study analysis, and more stable results with data subsampling or bootstrapping. Although MetaLA generated more stable results than MetaMLA, it detected less enriched pathways and is less consistent with single study analysis. Among the top significant triplets selected by MetaMLA, we constructed a gene regulatory network visualization to investigate the complex three-way conditional associations. The result identifies a subnetwork in carbohydrate metabolism network, which otherwise cannot be identified by traditional pair-wise co-expression analysis. We identified validation in PPI and focused functional annotation in TSA cycle.

The LA and MLA methods to detect LA triplets have their pros and cons. On one hand, the LA score by three-product estimation on normalized gene intensities is much easier to compute than the model-free estimation of MLA score. However, MLA is more accurate when interdependency among the triplet (i.e. conditional mean and variance of two genes depend on the third gene) exist and such interdependency is theoretically ignored by the LA method. Additionally, MLA also provides systematic inference to assess P-values and FDR control. To circumvent computational burden in MetaMLA, our proposed two-stage filtering can significantly reduce computing time. In this article, we demonstrate genome-wide screening on all possible gene triplets. To further reduce computational load, one may apply pre-selected scouting genes from prior biological knowledge, TF or PPI databases.

Our meta-analytic framework has the advantage to stablely combine multiple studies from different microarray or next-generation sequencing platforms. Potential heterogeneity from platform, batch effect or measurement scaling issues is automatically standardized in the meta-analysis. In the literature, it has been well-acknowledged that simple correlation and co-expression analysis are not sufficient to describe the complex system of gene regulation. Applying advanced association models elucidates novel regulatory mechanisms and meta-analysis by combining multiple transcriptomic studies will greatly reduce false positive findings. Our proposed meta-analytic LA methods help accurately detect complicated three-way interactions and regulatory mechanisms.

Supplementary Material

Supplementary Data

Funding

This work was supported by the National Institutes of Health NIH (R01CA190766 to S.L. and G.C.T.); China Scholarship Council (201508110051 to L.W.); National Nature Science Foundation of China (11526146 to L.W.); Scientific Research Level Improvement Quota Project of Capital University of Economics and Business [to L.W.]; and University of Minnesota Grant-In-Aid (to Y.Y.H.).

Conflict of Interest: none declared.

References

  1. Alepuz P.M. et al. (2003) Osmostress-induced transcription by Hot1 depends on a Hog1-mediated recruitment of the RNA Pol II. EMBO J., 22, 2433–2442. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Altman N.S. (1992) An introduction to kernel and nearest-neighbor nonparametric regression. Am. Stat., 46, 175–185. [Google Scholar]
  3. Amaratunga D., Cabrera J. (2001) Analysis of data from viral DNA microchips. J. Am. Stat. Assoc., 96, 1161–1170. [Google Scholar]
  4. Benjamini Y., Hochberg Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Stat. Soc. B., 57, 289–300. [Google Scholar]
  5. Bourgon R. et al. (2010) Independent filtering increases detection power for high-throughput experiments. Proc. Natl. Acad. Sci. USA, 107, 9546–9551. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Butte A.J., Kohane I.S. (2000). Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. In Pac. Symp. Biocomput., volume 5, pages 418–429 [DOI] [PubMed]
  7. Byrne K.P., Wolfe K.H. (2005) The yeast gene order browser: combining curated homology and syntenic context reveals gene fate in polyploid species. Genome Res., 15, 1456–1461. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Causton H.C. et al. (2001) Remodeling of yeast genome expression in response to environmental changes. Mol. Biol. Cell, 12, 323–337. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Chasman D. et al. (2014) Pathway connectivity and signaling coordination in the yeast stress-activated signaling network. Mol. Syst. Biol., 10, 759.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Cherry J.M. et al. (2011) Saccharomyces genome database: the genomics resource of budding yeast. Nucleic Acids Res., 40, D700–D705. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Dickinson J.R. et al. (2003) The catabolism of amino acids to long chain and complex alcohols in saccharomyces cerevisiae. J. Biol. Chem., 278, 8028–8034. [DOI] [PubMed] [Google Scholar]
  12. Efron B., Tibshirani R. (1986) Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Stat. sci, pages 54–75. [Google Scholar]
  13. Gasch A.P. et al. (2000) Genomic expression programs in the response of yeast cells to environmental changes. Mol. Biol. Cell, 11, 4241–4257. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Gavin A.-C. et al. (2006) Proteome survey reveals modularity of the yeast cell machinery. Nature, 440, 631–636. [DOI] [PubMed] [Google Scholar]
  15. Gunderson T., Ho Y.-Y. (2014) An efficient algorithm to explore liquid association on a genome-wide scale. BMC Bioinformatics, 15, (1), 371.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Ho Y.-Y. et al. (2011) Modeling liquid association. Biometrics, 67, 133–141. [DOI] [PubMed] [Google Scholar]
  17. Hughes T.R. et al. (2000) Functional discovery via a compendium of expression profiles. Cell, 102, 109–126. [DOI] [PubMed] [Google Scholar]
  18. Kanehisa M. et al. (2016) Kegg as a reference resource for gene and protein annotation. Nucleic Acids Res, 44, D457–D462. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Knijnenburg T.A. et al. (2009) Combinatorial effects of environmental parameters on transcriptional regulation in saccharomyces cerevisiae: a quantitative analysis of a compendium of chemostat-based transcriptome data. BMC Genomics, 10, 1.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Li K.-C. (2002) Genome-wide coexpression dynamics: theory and application. Proc. Natl. Acad. Sci. USA, 99, 16875–16880. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Li K.-C. et al. (2004) A system for enhancing genome-wide coexpression dynamics study. Proc. Natl. Acad. Sci. USA, 101, 15561–15566. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. McAlister L., Holland M.J. (1982) Targeted deletion of a yeast enolase structural gene. identification and isolation of yeast enolase isozymes. J. Biol. Chem., 257, 7181–7188. [PubMed] [Google Scholar]
  23. Reinders J. et al. (2007) Profiling phosphoproteins of yeast mitochondria reveals a role of phosphorylation in assembly of the ATP synthase. Mol. Cell. Proteomics, 6, 1896–1906. [DOI] [PubMed] [Google Scholar]
  24. Song L. et al. (2012) Comparison of co-expression measures: mutual information, correlation, and model based indices. BMC Bioinformatics, 13, 328.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Teixeira M.C. et al. (2013) The yeastract database: an upgraded information system for the analysis of gene and genomic transcription regulation in saccharomyces cerevisiae. Nucleic Acids Res., 42, D161–D166. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Upton G.J. (1992) Fisher’s exact test. J. Roy. Stat. Soc. A Stat., 155, 395–402. [Google Scholar]
  27. van Iterson M. et al. (2010) Filtering, FDR and power. BMC Bioinformatics, 11, 450.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Wolfe C.J. et al. (2005) Systematic survey reveals general applicability of ‘guilt-by-association’ within gene coexpression networks. BMC Bioinformatics, 6, 1.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Xu Y.-F. et al. (2012) Regulation of yeast pyruvate kinase by ultrasensitive allostery independent of phosphorylation. Mol. Cell, 48, 52–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Zhang B. et al. (2005) A general framework for weighted gene co-expression network analysis. Stat. Appl. Genet. Mol. Biol., 4, 1128.. [DOI] [PubMed] [Google Scholar]
  31. Zhang J. et al. (2007) Extracting three-way gene interactions from microarray data. Bioinformatics, 23, 2903–2909. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES