Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Dec 1.
Published in final edited form as: Biometrics. 2021 Aug 1;78(4):1542–1554. doi: 10.1111/biom.13526

Multidimensional molecular measurements–environment interaction analysis for disease outcomes

Yaqing Xu 1, Mengyun Wu 2, Shuangge Ma 1
PMCID: PMC9366385  NIHMSID: NIHMS1825899  PMID: 34213006

Abstract

Multiple types of molecular (genetic, genomic, epigenetic, etc.) measurements, environmental risk factors, and their interactions have been found to contribute to the outcomes and phenotypes of complex diseases. In each of the previous studies, only the interactions between one type of molecular measurement and environmental risk factors have been analyzed. In recent biomedical studies, multidimensional profiling, in which data from multiple types of molecular measurements are collected from the same subjects, is becoming popular. A myriad of recent studies have shown that collectively analyzing multiple types of molecular measurements is not only biologically sensible but also leads to improved estimation and prediction. In this study, we conduct an M–E interaction analysis, with M standing for multidimensional molecular measurements and E standing for environmental risk factors. This can accommodate multiple types of molecular measurements and sufficiently account for their overlapping as well as independent information. Extensive simulation shows that it outperforms several closely related alternatives. In the analysis of TCGA (The Cancer Genome Atlas) data on lung adenocarcinoma and cutaneous melanoma, we make some stable biological findings and achieve stable prediction.

Keywords: environmental risk factors, interaction analysis, multidimensional molecular data

1 |. INTRODUCTION

For the outcomes and phenotypes of cancer, cardiovascular diseases, asthma, mental disorders, and other complex diseases, multiple types of molecular measurements, environmental risk (E) factors, and their interactions play important roles (Bookman et al., 2011). For example, the expression of gene IL9 was found to interact with environmental dust mites to increase severe asthma exacerbations in children (Sordillo et al., 2015). In a lung cancer study, cigarette smoking was suggested to increase the copy number variation (CNV) of IGF1 and induce its overexpression and subsequent oncogenesis (Huang et al., 2011). Teschendorff et al. (2015) found that smoking is associated with DNA methylation in buccal cells, which may provide insights into the development of smoking-related cancers. These examples and others demonstrate that molecular effects on diseases can be modified by smoking, and such circumstances have been referred to as biological interactions (Hunter, 2005). We note that in each of the aforementioned examples and in other published studies, only the interactions between a single type of molecular measurement and environmental risk factors have been analyzed. From a methodological standpoint, the existing interaction analysis approaches can be categorized into marginal analysis (which examines one or a small number of molecular measurements at a time) and joint analysis (which models a large number of molecular measurements in a single model). We refer to Simonds et al. (2016) and McAllister et al. (2017) for more discussion. These studies have also recognized the great successes of the existing gene–environment (G–E) interaction analyses as well as their limitations, calling for more effective approaches and expanded analysis scopes.

In recent biomedical studies, multidimensional profiling is being increasingly employed, utilizing data from multiple types of molecular measurements collected from the same subjects (Tseng et al., 2015). Such studies make it possible to more deeply understand disease biology and construct more effective models for disease outcomes and phenotypes. A myriad of novel statistical methods have been developed. For example, Wang et al. (2012) proposed an integrative Bayesian approach to identify gene expression and methylation measurements associated with clinical outcomes such as survival. Gross and Tibshirani (2014) developed collaborative regression, which applies penalization to explicitly accommodate overlapping information (correlations) from gene expressions and CNVs for marker identification. Zhu et al. (2016) developed a linear regulatory module-based method using the sparse singular value decomposition (SVD) and penalization techniques to integrate gene expressions and their regulators for cancer outcomes. We refer to Kristensen et al. (2014) and Wu et al. (2019) for more discussions. These and other published studies have convincingly shown that integrating multidimensional molecular data is not only biologically sensible but also improves estimation, marker identification, and prediction. It is noted that these studies are focused on the main effects of molecular measurements. Thomas (2010), Ritchie et al. (2017), and others have suggested that the main effects of molecular measurements can only explain a limited proportion of variations in diseases. This is also observed in the aforementioned multidimensional profiling studies as well as others, suggesting a demand for new modeling components (e.g., interactions).

Strongly motivated by the limitations of existing interaction analyses (that is limited to a single type of molecular data) and multidimensional molecular data analysis (that is limited to main effects), in this study we develop an M–E interaction analysis, where M stands for multidimensional molecular measurements and E stands for environmental risk factors. The objective is to collectively accommodate multiple types of high-dimensional molecular measurements, environmental risk factors, and their interactions in modeling disease phenotypes and outcomes. This analysis is a natural next step from the integrated analysis of the main effects of multidimensional molecular data and studies that conduct interaction analysis of a single type of molecular measurement and environmental risk factors. Beyond the “ordinary” high dimensionality and noisy nature of molecular data, our analysis faces new challenges. Specifically, multiple types of molecular measurements are interconnected, leading to overlapping information. For example, gene expression levels are regulated by genetic and epigenetic effects. On the other hand, they also contain independent information on disease outcomes (Risch and Plass, 2008). Further, an interaction analysis demands respecting the “main effects, interactions” hierarchy, which only allows an interaction into the model if its corresponding main effects are also included. This hierarchy has been extensively adopted (Bien et al., 2013; Wu et al., 2020).

The proposed approach is tailored to the M–E interaction analysis and significantly advances from the aforementioned ones. Specifically, it innovatively accommodates the regulation relationships (overlapping information) among multidimensional molecular data via biclustering, through which the regulatory modules are identified and information within the modules is integrated. With the integrated molecular information, a novel joint interaction analysis based on penalization is developed and respects the “main effects, interactions” hierarchy (Bien et al., 2013; Wu et al., 2020).

2. METHODS

Without loss of generality, we use gene expressions and their regulators as an example in the following description. Such a combination has been popular in published studies (Wang et al., 2012; Zhu et al., 2016). Other combinations, for example, proteins and gene expressions, can be analyzed in the same manner. Assume n iid subjects. Denote G = (G1, … , GP) and R = (R1, … , Rq) as the n × p and n × q design matrices of p gene expression and q regulator measurements. Denote E = (E1, … , EM) as the n × M design matrix of environmental risk factors, and Y as the length n vector of outcomes. We first consider continuous outcomes and later discuss accommodating other types of outcomes in the Supporting Information. Assume that Y has been properly centered so that the intercept term is 0. Continuous variables in E, G, and R are standardized to have zero means and unit standard deviations.

Our goal is to identify important M–E interactions (as well as main effects) and construct a comprehensive outcome model. The proposed approach consists of the following main steps. Refer to Figure S1 in the Supporting Information for the analysis flowchart.

Step I

We conduct a penalized regression to estimate the gene expression-regulator relationships and then sequentially conduct biclustering to identify the regulatory modules. Consider the model G = RΘ + ϵ, where ϵ is the n × p matrix of random errors and Θ = (θ1, … , θp) is the q × p matrix of unknown coefficients. For estimating Θ, consider:

Θ^=argminΘ{12nGRΘF2+λj=1pθj1}, (1)

where || · ||F and || · ||1 denote the Frobenius norm of a matrix and L1 norm of a vector, respectively, and λ ≥ 0 is the tuning parameter.

To identify the regulatory modules, we conduct biclustering with Θ^. Here a regulatory module corresponds to a bicluster, which contains a small number of co-expressed gene expressions and their regulators. For estimation, we adopt the sparse clustering technique developed in Helgeson et al. (2020), which first introduces weights for gene expressions and then maximizes the weighted between-cluster distance for regulators. More specifically, the estimation is defined as:

maxC,C¯,wj=1pwj(1ql=1ql=1qdl,l,j1q1l,lCdl,l,j1q2l,lC¯dl,l,j), (2)

subject to ‖w2 ≤ 1, w1p, and wj ≥ 0 for j = 1, … , p, where θ^lj is the (l, j)th component of Θ^; dl,l,j=(θ^ljθ^lj)2 measures the distance between the lth and l′th regulators; C and C¯ are the disjoint index sets of regulator clusters; q1=|C| and q2=|C¯| are the cardinalities of C and C¯, respectively, with q1 < q2 and q1 + q2 = q; and w = (w1, … , wp)′ is the weight vector for gene expressions, with a larger weight indicating a higher importance for clustering. With the constraints for w, each wj has a nonzero value between 0 and 1. With the estimated w^, a two-sample, permutation-based Kolmogorov–Smirnov (K-S) test is conducted to quantify the significance of the difference between two clusters and select D—the set of gene expressions with significantly larger weights. Following Helgeson et al. (2020), we choose 0.05 as the cutoff for significance. This process leads to one regulatory module {C,D} with regulators in C and gene expressions in D. To obtain the subsequent modules, we update Θ^ by subtracting the module just identified and repeat the above procedure. This process is iterated until the K-S test fails to reject the null hypothesis of no clusters. With the sparsity of Θ^, it is expected that only a subset of gene expressions and regulators can form modules. Suppose that there are S identified modules: {C1,D1},,{CS,DS}.

Rationale

Linear regression is used to describe the regulations between the two types of molecular measurements. Multiple published studies (Shi et al., 2015; Zhu et al., 2016) have shown that this is a sensible choice, especially considering the high dimensionality. One gene expression is regulated by only a few regulators, and one regulator affects the expression of only a few genes. As such, Θ is assumed to be sparse, and Lasso is applied for the estimation and identification of important regulations.

The concept of regulatory module has been developed in Zhu et al. (2016) and other studies. A regulatory module consists of a small number of gene expressions and regulators that behave in a coordinated manner. The construction in Zhu et al. (2016), which is based on sparse SVD, limits each regulatory module to have rank one. Here, we lift this constraint via biclustering. By construction, each bicluster (regulatory module) consists of gene expressions and regulators sharing similar patterns in Θ. We adopt the sparse biclustering method developed in Helgeson et al. (2020) because of its competitive numerical performance. Note that here we cluster regulators into two disjoint groups with weighted gene expressions. It is also possible to reverse the roles of gene expressions and regulators, leading to similar clustering results in our numerical investigations. With the sequential cluster construction strategy, different regulatory modules may overlap. This is desirable as one gene/regulator can participate in multiple biological processes.

Step II

We integrate information within each regulatory module {Cs,Ds}, s = 1, … , S, using the principal component analysis (PCA) technique. Given a matrix A and an index set I, denote AI as the columns of A indexed by I. For the sth module, we apply PCA to the stacked matrix (GDs,RCs) and select the top PCs with a cumulative variance contribution rate ≥ 80%. Denote the resulting matrix composed of the ps PCs as Xs=(Xs,1,,Xs,ps). In addition, for the gene expressions and regulators not involved in any of the identified modules, we collect and combine them as Z=(Z1,,Zpz)=(GDc,RCc), where Dc={j{1,,p}:jDs,s=1,,S} and Cc={j{1,,q}:jCs,s=1,,S}. X = (X1, … , XS) and Z form the input for downstream analysis.

Rationale

The previous step does not directly limit the sizes of the modules. Thus, it is possible for some modules to have moderate to large sizes. Additionally, with regulations, measurements within the same modules can have strong correlations. To reduce dimensionality, remove collinearity, and simplify computation, we apply PCA. Overall, the input for the next step consists of the PCs (representing overlapping information) as well as gene expressions and regulators that do not have strong patterns (representing independent information).

Step III

We conduct interaction analysis that respects the “main effects, interactions” hierarchy (Bien et al., 2013). Consider the regression model:

Y=Eα+s=1SXsβs+Zγ+m=1Ms=1S(EmXs)(βs*ηsm)+m=1M(EmZ)(γ*τm)+ξ,=g(X,Z,E)+ξ. (3)

Here α = (α1, … , αM)′, β=(β1,,βS), and γ=(γ1,,γpz) correspond to the main effects of the environmental factors, regulatory modules, and individual molecular measurements (that do not belong to any modules), respectively. For the mth environmental factor, βS * ηsm and γ * τm correspond to the interactions with the sth regulatory module and all individual molecular measurements, respectively, with * being the component-wise product (e.g., γ*τm=(γ1τm1,,γpzτm,pz)). ⊙ is the “matching column-wise” Khatr–Rao product (Liu and Trenkler, 2008). ξ is the random error vector. To accommodate the hierarchical structure, the interaction effects βsjηsmj and γjτmj are decomposed into two components, with the first for the corresponding main effects (βsj and γj) and the second for the interaction-specific effects (ηsmj and τmj).

To estimate and identify important interactions (and main effects), we propose the following penalized objective function:

Q(Φ)=12Yg(X,Z,E)22+λ1s=1Sps(βs2+m=1Mηsm2)+λ2(γ1+m=1Mτm1), (4)

where Φ=(α,β1,,βS,γ,η11,,ηMS,τ1,,τM), || · ||2 is the L2 norm of a vector, and λ1, λ2 ≥ 0 are tuning parameters. Gene expressions and regulators that are involved in modules with nonzeroestimated βs and βs * ηsm are identified as having important main effects and M–E interactions, respectively. In addition, for individual molecular measurements, the nonzero components of γ and γ * τm correspond to important main effects and interactions, respectively.

Rationale

A joint model is developed to accommodate all molecular and environmental effects and their interactions. Here, the linear regression model can be replaced with other models, such as the accelerated failure time (AFT) model for survival data (Supporting Information). For estimation and selection, we adopt penalization, which has been popular in interaction analysis (Wu et al., 2020). Choosing Lasso and group Lasso penalization facilitates computation and enables a direct comparison with alternative analysis strategies in numerical studies. For many datasets, including those analyzed in this study, the environmental factors are preselected based on existing knowledge and usually considered important, and their coefficients are not subject to penalization. As such, the “main effects, interactions” hierarchy postulates that an identified interaction corresponds to an identified main molecular effect. To achieve this, we decompose the interaction effects into two components and have βsjηsmj ≠ 0 only if βsj ≠ 0 and γjτmj ≠ 0 only if γj ≠ 0. In (4), we employ group Lasso for regulatory modules (where PCs corresponding to the same module form a group) and Lasso for individual molecular measurements to identify M–E interactions and main effects. Here, all PCs corresponding to the same module are selected or not simultaneously, which is motivated by the coordinated nature of the molecular measurements in the module.

Optimization is carried out using existing algorithms and the coordinate descent techniques. The two tuning parameters in (4) are selected using the extended Bayesian information criterion. Refer to the Supporting Information for details on the algorithms and computational complexity. In addition, we provide heuristic theoretical justifications for consistency in the Supporting Information.

3 |. SIMULATION

We set p = q = 500, M = 5, and n = 250, and generate environmental factors from independent standard normal distributions. In addition, (a) we consider two settings for Θ to represent different regulation patterns. The first setting (Θ1) contains 15 regulatory modules with 1 overlap. Each regulatory module contains on average 12.3 gene expressions and 16.6 regulators. The elements in the regulatory modules are independently generated from normal distributions, with means ranging from −0.7 to 1.5 and standard deviations 0.1, covering different levels and directions of regulations. The remaining elements of Θ1 equal 0. The second setting (Θ2) contains 20 nonzero regulatory modules with 1 overlap, and the nonzero values are generated in a similar way to those of Θ1. These modules consist of 6.0 gene expressions and 8.1 regulators, on average. Compared to Θ1, Θ2 contains more modules with smaller sizes, representing a different type of regulations. (b) The values of regulators R involved in each regulatory module are generated from a multivariate normal distribution with marginal means of 0 and variances of 1. We consider three correlation structures. The first (R1) is an autoregressive structure in which the correlation between the jth and lth variables is (−0.5)|jl|. The second (R2) is a banded structure in which the correlation between the jth and lth variables is −0.5 if |jl| = 1 and 0 otherwise. The third (R3) has a structure in which the correlation between the jth and lth variables is (1)|jl|/(|Cs|+|Ds|). R1 and R2 are diagonally dominant, while R3 has all correlations at the same level. Individual regulators that are not involved in any regulatory modules are independently generated from a standard normal distribution. As such, regulators in different modules are independent from each other and also independent from the individual regulators. (c) Gene expression measurements are generated from G = RΘ + ϵ, where the elements of ϵ follow independent standard normal distributions. (d) Given G, R, and Θ, generate the integrated information Xs for each module using the top PCs and Z for the individual molecular units. (e) With Xs, s = 1, … , S and Z, consider response model (3). Two types of nonzero coefficient settings are considered, leading to a total of 100 (P1) and 70 (P2) important main molecular effects and M–E interactions, respectively. These nonzero coefficients are generated uniformly from (0.5, 0.8) (B1) or (0.8, 1.2) (B2), with the “main effects, interactions” hierarchical structure satisfied. The molecular factors with important effects include gene expressions and regulators involved in the regulatory modules as well as individual molecular measurements. Additional information is provided in the Supporting Information. Random errors ξ are generated from independent standard normal distributions.

To better appreciate the operating characteristics of the proposed module detection procedure, we simulate one dataset under setting Θ1 and correlation structure R1. We present the true regulation relationships in Figure 1, together with their estimated values and identified regulatory modules. We observe that with moderate associations between small sets of molecular measurements, the estimate Θ^ closely reflects the true regulation relationships. Furthermore, biclustering can properly identify the regulatory modules based on the estimated regulations.

FIGURE 1.

FIGURE 1

Simulation under setting Θ1 and R1. Upper-left: true values of regulations; upper-right: estimated regulations; lower: identified regulatory modules (shaded in light yellow. In this figure, color does not indicate a numerical implication.). This figure appears in color in the electronic version of this article, and any mention of color refers to that version

Besides the proposed approach, we also consider the following alternatives that have closely related frameworks. Comparing to these alternatives can directly establish the necessity of the considerations regarding gene expression-regulator regulations, correlations within regulatory modules, and hierarchical interactions. Specifically, Alt.1 excludes the integration Step II and builds the hierarchical interaction model with gene expressions and regulators directly combined as groups based on the identified regulatory modules. Alt.2 excludes the decomposition of the interaction coefficients in Step III, and so the “main effects, interactions” hierarchy may be violated. Alt.3 builds a hierarchical joint model directly using the stacked gene expression and regulator measurements without accounting for the regulations. Alt.4 incorporates the stacked gene expression and regulator measurements directly into the interaction model. It ignores the regulation relationships and interaction hierarchy. For evaluation, we consider the numbers of true positives (TPs) and false positives (FPs) for main effects and interactions together.

For each scenario, 200 replicates are simulated. The summary results under settings P1 and P2 are presented in Tables 1 and 2, respectively. We observe that the proposed approach achieves better or comparable performance in terms of identification accuracy. For example, in Table 1, with weak effects (B1), regulation pattern Θ1, and correlation structure R1, the proposed approach selects on average 95.94 TPs compared to 71.90 (Alt.1), 65.20 (Alt.2), 23.15 (Alt.3), and 16.80 (Alt.4). When there are more correlated molecular measurements, the proposed approach remains superior in identification. For instance, in Table 1, with weak effects (B1), regulation pattern Θ1, and correlation structure R3, the proposed approach selects on average 99.70 TPs with 8.50 FPs. In comparison, Alt.1, Alt.2, Alt.3, and Alt.4 select fewer TPs and more FPs, with (TP,FP)=(83.68,12.26), (95.90,54.75), (27.30,14.70), and (20.75,136.65), respectively. Similarly, we also observe that with regulation pattern Θ2 (under which the correlations are stronger) and weak effects (B1), the proposed approach identifies on average 79.40 TPs and 39.13 FPs, whereas the alternative methods identify much fewer TPs. With a higher signal level under setting B2, all approaches behave better, while with more regulation modules under setting Θ2, the performance of all the approaches decays. Under both settings, the proposed approach remains advantageous. We observe that Alt.1 generally achieves the second best identification accuracy. In some scenarios, it is competitive in TP identification, though at the cost of more FPs. This is because the proposed approach uses PCs for the joint interaction model, effectively removes collinearity, and reduces false discovery. The proposed approach performs better than Alt.2, which can be attributed to respecting the interaction hierarchy. The superior performance of the proposed approach over Alt.3 and Alt.4 provides direct support for the strategy of accommodating regulations among multidimensional molecular data in interaction analysis.

TABLE 1.

Summary results of the simulation under setting P1, with a total of 100 true positives: mean (sd) from 200 replicates

Setting: Θ1
Setting: Θ2
Approach TP FP TP FP
B1 R1 Proposed 95.94 (4.63) 11.39 (13.83) 80.06 (5.37) 6.31 (6.02)
Alt.1 71.90 (35.35) 20.45 (24.20) 27.06 (19.83) 1.94 (1.12)
Alt.2 65.20 (28.39) 28.75 (14.03) 31.69 (15.05) 7.94 (18.32)
Alt.3 23.15 (3.47) 7.45 (3.90) 20.35 (9.10) 19.85 (8.43)
Alt.4 16.80 (2.09) 122.20 (38.35) 29.85 (5.73) 127.40 (40.31)
R2 Proposed 97.30 (1.75) 5.70 (10.99) 80.72 (4.64) 20.89 (25.90)
Alt.1 86.60 (30.13) 13.00 (16.06) 47.00 (16.82) 6.11 (4.92)
Alt.2 85.15 (16.11) 39.95 (6.87) 33.61 (11.44) 19.06 (21.78)
Alt.3 23.35 (4.18) 7.65 (2.89) 21.05 (5.77) 25.79 (6.27)
Alt.4 16.95 (2.86) 126.05 (46.47) 16.00 (9.56) 71.15 (65.52)
R3 Proposed 99.70 (0.57) 8.50 (14.60) 79.40 (3.22) 36.13 (38.10)
Alt.1 83.68 (27.69) 12.26 (18.29) 51.14 (25.72) 8.71 (11.69)
Alt.2 95.90 (8.09) 54.75 (12.48) 30.50 (14.39) 4.50 (7.60)
Alt.3 27.30 (1.63) 14.70 (17.41) 20.21 (7.79) 22.11 (8.46)
Alt.4 20.75 (2.65) 136.65 (37.06) 20.00 (8.55) 103.05 (63.77)
B2 R1 Proposed 99.80 (0.41) 14.25 (18.95) 83.90 (4.43) 12.60 (10.56)
Alt.1 99.80 (0.41) 57.80 (22.75) 32.00 (23.43) 14.35 (13.92)
Alt.2 85.80 (14.06) 55.55 (27.20) 34.75 (13.98) 18.25 (9.48)
Alt.3 27.17 (2.46) 5.28 (1.02) 27.15 (8.67) 35.15 (11.45)
Alt.4 21.45 (2.98) 142.05 (31.31) 30.30 (10.98) 110.10 (66.72)
R2 Proposed 99.82 (0.39) 4.12 (14.69) 77.88 (3.67) 20.81 (12.93)
Alt.1 90.80 (27.98) 38.65 (21.69) 42.69 (22.46) 12.06 (8.73)
Alt.2 77.85 (19.63) 47.05 (16.62) 21.75 (19.49) 19.05 (22.55)
Alt.3 27.37 (2.29) 7.79 (3.31) 17.45 (5.84) 18.95 (11.87)
Alt.4 19.60 (2.19) 135.35 (36.02) 14.95 (9.74) 52.40 (47.98)
R3 Proposed 99.35 (0.67) 12.05 (17.72) 77.77 (2.95) 9.85 (6.67)
Alt.1 96.45 (13.77) 50.45 (17.38) 35.65 (19.45) 21.95 (13.06)
Alt.2 86.45 (12.17) 46.45 (11.91) 16.50 (16.99) 10.00 (13.13)
Alt.3 28.88 (3.14) 7.71 (2.52) 14.35 (4.89) 16.35 (7.19)
Alt.4 21.25 (2.71) 140.65 (30.84) 14.95 (10.79) 65.15 (62.45)

Note. B1–2 represent the coefficient settings, R1–3 represent the correlation structure settings, and Θ1–2 represent the regulation relationship settings. Abbreviations: FP, false positive; TP, true positive.

TABLE 2.

Summary results of the simulation under setting P2 with a total of 70 true positives: mean (sd) from 200 replicates

Setting: Θ1
Setting: Θ2
Approach TP FP TP FP
B1 R1 Proposed 68.85(0.88) 0.40(0.50) 67.30(3.26) 2.30(5.65)
Alt.1 63.30(15.69) 34.80(23.26) 33.95(20.68) 6.68(10.37)
Alt.2 65.25(4.27) 31.85(8.55) 52.15(10.98) 38.30(36.79)
Alt.3 22.25(4.46) 7.85(4.49) 20.95(4.32) 23.85(9.28)
Alt.4 15.30(2.75) 129.45(42.29) 28.10(4.10) 127.65(43.63)
R3 Proposed 57.15(18.43) 10.50(6.36) 53.90(12.49) 2.65(2.21)
Alt.1 42.55(28.65) 18.00(25.46) 38.25(15.21) 4.40(8.18)
Alt.2 42.50(21.19) 24.05(14.57) 28.00(17.26) 3.70(6.14)
Alt.3 24.50(2.50) 10.90(8.42) 18.00(11.31) 23.00(24.04)
Alt.4 14.30(1.95) 107.05(30.44) 16.95(5.31) 83.85(51.99)
R3 Proposed 67.15(6.71) 1.40(3.98) 51.90(13.63) 2.55(2.74)
Alt.1 61.00(19.74) 27.80(17.56) 34.25(6.54) 8.10(14.49)
Alt.2 65.35(5.05) 36.35(11.94) 44.45(15.43) 47.00(43.04)
Alt.3 22.75(2.90) 9.95(7.49) 15.75(4.88) 18.85(12.33)
Alt.4 15.40(2.26) 117.65(30.91) 19.20(5.69) 105.85(61.27)
B2 R2 Proposed 69.75(0.55) 1.70(6.67) 67.05(5.88) 10.50(11.00)
Alt.1 69.75(0.44) 32.30(11.68) 65.40(7.38) 28.65(30.47)
Alt.2 66.40(5.23) 38.15(30.67) 46.00(12.02) 2.30(5.25)
Alt.3 25.79(5.54) 11.05(17.48) 20.95(4.32) 23.85(9.28)
Alt.4 16.65(2.21) 136.00(36.41) 36.25(4.27) 150.50(45.17)
R2 Proposed 67.15(11.35) 1.15(3.77) 57.10(11.11) 4.55(4.08)
Alt.1 69.80(0.52) 33.70(12.69) 41.85(15.79) 16.35(9.42)
Alt.2 58.15(9.91) 36.55(11.76) 43.55(15.74) 11.50(13.61)
Alt.3 26.85(2.89) 6.85(2.13) 18.00(11.31) 23.00(24.04)
Alt.4 17.50(2.65) 148.40(45.18) 18.90(6.21) 91.40(53.36)
R3 Proposed 69.85(0.37) 2.80(7.25) 56.77(8.12) 3.69(6.32)
Alt.1 68.90(4.46) 32.95(9.29) 39.35(12.57) 19.35(18.79)
Alt.2 66.35(6.67) 35.45(7.99) 50.05(11.98) 7.15(14.18)
Alt.3 26.70(3.37) 7.40(3.42) 15.75(4.88) 18.85(12.33)
Alt.4 16.75(1.97) 133.65(32.08) 16.20(6.05) 63.40(43.57)

Note. B1–2 represent the coefficient settings, R1–3 represent the correlation structure settings, and Θ1–2 represent the regulation relationship settings. Abbreviations: FP, false positive; TP, true positive.

We have examined additional simulation scenarios, including those with outcomes generated based on the original G and R measurements, zero-inflated gene expressions, binary environmental factors, and various settings for p, q, M, and n. Similar patterns are observed and reported in the Supporting Information.

4 |. DATA ANALYSIS

The Cancer Genome Atlas (TCGA) is one of the largest data resources with multidimensional profiling. TCGA data have been analyzed in interaction analyses with one type of molecular measurement as well as integrated modeling with the main effects of multiple types of molecular measurements. We analyze data on lung adenocarcinoma (LUAD) and cutaneous melanoma (SKCM) (NCI and NHGRI, 2021). Data are downloaded from TCGA using R package cgdsr.

4.1 |. Analysis of LUAD data

The response of interest is the reference value for the pre-bronchodilator forced expiratory volume in one second in percent (FEV1). This is an important indicator of lung capacity, with a lower value suggesting the potentially functional disorder of the lungs, and has been shown to be a powerful indicator of future morbidity and mortality (Young et al., 2007). It is continuously distributed and ranges from 1.95 to 156 with a mean of 80.58 and a standard deviation of 23.55. We focus on the primary tumor samples of Caucasians. Methodologically speaking, it is straightforward to include data from other races in the analysis and use race as an additional variable in modeling. However, in the TCGA LUAD (and SKCM) data, the number of observations of other races is extremely low, and this imbalance may lead to unreliable estimations. Limiting to samples from Caucasians has been performed by Jiang et al. (2016) and other studies. For E factors, we consider age, American Joint Committee on Cancer (AJCC) tumor pathologic stage (Stage), tobacco smoking history indicator (Smoking), and gender, all of which have been extensively investigated in the literature. We analyze mRNA gene expression measurements collected using the Illumina HiSeq 2000 RNA Sequencing Version 2 analysis platform. For regulators, we include CNV measurements collected using the Genome-Wide Human SNP Array 6.0 platform and DNA methylation measurements collected using the Illumina Infinium HumanMethylation450 platform. In total, 18,345 gene expression, 23,321 CNV, and 15,288 methylation measurements are available for each subject. In principle, the proposed approach can be directly applied. However, considering that only a small number of molecular measurements are potentially associated with the outcome and that the analysis may be unstable with the high dimensionality and small sample size, we conduct a prescreening. Specifically, we select the top 1000 molecular measurements with the smallest p-values using marginal regression. This leads to 164 subjects with 467 gene expression and 533 regulator (316 CNV and 217 methylation) measurements for downstream analysis.

The proposed analysis identifies 20 regulatory modules in Step I, and each module on average contains 11.70 gene expression and 7.35 regulator measurements. A graphical representation of the modules is provided in Figure S2 in the Supporting Information, in which some overlaps of modules are observed. In the interaction analysis, the proposed approach identifies 62 main molecular effects and 29 M–E interactions, of which 50 main effects and 27 interactions correspond to 6 regulatory modules. The identified main effects consist of 41 gene expression, 9 CNV, and 12 methylation measurements, and the identified interactions consist of 20 with gene expressions and 9 with methylations. Detailed estimation results are presented in Table 3. We examine the biological implications of our findings by mining the published literature and conducting a gene ontology (GO) enrichment analysis. Details are provided in the Supporting Information.

TABLE 3.

Analysis of the LUAD data using the proposed method

Group Type Gene Main Age Stage Smoking Gender
0.010 −0.031 −0.201 −0.067
1 GE VIT 0.006
1 GE PRH1 0.007
1 GE NOXRED1 0.006
1 GE RYR3 0.007
1 GE SERPINB11 0.007
1 GE ZNF273 0.004
1 GE WRAP53 0.003
1 GE SNORA7B 0.006
1 GE GUCY2F 0.007
1 GE STATH 0.007
1 GE CACNG6 0.007
1 DM WIPI2 −0.005
3 GE LINC00922 −0.059 0.009 0.001 0.004
3 GE NDP −0.059 0.008 0.001 0.004
3 GE TNMD −0.055 0.008 0.001 0.004
3 GE IBSP −0.055 0.008 0.001 0.004
3 GE PWRN1 −0.053 0.008 0.001 0.004
3 GE CACNG3 −0.053 0.008 0.001 0.004
3 DM MIS18A −0.045 0.007 0.001 0.003
3 DM RRP1 −0.036 0.005 0.001 0.002
3 DM ZDHHC2 −0.044 0.006 0.001 0.003
9 GE ZXDA −0.014
9 GE EXOSC8 0.022
9 GE EPSTI1 0.020
9 GE UGT2B4 −0.012
9 CNV SLC22A10 0.009
9 CNV PABPC5 0.018
9 DM ATP8A2 0.010
9 DM DHX32 0.013
15 GE KL −0.016
15 CNV MAP4K4 −0.013
15 CNV KCMF1 −0.016
15 DM SATB2 −0.013
16 GE HIST1H2AA −0.009
16 GE KCNIP3 −0.008
16 GE LRRTM3 −0.011
16 GE DCLRE1A −0.012
16 GE PPP1R3D −0.007
16 GE NHLRC2 −0.009
16 GE NPAP1 −0.010
16 CNV MAP4K4 −0.011
20 GE FTSJ1 0.013
20 GE DGUOK 0.012
20 GE SESN3 −0.008
20 GE CAPZB 0.009
20 CNV PABPC5 0.005
20 CNV MRGPRD 0.008
20 DM IL17D 0.008
20 DM ATP8A2 0.006
20 DM DHX32 0.004
21 GE AFF3 −0.106 0.147 −0.013
27 GE SGPP2 −0.009
47 GE FNIP2 −0.027
50 GE C11orf65 0.005
68 GE DRD3 0.012
102 GE DPRX 0.026
124 GE PRIMA1 −0.016
178 GE FAM217B −0.013
304 CNV AK4 0.014
319 CNV MIR582 −0.024
423 DM HOXA1 −0.027
520 DM SDE2 0.014

Note. Group is the regulatory module membership or individual molecular measurement; Type is the type of molecular measurement; Main is the main molecular effect. The first row under the environmental risk factors includes the main E effects, and the rows further down include the M–E interactions.

Abbreviations: CNV, copy number variation; DM, DNA methylation; GE, gene expression.

Analyses are also conducted using the alternative approaches. In Table S5 in the Supporting Information, we report the detailed estimation results using Alt.3. This alternative identifies fewer regulators whose effects may be“disguised” by gene expressions, as regulations are not effectively described. In Table S6 in the Supporting Information, we provide the numbers of identified main effects and interactions, as well as the numbers of overlaps and RV coefficients between the identifications using different approaches (Smilde et al., 2009). An RV coefficient measures the common information of two data matrices. It lies between 0 and 1, and a larger value indicates a higher degree of overlap. We observe that different approaches select significantly different sets of factors, with moderate overlaps as measured by the RV coefficients. In practical data analysis, it is difficult to objectively evaluate identification accuracy. To provide indirect support, we evaluate prediction performance and selection stability. Specifically, for prediction evaluation, we consider the prediction mean squared error (PMSE) based on 200 random resamplings (9/10 training and 1/10 testing samples). The proposed approach has competitive performance with an average PMSE= 1.02, compared to 1.25 (Alt.1), 1.16 (Alt.2), 1.05 (Alt.3), and 1.02 (Alt.4). We also assess selection stability using the observed occurrence index (OOI) (Huang et al., 2006). For each identified main effect (interaction), the OOI computes the selection frequency in the 200 resamplings, and a larger value suggests higher stability. The proposed approach has satisfactory stability with an average OOI of 0.77, compared to 0.53 (Alt.1), 0.45 (Alt.2), 0.26 (Alt.3), and 0.21 (Alt.4).

4.2 |. Analysis of SKCM data

The response is overall survival, which is subject to right censoring. We focus on the primary tumor samples of Caucasians. For E variables, we consider age, AJCC tumor pathologic stage (Stage), gender, and Clark level at diagnosis (Clark), all of which have been suggested as associated with melanoma prognosis in the literature. In total, 18,925 gene expression, 23,287 CNV, and 15,616 methylation measurements are available. We utilize the same prescreening procedure as in the previous analysis, and the data used for downstream analysis contains 314 gene expression and 686 regulator (397 CNV and 289 methylation) measurements from 231 subjects, of which 139 died during follow-up. The observed times range from 2.04 to 357.10 months, with a median of 56.31 months.

The proposed analysis identifies 17 regulatory modules, each including on average 7.60 gene expressions and 6.45 regulators. A graphical representation is provided in Figure S2 in the Supporting Information. The AFT model is adopted to model survival. In total, 28 main effects and 12 interactions are identified, of which 14 main effects belong to 1 regulatory module and the remaining are related to the individual molecular units. The identified main effects consist of 15 gene expressions and 13 methylation loci, and the identified interactions correspond to 9 with gene expressions and 3 with methylations. The estimated coefficients are presented in Table 4. Results of the biological interpretations and GO enrichment analysis are provided in the Supporting Information.

TABLE 4.

Analysis of the SKCM data using the proposed method

Group Type Gene Main Age Stage Gender Clark
−0.176 −0.099 0.150 −0.042
14 GE MYCNOS 0.003
14 GE MRGPRX3 0.004
14 GE MFSD6L 0.005
14 GE IMP3 0.005
14 GE TBC1D7 0.003
14 GE A2M 0.004
14 GE NEURL2 0.005
14 GE IL24 0.004
14 DM MAU2 0.004
14 DM ZDHHC4 0.004
14 DM ENOX1 0.002
14 DM PTPN12 0.005
14 DM BRF2 0.002
14 DM SYT6 0.002
70 GE DSTYK −0.054 0.123 −0.164
71 GE GLDN 0.044 −0.058 0.012
82 GE RBP2 −0.029 −0.026
124 GE SATB2 −0.057 −0.032 0.034
153 GE RPL36AL 0.003
204 GE RNPS1 −0.057 0.112 0.084
214 GE ARL6IP1 −0.014
573 DM DPY19L3 0.006
640 DM RABEP1 −0.080 −0.071 0.010
647 DM SLU7 −0.004
654 DM KLHL31 −0.023
696 DM GLMP −0.016
714 DM BNIP1 −0.025
759 DM MS4A15 0.055 −0.045

Note. Group is the regulatory module membership or individual molecular measurement; Type is the type of molecular measurement. Main is the main molecular effect. The first row under the environmental risk factors includes the main E effects, and the rows further down include the M–E interactiotns. Abbreviations: CNV, copy number variation; DM, DNA methylation; GE, gene expression.

We conduct analysis using the alternatives and provide the estimation results using Alt.3 and comparison results in Tables S7 and S6 in the Supporting Information. Similar patterns as with the previous analysis are observed: the different approaches have small numbers of overlapping identifications and moderate RV coefficients. We also conduct the prediction and selection stability evaluations. With the censored survival response, we adopt the C statistic to measure prediction accuracy (Uno et al., 2011). A larger C statistic indicates better prediction. The proposed approach has an average C statistic of 0.60, compared to 0.57 (Alt.1), 0.48 (Alt.2), 0.47 (Alt.3), and 0.59 (Alt.4). In addition, it has superior selection stability with an average OOI of 0.74, compared to 0.56 (Alt.1), 0.50 (Alt.2), 0.38 (Alt.3), and 0.26 (Alt.4).

5 |. DISCUSSION

Modeling the outcomes of complex diseases is an “old” but still widely open problem. In this study, we have developed the M–E interaction analysis, which is the natural next step from the existing literature. In particular, this analysis is built on and also advances from the existing gene–environment interaction analysis by incorporating multiple types of molecular measurements (which have overlapping and independent information) in a single analysis. It also advances from the existing multidimensional molecular data analysis by incorporating interactions and respecting the hierarchical structure. The proposed approach has sound biological and statistical bases. Its working characteristics have been carefully examined, and our simulation and data analyses have demonstrated its satisfactory performance.

It remains an open question how to best accommodate multidimensional molecular data in modeling. The proposed Step I of this analysis has been motivated by Wang et al. (2012), Zhu et al. (2016), and several other studies. Similar to the literature, linear modeling and regularized estimation have been applied to estimate the regulations. Different from the literature, biclustering has been conducted to identify local regulations. This differs from Zhu et al. (2016) and others by relaxing the rank-one constraint. For the K-S test, we have followed the literature and adopted the 0.05 cutoff. Other approaches (e.g., the false discovery rate-controlling procedure) can also be considered. The Step II dimension reduction analysis can also be achieved using other techniques such as partial least squares, which may also lead to reasonable performance. We also note that Steps I and II both adopt unsupervised methods, which exploit multidimensional molecular information without using disease outcome information. As such, there is a need to avoid overfitting and establish a joint interaction model in the final step. There are alternative techniques for the interaction analysis in Step III. For example, when there are highly overlapping regulatory modules with potentially high correlations, an additional L2 penalty can be incorporated, so that important overlapping molecular measurements can be selected simultaneously. In our data analysis, strong overlaps have not been observed, making it unnecessary to pursue this additional penalty.

In our analysis, both G and R have been standardized following common practice, and there is no assumption on the cardinalities of Cs and Ds in Step II. Thus, the proposed approach is applicable when G and R are on different scales or when there is a difference in the cardinalities of Cs and Ds. Although the proposed analysis does not directly accommodate missing data, ordinary techniques such as imputation can be incorporated to handle missingness. We have identified important interactions based on the estimated regression coefficients, as opposed to statistical inference. High-dimensional inference is challenging, and extensive research may be needed for inference with the proposed analysis. We also note that the M–E interaction analysis inevitably involves multiple steps, given the complexity of multidimensional molecular data and disease mechanisms. In fact, even “simpler” multidimensional data integrations often involve multiple steps (Chen et al., 2013; Tseng et al., 2015; Zhu et al., 2016). Although our analysis consists of three steps, its implementation is rather straightforward, especially with the developed R code. We have used gene expressions and their regulators for descriptions. The proposed approach can be directly applied to other and potentially more complex data structures, thus enjoying broad applicability.

Supplementary Material

Supplementary Material

ACKNOWLEDGMENTS

The authors thank the editors and reviewers for their careful review and insightful comments. This work was supported by the National Institutes of Health [CA204120, CA241699, CA216017]; National Science Foundation [1916251]; Yale Cancer Center Pilot Award; National Natural Science Foundation of China [12071273]; Program for Innovative Research Team of Shanghai University of Finance and Economics; and Shanghai Pujiang Program [19PJ1403600].

Footnotes

SUPPORTING INFORMATION

Additional computation, simulation, and data analysis referenced in Sections 2, 3, and 4, along with the R code are available with this paper at the Biometrics website on Wiley Online Library. R code is also publicly available at https://github.com/shuanggema/omics_interaction.

DATA AVAILABILITY STATEMENT

The data that support the findings in this paper are openly available in TCGA (The Cancer Genome Atlas) at https://portal.gdc.cancer.gov/.

REFERENCES

  1. Bien J, Taylor J and Tibshirani R (2013) A lasso for hierarchical interactions. Annals of Statistics, 41, 1111–1141. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Bookman EB, McAllister K, Gillanders E, Wanke K, Balshaw D, Rutter J et al. (2011) Gene-environment interplay in common complex diseases: forging an integrative model—recommendations from an NIH workshop. Genetic Epidemiology, 35, 217–225. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Chen Y, Wu X and Jiang R (2013) Integrating human omics data to prioritize candidate genes. BMC Medical Genomics, 6, 57. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Gross SM and Tibshirani R (2014) Collaborative regression. Biostatistics, 16, 326–338. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Helgeson ES, Liu Q, Chen G, Kosorok MR and Bair E (2020) Biclustering via sparse clustering. Biometrics, 76, 348–358. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Huang J, Ma S and Xie H (2006) Regularized estimation in the accelerated failure time model with high-dimensional covariates. Biometrics, 62, 813–820. [DOI] [PubMed] [Google Scholar]
  7. Huang YT, Lin X, Liu Y, Chirieac LR, McGovern R, Wain J et al. (2011) Cigarette smoking increases copy number alterations in nonsmall-cell lung cancer. Proceedings of the National Academy of Sciences, 108, 16345–16350. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Hunter DJ (2005) Gene–environment interactions in human diseases. Nature Reviews Genetics, 6, 287–298. [DOI] [PubMed] [Google Scholar]
  9. Liu S and Trenkler G (2008) Hadamard, Khatri–Rao, Kronecker and other matrix products. International Journal of Information and Systems Sciences, 4, 160–177. [Google Scholar]
  10. Jiang Y, Shi X, Zhao Q, Krauthammer M, Rothberg BEG and Ma S (2016) Integrated analysis of multidimensional omics data on cutaneous melanoma prognosis. Genomics, 107, 223–230. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Kristensen VN, Lingjærde OC, Russnes HG, Vollan HKM, Frigessi A and Borresen-Dale AL (2014) Principles and methods of integrative genomic analyses in cancer. Nature Reviews Cancer, 14, 299–313. [DOI] [PubMed] [Google Scholar]
  12. McAllister K, Mechanic LE, Amos C, Aschard H, Blair IA, Chatterjee N et al. (2017) Current challenges and new opportunities for gene–environment interaction studies of complex diseases. American Journal of Epidemiology, 186, 753–761. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. NCI and NHGRI (2021) The Cancer Genome Atlas. Available at: https://portal.gdc.cancer.gov/ [Accessed on 15 May 2021].
  14. Risch A and Plass C (2008) Lung cancer epigenetics and genetics. International Journal of Cancer, 123, 1–7. [DOI] [PubMed] [Google Scholar]
  15. Ritchie MD, Davis JR, Aschard H, Battle A, Conti D, Du M et al. (2017) Incorporation of biological knowledge into the study of gene–environment interactions. American Journal of Epidemiology, 186, 771–777. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Smilde AK, Kiers H, Bijlsma S, Rubingh CM and Van E (2009. Matrix correlations for high-dimensional data: the modified RV-coefficient. Bioinformatics, 25, 401–405. [DOI] [PubMed] [Google Scholar]
  17. Shi X, Zhao Q, Huang J, Xie Y and Ma S (2015) Deciphering the associations between gene expression and copy number alteration using a sparse double Laplacian shrinkage approach. Bioinformatics, 31, 3977–3983. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Sordillo JE, Kelly R, Bunyavanich S, McGeachie M, Qiu W, Croteau-Chonka DC et al. (2015) Genome-wide expression profiles identify potential targets for gene–environment interactions in asthma severity. Journal of Allergy and Clinical Immunology, 136, 885–892. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Simonds NI, Ghazarian AA, Pimentel CB, Schully SD, Ellison GL, Gillanders EM, et al. (2016) Review of the gene–environment interaction literature in cancer: what do we know? Genetic Epidemiology, 40, 356–365. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Teschendorff AE, Yang Z, Wong A, Pipinikas CP, Jiao Y, Jones A et al. (2015) Correlation of smoking-associated DNA methylation changes in buccal cells with DNA methylation changes in epithelial cancer. JAMA Oncology, 1, 476–485. [DOI] [PubMed] [Google Scholar]
  21. Thomas D (2010) Gene–environment-wide association studies: emerging approaches. Nature Reviews Genetics, 11, 259–272. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Tseng G, Ghosh D and Zhou XJ (2015) Integrating Omics Data. Cambridge, UK: Cambridge University Press. [Google Scholar]
  23. Uno H, Cai T, Pencina MJ, D’Agostino RB and Wei LJ (2011) On the C-statistics for evaluating overall adequacy of risk prediction procedures with censored survival data. Statistics in Medicine, 30, 1105–1117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Wang W, Baladandayuthapani V, Morris JS, Broom BM, Manyam G and Do KA (2012) iBAG: integrative Bayesian analysis of high-dimensional multiplatform genomics data. Bioinformatics, 29, 149–159. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Wu C, Zhou F, Ren J, Li X, Jiang Y and Ma S (2019) A selective review of multi-level omics data integration using variable selection. High-throughput, 8, 4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Wu M, Zhang Q and Ma S (2020) Structured gene–environment interaction analysis. Biometrics, 76, 23–35. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Young R, Hopkins R and Eaton T (2007) Forced expiratory volume in one second: not just a lung function test but a marker of premature death from all causes. European Respiratory Journal, 30, 616–622. [DOI] [PubMed] [Google Scholar]
  28. Zhu R, Zhao Q, Zhao H and Ma S (2016) Integrating multidimensional omics data for cancer outcome. Biostatistics, 17, 605–618. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material

Data Availability Statement

The data that support the findings in this paper are openly available in TCGA (The Cancer Genome Atlas) at https://portal.gdc.cancer.gov/.

RESOURCES