Abstract
High-throughput cancer studies have been extensively conducted, searching for genetic markers associated with outcomes beyond clinical and environmental risk factors. Gene–environment interactions can have important implications beyond main effects. The commonly-adopted single-marker analysis cannot accommodate the joint effects of a large number of markers. The existing joint-effects methods also have limitations. Specifically, they may suffer from high computational cost, do not respect the “main effect, interaction” hierarchical structure, or use ineffective techniques. We develop a penalization method for the identification of important G × E interactions and main effects. It has an intuitive formulation, respects the hierarchical structure, accommodates the joint effects of multiple markers, and is computationally affordable. In numerical study, we analyze prognosis data under the AFT (accelerated failure time) model. Simulation shows satisfactory performance of the proposed method. Analysis of an NHL (non-Hodgkin lymphoma) study with SNP measurements shows that the proposed method identifies markers with important implications and satisfactory prediction performance.
Keywords: Gene–environment interaction, Penalized marker identification, Cancer prognosis
1. Introduction
In cancer research, high-throughput profiling has been extensively conducted, searching for genetic markers that are independently associated with outcomes and phenotypes beyond clinical and environmental risk factors. Many studies have been focused on the main effects of risk factors. Recent studies have shown that the interactions between genetic and clinical/environmental risk factors, which are also referred to as G × E interactions, can have important implications beyond the main effects. Comprehensive discussions of the existing methods can be found in the literature [1,7,14,15]. Multiple families of approaches have been developed, including for example the joint approach and stratification approach. In this study, we focus on the statistical modeling approach, where interactions are described using products of variables in statistical models.
Denote Y as the cancer outcome or phenotype. It can be a continuous marker, categorical cancer status, or cancer survival time. Denote Z = (Z1, ...,Zp) as the p SNPs (genes, or other genetic functional units) and X = (X1, ...,Xq) as the q clinical/environmental risk factors. Assume n iid samples. A popular approach proceeds as follows. (1) For j = 1, ..., p, fit the regression model , where ϕ is the known link function. With for example a binary Y, ϕ can be the logistic link. αks, γj, and βkjs are the unknown regression coefficients. As usually , for each j, this step can be carried out using standard techniques and software. Denote pkj as the p-value of , the estimate of βkj; (2) With {pkj : k = 1, ..., q, j = 1, ..., p}, conduct multiple comparison adjustment. Approaches such as the FDR (false discovery rate) can be applied to identify signifi-cant interactions. Multiple existing approaches [7,15] belong to this category. Different approaches may differ in terms of statistical models, hypothesis testing methods, and multiple comparison adjustment techniques. However they share the common strategy of analyzing one or a small number of genetic markers at a time. There are also non-parametric, more robust approaches, for example the MDR (multifactor dimensionality reduction [12]), that share a similar single-marker strategy. The most significant advantage of these approaches is computational simplicity. As Step (1) only involves low-dimensional models and can be conducted in a parallel manner, the overall computational cost is low.
The development and progression of cancer are associated with the combined effects of multiple clinical, environmental, and genetic risk factors, and their interactions. Unless under strong independence conditions (which are unlikely to hold in practice), estimation and inference based on the marginal models are biased. With additive genetic effects, the significant discrepancy between marginal and joint analyses has been thoroughly discussed [16,18].
In this article, we study G × E interactions in high-throughput cancer studies. A model-based approach is adopted, which detects important interactions by conducting estimation with βkjs. Unlike with some alternatives particularly single-marker analysis, the joint effects of a large number of genetic markers and their interactions with clinical/environmental risk factors are simultaneously considered. For the identification of important interactions and main effects, a penalization method is adopted, with which identification amounts to finding those terms with nonzero estimated regression coefficients. Under the present setting, one advantage of penalization is that it can accommodate a large number of interactions and main effects with affordable computational cost. More importantly, it naturally respects the “main effect–interaction” hierarchy. More details are described in Section 2. In numerical study, we use cancer prognosis data as an example and assume the AFT (accelerated failure time) model, which has a simple form and low computational cost.
2. Penalized identification of G × E interactions
Consider the effects of all p genetic markers, q clinical/environmental risk factors, and their p × q interactions:
| (1) |
α0 is the intercept, which may be needed for linear, logistic, and other models. bj = (γj,β1j, ...,βqj)′ and Wj = (Zj, X1Zj, ..., XqZj)′. Wj and bj represent all effects–main and interactions–corresponding to the jth SNP. Denote α = (α1, ...,αq)′, , and .
There are multiple ways of describing the relationship between main effects and interactions [3]. The first and simplest way is to treat interactions and main effects in the same manner. The second is the weak hierarchy, under which if an interaction term is identified as important, then at least one of the two corresponding main effects will be identified. The third is the strong hierarchy, under which if an interaction term is identified, then both of the corresponding main effects will be identified. The first approach is overly simplified and may not be sensible. In a recent study, Bien and others [3] develop a penalization approach for gene–gene interactions that respects the strong hierarchy. Even with a moderate number of variables, the number of all second-order interactions can be prohibitively large. In addition, respecting the strong hierarchy poses highly nontrivial constraints. Thus, the approach in [3] can only accommodate a moderate number of variables.
The present study differs from [3] and others. As the number of clinical/environmental risk factors is usually low, we are able to accommodate a much larger number of genetic risk factors. Clinical/environmental risk factors have “different status” from genetic markers. They are usually pre-selected, with evidences of being associated with cancer from previous studies. They may always be included in the model, and selection is not of interest. Thus in this study, the respected hierarchical structure states that if a G × E interaction term is selected, then the corresponding main genetic effect is selected.
Under model (1), denote L(α,b) as the goodness-of-fit measure. For example, it can be the negative log–likelihood function. In our numerical example, L(α,b) is from an estimating equation. Consider the penalized estimate
| (2) |
is the MCP penalty (minimax concave penalty [17]). It has estimation and selection properties better than some alternative penalization methods such as Lasso and comparable to others such as bridge and SCAD. d = q + 1 is the “size” of bj and can be absorbed into λ1. We use the present notations to be consistent with the literature. λ1 and λ2 are data-dependent tuning parameters. ξ is the regularization parameter. bjk is the kth element of bj.
Formulation (2) has been motivated by the following considerations. In our analysis, SNPs are the functional units. The penalty is the sum over p terms, with one for each SNP. Effect of the jth SNP is represented with bj, a length q + 1 vector. The first penalty in (2) determines whether bj ≡ 0, that is, whether the jth SNP has any effect at all. This step is achieved using the group MCP (gMCP) penalty [6]. If bj ≠ 0, then either the main effect or interaction or both are nonzero. In the second penalty, we penalize the interaction terms and determine which are nonzero. This step amounts to examining the q individual interaction terms and is achieved using the MCP penalty. The sum of the two penalties can thus identify important SNPs as well as important interaction terms. Clinical and environmental risk factors are not subject to penalized selection.
Using penalization for high-dimensional marker selection has been studied in a large number of publications. Because of the “main effect, interaction” hierarchy, simple penalization such as MCP or gMCP is insufficient. The proposed penalty shares a similar spirit with that in [5]. However the data settings are significantly different and, in this study, one “group” corresponds to one SNP and its interactions, as opposed to multiple variables. Second, to respect the specific hierarchical structure, the individual penalties are only imposed on the interactions. Third, we replace Lasso-type penalties with MCP penalties, which under simpler settings have been shown to have better performance. The Lasso-type penalization developed in [3] respects the strong hierarchy. It is computationally much more complicated and hence cannot accommodate a large number of markers. In addition, it treats all variables in the same manner and cannot discriminate between genetic markers and clinical/environmental risk factors.
2.1. Computation
First consider a linear regression model
where ε is the random error. Assume n iid observations {(Yi,Xi,Zi), i = 1, ..., n}. Under proper normalization, α0 = 0. Denote Y as the n -vector composed of Yis and X and W as the matrices composed of Xis and Wis, respectively. In the matrix form, the least squares objective function is , where ∥ · ∥ is the norm.
Consider the following iterative algorithm: (i) Initialize b̂ = 0 component-wise; (ii) Compute ; (iii) Compute b̂ as the minimizer of (2) with α fixed at ; (iv) Iterate Step (ii) and (iii) until convergence. In this algorithm, the most challenging step is (iii), for which we adopt a group coordinate descent (GCD) algorithm. The GCD algorithm optimizes the objective function with respect to one group of regression coefficients at a time (which correspond to the main and interaction effects of one SNP), and iteratively cycles through all groups. The overall cycling is repeated until convergence. Details of the GCD algorithm are provided in Appendix.
Consider other types of data and models, for example right censored survival data under the Cox model or binary data under the logistic regression model. Here the objective function is the negative log-likelihood function. Consider the following iterative algorithm: (i) Initialize and b̂ = 0 component-wise; (ii) at the current estimate , and make Taylor expansion of the objective function. Keep the linear and quadratic terms; (iii) call the algorithm developed for linear regression; (iv) and repeat Steps (ii) and (iii) until convergence.
The proposed penalty contains tuning parameters λ1, λ2 and regularization parameter ξ. Smaller values of ξ are better at retaining the unbiasedness of the MCP penalties for larger coefficients, but they also have the risk of creating objective functions with a non convexity problem that are difficult to optimize and yield solutions that are discontinuous with respect to λ1, λ2. It is therefore advisable to choose a ξ value that is big enough to avoid this problem but not too big. In our numerical study, we consider ξ values including 1.8, 3, 6, and 10 [17]. Larger λ1, λ2 lead to fewer identified markers. In our numerical study, we jointly search for the optimal values of (λ1,λ2,ξ) using V -fold cross validation with V = 5. As the proposed algorithm only involves simple calculations, the proposed approach is computationally feasible. For example, the analysis of one simulated dataset with n = 250 takes less than ten minutes on a regular desktop PC.
3. Simulation
As a specific example for demonstrating the proposed method, we consider right censored survival data under the AFT model. Details on the data settings and estimation procedure are described in Appendix. The simulation settings are as follows. The SNP values are generated using a two-step approach. We first generate a 1000-dimensional vector with a multivariate normal distribution. The marginal means are equal to zero and marginal variances equal to one. We consider two correlation structures. The first is the auto-regressive correlation structure where the jth and kth components have correlation coefficient ρ|j – k|. We consider ρ = 0.2, 0.5, and 0.8, corresponding to weak, moderate, and strong correlation, respectively. The second is the banded correlation structure. Here two scenarios are considered. Under the first scenario, the jth and kth elements have correlation coefficient 0.33 if |j – k| = 1 and 0 otherwise. Under the second scenario, the jth and kth elements have correlation coefficient 0.6 if |j – k| = 1, 0.33 if |j – k| = 2, and 0 otherwise. For each component of the 1000-dimensional vector, we dichotomize at the 1st and 3rd quartiles and generate the 3-level SNP value. For each subject, we simulate three clinical/environmental risk factors, with one continuously and two categorically distributed. For any two clinical/environmental risk factors, the correlation coefficient is 0.5. Under this simulated setting, there are a total of 1003 main effects and 3000 interactions. Among them, six main effects and six interactions are set as associated with prognosis, with regression coefficients generated from Unif[0.2,0.8]. We generate the log event times from the AFT model with intercept equal to zero and standard normally distributed random errors. The log censoring times are independently generated from a normal distribution. The censoring rate is about 40%. We simulate a total of 150 or 250 observations.
For comparison, we consider the following alternative approaches. [Alt.1] is a marginal analysis approach as described in Section 1. Here it serves as benchmark. [Alt.2] is a penalization approach, where the objective function is . Here we impose penalty on each main effect and interaction. This approach can conduct selection but does not respect the hierarchical structure. [Alt.3] is also a penalization approach, where the objective function is . This approach evaluates whether a is associated with prognosis at all, but does not discriminate whether the association comes from the main effect or interaction. We acknowledge that a large number of approaches can be used to analyze the simulated data. These three approaches are chosen for comparison as Alt.1 has been adopted in a large number of studies and can serve as benchmark, and as comparison withAlt.2 and Alt.3 can directly establish the merit of each penalty term in formulation (2).
Summary statistics based on 200 replicates are shown in Table 1. Under all simulation settings, the proposed approach is able to identify the majority or all of the true positive interactions, while having a small number of false positives. Performance of the proposed approach depends on the correlation structure, strength of correlation, and sample size, as has been observed in other penalization studies. The proposed approach also has satisfactory performance with the main effects. The marginal approachAlt.1 identifies a very small number of true positives. The unsatisfactory performance can be explained by the fact that interactions and main effects important in the joint model are not necessarily important in the marginal models, especially when there exist correlations among variables. Alt.2 has satisfactory performance identifying the important main effects. However, without respecting the hierarchical structure, it identifies fewer true positive interactions compared with the proposed approach. Alt.3 identifies a large number of false interactions. Such an observation is expected, as under this approach, if an SNP has nonzero interaction effects with at least one clinical/environmental risk factor, it is concluded to have interactions with all clinical/environmental risk factors. Additional simulation results are reported in Appendix. Similar conclusions can be drawn.
Table 1.
Simulation: mean (sd) based on 200 replicates. TP/FP: true/false positives, for main effects and interactions separately. There are 6 nonzero main effects and 6 nonzero interactions.
| Main effect |
Interaction |
|||||
|---|---|---|---|---|---|---|
| Correlation | n | TP | FP | TP | FP | |
| Alt.1 | AR ρ = 0.2 | 150 | 0.36(0.61) | 0.03(0.17) | 0.28(0.53) | 0.67(1.56) |
| AR ρ = 0.5 | 150 | 1.96(1.49) | 1.59(10.55) | 0.23(0.49) | 1.44(7.72) | |
| AR ρ = 0.8 | 150 | 5.46(0.74) | 0.63(3.63) | 0.66(0.78) | 0.86(1.68) | |
| AR ρ = 0.2 | 250 | 1.00(1.12) | 0.11(0.82) | 0.87(0.91) | 2.36(14.48) | |
| AR ρ = 0.5 | 250 | 4.05(1.25) | 5.18(50.49) | 0.60(0.82) | 3.05(17.53) | |
| AR ρ = 0.8 | 250 | 5.98(0.14) | 1.45(5.79) | 1.63(1.17) | 2.22(4.74) | |
| Banded 1 | 150 | 0.60(0.97) | 0.25(1.31) | 0.18(0.41) | 0.47(1.10) | |
| Banded 2 | 150 | 2.46(1.35) | 0.03(0.30) | 0.16(0.39) | 0.66(1.37) | |
| Banded 1 | 250 | 1.75(1.40) | 0.05(0.26) | 0.70(0.80) | 1.35(5.73) | |
| Banded 2 | 250 | 4.34(1.13) | 0.09(0.55) | 0.49(0.73) | 1.05(3.11) | |
| Alt.2 | AR ρ = 0.2 | 150 | 5.74(0.54) | 17.30(7.88) | 3.61(1.29) | 4.81(4.29) |
| AR ρ = 0.5 | 150 | 5.53(0.66) | 14.68(6.67) | 3.72(1.39) | 3.62(3.21) | |
| AR ρ = 0.8 | 150 | 4.72(0.71) | 10.46(4.97) | 3.32(1.11) | 3.77(3.05) | |
| AR ρ = 0.2 | 250 | 5.98(0.14) | 12.82(5.25) | 5.27(0.78) | 2.01(1.92) | |
| AR ρ = 0.5 | 250 | 5.90(0.30) | 14.00(5.63) | 5.30(0.73) | 2.48(2.08) | |
| AR ρ = 0.8 | 250 | 5.29(0.52) | 9.97(4.59) | 4.69(0.66) | 2.58(2.12) | |
| Banded 1 | 150 | 5.77(0.51) | 15.97(6.14) | 3.55(1.55) | 3.91(3.37) | |
| Banded 2 | 150 | 5.14(0.85) | 15.07(6.07) | 3.11(1.57) | 4.95(3.98) | |
| Banded 1 | 250 | 5.96(0.20) | 13.88(6.01) | 5.41(0.75) | 2.26(1.82) | |
| Banded 2 | 250 | 5.70(0.48) | 13.49(5.49) | 4.91(0.85) | 2.64(1.89) | |
| Alt.3 | AR ρ = 0.2 | 150 | 6.00(0.00) | 12.12(4.76) | 6.00(0.00) | 48.36(14.28) |
| AR ρ = 0.5 | 150 | 5.96(0.20) | 13.34(5.29) | 5.96(0.20) | 51.94(15.85) | |
| AR ρ = 0.8 | 150 | 5.30(0.83) | 12.62(5.47) | 5.30(0.83) | 48.46(17.00) | |
| AR ρ = 0.2 | 250 | 6.00(0.00) | 8.25(6.86) | 6.00(0.00) | 36.75(20.58) | |
| AR ρ = 0.5 | 250 | 6.00(0.00) | 8.66(5.60) | 6.00(0.00) | 37.98(16.80) | |
| AR ρ = 0.8 | 250 | 5.97(0.17) | 14.16(6.73) | 5.97(0.17) | 54.42(20.30) | |
| Banded 1 | 150 | 5.98(0.14) | 12.71(4.72) | 5.98(0.14) | 50.09(14.11) | |
| Banded 2 | 150 | 5.81(0.49) | 14.96(5.20) | 5.81(0.49) | 56.50(15.79) | |
| Banded 1 | 250 | 6.00(0.00) | 8.72(6.33) | 6.00(0.00) | 38.16(18.98) | |
| Banded 2 | 250 | 5.99(0.10) | 12.24(7.12) | 5.99(0.10) | 48.70(21.39) | |
| Proposed | AR ρ = 0.2 | 150 | 6.00(0.00) | 19.34(6.81) | 5.75(0.50) | 4.40(3.01) |
| AR ρ = 0.5 | 150 | 5.98(0.14) | 19.75(6.92) | 5.64(0.63) | 4.64(3.35) | |
| AR ρ = 0.8 | 150 | 5.58(0.62) | 20.10(6.10) | 4.97(1.09) | 6.10(3.53) | |
| AR ρ = 0.2 | 250 | 6.00(0.00) | 11.23(8.89) | 5.99(0.10) | 2.36(3.48) | |
| AR ρ = 0.5 | 250 | 6.00(0.00) | 10.88(7.51) | 5.97(0.17) | 2.12(2.50) | |
| AR ρ = 0.8 | 250 | 5.98(0.14) | 14.53(8.58) | 5.77(0.45) | 3.28(3.57) | |
| Banded 1 | 150 | 5.99(0.10) | 19.32(6.34) | 5.83(0.43) | 3.85(2.65) | |
| Banded 2 | 150 | 5.91(0.29) | 21.92(6.72) | 5.62(0.62) | 5.19(3.27) | |
| Banded 1 | 250 | 6.00(0.00) | 9.69(6.68) | 5.95(0.22) | 1.90(2.01) | |
| Banded 2 | 250 | 6.00(0.00) | 11.85(8.51) | 5.95(0.22) | 2.33(3.17) | |
4. Analysis of an NHL prognosis study
NHL is a heterogeneous group of lymphocytic disorders ranging in aggressiveness from very indolent cellular proliferation to highly aggressive and rapidly proliferative process. It is the fifth leading cause of cancer incidence and mortality in the US and remains poorly understood and largely incurable. Our group conducted a genetic association study, searching for risk factors associated with the overall survival in NHL patients [19,20]. The prognostic cohort consists of 575 NHL patients, among whom 496 donated either blood or buccal cell samples. All cases were classified into NHL subtypes according to the World Health Organization classification system. Specifically, 155 had DLBCL (diffuse large B-cell lymphoma), 117 had FL (follicular lymphoma), 57 had CLL/SLL (chronic lymphocytic leukemia/small lymphocytic lymphoma), 34 had MZBL (marginal zone B-cell lymphoma), 37 had T/NK-cell lymphoma, and 96 had other subtypes. The study cohort was assembled in Connecticut between 1996 and 2000. Vital status of all subjects was abstracted from CTR (Connecticut Tumor Registry) in 2010. In our analysis, we first analyze the whole cohort. In addition, we also analyze DLBCL, the largest subtype. Other subtypes are not analyzed because of sample size consideration.
When genotyping, we took a candidate gene approach. A total of 1462 tag SNPs from 210 candidate genes related to immune response were genotyped using a custom-designed GoldenGate Assay. In addition, 302 SNPs in 143 candidate genes previously genotyped by TaqMan Assay were also included. There were a total of 1764 SNPs, representing 333 genes. We remove patients with more than 20% SNPs missing and then remove SNPs with more than 20% measurements missing. The genotyping data were missing for the following reasons: the amount of DNA was too low, samples failed to amplify, samples amplified but their genotype could not be determined due to ambiguous results, or the DNA quality was poor. We then impute missing SNP measurements. A total of 1633 SNPs pass processing, representing 238 genes.
For the whole cohort, 346 patients pass processing. Among them, 159 died, with survival times ranging from 0.04 to 11.01 years (mean 4.23 years). For the 187 censored patients, the follow-up times range from 4.85 to 11.50 years (mean 9.00 years). For DLBCL, 139 patients pass processing. Among them, 61 died, with survival times ranging from 0.47 to 10.46 years (mean 4.16 years). For the 78 censored patients, the follow-up times range from 5.58 to 11.45 years (mean 9.08 years).
The following demographic and clinical factors were measured: age (rescaled to mean zero and variance one), education (level 1 = high school or less; level 2 = some college; level 3 = college graduate or more), tumor stage (levels 1–4 and unknown), B-symptom presence (no; yes; unknown), and initial treatment (none; surgery; radiation; chemotherapy; other). They include all widely accepted prognostic factors [18].
We recognize that this dataset has limitations. For example, a candidate gene approach was adopted, which may miss important markers during profiling. In addition, the sample size is limited. However, genetic association study on NHL prognosis is extremely limited [10,18]. To our best knowledge, this study is among the largest. Analysis of this dataset may suggest potential markers and provide valuable information for future larger studies. It can also test the practical feasibility of the proposed approach.
We apply the proposed approach and three alternatives described in the last section. In addition, we also employ the MCP penalization approach to the main effects (additive effects of clinical/environmental and genetic risk factors). The difference in marker identification is summarized in Table 2. For all subtypes combined, the proposed approach identifies 11 main effects and 59 interactions. For DLBCL, it identifies 5 main effects and 24 interactions. Table 2 suggests that the proposed approach identifies markers significantly different from the alternatives. Detailed results for the proposed approach are presented in Table 3 (all subtypes combined) and 4 (DLBCL). Results for the alternative approaches are provided in Tables 6–10 (all subtypes combined) and 11–14 (DLBCL) in Appendix. In Tables 3 and 4, it is noted that the small magnitudes of the estimated regression coefficients are caused by the logarithm transformation which makes the event times “clustered”. With the proposed approach, the dummy variables for a categorical clinical/environmental risk factor (for example tumor stage) are not selected as a whole. We intentionally design the proposed approach this way, so that it may identify which levels differ from the baseline. The proposed approach can be modified so that the dummy variables corresponding to the same risk factor are selected together.
Table 2.
Analysis of NHL data: main effects and interactions identified by different methods and overlaps. In each cell, number of identified main effects/number of identified interactions. MCP: MCP penalization applied to the main effects only.
| MCP | Alt.1 | Alt.2 | Alt.3 | Proposed | |
|---|---|---|---|---|---|
| All subtypes combined | |||||
| MCP | 23/0 | 2/0 | 3/0 | 1/0 | 8/0 |
| Alt.1 | 3/0 | 1/0 | 0/0 | 2/0 | |
| Alt.2 | 9/36 | 0/3 | 1/5 | ||
| Alt.3 | 4/52 | 3/26 | |||
| Proposed | 11/59 | ||||
| DLBCL | |||||
| MCP | 13/0 | 0/0 | 2/0 | 1/0 | 1/0 |
| Alt.1 | 0/1 | 0/0 | 0/0 | 0/0 | |
| Alt.2 | 2/9 | 1/1 | 0/0 | ||
| Alt.3 | 4/52 | 1/9 | |||
| Proposed | 5/24 | ||||
Table 3.
Analysis of NHL overall survival (all subtypes combined): identified main effects and interactions.
| SNP | Main | Age | Interaction |
|||||
|---|---|---|---|---|---|---|---|---|
| Education |
Tumor stage |
|||||||
| Level 2 | Level 3 | Level 2 | Level 3 | Level 4 | Unknown | |||
| MBL2_03 | –0.2355 | –0.0088 | 0.0059 | –0.0294 | ||||
| ALOXE3_03 | 0.0147 | –0.0031 | 0.0076 | |||||
| C4BPA_04 | –0.3286 | –0.0190 | –0.0266 | |||||
| CCR4_01 | 0.0516 | 0.0184 | –0.0023 | –0.0486 | –0.0051 | 0.0460 | ||
| MASP1_69 | –0.1931 | 0.0401 | –0.0495 | |||||
| MIF_16 | –0.1901 | –0.0299 | 0.0374 | –0.0784 | –0.0809 | |||
| NCF4_35 | –0.0023 | –0.0003 | –0.0001 | |||||
| NOS1_18 | 0.0053 | 0.0040 | 0.0132 | |||||
| SELP_26 | –0.1533 | –0.1072 | ||||||
| STAT4_33 | –0.3080 | 0.0650 | –0.0180 | |||||
| VCAM1_02 | –0.0021 | –0.0001 | 0.0001 | –0.0010 | ||||
| SNP | B-symptom |
Initial treatment |
||||
|---|---|---|---|---|---|---|
| Yes | Unknown | Surgery | Radiation | Chemo | Other | |
| MBL2_03 | 0.1570 | 0.0749 | ||||
| ALOXE3_03 | ||||||
| C4BPA_04 | 0.0276 | –0.0158 | 0.0144 | |||
| CCR4_01 | 0.0060 | –0.0019 | 0.0458 | |||
| MASP1_69 | ||||||
| MIF_16 | –0.0031 | 0.0808 | 0.0270 | –0.0024 | 0.0115 | |
| NCF4_35 | 0.0005 | 0.0002 | ||||
| NOS1_18 | –0.0013 | –0.0006 | 0.0015 | |||
| SELP_26 | –0.0184 | 0.0146 | 0.0078 | |||
| STAT4_33 | 0.0265 | 0.0655 | 0.0524 | 0.0260 | ||
| VCAM1_02 | 0.0009 | –0.0019 | 0.0004 | 0.0010 | –0.0009 | 0.0005 |
Table 4.
Analysis of DLBCL overall survival: identified main effects and interactions.
| SNP | Main | Age | Interaction |
|||||
|---|---|---|---|---|---|---|---|---|
| Education |
Tumor stage |
|||||||
| Level 2 | Level 3 | Level 2 | Level 3 | Level 4 | Unknown | |||
| CCND1_01 | –0.1445 | 0.1385 | –0.0800 | |||||
| C5_15 | –1.6346 | 1.0346 | 1.7137 | 0.2060 | ||||
| CCR7_03 | 1.2630 | 0.8515 | 1.1905 | –0.5183 | –0.7506 | 1.0540 | ||
| RAC2_20 | –0.0327 | –0.0081 | ||||||
| SOCS4_01 | 0.0324 | –0.0011 | 0.0113 | |||||
| SNP | B-symptom |
Initial treatment |
||||
|---|---|---|---|---|---|---|
| Yes | Unknown | Surgery | Radiation | Chemo | Other | |
| CCND1_01 | 0.1084 | |||||
| C5_15 | 0.9845 | |||||
| CCR7_03 | –0.9726 | 0.3497 | 0.7904 | –0.0765 | ||
| RAC2_20 | 0.0187 | –0.0076 | 0.0027 | –0.0052 | ||
| SOCS4_01 | –0.0032 | |||||
Searching published literature suggests that the identified genes have important implications. For all subtypes combined, the protein encoded by gene MBL2 belongs to the collectin family and is an important element in the innate immune system. Deficiencies of this gene have been associated with susceptibility to autoimmune and infectious diseases. Polymorphisms of MBL2 have been associated with NHL [13]. Mutations of MBL2 are suggested to be associated with NHL patients' survival [11]. Gene C4BPA encodes a member of a superfamily of proteins composed predominantly of tandemly arrayed short consensus repeats of approximately 60 amino acids. It is found to be significantly over-expressed in cancer patients with non-metastatic solid tumors [2]. Protein encoded by gene CCR4 belongs to the G-protein-coupled receptor family. It is a receptor for the CC chemokines, which play fundamental roles in the development, homeostasis, and function of the immune system, and have effects on cells of the central nervous system as well as on endothelial cells involved in angiogenesis or angiostasis. MASP1 encodes a serine protease that functions as a component of the lectin pathway of complement activation. The complement pathway plays an essential role in the innate and adaptive immune response. MIF is a ubiquitously expressed pro-inflammatory mediator that has also been implicated in the process of oncogenic transformation and tumor progression. Deletion of the MIF gene in mice has been shown to have several major consequences for the proliferative and transforming properties of cells. MIF-deficient cells exhibit increased resistance to oncogenic transformation [4]. The protein encoded by gene NCF4 is a cytosolic regulatory component of the superoxide-producing phagocyte NADPH-oxidase, a multicomponent enzyme system important for host defense. This gene belongs to the oxidative stress pathway, which is associated with NHL risk [8]. The protein encoded by gene NOS1 belongs to the family of nitric oxide synthases, which synthesize nitric oxide from l-arginine. Nitric oxide is a reactive free radical, which acts as a biologic mediator in several processes, including neuro-transmission, and antimicrobial and antitumoral activities. Gene SELP plays an important role in the pathogeneses of inflammation, thrombosis, and the growth and metastasis of cancers. Proteins encoded by genes STAT4 are members of the STAT protein family, which play important roles in lymphoma prognosis. Gene VCAM1 is a member of the Ig superfamily and encodes a cell surface sialoglycoprotein expressed by cytokine-activated endothelium. This type I membrane protein mediates leukocyte-endothelial cell adhesion and signal transduction.
Among the genes identified for DLBCL, the protein encoded by gene CCND1 belongs to the highly conserved cyclin family, whose members are characterized by a dramatic periodicity in protein abundance throughout the cell cycle. This protein has been shown to interact with tumor suppressor protein Rb, and the expression of this gene is regulated positively by Rb. Mutations, amplification and overexpression of this gene, which alter cell cycle progression, are observed frequently in a variety of tumors and may contribute to tumorigenesis. The protein encoded by gene C5 is the fifth component of complement, which plays an important role in inflammatory and cell killing processes. Mutations in this gene cause complement component 5 deficiency, a disease where patients show a propensity for severe recurrent infections. The protein encoded by gene CCR7 is a member of the G protein-coupled receptor family. This receptor was identified as a gene induced by the Epstein–Barr virus (EBV), and is thought to be a mediator of EBV effects on B lymphocytes. This receptor is expressed in various lymphoid tissues and activates B and T lymphocytes. It has been shown to control the migration of memory T cells to inflamed tissues, as well as stimulate dendritic cell maturation. The protein encoded by gene RAC2 is a GTPase which belongs to the RAS superfamily of small GTP-binding proteins. Members of this superfamily appear to regulate a diverse array of cellular events, including the control of cell growth, cytoskeletal reorganization, and activation of protein kinases. SOCS4 is a negative feedback regulator of EGF signaling, and has significantly attenuated expression in tumor tissue [9].
In the literature, genetic markers for NHL prognosis are still under extensive debate [17]. Our literature review suggests that the genetic markers identified using the proposed approach may have important implications. However, we are unable to evaluate whether they are “more meaningful” than those identified using the alternatives. The study on G × E interactions is even more limited. Our analysis suggests that there may exist complex interactions between genetic effects and demographics, socioeconomic status, and the natural course of the disease. In addition, treatment and genetic markers also intervene with each other. Compared with those of the main effects, the magnitudes of interactions are usually smaller. But there is a lack of clear pattern. For example for SNP VCAM1_02, its interactions with different treatments have different signs and different magnitudes. To further compare different approaches, we evaluate the prediction performance of identified main effects and interaction. It should be noted that although this evaluation can be informative, prediction and marker identification are different goals in analysis. We randomly sample 3/4 of the subjects without replacement and construct the training set. The corresponding testing set consists of the remaining subjects. We apply the proposed approach and alternatives and analyze the training set. The training set model is then used to make prediction for subjects in the testing set. We dichotomize the testing set risk scores at the median, create two risk groups, and compute the logrank statistic which measures the survival difference between the two groups. To avoid an extreme sampling, we repeat the above process 500 times and compute the average logrank statistics. Alt.1 is a marginal approach, does not lead to a joint prediction model, and is not evaluated. For all subtypes combined, the logrank statistics are 9.1 (Alt.2), 5.6 (Alt.3) and 11.7 (proposed, p-value 0.0006). For DLBCL, the logrank statistics are 27.6 (Alt.2), 46.1 (Alt.3), and 61.4 (proposed, p-value <0.0001). The proposed approach outperforms the two alternatives in prediction. It should be noted that the prediction evaluation is cross-validation-based and should be interpreted with cautions.
5. Discussion
In high-throughput cancer studies, it is of interest to identify G × E interactions which may be independently associated with outcomes/phenotypes beyond main effects. In this study, we propose using a penalization approach for identifying G × E interactions. The proposed approach has an intuitive formulation, can accommodate the joint effects of a large number of markers, respects the “main effect, interaction” hierarchical structure, and is computationally feasible. Simulation study shows that it outperforms alternatives by identifying more true positives and fewer false positives. In data analysis, it identifies markers different from alternatives. The identified genetic markers have important implications and satisfactory prediction performance.
There is a vast literature on statistical methods for analyzing G × E interactions. The goal of this study is to develop a new alternative, which may have certain advantages. Comprehensive review and comparison are beyond the scope of this article. In numerical study, we use censored survival data and AFT model as an example. It is conjectured that the proposed approach is also applicable to other data and model settings. Performance of a statistical method can be data-dependent. Examining performance of the proposed method under other settings will be postponed to future study. Data analysis in Section 4 and Appendix shows that the proposed method can identify genetic markers different from the alternatives. Literature mining in Section 4 shows that the identified markers have important implications. In addition, they have satisfactory prediction performance. Future confirmation study is needed to fully validate the findings.
Supplementary Material
Acknowledgments
We thank the editor, associate editor, and three reviewers for very careful review and insightful comments, which have led to a significant improvement of the article. This study was supported by CA142774 and CA165923 from NIH, a pilot grant from the Yale Comprehensive Cancer Center, and 2012LD001 from National Bureau of Statistics of China.
Footnotes
Appendix A. Supplementary data
Supplementary data to this article can be found online at http://dx.doi.org/10.1016/j.ygeno.2013.08.006.
References
- 1.Amato R, Pinelli M, D'Andrea D, Miele G, Nicodemi M, Raiconi G, Cocozza S. A novel approach to simulate gene–environment interactions in complex diseases. BMC Bioinforma. 2010;11:8. doi: 10.1186/1471-2105-11-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Battistelli S, Vittoria A, Cappelli R, Stefanoni M, Roviello F. Protein S in cancer patients with non-metastatic solid tumours. Eur. J. Surg. Oncol. 2005;31(7):798–802. doi: 10.1016/j.ejso.2005.05.001. [DOI] [PubMed] [Google Scholar]
- 3.Bien J, Taylor J, Tibshirani R. A Lasso for hierarchical interactions. Ann. Stat. 2013;41:1111–1141. doi: 10.1214/13-AOS1096. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Fingerle-Rowson G, Petrenko O. MIF coordinates the cell cycle with DNA damage checkpoints. Lessons from knockout mouse models. Cell Div. 2007;2:22. doi: 10.1186/1747-1028-2-22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Friedman J, Hastie T, Tibshirani R. A note on the group Lasso and a sparse group Lasso. 2010. (arXiv:1001.0736)
- 6.Huang J, Wei F, Ma S. Semiparametric regression pursuit. Stat. Sin. 2012;22:1403–1426. doi: 10.5705/ss.2010.298. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Hunter DJ. Gene–environment interactions in human diseases. Nat. Rev. Genet. 2005;6:287–298. doi: 10.1038/nrg1578. [DOI] [PubMed] [Google Scholar]
- 8.Kim C, Zheng T, Lan Q, Chen Y, Foss F, Chen X, Holford T, Leaderer B, Boyle P, Chanock SJ, Rothman N, Zhang Y. Genetic polymorphisms in oxidative stress pathway genes and modification of BMI and risk of non-Hodgkin lymphoma. Cancer Epidemiol. Biomarkers Prev. 2012;21(5):866–868. doi: 10.1158/1055-9965.EPI-12-0010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Kobayashi D, Nomoto S, Kodera Y, Fujiwara M, Koike M, Nakayama G, Ohashi N, Nakao A. Suppressor of cytokine signaling 4 detected as a novel gastric cancer suppressor gene using double combination array analysis. World J. Surg. 2012;36(2):362–372. doi: 10.1007/s00268-011-1358-2. [DOI] [PubMed] [Google Scholar]
- 10.Ma S. Risk factors of follicular lymphoma. Expert Opin. Med. Diagn. 2012;6(4):323–333. doi: 10.1517/17530059.2012.686996. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Martinez-Lopez J, Rivero A, Rapado I, Montalban C, Paz-Carreira J, Canales M, Martinez R, Sanchez-Godoy P, Fernandez de Sevilla A, Penalver FJ, Gonzalez M, Prieto E, Salar A, Burgaleta C, Queizan JA, Penarrubia MJ, Monteagudo MD, Cabrera C, De la Serna J, Tomas JF. Influence of MBL-2 mutations in the infection risk of patients with follicular lymphoma treated with rituximab, fludarabine, and cyclophosphamide. Leuk. Lymphoma. 2009;50(8):1283–1289. doi: 10.1080/10428190903040006. [DOI] [PubMed] [Google Scholar]
- 12.Moore JH, Gilbert JC, Tsai CT, Chiang FT, Holden T, Barney N, White BC. A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. J. Theor. Biol. 2006;241(2):252–261. doi: 10.1016/j.jtbi.2005.11.036. [DOI] [PubMed] [Google Scholar]
- 13.Mullighan CG, Heatley S, Doherty K, Szabo F, Grigg A, Hughes TP, Schwarer AP, Szer J, Tait BD, Bik To L, Bardy PG. Mannose-binding lectin gene polymorphisms are associated with major infection following allogeneic hemopoietic stem cell transplantation. Blood. 2002;99(10):3524–3529. doi: 10.1182/blood.v99.10.3524. [DOI] [PubMed] [Google Scholar]
- 14.North KE, Martin LJ. The importance of gene–environment interaction: implications for social scientists. Sociol. Methods Res. 2008;37:164–200. [Google Scholar]
- 15.Thomas D. Methods for investigating gene–environment interactions in candidate pathway and genome-wide association studies. Annu. Rev. Public Health. 2010;31:21–36. doi: 10.1146/annurev.publhealth.012809.103619. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Witten DM, Tibshirani R. Survival analysis with high-dimensional covariates. Stat. Methods Med. Res. 2010;19:29–51. doi: 10.1177/0962280209105024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Zhang CH. Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 2010;38:894–942. [Google Scholar]
- 18.Zhang Y, Dai Y, Zheng T, Ma S. Risk factors of Non-Hodgkin lymphoma. Expert Opin. Med. Diagn. 2011;5:539–550. doi: 10.1517/17530059.2011.618185. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Zhang Y, Holford TR, Leaderer B, Boyle P, Zahm SH, Flynn S, Tallini G, Owens PH, Zheng T. Hair-coloring product use and risk of non-Hodgkin's lymphoma: a population-based case–control study in Connecticut. Am. J. Epidemiol. 2004;159:148–154. doi: 10.1093/aje/kwh033. [DOI] [PubMed] [Google Scholar]
- 20.Zhang Y, Lan Q, Rothman N, Zhu Y, Zahm SH, Wang SS, Holford TR, Leaderer B, Boyle P, Zhang B, Zou K, Chanock S, Zheng T. A putative exonic splicing polymorphism in the BCL6 gene and the risk of non-Hodgkin lymphoma. J. Natl. Cancer Inst. 2005;97:1616–1618. doi: 10.1093/jnci/dji344. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
