Abstract
Summary
For understanding complex diseases, gene–environment (G–E) interactions have important implications beyond main G and E effects. Most of the existing analysis approaches and software packages cannot accommodate data contamination/long-tailed distribution. We develop GEInter, a comprehensive R package tailored to robust G–E interaction analysis. For both marginal and joint analysis, for data without and with missingness, for continuous and censored survival responses, it comprehensively conducts identification, estimation, visualization and prediction. It can fill an important gap in the existing literature and enjoy broad applicability.
Availability and implementation
Supplementary information
Supplementary data are available at Bioinformatics online.
1 Introduction
For understanding complex diseases, gene–environment (G–E) interactions have important implications beyond main G and E effects (Mcallister et al., 2017; Thomas, 2010). In published studies, G factors include gene expressions, SNPs and other types of molecular measurements. E factors include environmental exposures as well as demographic, clinical and socioeconomic variables. G–E interaction analysis can be classified into marginal and joint analysis, depending on whether a small or large number of G factors are analyzed at a time (Thomas, 2010). Beyond high dimensionality and noisy data, G–E interaction analysis is uniquely challenged by the ‘main effects, interactions’ hierarchy, under which an interaction can be identified only if its main effects are also identified (Wu and Ma, 2019). Available G–E interaction analysis packages include PLINK (Purcell et al., 2007), rareGE (Chen et al., 2014), aGE (Yang et al., 2019), spinBayes (Ren et al., 2020) and others.
Most of the existing analysis approaches and software packages cannot accommodate data contamination/long-tailed distribution, which are not uncommon in practice (Wu and Ma, 2019). In addition, many cannot directly accommodate missingness in E measurements (here it is noted that with the development of profiling techniques, missingness in G measurements is getting increasingly limited). To fill the knowledge gap, we develop the GEInter package tailored to robust G–E interaction analysis that may also have missing E measurements. As shown in Figure 1, GEInter can be uniquely advantageous with its comprehensiveness: it conducts marginal and joint analysis, for data without and with missingness in E variables and for continuous and censored survival outcomes. It realizes five recently published methods. It conducts identification, estimation, prediction and visualization.
Fig. 1.
Workflow. Pre/Res: robustness to contamination/long-tailed distribution in predictors/response
2 Materials and methods
For data without missingness, GEInter implements three approaches:
QPCorr (Xu et al., 2019) is a robust marginal analysis approach that can accommodate contaminated/long-tailed outcome data. It is built on quantile regression and adopts partial correlation to identify important interactions while properly controlling for main G and E effects. Two functions QPCorr.matrix and QPCorr.pval are developed to compute quantile partial correlations for interactions and their P values via a permutation method.
RobSBoosting (Wu et al., 2019) conducts robust joint analysis using the sparse boosting technique. It respects the interaction hierarchy by searching over only interactions whose main effects are already identified. It achieves robustness to contaminated/long-tailed response using the Huber loss and accommodates nonlinear effects of continuous E variables using B spline expansion. It is realized using function RobSBoosting.
PTReg (Xu et al., 2018) conducts robust joint analysis using penalized trimmed regression. It accommodates contamination/long-tailed distribution in both predictors and response. It conducts selection based on minimax concave penalty (MCP) and respects the interaction hierarchy using a decomposition technique. It is realized using function PTReg.
In the above analyses, continuous responses are modeled using linear regression. For survival data, accelerated failure time models are adopted, and censoring is accounted for with Kaplan–Meier (KM) weights.
For data with missingness in E measurements, GEInter implements two approaches:
AugmBLMCP (Wu et al., 2017) includes two steps, which are realized using functions Augmented.data and BLMCP. In Step 1, subjects with missingness are augmented, and a nonparametric kernel-based weight is assigned to each augmented data, achieving robustness to both contaminated/long-tailed predictor and response. An additional KM weight is introduced to accommodate censoring. In Step 2, a bi-level penalization approach (BLMCP; Liu et al., 2013) is applied for interaction analysis.
MissBoosting (Wu et al., 2019) includes three steps, is realized using function Miss.boosting and accommodates contamination/long-tailed distribution in response. In Step 1, a multiple imputation approach is applied to accommodate missingness. In Step 2, for each imputed data, RobSBoosting is adopted for interaction analysis. In Step 3, results from Step 2 are combined using stability selection.
For QPCorr, interaction identification is achieved with P values from function QPCorr.pval, along with the control for multiple comparisons using function p.adjust. For PTReg and BLMCP, two tuning parameters control the number of identified interactions and are selected using BIC (functions bic.PTReg and bic.BLMCP). For RobSBoosting and Miss.boosting, interaction identification is ‘automatically’ realized with the stopping iteration determined by BIC. For joint analysis, GEInter also provides functions coef (for extracting estimates), predict (for making prediction for new observations) and plot (for visualizing estimates).
We refer to the original publications and Section 1 of the Supplementary Materials for details on the methods, and Section 2 of the Supplementary Material for information on computer time.
3 Application examples
We analyze the TCGA head and neck squamous cell carcinoma data. The processed data are included in the package. The response is overall survival which is subject to right censoring. There are seven E factors, namely alcohol consumption frequency (ACF), smoking pack-years (SPY), age, gender, PN, PT and ICD O3 site. For G factors, 2000 gene expressions are considered. Data are available on 484 subjects, among whom 70.8% have missingness in ACF and/or SPY. We apply MissBoosting, which can accommodate missingness, nonlinear effects of continuous E factors (EC) (Supplementary Fig. S2) and contaminated outcome (Supplementary Fig. S3). Analysis is realized using: fit<-Miss.boosting(G, E, Y, im_time = 10, loop_time = 1000, v = 0.25, num.knots = 5, degree = 3, tau = 0.3, family=‘survival’,E_type=c(rep(‘EC’,3),rep(‘ED’,4)), where G and E are the G and E measurement matrices, Y is the two-column response matrix including survival times and censoring indicators, im_time is the number of imputations, loop_time and v are the number of iterations and step size, num.knots and degree are the parameters for B spline expansion, tau is the thresholding using in stability selection, family and E_type indicate the types of response and E factors. The output fit is a list that includes the identification results (unique_variable), estimated effects of main E (alpha0) and main G and interactions (beta0) and others. For this dataset, MissBoosting identifies 3 main E effects (age, ACF, SPF), 27 main G effects and 13 interactions. Visualization based on fit is realized using plot(fit), which returns the heatmap and curves for the linear and nonlinear effects (Supplementary Fig. S4). Prediction is realized using y.hat<-predict(fit, newE, newG), where newE and newG are the testing data matrices. With 100 resamplings, the average C-statistic (which measures the overall adequacy of prediction, with a larger value indicating better prediction) is 0.68.
We provide additional demonstrations in Section 3 of the Supplementary Material. In Section 4 of the Supplementary Material, we provide simulation to gain more insights into the methods and software.
4 Discussion
Robustness is a well-desired property in G–E interaction analysis and has drawn wide attention. GEInter is the first R package that comprehensively conducts robust G–E interaction analysis using state-of-the-art methods. With its comprehensiveness, user-friendly functions and demand for only basic R settings, it can significantly facilitate routine analysis.
Funding
This work was supported by the National Institutes of Health [CA204120] and National Natural Science Foundation of China [12071273].
Conflict of Interest: none declared.
Supplementary Material
Contributor Information
Mengyun Wu, School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai 200433, China.
Xing Qin, School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai 200433, China.
Shuangge Ma, Department of Biostatistics, Yale University, New Haven, CT 06520, USA.
References
- Chen H. et al. (2014) Incorporating gene-environment interaction in testing for association with rare genetic variants. Hum. Hered., 78, 81–90. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu J. et al. (2013) Identification of gene-environment interactions in cancer studies using penalization. Genomics, 102, 189–194. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mcallister K. et al. (2017) Current challenges and new opportunities for gene-environment interaction studies of complex diseases. Am. J. Epidemiol., 186, 753–761. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Purcell S. et al. (2007) PLINK: a toolset for whole-genome association and population-based linkage analysis. Am. J. Hum. Genet., 81, 559–575. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ren J. et al. (2020) Semiparametric Bayesian variable selection for gene-environment interactions. Stat. Med., 39, 617–638. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Thomas D. (2010) Gene-environment-wide association studies: emerging approaches. Nat. Rev. Genet., 11, 259–272. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu M. et al. (2017) Accommodating missingness in environmental measurements in gene-environment interaction analysis. Genet. Epidemiol., 41, 523–554. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu M. et al. (2019) Robust semiparametric gene-environment interaction analysis using sparse boosting. Stat. Med., 38, 4625–4641. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu M., Ma S. (2019) Robust genetic interaction analysis. Brief. Bioinform., 20, 624–637. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xu Y. et al. (2018) Robust gene-environment interaction analysis using penalized trimmed regression. J. Stat. Comput. Simul., 88, 3502–3528. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xu Y. et al. (2019) Robust identification of gene-environment interactions for prognosis using a quantile partial correlation approach. Genomics, 111, 1115–1123. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang T. et al. (2019) A powerful and data-adaptive test for rare-variant-based gene-environment interaction analysis. Stat. Med., 38, 1230–1244. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.