Abstract
For many practical high-dimensional problems, interactions have been increasingly found to play important roles beyond main effects. A representative example is gene-gene interaction. Joint analysis, which analyzes all interactions and main effects in a single model, can be seriously challenged by high dimensionality. For high-dimensional data analysis in general, marginal screening has been established as effective for reducing computational cost, increasing stability, and improving estimation/selection performance. Most of the existing marginal screening methods are designed for the analysis of main effects only. The existing screening methods for interaction analysis are often limited by making stringent model assumptions, lacking robustness, and/or requiring predictors to be continuous (and hence lacking flexibility). A unified marginal screening approach tailored to interaction analysis is developed, which can be applied to regression, classification, and survival analysis. Predictors are allowed to be continuous and discrete. The proposed approach is built on Coefficient of Variation (CV) filters based on information entropy. Statistical properties are rigorously established. It is shown that the CV filters are almost insensitive to the distribution tails of predictors, correlation structure among predictors, and sparsity level of signals. An efficient two-stage algorithm is developed to make the proposed approach scalable to ultrahigh-dimensional data. Simulations and the analysis of TCGA LUAD data further establish the practical superiority of the proposed approach.
Keywords: Coefficient of variation, Conditional entropy, Interaction analysis, Marginal Screening
1. Introduction
For many practical high-dimensional analysis problems, interactions have been increasingly confirmed as playing critical roles beyond main effects [1, 2]. The most representative example is perhaps gene-gene interaction. For a long array of diseases including cancer and cardiovascular diseases, gene-gene interactions with significant implications for disease risk, progression, survival, and other endpoints have been identified. “Genes” analyzed in published studies include SNPs, gene expressions, methylation and other epigenetic changes, microRNAs, and others. Interaction analysis has also been extensively conducted beyond biomedicine.
Most of the existing interaction analyses can be classified as marginal and joint. In marginal analysis, a small number of variables are analyzed at a time. As such, a large number of analyses are needed, leading to a multiple comparison adjustment problem. In contrast, in joint analysis, a large number of variables are collectively analyzed in a single model, leading to a regularized estimation and variable selection problem [3, 4, 5, 6]. The two analysis paradigms serve different purposes, with joint analysis possibly better reflecting, for example, the biology of complex diseases [7]. In this article, we focus on joint analysis. In the literature, many joint interaction analysis methods have been developed, and we refer to [8] and others for review. Despite successful methodological and theoretical developments, in practice, joint interaction analysis is usually seriously challenged by extremely high dimensionality, which can lead to an intolerably high computational cost, lack of stability, and inferior estimation and selection.
For high-dimensional data analysis in general (without and with interactions), marginal screening has been established as highly effective for reducing computational cost and improving stability, estimation, and selection performance. The essence of marginal screening lies in linking effects that are important in a joint model with those that are important in marginal models. Most of the existing screening methods are limited to main effects only. They can be roughly classified as model-based and model-free (robust). In model-based marginal screening, specific parametric or semiparametric models are assumed [9, 10]. In such analysis, when models are correctly specified, consistency properties can be established. However, in high-dimensional data analysis, model misspecification is not uncommon, which can lead to the failure of model-based screening. To tackle this problem, model-free screening methods have been developed, built on the quantile [11], rank correlation [12], distance correlation [13], and other techniques [14, 15, 16, 18]. It is noted that some of the existing techniques have relatively narrow applications. For example, the mean-variance-based sure independence screening [16] is limited to classification problems. The distance-correlation-based sure independence screening [13] requires predictors to have continuous distributions. The existing methods sometimes take quite different forms for different types of response variables, lacking uniformity – this is especially true for model-based methods.
Compared to the analysis of main effects, marginal screening for interaction analysis has been less developed. It may seem that the methods for main-effects screening can be directly applied. However, the validity of such methods often relies on certain no/weak correlation assumptions, which easily break down in interaction analysis. One solution has been brought by the unique variable selection hierarchy of interaction analysis. In particular, it has been argued both statistically and biologically that, if an interaction term is important, then one (under the weak hierarchy) or both (under the strong hierarchy) of the corresponding main effects should also be important [20]. With this hierarchy, progressive screening methods have been developed, which first conduct marginal screening with main effects, and then screen for important interactions corresponding to the selected main effects [19, 20, 21]. It is noted that the existing progressive methods are mostly model-based. In addition, they also demand certain correlation conditions to ensure that important main effects can be identified in the first place. To identify gene-gene and gene-environment interactions, entropy-based methods have been proposed. Examples include an index based on information gain to quantify the interaction effect between two categorical predictors and a response [22] and utilization of mutual information and mutual information gain to quantify gene-environment [23] and gene-gene [24] interactions. There are also approaches that take GINI purity gain into account to detect gene-gene interactions [25, 26]. However, these entropy-based methods generally overestimate dependence [27] and are focused on binary response, having limited practical applications.
In this article, our goal is to develop a new marginal screening approach tailored to interaction analysis. The significance of interaction analysis and marginal screening for such analysis has been well established and will not be reiterated. This study may complement and advance from the existing literature in multiple ways. First, the proposed analysis accommodates interactions and can complement those limited to main effects only – it is also noted that it is directly applicable to the analysis that only has main effects. Second, the proposed approach provides a unified solution to feature screening. It can comprehensively cover categorical, continuous, and censored survival outcomes. This methodological uniformity is much desired and not shared by many of the existing methods. Third, the proposed nonparametric coefficient of variation (CV) filters, built on Shannon’s entropy theory and coefficient of variation statistic, are model-free and have the much-desired robustness property. Its implementation does not require full specification of the distribution of response and covariates. The main consideration for adopting CV is that standard deviation (SD) of entropy information usually changes as mean changes, and dividing by mean can remove its impact on variation. Such a formulation can be even more useful when different predictors have different numbers of categories, as predictors with more categories are likely to be associated with larger information gain regardless of interactions are important or not. In this case, mean can serve as an adjustment factor to deal with this problem. The proposed CV filter is a standardization of SD that allows the comparison of variability regardless of the magnitudes of original features. With the assistance of a two-stage procedure, the proposed approach is also superior by being computationally scalable and fast. In addition, it is shown to have satisfactory performance when signals are weak and/or variables are dependent and heavy-tailed – such a property is not shared by most of the existing methods. Here it is noted that the merit of robustness for joint interaction analysis has been well established, for which we refer to the quantile [28], exponential loss [29, 30], rank-based [31, 32], and many other works. Fourth, the proposed approach can flexibly accommodate discrete, categorical, and continuous predictors, overcoming the stringent demand for continuous distributions by some methods. Lastly, statistical properties are rigorously established, providing the proposed approach a strong statistical ground and also shedding insights into coefficient of variation and entropy theory under high-dimensional settings. Overall, this study can provide a statistically well-grounded and numerically well-performed approach for alleviating computational burden and improving performance in high-dimensional interaction analysis.
2. Methods
The proposed CV filters are based on Shannon’s entropy theory [33], which has demonstrated great power in other fields but has not been well employed in interaction analysis. In Section 2.1, we first develop a new interaction filter based on conditional entropy for data with a categorical response and predictors. Based on this development, a new screening strategy is developed in Section 2.2, and its statistical properties are established in Section 2.3. In Section 2.4, we consider data with continuous and censored survival responses.
2.1. A new interaction filter based on conditional entropy
Let Y be a categorical response with R categories , and be a p-dimensional vector, where Xk is categorical with Jk categories for . In this study, we focus on second-order interactions. Higher-order interactions have been very limitedly investigated in high-dimensional settings. For two predictors Xk and Xl, they do not have an interaction effect, if and only if they are conditionally independent given Y.
Entropy is a key information measure for the uncertainty of random variables. Consider a categorical random variable X and its probability mass function . Set 0 × log 0 = 0. The entropy of X is defined as:
| (2.1) |
It is minimal (=0), if X has probability one for a specific category. On the other hand, it is maximal, if X has the same probability for all categories. For an equiprobable variable, H(X) increases with the number of categories.
Consider two variables Xk, Xl. We propose using conditional entropy to quantify the dependence between their interaction and Y. Denote by the conditional entropy of Y given and , where . Let , , be the weight characterizing the probability of (Xk, Xl) falling into class (i, j). Consider the quantity:
| (2.2) |
It measures the weighted amount of information remained in response Y given and . Consequently, is the sum of over all possible values that Xk and X1 can take. When has a strong relationship with Y, the uncertainty of Y is expected to significantly decrease after removing effects of . Specifically, if Y is completely determined by Xk and X1 (such that , where is a deterministic function), should be 0. By simple calculations, can be reformulated as:
| (2.3) |
We note from (2.3) that contains multiple sources of information, including the intrinsic information of the predictors, main-effect information and , and interaction information and . When there is no interaction, and . And so, . When Xk and X1 are independent, . These results motivate us to use conditional entropy to develop an efficient interaction screening filter.
Define as the set of conditional entropy of Y given all the possible values of . Intuitively, if there is an interaction, the effect of Xk on Y will not be the same for all levels of X1. This can be fully reflected in the variation of . Specifically, if all the components of are equal, should have no effect on Y. On the other hand, if there is a strong interaction, should have notable variability. As such, the standard deviation (SD) of can be potentially used to quantify interaction. When different predictors have different numbers of categories, the pair with more categories is likely to have smaller conditional entropy, regardless of the absence or presence of interaction. In addition, the SD of generally depends on the mean and is not dimensionless. With these considerations, we propose using the coefficient of variation (CV) to quantify interaction. Specifically,
| (2.4) |
where σkl and μkl are the SD and mean of , respectively. This CV filter is the standardization of SD. Consequently, it allows for the comparison of variability without being affected by the magnitudes of the original variables. It is easy to see that . Its properties are further established in the following proposition.
Proposition 1.
Let Y be a categorical random variable with categories , and for all . Let Xk be a categorical variable with Jk categories for . Let and for all , , and . Then, (1) for and , and . (2) for . (3) if and Y are independent. (4) if and only if and Y are independent and for all and .
Proof is provided in the Supplementary Materials. Result (1) implies that each component of Θkl is positive, and so is the mean value μkl. By Jensen’s inequality, μkl is bounded by the entropy of Y. As R increases, the upper bound also increases. In addition, if and Y are independent, CVkl cannot be too large and is bounded by ∼ 0.35. If falls into each category with an equal probability, CVkl achieves its minimum of 0. By definition, larger CV values indicate stronger interaction effects. As such, CVkl can be utilized as marginal utility for interaction screening.
Although categorical distributions have been assumed, the CV interaction filter can be generalized to continuous and mixture distributions. Specifically, if predictor Xk has a continuous distribution, we can employ slicing and partition Xk into Jk slices, and then the CV interaction filter can be applied. In our numerical studies, we adopt uniform slicing and note that data-dependent, possibly more effective slicing techniques have been developed in the literature.
Let q(j) be the j/Jkth percentile of Xk, , and , . The CV interaction filter is defined by replacing and in equation (2.2) by and , respectively. A similar idea can be applied to mixture distributions. When Xk is continuous and X1 is categorical, the CV filter can be constructed by replacing and in equation (2.2) with and , respectively.
With a sample of n iid observations , , can be estimated by plugging in sample means and standard deviations. That is,
| (2.5) |
where , , , is the indicator function, , and .
Illustrating examples
In Section 2 of the Supplementary Materials, we present two illustrating examples, which can provide more insights into working characteristics of the CV interaction filter and show that it performs well for both continuous and categorical predictors.
Remark 1.
Following the same principle, the CV filter approach can be applied to screen main effects. This can be achieved by replacing with its marginals Xk or Xl in (2.4) and rewriting as or . Properties are similar to the above. In particular, if and only if Xk and Y are independent and for . We refer to the approach of applying the CV filter to main effects only as CVMS (which will be further considered below).
2.2. Screening Strategy
Define the index sets of predictors and their second-order terms as:
Define the active main effect and interaction index sets as:
The full model index set is , and the true model index set is . For a model , we use to denote its size. As described above and in the literature, interaction analysis faces the additional complexity of the variable selection hierarchy. Here we consider the weak hierarchy, under which if , then at least one of Xk and Xl should also be identified. Extending to the strong hierarchy can be easily carried out.
We propose the following two-stage approach, which screens main effects and interactions in two consecutive steps. In particular,
Stage 1.
CVMS: Apply the CV-main-effect filter to , and identify:
Stage 2.
CVIS: Apply the CV-interaction filter to , and identify
The working active set for downstream analysis is then . Here, we note that some existing progressive methods demand multiple iterations to update and [20]. In contrast, the proposed approach is not iterative. In our theoretical investigations below, we examine the asymptotic requirements on and . In our numerical study, we take and . The value of is consistent with that in the literature [11, 13, 15, 16], and the value of has been motivated by the squared dimensionality of interactions. Our numerical study below suggests satisfactory performance. On the other hand, we note that when is small in practical data analysis, “to be cautious”, a larger value can be taken.
Remark 2.
Conceptually, with binary distributions, the CV filter may lose power, as the SD in the definition of CVk may not be sufficiently informative with only two data points. However, our numerical study below still suggests reasonable performance with binary distributions. When there are three or more categories, empirical study suggests highly satisfactory performance of the proposed approach. As a possible variation, one can directly apply the CV filter to the combination of main effects and interactions. Then the selected set, if needed, can be enriched to satisfy the variable selection hierarchy.
2.3. Statistical Properties
First consider the scenario with all predictors being categorical. Assume the following conditions:
(C1) There exist two positive constants c1 and c2, such that , , . There exist two positive constants c3 and c4, such that for , and .
(C2) There exist two positive constants c > 0 and , such that .
(C3) , , where s ≥ 0, t ≥ 0 and .
(C4) , where δ is a positive constant.
Condition (C1) guarantees that the proportion of each category of the response and pair of predictors cannot be too large or too small. Similar assumptions have been made in the literature [14, 16]. Condition (C2) is common in the marginal screening literature and requires that the minimum true signal is at least at the order of . Condition (C3) allows the number of categories for the response and predictors to diverge with a certain order, and the maximum number of categories for predictors is allowed to vary with sample size n. Condition (C4) is assumed to separate the active interaction set from noises. It ensures that CVkl of an active interaction is always larger than that of an inactive one at the population level. Compared to the partial orthogonality condition [17] (that for and for ), Condition (C4) is weaker in that the effects are not required to be 0 for all inactive interactions to have the consistency property in ranking. In fact, does not necessarily imply . The quantity is zero only if is independent of Y and falls into each category with an equal probability. In comparison, the Pearson’s Chi-squared-based sure independence screening [14] requires the effects of all inactive covariates to be zero to enjoy the strong sure screening property.
Theorem 1.
(Sure screening for categorical predictors) Under conditions (C1)–(C3),
where m is a positive constant. Therefore, if and , CVIS has the sure screening property.
Proof is provided in the Supplementary Materials. This result ensures that the estimated set of interactions contains the truly important ones with probability approaching one. It is noted that, as conditional entropy has the much-desired robustness property, CVIS is robust to heavy-tailed distributions of predictors and presence of outliers – a property not shared by most of the existing methods. Further, the sure screening property holds when predictors and/or response have a diverging number of categories.
Remark 3.
Under the same conditions, the filter also possesses the screening consistency property for main effects. In particular, it can be shown that, as , , where m′ is a positive constant. So if in Theorem 1 is satisfied, . Then the sure screening property holds for main effects.
To accommodate continuous distributions, additional assumptions are needed, and condition (C3) needs to be revised. Specifically,
(C5) If both Xk and Xl are continuous, then there exists a constant c5 such that for any and X in the domain of xk, where is the Lebesgue density function of Xk conditional on Y = r. There exists a constant c6 such that for any and x in the domain of xk, in the domain of Xl, where is the Lebesgue density function of Xk conditional on Y = r and .
(C5’) If Xk is continuous and Xl is categorical, then there exists a constant such that for any , and X in the domain of Xk, where is the Lebesgue density function of Xk conditional on and .
(C6) There exist positive constants c7 and such that for any and x in the domain of Xk, where is the Lebesgue density function of Xk. Further, is continuous in the domain of Xk.
(C3’) , , where , and .
Conditions (C5) and (C5’) exclude the extreme scenario where Xk places a heavy mass in a small range. Condition (C6) is mild and assumed for technical considerations. It requires a lower bound that is in the order of for the density of Xk. For data with both categorical and continuous predictors, we can establish the following results.
Theorem 2.
(Sure screening for both categorical and continuous predictors) Under Conditions (C1), (C2), (C3’), (C5), (C5’), and (C6),
where m is a positive constant. Therefore, if and , CVIS has the sure screening property.
Theorem 3.
(Ranking consistency) If Conditions (C1), (C4), (C5), and (C6) hold for and , then
This result establishes that for continuous, categorical, and mixture distributions, in a unified way, the proposed approach can properly rank and hence separate important and unimportant interaction terms. We note that Condition (C4) may be slightly stronger than some of its counterparts. However, the corresponding consistency result is also stronger. It justifies a clear gap between active and inactive interactions at the sample level. That is, the CVkl values of active interactions are always ranked above those of inactive ones with an overwhelming probability. Thus, with an appropriate cutoff, active and inactive interactions can be separated.
Remark 4.
(Computational complexity) By definition, the CV interaction filter allows the numbers of categories to differ across predictors. It can be derived that the computational complexity is . Further, by Condition (C3), it is , where . Therefore, it is less than .
2.4. Accommodating continuous and censored survival responses
With the assistance of slicing, the CV interaction filter can accommodate continuous responses. Specifically, we define a partition
where and . Each is referred to as a slice. We then define a random variable such that if and only if Y is in the gth slice. The slicing counterpart of is:
where . The slicing |version of the conditional entropy set can be formulated as . As such, we have the CV interaction filter for a continuous response:
| (2.6) |
where and are the standard deviation and mean of , respectively. In our numerical analysis, we adopt uniform slicing. Following [15], we propose .
Now consider data with right-censored responses. Instead of , we observe , where and . Here, we assume that the censoring variable Ci is independent of Yi and Xi. Denote . Let be the Kaplan-Meier estimator of S(t). To apply the CV interaction filter, we first apply uniform slicing and partition Y* into G slices. The inverse-probability-of-censoring CV filter for screening main effects (IPCW-CVMS) is based on statistic:
where , , and . The rationale behind this is that:
With the same strategy, the inverse-probability-of-censoring CV filter for screening interactions (IPCW-CVIS) is based on statistic:
| (2.7) |
Where , , , is the indicator function, , and .
For continuous and censored responses, with the statistics defined above, screening can be conducted in the same manner as described in Section 2.2. In addition, as described in the previous subsections, the proposed screening can accommodate categorical, continuous, and mixture predictor distributions.
3. Simulation
To gauge performance of the proposed CVMS+CVIS, we compare with the following competitors: (a) PCS, which conducts the Pearson’s Chi-squared-based screening of main effects [14], (b) DCS, which conducts the distance correlation-based screening of main effects [13], (c) IGS, which conducts the information gain-based screening of main effects [18], (d) CVMS+PCIS, which conducts the screening of main effects using the proposed CV filter and the screening of interactions using the Pearson’s Chi-squared-based technique, and (e) CVMS+KIF, which is similar to approach (d), with the interaction screening based on the Kendall Interaction filter [34], (f) PCS+PCIS, which conducts the screening of main effects and interactions using the Pearson’s Chi-squared-based technique [14]. For Examples 2 and 3, we also include IIS [6], which conducts the screening of interactions for nonlinear classification, for comparison. We acknowledge that there are many other screening methods. The above have been chosen because of their competitive performance. In particular, comparing with alternatives (a)-(c) can reveal merit of the proposed CV filter in the main effect screening step, and comparing with alternatives (d)-(f) can reveal merit in the interaction screening step. With 500 replicates, we compare performance using the following criteria: (a) MMS, which is the minimum model size required to include all of the true active predictors. Its 5%, 25%, 50%, 75%, and 95% quantiles are reported, (b) , which is the probability that all active main effects are ranked in the top , and , which is the probability that all active interactions are ranked in the top , (c) CZ, which is the percentage of correctly identified inactive predictors (among all identified inactive predictors), and (d) IZ, which is the percentages of mistakenly identified active predictors (among all identified active predictors). With the following examples, we consider n = 200, 500 and p = 1000, 5000. Here we note that, although p may seem moderate, the dimensionality of interaction analysis is actually extremely high, and screening is warranted.
Example 1.
(Index model) Denote with . Cauchy(0, Ip) is the p-dimensional standard Cauchy distribution. Consider the index model:
For X and ε, we consider the following three cases:
Case (1a): , and ρ = 0.5;
Case (1b): , , and ρ = 0.5;
Case (1c): the same as Case (1a) except that ρ = 0.8.
The active main effect set and interaction set are and , respectively. When slicing, we partition each predictor into three categories and the response into two and three categories (R = 2 and 3). Results are summarized in Table 1. We can see that all approaches tend to be more accurate when the number of slices for the response increases. The proposed approach performs the best with the highest selection probability and smallest model size. In the main effect screening, performance of all the methods is insensitive to the number of slices, when the dependence structure of the predictors is complicated. DCS performs worse with the heavy-tailed predictors. IGS and PCS cannot maintain reasonable model sizes at the 75% and 95% quantiles. In comparison, CVMS performs well in all settings. In the interaction screening, CVIS outperforms the alternatives by a large margin. And its performance is almost insensitive to the dependence structure of covariates and extreme values. The alternatives fail with too many false discoveries, especially when the sample size is small.
Table 1:
Simulation Example 1: means of performance measures based on 500 replicates. A cell is left empty if the corresponding method is not applied.
| Main-effect selection = 3.0 | Interaction selection = 2.0 | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
||||||||||||||||
| (n,p) | Method | 5% | 25% | 50% | 75% | 95% | CZ | IZ | 5% | 25% | 50% | 75% | 95% | CZ | IZ | ||
|
| |||||||||||||||||
| Case (1a): uniform slicing, R = 2 | |||||||||||||||||
| (200,1000) | DCS | 3.0 | 3.0 | 3.0 | 3.0 | 6.0 | 0.990 | 0.997 | 0.000 | ||||||||
| IGS | 3.0 | 3.0 | 3.0 | 3.0 | 6.0 | 0.980 | 0.997 | 0.000 | |||||||||
| CVMS+CVIS | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.000 | 0.997 | 0.000 | 4.0 | 8.0 | 11.0 | 24.0 | 813 | 0.780 | 0.996 | 0.150 | |
| PCS+PCIS | 3.0 | 3.0 | 3.0 | 3.0 | 7.0 | 0.980 | 0.997 | 0.000 | 6.0 | 10.0 | 17.0 | 109 | 1317 | 0.710 | 0.996 | 0.200 | |
| CVMS+KIF | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.000 | 0.997 | 0.000 | 3.0 | 16.0 | 64.0 | 708 | 4137 | 0.400 | 0.996 | 0.375 | |
| (500,1000) | DCS | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.000 | 0.997 | 0.000 | ||||||||
| IGS | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.000 | 0.997 | 0.000 | |||||||||
| CVMS+CVIS | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.000 | 0.997 | 0.000 | 4.0 | 4.0 | 8.0 | 13.0 | 18.0 | 1.000 | 0.996 | 0.000 | |
| PCS+PCIS | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.000 | 0.997 | 0.000 | 9.0 | 10.0 | 14.0 | 15.0 | 18.0 | 1.000 | 0.996 | 0.000 | |
| CVMS+KIF | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.000 | 0.997 | 0.000 | 3.0 | 3.03 | 5.0 | 17.0 | 483 | 0.920 | 0.996 | 0.050 | |
| Case (1a): uniform slicing, R = 3 | |||||||||||||||||
| (200,1000) | DCS | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.000 | 0.997 | 0.000 | ||||||||
| IGS | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.000 | 0.997 | 0.000 | |||||||||
| CVMS+CVIS | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.000 | 0.997 | 0.000 | 2.0 | 8.0 | 15.0 | 23.0 | 4382 | 0.815 | 0.996 | 0.050 | |
| PCS+PCIS | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.000 | 0.997 | 0.000 | 8.0 | 15.0 | 96.0 | 1242 | 8027 | 0.365 | 0.996 | 0.351 | |
| CVMS+KIF | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.000 | 0.997 | 0.000 | 3.0 | 5.0 | 47.0 | 403 | 3024 | 0.420 | 0.996 | 0.250 | |
| (500,1000) | DCS | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.000 | 0.997 | 0.000 | ||||||||
| IGS | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.000 | 0.997 | 0.000 | |||||||||
| CVMS+CVIS | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.000 | 0.997 | 0.000 | 2.0 | 3.0 | 4.0 | 8.0 | 15.0 | 1.000 | 0.996 | 0.000 | |
| PCS+PCIS | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.000 | 0.997 | 0.000 | 5.0 | 8.0 | 15.0 | 17.0 | 33.0 | 0.960 | 0.996 | 0.000 | |
| CVMS+KIF | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.000 | 0.997 | 0.000 | 3.0 | 3.0 | 5.0 | 5.0 | 11.0 | 1.000 | 0.996 | 0.000 | |
| Case (1b): uniform slicing, R = 2 | |||||||||||||||||
| (200,1000) | DCS | 3.0 | 3.0 | 3.0 | 5.0 | 23.0 | 1.000 | 0.996 | 0.000 | ||||||||
| IGS | 3.0 | 3.0 | 3.0 | 4.0 | 20.0 | 1.000 | 0.996 | 0.000 | |||||||||
| CVMS+CVIS | 3.0 | 3.0 | 3.0 | 5.0 | 16.0 | 1.000 | 0.996 | 0.000 | 5.0 | 9.0 | 13.0 | 16.0 | 4480 | 0.820 | 0.996 | 0.050 | |
| PCS+PCIS | 3.0 | 3.0 | 3.0 | 7.0 | 85.0 | 0.950 | 0.996 | 0.002 | 5.0 | 8.0 | 9.0 | 14.0 | 9996 | 0.800 | 0.996 | 0.075 | |
| CVMS+KIF | 3.0 | 3.0 | 3.0 | 5.0 | 16.0 | 1.000 | 0.996 | 0.000 | 49.0 | 183 | 482 | 1421 | 9996 | 0.170 | 0.996 | 0.400 | |
| (500,1000) | DCS | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.000 | 0.997 | 0.000 | ||||||||
| IGS | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.000 | 0.997 | 0.000 | |||||||||
| CVMS+CVIS | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.000 | 0.997 | 0.000 | 3.0 | 6.0 | 9.0 | 13.0 | 17.0 | 1.000 | 0.996 | 0.000 | |
| PCS+PCIS | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.000 | 0.997 | 0.000 | 4.0 | 8.0 | 10.0 | 13.0 | 17.0 | 1.000 | 0.996 | 0.000 | |
| CVMS+KIF | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.000 | 0.997 | 0.000 | 7.0 | 7.0 | 7.0 | 9.0 | 13.0 | 1.000 | 0.996 | 0.000 | |
| Case (1b): uniform slicing, R = 3 | |||||||||||||||||
| (200,1000) | DCS | 2.0 | 3.0 | 3.0 | 4.0 | 35.0 | 0.960 | 0.997 | 0.000 | ||||||||
| IGS | 3.0 | 3.0 | 4.0 | 5.0 | 15.0 | 1.000 | 0.997 | 0.000 | |||||||||
| CVMS+CVIS | 3.0 | 3.0 | 3.0 | 6.0 | 17.0 | 1.000 | 0.997 | 0.00 | 3.0 | 6.0 | 11.0 | 16.0 | 4519 | 0.860 | 0.996 | 0.075 | |
| PCS+PCIS | 3.0 | 3.0 | 4.0 | 7.0 | 23.0 | 1.000 | 0.997 | 0.000 | 4.0 | 9.0 | 13.0 | 2512 | 8993 | 0.750 | 0.996 | 0.125 | |
| CVMS+KIF | 3.0 | 3.0 | 3.0 | 6.0 | 17.0 | 1.000 | 0.997 | 0.000 | 4.0 | 6.0 | 9.0 | 89.0 | 9996 | 0.700 | 0.996 | 0.200 | |
| (500,1000) | DCS | 2.0 | 2.0 | 3.0 | 3.0 | 46.0 | 0.950 | 0.997 | 0.000 | ||||||||
| IGS | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.000 | 0.997 | 0.000 | |||||||||
| CVMS+CVIS | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.000 | 0.997 | 0.000 | 3.0 | 4.0 | 5.0 | 6.0 | 10.0 | 1.000 | 0.996 | 0.000 | |
| PCS+PCIS | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.000 | 0.997 | 0.000 | 3.0 | 5.0 | 6.0 | 8.0 | 11.0 | 1.000 | 0.996 | 0.000 | |
| CVMS+KIF | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.000 | 0.997 | 0.000 | 5.0 | 5.0 | 7.0 | 9.0 | 12.0 | 1.000 | 0.996 | 0.000 | |
| Case (1c): uniform slicing, R = 2 | |||||||||||||||||
| (200,1000) | DCS | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.000 | 0.997 | 0.000 | ||||||||
| IGS | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.000 | 0.997 | 0.000 | |||||||||
| CVMS+CVIS | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.000 | 0.997 | 0.000 | 10.0 | 15.0 | 19.0 | 25.0 | 36.0 | 1.000 | 0.996 | 0.000 | |
| PCS+PCIS | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.000 | 0.997 | 0.000 | 13.0 | 18.0 | 21.0 | 27.0 | 34.0 | 1.000 | 0.996 | 0.000 | |
| CVMS+KIF | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.000 | 0.997 | 0.000 | 79.0 | 459 | 818 | 2030 | 7580 | 0.080 | 0.996 | 0.500 | |
| (500,1000) | DCS | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.000 | 0.997 | 0.000 | ||||||||
| IGS | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.000 | 0.997 | 0.000 | |||||||||
| CVMS+CVIS | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.000 | 0.997 | 0.000 | 2.0 | 4.0 | 5.0 | 8.0 | 14.0 | 1.000 | 0.996 | 0.000 | |
| PCS+PCIS | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.000 | 0.997 | 0.000 | 16.0 | 18.0 | 18.0 | 19.0 | 26.0 | 1.000 | 0.996 | 0.000 | |
| CVMS+KI | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.000 | 0.997 | 0.000 | 44.0 | 101 | 253 | 356 | 555 | 0.350 | 0.996 | 0.475 | |
| Case (1c): uniform slicing, R = 3 | |||||||||||||||||
| (200,1000) | DCS | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.000 | 0.997 | 0.000 | ||||||||
| IGS | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.000 | 0.997 | 0.000 | |||||||||
| CVMS+CVIS | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.000 | 0.997 | 0.000 | 6.0 | 12.0 | 15.0 | 22.0 | 37.0 | 1.000 | 0.996 | 0.000 | |
| PCS+PCIS | 3.0 | 3.0 | 3.0 | 3.0 | 4.0 | 1.000 | 0.997 | 0.000 | 10.0 | 14.0 | 17.0 | 21.0 | 22.0 | 1.000 | 0.996 | 0.000 | |
| CVMS+KIF | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.000 | 0.997 | 0.000 | 89.0 | 128 | 1242 | 1711 | 2072 | 0.350 | 0.996 | 0.325 | |
| (500,1000) | DCS | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.000 | 0.997 | 0.000 | ||||||||
| IGS | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.000 | 0.997 | 0.000 | |||||||||
| CVMS+CVIS | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.000 | 0.997 | 0.000 | 3.0 | 6.0 | 13.0 | 20.0 | 34.0 | 1.000 | 0.996 | 0.000 | |
| PCS+PCIS | 3.0 | 3.0 | 3.0 | 3.0 | 4.0 | 1.000 | 0.997 | 0.000 | 10.0 | 16.0 | 17.0 | 19.0 | 30.0 | 1.000 | 0.996 | 0.000 | |
| CVMS+KIF | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.000 | 0.996 | 0.000 | 13.0 | 18.0 | 54.0 | 1088 | 1945 | 1.000 | 0.996 | 0.000 | |
Example 2.
Consider data with a binary response and categorical predictors. For the ith observation, Yi is generated from two settings: (a) , , and (b) and . Conditional on Yi, the predictors are generated under the following cases.
Case (2a): (binary) for j = 1, 3 and k = 1, 2.
for , where , and for the other k and j.
Case (2b): (3-level categorical) for , , and .
for and . For , for . Detailed values for are presented in Table S2 of the Supplementary Materials.
Case (2c): (continuous) , for .
for and . Other Xj’s for follow a standard normal distribution.
For Case (2a) and Case (2c), , , and for Case (2b), and .
For Case (2b), we find that PCS has highly unsatisfactory performance with main effects. As a “remedy”, we consider CVMS+PCIS, which adopts the proposed CV filter for main-effect screening, as opposed PCS+PCIS. Results are summarized in Tables S3–S6 in the Supplementary Materials. The overall findings are similar to those of Example 1. Specifically, as the number of categories of the predictors increases, performance of all methods gets slightly worse. Better performance is observed when the sample size increases. For main-effect screening, all methods perform well with binary predictors. With 3-level predictors, DCS, PCS, and IGS fail – they may miss important main effects even when the sample size is large. In comparison, CVMS consistently has higher coverage rates and smaller MMS values. For interaction screening, KIF and IIS break down, while PCIS and CVIS perform reasonably well. This is expected since IIS requires the predictors to be sub-Gaussian to enjoy the sure screening property. CVIS performs slightly better in almost all settings. When the sample size is small, PCIS tends to have large MMS and lower coverage rates, as the number of predictor categories increases. As expected, the difference is more obvious with imbalanced data. For Case (2c), when the predictors are continuous, we further include two other main-effect screening methods for comparison, namely the mean-variance based sure independence screening (MVS,[16]) and fused Kolmogorov filter (FKF,[15]). To apply CVMS, we dichotomize each continuous predictor at its median. Results are provided in Table S5–S6 in the Supplementary Materials, where we again observe superiority of the proposed approach. Compared to DCS, IGS, PCS, MVS, and FKF, CVMS is either the best or one of the best. Its performance can be further improved with three or more categories as described in Remark 2. In addition, CVIS has much smaller model sizes at high quantiles and higher probabilities of including all active interactions, especially under the imbalanced design and when the number of the predictors is large but the sample size is small.
Example 3.
(Generalized linear model) We simulate from the logistic model:
We consider two different cases for .
Case (3a): (continuous predictors) , where has off-diagonal entries being 0. The first and third diagonal entries are 2 and 4, respectively, and the other diagonal entries are 1.
Case (3b): (a mixture of continuous and categorical predictors) For , Xj is independently generated from . And for , Xj follows a Bernoulli distribution with , and the others are zero. The active interaction term, , is a product of a binary predictor and a continuous one.
Under this example, and . This model has a relatively simple structure. Here, we do not adopt the two-stage strategy. Instead, we directly resort to interaction screening. With CVIS and PCIS, all the continuous predictors are converted to binary via dichotomizing at the medians. Results are presented in Table S7 in the SupplementaryMaterials. The patterns of the findings are comparable to those above. The proposed CVIS is able to separate the true nonzero effects from the zeros with high accuracy. It outperforms IIS in terms of . It is comparable to KIF in terms of MMS and slightly outperforms PCIS. In terms of CZ, it is superior to the other two approaches, which have more false discoveries.
Example 4.
(Transformation model for a censored response) Yi is generated from the transformation model:
where Xi is the p-dimensional vector of predictors, contains all two-way interactions, and . The predictors are generated from a multivariate normal distribution with marginal means 0 and covariance Σ with . The censoring variable Ci is generated from a uniform distribution on [0, 7], and the censoring rate is around 15%. We set and . Thus, and . For this example, we compare against IPCW-tau [35] for main effect screening, and against PC-IPCW-tau [36], PCIS, and KIF for interaction screening. In addition, we also consider all main effects and interactions “equally” and apply IPCW-tau. When applying the IPCW-CV filters, we equally discretize the response and continuous predictors into three categories. The results are summarized in Table S8 in the Supplementary Materials. We observe similar superior performance of the proposed approach. It is also noted that performance of the proposed approach does not seem to strongly depend on censoring.
4. Analysis of TCGA data
We analyze data on lung adenocarcinoma (LUAD). The dataset is obtained from The Cancer Genome Atlas (TCGA, https://cancergenome.nih.gov/). TCGA has published high-quality omics and clinical data on multiple cancer types. The TCGA LUAD data has been analyzed in multiple studies, and both main effects and interactions have been examined [38, 28]. We refer to the TCGA website and existing literature for information on study design and data collection. Multiple types of omics data are available. Here we analyze mRNA gene expressions, which have been considered in multiple interaction analyses. To demonstrate the broad applicability of the proposed approach, we consider both censored survival and categorical response variables. In the original dataset, the expression values of 19,559 genes are available. Although in principle the proposed approach can be directly used, to generate more reliable results with a limited sample size, we first conduct a moderate unsupervised screening and retain the 5,000 genes with the largest marginal variances. Thus, in the following analysis, there are 5,000 candidate main effects and 12,497,500 possible second-order interactions. Such a dimensionality is considerably higher than in most published studies. Some demographic/clinical variables are also available. We focus on gene expressions but note that the proposed approach can be potentially coupled with conditional screening to accommodate these additional variables.
4.1. Analysis of censored overall survival
We first consider overall survival, which is subject to right censoring. Among the 493 subjects, 178 have observable survival times, and the rest 315 are censored. The observed survival times range from 0.13 to 165.37 months, with a median of 20.52 months. The censoring times range from 0.37 to 241.6 months, with a median of 22.32 months.
For the proposed approach, we first equally discretize the survival outcome into two categories and take a uniform slicing to partition each gene expression measurement into three slices. The proposed screening leads to 79 main effects and 158 interactions. Detailed results are shown in Table 2. A quick literature search suggests that many of the remaining genes have sound biological implications. For instance, gene DDX59 has been identified to promote DNA replication in lung adenocarcinoma. SOD3 re-expression in tumor-associated endothelial cells increases doxorubicin delivery into and chemotherapeutic effect on tumors. CSAG2 has been found to be necessary and sufficient to drive cell and tumor growth. Gene MAGEA4 is overexpressed and can serve as an immunotherapy target in various malignant tumors, including non-small cell lung cancer. TENM1 has been identified in vertebrates, coding for membrane proteins that are mainly involved in embryonic and neuronal development. CENATAC depletion or expression of disease mutants results in excessive retention of AT-AN minor introns in about 100 genes enriched for nucleocytoplasmic transport and cell cycle regulators, and causes chromosome segregation errors. We acknowledge that those in Table 2 are not the final identification results. However, the highly sensible candidates can still provide some support to the validity of the proposed approach.
Table 2:
Analysis of censored overall survival: 79 genes identified by CVMS and 158 interactions identified by CVMS+CVIS.
| CVMS-main effects | CVMS+CVIS-interactions | |||
|---|---|---|---|---|
|
|
|
|||
| DDX59 | PERM1 | CSAG2-ELOVL4 | MAGEA4-CNTN1 | MNDA-CPVL |
| CPVL | NOTUM | CSAG2-CARD14 | MAGEA4-LGR5 | PAGE2-SLC40A1 |
| TNFRSF11B | RHBDL1 | CSAG2-ALDH1L2 | PAGE2-MAL | CD4-CPVL |
| MYOZ1 | RAB36 | CSAG2-TDO2 | MAGEA4-ACKR3 | GUCA2B-ACKR3 |
| TDO2 | ATP8B2 | VCX3A-CENATAC | GUCA2B-LGR5 | GUCA2B-TENM1 |
| GPD1L | FAM83A | CSAG2-ACKR3 | TLR4-CPVL | TAC1-ELOVL4 |
| CENATAC | HAVCR1 | CSAG2-MYOZ1 | PAGE2-CARD14 | GUCA2B-EHF |
| RNF213 | PLAC8 | CSAG2-TNFRSF11B | ATG16L2-CENATAC | HOXD13-CARD14 |
| ACKR3 | CPXM2 | CSAG2-CNTN1 | SEC31B-CENATAC | PAGE2-SOD3 |
| EHF | COL18A1 | CSAG2-CPVL | BEX2-BEX4 | HOXD13-CNTN1 |
| CNTN1 | SLC1A3 | VCX3A-ELOVL4 | SLCO2B1-CPVL | GUCA2B-MYOZ1 |
| MAL | GLRB | CSAG2-CCDC184 | MAGEA4-CARD14 | GUCA2B- DAAM2 |
| TENM1 | CELSR1 | VCX3A-TENM1 | PAGE2-CPVL | GOLGA8B-CENATAC |
| SOD3 | NPIPA5 | CSAG2-EHF | MAGEA4-MAL | GUCA2B-ALDH1L2 |
| CARD14 | ACSS3 | CSAG2-DAAM2 | GUCA2B-CARD14 | NPY-LGR5 |
| LGR5 | OGT | VCX3A-CCDC184 | GUCA2B-BEX4 | GUCA2B-CENATAC |
| ELOVL4 | C15orf48 | VCX3A-MYOZ1 | GUCA2B-MAL | P2RY13- CPVL |
| BEX4 | CABCOCO1 | HOXD13- LOVL4 | MAGEA4-GPD1L | GUCA2B-ELOVL4 |
| CCDC184 | AHNAK | CSAG2-GPD1L | MAGEA4-TENM1 | GUCA2B-SLC40A1 |
| ALDH1L2 | RACGAP1 | CSAG2-BEX4 | PAGE2-BEX4 | GUCA2B-GPD1L |
| DAAM2 | CRISPLD2 | VCX3A- CNTN1 | PAGE2-ALDH1L2 | TLR8-CPVL |
| RGS20 | CDK14 | CSAG2-SOD3 | MAGEA4-TDO2 | MAGEB2-ALDH1L2 |
| ARHGEF26 | SDSL | CSAG2-CENATAC | MAGEA4-MYOZ1 | FAM193B-CENATAC |
| ABCA7 | LIMS2 | VCX3A-GPD1L | PAGE2-TDO2 | CD84-CPVL |
| SLC22A18 | ENPP5 | VCX3A-CARD14 | GUCA2B-TNFRSF11B | GUCA2B-TDO2 |
| AGAP9 | SLC16A8 | VCX3A-MAL | HOXD13-CCDC184 | PAGE2-DAAM2 |
| ZMYND12 | VCX3A- RNF213 | GUCA2B-CCDC184 | NPY-ELOVL4 | |
| TONSL | MAGEA4- ELOVL4 | MAGEA4-DAAM2 | AIF1-CPVL | |
| GUCY1A1 | VCX3A-CPVL | MAGEA4-BEX4 | MAGEB2-TDO2 | |
| MYO5C | VCX3A-BEX4 | MAGEA4-SOD3 | HOXD13-CPVL | |
| TNFSF4 | VCX3A-ALDH1L2 | MAGEA4-RNF213 | HOXD13-ACKR3 | |
| PRLR | MAGEA4-CPVL | PAGE2-TNFRSF11B | HOXD13-SOD3 | |
| BOK | VCX3A-ACKR3 | PAGE2-GPD1L | LMNTD2-CENATAC | |
| SLC9A5 | VCX3A-TNFRSF11B | HOXD13-LGR5 | HOXD13-TNFRSF11B | |
| GPR143 | VCX3A-TDO2 | TLR7-CPVL | GUCA2B-SOD3 | |
| USP27X | VCX3A-EHF | LENG8-CENATAC | CSF1R-CPVL | |
| RFTN1 | VCX3A-SOD3 | PAGE2-CCDC184 | NPY-CNTN1 | |
| TRIB3 | CCNL2-CENATAC | MAGEA4-EHF | NCKAP1L-CPVL | |
| WDHD1 | VCX3A-DAAM2 | HOXD13-TENM1 | HOXD13-BEX4 | |
| OSBPL6 | VCX3A-SLC40A1 | TAC1-BEX4 | MAGEB2-CARD14 | |
| C8B | PAGE2-CENATAC | PAGE2-RNF213 | NPY-BEX4 | |
| HOXB3 | PABPC1L-CENATAC | PAGE2-MYOZ1 | NPY-MYOZ1 | |
| CBLC | MAGEA4-TNFRSF11B | PAGE2-CNTN1 | CASP14-CARD14 | |
| GLB1L2 | DDX39B-CENATAC | GUCA2B-CNTN1 | CSAG3-ELOVL4 | |
| GPX3 | TTLL3-CENATAC | PAGE2-LGR5 | HOXD13-EHF | |
| WSB1 | MAGEA4-CCDC184 | PAGE2-EHF | HOXD13-SLC40A1 | |
| ARG2 | MS4A6A-CPVL | PAGE2-ACKR3 | TAC1-ACKR3 | |
| ELMO3 | MAGEA4-ALDH1L2 | GUCA2B-RNF213 | TAC1-TENM1 | |
| BST1 | NPY-CCDC184 | MAGEA4-SLC40A1 | NPY-ACKR3 | |
| BOP1 | HOXD13-ALDH1L2 | MAGEA4-CENATAC | IQGAP2-CPVL | |
| CDC42BPG | PAGE2-ELOVL4 | TAC1-CCDC184 | HOXD13-TDO2 | |
| SYT8 | MS4A7-CPVL | CSAD-CENATAC | NEUROD1-LGR5 | |
| PDPN | PAGE2-TENM1 | GUCA2B-CPVL | ||
Analysis is also conducted using the alternative approaches. The summary of comparisons is provided in Table 3. The differences are quantified using the numbers of overlapping effects as well as RV coefficients [37]. RV coefficient measures the similarity of two data matrices and ranges between 0 and 1, with a larger value indicating a higher overlap in information (contained in two sets of main effects or interactions). It is observed that the proposed approach identifies significantly different sets of main effects and interactions from the alternatives. However, the amount of overlapping information as measured by RV coefficient is moderate to high, which is reasonable as different genes can contain similar information.
Table 3:
Numbers of main effects and interactions identified by different approaches (diagonal elements) and their overlaps (off-diagonal). RV coefficients in “()”.
| Overall survival | Approach | CVMS+CVIS | PCS+PCIS | IGS+CVIS | CVMS+PCIS | IGS+PCIS |
|---|---|---|---|---|---|---|
|
| ||||||
| Main effects | CVMS+CVIS | 79 | 1(0.561) | 1(0.563) | ||
| PCS+PCIS | 79 | 78(0.999) | ||||
| IGS+CVIS | 79 | |||||
|
| ||||||
| Interaction | CVMS+CVIS | 158 | 0(0.714) | 0(0.775) | 49(0.939) | 0(0.714) |
| PCS+PCIS | 158 | 85(0.961) | 93(0.888) | 158(0.961) | ||
| IGS+CVIS | 158 | 72(0.990) | 109(0.961) | |||
| CVMS+PCIS | 158 | 93(0.888) | ||||
| IGS+PCIS | 158 | |||||
|
| ||||||
| Stage | ||||||
|
| ||||||
| Main effects | CVMS+CVIS | 76 | 14(0.339) | 14(0.350) | ||
| PCS+PCIS | 76 | 71(0.977) | ||||
| IGS+CVIS | 76 | |||||
|
| ||||||
| Interaction | CVMS+CVIS | 152 | 0(0.125) | 0(0.707) | 4(0.192) | 0(0.128) |
| PCS+PCIS | 152 | 1(0.200) | 14(0.939) | 92(0.996) | ||
| IGS+CVIS | 152 | 0(0.199) | 1(0.210) | |||
| CVMS+PCIS | 152 | 42(0.927) | ||||
| IGS+PCIS | 152 | |||||
As in some published studies [32, 38], we conduct downstream analysis to further examine the effect of screening. More specifically, (a) we randomly split data into a training set of size 393 and a testing set of size 100; (b) with the training set, the proposed and alternative screenings are conducted; (c) with the obtained main effects and interactions, we apply a penalization method, which can identify the important main effects and interactions in a joint interaction analysis model and respect the variable selection hierarchy. In this step of analysis, we adopt the Cox model. This may have a “conflict” with the proposed model-free spirit. Extending the proposed approach to joint modeling is highly nontrivial and will not be pursued here; (d) the training set model is then used for prediction with the testing set samples. We adopt the C-statistic to evaluate prediction performance. A C-statistic has range [0,1], with a larger value indicating better prediction; and (e) Steps (a)-(d) are repeated 200 times. The average C-statistics are 0.7810 (CVMS), 0.7605 (PCS), 0.7733 (IGS), 0.8325 (CVMS+CVIS), 0.8019 (PCS+PCIS), 0.8129 (CVMS+PCIS), 0.8032 (IGS+CVIS), 0.8078 (IGS+PCIS), which can provide an “indirect” support to the superiority of the proposed approach. To more intuitively comprehend this result, in Figure S3 (Supplementary Materials), we present the Kaplan-Meier curves for one random split. The two groups are generated by dichotomizing the predicted risk scores at the median. We see that the difference between the good and bad survival groups is bigger under the proposed approach (and the corresponding p-value is smaller).
4.2. Analysis of categorical stage
In this set of analysis, the outcome variable is pathological stage. In the original data, there are nine stages: Stage I, Stage IA, Stage IB, Stage II, Stage IIA, Stage IIB, Stage IIIA, Stage IIIB, and Stage IV. With a limited sample size, to avoid small counts, we combine into Stages I, II, III, and IV, which have sample sizes 270, 119, 81, and 26, respectively.
The proposed screening leads to 76 main effects and 152 interactions. Details are provided in Table 4. Similar to the above subsection, we also observe that many of the screened genes have sound biological implications. For example, CSAG2 and MAGEA4 have been found to play critical roles in cancer development. LIN28B has been reported to be highly expressed during embryogenesis but silent in most adult tissues, and can block the maturation of the tumor suppressor microRNA let-7 family and mediate diverse biological functions. GUCA2B has been suggested as a susceptibility gene for essential hypertension. UGT1A10, which is expressed exclusively in extrahepatic tissues, is a highly active and important extrahepatic enzyme. MAGEA1 is a promising candidate marker for LUAD therapy, and the MAGEA1-specific CAR-T cell immunotherapy may be an effective strategy for the treatment of MAGEA1-positive LUAD. In Table 3, we summarize the comparison between the proposed and alternative screenings. The overall pattern is similar to that for overall survival. The random split-based evaluation is conducted in a similar manner as in the previous subsection. The difference is that in Step (c), a logistic model is fit. Accordingly, we use classification error as the criterion for comparison. With 200 random splittings, the average classification accuracy (1-error) values are 0.770 (CVMS), 0.753 (PCS), 0.746 (IGS), 0.786 (CVMS+CVIS), 0.769 (PCS+PCIS), 0.760 (IGS+PCIS), 0.764 (IGS+CVIS), 0.778 (CVMS+PCIS), which again suggests the superiority of the proposed approach.
Table 4:
Analysis of categorical stage: 76 genes identified by CVMS and 152 interactions identified by CVMS+CVIS.
| CVMS-main effects | CVMS+CVIS-interactions | |||
|---|---|---|---|---|
|
|
|
|||
| CSAG2 | ZIC1 | CSAG2-CSAG3 | PSG4-CASP14 | PSG4-STRA8 |
| PSG4 | PDX1 | CSAG2-MAGEA4 | LIN28B-HOXD13 | PSG4-REG1A |
| LIN28B | NLGN4Y | CSAG2-MAGEB2 | DPPA2-PAGE2 | GAGE2A-CSAG2 |
| VCX3A | TFAP2D | CSAG2-MAGEA6 | LIN28B-REG1A | VCX3A-REG1A |
| MAGEA4 | DPYSL5 | CSAG2-MAGEA1 | LIN28B-MAGEA1 | PSG4-UGT1A10 |
| PAGE2 | PHGR1 | VCX3A-VCX | DPPA2-VCX3A | GAGE2A-DLK1 |
| GUCA2B | TM4SF5 | CSAG2-PSG4 | LIN28B- PIWIL3 | PAGE2-VCX3A |
| NPY | SCGN | CSAG2-PAGE2B | DPPA2-MAGEA4 | GAGE2A- DEFB4A |
| HOXD13 | LIPK | CSAG2-PAGE2 | DPPA2-PAGE2B | CSAG2-ZNF560 |
| UGT1A10 | ALX1 | CSAG2-LIN28B | LIN28B- IRX4 | MAGEA6-MAGEA12 |
| TAC1 | APOBEC1 | CSAG2-CASP14 | VCX3A-TAC1 | GAGE2A-PIWIL3 |
| MAGEB2 | C1orf21 | CSAG2-MAGEA12 | GAGE2A-MAGEA4 | GAGE2A-MAGEB2 |
| CASP14 | NSG1 | CSAG2-MAGEA10 | DPPA2-TAC1 | PAGE2-TAC1 |
| SOX14 | GPR160 | CSAG2-MAGEC2 | GAGE2A- UGT1A10 | LIN28B-ZNF560 |
| CSAG3 | MEOX1 | CSAG2-VCX3A | LIN28B- PAGE2B | VCX3A- IRX4 |
| MAGEA1 | GMNC | CSAG2-NPY | LIN28B- CSAG2 | ZFY- NLGN4Y |
| CGB5 | CDC25C | MAGEA6-MAGEA3 | GAGE2A- REG1A | CSAG2- PAGE5 |
| PIWIL3 | PIMREG | LIN28B- TAC1 | DPPA2- REG1A | DPPA2- LIN28B |
| PRR20G | CDT1 | VCX3A- PAGE2 | PSG4- PAGE2B | CSAG2- HOXC12 |
| PAGE2B | STAP1 | GAGE2A- VCX3A | MAGEA4- CSAG2 | DPPA2- CASP14 |
| PRAC2 | ADSS1 | MAGEA3- MAGEA6 | GAGE2A- LIN28B | VCX3A- MAGEA4 |
| MAGEA6 | RNASE1 | GAGE2A- PSG4 | DPPA2- NPY | GAGE2A- GUCA2B |
| HOXC12 | PTGDS | GAGE2A- TAC1 | MAGEA4- MAGEA10 | PSG4- DEFB4A |
| SST | CLUL1 | CSAG2- PRR20G | DPPA2- UGT1A10 | LIN28B- UGT1A10 |
| MAGEC2 | SMURF2 | CSAG2- PIWIL3 | PSG4- NPY | GUCA2B- UGT1A10 |
| SLC10A2 | GAGE2A- PAGE2 | LIN28B- PAGE2 | LIN28B- PSG4 | |
| IRX4 | CSAG2- HOXD13 | CSAG2- DLK1 | LIN28B- DLK1 | |
| REG1A | CSAG3- CSAG2 | GAGE2A- STRA8 | NPY- TAC1 | |
| STRA8 | GAGE2A- PAGE2B | LIN28B- MAGEB2 | VCX3A- NPY | |
| MAGEA10 | PSG4- CGB5 | PSG4- PIWIL3 | MAGEA4- CASP14 | |
| LCN15 | CSAG2- STRA8 | LIN28B- GUCA2B | CSAG2- DEFB4A | |
| VCX | CSAG2-UGT1A10 | DPPA2-CGB5 | LIN28B-CGB5 | |
| MAGEC1 | CSAG2-MAGEA3 | CSAG2-VCX | LIN28B-PRR20G | |
| DEFB4A | LIN28B-MAGEA4 | DPPA2-STRA8 | GAGE2A-CSAG3 | |
| NR0B1 | CSAG2-MAGEC1 | DPPA2-GUCA2B | MAGEA4-REG1A | |
| SPRR2F | VCX3A-STRA8 | LIN28B-HOXC12 | VCX3A-MAGEA10 | |
| MAGEA3 | GAGE2A-CGB5 | VCX3A-PSG4 | MAGEA4-UGT1A10 | |
| WFDC5 | DPPA2-PSG4 | LIN28B-MAGEA10 | LIN28B-STRA8 | |
| PAGE5 | PSG4-TAC1 | PSG4-PAGE2 | PSG4-MAGEA4 | |
| DLK1 | LIN28B-VCX3A | LIN28B-NPY | PSG4-HOXD13 | |
| ACTL8 | CSAG2-TAC1 | MAGEA4-MAGEA6 | VCX3A-DLK1 | |
| MAGEA12 | PSG4-VCX3A | PSG4-DLK1 | GAGE2A-MAGEA1 | |
| HOXA13 | CSAG2-IRX4 | MAGEA4-VCX3A | VCX3A-ZNF560 | |
| GP2 | MAGEA4-MAGEA1 | DPPA2-PIWIL3 | MAGEB2-CSAG2 | |
| TFF2 | PAGE2-PAGE2B | LIN28B-CASP14 | DDX3Y-NLGN4Y | |
| UGT2B11 | CSAG2-GUCA2B | DPPA2-RX4 | GAGE2A-PAGE5 | |
| ETNPPL | GAGE2A-CASP14 | GAGE2A-PRR20G | VCX3A-CASP14 | |
| SPRR2A | CSAG2-REG1A | DPPA2-DLK1 | FTHL17-UGT1A10 | |
| ZNF560 | GAGE2A-IRX4 | MAGEA4-PAGE2 | VCX3A-PIWIL3 | |
| KRT75 | VCX3A-PAGE2B | CSAG2-CGB5 | CSAG2-ZIC1 | |
| INSL4 | GAGE2A-NPY | PSG4-IRX4 | ||
5. Discussion
In this article, we have developed a new marginal screening approach. Although marginal screening is not a new topic, with the increasing resolution of profiling (and hence increasing dimensionality), it still plays an essential role in data analysis, and there is still a strong demand for more effective screening methods. This study has advanced from many of the existing studies by focusing on interactions, whose significance is increasingly recognized. The proposed approach is based on Shannon’s information theory, whose applications to screening remain limited. It can flexibly accommodate different types of distributions of response and predictors under one unified framework. It has the much-desired robustness properties not shared by the model-based and many other approaches. The theoretical development has provided a uniquely strong basis, and the numerical studies have convincingly established its practical superiority.
For convenience, as in the literature, we have employed a hard thresholding cutoff in each stage to retain a fixed number of predictors. It is possible to more data-dependently determine dn1 and dn2. First consider . Let be a permutation of such that . We adopt the maximum ratio criterion [14], with which . Asymptotically, it can be proved that is when and , since can be arbitrarily small. Here, . However, this criterion can be unstable with very large or very small , when there are predictors with very strong or weak effects [18]. To remedy this problem, a resampling-based method can be adopted, and can be restricted to be smaller than a user-specific constant. This technique proceeds as follows: (i) Generate B bootstrap samples. (ii) Calculate CV filters for each bootstrap sample. For the ith bootstrap sample, the CV estimates are ordered from the largest to smallest, and we calculate , (iii) Obtain . Similar discussions hold for . With the satisfactory performance of the hard cutoffs, we do not pursue this computationally more expensive determination in this article.
This study can be potentially extended in multiple ways. The proposed approach has been designed for the interactions of the same type of predictors. In omics studies, this amounts to gene-gene interactions. It will be almost straightforward to extend the proposed CV filters to gene-environment interactions, which involve two different types of predictors. We have focused on two-way interactions. Higher-order interactions are statistically meaningful, however, still have very limited practical applications under high-dimensional settings. We have focused on screening. It may be of interest to further develop joint interaction modeling also based on Shannon’s information theory, so the overall analysis – consisting of screening and joint modeling – can be more coherent.
Supplementary Material
Acknowledgements
We thank the editor and reviewers for their careful review and insightful comments, which have led to a significant improvement of the article. This study has been partly supported by NSFC grants 12001101 and 20YQ18, and NIH CA204120, CA121974, and CA196530.
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
References
- [1].Moore J, and Williams S. (2009). Epistasis and Its Implications for Personal Genetics. American Journal of Human Genetics, 85(3): 309–320. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Khan A, Dinh DM, Schneider D, Lenski R, and Cooper T. (2011). Negative epistasis between beneficial mutations in an evolving bacterial population. Science, 332(6034), 1193–1196. [DOI] [PubMed] [Google Scholar]
- [3].Yuan M, Joseph VR and Zou H. (2009). Structured variable selection and estimation. Annals of Applied Statistics, 3, 1738–1757. [Google Scholar]
- [4].Choi N, Li W, and Zhu J. (2010). Variable selection with the strong heredity constraint and its oracle property. Journal of the American Statistical Association, 105, 354–364. [Google Scholar]
- [5].Bien J, Taylor J, and Tibshirnani R. (2013). A LASSO for hierarchical interactions. The Annals of Statistics, 41, 1111–1141. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Fan Y, Kong Y, Li D, and Zheng Z. (2015). Innovated interaction screening for high-dimensional nonlinear classification. The Annals of Statistics, 43(3), 1243–1272. [Google Scholar]
- [7].Yan J, Risacher S, Shen L, and Andrew S. (2018). Network approaches to systems biology analysis of complex disease: integrative methods for multi-omics data. Briefings in Bioinformatics, 19(6), 1370–1381. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].Hao N, and Zhang H. (2017). A Note on High-Dimensional Linear Regression With Interactions. The American Statistician, 71(4), 291–297 [Google Scholar]
- [9].Fan J, Feng Y, and Song R. (2011). Nonparametric Independence Screening in Sparse Ultra-high Dimensional Additive Models. Journal of the American Statistical Association, 106, 544–557. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [10].Liu J, Li R, and Wu R. (2014). Feature Selection for Varying Coefficient Models with Ultrahigh Dimensional Covariates. Journal of the American Statistical Association, 109, 266–274. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].He X, Wang L, and Hong H. (2013). Quantile-Adaptive Model-Free Variable Screening for High-Dimensional Heterogeneous Data. The Annals of Statistics, 41, 342–369. [Google Scholar]
- [12].Li G, Peng H, Zhang J, and Zhu L. (2012). Robust Rank Correlation Based Screening. The Annals of Statistics, 40, 1846–1877. [Google Scholar]
- [13].Li R, Zhong W, and Zhu L. (2012). Feature Screening Via Distance Correlation Learning. Journal of American Statistical Association, 107, 1129–1139. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Huang D, Li R, and Wang H. (2014). Feature screening for ultrahigh dimensional categorical data with applications. Journal of Business & Economic Statistics, 32(2), 237–244. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].Mai Q, and Zou H. (2015). The fused Kolmogorov filter: a nonparametric model-free screening method. The Annals of Statistics, 43(4), 1471–1497. [Google Scholar]
- [16].Cui H, Li R, and Zhong W. (2015). Model-free feature screening for ultrahigh dimensional discriminant analysis. Journal of the American Statistical Association, 110(510), 630–641. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [17].Huang J, Horowitz J, and Ma S. (2008). Asymptotic Properties of Bridge Estimators in Sparse High-Dimensional Regression Models. The Annals of Statistics, 36, 587–613. [Google Scholar]
- [18].Ni L, and Fang F. (2016). Entropy-based Model-free Feature Screening for Ultrahigh-dimensional Multiclass Classification. Journal of Nonparametric Statistics, 28(3), 515–530. [Google Scholar]
- [19].Hall P, and Xue. J. (2014). On selecting interacting features from high-dimensional data. Computational Statistics & Data Analysis, 71, 694–708. [Google Scholar]
- [20].Hao N, and Zhang H. (2014). Interaction Screening for Ultrahigh-Dimensional Data. Journal of the American Statistical Association, 109(507), 1285–1301. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [21].Li Y, and Liu J. (2019). Robust variable and interaction selection for logistic regression and general index models. Journal of the American Statistical Association, 114(525), 271–286. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [22].Dong C, Chu X, Wang Y. et al. (2008). Exploration of gene–gene interaction effects using entropy-based methods. European Journal of Human Genetics, 16, 229–235. [DOI] [PubMed] [Google Scholar]
- [23].Wu X, Jin L, and Xiong M. (2009). Mutual Information for Testing Gene-Environment Interaction. PLoS One, 4(2), e4578. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [24].Fan R, Zhong M, Wang S, Zhang Y, Andrew A, Karagas M, Chen H, Amos C, Xiong M, and Moore J. (2011). Entropy-based information gain approaches to detect and to characterize gene-gene and gene-environment interactions/correlations of complex diseases. Genetic Epidemiology, 35(7), 706–721. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [25].Jiang R, Tang W, Wu X, and Fu W. (2009). A random forest approach to the detection of epistatic interactions in case-control studies. BMC Bioinformatics, 10(Suppl 1), S65. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [26].O’Hagan S, Wright Muelas., Day P, Lundberg E, Kell D. (2018). GeneGini: Assessment via the Gini Coefficient of Reference “Housekeeping” Genes and Diverse Human Transporter Expression Profiles. Cell System, 6(2), 230–244.e1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [27].Zhao J, Zhou Y, Zhang X, and Chen L. (2006). Part mutual information for quantifying direct associations in networks. Proceedings of National Academy of Sciences of the United States of America, 113(18), 5130–5135. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [28].Xu Y, Wu M, Zhang Q, and Ma S. (2019). Robust identification of gene-environment interactions for prognosis using a quantile partial correlation approach. Genomics, 111, 1115–1123. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [29].Pan W. (2009). Asymptotic Tests of Association with Multiple SNPs in Linkage Disequilibrium. Genetic Epidemiology, 33(6), 497–507. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [30].Pan W, Shen X. (2011). Adaptive Tests for Association Analysis of Rare Variants. Genetic Epidemiology, 35(5), 381–388. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [31].Shi X, Liu J, Huang J, Zhou Y, Xie Y, and Ma S. (2014). A penalized robust method for identifying gene-environment interactions. Genetic Epidemiology, 38(3), 220–230. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [32].Wu C, Shi X, Cui Y, and Ma S. (2015). A penalized robust semiparametric approach for gene–environment interactions. Statistics in Medicine, 34(30), 4016–4030. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [33].Shannon C. (1974). A Mathematical Theory of Communication. Bell Labs Technical Journal, 27(4), 379–423. [Google Scholar]
- [34].Anzarmou Y, Mkhadri A, and Oualkacha K. (2022). The Kendall Interaction Filter for Variable Interaction Screening in Ultra High Dimensional Classification Problems. Journal of Applied Statistics, published online. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [35].Song R, Lu W, Ma S, and Jeng X. (2014). Censored rank independence screening for high-dimensional survival data. Biometrika, 101(4), 799–814. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [36].Wang J, and Chen Y. (2020). Interaction screening by Kendall’s partial correlation for ultrahigh-dimensional data with survival trait. Bioinformatics, 36(9), 2763–2769. [DOI] [PubMed] [Google Scholar]
- [37].Escoufier Y. (1973). Le traitement des variables vectorielles. Biometrics, 29, 751–760. [Google Scholar]
- [38].Wu M, Huang J, and Ma S. (2018). Identifying gene-gene interactions using penalized tensor regression. Statistics in Medicine, 37(4), 598–610. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
