Abstract
For many complex business and industry problems, high-dimensional data collection and modeling have been conducted. It has been shown that interactions may have important implications beyond main effects. The number of unknown parameters in an interaction analysis can be larger or much larger than the sample size. As such, results generated from analyzing a single dataset are often unsatisfactory. Integrative analysis, which jointly analyzes the raw data from multiple independent studies, has been conducted in a series of recent studies and shown to outperform single-dataset analysis, meta-analysis, and other multi-datasets analyses. In this study, our goal is to conduct integrative analysis in interaction analysis. For regularized estimation and selection of important interactions (and main effects), we apply a Threshold Gradient Directed Regularization (TGDR) approach. Advancing from the exiting studies, the TGDR approach is modified to respect the “main effects, interactions” hierarchy. The proposed approach has an intuitive formulation and is computationally simple and broadly applicable. Simulations and the analyses of financial early warning system data and news-APP recommendation behavior data demonstrate its satisfactory practical performance.
Keywords: High-dimensional data, Integrative analysis, Interaction analysis, TGDR
1. Introduction
In multiple areas of business and industry, the collection and modeling of high-dimensional data have been extensively conducted, searching for important predictors associated with gross domestic product (GDP) growths, change percentages of stock markets, optimal portfolio allocations, and others. Accumulating evidences have suggested that interactions may have important implications beyond the main effects. Extensive methodological development and data analysis have been conducted [1, 2, 3]. Promising findings have been made for multiple business and industry problems [4, 5].
In high-dimensional interaction analysis, there are two generic paradigms. In marginal analysis, one or a small number of variables are analyzed at a time. In contrast, in joint analysis, a large number of variables are analyzed in a single model. As complex business and industry processes are attributable to the joint effects of multiple factors, joint analysis can be more sensible, however, at the same time, more challenging. Consider a dataset with n samples and d predictors. In a joint interaction analysis, the number of unknown parameters is of the order d2. For some business and industry studies, with the high cost of data collection, even with pre-processing, d can still be larger than or comparable to n. As such, often d2 >> n. There is thus a regularized model estimation problem. In addition, for some specific business/industry outcomes, as most of the interactions and main effects are not expected to be relevant, there is also a selection problem. In the literature, a large number of regularization techniques have been developed for estimation and selection with high-dimensional models [6, 7].
In high-dimensional data analysis, it has been recognized that, with small sample sizes, results generated from analyzing a single dataset are often unsatisfactory. For many business/industry problems of common interest, there are often multiple independent studies with comparable designs, making it possible to pool multiple datasets and increase sample size. There are generally two types of methods for pooling data. The first one is the classic meta-analysis which analyzes each dataset separately and pools summary statistics based on different rules. Among them, the pretest, Stein and Bayes rules have been the common choices in the literature. For example, the Stein rule has been developed for combining regression estimates with multiple low-dimensional datasets [8], combining regression estimates with one high-dimensional dataset under multiple penalized regression models [9] and combining estimates of mean with multiple datasets [10]. The second one is the more recent integrative analysis which jointly analyzes raw data from multiple independent datasets. In a series of recent high-dimensional studies on genetics [11, 12], integrative analysis techniques have been developed and shown to outperform single-dataset analysis, classic meta-analysis, and other multi-datasets methods. Despite its promising successes in the analysis of main effects with high-dimensional data, integrative analysis has not been well conducted in business studies in the context of interaction analysis, which has a much higher dimensionality and has a stronger need to increase sample size by pooling data.
Interaction analysis has unique characteristics, making directly adopting the existing integrative analysis techniques inappropriate. Specifically, in interaction analysis, there is a need to respect the “main effects, interactions” hierarchical structure [1]. Under the strong hierarchy, if an interaction is identified, then both of the corresponding main effects also need to be identified. In contrast, under the weak hierarchy, only one of the corresponding main effects needs to be identified. Directly applying the existing integrative analysis techniques may generate results that violate the hierarchy, causing trouble in interpretation and inference.
Motivated by the need to pool data and increase sample size in interaction analysis and success of integrative analysis in the analysis of main effects, the goal of this study is to conduct integrative interaction analysis. Significantly advancing from the existing interaction analysis, the integrative analysis of multiple independent datasets is conducted, which provides an effective way of increasing sample size and improving analysis results. Advancing from the existing integrative analysis, the more challenging interaction analysis, which has a need for respecting the hierarchy, is conducted. This study also contains a novel development of the TGDR (Threshold Gradient Directed Regularization) technique, which may have independent methodological value. Last but not least, this study extends the promising integrative analysis paradigm, which has been developed in genetics and other scientific fields, to the analysis of business and industry data.
2. Methods
2.1. Interaction analysis with a single dataset
First consider a single dataset with d-dimensional predictors X = (X1, …, Xd)′ and response variable Y. In the literature, multiple frameworks have been developed for interaction analysis. Here we adopt the regression-based framework, which has a solid statistical ground and very lucid interpretation. Consider the model
| (1) |
where the form of model ϕ(·) is known, β0 is the intercept, βj’s represent the main effects, and γjk’s represent interactions. In (1), “self interactions” (squared terms) are not included but can be easily added back. Note that model (1) is very generic, and the (data, model) dual can be (continuous data, linear regression model), (categorical/count data, generalized linear model), (survival data, AFT–accelerated failure time or Cox model), and many others. Denote θ = (β0, …,βj, …,βd,γ12, γ13, …)′ as the vector of all unknown regression coefficients. With n iid subjects, denote l(θ) as the log-likelihood function. It is noted that l(·) can also be other goodness-of-fit measures.
For regularized estimation and selection of important interactions and main effects, we propose using the TGDR technique. Thresholding is a popular technique in high-dimensional analysis, and many other regularization techniques, for example penalization, are closely connected to thresholding. TGDR is first developed for continuous data and linear regression [13] and later extended to other data and model settings. To accommodate interactions, the original TGDR algorithm needs to be revised. Specifically, we propose the following algorithm:
Algorithm I: TGDR for single dataset interaction analysis
Initialize t = 0 and θ(t) = 0, where θ(t) denotes the estimate of θ at step t;
Update t = t + 1. Compute , the vector of first-order derivatives. Denote the components of f as , where fj and correspond to those of βj and γkj, respectively.
- Compute the thresholding indicator of f. Specifically,
-
(c.1)g0 = 1;
-
(c.2);
-
(c.3)gj = I(|fj| > τ maxu |fu|). In addition, if , then set gj = 1 and gk = 1;
-
(c.1)
Update θ(t) = θ(t−1)+Δ g ⊙f, where ⊙ is the component-wise product and Δ is the step size;
Repeat Steps (b)-(d) for a large number of times. Select the optimal number of iterations topt. The final estimate is θ(topt). Interactions and main effects that correspond to the nonzero components of θ(topt) are identified as important.
Similar to the standard TGDR algorithm, the proposed algorithm starts with a null model. In each step, gradients are computed and used to direct update. Different from the ordinary gradientbased optimization techniques, variables are selected based on the magnitudes of gradients, and only coefficients of the selected variables are updated. There are multiple differences between the proposed and existing TGDR algorithms. First, in Step (c), as interactions and main effects have different grounds, an interaction term is only compared with other interactions, similar holds for main effects. In addition, in the second part of Step (c.3), the proposed algorithm ensures that if an interaction is selected, then the corresponding main effects are also selected. That is, the strong hierarchy is respected. The proposed algorithm can also be revised to respect the weak hierarchy. Specifically, in (c.3), if and gj = gk = 0, then if |fj| > |fk|, then gj = 1; otherwise, gk = 1.
In the proposed algorithm, Δ is the step size. Published studies suggest that the value of Δ is not critical, as long as it is small enough. In our numerical study, we set Δ = 0.01. The proposed algorithm also involves threshold 0 ≤ τ ≤ 1 and number of steps topt, both of which affect selection and estimation. More specifically, when topt is fixed, a larger τ leads to a sparser model. When τ is fixed, a larger topt leads to a denser model. In numerical study, we propose conducting a two-dimensional grid search and selecting the optimal values of τ and topt using five-fold cross validation. Specifically, the subjects are randomly partitioned into five disjoint sets with equal size. For each combination (τ, topt), the proposed method is trained on four of the five sets, and then tested on the remaining one to obtain the prediction error. This process is conducted five times. The combination (τ, topt) with the smallest average prediction error is selected as the optimal one. More discussions on the operating characteristics are provided in Section 3.1.
2.2. Integrative interaction analysis
Consider the integrative analysis of M independent datasets on the same scientific problem. In multi-datasets analysis, the selection of data and pre-processing are challenging tasks. However, they have been extensively discussed in the literature [14] and will not be reiterated here. Further assume that variables have been matched across datasets. Use notations similar to those in Section 2.1, and add superscript “(m)” to denote the mth dataset. Specifically, in dataset m, there are n(m) iid samples, each with measurements on the response variable Y(m) and d predictors X(m). Assume the regression model (1) for each dataset. Note that, in practice, the M datasets are collected under similar but not identical protocols. With possible differences in sample characteristics and other factors, the regression coefficients in different datasets are not assumed to be equal. This assumption is more flexible than that in many other multi-dataset analyses. In dataset m, denote l(m)(θ(m)) as the log-likelihood function normalized by sample size.
For integrative interaction analysis, the proposed method proceeds as follows:
Algorithm II: TGDR for integrative interaction analysis
Initialize t = 0 and θ(m)(t) = 0 for m = 1,⋯, M;
Update t = t+1. For m = 1,...,M, compute , the vector of first-order derivatives. Denote the components of f(m) as , where and correspond to those of and , respectively.
- Compute the thresholding indicator . Specifically,
-
(c.1)g0 = 1;
-
(c.2)
-
(c.3). In addition, if , then set gj = 1 and gk = 1;
-
(c.1)
For m = 1, …,M, update θ(m)(t) = θ(m)(t − 1) + Δ g ⊙ f(m), where Δ is the step size;
Repeat Steps (b)-(d) for a large number of times. Select the optimal number of iterations topt. The final estimate is θ(m)(topt). Interactions and main effects that correspond to the nonzero components of θ(m)(topt) are identified as important.
This algorithm shares a similar spirit with Algorithm I. The parameters Δ, τ, and topt have similar implications and will be chosen in the same manner. A small modification is that the log-likelihood functions need to be normalized so that the analysis is not dominated by larger datasets.
The key difference from single-dataset analysis is Step (c), where, in determining the thresholding indicator (and hence selection and estimation results), we jointly consider all M datasets. Here, we identify interactions and main effects with the largest overall gradients and update their estimates. In this way, if an interaction (or main effect) has weak “evidence” in one dataset but strong “evidences” in other datasets, it can also be identified. In integrative analysis when variable selection is of interest, two model structures have been proposed [15]. The first is the homogeneity structure, under which multiple datasets share the same sparsity structure. The other is the heterogeneity structure, under which multiple datasets can have possibly different sparse structures. Algorithm II reinforces the homogeneity structure, which is appropriate when multiple datasets are “similar enough”. Extension to the heterogeneity structure will be discussed in Section 5.
2.3. Analysis of censored survival data under the AFT model
In our numerical study (simulation and data analysis), we analyze censored survival data, which can be more complicated than continuous and categorical data. Such data are encountered in the analysis of financial default, device malfunction, and others. In stock and mortgage market analysis, survival data are at least as important as other data types. To be complete, here we provide details on the data and model. In addition, we also intend to demonstrate goodness-of-fit functions other than the likelihood.
In the mth dataset, the response variable Y(m) is a survival time. Consider the AFT model
| (2) |
where ϵ(m) is the random error with an unknown distribution, making it impossible to construct the likelihood function. Compared to alternatives such as the Cox model, the AFT model has a more lucid interpretation and, more importantly, the lowest computational cost, which is especially desirable with high-dimensional data. Denote C(m) as the censoring time. We observe . With a slight abuse of notation, assume that data have been sorted according to ’s, from the smallest to the largest. Compute the Kaplan-Meier (KM) weights as
| (3) |
Following [16], the goodness-of-fit function can be constructed as
| (4) |
with . The rest of the operation is the same as with likelihood functions.
3. Simulation
Simulation is conducted to better gauge performance of the proposed method. Three datasets are simulated (M = 3). Two sample size settings are considered, with n(m) = 100, 80, 70 (total sample size n = 250) and n(m) = 180,170,150 (total sample size n = 500), respectively. For the number of predictors, consider d = 50 and 100, respectively, comparable to that in many business/industry studies. Note that although d is seemingly moderate, the number of unknown parameters is considerably larger than the total sample size. Consider the following simulation settings.
(S1) In (2), ’s are independently generated from N(0,1). ϵ(m)’s are also independently generated from N(0,1). For the coefficient vector θ = (β0, …,βj, …,βd, γ12, γ13, …)′, we set
The rest coefficients are zero. Note that the three datasets have different regression coefficients but share the same sparsity structure. All satisfy the strong hierarchy.
(S2) For the coefficient vector θ, we set
The other settings are the same as those under S1. Compared to S1, the differences across datasets are smaller under S2.
(S3) For the coefficient vector θ, we set
The other settings are the same as those under S1. Under S3, a main effect/interaction may have different signs in different datasets, which allows for a greater level of across-dataset heterogeneity.
(S4) The settings are the same as those under S1, except that predictors have an auto-regressive correlation structure, with the jth and kth variables having correlation coefficient 0.5|j−k|.
(S5) For the coefficient vector θ, we set
and
The other settings are the same as those under S1. Under S5, the homogeneity structure of the parameters is not satisfied.
Under all settings, the log survival times are computed from model (2). The censoring times are independently generated from exponential distributions, and the parameters are adjusted so that the censoring rates are around 25%. Note that as (continuous data, linear regression model) is a special case of (censored survival data, AFT model), simulation is not conducted, and similar findings are expected.
3.1. Parameter path and operating characteristics
To better appreciate properties of the proposed method, we simulate one replicate with three datasets under setting S1 with n = 500 and p = 50. The optimal values of the parameters (τ, topt) are (0.9, 230) using the grid search based on the five-fold cross validation. To examine the effect of the parameters on the estimation and selection results, we plot the estimates of interactions/main effects as a function of τ or topt with the other tuning fixed at its optimal value. To improve presentation, we only show estimates for three sets of effects (for each set, one interaction and two corresponding main effects), including six true positives and three true negatives, and the other sets have similar properties.
As can be seen from Figure 1, the model gets denser (with more effects identified) as topt increases (left panels of Figure 1) and tends to be sparser as τ increases (right panels of Figure 1), which is consistent with the theoretical analysis in Section 2.1. There are three different scenarios in left panels of Figure 1. For the set represented by the blue lines, the main effects enter the models first, later followed by the interaction. For the set represented by the red lines, one main effect is “dragged” into the models by the interaction, reinforcing the strong hierarchy. The set represented by the black lines is not associated with the response. The three effects do not enter the models until after a large number of steps. With a properly selected number of steps, they are not selected. More definitive conclusions on the numerical properties of the proposed method are obtained below from simulation and data analysis.
Figure 1:
Parameter paths for one simulated replicate. The three rows correspond to three datasets. The two columns correspond to topt (left) and τ (right) with the other tuning fixed at its optimal value. The solid/dashed lines correspond to main effects/interactions. Lines with the same color correspond to the same set of effects (two main effects and one interaction).
3.2. Computational cost
In algorithm II, it is observed that only very simple calculations are involved in each step. With fixed tunings, Table A.1 (Appendix) provides the average computational time for the simulated datasets under setting S1 with M = 3, n = 250 and various values of d, suggesting that the proposed method is computationally affordable. For example, for the dataset with d = 100 of which the total number of the unknown parameters is 5050, the proposed analysis takes about 3.6 minutes on a laptop with standard configurations.
3.3. Comparison with the alternative methods
Beyond the proposed method (referred to as M1), we also analyze data with the following alternatives: (M2) Each dataset is analyzed separately using the method described in Section 2.1. Analysis results from the three datasets are combined using a meta-analysis approach. (M3) The three datasets are combined and analyzed. This analysis has also been referred to as an “intensity approach” in the literature. (M4) Each dataset is analyzed separately using the marginal analysis which analyzes two G factors and their interaction at a time. The corresponding p values are combined using a meta analysis approach which can be realized with the R package meta. Given the combined p values, a false discovery rate (FDR) approach is adopted for multiple comparison adjustment. The three methods M1, M3 and M4 reinforce the homogeneity structure, while M2 assumes the heterogeneity structure. We acknowledge that there are other methods that can be potentially applied to the simulated data. Comparison with M2 and M3, which are also based on the TGDR technique, can directly establish the merit of integrative analysis. M4 has been one of the most popular multi-datasets analysis methods and is a suitable benchmark for comparison.
For evaluating performance of different methods, we comprehensively consider (a) identification accuracy, which is measured using true positive rate (TPR) and false positive rate (FPR) for main effects and interactions separately; (b) estimation accuracy, which is measured using MSEs (mean squared errors) for main effects and interactions separately; and (c) prediction performance, which is measured using PE (prediction error) for independent data generated under the same settings. Note that for the independently generated testing data, there is no censoring and hence PE can be simply calculated.
We simulate 200 replicates under each scenario. Summary statistics are presented in Table 1 for S1. The rest of the results are presented in Appendix. Simulation suggests that, overall, the proposed method has superior or similar performance compared to the three alternatives in terms of identification, estimation, and prediction. For example in Table 1 with d = 100 and n = 500, for the identification of important interactions, the proposed method (M1) has (TPR, FPR)=(0.978, 0.000), compared to (0.476, 0.001) for M2, (0.968, 0.000) for M3 and (0.538,0.006) for M4. For the estimation of main effects, the four methods have MSEs 0.016 (M1), 0.202 (M2), 0.058 (M3), and 0.132 (M4), respectively. In the evaluation of prediction performance, the four methods have PEs 2.293 (M1), 11.968 (M2), 6.182 (M3), and 3.254 (M4), respectively. Under settings S1-S4 (Tables A.2–A.4), M2 performs worst among the four methods as it is with the heterogeneity structure which is not consistent with the homogeneity assumption in these settings. Under setting S5 (Table A.5) of which the homogeneity structure is not satisfied, the performance of the three homogeneity-based methods M1, M3 and M4 decay compared to those in Table 1, especially in terms of TPR for interactions. However, the proposed method is still observed to perform better than the alternatives, including M2.
Table 1:
Simulation results under S1. In each cell, mean (sd) based on 200 replicates.
| TPR | FPR | MSE | PE | ||||||
|---|---|---|---|---|---|---|---|---|---|
| d | n | Main | Inter | Main | Inter | Main | Inter | ||
| 50 | 250 | M1 | 0.965(0.02) | 0.864(0.12) | 0.082(0.06) | 0.001(0.00) | 0.098(0.05) | 0.010(0.00) | 4.076(1.35) |
| M2 | 0.735(0.16) | 0.356(0.10) | 0.145(0.06) | 0.006(0.00) | 0.512(0.15) | 0.028(0.01) | 12.103(3.60) | ||
| M3 | 0.970(0.00) | 0.844(0.08) | 0.114(0.11) | 0.005(0.01) | 0.139(0.04) | 0.011(0.00) | 5.747(1.38) | ||
| M4 | 0.807(0.11) | 0.364(0.14) | 0.275(0.08) | 0.008(0.00) | 0.305(0.13) | 0.021(0.01) | 6.278(1.89) | ||
| 50 | 500 | M1 | 0.991(0.04) | 0.996(0.02) | 0.073(0.08) | 0.000(0.00) | 0.024(0.02) | 0.003(0.00) | 1.746(0.57) |
| M2 | 0.981(0.03) | 0.725(0.08) | 0.306(0.11) | 0.008(0.01) | 0.196(0.08) | 0.012(0.00) | 4.052(1.21) | ||
| M3 | 0.900(0.00) | 0.989(0.03) | 0.076(0.13) | 0.001(0.00) | 0.108(0.02) | 0.009(0.00) | 5.627(0.62) | ||
| M4 | 0.887(0.10) | 0.716(0.15) | 0.229(0.09) | 0.006(0.00) | 0.129(0.09) | 0.010(0.00) | 5.666(2.17) | ||
| 100 | 250 | M1 | 0.955(0.07) | 0.738(0.15) | 0.059(0.03) | 0.000(0.00) | 0.087(0.04) | 0.003(0.00) | 6.884(1.91) |
| M2 | 0.438(0.17) | 0.178(0.10) | 0.097(0.03) | 0.003(0.00) | 0.380(0.06) | 0.009(0.00) | 21.754(4.17) | ||
| M3 | 0.950(0.03) | 0.778(0.10) | 0.075(0.03) | 0.003(0.00) | 0.099(0.04) | 0.003(0.00) | 7.776(1.65) | ||
| M4 | 0.855(0.09) | 0.400(0.15) | 0.118(0.07) | 0.005(0.00) | 0.204(0.06) | 0.005(0.00) | 12.480(3.41) | ||
| 100 | 500 | M1 | 0.985(0.03) | 0.978(0.05) | 0.028(0.03) | 0.000(0.00) | 0.016(0.01) | 0.001(0.00) | 2.293(0.79) |
| M2 | 0.828(0.11) | 0.476(0.09) | 0.079(0.04) | 0.001(0.00) | 0.202(0.05) | 0.006(0.00) | 11.968(3.07) | ||
| M3 | 0.990(0.00) | 0.968(0.06) | 0.028(0.04) | 0.000(0.00) | 0.058(0.01) | 0.002(0.00) | 6.182(0.62) | ||
| M4 | 0.905(0.10) | 0.538(0.14) | 0.387(0.07) | 0.006(0.00) | 0.132(0.06) | 0.004(0.00) | 3.254(1.36) | ||
One characteristic of the proposed method is that it respects the strong hierarchy. To better appreciate this characteristic, we conduct another set of simulation (referred to as S6 in Appendix). Here data generation is the same as S1. For the proposed M1, in Step (c.3), we remove “if , then gj = 1 and gk = 1”. That is, the strong hierarchy is not necessarily satisfied. For M2 and M3, similar modifications are made. Comparing Table A.6 (Appendix) and Table 1 suggests that the analysis that does not respect the strong hierarchy may have worse identification, estimation, and prediction performance.
In a small number of published studies especially the early ones, it has been suggested that the “main effects, interactions” hierarchy may not hold. Specific practical examples have been provided, for which interactions exist but the corresponding main effects are not important. To be comprehensive, we conduct another set of simulation. Under S7, the data settings are mostly similar to S1 except that the hierarchy is violated for some interactions. As shown in Table A.7 (Appendix), as the proposed method, M2 and M3 reinforce the hierarchy (which does not hold here), performance is slightly worse than that in Table 1.
4. Data analysis
4.1. Data on a financial early warning system
To illustrate effectiveness of the proposed approach for real business problems, we analyze data on a financial early warning system which is usually used for evaluating financial performance, assessing financial risk, and predicting potential bankruptcy [17]. The analyzed dataset is part of the China Stock Market and Accounting Research Database (CSMAR), which is published by the GTA Information Technology Company (http://www.gtarsc.com/). The outcome variable of interest is the price-to-earnings ratio (P/E ratio), which is a continuous variable defined as the market price per share divided by annual earning per share. We aim to search for important financial indicators and interactions that are associated with (log-transformed) P/E ratio measured during the period of January 1, 2013 to December 31, 2013. More specifically, we consider 557 stocks from companies of three different industry sectors. Among them, 290 are machinery listed companies (dataset 1), 161 are metals and non-metals listed companies (dataset 2), and 106 are mechanical listed companies (dataset 3). The predictors include 83 financial indicators reported on December 31, 2010, all of which have been extensively examined in published studies (detailed information provided in Appendix). The time lag of financial indicators and the response is between 2 to 3 years, which has been suggested in previous research [17, 18]. Note that in this analysis there is a single database. However, with the significant differences across industry sectors, it is reasonable to expect significant differences in data characteristics. It is thus sensible to be dissected into three datasets.
We apply the proposed as well as the three alternative methods. The summary comparison results are provided in Table 2. The detailed estimation results are provided in Table 3. The proposed method identifies 13 main effects and 9 interactions, which have overlap with but also differ from the alternatives. Specifically, M3 identifies 15 main effects (12 overlap with the proposed method) and 9 interactions (5 overlap with the proposed method).
Table 2:
Data analysis: numbers of overlapping main effects and interactions. In each cell: dataset 1/dataset 2/dataset 3.
| Financial early warning system data | |||||
| Main effects | M1 | M2 | M3 | M4 | |
| M1 | 13/13/13 | 9/7/6 | 12/12/12 | 4/4/4 | |
| M2 | 11/10/8 | 9/10/6 | 3/2/2 | ||
| M3 | 15/15/15 | 5/5/5 | |||
| M4 | 9/9/9 | ||||
| Interactions | M1 | M2 | M3 | M4 | |
| M1 | 9/9/9 | 4/4/0 | 5/5/5 | 0/0/0 | |
| M2 | 4/6/3 | 3/1/1 | 0/0/0 | ||
| M3 | 9/9/9 | 0/0/0 | |||
| M4 | 3/3/3 | ||||
| News-APP recommendation behavior data | |||||
| Main effects | M1 | M2 | M3 | M4 | |
| M1 | 12/12/12 | 8/10/8 | 11/11/11 | 5/5/5 | |
| M2 | 11/11/12 | 9/11/10 | 2/4/3 | ||
| M3 | 13/13/13 | 4/4/4 | |||
| M4 | 9/9/9 | ||||
| Interactions | M1 | M2 | M3 | M4 | |
| M1 | 10/10/10 | 0/7/3 | 7/7/7 | 5/5/5 | |
| M2 | 3/9/6 | 0/6/2 | 0/4/2 | ||
| M3 | 7/7/7 | 3/3/3 | |||
| M4 | 7/7/7 | ||||
Table 3:
Analysis of financial early warning system data: estimated coefficients for main effects and interactions.
| Dataset 1 | Dataset 2 | Dataset 3 | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| M1 | M2 | M3 | M4 | M1 | M2 | M3 | M4 | M1 | M2 | M3 | M4 | |
| CTR | −0.111 | −0.095 | 0.021 | 0.219 | 0.189 | 0.021 | −0.013 | 0.021 | ||||
| EntfcfPS | −0.031 | −0.002 | −0.022 | 0.008 | −0.048 | −0.022 | 0.033 | 0.023 | −0.022 | |||
| Equass | −0.113 | 0.010 | −0.127 | |||||||||
| Equtotlia | −0.016 | −0.016 | −0.062 | −0.016 | −0.016 | |||||||
| FAGR | 0.091 | 0.070 | 0.077 | 0.105 | 0.070 | 0.084 | 0.070 | |||||
| NAPS | −0.163 | 0.112 | 0.338 | |||||||||
| NcfffaPS | 0.073 | 0.037 | 0.045 | 0.008 | 0.045 | 0.030 | 0.045 | |||||
| Netprfgrrt | −0.063 | −0.041 | 0.014 | 0.037 | 0.052 | 0.014 | −0.077 | 0.039 | 0.014 | 0.076 | ||
| OpeCass | 0.171 | 0.138 | −0.149 | 0.134 | 0.138 | −0.206 | 0.051 | 0.017 | 0.138 | 0.065 | ||
| OpeCPSgrrt | −0.019 | 0.064 | 0.220 | 0.159 | 0.064 | 0.144 | 0.066 | 0.064 | ||||
| Opeprfgrrt | −0.108 | −0.062 | −0.132 | −0.160 | −0.069 | −0.132 | −0.169 | −0.094 | −0.132 | |||
| OwnCon11 | −0.151 | −0.231 | −0.151 | −0.151 | ||||||||
| PCF | 0.150 | 0.109 | 0.176 | −0.018 | 0.162 | 0.132 | 0.176 | −0.065 | 0.170 | 0.176 | −0.104 | |
| ROAgrrt | 0.079 | −0.025 | −0.076 | −0.043 | −0.025 | 0.051 | 0.010 | −0.025 | ||||
| Salcostrt | −0.099 | −0.228 | −0.137 | −0.099 | −0.134 | −0.099 | 0.098 | |||||
| SalesevOpeincm | −0.133 | −0.079 | 0.227 | 0.327 | ||||||||
| ShrhfcfPS | 0.182 | 0.147 | 0.074 | −0.167 | 0.074 | 0.166 | 0.082 | 0.074 | ||||
| TopecostTOR | −0.046 | −0.060 | ||||||||||
| UndivprfPS | −0.064 | −0.030 | −0.031 | 0.244 | 0.182 | −0.031 | −0.097 | −0.172 | −0.031 | −0.135 | ||
| WCNA | 0.121 | 0.131 | −0.037 | 0.132 | ||||||||
| WCTA | −0.098 | −0.323 | −0.326 | |||||||||
| CTR×NcfffaPS | 0.117 | 0.100 | 0.082 | 0.076 | 0.082 | −0.053 | 0.082 | |||||
| CTR×OpeCPSgrrt | −0.074 | 0.092 | 0.047 | −0.104 | ||||||||
| CTR×PCF | −0.041 | 0.117 | 0.258 | −0.003 | ||||||||
| CTR×ShrhfcfPS | 0.083 | 0.083 | 0.083 | |||||||||
| CTR×UndivprfPS | 0.078 | 0.090 | 0.043 | 0.036 | 0.043 | 0.007 | 0.043 | |||||
| EntfcfPS×ROAgrrt | 0.067 | 0.124 | 0.248 | −0.098 | ||||||||
| Equtotlia×Opeprfgrrt | −0.066 | −0.148 | −0.066 | −0.066 | ||||||||
| FAGR×OpeCPSgrrt | 0.049 | 0.072 | 0.108 | 0.148 | 0.072 | −0.014 | 0.072 | |||||
| Netprfgrrt×ShrhfcfPS | 0.144 | 0.213 | 0.157 | 0.035 | 0.157 | 0.009 | 0.157 | |||||
| NAPS×PCF | 0.130 | −0.0467 | 0.192 | |||||||||
| OpeCass×Opeprfgrrt | −0.050 | −0.050 | −0.076 | −0.050 | ||||||||
| Opecass×PCF | 0.032 | 0.042 | −0.182 | |||||||||
| Opecass×Salcostrt | −0.075 | 0.128 | 0.0771 | |||||||||
| OpeCPSgrrt×TopecostTOR | 0.010 | |||||||||||
| OpeCPSgrrt×UndivprfPS | −0.12 | −0.1215 | −0.116 | −0.121 | −0.204 | −0.121 | ||||||
| Opeprfgrrt×TopecostTOR | −0.188 | |||||||||||
| Opeprfgrrt×WCNA | 0.064 | 0.264 | −0.020 | 0.018 | ||||||||
| OwnCon11×Salcostrt | −0.035 | −0.121 | −0.035 | −0.035 | ||||||||
With practical data, it is difficult to objectively evaluate identification accuracy. Here we conduct the evaluation of prediction performance and stability, which may provide an indirect support. Specifically, we randomly select 2/3 of the subjects from each dataset, which form the training data. The subjects not selected form the testing data. Estimates are generated using the training data and used to make prediction for the testing data. As the outcome variable is continuous, we use the prediction error (PE) for evaluating. To avoid an extreme sampling, the above process is repeated 500 times and the average PE is computed. For the four methods, the average PEs are 0.639 (M1), 1.310 (M2), 0.769 (M3), and 2.436 (M4), respectively. In addition, for each main effect/interaction identified using full data, we compute their probabilities of being identified in the 500 resamplings, which have been referred to as “Observed Occurrence Index (OOI)” in the literature, with larger value indicating a higher degree of stability [19]. The OOI results are shown in Figure A.1 (Appendix). The proposed method has OOIs superior or comparable to the alternatives.
4.2. Data on news-APP recommendation behaviors
We analyze a news-APP (application) recommendation dataset collected by a commercial market-research firm in 2014. The study focused on the recommendation behaviors of 882 customers in China, including 410, 181, and 291 customers who used the news-APP “Baidu” (dataset 1), “Tencent” (dataset 2), and “Sina” (dataset 3), respectively. It has been suggested that people using the three different major tools have significant different characteristics. It is thus sensible to treat the collected data as three separate datasets and conduct integrative analysis. The outcome variable of interest is the time when the customer recommended the APP to others which is right censored. Specifically, the survival time T = t, if a customer recommended the APP to others at time t. The duration of the study is six weeks. Thus, T is right censored if the customer had not recommended the APP after six weeks of use. There are 341, 155, 181 uncensored customers during follow-up in datasets 1, 2, 3, respectively. 78 predictors are analyzed which are measured in the likert-type scale (detailed information provided in Appendix).
The summary analysis results using the four methods are provided in Table 2. The detailed estimation results are provided in Table 4. The proposed method identifies 12 main effects and 10 interactions. Method M3 identifies 13 main effects (11 overlap with the proposed method) and 10 interactions (7 overlap with the proposed method). Methods M2 and M4 make more different identifications. We conduct the same prediction and stability evaluation as for the financial early warning system data. As the outcome variable is right censored time-to-event, we use the logrank statistic to measure prediction, where a larger value indicates better prediction. The mean prediction logrank statistics are 7.832 (M1), 4.261 (M2), 6.892 (M3), and 7.366 (M4), respectively. The OOI results in Figure A.1 (Appendix) again suggest that the proposed method has a higher degree of stability. The improved prediction and stability results suggest the superiority of the proposed method.
Table 4:
Analysis of news-APP recommendation behavior data: estimated coefficients for main effects and interactions.
| Dataset 1 | Dataset 2 | Dataset 3 | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| M1 | M2 | M3 | M4 | M1 | M2 | M3 | M4 | M1 | M2 | M3 | M4 | |
| D19_2 | 0.001 | −0.067 | 0.002 | −0.110 | −0.039 | −0.067 | −0.255 | −0.126 | 0.004 | −0.067 | −0.298 | |
| D19_4 | 0.056 | −0.006 | −0.043 | 0.031 | −0.069 | −0.031 | −0.043 | 0.043 | −0.132 | −0.002 | −0.043 | −0.074 |
| D19_3 | −0.029 | −0.290 | −0.132 | |||||||||
| D19_5 | 0.069 | 0.023 | 0.070 | 0.092 | 0.051 | 0.070 | 0.002 | 0.036 | 0.070 | |||
| D19_7 | 0.034 | 0.214 | 0.070 | |||||||||
| D19_15 | 0.075 | 0.262 | 0.033 | 0.105 | −0.148 | 0.488 | ||||||
| D19_18 | −0.020 | −0.049 | 0.013 | −0.049 | −0.029 | −0.049 | ||||||
| D19_21 | 0.059 | 0.011 | −0.029 | −0.118 | −0.052 | −0.029 | −0.148 | −0.029 | ||||
| D19_24 | 0.069 | 0.042 | −0.110 | −0.243 | −0.072 | −0.110 | −0.070 | 0.010 | −0.110 | |||
| D20_2 | 0.012 | −0.047 | −0.230 | −0.061 | −0.020 | −0.047 | −0.064 | −0.070 | −0.011 | −0.047 | −0.108 | |
| D20_4 | 0.097 | 0.048 | −0.020 | −0.126 | −0.056 | −0.020 | −0.066 | −0.007 | −0.020 | |||
| D20_7 | 0.085 | 0.079 | 0.082 | |||||||||
| D20_11 | 0.026 | |||||||||||
| D20_18 | −0.004 | 0.050 | 0.212 | 0.075 | 0.050 | −0.063 | −0.002 | 0.050 | ||||
| D20_22 | 0.050 | 0.023 | 0.030 | 0.054 | 0.048 | 0.030 | −0.005 | −0.002 | 0.030 | |||
| D20_27 | 0.060 | 0.018 | 0.047 | 0.097 | 0.047 | −0.003 | 0.047 | |||||
| D20_31 | 0.068 | 0.019 | −0.082 | 0.042 | −0.187 | 0.002 | −0.082 | −0.417 | −0.076 | −0.082 | −0.137 | |
| D21_3 | −0.011 | −0.031 | ||||||||||
| D21_5 | 0.094 | 0.098 | −0.455 | |||||||||
| D21_13 | 0.028 | 0.017 | 0.017 | 0.025 | 0.017 | |||||||
| D19_2×D19_4 | −0.001 | −0.002 | −0.019 | −0.002 | −0.016 | −0.002 | −0.064 | −0.002 | −0.009 | −0.002 | −0.004 | |
| D19_2×D19_24 | 0.001 | −0.001 | −0.019 | −0.001 | −0.007 | |||||||
| D19_2×D20_2 | −0.001 | −0.149 | −0.001 | −0.004 | −0.028 | −0.001 | −0.012 | −0.076 | ||||
| D19_2×D20_4 | −0.019 | −0.007 | ||||||||||
| D19_2×D20_22 | −0.007 | |||||||||||
| D19_2×D20_31 | −0.003 | −0.005 | −0.053 | −0.007 | −0.033 | −0.005 | −0.042 | −0.006 | −0.005 | −0.035 | ||
| D19_4×D19_18 | −0.001 | −0.001 | −0.001 | |||||||||
| D19_4×D19_21 | −0.007 | −0.009 | −0.008 | −0.009 | −0.009 | −0.009 | ||||||
| D19_4×D20_2 | −0.002 | −0.003 | −0.003 | −0.010 | −0.003 | −0.003 | −0.003 | |||||
| D19_4×D20_22 | 0.001 | 0.001 | 0.002 | 0.001 | ||||||||
| D19_4×D20_31 | −0.010 | −0.013 | −0.058 | −0.014 | −0.017 | −0.013 | −0.136 | −0.012 | −0.013 | −0.103 | ||
| D19_7 ×D19_15 | 0.002 | −0.171 | −0.138 | |||||||||
| D19_15×D20_31 | 0.001 | −0.032 | −0.001 | −0.003 | −0.002 | −0.067 | ||||||
| D19_21×D20_31 | −0.004 | −0.007 | −0.008 | −0.007 | −0.007 | −0.007 | ||||||
| D19_24×D20_27 | 0.006 | |||||||||||
| D19_24×D21_3 | −0.001 | |||||||||||
| D20_2×D20_31 | −0.001 | −0.017 | −0.001 | −0.001 | ||||||||
| D20_4×D20_31 | −0.001 | −0.003 | −0.004 | −0.024 | −0.003 | −0.002 | −0.003 | |||||
| D20_27×D21_3 | −0.009 | |||||||||||
| D20_31×D21_5 | −0.029 | −0.011 | −0.145 | |||||||||
5. Discussion
For the analysis of high-dimensional data in genetics and other scientific fields, integrative analysis has established its effectiveness in pooling multiple independent datasets, increasing sample size, and improving analysis results. In this study, we significantly extend integrative analysis to interaction analysis and to the analysis of business and industry data. The proposed method is based on the TGDR technique, which has been extensively applied to the analysis of main effects but not interactions. The TGDR technique is modified to respect the “main effects, interactions” strong hierarchy. The proposed method has an intuitive formulation. In simulation, it outperforms the direct competitors. In the analysis of financial early warning system data and news-APP recommendation behavior data, it leads to findings different from the alternatives. The improved prediction and stability support its validity. Overall, this study provides a practically useful new venue for studying interactions under the integrative analysis paradigm.
This study can be potentially extended in multiple directions. Interaction analysis has an important role in the study of complex business and industry problems. It is of interest to conduct other types of interaction analysis/use other techniques under the integrative analysis paradigm. In our study, we consider continuous data under the linear model and censored survival data under the AFT model. The proposed method can be directly applied to other types of goodness-of-fit measures especially including log-likelihood functions. The strong hierarchy is assumed. Extension to accommodate the weak hierarchy can be easily obtained by modifying step (c.3) of Algorithm II. With multiple datasets, the homogeneity model is assumed. Extension to the heterogeneity model demands making the thresholding indicators data-specific. In data analysis, the evaluation of prediction and stability provides support to the identification results. More evaluations, especially scientific evaluations, are needed to confirm the findings.
Acknowledgements
We thank the organizers and participants of International Workshop on Perspectives on High-Dimensional Data Analysis (HDDA-VII). This study has been supported by National Natural Science Foundation of China (71771211, 11401013, 91546202), MOE Project of Key Research Institute of Humanities and Social Sciences at Universities (16JJD910002), and NIH (CA204120 and CA191383).
Appendix
A.1. Additional numerical results of the simulations
Table A.1:
Computational time of the proposed method for setting S1 with M = 3 and n = 250.
| d | Number of the total unknown parameters | Computational time (minutes) |
|---|---|---|
| 50 | 1275 | 0.374 |
| 70 | 2485 | 0.965 |
| 80 | 3240 | 1.422 |
| 100 | 5050 | 3.588 |
Table A.2:
Simulation results under S2. In each cell, mean (sd) based on 200 replicates.
| TPR | FPR | MSE | PE | ||||||
|---|---|---|---|---|---|---|---|---|---|
| d | n | Main | Inter | Main | Inter | Main | Inter | ||
| 50 | 250 | M1 | 0.988(0.03) | 0.887(0.14) | 0.063(0.06) | 0.000(0.00) | 0.066(0.03) | 0.007(0.00) | 3.259(1.18) |
| M2 | 0.787(0.11) | 0.408(0.11) | 0.124(0.05) | 0.006(0.00) | 0.383(0.10) | 0.020(0.00) | 9.198(2.39) | ||
| M3 | 1.000(0.00) | 0.870(0.05) | 0.126(0.14) | 0.001(0.00) | 0.070(0.02) | 0.007(0.00) | 3.853(1.05) | ||
| M4 | 0.820(0.11) | 0.415(0.13) | 0.270(0.10) | 0.008(0.00) | 0.223(0.10) | 0.019(0.01) | 4.493(1.27) | ||
| 50 | 500 | M1 | 1.000(0.00) | 0.990(0.03) | 0.106(0.09) | 0.000(0.00) | 0.020(0.01) | 0.003(0.00) | 1.672(0.30) |
| M2 | 0.973(0.03) | 0.782(0.09) | 0.275(0.08) | 0.006(0.00) | 0.103(0.04) | 0.008(0.00) | 2.839(0.68) | ||
| M3 | 1.000(0.00) | 0.995(0.02) | 0.275(0.23) | 0.000(0.00) | 0.054(0.01) | 0.005(0.00) | 3.462(0.43) | ||
| M4 | 0.902(0.10) | 0.578(0.14) | 0.257(0.10) | 0.007(0.00) | 0.095(0.06) | 0.011(0.00) | 3.825(1.26) | ||
| 100 | 250 | M1 | 0.987(0.05) | 0.855(0.18) | 0.062(0.07) | 0.000(0.00) | 0.061(0.03) | 0.003(0.00) | 5.586(1.67) |
| M2 | 0.437(0.20) | 0.169(0.10) | 0.083(0.07) | 0.003(0.00) | 0.300(0.05) | 0.006(0.00) | 17.499(4.08) | ||
| M3 | 1.000(0.00) | 0.833(0.10) | 0.157(0.23) | 0.001(0.00) | 0.046(0.02) | 0.002(0.00) | 5.848(1.04) | ||
| M4 | 0.848(0.08) | 0.439(0.17) | 0.461(0.14) | 0.005(0.01) | 0.487(0.09) | 0.005(0.01) | 7.520(1.73) | ||
| 100 | 500 | M1 | 1.000(0.00) | 0.975(0.05) | 0.072(0.09) | 0.000(0.00) | 0.012(0.00) | 0.001(0.00) | 2.139(0.51) |
| M2 | 0.864(0.08) | 0.497(0.10) | 0.160(0.11) | 0.004(0.00) | 0.146(0.04) | 0.004(0.00) | 8.292(1.77) | ||
| M3 | 1.000(0.00) | 0.953(0.03) | 0.214(0.33) | 0.000(0.00) | 0.029(0.01) | 0.002(0.00) | 3.908(0.38) | ||
| M4 | 0.927(0.07) | 0.545(0.16) | 0.444(0.16) | 0.005(0.01) | 0.109(0.06) | 0.004(0.00) | 4.315(0.90) | ||
Table A.3:
Simulation results under S3. In each cell, mean (sd) based on 200 replicates.
| TPR | FPR | MSE | PE | ||||||
|---|---|---|---|---|---|---|---|---|---|
| d | n | Main | Inter | Main | Inter | Main | Inter | ||
| 50 | 250 | M1 | 0.978(0.05) | 0.882(0.09) | 0.073(0.08) | 0.000(0.00) | 0.075(0.04) | 0.009(0.00) | 3.436(1.13) |
| M2 | 0.779(0.11) | 0.401(0.11) | 0.162(0.07) | 0.007(0.00) | 0.467(0.14) | 0.027(0.00) | 11.588(3.56) | ||
| M3 | 0.953(0.06) | 0.667(0.10) | 0.054(0.07) | 0.002(0.00) | 0.333(0.06) | 0.020(0.00) | 12.669(1.63) | ||
| M4 | 0.800(0.11) | 0.370(0.14) | 0.268(0.09) | 0.008(0.00) | 0.324(0.13) | 0.029(0.01) | 6.539(2.42) | ||
| 50 | 500 | M1 | 1.000(0.00) | 0.990(0.04) | 0.077(0.07) | 0.000(0.00) | 0.018(0.01) | 0.003(0.00) | 1.701(0.60) |
| M2 | 0.989(0.02) | 0.750(0.07) | 0.354(0.12) | 0.009(0.01) | 0.166(0.07) | 0.010(0.00) | 3.379(1.01) | ||
| M3 | 0.973(0.05) | 0.733(0.07) | 0.092(0.05) | 0.001(0.00) | 0.277(0.04) | 0.017(0.00) | 11.598(1.23) | ||
| M4 | 0.912(0.08) | 0.537(0.16) | 0.257(0.09) | 0.007(0.00) | 0.124(0.08) | 0.017(0.01) | 5.291(2.14) | ||
| 100 | 250 | M1 | 0.939(0.06) | 0.725(0.16) | 0.014(0.01) | 0.000(0.00) | 0.100(0.06) | 0.004(0.00) | 7.785(2.97) |
| M2 | 0.357(0.18) | 0.133(0.08) | 0.031(0.03) | 0.001(0.00) | 0.402(0.04) | 0.009(0.00) | 24.232(3.96) | ||
| M3 | 0.908(0.13) | 0.561(0.14) | 0.022(0.03) | 0.000(0.00) | 0.217(0.06) | 0.006(0.00) | 15.782(2.90) | ||
| M4 | 0.867(0.10) | 0.411(0.13) | 0.201(0.05) | 0.006(0.00) | 0.735(0.49) | 0.018(0.01) | 21.707(3.85) | ||
| 100 | 500 | M1 | 0.995(0.02) | 0.973(0.06) | 0.019(0.02) | 0.000(0.00) | 0.016(0.01) | 0.001(0.00) | 2.381(0.94) |
| M2 | 0.824(0.09) | 0.474(0.09) | 0.070(0.04) | 0.001(0.00) | 0.192(0.07) | 0.006(0.00) | 11.239(3.08) | ||
| M3 | 0.970(0.05) | 0.703(0.08) | 0.014(0.01) | 0.000(0.00) | 0.162(0.03) | 0.005(0.00) | 11.990(1.24) | ||
| M4 | 0.905(0.08) | 0.532(0.11) | 0.198(0.08) | 0.006(0.00) | 0.141(0.05) | 0.006(0.00) | 5.236(1.36) | ||
Table A.4:
Simulation results under S4. In each cell, mean (sd) based on 200 replicates.
| TPR | FPR | MSE | PE | ||||||
|---|---|---|---|---|---|---|---|---|---|
| d | n | Main | Inter | Main | Inter | Main | Inter | ||
| 50 | 250 | M1 | 0.990(0.03) | 0.988(0.03) | 0.171(0.12) | 0.001(0.00) | 0.052(0.03) | 0.004(0.00) | 1.771(0.63) |
| M2 | 0.972(0.03) | 0.797(0.08) | 0.335(0.08) | 0.008(0.01) | 0.168(0.05) | 0.012(0.00) | 3.197(1.29) | ||
| M3 | 0.985(0.02) | 0.965(0.05) | 0.171(0.10) | 0.002(0.00) | 0.113(0.02) | 0.010(0.00) | 8.993(1.75) | ||
| M4 | 0.945(0.07) | 0.833(0.10) | 0.323(0.12) | 0.017(0.01) | 0.214(0.13) | 0.021(0.01) | 2.719(0.95) | ||
| 50 | 500 | M1 | 0.995(0.02) | 1.000(0.00) | 0.125(0.10) | 0.000(0.00) | 0.029(0.02) | 0.002(0.00) | 1.370(0.30) |
| M2 | 0.997(0.01) | 0.964(0.03) | 0.498(0.10) | 0.010(0.01) | 0.078(0.02) | 0.004(0.00) | 2.167(0.38) | ||
| M3 | 1.000(0.00) | 0.988(0.03) | 0.092(0.13) | 0.001(0.00) | 0.094(0.01) | 0.009(0.00) | 6.948(1.15) | ||
| M4 | 0.970(0.05) | 0.892(0.09) | 0.302(0.08) | 0.017(0.01) | 0.066(0.04) | 0.010(0.01) | 1.793(0.96) | ||
| 100 | 250 | M1 | 0.991(0.03) | 0.966(0.06) | 0.145(0.08) | 0.000(0.00) | 0.035(0.01) | 0.001(0.00) | 2.704(1.33) |
| M2 | 0.898(0.09) | 0.587(0.13) | 0.132(0.07) | 0.002(0.00) | 0.143(0.06) | 0.005(0.00) | 9.211(5.42) | ||
| M3 | 0.991(0.00) | 0.943(0.07) | 0.183(0.10) | 0.001(0.00) | 0.063(0.02) | 0.003(0.00) | 9.575(2.77) | ||
| M4 | 0.950(0.07) | 0.758(0.14) | 0.385(0.06) | 0.007(0.00) | 2.594(4.77) | 0.157(0.28) | 4.973(4.66) | ||
| 100 | 500 | M1 | 0.996(0.02) | 0.997(0.02) | 0.140(0.07) | 0.000(0.00) | 0.020(0.01) | 0.001(0.00) | 1.741(0.50) |
| M2 | 0.986(0.02) | 0.880(0.06) | 0.282(0.09) | 0.001(0.00) | 0.069(0.02) | 0.002(0.00) | 2.502(1.19) | ||
| M3 | 1.000(0.00) | 0.978(0.04) | 0.136(0.06) | 0.000(0.00) | 0.047(0.01) | 0.002(0.00) | 7.534(1.24) | ||
| M4 | 0.986(0.04) | 0.883(0.11) | 0.404(0.08) | 0.009(0.00) | 0.120(0.12) | 0.005(0.00) | 2.156(0.94) | ||
Table A.5:
Simulation results under S5. In each cell, mean (sd) based on 200 replicates.
| TPR | FPR | MSE | PE | ||||||
|---|---|---|---|---|---|---|---|---|---|
| d | n | Main | Inter | Main | Inter | Main | Inter | ||
| 50 | 250 | M1 | 0.825(0.08) | 0.523(0.15) | 0.112(0.06) | 0.003(0.00) | 0.251(0.09) | 0.024(0.00) | 9.902(2.23) |
| M2 | 0.789(0.12) | 0.407(0.11) | 0.160(0.08) | 0.006(0.00) | 0.510(0.11) | 0.027(0.00) | 11.634(3.77) | ||
| M3 | 0.816(0.09) | 0.521(0.17) | 0.117(0.06) | 0.004(0.00) | 0.366(0.08) | 0.027(0.00) | 15.306(2.79) | ||
| M4 | 0.713(0.12) | 0.269(0.11) | 0.262(0.09) | 0.007(0.00) | 0.381(0.14) | 0.034(0.01) | 9.479(2.81) | ||
| 50 | 500 | M1 | 0.884(0.07) | 0.714(0.09) | 0.207(0.11) | 0.002(0.00) | 0.099(0.04) | 0.015(0.00) | 6.307(1.14) |
| M2 | 0.887(0.02) | 0.660(0.07) | 0.327(0.11) | 0.010(0.01) | 0.179(0.08) | 0.011(0.00) | 6.631(1.14) | ||
| M3 | 0.892(0.06) | 0.688(0.11) | 0.171(0.04) | 0.003(0.00) | 0.258(0.06) | 0.023(0.00) | 12.882(1.57) | ||
| M4 | 0.761(0.09) | 0.343(0.10) | 0.241(0.07) | 0.007(0.00) | 0.220(0.07) | 0.028(0.00) | 9.023(1.89) | ||
| 100 | 250 | M1 | 0.705(0.14) | 0.371(0.14) | 0.028(0.02) | 0.000(0.00) | 0.204(0.07) | 0.008(0.00) | 16.356(3.34) |
| M2 | 0.393(0.20) | 0.140(0.09) | 0.033(0.02) | 0.001(0.00) | 0.386(0.06) | 0.009(0.00) | 22.753(4.86) | ||
| M3 | 0.705(0.14) | 0.358(0.16) | 0.032(0.04) | 0.001(0.00) | 0.235(0.08) | 0.008(0.00) | 18.425(3.26) | ||
| M4 | 0.675(0.09) | 0.230(0.10) | 0.412(0.07) | 0.006(0.00) | 1.026(0.74) | 0.025(0.01) | 22.424(4.26) | ||
| 100 | 500 | M1 | 0.863(0.06) | 0.570(0.14) | 0.057(0.05) | 0.000(0.00) | 0.069(0.02) | 0.005(0.00) | 9.070(1.67) |
| M2 | 0.860(0.11) | 0.477(0.11) | 0.085(0.05) | 0.001(0.00) | 0.197(0.05) | 0.006(0.00) | 11.122(3.08) | ||
| M3 | 0.864(0.06) | 0.577(0.12) | 0.058(0.02) | 0.001(0.00) | 0.158(0.04) | 0.006(0.00) | 14.746(1.73) | ||
| M4 | 0.826(0.09) | 0.328(0.10) | 0.406(0.07) | 0.006(0.00) | 0.215(0.07) | 0.010(0.00) | 10.011(1.36) | ||
Table A.6:
Simulation results under S6. In each cell, mean (sd) based on 200 replicates.
| TPR | FPR | MSE | PE | ||||||
|---|---|---|---|---|---|---|---|---|---|
| d | n | Main | Inter | Main | Inter | Main | Inter | ||
| 50 | 250 | M1 | 0.953(0.04) | 0.845(0.12) | 0.081(0.09) | 0.005(0.00) | 0.115(0.05) | 0.009(0.00) | 3.829(1.54) |
| M2 | 0.404(0.17) | 0.260(0.09) | 0.096(0.05) | 0.008(0.00) | 0.688(0.13) | 0.029(0.00) | 16.593(5.30) | ||
| M3 | 0.950(0.07) | 0.908(0.08) | 0.116(0.14) | 0.011(0.01) | 0.175(0.08) | 0.011(0.00) | 5.735(1.47) | ||
| M4 | 0.375(0.15) | 0.445(0.14) | 0.036(0.01) | 0.010(0.00) | 0.490(0.16) | 0.017(0.00) | 9.804(2.44) | ||
| 50 | 500 | M1 | 0.970(0.00) | 0.963(0.03) | 0.081(0.06) | 0.004(0.00) | 0.024(0.01) | 0.002(0.00) | 1.549(0.35) |
| M2 | 0.843(0.10) | 0.679(0.10) | 0.239(0.09) | 0.016(0.01) | 0.289(0.14) | 0.012(0.00) | 5.263(3.20) | ||
| M3 | 0.965(0.02) | 0.965(0.02) | 0.069(0.10) | 0.011(0.02) | 0.148(0.07) | 0.008(0.00) | 5.384(1.00) | ||
| M4 | 0.680(0.13) | 0.683(0.11) | 0.034(0.01) | 0.009(0.00) | 0.262(0.10) | 0.010(0.01) | 8.601(1.87) | ||
| 100 | 250 | M1 | 0.903(0.13) | 0.715(0.16) | 0.042(0.02) | 0.002(0.00) | 0.115(0.06) | 0.003(0.00) | 7.498(2.90) |
| M2 | 0.383(0.13) | 0.115(0.08) | 0.047(0.03) | 0.002(0.00) | 0.432(0.04) | 0.009(0.00) | 26.146(4.55) | ||
| M3 | 0.920(0.15) | 0.753(0.10) | 0.046(0.03) | 0.003(0.00) | 0.127(0.07) | 0.004(0.00) | 8.139(2.89) | ||
| M4 | 0.430(0.15) | 0.393(0.15) | 0.036(0.01) | 0.007(0.00) | 0.253(0.09) | 0.009(0.00) | 16.123(1.86) | ||
| 100 | 500 | M1 | 0.998(0.02) | 0.973(0.04) | 0.028(0.04) | 0.001(0.00) | 0.016(0.01) | 0.001(0.00) | 2.051(0.62) |
| M2 | 0.671(0.15) | 0.454(0.04) | 0.058(0.11) | 0.003(0.00) | 0.250(0.07) | 0.006(0.00) | 12.416(3.75) | ||
| M3 | 0.998(0.00) | 0.980(0.08) | 0.026(0.03) | 0.002(0.00) | 0.072(0.03) | 0.002(0.00) | 6.029(1.05) | ||
| M4 | 0.610(0.14) | 0.508(0.01) | 0.005(0.14) | 0.007(0.00) | 0.150(0.07) | 0.004(0.00) | 8.730(1.89) | ||
Table A.7:
Simulation results under S7. In each cell, mean (sd) based on 200 replicates.
| TPR | FPR | MSE | PE | ||||||
|---|---|---|---|---|---|---|---|---|---|
| d | n | Main | Inter | Main | Inter | Main | Inter | ||
| 50 | 250 | M1 | 0.970(0.05) | 0.830(0.15) | 0.080(0.06) | 0.001(0.00) | 0.110(0.06) | 0.012(0.01) | 4.488(1.66) |
| M2 | 0.682(0.16) | 0.344(0.11) | 0.148(0.08) | 0.006(0.00) | 0.539(0.14) | 0.029(0.00) | 12.851(3.94) | ||
| M3 | 1.000(0.00) | 0.828(0.09) | 0.104(0.09) | 0.004(0.00) | 0.141(0.04) | 0.012(0.00) | 5.823(1.13) | ||
| M4 | 0.752(0.11) | 0.415(0.15) | 0.286(0.08) | 0.007(0.00) | 0.320(0.12) | 0.028(0.01) | 5.987(2.31) | ||
| 50 | 500 | M1 | 1.000(0.00) | 0.965(0.06) | 0.119(0.10) | 0.000(0.00) | 0.021(0.01) | 0.004(0.00) | 1.846(0.61) |
| M2 | 0.978(0.03) | 0.711(0.08) | 0.362(0.11) | 0.011(0.01) | 0.205(0.10) | 0.012(0.00) | 3.904(1.33) | ||
| M3 | 1.000(0.00) | 0.983(0.04) | 0.119(0.15) | 0.003(0.01) | 0.102(0.02) | 0.009(0.00) | 5.223(0.85) | ||
| M4 | 0.848(0.11) | 0.532(0.13) | 0.268(0.10) | 0.007(0.00) | 0.151(0.08) | 0.019(0.01) | 5.650(2.06) | ||
| 100 | 250 | M1 | 0.933(0.10) | 0.753(0.17) | 0.027(0.02) | 0.000(0.00) | 0.111(0.06) | 0.005(0.00) | 8.478(2.36) |
| M2 | 0.317(0.14) | 0.127(0.07) | 0.029(0.02) | 0.001(0.00) | 0.393(0.05) | 0.009(0.00) | 23.482(4.99) | ||
| M3 | 0.973(0.07) | 0.820(0.14) | 0.043(0.05) | 0.001(0.00) | 0.120(0.07) | 0.004(0.00) | 10.880(2.98) | ||
| M4 | 0.800(0.14) | 0.463(0.15) | 0.042(0.07) | 0.007(0.00) | 0.873(0.51) | 0.611(0.08) | 13.808(7.61) | ||
| 100 | 500 | M1 | 0.970(0.05) | 0.710(0.07) | 0.017(0.03) | 0.000(0.00) | 0.018(0.01) | 0.002(0.00) | 2.730(0.88) |
| M2 | 0.410(0.16) | 0.180(0.08) | 0.036(0.02) | 0.001(0.00) | 0.209(0.06) | 0.006(0.00) | 11.894(3.07) | ||
| M3 | 0.940(0.07) | 0.600(0.12) | 0.007(0.01) | 0.000(0.00) | 0.059(0.02) | 0.002(0.00) | 5.927(1.14) | ||
| M4 | 0.830(0.08) | 0.390(0.13) | 0.041(0.07) | 0.006(0.00) | 0.148(0.08) | 0.006(0.00) | 9.819(1.31) | ||
A.2. Definitions of variables used in data analysis
Table A.8:
Financial early warning system data: definitions of variables
| Definition | Variable |
|---|---|
| Earning per share | BasicEPS |
| Net asset value per share | NAPS |
| Main income per share | MincmPS |
| Operating profit per share | OpeprfPS |
| Earnings before interest and tax per share | EBITPS |
| Surplus reserve fund per share | SurrefdPS |
| Undistributed profit per share | UndivprfPS |
| Operating cash flow per share | OpeCFPS |
| Cash flow per share | CFPS |
| Enterprise free cash flow per share | EntfcfPS |
| Shareholder free cash flow per share | ShrhfcfPS |
| Average return on equity | AvgROE |
| Return on assets earnings before interest and tax | ROAEBIT |
| Return on assets | ROA |
| Rate of return on invested capital | ROIC |
| Net profit ratio | Netprfrt |
| Gross income ratio | Gincmrt |
| Sales cost ratio | Salcostrt |
| Net profit to total operation income | NprTOR |
| Operating profit to total operation income | OpeprTOR |
| Earnings before interest and tax to total operation income | EBITTOR |
| Total operating costs to total operation income | TopecostTOR |
| Net profits | Netprf |
| Earnings before interest and tax | EBIT |
| Operating profit ratio | Opeprfrt |
| Current ratio | Currt |
| Quick ratio | Qckrt |
| Equity to total liability | Equtotlia |
| Net operating cash flow to total liabilities | NOCFtotlia |
| Net operating cash flow to current liabilities | NOCFtotcurlia |
| Operating cash flow liabilities ratio | OpeCcurdb |
| Earning per share growth rate | EPSgrrt |
| Operating income growth rate | Opeincmgrrt |
| Operating profit growth rate | Opeprfgrrt |
| Net profit growth rate | Netprfgrrt |
| Operating cash flow per share growth rate | OpeCPSgrrt |
| Return on assets growth rate | ROAgrrt |
| Net asset value growth rate | Netassgrrt |
| Net asset value per share growth rate | NAPSgrrt |
| Inventory turnover ratio | Invtrtrrat |
| Account receivable turnover ratio | ARTrat |
| Account payable turnover ratio | AccrPayrat |
| Current assets turnover ratio | Currat |
| Fixed assets turnover ratio | Fixassrat |
| Equity turnover ratio | Equrat |
| Total assets turnover ratio | Totassrat |
| Sales and service cash to operating income | SalesevOpeincm |
| Cash rate of sales | Casrtsale |
| Capital expenditure to depreciation and amortization | CapexpDM |
| Sales and service render cash | Salesevcash |
| Operation cash into asset | OpeCass |
| Debt to asset ratio | Dbastrt |
| Current asset to total asset | Curtotast |
| Noncurrent asset to total asset | Noncurtotast |
| Fixed assets ratio | Fixassrt |
| Current liability to total liability | Curtotlia |
| Equity to asset | Equass |
| Long asset to fit asset | Lassfitass |
| Price to book value ratio | PB |
| Price cash flow ratio | PCF |
| Price sales ratio | PS |
| Ownership concentration 1 | OwnCon1 |
| Ownership concentration 5 | OwnCon5 |
| Ownership concentration 10 | OwnCon10 |
| Ownership concentration 11 | OwnCon11 |
| H1 index | H1Index |
| H5 index | H5Index |
| H10 index | H10Index |
| Z index | ZIndex |
| Net cash flow from investing activities per share | NcffiaPS |
| Net cash flow from financing activities per share | NcfffaPS |
| Net cash flow per share | NcfPS |
| Fixed asset growth rate | FAGR |
| Equity to fixed asset | EFA |
| Current debt ratio | CDR |
| Debt to equity market ratio | DEMR |
| Capital turnover ratio | CTR |
| Long-term asset turnover ratio | LATR |
| Net profit margin of current assets | NPMCA |
| Net profit margin of fixed assets | NPMFA |
| Working capital ratio | WCR |
| Working capital to total assets | WCTA |
| Working capital to net assets | WCNA |
Table A.9:
News-APP recommendation behavior data: definitions of variables.
| Definition | Variable |
|---|---|
| The news are true and accurate without false information. | D19_1 |
| The news are objective without personal bias. | D19_2 |
| The news present the scene where the events occurred with the pictures and videos. | D19_3 |
| The analysis of the news is thorough, leading to the nature of the event. | D19_4 |
| The news present multiple points on the events. | D19_5 |
| There are well-known experts’ columns or comments. | D19_6 |
| The news reflect the professionalism of editors. | D19_7 |
| The hot topics cover comprehensive news. | D19_8 |
| There are news about history, humanities, geography, arts and others, which can broaden my horizons. | D19_9 |
| The news include the global information. | D19_10 |
| The news include rich live events related to my daily life. | D19_11 |
| The news include discount information about living and consuming, which bring great benefits to my life. | D19_12 |
| The news include all kinds of useful information in life. | D19_13 |
| The news are reported in time when an important event happens. | D19_14 |
| The updates of news are in time with fresh information. | D19_15 |
| There are live news and sports. | D19_16 |
| The overall style of the news is acceptable and desirable. | D19_17 |
| The news have different kinds of styles. | D19_18 |
| There are some desirable special columns. | D19_19 |
| The news reflect the stands of points of the editors. | D19_20 |
| The APP provides a sense of participation. | D19_21 |
| The interactive contents are readable with good quality. | D19_22 |
| The APP makes me interact with a group of people who have the the same hobbies with me. | D19_23 |
| I can find someone with the same likes and dislikes when I participate in the APP. | D19_24 |
| There are rich of news that I am interested in. | D19_25 |
| It is very convenient to read the news that I am interested in. | D19_26 |
| The APP can present the news that I am interested in positively. | D19_27 |
| The APP gives prominence to the key points. | D20_1 |
| The Images and text layout match properly. | D20_2 |
| The color assortment is reasonable and desirable. | D20_3 |
| The interface design is novel. | D20_4 |
| The interface and logo design can reflect the characteristics of the APP. | D20_5 |
| The interface and logo design can indicate that the APP is news-APP clearly. | D20_6 |
| The video is vivid with high quality. | D20_7 |
| The video works fine. | D20_8 |
| It is very convenient to find and play the video. | D20_9 |
| The pictures are clear with high quality. | D20_10 |
| It is very convenient to find and read the pictures. | D20_11 |
| The presentations of the pictures are novel and beautiful. | D20_12 |
| The types of the multimedia is in rich with high selectivity. | D20_13 |
| The APP does not consume too much cell phone traffic. | D20_14 |
| The APP runs with less memory and does not affect the speed of the phone. | D20_15 |
| The APP automatically cleans up the cache content and does not take up the phone storage resources. | D20_16 |
| The operations of APP are smooth and do not crash and flare. | D20_17 |
| It is convenient to view the comments. | D20_18 |
| It is convenient to submit the comments. | D20_19 |
| The APP can be shared to many social platforms. | D20_20 |
| It is easy to find the function set menu. | D20_21 |
| The set of the function keys is reasonable and do not waste a limited interface. | D20_22 |
| The function design is in line with conventional operating habits. | D20_23 |
| Content sections can be set freely. | D20_24 |
| The function and design can be set in accordance with my preferences and needs. | D20_25 |
| There are many types of subscription. | D20_26 |
| Each type of subscription includes many sources. | D20_27 |
| The subscription process is simple with only a few of steps. | D20_28 |
| The news can be subscribed based on keywords. | D20_29 |
| It is easy to find the news I want to subscribed. | D20_30 |
| Push news can be selected according to my preferences. | D20_31 |
| There are hot search words presented in the search interface. | D20_32 |
| The recommended news are in accordance with my preferences. | D20_33 |
| Search results are rich and accurate. | D20_34 |
| It is easy to find the news that I want to add. | D20_35 |
| The navigation bar is clear. | D20_36 |
| It is easy to find what I am interested in. | D20_37 |
| It is easy to find the advertisement of this APP. | D21_1 |
| I am impressed by the APP’s advertisement. | D21_2 |
| The APP’s advertisement is desirable. | D21_3 |
| This APP is praised and recognized by the relevant experts. | D21_4 |
| This APP is always associated with some major events and sports. | D21_5 |
| This APP has good reputation. | D21_6 |
| People around are using this APP. | D21_7 |
| The parent brand of the APP has a good image and is recognized. | D21_8 |
| The parent brand of the APP has great influencing power. | D21_9 |
| The image of the parent brand of the APP is in line with the characteristics of the news product. | D21_10 |
| This APP has a clear brand personality, such as youthful, rigorous and so on. | D21_11 |
| The APP’s user group is the same type as me. | D21_12 |
| The personality of this APP is my favorite. | D21_13 |
| The overall style of the APP is acceptable and desirable. | D21_14 |
A.3. Additional numerical results of the data analysis
Figure A.1:
Data analysis: OOI. Top/Blue: financial early warning system data. Bottom/Red: news-APP recommendation behavior data.
A.4. Additional data analysis: gas station customers’ psychology and behavior
In this section, we analyze another real dataset with higher dimension (d = 108) to further understand the effectiveness of the proposed method. This dataset is based on a questionnaire survey on the gas station customers in Guangzhou, Guangdong Province, China in 2014. It is collected by an energy company with the goal of studying the relationship between consumers’ psychology and behavior. A multi-stage stratified sampling design is adopted in the survey which includes a total of 486 customers of the gas stations in both urban and suburbs with various road conditions. As the psychology and behavior differ significantly across customers with different ages, we partition the customers into three datasets according to their birthdays: earlier than 1970 (dataset 1, n(1) = 96), between 1970 and 1980 (dataset 2, n(2) = 182), and later than 1980 (dataset 3, n(3) = 208). The response of interest is the amount of gas consumed in the past year, which is continuous and analyzed using the linear model. The predictors include three personal information variables (gender, marriage status and education) and 105 psychological and behavioral measurements which are measured by ten-point Likert scale with one and ten indicating an extremely dissatisfaction and a most satisfactory, respectively. Due to business confidential, the detailed variables and questionnaire cannot be publicly available.
The summary analysis results and detailed estimation results using the four methods are provided in Tables A.10 and A.11, respectively. The proposed method identifies 10 main effects and 11 interactions, which are different from the alternatives. The same prediction and stability evaluation are conducted. The mean prediction errors (PEs) are 7.708 (M1), 13.233 (M2), 10.109 (M3), and 10.367 (M4), respectively, suggesting the superiority of the proposed method. In Figure A.2, it is observed that the proposed method has the best selection stability.
Table A.10:
Additional data analysis: numbers of overlapping main effects and interactions. In each cell: dataset 1/dataset 2/dataset 3.
| Main effects | M1 | M2 | M3 | M4 | |
| M1 | 10/10/10 | 7/6/6 | 8/8/8 | 4/4/4 | |
| M2 | 9/11/9 | 6/6/7 | 2/1/1 | ||
| M3 | 12/12/12 | 3/3/3 | |||
| M4 | 7/7/7 | ||||
| Interactions | M1 | M2 | M3 | M4 | |
| M1 | 11/11/11 | 5/6/9 | 6/6/6 | 1/1/1 | |
| M2 | 7/9/15 | 5/4/8 | 0/0/0 | ||
| M3 | 9/9/9 | 0/0/0 | |||
| M4 | 3/3/3 | ||||
Table A.11:
Analysis of gas station customers’ psychology and behavior data: estimated coefficients for main effects and interactions.
| Dataset 1 | Dataset 2 | Dataset 3 | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| M1 | M2 | M3 | M4 | M1 | M2 | M3 | M4 | M1 | M2 | M3 | M4 | |
| V1 | −0.091 | −0.094 | −0.073 | 0.100 | 0.099 | −0.073 | −0.103 | 0.098 | −0.073 | |||
| V15 | −0.070 | −0.095 | −0.081 | 0.090 | 0.098 | −0.081 | −0.092 | 0.090 | −0.081 | |||
| V19 | 0.086 | |||||||||||
| V22 | −0.084 | −0.093 | −0.084 | 0.084 | 0.093 | −0.084 | −0.077 | 0.106 | −0.084 | |||
| V24 | −0.095 | −0.095 | −0.084 | −0.095 | ||||||||
| V3 | −0.050 | −0.050 | −0.050 | |||||||||
| V31 | −0.087 | −0.086 | 0.027 | −0.093 | 0.112 | 0.161 | 0.027 | 0.069 | −0.080 | 0.102 | 0.027 | 0.110 |
| V35 | 0.100 | 0.100 | −0.035 | 0.048 | ||||||||
| V4 | 0.095 | 0.092 | 0.062 | −0.098 | −0.022 | 0.062 | −0.155 | 0.025 | 0.062 | 0.109 | ||
| V48 | −0.071 | |||||||||||
| V5 | −0.091 | −0.088 | 0.102 | 0.100 | −0.088 | −0.100 | 0.091 | −0.088 | ||||
| V50 | 0.092 | 0.092 | 0.092 | |||||||||
| V59 | −0.095 | |||||||||||
| V6 | −0.085 | 0.099 | 0.086 | |||||||||
| V60 | −0.093 | −0.093 | −0.093 | |||||||||
| V64 | 0.191 | 0.202 | 0.196 | |||||||||
| V67 | 0.066 | −0.194 | −0.088 | −0.174 | −0.011 | 0.155 | ||||||
| V75 | −0.061 | −0.100 | ||||||||||
| V78 | 0.070 | 0.096 | −0.234 | −0.032 | 0.096 | −0.205 | 0.100 | 0.096 | 0.113 | |||
| V87 | 0.074 | |||||||||||
| V88 | −0.167 | −0.103 | 0.127 | |||||||||
| V9 | −0.100 | −0.093 | −0.077 | 0.112 | 0.096 | −0.077 | −0.111 | 0.104 | −0.077 | |||
| V95 | 0.169 | 0.155 | 0.122 | |||||||||
| V1×V15 | −0.078 | −0.094 | 0.015 | 0.085 | 0.015 | −0.096 | 0.096 | 0.015 | ||||
| V1×V22 | −0.088 | −0.094 | 0.011 | 0.084 | 0.097 | 0.011 | −0.082 | 0.112 | 0.011 | |||
| V1×V31 | −0.086 | 0.083 | −0.190 | 0.101 | 0.093 | −0.190 | −0.088 | 0.015 | −0.190 | |||
| V4×V35 | 0.100 | 0.100 | −0.019 | 0.028 | ||||||||
| V6×V31 | −0.094 | −0.090 | 0.094 | |||||||||
| V9×V31 | −0.351 | −0.094 | 0.363 | 0.356 | 0.182 | 0.363 | −0.364 | 0.024 | 0.363 | |||
| V15×V31 | −0.095 | 0.090 | 0.090 | 0.091 | 0.090 | |||||||
| V1×V5 | −0.098 | 0.108 | 0.103 | −0.103 | 0.097 | |||||||
| V1×V9 | −0.102 | 0.109 | 0.094 | −0.111 | 0.110 | |||||||
| V19×V31 | 0.092 | |||||||||||
| V22×V31 | −0.095 | 0.093 | 0.092 | 0.358 | 0.093 | −0.089 | 0.025 | 0.093 | ||||
| V48×V75 | −0.094 | |||||||||||
| V1×V6 | 0.094 | |||||||||||
| V6×V33 | 0.101 | |||||||||||
| V9×V15 | 0.013 | −0.085 | 0.010 | −0.085 | −0.011 | 0.012 | −0.085 | |||||
| V9×V22 | −0.095 | 0.092 | −0.090 | 0.115 | ||||||||
| V15×V22 | −0.089 | −0.089 | 0.012 | −0.089 | ||||||||
| V24×V75 | −0.100 | |||||||||||
| V31×V78 | 0.092 | 0.092 | 0.092 | |||||||||
| V67×V78 | 0.072 | 0.032 | −0.051 | 0.025 | 0.036 | −0.027 | ||||||
| V4×V78 | 0.026 | 0.026 | −0.020 | |||||||||
| V64×V95 | −0.030 | −0.029 | −0.025 | |||||||||
Figure A.2:
Additional data analysis (gas station customers’ psychology and behavior): OOI.
References
- 1.Bien J, Taylor J, Tibshirani R. A Lasso for hierarchical interactions. Annals of Statistics 2013;41(3):1111–1141. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Lim M, Hastie T. Learning interactions via hierarchical group-Lasso regularization. Journal of Computational and Graphical Statistics 2015;24(3):627–654. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Zeng X, Ma S, Qin Y, Li Y. Variable selection in strong hierarchical semiparametric models for longitudinal data. Statistics and Its Interface 2015;8(3):355–365. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Hsu D Identifying key variables and interactions in statistical models of building energy consumption using regularization. Energy 2015;83:144–155. [Google Scholar]
- 5.Hayashi M, Boadway R. An empirical analysis of intergovernmental tax interaction: the case of business income taxes in Canada. Canadian Journal of Economics 2001;34(2):481–503. [Google Scholar]
- 6.Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2005; 67(2):301–320. [Google Scholar]
- 7.Huang J, Horowitz JL, Ma S. Asymptotic properties of bridge estimators in sparse high-dimensional regression models. The Annals of Statistics 2008;36(2):587–613. [Google Scholar]
- 8.Ziemer RF, Wetzstein ME. A Stein-rule method for pooling data. Economics Letters 1983;11:137–143. [Google Scholar]
- 9.Ejaz Ahmed S, Yüzbasi B. Big data analytics: integrating penalty strategies. International Journal of Management Science and Engineering Management 2016;11(2):105–115. [Google Scholar]
- 10.Shah MKA, Lisawadi S, Ejaz Ahmed S. Merging data from multiple sources: pretest and shrinkage perspectives. Journal of Statistical Computation and Simulation 2017: 87(8):1577–1592. [Google Scholar]
- 11.Liu J, Huang J, Ma S. Integrative analysis of cancer diagnosis studies with composite penalization. Scandinavian Journal of Statistics 2014;41(1):87–103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Huang Y, Liu J, Yi H, Shia BC, Ma S. Promoting similarity of model sparsity structures in integrative analysis of cancer genetic data. Statistics in Medicine, 2017;36(3):509–559. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Friedman J, Popescu BE. Gradient directed regularization for linear regression and classification Technical Report, Statistics Department, Stanford University, 2003. [Google Scholar]
- 14.Shi X, Liu J, Huang J, Zhou Y, Shia B, Ma S. Integrative analysis of high-throughput cancer studies with contrasted penalization. Genetic Epidemiology 2014; 38(2):144–151. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Liu J, Huang J, Ma S. Integrative analysis of multiple cancer genomic datasets under the heterogeneity model. Statistics in Medicine 2013;32(20):3509–3521. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Stute W Distributional convergence under random censorship when covariables are present. Candinavian Journal of Statistics 1996;23:461–471. [Google Scholar]
- 17.Li J, Qin Y, Yi D, Li Y, Shen Y. Feature selection for support vector machine in the study of financial early warning system. Quality and Reliability Engineering International 2014;30(6):867–877. [Google Scholar]
- 18.Koyuncugil AS, Ozgulbas N. Financial early warning system model and data mining application for risk detection. Expert Systems with Applications 2012;39(6):6238–6253. [Google Scholar]
- 19.Huang J, Ma S. Variable selection in the accelerated failure time model via the bridge method. Lifetime Data Analysis 2010;16(2):176–195. [DOI] [PMC free article] [PubMed] [Google Scholar]



