Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Oct 16.
Published in final edited form as: Appl Stoch Models Bus Ind. 2018 May 29;35(2):354–375. doi: 10.1002/asmb.2342

Integrative Interaction Analysis using Threshold Gradient Directed Regularization

Yang Li 1,2, Rong Li 2, Yichen Qin 3, Mengyun Wu 4,5,, Shuangge Ma 2,5,
PMCID: PMC7565571  NIHMSID: NIHMS966821  PMID: 33071651

Abstract

For many complex business and industry problems, high-dimensional data collection and modeling have been conducted. It has been shown that interactions may have important implications beyond main effects. The number of unknown parameters in an interaction analysis can be larger or much larger than the sample size. As such, results generated from analyzing a single dataset are often unsatisfactory. Integrative analysis, which jointly analyzes the raw data from multiple independent studies, has been conducted in a series of recent studies and shown to outperform single-dataset analysis, meta-analysis, and other multi-datasets analyses. In this study, our goal is to conduct integrative analysis in interaction analysis. For regularized estimation and selection of important interactions (and main effects), we apply a Threshold Gradient Directed Regularization (TGDR) approach. Advancing from the exiting studies, the TGDR approach is modified to respect the “main effects, interactions” hierarchy. The proposed approach has an intuitive formulation and is computationally simple and broadly applicable. Simulations and the analyses of financial early warning system data and news-APP recommendation behavior data demonstrate its satisfactory practical performance.

Keywords: High-dimensional data, Integrative analysis, Interaction analysis, TGDR

1. Introduction

In multiple areas of business and industry, the collection and modeling of high-dimensional data have been extensively conducted, searching for important predictors associated with gross domestic product (GDP) growths, change percentages of stock markets, optimal portfolio allocations, and others. Accumulating evidences have suggested that interactions may have important implications beyond the main effects. Extensive methodological development and data analysis have been conducted [1, 2, 3]. Promising findings have been made for multiple business and industry problems [4, 5].

In high-dimensional interaction analysis, there are two generic paradigms. In marginal analysis, one or a small number of variables are analyzed at a time. In contrast, in joint analysis, a large number of variables are analyzed in a single model. As complex business and industry processes are attributable to the joint effects of multiple factors, joint analysis can be more sensible, however, at the same time, more challenging. Consider a dataset with n samples and d predictors. In a joint interaction analysis, the number of unknown parameters is of the order d2. For some business and industry studies, with the high cost of data collection, even with pre-processing, d can still be larger than or comparable to n. As such, often d2 >> n. There is thus a regularized model estimation problem. In addition, for some specific business/industry outcomes, as most of the interactions and main effects are not expected to be relevant, there is also a selection problem. In the literature, a large number of regularization techniques have been developed for estimation and selection with high-dimensional models [6, 7].

In high-dimensional data analysis, it has been recognized that, with small sample sizes, results generated from analyzing a single dataset are often unsatisfactory. For many business/industry problems of common interest, there are often multiple independent studies with comparable designs, making it possible to pool multiple datasets and increase sample size. There are generally two types of methods for pooling data. The first one is the classic meta-analysis which analyzes each dataset separately and pools summary statistics based on different rules. Among them, the pretest, Stein and Bayes rules have been the common choices in the literature. For example, the Stein rule has been developed for combining regression estimates with multiple low-dimensional datasets [8], combining regression estimates with one high-dimensional dataset under multiple penalized regression models [9] and combining estimates of mean with multiple datasets [10]. The second one is the more recent integrative analysis which jointly analyzes raw data from multiple independent datasets. In a series of recent high-dimensional studies on genetics [11, 12], integrative analysis techniques have been developed and shown to outperform single-dataset analysis, classic meta-analysis, and other multi-datasets methods. Despite its promising successes in the analysis of main effects with high-dimensional data, integrative analysis has not been well conducted in business studies in the context of interaction analysis, which has a much higher dimensionality and has a stronger need to increase sample size by pooling data.

Interaction analysis has unique characteristics, making directly adopting the existing integrative analysis techniques inappropriate. Specifically, in interaction analysis, there is a need to respect the “main effects, interactions” hierarchical structure [1]. Under the strong hierarchy, if an interaction is identified, then both of the corresponding main effects also need to be identified. In contrast, under the weak hierarchy, only one of the corresponding main effects needs to be identified. Directly applying the existing integrative analysis techniques may generate results that violate the hierarchy, causing trouble in interpretation and inference.

Motivated by the need to pool data and increase sample size in interaction analysis and success of integrative analysis in the analysis of main effects, the goal of this study is to conduct integrative interaction analysis. Significantly advancing from the existing interaction analysis, the integrative analysis of multiple independent datasets is conducted, which provides an effective way of increasing sample size and improving analysis results. Advancing from the existing integrative analysis, the more challenging interaction analysis, which has a need for respecting the hierarchy, is conducted. This study also contains a novel development of the TGDR (Threshold Gradient Directed Regularization) technique, which may have independent methodological value. Last but not least, this study extends the promising integrative analysis paradigm, which has been developed in genetics and other scientific fields, to the analysis of business and industry data.

2. Methods

2.1. Interaction analysis with a single dataset

First consider a single dataset with d-dimensional predictors X = (X1, …, Xd)′ and response variable Y. In the literature, multiple frameworks have been developed for interaction analysis. Here we adopt the regression-based framework, which has a solid statistical ground and very lucid interpretation. Consider the model

Yϕ(β0+j=1dβjXj+j<kγjkXjXk), (1)

where the form of model ϕ(·) is known, β0 is the intercept, βj’s represent the main effects, and γjk’s represent interactions. In (1), “self interactions” (squared terms) are not included but can be easily added back. Note that model (1) is very generic, and the (data, model) dual can be (continuous data, linear regression model), (categorical/count data, generalized linear model), (survival data, AFT–accelerated failure time or Cox model), and many others. Denote θ = (β0, …,βj, …,βd,γ12, γ13, …)′ as the vector of all unknown regression coefficients. With n iid subjects, denote l(θ) as the log-likelihood function. It is noted that l(·) can also be other goodness-of-fit measures.

For regularized estimation and selection of important interactions and main effects, we propose using the TGDR technique. Thresholding is a popular technique in high-dimensional analysis, and many other regularization techniques, for example penalization, are closely connected to thresholding. TGDR is first developed for continuous data and linear regression [13] and later extended to other data and model settings. To accommodate interactions, the original TGDR algorithm needs to be revised. Specifically, we propose the following algorithm:

Algorithm I: TGDR for single dataset interaction analysis

  1. Initialize t = 0 and θ(t) = 0, where θ(t) denotes the estimate of θ at step t;

  2. Update t = t + 1. Compute f=l(θ)θ|θ=θ(t1), the vector of first-order derivatives. Denote the components of f as (f0,f1,,fd,f˜12,,f˜jk), where fj and f˜jk correspond to those of βj and γkj, respectively.

  3. Compute the thresholding indicator g=(g0,g1,,gd,g˜12,,g˜jk) of f. Specifically,
    • (c.1)
      g0 = 1;
    • (c.2)
      g˜jk=I(|f˜jk|>τmaxu,v|f˜uv|);
    • (c.3)
      gj = I(|fj| > τ maxu |fu|). In addition, if g˜jk=1, then set gj = 1 and gk = 1;
  4. Update θ(t) = θ(t−1)+Δ gf, where ⊙ is the component-wise product and Δ is the step size;

  5. Repeat Steps (b)-(d) for a large number of times. Select the optimal number of iterations topt. The final estimate is θ(topt). Interactions and main effects that correspond to the nonzero components of θ(topt) are identified as important.

Similar to the standard TGDR algorithm, the proposed algorithm starts with a null model. In each step, gradients are computed and used to direct update. Different from the ordinary gradientbased optimization techniques, variables are selected based on the magnitudes of gradients, and only coefficients of the selected variables are updated. There are multiple differences between the proposed and existing TGDR algorithms. First, in Step (c), as interactions and main effects have different grounds, an interaction term is only compared with other interactions, similar holds for main effects. In addition, in the second part of Step (c.3), the proposed algorithm ensures that if an interaction is selected, then the corresponding main effects are also selected. That is, the strong hierarchy is respected. The proposed algorithm can also be revised to respect the weak hierarchy. Specifically, in (c.3), if g˜jk=1 and gj = gk = 0, then if |fj| > |fk|, then gj = 1; otherwise, gk = 1.

In the proposed algorithm, Δ is the step size. Published studies suggest that the value of Δ is not critical, as long as it is small enough. In our numerical study, we set Δ = 0.01. The proposed algorithm also involves threshold 0 ≤ τ ≤ 1 and number of steps topt, both of which affect selection and estimation. More specifically, when topt is fixed, a larger τ leads to a sparser model. When τ is fixed, a larger topt leads to a denser model. In numerical study, we propose conducting a two-dimensional grid search and selecting the optimal values of τ and topt using five-fold cross validation. Specifically, the subjects are randomly partitioned into five disjoint sets with equal size. For each combination (τ, topt), the proposed method is trained on four of the five sets, and then tested on the remaining one to obtain the prediction error. This process is conducted five times. The combination (τ, topt) with the smallest average prediction error is selected as the optimal one. More discussions on the operating characteristics are provided in Section 3.1.

2.2. Integrative interaction analysis

Consider the integrative analysis of M independent datasets on the same scientific problem. In multi-datasets analysis, the selection of data and pre-processing are challenging tasks. However, they have been extensively discussed in the literature [14] and will not be reiterated here. Further assume that variables have been matched across datasets. Use notations similar to those in Section 2.1, and add superscript “(m)” to denote the mth dataset. Specifically, in dataset m, there are n(m) iid samples, each with measurements on the response variable Y(m) and d predictors X(m). Assume the regression model (1) for each dataset. Note that, in practice, the M datasets are collected under similar but not identical protocols. With possible differences in sample characteristics and other factors, the regression coefficients in different datasets are not assumed to be equal. This assumption is more flexible than that in many other multi-dataset analyses. In dataset m, denote l(m)(θ(m)) as the log-likelihood function normalized by sample size.

For integrative interaction analysis, the proposed method proceeds as follows:

Algorithm II: TGDR for integrative interaction analysis

  1. Initialize t = 0 and θ(m)(t) = 0 for m = 1,⋯, M;

  2. Update t = t+1. For m = 1,...,M, compute f(m)=l(m)(θ(m))θ(m)|θ(m)=θ(m)(t1), the vector of first-order derivatives. Denote the components of f(m) as (f0(m),,fd(m),f˜11(m),,f˜jk(m),), where fj(m) and f˜jk(m) correspond to those of βj(m) and γkj(m), respectively.

  3. Compute the thresholding indicator g=(g0,g1,,gd,g˜12,,g˜jk). Specifically,
    • (c.1)
      g0 = 1;
    • (c.2)
      g˜jk=I(m|f˜jk(m)|>τmaxu,vm|f˜uv(m)|);
    • (c.3)
      gj=I(m|fj(m)|>τmaxum|fu(m)|). In addition, if g˜jk=1, then set gj = 1 and gk = 1;
  4. For m = 1, …,M, update θ(m)(t) = θ(m)(t − 1) + Δ gf(m), where Δ is the step size;

  5. Repeat Steps (b)-(d) for a large number of times. Select the optimal number of iterations topt. The final estimate is θ(m)(topt). Interactions and main effects that correspond to the nonzero components of θ(m)(topt) are identified as important.

This algorithm shares a similar spirit with Algorithm I. The parameters Δ, τ, and topt have similar implications and will be chosen in the same manner. A small modification is that the log-likelihood functions need to be normalized so that the analysis is not dominated by larger datasets.

The key difference from single-dataset analysis is Step (c), where, in determining the thresholding indicator (and hence selection and estimation results), we jointly consider all M datasets. Here, we identify interactions and main effects with the largest overall gradients and update their estimates. In this way, if an interaction (or main effect) has weak “evidence” in one dataset but strong “evidences” in other datasets, it can also be identified. In integrative analysis when variable selection is of interest, two model structures have been proposed [15]. The first is the homogeneity structure, under which multiple datasets share the same sparsity structure. The other is the heterogeneity structure, under which multiple datasets can have possibly different sparse structures. Algorithm II reinforces the homogeneity structure, which is appropriate when multiple datasets are “similar enough”. Extension to the heterogeneity structure will be discussed in Section 5.

2.3. Analysis of censored survival data under the AFT model

In our numerical study (simulation and data analysis), we analyze censored survival data, which can be more complicated than continuous and categorical data. Such data are encountered in the analysis of financial default, device malfunction, and others. In stock and mortgage market analysis, survival data are at least as important as other data types. To be complete, here we provide details on the data and model. In addition, we also intend to demonstrate goodness-of-fit functions other than the likelihood.

In the mth dataset, the response variable Y(m) is a survival time. Consider the AFT model

log(Y(m))=β0(m)+j=1dβj(m)Xj(m)+j<kγjk(m)Xj(m)Xk(m)+ϵ(m), (2)

where ϵ(m) is the random error with an unknown distribution, making it impossible to construct the likelihood function. Compared to alternatives such as the Cox model, the AFT model has a more lucid interpretation and, more importantly, the lowest computational cost, which is especially desirable with high-dimensional data. Denote C(m) as the censoring time. We observe {(Ti(m)=min(Yi(m),Ci(m)),δi(m)=I(Yi(m)Ci(m)),Xi(m)):i=1,,n(m)}. With a slight abuse of notation, assume that data have been sorted according to Ti(m)’s, from the smallest to the largest. Compute the Kaplan-Meier (KM) weights as

ω1(m)=δ1(m)n(m),ωi(m)=δi(m)n(m)i+1j=1i1(n(m)jn(m)j+1)δj(m),i=2,,n(m). (3)

Following [16], the goodness-of-fit function can be constructed as

l(m)(θ(m))=i=1n(m)ωi(m)(log(T(m))β0(m)j=1dβj(m)Xj(m)j<kγjk(m)Xj(m)Xk(m))2, (4)

with θ(m)=(β0(m),,βj(m),,βd(m),γ12(m),γ13(m),). The rest of the operation is the same as with likelihood functions.

3. Simulation

Simulation is conducted to better gauge performance of the proposed method. Three datasets are simulated (M = 3). Two sample size settings are considered, with n(m) = 100, 80, 70 (total sample size n = 250) and n(m) = 180,170,150 (total sample size n = 500), respectively. For the number of predictors, consider d = 50 and 100, respectively, comparable to that in many business/industry studies. Note that although d is seemingly moderate, the number of unknown parameters is considerably larger than the total sample size. Consider the following simulation settings.

(S1) In (2), Xj(m)’s are independently generated from N(0,1). ϵ(m)’s are also independently generated from N(0,1). For the coefficient vector θ = (β0, …,βj, …,βd, γ12, γ13, …)′, we set

Dataset1Dataset2Dataset3(β1β2β3β4β5β6β7β8β9β10γ12γ13γ14γ23γ24γ34γ56γ57γ67γ89222221111122222111111.51.51.51.51.5111111.51.51.51.51.511111111110.50.50.50.50.5111110.50.50.50.50.5).

The rest coefficients are zero. Note that the three datasets have different regression coefficients but share the same sparsity structure. All satisfy the strong hierarchy.

(S2) For the coefficient vector θ, we set

Dataset1Dataset2Dataset3(β1β2β3β4β5β6β7β8β9β10γ12γ13γ14γ23γ24γ34γ56γ57γ67γ89111111111121.51.51110.50.50.50.51.51.51.51.51.5111112110.50.50.51.51.51.51.51.51.51.51.51.50.50.50.50.50.51.5110.50.50.51111).

The other settings are the same as those under S1. Compared to S1, the differences across datasets are smaller under S2.

(S3) For the coefficient vector θ, we set

Dataset1Dataset2Dataset3(β1β2β3β4β5β6β7β8β9β10γ12γ13γ14γ23γ24γ34γ56γ57γ67γ89222221111121.51.51110.50.50.50.51.51.51.51.51.5111112110.50.50.51.51.51.51.5111110.50.50.50.50.51.5110.50.50.52222).

The other settings are the same as those under S1. Under S3, a main effect/interaction may have different signs in different datasets, which allows for a greater level of across-dataset heterogeneity.

(S4) The settings are the same as those under S1, except that predictors have an auto-regressive correlation structure, with the jth and kth variables having correlation coefficient 0.5|jk|.

(S5) For the coefficient vector θ, we set

Dataset1Dataset2Dataset3(β1β2β3β4β5β6β7β8β9β10β11β12β13β14222221111100001.51.51.51.51.50011111001111100000.50.50.50.50.5)

and

Dataset1Dataset2Dataset3(γ12γ13γ14γ23γ24γ34γ56γ57γ67γ89γ8,10γ9,10γ10,11γ10,12γ11,12γ13,1421.51.51110.50.50.50.50000002110.50.50.50001.51.51.5001.501.5110.50.50.50000002222).

The other settings are the same as those under S1. Under S5, the homogeneity structure of the parameters is not satisfied.

Under all settings, the log survival times are computed from model (2). The censoring times are independently generated from exponential distributions, and the parameters are adjusted so that the censoring rates are around 25%. Note that as (continuous data, linear regression model) is a special case of (censored survival data, AFT model), simulation is not conducted, and similar findings are expected.

3.1. Parameter path and operating characteristics

To better appreciate properties of the proposed method, we simulate one replicate with three datasets under setting S1 with n = 500 and p = 50. The optimal values of the parameters (τ, topt) are (0.9, 230) using the grid search based on the five-fold cross validation. To examine the effect of the parameters on the estimation and selection results, we plot the estimates of interactions/main effects as a function of τ or topt with the other tuning fixed at its optimal value. To improve presentation, we only show estimates for three sets of effects (for each set, one interaction and two corresponding main effects), including six true positives and three true negatives, and the other sets have similar properties.

As can be seen from Figure 1, the model gets denser (with more effects identified) as topt increases (left panels of Figure 1) and tends to be sparser as τ increases (right panels of Figure 1), which is consistent with the theoretical analysis in Section 2.1. There are three different scenarios in left panels of Figure 1. For the set represented by the blue lines, the main effects enter the models first, later followed by the interaction. For the set represented by the red lines, one main effect is “dragged” into the models by the interaction, reinforcing the strong hierarchy. The set represented by the black lines is not associated with the response. The three effects do not enter the models until after a large number of steps. With a properly selected number of steps, they are not selected. More definitive conclusions on the numerical properties of the proposed method are obtained below from simulation and data analysis.

Figure 1:

Figure 1:

Parameter paths for one simulated replicate. The three rows correspond to three datasets. The two columns correspond to topt (left) and τ (right) with the other tuning fixed at its optimal value. The solid/dashed lines correspond to main effects/interactions. Lines with the same color correspond to the same set of effects (two main effects and one interaction).

3.2. Computational cost

In algorithm II, it is observed that only very simple calculations are involved in each step. With fixed tunings, Table A.1 (Appendix) provides the average computational time for the simulated datasets under setting S1 with M = 3, n = 250 and various values of d, suggesting that the proposed method is computationally affordable. For example, for the dataset with d = 100 of which the total number of the unknown parameters is 5050, the proposed analysis takes about 3.6 minutes on a laptop with standard configurations.

3.3. Comparison with the alternative methods

Beyond the proposed method (referred to as M1), we also analyze data with the following alternatives: (M2) Each dataset is analyzed separately using the method described in Section 2.1. Analysis results from the three datasets are combined using a meta-analysis approach. (M3) The three datasets are combined and analyzed. This analysis has also been referred to as an “intensity approach” in the literature. (M4) Each dataset is analyzed separately using the marginal analysis which analyzes two G factors and their interaction at a time. The corresponding p values are combined using a meta analysis approach which can be realized with the R package meta. Given the combined p values, a false discovery rate (FDR) approach is adopted for multiple comparison adjustment. The three methods M1, M3 and M4 reinforce the homogeneity structure, while M2 assumes the heterogeneity structure. We acknowledge that there are other methods that can be potentially applied to the simulated data. Comparison with M2 and M3, which are also based on the TGDR technique, can directly establish the merit of integrative analysis. M4 has been one of the most popular multi-datasets analysis methods and is a suitable benchmark for comparison.

For evaluating performance of different methods, we comprehensively consider (a) identification accuracy, which is measured using true positive rate (TPR) and false positive rate (FPR) for main effects and interactions separately; (b) estimation accuracy, which is measured using MSEs (mean squared errors) for main effects and interactions separately; and (c) prediction performance, which is measured using PE (prediction error) for independent data generated under the same settings. Note that for the independently generated testing data, there is no censoring and hence PE can be simply calculated.

We simulate 200 replicates under each scenario. Summary statistics are presented in Table 1 for S1. The rest of the results are presented in Appendix. Simulation suggests that, overall, the proposed method has superior or similar performance compared to the three alternatives in terms of identification, estimation, and prediction. For example in Table 1 with d = 100 and n = 500, for the identification of important interactions, the proposed method (M1) has (TPR, FPR)=(0.978, 0.000), compared to (0.476, 0.001) for M2, (0.968, 0.000) for M3 and (0.538,0.006) for M4. For the estimation of main effects, the four methods have MSEs 0.016 (M1), 0.202 (M2), 0.058 (M3), and 0.132 (M4), respectively. In the evaluation of prediction performance, the four methods have PEs 2.293 (M1), 11.968 (M2), 6.182 (M3), and 3.254 (M4), respectively. Under settings S1-S4 (Tables A.2A.4), M2 performs worst among the four methods as it is with the heterogeneity structure which is not consistent with the homogeneity assumption in these settings. Under setting S5 (Table A.5) of which the homogeneity structure is not satisfied, the performance of the three homogeneity-based methods M1, M3 and M4 decay compared to those in Table 1, especially in terms of TPR for interactions. However, the proposed method is still observed to perform better than the alternatives, including M2.

Table 1:

Simulation results under S1. In each cell, mean (sd) based on 200 replicates.

TPR FPR MSE PE

d n Main Inter Main Inter Main Inter
50 250 M1 0.965(0.02) 0.864(0.12) 0.082(0.06) 0.001(0.00) 0.098(0.05) 0.010(0.00) 4.076(1.35)
M2 0.735(0.16) 0.356(0.10) 0.145(0.06) 0.006(0.00) 0.512(0.15) 0.028(0.01) 12.103(3.60)
M3 0.970(0.00) 0.844(0.08) 0.114(0.11) 0.005(0.01) 0.139(0.04) 0.011(0.00) 5.747(1.38)
M4 0.807(0.11) 0.364(0.14) 0.275(0.08) 0.008(0.00) 0.305(0.13) 0.021(0.01) 6.278(1.89)

50 500 M1 0.991(0.04) 0.996(0.02) 0.073(0.08) 0.000(0.00) 0.024(0.02) 0.003(0.00) 1.746(0.57)
M2 0.981(0.03) 0.725(0.08) 0.306(0.11) 0.008(0.01) 0.196(0.08) 0.012(0.00) 4.052(1.21)
M3 0.900(0.00) 0.989(0.03) 0.076(0.13) 0.001(0.00) 0.108(0.02) 0.009(0.00) 5.627(0.62)
M4 0.887(0.10) 0.716(0.15) 0.229(0.09) 0.006(0.00) 0.129(0.09) 0.010(0.00) 5.666(2.17)

100 250 M1 0.955(0.07) 0.738(0.15) 0.059(0.03) 0.000(0.00) 0.087(0.04) 0.003(0.00) 6.884(1.91)
M2 0.438(0.17) 0.178(0.10) 0.097(0.03) 0.003(0.00) 0.380(0.06) 0.009(0.00) 21.754(4.17)
M3 0.950(0.03) 0.778(0.10) 0.075(0.03) 0.003(0.00) 0.099(0.04) 0.003(0.00) 7.776(1.65)
M4 0.855(0.09) 0.400(0.15) 0.118(0.07) 0.005(0.00) 0.204(0.06) 0.005(0.00) 12.480(3.41)

100 500 M1 0.985(0.03) 0.978(0.05) 0.028(0.03) 0.000(0.00) 0.016(0.01) 0.001(0.00) 2.293(0.79)
M2 0.828(0.11) 0.476(0.09) 0.079(0.04) 0.001(0.00) 0.202(0.05) 0.006(0.00) 11.968(3.07)
M3 0.990(0.00) 0.968(0.06) 0.028(0.04) 0.000(0.00) 0.058(0.01) 0.002(0.00) 6.182(0.62)
M4 0.905(0.10) 0.538(0.14) 0.387(0.07) 0.006(0.00) 0.132(0.06) 0.004(0.00) 3.254(1.36)

One characteristic of the proposed method is that it respects the strong hierarchy. To better appreciate this characteristic, we conduct another set of simulation (referred to as S6 in Appendix). Here data generation is the same as S1. For the proposed M1, in Step (c.3), we remove “if g˜jk=1, then gj = 1 and gk = 1”. That is, the strong hierarchy is not necessarily satisfied. For M2 and M3, similar modifications are made. Comparing Table A.6 (Appendix) and Table 1 suggests that the analysis that does not respect the strong hierarchy may have worse identification, estimation, and prediction performance.

In a small number of published studies especially the early ones, it has been suggested that the “main effects, interactions” hierarchy may not hold. Specific practical examples have been provided, for which interactions exist but the corresponding main effects are not important. To be comprehensive, we conduct another set of simulation. Under S7, the data settings are mostly similar to S1 except that the hierarchy is violated for some interactions. As shown in Table A.7 (Appendix), as the proposed method, M2 and M3 reinforce the hierarchy (which does not hold here), performance is slightly worse than that in Table 1.

4. Data analysis

4.1. Data on a financial early warning system

To illustrate effectiveness of the proposed approach for real business problems, we analyze data on a financial early warning system which is usually used for evaluating financial performance, assessing financial risk, and predicting potential bankruptcy [17]. The analyzed dataset is part of the China Stock Market and Accounting Research Database (CSMAR), which is published by the GTA Information Technology Company (http://www.gtarsc.com/). The outcome variable of interest is the price-to-earnings ratio (P/E ratio), which is a continuous variable defined as the market price per share divided by annual earning per share. We aim to search for important financial indicators and interactions that are associated with (log-transformed) P/E ratio measured during the period of January 1, 2013 to December 31, 2013. More specifically, we consider 557 stocks from companies of three different industry sectors. Among them, 290 are machinery listed companies (dataset 1), 161 are metals and non-metals listed companies (dataset 2), and 106 are mechanical listed companies (dataset 3). The predictors include 83 financial indicators reported on December 31, 2010, all of which have been extensively examined in published studies (detailed information provided in Appendix). The time lag of financial indicators and the response is between 2 to 3 years, which has been suggested in previous research [17, 18]. Note that in this analysis there is a single database. However, with the significant differences across industry sectors, it is reasonable to expect significant differences in data characteristics. It is thus sensible to be dissected into three datasets.

We apply the proposed as well as the three alternative methods. The summary comparison results are provided in Table 2. The detailed estimation results are provided in Table 3. The proposed method identifies 13 main effects and 9 interactions, which have overlap with but also differ from the alternatives. Specifically, M3 identifies 15 main effects (12 overlap with the proposed method) and 9 interactions (5 overlap with the proposed method).

Table 2:

Data analysis: numbers of overlapping main effects and interactions. In each cell: dataset 1/dataset 2/dataset 3.

Financial early warning system data

Main effects M1 M2 M3 M4
M1 13/13/13 9/7/6 12/12/12 4/4/4
M2 11/10/8 9/10/6 3/2/2
M3 15/15/15 5/5/5
M4 9/9/9

Interactions M1 M2 M3 M4
M1 9/9/9 4/4/0 5/5/5 0/0/0
M2 4/6/3 3/1/1 0/0/0
M3 9/9/9 0/0/0
M4 3/3/3

News-APP recommendation behavior data

Main effects M1 M2 M3 M4
M1 12/12/12 8/10/8 11/11/11 5/5/5
M2 11/11/12 9/11/10 2/4/3
M3 13/13/13 4/4/4
M4 9/9/9

Interactions M1 M2 M3 M4
M1 10/10/10 0/7/3 7/7/7 5/5/5
M2 3/9/6 0/6/2 0/4/2
M3 7/7/7 3/3/3
M4 7/7/7

Table 3:

Analysis of financial early warning system data: estimated coefficients for main effects and interactions.

Dataset 1 Dataset 2 Dataset 3

M1 M2 M3 M4 M1 M2 M3 M4 M1 M2 M3 M4
CTR −0.111 −0.095 0.021 0.219 0.189 0.021 −0.013 0.021
EntfcfPS −0.031 −0.002 −0.022 0.008 −0.048 −0.022 0.033 0.023 −0.022
Equass −0.113 0.010 −0.127
Equtotlia −0.016 −0.016 −0.062 −0.016 −0.016
FAGR 0.091 0.070 0.077 0.105 0.070 0.084 0.070
NAPS −0.163 0.112 0.338
NcfffaPS 0.073 0.037 0.045 0.008 0.045 0.030 0.045
Netprfgrrt −0.063 −0.041 0.014 0.037 0.052 0.014 −0.077 0.039 0.014 0.076
OpeCass 0.171 0.138 −0.149 0.134 0.138 −0.206 0.051 0.017 0.138 0.065
OpeCPSgrrt −0.019 0.064 0.220 0.159 0.064 0.144 0.066 0.064
Opeprfgrrt −0.108 −0.062 −0.132 −0.160 −0.069 −0.132 −0.169 −0.094 −0.132
OwnCon11 −0.151 −0.231 −0.151 −0.151
PCF 0.150 0.109 0.176 −0.018 0.162 0.132 0.176 −0.065 0.170 0.176 −0.104
ROAgrrt 0.079 −0.025 −0.076 −0.043 −0.025 0.051 0.010 −0.025
Salcostrt −0.099 −0.228 −0.137 −0.099 −0.134 −0.099 0.098
SalesevOpeincm −0.133 −0.079 0.227 0.327
ShrhfcfPS 0.182 0.147 0.074 −0.167 0.074 0.166 0.082 0.074
TopecostTOR −0.046 −0.060
UndivprfPS −0.064 −0.030 −0.031 0.244 0.182 −0.031 −0.097 −0.172 −0.031 −0.135
WCNA 0.121 0.131 −0.037 0.132
WCTA −0.098 −0.323 −0.326
CTR×NcfffaPS 0.117 0.100 0.082 0.076 0.082 −0.053 0.082
CTR×OpeCPSgrrt −0.074 0.092 0.047 −0.104
CTR×PCF −0.041 0.117 0.258 −0.003
CTR×ShrhfcfPS 0.083 0.083 0.083
CTR×UndivprfPS 0.078 0.090 0.043 0.036 0.043 0.007 0.043
EntfcfPS×ROAgrrt 0.067 0.124 0.248 −0.098
Equtotlia×Opeprfgrrt −0.066 −0.148 −0.066 −0.066
FAGR×OpeCPSgrrt 0.049 0.072 0.108 0.148 0.072 −0.014 0.072
Netprfgrrt×ShrhfcfPS 0.144 0.213 0.157 0.035 0.157 0.009 0.157
NAPS×PCF 0.130 −0.0467 0.192
OpeCass×Opeprfgrrt −0.050 −0.050 −0.076 −0.050
Opecass×PCF 0.032 0.042 −0.182
Opecass×Salcostrt −0.075 0.128 0.0771
OpeCPSgrrt×TopecostTOR 0.010
OpeCPSgrrt×UndivprfPS −0.12 −0.1215 −0.116 −0.121 −0.204 −0.121
Opeprfgrrt×TopecostTOR −0.188
Opeprfgrrt×WCNA 0.064 0.264 −0.020 0.018
OwnCon11×Salcostrt −0.035 −0.121 −0.035 −0.035

With practical data, it is difficult to objectively evaluate identification accuracy. Here we conduct the evaluation of prediction performance and stability, which may provide an indirect support. Specifically, we randomly select 2/3 of the subjects from each dataset, which form the training data. The subjects not selected form the testing data. Estimates are generated using the training data and used to make prediction for the testing data. As the outcome variable is continuous, we use the prediction error (PE) for evaluating. To avoid an extreme sampling, the above process is repeated 500 times and the average PE is computed. For the four methods, the average PEs are 0.639 (M1), 1.310 (M2), 0.769 (M3), and 2.436 (M4), respectively. In addition, for each main effect/interaction identified using full data, we compute their probabilities of being identified in the 500 resamplings, which have been referred to as “Observed Occurrence Index (OOI)” in the literature, with larger value indicating a higher degree of stability [19]. The OOI results are shown in Figure A.1 (Appendix). The proposed method has OOIs superior or comparable to the alternatives.

4.2. Data on news-APP recommendation behaviors

We analyze a news-APP (application) recommendation dataset collected by a commercial market-research firm in 2014. The study focused on the recommendation behaviors of 882 customers in China, including 410, 181, and 291 customers who used the news-APP “Baidu” (dataset 1), “Tencent” (dataset 2), and “Sina” (dataset 3), respectively. It has been suggested that people using the three different major tools have significant different characteristics. It is thus sensible to treat the collected data as three separate datasets and conduct integrative analysis. The outcome variable of interest is the time when the customer recommended the APP to others which is right censored. Specifically, the survival time T = t, if a customer recommended the APP to others at time t. The duration of the study is six weeks. Thus, T is right censored if the customer had not recommended the APP after six weeks of use. There are 341, 155, 181 uncensored customers during follow-up in datasets 1, 2, 3, respectively. 78 predictors are analyzed which are measured in the likert-type scale (detailed information provided in Appendix).

The summary analysis results using the four methods are provided in Table 2. The detailed estimation results are provided in Table 4. The proposed method identifies 12 main effects and 10 interactions. Method M3 identifies 13 main effects (11 overlap with the proposed method) and 10 interactions (7 overlap with the proposed method). Methods M2 and M4 make more different identifications. We conduct the same prediction and stability evaluation as for the financial early warning system data. As the outcome variable is right censored time-to-event, we use the logrank statistic to measure prediction, where a larger value indicates better prediction. The mean prediction logrank statistics are 7.832 (M1), 4.261 (M2), 6.892 (M3), and 7.366 (M4), respectively. The OOI results in Figure A.1 (Appendix) again suggest that the proposed method has a higher degree of stability. The improved prediction and stability results suggest the superiority of the proposed method.

Table 4:

Analysis of news-APP recommendation behavior data: estimated coefficients for main effects and interactions.

Dataset 1 Dataset 2 Dataset 3

M1 M2 M3 M4 M1 M2 M3 M4 M1 M2 M3 M4
D19_2 0.001 −0.067 0.002 −0.110 −0.039 −0.067 −0.255 −0.126 0.004 −0.067 −0.298
D19_4 0.056 −0.006 −0.043 0.031 −0.069 −0.031 −0.043 0.043 −0.132 −0.002 −0.043 −0.074
D19_3 −0.029 −0.290 −0.132
D19_5 0.069 0.023 0.070 0.092 0.051 0.070 0.002 0.036 0.070
D19_7 0.034 0.214 0.070
D19_15 0.075 0.262 0.033 0.105 −0.148 0.488
D19_18 −0.020 −0.049 0.013 −0.049 −0.029 −0.049
D19_21 0.059 0.011 −0.029 −0.118 −0.052 −0.029 −0.148 −0.029
D19_24 0.069 0.042 −0.110 −0.243 −0.072 −0.110 −0.070 0.010 −0.110
D20_2 0.012 −0.047 −0.230 −0.061 −0.020 −0.047 −0.064 −0.070 −0.011 −0.047 −0.108
D20_4 0.097 0.048 −0.020 −0.126 −0.056 −0.020 −0.066 −0.007 −0.020
D20_7 0.085 0.079 0.082
D20_11 0.026
D20_18 −0.004 0.050 0.212 0.075 0.050 −0.063 −0.002 0.050
D20_22 0.050 0.023 0.030 0.054 0.048 0.030 −0.005 −0.002 0.030
D20_27 0.060 0.018 0.047 0.097 0.047 −0.003 0.047
D20_31 0.068 0.019 −0.082 0.042 −0.187 0.002 −0.082 −0.417 −0.076 −0.082 −0.137
D21_3 −0.011 −0.031
D21_5 0.094 0.098 −0.455
D21_13 0.028 0.017 0.017 0.025 0.017
D19_2×D19_4 −0.001 −0.002 −0.019 −0.002 −0.016 −0.002 −0.064 −0.002 −0.009 −0.002 −0.004
D19_2×D19_24 0.001 −0.001 −0.019 −0.001 −0.007
D19_2×D20_2 −0.001 −0.149 −0.001 −0.004 −0.028 −0.001 −0.012 −0.076
D19_2×D20_4 −0.019 −0.007
D19_2×D20_22 −0.007
D19_2×D20_31 −0.003 −0.005 −0.053 −0.007 −0.033 −0.005 −0.042 −0.006 −0.005 −0.035
D19_4×D19_18 −0.001 −0.001 −0.001
D19_4×D19_21 −0.007 −0.009 −0.008 −0.009 −0.009 −0.009
D19_4×D20_2 −0.002 −0.003 −0.003 −0.010 −0.003 −0.003 −0.003
D19_4×D20_22 0.001 0.001 0.002 0.001
D19_4×D20_31 −0.010 −0.013 −0.058 −0.014 −0.017 −0.013 −0.136 −0.012 −0.013 −0.103
D19_7 ×D19_15 0.002 −0.171 −0.138
D19_15×D20_31 0.001 −0.032 −0.001 −0.003 −0.002 −0.067
D19_21×D20_31 −0.004 −0.007 −0.008 −0.007 −0.007 −0.007
D19_24×D20_27 0.006
D19_24×D21_3 −0.001
D20_2×D20_31 −0.001 −0.017 −0.001 −0.001
D20_4×D20_31 −0.001 −0.003 −0.004 −0.024 −0.003 −0.002 −0.003
D20_27×D21_3 −0.009
D20_31×D21_5 −0.029 −0.011 −0.145

5. Discussion

For the analysis of high-dimensional data in genetics and other scientific fields, integrative analysis has established its effectiveness in pooling multiple independent datasets, increasing sample size, and improving analysis results. In this study, we significantly extend integrative analysis to interaction analysis and to the analysis of business and industry data. The proposed method is based on the TGDR technique, which has been extensively applied to the analysis of main effects but not interactions. The TGDR technique is modified to respect the “main effects, interactions” strong hierarchy. The proposed method has an intuitive formulation. In simulation, it outperforms the direct competitors. In the analysis of financial early warning system data and news-APP recommendation behavior data, it leads to findings different from the alternatives. The improved prediction and stability support its validity. Overall, this study provides a practically useful new venue for studying interactions under the integrative analysis paradigm.

This study can be potentially extended in multiple directions. Interaction analysis has an important role in the study of complex business and industry problems. It is of interest to conduct other types of interaction analysis/use other techniques under the integrative analysis paradigm. In our study, we consider continuous data under the linear model and censored survival data under the AFT model. The proposed method can be directly applied to other types of goodness-of-fit measures especially including log-likelihood functions. The strong hierarchy is assumed. Extension to accommodate the weak hierarchy can be easily obtained by modifying step (c.3) of Algorithm II. With multiple datasets, the homogeneity model is assumed. Extension to the heterogeneity model demands making the thresholding indicators data-specific. In data analysis, the evaluation of prediction and stability provides support to the identification results. More evaluations, especially scientific evaluations, are needed to confirm the findings.

Acknowledgements

We thank the organizers and participants of International Workshop on Perspectives on High-Dimensional Data Analysis (HDDA-VII). This study has been supported by National Natural Science Foundation of China (71771211, 11401013, 91546202), MOE Project of Key Research Institute of Humanities and Social Sciences at Universities (16JJD910002), and NIH (CA204120 and CA191383).

Appendix

A.1. Additional numerical results of the simulations

Table A.1:

Computational time of the proposed method for setting S1 with M = 3 and n = 250.

d Number of the total unknown parameters Computational time (minutes)
50 1275 0.374
70 2485 0.965
80 3240 1.422
100 5050 3.588

Table A.2:

Simulation results under S2. In each cell, mean (sd) based on 200 replicates.

TPR FPR MSE PE

d n Main Inter Main Inter Main Inter
50 250 M1 0.988(0.03) 0.887(0.14) 0.063(0.06) 0.000(0.00) 0.066(0.03) 0.007(0.00) 3.259(1.18)
M2 0.787(0.11) 0.408(0.11) 0.124(0.05) 0.006(0.00) 0.383(0.10) 0.020(0.00) 9.198(2.39)
M3 1.000(0.00) 0.870(0.05) 0.126(0.14) 0.001(0.00) 0.070(0.02) 0.007(0.00) 3.853(1.05)
M4 0.820(0.11) 0.415(0.13) 0.270(0.10) 0.008(0.00) 0.223(0.10) 0.019(0.01) 4.493(1.27)

50 500 M1 1.000(0.00) 0.990(0.03) 0.106(0.09) 0.000(0.00) 0.020(0.01) 0.003(0.00) 1.672(0.30)
M2 0.973(0.03) 0.782(0.09) 0.275(0.08) 0.006(0.00) 0.103(0.04) 0.008(0.00) 2.839(0.68)
M3 1.000(0.00) 0.995(0.02) 0.275(0.23) 0.000(0.00) 0.054(0.01) 0.005(0.00) 3.462(0.43)
M4 0.902(0.10) 0.578(0.14) 0.257(0.10) 0.007(0.00) 0.095(0.06) 0.011(0.00) 3.825(1.26)

100 250 M1 0.987(0.05) 0.855(0.18) 0.062(0.07) 0.000(0.00) 0.061(0.03) 0.003(0.00) 5.586(1.67)
M2 0.437(0.20) 0.169(0.10) 0.083(0.07) 0.003(0.00) 0.300(0.05) 0.006(0.00) 17.499(4.08)
M3 1.000(0.00) 0.833(0.10) 0.157(0.23) 0.001(0.00) 0.046(0.02) 0.002(0.00) 5.848(1.04)
M4 0.848(0.08) 0.439(0.17) 0.461(0.14) 0.005(0.01) 0.487(0.09) 0.005(0.01) 7.520(1.73)

100 500 M1 1.000(0.00) 0.975(0.05) 0.072(0.09) 0.000(0.00) 0.012(0.00) 0.001(0.00) 2.139(0.51)
M2 0.864(0.08) 0.497(0.10) 0.160(0.11) 0.004(0.00) 0.146(0.04) 0.004(0.00) 8.292(1.77)
M3 1.000(0.00) 0.953(0.03) 0.214(0.33) 0.000(0.00) 0.029(0.01) 0.002(0.00) 3.908(0.38)
M4 0.927(0.07) 0.545(0.16) 0.444(0.16) 0.005(0.01) 0.109(0.06) 0.004(0.00) 4.315(0.90)

Table A.3:

Simulation results under S3. In each cell, mean (sd) based on 200 replicates.

TPR FPR MSE PE

d n Main Inter Main Inter Main Inter
50 250 M1 0.978(0.05) 0.882(0.09) 0.073(0.08) 0.000(0.00) 0.075(0.04) 0.009(0.00) 3.436(1.13)
M2 0.779(0.11) 0.401(0.11) 0.162(0.07) 0.007(0.00) 0.467(0.14) 0.027(0.00) 11.588(3.56)
M3 0.953(0.06) 0.667(0.10) 0.054(0.07) 0.002(0.00) 0.333(0.06) 0.020(0.00) 12.669(1.63)
M4 0.800(0.11) 0.370(0.14) 0.268(0.09) 0.008(0.00) 0.324(0.13) 0.029(0.01) 6.539(2.42)

50 500 M1 1.000(0.00) 0.990(0.04) 0.077(0.07) 0.000(0.00) 0.018(0.01) 0.003(0.00) 1.701(0.60)
M2 0.989(0.02) 0.750(0.07) 0.354(0.12) 0.009(0.01) 0.166(0.07) 0.010(0.00) 3.379(1.01)
M3 0.973(0.05) 0.733(0.07) 0.092(0.05) 0.001(0.00) 0.277(0.04) 0.017(0.00) 11.598(1.23)
M4 0.912(0.08) 0.537(0.16) 0.257(0.09) 0.007(0.00) 0.124(0.08) 0.017(0.01) 5.291(2.14)

100 250 M1 0.939(0.06) 0.725(0.16) 0.014(0.01) 0.000(0.00) 0.100(0.06) 0.004(0.00) 7.785(2.97)
M2 0.357(0.18) 0.133(0.08) 0.031(0.03) 0.001(0.00) 0.402(0.04) 0.009(0.00) 24.232(3.96)
M3 0.908(0.13) 0.561(0.14) 0.022(0.03) 0.000(0.00) 0.217(0.06) 0.006(0.00) 15.782(2.90)
M4 0.867(0.10) 0.411(0.13) 0.201(0.05) 0.006(0.00) 0.735(0.49) 0.018(0.01) 21.707(3.85)

100 500 M1 0.995(0.02) 0.973(0.06) 0.019(0.02) 0.000(0.00) 0.016(0.01) 0.001(0.00) 2.381(0.94)
M2 0.824(0.09) 0.474(0.09) 0.070(0.04) 0.001(0.00) 0.192(0.07) 0.006(0.00) 11.239(3.08)
M3 0.970(0.05) 0.703(0.08) 0.014(0.01) 0.000(0.00) 0.162(0.03) 0.005(0.00) 11.990(1.24)
M4 0.905(0.08) 0.532(0.11) 0.198(0.08) 0.006(0.00) 0.141(0.05) 0.006(0.00) 5.236(1.36)

Table A.4:

Simulation results under S4. In each cell, mean (sd) based on 200 replicates.

TPR FPR MSE PE

d n Main Inter Main Inter Main Inter
50 250 M1 0.990(0.03) 0.988(0.03) 0.171(0.12) 0.001(0.00) 0.052(0.03) 0.004(0.00) 1.771(0.63)
M2 0.972(0.03) 0.797(0.08) 0.335(0.08) 0.008(0.01) 0.168(0.05) 0.012(0.00) 3.197(1.29)
M3 0.985(0.02) 0.965(0.05) 0.171(0.10) 0.002(0.00) 0.113(0.02) 0.010(0.00) 8.993(1.75)
M4 0.945(0.07) 0.833(0.10) 0.323(0.12) 0.017(0.01) 0.214(0.13) 0.021(0.01) 2.719(0.95)

50 500 M1 0.995(0.02) 1.000(0.00) 0.125(0.10) 0.000(0.00) 0.029(0.02) 0.002(0.00) 1.370(0.30)
M2 0.997(0.01) 0.964(0.03) 0.498(0.10) 0.010(0.01) 0.078(0.02) 0.004(0.00) 2.167(0.38)
M3 1.000(0.00) 0.988(0.03) 0.092(0.13) 0.001(0.00) 0.094(0.01) 0.009(0.00) 6.948(1.15)
M4 0.970(0.05) 0.892(0.09) 0.302(0.08) 0.017(0.01) 0.066(0.04) 0.010(0.01) 1.793(0.96)

100 250 M1 0.991(0.03) 0.966(0.06) 0.145(0.08) 0.000(0.00) 0.035(0.01) 0.001(0.00) 2.704(1.33)
M2 0.898(0.09) 0.587(0.13) 0.132(0.07) 0.002(0.00) 0.143(0.06) 0.005(0.00) 9.211(5.42)
M3 0.991(0.00) 0.943(0.07) 0.183(0.10) 0.001(0.00) 0.063(0.02) 0.003(0.00) 9.575(2.77)
M4 0.950(0.07) 0.758(0.14) 0.385(0.06) 0.007(0.00) 2.594(4.77) 0.157(0.28) 4.973(4.66)

100 500 M1 0.996(0.02) 0.997(0.02) 0.140(0.07) 0.000(0.00) 0.020(0.01) 0.001(0.00) 1.741(0.50)
M2 0.986(0.02) 0.880(0.06) 0.282(0.09) 0.001(0.00) 0.069(0.02) 0.002(0.00) 2.502(1.19)
M3 1.000(0.00) 0.978(0.04) 0.136(0.06) 0.000(0.00) 0.047(0.01) 0.002(0.00) 7.534(1.24)
M4 0.986(0.04) 0.883(0.11) 0.404(0.08) 0.009(0.00) 0.120(0.12) 0.005(0.00) 2.156(0.94)

Table A.5:

Simulation results under S5. In each cell, mean (sd) based on 200 replicates.

TPR FPR MSE PE

d n Main Inter Main Inter Main Inter
50 250 M1 0.825(0.08) 0.523(0.15) 0.112(0.06) 0.003(0.00) 0.251(0.09) 0.024(0.00) 9.902(2.23)
M2 0.789(0.12) 0.407(0.11) 0.160(0.08) 0.006(0.00) 0.510(0.11) 0.027(0.00) 11.634(3.77)
M3 0.816(0.09) 0.521(0.17) 0.117(0.06) 0.004(0.00) 0.366(0.08) 0.027(0.00) 15.306(2.79)
M4 0.713(0.12) 0.269(0.11) 0.262(0.09) 0.007(0.00) 0.381(0.14) 0.034(0.01) 9.479(2.81)

50 500 M1 0.884(0.07) 0.714(0.09) 0.207(0.11) 0.002(0.00) 0.099(0.04) 0.015(0.00) 6.307(1.14)
M2 0.887(0.02) 0.660(0.07) 0.327(0.11) 0.010(0.01) 0.179(0.08) 0.011(0.00) 6.631(1.14)
M3 0.892(0.06) 0.688(0.11) 0.171(0.04) 0.003(0.00) 0.258(0.06) 0.023(0.00) 12.882(1.57)
M4 0.761(0.09) 0.343(0.10) 0.241(0.07) 0.007(0.00) 0.220(0.07) 0.028(0.00) 9.023(1.89)

100 250 M1 0.705(0.14) 0.371(0.14) 0.028(0.02) 0.000(0.00) 0.204(0.07) 0.008(0.00) 16.356(3.34)
M2 0.393(0.20) 0.140(0.09) 0.033(0.02) 0.001(0.00) 0.386(0.06) 0.009(0.00) 22.753(4.86)
M3 0.705(0.14) 0.358(0.16) 0.032(0.04) 0.001(0.00) 0.235(0.08) 0.008(0.00) 18.425(3.26)
M4 0.675(0.09) 0.230(0.10) 0.412(0.07) 0.006(0.00) 1.026(0.74) 0.025(0.01) 22.424(4.26)

100 500 M1 0.863(0.06) 0.570(0.14) 0.057(0.05) 0.000(0.00) 0.069(0.02) 0.005(0.00) 9.070(1.67)
M2 0.860(0.11) 0.477(0.11) 0.085(0.05) 0.001(0.00) 0.197(0.05) 0.006(0.00) 11.122(3.08)
M3 0.864(0.06) 0.577(0.12) 0.058(0.02) 0.001(0.00) 0.158(0.04) 0.006(0.00) 14.746(1.73)
M4 0.826(0.09) 0.328(0.10) 0.406(0.07) 0.006(0.00) 0.215(0.07) 0.010(0.00) 10.011(1.36)

Table A.6:

Simulation results under S6. In each cell, mean (sd) based on 200 replicates.

TPR FPR MSE PE

d n Main Inter Main Inter Main Inter
50 250 M1 0.953(0.04) 0.845(0.12) 0.081(0.09) 0.005(0.00) 0.115(0.05) 0.009(0.00) 3.829(1.54)
M2 0.404(0.17) 0.260(0.09) 0.096(0.05) 0.008(0.00) 0.688(0.13) 0.029(0.00) 16.593(5.30)
M3 0.950(0.07) 0.908(0.08) 0.116(0.14) 0.011(0.01) 0.175(0.08) 0.011(0.00) 5.735(1.47)
M4 0.375(0.15) 0.445(0.14) 0.036(0.01) 0.010(0.00) 0.490(0.16) 0.017(0.00) 9.804(2.44)

50 500 M1 0.970(0.00) 0.963(0.03) 0.081(0.06) 0.004(0.00) 0.024(0.01) 0.002(0.00) 1.549(0.35)
M2 0.843(0.10) 0.679(0.10) 0.239(0.09) 0.016(0.01) 0.289(0.14) 0.012(0.00) 5.263(3.20)
M3 0.965(0.02) 0.965(0.02) 0.069(0.10) 0.011(0.02) 0.148(0.07) 0.008(0.00) 5.384(1.00)
M4 0.680(0.13) 0.683(0.11) 0.034(0.01) 0.009(0.00) 0.262(0.10) 0.010(0.01) 8.601(1.87)

100 250 M1 0.903(0.13) 0.715(0.16) 0.042(0.02) 0.002(0.00) 0.115(0.06) 0.003(0.00) 7.498(2.90)
M2 0.383(0.13) 0.115(0.08) 0.047(0.03) 0.002(0.00) 0.432(0.04) 0.009(0.00) 26.146(4.55)
M3 0.920(0.15) 0.753(0.10) 0.046(0.03) 0.003(0.00) 0.127(0.07) 0.004(0.00) 8.139(2.89)
M4 0.430(0.15) 0.393(0.15) 0.036(0.01) 0.007(0.00) 0.253(0.09) 0.009(0.00) 16.123(1.86)

100 500 M1 0.998(0.02) 0.973(0.04) 0.028(0.04) 0.001(0.00) 0.016(0.01) 0.001(0.00) 2.051(0.62)
M2 0.671(0.15) 0.454(0.04) 0.058(0.11) 0.003(0.00) 0.250(0.07) 0.006(0.00) 12.416(3.75)
M3 0.998(0.00) 0.980(0.08) 0.026(0.03) 0.002(0.00) 0.072(0.03) 0.002(0.00) 6.029(1.05)
M4 0.610(0.14) 0.508(0.01) 0.005(0.14) 0.007(0.00) 0.150(0.07) 0.004(0.00) 8.730(1.89)

Table A.7:

Simulation results under S7. In each cell, mean (sd) based on 200 replicates.

TPR FPR MSE PE

d n Main Inter Main Inter Main Inter
50 250 M1 0.970(0.05) 0.830(0.15) 0.080(0.06) 0.001(0.00) 0.110(0.06) 0.012(0.01) 4.488(1.66)
M2 0.682(0.16) 0.344(0.11) 0.148(0.08) 0.006(0.00) 0.539(0.14) 0.029(0.00) 12.851(3.94)
M3 1.000(0.00) 0.828(0.09) 0.104(0.09) 0.004(0.00) 0.141(0.04) 0.012(0.00) 5.823(1.13)
M4 0.752(0.11) 0.415(0.15) 0.286(0.08) 0.007(0.00) 0.320(0.12) 0.028(0.01) 5.987(2.31)

50 500 M1 1.000(0.00) 0.965(0.06) 0.119(0.10) 0.000(0.00) 0.021(0.01) 0.004(0.00) 1.846(0.61)
M2 0.978(0.03) 0.711(0.08) 0.362(0.11) 0.011(0.01) 0.205(0.10) 0.012(0.00) 3.904(1.33)
M3 1.000(0.00) 0.983(0.04) 0.119(0.15) 0.003(0.01) 0.102(0.02) 0.009(0.00) 5.223(0.85)
M4 0.848(0.11) 0.532(0.13) 0.268(0.10) 0.007(0.00) 0.151(0.08) 0.019(0.01) 5.650(2.06)

100 250 M1 0.933(0.10) 0.753(0.17) 0.027(0.02) 0.000(0.00) 0.111(0.06) 0.005(0.00) 8.478(2.36)
M2 0.317(0.14) 0.127(0.07) 0.029(0.02) 0.001(0.00) 0.393(0.05) 0.009(0.00) 23.482(4.99)
M3 0.973(0.07) 0.820(0.14) 0.043(0.05) 0.001(0.00) 0.120(0.07) 0.004(0.00) 10.880(2.98)
M4 0.800(0.14) 0.463(0.15) 0.042(0.07) 0.007(0.00) 0.873(0.51) 0.611(0.08) 13.808(7.61)

100 500 M1 0.970(0.05) 0.710(0.07) 0.017(0.03) 0.000(0.00) 0.018(0.01) 0.002(0.00) 2.730(0.88)
M2 0.410(0.16) 0.180(0.08) 0.036(0.02) 0.001(0.00) 0.209(0.06) 0.006(0.00) 11.894(3.07)
M3 0.940(0.07) 0.600(0.12) 0.007(0.01) 0.000(0.00) 0.059(0.02) 0.002(0.00) 5.927(1.14)
M4 0.830(0.08) 0.390(0.13) 0.041(0.07) 0.006(0.00) 0.148(0.08) 0.006(0.00) 9.819(1.31)

A.2. Definitions of variables used in data analysis

Table A.8:

Financial early warning system data: definitions of variables

Definition Variable
Earning per share BasicEPS
Net asset value per share NAPS
Main income per share MincmPS
Operating profit per share OpeprfPS
Earnings before interest and tax per share EBITPS
Surplus reserve fund per share SurrefdPS
Undistributed profit per share UndivprfPS
Operating cash flow per share OpeCFPS
Cash flow per share CFPS
Enterprise free cash flow per share EntfcfPS
Shareholder free cash flow per share ShrhfcfPS
Average return on equity AvgROE
Return on assets earnings before interest and tax ROAEBIT
Return on assets ROA
Rate of return on invested capital ROIC
Net profit ratio Netprfrt
Gross income ratio Gincmrt
Sales cost ratio Salcostrt
Net profit to total operation income NprTOR
Operating profit to total operation income OpeprTOR
Earnings before interest and tax to total operation income EBITTOR
Total operating costs to total operation income TopecostTOR
Net profits Netprf
Earnings before interest and tax EBIT
Operating profit ratio Opeprfrt
Current ratio Currt
Quick ratio Qckrt
Equity to total liability Equtotlia
Net operating cash flow to total liabilities NOCFtotlia
Net operating cash flow to current liabilities NOCFtotcurlia
Operating cash flow liabilities ratio OpeCcurdb
Earning per share growth rate EPSgrrt
Operating income growth rate Opeincmgrrt
Operating profit growth rate Opeprfgrrt
Net profit growth rate Netprfgrrt
Operating cash flow per share growth rate OpeCPSgrrt
Return on assets growth rate ROAgrrt
Net asset value growth rate Netassgrrt
Net asset value per share growth rate NAPSgrrt
Inventory turnover ratio Invtrtrrat
Account receivable turnover ratio ARTrat
Account payable turnover ratio AccrPayrat
Current assets turnover ratio Currat
Fixed assets turnover ratio Fixassrat
Equity turnover ratio Equrat
Total assets turnover ratio Totassrat
Sales and service cash to operating income SalesevOpeincm
Cash rate of sales Casrtsale
Capital expenditure to depreciation and amortization CapexpDM
Sales and service render cash Salesevcash
Operation cash into asset OpeCass
Debt to asset ratio Dbastrt
Current asset to total asset Curtotast
Noncurrent asset to total asset Noncurtotast
Fixed assets ratio Fixassrt
Current liability to total liability Curtotlia
Equity to asset Equass
Long asset to fit asset Lassfitass
Price to book value ratio PB
Price cash flow ratio PCF
Price sales ratio PS
Ownership concentration 1 OwnCon1
Ownership concentration 5 OwnCon5
Ownership concentration 10 OwnCon10
Ownership concentration 11 OwnCon11
H1 index H1Index
H5 index H5Index
H10 index H10Index
Z index ZIndex
Net cash flow from investing activities per share NcffiaPS
Net cash flow from financing activities per share NcfffaPS
Net cash flow per share NcfPS
Fixed asset growth rate FAGR
Equity to fixed asset EFA
Current debt ratio CDR
Debt to equity market ratio DEMR
Capital turnover ratio CTR
Long-term asset turnover ratio LATR
Net profit margin of current assets NPMCA
Net profit margin of fixed assets NPMFA
Working capital ratio WCR
Working capital to total assets WCTA
Working capital to net assets WCNA

Table A.9:

News-APP recommendation behavior data: definitions of variables.

Definition Variable
The news are true and accurate without false information. D19_1
The news are objective without personal bias. D19_2
The news present the scene where the events occurred with the pictures and videos. D19_3
The analysis of the news is thorough, leading to the nature of the event. D19_4
The news present multiple points on the events. D19_5
There are well-known experts’ columns or comments. D19_6
The news reflect the professionalism of editors. D19_7
The hot topics cover comprehensive news. D19_8
There are news about history, humanities, geography, arts and others, which can broaden my horizons. D19_9
The news include the global information. D19_10
The news include rich live events related to my daily life. D19_11
The news include discount information about living and consuming, which bring great benefits to my life. D19_12
The news include all kinds of useful information in life. D19_13
The news are reported in time when an important event happens. D19_14
The updates of news are in time with fresh information. D19_15
There are live news and sports. D19_16
The overall style of the news is acceptable and desirable. D19_17
The news have different kinds of styles. D19_18
There are some desirable special columns. D19_19
The news reflect the stands of points of the editors. D19_20
The APP provides a sense of participation. D19_21
The interactive contents are readable with good quality. D19_22
The APP makes me interact with a group of people who have the the same hobbies with me. D19_23
I can find someone with the same likes and dislikes when I participate in the APP. D19_24
There are rich of news that I am interested in. D19_25
It is very convenient to read the news that I am interested in. D19_26
The APP can present the news that I am interested in positively. D19_27
The APP gives prominence to the key points. D20_1
The Images and text layout match properly. D20_2
The color assortment is reasonable and desirable. D20_3
The interface design is novel. D20_4
The interface and logo design can reflect the characteristics of the APP. D20_5
The interface and logo design can indicate that the APP is news-APP clearly. D20_6
The video is vivid with high quality. D20_7
The video works fine. D20_8
It is very convenient to find and play the video. D20_9
The pictures are clear with high quality. D20_10
It is very convenient to find and read the pictures. D20_11
The presentations of the pictures are novel and beautiful. D20_12
The types of the multimedia is in rich with high selectivity. D20_13
The APP does not consume too much cell phone traffic. D20_14
The APP runs with less memory and does not affect the speed of the phone. D20_15
The APP automatically cleans up the cache content and does not take up the phone storage resources. D20_16
The operations of APP are smooth and do not crash and flare. D20_17
It is convenient to view the comments. D20_18
It is convenient to submit the comments. D20_19
The APP can be shared to many social platforms. D20_20
It is easy to find the function set menu. D20_21
The set of the function keys is reasonable and do not waste a limited interface. D20_22
The function design is in line with conventional operating habits. D20_23
Content sections can be set freely. D20_24
The function and design can be set in accordance with my preferences and needs. D20_25
There are many types of subscription. D20_26
Each type of subscription includes many sources. D20_27
The subscription process is simple with only a few of steps. D20_28
The news can be subscribed based on keywords. D20_29
It is easy to find the news I want to subscribed. D20_30
Push news can be selected according to my preferences. D20_31
There are hot search words presented in the search interface. D20_32
The recommended news are in accordance with my preferences. D20_33
Search results are rich and accurate. D20_34
It is easy to find the news that I want to add. D20_35
The navigation bar is clear. D20_36
It is easy to find what I am interested in. D20_37
It is easy to find the advertisement of this APP. D21_1
I am impressed by the APP’s advertisement. D21_2
The APP’s advertisement is desirable. D21_3
This APP is praised and recognized by the relevant experts. D21_4
This APP is always associated with some major events and sports. D21_5
This APP has good reputation. D21_6
People around are using this APP. D21_7
The parent brand of the APP has a good image and is recognized. D21_8
The parent brand of the APP has great influencing power. D21_9
The image of the parent brand of the APP is in line with the characteristics of the news product. D21_10
This APP has a clear brand personality, such as youthful, rigorous and so on. D21_11
The APP’s user group is the same type as me. D21_12
The personality of this APP is my favorite. D21_13
The overall style of the APP is acceptable and desirable. D21_14

A.3. Additional numerical results of the data analysis

Figure A.1:

Figure A.1:

Data analysis: OOI. Top/Blue: financial early warning system data. Bottom/Red: news-APP recommendation behavior data.

A.4. Additional data analysis: gas station customers’ psychology and behavior

In this section, we analyze another real dataset with higher dimension (d = 108) to further understand the effectiveness of the proposed method. This dataset is based on a questionnaire survey on the gas station customers in Guangzhou, Guangdong Province, China in 2014. It is collected by an energy company with the goal of studying the relationship between consumers’ psychology and behavior. A multi-stage stratified sampling design is adopted in the survey which includes a total of 486 customers of the gas stations in both urban and suburbs with various road conditions. As the psychology and behavior differ significantly across customers with different ages, we partition the customers into three datasets according to their birthdays: earlier than 1970 (dataset 1, n(1) = 96), between 1970 and 1980 (dataset 2, n(2) = 182), and later than 1980 (dataset 3, n(3) = 208). The response of interest is the amount of gas consumed in the past year, which is continuous and analyzed using the linear model. The predictors include three personal information variables (gender, marriage status and education) and 105 psychological and behavioral measurements which are measured by ten-point Likert scale with one and ten indicating an extremely dissatisfaction and a most satisfactory, respectively. Due to business confidential, the detailed variables and questionnaire cannot be publicly available.

The summary analysis results and detailed estimation results using the four methods are provided in Tables A.10 and A.11, respectively. The proposed method identifies 10 main effects and 11 interactions, which are different from the alternatives. The same prediction and stability evaluation are conducted. The mean prediction errors (PEs) are 7.708 (M1), 13.233 (M2), 10.109 (M3), and 10.367 (M4), respectively, suggesting the superiority of the proposed method. In Figure A.2, it is observed that the proposed method has the best selection stability.

Table A.10:

Additional data analysis: numbers of overlapping main effects and interactions. In each cell: dataset 1/dataset 2/dataset 3.

Main effects M1 M2 M3 M4
M1 10/10/10 7/6/6 8/8/8 4/4/4
M2 9/11/9 6/6/7 2/1/1
M3 12/12/12 3/3/3
M4 7/7/7

Interactions M1 M2 M3 M4
M1 11/11/11 5/6/9 6/6/6 1/1/1
M2 7/9/15 5/4/8 0/0/0
M3 9/9/9 0/0/0
M4 3/3/3

Table A.11:

Analysis of gas station customers’ psychology and behavior data: estimated coefficients for main effects and interactions.

Dataset 1 Dataset 2 Dataset 3

M1 M2 M3 M4 M1 M2 M3 M4 M1 M2 M3 M4
V1 −0.091 −0.094 −0.073 0.100 0.099 −0.073 −0.103 0.098 −0.073
V15 −0.070 −0.095 −0.081 0.090 0.098 −0.081 −0.092 0.090 −0.081
V19 0.086
V22 −0.084 −0.093 −0.084 0.084 0.093 −0.084 −0.077 0.106 −0.084
V24 −0.095 −0.095 −0.084 −0.095
V3 −0.050 −0.050 −0.050
V31 −0.087 −0.086 0.027 −0.093 0.112 0.161 0.027 0.069 −0.080 0.102 0.027 0.110
V35 0.100 0.100 −0.035 0.048
V4 0.095 0.092 0.062 −0.098 −0.022 0.062 −0.155 0.025 0.062 0.109
V48 −0.071
V5 −0.091 −0.088 0.102 0.100 −0.088 −0.100 0.091 −0.088
V50 0.092 0.092 0.092
V59 −0.095
V6 −0.085 0.099 0.086
V60 −0.093 −0.093 −0.093
V64 0.191 0.202 0.196
V67 0.066 −0.194 −0.088 −0.174 −0.011 0.155
V75 −0.061 −0.100
V78 0.070 0.096 −0.234 −0.032 0.096 −0.205 0.100 0.096 0.113
V87 0.074
V88 −0.167 −0.103 0.127
V9 −0.100 −0.093 −0.077 0.112 0.096 −0.077 −0.111 0.104 −0.077
V95 0.169 0.155 0.122
V1×V15 −0.078 −0.094 0.015 0.085 0.015 −0.096 0.096 0.015
V1×V22 −0.088 −0.094 0.011 0.084 0.097 0.011 −0.082 0.112 0.011
V1×V31 −0.086 0.083 −0.190 0.101 0.093 −0.190 −0.088 0.015 −0.190
V4×V35 0.100 0.100 −0.019 0.028
V6×V31 −0.094 −0.090 0.094
V9×V31 −0.351 −0.094 0.363 0.356 0.182 0.363 −0.364 0.024 0.363
V15×V31 −0.095 0.090 0.090 0.091 0.090
V1×V5 −0.098 0.108 0.103 −0.103 0.097
V1×V9 −0.102 0.109 0.094 −0.111 0.110
V19×V31 0.092
V22×V31 −0.095 0.093 0.092 0.358 0.093 −0.089 0.025 0.093
V48×V75 −0.094
V1×V6 0.094
V6×V33 0.101
V9×V15 0.013 −0.085 0.010 −0.085 −0.011 0.012 −0.085
V9×V22 −0.095 0.092 −0.090 0.115
V15×V22 −0.089 −0.089 0.012 −0.089
V24×V75 −0.100
V31×V78 0.092 0.092 0.092
V67×V78 0.072 0.032 −0.051 0.025 0.036 −0.027
V4×V78 0.026 0.026 −0.020
V64×V95 −0.030 −0.029 −0.025

Figure A.2:

Figure A.2:

Additional data analysis (gas station customers’ psychology and behavior): OOI.

References

  • 1.Bien J, Taylor J, Tibshirani R. A Lasso for hierarchical interactions. Annals of Statistics 2013;41(3):1111–1141. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Lim M, Hastie T. Learning interactions via hierarchical group-Lasso regularization. Journal of Computational and Graphical Statistics 2015;24(3):627–654. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Zeng X, Ma S, Qin Y, Li Y. Variable selection in strong hierarchical semiparametric models for longitudinal data. Statistics and Its Interface 2015;8(3):355–365. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Hsu D Identifying key variables and interactions in statistical models of building energy consumption using regularization. Energy 2015;83:144–155. [Google Scholar]
  • 5.Hayashi M, Boadway R. An empirical analysis of intergovernmental tax interaction: the case of business income taxes in Canada. Canadian Journal of Economics 2001;34(2):481–503. [Google Scholar]
  • 6.Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2005; 67(2):301–320. [Google Scholar]
  • 7.Huang J, Horowitz JL, Ma S. Asymptotic properties of bridge estimators in sparse high-dimensional regression models. The Annals of Statistics 2008;36(2):587–613. [Google Scholar]
  • 8.Ziemer RF, Wetzstein ME. A Stein-rule method for pooling data. Economics Letters 1983;11:137–143. [Google Scholar]
  • 9.Ejaz Ahmed S, Yüzbasi B. Big data analytics: integrating penalty strategies. International Journal of Management Science and Engineering Management 2016;11(2):105–115. [Google Scholar]
  • 10.Shah MKA, Lisawadi S, Ejaz Ahmed S. Merging data from multiple sources: pretest and shrinkage perspectives. Journal of Statistical Computation and Simulation 2017: 87(8):1577–1592. [Google Scholar]
  • 11.Liu J, Huang J, Ma S. Integrative analysis of cancer diagnosis studies with composite penalization. Scandinavian Journal of Statistics 2014;41(1):87–103. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Huang Y, Liu J, Yi H, Shia BC, Ma S. Promoting similarity of model sparsity structures in integrative analysis of cancer genetic data. Statistics in Medicine, 2017;36(3):509–559. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Friedman J, Popescu BE. Gradient directed regularization for linear regression and classification Technical Report, Statistics Department, Stanford University, 2003. [Google Scholar]
  • 14.Shi X, Liu J, Huang J, Zhou Y, Shia B, Ma S. Integrative analysis of high-throughput cancer studies with contrasted penalization. Genetic Epidemiology 2014; 38(2):144–151. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Liu J, Huang J, Ma S. Integrative analysis of multiple cancer genomic datasets under the heterogeneity model. Statistics in Medicine 2013;32(20):3509–3521. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Stute W Distributional convergence under random censorship when covariables are present. Candinavian Journal of Statistics 1996;23:461–471. [Google Scholar]
  • 17.Li J, Qin Y, Yi D, Li Y, Shen Y. Feature selection for support vector machine in the study of financial early warning system. Quality and Reliability Engineering International 2014;30(6):867–877. [Google Scholar]
  • 18.Koyuncugil AS, Ozgulbas N. Financial early warning system model and data mining application for risk detection. Expert Systems with Applications 2012;39(6):6238–6253. [Google Scholar]
  • 19.Huang J, Ma S. Variable selection in the accelerated failure time model via the bridge method. Lifetime Data Analysis 2010;16(2):176–195. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES