Skip to main content
Journal of Applied Statistics logoLink to Journal of Applied Statistics
. 2020 Jun 29;48(11):2022–2041. doi: 10.1080/02664763.2020.1783522

Feasibility as a mechanism for model identification and validation

Corrine F Elliott a, Joshua W Lambert a,b, Arnold J Stromberg a, Pei Wang a, Ting Zeng a, Katherine L Thompson a,CONTACT
PMCID: PMC9042105  PMID: 35706432

Abstract

As new technologies permit the generation of hitherto unprecedented volumes of data (e.g. genome-wide association study data), researchers struggle to keep up with the added complexity and time commitment required for its analysis. For this reason, model selection commonly relies on machine learning and data-reduction techniques, which tend to afford models with obscure interpretations. Even in cases with straightforward explanatory variables, the so-called ‘best’ model produced by a given model-selection technique may fail to capture information of vital importance to the domain-specific questions at hand. Herein we propose a new concept for model selection, feasibility, for use in identifying multiple models that are in some sense optimal and may unite to provide a wider range of information relevant to the topic of interest, including (but not limited to) interaction terms. We further provide an R package and associated Shiny Applications for use in identifying or validating feasible models, the performance of which we demonstrate on both simulated and real-life data.

Keywords: Data analysis, feasibility, model selection, model validation, regression, statistical model

1. Introduction

As technological advancements permit the acquisition and storage of increasingly massive quantities of electronic data (e.g. genome-wide association study data), discovery of viable descriptive or predictive models becomes progressively more difficult to achieve. Not only are massive databases difficult to synthesize manually, but exhaustively fitting models becomes impracticable, consuming excessive computational time and analytical effort even when models are constrained to a maximum number of parameters. Already the volumes of data available to researchers overwhelm even the fastest modern computers [7].

Bearing in mind the limitations of existing model-selection techniques (see Section 2), we introduce feasibility as a new framework for model selection and validation, designed to overcome the difficulties faced by model builders working in a variety of research areas. By performing model selection in a feasibility setting, we can not only identify combinations of factors related to a response variable of interest – including higher-order terms and interactions – but also improve the descriptive ability of existing models, thereby affording researchers a deeper understanding of the relationships underlying the data they seek to explore.

In the following sections, we present an overview of existing model-selection techniques (Section 2), followed by an introduction to feasibility and to our tool founded on this concept (Section 3). Section 4.1 characterizes the behavior of feasible models with increasing sample size in the regression setting. In Section 4.2, we present simulations exploring the performance of the feasibility-based algorithm relative to conventional forward selection, and investigations into the feasibility of models identified using each of stepwise selection, forward selection, and LASSO methods. To demonstrate the viability of the feasibility concept in analyzing real-life data, we then (Section 4.3) leverage our tools on two publicly available datasets. Finally, in Sections 5 and 6, we suggest some possible future directions for research into and empowered by the feasibility framework.

2. Background

A variety of classical and modern statistical algorithms exist to assist researchers in discovering data-driven models, many of which rely on iterative, computerized selection of predictor variables using predetermined cutoffs for information gain. In this context, information gain is generally quantified by reduction in error sum of squares, or by equivalent metrics such as the F statistic. Such automatic model-selection techniques commonly fall into one of three categories: (i) forward selection, which starts with an intercept-only model and adds a predictor to the working model on each iteration; (ii) backward elimination, which starts with a model containing all available variables, and removes a predictor from the working model on each iteration; or (iii) stepwise selection, a hybrid of forward and backward regression in which predictors iteratively enter and leave the working model [12]. Each of these automatic procedures continues iterating until the working model stabilizes to contain solely the predictors whose information contribution surpasses the chosen threshold. In contrast, penalized regression techniques, such as LASSO [22,23] and ridge regression, produce working models via non-iterative procedures by penalizing small regression coefficient estimates to zero.

Despite being more efficient than an exhaustive search, none of these procedures is universally appropriate. Researchers typically rely on automatic model-selection procedures when the researcher has no theory-driven hypothesis for the mechanism underlying his or her data [18], and/or when the available predictor variables are too numerous to permit manual investigation. Unfortunately, the very situations that encourage use of automatic model-selection techniques render them suspect. Most metrics for quantifying information gain, such as the F statistic, are only strictly appropriate for testing predetermined hypotheses; by definition, testing of a data-driven model occurs post hoc [12,20]. Although other metrics exist for quantifying model quality, choice of criterion function is itself a multifaceted process, as stepwise procedures of model selection depend heavily on the chosen path through the variable space [5]. Finally, datasets containing numerous predictor variables do not necessarily contain sufficient observations for the resulting parameter estimates to be reliable, especially when the predictors are correlated [20].

Machine-learning and data-reduction approaches constitute the primary rivals to classical statistical methods. These techniques, including neural networks and random forests [9], can accommodate larger volumes of data and may generate models with higher model-quality scores, especially those designed to capture prediction error. However, the resulting models do not benefit from the well established statistical theory underlying many parametric model forms. Moreover, they are generally convoluted or mechanistically opaque – hence the widespread likening of such processes to ‘black boxes’ – rendering difficult both interpretation and generalization to new datasets. Thus, machine-learning techniques find common application when their users seek high predictive accuracy, but fall through when researchers' interest lies in describing or understanding processes.

Perhaps most limiting of all, existing techniques in both categories almost ubiquitously converge to a single model, selected as ‘best’ under the chosen metric of model quality. They therefore risk overlooking factors related to the variable of interest – possibly including the true relationship underlying the data: In simulation studies of small to medium effects with only one set of (randomly selected) variables truly related to the response, exhaustive search has been shown to identify an incorrect set of variables even in the well established context of classical regression with optimization by R2 [21]. In less-extreme cases, researchers may overlook models that, while statistically suboptimal for a particular dataset, are of interest from a clinical or scientific viewpoint [18]. For example, if we seek knowledge of risk factors for a complex disease, a single model is likely insufficient to capture the myriad complexities of the target relationship.

In particular, researchers tend to overlook or omit higher-order terms such as interactions, in large part because searching exhaustively for models containing interaction terms quickly becomes computationally intensive. To illustrate: A dataset containing a mere 250 variables incurs 250 quadratic terms, (2502) pair-wise interactions, and (2503) three-way interactions, for a minimum of more than 2.6 million models under consideration even without searching exhaustively for subsets of main effects. The inability of existing model-building procedures to identify optimal models under such mundane conditions as these, encourages one of two alternatives: considering only those interactions suggested by an expert investigator, thereby limiting the design space to the scope of an individual opinion; or ignoring interactions altogether, thereby limiting the descriptive and predictive power of the final model [6]. Neither scenario encourages the discovery of new relationships on the basis of novel datasets.

In the following section, we introduce a new paradigm for model building aimed at giving researchers a more-complete conception of the relationships in their data than is offered by any existing technique of either the automatic-selection or the machine-learning persuasion. We further introduce our publicly accessible, computationally efficient tools founded on this framework, which we designed to accommodate nearly any model form and criterion for gauging model quality [13]. With sufficient repetitions, these tools produce an ensemble of models that have been shown to capture the optimal model with arbitrarily high probability [10].

3. Methods

In this section, we introduce a highly flexible framework, termed feasibility, that can be used to explore the optimality of statistical models (see Section 3.1). This framework can be applied either to test existing models for optimality (Section 3.2) or to identify new models (Section 3.3), and is available in both R package and applet formats (Section 3.5).

3.1. Feasibility

Consider a statistical model in which a response, an N×1 vector denoted Y, is described by a subset of predictors from a pool of p explanatory variables. Suppose we wish to fix q variables in the model, denoted X1,,Xq, and to identify the k additional variables X(1),,X(k) that will afford the most-informative final model. Let XK denote the N×k matrix formed by binding the k column vectors X(1),,X(k). Then the model defined by (X1,,Xq,XK) is said to be feasible (alternatively, is a feasible solution) under a given criterion if the k discretionary variables composing XK optimize said criterion function within the data space accessible via single-variable exchanges.

In a regression context, the criterion function is commonly a measure of model quality as in Equations 1 and 2.

XK=argminXKAIC(X1,,Xq,XK) (1)
XK=argmaxXKR2(X1,,Xq,XK) (2)

Here XK denotes the optimal set of k discretionary variables accessible by means of sequential replacement. More generally, let C denote the objective function – namely, the criterion function or, in cases of minimization, the negative of the criterion function. Then we write:

XK=argmaxXKC(X1,,Xq,XK) (3)

Hence, within the class of models exhibiting a specified functional form, a feasible solution is a model in which no single exchange of a non-fixed, first-order term (and associated higher-order terms) improves the objective function [10].

Although we restrict our examples to linear models for simplicity, note that nothing in this description of the feasibility framework limits its use to a specific model structure or criterion function. Indeed, the framework (and our implementation thereof) admits a variety of model forms – including nonlinear models, nonparametric regression models, and generalized additive models – and criterion functions – including (possibly penalized) sum of squared errors. Moreover, models identified under the feasibility framework remain flexible to standard modes of inference associated with the chosen model form. A minor limitation on leveraging feasibility is that the criterion function return a value amenable to maximization (or minimization), thereby permitting a direct comparison across fitted models.

3.2. Validating models for feasibility

Many published models are not feasible, as we illustrate with two examples in Section 4.3. Such models may nonetheless contain valuable domain-specific information – but validation under the feasibility framework offers a method of improving upon (or supplementing) these models to better understand the mechanisms underlying a dataset. Here we do not seek an optimal solution, as under an exhaustive search, nor to dictate a ‘correct’ method of selecting models – but rather to improve the descriptive or predictive power of models that otherwise would constitute our best understanding of the data at hand.

The above definition of a feasible solution lends itself readily to validating feasible models of the first order (Figure 1). Consider a user-provided model of the form YβaXa+βbXb, where Xa and Xb are selected from a pool of p variables. Specifically, suppose we take Yβ4X4+β98X98 with AIC of 0.65. This model is feasible if no exchange of X4 or X98 for any of the other p−2 variables will afford an improvement to the criterion function. Then the feasible model accessible from our initialized model is identically the original model.

Figure 1.

Figure 1.

Flowchart depicting the Feasible Solution Algorithm optimization procedure acting on a simple example.

Suppose instead that an exchange of X98 for X5 affords an AIC of 0.32, constituting the largest improvement in AIC achievable by exchanging either of X4 or X98. If the resulting model ( Yβ4X4+β5X5) is still not feasible, then we may continue the procedure of repeatedly exchanging one of the discretionary variables for a variable not yet included in the model. On each iteration, a single explanatory variable out of the k discretionary variables (here k = 2) is replaced by the variable that most improves the criterion function, as follows:

XK=argmaxXK(j);j{1,,k}[argmaxX(j){X1,,Xp}C(X1,,Xq;XK(j){X(1),,X(k)})] (4)

Eventually no single exchange would further improve upon the model criterion, yielding a combination of factors ( XK) that are suggested by the data to influence the response variable of interest in the presence of the fixed variables X1,,Xq. By this method, we have both (a) located a feasible solution and, because we were able to improve upon the initialized model, (b) demonstrated that the original model was not feasible.

Beyond first-order models, the same procedure suits models whose discretionary matrix XK is defined entirely by a single, higher-order interaction term. This problem may be considered equivalently as a test of the feasibility of the interaction term itself. Take the example of a model with the form YβaXa+βbXb+βabXaXb, where the main effects are determined by their participation in the interaction term. We assume a pool of p variables. Given a desired interaction order, here m = 2, we again attempt to improve the working model by exchanging one of the m interaction components (either of Xa or Xb in the interaction XaXb) for one of the p−2 variables not yet included in the interaction. Pursuant to the hierarchical paradigm, we exchange all associated lower-order interactions and main effects prior to calculating the new criterion value. The resulting feasible solution is the model defined by the feasible interaction term accessible from XaXb by means of one or more single-variable exchanges.

3.3. Locating feasible solutions

The procedure described above for validating models under feasibility entails a single repetition of the algorithm for locating feasible solutions. The validation procedure identifies a single feasible model: In our toy example of a first-order model, from a user-defined starting point of Yβ4X4+β98X98, we identify the feasible solution Yβ4X4+β5X5.

When instead we seek to discover new feasible solutions, we leverage the same algorithm with multiple repetitions, each initialized at a random model of the desired functional form. Each repetition of the algorithm affords the feasible solution accessible by means of sequential replacement from the random initialization [13]. The full output therefore offers a composite of feasible solutions, accompanied by the corresponding criterion values and frequencies of occurrence for each model (the latter, because a single feasible solution may be accessible from multiple initializations). An in-depth discussion of locating feasible solutions with this algorithm is given in Lambert et al.[13].

Note again that our goal in searching for feasible solutions is neither to locate all feasible solutions, nor to identify a single optimal solution – although either objective may be achieved with sufficient repetitions [10]. Rather, the algorithm returns an ensemble of models that describe the data well individually, and may unite to elucidate the natural processes governing a dataset. The analyst may then investigate these models, whether rationally or empirically, either to winnow the candidate models or simply to understand better the mechanistic underpinnings of the data.

3.4. Term-wise feasibility

Thus far we have (a) defined feasibility with respect to linear models and models defined by a single interaction, and (b) illustrated a procedure for validating or locating feasible solutions of either type. We next turn our attention to defining feasibility for models more generally, such as a linear model containing two or more interactions.

Consider the model form YβaXa+βbXb+βcXc+βdXd+βcdXcXd, where the main effects of Xc and Xd are again determined by their participation in the interaction term. By extension, a more-parsimonious representation may be given by YβaXa+βbXb+βcdXcXd because the omitted main effects are assumed present in accordance with the hierarchical paradigm.

We define a single term as feasible if it optimizes the desired criterion function when all other terms in the parent model are held fixed. For the model above, Xa is feasible if we cannot exchange it for any other variable Xe such that YβeXe+βbXb+βcdXcXd yields a superior criterion function value. By extension, a term-wise feasible model is simply a model whose every component term is feasible. That is, the aforementioned model is term-wise feasible if and only if each of the terms Xa,Xb, and XcXd is feasible (Figure 2). Note that a feasible solution is necessarily term-wise feasible, as the model would not be feasible if any component term could be replaced to yield an improvement to the criterion function value.

Figure 2.

Figure 2.

Flowchart depicting a test of term-wise feasibility for a user-specified model.

3.5. Implementation

To support validation of a simple linear or interaction-based model under the feasibility framework, we provide the R package rFSA and accompanying Feasible Solution Shiny Application (available from https://shiny.as.uky.edu/mcfsa/). The algorithm proceeds as described in Section 3.2 above; for a more-detailed description of the procedure and instructions for its use in locating multiple feasible solutions (surveyed briefly in Section 3.3), we refer readers to Lambert et al.[13].

By design, the output of the validation procedure is the ‘best’ model achievable by sequential replacement initialized at the model of interest. If the model is feasible, then no exchange of variables will afford an improvement, and the result will be identical to the initialized variables. Conversely, an infeasible model returns a different combination of variables, assembled by iteratively exchanging a less-informative predictor for one that improves the criterion function.

Vetting of more-general models for term-wise feasibility is available through the Shiny Application for Feasibility, available from https://shiny.as.uky.edu/feas/. This applet accepts a model from the user and determines the feasibility of each component term. If all terms are feasible, then the application returns the appropriate verdict without supplement. In the case of an infeasible model, the application returns the best replacement available for each infeasible term, along with the improved criterion values achieved by making each of the proposed exchanges individually.

4. Results & discussion

The feasibility framework overcomes or mitigates many challenges associated with existing model-selection procedures. It addresses the computational challenges of exhaustive search – and increasingly, as the size of a typical dataset swells, the logistical challenges of other model-selection procedures as well – in that the strategy of iterative improvement requires fitting fewer models.

This approach is highly versatile, accommodating any model-building problem for which an optimality criterion exists. As noted in Section 2, the performance of iterative procedures depends strongly on the choice of criterion function, while our implementation of feasibility accommodates efficient discovery of feasible models under a variety of criteria to yield a small subset of models for further investigation. Beyond such exploratory efforts, feasibility can be employed to validate hypothesis-driven models, or in support of existing automatic-selection procedures – and in contrast to so-called ‘black box’ methods, feasibility maintains the interpretability and theoretical foundation associated with the underlying model fit.

These advantages unite to render the feasibility framework a valuable tool for researchers seeking insight from novel datasets, as demonstrated in the simulation studies and analyses to follow.

4.1. Feasibility as sample size increases

Consider the typical linear regression setting (Kutner, p. 218 [12] and in Equation 5) and denote the explanatory variables X1,X2,,Xp.

Yi=β0+β1Xi1+β2Xi2++βpXip+ϵi, (5)

where ϵiiidN(0,σ2). Let βi0 for i=1,2,,k and βi=0 for i=k+1,k+2,,p. Using notation from Kutner [12], the sums of squares can be defined as follows.

SSTO=YTY(1n)YTJY (6)
SSE=(YXβ^)T(YXβ^) (7)
SSR=YT[H(1n)J]Y (8)

Consider models with k explanatory variables and use least squares estimates, (β^j,LS) to estimate the slope parameters in the model, βj where j{1,2,,p}.

4.1.1. Theorem

For regression setting (5) with noncolinear predictors, for all n>N, β^j,LS, j{1,2,,p} is the only feasible solution when using R2 as the optimization criterion.

4.1.2. Proof

Note that as n increases,

SSR(Xi|X2,,Xp)SSTO0,i>k. (9)

Since X1,,Xk are not collinear, for some ϵ>0 and n>N,

SSR(Xi|X2,,Xp)SSTO>ϵ,ik. (10)

Consider any model with k explanatory variables, denoted X(1), X(2), … X(k). If XiX1,,Xk is not in the model, for large enough n, Equations (9) and (10) imply FSA will replace one of X(1), X(2), X(k) with one of X1,,Xk. Replacements will continue until the model contains only X1,X2,,Xk.

4.2. Simulation studies

In this section, we present three simulation studies to investigate model-selection algorithms in the context of feasibility. Specifically, we explore (1) the ability of our feasibility-based algorithm to identify the true effects underlying a simulated dataset, as referenced against conventional forward selection in SAS; (2) the feasibility of models generated by each of two standard feature-selection algorithms, forward selection and LASSO; and (3) the feasibility of models generated by stepwise selection using the R package, leaps [14], acting on large datasets containing hard-coded interaction effects.

4.2.1. Accuracy and parsimony of feasibility-based algorithm vs. forward selection

To compare the accuracy of the feasibility-based algorithm against that of existing automatic-selection methods, we first simulate datasets containing N=100 observations of ten independent predictor variables (X1,,X10) drawn from U(0,1). For each dataset, the response variable is simulated as

YiindN(μ=Xi1+Xi2+Xi3+Xi4+cXi1Xi2,σ2=1),

where c is a constant and i=1,,100. Thus the true relationship underlying each dataset is given by:

YX1+X2+X3+X4+cX1X2

Here we simulate 100 such datasets for each of c{0.5,1.0,1.5,2.0} to elucidate the consequences of interaction effect size on the ability of each algorithm to identify relevant interaction terms.

On each dataset, we then execute each of rFSA in R, searching for two-way interactions, and forward selection with PROC GLMSELECT in SAS, with up to two-way interactions. In both cases, we employ the conventional criterion function of adjusted R2. Here we choose to compare against SAS because the traditional R package for forward selection, leaps [14], does not accommodate interaction terms unless we define them manually in the analysis dataset – and even then does not account for the associated main-effect terms. (We explore the latter approach in a later simulation study.) In this scenario, SAS uniformly fails to identify a useful relationship, consistently generating models that contain numerous first-order and interaction effects. As an example, we report the model generated by SAS on the first dataset of the simulation experiment with interaction effect size c = 2.0. Note that this value represents the largest interaction effect size under consideration, and thus should afford the best performance:

Yβ0+β1X1+β1,2X1X2+β3X3+β1,3X1X3+β2,3X2X3+β4X4+β1,4X1X4+β2,4X2X4+β3,4X3X4+β1,5X1X5+β1,7X1X7+β8X8+β4,8X4X8+β9X9+β1,9X1X9+β2,9X2X9+β4,9X4X9+β8,9X8X9+β5,10X5X10+β9,10X9X10

Omitting the true effects identified correctly, this model contains 17 extraneous terms. Figure 3 depicts boxplots of the number of extraneous terms contained in each SAS model across sizes of the interaction effect, calculated analogously to the preceding example. Because these models are so consistently unwieldy, we do not attempt to validate them for feasibility.

Figure 3.

Figure 3.

Number of extraneous effects identified by either of forward selection with PROC GLMSELECT or searching for feasible two-way interactions with rFSA, acting on simulated datasets with various sizes of interaction effect.

Under our specified function parameters, the models identified by rFSA each contain two main effects and their interaction. Figure 3 also depicts boxplots for the number of extraneous predictors identified by rFSA, which value never exceeds two on a given dataset, because in the worst-case scenario, rFSA would identify a single, incorrect interaction term.

The four variables most commonly selected by rFSA are X1, X2, X3, and X4, all of which are influential predictors in truth. As depicted in Figure 4, rFSA identifies the correct interaction (between X1 and X2) on three datasets out of 100 with interaction effect size 1.0, 17 with interaction effect size 1.5, and 52 with interaction effect size 2.0; we do not plot analogous values for the SAS models because the abundance of superfluous terms renders the models uninformative even when the true predictors are present.

Figure 4.

Figure 4.

Number of simulated datasets out of 100 in which rFSA discovers the true interaction effect (between X1 and X2) or the interaction between true main effects X3 and X4 across various sizes of interaction effect.

Although rFSA does not always identify the correct interaction term, it discovers an interaction between X3 and X4 (i.e. the true main effects) on all 100 datasets with effect size 0.5; 99 out of 100 datasets with effect size 1.0; all datasets with effect size 1.5; and 96 out of 100 datasets with effect size 2.0. This interaction will likely manifest less often as the interaction between X1 and X2 becomes clearer; hence the number of occurrences drops when the effect size is relatively large. (We do not necessarily expect a monotonic trend for this trade-off: Recall that the algorithm can return multiple models, and thus sometimes reports both the true interaction and a model containing the true main effects.) We thus observe that even when the feasibility algorithm fails to identify the true interaction term underlying a dataset, it consistently identifies valuable relationships in the data despite scenarios in which other algorithms falter.

4.2.2. Feasibility of models from LASSO and forward selection

Having shown that feasibility has the potential to elucidate true effects underlying a simulated dataset, we next assess the ability of standard statistical techniques to identify feasible models. Specifically, we investigate the performance of LASSO and of forward selection in terms of generating feasible models. Note that we do not seek to compare the performance of these two methods, only to validate the resulting models for feasibility.

For this study, we analyzed N = 1000 observations of p = 500 independent predictor variables (X1,X2,,X500) drawn from U(1,1). The response variable is simulated as YiiidN(0,1) for i=1,2,,n. We select features from each of 100 such simulated datasets using forward selection (with values for α-to-enter of 0.01, 0.025, and 0.05 using the R package olsrr [11]), and LASSO [22] (with values for λ of 0.06, 0.07, and 0.08 using the R package glmnet [19]). Upon completion of each analysis, we employed rFSA to validate the resulting models for feasibility under a criterion of R2.

The results of this study are provided in Figures 5 and 6. In panels (A) and (B) of Figure 5, boxplots describe the number of predictors composing each model fitted to the simulated datasets by LASSO or by forward selection, respectively. The criterion for entering a model becomes more stringent as λ increases or α-to-enter decreases, and thus fewer predictors are able to enter the model. The values of α-to-enter are standard across the statistical literature, while we selected values of λ to favor similar numbers of predictors in the models generated by either method; hence the feasibility results reported in Figure 6 are approximately comparable across methods.

Figure 5.

Figure 5.

Number of predictors composing each model recommended by either of (A) LASSO or (B) forward selection acting on simulated datasets with various sizes of λ (LASSO) or α-to-enter (forward selection).

Figure 6.

Figure 6.

Feasible models as a proportion of all models suggested by either of (A) LASSO or (B) forward selection acting on simulated datasets with various sizes of λ (LASSO) or α-to-enter (forward selection).

Figure 6 depicts the proportion of models generated by each of LASSO (panel A) and forward selection (panel B) that were deemed feasible by rFSA.1 Note that for p<300, leaps can provide the exact, and therefore, feasible solution, but for p>300, leaps is not computationally possible with 40 GB of RAM. In either case, models were less-often favorable when produced under less-stringent levels of the respective parameter for LASSO or forward selection. Moreover, LASSO produced feasible models 0% of the time under λ=0.06, and forward selection produced feasible models 41% of the time under α-to-enter=0.05 – suggesting that for these simulated datasets, a single exchange of variables would produce better-fitting models 100% of the time under LASSO and 59% of the time under forward selection. Hence the results of this simulation suggest that feasibility is a viable extension to improve or complement the results generated by existing statistical methods in common usage.

4.2.3. Feasibility of models including interaction effects from stepwise selection

Recall that the R package leaps [14] accommodates stepwise selection with interaction terms only when the researcher defines them manually as predictors in the analysis dataset. Our final simulation study explores the feasibility of models generated under this approach. Specifically, we simulate datasets containing columns for each predictor and for the associated two-way interactions, and then perform stepwise selection using the regsubsets function provided by leaps [14].

Each simulated dataset contains n observations, p explanatory ( X) variables, (p2) variables representing pairwise interactions, and a single response variable ( Y). For each observation i{1,2,,n} and predictor j{1,2,,p}, we simulate the data as follows, where predictors 5, 10, 46, and 83 were selected randomly under the smallest p:

XijiidU(0,1) (11)
YiindN(Xi,5+Xi,10+Xi,46+Xi,83+Xi,5Xi,10,σ2=1) (12)

We simulate 100 datasets for each combination of parameters drawn from n{100,500,1000} and p{100,150,200,250,300}.

Note that we were prevented from extending beyond p = 300 due to the computational constraints of leaps [14]. For example, memory usage exceeded 40 GB for two of our simulated datasets containing 1000 observations of 300 explanatory variables with all associated pairwise interactions.

For each dataset, stepwise selection with regsubsets identified the ‘best’ model containing 2, 3, 4, 5, and 6 terms, respectively, where each term is either one of the p explanatory variables or one of the associated two-way interactions. We then validated each term of the models for feasibility under a criterion of adjusted R2 (holding fixed the other terms in the model) and aggregated these verdicts to determine the term-wise feasibility of each model. To mimic the behavior of leaps, for this study we do not assume present the main effects associated with a pair-wise interaction term. Note that for a model containing multiple linear terms, we check the feasibility of all linear terms simultaneously; a solitary linear term cannot be validated under the current implementation and thus does not contribute to the final model verdict.

Figure 7 reports the proportion of the 100 datasets under each parameter setting for which stepwise selection returned a feasible model. In general, we see that models containing more terms are less often feasible. Moreover, for models containing few terms, feasible models are more common under datasets with more observations (n). However, for models with several terms, all sample sizes produce low percentages of feasible models. Thus, under a variety of reasonably sized datasets, conventional stepwise selection in R is both (a) computationally intensive and (b) affords results that can be improved or supplemented by leveraging the feasibility framework.

Figure 7.

Figure 7.

Feasible models as a proportion of all models including two-way interactions suggested by stepwise selection acting on simulated datasets with various model sizes.

As a final note, it is worthwhile to consider the approach – common among some practitioners – of using stepwise selection to winnow a pool of explanatory variables, and then searching exhaustively for models built from the resulting subset of predictors. This strategy can be computationally intensive if we wish to consider interactions, as in the example above. Perhaps more importantly, the lack of feasibility demonstrated in this study suggests that the winnowed pool may likely contain the wrong predictors to achieve best performance, and thus this technique could be yet strengthened by considering the feasibility of stepwise-selected models prior to the exhaustive search.

4.3. Case studies of published datasets

For each of the datasets under consideration, we first replicated the authors' original analysis to verify the data and ensure that future results are directly comparable to the original findings. In cases where the authors select one or more models for their data, we then test these models to determine whether they are feasible, or if instead we can improve upon the published model by making one or more simple variable exchanges. Data-specific methodological details are given in the corresponding subsections.

4.3.1. Eye data

Our first target dataset originated from a study by Rong et al. aimed at modelling the best corrected visual acuity (BCVA) achieved in children with congenital cataracts who underwent cataract extraction followed by intraocular lens (IOL) implantation [16]. For each of N = 110 children, the dataset [17] contains a variety of demographic and medical data, including: sex; age at each operation and at follow-up; uncorrected visual acuity (UCVA); opacity (partial or total); compliance with post-operation therapy (none/poor/good); refractive error at last follow-up; and incidence of various post-operational complications (Figure 8). In children with bilateral cataracts, Rong et al. randomly selected one eye for inclusion in the dataset.

Figure 8.

Figure 8.

Subset of cataract dataset.

In the original analysis, Rong et al. analyzed separately those eyes belonging to children with unilateral versus bilateral cataracts. Nonetheless, stepwise multiple linear regression afforded the same multiple regression model for both groups, regressing BCVA on opacity, age at cataract extraction, compliance with therapy, and (refractive) error (Table 1). It is important to note that all categorical variables in the original analysis were treated as continuous and scaled accordingly to afford the published standardized parameters [16]. For the sake of comparison, we maintain this treatment throughout the following analysis.

Table 1. Comparison of published models [16] and feasible models identified by rFSA in the course of vetting published models for feasibility.
  Model Age. extract Compliance Complication Error Follow. up Nystagmus Opacity UCVA adj R2 PRESS
Bilateral Published X X   X     X   0.4100 51.42
  Feasible 1   X X       X X 0.6186 31.73
Unilateral Published X X   X     X   0.7393 11.04
  Feasible 1   X   X     X X 0.7711 9.98
  Feasible 2       X X X X   0.8326 6.41

After verifying reproducibility of the original analysis, we investigate whether the authors' model is a feasible solution. We explore the published model separately for each group of children, and discover that it is feasible for neither unilateral nor bilateral cataracts. Because the authors do not cite a preferred criterion by which to judge the relative quality of models, we execute rFSA using each of five built-in criteria for linear models: R2, adjusted R2, Akaike and Bayesian information criteria (AIC and BIC), and predicted residual error sum of squares (Allen's PRESS) [1,2,13]. The published model is infeasible regardless of the criterion function used to validate it. Furthermore, with the occasional exception of PRESS (which is distinct in optimizing the prediction accuracy rather than descriptive ability of a model), all criteria exhibit good agreement with respect to the model(s) favored for a specified number of parameters. For these reasons, we report only the adjusted R2 and PRESS statistic in the main text. Values for the other criteria are provided as an Appendix.

Table 1 contains a list of feasible models, identified using the feasibility-validation procedure described in Section 3.2, to supplement or replace the original published model. These include one model for patients with bilateral and two models for unilateral cataracts, with the latter models afforded by optimizing the adjusted R2 and PRESS criteria, respectively. Regardless of the chosen criterion function, we are able to improve upon the performance of the model published by Rong et al., thereby supporting our assertion that the feasibility framework is a viable supplement to existing model-selection paradigms.

In addition to searching for more-informative models containing an equal number of parameters to the published model (i.e. same value of k), we also test whether any of the predictors in the published model remain significant when added to the feasible models in Table 1. Several variables appear in the feasible models as well as the published, and thus need not be considered in this manner. The feasible model for bilateral eyes omits the original model variables for age at extraction and refractive error; when either of these variables is added to the feasible model, individual t-tests with significance level α=0.05 reveal that the effect of neither age at extraction (p = 0.126) nor refractive error ( p=0.487) is significant. Likewise, the predictors in the published model for unilateral subjects are not significant when added individually to either feasible model.

With this case study, we demonstrate that the feasibility paradigm can be used to discover models that are equally parsimonious to those generated by the original authors, yet exhibit superior descriptive and/or predictive power. Thus by incorporating the notion of feasibility, we can identify models that might not be considered on the basis of an expert opinion, but which nonetheless may provide valuable information in response to domain-specific research questions.

4.3.2. Fawn data

Bonar et al. (2016) compiled our second target dataset for use in exploring how a fawn's juvenile environment affects its risk of predation prior to reaching maturity [3]. The dataset [4] (Figure 9) thus contains demographic, environmental, and longevity information for fawns under study during their first eight weeks of life. Demographic information includes sex, species, birth year, initial capture date and age, mother identification number, and whether the fawn had a twin. Environmental information includes values averaged over all sightings for a fawn's elevation; height of surrounding vegetation; steepness and ruggedness of the immediate environment; and distances to human settlement and riparian ground [3]. Climate information was aggregated to afford annual measures of spring vegetation, and winter precipitation and wind speed. Finally, for a subset of 82 mule and 65 white-tailed fawns, the authors report each fawn's distance to the nearest coyote den, as well as estimated annual densities of adult, female mule or white-tailed deer. After verifying the reproducibility of published results, we choose to focus our analysis on this subset of N = 147 fawns because they compose the largest self-contained subset with data for all variables of interest.2

Figure 9.

Figure 9.

Subset of fawn dataset.

Bonar et al. undertook a model-selection procedure aimed at modelling a fawn's survival based on a variety of physical and social factors in its immediate environment, including the terrain, climate, and density of adult female deer. Species was included as a covariate in all models, and mother ID as a random effect; we maintain these requirements for comparability of results. Bonar et al. fitted generalized estimating equations (GEEs) using the R package geepack with an exchangeable covariance structure; empirical standard errors (default); and a binomial random component with logit link function [8,24,25]. Models were ranked according to penalized Quasi-likelihood under the Independence model Criterion (QICu), a generalized analog to AIC [15]. This subset of our study therefore serves as an apt demonstration of the flexibility of the feasibility paradigm to any desired model form and criterion function.

Bonar et al. examined 14 models acting on the targeted subset of fawns. As for our first dataset, we consider the feasibility of each non-trivial, first-order model. All models containing more than two predictors were infeasible (Table 2) ,3 and thus would benefit from consideration of feasible models as an extension to those published. Among models containing one or two predictors (with and without the associated two-way interaction), rFSA initialized at a random starting model was able to identify the feasible model reported by Bonar et al. These results demonstrate the viability of the feasibility framework for use in identifying data-driven models with equal or superior performance relative to models formulated manually, even in context of more-complex model structures.

Table 2. Comparison of published first-order models [3] and feasible models identified by rFSA in the course of vetting published models for feasibility.
Model Species Steepness NDVI MD Density WT Density Winter rain Winter wind Slope Usage Dist. Builds. Birth Year Dist. Coyote QICu
Model 1A X X X                 143.15
Feasible X     X       X       135.45
Model 1B X X X     X X         136.66
Feasible X   X       X X X     129.85
Model 2A X     X               141.93
Feasible X     X               141.93
Model 3A X X X X               135.35
Feasible X     X       X   X   132.01
Model 3C X X X X   X X         137.44
Feasible X   X X       X X   X 130.41
Model 4A X X X   X             144.46
Feasible X     X       X   X   132.01
Model 4C X X X   X X X         137.44
Feasible X   X X       X X   X 130.41

5. Future work

The value of feasibility lies as much in its viability for characterizing models as for selecting them. By extension, we would like to explore more deeply the models produced by various existing model-selection procedures and criteria (a) with and without complementation by feasibility, and (b) as compared to one another when each is supplemented by feasibility (e.g. LASSO + feasibility as compared to stepwise selection + feasibility). As we argued earlier in the manuscript, gaining a deep understanding of the complex natural phenomena governing a high-dimensional dataset likely requires exploring a variety of models. Nonetheless, we also intend to characterize the practicality and derive statistical optimality guarantees for models identified under the feasibility framework.

6. Concluding remarks

In this paper, we introduce feasibility as a framework for subset selection and model validation. We further illustrate the versatility of the paradigm by executing a feasibility-based algorithm, implemented as R package rFSA (with associated Shiny Application available online), on real-life datasets requiring a combination of linear and generalized linear models. These prototypical analytical efforts demonstrate the viability of the feasibility concept as a framework for identifying valuable relationships underlying a dataset and/or improving upon existing models by validating for feasibility. These features of the feasibility framework would benefit a variety of research fields analyzing topics anywhere from Census data to Medicaid or genomic (single nucleotide polymorphism) data.

The concept of feasibility, and our implementation thereof, overcomes many of the challenges associated with automatic model-selection procedures. Not only can one use rFSA or the Shiny Application to analyze high-dimensional data on standard computing resources, but the algorithm reduces artificial limitations on researchers' understanding of the relationships they seek to explore by producing multiple feasible solutions for empirical validation. Moreover, in contrast to rival machine-learning techniques, the resulting models permit facile interpretation and deeper exploration by standard modes of statistical inference. By leveraging feasibility as a framework for model selection, researchers have the potential to identify promising candidate models, and thus draw valuable insights, from large data obeying virtually any modelling paradigm.

Funding Statement

This work was supported by the Kentucky Biomedical Research Infrastructure and INBRE National Institute of General Medical Sciences Grant [P20 RR16481]; and a National Multiple Sclerosis Society Pilot Grant [PP-1609-25975].

Notes

1

Note that one analysis by LASSO with λ=0.08 and five analyses by forward selection with α-to-enter=0.01 each produced only one recommended variable, and were therefore omitted from Figures 6(A) and 6(B), respectively.

2

One additional fawn was omitted from this analysis due to missing data, yielding N = 146.

3

QICu values reported for published models are updated from those in Bonar et al [3]. to omit a fawn that died from illness rather than predation. All table values are internally comparable.

Data availability statement

The data that support the findings of this study are publicly available in the Dryad Digital Repository at https://datadryad.org/, with digital object identification numbers of 10.5061/dryad.5t9d1 (cataract data, [17]) and 10.5061/dryad.bg04r (fawn data, [4]), respectively.

Disclosure statement

No potential conflict of interest was reported by the authors.

Appendix 1. Full table of criterion values for cataract models

Table 3. (Appendix) Criterion values for published models [16] and feasible models identified by rFSA in the course of vetting published models for feasibility.

  Model R2 adj R2 AIC BIC PRESS
Bilateral Published 0.4415 0.4100 182.41 196.39 51.42
  Feasible 1 0.6389 0.6186 149.25 163.24 31.73
Unilateral Published 0.7709 0.7393 57.37 66.53 11.04
  Feasible 1 0.7989 0.7711 52.95 62.11 9.98
  Feasible 2 0.8529 0.8326 42.30 51.46 6.41

References

  • 1.Allen D.M., The prediction sum of squares as a criterion for selecting predictor variables, University of Kentucky, 1971.
  • 2.Allen D.M., The relationship between variable selection and data agumentation and a method for prediction, Technometrics 16 (1974), pp. 125–127. doi: 10.1080/00401706.1974.10489157 [DOI] [Google Scholar]
  • 3.Bonar M., Manseau M., Geisheimer J., Bannatyne T., and Lingle S., The effect of terrain and female density on survival of neonatal white-tailed deer and mule deer fawns, Ecol. Evol. 6 (2016), pp. 4387–4402. doi: 10.1002/ece3.2178 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Bonar M., Manseau M., Geisheimer J., Bannatyne T., and Lingle S., The effect of terrain and female density on survival of neonatal white-tailed deer and mule deer fawns, electronic dataset, Dryad Digital Repository (2016), doi: 10.5061/dryad.bg04r. [DOI] [PMC free article] [PubMed]
  • 5.Cantoni E., Mills Flemming J., and Ronchetti E., Variable selection in additive models by non-negative garrote, Stat. Model. 11 (2011), pp. 237–252. doi: 10.1177/1471082X1001100304 [DOI] [Google Scholar]
  • 6.Foster D. and Stine R., Variable selection in data mining: building a predictive model for bankruptcy, J. Am. Stat. Assoc. 99 (2004), pp. 303–313. doi: 10.1198/016214504000000287 [DOI] [Google Scholar]
  • 7.Goudey B., Abedini M., Hopper J.L., Inouye M., Makalic E., Schmidt D.F., Wagner J., Zhou Z., Zobel J., and Reumann M., High performance computing enabling exhaustive analysis of higher order single nucleotide polymorphism interaction in Genome Wide Association Studies, Health Inf. Sci. Syst. 3, S3 (2015). doi: 10.1186/2047-2501-3-S1-S3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Halekoh U., Hojsgaard S., and Yan J., The R package geepack for generalized estimating equations, J. Stat. Softw. 15 (2006), pp. 1–11. doi: 10.18637/jss.v015.i02 [DOI] [Google Scholar]
  • 9.Hastie T., Tibshirani R., and Friedman J., The Elements of Statistical Learning: Data Mining, Inference and Prediction, Springer, New York, 2009. [Google Scholar]
  • 10.Hawkins D., The feasible solution algorithm for least trimmed squares regression, Comput. Stat. Data Anal. 17 (1994), pp. 185–196. doi: 10.1016/0167-9473(92)00070-8 [DOI] [Google Scholar]
  • 11.Hebbali A., olsrr: Tools for building OLS regression models (2018). https://CRAN.R-project.org/package=olsrr, R package version 0.5.1
  • 12.Kutner M., Nachtsheim C., Neter J., and Li W., Applied Linear Statistical Models, 5th ed., McGraw-Hill/Irwin, New York, 2004. [Google Scholar]
  • 13.Lambert J., Gong L., Elliott C., Thompson K., and Stromberg A., rFSA: an R package for finding best subsets and interactions, R. J. 10 (2018), pp. 295–308, doi: 10.32614/RJ-2018-059. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Lumley T. and Miller A., leaps: regression subset selection (2009). https://cran.r-project.org/web/packages/leaps/, R package version 2.9
  • 15.Pan W., Akaike's information criterion in generalized estimating equations, Biometrics 57 (2001), pp. 120–125. doi: 10.1111/j.0006-341X.2001.00120.x [DOI] [PubMed] [Google Scholar]
  • 16.Rong X., Ji Y., Fang Y., Jiang Y., and Lu Y., Long-term visual outcomes of secondary intraocular lens implantation in children with congenital cataracts, PLoS ONE 10 (2015). pp. e0134864. doi: 10.1371/journal.pone.0134864 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Rong X., Ji Y., Fang Y., Jiang Y., and Lu Y., Long-term visual outcomes of secondary intraocular lens implantation in children with congenital cataracts, electronic dataset, Dryad Digital Repository (2015). doi: 10.5061/dryad.5t9d1. [DOI] [PMC free article] [PubMed]
  • 18.Sauerbrei W., Royston P., and Binder H., Selection of important variables and determination of functional form for continuous predictors in multivariable model building, Stat. Med. 26 (2007), pp. 5512–5528. doi: 10.1002/sim.3148 [DOI] [PubMed] [Google Scholar]
  • 19.Simon N., Friedman J., Hastie T., and Tibshirani R., Regularization paths for cox's proportional hazards model via coordinate descent, J. Stat. Softw. 39 (2011), pp. 1–13, http://www.jstatsoft.org/v39/i05/. doi: 10.18637/jss.v039.i05 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Steyerberg E., Eijkemans M., F. Harrell, Jr., and Habbema J., Prognostic modelling with logistic regression analysis: A comparison of selection and estimation methods in small data sets, Stat. Med. 19 (2000), pp. 1059–1079. doi: [DOI] [PubMed] [Google Scholar]
  • 21.Thompson K., Correct model selection using R2 and Akaike Information Criterion in big data analysis In preparation (2018)
  • 22.Tibshirani R., Regression shrinkage and selection via the lasso, J. R. Stat. Soc. Ser B (Methodological) 58 (1996), pp. 267–288. [Google Scholar]
  • 23.Tibshirani R., Regression shrinkage and selection via the lasso: a retrospective, J. R. Stat. Soc.: Ser. B (Statistical Methodology) 73 (2011), pp. 273–282. doi: 10.1111/j.1467-9868.2011.00771.x [DOI] [Google Scholar]
  • 24.Yan J., geepack: yet another package for generalized estimating equations, R-News 2 (2002), pp. 12–14. [Google Scholar]
  • 25.Yan J. and Fine J., Estimating equations for association structures, Stat. Med. 23 (2004), pp. 859–880. doi: 10.1002/sim.1650 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data that support the findings of this study are publicly available in the Dryad Digital Repository at https://datadryad.org/, with digital object identification numbers of 10.5061/dryad.5t9d1 (cataract data, [17]) and 10.5061/dryad.bg04r (fawn data, [4]), respectively.


Articles from Journal of Applied Statistics are provided here courtesy of Taylor & Francis

RESOURCES