1. Introduction
We thank the editors for inviting a discussion of our article, and the discussants (Lu and Ishwaran, 2020, hereafter referred to as IL) for engaging in a stimulating conversation about our work. The discussants raised several interesting points, including:
the ties between our proposal and the work of Breiman (and others) on variable importance in machine learning;
the extension of our procedure to an adjusted R2, and its performance in scenarios with higher dimensionality.
Here, we comment further on what we view as the primary distinction between our work and previous proposals as well as the potential role of cross-validation (also referred to as cross-fitting in this context) in addressing small-sample performance. Below, we use the abbreviation VIM to refer to a variable importance measure, and refer to our follow-up paper (Williamson et al., 2020) as WGSC.
2. Genesis of variable importance measures
IL highlight the relationship between our proposed measures and the Breiman-Cutler family of variable importance measures (BC-VIMs). Indeed, apart from how the reduced model is defined, our procedures do fall under the umbrella of BC-VIMs. However, we arrive at these procedures from a somewhat different perspective than in the existing literature, and this distinction has important repercussions on both interpretation and inference. While we make this point explicit and expand on it in various directions in WGSC, we summarize it below. For this, it is useful to delineate the various definitions and objectives of variable importance assessment — in this discussion, we focus on what we believe are the two predominant perspectives on variable importance.
Traditionally, the objective in variable importance assessment is to quantify the importance of a variable (or set of variables) within the confines of a given (possibly black-box) prediction algorithm. For example, BC-VIMs play a key role in helping to interpret a fitted algorithm, leading to their tremendous popularity in the machine learning literature. The BC-VIM estimand must be interpreted relative to the fitted algorithm, and helps us to understand the extent to which a fitted model makes use of features. In the notation of IL, a BC-VIM is typically given by
where the full model is considered fixed, the reduced model is defined in some fashion relative to the full model (e.g., via permutation of certain model inputs), and PE(model) refers to the empirically estimated prediction error of the given model. Because it pertains to describing an externally-specified algorithm for prediction, we find it useful to refer to this perspective as a form of extrinsic variable importance.
In contrast, we focus on a notion of variable importance that captures the “optimized predictiveness” attributable to one variable or a set of variables. Specifically, we consider a measure that quantifies the extent to which the oracle predictiveness of a collection of variables decreases when a variable (or set of variables) is removed from consideration. Here, predictiveness of a candidate prediction function can be measured however we wish — in this paper, we focus on the complement of the standardized R2 but provide many other examples in WGSC — and oracle predictiveness is the greatest predictiveness achievable by any prediction function based on the entire versus reduced set of variables. Concretely, in our current work, the predictiveness of a candidate prediction function f is given by
and for the entire set of variables, the oracle predictiveness is
whereas the oracle predictiveness of the reduced set of variables can be shown to equal
Here, represents the class of all real-valued functions that take as input realizations of the entire variable set and maps into the sample space of the outcome variable, μ0 is the true conditional mean outcome given these same variables, and and μ0,s are defined similarly but with only the reduced set of variables. Variable importance is then taken to be , the standardized difference in R2. This formulation in terms of a contrast in oracle predictiveness, which we study in generality in WGSC, leads to a population parameter that does not describe any externally-provided prediction function. Rather, it captures the prediction potential of one variable (or a set of variables) — in this sense, it provides an intrinsic metric of variable importance — and we are interested in making inference about this parameter. In practice, μ0 and μ0,s are unknown and must be estimated, ideally in a flexible manner to reduce the risk of inconsistency, thus leading to a plug-in estimator that is precisely a BC-VIM variant. Nevertheless, the underlying target of inference may differ between the two approaches and so may approaches to statistical inference.
We note that this common estimator will always provide a good assessment of the contribution of a variable (or a set of variables) to the predictive mechanism underlying the estimated conditional mean function, be it algorithmic or model-based. However, it will only truly reflect the intrinsic predictiveness of this variable (or set of variables) if the conditional mean function estimator is consistent for its intended target. As such, while we find the intrinsic (agnostic) notion of variable importance particularly appealing, we recognize that it is also a more ambitious measure, and is thus more challenging to learn from data.
3. Finite-sample performance and dimensionality
In simulation studies, IL show that our proposed estimator increasingly suffers from artificial inflation — and thus greater mean-squared error for the true variable importance measure — in the presence of a growing number of noise variables, and that a correction factor can successfully mitigate this phenomenon. This is an interesting suggestion that appears at least in part motivated by classical Gaussian theory results. We too had noted this pattern of behavior in numerical investigations. We expected that the use of cross-validation would mitigate this phenomenon, particularly given that cross-validation is known to reduce overfitting in many applications. This would be an added benefit of cross-validation, since we were primarily motivated by the need to eliminate Donsker class conditions unlikely to be satisfied when machine learning tools are used to estimate the oracle prediction functions. In our context, it does appear to mitigate precisely the poor behavior induced by injection of noise variables, as can be seen in Figure 1, which augments the simulation study of IL with our cross-validated proposal (details are available on GitHub). We note furthermore that while a theoretical analysis of the properties of this cross-validated procedure is slightly more difficult, it is possible to show that it will enjoy nice large-sample properties (e.g., consistency, root-n convergence to a Gaussian limit) under strictly weaker conditions than its non-cross-validated counterpart. Comprehensive theory and additional details are provided in WGSC. We look forward to further study to delineate the relative pros and cons of the corrected estimator of IL and of our cross-validated procedure.
Figure 1:

Estimated variable importance for the specified variable(s) using: our proposed estimators, and the adjusted estimators proposed in IL, and and the BC-VIM and k-adjusted BC-VIM based on random forests with default tuning parameter values. This figure appears in color in the electronic version of this article, and any mention of color refers to that version.
4. Bridging the ‘two cultures’ ?
Breiman (2001) described a dichotomy of cultures in statistics, which pits the use of simple, transparent models to represent the data-generating mechanism — at the cost of likely model misspecification — against the adoption of flexible algorithmic approaches — at the cost of opaqueness and perhaps greater difficulties in making valid statistical inference. Our view is that the development and popularization of principled inferential methods based on machine learning tools has blurred the lines between these two cultures.
While much of conventional practice follows the traditional model-based approach to statistics, wherein the estimands (explicitly or implicitly) considered are often coefficients indexing simple statistical models, there is increasing focus on ‘nonparametric’ estimands whose definition is interpretable without relying on unrealistic modeling assumptions. The fields of causal inference and targeted learning, for example, have been driving forces for this idea, emphasizing the importance of carefully defining the statistical estimand before any deliberation about estimation techniques even occurs. Such parameters involve features of the data-generating distribution that, in the absence of traditional modeling assumptions (e.g., linearity of conditional mean function), may be more challenging to estimate, prompting the consideration of flexible learning strategies (e.g., machine learning) for this purpose. Dedicated techniques are then needed in order to perform valid inference on these estimands despite the use of flexible learners — thankfully, such techniques (e.g., influence function-based debiasing) have been popularized and made more accessible in the past decade. Thus, nowadays, it is relatively straightforward to make inference on interpretable summaries of a distribution in more realistic (nonparametric or semiparametric) models, aided by flexible learning algorithms. As such, we see the two cultures of Breiman (2001) as no longer forming a dichotomy, since the primary tenets of both cultures can indeed be respected simultaneously.
We see our approach as an example of precisely this closure of the culture gap: using our results, a practitioner can perform valid statistical inference on an interpretable population VIM without relying on specifying correctly a simple population model. This involved, as a key step, choosing a target parameter defined broadly rather than only within the confines of some particular regression model. Surprisingly, as we highlighted briefly in our paper but studied in generality in WGSC, in our particular context, sophisticated debiasing techniques are not needed, at least if the estimand is taken as a contrast in best-case relative R2 (as opposed to the equivalent ANOVA formulation), even though our estimation procedure relied on the use of flexible learners. This observation is a fortunate anomaly that stems from the definition of oracle predictiveness as an optimized criterion, thus rendering second-order the contribution of estimators of the oracle prediction functions to the large-sample behavior of the resulting VIM plug-in estimator. This fact greatly simplifies our inferential procedure.
5. Conclusion
We again thank the discussants for their thoughtful discussion of our work and for the interesting ideas they have put forth. We look forward to further research toward bridging the gap between traditional statistical inference and machine learning.
References
- Breiman L (2001). Statistical modeling: The two cultures. Statistical Science 16, 199–231. [Google Scholar]
- Lu M and Ishwaran H (2020). Discussion of “Nonparametric variable importance assessment using machine learning techniques”. Biometrics . [DOI] [PMC free article] [PubMed] [Google Scholar]
- Williamson B, Gilbert P, Simon N, and Carone M (2020). A unified approach for inference on algorithm-agnostic variable importance. arXiv . [DOI] [PMC free article] [PubMed] [Google Scholar]
