Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Mar 11.
Published in final edited form as: Biometrics. 2020 Dec 8;77(1):23–27. doi: 10.1111/biom.13391

Discussion on “Nonparametric variable importance assessment using machine learning techniques”

Brian D Williamson 1, Peter B Gilbert 1, Marco Carone 1, Noah Simon 1, Min Lu 1, Hemant Ishwaran 1
PMCID: PMC7946645  NIHMSID: NIHMS1623070  PMID: 33290584

1. Introduction: Breiman-Cutler variable importance

Williamson, B. et al. present a variable index related to R2 but with the property of being free of model specification. We congratulate the authors on this very interesting paper.

Our first point is to draw an important connection between this work and existing work in machine learning. In their Introduction, the authors briefly mention the variable importance measure used in Breiman (2001). The authors state that this and related measures used for random forests are intimately tied to the specific estimation technique used, in contrast to the agnostic procedure they propose. However we will argue there is a much deeper connection than might be apparent.

Denoting PE for prediction error, the general principle underlying Breiman (2001) was to define importance using a prediction error approach. Breiman (2001) defined the variable importance În,s for a set of variables s as the difference in prediction error for the full model compared to the model without s,

I^n,s=PE(model without s)PE(full model). (A)

The rationale being that if s contains informative variables, then removing s will increase prediction error and În,s > 0. The larger the value, the more evidence of s’s importance. On the other hand, if s contains only noisy variables, then removing s may reduce prediction error relative to the full model, or at the very least it will not increase, and thus În,s ⩽ 0.

How one actually calculates prediction error for the s-restricted model is crucial and one of the clever aspects of Breiman (2001). This idea was actually developed by Leo Breiman in collaboration with Adele Cutler and is therefore often called Breiman-Cutler variable importance. We will hereafter simply abbreviate this as BC-VIMP (where VIMP stands for variable importance, Ishwaran, 2007; Ishwaran et al., 2008). Let μ^T(X,Θv) be the v = 1,…, V tree estimator for the unknown target function μ0(X) estimated with respect to a loss function (Y, μ). Here {Θv}v=1V are i.i.d. random instructions used to grow each tree. This includes for example instructions for splitting the learning data into training data used to grow the tree and out-of-sample data used for testing. The latter indices are denoted by Ov for tree v.

BC-VIMP is obtained by averaging over tree VIMP:I^n,s=V1v=1VI^n,sv where I^n,sv is the vimp for tree v defined by

I^n,sv=1|Ov|iOvl(Yi,μ^T(X˜i(s),Θv))1|Ov|iOvl(Yi,μ^T(Xi,Θv)). (B)

The value X˜(s) represents X when the coordinates Xs have been randomly permuted. In (B), only out-of-sample cases Ov have their s coordinates randomly permuted. These values (X˜i(s))iOv are run through the tree to estimate PE(model without s).

There are two key points in the above calculation that we highlight:

(P1) Calculating the s-restricted model estimator: How one calculates the s-restricted model estimator is very flexible. What BC-VIMP does is to “noise” up the s coordinates, Xs, by permuting them, thereby obtaining a noised up estimator with the intention to mimic a model with s removed. The main advantage of this is that it’s computationally fast, therefore it can be used for high dimensional and big data problems.

(P2) The same estimator is used to calculate both terms in (B): The same estimator μ^T(X,Θv) is used to calculate prediction error for both the s-restricted and full model. Using the same tree harness eliminates Monte Carlo variability that would occur otherwise.

2. VIMP for regression and a general framework

The framework (B) and its over-arching principle (A) are very general and applicable to many settings. For example, extensions include random survival forests (Ishwaran et al., 2008) and competing risk forests (Ishwaran et al., 2014). Approaches based on (A) are also used by many machine learning methods, such as boosting. In fact, there is nothing special about a tree or a random forest that is required to use (A). Thus we can describe a more general framework by replacing the estimator μ^T with any other estimator; let us call this μ^*. As before the estimator is constructed from training data and prediction error calculated using out-of-sample data, denoted by Ov, where v = 1,…,V. Write μ^v* for the training data estimator. We now have the following general VIMP framework,

I^n,sv=1|Ov|iOvl(Yi,μ^s,v*(Xi))1|Ov|iOvl(Yi,μ^v*(Xi)). (C)

In (C), μ^s,v* denotes the s-restricted estimator. The only constraint is it satisfies (P2) requiring it uses the full model estimator μ^v*. One example we have discussed is permutation importance, μ^s,v*=μ^v*(X˜i(s)). However this is not the only method, nor is necessarily the best method that can be used. For example, Lu and Ishwaran (2018) described a general technique for restricted model estimation in parametric models. Rather than permuting Xs, they replace the estimated coefficients for Xs in the full model with values of zero.

Now we show how this is related to the authors’ work. Let us begin by looking at equation (10) of their paper. For the comparison we rescale their estimator by the sample variance and denote the estimator using a tilde. The authors’ rescaled equation (10) is

ψ˜n,s=1ni=1n{Yiμ^s(Xi)}21ni=1n{Yiμ^(Xi)}2.

Thus ψ˜n,s is the difference in sum-of-squared values for the restricted versus full model. As this is a measure calculated from the full learning data it is not directly comparable to (C). However, Williamson, B. et al. also describe a cross-validated version of their estimator in their Algorithm 2. Their rescaled estimator (line 6, Algorithm 2) is

ψ˜n,sv=1|Dv|iDv{Yiμ^s,v(Xi)}21|Dv|iDv{Yiμ^v(Xi)}2 (D)

where Dv is a test-set fold, v = 1,…, V, using V-fold estimation. Here μ^v denotes the full model estimator calculated using the vth-fold training data and μ^s,v is an s-restricted model estimator calculated using the same v-fold training data.

When specialized to L2-loss, (Y, μ) = (Yμ)2, upon comparing (C) to (D) we see an obvious similarity. It is also clear now that (D) is an example of the general VIMP principle (A). Indeed, the key issue comes down to (P1) in how one defines the restricted model estimator. The approach used by the authors to calculate μ^s,v (line 5, Algorithm 2) is to use the full model estimator μ^v and regress this on Xs. This is very interesting because what the authors are proposing is essentially a new technique for restricted model estimation for regression. This can be added to the growing list of such techniques (Table 1) and in our opinion is a valuable hidden contribution of the paper. Note that the rationale for using the full model estimator μ^v in calculating both prediction error terms as in (P2) is explained by the authors on page 9 of their paper. They make special mention of this stating they did this because the more obvious method of regressing Y on Xs was found to generally lead to incompatible results.

Table 1.

Different Methods and their Relationship to the General VIMP Framework.

Method Loss, (Y, μ) Learner, μ* μ^s,v*, s-restricted estimator
ψ^n,sv (Yμ)2 Any Regress μ^v on Xs
BC-VIMP†† Any Tree Permute Xs
LI-VIMP Any Parametric model Set coefficients for Xs to zero
LI-VIMP Any Nonparametric model Permute Xs
I^n,sv‡‡ Any Any Any

Williamson, B. et al.’s estimator from Algorithm 2.

††

Breiman-Cutler VIMP.

Refers to method of Lu and Ishwaran (2018).

‡‡

Generalized VIMP; see (C).

3. Dimensionality: some extensions and the benefit of prediction error

Another point we wish to mention relates to dimensionality and noise. We show that the magnitude of the author’s estimator ψ^n,s can depend on the size of s, hence values of ψ^n,s may not be comparable across models of different dimension. As an attempt to account for this phenomenon, we can define an adjusted ψ^n,s that is similar to adjusted R2,

ψ^n,sadj=[1n(p|s|)1i=1n{Yiμ^s(Xi)}21np1i=1n{Yiμ^(Xi)}2]/VARtot

where VARtot=i=1n(YiY¯n)2/(n1). However, this adjustment could be too weak because the value of n − (p − |s|) − 1 could be dominated by n. Another option is to add a ratio to the first term, k := k(p, s), and to modify the estimator as follows

ψ^n,sk=[k(p,s)n(p|s|)1i=1n{Yiμ^s(Xi)}21np1i=1n{Yiμ^(Xi)}2]/VARtot .

One example for the ratio could be k(p, s) = ln(p/|s|)/ln(p), where when |s| = 1, we have k(p, s) = 1 and ψ^n,sk=ψ^n,sadj.

We use the simulation of setting A in Section 3.3 to display how ψ^n,s, ψ^n,sadj and ψ^n,sk change with expanding feature sizes, s = {6}, {6, 7}, {6, 7, 8}, {6, 7, 8, 9}, {6, 7, 8, 9, 10}. We follow Algorithm 1 and estimate μ^ and μ^s using gradient boosted trees as the authors did. A total of 50 independent datasets of size n = 300 and n = 500 were simulated. Results are displayed in Figure 1. Since X6 and X7 are the only informative variables, we expect the variable importance to increase from s = {6} to s = {6, 7}. In other words for an estimator ψ^n,s*, we would expect ψ^n,{6,7}*>ψ^n,{6}* if ψ^n,{7}*>ψ^n,{6}* and ψ^n,s* measures “average” effect size of features in s. Or ψ^n,{6,7}*>ψ^n,{6}* if ψ^n,{7}*>0 and ψ^n,s* measures the “joint” effect size of features in s. However, we would not wish such increases to occur from s = {6, 7} to s = {6, 7, 8}, from s = {6, 7, 8} to s = {6, 7, 8, 9}, and so forth, since X8, X9 and X10 are noise variables.

Figure 1.

Figure 1.

Comparison of variable importance measures with increasing size of feature set s for sample sizes n = 300 (left) and n = 500 (right). Adjusted variable importance measures ψ^n,sadj and ψ^n,sk are shown in red and blue respectively and the unadjusted measure ψ^n,s is marked in black color. The datasets are generated according to the simulation of setting A in Section 3.3, where only X6 and X7 are informative variables in all the chosen feature sets. Values of ψ^n,s and ψ^n,sadj are similar and both become inflated as noise variables are added. On the other hand, ψ^n,sk performs much better due to its heavy dimensionality penalty. Also included in figure are BC-VIMP values displayed in purple. By being based on prediction error, BC-VIMP automatically adjusts for dimensionality and does not overfit with addition of noise and performs correctly. Orange lines are BC-VIMP multiplied by dimensionality parameter k. This further pushes VIMP values downwards for overfit models. Finally, note BC-VIMP has been scaled by 10−4 and 10−3 for left and right figures. Thus for model s = {6, 7}, the value is approximately 7% for n = 300 and 10% for n = 500. This is very interpretable and is stating that the model with the two non-zero variables explains a relatively high fraction of the variance over test data. This type of interpretation is not possible with estimators like ψ^n,s that are not prediction error based. This figure appears in color in the electronic version of this article, and any mention of color refers to that version.

Figure 1 shows that as noise variables are added and size of s increases, ψ^n,sk helps reduce inflated values seen for ψ^n,s. Values for ψ^n,sadj are similar to ψ^n,s, confirming that its dimensionality adjustment is too weak. Of course, from such small sample sizes, the estimation of ψ^n,s could be biased, hence variable importance may not be able to measure the average or joint effect sizes in perfect proportion. Therefore to test this, we have added to Figure 1 another line “BC-VIMP” which are Breiman-Cutler importance values, standardized by VARtot. Values were calculated using the vimp function from the randomForestSRC R-package (Ishwaran and Kogalur, 2020). We can immediately see that BC-VIMP values conform to what we had expected to see: importance values increase from s = {6} to s = {6, 7} and then immediately flatten off. In fact there even seems to be a downward trend for BC-VIMP for the overfit models. The latter occurs as a benefit of using prediction error as prediction error will increase with addition of noise. Finally, we have also adjusted BC-VIMP by k for comparison. As can be seen the adjustment further pushes importance downwards for the overfit models.

4. Conclusions

In their paper, the authors have introduced not one, but actually two estimators. The first ψ^n,s (Algorithm 1) is constructed from the full learning dataset. Although we did not comment much about this version in our discussion, one major advantage of ψ^n,s is that it is far more amenable to theoretical analysis than a prediction error based estimator, such as their second estimator, ψ^n,sv. In fact this is what the authors have done. Using empirical processes and semiparametric theory they have developed a comprehensive analysis for ψ^n,s that provides justification for the procedure and identifies its asymptotic limiting distribution for certain cases, and we commend the authors on doing so. Regarding the limiting distribution of Theorem 1, the authors use this for developing confidence intervals using a plug-in estimator. We would like to mention another technique that has been used for constructing confidence intervals for VIMP based on subsampling (Ishwaran and Lu, 2019). This method is applicable to many settings including survival, regression and classification problems. By being based on subsampling theory, the conditions needed are different than those used here. In particular, condition (A1) requires a rate condition for the underlying estimation technique, whereas for the subsampling estimator, one only requires the existence of a limiting distribution. This could be useful as proving (A1) may be difficult for machine learning methods. From a more practical standpoint, subsampling is highly computationally efficient and therefore opens up applications to high dimensional and big data scenarios.

This brings us to the authors second estimator ψ^n,sv (Algorithm 2), which we have discussed in more detail. This estimator unlike the full data estimator is prediction error based and as we have argued is an example of the general principle (A) described by Breiman (2001). As we commented, the authors are proposing a new technique for restricted model estimation (Table 1). To calculate the first term in (A) they regress the full estimator on Xs. This is interesting and we wish they had provided empirical results of its effectiveness. In fact, we believe this estimator will prove superior to ψ^n,s. We illustrated already some problems with the latter. Other issues to mention are in the results of real data applications: all values of ψ^n,s appear positive and confidence intervals do not cover zero, which could mislead one to believe ψ^n,s can only rank variables, but is unable to separate noise variables. We believe these and other issues will be remedied by a prediction error based VIMP.

In closing, we congratulate the authors for their contributions to the area of variable importance. We also thank the editor(s) for giving us an opportunity for sharing our insights on this work. There are many interesting ideas and potentially important theoretical work that may find inspiration from this article and its discussion.

References

  1. Breiman L (2001). Random forests. Machine Learning 45, 5–32. [Google Scholar]
  2. Ishwaran H (2007). Variable importance in binary regression trees and forests. Electronic Journal of Statistics, 1:519–537. [Google Scholar]
  3. Ishwaran H, Kogalur UB, Blackstone EH, and Lauer MS (2008). Random survival forests. The Annals of Applied Statistics, 2(3), 841–860. [Google Scholar]
  4. Ishwaran H, Gerds TA, Kogalur UB, Moore RD, Gange SJ, and Lau BM (2014). Random survival forests for competing risks. Biostatistics, 15(4), 757–773. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Ishwaran H and Kogalur UB (2020). Random Forests for Survival, Regression, and Classification (RF-SRC). R package version 2.9.3 [Google Scholar]
  6. Ishwaran H and Lu M (2019). Standard errors and confidence intervals for variable importance in random forest regression, classification, and survival. Statistics in Medicine 38(4), 558–582. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Lu M and Ishwaran H (2018). A prediction-based alternative to P values in regression models. The Journal of Thoracic and Cardiovascular Surgery 155(3), 1130–1136. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES