Discussion on “Nonparametric variable importance assessment using machine learning techniques”

Brian D Williamson; Peter B Gilbert; Marco Carone; Noah Simon; Min Lu; Hemant Ishwaran

doi:10.1111/biom.13391

. Author manuscript; available in PMC: 2021 Mar 11.

Published in final edited form as: Biometrics. 2020 Dec 8;77(1):23–27. doi: 10.1111/biom.13391

Discussion on “Nonparametric variable importance assessment using machine learning techniques”

Brian D Williamson ¹, Peter B Gilbert ¹, Marco Carone ¹, Noah Simon ¹, Min Lu ¹, Hemant Ishwaran ¹

PMCID: PMC7946645 NIHMSID: NIHMS1623070 PMID: 33290584

1. Introduction: Breiman-Cutler variable importance

Williamson, B. et al. present a variable index related to R² but with the property of being free of model specification. We congratulate the authors on this very interesting paper.

Our first point is to draw an important connection between this work and existing work in machine learning. In their Introduction, the authors briefly mention the variable importance measure used in Breiman (2001). The authors state that this and related measures used for random forests are intimately tied to the specific estimation technique used, in contrast to the agnostic procedure they propose. However we will argue there is a much deeper connection than might be apparent.

Denoting PE for prediction error, the general principle underlying Breiman (2001) was to define importance using a prediction error approach. Breiman (2001) defined the variable importance Î_n,s for a set of variables s as the difference in prediction error for the full model compared to the model without s,

{\hat{I}}_{n, s} = PE (model without s) - PE (full model) .

(A)

The rationale being that if s contains informative variables, then removing s will increase prediction error and Î_n,s > 0. The larger the value, the more evidence of s’s importance. On the other hand, if s contains only noisy variables, then removing s may reduce prediction error relative to the full model, or at the very least it will not increase, and thus Î_n,s ⩽ 0.

How one actually calculates prediction error for the s-restricted model is crucial and one of the clever aspects of Breiman (2001). This idea was actually developed by Leo Breiman in collaboration with Adele Cutler and is therefore often called Breiman-Cutler variable importance. We will hereafter simply abbreviate this as BC-VIMP (where VIMP stands for variable importance, Ishwaran, 2007; Ishwaran et al., 2008). Let ${\hat{μ}}_{T} (X, Θ_{v})$ be the v = 1,…, V tree estimator for the unknown target function μ₀(X) estimated with respect to a loss function ℓ(Y, μ). Here ${Θ_{v}}_{v = 1}^{V}$ are i.i.d. random instructions used to grow each tree. This includes for example instructions for splitting the learning data into training data used to grow the tree and out-of-sample data used for testing. The latter indices are denoted by $O_{v}$ for tree v.

BC-VIMP is obtained by averaging over tree $VIMP : {\hat{I}}_{n, s} = V^{- 1} \sum_{v = 1}^{V} {\hat{I}}_{n, s}^{v}$ where ${\hat{I}}_{n, s}^{v}$ is the vimp for tree v defined by

{\hat{I}}_{n, s}^{v} = \frac{1}{| O_{v} |} \sum_{i \in O_{v}} l (Y_{i}, {\hat{μ}}_{T} ({\tilde{X}}_{i}^{(s)}, Θ_{v})) - \frac{1}{| O_{v} |} \sum_{i \in O_{v}} l (Y_{i}, {\hat{μ}}_{T} (X_{i}, Θ_{v})) .

(B)

The value ${\tilde{X}}^{(s)}$ represents X when the coordinates X_s have been randomly permuted. In (B), only out-of-sample cases $O_{v}$ have their s coordinates randomly permuted. These values ${({\tilde{X}}_{i}^{(s)})}_{i \in O_{v}}$ are run through the tree to estimate PE(model without s).

There are two key points in the above calculation that we highlight:

(P1) Calculating the s-restricted model estimator: How one calculates the s-restricted model estimator is very flexible. What BC-VIMP does is to “noise” up the s coordinates, X_s, by permuting them, thereby obtaining a noised up estimator with the intention to mimic a model with s removed. The main advantage of this is that it’s computationally fast, therefore it can be used for high dimensional and big data problems.

(P2) The same estimator is used to calculate both terms in (B): The same estimator ${\hat{μ}}_{T} (X, Θ_{v})$ is used to calculate prediction error for both the s-restricted and full model. Using the same tree harness eliminates Monte Carlo variability that would occur otherwise.

2. VIMP for regression and a general framework

The framework (B) and its over-arching principle (A) are very general and applicable to many settings. For example, extensions include random survival forests (Ishwaran et al., 2008) and competing risk forests (Ishwaran et al., 2014). Approaches based on (A) are also used by many machine learning methods, such as boosting. In fact, there is nothing special about a tree or a random forest that is required to use (A). Thus we can describe a more general framework by replacing the estimator ${\hat{μ}}_{T}$ with any other estimator; let us call this ${\hat{μ}}^{*}$ . As before the estimator is constructed from training data and prediction error calculated using out-of-sample data, denoted by $O_{v}$ , where v = 1,…,V. Write ${\hat{μ}}_{v}^{*}$ for the training data estimator. We now have the following general VIMP framework,

{\hat{I}}_{n, s}^{v} = \frac{1}{| O_{v} |} \sum_{i \in O_{v}} l (Y_{i}, {\hat{μ}}_{s, v}^{*} (X_{i})) - \frac{1}{| O_{v} |} \sum_{i \in O_{v}} l (Y_{i}, {\hat{μ}}_{v}^{*} (X_{i})) .

(C)

In (C), ${\hat{μ}}_{s, v}^{*}$ denotes the s-restricted estimator. The only constraint is it satisfies (P2) requiring it uses the full model estimator ${\hat{μ}}_{v}^{*}$ . One example we have discussed is permutation importance, ${\hat{μ}}_{s, v}^{*} = {\hat{μ}}_{v}^{*} ({\tilde{X}}_{i}^{(s)})$ . However this is not the only method, nor is necessarily the best method that can be used. For example, Lu and Ishwaran (2018) described a general technique for restricted model estimation in parametric models. Rather than permuting X_s, they replace the estimated coefficients for X_s in the full model with values of zero.

Now we show how this is related to the authors’ work. Let us begin by looking at equation (10) of their paper. For the comparison we rescale their estimator by the sample variance and denote the estimator using a tilde. The authors’ rescaled equation (10) is

{\tilde{ψ}}_{n, s} = \frac{1}{n} \sum_{i = 1}^{n} {Y_{i} - {\hat{μ}}_{s} (X_{i})}^{2} - \frac{1}{n} \sum_{i = 1}^{n} {Y_{i} - \hat{μ} (X_{i})}^{2} .

Thus ${\tilde{ψ}}_{n, s}$ is the difference in sum-of-squared values for the restricted versus full model. As this is a measure calculated from the full learning data it is not directly comparable to (C). However, Williamson, B. et al. also describe a cross-validated version of their estimator in their Algorithm 2. Their rescaled estimator (line 6, Algorithm 2) is

{\tilde{ψ}}_{n, s}^{v} = \frac{1}{| D_{v} |} \sum_{i \in D_{v}} {Y_{i} - {\hat{μ}}_{s, v} (X_{i})}^{2} - \frac{1}{| D_{v} |} \sum_{i \in D_{v}} {Y_{i} - {\hat{μ}}_{v} (X_{i})}^{2}

(D)

where D_v is a test-set fold, v = 1,…, V, using V-fold estimation. Here ${\hat{μ}}_{v}$ denotes the full model estimator calculated using the vth-fold training data and ${\hat{μ}}_{s, v}$ is an s-restricted model estimator calculated using the same v-fold training data.

When specialized to L₂-loss, ℓ(Y, μ) = (Y − μ)², upon comparing (C) to (D) we see an obvious similarity. It is also clear now that (D) is an example of the general VIMP principle (A). Indeed, the key issue comes down to (P1) in how one defines the restricted model estimator. The approach used by the authors to calculate ${\hat{μ}}_{s, v}$ (line 5, Algorithm 2) is to use the full model estimator ${\hat{μ}}_{v}$ and regress this on X_−s. This is very interesting because what the authors are proposing is essentially a new technique for restricted model estimation for regression. This can be added to the growing list of such techniques (Table 1) and in our opinion is a valuable hidden contribution of the paper. Note that the rationale for using the full model estimator ${\hat{μ}}_{v}$ in calculating both prediction error terms as in (P2) is explained by the authors on page 9 of their paper. They make special mention of this stating they did this because the more obvious method of regressing Y on X_−s was found to generally lead to incompatible results.

Table 1.

Different Methods and their Relationship to the General VIMP Framework.

Method	Loss, ℓ(Y, μ)	Learner, μ*	${\hat{μ}}_{s, v}^{*}$ , s-restricted estimator
${\hat{ψ}}_{n, s}^{v}$ ^†	(Y − μ)²	Any	Regress ${\hat{μ}}_{v}$ on X_−s
BC-VIMP^††	Any	Tree	Permute X_s
LI-VIMP^‡	Any	Parametric model	Set coefficients for X_s to zero
LI-VIMP^‡	Any	Nonparametric model	Permute X_s
${\hat{I}}_{n, s}^{v}$ ^‡‡	Any	Any	Any

Open in a new tab

^†

Williamson, B. et al.’s estimator from Algorithm 2.

^††

Breiman-Cutler VIMP.

^‡

Refers to method of Lu and Ishwaran (2018).

^‡‡

Generalized VIMP; see (C).

3. Dimensionality: some extensions and the benefit of prediction error

Another point we wish to mention relates to dimensionality and noise. We show that the magnitude of the author’s estimator ${\hat{ψ}}_{n, s}$ can depend on the size of s, hence values of ${\hat{ψ}}_{n, s}$ may not be comparable across models of different dimension. As an attempt to account for this phenomenon, we can define an adjusted ${\hat{ψ}}_{n, s}$ that is similar to adjusted R²,

{\hat{ψ}}_{n, s}^{adj} = [\frac{1}{n - (p - | s |) - 1} \sum_{i = 1}^{n} {Y_{i} - {\hat{μ}}_{s} (X_{i})}^{2} - \frac{1}{n - p - 1} \sum_{i = 1}^{n} {Y_{i} - \hat{μ} (X_{i})}^{2}] / {VAR}_{tot}

where ${VAR}_{tot} = \sum_{i = 1}^{n} {(Y_{i} - {\bar{Y}}_{n})}^{2} / (n - 1)$ . However, this adjustment could be too weak because the value of n − (p − |s|) − 1 could be dominated by n. Another option is to add a ratio to the first term, k := k(p, s), and to modify the estimator as follows

{\hat{ψ}}_{n, s}^{k} = [\frac{k (p, s)}{n - (p - | s |) - 1} \sum_{i = 1}^{n} {Y_{i} - {\hat{μ}}_{s} (X_{i})}^{2} - \frac{1}{n - p - 1} \sum_{i = 1}^{n} {Y_{i} - \hat{μ} (X_{i})}^{2}] / {VAR}_{tot} .

One example for the ratio could be k(p, s) = ln(p/|s|)/ln(p), where when |s| = 1, we have k(p, s) = 1 and ${\hat{ψ}}_{n, s}^{k} = {\hat{ψ}}_{n, s}^{adj}$ .

We use the simulation of setting A in Section 3.3 to display how ${\hat{ψ}}_{n, s}$ , ${\hat{ψ}}_{n, s}^{adj}$ and ${\hat{ψ}}_{n, s}^{k}$ change with expanding feature sizes, s = {6}, {6, 7}, {6, 7, 8}, {6, 7, 8, 9}, {6, 7, 8, 9, 10}. We follow Algorithm 1 and estimate $\hat{μ}$ and ${\hat{μ}}_{s}$ using gradient boosted trees as the authors did. A total of 50 independent datasets of size n = 300 and n = 500 were simulated. Results are displayed in Figure 1. Since X₆ and X₇ are the only informative variables, we expect the variable importance to increase from s = {6} to s = {6, 7}. In other words for an estimator ${\hat{ψ}}_{n, s}^{*}$ , we would expect ${\hat{ψ}}_{n, {6, 7}}^{*} > {\hat{ψ}}_{n, {6}}^{*}$ if ${\hat{ψ}}_{n, {7}}^{*} > {\hat{ψ}}_{n, {6}}^{*}$ and ${\hat{ψ}}_{n, s}^{*}$ measures “average” effect size of features in s. Or ${\hat{ψ}}_{n, {6, 7}}^{*} > {\hat{ψ}}_{n, {6}}^{*}$ if ${\hat{ψ}}_{n, {7}}^{*} > 0$ and ${\hat{ψ}}_{n, s}^{*}$ measures the “joint” effect size of features in s. However, we would not wish such increases to occur from s = {6, 7} to s = {6, 7, 8}, from s = {6, 7, 8} to s = {6, 7, 8, 9}, and so forth, since X₈, X₉ and X₁₀ are noise variables.

Figure 1 shows that as noise variables are added and size of s increases, ${\hat{ψ}}_{n, s}^{k}$ helps reduce inflated values seen for ${\hat{ψ}}_{n, s}$ . Values for ${\hat{ψ}}_{n, s}^{adj}$ are similar to ${\hat{ψ}}_{n, s}$ , confirming that its dimensionality adjustment is too weak. Of course, from such small sample sizes, the estimation of ${\hat{ψ}}_{n, s}$ could be biased, hence variable importance may not be able to measure the average or joint effect sizes in perfect proportion. Therefore to test this, we have added to Figure 1 another line “BC-VIMP” which are Breiman-Cutler importance values, standardized by VAR_tot. Values were calculated using the vimp function from the randomForestSRC R-package (Ishwaran and Kogalur, 2020). We can immediately see that BC-VIMP values conform to what we had expected to see: importance values increase from s = {6} to s = {6, 7} and then immediately flatten off. In fact there even seems to be a downward trend for BC-VIMP for the overfit models. The latter occurs as a benefit of using prediction error as prediction error will increase with addition of noise. Finally, we have also adjusted BC-VIMP by k for comparison. As can be seen the adjustment further pushes importance downwards for the overfit models.

4. Conclusions

In their paper, the authors have introduced not one, but actually two estimators. The first ${\hat{ψ}}_{n, s}$ (Algorithm 1) is constructed from the full learning dataset. Although we did not comment much about this version in our discussion, one major advantage of ${\hat{ψ}}_{n, s}$ is that it is far more amenable to theoretical analysis than a prediction error based estimator, such as their second estimator, ${\hat{ψ}}_{n, s}^{v}$ . In fact this is what the authors have done. Using empirical processes and semiparametric theory they have developed a comprehensive analysis for ${\hat{ψ}}_{n, s}$ that provides justification for the procedure and identifies its asymptotic limiting distribution for certain cases, and we commend the authors on doing so. Regarding the limiting distribution of Theorem 1, the authors use this for developing confidence intervals using a plug-in estimator. We would like to mention another technique that has been used for constructing confidence intervals for VIMP based on subsampling (Ishwaran and Lu, 2019). This method is applicable to many settings including survival, regression and classification problems. By being based on subsampling theory, the conditions needed are different than those used here. In particular, condition (A1) requires a rate condition for the underlying estimation technique, whereas for the subsampling estimator, one only requires the existence of a limiting distribution. This could be useful as proving (A1) may be difficult for machine learning methods. From a more practical standpoint, subsampling is highly computationally efficient and therefore opens up applications to high dimensional and big data scenarios.

This brings us to the authors second estimator ${\hat{ψ}}_{n, s}^{v}$ (Algorithm 2), which we have discussed in more detail. This estimator unlike the full data estimator is prediction error based and as we have argued is an example of the general principle (A) described by Breiman (2001). As we commented, the authors are proposing a new technique for restricted model estimation (Table 1). To calculate the first term in (A) they regress the full estimator on X_−s. This is interesting and we wish they had provided empirical results of its effectiveness. In fact, we believe this estimator will prove superior to ${\hat{ψ}}_{n, s}$ . We illustrated already some problems with the latter. Other issues to mention are in the results of real data applications: all values of ${\hat{ψ}}_{n, s}$ appear positive and confidence intervals do not cover zero, which could mislead one to believe ${\hat{ψ}}_{n, s}$ can only rank variables, but is unable to separate noise variables. We believe these and other issues will be remedied by a prediction error based VIMP.

In closing, we congratulate the authors for their contributions to the area of variable importance. We also thank the editor(s) for giving us an opportunity for sharing our insights on this work. There are many interesting ideas and potentially important theoretical work that may find inspiration from this article and its discussion.

References

Breiman L (2001). Random forests. Machine Learning 45, 5–32. [Google Scholar]
Ishwaran H (2007). Variable importance in binary regression trees and forests. Electronic Journal of Statistics, 1:519–537. [Google Scholar]
Ishwaran H, Kogalur UB, Blackstone EH, and Lauer MS (2008). Random survival forests. The Annals of Applied Statistics, 2(3), 841–860. [Google Scholar]
Ishwaran H, Gerds TA, Kogalur UB, Moore RD, Gange SJ, and Lau BM (2014). Random survival forests for competing risks. Biostatistics, 15(4), 757–773. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ishwaran H and Kogalur UB (2020). Random Forests for Survival, Regression, and Classification (RF-SRC). R package version 2.9.3 [Google Scholar]
Ishwaran H and Lu M (2019). Standard errors and confidence intervals for variable importance in random forest regression, classification, and survival. Statistics in Medicine 38(4), 558–582. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lu M and Ishwaran H (2018). A prediction-based alternative to P values in regression models. The Journal of Thoracic and Cardiovascular Surgery 155(3), 1130–1136. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] Breiman L (2001). Random forests. Machine Learning 45, 5–32. [Google Scholar]

[R2] Ishwaran H (2007). Variable importance in binary regression trees and forests. Electronic Journal of Statistics, 1:519–537. [Google Scholar]

[R3] Ishwaran H, Kogalur UB, Blackstone EH, and Lauer MS (2008). Random survival forests. The Annals of Applied Statistics, 2(3), 841–860. [Google Scholar]

[R4] Ishwaran H, Gerds TA, Kogalur UB, Moore RD, Gange SJ, and Lau BM (2014). Random survival forests for competing risks. Biostatistics, 15(4), 757–773. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Ishwaran H and Kogalur UB (2020). Random Forests for Survival, Regression, and Classification (RF-SRC). R package version 2.9.3 [Google Scholar]

[R6] Ishwaran H and Lu M (2019). Standard errors and confidence intervals for variable importance in random forest regression, classification, and survival. Statistics in Medicine 38(4), 558–582. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Lu M and Ishwaran H (2018). A prediction-based alternative to P values in regression models. The Journal of Thoracic and Cardiovascular Surgery 155(3), 1130–1136. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Discussion on “Nonparametric variable importance assessment using machine learning techniques”

Brian D Williamson

Peter B Gilbert

Marco Carone

Noah Simon

Min Lu

Hemant Ishwaran

1. Introduction: Breiman-Cutler variable importance

2. VIMP for regression and a general framework

Table 1.

3. Dimensionality: some extensions and the benefit of prediction error

Figure 1.

4. Conclusions

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Discussion on “Nonparametric variable importance assessment using machine learning techniques”

Brian D Williamson

Peter B Gilbert

Marco Carone

Noah Simon

Min Lu

Hemant Ishwaran

1. Introduction: Breiman-Cutler variable importance

2. VIMP for regression and a general framework

Table 1.

3. Dimensionality: some extensions and the benefit of prediction error

Figure 1.

4. Conclusions

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases