Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Apr 20.
Published in final edited form as: Proc Mach Learn Res. 2020 Jul;119:10282–10291.

Efficient nonparametric statistical inference on population feature importance using Shapley values

Brian D Williamson 1,*, Jean Feng 2,*
PMCID: PMC8057672  NIHMSID: NIHMS1678924  PMID: 33884372

Abstract

The true population-level importance of a variable in a prediction task provides useful knowledge about the underlying data-generating mechanism and can help in deciding which measurements to collect in subsequent experiments. Valid statistical inference on this importance is a key component in understanding the population of interest. We present a computationally efficient procedure for estimating and obtaining valid statistical inference on the Shapley Population Variable Importance Measure (SPVIM). Although the computational complexity of the true SPVIM scales exponentially with the number of variables, we propose an estimator based on randomly sampling only Θ(n) feature subsets given n observations. We prove that our estimator converges at an asymptotically optimal rate. Moreover, by deriving the asymptotic distribution of our estimator, we construct valid confidence intervals and hypothesis tests. Our procedure has good finite-sample performance in simulations, and for an in-hospital mortality prediction task produces similar variable importance estimates when different machine learning algorithms are applied.

1. Introduction

In many scientific applications, understanding the intrinsic predictive value of a variable can shed light on the internal mechanisms relating the variable to the outcome of interest, help build future models, and guide experimental design. For example, hospital administrators may want to know the important features to collect for predicting patient outcomes. Likewise, vaccine researchers may want to know the most important molecular phenotypes to measure that are most predictive of binding or vaccine efficacy (see, e.g., Dunning, 2006). Variable importance measures (VIMs) provide necessary information towards answering these questions.

Our interest here is in statistical inference on the population VIM. This VIM quantifies the predictive value of a variable within the oracle prediction model f0 defined relative to an arbitrary predictiveness measure V. For many choices of V, f0 is either the conditional mean outcome given covariates (e.g., if V = R2) or a simple functional of this conditional mean (e.g., if V = classification accuracy). We note that population VIMs are distinct from algorithmic VIMs, which describe the importance of a variable within a fitted model f^ (see, e.g., Breiman, 2001; Garson, 1991; Murdoch et al., 2019). Although algorithmic VIMs have been used as a proxy for population VIMs out of convenience, differences between f^ and f0 can often lead to substantially different interpretations of the resulting VIMs. Whereas an algorithmic VIM necessarily varies across fitted models, a population VIM is independent of the specific procedure used to estimate f0.

Existing population VIMs suffer from a number of issues. Traditionally, population VIMs have relied on restrictive parametric assumptions (e.g., R2 in linear models; see, e.g., Grömping, 2007; Nathans et al., 2012), which can lead to misleading results if the parametric model does not hold. Recent work has focused on extending these definitions by removing the parametric assumptions (Feng et al., 2018; Williamson et al., 2020b); however, these definitions define importance of a variable with respect to the others and assign near-zero importance when features are highly correlated. Other VIMs require strong assumptions on the design to be valid (e.g., ANOVA), but again fail in simple cases with correlated variables. To address this, Owen and Prieur (2017) proposed using Shapley values to quantify the population VIM, where the value function is the variance explained; these VIMs inherit many desirable theoretical properties from the Shapley value. In fact, contemporary work has also defined the ideal estimand of algorithmic VIM estimation procedures to be the Shapley population VIM (SPVIM) (Covert et al., 2020).

Unfortunately, exact estimation of SPVIM is computationally intractable in general settings (Owen and Prieur, 2017): the SPVIM is defined as the sum of 2p terms, where p is the number of features and each term depends on estimating the conditional mean function with respect to a unique feature subset. Previous approaches have either suggested sampling as many subsets as possible to estimate the Shapley value (see, e.g., Castro et al., 2009) or utilized special properties of tree estimators to reduce the number of subsets required (Lundberg et al., 2020). Notably, Štrumbelj and Kononenko (2014) analyzed the asymptotic distribution of a sampling-based estimator of Shapley algorithmic variable importance to derive confidence intervals.

In this paper, we combine the aforementioned developments and provide a nonparametric statistical inference procedure for SPVIM. We generalize previous definitions of SPVIM and use an arbitrary measure of predictiveness V. We tackle the computational complexity of the problem by randomly sampling feature subsets according to the Shapley value weights and then fitting corresponding models. We derive the asymptotic distribution of this sampling-based SPVIM estimator and show that the error from our proposed procedure can be decomposed into two components: the error from estimating the oracle prediction models and the error from omitting summands from the Shapley value estimand. Given n training observations, we find that our estimator only needs to sample m = Θ(n) subsets to converge at an asymptotically optimal rate. Moreover, since the subset sampling distribution is highly skewed, the number of unique feature subsets is much smaller than m in practice. We then use the asymptotic distribution to construct asymptotically unbiased point estimates, valid confidence intervals, and hypothesis tests with proper type I error control.

We demonstrate the validity of our approach in a simulation study and estimate the SPVIM of hospital measurements for predicting mortality in the intensive care unit (ICU). All numerical results can be replicated using code available on GitHub at bdwilliamson/spvim_supplementary; the proposed methods are also implemented in the Python package vimpy and the R package vimp.

2. Variable importance

2.1. Data structure and notation

Let M be a nonparametric class of joint distributions over covariates X=(X1,,Xp)Xp and response YY, where X and Y denote the sample spaces of X and Y, respectively. Suppose that each observation O consists of (X, Y). In this article, we consider observations O1,…,On drawn independently according to a joint probability distribution P0M.

Next, we define the feature subsets and oracle prediction models of interest. We take S to be the power set of N := {1,…,p}. Let s(j) for j = 1,…,2p be an ordered sequence of the subsets in S, where s(1) = ∅ and s(2p)=N. For any index set sS, we denote by Xs and Xs the sample spaces of Xs and Xs, respectively. We denote by as and as the elements of a vector a with indices in s and not in s, respectively. We also consider the binary vector z(s)p+1 for each sS, where z(s)1 = 1 for all sS and z(s)k+1 = I(ks) for k = 1,…,p. Finally, we consider a rich class F of functions from X to Y endowed with a norm F. For any sS, we define the subset Fs:={fF:f(u)=f(v) for all u, vX satisfying us = vs} of functions in F whose evaluation ignores elements of the input x with index not in s. In all examples we consider, we take F to be a rich class of functions that is essentially unrestricted up to regularity conditions.

2.2. Oracle predictiveness

We define the importance of a variable at the population level in terms of its oracle predictiveness. This predictiveness is measured by a real-valued functional V:F×M. We assume that larger values of V (f, P) imply higher predictiveness. Examples of predictiveness measures — including R2, deviance, area under the ROC curve, and classification accuracy — are provided in Williamson et al. (2020b).

The oracle predictiveness is the maximum achievable predictiveness over a class of prediction functions. More formally, we define the total oracle predictiveness v0,N:=maxfFV(f,P0) and its associated oracle prediction function f0,NargmaxfFV(f,P0). For many machine learning algorithms, f0,N is the target of interest. We further define the oracle prediction function f0,s that maximizes V(f, P0) over all fFs; the marginal oracle predictiveness v0,s := V (f0,s, P0) quantifies the prediction potential of features with index in s. The null oracle predictiveness v0,:=V(f0,,P0) quantifies the prediction potential of a model that uses no covariate information. Finally, let v0:=[v0,,v0,s(2),,v0,N] denote the 2p-dimensional vector of predictiveness measures for all subsets in S. The predictiveness measure v0,s(j) is defined relative to the population P0, a joint distribution in the nonparametric statistical model M; thus, its interpretation is tied to neither any particular estimation procedure nor any parametric assumptions.

2.3. The Shapley population variable importance measure

We now define a population VIM using the classical form of the Shapley value (see, e.g., Shapley, 1953; Charnes et al., 1988) with an arbitrary measure of predictiveness V. Specifically, the Shapley population variable importance measure (SPVIM) of the variable Xj is the average gain in oracle predictiveness from including feature Xj over all possible subsets:

ψ0,0,j:=sN\{j}1p(p1|s|)1{V(f0,sj,P0)V(f0,s,P0)}, (1)

where the indices of ψ describe the number of subsets, the distribution P0, and the feature of interest j, respectively. We use the index 0 to indicate that the SPVIM is computed using all subsets and the true distribution P0. SPVIMs inherit the following properties from Shapley values (Shapley, 1953):

  • Non-negativity: by construction, ψ0,0,j0.

  • Additivity1: the sum of the SPVIMs across all variables is equal to the difference between the total and null oracle predictiveness,
    j=1pψ0,0,j=v0,Nv0, (2)
  • Symmetry: if Xi = Xj, then ψ0,0,i=ψ0,0,j.

  • Null feature: if Xj provides no added predictive value, i.e., v0,sj=v0,s for all s(N\{j}), then its SPVIM value is ψ0,0,j=0.

  • Linearity: if V˜αV, then its associated SPVIM values are ψ˜0,0,j=αψ0,0,j for all j ∈ {1,…,p}.

Because SPVIMs satisfy these properties, they clearly address the issue of correlated features: given collinear variables Xj and Xk that are each marginally predictive, previous nonparametric population VIMs (see, e.g., Williamson et al., 2020b) would assign zero importance to both variables whereas SPVIM would assign the same positive value to both variables.

In this paper, we take advantage of an alternate formulation of the Shapley value noted in previous work (see, e.g., Charnes et al., 1988; Lundberg and Lee, 2017). In particular, we can rewrite the weighted average in (1) as the solution of a weighted linear regression problem, where we treat the predictiveness of a feature subset v0,s as the response and the subset membership z(s) as the covariates. Define a diagonal matrix of weights W2p×2p where W1,1 = W2p,2p = 1, and for any j2,,2p1, Wj,j=(p2|s(j)|1)1. The matrix Z2p×(p+1) consists of the stacked z(s) vectors for each sS. Setting ψ0,0,:=v0,, we denote by ψ0,0 the (p+1)-dimensional vector of population Shapley values. Then (1) is equivalent to

ψ0,0:=argminψp+1W(Zψv0)22, (3)

a result that we prove in the Supplement. If we define the distribution Q0 over subsets S with probability mass function assigning weight (p2|S|1)1 for SS\{,N} and weight 1 for S ∈ {∅, N} (scaled so that the weights sum to one), then (3) is equivalent to a population average:

ψ0,0argminψp+1EQ0[(z(S)ψv0,S)2].

We will use this fact in our estimation procedure below.

3. Estimation and inference

3.1. Plug-in estimation

We now discuss how to estimate the SPVIM values for all p features using independent observations O1,…,On drawn from P0. Definition (3) suggests considering an estimator based on plugging in estimators of each individual component. We discuss each component in turn.

First, we estimate the predictiveness measure v0,s = V(f0,s, P0) for a subset sS by plugging in estimates of the oracle function f0,s and the distribution P0. A simple approach is to partition the data into a training set and a validation set, construct an estimator fn,s for f0,s on the training data (using only the observed covariates in s), and estimate P0 using the empirical distribution of the validation set PV. Using this training-validation split, our estimate of predictiveness is then

vn,s=V(fn,s,PV). (4)

An alternative approach is to perform K-fold cross-fitting, where we partition the data into K subsets of roughly equal size and, for each k ∈ {1,…,K}, construct an estimator fk,n,s based on all the data except for the kth subset. Let Pk,n be the empirical distribution of the kth subset. Then we could estimate v0,s using

vn,s=1Kk=1KV(fk,n,s,Pk,n). (5)

If we had the entire estimated vector of predictiveness measures vn, we could estimate ψ0,0 using the plug-in estimator

ψ0,n:=argminψp+1EQ0[(Z(S)ψvn,S)2]. (6)

Unfortunately, obtaining vn requires training 2p models, rendering this a computationally intractable task in general. Instead, we can replace Q0 in (6) with an empirical distribution estimator Qm obtained by sampling m subsets from S according to Q0. This leads us to the SPVIM estimator ψm,n which solves the constrained least squares problem

minψp+1EQm[(Z(S)ψvn,S)2]subjecttoGψ=cn, (7)

where G:=[z(),z(N)]2×(p+1) and cn:=[vn,,vn,N]2. The constraint ensures that the estimated SPVIMs satisfy the additivity property (2) and that the estimated SPVIM for the null set is the estimated null predictiveness value.

This constrained least squares problem can be solved by forming a Lagrangian and inverting its Karush-Kuhn-Tucker (KKT) conditions (Boyd and Vandenberghe, 2004). More specifically, let s1, …, s be the unique subsets in Qm. Let Wm be the × diagonal matrix where the kth diagonal element is the probability mass of sk in Qm. Let vm,n=(vn,s1,,vn,s) be the estimated predictiveness measures for the ℓ subsets. Let Zm be the stack of vectors z(s1),,z(s). Then (7) can also be written as

minψp+1Wm(Zmψvm,n)22subjecttoGψ=cn.

Solving the KKT conditions with Lagrange multipliers denoted by λ, we obtain a closed-form SPVIM estimator:

[ψm,nλ]=[2ZmWmZmGG0]1[2Wmvm,ncn]. (8)

To ensure that (7) has a unique solution, we select a sufficiently large value of m so that Qm inclues at least p + 1 unique subsets. The full estimation procedure is given in Algorithm 1.

We now describe the properties listed in Section 2.3 that are satisfied by this sampling-based SPVIM estimator. It is easy to see that the additivity, symmetry, and linearity properties always hold. One possible concern is that the nonnegativity property can be violated. Nevertheless, in practice we find that negative SPVIM estimates are close to zero and the 95% confidence intervals cover zero. If nonnegativity is truly a concern, one can also add a nonnegative constraint to (7). Finally, the null feature property holds with respect to estimated predictiveness values and the sampled subsets. Note that this property is only relevant for discrete predictiveness measures like 0–1 classification accuracy, since the estimated predictiveness values are rarely exactly the same for continuous predictiveness measures like R2.

The plug-in estimator ψm,n is appealing due to its simplicity. In general, however, such an estimator may fail to be consistent at rate n−1/2 if the population optimizers f0,s are flexibly estimated. This phenomenon is due in large part to the optimal bias-variance tradeoff for estimating f0,s differing in general from the optimal bias-variance tradeoff for estimating vn,s. Plug-in estimators typically inherit much of the bias from estimating f0,s, and this bias does not in general tend to zero sufficiently fast to allow n−1/2-rate estimation of ψ0,0 (Williamson et al., 2020a). In the next section, we extend the results of Williamson et al. (2020b) to describe conditions under which the estimator ψm,n is asymptotically normal.

Algorithm 1.

Estimation of SPVIM

1: Input initial parameter γ ≥ 1.
2: Sample m = γn subsets from Q0, denoted s1, …, sm.
3: Estimate prediction functions fn,s for each s ∈ {s1, …, sm} ∪ {∅, N}.
4: Compute predictiveness estimates vn,s for s ∈ {s1, …, sm} ∪ {∅, N} using a training-validation split (see Equation (4)) or K-fold cross-fitting (see Equation (5)).
5: Solve for ψm,n using Equation (8).

3.2. Large-sample inferential properties

We now study the conditions under which ψm,n is an asymptotically normal estimator of the SPVIM ψ0,0. Using these conditions, we can design a procedure to construct confidence intervals and hypothesis tests. To do this, we decompose the error of our estimator ψm,n into the following components:

ψm,nψ0,0=(ψ0,nψ0,0)+(ψm,0ψ0,0)+rm,n, (9)

where ψm,0 is obtained by replacing vn,S with v0,S in (7) and rm,n:=(ψm,nψm,0)(ψ0,nψ0,0). Each term on the right-hand side of (9) can then be studied separately to determine the large-sample behavior of ψm,n. The first term is the error of the estimator ψ0,n (6) constructed using prediction functions fn,s estimated using n observations for all subsets s. The second term is the error of the estimator ψm,0 constructed using oracle prediction functions for sampled subsets in Qm. In other words, the first term characterizes the error contribution from sampling training observations and the second term characterizes the error contribution from sampling subsets. The third term is a difference-in-differences remainder term that we prove to be negligible under some regularity conditions. Based on this decomposition, we will show that the asymptotic variance of n(ψm,nψ0,0) is simply the sum of the asymptotic variances of the first and second error terms.

Our result makes use of several conditions that require additional notation. These conditions were initially provided in Williamson et al. (2020b). We define the linear space R:={c(P1P2):c,P1,P2M} of finite signed measures generated by M. For any RR, e.g., R = c(P1P2), we consider the supremum norm ∥R := |c|supo |F1(o) − F2(o)|, where F1 and F2 are the distribution functions corresponding to P1 and P2, respectively. Next, we define the following notation for each subset sS. For distribution P0,ϵ:=P0+ϵh with ϵ and hR, we define f0,ϵ,s=fP0,ϵ,s to be its corresponding oracle prediction function with respect to subset s. Let V˙(f,P0;h) denote the Gâteaux derivative of PV (f, P) at P0 in the direction hR, and define the random function gn,s:oV˙(fn,s,P0;δoP0)V˙(f0,s,P0;δoP0), where δo is the degenerate distribution on {o}. Consider the following set of deterministic [(A1)–(A4)] and stochastic [(B1)–(B2)] conditions for each subset sS:

(A1) (optimality) there is some C > 0 such that for each sequence f1,f2,Fs with fjf0,sFs0, there is a J such that for all j>J,V(fj,P0)V(f0,s,P0)Cfjf0,sFs2;

(A2) (differentiability) there is some δ > 0 such that for each sequence ϵ1,ϵ2, and h,h1,h2,R satisfying that ϵj → 0 and ∥hjh → 0, it holds that

supfFs:ff0,sFs<δV(f,P0+ϵjhj)V(f,P0)ϵjV˙(f,P0;hj)0;

(A3) (optimizer continuity) f0,ϵ,sf0,sFs=o(ϵ) for each hR;

(A4) (derivative continuity) fV˙(f,P0;h) is continuous at f0,s relative to Fs for each hR;

(B1) (minimum rate of convergence) fn,sf0,sFs=oP(n1/4);

(B2) (weak consistency) E0[{gn,s(o)}2dP0(o)]=oP(1);

The Gâteaux derivative V˙ is provided in Williamson et al. (2020b) for several common measures of predictiveness, including classification accuracy, AUC, and R2. Assuming conditions (A1)–(A4) and (B1)–(B2) hold for every subset in S, vn is an asymptotically linear estimator of v0 with influence function V˙0:o[V˙(f0,,P0;δoP0),,V˙(f0,N,P0;δoP0)] by Theorem 2 in Williamson et al. (2020b). Finally, we introduce a condition that specifies the number of subsets to sample:

(C1) (minimum number of subsets) For γ > 0 and sequence γ1,γ2,+ satisfying that |γjγ| → 0, m = γnn.

For convenience, we define several objects that simplify the notation in our main result below. Set A := ZWZ, where Z is the stack of vectors z(s) for all sS, and define C := A−1G(GA−1G)−1. Let the QR decomposition of G be

G=[U1U2][R0],

where R is an upper-triangular matrix. We define the functions

ϕ0,1(O)=A1ZWV˙0(O)andϕ0,2(S;v0)=U2V1[z(S)ψ0,0v0,S]U2z(S),

where V = U2ZWZU2. Assuming all of the aforementioned conditions hold, then ψm,n is a consistent and an asymptotically normal estimator of ψ0,0.

Theorem 1.

If the collection of conditions implied by (A1)–(A4) and (B1)–(B2) hold for every subset in S and condition (C1) holds, then ψm,n has the asymptotic distribution

n(ψm,nψ0,0)dN(0,Σ0),whereΣ0:=CovP0(ϕ0,1(O))+γ1CovQ0(ϕ0,2(S;v0)).

To construct Wald-based confidence intervals (CIs) for ψ0,0, we estimate the asymptotic covariance Σ0 by plugging in consistent estimators of each component. That is, we use consistent estimators Am, Zm, and Wm of A, Z, and W, respectively. Note that the estimators and CIs may be constructed using only the sampled subsets. If ψ0,0,j = 0 for any j, then the contribution from sampling observations to the asymptotic covariance term corresponding to index j will be zero, leading to some additional complications. We discuss this case further in the next section.

Conditions (A1)–(A4) are required to control the contribution from estimating f0,s for each sS. Williamson et al. (2020b) show that these conditions are satisfied for R2, deviance, accuracy, and AUC. Conditions (B1)–(B2) place restrictions on the class of estimators of f0,s that we may consider. While condition (B1) holds for many estimators (e.g., generalized additive models (Hastie and Tibshirani, 1990)), we show in Section 5 that this condition may only need to be approximately satisfied. Condition (B2) is implied by a form of consistency of fn,s.

Finally, condition (C1) is necessary to control the contribution from having had to estimate Q0. Because ψ0,n is an asymptotically efficient estimator of ψ0,0, this condition implies that sampling m = Θ(n) subsets is asymptotically optimal, up to a constant factor proportional to γ−1. Intuitively, this is because there is an irremovable error contribution from having sampled n training observations. As such, we simply need to sample enough subsets for the second error term in (9) to be on the same order as the first term. Moreover, because the distribution Q0 places the heaviest weight on subset sizes at the extremes (closest in size to the empty set and full set), we do not need to estimate a large number of unique prediction functions in practice. To our knowledge, this is the first result that delineates the number of feature subsets to sample for constructing an asymptotically normal estimator of Shapley values.

3.3. Testing the null SPVIM hypothesis

We now use Theorem 1 to construct a test for the null hypothesis that a variable is not important, i.e., ψ0,0,j = 0 for some j. When a variable Xj has null importance, the true value ψ0,0,j is at the boundary of the parameter space, and the contribution to the asymptotic variance from sampling observations in Theorem 1 is zero. This may cause difficulties in hypothesis testing: as the number of sampled subsets grows, the contribution to the asymptotic variance from sampling subsets tends to zero. Thus, in the limit, a hypothesis test based on the estimator of this asymptotic variance proposed in the previous section will fail to appropriately control the type I error.

Instead, we rely on sample-splitting to construct a valid test of the δ-null hypothesis of the jth SPVIM value, i.e., H0,j : ψ0,0,j ∈ [0, δ]. In our approach, we make use of the fact that ψ0,0,∅ may be nonzero for some predictiveness measures (e.g., AUC). Based on one portion of the data, construct estimator ψm,n,j,+ := ψm,n,j + ψm,n, of ψ0,0,j + ψ0,0,∅ and obtain an estimator σn,j2 of the variance σ0,j2:=(Σ0)jj. Based on the remaining data, obtain an estimator ψm,n,∅,1 of ψ0,0,∅ with corresponding variance estimator σn,2. Then, we calculate a test statistic Tn:=(ψm,n,j,+ψm,n,,1)δn11σn,j2+2n21σn,2 and its corresponding p-value pn := 1 − Φ(Tn), where n1 and n2 denote the respective sample sizes of the split dataset and Φ denotes the standard normal cumulative distribution function. We reject H0 if and only if pn < α for some pre-specified level α. Under conditions (A1)–(A4), (B1)–(B2), and (C1), for any α ∈ (0,1), the proposed test is consistent and has type I error equal to α.

4. Local and group variable importance

Until now, we have focused on a global measure of importance by integrating over the entire distribution P0. For certain settings, we may be interested instead in a local version of variable importance. A simple extension of (1) or (3) allows us to define a local version of variable importance: for a subpopulation AX,

ψ0,0,j(A):=1psS(p1|s|)1{V(f0,sj,P0XA)V(f0,s,P0XA)},

where we have simply plugged the conditional distribution P0|XA into (1). Taken to the extreme, where the subpopulation A consists only of a single observation, this definition of local feature importance is equivalent to the SHAP values considered by Lundberg and Lee (2017), though here we use an arbitrary measure of predictiveness in place of the conditional expectation. Unfortunately, valid statistical inference on this individual-observation-level importance appears difficult, if not impossible.

In addition, if there is some scientifically meaningful partition of the features, we can extend SPVIM to these feature subgroups. For example, one may group together all measurements from the same medical device. Let the partition of features into groups be denoted P:={s1,,sk} where siS and i=1ksi=N, and sisj= for every (i, j) pair. Then the Shapley-based population variable group importance measure may be determined as in (1), where the sum is taken over all subsets in P.

5. Simulation study

In this section, we present simulation results validating our statistical inference procedure for SPVIM in finite samples. We consider 200 covariates XN200(0, Σ). The variance-covariance matrix Σ has diagonal equal to 1 and several correlated features: Cov(X1, X11) = 0.7; Cov(X3, X12) = Cov(X3, X13) = 0.3; and Cov(X5, X14) = 0.05. The covariance of the remaining feature pairs is zero. Based on these covariates, we observe a continuous outcome Y | X = xN(f(x), 1), where

f(x)=j{1,3,5}fj(xj),
f1(x)=sign(x),
f3(x)=(6)I(x4)+(4)I(4<x2)+(2)I(0x<2)+2I(2<x4)+4I(x>4),and
f5(x)=(1)I(x4or2<x0or2<x4)+I(4<x2or0<x2orx>4).

In this data-generating mechanism, the vector (X1, X3, X5) is directly relevant to predicting the outcome, while the vector (X11, …, X14) is only related to the outcome through correlation with (X1, X3, X5); the remaining 193 features are pure noise. We generated 1,000 random datasets of size n ∈ {500, 1000, 2000, 3000, 4000}. The true SPVIM values for predictiveness defined in terms of R2 are approximately (0.19, 0.29, 0.23, 0.04, 0.01, 0.01, 0) for the non-noise features, respectively, and zero for the remaining features.

To obtain each fn,s we fit boosted trees (Friedman, 2001) using the Python package xgboost (Chen and Guestrin, 2016) with maximum tree depth equal to one, learning rate equal to 10−2, and 2-regularization parameter equal to zero. The number of trees varied among {50, 100, 250, 500, 1000, …, 3000} and the 1-regularization parameter varied among {10−3, 10−2, 0.1, 1, 5, 10}; the combination of these parameters was tuned using five-fold cross-validation to minimize the mean squared error (MSE).

We computed the relevant SPVIM estimator using Algorithm 1, where we sampled m = 2n subsets and estimated predictiveness using five-fold cross-fitting. For comparison, we computed the mean absolute SHAP value (Lundberg and Lee, 2017), where the average was taken over all observations. This allows us to directly evaluate the accuracy of algorithmic VIMs for estimating the population VIMs. We then computed the empirical MSE scaled by n, the empirical coverage of nominal 95% CIs, and the empirical power of our proposed hypothesis test. Finally, we compare the accuracy of our SPVIM estimates and the mean SHAP values in terms of their correlation with the true SPVIM values. All analyses were performed on a computer cluster with 32-core CPU nodes with 64 GB RAM.

We display the results of this experiment in Figure 1. We see that as n increases, the scaled empirical MSE of our estimator decreases to a fixed level — namely, the scaled empirical variance — for each feature. This matches our expectations from Section 3.2: the scaled empirical bias of our proposed estimator should tend to zero with increasing sample size, while the scaled empirical variance tends to the asymptotic variance. We note here that while boosted trees are a popular estimation procedure, they do not necessarily satisfy condition (B1) (see, e.g., Zhang and Yu, 2005). Thus, the convergence observed here provides some empirical evidence that condition (B1) may only need to hold approximately in practice. We also find that the coverage of nominal 95% confidence intervals increases to the nominal level as the sample size increases. Our proposed hypothesis test controls the type I error rate and is consistent: the empirical type I error rate is at the nominal level for null feature X6, while the empirical power is near one for each of the directly important features. Power tends to be small for the indirectly important features (X11, …, X14), especially at small sample sizes; this reflects the fact that the importance of these features is closer to the null hypothesis than the importance of the directly relevant features. Finally, we see that SPVIM estimates are more correlated with the true population importance than SHAP values. We provide the estimated SPVIM and mean absolute SHAP values in the Supplement.

Figure 1.

Figure 1.

Performance of our statistical inference procedure for estimating the Shapley-based population variable importance (SPVIM) with respect to R2 using n training observations and 2n sampled subsets. (A, E) Empirical MSE for the proposed plug-in estimator scaled by n for j ∈ {1, 3, 5, 6} and j ∈ {11, 12, 13, 14}, respectively; (B, F) Empirical coverage of nominal 95% confidence intervals; (C, G) Empirical power of the hypothesis testing procedure for null hypothesis that the jth variable has null importance; (D) Kendall’s tau between the true and estimated SPVIM values using our approach versus the mean absolute SHAP value.

6. Predicting mortality of patients in theintensive care unit

We now analyze data on patients’ stays in the ICU from the Multiparameter Intelligent Monitoring in Intensive Care II (MIMIC-II) database (Silva et al., 2012). We consider 4000 records on several features: five general descriptors collected upon admission to the ICU, and 15 features — including the Glasgow Coma Scale (GCS), blood urea nitrogen (BUN), and heart rate — measured over the course of the first 48 hours after admission to the ICU. The outcome of interest is in-hospital mortality. Rather than use the entire time series, we simplify the analysis by first computing the minimum, average, and maximum value of each of the time-series features used in the simplified acute physiology (SAPS) I or II scores. The SAPS scores are established measures for estimating the mortality risk of ICU patients. We then remove any features that are measured in fewer than 70% of the patients. When combined with the general descriptor variables, a total of 37 extracted features remain. We provide a full list of these extracted features in the Supplement.

We estimate the SPVIM for each variable using AUC to measure predictiveness. For comparison, we also provide the mean absolute SHAP value obtained from Tree SHAP (Lundberg et al., 2020) and Kernel SHAP (Lundberg and Lee, 2017); and the proportion of times a feature was selected across test instances using LIME (Ribeiro et al., 2016). We discuss conditions under which the mean absolute SHAP value is a suitable proxy for the SPVIM in Section 2.2 in the Supplement.

We obtained estimates of each f0,s using two separate procedures. In the first analysis, we maximized the empirical log likelihood using boosted trees with maximum depth equal to four, learning rate equal to 10−3, and a number of estimators in {2000, 4000, …, 12000} selected using five-fold cross-validation. In the second analysis, we maximized the empirical log likelihood by fitting ensembles of five dense ReLU neural networks (NNs) with architectures chosen from {(37, 25, 25, 20, 10, 1), (37, 25, 20, 1), (37, 25, 20, 20, 1)} using 5-fold cross-validation. The NNs were trained using Adam (Kingma and Ba, 2014) with a maximum of 2000 iterations and with 2 regularization parameter equal to 0.1. We again used 5-fold cross-fitting to estimate the predictiveness measures for the sampled subsets. Using our procedure, we fit models for only 119 unique subsets and computed SPVIM estimates in two hours for each analysis. LIME had similar computation time (1.7 hours) in the case of NNs, but longer computation time (4 hours) in the case of trees. The computation time of both our procedure and LIME falls between the highly specialized Tree SHAP algorithm, which completed in a few minutes, and the general-purpose Kernel SHAP, which took approximately 20 hours.

In Figure 2, we display the estimates from each VIM and both estimation procedures. We first focus on the SPVIM estimates provided in Panel A. The GCS is estimated to be the most important feature using both trees and NNs, though different summaries of the GCS are most important across the two procedures (mean for trees and max for NNs). This result matches prior knowledge: GCS is used to assess the level of consciousness of patients and is the highest scoring item in the SAPS scores. We find that the confidence intervals for SPVIM are quite wide, which is important information for placing the results in context.

Figure 2.

Figure 2.

We estimated importance of features for predicting in-hospital death in the ICU using our statistical inference procedure for SPVIM with respect to AUC (A), the mean absolute SHAP value (B), and LIME (C). Gray circles and blue triangles denote estimates from fitting boosted trees and neural networks, respectively. The features are ordered from top to bottom by their point estimate from the neural networks procedure. 95% confidence intervals only appear in (A) since there is no statistical inference procedure for mean absolute SHAP values or LIME.

Next, we compare the agreement between rankings calculated based on the fitted boosted trees and NNs for the SPVIM estimates, mean absolute SHAP values (Figure 2 panel B), and LIME (panel C). There is considerably more agreement between the two procedures for the SPVIM estimates than for the SHAP value estimates and LIME proportions. The estimated Kendall’s tau between procedures is 0.71 for our SPVIM estimator vs 0.37 for SHAP and 0.39 for LIME. Given the large discrepancies between the algorithmic VIMs, we conclude that they are poor proxies for our population VIM. Instead, one should use a procedure specifically designed to estimate SPVIM.

Finally, we find that the feature rankings within trees or NNs from our SPVIM estimator, SHAP, and LIME are substantively different. One noticeable difference is that SHAP and LIME values for several summary statistics derived from the same measurement (e.g., min, mean, and max GCS) differ widely; this should not occur, since these summary statistics are highly correlated. On the other hand, SPVIM estimates for summary statistics derived from the same measurement tend to be more similar.

7. Discussion

We have proposed a computationally tractable statistical inference procedure for the Shapley population variable importance measure (SPVIM). Methods for estimating SPVIM are complementary to those for estimating algorithmic variable importance. The former helps us understand the underlying data-generating mechanism and can help guide future experiments; the latter helps us interpret a particular fitted model. Here, we define SPVIM with respect to an arbitrary measure of predictiveness, allowing the data analyst to select the most appropriate measure for the task at hand. Since the SPVIM is also defined relative to the population, the target of inference is not affected by the choice of prediction algorithm. We have derived the asymptotic distribution of an SPVIM estimator based on randomly sampled feature subsets, and have used this distribution to construct asymptotically normal point estimates, valid confidence intervals, and hypothesis tests with the correct type I error control. Notably, we determined a minimum number of feature subsets to sample: we show that our estimator only needs to fit prediction models for m = Θ(n) sampled subsets for its error to be on the same order as an estimator that fits prediction models for all possible subsets.

In this manuscript, we have focused on quantifying the importance of a variable averaging across the entire population. Local importance measures can be obtained by restricting to smaller subpopulations. However, as the subpopulations decrease in size, the uncertainty of our estimates increases. Our asymptotic results do not apply to the most extreme case, the variable importance at the level of a single observation. Nevertheless, this value may be of interest in some tasks. Further work is necessary to define relevant importance measures at the single-observation-level and derive procedures with the desired performance.

Finally, we caution against interpreting SPVIM estimates in a causal manner. SPVIM reflects importance in the oracle prediction model rather than importance in the oracle causal model. In many scientific applications, the importance in the causal model is of ultimate interest. To get causal importance, one may need to employ techniques from causal inference. Recent developments relating prediction models and causal models may also be of use in these cases (Arjovsky et al., 2019).

Supplementary Material

Suppl materials

Acknowledgments

The authors wish to thank Jessica Perry, Noah Simon, the anonymous reviewers, and the meta-reviewer for insightful comments that improved this manuscript. BDW was supported by NIH award F31 AI140836. The opinions expressed in this manuscript are those of the authors and do not necessarily represent the official views of the NIAID or the NIH.

Footnotes

Supplementary Material

Technical details are available in the Supplement. Code is available on Github at bdwilliamson/spvim_supplementary.

1

In the Shapley value literature, this additivity property is referred to as “efficiency”. However, this notion of efficiency is very different from statistical efficiency, which is related to the asymptotic variance of a statistical estimator.

References

  1. Arjovsky M, Bottou L, Gulrajani I, and Lopez-Paz D. Invariant risk minimization. arXiv:1907.02893, 2019. [Google Scholar]
  2. Boyd S and Vandenberghe L. Convex optimization. Cambridge university press, 2004. [Google Scholar]
  3. Breiman L. Random forests. Machine Learning, 45(1): 5–32, 2001. [Google Scholar]
  4. Castro J, Gómez D, and Tejada J. Polynomial calculation of the shapley value based on sampling. Computers & Operations Research, 36(5):1726–1730, 2009. [Google Scholar]
  5. Charnes A, Golany B, Keane M, and Rousseau J. Extremal principle solutions of games in characteristic function form: core, Chebychev and Shapley value generalizations. In Sengupta JK and Kadekodi GK, editors, Econometrics of Planning and Efficiency, pages 123–133. Springer, 1988. [Google Scholar]
  6. Chen T and Guestrin C. XGBoost: A Scalable Tree Boosting System. arXiv:1603.02754, 2016. [Google Scholar]
  7. Covert I, Lundberg S, and Lee SI. Understanding global feature contributions through additive importance measures. arXiv, 2020. https://arxiv.org/abs/2004.00668. [Google Scholar]
  8. Dunning AJ. A model for immunological correlates of protection. Statistics in Medicine, 25(9):1485–1497, 2006. [DOI] [PubMed] [Google Scholar]
  9. Feng J, Williamson BD, Carone M, and Simon N. Non-parametric variable importance using an augmented neural network with multi-task learning. In Proceedings of the 35th International Conference on Machine Learning, pages 1495–1504, 2018. [Google Scholar]
  10. Friedman JH. Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29(5):1189–1232, 2001. [Google Scholar]
  11. Garson DG. Interpreting neural network connection weights. Artificial Intelligence Expert, 1991. [Google Scholar]
  12. Grömping U. Estimators of relative importance in linear regression based on variance decomposition. The American Statistician, 61(2):139–147, 2007. [Google Scholar]
  13. Hastie TJ and Tibshirani RJ. Generalized Additive Models, volume 43. CRC Press, 1990. [Google Scholar]
  14. Kingma D and Ba J. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014. [Google Scholar]
  15. Lundberg SM and Lee S-I. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, 2017. [Google Scholar]
  16. Lundberg SM, Erion G, Chen H, DeGrave A, et al. From local explanations to global understanding with explainable AI for trees. Nature Machine Intelligence, 2(1): 2522–5839, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Murdoch WJ, Singh C, Kumbier K, Abbasi-Asl R, and Yu B. Interpretable machine learning: definitions, methods, and applications. arXiv:1901.04592, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Nathans LL, Oswald FL, and Nimon K. Interpreting multiple linear regression: A guidebook of variable importance. Practical Assessment, Research & Evaluation, 17 (9), 2012. [Google Scholar]
  19. Owen AB and Prieur C. On Shapley value for measuring importance of dependent units. SIAM/ASA Journal on Uncertainty Quantification, 5, 2017. doi: 10.1137/16M1097717. [DOI] [Google Scholar]
  20. Ribeiro MT, Singh S, and Guestrin C. Why should I trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135–1144, 2016. [Google Scholar]
  21. Shapley LS. A value for n-person games. Contributions to the Theory of Games, 2(28):307–317, 1953. [Google Scholar]
  22. Silva I, Moody G, Scott DJ, Celi LA, and Mark RG. Predicting in-hospital mortality of icu patients: The physionet/computing in cardiology challenge 2012. In Computing in Cardiology (CinC), 2012. IEEE, 2012. [PMC free article] [PubMed] [Google Scholar]
  23. Štrumbelj E and Kononenko I. Explaining prediction models and individual predictions with feature contributions. Knowledge and Information Systems, 41(3):647–665, 2014. [Google Scholar]
  24. Williamson BD, Gilbert PB, Carone M, and Simon N. Non-parametric variable importance assessment using machine learning techniques. Biometrics, (to appear), 2020a. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Williamson BD, Gilbert PB, Simon N, and Carone M. A unified approach for inference on algorithm-agnostic variable importance. arXiv, 2020b. https://arxiv.org/abs/2004.03683. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Zhang T and Yu B. Boosting with early stopping: Convergence and consistency. The Annals of Statistics, 33(4): 1538–1579, 2005. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Suppl materials

RESOURCES