Variable importance measures for heterogeneous causal effects

Oliver J Hines; Karla Diaz-Ordaz; Stijn Vansteelandt

doi:10.1093/biomtc/ujaf140

. Author manuscript; available in PMC: 2026 Mar 6.

Published in final edited form as: Biometrics. 2025 Dec 24;81(4):ujaf140. doi: 10.1093/biomtc/ujaf140

Variable importance measures for heterogeneous causal effects

Oliver J Hines ¹, Karla Diaz-Ordaz ², Stijn Vansteelandt ³

PMCID: PMC7618827 EMSID: EMS212112 PMID: 41437933

Abstract

Motivated by applications in precision medicine and treatment effect heterogeneity, recent research has focused on estimating conditional average treatment effects (CATEs) using machine learning (ML). CATE estimates may represent complicated functions that provide little insight into the key drivers of heterogeneity. Therefore, we introduce nonparametric treatment effect variable importance measures (TE-VIMs), based on the mean-squared error (MSE) in predicting the individual treatment effect. More precisely, TE-VIMs represent the increase in MSE when variables are removed from the CATE conditioning set. We derive efficient TE-VIM estimators which can be used with any CATE estimation strategy and are amenable to ML estimation. We propose several strategies to calculate these VIMs (e.g. leave-one out, or keep-one in), using popular meta-learners for the CATE. We study the finite sample performance through a simulation study and illustrate their application using clinical trial data.

Keywords: Causal inference, Conditional effects, Data-adaptive estimation, Effect modification

1. Introduction

In the medical and social sciences there is a longstanding interest in quantifying the heterogeneity in the effect of a treatment or intervention on a population. Understanding such heterogeneity is essential for informing scientific research and optimizing treatment decisions. Attention focused initially on subgroup analyses, which identify population subgroups that benefit most/least from treatment, to be evaluated further in potential future studies (Rothwell, 2005; Slamon et al., 2001). Typical challenges of subgroup analyses are selecting stratification variables in a systematic way and handling the resulting multiplicity problem. Endeavours to address these were soon followed by methodological developments on personalized medicine in causal inference, pioneered by Murphy (2003), with the primary focus being on policy learning; i.e., determining the treatment policy that minimizes some measure of population risk (van der Laan and Luedtke, 2014; Athey and Wager, 2021).

Recently, attention has partially shifted towards learning the conditional average treatment effect (CATE) τ(x) ≡ E(Y ¹ − Y ⁰|X = x), where Y^a is the outcome observed if treatment A were set to a ∈ {0, 1}, and X ∈ ℝ^p are pre-treatment covariates (Athey and Imbens, 2016; Wager and Athey, 2018; Künzel et al., 2019; Kennedy, 2023). The CATE provides insight into the magnitude of the treatment effect for each individual and the optimal dynamic treatment rule (OTR), obtained from the sign of the CATE (VanderWeele et al., 2019).

These foregoing developments are important, but leave unanswered a key question: what are the key drivers of treatment effect heterogeneity? Answers of this question may inform about treatment mechanism, suggest future therapies, help compare clinical trial populations, or help quantify systematic treatment biases (e.g. due to race or socio-economic status). One pioneering proposal for CATE variable attribution is to extend random forest variable importance measures (VIMs) through the ‘causal forest’ estimator of the CATE (Athey et al., 2019; Wager and Athey, 2018). The resulting VIMs rely on the tree architecture of causal forests, and may inherently assign greater importance to continuous variables, or categorical variables with many categories (Grömping, 2009). VIMs based on specific modeling strategies (decisions trees, linear regression, etc.) are referred to as ‘algorithmic’ (Williamson and Feng, 2020). Whilst algorithmic-VIMs can provide useful insights, there remains a need for model-agnostic alternatives, especially when the chosen CATE estimating strategy does not have a well-established algorithmic-VIM, but also in order to compare VIMs following different CATE estimation procedures. Therefore, we propose treatment effect VIMs (TE-VIMs) that are nonparametric summary statistics, which measure the importance of variable subsets in predicting the individual treatment effect (ITE) Y ¹ − Y ⁰. TE-VIMs are relatively easy to communicate to researchers already familiar with traditional goodness-of-fit methods such as ANOVA, and, unlike algorithmic-VIMs, allow researchers to compare heterogeneous treatment effect insights across different CATE learning algorithms.

More precisely, we consider the mean-squared-error L{f} ≡ E[{Y ¹ − Y ⁰ − f (X)}²], which for arbitrary f : ℝ^p ↦ ℝ is not identified without strong assumptions on the joint distribution of (Y¹, Y⁰) (Levy et al., 2021; Ding et al., 2016). One key insight is that L{f} = E[{τ (X) − f (X)}²]+ E{var(Y ¹ − Y ⁰|X)} comprises a first term that is identified under standard causal assumptions and another that does not depend on f. Exploiting this decomposition, we define the TE-VIM Θ_s ≡ L{τ_s}−L{τ}, where τ_s(x) ≡ E(Y ¹−Y ⁰|X_−s = x_−s) is the CATE conditional on X_−s, and u_−s denotes the vector of all the components of u with index not in s ⊆ {1, …, p}. Note that τ_s(x) only depends on x_−s, but we write it as a function of x to simplify notation. We interpret Θ_s = var{τ (X)} − var{τ_s(X)} in terms of the variance var{τ (X)} of the treatment effect (VTE), a global measure of treatment effect heterogeneity due to Levy et al. (2021), see Supplement D.1. Specifically, Θ_s ≥ 0 represents the increase in CATE variance when variables in s are excluded from the conditioning set. Thus, Θ_s quantifies the treatment effect heterogeneity explained by X_s, beyond that already explained by X_−s, where u_s denotes the vector of all components of u with index in s.

Moreover, TE-VIMs connect to regression-VIMs (Williamson et al., 2021; Zhang and Janson, 2020), also called ‘leave-out covariates’ (Lei et al., 2018; Verdinelli and Wasserman, 2024a), and the nonparametric VIM framework of Williamson et al. (2023). The latter framework covers VIMs that represent differences in value (negative risk) functions. Our work represents a step towards applying this framework in more complicated settings, such as in causal inference where identification may be a concern.

In Section 2 we derive efficient TE-VIM estimators and motivate the DR-learner of the CATE by interpreting our estimators in terms of pseudo-outcomes (Kennedy, 2023). Results on simulated data are presented in Section 3 and in Section 4 we use TE-VIMs to identify drivers of treatment effect heterogeneity in a clinical trial setting. In Section 5 we compare TE-VIMs with existing OTR-VIMs and outline extensions to continuous treatments.

2. Methodology

2.1. Motivating the estimand

We consider n i.i.d. observations (z₁, …, z_n) of a random variable Z = (Y, A, X) ~ P₀ distributed according to an unknown distribution P₀ ∈ ℳ and consisting of an ‘outcome’ Y ∈ ℝ, an ‘exposure’ or ‘treatment’ A ∈ {0, 1}, and covariates X ∈ ℝ^p. In a slight abuse of notation, we let p denote the index set {1, …, p} so that τ_p is the average treatment effect (ATE) and Θ_p is the VTE, which we assume to be non-zero. Assuming consistency (A = a ⇒ Y = Y ^a), conditional exchangeability (Y^a ⫫ A|X for a = 0, 1), and positivity (0 < π(X) < 1 almost surely), the CATE is identified by τ(x) = μ(1, x) − μ(0, x), where μ(a, x) ≡ E(Y |A = a, X = x) and π(x) ≡ E(A|X = x) is the ‘propensity score’. We let ‖f (Z)‖ ≡ E[{f (Z)}²]^1/2 denote the L₂(P₀) norm of an arbitrary function f, and when f is estimated from data, the expectation is taken only over the random inputs Z. Assuming that ‖τ(X)‖ < ∞, then the TE-VIM Θ_s = ‖τ(X) − τ_s(X)‖² < ∞ is also finite. By construction, s′ ⊆ s ⊆ p implies $Θ_{s^{'}} \leq Θ_{s} \leq Θ_{p}$ , i.e. the set s′ cannot be more important than s. Generally, the covariates used to define the CATE need not be the same as the covariates X required for conditional exchangeability to hold. E.g. one could consider the target estimand $L {τ_{s^{'}}} - L {τ_{s}} = Θ_{s^{'}} - Θ_{s}$ , which quantifies the importance of s′ with s treated as the full covariate set. This extension follows from results for Θ_s, which we study in the current work.

The TE-VIM Θ_s is analogous to the regression-VIM Ω_s ≡ ‖Y − μ_s(X)‖² − ‖Y − μ(X)‖² due to Williamson et al. (2021), which replaces the ITE Y ¹ − Y⁰ with Y, and hence τ (x) with μ(x) ≡ E(Y |X = x) and τ_s(x) with μ_s(x) ≡ E(Y |X_−s = x_−s). The two proposals differ, however, in how the VIM is scaled with Williamson et al. (2021) defining the scaled regression-VIM Ω_s/var(Y) in analogy with the familiar R² statistic. Since var(Y ¹ − Y⁰) is not easily identified, we instead propose scaled TE-VIMs as Ψ_s ≡ Θ_s/Θ_p = 1−var{τ_s(X)}/Θ_p. Like an R² statistic, we interpret Ψ_s ∈ [0, 1] as the proportion of treatment effect heterogeneity explained by X_−s compared with X. For instance, under the linear model E(Y^a|X = x) = μ(a, x) = β(x) + aτ (x), where β(x) and τ (x) are both linear in x, Ψ_s is the limiting R² value obtained from a linear regression of the effect modifier τ (X) on X_−s. Moreover, Ψ_s and Ω_s/var(Y) are both invariant to linear transformations of the outcome and invertible component-wise transformations of X, see Supplement D.2 for details. In practice, VIM scaling makes little difference when using Ψ_s and $Ψ_{s^{'}}$ to compare the relative importance of s and s′. Instead, the main decision for investigators is which covariate sets should be compared, and we identify the following modes of operation in this regard:

Leave-one-out (LOO): The set s contains a single covariate of interest. This mode may ‘under represent’ importance when covariates are highly correlated.
Keep-one-in (KOI): The set s contains all but a single covariate of interest. This mode may ‘over represent’ importance when covariates are highly correlated, and is less sensitive to multi-covariate interactions.
Shapley values: TE-VIMs for all possible (2^p) covariate combinations are considered and aggregated in a game theoretic manner (Owen and Prieur, 2017; Williamson and Feng, 2020). This is a theoretically appealing compromise between LOO and KOI, but may be computationally impractical for even modest p and render clinical interpretation more subtle. A definition of TE-VIM Shapley values is given in Supplement D.3.
Covariate grouping: Domain specific knowledge is used to group covariates, simplifying the above modes by considering covariates block-wise (e.g. comparing biological vs. non-biological factors).

Note that we use ‘under’ and ‘over’ represent in a relative sense, since the ground truth variable importance depends on the importance definition used. See e.g. Hama et al. (2022); Verdinelli and Wasserman (2024b) for recent discussions on this topic and Tang and Westling (2025) for variable selection proposals. In Section 4 we demonstrate the LOO and KOI modes through an applied example.

2.2. CATE estimation

Estimation of TE-VIMs will rely on initial CATE estimates, obtained via flexible machine learning based methods, which we review first. CATE estimation is challenging since common machine learning algorithms (random forests, neural networks, boosting etc.) are designed for mean outcome regression, e.g. by minimizing the mean squared error loss. CATE estimation strategies therefore either modify existing machine learning methods to target CATEs, e.g. Athey et al. (2019) modify the random forest algorithm for CATE estimation. Alternatively, ‘metalearning’ strategies decompose CATE estimation into a sequence of sub-regression problems, which can be solved using off-the-shelf machine learning algorithms (Künzel et al., 2019; Kennedy, 2023).

In the current work we focus on two metalearning algorithms which, following the naming convention of Künzel et al. (2019) and Kennedy (2023), we refer to as the T-learner and the DR-learner. The T-learner is based on the decomposition τ (x) = μ(1, x) − μ(0, x), and estimates the CATE by ${\hat{τ}}^{(T)} (x) \equiv \hat{μ} (1, x) - \hat{μ} (0, x),$ where $\hat{μ} (a, x)$ represents an estimate of μ(a, x) obtained by a regression of Y on X using observations where A = a. The T-learner, however is problematic for two main reasons. Firstly, whilst regularization methods can be used to control the smoothness of $\hat{μ} (a, x)$ , the same is not true of ${\hat{τ}}^{(T)} (x)$ which may be erratic. Slow convergence rates affecting $\hat{μ} (a, x)$ may therefore propagate into ${\hat{τ}}^{(T)} (x)$ . Secondly, $\hat{μ} (1, x)$ is chosen to make an optimal bias-variance trade-off over the covariate distribution of the treated population. Likewise, $\hat{μ} (0, x)$ is chosen to make an optimal bias-variance trade-off over the covariate distribution of the untreated population. When there is poor overlap between the treated and untreated subgroups, then ${\hat{τ}}^{(T)} (x)$ may fail to deliver an optimal bias-variance trade-off over the population covariate distribution, making the T-learner potentially poorly targeted towards CATE estimation.

The DR-learner (Kennedy, 2023; Luedtke and van der Laan, 2016; van der Laan, 2013) is an alternative metalearning algorithm based on the decomposition τ (x) = E{φ(Z)|X = x} where, for z = (y, a, x)

φ (z) \equiv {y - μ (a, x)} \frac{a - π (x)}{π (x) {1 - π (x)}} + μ (1, x) - μ (0, x) .

(1)

is called the ‘pseudo outcome’, or the augmented inverse propensity weighted score (Robins et al., 1994), and acts like the ITE in expectation. The DR-learner first estimates μ(a, x) and π(x) to obtain the pseudo-outcome estimator $\hat{φ} (Z)$ , where μ(a, x) and π(x) in (1) are replaced with estimates $\hat{μ} (a, x)$ and $\hat{π} (x)$ .

In a second step, the estimated pseudo-outcome, $\hat{φ} (Z)$ is regressed on covariates X to obtain ${\hat{τ}}^{(D R)} (x)$ . A sample splitting scheme is also recommended, whereby the regression steps to obtain $\hat{μ} (a, x), \hat{π} (x)$ , and ${\hat{τ}}^{(D R)} (x)$ are performed on three independent samples.

The DR-learner alleviates the issues related to the T-learner since the complexity of ${\hat{τ}}^{(D R)} (x)$ can be controlled by regularizing the regression in the final stage of the procedure, mitigating concerns regarding the smoothness of the T-learner. With regard to consistency, taking expectation over the random inputs Z, the square of $E {\hat{φ} (Z) ∣ X = x} - τ (x)$ is bounded above by the product of the squared estimation errors of the propensity score and regression estimators (up to constant scaling). In practice, this means that the regression of $\hat{φ} (Z)$ on X mimics the oracle regression of φ(Z) on X provided that

(A 1) ‖ {π (X) - \hat{π} (X)} {μ (a, X) - \hat{μ} (a, X)} ‖ = o_{P} (n^{- 1 / 2}) for a = 0, 1.

and a suitable cross-fitting procedure is used. (A1) implies that one can trade-off accuracy in the outcome and propensity score estimators, a property known as double robustness, hence the name ‘DR-learner’. Note that the n^−1/2 rate in (A1) differs from the ‘stability’ condition provided for the DR-learner by Kennedy (2023, Proposition 1), as it reflects what will be required by our TE-VIM estimators in Section 2.3.

Estimation of the CATE τ_s(x) is complicated by the fact that one cannot assume that Y ⫫ A|X_−s for an arbitrary subset of covariates s, a problem that is sometimes referred to as ‘runtime confounding’ (Coston et al., 2020). The DR-learner readily accommodates runtime confounding through the identity τ_s(x) = E{φ(Z)|X_−s = x_−s}. This identity implies that one may estimate τ_s(x) by regressing $\hat{φ} (Z)$ on X_−s, i.e. modifying the final regression step of the DR-learner.

We recommend a metalearner for τ_s(x) based on the identity τ_s(x) = E{τ (X)|X_−s = x_−s}. Specifically, we propose estimating τ_s(x) by regressing initial CATE estimates, $\hat{τ} (X)$ on X_−s via a machine learning algorithm. Like the DR-learner, one can regularize the resulting CATE estimator ${\hat{τ}}_{s} (x)$ . We advocate this approach, since it usually results in estimates of τ_s(x) which are compatible with those of τ (x), in the sense of respecting the fact that the two conditional means are related. For instance, we expect that E{φ(Z)} = E{τ (X)} = E{τ_s(X)}, but these equalities are generally violated for the corresponding estimated quantities $E {\hat{φ} (Z)}, E {\hat{τ} (X)},$ and $E {{\hat{τ}}_{s} (X)}$ .

2.3. TE-VIM estimation

2.3.1. Estimation of Θ_s

We consider estimators based on the efficient influence curve (IC) of Θ_s under the non-parametric model. Briefly, ICs are mean zero functions that characterize the sensitivity of so-called pathwise differentiable estimands to small changes in the data generating law. Thus, ICs are useful for constructing efficient estimators and determining their asymptotic distribution, see e.g. Hines et al. (2022) for an introduction.

TE-VIMs fall under the VIM framework of Williamson et al. (2023), for which generic IC results are available. These results cannot be directly applied, however, since the risk L{f} is not identified. Instead, we consider that Θ_s = L_U {τ_s} − L_U {τ} where

L_{U} {f} = E [{U + τ (X) - f (X)}^{2}] = E [{τ (X) - f (X)}^{2}] + E (U^{2})

and U is a random variable such that E(U|X) = 0. Note that U = Y¹ −Y⁰ −τ(X) recovers the unidentified risk L{f} and U = φ(Z) − τ (X) recovers the DR-learner population risk. To simplify derivations we consider the risk L*{f} obtained by setting U = 0. Theorem 3 of Williamson et al. (2023) states that, for this risk, there is no price to pay to estimate its minimizer τ_s(x), insofar as the IC for L*{τ_s} that is derived when τ_s(x) is known is the same as that derived when τ_s(x) is unknown. In Supplement A we use point mass contamination to show that, for known f, the IC of L*{f} at a single observation z is {φ(z) − f(x)}² − {φ(z) − τ (x)}² − L* {f}. Hence, applying the aforementioned Theorem, the IC of Θ_s is

ϕ_{s} (z) \equiv {φ (z) - τ_{s} (x)}^{2} - {φ (z) - τ (x)}^{2} - Θ_{s} .

(2)

The interpretation of φ(Z) as a pseudo-outcome, which plays the role of the unobserved ITE Y¹ − Y⁰, holds in the present context. To see why, we compare (2) to the IC of Ω_s by Williamson et al. (2021), {y − μ_s(x)}² − {y − μ(x)}² − Ω_s, which is of the same form as (2), but with the outcome y replacing the pseudo-outcome φ(z).

The IC in (2) may be used to construct efficient estimating equation estimators of Θ_s by setting (an estimate of) the sample mean IC to zero. This strategy is equivalent to the so-called one-step correction outlined in Supplement B. We thus obtain the estimator

{\hat{Θ}}_{s} \equiv n^{- 1} \sum_{i = 1}^{n} [{\hat{φ} (z_{i}) - {\hat{τ}}_{s} (x_{i})}^{2} - {\hat{φ} (z_{i}) - \hat{τ} (x_{i})}^{2}],

(3)

where superscript hat denotes consistent estimators. In practice, we recommend a cross-fitting procedure of the type described in Algorithms SS-A and SS-B, to obtain the fitted models and evaluate the estimators using a single sample (Chernozhukov et al., 2018; Zheng and van der Laan, 2011). We discuss the reasons for sample splitting with reference to Theorem 1, which gives conditions under which ${\hat{Θ}}_{s}$ is regular asymptotically linear (RAL).

Theorem 1

Assume that there exist constants ϵ, K, δ ∈ (0, ∞) such that almost surely $\hat{π} (X) \in (ϵ, 1 - ϵ), var {φ (Z) ∣ X} < K$ , and $∥ \hat{τ} (X) - {\hat{τ}}_{s} (X) ∥ < δ .$ Suppose also that at least one of the following two conditions hold:

Sample splitting: $\hat{π} (x), \hat{μ} (x), \hat{τ} (x)$ , and ${\hat{τ}}_{s} (x)$ are obtained from a sample indepen-dent of the one used to construct ${\hat{Θ}}_{s}$ .
Donsker condition: The quantities ${φ (Z) - \hat{τ} (X)}^{2}, {φ (Z) - {\hat{τ}}_{s} (X)}^{2}$ , and ${\hat{τ} (X) - {\hat{τ}}_{s} (X)} \hat{φ} (Z)$ fall within a P₀-Donsker class with probability approaching 1.

Finally assume (A1) holds, and (A2) that $‖ τ (X) - \hat{τ} (X) ‖$ and $| | τ_{s} (X) - {\hat{τ}}_{s} (X) | |$ are both o_P (n^−1/4). Then ${\hat{Θ}}_{s}$ is asymptotically linear with IC ϕ_s(Z), and hence ${\hat{Θ}}_{s}$ converges to Θ_s in probability, and for Θ_s > 0 then $n^{1 / 2} ({\hat{Θ}}_{s} - Θ_{s})$ converges in distribution to a mean-zero normal random variable with variance ‖Φ_s(Z)‖².

Assumptions (A1-A2) both require nuisance function estimators to converge at sufficiently fast rates. The requirement for n^1/4 rate convergence in (A2) is standard in the recent VIM framework of Williamson et al. (2023), whilst (A1) is additionally required to control for errors in estimating the pseudo-outcomes.

Together, (A1-A2) suggest that the DR-learner may be preferred over the T-learner due to its robustness. In particular, the T-learner of the CATE satisfies $∥ τ (X) - {\hat{τ}}^{(T)} (X) ∥ = o_{P} (n^{- 1 / 4})$ , provided that $‖ μ (a, X) - \hat{μ} (a, X) ‖ = o_{P} (n^{- α})$ , with α ≥ 1/4 for a = 0, 1. (A1) then implies that $‖ π (X) - \hat{π} (X) ‖$ must be at least o_P (n^−1/2+α). I.e. the propensity score estimator may converge at a slower rate, if the outcome estimator converges at a faster rate, but the converse is not true since fast convergence of the outcome estimator is needed to assure sufficiently fast convergence of the T-learner. This is unsatisfying for example in clinical trial settings, where the exposure is randomized and propensity scores are known.

The DR-learner, however, satisfies $∥ τ (X) - {\hat{τ}}^{(D R)} (X) ∥ = o_{P} (n^{- 1 / 4})$ when (A1) holds and $∥ E {\hat{φ} (Z) ∣ X} - {\hat{τ}}^{(D R)} (X) ∥ = o_{P} (n^{- 1 / 4})$ ), i.e. when the final DR-learning regression estimator is consistent at n^−1/4 rate. Applying the same reasoning as before, (A1) implies that $‖ μ (A, X) - \hat{μ} (A, X) ‖$ can be o_P (n^−α), if $‖ π (X) - \hat{π} (X) ‖$ is o_P (n^−1/2+α), for any α ∈ (0, 1/2). In other words, the outcome estimator is allowed to converge at a slower rate, provided the propensity score estimator converges at a faster rate and vice-versa, which marks an improvement over the T-learner, at the expense of an additional requirement on the final DR-learning step. The requirement on the DR-learning step, however, will likely be weaker than that on outcome estimator, since τ (x) is likely smoother than μ(a, x), e.g. when the CATE depends only on a subset of X.

The Donsker condition in Theorem 1 controls the ‘empirical process’ term in the estimator expansion (Newey and Robins, 2018; Hines et al., 2022). This condition is usually not guaranteed to hold when flexible machine learning methods are used to estimate nuisance functions. Fortunately, cross-fitting using two or more splits of the data offers a way of avoiding Donsker conditions, at the expense of making nuisance functions more computationally expensive to learn (Chernozhukov et al., 2018; Zheng and van der Laan, 2011).

2.3.2. Importance testing

One property shared by ${\hat{Θ}}_{s}$ and the analogous Ω_s estimator (Williamson et al., 2021) concerns their behavior under the zero-importance null hypothesis, H₀ : Θ_s = 0. For TE-VIMs, H₀ corresponds to treatment effect homogeneity τ (x) = τ_s(x), in which case ϕ_s(Z) = 0. This IC degeneracy makes H₀ difficult to test, since ${\hat{Θ}}_{s}$ is not necessarily asymptotically normal. For this reason, Theorem 1 considers the asymptotic distribution only when Θ_s > 0. One solution to the IC degeneracy problem is to estimate var{τ (X)} and var{τ_s(X)}² using efficient estimators in separate samples (Williamson et al., 2023). Each estimand has a non-zero IC provided that var{τ_s(X)} > 0, despite both ICs being identical under H₀. Thus both estimators are independent and asymptotically normal, hence their difference (an estimator of Θ_s) is also asymptotically normal even when Θ_s = 0. One therefore obtains a valid Wald-type test for H₀, at the expense of using an estimator for Θ_s, which is inefficient because the component estimators are estimated using only half of the data. Similarly, one could test the zero-VTE null hypothesis (var{τ (X)} = 0) by estimating E{τ²(X)} and E{τ (X)}₂ using efficient estimators in separate samples and taking their difference. Such approaches are an active area of research and evaluating them in the context of TE-VIMs is beyond the scope of the current work (Hudson, 2023; Guo and Shah, 2025). Moreover, the distribution of ${\hat{Θ}}_{s}$ under H₀ may depend on higher-order pathwise derivatives of the estimand, which are also an open research area (Carone et al., 2018).

2.3.3. Estimation of Ψ_s

The scaled TE-VIM Ψ_s ≡ Θ_s/Θ_p, has IC, Φ_s(z) ≡ {ϕ_s(z) − Ψ_sϕ_p(z)}/Θ_p, where ϕ_p(z) denotes (2) for the index set s = p. This IC implies an estimating equations estimator, ${\hat{Ψ}}_{s} = {\hat{Θ}}_{s} / {\hat{Θ}}_{p}$ , where ${\hat{Θ}}_{p}$ is the VTE estimator obtained when ${\hat{τ}}_{s} (x)$ in (3) is replaced with an ATE estimate ${\hat{τ}}_{p}$ .

The estimators ${\hat{Θ}}_{p}$ and ${\hat{Ψ}}_{s}$ both rely on an ATE estimator ${\hat{τ}}_{p}$ . The IC of the ATE, φ(z) − τ_p, implies an efficient estimator ${\hat{τ}}_{p}^{*} \equiv n^{- 1} \sum_{i = 1}^{n} \hat{φ} (z_{i})$ , known as the augmented inverse propensity weighted (AIPW) estimator (Robins et al., 1994). We recommend the AIPW estimator in the current context since ${\hat{Θ}}_{p}$ and ${\hat{Ψ}}_{s}$ are locally insensitive to small perturbations in ${\hat{τ}}_{p}$ about ${\hat{τ}}_{p}^{*}$ . To see why, consider the partial derivative

\frac{\partial {\hat{Θ}}_{p}}{\partial {\hat{τ}}_{p}} = \frac{\partial}{\partial {\hat{τ}}_{p}} n^{- 1} \sum_{i = 1}^{n} [{\hat{φ} (z_{i}) - {\hat{τ}}_{p}}^{2} - {\hat{φ} (z_{i}) - \hat{τ} (x_{i})}^{2}] = - 2 ({\hat{τ}}_{p}^{*} - {\hat{τ}}_{p}),

which is zero at ${\hat{τ}}_{p} = {\hat{τ}}_{p}^{*}$ . This orthogonality means that uncertainty in the AIPW estimator can be ignored when estimating Θ_p, see Vermeulen and Vansteelandt (2015) for details.

Like ϕ_s(z), Ф_s(z) degenerates to Ф_s(z) = 0 when Ψ_s = 0 or when Ψ_s = 1, i.e. Θ_s = 0 or Θ_s = Θ_p. For this reason, the asymptotic normality of ${\hat{Ψ}}_{s}$ , described in Theorem 2, holds only for Ψ_s ∈ (0, 1), i.e. when covariates in s account for some, but not all, heterogeneity. Similar estimators for log(Θ_s) and logit(Ψ_s) are derived in Supplement D.5, which may be used to construct alternative bounded estimators ${\hat{Θ}}_{s}^{*} > 0$ and ${\hat{Ψ}}_{s}^{*} \in (0, 1)$ .

Theorem 2

Assume that the conditions in Theorem 1 are satisfied, Θ_p > 0, and there exists δ ∈ (0, ∞) such that $| | \hat{τ} (X) - {\hat{τ}}_{p}^{*} | | < δ$ . Then ${\hat{Ψ}}_{s}$ , with the ATE estimated by ${\hat{τ}}_{p} = {\hat{τ}}_{p}^{*}$ , is asymptotically linear with IC, Ф_s(Z), and hence ${\hat{Ψ}}_{s}$ converges to Ψ_s in probability, and for Ψ_s ∈ (0, 1) then $n^{1 / 2} ({\hat{Ψ}}_{s} - Ψ_{s})$ converges in distribution to a mean-zero normal random variable with variance ‖Φ_s(Z)‖².

2.3.4. Plug-in estimation

A common framework for constructing debiased estimators is through targeted maximum likelihood estimators (TMLEs) (van der Laan and Gruber, 2016). TMLEs are ‘plug-in’, in that they are defined through estimand mappings, e.g. Θ_s : ℳ ↦ [0, ∞) evaluated at a distributoin estimate ${\hat{P}}_{n} \in ℳ$ . Despite the apparent similarity of ${\hat{Θ}}_{s}$ with the representation Θ_s = E [{φ(Z) − τ_s(X)}² − {φ(Z) − τ (X)}²], the estimators ${\hat{Θ}}_{s}, {\hat{Θ}}_{p}, {\hat{Ψ}}_{s}$ are not plug-in. This is evident from the fact that ${\hat{Θ}}_{s}$ may take negative values, but such codomain violations are not possible for plug-in estimators. We emphasize this point since Williamson et al.(2021) use ‘plug-in’ to refer to representation similarities, and their Ω_s estimator is also not plug-in in the estimand mapping sense. Moreover, TMLEs for Θ_s are challenging, since the targeting step must target compatible estimators for μ(x_i) and μ_s(x_i) simultaneously. The VTE does not suffer this issue, with TMLEs proposed by Levy et al. (2021). TMLEs for TE-VIMs are investigated by Li et al. (2023).

2.3.5. Algorithms

The estimators ${\hat{Θ}}_{s}, {\hat{Θ}}_{p}$ , and ${\hat{Ψ}}_{s}$ are indexed by the choice of pseudo-outcome and CATE estimators. Generally, we are unrestricted in the choice of CATE metalearner, and outcome and propensity score learners. We propose two Algorithms based on the T- and DR-learners, with and without sample splitting. In Algorithm noSS substeps marked (A) and (B) refer to the T- and DR-learners respectively. Where the algorithms require models to be ‘fitted’, any suitable regression method/learner can be used.

Both algorithms return pseudo-outcome and CATE estimates, ${{\hat{φ}}_{i}}_{i = 1}^{n}, {{\hat{τ}}_{i}}_{i = 1}^{n}$ ,, and ${{\hat{τ}}_{s, i}}_{i = 1}^{n}$ , which can be used to obtain ${\hat{τ}}_{p} = n^{- 1} \sum_{i = 1}^{n} {\hat{φ}}_{i}$ and the uncentered $ICs {\hat{ϕ}}_{i, s} = {{\hat{φ}}_{i} - {\hat{τ}}_{s, i}}^{2} - {{\hat{φ}}_{i} - {\hat{τ}}_{i}}^{2}$ and ${\hat{ϕ}}_{i, p} = {{\hat{φ}}_{i} - {\hat{τ}}_{p}}^{2} - {{\hat{φ}}_{i} - {\hat{τ}}_{i}}^{2}$ which imply the estimators ${\hat{Θ}}_{s} = n^{- 1} \sum_{i = 1}^{n} {\hat{ϕ}}_{i, s}, {\hat{Θ}}_{p} = n^{- 1} \sum_{i = 1}^{n} {\hat{ϕ}}_{i, p}$ and ${\hat{Ψ}}_{s} = {\hat{Θ}}_{s} / {\hat{Θ}}_{p}$ , with variances respectively estimated by $n^{- 2} \sum_{i = 1}^{n} {({\hat{ϕ}}_{i, s} - {\hat{Θ}}_{s})}^{2}, n^{- 2} \sum_{i = 1}^{n} {({\hat{ϕ}}_{i, p} - {\hat{Θ}}_{p})}^{2}$ and ${(n {\hat{Θ}}_{p})}^{- 2} \sum_{i = 1}^{n} {({\hat{ϕ}}_{i, s} - {\hat{Ψ}}_{s} {\hat{ϕ}}_{i, p})}^{2}$ . The algorithms differ in their CATE metalearner and use of sample splitting. Comparing Algorithms SS-A and SS-B, the DR-learner requires additional cross-fitting because it is trained on pseudo-outcomes estimates that are learned from a separate sample. As such, Algorithm SS-B has 𝒪(K²) regression operations compared with 𝒪(K) for SS-A.

Algorithm noSS: No sample splitting

(1)
Fit $\hat{μ} (., .)$ and $\hat{π} (.)$ . Use these fitted models to obtain ${\hat{φ}}_{i} \equiv \hat{φ} (z_{i})$ .
(2)
(A) Use the model for $\hat{μ} (., .)$ from Step 1, to obtain $\hat{τ} (x) \equiv \hat{μ} (1, x) - \hat{μ} (0, x)$ . Or (B) Fit $\hat{τ} (.)$ by regressing $\hat{φ} (Z)$ on X. After doing (A) or (B), use the fitted models to obtain ${\hat{τ}}_{i} \equiv \hat{τ} (x_{i})$ .
(3)
Fit ${\hat{τ}}_{s} (.)$ by regressing $\hat{τ} (X)$ on X_−s. Use the fitted model to obtain ${\hat{τ}}_{s, i} \equiv {\hat{τ}}_{s} (x_{i})$ .
(4)
Optionally repeat Step 3 for other covariate sets of interest.

Algorithm SS-A: Sample splitting with the T-Learner

(1)
Split the data into K ≥ 2 folds. For each fold k: Fit $\hat{μ} (., .)$ and $\hat{π} (.)$ using the data set excluding fold k. Use these fitted models to obtain ${\hat{φ}}_{i} \equiv \hat{φ} (z_{i})$ and ${\hat{τ}}_{i} \equiv \hat{μ} (1, x_{i}) - \hat{μ} (0, x_{i})$ for i in fold k.
(2)
Fit ${\hat{τ}}_{s} (.)$ by regressing $\hat{μ} (1, X) - \hat{μ} (0, X)$ on X_−s using the data excluding fold k. Use the fitted model to obtain ${\hat{τ}}_{s, i} \equiv {\hat{τ}}_{s} (x_{i})$ for i in fold k.
(3)
Optionally repeat Step 2 for other covariate sets of interest. End for.

Algorithm SS-B: Sample splitting with the DR-Learner

(1)
Split the data into K ≥ 3 folds. For each pair of folds j ≠ k: Fit $\hat{μ} (., .)$ and $\hat{π} (.)$ using the data set that excludes both folds j and k. Use these fitted models to obtain ${\hat{φ}}_{i}^{(k)} \equiv \hat{φ} (z_{i})$ for i in fold j, and ${\hat{φ}}_{i}^{(j)} \equiv \hat{φ} (z_{i})$ for i in fold k. End for.
(2)
For each fold k: Fit $\hat{τ} (.)$ by regressing ${\hat{φ}}^{(k)} (Z)$ on X using the data excluding fold k. Use the fitted models to obtain ${\hat{τ}}_{i} \equiv \hat{τ} (x_{i})$ for i in fold k.
(3)
Obtain ${\hat{φ}}_{i} \equiv {(K - 1)}^{- 1} Σ_{j \neq k} {\hat{φ}}^{(j)} (z_{i})$ for i in fold k.
(4)
Fit ${\hat{τ}}_{s} (.)$ by regressing $\hat{τ} (X)$ on X_−s using the data excluding fold k. Use the fitted model to obtain ${\hat{τ}}_{s, i} \equiv {\hat{τ}}_{s} (x_{i})$ for i in fold k.
(5)
Optionally repeat Step 4 for other covariate sets of interest. End for.

3. Simulation study

We compared all Algorithms (K = 8 folds) under three data generating processes (DGPs). For each, generalized additive models (GAMs) were used to estimate μ(a, x), π(x), τ_s(x), and in the case of the DR-learner, τ(x). GAMs are flexible spline smoothing models implemented in the mgcv R package (Wood et al., 2016), and for DGPs 1 and 2, (X₁, X₂) interaction terms were included. Propensity score models were additive on the logit scale.

DGP 1

We generated 500 datasets for each size n ∈ {500, 1000, 2000, 3000, 4000, 5000} according to X₁, X₂ ~ Uniform(−1, 1), A ~ Bernoulli{expit(−0.4X₁+0.1X₁X₂)}, $τ^{(1)} (X) = X_{1}^{3} + 1.4 X_{1}^{2} + 25 X_{2}^{2} / 9$ , and $Y ~ N_{1} (X_{1} X_{2} + 2 X_{2}^{2} - X_{1} + A τ^{(1)} (X), 1)$ where 𝒩_p(0, Σ) denotes a p-dimensional normal variable with mean μ and covariance matrix Σ. In this case τ_p = 1.39 and Θ_p = 1.003. Since we consider only two covariates, the LOO and KOI TE-VIM modes are equivalent, with Θ₁ = 0.32, Θ₂ = 0.69, Ψ₁ = 0.32, and Ψ₂ = 0.68.

DGP 2

The setup in DGP 1, but with τ⁽¹⁾(X) replaced with τ⁽²⁾(X) = τ⁽¹⁾(X)/10. In this way, the relative importance of X₁, X₂ are unchanged, but the overall effect size and heterogeneity is much smaller. This results in τ_p = 0.139, Θ_p = 0.01003, Θ₁ = 0.0032 and Θ₂ = 0.0069, but the scaled TE-VIMs Ψ₁ = 0.32 and Ψ₂ = 0.68 are the same as DGP 1.

DGP 3

We generated 500 datasets with n = 5000 according to

(X_{1}, X_{2}), (X_{3}, X_{4}), (X_{5}, X_{6}) ~ N_{2} (0, (\begin{matrix} 1 & 0.5 \\ 0.5 & 1 \end{matrix})),

A ~ Bernoulli{expit(−0.4X₁ +0.1X₁X₂+0.2X₅)}, τ⁽³⁾(X) = X₁+2X₂+X₃, Y ~ 𝒩₁(X₃ − X₆ + Aτ⁽³⁾(X), 3). In this case τ_p = 0 and Θ_p = 8 but the LOO and KOI TE-VIM modes are not equivalent (see Table 1). Under the KOI mode, some importance is assigned to X₄ due to its correlation with X₃, also greater importance is assigned to X₁ versus X₃ due to the correlation of X₁ with X₂. The LOO mode assigns little importance to X₁, X₃ since they are correlated with X₂, X₄ respectively. Shapley values represent a compromise between these modes, but require 2^p = 64 TE-VIMs to be evaluated. For this reason we compared only the LOO and KOI modes in our simulation.

Table 1. True TE-VIM values for DGP 3.

For the KOI mode we report Θ_p − Θ_s. Scaled TE-VIMs are obtained by dividing each value by Θ_p = 8. The Shapley values sum to Θ_p by design.

Target covariate	Leave-one-out	Keep-one-in	Shapley
X ₁	0.75	4	3.375
X ₂	3	6.25	4.625
X ₃	0.75	1	0.875
X ₄	0	0.25	0.125
X ₅	0	0	0
X ₆	0	0	0

Open in a new tab

3.1. Results

For each dataset, Θ_s and Ψ_s were estimated, along with their standard errors and Wald based (95%) confidence intervals (CIs). For DGPs 1 and 2, we also report the empirical probability that TE-VIMs correctly rank X₂ as more important than X₁. Figure 1 shows bias, variance, and coverage plots for DGP 1. Additional plots for DGP 2 and DGP 3 are in Supplement C.

Dashed lines indicate zero bias, and nominal 95% CI coverage. For readability, a small amount of ‘jitter’ has been added to the sample size, n. In order to present scaled and unscaled TE-VIMs together, the standard deviation and bias of scaled TE-VIMs has been multiplied by the true VTE. Note that the bias and variance scales differ between sub-plots A,B and D,E.

For all DGPs, we see that TE-VIM estimators which do not use sample splitting tend to overestimate TE-VIMs (positive bias), whilst sample splitting estimators tend to underestimate TE-VIMs (negative bias). For DGP 1, we observe that, in small samples, DR-learner based algorithms (-B) produce larger bias, variance, and reduced CI coverage, than their T-learner counterparts (-A). Moreover, the scaled TE-VIM estimators tend to have smaller bias and variance than TE-VIM estimators. This trend appears reversed in the low-heterogeneity regime (DGP 2) when sample splitting is used. We believe this is due to extreme inverse weighting in the VTE estimate ${\hat{Θ}}_{p}$ , which appears in the denominator of ${\hat{Ψ}}_{1}$ and ${\hat{Ψ}}_{2}$ .

For DGP 1, all algorithms recover the correct ranking with a high degree of accuracy. For DGP 2, this accuracy is reduced and conclusions based on scaled and unscaled TE-VIMs do not always agree, with the latter generally being more correct when cross-fitting is used (see Supplement C). For a given dataset, the ranking of scaled and unscaled TE-VIMs differs only when ${\hat{Θ}}_{p} < 0$ . Therefore, we recommend that scaled TE-VIMs are only used when sensible VTE estimates are obtained, though TE-VIMs are also scientifically less relevant when there is little heterogeneity to account for.

In DGP 3 we observe that null importance does not seem to affect estimator bias, but does lead to reduced estimator standard deviations, as expected from theory, and decreased CI coverage. This phenomenon is especially clear when examining covariate X₄, which has, in truth, null importance under the LOO mode, but not under the KOI mode. For the X₄ LOO TE-VIM estimators we observe low variance and low CI coverage, whereas for the X₄ KOI TE-VIM estimators we see higher variance and closer to nominal coverage.

4. Applied example: AIDS clinical trial

The AIDS Clinical Trials Group Protocol 175 (ACTG175) (Hammer et al., 1996) considers 2139 HIV patients with CD4 T-cell count between 200 and 500mm⁻³ randomized to 4 treatment groups: (i) zidovudine (ZDV); (ii) didanosine (ddI); (iii) ZDV+ddI; (iv) ZDV+zalcitabine. We compare groups (ii) and (iii) (A = 0, 1), with 561 and 522 patients respectively. We consider CD4 count at 20±5 weeks as a continuous outcome Y and 12 baseline covariates, 5 continuous: age, weight, Karnofsky score, CD4/CD8 count; and 7 binary: sex, homosexual activity (y/n), race (white/non-white), symptomatic (y/n), intravenous drug use history(y/n), hemophilia (y/n), and antiretroviral history (experienced/naive).

TE-VIMs for each covariates were estimated using all algorithms with K = 10 folds (around 10 folds is typical for cross-fitting procedures). A constant propensity score of 522/1083 ≈ 0.48 was used, since treatment is randomized. Fitted models for the outcome and CATEs were obtained using the ‘discrete’ Super Learner (van der Laan et al., 2007), an ensemble learning method, which selects the regression algorithm in a ‘learner library’ that minimizes some cross-validated risk. We used the SuperLearner R package implementation of this algorithm with 10 cross-validation folds, mean-squared-error loss, and a learner library containing various routines (glm, glmnet, gam, xgboost, ranger). Similar results are obtained when Super Learner regression is ablated and replaced with GAMs (see Supplement F).

AIPW estimates of the ATE using pseudo-outcomes from Algorithms noSS, SS-A, and SS-B were similar, respectively: 28.2mm⁻³ (CI: 14.0, 42.3; p<0.01); 28.4mm⁻³ (CI: 13.8, 42.9; p<0.01); 27.9mm⁻³ (CI: 13.3, 42.5; p<0.01), where all CIs are reported at 95% significance and p-values are of Wald type. VTE estimates differed substantially between algorithms with/without cross-fitting. With Algorithms noSS-A and -B returning estimates: 3100mm⁻⁶ (CI: 1410, 4790; p<0.01) and 3600mm⁻⁶ (CI: 1810, 5380; p<0.01), and for Algorithms SS-A and -B: 1260mm⁻⁶ (CI: -425, 2940; p=0.14) and 1250mm⁻⁶ (CI: -580, 3080; p=0.18). It is helpful to also consider the square root of the VTE estimates, which is on the same scale as the ATE. These are 55.7, 60.0, 35.5, and 35.3mm⁻³ for Algorithms noSS-A, -B, SS-A, and -B respectively. Based on the VTE CIs from Algorithms SS-A and -B, low treatment effect heterogeneity is a concern in this analysis. Figures 2 and 3 show unscaled and scaled TE-VIM estimates using LOO and KOI modes. All Algorithms rank CD4 count and homosexual activity as the most important covariates, with CD8 count also ranked highly by Algorithm SS-B under the KOI mode. We also observe (i) that standard errors are small for unimportant covariates, as expected due to the importance testing issues in Section 2.3.2; (ii) Algorithms SS-A and -B produce scaled TE-VIMs point estimates outside of (0, 1) more often than noSS-A and -B; and (iii) Algorithms noSS-A and -B do not produce scaled or unscaled TE-VIM point estimates with 95% CIs overlapping zero.

Error bars indicate 95% CIs. In each plot, covariates are sorted according to their TE-VIM point estimate. Dashed lines indicate no importance. For the KOI mode, the TE-VIM represents the importance of the complement variable set, i.e. low-values denote high-importance of the KOI covariate.

Error bars indicate 95% CIs. In each plot, covariates are sorted according to their TE-VIM point estimate. Dashed lines indicate the [0, 1] support of the scaled TE-VIM. For the KOI mode, the TE-VIM represents the importance of the complement variable set, i.e. low-values denote high-importance of the KOI covariate.

5. Related work and extensions

Here we discuss VIMs for the OTR and TE-VIMs for continuous treatments. Discussion on treatment effect scales, linear CATE projections (Boileau et al., 2023) and treatment effect cumulative distribution functions (Levy and van der Laan, 2018) are in Supplement D.

5.1. Optimal treatment rule - VIMs

Although TE-VIMs capture the importance of variables in explaining the CATE, this should not be confused with the importance of those variables in determining the OTR d(x) ≡ 𝕀{τ (x) > 0}, where 𝕀(.) is an indicator and, w.l.o.g., a greater outcome is preferred. For instance, treatment could be uniformly beneficial, in which case the OTR is to always treat, despite possible treatment effect heterogeneity. In such settings, analysis using TE-VIMs may provide useful insight to further improve therapies. Faced with OTR heterogeneity, one might consider nonparametric VIMs related to the OTR, e.g. by considering the risk f ↦ − E{Y^f(X)}, where f : ℝ^p ↦ {0, 1} is a policy. This risk implies the OTR-VIM $Γ_{s}^{*} \equiv E {Y^{d (X)} - Y^{d_{s}^{*} (X)}} \geq 0$ where $d_{s}^{*} (x) \equiv I {τ_{s} (x) > 0}$ is the OTR given X_−s (Zhang et al., 2015; Williamson et al., 2023). Thus, OTR-VIMs compare the OTR with suboptimal policies that are optimal within a restricted set of policies.

Under standard causal assumptions (consistency, positivity, exchangeability), the OTR-VIM is identified by $Γ_{s}^{*} = E [τ (X) {d (X) - d_{s}^{*} (X)}]$ . Unlike TE-VIMs, $Γ_{s}^{*}$ is not pathwise differentiable without additional assumptions e.g. that the OTR is insensitive to small changes in P₀ (Luedtke and van der Laan, 2016), with the pathwise derivative usually used to construct efficient estimators. Analogous to Θ_s and Ω_s, one could alternatively define OTR-VIMs using the risk f ↦ E[{d(X) − f(X)}²], which implies the VIM Γ_s ≡ E[d_s(X){1 − d_s(X)}] where d_s(x) ≡ E{d(X)|X_−s = x_−s} = Pr{τ (X) > 0|X_−s = x_−s}. Note that d(X) ∈ {0, 1} and d_s(X) ∈ [0, 1]. Like Θ_s and Ω_s, Γ_s ∈ [0, 0.25] is invariant to linear transformations of the outcome, however, like $Γ_{s}^{*}$ , additional assumptions on the OTR are required for pathwise differentiability of Γ_s.

5.2. Continuous treatments

Continuous analogues of the CATE based on linear model projections are proposed by Hines et al. (2023). In particular, λ(x) ≡ cov(A, Y |X = x)/var(A|X = x) is well defined when A is continuous, and identifies the CATE under standard causal assumptions (consistency, positivity, exchangeability) when A is binary. Appealing to the risk f ↦ ‖λ(X) − f(X)‖², one might extend the ATE, VTE, and TE-VIMs to continuous exposures using the estimands: E{λ(X)}, var{λ(X)}, and E[var{λ(X)|X_−s}]/var{λ(X)}, which identify their CATE counterparts when A is binary. ICs for these estimands are obtained by replacing the pseudo-outcome φ(z) in the current work, with

[y - μ (x) - λ (x) {a - π (x)}] \frac{a - π (x)}{var (A ∣ X = x)} + λ (x)

which reduces to φ(z) when A is binary. See Supplement D.9 for details.

6. Conclusion

We propose TE-VIMs, which capture the importance of variable subsets in explaining the CATE, and complement the VTE as a global heterogeneity measure (Levy et al., 2021). We derive efficient TE-VIM estimators that are amenable to machine learning of working models, thus, unlike existing proposals using causal forests, are not tied to specific algorithms (Athey et al., 2019). In studies where heterogeneous treatment effects are of primary interest, we recommend using TE-VIMs to select stratification variables for subgroup analyses, to support variable selection decisions for CATE/OTR models, or to rank effect modifiers. In studies where the main goal is to estimate the ATE, we recommend that VTE inference should also form part of the analysis, since the ATE and VTE can be used to inform on the marginal probability of adverse CATEs (see Supplement D.8). TE-VIMs may then form part of a secondary analysis when large treatment effect heterogeneity cannot be ruled out. We recommend that domain specific knowledge is used to select potential effect modifiers for VTE and TE-VIM analyses. We have outlined several frameworks for choosing the sets of variables to be included for the comparisons needed in the TE-VIM algorithm, such as leave-one-out or keep-one-in. Finally, we caution users against over-interpreting TE-VIM p-values due to the difficulty of testing the zero importance null and due to the multiplicity problem when the number of covariate subsets is large. The latter problem may be some-what attenuated using a Bonferroni approach. Any suggested covariate that may explain heterogeneity of treatment effects should be confirmed using independent data.

Supplementary Material

Supplement

EMS212112-supplement-Supplement.pdf^{(834.8KB, pdf)}

Acknowledgments

For the purpose of open access, the author(s) has applied a Creative Commons Attribution (CC BY) license to any Author Accepted Manuscript version arising from this submission.

Funding

This work was supported by the Medical Research Council [grant number MR/N013638/1]. K.DO. was funded by a Royal Society-Welcome Trust Sir Henry Dale fellowship, grant number 218554/Z/19/Z.

Data Availability

AIDS Clinical Trials Group Protocol 175 (ACTG175) data Hammer et al. (1996) is available on CRAN at https://CRAN.R-project.org/package=speff2trial.

References

Athey S, Imbens G. Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences of the United States of America. 2016;113(27):7353–7360. doi: 10.1073/pnas.1510489113. [DOI] [PMC free article] [PubMed] [Google Scholar]
Athey S, Tibshirani J, Wager S. Generalized random forests. Annals of Statistics. 2019;47(2):1179–1203. [Google Scholar]
Athey S, Wager S. Policy learning with observational data. Econometrica. 2021;89(1):133–161. [Google Scholar]
Boileau P, Qi NT, van der Laan MJ, Dudoit S, Leng N. A Flexible Approach for Predictive Biomarker Discovery. Biostatistics. 2023;24(4):1085–1105. doi: 10.1093/biostatistics/kxac029. [DOI] [PubMed] [Google Scholar]
Carone M, Díaz I, van der Laan MJ. Targeted Learning in Data Science. Springer International Publishing; Cham: 2018. Higher-Order Targeted Loss-Based Estimation; pp. 483–510. [Google Scholar]
Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey WK, Robins JM. Double/debiased machine learning for treatment and structural parameters. Econometrics Journal. 2018;21(1):C1–C68. [Google Scholar]
Coston A, Kennedy EH, Chouldechova A. Advances in Neural Information Processing Systems. Vol. 33. Curran Associates, Inc; 2020. Counterfactual Predictions under Runtime Confounding; pp. 4150–4162. [Google Scholar]
Ding P, Feller A, Miratrix L. Randomization inference for treatment effect variation. Journal of the Royal Statistical Society Series B: Statistical Methodology. 2016;78(3):655–671. [Google Scholar]
Grömping U. Variable importance assessment in regression: linear regression versus random forest. The American Statistician. 2009;63(4):308–319. [Google Scholar]
Guo FR, Shah RD. Rank-transformed subsampling: inference for multiple data splitting and exchangeable p-values. Journal of the Royal Statistical Society Series B: Statistical Methodology. 2025;87(1):256–286. [Google Scholar]
Hama N, Mase M, Owen AB. Model free variable importance for high dimensional data. arXiv (2211.08414) 2022 [Google Scholar]
Hammer SM, Katzenstein DA, Hughes MD, Gundacker H, Schooley RT, Haubrich RH, Henry WK, Lederman MM, Phair JP, Niu M, Hirsch MS, et al. A trial comparing nucleoside monotherapy with combination therapy in HIV-infected adults with CD4 cell counts from 200 to 500 per cubic millimeter. The New England Journal of Medicine. 1996;335:1081–1090. doi: 10.1056/NEJM199610103351501. [DOI] [PubMed] [Google Scholar]
Hastie T. gam: Generalized additive models CRAN: Contributed Packages. 2004.
Hines O, Diaz-Ordaz K, Vansteelandt S. Optimally weighted average derivative effects. arXiv (2308.05456) 2023 [Google Scholar]
Hines O, Dukes O, Diaz-Ordaz K, Vansteelandt S. Demystifying statistical learning based on efficient influence functions. The American Statistician. 2022;76(3):292–304. [Google Scholar]
Hudson A. Nonparametric inference on non-negative dissimilarity measures at the boundary of the parameter space. arXiv (2306.07492) 2023 [Google Scholar]
Kennedy EH. Towards optimal doubly robust estimation of heterogeneous causal effects. Electronic Journal of Statistics. 2023;17(2):3008–3049. [Google Scholar]
Künzel SR, Sekhon JS, Bickel PJ, Yu B. Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the National Academy of Sciences of the United States of America. 2019;116(10):4156–4165. doi: 10.1073/pnas.1804597116. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lei J, G’Sell M, Rinaldo A, Tibshirani RJ, Wasserman L. Distribution-free predictive inference for regression. Journal of the American Statistical Association. 2018;113(523):1094–1111. [Google Scholar]
Levy J, van der Laan MJ. Kernel smoothing of the treatment effect CDF. arXiv (1811.06514) 2018 [Google Scholar]
Levy J, van der Laan MJ, Hubbard A, Pirracchio R. A fundamental measure of treatment effect heterogeneity. Journal of Causal Inference. 2021;9(1):83–108. [Google Scholar]
Li H, Hubbard A, van der Laan MJ. Targeted learning on variable importance measure for heterogeneous treatment effect. arXiv (2309.13324) 2023 [Google Scholar]
Luedtke AR, van der Laan MJ. Super-learning of an optimal dynamic treatment rule. International Journal of Biostatistics. 2016;12(1):305–332. doi: 10.1515/ijb-2015-0052. [DOI] [PMC free article] [PubMed] [Google Scholar]
Murphy SA. Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2003;65(2):331–355. [Google Scholar]
Newey WK, Robins JM. Cross-fitting and fast remainder rates for semi-parametric estimation. arXiv (1801.09138) 2018 [Google Scholar]
Owen AB, Prieur C. On Shapley value for measuring importance of dependent inputs. SIAM-ASA Journal on Uncertainty Quantification. 2017;5(1):986–1002. [Google Scholar]
Robins JM, Rotnitzky A, Zhao LP. When some of regression coefficients estimation regressors are not always observed. Methods. 1994;89(427):846–866. [Google Scholar]
Rothwell PM. Subgroup analysis in randomised controlled trials: importance, indications, and interpretation. The Lancet. 2005;365(9454):176–186. doi: 10.1016/S0140-6736(05)17709-5. [DOI] [PubMed] [Google Scholar]
Slamon DJ, Leyland-Jones B, Shak S, Fuchs H, Paton V, Bajamonde A, Fleming T, Eiermann W, Wolter J, Pegram M, Baselga J, et al. Use of chemotherapy plus a monoclonal antibody against HER2 for metastatic breast cancer that overexpresses HER2. New England Journal of Medicine. 2001;344(11):783–792. doi: 10.1056/NEJM200103153441101. [DOI] [PubMed] [Google Scholar]
Tang Z, Westling T. Nonparametric Assessment of Variable Selection and Ranking Algorithms. Journal of Computational and Graphical Statistics. 2025:1–12. [Google Scholar]
van der Laan MJ. Targeted learning of an optimal dynamic treatment, and statistical inference for its mean outcome UC Berkeley Division of Biostatistics Working Paper Series. 2013;(317):1–90.
van der Laan MJ, Gruber S. One-step targeted minimum loss-based estimation based on universal least favorable one-dimensional submodels. International Journal of Biostatistics. 2016;12(1):351–378. doi: 10.1515/ijb-2015-0054. [DOI] [PMC free article] [PubMed] [Google Scholar]
van der Laan MJ, Luedtke AR. Targeted learning of the mean outcome under an optimal dynamic treatment rule. Journal of Causal Inference. 2014;3(1):61–95. doi: 10.1515/jci-2013-0022. [DOI] [PMC free article] [PubMed] [Google Scholar]
van der Laan MJ, Polley EC, Hubbard AE. Super learner. Statistical Applications in Genetics and Molecular Biology. 2007;6(1) doi: 10.2202/1544-6115.1309. [DOI] [PubMed] [Google Scholar]
VanderWeele TJ, Luedtke AR, van der Laan MJ, Kessler RC. Selecting optimal subgroups for treatment using many covariates. Epidemiology (Cambridge, Mass) 2019;30(3):334. doi: 10.1097/EDE.0000000000000991. [DOI] [PMC free article] [PubMed] [Google Scholar]
Verdinelli I, Wasserman L. Decorrelated Variable Importance. Journal of Machine Learning Research. 2024a;25(7):1–27. [Google Scholar]
Verdinelli I, Wasserman L. Feature Importance: A Closer Look at Shapley Values and LOCO. Statistical Science. 2024b;39(4) [Google Scholar]
Vermeulen K, Vansteelandt S. Bias-reduced doubly robust estimation. Journal of the American Statistical Association. 2015;110(511):1024–1036. [Google Scholar]
Wager S, Athey S. Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association. 2018;113(523):1228–1242. [Google Scholar]
Williamson BD, Feng J. Efficient nonparametric statistical inference on population feature importance using Shapley values; 37th International Conference on Machine Learning, ICML 2020; 2020. pp. 10213–10222.PartF168147-14. [PMC free article] [PubMed] [Google Scholar]
Williamson BD, Gilbert PB, Carone M, Simon NR. Nonparametric variable importance assessment using machine learning techniques. Biometrics. 2021;77(1):9–22. doi: 10.1111/biom.13392. [DOI] [PMC free article] [PubMed] [Google Scholar]
Williamson BD, Gilbert PB, Simon NR, Carone M. A general framework for inference on algorithm-agnostic variable importance. Journal of the American Statistical Association. 2023;118(543):1645–1658. doi: 10.1080/01621459.2021.2003200. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wood SN, Pya N, Säfken B. Smoothing parameter and model selection for general smooth models. Journal of the American Statistical Association. 2016;111(516):1548–1563. [Google Scholar]
Zhang L, Janson L. Floodgate: inference for model-free variable importance. arXiv (2007.01283) 2020 [Google Scholar]
Zhang Y, Laber EB, Tsiatis A, Davidian M. Using decision lists to construct interpretable and parsimonious treatment regimes. Biometrics. 2015;71(4):895–904. doi: 10.1111/biom.12354. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zheng W, van der Laan MJ. Targeted Learning. Springer New York; New York, NY: 2011. Cross-validated targeted minimum-loss-based estimation; pp. 459–474. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement

EMS212112-supplement-Supplement.pdf^{(834.8KB, pdf)}

Data Availability Statement

AIDS Clinical Trials Group Protocol 175 (ACTG175) data Hammer et al. (1996) is available on CRAN at https://CRAN.R-project.org/package=speff2trial.

[R1] Athey S, Imbens G. Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences of the United States of America. 2016;113(27):7353–7360. doi: 10.1073/pnas.1510489113. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Athey S, Tibshirani J, Wager S. Generalized random forests. Annals of Statistics. 2019;47(2):1179–1203. [Google Scholar]

[R3] Athey S, Wager S. Policy learning with observational data. Econometrica. 2021;89(1):133–161. [Google Scholar]

[R4] Boileau P, Qi NT, van der Laan MJ, Dudoit S, Leng N. A Flexible Approach for Predictive Biomarker Discovery. Biostatistics. 2023;24(4):1085–1105. doi: 10.1093/biostatistics/kxac029. [DOI] [PubMed] [Google Scholar]

[R5] Carone M, Díaz I, van der Laan MJ. Targeted Learning in Data Science. Springer International Publishing; Cham: 2018. Higher-Order Targeted Loss-Based Estimation; pp. 483–510. [Google Scholar]

[R6] Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey WK, Robins JM. Double/debiased machine learning for treatment and structural parameters. Econometrics Journal. 2018;21(1):C1–C68. [Google Scholar]

[R7] Coston A, Kennedy EH, Chouldechova A. Advances in Neural Information Processing Systems. Vol. 33. Curran Associates, Inc; 2020. Counterfactual Predictions under Runtime Confounding; pp. 4150–4162. [Google Scholar]

[R8] Ding P, Feller A, Miratrix L. Randomization inference for treatment effect variation. Journal of the Royal Statistical Society Series B: Statistical Methodology. 2016;78(3):655–671. [Google Scholar]

[R9] Grömping U. Variable importance assessment in regression: linear regression versus random forest. The American Statistician. 2009;63(4):308–319. [Google Scholar]

[R10] Guo FR, Shah RD. Rank-transformed subsampling: inference for multiple data splitting and exchangeable p-values. Journal of the Royal Statistical Society Series B: Statistical Methodology. 2025;87(1):256–286. [Google Scholar]

[R11] Hama N, Mase M, Owen AB. Model free variable importance for high dimensional data. arXiv (2211.08414) 2022 [Google Scholar]

[R12] Hammer SM, Katzenstein DA, Hughes MD, Gundacker H, Schooley RT, Haubrich RH, Henry WK, Lederman MM, Phair JP, Niu M, Hirsch MS, et al. A trial comparing nucleoside monotherapy with combination therapy in HIV-infected adults with CD4 cell counts from 200 to 500 per cubic millimeter. The New England Journal of Medicine. 1996;335:1081–1090. doi: 10.1056/NEJM199610103351501. [DOI] [PubMed] [Google Scholar]

[R13] Hastie T. gam: Generalized additive models CRAN: Contributed Packages. 2004.

[R14] Hines O, Diaz-Ordaz K, Vansteelandt S. Optimally weighted average derivative effects. arXiv (2308.05456) 2023 [Google Scholar]

[R15] Hines O, Dukes O, Diaz-Ordaz K, Vansteelandt S. Demystifying statistical learning based on efficient influence functions. The American Statistician. 2022;76(3):292–304. [Google Scholar]

[R16] Hudson A. Nonparametric inference on non-negative dissimilarity measures at the boundary of the parameter space. arXiv (2306.07492) 2023 [Google Scholar]

[R17] Kennedy EH. Towards optimal doubly robust estimation of heterogeneous causal effects. Electronic Journal of Statistics. 2023;17(2):3008–3049. [Google Scholar]

[R18] Künzel SR, Sekhon JS, Bickel PJ, Yu B. Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the National Academy of Sciences of the United States of America. 2019;116(10):4156–4165. doi: 10.1073/pnas.1804597116. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Lei J, G’Sell M, Rinaldo A, Tibshirani RJ, Wasserman L. Distribution-free predictive inference for regression. Journal of the American Statistical Association. 2018;113(523):1094–1111. [Google Scholar]

[R20] Levy J, van der Laan MJ. Kernel smoothing of the treatment effect CDF. arXiv (1811.06514) 2018 [Google Scholar]

[R21] Levy J, van der Laan MJ, Hubbard A, Pirracchio R. A fundamental measure of treatment effect heterogeneity. Journal of Causal Inference. 2021;9(1):83–108. [Google Scholar]

[R22] Li H, Hubbard A, van der Laan MJ. Targeted learning on variable importance measure for heterogeneous treatment effect. arXiv (2309.13324) 2023 [Google Scholar]

[R23] Luedtke AR, van der Laan MJ. Super-learning of an optimal dynamic treatment rule. International Journal of Biostatistics. 2016;12(1):305–332. doi: 10.1515/ijb-2015-0052. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Murphy SA. Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2003;65(2):331–355. [Google Scholar]

[R25] Newey WK, Robins JM. Cross-fitting and fast remainder rates for semi-parametric estimation. arXiv (1801.09138) 2018 [Google Scholar]

[R26] Owen AB, Prieur C. On Shapley value for measuring importance of dependent inputs. SIAM-ASA Journal on Uncertainty Quantification. 2017;5(1):986–1002. [Google Scholar]

[R27] Robins JM, Rotnitzky A, Zhao LP. When some of regression coefficients estimation regressors are not always observed. Methods. 1994;89(427):846–866. [Google Scholar]

[R28] Rothwell PM. Subgroup analysis in randomised controlled trials: importance, indications, and interpretation. The Lancet. 2005;365(9454):176–186. doi: 10.1016/S0140-6736(05)17709-5. [DOI] [PubMed] [Google Scholar]

[R29] Slamon DJ, Leyland-Jones B, Shak S, Fuchs H, Paton V, Bajamonde A, Fleming T, Eiermann W, Wolter J, Pegram M, Baselga J, et al. Use of chemotherapy plus a monoclonal antibody against HER2 for metastatic breast cancer that overexpresses HER2. New England Journal of Medicine. 2001;344(11):783–792. doi: 10.1056/NEJM200103153441101. [DOI] [PubMed] [Google Scholar]

[R30] Tang Z, Westling T. Nonparametric Assessment of Variable Selection and Ranking Algorithms. Journal of Computational and Graphical Statistics. 2025:1–12. [Google Scholar]

[R31] van der Laan MJ. Targeted learning of an optimal dynamic treatment, and statistical inference for its mean outcome UC Berkeley Division of Biostatistics Working Paper Series. 2013;(317):1–90.

[R32] van der Laan MJ, Gruber S. One-step targeted minimum loss-based estimation based on universal least favorable one-dimensional submodels. International Journal of Biostatistics. 2016;12(1):351–378. doi: 10.1515/ijb-2015-0054. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] van der Laan MJ, Luedtke AR. Targeted learning of the mean outcome under an optimal dynamic treatment rule. Journal of Causal Inference. 2014;3(1):61–95. doi: 10.1515/jci-2013-0022. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] van der Laan MJ, Polley EC, Hubbard AE. Super learner. Statistical Applications in Genetics and Molecular Biology. 2007;6(1) doi: 10.2202/1544-6115.1309. [DOI] [PubMed] [Google Scholar]

[R35] VanderWeele TJ, Luedtke AR, van der Laan MJ, Kessler RC. Selecting optimal subgroups for treatment using many covariates. Epidemiology (Cambridge, Mass) 2019;30(3):334. doi: 10.1097/EDE.0000000000000991. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] Verdinelli I, Wasserman L. Decorrelated Variable Importance. Journal of Machine Learning Research. 2024a;25(7):1–27. [Google Scholar]

[R37] Verdinelli I, Wasserman L. Feature Importance: A Closer Look at Shapley Values and LOCO. Statistical Science. 2024b;39(4) [Google Scholar]

[R38] Vermeulen K, Vansteelandt S. Bias-reduced doubly robust estimation. Journal of the American Statistical Association. 2015;110(511):1024–1036. [Google Scholar]

[R39] Wager S, Athey S. Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association. 2018;113(523):1228–1242. [Google Scholar]

[R40] Williamson BD, Feng J. Efficient nonparametric statistical inference on population feature importance using Shapley values; 37th International Conference on Machine Learning, ICML 2020; 2020. pp. 10213–10222.PartF168147-14. [PMC free article] [PubMed] [Google Scholar]

[R41] Williamson BD, Gilbert PB, Carone M, Simon NR. Nonparametric variable importance assessment using machine learning techniques. Biometrics. 2021;77(1):9–22. doi: 10.1111/biom.13392. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R42] Williamson BD, Gilbert PB, Simon NR, Carone M. A general framework for inference on algorithm-agnostic variable importance. Journal of the American Statistical Association. 2023;118(543):1645–1658. doi: 10.1080/01621459.2021.2003200. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R43] Wood SN, Pya N, Säfken B. Smoothing parameter and model selection for general smooth models. Journal of the American Statistical Association. 2016;111(516):1548–1563. [Google Scholar]

[R44] Zhang L, Janson L. Floodgate: inference for model-free variable importance. arXiv (2007.01283) 2020 [Google Scholar]

[R45] Zhang Y, Laber EB, Tsiatis A, Davidian M. Using decision lists to construct interpretable and parsimonious treatment regimes. Biometrics. 2015;71(4):895–904. doi: 10.1111/biom.12354. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R46] Zheng W, van der Laan MJ. Targeted Learning. Springer New York; New York, NY: 2011. Cross-validated targeted minimum-loss-based estimation; pp. 459–474. [Google Scholar]

PERMALINK

Variable importance measures for heterogeneous causal effects

Oliver J Hines

Karla Diaz-Ordaz

Stijn Vansteelandt

Abstract

1. Introduction

2. Methodology

2.1. Motivating the estimand

2.2. CATE estimation

2.3. TE-VIM estimation

2.3.1. Estimation of Θs

Theorem 1

2.3.2. Importance testing

2.3.3. Estimation of Ψs

Theorem 2

2.3.4. Plug-in estimation

2.3.5. Algorithms

Algorithm noSS: No sample splitting

Algorithm SS-A: Sample splitting with the T-Learner

Algorithm SS-B: Sample splitting with the DR-Learner

3. Simulation study

DGP 1

DGP 2

DGP 3

Table 1. True TE-VIM values for DGP 3.

3.1. Results

Figure 1. Bias (A, D), empirical standard deviation (B, E) and coverage (C, F) for estimators from DGP 1.

4. Applied example: AIDS clinical trial

Figure 2. TE-VIM estimates Θ^s from the ACTG175 study using each of the proposed Algorithms.

Figure 3. Scaled TE-VIM estimates Ψ^s from the ACTG175 study using each of the proposed Algorithms.

5. Related work and extensions

5.1. Optimal treatment rule - VIMs

5.2. Continuous treatments

6. Conclusion

Supplementary Material

Acknowledgments

Funding

Data Availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

2.3.1. Estimation of Θ_s

2.3.3. Estimation of Ψ_s

Figure 2. TE-VIM estimates ${\hat{Θ}}_{s}$ from the ACTG175 study using each of the proposed Algorithms.

Figure 3. Scaled TE-VIM estimates ${\hat{Ψ}}_{s}$ from the ACTG175 study using each of the proposed Algorithms.