Abstract
We discuss the results on improving the generalizability of individualized treatment rule following the work in Kallus [1] and Mo et al. [5]. We note that the advocated weights in Kallus [1] are connected to the efficient score of the contrast function. We further propose a likelihood-ratio-based method (LR-ITR) to accommodate covariate shifts, and compare it to the CTE-DR-ITR method proposed in Mo et al. [5]. We provide the upper-bound on the risk function of the target population when both the covariate shift and the contrast function shift are present. Numerical studies show that LR-ITR can outperform CTE-DR-ITR when there is only covariate shift.
Keywords: Generalizability, covariate shift, efficient score, density-ratio estimation
1. Introduction
The problem of constructing individualized treatment rules (ITRs), a function that maps patient characteristics to an available treatment, has recently received significant attention among statistical researchers [6, 11, 8, 12, 4, 9]. These methods typically assume that the population of the training samples and the target population where the ITR will be implemented in the future are identical. However, when these two populations are different from each other, the estimated ITR may perform poorly on the target population [10]. We congratulate Kallus (2020) and Mo, Qi, and Liu (2020) on their contributions in proposing robust ITR estimation approaches where there are covariate changes between the training and target populations, also known as covariate shift.
Let (X, A,Y) be the triplet, where X denotes patients covariates, A denotes the treatments, and Y denotes the outcome. Suppose that the treatment space is . Let Y(1) and Y(−1) be the potential outcomes given treatment A = 1 and −1, respectively. Let be the distribution of the training population and be the distribution of the testing population. Covariate shift assumes that the conditional distribution of (Y(1),Y(−1)) | X on the training and target populations are the same, but the covariate distributions may be different. Consequently, given X, the contrast functions satisfy . With the presence of covariate shifts, Kallus [1] defines the retargeted policy value and proposes a weighting approach that minimizes the efficient variance in estimating the finite-sample objective. Mo et al. [5] maximizes the worst value function over a set of possible weights.
We point out that the weights advocated in Kallus [1] to adjust treatment-control overlapping is related to the efficient score of estimating the contrast function. Therefore, his work can be considered as improving the efficiency of estimating ITR. Different from Kallus [1], Mo et al. [5] focus on controlling the possible bias when covariate distribution shifts. We propose an alternative method, termed as likelihood-ratio weighted ITR (LR-ITR), which re-weights the learning objective with the directly estimated likelihood ratio from the training and target data. Numerical results show that the LR-ITR method can improve the performance of the standard ITR approach, and outperform CTE-DR-ITR proposed in Mo et al. [5] when there is only covariate shift. On the other hand, CTE-DR-ITR method can be more generalizable when there is contrast shift in addition to covariate shift, i.e., .
This article is organized as follows. In Section 2, we discuss the connection between the weights in Kallus [1] and the efficient estimation of the contrast function [3]. In Section 3, we introduce the LR-ITR approach, along with its theoretical properties, and conduct simulation studies. In Section 4, we briefly summarize our discussion.
2. Retargeting weights and efficient estimation
The weighting approach for adjusting treatment-control overlapping is closely connected with the efficient score of the contrast functions. Given a policy π(a | X), which is a distribution on the treatment space given X, the retargeted learning objective in Kallus [1] is defined as
| (2.1) |
where . Kallus [1] advocates to use ρ(a | X) = 1/2 and w(X) ∝ [Var{Y(1) | X}/P(A = 1 | X) + Var{Y(−1) | X} / P(A = −1 | X)]−1. The proposed choice of ρ(a | X) = 1/2 compares the value function under the policy π with the value function under the pure randomization.
We highlight that the retargeting weight w(X) is connected with the efficient estimation of the contrast function, Δ(X) = μ(1 | X) – μ(−1 | X). In Liang and Yu [3], it is shown that assuming Δ(X) = g(X⊤ β), where g is an unknown function, the efficient score for β is
where ϵ = Y − Ag(X⊤ β) / 2 and . Since ϵ can be written as , where , we have
Given that and , is also proportional to [Var{Y(1) | X} / P(A = 1 | X) + Var{Y(−1) | X} / P(A = −1 | X)]−1. As such, the adjustment for treatment-control overlapping by the weights in Kallus [1] can also benefit the estimation of the contrast function.
3. Likelihood-ratio weighted ITR
Mo et al. [5] propose targeting a class of populations rather than a specific population to learn a distributionally robust individualized treatment rule (ITR). We propose an alternative likelihood-ratio-weighted approach. Following Mo et al. [5], we consider the value function given an ITR d(X). Under Assumption 1 (Covariate Changes) in Mo et al. [5], we have
where qx and px are the density functions of the target population and the training population, respectively. Let w*(X) = qx(X) / px(X). The target value function, V(d), can be reformulated as
3.1. Algorithm
We estimate w*(X) by using the covariates information from the training population and the target population. Unconstrained least-squares importance fitting (uLSIF) is an efficient algorithm to estimate the likelihood ratio [2], which has a closed-form solution and can be calculated by solving a linear system. Let , where ϕl(X)’s are pre-specified basis functions and b is the number of the basis. Notice that . The uLSIF aims to minimize , and
| (3.1) |
where C is a constant irrelevant to . Let α = (α1,⋯, αb)⊤. (3.1) can be written as
| (3.2) |
where is a b × b matrix with its (i, j)th coordinate, , and h is a b-dimensional vector with its ith coordinate, . The and test are the empirical expectations defined by the samples from the training and target populations. Typically, we only need a small calibration set from the target population to calculate any ’s. The uLSIF estimator is
where minimizes (3.2). To stabilize the estimation of , we cap at 10. This choice of cap corresponds to the bound of w*(X) assumed in the Mo et al. [5] simulations. The implementation procedure is outlined in Algorithm 1. We call this method as the likelihood-ratio weighted ITR (LR-ITR).
Algorithm 1: Likelihood-ratio weighted ITR.
Input: n samples from the training population (X, A,Y) and ncalib samples from the target population including only covariates.
Output: An ITR .
- Estimate where α is the minimizer of
where ϕl (·)’s are chosen as Gaussian kernels with kernel bandwidth σ. The parameter σ and λ are tuned by cross-validation. The ncalib samples from the target population are used to calculate h; Obtain an estimator by fitting a causal forest [7] using the training data;
- Obtain the linear decision rule , where β minimizes
and ψ is the robust smoothed ramp loss [12].
3.2. Risk Bounds on the Target Population
In this section, we provide risk bounds for the risk functions on the target population via different methods. Define the risk function on the target population as
where . The proposed LR-ITR method minimizes the LR-risk function, defined as
where . The CTE-DR-ITR approach proposed in Mo et al. [5] minimizes the DR-risk function associated with parameters (c, k) on the training population, defined as
where . The weight function w(X) represents a general density ratio in , and we assume that . Given a pre-specified class of ITRs, , let the minimizer of LLR(d(X)) in be and the minimizer of LDR(d(X)) in be . We provide upper bounds on and .
Let be the minimizer (or the sequence converging to the minimum) of for v > 0, and . Also, let be the minimizer (or the sequence converging to the minimum) of over , where , and . Given k ≥ 2, we have the following inequalities.
Theorem 3.1.
- For the LR-ITR approach, we have
- For the CTE-DR-ITR approach, we have
where , and
with .
When there is a shift in the contrast function, i.e., Ctest (X) ≠ C(X), we have .
If the difference between and dominates the difference between and , then leads to a larger upper bound compared with . When Ctest (X) = C(X), we have . We also have and . Further, we have . Notice that LDR(d(X)) ≥ LLR(d(X)) when . The can lead to a lower upper bound compared with .
3.3. Simulations
We compare the performance of the LR-ITR approach with the CTE-DR-ITR approach. Both methods only require a calibration dataset with the covariate vector from the targeted distribution.
3.3.1. Mixture of Subgroups with only Covariate Shifts
We first consider the simulation settings with only covariate shifts present. We modify the simulation setup on the mixture of subgroups in Mo et al. [5] such that the two subgroups share the same contrast function between the training and the target populations, but have different covariate distributions.
We set n = 1000 and p = 10. We generate the covariate vector as: , where ξ ~ Bernoulli(pmix) is the unobserved indicator of the mixture with pmix determining the proportion of the two subgroups, μ1 = (0,0,0,⋯,0)⊤ and μ2 = (1.958,1.958,0,⋯,0)⊤. We consider different mixture proportions on the training and the target populations. We fix pmix = 0.75 on the training population. For the target population, we change pmix ∈{0.1,0.25,0.5,0.75,0.9}. We then generate A | X ~ Bernoulli(1/2) and , where and .
We generate a calibrating dataset from the targeted population with ncalib = 50, which only contains the covariate information. While CTE-DR-ITR approach utilizes this calibrating dataset to select the tuning parameter, we use this calibrating dataset to estimate the likelihood ratio. We compare the performances of the Standard ITR, the CTE-DR-ITR, and the LR-ITR approaches. To evaluate the estimated ITRs, we generate a testing dataset with ntest = 106. We calculate the value function as , where represents the empirical mean on the testing dataset.
Table 1 presents the simulation results summarized over 500 replicates. The results show that the LR-ITR could outperform the CTE-DR-ITR with the presence of only covariate shifts.
Table 1.
Simulation results with ncalib = 50 and mixture of subgroups. Mean (Standard error) over 500 repeats are reported.
| The pmix on the target population | |||||
|---|---|---|---|---|---|
| Type | 0.1 | 0.25 | 0.5 | 0.75 | 0.9 |
| Standard ITR | 7.827 (0.0170) |
6.607 (0.0145) |
4.582 (0.0105) |
2.560 (0.00654) |
1.347 (0.00470) |
| CTE-DR-ITR | 7.868 (0.0185) |
6.641 (0.0158) |
4.602 (0.0113) |
2.557 (0.00738) |
1.337 (0.00534) |
| LR-ITR |
8.074 (0.0165) |
6.846 (0.0133) |
4.730 (0.0101) |
2.616 (0.00608) |
1.417 (0.00403) |
3.3.2. Mixture of Subgroups with Contrast Shifts
The simulation setting is the same as Section 4.2 in Mo et al. [5], which involves contrast shifts in addition to covariate shifts. Specifically, on the CTE function, we follow the same generative model of Y | (A, X), but replace the C(X) with C(ξ, X) = −1.5 × (2ξ − 1) − 2x1 + x2. The optimal decision rule given X is , which involves the unobserved ξ. Consequently, the distributions of (Y(1),Y(−1)) | X also shift on the target population compared with those on the training population. We choose μ1 = (−1/2,1/2,0,⋯,0)⊤ and μ2 = μ1.
Table 2 shows that the LR-ITR does not improve over the standard ITR when the proportion of the mixture changes. Nonetheless, the CTE-DR-ITR performs better than the Standard ITR.
Table 2.
Simulation results with ncalib = 50 and mixture of subgroups. Mean (Standard error) over 500 repeats are reported.
| The pmix on the target population | |||||
|---|---|---|---|---|---|
| Type | 0.1 | 0.25 | 0.5 | 0.75 | 0.9 |
| Standard ITR | 1.143 (0.00434) |
1.232 (0.00329) |
1.383 (0.0015) |
1.535 (0.000543) |
1.632 (0.00142) |
| CTE-DR-ITR |
1.16 (0.00409) |
1.247 (0.00323) |
1.388 (0.00137) |
1.534 (0.00055) |
1.628 (0.00149) |
| LR-ITR | 1.111 (0.00416) |
1.216 (0.00319) |
1.379 (0.00151) |
1.530 (0.00279) |
1.616(0.00147) |
4. Discussion
Kallus [1] and Mo et al. [5] provide different methods to accommodate the covariate shift. We point out the connection of the advocated weight in Kallus [1] with the efficient score of the contrast function estimation. We also propose a likelihood-ratio weighted approach to address the covariate shift. Upper bounds on the risk function under the target population are provided when distributional shifts exist. There is strength in the performance of the LR-ITR when only covariate shift exists. When other distributional shifts exist, such as shifts in conditional distributions of (Y(1),Y(−1)) | X, CTE-DR-ITR can perform better than LR-ITR even if the likelihood ratio between two populations is known.
Supplementary Material
Acknowledgments
The authors gratefully acknowledge support by R01DK108073 awarded by the National Institutes of Health.
Contributor Information
Muxuan Liang, Public Health Sciences Division, Fred Hutchinson Cancer Research Center.
Ying-Qi Zhao, Public Health Sciences Division, Fred Hutchinson Cancer Research Center.
References
- [1].Kallus N [2020], ‘More efficient policy learning via optimal retargeting’, Journal of the American Statistical Association pp. 1–13. [Google Scholar]
- [2].Kanamori T, Hido S and Sugiyama M [2009], ‘A least-squares approach to direct importance estimation’, The Journal of Machine Learning Research 10, 1391–1445. [Google Scholar]
- [3].Liang M and Yu M [2020], ‘A semiparametric approach to model effect modification’, Journal of the American Statistical Association (just-accepted), 1–33. [Google Scholar]
- [4].Liu Y, Wang Y, Kosorok MR, Zhao Y and Zeng D [2018], ‘Augmented outcome-weighted learning for estimating optimal dynamic treatment regimens’, Statistics in Medicine 37(26), 3776–3788. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [5].Mo W, Qi Z and Liu Y [2020], ‘Learning optimal distributionally robust individualized treatment rules’, Journal of the American Statistical Association (just-accepted), 1–40. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Qian M and Murphy SA [2011], ‘Performance guarantees for individualized treatment rules’, Annals of Statistics 39(2), 1180. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Wager S and Athey S [2018], ‘Estimation and inference of heterogeneous treatment effects using random forests’, Journal of the American Statistical Association 113(523), 1228–1242. [Google Scholar]
- [8].Zhang B, Tsiatis AA, Laber EB and Davidian M [2012], ‘A robust method for estimating optimal treatment regimes’, Biometrics 68(4), 1010–1018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Zhao Y-Q, Laber EB, Ning Y, Saha S and Sands BE [2019], ‘Efficient augmentation and relaxation learning for individualized treatment rules using observational data.’, Journal of Machine Learning Research 20, 48–1. [PMC free article] [PubMed] [Google Scholar]
- [10].Zhao Y-Q, Zeng D, Tangen CM and LeBlanc ML [2019], ‘Robustifying trial-derived optimal treatment rules for a target population’, Electronic Journal of Statistics 13(1), 1717. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Zhao Y, Zeng D, Rush AJ and Kosorok MR [2012], ‘Estimating individualized treatment rules using outcome weighted learning’, Journal of the American Statistical Association 107(499), 1106–1118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Zhou X, Mayer-Hamblett N, Khan U and Kosorok MR [2017], ‘Residual weighted learning for estimating individualized treatment rules’, Journal of the American Statistical Association 112(517), 169–187. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
