Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Jan 1.
Published in final edited form as: J Am Stat Assoc. 2021 Apr 1;116(534):690–693. doi: 10.1080/01621459.2020.1833887

Discussion of Kallus (2020) and Mo et al (2020)

Muxuan Liang 1, Ying-Qi Zhao 2,*
PMCID: PMC8409173  NIHMSID: NIHMS1722688  PMID: 34483404

Abstract

We discuss the results on improving the generalizability of individualized treatment rule following the work in Kallus [1] and Mo et al. [5]. We note that the advocated weights in Kallus [1] are connected to the efficient score of the contrast function. We further propose a likelihood-ratio-based method (LR-ITR) to accommodate covariate shifts, and compare it to the CTE-DR-ITR method proposed in Mo et al. [5]. We provide the upper-bound on the risk function of the target population when both the covariate shift and the contrast function shift are present. Numerical studies show that LR-ITR can outperform CTE-DR-ITR when there is only covariate shift.

Keywords: Generalizability, covariate shift, efficient score, density-ratio estimation

1. Introduction

The problem of constructing individualized treatment rules (ITRs), a function that maps patient characteristics to an available treatment, has recently received significant attention among statistical researchers [6, 11, 8, 12, 4, 9]. These methods typically assume that the population of the training samples and the target population where the ITR will be implemented in the future are identical. However, when these two populations are different from each other, the estimated ITR may perform poorly on the target population [10]. We congratulate Kallus (2020) and Mo, Qi, and Liu (2020) on their contributions in proposing robust ITR estimation approaches where there are covariate changes between the training and target populations, also known as covariate shift.

Let (X, A,Y) be the triplet, where X denotes patients covariates, A denotes the treatments, and Y denotes the outcome. Suppose that the treatment space is A={1,1}. Let Y(1) and Y(−1) be the potential outcomes given treatment A = 1 and −1, respectively. Let be the distribution of the training population and test be the distribution of the testing population. Covariate shift assumes that the conditional distribution of (Y(1),Y(−1)) | X on the training and target populations are the same, but the covariate distributions may be different. Consequently, given X, the contrast functions satisfy E[Y(1)Y(1)X]=Etest[Y(1)Y(1)X]. With the presence of covariate shifts, Kallus [1] defines the retargeted policy value and proposes a weighting approach that minimizes the efficient variance in estimating the finite-sample objective. Mo et al. [5] maximizes the worst value function over a set of possible weights.

We point out that the weights advocated in Kallus [1] to adjust treatment-control overlapping is related to the efficient score of estimating the contrast function. Therefore, his work can be considered as improving the efficiency of estimating ITR. Different from Kallus [1], Mo et al. [5] focus on controlling the possible bias when covariate distribution shifts. We propose an alternative method, termed as likelihood-ratio weighted ITR (LR-ITR), which re-weights the learning objective with the directly estimated likelihood ratio from the training and target data. Numerical results show that the LR-ITR method can improve the performance of the standard ITR approach, and outperform CTE-DR-ITR proposed in Mo et al. [5] when there is only covariate shift. On the other hand, CTE-DR-ITR method can be more generalizable when there is contrast shift in addition to covariate shift, i.e., E[Y(1)Y(1)X]Etest[Y(1)Y(1)X].

This article is organized as follows. In Section 2, we discuss the connection between the weights in Kallus [1] and the efficient estimation of the contrast function [3]. In Section 3, we introduce the LR-ITR approach, along with its theoretical properties, and conduct simulation studies. In Section 4, we briefly summarize our discussion.

2. Retargeting weights and efficient estimation

The weighting approach for adjusting treatment-control overlapping is closely connected with the efficient score of the contrast functions. Given a policy π(a | X), which is a distribution on the treatment space A given X, the retargeted learning objective in Kallus [1] is defined as

R(π;w,ρ)=E[w(X)aA(π(aX)ρ(aX))μ(aX)], (2.1)

where μ(aX)=E[Y(a)X]. Kallus [1] advocates to use ρ(a | X) = 1/2 and w(X) ∝ [Var{Y(1) | X}/P(A = 1 | X) + Var{Y(−1) | X} / P(A = −1 | X)]−1. The proposed choice of ρ(a | X) = 1/2 compares the value function under the policy π with the value function under the pure randomization.

We highlight that the retargeting weight w(X) is connected with the efficient estimation of the contrast function, Δ(X) = μ(1 | X) – μ(−1 | X). In Liang and Yu [3], it is shown that assuming Δ(X) = g(X β), where g is an unknown function, the efficient score for β is

w˜(X)AP(AX)g(Xβ){XE[w˜(X)XXβ]E[w˜(X)Xβ]}{ϵE[ϵX]},

where ϵ = YAg(X β) / 2 and w˜(X)(E[ϵ2P(AX)2X]E[ϵX]2E[P(AX)2X])1. Since ϵ can be written as ϵ=S+ϵ˜, where [ϵ˜X,A]=0, we have

E[ϵ2[P(AX)]2X]E[ϵ|X]2E[1[P(AX)]2X]=E[(S+ϵ˜)2[P(AX)]2X]S2{1P(A=1X)+1P(A=1X)}=E[ϵ˜2[P(AX)]2X]=E[ϵ˜2A=1,X]P(A=1X)+E[ϵ˜2A=1,X]P(A=1X).

Given that E[ϵ˜2A=1,X]=Var(Y(1)X) and E[ϵ˜2A=1,X]=Var(Y(1)X), w˜(X) is also proportional to [Var{Y(1) | X} / P(A = 1 | X) + Var{Y(−1) | X} / P(A = −1 | X)]−1. As such, the adjustment for treatment-control overlapping by the weights in Kallus [1] can also benefit the estimation of the contrast function.

3. Likelihood-ratio weighted ITR

Mo et al. [5] propose targeting a class of populations rather than a specific population to learn a distributionally robust individualized treatment rule (ITR). We propose an alternative likelihood-ratio-weighted approach. Following Mo et al. [5], we consider the value function V(d)=Etest[C(X)d(X)] given an ITR d(X). Under Assumption 1 (Covariate Changes) in Mo et al. [5], we have

dtestd=qx(X)px(X),

where qx and px are the density functions of the target population and the training population, respectively. Let w*(X) = qx(X) / px(X). The target value function, V(d), can be reformulated as

V(f)=E[w*(X)C(X)d(X)].

3.1. Algorithm

We estimate w*(X) by using the covariates information from the training population and the target population. Unconstrained least-squares importance fitting (uLSIF) is an efficient algorithm to estimate the likelihood ratio [2], which has a closed-form solution and can be calculated by solving a linear system. Let γ(X)=l=1bαlϕl(X), where ϕl(X)’s are pre-specified basis functions and b is the number of the basis. Notice that E[γ(X)]=1. The uLSIF aims to minimize E[(γ(X)w*(X))2], and

E[(γ(X)w*(X))2]=E[γ2(X)]2Etest[γ(X)]+C, (3.1)

where C is a constant irrelevant to αls. Let α = (α1,⋯, αb). (3.1) can be written as

minααΨ^α2αh+λα22, (3.2)

where Ψ^ is a b × b matrix with its (i, j)th coordinate, Ψ^i,j=E[ϕi(X)ϕj(X)], and h is a b-dimensional vector with its ith coordinate, h^i=Etest[ϕi(X)]. The E[] and test Etest[] are the empirical expectations defined by the samples from the training and target populations. Typically, we only need a small calibration set from the target population to calculate any h^i’s. The uLSIF estimator is

γ^(X)=l=1bmax{0,α^l}ϕl(X),

where α=(α^1,,α^b) minimizes (3.2). To stabilize the estimation of γ^(X), we cap γ^(X) at 10. This choice of cap corresponds to the bound of w*(X) assumed in the Mo et al. [5] simulations. The implementation procedure is outlined in Algorithm 1. We call this method as the likelihood-ratio weighted ITR (LR-ITR).

Algorithm 1: Likelihood-ratio weighted ITR.

Input: n samples from the training population (X, A,Y) and ncalib samples from the target population including only covariates.

Output: An ITR d^(X)=sgn{Xβ}.

  1. Estimate γ^(X)=l=1bmax{0,α^l}ϕl(X) where α is the minimizer of
    minααΨ^α2αh+λα22,
    where ϕl (·)’s are chosen as Gaussian kernels with kernel bandwidth σ. The parameter σ and λ are tuned by cross-validation. The ncalib samples from the target population are used to calculate h;
  2. Obtain an estimator C^(X) by fitting a causal forest [7] using the training data;

  3. Obtain the linear decision rule d^(X)=sgn{Xβ}, where β minimizes
    E[γ^(X)C^(X){ψ(Xβ)1}],
    and ψ is the robust smoothed ramp loss [12].

3.2. Risk Bounds on the Target Population

In this section, we provide risk bounds for the risk functions on the target population via different methods. Define the risk function on the target population as

Ltest(d(X))=Etest[Ctest(X)(d(X))],

where Ctest(X)=Etest[Y(1)Y(1)X]. The proposed LR-ITR method minimizes the LR-risk function, defined as

LLR(d(X))=E[w*(X)C(X)(d(X))],

where C(X)=E[Y(1)Y(1)X]. The CTE-DR-ITR approach proposed in Mo et al. [5] minimizes the DR-risk function associated with parameters (c, k) on the training population, defined as

LDR(d(X))=supw(X)W(c,k)E[w(X)C(X)(d(X))],

where W(c,k)={w(X):E[w(X)]=1,E[wk(X)]ck,w(X)+}. The weight function w(X) represents a general density ratio in W(c,k), and we assume that w*(X)W(c,k). Given a pre-specified class of ITRs, D, let the minimizer of LLR(d(X)) in D be dLR*(X) and the minimizer of LDR(d(X)) in D be dDR*(X). We provide upper bounds on Ltest(dLR*(X)) and Ltest(dDR*(X)).

Let vLR* be the minimizer (or the sequence converging to the minimum) of Ctest(X)vC(X)L2(test) for v > 0, and δLR*(X)=Ctest(X)vLR*C(X). Also, let vDR*(X) be the minimizer (or the sequence converging to the minimum) of Ctest(X)v(X)C(X)L2(test) over V, where V={v(X)L():v(X)+}, and δDR*(X)=Ctest(X)vDR*(X)C(X). Given k ≥ 2, we have the following inequalities.

Theorem 3.1.

  1. For the LR-ITR approach, we have
    Ltest(dLR*(X))vLR*LLR(dLR*(X))+δLR*(X)L2(test).
  2. For the CTE-DR-ITR approach, we have
    Ltest(dDR*(X))vDR*LDR(dDR*(X))+δDR*(X)L2(test),
    where vDR*=E[w*(X)vDR*(X)], and
    LDR(dDR*(X))=infd{supw(X)W(c˜,k)E[w(X)C(X)(d(X))]}
    with c˜=cvDR*(X)L()/vDR*.

When there is a shift in the contrast function, i.e., Ctest (X) ≠ C(X), we have δLR*(X)L2(test)δDR*(X)L2(test).

If the difference between δLR*(X)L2(test) and δDR*(X)L2(test) dominates the difference between vLR*LLR(dLR*(X)) and vDR*LDR(dDR*(X)), then dLR*(X) leads to a larger upper bound compared with dDR*(X). When Ctest (X) = C(X), we have δLR*(X)=δDR*(X)=0. We also have vLR*=1 and vDR*(X)1. Further, we have c˜=c. Notice that LDR(d(X)) ≥ LLR(d(X)) when c˜=c. The dLR*(X) can lead to a lower upper bound compared with dDR*(X).

3.3. Simulations

We compare the performance of the LR-ITR approach with the CTE-DR-ITR approach. Both methods only require a calibration dataset with the covariate vector from the targeted distribution.

3.3.1. Mixture of Subgroups with only Covariate Shifts

We first consider the simulation settings with only covariate shifts present. We modify the simulation setup on the mixture of subgroups in Mo et al. [5] such that the two subgroups share the same contrast function between the training and the target populations, but have different covariate distributions.

We set n = 1000 and p = 10. We generate the covariate vector as: Xξ~ξNp(μ1,Ip)+(1ξ)Np(μ2,Ip), where ξ ~ Bernoulli(pmix) is the unobserved indicator of the mixture with pmix determining the proportion of the two subgroups, μ1 = (0,0,0,⋯,0) and μ2 = (1.958,1.958,0,⋯,0). We consider different mixture proportions on the training and the target populations. We fix pmix = 0.75 on the training population. For the target population, we change pmix ∈{0.1,0.25,0.5,0.75,0.9}. We then generate A | X ~ Bernoulli(1/2) and Y(X,A)=m(X)+(A1/2)C(X)+N(0,1), where m(x)=1+j=1pxj/p and C(X)=x2(x132x1).

We generate a calibrating dataset from the targeted population with ncalib = 50, which only contains the covariate information. While CTE-DR-ITR approach utilizes this calibrating dataset to select the tuning parameter, we use this calibrating dataset to estimate the likelihood ratio. We compare the performances of the Standard ITR, the CTE-DR-ITR, and the LR-ITR approaches. To evaluate the estimated ITRs, we generate a testing dataset with ntest = 106. We calculate the value function as Entest[C(X)d^(X)], where Entest[] represents the empirical mean on the testing dataset.

Table 1 presents the simulation results summarized over 500 replicates. The results show that the LR-ITR could outperform the CTE-DR-ITR with the presence of only covariate shifts.

Table 1.

Simulation results with ncalib = 50 and mixture of subgroups. Mean (Standard error) over 500 repeats are reported.

The pmix on the target population
Type 0.1 0.25 0.5 0.75 0.9
Standard ITR 7.827
(0.0170)
6.607
(0.0145)
4.582
(0.0105)
2.560
(0.00654)
1.347
(0.00470)
CTE-DR-ITR 7.868
(0.0185)
6.641
(0.0158)
4.602
(0.0113)
2.557
(0.00738)
1.337
(0.00534)
LR-ITR 8.074
(0.0165)
6.846
(0.0133)
4.730
(0.0101)
2.616
(0.00608)
1.417
(0.00403)

3.3.2. Mixture of Subgroups with Contrast Shifts

The simulation setting is the same as Section 4.2 in Mo et al. [5], which involves contrast shifts in addition to covariate shifts. Specifically, on the CTE function, we follow the same generative model of Y | (A, X), but replace the C(X) with C(ξ, X) = −1.5 × (2ξ − 1) − 2x1 + x2. The optimal decision rule given X is sgn{E[C(ξ,X)X]}, which involves the unobserved ξ. Consequently, the distributions of (Y(1),Y(−1)) | X also shift on the target population compared with those on the training population. We choose μ1 = (−1/2,1/2,0,⋯,0) and μ2 = μ1.

Table 2 shows that the LR-ITR does not improve over the standard ITR when the proportion of the mixture changes. Nonetheless, the CTE-DR-ITR performs better than the Standard ITR.

Table 2.

Simulation results with ncalib = 50 and mixture of subgroups. Mean (Standard error) over 500 repeats are reported.

The pmix on the target population
Type 0.1 0.25 0.5 0.75 0.9
Standard ITR 1.143
(0.00434)
1.232
(0.00329)
1.383
(0.0015)
1.535
(0.000543)
1.632
(0.00142)
CTE-DR-ITR 1.16
(0.00409)
1.247
(0.00323)
1.388
(0.00137)
1.534
(0.00055)
1.628
(0.00149)
LR-ITR 1.111
(0.00416)
1.216
(0.00319)
1.379
(0.00151)
1.530
(0.00279)
1.616(0.00147)

4. Discussion

Kallus [1] and Mo et al. [5] provide different methods to accommodate the covariate shift. We point out the connection of the advocated weight in Kallus [1] with the efficient score of the contrast function estimation. We also propose a likelihood-ratio weighted approach to address the covariate shift. Upper bounds on the risk function under the target population are provided when distributional shifts exist. There is strength in the performance of the LR-ITR when only covariate shift exists. When other distributional shifts exist, such as shifts in conditional distributions of (Y(1),Y(−1)) | X, CTE-DR-ITR can perform better than LR-ITR even if the likelihood ratio between two populations is known.

Supplementary Material

Supp 1

Acknowledgments

The authors gratefully acknowledge support by R01DK108073 awarded by the National Institutes of Health.

Contributor Information

Muxuan Liang, Public Health Sciences Division, Fred Hutchinson Cancer Research Center.

Ying-Qi Zhao, Public Health Sciences Division, Fred Hutchinson Cancer Research Center.

References

  • [1].Kallus N [2020], ‘More efficient policy learning via optimal retargeting’, Journal of the American Statistical Association pp. 1–13. [Google Scholar]
  • [2].Kanamori T, Hido S and Sugiyama M [2009], ‘A least-squares approach to direct importance estimation’, The Journal of Machine Learning Research 10, 1391–1445. [Google Scholar]
  • [3].Liang M and Yu M [2020], ‘A semiparametric approach to model effect modification’, Journal of the American Statistical Association (just-accepted), 1–33. [Google Scholar]
  • [4].Liu Y, Wang Y, Kosorok MR, Zhao Y and Zeng D [2018], ‘Augmented outcome-weighted learning for estimating optimal dynamic treatment regimens’, Statistics in Medicine 37(26), 3776–3788. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [5].Mo W, Qi Z and Liu Y [2020], ‘Learning optimal distributionally robust individualized treatment rules’, Journal of the American Statistical Association (just-accepted), 1–40. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [6].Qian M and Murphy SA [2011], ‘Performance guarantees for individualized treatment rules’, Annals of Statistics 39(2), 1180. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [7].Wager S and Athey S [2018], ‘Estimation and inference of heterogeneous treatment effects using random forests’, Journal of the American Statistical Association 113(523), 1228–1242. [Google Scholar]
  • [8].Zhang B, Tsiatis AA, Laber EB and Davidian M [2012], ‘A robust method for estimating optimal treatment regimes’, Biometrics 68(4), 1010–1018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [9].Zhao Y-Q, Laber EB, Ning Y, Saha S and Sands BE [2019], ‘Efficient augmentation and relaxation learning for individualized treatment rules using observational data.’, Journal of Machine Learning Research 20, 48–1. [PMC free article] [PubMed] [Google Scholar]
  • [10].Zhao Y-Q, Zeng D, Tangen CM and LeBlanc ML [2019], ‘Robustifying trial-derived optimal treatment rules for a target population’, Electronic Journal of Statistics 13(1), 1717. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [11].Zhao Y, Zeng D, Rush AJ and Kosorok MR [2012], ‘Estimating individualized treatment rules using outcome weighted learning’, Journal of the American Statistical Association 107(499), 1106–1118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [12].Zhou X, Mayer-Hamblett N, Khan U and Kosorok MR [2017], ‘Residual weighted learning for estimating individualized treatment rules’, Journal of the American Statistical Association 112(517), 169–187. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp 1

RESOURCES