Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Jan 1.
Published in final edited form as: J Am Stat Assoc. 2022 Dec 12;119(545):757–772. doi: 10.1080/01621459.2022.2144737

Matching on Generalized Propensity Scores with Continuous Exposures

Xiao Wu 1, Fabrizia Mealli 2,3, Marianthi-Anna Kioumourtzoglou 4, Francesca Dominici 5, Danielle Braun 5
PMCID: PMC10958667  NIHMSID: NIHMS1862278  PMID: 38524247

Abstract

In the context of a binary treatment, matching is a well-established approach in causal inference. However, in the context of a continuous treatment or exposure, matching is still underdeveloped. We propose an innovative matching approach to estimate an average causal exposure-response function under the setting of continuous exposures that relies on the generalized propensity score (GPS). Our approach maintains the following attractive features of matching: a) clear separation between the design and the analysis; b) robustness to model misspecification or to the presence of extreme values of the estimated GPS; c) straightforward assessments of covariate balance. We first introduce an assumption of identifiability, called local weak unconfoundedness. Under this assumption and mild smoothness conditions, we provide theoretical guarantees that our proposed matching estimator attains point-wise consistency and asymptotic normality. In simulations, our proposed matching approach outperforms existing methods under settings with model misspecification or in the presence of extreme values of the estimated GPS. We apply our proposed method to estimate the average causal exposure-response function between long-term PM2.5 exposure and all-cause mortality among 68.5 million Medicare enrollees, 2000–2016. We found strong evidence of a harmful effect of long-term PM2.5 exposure on mortality. Code for the proposed matching approach is provided in the CausalGPS R package, which is available on CRAN and provides a computationally efficient implementation.

Keywords: Causal Inference, Continuous Treatment, Covariate Balance, Non-parametric, Observational Study

1. Introduction

In large-scale observational studies, estimating the causal effects is challenging because: 1) the treatment (named exposure in epidemiology) is often continuous in nature, and thus one has to allow for flexible estimation of the exposure-response function (ERF) on a continuous scale; 2) the exposure assignment is not random, and thus we need to properly adjust for potential confounders (i.e., pre-exposure covariates associated with both exposure and outcome); and 3) in the presence of large datasets, causal inference analyses can be computationally burdensome.

In our motivational example of air pollution epidemiology, confounding adjustment is traditionally achieved by fitting a multivariate regression model with the health outcome as the dependent variable, air pollution exposure as an independent variable, and many potential confounders as additional independent variables (e.g., Di et al. (2017), Liu et al. (2019)). It has been well documented in the literature that traditional regression methods do not allow for a clear distinction between the design and analysis stages, are susceptible to model misspecification, offer limited sensitivity analyses tools to assess underlying assumptions, and often their results cannot be interpreted as causal effects (Rubin et al. 2008). Researchers have advocated the development and implementation of causal inference methods to inform air pollution policy (e.g., Goldman & Dominici (2019)).

Under the potential outcomes framework for causal inference, the design stage (i.e., where we a) define the causal estimands and the target population, b) implement a design-based method such as matching or weighting to construct a matched or weighted dataset, and c) assess the quality of the design using metrics such as covariate balance) and the analysis stage (i.e., where we estimate the causal effects) are distinct (Imbens & Rubin, 2015). A common approach for confounding adjustment in this framework is using the propensity score, i.e., the probability of a unit being assigned to a particular level of a binary exposure, given the pre-exposure covariates. Rosenbaum & Rubin (1983) introduced the idea of using propensity scores to adjust for confounding in observational studies under the potential outcomes framework. After this seminal paper, several propensity score techniques, both for estimation and implementation, have been developed to estimate causal effects in observational studies (see Harder et al. (2010) for a review). However, for the most part, propensity score approaches have been developed in the context of a binary exposure. To handle settings where the exposure might have more than two levels, Joffe & Rosenbaum (1999), Imbens (2000) have introduced the generalized propensity score (GPS) for categorical exposures. Imbens (2000) proposed an inverse probability of treatment weighting (IPTW) of GPS for confounding adjustment. Although there is no natural analogue for matching and subclassification for GPS, Lechner (2001), Yang et al. (2016) proposed alternative ways to estimate causal effects using matching and subclassification in this categorical exposure setting.

Hirano & Imbens (2004) have extended the GPS to the continuous exposure setting, and defined the GPS as a conditional probability density function of the exposure given the pre-exposure covariates. Hirano & Imbens (2004) proposed a procedure to estimate the causal ERF in which the estimated GPS is included as a covariate in the outcome model (i.e., GPS adjustment). The validity of their approach relies on the assumption that both the GPS model and the outcome model must be specified correctly.

Robins et al. (2000) proposed a causal inference approach that relies on weighting by the GPS and can also be used in the continuous exposure setting. More specifically, they introduced marginal structure models in which the causal parameters can be consistently estimated using a class of IPTW estimators that relies on the GPS. However, this approach also requires the correct specification of both GPS and outcome models. To relax the parametric modeling assumption of the GPS under the weighting framework, the following authors Fong et al. (2018), Yiu & Su (2018), Vegetabile et al. (2020), Tübbicke (2022) proposed various balancing approaches that directly optimize certain features of the weights rather than explicitly modeling the GPS. The distinction between balancing vs. modeling approaches in the context of weighting was reviewed by Chattopadhyay et al. (2020). Kennedy et al. (2017) proposed a non-parametric doubly robust (DR) approach for causal exposure-response estimation in the context of a continuous exposure. Additionally, Colangelo & Lee (2020), Semenova & Chernozhukov (2021) proposed double machine learning (DML) approaches, which rely on the DR moment functions, in the continuous exposure setting. They additionally propose the use of cross-fitting via sample splitting to avoid complexity restrictions and accommodate a wide class of modern regularized methods for the GPS and outcome models (see Kennedy (2022) for a review). The DR estimator is a class of the augmented IPTW estimator that is more robust to model misspecification of either the GPS model or the outcome model (Robins & Rotnitzky, 2001, Bang & Robins, 2005, Cao et al. 2009). This approach, to produce consistent estimation, only requires that either the GPS or the outcome model are correctly specified. Yet, in observational studies, neither the GPS model nor the outcome model is known and likely to be correctly specified. Literature shows DR approaches often perform unstably in finite sample scenarios when both the models of the GPS and the outcome are misspecified and are sensitive to extreme values of the estimated GPS (Kang & Schafer, 2007, Robins et al. 2007, Waernbaum, 2012). In addition, assessments of covariate balance are often not emphasized in DR methods.

Matching methods, another class of popular causal inference approaches in binary and categorical exposure settings (Rosenbaum & Rubin, 1983, Lechner, 2001, Yang et al. 2016), have the following attractive features: 1) clear separation between the design and the analysis, improving the objectiveness of causal inference (Ho et al. 2007, Rubin et al. 2008); 2) robustness to model misspecification and/or to the presence of extreme values of the estimated GPS (Waernbaum, 2012, Greifer & Stuart, 2021); 3) maintaining the unit of analysis intact and creating an actual matched set, allowing for straightforward assessments of covariate balance and additional sensitivity analyses (Zubizarreta, 2012, Stuart et al. 2020). Yet, to our knowledge, matching approaches have rarely been extended and implemented in causal inference for continuous exposures. The only exception is Zhang et al. (2020), who proposed a non-bipartite matching approach to divide the original dataset into subclasses by grouping units with similar observed covariates. The authors then fitted an agnostic parametric regression model with a fixed effect to each subclass in the continuous exposure setting, and developed the randomization-based inferential procedure of a specific causal parameter, derived from the regression coefficient of the aforementioned parametric model. Zhang et al. (2020) did not create a new matched dataset and did not estimate the non-parametric ERF for a target population. Their approach also does not rely on the GPS as a dimension reduction for multi-dimensional covariates.

In this paper, we develop a novel matching approach for flexibly estimating a non-parametrically specified causal ERF of a continuous exposure. We introduce a GPS matching framework that jointly matches on both the estimated scalar GPS and exposure levels to adjust for confounding bias. In Section 2, we introduce identifiability assumptions and provide identification results for a population average causal ERF. In Section 3, we describe the GPS matching algorithm. In this section, we also introduce measures of covariate balance under the matching framework and a bootstrap inferential procedure for uncertainty quantification. In Section 4, we provide theoretical results showing that our proposed matching estimator attains point-wise consistency and asymptotic normality. In Section 5, via simulations, we demonstrate that the proposed matching estimator has superior finite sample performances compared to existing causal inference methods, under several data generating mechanisms. In Section 6, we estimate a causal ERF relating long-term PM2.5 exposure levels to mortality in a large observational administrative cohort constructed by 68, 503, 979 Medicare beneficiaries in the continental United States (2000–2016). We conclude with a discussion in Section 7.

2. The Generalized Propensity Score Function

We use the following mathematical notation: let N denote the study sample size. For each unit j ∈ {1, …, N}, let Cj denote the pre-exposure covariates for unit j, which is characterized by a vector (C1j, …, Cqj) of length q; Wj denote the observed continuous exposure for unit j, WjW; Yjobs denote the observed outcome for unit j; and Yj(w) denote the counterfactual outcome for unit j at the exposure level w. fWj|Cj(wc), for all wW, denote the assignment mechanism defined as the conditional probability density function of each exposure level given the pre-exposure covariates Cj = c. One target estimand is the population average causal ERF defined on the specific range of the exposure levels wW, μ(w) = E{Yj(w)}.

Under the potential outcomes framework (Rubin, 1974) which was adapted to continuous exposures (Hirano & Imbens, 2004), we establish the following assumptions of identifiability:

Assumption 1 (Consistency). For each unit j, Wj = w implies Yjobs=Yj(w).

Assumption 2 (Overlap). For all possible values of c, the conditional probability density function of receiving any possible exposure wW is positive: fWj|Cj(wc)pfor some constant p > 0.

This assumption bounds the values of the GPS away from zero. It guarantees that for all possible values of pre-exposure covariates Cj = c, we will be able to consistently estimate μ(w) for each exposure w without relying on extrapolation. This assumption aligns with the positivity assumption of Kennedy et al. (2017).

Condition 1 (Weak Unconfoundedness). The assignment mechanism is weakly unconfounded if for each unit j and for all wW, in which w is continuously distributed with respect to the Lebesgue measure on W; Wj ⊥⊥ Yj(w)|Cj.

Condition 1 refers to the fact that we do not require (conditional) independence of potential outcomes, Yj(w), for all wW jointly, i.e., Wj{Yj(w)}wWCj. Instead, we only require conditional independence of the potential outcome, Yj(w), for a given exposure level w. Most causal inference studies using continuous exposures rely on this condition (Hirano & Imbens, 2004, Imai & Van Dyk, 2004, Flores et al. 2007, Galvao & Wang, 2015, Kennedy et al. 2017).

We now introduce Assumption 3, the Local Weak Unconfoundedness assumption, which is less stringent than Condition 1 defined above. We first define the caliper δ as the radius of the neighborhood set for any exposure level w (i.e., [wδ, w + δ]). We specify δ as a constant for a given dataset with sample size N, and we require δ→ 0 as N → ∞. We provide details on the practical considerations in the selection of δ in Section 3, and theoretical considerations in Section 4.

Assumption 3 (Local Weak Unconfoundedness). The assignment mechanism is locally weakly unconfounded if for each unit j and all wW, in which w is continuously distributed with respect to the Lebesgue measure on W, then for any w˜[wδ,w+δ], f(Yj(w)Cj,Wj=w˜)=f(Yj(w)Cj), where we use f to denote a generic probability density function.

The local refers to the fact that we focus on the conditional independence I(Wj=w˜)Yj(w)Cj, where the indicator function I (·) is defined by an event {Wj=w˜} and w˜ is in the neighborhood set [wδ, w + δ] around w. This assumption is mathematically weaker than Condition 1 and can be deduced from Condition 1 as I(Wj=w˜) is measurable with respect to the σ-algebra generated by Wj. Assumption 3 (together with other assumptions of identifiability) is sufficient to identify our causal estimand of interest. We would like our method to rely on a minimal set of assumptions because a weaker assumption is more plausible empirically, although we understand all of these assumptions are unverifiable using observational data. We give an example of a multi-valued exposure to provide some intuition (see Example S.1 of the Supplementary Materials). We do not attempt to argue that Assumption 3 is substantively weaker than Condition 1, but that there may be some settings in which Assumption 3 will be satisfied while Condition 1 will not. Also, we find causal estimands defined in our paper and in literature are, in general, identified under either assumption (Imbens, 2000, Hirano & Imbens, 2004, Kennedy et al. 2017).

We follow the generalization of the propensity score from binary exposure to continuous exposure as proposed by Hirano & Imbens (2004).

Definition 1. The generalized propensity scores are the conditional probability density functions of the exposure given pre-exposure covariates: e(c)={fWjCj(wc),wW}The individual generalized propensity score e(w,c)=fWjCj(wc) is called an evaluation of e(c) at Wj = w.

It is natural to couple Assumption 3 with the following smoothness assumption, which has been used in models with counterfactual outcomes (Kim et al. 2018).

Assumption 4 (Smoothness). For each unit j and any wW, (1) e(w, c) is Lipschitz continuous with respect to w for all c, and (2) μgps(w, e) ≡ E[Yj(w) | e(Wj, Cj) = e, Wj = w] is Lipschitz continuous with respect to w for all e. That is, ∀ w, wW, | e (w, c) − e (w′, c) |≤ A | ww′ |, ∀ c, for some constant A, and |μgps (w, e) − μgps (w′, e)| ≤ B | ww′|, ∀ e, for some constant B.

The following Lemmas show that 1) the local weak unconfoundedness holds when we condition on the GPS, and 2) the population average causal ERF is identifiable under Assumptions 1–4.

Lemma 1 (Local Weak Unconfoundedness Given GPS). Suppose the assignment mechanism is locally weakly unconfounded. Then for each unit j, all wW and w˜[wδ,w+δ], f{Yj(w)e(w˜,Cj),Wj=w˜}=f{Yj(w)e(w˜,Cj)}, where we use f to denote a generic probability density function.

Lemma 2 (Average Causal ERF). Suppose Assumptions 1–4 hold. Then for all wW,

μ(w)=E[Yj(w)]=limδ0 E[E{Yjobse(Wj,Cj),Wj[wδ,w+δ]}].

Lemma 1–2 state that, under the local weak unconfoundedness assumption, the population average causal ERF is identifiable (Hirano & Imbens, 2004). Importantly, we can estimate, for each exposure level w, the population average ERF by averaging over a set of conditional expectations of observed outcomes conditioning on a scalar GPS, e(Wj, Cj), and exposure Wj ∈ [wδ, w + δ], i.e., E{Yjobse(Wj,Cj),Wj[wδ,w+δ]}. It shows that the GPS is able to provide a dimension reduction from multi-dimensional potential confounders to a scalar function even under continuous exposure settings. The proofs of both Lemmas are presented in the Supplementary Materials.

3. Matching Framework

3.1. GPS Matching Algorithm

In a completely randomized experiment, study units are randomized to receive different exposure levels, and therefore units assigned to different levels of exposures will have similar distributions of their pre-exposure covariates (i.e., they will be balanced). In observational studies, because units are not randomized to different exposure levels, their pre-exposure covariates might be imbalanced, and this can lead to confounding bias. The goal of matching is to create a new dataset where the distribution of pre-exposure covariates across different exposure levels is as balanced as possible. When the exposure is binary, this can be achieved by pairing an exposed unit with an unexposed unit that has nearly identical values of pre-exposure covariates and/or of the estimated propensity score. When the exposure is continuous, there is no explicit way to distinguish units as exposed vs. unexposed, and thus a different matching procedure is needed.

Algorithm 1.

GPS Matching Algorithm

a) Design Stage:
 1) We fit a GPS model e(w, c) on the observed data, {(w1, c1), (w2, c2), …, (wN, cN)}, using either a parametric model (e.g., a parametric linear regression model) or a non-parametric model (e.g., a flexible ensemble learning model). We denote by e^(wj,cj) the estimated GPS for an arbitrary unit j having exposure wj and covariate cj. We denote min(w) = minj∈{1,2, …, N} wj and max (w) = maxj∈{1,2,…, N} wj; min(e^)=minj{1,2,,N}e^(wj,cj) and max(e^)=maxj{1,2,,N}e^(wj,cj). Let w* and e* represent the standardized Euclidean transformation of quantities w and e, i.e., for any (w,e)×+, w*=wmin(w)max(w)min(w), e*=emin(e^)max(e^)min(e^).
 2) We specify a caliper δ and we define a predetermined set of exposure levels w(l) which are the mid points of L equally sized bins, [w(l)δ, w(l) + δ]. More specifically, {w(1) = min(w) + δ, w(2) = min(w) + 3δ, …, w(L) = min(w) + (2L − 1) δ}, where L=max(w)min(w)2δ+12.
 3) For each l, we create template units j′ = 1, 2,…, N with observed covariate values cj and fixed exposure level w(l). For each l and for each j′, we create a matched dataset of dimension L × N of imputed values of the missing potential outcomes Yj (w(l)). More specifically, we implement a nested-loop algorithm, with l in 1, 2,…, L as the outer-loop, and j′ in 1, …, N as the inner-loop.
  for l = 1, 2, …, L do
  Choose one exposure level w(l) ∈ {w(1), w(2), …, w(L)}.
  for j′ = 1, 2,…, N do
   3.1) We fix the template unit j′ to have exposure w(l) and evaluate the GPS at (w(l), cj′), denoted by ej(l), based on the fitted GPS model in Step 1.
   3.2) We implement the matching to find an observed unit j, denoted by jgps(ej(l),w(l)), such that jgps(ej(l),w(l))=argminj:wj[w(l)δ,w(l)+δ](λe^*(wj,cj),(1λ)wj*)(λej(l)*,(1λ)w(l)*), where ∥.∥ is a pre-specified two-dimensional metric, λ is the scale parameter assigning weights to the corresponding two dimensions (i.e., the GPS and the exposure) and λ ∈ [0,1], and δ is the caliper defined in Step 2.
   3.3.) We impute Yj (w(l)) as: Y^j(w(l))=Yjgps(ej(l),w(l))obs.
  end for
  Note: We allow multiple matches of an observed unit j to different template units j′ throughout the inner-loop j′ in 1, …, N (“matching with replacement”).
  end for
 4) After implementing the algorithm in Step 3, we construct the matched dataset with N × L units by combining all {Yjgps(ej(l),w(l))obs,wjgps(ej(l),w(l)),cjgps(ej(l),w(l))} for j′ = 1, 2, …, N for all l = 1, 2, …, L.
 5) We assess covariate balance for the matched dataset. If the covariate balance assessment is passed, proceed to the analysis stage, else, rerun steps 1–4 with different specifications. The details of covariate balance assessment are provided in Section 3.2.
b) Analysis Stage:
 6) We compute the estimated quantity of interest μ^gps(w(l))=E[Yj(w(l))]=1Nj=1NYjgps(ej(l),w(l))obs at the predetermined exposures w(l), for l = 1, 2, …, L.
 7) We estimate a smoothed average causal ERF. The point-wise matching estimator μ^gps(w(l)) in Step 6 can be regarded as a non-parametric estimator with a rectangular kernel. The resulting curve may not be smooth. To improve the smoothness of the curve, we introduce kernel smoothing by either 1) fitting a kernel smoother on the entire matched set constructed in Step 4 to obtain a smoothed average ERF μ^gps(2)() or 2) replacing the rectangular kernel in μ^gps() with an Epanechnikov/Gaussian kernel to obtain μ^gps(2)(). Note the smoothed estimator μ^gps(2)(w) can be evaluated at any exposure level w ∈ [min(w), max(w)], rather than at the L predetermined exposure levels {w(1), w(2), …, w(L)}, given the extrapolation of kernel smoothing.

We provide details of our proposed GPS matching approach when the exposure is continuous; see Algorithm 1 for details. Briefly, we first specify a caliper δ and create L equally sized disjoint bins of exposure values [w(l)δ, w(l) + δ], l = 1, 2, …, L. For each l, we create a new set of hypothetical units j′ = 1, 2, …, N with observed covariate values cj (cj = cj if j′ = j) but we fix their exposure level at w(l). We call these hypothetical units template units. Our goal is, for each exposure bin l and for each template unit j′, to impute the L × N missing potential outcomes Yj (w(l)). To achieve this, for each l and then for each j′, we need to find a matched observed unit j such that: 1) unit j has observed exposure wj that belongs to the bin l; and 2) unit j is the nearest neighbor of the template unit j′ with respect to a two-dimensional metric (e.g., Manhattan L1 distance) on the exposure level and the estimated GPS, on a standardized scale. We denote this newly matched observed unit j as jgps(ej(l),w(l)). Then for each (l, j′) we impute the missing potential outcomes as: Y^j(w(l))=Yjgps(ej(l),w(l))obs. We allow matching with replacement: observed unit j can be used as a match for multiple template units (see Figure S.1 of the Supplementary Materials for an illustrative example).

Throughout the GPS matching algorithm, decisions need to be made about different elements of the proposed method, including the specification of the GPS model, the distance metrics, the hyperparameters (δ, λ), the measures used to assess covariate balance in the design stage, and the types of the non-parametric estimator for exposure-response estimation in the analysis stage. In Sections 3.23.3 we provide guidelines on how to make these choices. In Section 3.2, we introduce two covariate balance measures. In Section 3.3, we provide details on how to select the hyperparameters (δ, λ). We also provide an R package, CausalGPS, available on CRAN, to implement Algorithm 1. The package uses OpenMP (Open Multi-Processing) to support multi-platform shared-memory multiprocessing programming, and thus provides a computationally-scalable solution to handle datasets with millions of observations. Algorithm 1 collapses to the propensity score matching proposed in Abadie & Imbens (2006) under binary exposure setting, and to the GPS matching proposed in Yang et al. (2016) under categorical exposure setting if we force the set of exposure levels w(l) to each be equal to discrete exposure levels (e.g., in the binary exposure setting, force w(1) = 0 and w(1) = 1), and set (δ, λ) = (0,1) afterwards. In Section S.5 of the Supplementary Materials, we propose a modified bootstrap procedure, named m-out-of-n bootstrap, to estimate the variance of the GPS matching estimator. In Section S.6 of the Supplementary Materials, we discuss the computational effort of the proposed algorithm.

3.2. Covariate Balance

The goal of covariate balance assessment is to check the degree to which the distribution of observed pre-exposure covariates is similar across all exposure levels (i.e., the balancing condition). We introduce two new measures to assess covariate balance in the design stage; absolute correlation and block absolute standardized bias (BASB) for continuous exposures. The absolute correlation between the exposure and each pre-exposure covariate is a global measure and can inform whether the whole matched set is balanced. The BASB is a local measure that informs whether a specific exposure block is balanced or not. For the BASB, we estimate differences in means (and associated standard deviations) for each pre-exposure covariate between wjWk v.s. wjWk, where we categorize the exposure range W=[min(w),max(w)] into K blocks Wk, k = 1, 2, …,K. The block refers to the fact that the absolute standardized bias is calculated for Wj in the block Wk. The measures above build upon the work by Fong et al. (2018), Austin (2018) who examine covariate balance conditions with continuous exposures under a weighting framework. We adapt them into the GPS matching framework.

Formally, we define {w(1)=min(w),w(2)=min(w)+max(w)min(w)K,,w(K+1)=max(w)}[min(w),max(w)], where K is the number of blocks; and we have Wk=[w(k),w(k+1)]. For example, the exposure range is categorized by quintile when K = 5. Let rk denote the number of units within the block Wk. Suppose the i-th unit in the k-th block Wk has exposure wik and q-dimensional pre-exposure covariates cik, and appears nik times in the matched dataset. We centralize and orthogonalize the covariates cik and the exposure wik as cik=Sc1/2(cikc¯ik), wik=Sw1/2(wikw¯ik), where c¯ik=k=1Ki=1rknikcik/k=1Ki=1rknik, Sc=k=1Ki=1rknik(cikc¯ik)(cikc¯ik)T/k=1Ki=1rknik, wik=k=1Ki=1rknikwik/k=1Ki=1rknik and Sw=k=1Ki=1rknik(wikw¯ik)(wikw¯ik)T/k=1Ki=1rknik.

Global Measure.

Based on the balancing condition, the correlations between the exposure and pre-exposure covariates should, on average, be equal to zero if covariate balance is achieved. We assess covariate balance in the matched dataset as |k=1Ki=1rknikcikwik|<ϵ1, in which the q-dimensional vector ϵ1 indicates pre-specified thresholds, e.g., ϵ1=(0.1,0.1,,0.1q-dimensional) (Zhu et al. 2015). The pre-specified ϵ1 serves as a practical guideline, whereas covariate balance measures should be minimized where possible (Imai et al. 2008).

Local Measure.

Based on the balancing condition, any exposure block k should, on average, have zero BASB if covariate balance is achieved. We assess the covariate balance between units with exposure levels within the block Wk and outside of this block in the matched dataset as |i=1rknikciki=1rknikkki=1rknikcikkki=1rknik|<ϵ2, in which the q-dimensional vector ϵ2 indicates pre-specified thresholds, e.g., ϵ2=(0.2,0.2,,0.2q-dimensional) (Harder et al. 2010).

Researchers can also specify covariate balance measures that average over all q observed pre-exposure covariates. The average absolute correlation is defined as the average of absolute correlations of all q observed pre-exposure covariates. Similarly, the average BASB is defined as the average of absolute standardized bias of all q observed pre-exposure covariates for each block k: 1) Average absolute correlation |k=1Ki=1rknikcikwik|¯; 2) Average BASB: |i=1rknikciki=1rknikkki=1rknikcikkki=1rknik|¯, for k = 1, 2,…, K, where V¯ indicates the mean across the elements of vector V.

3.3. Selecting the Hyperparameters (δ, λ)

As detailed in Algorithm 1, the hyperparameters δ and λ need to be specified. Intuitively, 1) the caliper δ should be relatively small so that the matched observed unit j is ensured to have an exposure Wj that is close to the exposure level w(l) of the template unit j′; 2) the scale parameter λ should be close to 1 so that the observed unit is a good match to the template unit j′ with respect to the GPS, and thus potentially achieving the de j sired covariate balance in the matched dataset (Flores et al. 2007). Furthermore, δ should depend on the sample size N to align with asymptotic results of the matching estimator in Section 4. Although there is no absolute restriction on the caliper size, the practical guideline of determining the caliper size is similar to the bandwidth selection procedure for kernel smoothing methods (i.e., conduct a grid search among a candidate set of reasonably small δ and choose the optimal hyperparameter based on a pre-specified criterion).

Selecting λ = 1 is a practical solution motivated by empirical evidence when δ is properly selected a priori and computational resources are restricted. Yet, the choice of both hyperparameters may depend on data, and researchers may have no prior information on how to choose hyperparameters. Setting an overly small δ may result in no feasible match; whereas, for some larger δ, there may be scenarios in which multiple observed units are qualified for a match and we want to choose one among them based on covariate balance measures. Also, setting λ = 1 does not always result in optimal covariate balance if the caliper δ varies at the same time.

We introduce a data-driven approach to select the hyperparameters (δ, λ) simultaneously aiming at achieving optimal covariate balance. The optimal (δ, λ) could be specified by optimizing a utility function that measures the degree of covariate balance (e.g., the average absolute correlation or the average BASB) (McCaffrey et al. 2004, Zhu et al. 2015). Noting that the optimal (δ, λ) aim at achieving covariate balance on the entire matched dataset, the average absolute correlation would be a suitable global measure in practice. We summarize our data-driven tuning procedure as follow:

  1. Specify a candidate set of (δ, λ), where the candidates δ’s are relatively small and a grid of λ’s ranges from 0 to 1 (fixing λ = 1 when computational resources are restricted).

  2. Construct the matched dataset by implementing the design stage of Algorithm 1 with a pair of (δ, λ) from the pre-specified candidate set.

  3. Calculate the average absolute correlation (or other pre-specified measures for covariate balance) on this matched dataset.

  4. Repeat steps 2–3 using grid search on the pre-specified candidate set of (δ, λ).

  5. Find the (δ, λ) which minimizes the average absolute correlation (or optimizing other pre-specified measures for covariate balance), leading to the best covariate balance.

The tuning procedure is conducted in the design stage without access to outcome information, thus, this procedure neither biases analyses of outcomes nor requires corrections for multiple comparisons (Zhu et al. 2015, Rosenbaum, 2020).

4. Asymptotic Properties

We present the asymptotic properties for the proposed matching estimators for the population average causal ERF μ(w), where we match either 1) on a scalar covariate, 2) on the true GPS, 3) on the GPS consistently estimated by a parametric model, given the fixed scale parameter λ = 1 with the fixed caliper size δ = o(N−1/3) and N δ→ ∞. We focus on the point-wise asymptotic properties with respect to each exposure level W, given data are independent and identically distributed (iid). The summary conclusions are that the proposed matching estimator is asymptotically unbiased, consistent, and asymptotically normal with a non-parametric rate ()−1/2 when matching on a scalar covariate (e.g., GPS), yet the properties are not necessarily held if matching on multidimensional covariates, which justifies the GPS matching. Finally, we propose to smooth the estimator by using a kernel smoother with a proper bandwidth parameter h. Assuming hδ = o(N−1/3), the asymptotic normality hold for a smoothed matching estimator with a rate (Nh)−1/2.

We begin by defining the conditional means and variances of potential outcomes given pre-exposure covariates and given the GPS as follows:

μC(w,c)=E{Yj(w)Wj=w,Cj=c};μgps(w,e)=E{Yj(w)Wj=wj,e(Wj,Cj)=e};
σC2(w,c)=Var{Yj(w)Wj=w,Cj=c};σgps2(w,e)=Var{Yj(w)Wj=w,e(Wj,Cj)=e}.

To simplify the algebraic expression, we only consider one-to-one nearest neighbor matching on a set of continuous covariates Cj. The matching estimator for μ(w) can be defined as,

μ^(w)=1Nj=1NK(j)YjIj(w,δ),

where K(j) indicates the number of replacements in which observed unit j is used as a match, and Ij (w, δ) = I(Wj ∈ [wδ, W + δ]). The difference between the matching estimator μ^(w), and the true population average causal ERF μ(w), can be decomposed as,

μ^(w)μ(w)={μ¯(w)μ(w)}+Bμ(w)+μ(w), (1)

where, μ¯(w) is the average of conditional means of potential outcomes given pre-exposure covariates, Bμ(w) is the conditional bias of the matching estimator related to μ¯(w), and ℰμ(w) is the average of conditional residuals of the matching estimator. Specifically, let j(j′) indicate the nearest neighbor match for the template unit (w, Cj). Note the nearest neighbor match for (w, Cj′) depends on w. Since we focus on each w point-wise, we omit w in the definition of j(j′) for conciseness. We have,

μ¯(w)=1Nj=1NμC(w,Cj);Bμ(w)=1Nj=1NBμ,j=1Nj=1N{μC(Wj(j),Cj(j))μC(w,Cj)};
μ(w)=1Nj=1NK(j)μ,jIj(w,δ)=1Nj=1NK(j){YjμC(Wj,Cj)}Ij(w,δ).

Theorem 1 (The Order of Bias). Assume Assumptions 1–4 and the uniform boundedness assumption (S.1 in the Supplementary Materials) hold, if Cj is scalar, the order of the bias of the proposed matching estimator, that is Bμ (w), is Op (max{()−1, δ}).

Theorem 1 provides the stochastic order of bias terms in Equation 1. Under the given conditions of δ, the bias term will be asymptotically negligible. The rate is faster than ()−1/2 given δ = o(N−1/3), which guarantees the bias does not dominate the asymptotic behaviors of μ^(w).

Theorem 2 (Variance). Assume Assumptions 14 and S.1 in the Supplementary Materials hold. If Cj is scalar, (Nδ)Var{μ^(w)}=E[σc2(w,Cj){3fw(w)2e(w,Cj)}]+op(1).

Theorem 2 shows the asymptotic variance for μ^(w) is finite and provides an expression for it.

Theorem 3 (Consistency). Assume Assumptions 14 and S.1 in the Supplementary Materials hold. If Cj is scalar, μ^(w)μ(w)p0.

Theorem 3 shows the proposed matching estimator is point-wise consistent.

Theorem 4 (Asymptotic Normality). Assume Assumptions 14 and S.1 in the Supplementary Materials hold. If Cj is scalar, Σc1/2(Nδ)1/2{μ^(w)μ(w)}d𝒩{0,1},Σc=1Nj=1N[δK(j)2σc2(Wj,Cj)Ij(w,δ)].

Theorem 4 shows that when the set of matching covariates contains only one continuously distributed variable, the matching estimator is ()1/2 -consistent and asymptotically normal. Note Σc depends on w. Relative to matching directly on the covariates, propensity score matching has the advantage of reducing the dimensionality of matching to a single dimension (Abadie & Imbens, 2016). Therefore, for GPS matching, we have the following corollary.

Corollary 1 (Asymptotic Normality with GPS). Assume Assumptions 1–4 and the uniform boundedness assumption (S.2 in the Supplementary Materials) hold.

Σgps1/2(Nδ)1/2{μ^gps(w)μ(w)}d𝒩{0,1},Σgps=1Nj=1N[δK(j)2σgps2,{Wj,e(Wj,Cj)}Ij(w,δ)].

In observational studies, we may never know the underlying assignment mechanism of exposures, and thus the true GPS values are unknown. Consequently, the GPS has to be estimated by statistical models prior to matching. Abadie & Imbens (2016) derived and proved large sample properties of propensity score matching estimators that correct for the first step estimation of the propensity score. Following Abadie & Imbens (2016), we consider a parametric specification for the GPS model e(w, c) = g(w, c; θ), where g is a known link function and θ is a finite-dimensional set of parameters. We estimate θ by maximum likelihood estimation (MLE) (see Section S.2 of the Supplementary Materials for additional details).

To simplify the notations, we define e(w, c; θ) the GPS with parameter vector θ. Therefore, e(w, c; θ) denotes the estimated GPS. We further define μ^gps (w;θ) the matching estimator using the GPS with parameter vector θ. Therefore, μ^gps (w;θ) denotes the matching estimator using the estimated GPS. We define σgps2(w,e;θ) analogously.

Theorem 5 (Asymptotic Normality with estimated GPS). Assume Assumptions 1–4, the uniform boundedness and the convergence in probability assumption (S.2′3 in the Supplementary Materials) hold. The GPS model has a parametric specification with parameter vector θ and the estimated parameter θ is estimated by MLE. Then, the matching estimator μ^gps (w;θ) satisfies

Σgps1/2(Nδ)1/2{μ^gps(w;θ)μ(w)}d𝒩{0,1},Σgps=1Nj=1N[δK(j)2σgps2{Wj,e(Wj,Cj;θ);θ}Ij(w,δ)].

Theorem 5 states that no matter whether we match on the true GPS or the GPS consistently estimated by a parametric model, the asymptotic properties are unchanged. Importantly, the form of the asymptotic variance remains the same if the GPS model has a parametric specification and thus the estimated parameter obtained by MLE in the GPS model has a convergence rate of N−1/2.

To reduce the jaggedness and improve the finite sample performance of the matching estimator proposed in Theorem 5, we propose to smooth the estimator by using a kernel smoother with a proper bandwidth parameter h (Wand & Jones, 1994; Heller, 2007).

Proposition 1 (Asymptotic Normality of Smoothed ERF). We denote the smoothed matching estimator as μ^gps(2)(w;θ)=1Nj=1NK(j)YjΨ(Wjwh), where w is an interior point of the support of Wj, Ψ(·) is a kernel, a unimodal symmetric probability density function with maximum at 0 and support [−1,1], and h ≥ 0 is the bandwidth. Assume Assumptions 14, S.2′3 in the Supplementary Materials hold, and hδ = o(N−1/3). The GPS model has a parametric specification with parameter vector θ and the estimated parameter θ is estimated by MLE. Then, the smoothed matching estimator μ^gps(2)(w;θ) satisfies

[δhΣgps1/2](Nh)1/2{μ^gps(2)(w;θ)μ(w)}d𝒩{0,12Ψ2(u)du},

where Σgps is the same variance function as defined in Theorem 5.

Proofs of Theorems 1–5 and Proposition 1 are provided in the Supplementary Materials.

5. Simulations

We conduct simulation studies to evaluate the performance of the newly proposed GPS matching approach compared to the other five state-of-art alternatives: 1) GPS adjustment estimator (Hirano & Imbens, 2004); 2) IPTW estimator (Robins et al. 2000); 3) non-parametric DR estimator (Kennedy et al. 2017); 4) double machine learning (DML) estimator (Colangelo & Lee, 2020); and 5) covariate balancing propensity score (CBPS) weighting estimator (Fong et al. 2018). We also compare the performance of each estimator (except for DML and CBPS weighting) when estimating the GPS using 1) a parametric linear regression model assuming normal residuals (Hirano & Imbens, 2004) and 2) a cross-validation-based Super Learner algorithm (Van der Laan et al. 2007, Kennedy et al. 2017). For DML, we estimate the GPS using a lasso regression recommended by Colangelo & Lee (2020), and we choose 2-fold sample splits in cross-fitting. In the Supplementary Materials, we vary the model specifications and sample splits. For CBPS, we calculate the GPS by directly optimizing the covariate balancing condition, i.e., minimizing the weighted correlation between exposures and pre-exposure covariates. For the IPTW and DR estimators, we follow the common practice of stabilizing the weights, and consider both the untrimmed and trimmed weights.

5.1. Simulation Settings

We generate six pre-exposure covariates (C1, C2, …, C6), which include a combination of continuous and categorical variables, C1,…, C4 ~ 𝒩 (0, I4), C5 ~ V {−2, 2}, C6 ~ U (−3, 3), where 𝒩(0, I4) denotes multivariate normal distributions, V{−2, 2} denotes a discrete uniform distribution, and U (−3, 3) denotes a continuous uniform distribution. We generate W using seven different specifications of the GPS model all relying on the cardinal function γ(C) = −0.8 + (0.1, 0.1, −0.1, 0.2, 0.1, 0.1)C. We describe the details and rationale behind the choice of these data generating mechanisms in Section S.3.1 of the Supplementary Materials. We generate Y from an outcome model which is assumed to be a cubic function of W with additive terms for the confounders and interactions between W and the confounders, Y | W, C ~ N {μ(W, C),102}, where μ(W,C)=1(2,2,3,1,2,2)CW(0.10.1C1+0.1C4+0.1C5+0.1C32)+0.132W3. For each of the seven GPS model specifications (scenarios), we vary the sample size N (= 200, 1000, 5000). For each combination of model specification and sample size, we generate S = 500 simulated datasets.

After generating the data we estimate the ERF for each simulation scenario using six different approaches, including the GPS matching approach and five state-of-art alternatives. For IPTW and DR estimators, we report results based on untrimmed and trimmed weights respectively (see Section S.3.2 of the Supplementary Materials for the implementation details). For all scenarios, we present two sets of simulation studies, one based on the GPS estimated by a parametric linear regression model assuming normal residuals (i.e., parametric MLE), and one based on the GPS estimated by the cross-validation-based Super Learner algorithm, i.e., an ensemble learning method with four learners, including extreme gradient boosting machines (GBM), multivariate adaptive regression splines, generalized additive models, and random forest (implemented by the SuperLearner R package with four algorithms: SL.xgboost, SL.earth, SL.gam, SL.ranger).

To assess the performance of the different estimators, we calculate the absolute bias and root mean squared error (RMSE) of the estimated ERF. These two quantities were estimated empirically at each point within the range W*, and integrated across the range W*. They are defined as follows: Absolute Bias=W*|1Ss=1sμ^s(w)μ(w)|fw(w)dw, RMSE=W*[1Ss=1s{μ^s(w)μ(w)}2]1/2fw(w)dw, where S denotes the number of replicates, and W* denotes a trimmed version of the support W, excluding 10% of mass at the boundaries to avoid boundary instability (Kennedy et al. 2017). We also evaluate the coverage rates of the m-out-of-n bootstrap point-wise Wald 100(1 − α) % confidence intervals (setting α = 0.05). The (average) coverage rate is defined as follows: Coverage Rate=W*[1Ss=1sI(μ(w){μ^s(w)±z1α/2×Varboots,s[μ^s(w)]})]fw(w)dw.

5.2. Covariate Balance Assessment

We can compare the values of covariates balance measures, e.g., the absolute correlation or the BASB, described in Section 3.2 between the matched dataset and unadjusted dataset. If the average absolute correlation and the average BASB for the observed covariates in the matched dataset are smaller than those in unadjusted dataset, we conclude our approach improves covariate balance. We choose a L1 distance metric and use the data-driven approach proposed in Section 3.3 to select the hyper-parameters (δ, λ). The implementation details and selected hyperparameters (δ, λ) for each simulation scenario are described in Section S.3.5 of the Supplementary Materials.

For two of the approaches; 1) the proposed GPS matching, and 2) CBPS weighting approaches, we assess covariate balance using the absolute correlation measure. We calculate the absolute correlations for each of six covariates in the matched dataset constructed by GPS matching, and compare with the weighted absolute correlations obtained by CBPS and the absolute correlations in the unadjusted dataset. We use the average absolute correlation < 0.1 as the threshold indicating good covariate balance. We fail to achieve this pre-specified threshold under scenario 7 with the minimal average absolute correlation = 0.15 among all simulated datasets. Scenario 7 suggests that our proposed method is unable to achieve covariate balance when the exposure assignment is highly imbalanced in the covariates. We recommend that researchers proceed to the analysis stage only if covariate balance has been achieved in the design stage. In Figure 1, we show absolute correlation results of GPS matching from the remaining six simulation settings (scenarios 1 – 6) under two different approaches to estimate the GPS (i.e., Super Learner and linear regression) with sample size N = 5000, and compare them to results of CBPS weighting using the same simulated dataset. We see that covariate balance improves substantially for both GPS matching and CBPS across all six covariates for all simulation settings. Under five out of six settings (scenario 1, 3 – 6), absolute correlations for the GPS matched dataset for all confounders are < 0.10, which indicates excellent covariate balance. For matching, the absolute correlations are, in general, slightly smaller when using a Super Learner algorithm to estimate the GPS, compared to using a parametric linear regression model. Compared to CBPS, we found the performance in terms of covariate balance is, in general, comparable between GPS matching and CBPS weighting (shown in Figure 1). As expected, the GPS matching approach is advantageous under settings of extreme values for the GPS (scenario 2). We also calculate the BASB in the matched dataset constructed by GPS matching where we categorize the exposure range by quintile (K = 5) in all six simulation scenarios. We use the average BASB < 0.2 as the threshold indicating good covariate balance. In Figure S.23 of the Supplementary Materials, we present the BASB under six simulation scenarios. The results show that our GPS matching approach also improves covariate balance in terms of BASB for all six covariates. The results based on the absolute correlation and the BASB measures are consistent.

Fig. 1.

Fig. 1

Absolute Correlations. Each panel represents the absolute correlations for each covariate in the matched dataset (GPS estimated by a Super Learner algorithm; solid blue line); matched dataset (GPS estimated by a parametric linear regression model; solid red line); CBPS weighted dataset (solid green line) and original unadjusted dataset (dashed line) under six simulation settings under sample size N = 5000. The dotted line represents the threshold for covariate balance suggested by Zhu et al. (2015). The GPS in CBPS was calculated by directly optimizing the covariate balancing condition. Both GPS matching and CBPS weighting improve covariate balance for all six covariates in all settings.

5.3. Simulation Results

Table 1 shows the simulation results where we estimate the GPS model using parametric linear regression models assuming normal residuals, while Table S.4 of the Supplementary Materials shows results for the settings where we estimate the GPS model using Super Learner algorithms. For CBPS, the GPS was calculated by optimizing a covariate balance condition.

Table 1.

Absolute Bias and Root Mean Squared Error (RMSE). We estimate the GPS using parametric linear regression models (except for CBPS, where the GPS was calculated by optimizing the covariate balancing condition; and for DML, where the GPS was estimated by using regularized linear regression (lasso)). All results are based on S = 500 replicates

GPS generation N Matching Adjustment IPTW DR IPTW (trim) DR (trim) DML CBPS
1) N(0, 5)-distributed residuals 200 1.02 (3.87) 1.19 (3.50) 2.17 (4.08) 1.08 (3.71) 2.17 (4.08) 1.22 (3.29) * (*) 1.70 (3.78)
1000 0.54 (1.90) 1.31 (2.49) 1.97 (3.59) 0.90 (2.31) 1.98 (3.56) 0.61 (1.47) 0.09 (2.68) 1.49 (3.33)
5000 0.21 (1.29) 1.01 (1.90) 1.42 (3.23) 0.60 (1.45) 1.44 (3.16) 0.49 (0.87) 0.10 (1.34) 1.66 (3.76)
2) t(2)-distributed residuals 200 3.16 (6.98) 3.45 (42.78) * (*) * (*) 5.69 (22.27) 14.61 (108.74) * (*) 2.11 (6.43)
1000 2.20 (4.17) * (*) 85.08 (*) * (*) 7.53 (21.46) 48.29 (704.54) 11.44 (210.38) 4.75 (14.15)
5000 1.44 (2.91) * (*) * (*) * (*) * (*) 114.06 (700.18) 6.43 (71.69) 5.09 (17.69)
3) 2nd order term 200 1.41 (4.28) 1.81 (4.25) 2.58 (4.80) 2.18 (4.85) 2.59 (4.73) 1.60 (3.52) * (*) 1.67 (4.44)
1000 0.86 (2.07) 1.50 (2.71) 2.00 (4.10) 2.35 (4.73) 2.07 (3.94) 0.85 (1.70) 0.24 (2.83) 1.77 (3.75)
5000 0.55 (1.42) 1.16 (2.07) 1.51 (4.01) 2.68 (13.77) 1.55 (3.44) 0.83 (1.33) 0.27 (1.40) 1.87 (4.17)
4) logistic link 200 1.35 (4.27) 1.61 (4.13) 2.46 (4.58) 1.29 (3.59) 2.47 (4.57) 1.29 (3.39) 48.6 (*) 1.62 (4.04)
1000 0.62 (2.03) 1.71 (2.90) 1.98 (3.89) 0.89 (2.56) 2.01 (3.83) 0.63 (1.49) 0.09 (2.84) 1.49 (3.60)
5000 0.34 (1.36) 1.19 (2.11) 1.44 (3.56) 0.63 (1.49) 1.46 (3.46) 0.56 (1.01) 0.14 (1.35) 1.65 (4.21)
5) 1-logistic link 200 0.60 (3.81) 1.14 (3.09) 1.30 (3.19) 0.92 (3.55) 1.31 (3.17) 0.53 (2.71) * (*) 1.51 (3.51)
1000 0.43 (1.84) 1.48 (2.32) 1.51 (2.77) 0.83 (2.20) 1.50 (2.72) 0.35 (1.32) 0.32 (2.69) 1.61 (2.76)
5000 0.19 (1.24) 1.00 (1.63) 1.05 (2.47) 0.54 (1.26) 1.04 (2.37) 0.22 (0.75) 0.44 (1.46) 1.43 (2.84)
6) log link 200 1.26 (4.17) 3.30 (9.87) 2.68 (5.39) 2.62 (45.00) 2.74 (5.26) 0.99 (4.18) * (*) 2.91 (5.50)
1000 0.97 (2.17) 2.46 (4.18) 2.51 (4.09) 3.57 (97.76) 2.53 (4.07) 0.44 (1.80) 0.48 (3.96) 2.54 (4.04)
5000 0.62 (1.48) 2.55 (3.59) 4.06 (47.33) 10.51 (146.82) 1.97 (4.96) 0.91 (1.42) 0.54 (1.49) 2.60 (4.25)

Notes: Matching = the proposed GPS matching; Adjustment = includes GPS as covariates in an outcome model proposed in Hirano & Imbens (2004); IPTW = inverse probability of treatment weighting; DR = doubly robust proposed in Kennedy et al. (2017); DML = double machine learning proposed in Colangelo & Lee (2020); CBPS = covariate balancing propensity score proposed in Fong et al. (2018); trim = trim the stabilized weight that is larger than 10.

*

represented values larger than 1000 or more than 50% of simulations fail to converge.

Under scenario 1, when the GPS model is correctly specified as a linear regression model assuming normal residuals and thus does not contain many extreme GPS values, all approaches perform reasonably well (see Table 1). The GPS matching, trimmed non-parametric DR and DML approaches, in general, outperform the GPS adjustment, IPTW, CBPS, and untrimmed non-parametric DR approaches, in terms of both absolute bias and RMSE. The GPS matching estimator provides smaller absolute bias, yet the trimmed non-parametric DR estimator provides smaller RMSE. The DML estimator performs poorly when the sample size is small (N = 200), yet performs well when the sample size is relatively large (N = 1000, 5000), though the RMSE is still larger compared to the GPS matching estimator.

Under scenario 2, when the GPS model is still linear yet includes extreme GPS values (misspecified in the residual distribution), the GPS adjustment, IPTW, CBPS, DR, and DML estimators all produce very large RMSE, and are not able to reduce confounding bias even as the sample sizes increase. There is plenty of literature suggesting stabilizing weights and trimming extreme weights under binary/categorical exposure settings (Harder et al. 2010, Crump et al. 2009, Yang et al. 2016), yet the guidelines on handling extreme weights under continuous exposure regimes are sparse. In these simulation studies, we see that the common practical guidelines for trimming (capping the stabilized weights at 10 (Harder et al. 2010)) do not provide sufficient remedy. In contrast, we found our matching estimator has better finite sample performance in scenarios where there are extreme values of the estimated GPS, creating a much more stable estimation than any of the other alternatives evaluated. Importantly, the absolute bias and RMSE of the proposed matching estimator decreases as the sample size increases. This may be because the performance of the GPS matching estimator is not driven by one or few units with extreme GPS values.

In scenarios 3–6, when the function forms of the GPS model are misspecified in various ways, the GPS matching approach consistently provides smaller bias and smaller RMSE compared to most of the alternatives. The performance of the GPS matching and DML estimators is comparable when the sample size is relatively large (N = 1000, 5000). These results show that when the GPS is modeled by a misspecified parametric linear regression model (which can happen in practice), matching provided notable improved performances compared to other approaches (see Table 1). This finding is aligned with results from Waernbaum (2012) under binary exposure settings, showing that when matching on a parametric model (e.g., a propensity score), the matching estimator is robust to model misspecifications.

In terms of the performance of the m-out-of-n bootstrap procedure, we found that the coverage rates are close to the nominal level under scenario 1, although slightly below the nominal level with sample size N = 5000. As expected, the coverage becomes worse in other scenarios with misspecified GPS models, especially, when the GPS model includes extreme GPS values (misspecified in the residual distribution; scenario 2) (see Section S.3.3 and Table S.23 of the Supplementary Materials). When the GPS is estimated by a Super Learner algorithm, the performance of other approaches improves for almost all scenarios (see Table S.4 of the Supplementary Materials), likely because the flexible non-parametric modeling techniques have a greater potential to effectively recover the correct form of the GPS. Still, matching outperforms the GPS adjustment, IPTW, and CBPS approaches both in terms of absolute bias and RMSE, though it is slightly less efficient than the DR estimator. We show that the use of an ensemble machine learning model for the GPS estimation has the potential to improve the robustness to GPS model misspecification in finite sample studies, given that flexible non-parametric models themselves are often less prone to model misspecifications compared to parametric models. The use of cross-fitting leads to small improvements in the performance of the DML estimator (G = 2 vs. G = 1) when the sample size is relatively large (N = 1000, 5000), but decreases the performance significantly when the sample size is small (N = 200) (see Table S.5 of the Supplementary Materials).

We conduct additional simulation studies to assess the sensitivity of the GPS matching approach to different values of the hyperparameters (δ, λ) and differing distance metrics in the matching function (see Section S.3.6 and Table S.6 of the Supplementary Materials). We found the matching estimator is relatively insensitive to the choice of distance metrics (L1 vs. L2 distance). We also compared our data-driven tuning procedure (optimized for covariate balance), which was used in the main simulations, i.e., Tables 1 and S.7, to pre-specified hyperparameters (δ, λ). We found that for some settings the matching estimator is sensitive to the choice of hyperparameters, and matching estimators based on our data-driven approach achieve small absolute bias and RMSE.

6. Data Application

The key scientific question in air pollution epidemiology studies is to assess whether and in what magnitude exposure to air pollution is causally linked to adverse health outcomes. We apply the GPS matching approach to a cohort of Medicare enrollees to estimate the causal ERF of long-term PM2.5 exposure on all-cause mortality. Medicare claims data, obtained from the Centers for Medicare and Medicaid Services (CMS), provide a rich data platform to conduct air pollution studies on a national scale (Di et al. 2017). To this end, we use the largest-to-date Medicare enrollee cohort across the contiguous US from 2000 to 2016. This study population includes a total of 68.5 million individuals, who reside in 31, 414 zip codes across 17 years. Our unit of analysis is zip code by year. That is, for each year, we count the number of deaths among Medicare enrollees for each zip code, resulting in a total of 0.5-million units. Daily PM2.5 exposures were estimated at a 1km × 1km grid cell resolution using a spatio-temporal prediction model with excellent predictive accuracy (cross-validated R2 = 0.86)(Di et al. 2019). To obtain the annual average PM2.5 at each zip code, we average the gridded concentrations within the boundary of each zip code and then average the daily zip code level concentrations within each year. We assign the annual average PM2.5 to the corresponding zip code for each year. The range of annual average PM2.5 from 2000 to 2016 was 0.01 – 30.92 μg/m3 with 1% and 99% quantiles equal to (2.76, 17.16).

Design Stage.

We estimate the GPS by using an extreme GBM (Chen & Guestrin, 2016, Zhu et al. 2015), with annual zip code level PM2.5 exposure as the dependent variable and 19 zip code level potential confounders as independent variables, including population demographic information, Medicaid eligibility information (as a surrogate for socioeconomic status), meteorological information, time trend (year), and spatial trend (census region) (see Figure 2). We use an extreme GBM (i.e., a single learner in the Super Learner algorithm) to estimate the GPS model because 1) it was more flexible and achieved better covariate balance compared to a linear regression model on this complex data; 2) it was computationally feasible on this large dataset. After obtaining the estimated GPS, we implement our GPS matching procedure. Specifically, we choose a pre-specified two-dimensional L1 distance metric and follow the data-driven tuning procedure described in Section 3.3 to select (δ, λ). The selected optimal caliper is δ = 0 .16 (i.e., corresponding to L = 100 exposure levels), and optimal scale parameter is λ = 1. We construct the matched dataset by collecting all imputed observations. Additional details on the grid search of hyperparameters (δ, λ) and the GPS model specification can be found in Section S.4.14.2 of the Supplementary Materials.

Fig. 2.

Fig. 2

Absolute Correlations. The figure represents the absolute correlations for each covariate in the matched dataset (solid line) and original unadjusted dataset (dashed line). The dotted line represents the cut-off of covariate balance suggested by Zhu et al. (2015). The average absolute correlation is 0.19 before matching, and the average absolute correlation is minimized as 0.04 after matching, when we use caliper δ = 0.16 (i.e., L = 100 exposure levels) and scale parameter λ = 1. Importantly, time trend (year) is strongly imbalanced before matching, yet is balanced after matching.

We assess covariate balance by calculating the absolute correlation for each potential confounder as discussed in Section 3.2. We specify the average absolute correlation being less than 0.1 as the threshold for covariate balance. The GPS matching implementation largely improves covariate balance for 16 out of 19 potential confounders. The average absolute correlation is 0.19 before matching, whereas, the average absolute correlation is 0.04 after matching (See Figure 2).

Analysis Stage.

After obtaining the matched dataset, we fit a kernel smoother with Gaussian kernels on the matched dataset to estimate the causal ERF relating long-term PM2.5 levels to all-cause mortality rate. We construct the point-wise Wald 95% confidence band for the ERF using the m-out-of-n bootstrap procedure. We implement a block bootstrap with zip codes as the block units. Therefore, we account for the correlation between observations across different years yet within the same zip code by the “block” nature of the bootstrap procedure. We recalculated the GPS and refit the outcome model in each bootstrap replicate to ensure that the bootstrap procedure jointly accounted for the variability associated both with the GPS and outcome model estimations. After fitting the kernel smoother, to avoid extrapolation at the support boundaries, we trim the highest 1% and lowest 1% PM2.5 exposures of the ERF, consistent with Liu et al. (2019), Di et al. (2017).

Figure 3 shows the average causal ERF in mortality rate (left panel) and its transformation in hazard ratio (right panel). For the hazard ratio, we defined the baseline rate as the estimated average mortality rate corresponding to an exposure level equal to the 1% quantile of PM2.5 exposures (i.e., 2.76 μg/m3). To our knowledge, this is the first exposure-response curve assessing the effects of long-term PM2.5 on all-cause mortality using a causal inference approach to account for measured confounders, which provides strong evidence of the causal link. We find a consistently harmful effect of long-term PM2.5 exposure on mortality across the range of annual average PM2.5 (2.76–17.16 μg/m3) for the entire dataset including Medicare enrollees from 2000 to 2016 across the continental US. Importantly, the curve is steeper at exposure levels lower than the current national standards (annual average ≤12 μg/m3), indicating aggravated harmful effects at exposure levels even below the national standards.

Fig. 3.

Fig. 3

The causal ERF relating all-cause mortality to long-term PM2.5 exposure. The left panel presents the smoothed causal ERF in mortality rate obtained by a kernel smoother with optimal bandwidth (solid line) and its point-wise Wald 95% confidence band calculated by m-out-of-n bootstrap (dashed line). The right panel is the smoothed curve in hazard ratio with its point-wise Wald 95% confidence band. The GPS was estimated by using an extreme GBM (Chen & Guestrin, 2016).

We also implement the non-parametric DR approach proposed by Kennedy et al. (2017) on the same observational dataset. We find that both the GPS matching estimator and non-parametric DR estimator provide an exposure-response curve with similar shapes (see Section S.4.3 of the Supplementary Materials). The data analysis took approximately 6 hours to complete using two computer clusters of 64 CPU cores and 500 GB memory. The computational effort is shown in Section S.6 of the Supplementary Materials.

7. Discussion

We have developed a GPS matching approach for estimating causal ERF. Our proposed approach fills an important gap in the literature as it provides a theoretically-justified generalization for matching in the context of a continuous exposure. We demonstrate that: 1) under the local weak unconfoundedness and other identifiability assumptions; 2) when the GPS is consistently estimated by a parametric model; and 3) the caliper δ is well chosen, the GPS matching estimator attains point-wise ()1/2 -consistency and asymptotic normality.

When the GPS is estimated by non-parametric machine learning models, the asymptotic properties described in Section 4 are not guaranteed, since the asymptotic properties of most machine learning algorithms are unknown. DML methods with cross-fitting via sample splitting are becoming increasingly popular in causal inference (Chernozhukov et al. 2018), including recent extensions to continuous exposure settings (Colangelo & Lee, 2020, Semenova & Chernozhukov, 2021). DML methods construct pathwise differentiable functionals depending on the GPS and outcome models (as nuisance parameters), and rely on cross-fitting via sample splitting to allow non-parametric machine learning models for the nuisance parameter estimation. Yet those methods require the pathwise differentiability of the target functional (i.e., it is possible to compute the functional’s pathwise derivative) (Kennedy, 2022), whereas the GPS matching estimator is non-pathwise differentiable functional of the distribution of the GPS, which makes it difficult to establish an asymptotic approximation to the distribution of matching estimators with non-parametrically estimated GPS (Abadie & Imbens, 2016). We conjecture that if the non-parametric model for the GPS estimation converges faster than the residuals of the matching estimator (which converge at the rate of ()−1/2), the estimation error of the GPS is negligible compared to the error created by the matching residuals and thus would not impact the asymptotic properties of the matching estimator. The formal analysis of the asymptotic properties of the matching estimator using the GPS estimated by non-parametric machine learning models is an open field of research.

While the asymptotic properties of the GPS matching estimator relies on a more stringent theoretical condition than the DML estimator, the GPS matching methods have several advantages which are outlined below. First, like many other matching methods under binary exposure settings, the GPS matching methods, in the context of a continuous exposure, also separate the design stage and the analysis stage. The design stage does not involve any outcome data information. A careful design can improve the objectiveness of the outcome data analysis (Rubin et al. 2008).

Second, the GPS matching methods proposed are robust to both GPS and outcome model misspecifications, according to empirical evidence based on the simulation results in Section 5. In observational studies, neither the GPS model nor the outcome model is known. Although DR and DML methods produce consistent estimation as long as either the GPS or outcome model is consistently estimated, both the GPS and outcome models might be misspecified in practice. When specifying the propensity score model, Dehejia & Wahba (1999), Waernbaum (2012) point out that when a misspecified propensity score model constitutes a balancing score (Rosenbaum & Rubin, 1983) or a larger class of covariate score (Waernbaum, 2012), the matching estimator is still consistent. This property, if it applies to the GPS model, shows that matching is robust to the GPS model misspecification if the misspecified GPS model belongs to a class of balancing or covariate scores. This implies there are multiple possibilities for the matching estimator to make a reliable inference, which highlights the robustness of the matching method to GPS model misspecifications. Moreover, the GPS model can be selected based on measures of covariate balance in the design stage of the GPS matching approach. Such practice provides a safeguard against GPS model misspecifications (Abadie & Imbens, 2016). When specifying the outcome model, matching provides a non-parametric preprocessing to reduce outcome model dependence and aims to offer the promise of causal inference with fewer assumptions (Ho et al. 2007). The matching step reduces the dependence between the exposures and potential confounders, and therefore estimates of causal effects are less dependent on outcome modeling choices. When the data allow proper matches, causal estimations are robust to different modeling assumptions for the outcome analysis (Ho et al. 2007). Matching is also more robust to the presence of extreme values of the estimated GPS compared to weighting. For weighing approaches, if the GPS value for a unit is 0.001, the unit will be assigned a weight equal to 1000. Such extreme weights are likely to dramatically increase the variance of the weighting estimator; little changes in the GPS estimates (e.g., from 0.001 to 0.0001) may produce huge changes in the causal estimates. Although methods to stabilize and trim large weights exist, we found the performances of trimmed/stabilized weighting estimators are only moderately improved in simulations, and do not perform as well as the proposed GPS matching estimator. In contrast, for matching, if there is another unit j with similar exposure to unit j′ ’s exposure that also has an estimated GPS value close to j′ ’s GPS value of 0.001, we simply match unit j′ and unit j. Ultimately, the performance of the matching estimator is not driven by one or few units with very extreme weights. Also, matching only depends on the relative distance between unit j′ and unit j in terms of GPS values and exposure levels (“nearest neighbor”), thus small changes in the GPS estimates are less likely to change the matches dramatically, and thus are also less likely to affect the causal estimates. Via a comprehensive set of simulation studies, we found that the GPS matching approach consistently performs well in finite samples under settings with extreme estimated GPS values or when the GPS model is misspecified. Formal theoretical analyses establishing the robustness of the proposed matching methods to GPS and outcome model misspecifications is subject of future work.

Third, the GPS matching methods maintain the unit of analysis intact and create an actual matched set (often called a hot deck imputation in literature). In contrast, with weighting, it can be challenging to interpret what it means when, for example, a subject receives a weight of 1.3 (Stuart et al. 2020). Also, matching methods share the same advantage as weighting of allowing extensive diagnostics (e.g., covariate balance assessments) without invalidating analyses of outcomes. In the GPS matching approach, we proposed two assessments of covariate balance (i.e., absolute correlations and BASB). Such easy-to-implement assessments for covariate balance are often not straightforward for other model-based GPS adjustment or DR approaches (Greifer & Stuart, 2021). Under matching, in addition to covariate balance assessments, researchers can conduct additional diagnostics, including data visualization, to assess the robustness of their results since an actual matched set is readily available. Based on the actual matched set, other distributional causal estimands, e.g., quantile causal effects, besides population average causal effects (see Section 18.1 of Imbens & Rubin (2015)) can be estimated. Such extensions are often not straightforward for the existing causal inference methods for continuous exposures.

Still there are several areas of future development. The GPS matching approach relies on four main assumptions: 1) consistency, 2) overlap, 3) local weak unconfoundedness, and 4) smoothness. The consistency assumption is a fundamental assumption in the classical potential outcomes framework. Recent literature (Tchetgen & VanderWeele, 2012) starts to relax it by allowing interference, yet future adaptations are needed to extend these relaxations to (generalized) propensity score-based analyses. The overlap assumption is another fundamental assumption for the validity of most causal inference methods. Under binary or categorical exposure settings, investigators widely use diagnostic plots to check overlap (Wu et al. 2019) and trimming to ensure overlap (Crump et al. 2009, Harder et al. 2010, Yang et al. 2016). However, under continuous exposure settings, since the overlap is defined by a probability density function, it is conceptually hard to check it directly via finite samples. One potential way to check overlap in this setting is to categorize the continuous exposure and check/ensure overlap among categories using standard approaches developed in categorical exposure settings (Yang et al. 2016, Wu et al. 2019), yet no current approach is able to directly verify the overlap on a continuous scale. Future work is needed to develop rigorous approaches to check/ensure overlap in continuous exposure settings.

We introduced the local weak unconfoundedness assumption, which is less stringent than the common weak unconfoundedness assumption, though it is still unverifiable since data are always uninformative about the distribution of the counterfactual outcomes. In addition, as with other (generalized) propensity score-based approaches, this approach does not resolve the potential bias due to unmeasured confounding, in which case the unconfoundedness assumption is violated. The caliper δ in the assumption has both important theoretical and practical implications. By choosing a suitable caliper δ that depends on the sample size, under the local weak unconfoundedness assumption, we identify the theoretical point at which the proposed matching estimator achieves desirable asymptotic properties. The smoothness assumption is essentially the standard Lipschitz continuous condition imposed in non-parametric regression problems and has been used in models with counterfactual outcomes (Kim et al. 2018). Also, in the smoothed matching estimator, we require the rate of smoothness, i.e., the bandwidth, to satisfy hδ = o(N−1/3), to ensure the bias from matching discrepancy is asymptotically negligible and also the original and smoothed matching estimators maintain the similar asymptotic normal distributions. In finite sample, both caliper δ and bandwidth h are considered tuning parameters and searched via data-driven approaches. The focus of this paper is not to find a non-parametric estimator with the sharpest rate of convergence; thus, we obtain the asymptotically unbiased matching estimator via under-smoothing (Wand & Jones, 1994). A natural extension is to generalize the bias-corrected matching estimator proposed in Abadie & Imbens (2011) into our non-parametric settings, which has the potential to obtain sharper results on the rate of convergence. We obtained the theoretical results of the matching estimator relying on the true GPS and the GPS estimated by a parametric model. In future work, it is worthwhile to explore the asymptotic properties of the matching estimator using a GPS model estimated non-parametrically. Both the bootstrap procedure and the theoretical results of this paper are developed for a given exposure point-wise. It would be helpful to develop inference procedures that are able to quantify the uncertainty of the exposure-response curve via simultaneous confidence bands and derive uniform consistency and weak convergence of the matching estimator.

We applied the GPS matching approach to estimate the causal relationship between long-term PM2.5 exposure and all-cause mortality on a massive Medicare administrative data cohort. We found strong evidence of a positive and near-linear causal ERF between long-term PM2.5 exposure and all-cause mortality. Some previous air pollution studies were conducted using propensity score-based analyses, however, researchers often dichotomize or categorize continuous exposures in order to utilize propensity score methods (Wu et al. 2019). The GPS matching approach introduced in this paper is the first matching approach that allows for the estimation of a causal ERF of a continuous exposure and the assessment of covariate balance. Computational feasibility is an important consideration. The proposed matching with replacement procedure eases the computational burden (Imbens & Rubin, 2015) and has the capability of utilizing parallel computing to accelerate the matching procedure.

Finally, the GPS matching approach can be used in any field where interventions (named exposures/treatments in different settings) are continuous. Environmental health research is one application area where many exposures/treatments are naturally continuous, e.g., air pollution, temperature, and ultraviolet radiation. We anticipate the simplicity and generality of our matching framework will promote awareness of causal inference in future science and policy-relevant research in many application areas including social science, economics, and many sub-fields of public health.

Supplementary Material

Supp 1

Acknowledgement

The authors are grateful to Agnese Panzera, Xihao Li, Boyu Ren, Naeem Khoshnevis, Ying-Ying Lee, Junwei Lu, Ziyang Wei, Jose R. Zubizarreta and Elizabeth A. Stuart for helpful discussions. We thank the referees for their thoughtful comments. Funding was provided by the HEI grant 4953-RFA14-3/16-4, US EPA grant 83587201-0, NIH grants R01 ES026217, R01 MD012769, R01 ES028033, R01 ES028033-S1, 1R01 ES030616, 1R01 AG066793-01R01, 3R01 AG066793-02S1, 1R01 ES029950, 1RF1 AG071024, 1RF1 AG074372-01A1, Alfred P. Sloan Foundation grant G-2020-13946, Harvard University Climate Change Solutions Fund, Fernholz Foundation, Department of Excellence 2018–2022 grant by Italian Ministry of University and Research (MUR).

References

  1. Abadie A & Imbens GW (2006), ‘Large sample properties of matching estimators for average treatment effects’, Econometrica 74(1), 235–267. [Google Scholar]
  2. Abadie A & Imbens GW (2011), ‘Bias-corrected matching estimators for average treatment effects’, Journal of Business & Economic Statistics 29(1), 1–11. [Google Scholar]
  3. Abadie A & Imbens GW (2016), ‘Matching on the estimated propensity score’, Econometrica 84(2), 781–807. [Google Scholar]
  4. Austin PC (2018), ‘Assessing covariate balance when using the generalized propensity score with quantitative or continuous exposures’, Statistical Methods in Medical Research p. 0962280218756159. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Bang H & Robins JM (2005), ‘Doubly robust estimation in missing data and causal inference models’, Biometrics 61(4), 962–973. [DOI] [PubMed] [Google Scholar]
  6. Cao W, Tsiatis AA & Davidian M (2009), ‘Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data’, Biometrika 96(3), 723–734. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Chattopadhyay A, Hase CH & Zubizarreta JR (2020), ‘Balancing vs modeling approaches to weighting in practice’, Statistics in Medicine 39(24), 3227–3254. [DOI] [PubMed] [Google Scholar]
  8. Chen T & Guestrin C (2016), Xgboost: A scalable tree boosting system, in ‘Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining’, ACM, pp. 785–794. [Google Scholar]
  9. Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey W and Robins J (2018), ‘Double/debiased machine learning for treatment and structural parameters’.
  10. Colangelo K & Lee Y-Y (2020), ‘Double debiased machine learning nonparametric inference with continuous treatments’, arXiv preprint arXiv:2004.03036 [Google Scholar]
  11. Crump RK, Hotz VJ, Imbens GW & Mitnik OA (2009), ‘Dealing with limited overlap in estimation of average treatment effects’, Biometrika 96(1), 187–199. [Google Scholar]
  12. Dehejia RH & Wahba S (1999), ‘Causal effects in nonexperimental studies: Reevaluating the evaluation of training programs’, Journal of the American statistical Association 94(448), 1053–1062. [Google Scholar]
  13. Di Q, Amini H, Shi L, Kloog I, Silvern R, Kelly J, Sabath MB, Choirat C, Koutrakis P, Lyapustin A et al. (2019), ‘An ensemble-based model of pm2. 5 concentration across the contiguous united states with high spatiotemporal resolution’, Environment International 130, 104909. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Di Q, Wang Y, Zanobetti A, Wang Y, Koutrakis P, Choirat C, Dominici F & Schwartz JD (2017), ‘Air pollution and mortality in the medicare population’, New England Journal of Medicine 376(26), 2513–2522. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Flores CA et al. (2007), ‘Estimation of dose-response functions and optimal doses with a continuous treatment’, University of Miami. Typescript. [Google Scholar]
  16. Fong C, Hazlett C, Imai K et al. (2018), ‘Covariate balancing propensity score for a continuous treatment: Application to the efficacy of political advertisements’, The Annals of Applied Statistics 12(1), 156–177. [Google Scholar]
  17. Galvao AF & Wang L (2015), ‘Uniformly semiparametric efficient estimation of treatment effects with a continuous treatment’, Journal of the American Statistical Association 110(512), 1528–1542. [Google Scholar]
  18. Goldman GT & Dominici F (2019), ‘Don’t abandon evidence and process on air pollution policy’, Science 363(6434), 1398–1400. [DOI] [PubMed] [Google Scholar]
  19. Greifer N & Stuart EA (2021), ‘Matching methods for confounder adjustment: An addition to the epidemiologist’s toolbox’, Epidemiologic Reviews. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Harder VS, Stuart EA & Anthony JC (2010), ‘Propensity score techniques and the assessment of measured covariate balance to test causal associations in psychological research.’, Psychological Methods 15(3), 234. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Heller G (2007), ‘Smoothed rank regression with censored data’, Journal of the American Statistical Association 102(478), 552–559. [Google Scholar]
  22. Hirano K & Imbens GW (2004), ‘The propensity score with continuous treatments’, Applied Bayesian Modeling and Causal Inference from Incomplete-Data Perspectives 226164, 73–84. [Google Scholar]
  23. Ho DE, Imai K, King G & Stuart EA (2007), ‘Matching as nonparametric preprocessing for reducing model dependence in parametric causal inference’, Political Analysis 15(3), 199–236. [Google Scholar]
  24. Imai K, King G, Stuart EA et al. (2008), ‘Misunderstandings between experimentalists and observationalists about causal inference’, Journal of the Royal Statistical Society Series A 171(2), 481–502. [Google Scholar]
  25. Imai K & Van Dyk DA (2004), ‘Causal inference with general treatment regimes: Generalizing the propensity score’, Journal of the American Statistical Association 99(467), 854–866. [Google Scholar]
  26. Imbens GW (2000), ‘The role of the propensity score in estimating dose-response functions.’, Biometrika 87(3). [Google Scholar]
  27. Imbens GW & Rubin DB (2015), Causal inference in statistics, social, and biomedical sciences, Cambridge University Press. [Google Scholar]
  28. Joffe MM & Rosenbaum PR (1999), ‘Invited commentary: propensity scores’, American Journal of Epidemiology 150(4), 327–333. [DOI] [PubMed] [Google Scholar]
  29. Kang JD & Schafer JL (2007), ‘Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data’, Statistical Science 22(4), 523–539. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Kennedy EH (2022), ‘Semiparametric doubly robust targeted double machine learning: a review’, arXiv preprint arXiv:2203.06469 [Google Scholar]
  31. Kennedy EH, Ma Z, McHugh MD & Small DS (2017), ‘Non-parametric methods for doubly robust estimation of continuous treatment effects’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79(4), 1229–1245. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Kim W, Kwon K, Kwon S & Lee S (2018), ‘The identification power of smoothness assumptions in models with counterfactual outcomes’, Quantitative Economics 9(2), 617–642. [Google Scholar]
  33. Lechner M (2001), Identification and estimation of causal effects of multiple treatments under the conditional independence assumption, in ‘Econometric evaluation of labour market policies’, Springer, pp. 43–58. [Google Scholar]
  34. Liu C, Chen R, Sera F, Vicedo-Cabrera AM, Guo Y, Tong S, Coelho MS, Saldiva PH, Lavigne E, Matus P et al. (2019), ‘Ambient particulate air pollution and daily mortality in 652 cities’, New England Journal of Medicine 381(8), 705–715. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. McCaffrey DF, Ridgeway G & Morral AR (2004), ‘Propensity score estimation with boosted regression for evaluating causal effects in observational studies.’, Psychological Methods 9(4), 403. [DOI] [PubMed] [Google Scholar]
  36. Robins JM, Hernan MA & Brumback B (2000), ‘Marginal structural models and causal inference in epidemiology’. [DOI] [PubMed]
  37. Robins JM & Rotnitzky A (2001), ‘Inference for semiparametric models: Some questions and an answer-comments’.
  38. Robins J, Sued M, Lei-Gomez Q & Rotnitzky A (2007), ‘Comment: Performance of double-robust estimators when” inverse probability” weights are highly variable’, Statistical Science 22(4), 544–559. [Google Scholar]
  39. Rosenbaum PR (2020), ‘Modern algorithms for matching in observational studies’, Annual Review of Statistics and Its Application 7. [Google Scholar]
  40. Rosenbaum PR & Rubin DB (1983), ‘The central role of the propensity score in observational studies for causal effects’, Biometrika pp. 41–55. [Google Scholar]
  41. Rubin DB (1974), ‘Estimating causal effects of treatments in randomized and nonrandomized studies.’, Journal of Educational Psychology 66(5), 688. [Google Scholar]
  42. Rubin DB et al. (2008), ‘For objective causal inference, design trumps analysis’, The Annals of Applied Statistics 2(3), 808–840. [Google Scholar]
  43. Semenova V & Chernozhukov V (2021), ‘Debiased machine learning of conditional average treatment effects and other causal functions’, The Econometrics Journal 24(2), 264–289. [Google Scholar]
  44. Stuart EA, Ackerman B et al. (2020), ‘Commentary on yu et al.: Opportunities and challenges for matching methods in large databases’, Statistical Science 35(3), 367–370. [Google Scholar]
  45. Tchetgen EJT & VanderWeele TJ (2012), ‘On causal inference in the presence of interference’, Statistical Methods in Medical Research 21(1), 55–75. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Tübbicke S (2022), ‘Entropy balancing for continuous treatments’, Journal of Econometric Methods 11(1), 71–89. [Google Scholar]
  47. Van der Laan MJ, Polley EC & Hubbard AE (2007), ‘Super learner’, Statistical Applications in Genetics and Molecular Biology 6(1). [DOI] [PubMed] [Google Scholar]
  48. Vegetabile BG, Griffin BA, Coffman DL, Cefalu M & McCaffrey DF (2020), ‘Nonparametric estimation of population average dose-response curves using entropy balancing weights for continuous exposures’, arXiv preprint arXiv:2003.02938 [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Waernbaum I (2012), ‘Model misspecification and robustness in causal inference: comparing matching with doubly robust estimation’, Statistics in Medicine 31(15), 1572–1581. [DOI] [PubMed] [Google Scholar]
  50. Wand MP & Jones MC (1994), Kernel smoothing, Chapman and Hall/CRC. [Google Scholar]
  51. Wu X, Braun D, Kioumourtzoglou M-A, Choirat C, Di Q, Dominici F et al. (2019), ‘Causal inference in the context of an error prone exposure: air pollution and mortality’, The Annals of Applied Statistics 13(1), 520–547. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Yang S, Imbens GW, Cui Z, Faries DE & Kadziola Z (2016), ‘Propensity score matching and subclassification in observational studies with multi-level treatments’, Biometrics 72(4), 1055–1065. [DOI] [PubMed] [Google Scholar]
  53. Yiu S & Su L (2018), ‘Covariate association eliminating weights: a unified weighting framework for causal effect estimation’, Biometrika 105(3), 709–722. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Zhang B, Mackay EJ & Baiocchi M (2020), ‘Statistical matching and subclassification with a continuous dose: characterization, algorithms, and inference’, arXiv preprint arXiv:2012.07182 [Google Scholar]
  55. Zhu Y, Coffman DL & Ghosh D (2015), ‘A boosting algorithm for estimating generalized propensity scores with continuous treatments’, Journal of Causal Inference 3(1), 25–40. [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Zubizarreta JR (2012), ‘Using mixed integer programming for matching in an observational study of kidney failure after surgery’, Journal of the American Statistical Association 107(500), 1360–1371. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp 1

RESOURCES