Estimation of the average treatment effect with variable selection and measurement error simultaneously addressed for potential confounders

Grace Y Yi; Li-Pang Chen

doi:10.1177/09622802221146308

. 2023 Jan 24;32(4):691–711. doi: 10.1177/09622802221146308

Estimation of the average treatment effect with variable selection and measurement error simultaneously addressed for potential confounders

Grace Y Yi ^1,^2,^✉, Li-Pang Chen ^1,³

PMCID: PMC10119903 PMID: 36694932

Abstract

In the framework of causal inference, the inverse probability weighting estimation method and its variants have been commonly employed to estimate the average treatment effect. Such methods, however, are challenged by the presence of irrelevant pre-treatment variables and measurement error. Ignoring these features and naively applying the usual inverse probability weighting estimation procedures may typically yield biased inference results. In this article, we develop an inference method for estimating the average treatment effect with those features taken into account. We establish theoretical properties for the resulting estimator and carry out numerical studies to assess the finite sample performance of the proposed estimator.

Keywords: Causal inference, inverse probability weight, measurement error, propensity score, simulation–extrapolation, variable selection

1. Introduction

Causal inference offers an important paradigm for answering a multitude of questions arising from a broad variety of areas such as healthcare, epidemiological studies, and social sciences. With observational studies, various estimation methods have been developed based on propensity scores, where the propensity score is defined as the conditional probability for an individual to receive a treatment, given pre-treatment confounders.¹ In particular, the inverse probability weighting estimation methods are widely used due to their easy implementation and transparent interpretation (e.g. Rosenbaum and Rubin,² Lunceford and Davidian,³ and Bang and Robins⁴). They adjust for the effects of measured confounders by re-weighting the data as if the weighted data were collected from randomized controlled trials.

The validity of those methods requires the treatment model to be correctly specified so that the propensity scores can be consistently estimated. To accommodate possible misspecification of the treatment model, Bang and Robins⁴ developed a doubly robust estimation method. This protection, nevertheless, does not come for free. It is at the price of specifying the outcome model correctly to facilitate the relationship between the outcome and confounders.

If correctly posing the outcome model is not possible, trying to postulate a feasible treatment model becomes necessary, which basically requires no omission of relevant pre-treatment confounders. Furthermore, to ensure the no-unmeasured-confounders assumption to be reasonable, one may tend to include as many variables as possible when building a treatment model. However, naively including all available variables in the treatment model would degrade inference results or even yield erroneous conclusions. In applications, some collected variables are unimportant in explaining the treatment variable, yet it is often unclear which variables should or should not be included in the treatment model, and we often rely on subject-matter knowledge to decide what variables are to be used for building a treatment model (e.g. Westreich et al.⁵). It is desirable to develop an analytical procedure for variable selection to build a suitable treatment model.

Utilizing variable selection techniques in causal inference has recently attracted interest. For example, Shortreed and Ertefaie⁶ proposed the outcome-adaptive lasso method to select covariates associated with outcomes. Ertefaie et al.⁷ developed a penalized objective function by employing both the outcome and treatment models to do a variable selection. Assuming Spike and slab priors for covariate coefficients, Koch et al.⁸ explored a Bayesian method to estimate causal effects with outcome and treatment models simultaneously employed. Ghosh et al.⁹ proposed the “multiply impute, then select” approach by employing the lasso method. Vansteelandt et al.¹⁰ utilized the focused information criterion to select important confounders.

While those methods are useful for formulating treatment models, their development hinges on an implicit but subtle condition that variables need to be precisely measured. This assumption, however, is commonly violated in applications. Measurement error arises inevitably and ubiquitously because of various reasons, such as the impossibility of measuring a long-term average quantity, inevitable recall bias in answering a questionnaire, unaffordability of precise measurements, the unwillingness of answering sensitive questions, and so on. Causal inference with measurement error has attracted extensive attention (e.g. Imai and Yamamoto¹¹ and Edwards et al.¹²). For example, McCaffrey et al.¹³ and Shu and Yi^14,15 proposed methods for adjusting measurement error effects on causal inference. To ameliorate measurement error effects, Kyle et al.¹⁶ explored the simulation–extrapolation (SIMEX) method for marginal structural models with time-varying covariates. These methods, however, do not consider variable selection to exclude unimportant pre-treatment variables.

In this article, we consider inverse probability weighting estimation in the presence of measurement error and unimportant pre-treatment variables. We focus on the circumstance where no prior knowledge is available to guide us to select relevant variables for building the treatment model and we rely on using variable selection techniques to do so. To provide valid estimation results, we propose a simulation-based estimation method for the average treatment effect. Variable selection for building the treatment model and accommodation of measurement error effects are simultaneously conducted in inferential procedures. Theoretical results for the proposed method are established rigorously. Our method has a broad scope of applications. It does not require modeling of the outcome process to specify the relationship of the outcome variable with confounders. The proposed method applies to any parametric models used to describe the treatment model. The implementation procedure is straightforward.

The rest of this article is organized as follows. Section 2 introduces the notation and inference framework. Section 3 reports the proposed method together with theoretical justifications. Simulation studies for the proposed method are included in Section 4 and an application of the proposed method is described in Section 5. Discussions and extensions are presented in the last section.

2. Notation and framework

For an individual, let T be the observed binary treatment variable, with $T = 1$ if treated and $T = 0$ if untreated. Let $Y_{(1)}$ denote the potential outcome that would have been observed had the subject been treated, and let $Y_{(0)}$ represent the potential outcome that would have been observed had the subject been untreated. We are interested in estimating the average treatment effect (ATE), $τ_{0} ≜ E (Y_{(1)}) - E (Y_{(0)})$ . While T is generically termed as a treatment indicator here, its practical meaning varies in applications. For example, T can be used to represent the exposure to a certain condition, as the case in the example presented in Section 5.

Let Y represent the observed outcome for an individual, which is assumed to be linked with potential outcomes via $Y = T Y_{(1)} + (1 - T) Y_{(0)}$ . This assumption, called consistency, basically says that the potential outcome under the observed treatment status equals the observed outcome.

In a randomized trial, the treatment indicator T is determined randomly so the potential outcomes ${Y_{(1)}, Y_{(0)}}$ and T are independent. Consequently, the sample average of the response measurements for the treated and untreated groups can be directly used to estimate $τ_{0}$ by taking their difference. However, in a non-randomized trial or an observational study, the relationship $E (Y_{(j)}) = E {Y_{(j)} | T = j}$ does not hold anymore for $j = 0, 1$ due to the presence of confounders that are not controlled for the treated and untreated groups.

Let W be the vector of pre-treatment covariates or confounders. We assume that W contains all confounders associated with both potential outcomes and treatment and that only some components in W are predictive for T. That is, W can be written as $W = (W_{I}^{T}, W_{I I}^{T})^{T}$ so that ${Y_{(1)}, Y_{(0)}}$ and T are conditionally independent, given W; and $P (T = 1 | W) = P (T = 1 | W_{I})$ . In other words, $W_{I}$ includes important variables to predict T, whereas $W_{I I}$ contains unimportant variables for predicting T.

The introduction of W allows us to express $E (Y_{(j)})$ using the observed outcome Y and the treatment indicator T. Specifically, let $π ≜ P (T = 1 | W_{I})$ denote the propensity score for an individual, then with the consistency assumption, we obtain that

\begin{aligned} E (\frac{T Y}{π}) & = E [\frac{T {T Y_{(1)} + (1 - T) Y_{(0)}}}{π}] \\ = E {\frac{T Y_{(1)}}{π}} \\ = E [\frac{E {T Y_{(1)} | W}}{π}] \\ = E {\frac{E (T = 1 | W) E {Y_{(1)} | W}}{π}} \\ = E {\frac{E (T = 1 | W_{I}) E {Y_{(1)} | W}}{π}} \\ = E {E (Y_{(1)} | W)} \\ = E {Y_{(1)}} \end{aligned}

where the second step comes from that $T^{2} = T$ and $T (1 - T) = 0$ , the fourth step is due to the assumption of no unmeasured confounders, and the fifth step results from the nature of $W_{I}$ .

Similarly, we obtain that $E {(1 - T) Y / (1 - π)} = E {Y_{(0)}}$ , and thus, $E {(T Y / π) - [(1 - T) Y / (1 - π)]} = τ_{0}$ . These identities provide us a basis for constructing a consistent estimator of $τ_{0}$ using the observed data, as described below. Here the positivity assumption $0 < π < 1$ is implicitly made to ensure the denominators are meaningful.

2.1. Consistent estimator

To estimate $τ_{0}$ , suppose we have a sample ${(T_{i}, Y_{i}, W_{i}) : i = 1, \dots, n}$ of size n, where $T_{i}, Y_{i}$ , and $W_{i} = (W_{I i}^{T}, W_{I I i}^{T})^{T}$ represent the corresponding variables for subject i in the sample for $i = 1, \dots, n$ . To use the confounder information together with the sample response measurements obtained from observational studies, Rosenbaum and Rubin¹ initiated the propensity score method to balance the distribution of the covariates for the treated and untreated groups. For $i = 1, \dots, n$ , let

π_{i} = P (T_{i} = 1 | W_{I i})

denote the conditional probability for subject i to receive the treatment, given the predictive covariates $W_{I i}$ ; this probability is called the propensity score of subject i.

If $π_{i}$ can be consistently estimated, say, by ${\hat{π}}_{i}$ , for subject $i = 1, \dots, n$ , then a consistent estimator of $τ_{0}$ can be constructed by

\hat{τ} = \frac{1}{n} \sum_{i = 1}^{n} \frac{T_{i} Y_{i}}{{\hat{π}}_{i}} - \frac{1}{n} \sum_{i = 1}^{n} \frac{(1 - T_{i}) Y_{i}}{1 - {\hat{π}}_{i}} .

(1)

To mitigate unstable numerical results caused by extreme ${\hat{π}}_{i}$ close to 0 or 1, Lunceford and Davidian³ proposed a stable version of (1), which is also consistent:

\hat{τ} = {(\sum_{i = 1}^{n} \frac{T_{i}}{{\hat{π}}_{i}})}^{- 1} \sum_{i = 1}^{n} \frac{T_{i} Y_{i}}{{\hat{π}}_{i}} - {(\sum_{i = 1}^{n} \frac{1 - T_{i}}{1 - {\hat{π}}_{i}})}^{- 1} \sum_{i = 1}^{n} \frac{(1 - T_{i}) Y_{i}}{1 - {\hat{π}}_{i}} .

(2)

We use (2) for the following development.

2.2. Variable selection and measurement error

The consistency of the estimator (2) hinges on the consistent estimation of the propensity score $π_{i}$ for subject $i = 1, \dots, n$ . While the propensity score $π_{i}$ is assumed to be determined by $W_{I i}$ but not $W_{I I i}$ , in building the treatment model for $π_{i}$ , it is usually unclear what components of $W_{i}$ should be included in $W_{I i}$ , though subject matter knowledge is often utilized to help. Here we consider the circumstance where no prior knowledge is available to guide us to form $W_{I i}$ and we rely on using variable selection techniques to decide what variables in $W_{i}$ are to be selected to form $W_{I i}$ . Without loss of generality, components in $W_{i}$ are assumed to be standardized to have mean 0 and variance 1.

Suppose that we start with postulating the propensity score $π_{i}$ using a parametric model, denoted $g (W_{i}; γ)$ , with all the components $W_{I i}$ and $W_{I I i}$ in $W_{i}$ included:

π_{i} = g (W_{I i}, W_{I I i}; γ),

(3)

where $g (\cdot)$ is a link function, such as the logit, probit, or complementary log–log function; and $γ = (γ_{0}, γ_{I}^{* T}, γ_{I I}^{T})^{T}$ is the vector of regression parameters, with $γ_{0}$ representing the intercept and $γ_{I}^{*}$ and $γ_{I I}$ standing for the parameters, respectively, corresponding to $W_{I i}$ and $W_{I I i}$ . The importance of $W_{I i}$ and irrelevance of $W_{I I i}$ with respect to $T_{i}$ is then reflected by that $γ_{I}^{*} \neq 0$ and $γ_{I I} = 0$ . Here and elsewhere, we loosely use 0 to denote a zero vector whose dimension is inferred from the context. In other words, to identify salient variables in $W_{i}$ , one may apply variable selection techniques to model (3) and then examine estimates for the components in $γ$ to determine what components in $W_{i}$ need to be selected to form $W_{I i}$ . For ease of exposition, we write $γ_{I} = (γ_{0}, γ_{I}^{* T})^{T}$ and $γ = (γ_{I}^{T}, γ_{I I}^{T})^{T}$ ; let p denote the dimension of $(γ_{I}^{* T}, γ_{I I}^{T})^{T}$ ; and we also write $γ = (γ_{0}, γ_{I} \dots γ_{p})^{T}$ .

The estimation of $γ$ may proceed as follows. Let $S_{i} (γ; W_{i})$ denote the likelihood score function, or more generally, an unbiased estimating function, derived from model (3) to reflect the contribution from subject i in the sample. If the true value of $W_{i}$ is available, one may solve

\sum_{i = 1}^{n} S_{i} (γ; W_{i}) = 0

(4)

for $γ$ to obtain a consistent estimator under regularity conditions.

In applications, some variables in $W_{i}$ are error-prone and their accurate measurements are not available for all subjects in the study. To characterize this feature, we re-write $W_{I i} = (Z_{I i}^{T}, X_{I i}^{T})^{T}$ and $W_{I I i} = (Z_{I I i}^{T}, X_{I I i}^{T})^{T}$ so that $Z_{i} ≜ (Z_{I i}^{T}, Z_{I I i}^{T})^{T}$ represents the subvector of error-free covariates in $W_{i}$ and $X_{i} ≜ (X_{I i}^{T}, X_{I I i}^{T})^{T}$ is the subvector of error-prone covariates in $W_{i}$ . Furthermore, let $X_{i}^{*} = (X_{I i}^{* T}, X_{I I i}^{* T})^{T}$ denote the observed version of $X_{i}$ , where $X_{I i}^{*}$ and $X_{I I i}^{*}$ are the observed measurements of $X_{I i}$ and $X_{I I i}$ , respectively.

Suppose that $X_{i}^{*}$ and $X_{i}$ are linked by the classical additive error model

X_{i}^{*} = X_{i} + e_{i},

(5)

where the error term $e_{i}$ is independent of ${T_{i}, X_{i}, Z_{i}, Y_{(0) i}, Y_{(1) i}}$ and follows $N (0, Σ_{e})$ with covariance matrix $Σ_{e}$ . Here $Y_{(1) i}$ (or $Y_{(0) i}$ ) represents the potential outcome that would have been observed had subject i been treated (or untreated). Model (5) features situations where the observed value fluctuates around the true value with an error term; this model is most commonly used in the literature (e.g. Carroll et al.¹⁷ and Yi¹⁸).

While model (5) is useful to describe measurement error problems in applications, its introduction brings in an issue of parameter non-identifiability (e.g. Yi¹⁸ and Yi et al.¹⁹). To circumvent this issue and highlight the main ideas, we assume that $Σ_{e}$ is known for now and defer the discussions on handling unknown $Σ_{e}$ to Section 6.

3. Simulation-based method

In this section, we develop a method of estimating the average treatment effect $τ_{0}$ by incorporating the features of variable selection and measurement error. We also establish the asymptotic properties of the resulting estimator.

3.1. Implementation procedure

Here we describe a simulation-based method that applies to any parametric form of the treatment model (3). The method roots in the SIMEX algorithm developed for the non-causal inference framework by Cook and Stefanski,²⁰ but it extends the scope of the usual SIMEX algorithm by including extra steps for variable selection and inverse probability treatment weighting estimation. The estimation procedure consists of five steps. The first three steps come from the SIMEX method which aims to correct for measurement error effects on the estimation of the parameters associated with the treatment model (3), the fourth step performs the selection of active variables for building a suitable treatment model, and the last step outputs a valid estimator of $τ_{0}$ . The five implementation steps are described as follows, and the programming code is available at GitHub: https://github.com/lchen723/Causal-inference-R-code.git.

Step 1. Simulation:

Let K be a given positive integer (say, $K = 500$ ) and let $C = {ψ_{1}, ψ_{2}, \dots, ψ_{M}}$ be a sequence of M (say, $M = 10$ ) non-negative numbers taken from $[0, ψ_{M}]$ with a given $ψ_{M}$ (say, $ψ_{M} = 1$ ), where $ψ_{1} = 0$ . For $k = 1, \dots, K$ , independently generate random variates $e_{i k}$ from $N (0, Σ_{e})$ and define

X_{i}^{*} (k, ψ) = X_{i}^{*} + \sqrt{ψ} e_{i k} for ψ \in C .

Step 2. Estimation of treatment model parameters:

In (4), we replace $X_{i}$ with $X_{i}^{*} (k, ψ)$ and solve

\sum_{i = 1}^{n} S_{i} (γ; X_{i}^{*} (k, ψ)) = 0

for $γ$ to obtain an estimator, denoted $\hat{γ} (k, ψ)$ , where the dependence of $S_{i} (γ; X_{i}^{*} (k, ψ))$ on $Z_{i}$ is suppressed in the notation. Next, calculate

\hat{γ} (ψ) = K^{- 1} \sum_{k = 1}^{K} \hat{γ} (k, ψ) .

Step 3. Extrapolation:

For $j = 0, 1, \dots, p$ , let ${\hat{γ}}_{j} (ψ)$ denote the jth element of $\hat{γ} (ψ)$ ; fit a regression model to ${(ψ, {\hat{γ}}_{j} (ψ)) : ψ \in C}$ and extrapolate it to $ψ = - 1$ ; and let ${\tilde{γ}}_{j}$ denote the resulting extrapolated estimator of $γ_{j}$ , the jth element of $γ$ . Write $\tilde{γ} = ({\tilde{γ}}_{0}, {\tilde{γ}}_{1}, \dots, {\tilde{γ}}_{p})^{T}$ .

Step 4. Variable selection:

Define a quadratic loss function:

ℓ (γ) = \frac{1}{2} (γ - \tilde{γ})^{T} V_{n} (γ - \tilde{γ}),

(6)

where $V_{n}$ is a user-specified positive definite weight matrix. Different specifications of $V_{n}$ may result in estimators having different efficiency; taking $V_{n}$ to be the covariance matrix of $\tilde{γ}$ yields an efficient result, and setting $V_{n}$ as an identity matrix gives an easy implementation.

Consider the penalized quadratic loss function:

ℓ_{P} (γ) = ℓ (γ) - n \sum_{j = 1}^{p} p_{λ} (| γ_{j} |),

(7)

where $p_{λ} (\cdot)$ is a penalty function with a tuning parameter $λ$ . Here we consider a weighted $L_{1}$ -penalty²¹:

p_{λ} (| γ_{j} |) = p_{λ}^{'} (| {\tilde{γ}}_{j} |) | γ_{j} |,

where $p_{λ}^{'} (u)$ is the first order derivative of $p_{λ} (u)$ , and $p_{λ} (u)$ can be set as a commonly used penalty function such as the least absolute shrinkage and selection operator (LASSO) penalty²² or the smoothly clipped absolute deviation (SCAD) penalty²³ that are commonly used in applications. With the LASSO penalty, we set

p_{λ} (| γ_{j} |) = λ | γ_{j} |,

and for the SCAD penalty, we take

p_{λ}^{'} (x) = λ {I (x \leq λ) + \frac{(a λ - x)_{+}}{(a - 1) λ} \cdot I (x > λ)},

where $I (\cdot)$ is the indicator function, $x_{+} = \max {x, 0}$ , and $a = 3.7$ .

To achieve a satisfactory performance of the selection procedure, we consider a grid $Λ$ of possible values for the tuning parameter $λ$ , and for $λ \in Λ$ , let $\hat{γ} (λ) = {argmin}_{γ} ℓ_{P} (γ)$ and let $d f_{λ}$ denote the number of non-zero elements of $\hat{γ} (λ)$ . We define

B I C (λ) = - 2 ℓ (\hat{γ} (λ)) + 2 (\log n) d f_{λ} .

Then the optimal tuning parameter $λ^{*}$ is chosen as the minimizer of $B I C (λ)$ :

λ^{*} = {argmin}_{λ \in Λ} B I C (λ)

and the resulting estimator $\hat{γ} (λ^{*})$ , denoted $\hat{γ}$ , is taken as the estimator of $γ$ .

Step 5. Estimation of ATE:

Write $\hat{γ} = ({\hat{γ}}_{I}^{T}, {\hat{γ}}_{I I}^{T})^{T}$ with ${\hat{γ}}_{I} = ({\hat{γ}}_{0}, {\hat{γ}}_{x I}^{T}, {\hat{γ}}_{z I}^{T})^{T}$ and ${\hat{γ}}_{I I} = ({\hat{γ}}_{x I I}^{T}, {\hat{γ}}_{z I I}^{T})^{T}$ , respectively, corresponding to the non-zero and zero components in $\hat{γ}$ . Importance of covariates ${X_{I i}, Z_{I i}}$ and unimportance of covariates ${X_{I I i}, Z_{I I i}}$ are thereby suggested by the estimates of the corresponding coefficients. With unimportant variables $X_{I I i}$ and $Z_{I I i}$ excluded from the initial model (3), the final treatment model is taken as

π_{i} = g (X_{I i}, Z_{I i}; γ_{I}),

(8)

where $γ_{I}$ includes the intercept and the model parameters associated with important covariates ${X_{I i}, Z_{I i}}$ .

Let $X_{I i}^{*} (k, ψ)$ denote the subvector of $X_{i}^{*} (k, ψ)$ corresponding to $X_{I i}^{*}$ , generated from Step 1. Using the selected treatment model (8) with $X_{I i}$ replaced by $X_{I i}^{*} (k, ψ)$ , we calculate the fitted value ${\hat{π}}_{i} (k, ψ) ≜ g (X_{I i}^{*} (k, ψ), Z_{I i}; {\hat{γ}}_{I})$ . Then we obtain an estimate, say, $\hat{τ} (k, ψ)$ , of $τ_{0}$ using (2) with ${\hat{π}}_{i}$ replaced by ${\hat{π}}_{i} (k, ψ)$ , and calculate

\hat{τ} (ψ) = K^{- 1} \sum_{k = 1}^{K} \hat{τ} (k, ψ) .

Finally, we fit a regression model to ${(ψ, \hat{τ} (ψ)) : ψ \in C}$ and extrapolate it to $ψ = - 1$ . The resulting value, denoted as $\hat{τ}$ , is taken as an estimate of $τ_{0}$ .

We conclude this section with a few remarks. The basic idea of the implementation roots in using the available estimation method (2), which is developed under the ideal situation where all the pre-treatment covariates are relevant and measured without error. We start with an initial treatment model (3) by using all pre-treatment variables in $W_{i}$ to express propensity scores.

The first three steps are directed to addressing the measurement error effects in the estimation of the propensity scores; and the last step estimates the ATE $τ_{0}$ with the measurement error effects accounted for by employing the SIMEX algorithm. The idea of the SIMEX is to first establish the trend of measurement error-induced biases as a function of the variance of measurement error by artificially creating a sequence of surrogate measurements, and then extrapolate this trend back to the case without measurement error. Specifically, in Step 1, we artificially create a sequence of error-contaminated surrogate measurements by introducing different degrees of measurement error, and then we apply those surrogate measurements in Step 2 to obtain biased estimates by running an estimation method developed for error-free settings. Step 3 traces the patten of biased estimates against varying magnitudes of measurement error and then does extrapolation. To analytically demonstrate this rationale, one may consider the simple linear regression model with an additive measurement error in the covariate; an intuitive illustration of the idea can be found in Yi.¹⁸ $^{(p . 63 - 64)}$ In implementing Steps 3 and 5, it is ideal to use the true extrapolation function as is required in establishing the theoretical results in the next subsection. However, the exact function form is typically unknown in applications, and one has to invoke a specified regression model to approximate it. As a result, this approximation makes the resulting SIMEX estimators not exactly but only approximately consistent. Further, this suggests that the performance of the SIMEX estimators can be sensitive to the choice of a working extrapolation function form. In our simulation studies to be reported in Section 4, we compare how different specifications of the extrapolation function in Steps 3 and 5 may affect the performance of the estimation.

Step 4 requires the optimal value for the tuning parameter $λ$ . While different criteria such as generalized cross validation (GCV) and Akaike information criterion (AIC) are commonly used in applications, Wang et al.²⁴ and Zhang et al.²⁵ showed that the optimal tuning parameter derived from the AIC and GCV criteria has a nonignorable overfitting effect and that the tuning parameter derived from the Bayesian information criterion (BIC) with the SCAD can identify the true model consistently under linear or partial linear regression models. Here we consider a criterion based on the BIC following Yi et al.²⁶ and Chen and Yi.²⁷

3.2. Theoretical results

Now we justify the validity of the algorithm described in Section 3.1 by establishing asymptotic properties for the associated estimators. Consistent with authors such as Fan and Li,²³ Yi et al.,²⁶ and Carroll et al.,²⁸ the following regularity conditions are considered:

(A).
As $n \to \infty$ , $n^{- 1} V_{n} \binom{p}{⟶} V$ , where V is positive definite.
(B).
As $n \to \infty$ , $\sqrt{n} (\tilde{γ} - γ) \binom{d}{⟶} N (0, Σ)$ , where $Σ$ is positive definite.
(C).
To express the dependence on the sample size, we write the tuning parameter as $λ_{n}$ . Define
$a_{n} = max {p_{λ_{n}}^{'} (| γ_{j} |) : γ_{j} \neq 0 with j = 1, \dots, p} and b_{n} = max {p_{λ_{n}}^{″} (| γ_{j} |) : γ_{j} \neq 0 with j = 1, \dots, p} .$
Assume that as $n \to \infty$ ,
$a_{n} \to 0 and b_{n} \to 0 and {liminf}_{n \to \infty} {liminf}_{u \to 0^{+}} \sqrt{n} p_{λ_{n}}^{'} (u) = \infty .$
(D).
The extrapolation function in Step 5 is correctly specified.

Repeating the proof for Theorem 1 of Fan and Li,²³ we can show that with regularity conditions, there exists an estimator $\hat{γ}$ such that

‖ \hat{γ} - γ ‖ = O_{p} (n^{- 1 / 2} + a_{n}) .

This suggests that $\hat{γ}$ is a $\sqrt{n}$ -consistent estimator of $γ$ if $a_{n} = O_{p} (n^{- 1 / 2})$ .

Next, we discuss the asymptotic properties of $\hat{γ}$ . Split V into a $2 \times 2$ block matrix:

V = (\begin{matrix} V_{I, I} & V_{I, I I} \\ V_{I I, I} & V_{I I, I I} \end{matrix}),

where $V_{u, v}$ is the submatrix of dimension $p_{u} \times p_{v}$ for $u, v = I, I I$ , $p_{I}$ represents the dimension of $γ_{I}$ , and $p_{I I}$ stands for the dimension of $γ_{I I}$ . Write $V_{I} = [V_{I, I} V_{I, I I}]$ . Using the arguments of Yi et al.,²⁶ we can prove the following results.

Theorem 3.1.

Assume that regularity conditions (A) to (C) hold. Then the following results hold:

(a).
As $n \to \infty$ ,
$\sqrt{n} ({\hat{γ}}_{I} - γ_{I}) \overset{d}{\to} N (0, V_{I, I}^{- 1} V_{I} Σ V_{I}^{T} V_{I, I}^{- 1}) .$

(b).
${\hat{γ}}_{z I I} = 0$ and ${\hat{γ}}_{x I I} = 0$ .

(c).
When V is equal to $Σ^{- 1}$ , the covariance matrix of ${\hat{γ}}_{I}$ is no greater than the covariance variance of ${\tilde{γ}}_{I}$ in the Loewner order, where ${\tilde{γ}}_{I}$ is the subvector of the SIMEX estimator $\tilde{γ}$ corresponding to $γ_{I}$ .

Theorem 3.1(a) establishes the asymptotic distribution for the estimators for the effects corresponding to important pre-treatment variables in model (3), or equivalently, for the estimators of the parameters for the selected treatment model (8). Theorem 3.1(b) ensures the oracle property in the sense of Fan and Li²³ for the variable selection procedure for building the final treatment model (8). The results in Theorem 3.1(a) and (b) are established in the same lines as Zou and Li²¹ and Fan and Li,²³ which basically require regularity condition (C). This condition is satisfied by the SCAD penalty but not the LASSO penalty. Condition (B) is needed in showing Theorem 3.1(a), and its validity is ensured by the result of Carroll et al.,²⁸ who assumed the knownness of the extrapolation function in Step 3.

Theorem 3.1(c) suggests that the estimator obtained from conducting an extra step of variable selection (i.e. obtained from Steps 1–4) is more efficient than the usual SIMEX estimator (i.e. obtained from Steps 1–3). This shows the necessity and importance of excluding inactive pre-treatment variables when building the treatment model. Furthermore, we show the following asymptotic distribution of the estimator $\hat{τ}$ established in Step 5 of Section 3.1, and its proof is deferred to the Appendix.

Theorem 3.2.

Assume that regularity conditions (A) to (D) hold. Then the estimator $\hat{τ}$ obtained in Section 3.1 has the asymptotic distribution

$\sqrt{n} (\hat{τ} - τ_{0}) \overset{d}{\to} N (0, v (τ_{0})) as n \to \infty,$

where $v (τ_{0})$ is defined in the Appendix.

4. Simulation studies

In this section, we conduct simulation studies to assess the finite sample performance of the proposed method. As described in Section 3, we first select the variables and estimate the associated parameters for the treatment model, and then estimate the average treatment effect $τ_{0}$ . For each parameter configuration, we repeat the simulation 500 times, where the sample size is set as $n = 400$ .

4.1. Simulation designs

We consider one of the following two outcome models, which show different types of dependence of Y on the covariates ${X, Z}$ and the treatment indicator variable T:

\begin{aligned} Model 1 : & Y = T + β_{x}^{T} X + β_{z}^{T} Z + ϵ \\ Model 2 : & logit P (Y = 1 | T, X, Z) = T + β_{x}^{T} X + β_{z}^{T} Z \end{aligned}

where $ϵ$ is independent of ${X, Z, T}$ ; $ϵ \sim N (0, 1)$ ; $β_{x}$ and $β_{z}$ are the model parameters of dimension $p_{x}$ and $p_{z}$ , respectively. Here we take $β_{x} = (1_{d_{x}}^{T} 0_{p_{x} - d_{x}}^{T})^{T}$ and $β_{z} = (1_{d_{z}}^{T} 0_{p_{z} - d_{z}}^{T})^{T}$ with $d_{x} ≜ ⌈ \frac{p_{x}}{2} ⌉$ and $d_{z} ≜ ⌈ \frac{p_{z}}{2} ⌉$ , where $⌈ a ⌉$ represents the ceiling function of a, that is, it is the least integer that is $\geq a$ for a real number a; and $1_{r}$ and $0_{r}$ represent the $r \times 1$ unit and zero vectors, respectively, for a positive integer r. We set $p_{x}$ and $p_{z}$ both to be 15, yielding that Models 1 and 2 include 16 important covariates of which 8 are error-prone, and 14 unimportant covariates of which 7 are error-prone.

The treatment indicator $T_{i}$ is independently generated from the treatment model (3) for $i = 1, \dots, n$ . We consider three useful model forms for (3), respectively, given by

logistic regression model:
$π_{i} = \frac{\exp (γ_{0} + X_{i}^{T} γ_{x} + Z_{i}^{T} γ_{z})}{1 + \exp (γ_{0} + X_{i}^{T} γ_{x} + Z_{i}^{T} γ_{z})}$ (9)
probit regression model:
$π_{i} = Φ (γ_{0} + X_{i}^{T} γ_{x} + Z_{i}^{T} γ_{z})$ (10)
complementary log–log regression model:
$π_{i} = 1 - \exp {- \exp (γ_{0} + X_{i}^{T} γ_{x} + Z_{i}^{T} γ_{z})}$ (11)

where $γ = (γ_{0}, γ_{x}^{T}, γ_{z}^{T})^{T}$ is the vector of parameters, and $Φ (\cdot)$ is the cumulative standard normal distribution function.

For all those treatment models, consider the case with $γ_{x} = γ_{z} = (1_{d_{x} / 2}^{T}, - 1_{d_{x} / 2}^{T}, 0_{p_{x} - d_{x}}^{T})^{T}$ and $γ_{0} = 1$ . Thus, the number of important error-prone and important error-free variables is both 8, and the number of unimportant error-prone and unimportant error-free variables is both 7.

To generate measurements of $Y_{i}$ and $T_{i}$ , respectively, from an outcome and treatment model independently for $i = 1, \dots, n$ , we need first to simulate measurements for the covariates ${X_{i}, Z_{i}}$ as well as for the treatment indicator variable $T_{i}$ . For $i = 1, \dots, n$ , we generate covariate measurements independently from the normal distribution: $(X_{i}^{T}, Z_{i}^{T})^{T} \sim N (0_{p}, Σ_{w})$ , where the matrix $Σ_{w}$ is written as

(\begin{matrix} Σ_{x x} & Σ_{x z} \\ Σ_{z x} & Σ_{z z} \end{matrix})

with $Σ_{x z} = Σ_{z x}$ and $Σ_{x x}$ and $Σ_{z z}$ , respectively, representing the covariance matrices for $X_{i}$ and $Z_{i}$ . Let $σ_{x z j k}$ , $σ_{x j k}$ , and $σ_{z j k}$ denote element $(j, k)$ of $Σ_{x z}$ , $Σ_{x x}$ , and $Σ_{z z}$ , respectively. We consider the case with $σ_{x z j k} = {0.4}^{(2 + | j - k |)}$ , $σ_{x j k} = σ_{x}^{2} ρ_{x}^{| j - k |}$ , and $σ_{z j k} = σ_{z}^{2} ρ_{z}^{| j - k |}$ by setting $σ_{x}^{2} = σ_{z}^{2} = 1.0$ and $ρ_{x} = ρ_{z} = 0.5$ .

To generate surrogate measurements $X_{i}^{*}$ for $X_{i}$ , we consider the classical additive error model (5), where $Σ_{e}$ is set as a diagonal matrix with a common element $σ_{e}^{2}$ . We set $σ_{e}^{2}$ to be 0.15, 0.50, or 0.75, yielding the signal-to-noise ratio $var (X_{j}) / var (X_{j}^{*})$ to be 44, 4, and 1.8, respectively.

Finally, we examine the value of the average treatment effect, $τ_{0}$ , under each of the outcome models. Model 1 shows a scenario with continuous outcomes and yields the true value $τ_{0}$ to be

\begin{aligned} τ_{0} & = E (Y_{(1)}) - E (Y_{(0)}) = E {E (Y_{(1)} | T, X, Z)} - E {E (Y_{(0)} | T, X, Z)} \\ = E (1 + β_{x}^{T} X + β_{z}^{T} Z + ϵ) - E (0 + β_{x}^{T} X + β_{z}^{T} Z + ϵ) = 1 \end{aligned}

On the contrary, Model 2 forms a logistic regression model for binary outcomes, yielding the true value $τ_{0}$ to be

\begin{aligned} τ_{0} & = P (Y_{(1)} = 1) - P (Y_{(0)} = 1) \\ = \int_{R^{p_{x}}} \int_{R^{p_{z}}} P (Y = 1 | T = 1, X = x, Z = z) f (x, z) d z d x \\ - \int_{R^{p_{x}}} \int_{R^{p_{z}}} P (Y = 1 | T = 0, X = x, Z = z) f (x, z) d z d x \end{aligned}

(12)

where $f (x, z)$ is the density function of ${X, Z}$ , which is specified as a multivariate normal distribution earlier, and $P (Y = 1 | T = 1, X = x, Z = z)$ is determined by Model 2. As (12) does not have a closed form, we use the Monte Carlo method to obtain an approximate value of $τ_{0}$ . First, we generate a sequence of values ${(x_{j}, z_{j}) : j = 1, \dots, N}$ from $f (x, z)$ for a sufficiently large N, and then we approximate $τ_{0}$ by $\frac{1}{N} \sum_{j = 1}^{N} P (Y = 1 | T = 1, X = x_{j}, Z = z_{j}) - \frac{1}{N} \sum_{j = 1}^{N} P (Y = 1 | T = 0, X = x_{j}, Z = z_{j})$ . In our numerical studies, we set $N = 5000$ , yielding the true value $τ_{0}$ to be 0.187 approximately.

4.2. Simulation results

Our objectives here are to (a) select important variables from the treatment model (3), (b) assess the performance of the proposed estimator of $τ_{0}$ , and (c) demonstrate the effects of measurement error. When implementing the proposed method described in Section 3.1, we consider the following choices of relevant quantities. In Steps 3 and 5, we set $C = {0, 0.25, 0.50, 0.75, 1.0, 1.25, 1.50, 1.75, 2.0}$ and $K = 500$ ; to evaluate how estimation results of $τ_{0}$ may change, we use the quadratic, linear, or rational linear extrapolation function to approximate the true extrapolation function, as discussed by Carroll et al.¹⁷ $^{(Section 5.3.2)}$ . In Step 4, we set $V_{n}$ in (6) to be the identity matrix and take the SCAD penalty for (7); in comparison, we also consider the use of the LASSO penalty, which does not satisfy the requirement of “ $a_{n} \to 0$ as $n \to \infty$ ” in condition (C).

In contrast to variable selection for objective (a), we also examine the performance based on the full model without doing the variable selection. To compare the performance of the methods, we use both the $L_{1}$ and $L_{2}$ loss functions, respectively, given by $\sum_{k = 0}^{p} | {\hat{γ}}_{k} - γ_{k} | and \sum_{k = 0}^{p} ({\hat{γ}}_{k} - γ_{k})^{2},$ where ${\hat{γ}}_{k}$ and $γ_{k}$ represent the kth component of $\hat{γ}$ and $γ$ , respectively. In addition, we report the total number of selected variables and the number of falsely excluded important variables, denoted $# S$ and $# FN$ , respectively. Regarding objective (c) of demonstrating the impact of measurement error, we examine the naive approach, which disregards the difference between $X_{i}^{*}$ and $X_{i}$ in variable selection and estimation of $τ_{0}$ .

To save space, here we report only the results for the cases with the treatment model specified as the logistic regression model and defer the results for the other two treatment models to the Supplemental material. Table 1 records the results for variable selection of the treatment model, and Table 2 reports finite sample biases (Bias), empirical standard errors (S.E.), root mean squared errors (RMSE), and coverage rates in percent (CR%) for 95% confidence intervals for $τ_{0}$ obtained under the response Models 1 and 2.

Table 1.

Simulation results: variable selection for the treatment model postulated by the logistic model.

		Proposed: quadratic extrapolation				Proposed: linear extrapolation				Proposed: rational linear extrapolation				Naive
$σ_{e}^{2}$	Method	$L_{1}$ -loss	$L_{2}$ -loss	#S	#FN	$L_{1}$ -loss	$L_{2}$ -loss	#S	#FN	$L_{1}$ -loss	$L_{2}$ -loss	#S	#FN	$L_{1}$ -loss	$L_{2}$ -loss	#S	#FN
0.15	LASSO	0.322	0.013	18.363	0.000	0.405	0.028	18.557	0.000	0.395	0.023	18.530	0.000	2.528	0.468	22.190	0.000
	SCAD	0.313	0.011	17.256	0.000	0.353	0.022	17.477	0.000	0.349	0.020	17.309	0.000	2.489	0.447	22.167	0.000
	full	0.553	0.028	—	—	0.718	0.062	—	—	0.624	0.053	—	—	2.843	0.655	—	—
0.50	LASSO	0.337	0.015	18.570	0.000	0.423	0.033	18.593	0.000	0.414	0.029	18.554	0.000	2.607	0.474	22.231	0.000
	SCAD	0.326	0.014	17.368	0.000	0.388	0.029	17.536	0.000	0.380	0.027	17.388	0.000	2.538	0.456	22.184	0.000
	full	0.580	0.034	—	—	0.724	0.069	—	—	0.642	0.054	—	—	2.896	0.660	—	—
0.75	LASSO	0.349	0.018	18.615	0.000	0.440	0.037	18.673	0.000	0.423	0.034	18.650	0.000	2.634	0.486	22.325	0.000
	SCAD	0.342	0.017	17.381	0.000	0.421	0.034	17.589	0.000	0.411	0.032	17.430	0.000	2.597	0.475	22.249	0.000
	full	0.595	0.037	—	—	0.731	0.071	—	—	0.660	0.056	—	—	2.923	0.678	—	—
True X	LASSO	0.316	0.011	17.231	0.000	—	—	—	—	—	—	—	—	—	—	—	—
	SCAD	0.310	0.010	17.156	0.000	—	—	—	—	—	—	—	—	—	—	—	—
	full	0.536	0.025	—	—	—	—	—	—	—	—	—	—	—	—	—	—

Open in a new tab

LASSO: least absolute shrinkage and selection operator; SCAD: smoothly clipped absolute deviation. “Proposed” refers to the procedure in Section 3 using the surrogate $X_{i}^{*}$ together with other measurements; “Naive” represents the estimation procedure in Section 2 with $X_{i}$ replaced by $X_{i}^{*}$ ; “True X” denotes the estimation procedure in Section 2 using $X_{i}$ together with other measurements.

Table 2.

Simulation results: estimation of ATE $τ_{0}$ with the treatment model postulated by the logistic model.

			Proposed: quadratic extrapolation				Proposed: linear extrapolation				Proposed: rational linear extrapolation				Naive
Model	$σ_{e}^{2}$	Method	Bias	S.E.	RMSE	CR%	Bias	S.E.	RMSE	CR%	Bias	S.E.	RMSE	CR%	Bias	S.E.	RMSE	CR%
1	0.15	LASSO	0.031	0.023	0.039	94.7	0.035	0.030	0.046	94.5	0.034	0.026	0.043	94.3	0.146	0.020	0.147	47.3
		SCAD	0.026	0.020	0.033	95.2	0.034	0.028	0.044	94.7	0.031	0.023	0.039	94.5	0.137	0.018	0.138	49.5
		full	0.058	0.026	0.064	88.4	0.064	0.035	0.073	86.3	0.061	0.031	0.068	87.8	0.159	0.024	0.161	42.6
	0.50	LASSO	0.033	0.025	0.041	94.7	0.039	0.033	0.051	94.4	0.035	0.029	0.045	94.1	0.155	0.022	0.157	43.4
		SCAD	0.028	0.023	0.036	95.0	0.037	0.030	0.048	94.5	0.034	0.028	0.044	94.3	0.149	0.020	0.150	45.2
		full	0.063	0.029	0.069	86.8	0.068	0.042	0.080	84.6	0.066	0.036	0.075	86.1	0.167	0.027	0.169	38.4
	0.75	LASSO	0.035	0.028	0.045	94.5	0.042	0.036	0.055	93.8	0.038	0.032	0.050	94.0	0.163	0.025	0.165	40.6
		SCAD	0.033	0.026	0.042	94.8	0.041	0.034	0.053	94.0	0.036	0.030	0.047	94.2	0.157	0.023	0.159	42.0
		full	0.069	0.032	0.076	83.7	0.071	0.045	0.084	82.4	0.069	0.038	0.079	86.0	0.173	0.029	0.175	34.2
	True X	LASSO	0.026	0.019	0.032	95.1	—	—	—	—	—	—	—	—	—	—	—	—
		SCAD	0.023	0.017	0.029	95.3	—	—	—	—	—	—	—	—	—	—	—	—
		full	0.053	0.024	0.058	91.3	—	—	—	—	—	—	—	—	—	—	—	—
2	0.15	LASSO	0.020	0.027	0.034	95.2	0.039	0.032	0.050	94.3	0.036	0.030	0.047	94.4	0.216	0.020	0.217	23.6
		SCAD	0.018	0.023	0.029	95.4	0.035	0.028	0.045	94.4	0.032	0.026	0.041	94.6	0.208	0.017	0.209	26.7
		full	0.063	0.036	0.073	86.5	0.071	0.042	0.082	85.1	0.068	0.039	0.078	86.7	0.256	0.030	0.258	18.5
	0.50	LASSO	0.024	0.029	0.038	95.0	0.042	0.034	0.054	94.0	0.040	0.032	0.051	94.1	0.223	0.023	0.224	20.1
		SCAD	0.022	0.025	0.033	95.2	0.039	0.030	0.049	94.1	0.037	0.030	0.048	94.3	0.216	0.019	0.217	22.4
		full	0.072	0.039	0.082	84.7	0.076	0.045	0.088	84.7	0.073	0.041	0.084	85.8	0.266	0.034	0.268	16.1
	0.75	LASSO	0.027	0.030	0.040	95.0	0.044	0.037	0.057	93.8	0.043	0.034	0.055	94.0	0.250	0.026	0.251	18.6
		SCAD	0.026	0.028	0.038	95.0	0.043	0.034	0.055	93.8	0.040	0.033	0.052	94.0	0.238	0.024	0.239	20.1
		full	0.079	0.043	0.090	82.6	0.084	0.047	0.096	84.1	0.082	0.044	0.093	84.7	0.307	0.035	0.309	11.4
	True X	LASSO	0.017	0.023	0.029	95.3	—	—	—	—	—	—	—	—	—	—	—	—
		SCAD	0.015	0.020	0.025	95.4	—	—	—	—	—	—	—	—	—	—	—	—
		full	0.063	0.030	0.070	91.1	—	—	—	—	—	—	—	—	—	—	—	—

Open in a new tab

ATE: average treatment effect; S.E.: standard error; RMSE: root mean square error; LASSO: least absolute shrinkage and selection operator; SCAD: smoothly clipped absolute deviation. CR%: coverage rates in percent. “Proposed” refers to the procedure in Section 3 using the surrogate $X_{i}^{*}$ together with other measurements, “Naive” represents the estimation procedure in Section 2 with $X_{i}$ replaced by $X_{i}^{*}$ , and “True X” denotes the estimation procedure in Section 2 using $X_{i}$ together with other measurements.

First, we examine the performance of the proposed method. Comparing the values of the $L_{1}$ and $L_{2}$ loss functions produced from the proposed method and the likelihood estimation method based on the full model, Table 1 shows the benefits of conducting variable selection for the treatment model (3). With the proposed method, using the quadratic extrapolation function tends to perform the best while the linear extrapolation function seems to perform the least well. The proposed method with the SCAD penalty inclines to slightly outperform the proposed method with the LASSO penalty. As expected, as the degree of measurement error increases, the proposed method produces increasing values of the $L_{1}$ and $L_{2}$ loss functions. Regarding the number (#S) of selected variables, the LASSO penalty tends to produce larger values than those obtained from using the SCAD, and both seem to be slightly larger than the true value.

Regarding the performance of estimating the ATE $τ_{0}$ , Table 2 shows that the proposed method outperforms the method without excluding unimportant variables (i.e. under the full model). Estimation of $τ_{0}$ without incorporating variable selection for building the treatment model (3) yields larger finite sample biases and standard errors than the proposed method, and the resulting coverage rates for 95% confidence intervals of $τ_{0}$ deviate from the nominal level 95%. This suggests that imposing variable selection in the procedures of estimating $τ_{0}$ is necessary.

With the three extrapolation functions, the proposed method generally performs well with small finite sample biases and good coverage rates for 95% confidence intervals of $τ_{0}$ , and the performance with the quadratic extrapolation function tends to perform the best. Unsurprisingly, the performance of the proposed method deteriorates as the degree of measurement error increases. In implementing the proposed method, using the SCAD penalty yields slightly better results than using the LASSO penalty.

On the other hand, we inspect the results produced from the naive method, which ignores the measurement error effects. Evidently, as shown in Table 1, the naive method produces noticeable biases for the parameters of the treatment model (3) than the proposed method, though the number of falsely excluded important variables by the naive method, $#$ FN, is near zero in all settings. Notably, Table 2 shows that the naive method produces unsatisfactory estimation results of $τ_{0}$ with a lot larger finite sample biases yet smaller standard errors than those yielded from the proposed method, regardless of the outcome model, and thus, it leads to useless coverage rates for 95% confidence intervals of $τ_{0}$ . Furthermore, RMSE values yielded from the naive method are larger than those of the proposed method under different settings, suggesting that the bias-variance trade-off of the proposed method is beneficial in contrast to that of the naive method. Similar patterns are also revealed in the tables included in Section A of the Supplemental material.

In summary, the simulation studies demonstrate the adverse effects of ignoring measurement error in the analysis as well as the importance of excluding irrelevant covariates when building the treatment model for calculating proper propensity scores. The studies confirm the satisfactory performance of the proposed estimator of $τ_{0}$ in finite sample settings; the method with the SCAD penalty tends to slightly outperform the method with the LASSO penalty. As the proposed method is simulation-based, it may be time-consuming. However, its implementation is easy and can be built as a computerized-automatic program. Further, its applicability scope is broad in the sense that any parametric or semiparametric modeling for the treatment model (3) can be applied.

5. Analysis of NHANES I Epidemiologic Follow-up Study data

5.1. Data description

We apply the proposed method to analyze the data arising from the NHANES I Epidemiologic Follow-up Study (NHEFS), a national longitudinal study that was jointly initiated by the National Center for Health Statistics and the National Institute on Aging in collaboration with other agencies of the Public Health Service. The study was designed to investigate the relationship among clinical, nutritional, and behavioral factors. The NHEFS cohort includes participants of age 25 to 74 years who completed a medical examination at NHANES I in 1971–1975, and the first wave of data collection was conducted for all members from 1982 to 1984. The details can be found at https://wwwn.cdc.gov/nchs/nhanes/nhefs/#dfd.

In our analysis here, we consider a dataset of 1624 subjects, which is available at https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/. We are interested in understanding possible causal effects of smoking behavior on the weight change. Using the notation in Section 2, for an individual we let T represent the binary exposure variable for the smoking status (qsmk) (1 for quitting smoking or not smoking between the first questionnaire and 1982 and 0 otherwise). We want to estimate the average effect of T, $τ_{0} = E (Y_{(1)}) - E (Y_{(0)})$ , where $Y_{(1)}$ represents the weight an individual would have at the entry of the study if this person had been a nonsmoker or quitted smoking, and $Y_{(0)}$ represents the weight an individual would have at the entry of the study if this person had been a smoker. Let Y denote the actual weight in kilograms (wt) for an individual at the entry of the study.

Since the data are collected from the observational studies (and hence, smoking status cannot be randomized among the study subjects), directly calculating the difference for the sample average between the smokers and nonsmokers fails to yield a consistent estimator of $τ_{0}$ . To control for the confounding effects and use (2) to estimate $τ_{0}$ , we need first to build a suitable treatment model to calculate propensity scores.

The dataset we consider contains the following covariates, including systolic blood pressure (sbp, in millimeter of mercury (mmHg)), serum cholesterol (cholesterol, in mg/100), diastolic blood pressure (dbp, in mmHg), height in centimeters (ht), average tobacco price in the state of residence in 1982 (price82, in US dollars), age in 1971 (age, in years), sex (0 for male and 1 for female), use nerves medication (nerves, with 0 representing never and 1 otherwise), use high blood pressure medication (hbpmed, with 0 representing never use and 1 otherwise), and race (0 for white and 1 otherwise). Here each variable is standardized by subtracting its sample mean and then being divided by its sample standard deviation.

Blood pressure and cholesterol are error-prone, as discussed, respectively, by Bauldry et al.²⁹ and Glasziou et al.³⁰ As commented by Lebow and Rudd,³¹ $^{(p . 168)}$ the price of tobacco is subject to measurement error due to respondents’ imprecise recall of their own expenditure or their unwillingness of reporting expenditures. Let X represent the vector of error-prone covariates including sbp ( $X_{1}$ ), dbp ( $X_{2}$ ), cholesterol ( $X_{3}$ ), and price82 ( $X_{4}$ ). Let Z denote the vector of error-free covariates including ht, age, sex, nerves, hbpmed, and race. Using the notation in Section 4.1, we have that $p_{x} = 4$ , $p_{z} = 1 + 6$ with 1 indicating the inclusion of the intercept in Z. We employ model (5) to describe the relationship between the observed surrogate measurement $X^{*}$ and the value of the true covariate X.

5.2. Analysis using external information

To estimate the covariance matrix in model (5), here we make use of external information of repeated measurements for sbp, dbp, cholesterol, and price82. Specifically, two repeated measurements, $X_{i 1 k}^{*}$ and $X_{i 2 k}^{*}$ , of the variables sbp ( $X_{i 1}$ ) and dbp ( $X_{i 2}$ ) for the same subjects are available at https://archive.ics.uci.edu/ml/datasets/Myocardial+infarction+complications, where $i = 574$ and $k = 1, \dots, n_{(1, 2)}$ with $n_{(1, 2)} = 2$ . Three repeated measurements, $X_{i 3 k}$ , of the the variable cholesterol ( $X_{i 3}$ ) for a group of people are posted at https://www.sheffield.ac.uk/mash/statistics/datasets, where $i = 18$ and $k = 1, \dots, n_{3}$ with $n_{3} = 3$ . Eleven repeated measurements, $X_{i 4 k}$ , of the variable price82 ( $X_{i 4}$ ) for the same individuals are available in the R package “Ecdar,” where $i = 48$ and $k = 1, \dots, n_{4}$ with $n_{4} = 11$ .

With those repeated measurements, the estimated covariance matrix of $Σ_{e}$ is given by

{\hat{Σ}}_{e} = diag ({\hat{Σ}}_{12}^{*}, {\hat{σ}}_{33}^{* 2}, {\hat{σ}}_{44}^{* 2}),

where

{\hat{Σ}}_{12}^{*} = (\begin{matrix} 0.499 & 0.434 \\ 0.434 & 0.543 \end{matrix}),

${\hat{σ}}_{33}^{* 2} = 0.006$ , and ${\hat{σ}}_{44}^{* 2} = 0.150$ are, respectively, obtained from the three different data sources described above using the method of moments,

\frac{\sum_{i = 1}^{n} \sum_{k = 1}^{n_{j}} (X_{i j k}^{*} - {\bar{X}}_{i j}^{*}) (X_{i j k}^{*} - {\bar{X}}_{i j}^{*})^{⊤}}{\sum_{i = 1}^{n} (n_{j} - 1)}

for $j = (1, 2), 3, 4$ , ${\bar{X}}_{i j}^{*} = n_{j}^{- 1} \sum_{k = 1}^{n_{j}}$ , and the independence of the $X_{i j k}^{*}$ is assumed for $j = 1, 2, 3, 4$ .

To implement the proposed method in Section 3.1, we consider three forms for the treatment model (3): the logistic, probit, and complementary log–log models; and we examine three extrapolation functions in Steps 3 and 5 in Section 3.1: quadratic, linear, and rational linear extrapolation functions. For variable selection, we use either the LASSO or SCAD penalty when implementing the proposed method based on (7). In comparison, we also examine the implementation based on the full model without doing variable selection like in Section 4, and call this method “full.” In the top two panels of Table 3, we report the estimates of the model parameters for the logistic regression treatment model obtained from different methods as well as the estimation results of $τ_{0}$ , and defer the results for the other two treatment models to Section B in the Supplemental material.

Table 3.

Analysis results of NHEFS data with propensity scores determined by the logistic model: external data are used to characterize the measurement error degree.

	Quadratic			Linear			RL
Covariate	full	LASSO	SCAD	full	LASSO	SCAD	full	LASSO	SCAD
Intercept	−1.162	−1.098	−1.159	−1.145	−1.089	−1.150	−1.093	−1.049	−1.099
sbp	0.052	—	—	0.065	—	—	0.073	—	—
dbp	1.199	1.146	1.188	1.141	1.092	1.098	1.280	1.222	1.274
cholesterol	−0.005	—	—	−0.029	—	—	−0.013	—	—
price82	−1.268	−0.216	−0.272	−1.080	−0.024	−0.032	−1.120	−0.062	−0.066
ht	0.060	—	—	0.090	—	—	0.036	—	—
age	0.047	—	—	0.268	—	—	0.273	—	—
sex	0.040	—	—	0.029	—	—	0.028	—	—
nerves	−0.050	—	—	−0.034	—	—	−0.070	—	—
hbpmed	0.015	—	—	0.022	—	—	−0.004	—	—
race	−1.300	−0.236	−0.296	−1.276	−0.220	−0.280	−1.319	−0.274	−0.324
$\hat{τ}$	1.313	3.256	3.477	1.478	3.345	3.500	1.600	3.587	3.744
$S . E . (\hat{τ})$	1.013	0.931	0.925	1.024	0.977	0.963	1.130	1.044	1.038
p-value	0.194	<0.001	<0.001	0.148	<0.001	<0.001	0.157	<0.001	<0.001
${\hat{τ}}_{F}$	—	3.554	3.951	—	3.375	3.245	—	3.754	3.609
$S . E . ({\hat{τ}}_{F})$	—	1.312	1.293	—	0.588	0.590	—	1.048	1.038
p-value	—	0.007	0.002	—	<0.001	<0.001	—	<0.001	<0.001

Open in a new tab

ATE: average treatment effect; LASSO: least absolute shrinkage and selection operator; SCAD: smoothly clipped absolute deviation. Headings “Quadratic”, “Linear”, and “RL” refer to the extrapolation function approximated by the quadratic, linear, and rational linear functions, respectively. The top panel reports the results of variable selection for the treatment model; the middle panel displays the estimation results of the ATE $τ_{0}$ ; and the bottom panel shows the estimation results of the ATE $τ_{0}$ by forcefully including “age” and “sex” to the selected variables to form the final treatment model to estimate $τ_{0}$ .

Those results show that regardless of the model assumption for the treatment model and the extrapolation function form, both the proposed LASSO and SCAD methods suggest the same important covariates, dbp and race, for determining propensity scores. The estimates of $τ_{0}$ produced by the proposed LASSO and SCAD method are closer to each other than to those obtained from the “full” method; and the former estimates are smaller than the latter ones. The standard errors produced from the proposed method with the SCAD penalty are the smallest, and they are closer to those produced from the proposed method with the LASSO penalty than those obtained from the “full” method.

Under all the settings, the estimates determined by the proposed LASSO or SCAD method have p-values $< 0.001$ , revealing evidence of the exposure effects of T on affecting the average weight difference for nonsmokers and smokers. This revealing is in line with the results of Kaufman et al.³² On the contrary, if not excluding irrelevant covariates when determining propensity scores, the resulting p-values derived from the “full” method are all $> 0.05$ , showing no evidence that the smoking status has effects on affecting individuals’ weights.

To reflect the application scenario where the assignment of the treatment must depend on certain covariates, a referee suggested to forcefully include some covariates to form the final treatment model, together with the selected variables obtained from Steps 1 to 4 in Section 3.1. To this end, here we take the covariates, age and sex, as those that must enter the final treatment model. Let ${X_{I i}, Z_{I i}}$ denote the covariates selected from Steps 1 to 4 in Section 3.1. We then modify the fitting model (8) by replacing ${X_{I i}, Z_{I i}}$ with ${X_{I i}, Z_{I i}, a g e, s e x}$ , and re-run Step 5 in Section 3.1 to obtain an estimate of $τ_{0}$ ; let ${\hat{τ}}_{F}$ and S.E. $({\hat{τ}}_{F})$ denote the resulting estimate and the associated standard error, respectively. The results for the logistic treatment model are presented in the bottom panel of Table 3; and the results for the probit and complement log–log treatment models are summarized, respectively, at the bottom panels of Tables B1 and B2 in the Supplemental material.

5.3. Sensitivity analyses

Using external data helps us estimate the covariance matrix in model (5) as illustrated in Section 5.2. The validity of the analysis in Section 5.2 requires the comparability between the measurement error processes of the external data and the data we analyze, also called the transportability condition.³³ As it is not apparent that this condition holds, here we further carry out sensitivity analyses to understand the impact of measurement error on estimation (e.g. Chen and Yi²⁷ and Chen and Yi³⁴).

Let $Σ_{X}$ and $Σ_{X^{*}}$ denote the covariance matrix of X and $X^{*}$ , respectively, and let $σ_{X i j}$ , $σ_{X^{*} i j}$ , and $σ_{e i j}$ denote the $(i, j)$ entry of $Σ_{X}$ , $Σ_{X^{*}}$ , and $Σ_{e}$ , respectively. Measurement error model (5) suggests that $σ_{X i j}$ is smaller than $σ_{X^{*} i j}$ for all i and j. To consider possible scenarios of measurement error, we calculate the sample covariance ${\hat{Σ}}_{X^{*}}$ of $Σ_{X^{*}}$ and set $σ_{X i j}$ as $σ_{X i j} = 0.9 {\hat{σ}}_{X^{*} i j}$ where ${\hat{σ}}_{X^{*} i j}$ is the $(i, j)$ entry of ${\hat{Σ}}_{X^{*}}$ , and

\begin{aligned} {\hat{Σ}}_{X^{*}} = (\begin{matrix} 1.000 & 0.560 & 0.155 & - 0.066 \\ 0.560 & 1.000 & 0.052 & - 0.026 \\ 0.155 & 0.052 & 1.000 & 0.053 \\ - 0.066 & - 0.026 & 0.053 & 1.000 \end{matrix}) . \end{aligned}

To specify $σ_{e i j}$ , we use the reliability ratio $R_{i j} = σ_{X i j} / σ_{X^{*} i j} = σ_{X i j} / (σ_{X i j} + σ_{ϵ i j})$ to guide us:

σ_{e i j} = (R_{i j}^{- 1} - 1) σ_{X i j} .

(13)

For ease of exposition, we take $R_{i j}$ as a constant for all i and j and let R denote it. In our analysis, we specifically consider $R = 0.65, 0.75$ or $0.85$ , and in the top two panels of Tables 4 to 6 we report the estimates of the parameters in the logistic treatment model as well as estimation results of $τ_{0}$ . Other results for the probit and complementary log–log treatment models are included in Section B of the Supplemental material.

Table 4.

Sensitivity analyses for NHEFS data with propensity scores determined by the logistic model: a quadratic extrapolation function is used.

	$R = 0.65$			$R = 0.75$			$R = 0.85$
Covariate	full	LASSO	SCAD	full	LASSO	SCAD	full	LASSO	SCAD
intercept	−1.292	−1.501	−1.161	−1.303	−1.191	−1.124	−1.792	−1.895	−1.570
sbp	0.626	—	—	0.236	—	—	0.152	—	—
dbp	2.427	1.915	2.510	3.301	2.575	3.347	3.212	2.758	3.434
cholesterol	−0.057	—	—	−0.197	—	—	−0.250	—	—
price82	−1.321	−0.530	−0.538	−1.087	−0.414	−0.406	−1.281	−0.384	−0.384
ht	0.011	—	—	0.011	—	—	0.011	—	—
age	0.020	—	—	0.022	—	—	0.024	—	—
sex	0.078	—	—	0.068	—	—	0.037	—	—
nerves	−0.235	—	—	−0.179	—	—	−0.190	—	—
hbpmed	0.023	—	—	0.022	—	—	0.013	—	—
race	−1.034	−0.243	−0.251	−1.028	−0.225	−0.217	−1.098	−0.201	−0.201
$\hat{τ}$	1.012	3.155	3.617	1.497	3.631	3.713	1.424	3.672	3.286
$S . E . (\hat{τ})$	1.411	1.203	1.022	1.658	1.259	1.104	1.623	1.375	1.259
p-value	0.473	0.008	<0.001	0.366	0.003	<0.001	0.380	0.007	0.009
${\hat{τ}}_{F}$	—	3.886	3.424	—	3.633	3.915	—	3.094	3.143
$S . E . ({\hat{τ}}_{F})$	—	1.152	1.474	—	1.296	0.898	—	1.379	1.119
p-value	—	0.001	0.020	—	0.005	<0.001	—	0.024	0.005

Open in a new tab

ATE: average treatment effect; LASSO: least absolute shrinkage and selection operator; SCAD: smoothly clipped absolute deviation. The top panel reports the results of variable selection for the treatment model; the middle panel displays the estimation results of the ATE $τ_{0}$ ; and the bottom panel shows the estimation results of the ATE $τ_{0}$ by forcefully including “age” and “sex” to the selected variables to form the final treatment model to estimate $τ_{0}$ .

Table 6.

Sensitivity analyses for NHEFS data with propensity scores determined by the logistic model: a rational linear extrapolation function is used.

	$R = 0.65$			$R = 0.75$			$R = 0.85$
Covariate	full	LASSO	SCAD	full	LASSO	SCAD	full	LASSO	SCAD
intercept	−1.163	−1.116	−1.170	−1.145	−1.092	−1.145	−1.149	−1.099	−1.156
sbp	0.069	—	—	0.035	—	—	0.062	—	—
dbp	1.295	1.234	1.288	1.277	1.224	1.277	1.208	1.152	1.187
cholesterol	−0.001	—	—	−0.008	—	—	−0.002	—	—
price82	−0.080	−0.033	−0.033	−0.103	−0.050	−0.050	−0.100	−0.050	−0.051
ht	0.100	—	—	0.091	—	—	0.093	—	—
age	0.271	—	—	0.260	—	—	0.260	—	—
sex	0.011	—	—	0.011	—	—	0.014	—	—
nerves	−0.070	—	—	−0.041	—	—	−0.067	—	—
hbpmed	0.012	—	—	0.016	—	—	0.015	—	—
race	−0.815	−0.268	−0.322	−0.827	−0.275	−0.328	−0.834	−0.254	−0.312
$\hat{τ}$	1.236	3.188	3.433	0.955	3.209	3.107	1.436	3.139	3.650
$S . E . (\hat{τ})$	1.520	1.062	1.024	1.255	1.189	1.160	1.198	0.961	0.918
p-value	0.219	<0.001	<0.001	0.447	0.001	0.002	0.231	<0.001	<0.001
${\hat{τ}}_{F}$	—	3.210	3.084	—	3.166	3.073	—	3.796	3.445
$S . E . ({\hat{τ}}_{F})$	—	1.025	1.014	—	0.803	0.737	—	1.216	1.177
p-value	—	0.002	0.002	—	<0.001	<0.001	—	0.002	0.003

Open in a new tab

Table 5.

Sensitivity analyses for NHEFS data with propensity scores determined by the logistic model: a linear extrapolation function is used.

	$R = 0.65$			$R = 0.75$			$R = 0.85$
Covariate	full	LASSO	SCAD	full	LASSO	SCAD	full	LASSO	SCAD
intercept	−1.478	−1.031	−1.387	−1.956	−1.441	−1.866	−1.128	−1.648	−1.032
sbp	0.108	—	—	0.191	—	—	0.130	—	—
dbp	1.117	0.888	1.123	1.265	0.967	1.202	1.186	0.929	1.172
cholesterol	−0.016	—	—	−0.019	—	—	−0.052	—	—
price82	−0.598	−0.151	−0.169	−0.470	−0.145	−0.153	−0.428	−0.133	−0.146
ht	0.008	—	—	0.008	—	—	0.009	—	—
age	0.022	—	—	0.023	—	—	0.023	—	—
sex	0.049	—	—	0.072	—	—	0.066	—	—
nerves	−0.171	—	—	−0.153	—	—	−0.140	—	—
hbpmed	0.009	—	—	0.005	—	—	0.003	—	—
race	−0.881	−0.434	−0.493	−0.898	−0.383	−0.402	−0.900	−0.420	−0.451
$\hat{τ}$	1.658	3.713	3.520	1.212	3.915	3.756	1.249	3.844	3.652
$S . E . (\hat{τ})$	1.155	1.088	1.128	1.612	1.014	0.997	1.374	1.245	1.224
p-value	0.265	0.001	0.002	0.452	<0.001	<0.001	0.363	0.001	0.003
${\hat{τ}}_{F}$	—	3.523	3.501	—	3.467	3.263	—	3.055	2.977
$S . E . ({\hat{τ}}_{F})$	—	1.216	1.215	—	0.883	0.900	—	1.050	1.048
p-value	—	0.004	0.004	—	<0.001	<0.001	—	0.004	0.003

Open in a new tab

ATE: average treatment effect; LASSO: least absolute shrinkage and selection operator; SCAD: SCAD: smoothly clipped absolute deviation. The top panel reports the results of variable selection for the treatment model; the middle panel displays the estimation results of the ATE $τ_{0}$ ; and the bottom panel shows the estimation results of the ATE $τ_{0}$ by forcefully including “age” and “sex” to the selected variables to form the final treatment model to estimate $τ_{0}$ .

The estimation results are, in nature, consistent with the numerical results reported in Section 5.2. The variables dbp and race are selected to determine the propensity scores regardless of the values of R. For the estimation of $τ_{0}$ , the proposed method with either the LASSO or SCAD penalty shows the significance effect of T, whereas the results obtained from the “full” model do not reveal such an effect.

Similar to Section 5.2, here we report sensitivity analyses results for ${\hat{τ}}_{F}$ and $S . E . ({\hat{τ}}_{F})$ . The bottom panels of Tables 4 to 6 show the results for taking the treatment model as a logistic regression form. The results under the probit and complement log–log treatment models are placed in the bottom panels of Tables B3 to B8 in the Supplemental material.

6. Discussions and extensions

The inverse probability weighting estimation method and its variants have proved to be useful for estimating the average treatment effect in the causal inference framework. Despite the popularity of these methods, their applications are hindered by two critical conditions. The validity of those methods relies on the proper determination of propensity scores and the precise measurements of the confounders. In this article, we develop a simulation-based method by adapting the inverse probability weighting scheme to accommodate measurement error effects as well as variable selection for calculating propensity scores.

To highlight the idea, we present the proposed method by assuming the covariance matrix $Σ_{e}$ for model (5) to be given. This condition is needed only for Step 1 in Section 3. However, it is restrictive for applications. To circumvent the issues induced by unknown $Σ_{e}$ , we often need an additional data source to gain an understanding of the measurement error degree. If a validation sample having measurements for both $X_{i}$ and $X_{i}^{*}$ is available, $Σ_{e}$ can be estimated by fitting model (5) to the validation data. If a prior study on the same variables is available for the estimation of $Σ_{e}$ , one may use the estimated $Σ_{e}$ to implement Step 1 in Section 3. When repeated surrogate measurements are available, one may use them to estimate $Σ_{e}$ , as done in Section 5.2, or alternatively, one may modify Step 1 in Section 3 to get around the problem of unknown $Σ_{e}$ by applying the method of Devanarayan and Stefanski.³⁵ In circumstances where no additional data sources are available, we often conduct sensitivity analyses to understand the impact of different degrees of measurement error on estimation of $τ_{0}$ , as demonstrated in Section 5.3. Basically, we first specify a sequence of values for the covariance matrix $Σ_{e}$ to reflect possible scenarios of the measurement error process, and then we apply the proposed method to conduct the estimation of $τ_{0}$ for each specified $Σ_{e}$ . Finally, we evaluate the sensitivity of estimation results to different magnitudes of measurement error. Such a study helps us understand how varying amounts of measurement error may affect estimation results.

The proposed method has a major limitation. As discussed in Section 3.1, the extrapolation functions in Steps 3 and 5 are unknown in applications and they can only be approximated by user-specified functions. Consistent with many studies about the SIMEX algorithm, we approximate the underlying true extrapolation functions by using one of the three functions, quadratic, linear, and rational linear functions, in our numerical studies. The simulation studies show that quadratic functions tend to outperform linear and rational linear functions, as in the same line with the numerical experience reported in the literature.¹⁷ $^{(Section 5.3.2)}$ Here we stress that the inference outcome can be sensitive to the choice of an approximating function. It is useful to develop a procedure to construct data-driven functions to well approximate the extrapolation functions to reach robust inference results. One possible approach is to start with a class of candidate functions, say $F$ , then repeat the steps in Section 3.1 by taking each $f \in F$ to approximate the extrapolation function, and let ${\hat{τ}}_{f}$ and $v ({\hat{τ}}_{f})$ denote the resulting estimator and the associated variance, respectively. Then the optimal estimator of $τ_{0}$ , ${\hat{τ}}_{f_{0}}$ , is the estimator derived from using the function, $f_{0} = {argmin}_{f \in F} v ({\hat{τ}}_{f})$ , which minimizes $v ({\hat{τ}}_{f})$ . It will be interesting to carefully explore this idea through numerical studies and theoretical justifications by examining various classes $F$ .

This research can be sharpened with several extensions which are outlined as follows. The proposed method focuses on the case where error-prone variables are all continuous. If error-prone variables are all discrete, one may modify the development here by replacing the SIMEX steps with the MC-SIMEX algorithm developed by Kuchenhoff et al.³⁶ It is also interesting to generalize the development to accommodate a mix of error-contaminated discrete and continuous variables. One may adapt the augmented simulation-extrapolation method developed by Yi et al.³³ and discussed by Chen and Yi²⁷ and Zhang and Yi.³⁷ To shed light on the extension, we consider the case with the vector $X_{i}$ only, where $X_{i} = (X_{D i}, X_{C i}^{T})^{T}$ with $X_{D i}$ representing a binary variable and $X_{C i}$ a subvector of continuous variables. Let $X_{D i}^{*}$ and $X_{C i}^{*}$ denote the observed surrogate measurements of $X_{D i}$ and $X_{C i}$ , respectively. Let $p_{i} = P (X_{D i}^{*} = 0 | X_{D i} = 1, X_{C i})$ and $q_{i} = P (X_{D i}^{*} = 1 | X_{D i} = 0, X_{C i})$ be the conditional misclassification probabilities, given $X_{i}$ , where the dependence of $p_{i}$ and $q_{i}$ on $X_{i}$ is suppressed in the notation.

One may employ logistic regression models to facilitate the dependence of the misclassification probabilities on the true variables $X_{i}$ :

logit p_{i} = α_{01} + α_{x 1}^{T} X_{C i}; logit q_{i} = α_{00} + α_{x 0}^{T} X_{C i}

where $α = (α_{01}, α_{x 1}^{T}, α_{00}, α_{x 0}^{T})^{T}$ is the vector of regression parameters.

For the error-prone continuous variables $X_{C i}$ , we consider a model similar to (5):

X_{C i}^{*} = X_{C i} + e_{C i},

where the error term $e_{C i}$ is independent of ${T_{i}, X_{i}, X_{D i}^{*}, Y_{(0) i}, Y_{(1) i}}$ and follows $N (0, Σ_{C e})$ with covariance matrix $Σ_{C e}$ .

Now we write $S_{i} (γ; X_{i})$ in (4) as $S_{i} (γ; X_{D i}, X_{C i})$ to separately show the involvement of the discrete and continuous covariates. Define

\begin{aligned} S_{i}^{*} (β; γ, X_{D i}^{*}, X_{C i}) & = (1 - p_{i} - q_{i})^{- 1} \\ \times [(1 - X_{D i}^{*}) {(1 - p_{i}) S_{i} (β; γ, 0, X_{C i}) - q_{i} S_{i} (β; γ, 1, X_{C i})} \\ - X_{D i}^{*} {p_{i} S_{i} (β; γ, 0, X_{C i}) - (1 - q_{i}) S_{i} (β; γ, 1, X_{C i})}] \end{aligned}

where $S_{i} (β; γ, r, X_{C i})$ represents $S_{i} (γ; X_{D i}, X_{C i})$ with $X_{D i}$ taking value r, and $r = 0, 1$ .

By Problem 2.10 of Yi¹⁸ $^{(p . 83)}$ , $S_{i}^{*} (β; γ, X_{D i}^{*}, X_{C i})$ is an unbiased estimating function expressed in terms of the observed binary variable $X_{D i}^{*}$ , together with $X_{C i}$ . To address the induced error effects of replacing $X_{C i}$ with its surrogate $X_{C i}^{*}$ , we may now employ the implementation procedure in Section 3.1 by replacing $S_{i} (β; γ, X_{D i}, X_{C i})$ with $S_{i}^{*} (β; γ, X_{D i}^{*}, X_{C i})$ in Step 2. Then the same steps carry through to obtain an estimator of $τ_{0}$ with mismeasurement error effects in both $X_{D i}$ and $X_{C i}$ accounted for.

For a more general case with multiple binary variables or categorical variables subject to measurement error, one may use the same ideas to develop an inference procedure by introducing a misclassification matrix and further modifying the construction of the function $S_{i}^{*} (β; γ, X_{D i}^{*}, X_{C i})$ with more involved notation. A careful study is warranted for working out technical details.

In Step 4 of the implementation procedure in Section 3.1, a common penalty function $p_{λ} (\cdot)$ is used for the components in $γ$ to construct the penalized quadratic loss function (6). This practice essentially treats all the pre-treatment variables of $W_{i}$ in model (3) equally the same. In applications, with the subject matter information, we may need to include certain pre-treatment variables when building the treatment model. In this case, the formulation of the penalized quadratic loss function (6) can be modified by not imposing a penalty on the parameters corresponding to such variables. Alternatively, one may first implement the present development to do variable selection and then add the must-to-be-included variables (if not being selected) to build the final treatment model, together with those selected variables. It is useful to explore these two strategies in depth.

Our development roots in the use of a parametric model (3) to describe propensity scores. This consideration is primarily driven by the attractive features of parametric modeling. Parametric approaches are more effective than non-parametric methods in handling data with a large dimension, and they allow us to deal with measurement error effects using existing methods. Further, the asymptotic result of the resulting estimator can be established, which enables us to carry out statistical inference rigorously. While having these advantages, the parametric modeling scheme is vulnerable to model misspecification. To ameliorate this, machine learning methods have been employed to characterize propensity scores, which include classification and regression trees (e.g. Lee et al.³⁸), neural networks and support vector machines (e.g. Westreich et al.³⁹), and cross-fit estimators (e.g. Zivich and Breskin⁴⁰). It is interesting to extend our development by using those machine learning methods to delineate the treatment model. One needs to carefully investigate typical issues related to machine learning methods, such as the lack of transparent interpretation or the nature of “black-box” of the algorithms, and the bias-and-variance trade-off.

Supplemental Material

sj-pdf-1-smm-10.1177_09622802221146308 - Supplemental material for Estimation of the average treatment effect with variable selection and measurement error simultaneously addressed for potential confounders

Click here for additional data file.^{(175.2KB, pdf)}

Supplemental material, sj-pdf-1-smm-10.1177_09622802221146308 for Estimation of the average treatment effect with variable selection and measurement error simultaneously addressed for potential confounders by Grace Y. Yi and Li-Pang Chen in Statistical Methods in Medical Research

Acknowledgements

The authors thank the Editor and the review team for their helpful comments on the initial submission. Yi is Canada Research Chair in Data Science (Tier 1). Her research was undertaken, in part, thanks to funding from the Canada Research Chairs Program.

Appendix: Proof of Theorem 3.2

Let $γ_{I}$ denote the parameters for the selected treatment model (8) and let $ϕ (γ_{I}; T_{i}, X_{I i}, Z_{I i})$ denote the score function of $γ_{I}$ derived from model (8). Let $θ = (γ_{I}^{T}, τ)^{T}$ denote the parameters and let $θ_{0} = (γ_{I 0}^{T}, τ_{0})^{T}$ denote their true values. Let $\hat{θ} = ({\hat{γ}}_{I}^{T}, \hat{τ})^{T}$ denote the estimator of $θ$ , referred in Theorems 3.1 and 3.2.

Define

U_{i} (θ; Y_{i}, T_{i}, X_{I i}, Z_{I i}) = (\begin{matrix} ϕ (γ; T_{i}, X_{I i}, Z_{I i}) \\ \frac{T_{i} Y_{i}}{π_{i}} - \frac{(1 - T_{i}) Y_{i}}{1 - π_{i}} - τ \end{matrix}) .

(14)

It is readily shown that $U_{i} (θ; Y_{i}, T_{i}, X_{I i}, Z_{I i})$ is an unbiased estimating function of $θ$ , i.e.,

E {U_{i} (θ; Y_{i}, T_{i}, X_{I i}, Z_{I i})} = 0

where the expectation is evaluated with respect to the joint distribution for the associated random variables.

Now we discuss the asymptotic distribution for the estimator $\hat{θ}$ for $θ$ by modifying the discussions of Yi et al.²⁶ and Carroll et al.²⁸ For each k and $ψ$ , define $U_{i}^{*} (θ; k, ψ) = U_{i} (θ; Y_{i}, T_{i}, X_{I i}^{*} (k, ψ), Z_{I i})$ , where $X_{I i}^{*} (k, ψ)$ is the subvector of $X_{i}^{*} (k, ψ)$ corresponding to the subvector $X_{I i}^{*}$ of $X_{i}^{*}$ . Since for a given $ψ$ , the $X_{I i}^{*} (k, ψ)$ are independent for $k = 1, \dots, K$ , so the solutions of $E [U_{i}^{*} (θ; k, ψ)] = 0$ are free of the values of k, where the expectation is evaluated under the distribution of the associated variables. Assuming that the equation has a unique solution, we let $θ (ψ)$ denote such a solution.

Assuming that for each k and $ψ$ ,

\sum_{i = 1}^{n} U_{i}^{*} (θ; k, ψ) = 0

(15)

has a unique solution, and let $\hat{θ} (k, ψ)$ denote such a solution. Then applying Theorem 1 of Yi and Reid⁴¹ gives

\hat{θ} (k, ψ) \overset{p}{\to} θ (ψ) as n \to \infty .

Applying the Taylor series expansion to (15) gives that

\sqrt{n} {\hat{θ} (k, ψ) - θ (ψ)} = - {\sqrt{n}}^{- 1} \sum_{i = 1}^{n} [Γ (ψ)]^{- 1} U_{i}^{*} (θ (ψ); k, ψ) + o_{p} (1)

(16)

where $Γ (ψ) = E [(\partial / \partial θ) U_{i}^{* T} (θ (ψ); k, ψ)]$ .

Let

\hat{θ} (ψ) = \frac{1}{K} \sum_{k = 1}^{K} \hat{θ} (k, ψ) and V_{i} (ψ) = \frac{1}{K} \sum_{k = 1}^{K} [Γ (ψ)]^{- 1} U_{i}^{*} (θ (ψ); k, ψ) .

Thus, summing (16) over k and then dividing by K leads to

\sqrt{n} {\hat{θ} (ψ) - θ (ψ)} = - {\sqrt{n}}^{- 1} \sum_{i = 1}^{n} V_{i} (ψ) + o_{p} (1) for ψ \in C .

(17)

Now we examine the extrapolation step for obtaining the SIMEX estimator $\hat{θ}$ . Let d be the dimension of $θ$ . Suppose the exact extrapolation function is known, that is, there is a known $d \times 1$ vector $h (\cdot)$ of functions of m-dimensional arguments such that the SIMEX estimator can be written as

\hat{θ} = h (\hat{θ} (ψ_{1}), \dots, \hat{θ} (ψ_{M}))

and the true parameter value $θ_{0}$ is related to ${θ (ψ_{1}), \dots, θ (ψ_{M})}$ by

θ_{0} = h (θ (ψ_{1}), \dots, θ (ψ_{M})) .

For $j = 1, \dots, M$ , let ${\dot{h}}_{j} = \partial h (θ (ψ_{1}), \dots, θ (ψ_{M})) / \partial θ^{T} (ψ_{j})$ . Then applying the Taylor series expansion to $h (\hat{θ} (ψ_{1}), \dots, \hat{θ} (ψ_{M}))$ gives

h (\hat{θ} (ψ_{1}), \dots, \hat{θ} (ψ_{M})) = h (θ (ψ_{1}), \dots, θ (ψ_{M})) + \sum_{j = 1}^{M} {\dot{h}}_{j} \times (\hat{θ} (ψ_{j}) - θ (ψ_{j})) + o_{p} ({\sqrt{n}}^{- 1}),

thus

\sqrt{n} (\hat{θ} - θ_{0}) = \sum_{j = 1}^{M} {\dot{h}}_{j} \times \sqrt{n} (\hat{θ} (ψ_{j}) - θ (ψ_{j})) + o_{p} (1) .

Define $H_{i} = \sum_{j = 1}^{M} {\dot{h}}_{j} \times V_{i} (ψ_{j})$ , then by (17), we have

\sqrt{n} (\hat{θ} - θ_{0}) = - {\sqrt{n}}^{- 1} \sum_{i = 1}^{n} H_{i} + o_{p} (1) .

(18)

Then applying the Central Limit Theorem to (18) gives

\sqrt{n} (\hat{θ} - θ_{0}) \overset{p}{\to} N (0, Σ_{S I M}) as n \to \infty,

(19)

where $Σ_{S I M} = var (H_{i})$ . Let $v (τ_{0})$ denote the right-lower corner element of $Σ_{S I M}$ . Then (19) yields that $\sqrt{n} (\hat{τ} - τ_{0})$ has an asymptotic normal distribution with mean zero and variance $v (γ_{0})$ .

Footnotes

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC). Yi’s research was undertaken, in part, by funding from the Canada Research Chairs Program.

ORCID iD: Grace Y. Yi https://orcid.org/0000-0002-0081-873X

Supplemental material: Supplemental material for this article is available online.

References

1.Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika 1983; 70: 41–55. [Google Scholar]
2.Rosenbaum PR, Rubin DB. Reducing bias in observational studies using subclassification on the propensity score. J Am Stat Assoc 1984; 79: 516–524. [Google Scholar]
3.Lunceford JK, Davidian M. Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Stat Med 2004; 23: 2937–2960. [DOI] [PubMed] [Google Scholar]
4.Bang H, Robins JM. Doubly robust estimation in missing data and causal inference models. Biometrics 2005; 61: 962–973. [DOI] [PubMed] [Google Scholar]
5.Westreich D, Cole SR, Funk MJ, et al. The role of the c-statistic in variable selection for propensity score models. Pharmacoepidemiol Drug Saf 2011; 20: 317–320. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Shortreed SM, Ertefaie A. Outcome-adaptive lasso: variable selection for causal inference. Biometrics 2017; 73: 1111–1122. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Ertefaie A, Asgharian M, Stephens DA. Variable selection in causal inference using a simultaneous penalization method. J Causal Inference 2018; 6: 20170010. [Google Scholar]
8.Koch B, Vock DM, Wolfson J, et al. Variable selection and estimation in causal inference using Bayesian spike and slab priors. Stat Methods Med Res 2020; 29: 2445–2469. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Ghosh D, Zhu Y, Coffman DL. Penalized regression procedures for variable selection in the potential outcomes framework. Stat Med 2015; 34: 1645–1658. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Vansteelandt S, Bekaert M, Claeskens G. On model selection and model misspecification in causal inference. Stat Methods Med Res 2010; 21: 7–30. [DOI] [PubMed] [Google Scholar]
11.Imai K, Yamamoto T. Causal inference with differential measurement error: Nonparametric identification and sensitivity analysis. Am J Pol Sci 2010; 54: 543–560. [Google Scholar]
12.Edwards J, Cole SR, Westreich D. All your data are always missing: incorporating bias due to measurement error into the potential outcomes framework. Int J Epidemiol 2015; 44: 1452–1459. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.McCaffrey DF, Lockwood JR, Setodji CM. Inverse probability weighting with error-prone covariates. Biometrika 2013; 100: 671–680. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Shu D, Yi GY. Causal inference with measurement error in outcomes: Bias analysis and estimation methods. Stat Methods Med Res 2019a; 28: 2049–2068. [DOI] [PubMed] [Google Scholar]
15.Shu D, Yi GY. Inverse-probability-of-treatment weighted estimation of causal parameters in the presence of error-contaminated and time-dependent confounders. Biometrical J 2019b; 61: 1507–1525. [DOI] [PubMed] [Google Scholar]
16.Kyle RP, Moodie EEM, Klein MB, et al. Correcting for measurement error in time-varying covariates in marginal structural models. Am J Epidemiol 2016; 184: 249–258. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Carroll RJ, Ruppert D, Stefanski LA, et al. Measurement error in nonlinear models. 2nd ed.Boca Raton, FL: Chapman & Hall, 2006. [Google Scholar]
18.Yi GY. Statistical analysis with measurement error or misclassification: strategy. Springer: Method and Application, 2017. [Google Scholar]
19.Yi GY, Delaigle A, Gustafson P. Handbook of measurement error models. Boca Raton, FL: Chapman & Hall/CRC, 2021. [Google Scholar]
20.Cook JR, Stefanski LA. Simulation-extrapolation in parametric measurement error models. J Am Stat Assoc 1994; 89: 1314–1328. [Google Scholar]
21.Zou H, Li R. One-step sparse estimates in nonconcave penalized likelihood models. Ann Stat 2008; 36: 1509–1533. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc, Ser B 1996; 58: 267–288. [Google Scholar]
23.Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 2001; 96: 1348–1360. [Google Scholar]
24.Wang H, Li R, Tsai C-L. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika 2007; 94: 553–568. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Zhang Y, Li R, Tsai C-L. Regularization parameter selections via generalized information criterion. J Am Stat Assoc 2010; 105: 312–323. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Yi GY, Tan X, Li R. Variable selection and inference procedures for marginal analysis of longitudinal data with missing observations and covariate measurement error. Can J Stat 2015; 43: 498–518. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Chen L-P, Yi GY. Analysis of noisy survival data with graphical proportional hazards measurement error models. Biometrics 2021; 77: 956–969. [DOI] [PubMed] [Google Scholar]
28.Carroll RJ, Lombard F, Küchenhoff H, et al. Asymptotics for the SIMEX estimator in structural measurement error models. J Am Stat Assoc 1996; 91: 242–250. [Google Scholar]
29.Bauldry S, Bollen KA, Adair LS. Evaluating measurement error in readings of blood pressure for adolescents and young adults. Blood Press 2015; 24: 96–102. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Glasziou PP, Irwig L, Heritier S, et al. Monitoring cholesterol levels: measurement error or true change? Ann Intern Med 2008; 148: 656–661. [DOI] [PubMed] [Google Scholar]
31.Lebow DE, Rudd JB. Measurement error in the consumer price index: where do we stand? J Econ Lit 2003; 41: 159–201. [Google Scholar]
32.Kaufman A, Augustson EM, Patrick H. Unraveling the relationship between smoking and weight: the role of sedentary behavior. J Obes 2012; Article ID 735465. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Yi GY, Ma Y, Spiegelman D, Carroll RJ. Functional and structural methods with mixed measurement error and misclassification in covariates. J Am Stat Assoc 2015; 110: 681–696. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Chen L-P, Yi GY. Model selection and model averaging for analysis of truncated and censored data with measurement error. Electron J Stat 2020; 14: 4054–4109. [Google Scholar]
35.Devanarayan V, Stefanski LA. Empirical simulation extrapolation for measurement error models with replicate measurements. Stat Probab Lett 2002; 59: 219–225. [Google Scholar]
36.Küchenhoff H, Mwalili SM, Lesaffre E. A general method for dealing with misclassification in regression: the misclassification SIMEX. Biometrics 2006; 62: 85–96. [DOI] [PubMed] [Google Scholar]
37.Zhang Q, Yi GY. R package for analysis of data with mixed measurement error and misclassification in covariates: augSIMEX. J Stat Comput Simul 2019; 89: 2293–2315. [Google Scholar]
38.Lee BK, Lessler J, Stuart EA. Improving propensity score weighting using machine learning. Stat Med 2010; 29: 337–346. [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Westreich D, Lessler J, Funk MJ. Propensity score estimation: Neural networks, support vector machines, decision trees (CART), and meta-classifiers as alternatives to logistic regression. J Clin Epidemiol 2010; 63: 826–833. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Zivich PN, Breskin A. Machine learning for causal inference: on the use of cross-fit estimators. Epidemiology 2021; 32: 393–401. [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Yi GY, Reid N. A note on misspecified estimating function. Stat Sin 2010; 20: 1749–1769. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Click here for additional data file.^{(175.2KB, pdf)}

[bibr1-09622802221146308] 1.Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika 1983; 70: 41–55. [Google Scholar]

[bibr2-09622802221146308] 2.Rosenbaum PR, Rubin DB. Reducing bias in observational studies using subclassification on the propensity score. J Am Stat Assoc 1984; 79: 516–524. [Google Scholar]

[bibr3-09622802221146308] 3.Lunceford JK, Davidian M. Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Stat Med 2004; 23: 2937–2960. [DOI] [PubMed] [Google Scholar]

[bibr4-09622802221146308] 4.Bang H, Robins JM. Doubly robust estimation in missing data and causal inference models. Biometrics 2005; 61: 962–973. [DOI] [PubMed] [Google Scholar]

[bibr5-09622802221146308] 5.Westreich D, Cole SR, Funk MJ, et al. The role of the c-statistic in variable selection for propensity score models. Pharmacoepidemiol Drug Saf 2011; 20: 317–320. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr6-09622802221146308] 6.Shortreed SM, Ertefaie A. Outcome-adaptive lasso: variable selection for causal inference. Biometrics 2017; 73: 1111–1122. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr7-09622802221146308] 7.Ertefaie A, Asgharian M, Stephens DA. Variable selection in causal inference using a simultaneous penalization method. J Causal Inference 2018; 6: 20170010. [Google Scholar]

[bibr8-09622802221146308] 8.Koch B, Vock DM, Wolfson J, et al. Variable selection and estimation in causal inference using Bayesian spike and slab priors. Stat Methods Med Res 2020; 29: 2445–2469. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr9-09622802221146308] 9.Ghosh D, Zhu Y, Coffman DL. Penalized regression procedures for variable selection in the potential outcomes framework. Stat Med 2015; 34: 1645–1658. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr10-09622802221146308] 10.Vansteelandt S, Bekaert M, Claeskens G. On model selection and model misspecification in causal inference. Stat Methods Med Res 2010; 21: 7–30. [DOI] [PubMed] [Google Scholar]

[bibr11-09622802221146308] 11.Imai K, Yamamoto T. Causal inference with differential measurement error: Nonparametric identification and sensitivity analysis. Am J Pol Sci 2010; 54: 543–560. [Google Scholar]

[bibr12-09622802221146308] 12.Edwards J, Cole SR, Westreich D. All your data are always missing: incorporating bias due to measurement error into the potential outcomes framework. Int J Epidemiol 2015; 44: 1452–1459. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr13-09622802221146308] 13.McCaffrey DF, Lockwood JR, Setodji CM. Inverse probability weighting with error-prone covariates. Biometrika 2013; 100: 671–680. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr14-09622802221146308] 14.Shu D, Yi GY. Causal inference with measurement error in outcomes: Bias analysis and estimation methods. Stat Methods Med Res 2019a; 28: 2049–2068. [DOI] [PubMed] [Google Scholar]

[bibr15-09622802221146308] 15.Shu D, Yi GY. Inverse-probability-of-treatment weighted estimation of causal parameters in the presence of error-contaminated and time-dependent confounders. Biometrical J 2019b; 61: 1507–1525. [DOI] [PubMed] [Google Scholar]

[bibr16-09622802221146308] 16.Kyle RP, Moodie EEM, Klein MB, et al. Correcting for measurement error in time-varying covariates in marginal structural models. Am J Epidemiol 2016; 184: 249–258. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr17-09622802221146308] 17.Carroll RJ, Ruppert D, Stefanski LA, et al. Measurement error in nonlinear models. 2nd ed.Boca Raton, FL: Chapman & Hall, 2006. [Google Scholar]

[bibr18-09622802221146308] 18.Yi GY. Statistical analysis with measurement error or misclassification: strategy. Springer: Method and Application, 2017. [Google Scholar]

[bibr19-09622802221146308] 19.Yi GY, Delaigle A, Gustafson P. Handbook of measurement error models. Boca Raton, FL: Chapman & Hall/CRC, 2021. [Google Scholar]

[bibr20-09622802221146308] 20.Cook JR, Stefanski LA. Simulation-extrapolation in parametric measurement error models. J Am Stat Assoc 1994; 89: 1314–1328. [Google Scholar]

[bibr21-09622802221146308] 21.Zou H, Li R. One-step sparse estimates in nonconcave penalized likelihood models. Ann Stat 2008; 36: 1509–1533. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr22-09622802221146308] 22.Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc, Ser B 1996; 58: 267–288. [Google Scholar]

[bibr23-09622802221146308] 23.Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 2001; 96: 1348–1360. [Google Scholar]

[bibr24-09622802221146308] 24.Wang H, Li R, Tsai C-L. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika 2007; 94: 553–568. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr25-09622802221146308] 25.Zhang Y, Li R, Tsai C-L. Regularization parameter selections via generalized information criterion. J Am Stat Assoc 2010; 105: 312–323. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr26-09622802221146308] 26.Yi GY, Tan X, Li R. Variable selection and inference procedures for marginal analysis of longitudinal data with missing observations and covariate measurement error. Can J Stat 2015; 43: 498–518. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr27-09622802221146308] 27.Chen L-P, Yi GY. Analysis of noisy survival data with graphical proportional hazards measurement error models. Biometrics 2021; 77: 956–969. [DOI] [PubMed] [Google Scholar]

[bibr28-09622802221146308] 28.Carroll RJ, Lombard F, Küchenhoff H, et al. Asymptotics for the SIMEX estimator in structural measurement error models. J Am Stat Assoc 1996; 91: 242–250. [Google Scholar]

[bibr29-09622802221146308] 29.Bauldry S, Bollen KA, Adair LS. Evaluating measurement error in readings of blood pressure for adolescents and young adults. Blood Press 2015; 24: 96–102. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr30-09622802221146308] 30.Glasziou PP, Irwig L, Heritier S, et al. Monitoring cholesterol levels: measurement error or true change? Ann Intern Med 2008; 148: 656–661. [DOI] [PubMed] [Google Scholar]

[bibr31-09622802221146308] 31.Lebow DE, Rudd JB. Measurement error in the consumer price index: where do we stand? J Econ Lit 2003; 41: 159–201. [Google Scholar]

[bibr32-09622802221146308] 32.Kaufman A, Augustson EM, Patrick H. Unraveling the relationship between smoking and weight: the role of sedentary behavior. J Obes 2012; Article ID 735465. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr33-09622802221146308] 33.Yi GY, Ma Y, Spiegelman D, Carroll RJ. Functional and structural methods with mixed measurement error and misclassification in covariates. J Am Stat Assoc 2015; 110: 681–696. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr34-09622802221146308] 34.Chen L-P, Yi GY. Model selection and model averaging for analysis of truncated and censored data with measurement error. Electron J Stat 2020; 14: 4054–4109. [Google Scholar]

[bibr35-09622802221146308] 35.Devanarayan V, Stefanski LA. Empirical simulation extrapolation for measurement error models with replicate measurements. Stat Probab Lett 2002; 59: 219–225. [Google Scholar]

[bibr36-09622802221146308] 36.Küchenhoff H, Mwalili SM, Lesaffre E. A general method for dealing with misclassification in regression: the misclassification SIMEX. Biometrics 2006; 62: 85–96. [DOI] [PubMed] [Google Scholar]

[bibr37-09622802221146308] 37.Zhang Q, Yi GY. R package for analysis of data with mixed measurement error and misclassification in covariates: augSIMEX. J Stat Comput Simul 2019; 89: 2293–2315. [Google Scholar]

[bibr38-09622802221146308] 38.Lee BK, Lessler J, Stuart EA. Improving propensity score weighting using machine learning. Stat Med 2010; 29: 337–346. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr39-09622802221146308] 39.Westreich D, Lessler J, Funk MJ. Propensity score estimation: Neural networks, support vector machines, decision trees (CART), and meta-classifiers as alternatives to logistic regression. J Clin Epidemiol 2010; 63: 826–833. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr40-09622802221146308] 40.Zivich PN, Breskin A. Machine learning for causal inference: on the use of cross-fit estimators. Epidemiology 2021; 32: 393–401. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr41-09622802221146308] 41.Yi GY, Reid N. A note on misspecified estimating function. Stat Sin 2010; 20: 1749–1769. [Google Scholar]

PERMALINK

Estimation of the average treatment effect with variable selection and measurement error simultaneously addressed for potential confounders

Grace Y Yi

Li-Pang Chen

Abstract

1. Introduction

2. Notation and framework

2.1. Consistent estimator

2.2. Variable selection and measurement error

3. Simulation-based method

3.1. Implementation procedure

3.2. Theoretical results

Theorem 3.1.

Theorem 3.2.

4. Simulation studies

4.1. Simulation designs

4.2. Simulation results

Table 1.

Table 2.

5. Analysis of NHANES I Epidemiologic Follow-up Study data

5.1. Data description

5.2. Analysis using external information

Table 3.

5.3. Sensitivity analyses

Table 4.

Table 6.

Table 5.

6. Discussions and extensions

Supplemental Material

Acknowledgements

Appendix: Proof of Theorem 3.2

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Estimation of the average treatment effect with variable selection and measurement error simultaneously addressed for potential confounders

Grace Y Yi

Li-Pang Chen

Abstract

1. Introduction

2. Notation and framework

2.1. Consistent estimator

2.2. Variable selection and measurement error

3. Simulation-based method

3.1. Implementation procedure

3.2. Theoretical results

Theorem 3.1.

Theorem 3.2.

4. Simulation studies

4.1. Simulation designs

4.2. Simulation results

Table 1.

Table 2.

5. Analysis of NHANES I Epidemiologic Follow-up Study data

5.1. Data description

5.2. Analysis using external information

Table 3.

5.3. Sensitivity analyses

Table 4.

Table 6.

Table 5.

6. Discussions and extensions

Supplemental Material

Acknowledgements

Appendix: Proof of Theorem 3.2

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases