Modeling Longitudinal Change in Biomarkers Using Data from a Complex Survey Sampling Design: An Application to the Hispanic Community Health Study/Study of Latinos (HCHS/SOL)

Nicole M Butera; Donglin Zeng; Gerardo Heiss; Jianwen Cai

doi:10.1002/sim.9635

. Author manuscript; available in PMC: 2024 Mar 13.

Published in final edited form as: Stat Med. 2023 Jan 11;42(5):632–655. doi: 10.1002/sim.9635

Modeling Longitudinal Change in Biomarkers Using Data from a Complex Survey Sampling Design: An Application to the Hispanic Community Health Study/Study of Latinos (HCHS/SOL)

Nicole M Butera ^1,^*, Donglin Zeng ², Gerardo Heiss ³, Jianwen Cai ²

PMCID: PMC10936944 NIHMSID: NIHMS1962085 PMID: 36631123

Summary

In observational cohort studies, there is frequently interest in modeling longitudinal change in a biomarker (i.e., physiological measure indicative of metabolic dysregulation or disease; e.g., blood pressure) in the absence of treatment (i.e., medication), and its association with modifiable risk factors expected to affect health (e.g., body mass index). However, individuals may start treatment during the study period, and consequently biomarker values observed while on treatment may be different than those that would have been observed in the absence of treatment. If treated individuals are excluded from analysis, then effect estimates may be biased if treated individuals differ systematically from untreated individuals. We addressed this concern in the setting of the Hispanic Community Health Study/Study of Latinos (HCHS/SOL), an observational cohort study that employed a complex survey sampling design to enable inference to a finite target population. We considered biomarker values measured while on treatment to be missing data, and applied missing data methodology (inverse probability weighting (IPW) and doubly robust estimation) to this problem. The proposed methods leverage information collected between study visits on when individuals started treatment, by adapting IPW and doubly robust approaches to model the treatment mechanism using survival analysis methods. This methodology also incorporates sampling weights and uses a bootstrap approach to estimate standard errors accounting for the complex survey sampling design. We investigated variance estimation for these methods, conducted simulation studies to assess statistical performance in finite samples, and applied the methodology to model temporal change in blood pressure in HCHS/SOL.

Keywords: Blood pressure, Complex survey, Doubly robust, Inverse probability weighting, Longitudinal change

1 |. INTRODUCTION

The Hispanic Community Health Study/Study of Latinos (HCHS/SOL) is a large, community-based, ongoing longitudinal cohort study of Hispanic/Latino adults living in the US¹. HCHS/SOL sampled 16,415 Hispanic/Latino adults from 4 US cities (Bronx, NY; Chicago, IL; Miami, FL; San Diego, CA) using a complex survey sampling design. Details of the sampling design used in HCHS/SOL were described by LaVange et al², and are summarized briefly here. A stratified three-stage sampling design was implemented within each study field center, where stage 1 sampled census block groups, stage 2 sampled households within each sampled block group, and stage 3 sampled individuals within each sampled household. A stratified simple random sample was implemented at each stage, where probability of selection at each stage differed by stratum. Study participants attended an initial baseline clinic visit (2008–2011), where a variety of survey and clinical measures were collected. Every year since baseline, participants are contacted by phone to participate in an annual follow-up survey, collecting health data such as recommended changes to treatment regimens. All living study participants were invited to a follow-up clinic visit (2014–2017), which included collection of many of the same variables that were collected at the baseline visit. Due to the complex survey sampling design, sampling weights (incorporating the unequal probability of sampling and differential non-response rates for study participation at baseline and follow-up) and indicators of stage 1 study stratum and cluster membership are distributed to study investigators, and all analyses use statistical methods that accommodate the complex survey sampling design.

In longitudinal cohort studies, such as HCHS/SOL, there is often interest in modeling the association of modifiable health risk factors (e.g., body mass index) with natural (i.e., treatment-free) longitudinal change in biomarkers (i.e., physiological measure indicative of metabolic dysregulation or disease; e.g., blood pressure). However, in an observational study, some participants may start disease treatment (e.g., antihypertensive medication) during the study that would affect the biomarker values of interest. The initiation of treatment between study visits is highly informative for the biomarker change between visits (i.e., treated biomarker values are expected to be in a healthier range than the untreated values), and consequently calculating biomarker change using the observed value at follow-up might be expected to produce biased estimates for the association of risk factors with the natural longitudinal change, since the observed biomarker change may be largely due to treatment rather than due to the risk factor of interest. In addition, excluding participants who initiated treatment between study visits could result in a biased sample because the biomarker trajectory of excluded participants might differ from the analysis sample. Some ad-hoc approaches commonly used to address this problem in practice involve adding an adjustment value to any biomarker values measured while on treatment³, usually based on treatment effects estimated by previous research. However, these naive approaches do not incorporate participant characteristics or study design features in the adjustment method, and could result in biased results.

We propose to consider biomarker values measured while on treatment to be missing data, and to adapt missing data methodology (inverse probability weighting (IPW) and doubly robust approaches) to address this problem. IPW involves predicting the probability of data being observed for each data record, and performing statistical analysis among the sub-sample with observed data, weighted by the inverse of this estimated probability⁴. In addition, Scharfstein et al⁵ proposed predicting the probability that the outcome is observed and the imputed outcome variable for each data record based on separately specified working models for missingness and the outcome respectively, and solving a set of estimating equations based on these predicted probabilities and predicted outcomes to estimate the effects of interest; they showed that this estimator is doubly robust (statistically consistent if either the missingness model and/or the outcome imputation model is specified correctly).

In HCHS/SOL, as in many longitudinal cohort studies, health information was collected between main study visits through annual telephone follow-up. During the annual follow-up surveys, participants were asked about starting new treatments or changing existing treatments. For those who were in the healthy range of the biomarker or not healthy but untreated at baseline and started treatment by the time the follow-up visit was conducted, the intermediate data can be used to determine approximately when treatment started. In other words, there are two complimentary pieces of information available about the treatment process: (1) binary information about whether the participant was treated by follow-up or not, and (2) information about the timing of treatment initiation (which is provided by the intermediate data collected between study visits). For IPW and/or doubly robust approaches, predicting the probability of no treatment at follow-up from a model for a binary indicator of whether the individual was treated at follow-up (e.g., using a logistic regression model) would discard the second piece of information about the timing of treatment initiation that is available from these intermediate data. Therefore, for IPW and doubly robust approaches, we instead propose to model these time-to-treatment data using survival analysis methods, which use both the binary information about whether participants were treated by follow-up and the additional information about the timing of treatment initiation, and then to predict the probability of no treatment at the follow-up time based on this model.

Since the sample for HCHS/SOL was selected using a complex survey sampling design to enable inference to a finite target population, the IPW and doubly robust approaches developed based on simple random samples need to incorporate appropriate statistical methods to obtain unbiased effect estimates and the associated standard errors in the design-based inference for a finite population. In particular, weighted pseudo-likelihood methods can be used for estimation for generalized linear models with complex survey sample data. Taylor series linearization (i.e., sandwich variance estimator) is commonly used for variance estimation based on data from a complex survey sample. Taylor series linearization provides accurate variance estimation based on samples for which the primary sampling units (PSU; i.e., clusters from the first sampling stage) have been sampled with replacement or sampling probabilities for the PSUs were small⁶, and has been shown to conservatively estimate variances based on samples for which PSUs have been sampled without replacement⁷. Alternatively, bootstrap approaches have also been developed for variance estimation accounting for a complex survey sampling design^8,9,10. In addition, when there are missing data in a study that uses a complex survey sampling design, any methods that are used to account for the missing data (e.g., IPW, multiple imputation) should incorporate the sampling design features^11,12. Therefore, the proposed methodology will incorporate the sampling design features (e.g., strata, clusters, sampling weights) in the time-to-treatment and imputation models.

The rest of this paper is organized in the following way. Section 2 describes the application of missing data methodology to modeling longitudinal biomarker change when a subset of the sample started treatment during the study period, incorporating the complex survey sampling design. Section 3 describes variance estimation for the proposed estimators. Section 4 evaluates the statistical performance of the proposed approaches via simulation studies. Section 5 applies these approaches to modeling blood pressure change in HCHS/SOL. Section 6 concludes with a discussion.

2 |. PROPOSED APPROACHES FOR MODELING LONGITUDINAL BIOMARKER CHANGE WITH COMPLEX SURVEY DATA

Here we describe the general framework for the statistical problem of modeling longitudinal biomarker change in the presence of treatment based on complex survey data. Consider a longitudinal study where the cohort was selected from a finite population using a multi-stage sampling design with unequal probability of sampling, stratification, and clustering at each stage. Data were collected from two study visits (baseline and follow-up). Data regarding when treatment was initiated for those that started treatment between baseline and follow-up were also collected. Let $h = 1, \dots, H$ indicate the first stage sampling stratum, $i = 1, \dots, n_{h}$ indicate the PSU selected into the sample from stratum $h$ , $j = 1, \dots, m_{h i}$ indicate the individual selected into the sample from the selected PSU $i$ within stratum $h$ , and $n = \sum_{h = 1}^{H} \sum_{i = 1}^{n_{h}} m_{h i}$ indicate the sample size. Without loss of generality, let $k = 1, \dots, N$ indicate the individual from the population of size $N$ , and $s = 1, \dots, n$ correspond to the individuals selected into the sample. Let $ω$ be the sampling weight; without loss of generality, all mathematical derivations will consider the sampling weight to simply be the inverse of the sampling probability, but all results continue to hold if the sampling weights are multiplied by a normalizing constant. Let $Y_{0}$ be the baseline biomarker value and $Y_{V}$ be the untreated biomarker value at follow-up. Assume that the finite population of interest satisfies some generalized linear model, such that $U_{N} (β) = \sum_{k = 1}^{N} (Δ Y_{k} - μ (X_{* k}^{T} β)) \partial_{β} μ (X_{* k}^{T} β) = 0$ for some constant $β$ , where $Δ Y = Y_{V} - Y_{0}$ is the untreated biomarker change between baseline and the follow-up visit, $X_{*} = (X, T_{V}, Y_{0}), T_{V}$ is the time between baseline and follow-up, $X$ is a vector of baseline covariates of interest, and $μ (\cdot)$ is a known link function (e.g., an identity link function for linear model). Let $R_{V}$ be a binary indicator of having received treatment by the follow-up visit, $B_{V}$ be a binary indicator of disease status at follow-up, and $T_{M}$ be the time between baseline and treatment initiation. Note that the trajectory of the untreated biomarker is expected to be different among those with disease at follow-up ( $B_{V} = 1)$ compared to those without disease at follow-up $(B_{V} = 0)$ , and in fact sometimes the disease status may be defined partly based on values of the untreated biomarker (for example, hypertension status is generally defined based on systolic and diastolic blood pressure values above a threshold value).

Consistent with medical practice guidelines¹³, assume that individuals continue to receive treatment after starting treatment for their chronic disease. Also assume that only those with the disease will receive treatment; in particular, even if the abnormal condition or phenotype subsides once treated, the individual is still considered to have the disease since the treatment is needed to control the disease. This framework is fully applicable to many chronic diseases, such as hypertension, with conventional categories of hypertensives aware, treated, or controlled¹³. In addition, assume $Y_{V} ⊥ R_{V} ∣ (Z, B_{V} = 1)$ , where $Z$ is a vector of baseline auxiliary variables (including $X$ , $T_{V}$ , $Y_{0}$ , and all variables related to the study design, among other variables) related to the treatment process and/or the untreated biomarker value at follow-up; in other words, we assume that those with untreated disease at follow-up are representative of those with treated disease, conditional on observed data including the study design. Note that based on this set-up, there are three key subgroups to consider at follow-up: individuals without disease $(B_{V} = 0)$ , individuals with untreated disease ( $B_{V} = 1, R_{V} = 0)$ , and individuals with treated disease ( $B_{V} = 1, R_{V} = 1)$ ; for example, at the follow-up visit in the analysis of HCHS/SOL data considered in Section 5 of this article, 66% of the weighted sample did not have hypertension (i.e., were without disease), 22% of the weighted sample had untreated hypertension (i.e., had untreated disease), and 12% of the weighted sample had treated hypertension (i.e., had treated disease). Since we are interested in untreated biomarker values, biomarker change will be considered as missing for the subgroup with treated disease (see Figure 1).

Diagram to illustrate which data are missing and observed. At the follow-up visit in the analysis of HCHS/SOL data considered in Section 5 of this article, 66% of the weighted sample did not have hypertension (i.e., no disease), 22% of the weighted sample had untreated hypertension (i.e., had disease with no treatment), and 12% of the weighted sample had treated hypertension (i.e., had disease with treatment).

2.1 |. Inverse Probability Weighting (IPW) Approach

An IPW approach for modeling natural (i.e., treatment-free) biomarker change is proposed, consisting of the following steps (see Figure 2): (1) predict the treatment probability for each individual, (2) calculate IP weights based on these predicted probabilities, and (3) fit a weighted version of the final regression model of interest. For step1, the treatment probability at follow-up for individuals with disease can be predicted from a working Cox proportional hazards model for time-to-treatment $(T_{M})$ , fit only among individuals with disease at follow-up $(B_{V} = 1)$ . The hazard function for this model has the form $\tilde{λ} (t ∣ Z, B_{V} = 1) = {\tilde{λ}}_{0} (t) e^{Z^{T} \tilde{α}}$ , where ${\tilde{λ}}_{0} (t)$ is an unspecified baseline hazard function, $\tilde{α}$ is an unknown constant vector of regression coefficients, and time-to-treatment is considered to be right censored at the time of the follow-up visit ( $T_{V}$ ) for those not treated at follow-up $(R_{V} = 0)$ . The time-to-treatment model should include sampling weights and cluster membership as covariates^14,11,12; if it is not computationally feasible to include cluster membership as covariates in the model due to a large number of clusters, then include stratum indicators and/or a smaller set of proxy variables for cluster membership as covariates instead. Note that the purpose of the time-to-treatment model is to predict the probability of starting treatment before the follow-up visit, not to estimate or interpret model parameters. Therefore, it is important to include as much relevant information in this model as possible to improve prediction of these treatment probabilities, including design variables when the study design is informative for the treatment process.

Flowchart to illustrate the steps for the proposed methods (inverse probability weighting (IPW) approach and doubly robust approach), and the output obtained from each step.

Let ${\hat{α}}_{n}$ be the maximum partial likelihood estimator for $\tilde{α}$ ¹⁵, and ${\hat{Λ}}_{0 n} (\cdot)$ be the Breslow estimator for ${\tilde{Λ}}_{0} (\cdot) = \int_{0}^{t} {\tilde{λ}}_{0} (u) d u$ ¹⁶. Let $\hat{π} = \tilde{P} (R_{V} = 0 ∣ Z, B_{V}; {\hat{α}}_{n}, {\hat{Λ}}_{0 n} (\cdot))$ be the probability of not starting treatment by the follow-up visit, where $\hat{π} = \tilde{P} (R_{V} = 0 ∣ Z, B_{V} = 0) = 1$ for individuals without the disease at the follow-up visit, and $\hat{π} = \tilde{P} (R_{V} = 0 ∣ Z, B_{V} = 1; {\hat{α}}_{n}, {\hat{Λ}}_{0 n} (\cdot)) = \tilde{P} (T_{M} > T_{V} ∣ Z, B_{V} = 1; {\hat{α}}_{n}, {\hat{Λ}}_{0 n} (\cdot)) = e x p {- {\hat{Λ}}_{0 n} (T_{V}) e^{Z^{T} {\hat{α}}_{n}}}$ for individuals with the disease at follow-up, where $\tilde{P} (\cdot; \tilde{α}, {\tilde{Λ}}_{0} (\cdot))$ indicates the probability based on the specified working model for time-to-treatment and the constants $\tilde{α}$ and ${\tilde{Λ}}_{0} (\cdot)$ . For step 2, we assign a weight of $1 / \hat{π}$ to untreated individuals, and a weight of 0 to treated individuals; in other words, the IP weights are calculated as $\frac{1 - R_{V}}{\hat{π}}$ . Finally, for step 3, we would like to calculate a final set of weights such that the weighted distribution of the covariates among untreated individuals is comparable to the distribution of the covariates in the population of interest (i.e., such that $w e i g h t * P (Z, B_{V} ∣ R_{V} = 0, I = 1) = P (Z, B_{V})$ , where $I$ is an indicator of being selected into the sample). Since the sampling weight represents the inverse of the probability of selection in the sample (i.e., $1 / P (I = 1 ∣ Z, B_{V})$ ), and the IP weight represents the inverse of the probability of not receiving treatment by the follow-up study visit conditional on being selected in the sample (i.e., $1 / P (R_{V} = 0 ∣ Z, B_{V}, I = 1)$ ), the product of the sampling weight $(ω)$ and the estimated IP weight $(\frac{1 - R_{V}}{\hat{π}})$ can be applied to make the covariate distribution for untreated individuals to resemble the covariate distribution in the population (i.e., $c o n s t a n t * ω * \frac{1 - R_{V}}{\hat{π}} * P (Z, B_{V} ∣ R_{V} = 0, I = 1) = P (R_{V} = 0, I = 1) * 1 / P (I = 1 ∣ Z, B_{V}) * 1 / \tilde{P} (R_{V} = 0 ∣ Z, B_{V}, I = 1; ({\hat{α}}_{n}, {\hat{Λ}}_{0 n} (\cdot)) * P (Z, B_{V} ∣ R_{V} = 0, I = 1) ≃ P (Z, B_{V}))$ . Therefore, the regression effect of interest $β$ is estimated by solving a set of estimating equations, weighted by the product of the sampling weight and the estimated IP weight:

U_{n}^{I P W} (β ∣ {\hat{π}}_{1}, \dots, {\hat{π}}_{n}) = \sum_{s = 1}^{n} ω_{s} \frac{1 - R_{V s}}{{\hat{π}}_{s}} (Δ Y_{s} - μ (X_{* s}^{T} β)) \partial_{β} μ (X_{* s}^{T} β) = 0, where {\hat{π}}_{s} = \tilde{P} (R_{V s} = 0 ∣ Z_{s}, B_{V s}; {\hat{α}}_{n}, {\hat{Λ}}_{0 n} (\cdot)) . For those with disease (B_{V s} = 1) : {\hat{π}}_{s} = e x p {- {\hat{Λ}}_{0 n} (T_{V s}) e^{Z_{s}^{T} {\hat{α}}_{n}}} For those without disease (B_{V s} = 0) : {\hat{π}}_{s} = 1

(1)

Note that the estimating equations in (1) can be solved using standard complex survey regression software (such as the svyglm function in $R$ ) with the weight specified as the product of the sampling weight and $\frac{1 - R_{V}}{\hat{π}}$ (see Web Appendix D for sample R code using the svyglm function to solve the estimating equations in (1)).

2.2 |. Doubly Robust Approach

We also propose a doubly robust approach, which is statistically consistent if either a time-to-treatment model and/or an imputation model for biomarker change are correctly specified. This doubly robust approach consists of the following steps (see Figure 2): (1) predict the treatment probability and imputed untreated biomarker change for each individual, (2) calculate IP weights, and (3) solve a set of estimating equations based on the predicted IP weights and imputed values of biomarker change. For step 1, we fit a Cox proportional hazards model for time-to-treatment $(T_{M})$ similarly to the IPW method. In addition, we fit a working generalized linear model only among untreated individuals with disease at follow-up $(B_{V} = 1, R_{V} = 0)$ , with the following form: $\tilde{E} [Δ Y ∣ Z, B_{V} = 1; \tilde{γ}] = v (Z^{T} \tilde{γ})$ , where $\tilde{E} [\cdot; \tilde{γ}]$ indicates the expectation based on the specified working imputation model and the unknown constant vector of regression coefficients $\tilde{γ}$ , and $v (\cdot)$ is a specified link function. Again, the imputation model for biomarker change should include sampling weights and indicators of cluster membership as covariates. Then we predict biomarker change for those with disease at follow-up (regardless of whether they were treated or not) from the imputation model $(\hat{v} = \tilde{E} [Δ Y ∣ Z, B_{V} = 1; {\hat{γ}}_{n}]$ , where ${\hat{γ}}_{n}$ is the maximum likelihood estimator for $\tilde{γ}$ based on fitting the model among those with untreated disease at follow-up).

For step 2, we estimate IP weights as in the IPW method. Finally, for step 3, the regression effect of interest $β$ is estimated by solving a “sample-weighted” version of the doubly robust estimating equations proposed by Scharfstein et al⁵:

U_{n}^{D R} (β ∣ {\hat{π}}_{1}, \dots, {\hat{π}}_{n}, {\hat{v}}_{1}, \dots, {\hat{v}}_{n}) = \sum_{s = 1}^{n} ω_{s} {\frac{1 - R_{V s}}{{\hat{π}}_{s}} (Δ Y_{s} - μ (X_{* s}^{T} β)) \partial_{β} μ (X_{* s}^{T} β) - (\frac{1 - R_{V s}}{{\hat{π}}_{s}} - 1) ({\hat{v}}_{s} - μ (X_{* s}^{T} β)) \partial_{β} μ (X_{* s}^{T} β)} = \sum_{s = 1}^{n} ω_{s} {{\hat{v}}_{s} + \frac{1 - R_{V s}}{{\hat{π}}_{s}} (Δ Y_{s} - {\hat{v}}_{s}) - μ (X_{* s}^{T} β)} \partial_{β} μ (X_{* s}^{T} β) = \sum_{s = 1}^{n} ω_{s} {[\frac{1 - R_{V s}}{{\hat{π}}_{s}} Δ Y_{s} + (1 - \frac{1 - R_{V s}}{{\hat{π}}_{s}}) {\hat{v}}_{s}] - μ (X_{* s}^{T} β)} \partial_{β} μ (X_{* s}^{T} β) = 0.

(2)

This set of estimating equations is equivalent to the estimating equations for a sample-weighted regression model with a modified outcome variable equal to $[\frac{1 - R_{V}}{\hat{π}} Δ Y + (1 - \frac{1 - R_{V}}{\hat{π}}) \hat{v}]$ , and therefore can be solved using standard complex survey regression software (such as the svyglm function in R) using this modified outcome variable (see Web Appendix D for sample R code using the svyglm function to solve the estimating equations in (2)). Further, this modified outcome variable equals the observed biomarker change $(\frac{1 - 0}{1} Δ Y + (1 - \frac{1 - 0}{1}) \hat{v} = Δ Y)$ for individuals without the disease (i.e., the expression does not depend on $\hat{v}$ for those without disease), equals the imputed biomarker change $(\frac{1 - 1}{\hat{π}} Δ Y + (1 - \frac{1 - 1}{\hat{π}}) \hat{v} = \hat{v})$ for treated individuals, and equals a weighted combination of the observed biomarker change and the imputed biomarker change for individuals with untreated disease.

3 |. VARIANCE ESTIMATION FOR THE PROPOSED METHODS

Here we study variance estimation for the proposed estimators in the context of design-based inference based on complex survey samples selected from a finite population.

First we study the asymptotic distribution of the proposed IPW estimator, ${\hat{β}}_{n}^{I P W}$ . It has been shown that ${\hat{α}}_{n}$ converges in probability to a constant $α^{*}$ , and ${\hat{Λ}}_{0 n} (t)$ converges in probability to a function $Λ_{0}^{*} (t)$ ^17,18; therefore, ${\hat{π}}_{s}$ converges in probability to $π_{s}^{*}$ , $s = 1, \dots, n$ . If the time-to-treatment model is specified correctly, then $α^{*}$ is the true parameter value for $\tilde{α}$ , $Λ_{0}^{*} (\cdot)$ is the true cumulative baseline hazard function, and $π_{s}^{*}$ is the true conditional probability of no treatment. This result can be used to show that the expected value of the sample estimating equations $U_{n}^{I P W} (β)$ equals the expected value of the population estimating equations $U_{N} (β)$ when the time-to-treatment model is specified correctly.

Theorem 1.

Assume there is a sequence of populations indexed by $q = 1,2, 3, \dots$ of size $N_{q}$ , and a corresponding sequence of samples of size $n_{q}$ drawn from each population, such that $n_{q} / N_{q} < 1 - ϵ$ for some $ϵ > 0$ for sufficiently large $q$ . Under assumptions (A1)–(A9) in the Supplementary Materials (Web Appendix A), if the correct IP weights are known (i.e., if $π_{s}^{*} = P (R_{V s} = 0 ∣ Z_{s}, B_{V s} = 1)$ is known for $s = 1, \dots, n_{q}$ ), then $n_{q}^{1 / 2} ({\hat{β}}_{n_{q}}^{I P W} - β^{*})$ converges to a normal distribution with mean zero as $q \to \infty$ , where ${\hat{β}}_{n_{q}}^{I P W}$ is the solution to $U_{n_{q}}^{I P W} (β) = 0$ in equation (1). Furthermore, the asymptotic covariance of ${\hat{β}}_{n}^{I P W}$ can be estimated by the following robust estimator:

\hat{C o v} ({\hat{β}}_{n}^{I P W}) \equiv {[\partial_{β} U_{n}^{I P W} ({\hat{β}}_{n}^{I P W}; π_{1}^{*}, \dots, π_{n}^{*})]}^{- 1} {\hat{Σ}}_{U}^{I P W} ({\hat{β}}_{n}^{I P W}; π_{1}^{*}, \dots, π_{n}^{*}) {[\partial_{β} U_{n}^{I P W} {({\hat{β}}_{n}^{I P W}; π_{1}^{*}, \dots, π_{n}^{*})}^{'}]}^{- 1},

(3)

where ${\hat{Σ}}_{U}^{I P W} ({\hat{β}}_{n}^{I P W}; π_{1}^{*}, \dots, π_{n}^{*}) = \sum_{h = 1}^{H} \frac{n_{h}}{n_{h} - 1} \sum_{i = 1}^{n_{h}} {(e_{h i}^{I P W} - \frac{1}{n_{h}} \sum_{g = 1}^{n_{h}} e_{h g}^{I P W})}^{\otimes 2}$ , $e_{h i}^{I P W} = \sum_{j = 1}^{m_{h i}} ω_{h i j} \frac{1 - R_{V h i j}}{π_{h i j}^{*}} (Δ Y_{h i j} - μ (X_{* h i j}^{T} {\hat{β}}_{n}^{I P W})) \partial_{β} μ (X_{* h i j}^{T} {\hat{β}}_{n}^{I P W})$ , and $r^{\otimes 2} = {r r}^{T}$ for a $p x 1$ vector $r$ .

The proof for Theorem 1 is outlined in the Supplementary Materials (Web Appendix B). In particular, under assumptions (A1)–(A9) in the Supplementary Materials (Web Appendix A), Corollary 1 from Binder⁶ can be used to obtain the asymptotic distribution of the IPW estimator ${\hat{β}}_{n}^{I P W}$ in Theorem 1, assuming the correct IP weights are known.

Next we study the asymptotic distribution of the proposed doubly robust estimator, ${\hat{β}}_{n}^{D R}$ . It has been shown that ${\hat{γ}}_{n}$ converges in probability to a constant $γ^{*}$ [19; therefore, ${\hat{v}}_{s}$ converges in probability to $v_{s}^{*}$ , $s = 1, \dots, n$ . If the imputation model is specified correctly, then $γ^{*}$ is the true parameter value for $\tilde{γ}$ , and $v_{s}^{*}$ is the true conditional expected value of biomarker change. These results, combined with the asymptotic results of the parameters from the time-to-treatment model $({\hat{α}}_{n}, {\hat{Λ}}_{0 n})$ , can be used to show that the expected value of the sample estimating equations $U_{n}^{D R} (β)$ equals the expected value of the population estimating equations $U_{N} (β)$ when the time-to-treatment model and/or the imputation model are specified correctly.

Theorem 2.

Assume there is a sequence of populations indexed $q = 1,2, 3, \dots$ of size $N_{q}$ , and a corresponding sequence of samples of size $n_{q}$ drawn from each population, such that $n_{q} / N_{q} < 1 - ϵ$ for some $ϵ > 0$ for sufficiently large $q$ . Under assumptions (A1)–(A9) in the Supplementary Materials (Web Appendix A), if the correct IP weights and/or the correct imputed biomarker change are known (i.e., if using known constants $π_{s}^{*}$ and $v_{s}^{*}$ where $π_{s}^{*} = P (R_{V s} = 0 ∣ Z_{s}, B_{V s} = 1)$ and/or $v_{s}^{*} = E [Δ Y_{s} ∣ Z_{s}, B_{V s} = 1]$ for $s = 1, \dots, n_{q})$ , then $n_{q}^{1 / 2} ({\hat{β}}_{n_{q}}^{D R} - β^{*})$ converges to a normal distribution with mean zero as $q \to \infty$ , where ${\hat{β}}_{n_{q}}^{D R}$ is the solution to $U_{n_{q}}^{D R} (β) = 0$ in equation (2). Furthermore, the asymptotic covariance of ${\hat{β}}_{n}^{D R}$ can be estimated by the following robust estimator:

\hat{C o v} ({\hat{β}}_{n}^{D R}) \equiv {[\partial_{β} U_{n}^{D R} ({\hat{β}}_{n}^{D R}; π_{1}^{*}, \dots, π_{n}^{*}, v_{1}^{*}, \dots, v_{n}^{*})]}^{- 1} {\hat{Σ}}_{U}^{D R} ({\hat{β}}_{n}^{D R}; π_{1}^{*}, \dots, π_{n}^{*}, v_{1}^{*}, \dots, v_{n}^{*}) {[\partial_{β} U_{n}^{D R} {({\hat{β}}_{n}^{D R}; π_{1}^{*}, \dots, π_{n}^{*}, v_{1}^{*}, \dots, v_{n}^{*})}^{'}]}^{- 1},

(4)

where ${\hat{Σ}}_{U}^{D R} ({\hat{β}}_{n}^{D R}; π_{1}^{*}, \dots, π_{n}^{*}, v_{1}^{*}, \dots, v_{n}^{*}) = \sum_{h = 1}^{H} \frac{n_{h}}{n_{h} - 1} \sum_{i = 1}^{n_{h}} {(e_{h i}^{D R} - \frac{1}{n_{h}} \sum_{g = 1}^{n_{h}} e_{h g}^{D R})}^{\otimes 2}$ and $e_{h i}^{D R} = \sum_{j = 1}^{m_{h i}} ω_{h i j} \{\frac{1 - R_{V h i j}}{π_{h i j}^{*}} (Δ Y_{h i j} - μ (X_{* h i j}^{T} {\hat{β}}_{n}^{D R})) \partial_{β} μ (X_{* h i j}^{T} {\hat{β}}_{n}^{D R}) - (\frac{1 - R_{V h i j}}{π_{h i j}^{*}} - 1) (v_{h i j}^{*} - μ (X_{* h i j}^{T} {\hat{β}}_{n}^{D R})) \partial_{β} μ (X_{* h i j}^{T} {\hat{β}}_{n}^{D R})\} .$

The proof for Theorem 2 is similar to the proof for Theorem 1, and the proof is outlined in the Supplementary Materials (Web Appendix B). In particular, under assumptions (A1)–(A9) in the Supplementary Materials (Web Appendix A), Corollary 1 from Binder⁶ can be used to obtain the asymptotic distribution of the doubly robust estimator ${\hat{β}}_{n}^{D R}$ in Theorem 2, assuming the correct IP weights and/or correct imputed biomarker change values are known.

Note that the conditional treatment probabilities and expected values of biomarker change are generally not known in practice, but are predicted from a Cox proportional hazards model for time-to-treatment and a generalized linear model for biomarker change respectively. Therefore, variance estimation for the proposed IPW and doubly robust estimators should incorporate the variability from the estimation of the time-to-treatment model and/or imputation model for biomarker change. Although it would be possible to derive an asymptotic variance estimator for the proposed IPW and doubly robust estimators accounting for the estimation of the time-to-treatment and/or imputation models, this derivation would be complex, and is outside the scope of this paper. Alternatively, the variance of these proposed estimators can be estimated, including accounting for the variability due to estimation of the time-to-treatment model and/or imputation model, using a bootstrap approach. The idea behind the bootstrap approach for variance estimation is to approximate the sampling distribution of the statistic of interest based on simulating the sampling procedure used to obtain the survey sample^8,9,10. In the case of complex survey samples, this can be done by taking simple random samples with replacement of the PSUs within each stratum for a large number of replicates $A$ , adjusting the sampling weights in some way for each replicate sample to reflect the bootstrap sampling, and estimating the variance as $\frac{1}{A} \sum_{a = 1}^{A} {\{{\hat{β}}^{(a)} - \bar{\hat{β}}\}}^{2}$ , where ${\hat{β}}^{(a)}$ is the statistic of interest from bootstrap replicate sample $a$ using the adjusted bootstrap sampling weights^20,21, and $\bar{\hat{β}} = \frac{1}{A} \sum_{a = 1}^{A} {\hat{β}}^{(a)}$ is the average of the statistic of interest across all bootstrap replicates. There are many ways to implement this general bootstrap approach for complex survey samples. A common approach, which will be implemented in this paper, is to take simple random samples of $n_{h} - 1$ PSUs within each stratum $h$ ²², and use the rescaling bootstrap procedure of Rao, Wu, and Yue²¹ to calculate the bootstrap sampling weights for each bootstrap replicate sample $a$ as $ω_{h i j}^{(* a)} = \{1 + \frac{n_{h}}{n_{h} - 1} m_{h i}^{(* a)}\} ω_{h i j}$ , where $m_{h i}^{(* a)}$ is the number of times PSU $i$ was re-sampled from stratum $h$ in bootstrap sample $a$ . Note that the bootstrap approach for variance estimation requires that the analysis be repeated on a large number of bootstrap replicate samples (usually 100 to 1000 replicates), and so may be computationally burdensome in some situations. For cases where computational resources are too limited to use a bootstrap approach, the variance estimators assuming known IP weights and imputed biomarker change values from (3) and (4) can be used to provide conservative variance estimation, where the estimated IP weights $(\frac{1 - R_{V}}{\hat{π}})$ and imputed biomarker change $(\hat{v})$ are substituted for $\frac{1 - R_{V}}{π^{*}}$ and $v^{*}$ respectively^23,12.

4 |. SIMULATION STUDIES

In this section, we conducted simulation studies to evaluate the statistical performance of the proposed methods for estimating longitudinal biomarker change using data obtained by a complex survey sampling design. Two simulation studies were conducted. Simulation Study 1 compared the statistical performance of the proposed methods to common ad hoc methods. Simulation Study 2 evaluated the performance of the proposed methods when the time-to-treatment model and/or the imputation model was misspecified by omitting an important covariate.

First, all variables were generated for a single population, and then 1000 independent samples were randomly selected without replacement from this population. Data for this single population and the 1000 random samples were used for both simulation studies. Finally, for each simulation study, the proposed methods were applied to each of these simulated samples, and the statistical properties of the proposed methods (e.g., bias, confidence interval coverage) were assessed.

4.1 |. Sampling Design

A stratified two-stage sampling design was used for these simulations, where primary sampling units (PSUs) were selected in stage 1 (denoted by the subscript $i$ ) and individuals were selected in stage 2 (denoted by the subscript $j$ ; see Figure 3 for an illustration of the sampling design used to randomly select samples in these simulations).

Flowchart to illustrate sampling design for simulated data

At stage 1, a stratified random sample of PSUs were sampled without replacement from the population. The population had 376 PSUs, divided into 2 strata. Separately within each stratum, PSUs were sampled without replacement using a stratum-specific sampling probability. Figure 3 lists the number of population PSUs and sampling probability for each stratum.

At stage 2, a stratified random sample of individuals were sampled without replacement from the selected PSUs from stage 1. The number of individuals in each PSU was independently generated as $1 + P o i s s o n (500)$ . Within each selected PSU, individuals were divided into 2 strata. Separately within each stage 2 stratum, individuals were sampled without replacement from those PSUs that were selected from stage 1, using sampling probabilities that differed by stage 2 stratum (see Figure 3).

Based on the design, sampling weights were generated in three steps. First, weights for each sampling stage were calculated as the inverse of the sampling probability used for that PSU or individual at the corresponding sampling stage. Second, base weights were calculated as the product of the weights from both sampling stages. Third, base weights were normalized by dividing each individual base weight by the sample mean of the base weights, to produce final sampling weights that summed to the sample size within each sample dataset.

4.2 |. Simulated Population Variables

Population variables were simulated based on the example of estimating longitudinal change in systolic blood pressure when a subset of the sample initiated antihypertensive treatment between baseline and follow-up. The analysis of interest was to estimate the association of a predictor predictor $X$ on systolic blood pressure change $(Δ Y)$ , based on a linear regression model with an identity link, adjusted for baseline systolic blood pressure $(Y_{0})$ and time between study visits $(T_{V})$ . Figure 4 illustrates the process of generating the data for the population in the simulations presented in this paper, and Table 1 presents the distributions used to generate all variables for the simulations. First, variables involved in the complex survey sampling design were generated, including the sampling stratum for the PSU (i.e., for the stage 1 sampling stratum) and the sampling stratum for the individuals (i.e., for the stage 2 sampling stratum); see Section 4.1 for details about the complex survey sampling design used to select the samples, and how the design variables were generated for the simulations. Then, data for the variables from the analysis of interest (i.e., predictor variable $X$ , baseline systolic blood pressure $Y_{0}$ , and time between study visits $T_{V}$ ) were generated.

Flowchart to illustrate the data generation process for the simulated population. Notation: $P S U =$ primary sampling unit from target population, $I n d i v i d u a l =$ person from target population, $X =$ predictor variable for final regression model, $Z =$ auxiliary variable for time-to-treatment and imputation models, $T_{V} =$ time between baseline and follow-up, $T_{M} =$ time between baseline and treatment initiation, $Y_{0} =$ baseline biomarker, $Y_{V} =$ biomarker at follow-up if participant had not received treatment, $Y_{V}^{trt} =$ biomarker at follow-up if participant had not received treatment, $Y_{D} =$ observed biomarker at follow-up, $B_{V}$ = indicator of disease status at follow-up, $R_{V} =$ indicator of treatment at follow-up.

TABLE 1.

Summary of distributions used to generate all population variables for simulations to evaluate the statistical performance of the proposed methods

Variable	Description	Distribution

$X$	Predictor	$X_{i j} ~ N (1, 0.5)$
$Z$	Auxiliary Variable	$Z_{i j} ~ N (2, 9)$
$T_{V}$	Follow-up Time	Truncated $N (6, 1)$ , bounded between 3 and 9
$Y_{0}$	Baseline Biomarker	Normal distribution: $110 + a_{i} + η_{i j}$
		$a_{i} ~ N (0, 4), η_{i j} ~ N (0, 4)$
$B_{V}$	Disease at Follow-up	Bernoulli distribution with the following probability: ${l o g i t}^{- 1} (c_{0} + b_{i})$
		$b_{i} ~ N (0, 0.09)$
$T_{M}$	Time-to-treatment	If $B_{V i j} = 1$ : Weibull distribution with the following hazard function (varied by stage 1 and stage 2 sampling strata):
		Stage 1 Stratum 1, Stage 2 Stratum A: $2 λ t * e x p \{- 0.2 X_{i j} - 0.2 Z_{i j} - 0.001 Y_{0 i j} + c_{i}\}$
		Stage 1 Stratum 1, Stage 2 Stratum B: $2 λ t * e x p \{0.1 - 0.15 X_{i j} - 0.2 Z_{i j} - 0.001 Y_{0 i j} + c_{i}\}$
		Stage 1 Stratum 2, Stage 2 Stratum A: $2 λ t * e x p \{- 0.1 - 0.2 X_{i j} - 0.2 Z_{i j} - 0.001 Y_{0 i j} + c_{i}\}$
		Stage 1 Stratum 2, Stage 2 Stratum B: $2 λ t * e x p \{- 0.15 X_{i j} - 0.2 Z_{i j} - 0.001 Y_{0 i j} + c_{i}\}$
		$c_{i} ~ N (0, 0.09)$
$Y_{V}$	Untreated Biomarker at Follow-up	If $B_{V i j} = 0$ : The following truncated normal distribution bounded from above by 130 (varied by stage 1 and stage 2 sampling stratum):
		Stage 1 Stratum 1, Stage 2 Stratum A: $Y_{0 i j} + 0.2 X_{i j} - 1.6 Z_{i j} + 0.2 T_{V i j} + d_{i} + ϵ_{i j}$
		Stage 1 Stratum 1, Stage 2 Stratum B: $Y_{0 i j} + 0.1 + 0.25 X_{i j} - 1.6 Z_{i j} + 0.2 T_{V i j} + d_{i} + ϵ_{i j}$
		Stage 1 Stratum 2, Stage 2 Stratum A: $Y_{0 i j} - 0.1 + 0.2 X_{i j} - 1.6 Z_{i j} + 0.2 T_{V i j} + d_{i} + ϵ_{i j}$
		Stage 1 Stratum 2, Stage 2 Stratum B: $Y_{0 i j} + 0.25 X_{i j} - 1.6 Z_{i j} + 0.2 T_{V i j} + d_{i} + ϵ_{i j}$
		If $B_{V i j} = 1$ : The following truncated normal distribution bounded from below by 130 (varied by stage 1 and stage 2 sampling stratum):
		Stage 1 Stratum 1, Stage 2 Stratum A: $Y_{0 i j} + 20 + 1.8 X_{i j} - 1.1 Z_{i j} + 4 T_{V i j} + d_{i} + ϵ_{i j}$
		Stage 1 Stratum 1, Stage 2 Stratum B: $Y_{0 i j} + 20.1 + 1.85 X_{i j} - 1.1 Z_{i j} + 4 T_{V i j} + d_{i} + ϵ_{i j}$
		Stage 1 Stratum 2, Stage 2 Stratum A: $Y_{0 i j} + 19.9 + 1.8 X_{i j} - 1.1 Z_{i j} + 4 T_{V i j} + d_{i} + ϵ_{i j}$
		Stage 1 Stratum 2, Stage 2 Stratum B: $Y_{0 i j} + 20 + 1.85 X_{i j} - 1.1 Z_{i j} + 4 T_{V i j} + d_{i} + ϵ_{i j}$
		$d_{i} ~ N (0, 4), ϵ_{i j} ~ N (0, 4)$ ¹
$Y_{V}^{trt}$	Treated Biomarker at Follow-up	$(Y_{V i j} - Y_{V i j}^{t r t})$ generated from the following normal distribution (varied by stage 1 and stage 2 sampling stratum):
		Stage 1 Stratum 1, Stage 2 Stratum A: $- 11 + 0.1 Y_{0 i j} - 3 X_{i j} + 1.5 Z_{i j} + 1.1 T_{V i j} + e_{i} + κ_{i j}$
		Stage 1 Stratum 1, Stage 2 Stratum B: $- 10.9 + 0.1 Y_{0 i j} - 2.95 X_{i j} + 1.5 Z_{i j} + 1.1 T_{V i j} + e_{i} + κ_{i j}$
		Stage 1 Stratum 2, Stage 2 Stratum A: $- 11.1 + 0.1 Y_{0 i j} - 3 X_{i j} + 1.5 Z_{i j} + 1.1 T_{V i j} + e_{i} + κ_{i j}$
		Stage 1 Stratum 2, Stage 2 Stratum B: $- 11 + 0.1 Y_{0 i j} - 2.95 X_{i j} + 1.5 Z_{i j} + 1.1 T_{V i j} + e_{i} + κ_{i j}$
		$e_{i} ~ N (0, 4), κ_{i j} ~ N (0, 4)$

Open in a new tab

NOTE: Subscript $i$ denotes the PSU, and subscript $j$ denotes the individual.

NOTE: The first 79 PSUs were assigned to Stage 1 Stratum 1, and the last 297 PSUs were assigned to Stage 1 Stratum 2. 60% of the individuals were randomly assigned to stage 2 stratum A, and the remaining 40% of individuals were assigned to participant stratum B.

$V a r (d_{i}) / [V a r (d_{i}) + V a r (ϵ_{i j})]$ equals the intra-class correlation of $Y_{Vij}$ at the PSU level.

For systolic blood pressure at the follow-up visit (used to calculate systolic blood pressure change between study visits), both an untreated value $(Y_{V})$ and a treated value ( $Y_{V}^{trt}$ ) was generated for all individuals (regardless of whether the individual was actually treated by the follow-up visit). The observed value of systolic blood pressure at the follow-up visit depended on whether the individual actually started treatment prior to the follow-up visit (i.e., only the untreated value would be observed if the individual was untreated, and only the treated value would be observed if the individual was treated). In addition, only individuals with hypertension by the follow-up visit would be eligible to start treatment by the follow-up visit. Therefore, a binary indicator of hypertension at follow-up $(B_{V})$ was generated, and then time between baseline and treatment initiation $(T_{M})$ was generated only for those with hypertension at follow-up (i.e., $B_{V} = 1$ ). An indicator of treatment at follow-up $(R_{V})$ was defined as 0 for those without hypertension (i.e., no one without hypertension was treated) and as $I (T_{M} \leq T_{V})$ for those with hypertension (i.e., an indicator that time to treatment initiation was not longer than the time to follow-up). Then, observed systolic blood pressure at follow-up was obtained for all individuals as $Y_{D} = R_{V} Y_{V}^{trt} + (1 - R_{V}) Y_{V}$ (i.e., as the treated value for those who received treatment by the follow-up visit, and as the untreated value for those who did not receive treatment by the follow-up visit). The untreated value of systolic blood pressure at follow-up $(Y_{V})$ was generated from a mixture of two truncated normal distributions, where one normal distribution was for those without hypertension at follow-up $(B_{V} = 0)$ and was bounded from above by 130 mmHg (the clinical cut-point for high systolic blood pressure), and the other normal distribution was for those with hypertension at follow-up $(B_{V} = 1)$ and was bounded from below by 130 mmHg. Individual-level treatment effect, $(Y_{V} - Y_{V}^{trt})$ , was generated from a normal distribution.

In order to simulate an informative sampling design, the distributions for time between baseline and treatment initiation $(T_{M})$ , and the untreated $(Y_{V})$ and treated $(Y_{V}^{trt})$ systolic blood pressure variables at follow-up depended on elements of the sampling design (e.g., separate distributions for each stage 1 sampling stratum and stage 2 sampling stratum, interactions with the stage 2 sampling stratum, random effects for the PSU clusters). In addition, to induce dependence between untreated systolic blood pressure change and the treatment process, an auxiliary variable was generated $(Z)$ , and the distributions for time-to-treatment and the untreated and treated systolic blood pressure at follow-up also depended on this auxiliary variable; therefore conditional independence of $R_{V}$ and $Y_{V}$ held for these simulated data within the subgroup with hypertension (i.e., with $B_{V} = 1$ ), conditional on this auxiliary variable, the predictor, time between baseline and follow-up, baseline systolic blood pressure, and the study design (i.e., assumption (A3) from Web Appendix A held with these data).

To assess how the performance of the proposed methods depends on the population proportions with disease and with treatment by follow-up, four data scenarios were considered based on different combinations of low (50%) and high (70%) population proportion with no disease by the follow-up visit $(B_{V} = 0)$ , and low (15%) and high (25%) population proportion with treatment by the follow-up visit $(R_{V} = 1)$ . Table 2 summarizes how the data generation parameter values differed for these four data scenarios.

TABLE 2.

Summary of data generation parameters for different population proportions with disease and with treatment at follow-up visit in the simulation study

Scenario	% No Disease ( $B_{V} = 0$ )	% Treated ( $B_{V} = 1$ , $R_{V} = 1$ )	% Untreated Disease ( $B_{V} = 1$ , $R_{V} = 0$ )	$c_{0}$	$λ$

1	50	15	35	0	0.02
2		25	25	0	0.04
3	70	15	15	−0.9	0.04
4		25	5	−0.9	0.15

Open in a new tab

4.3 |. Simulation Study 1: Comparison of the Proposed Methods to Common Ad Hoc Methods

This section describes Simulation Study 1, which compared the proposed methods to common ad hoc methods.

4.3.1 |. Analysis Methods

Two proposed methods were compared: IPW Proposed and Doubly Robust (DR) Proposed. For each proposed method, the time-to-treatment model and/or imputation model was fit only among those with disease at follow-up; the time-to-treatment model was specified as a Cox proportional hazards model, and the imputation model was specified as a linear regression model for untreated change in systolic blood pressure with an identity link function. For comparison, the IPW method was also repeated predicting the treatment probability based on a logistic regression model for a binary indicator of treatment at follow-up (IPW Logistic), and the doubly robust method was repeated fitting the imputation model among all untreated individuals (i.e., regardless of disease status; DR All Imputation). All time-to-treatment models and imputation models controlled for the predictor, time between baseline and follow-up (imputation model only), auxiliary variable related to treatment and systolic blood pressure change, baseline systolic blood pressure, stage 1 sampling stratum, stage 2 sampling stratum, sampling weights, and any interactions from the true data generation models for time-to-treatment and systolic blood pressure change (see the data generation models for time-to-treatment and untreated systolic blood pressure at follow-up, which are provided in Table 1). Standard errors for the IPW and DR methods were estimated using the bootstrap procedure described in Section 3. For comparison, standard errors for the IPW Proposed and DR Proposed methods were also estimated using the robust variance estimators based on formulas (3) and (4) assuming the IP weights and imputed biomarker change values were known with certainty (IPW Proposed Robust SE and DR Proposed Robust SE).

These methods were compared to common ad hoc methods. The Observed method used the observed systolic blood pressure value at follow-up for the entire sample. The Exclude Treated method excluded treated observations from analysis. Adjustment value methods added an adjustment value to the observed systolic blood pressure value at follow-up for treated individuals, and used this adjusted systolic blood pressure value in analysis. The adjustment value equaled a constant value of 10 mmHg for the Add Constant method, equaled $e x p {k}$ where $k$ was randomly drawn from a normal distribution with mean $\log (10)$ and standard deviation $\log (1.7)$ for the Add Random Value method, and equaled (9.1 + 0.1 * (baseline systolic blood pressure – 154)) − for the Add Expected Effect method. The magnitude of all adjustment values were based on results from a meta-analysis for the treatment effect of antihypertensive medication on systolic blood pressure²⁴. In addition, a Gold Standard method was also implemented, where the final regression model of interest was fit using the untreated systolic blood pressure at follow-up for the entire sample. Note that use of the Gold Standard method was only possible in the simulation study because the untreated values were generated for everyone, and therefore it would not be possible to implement this method in practice. The Gold Standard and all ad hoc comparison methods incorporated the complex sampling design using the survey package in R^25,26.

4.3.2 |. Results

Figure 5 displays regression estimates (i.e., increase in systolic blood pressure change (mmHg) associated with a 1-unit increase in the predictor) and 95% confidence interval coverage probabilities of the regression coefficient for the predictor variable. Simulation results for all regression coefficients are presented in the Supplementary Material (Web Appendix C). The true population parameters were calculated by fitting the linear regression model of interest using the population dataset, where systolic bood pressure change was calculated using the true untreated blood pressure at follow-up; bias and 95% confidence interval coverage were calculated by comparing the sample estimates to these true population parameters. The IPW Proposed and DR Proposed methods were essentially unbiased with approximately nominal confidence interval coverage for most scenarios except for one situation: when the proportion of the sample with untreated disease was small (scenario 4: 25% with treated disease, 5% with untreated disease), the IPW Proposed method exhibited bias and low confidence interval coverage. In this scenario, a small group of people (with untreated disease) was being weighted to represent the treated group that was 5 times as large, and there were individuals in the treated group for whom there were no individuals in the untreated with disease group who were comparable in terms of the predicted probability of no treatment (i.e., in terms of the observed auxiliary variables).Precision for the IPW Proposed method generally improved as the untreated proportion of the subgroup with disease increased. The estimates from the ad hoc (Observed, Exclude Treated, Add Constant, Add Random Value, Add Expected Effect) and IPW Logistic methods were biased. For the methods that involved adding an adjustment value to any biomarker values measured while on treatment (Add Constant, Add Random Value, Add Expected Effect), bias increased and confidence interval coverage decreased as the treated proportion of the sample increased. Although the imputation model was misspecified for the DR All Imputation method (since the imputation model was fit based on all untreated individuals regardless of disease status), this method was unbiased in most situations due to the double robustness property and the correctly specified time-to-treatment model; however, the DR All Imputation method was generally less precise than the DR Proposed method. Bootstrap standard errors estimated the empirical standard errors well for the proposed methods; robust standard errors assuming known IP weights and imputed values also performed reasonably well for the proposed methods, but generally over-estimated the empirical standard errors (Web Appendix C).

Regression estimates (mmHg) and 95% confidence interval coverage of the coefficient for the predictor variable from all compared methods in Simulation Study 1 based on 1000 simulated datasets for varying population proportions of disease and treatment. Dashed vertical lines correspond to the true population parameter, based on fitting the model of interest in the population dataset. 95% confidence interval coverage presented on the right side of each scenario for this figure, which was calculated as the proportion of samples with a 95% confidence interval that contained the true population parameter.

4.4 |. Simulation Study 2: Impact of Time-to-Treatment and Imputation Model Misspecification

This section describes Simulation Study 2, which evaluated the proposed methods when the time-to-treatment model and/or the imputation model was misspecified.

4.4.1 |. Analysis Methods

The two proposed methods were compared: IPW Proposed and DR Proposed (see the descriptions of these methods in Section 4.3.1). The time-to-treatment model and/or imputation model was fit only among those with disease at follow-up; the time-to-treatment model was specified as a Cox proportional hazards model, and the imputation model was specified as a linear regression model for untreated change in systolic blood pressure with an identity link function. For each method, the time-to-treatment model and/or imputation model either included the correct set of effects (including the predictor, time between baseline and follow-up (imputation model only), auxiliary variable, baseline systolic blood pressure, stage 1 stratum, stage 2 stratum, sampling weights, and relevant interactions) or omitted the auxiliary variable. Standard errors were estimated using the bootstrap procedure described in Section 3.

4.4.2 |. Results

Table 3 presents the bias, relative bias, empirical SE, mean estimated SE, and 95% confidence interval coverage of the regression coefficient for time between study visits. The IPW Proposed (with a correctly specified time-to-treatment model) and DR Proposed (with a correctly specified time-to-treatment model and/or imputation model) methods were essentially unbiased with approximately nominal confidence interval coverage, with the exception of the IPW Proposed method for simulation scenario 4 where the proportion of the sample with untreated disease was small (as was also noted for Simulation Study 1 presented in Section 4.3.2). For the regression coefficient for time between study visits, there was small bias for the IPW Proposed method when the time-to-treatment model was misspecified and for the DR Proposed method when both the time-to-treatment and imputation models were misspecified.

TABLE 3.

Comparison of proposed methods with misspecified models (Simulation Study 2): Time between study visits ( $T_{V}$ )

Scenario	%No disease	% Treated disease	% Untreated disease	Method	$T_{M}^{a}$	$Y_{V}^{a}$	Bias	Relative Bias	Empirical SE	Estimated SE	Coverage

1	50	15	35	IPW Proposed	T		0.020	0.010	0.390	0.375	94.1
					F		−0.091	−0.043	0.356	0.348	93.5
				DR Proposed	T	T	0.015	0.007	0.318	0.323	94.6
					T	F	0.020	0.009	0.322	0.325	95.0
					F	T	0.013	0.006	0.318	0.323	94.8
					F	F	−0.081	−0.038	0.314	0.318	94.5
2		25	25	IPW Proposed	T		−0.010	−0.005	0.522	0.489	93.0
					F		−0.203	−0.098	0.370	0.378	91.2
				DR Proposed	T	T	0.024	0.011	0.324	0.324	93.4
					T	F	0.038	0.018	0.345	0.338	93.5
					F	T	0.019	0.009	0.322	0.322	94.2
					F	F	−0.137	−0.066	0.315	0.315	91.9
3	70	15	15	IPW Proposed	T		−0.025	−0.020	0.522	0.486	92.4
					F		−0.128	−0.102	0.370	0.370	93.1
				DR Proposed	T	T	0.002	0.001	0.298	0.298	95.0
					T	F	0.003	0.003	0.306	0.306	94.7
					F	T	0.010	0.008	0.296	0.297	95.0
					F	F	−0.080	−0.064	0.288	0.290	93.9
4		25	5	IPW Proposed	T		−1.534	−1.180	0.984	0.900	54.9
					F		−0.512	−0.394	0.717	0.664	81.6
				DR Proposed	T	T	−0.001	−0.000	0.306	0.304	94.2
					T	F	−0.236	−0.182	0.320	0.322	88.2
					F	T	−0.004	−0.003	0.299	0.301	94.1
					F	F	−0.175	−0.134	0.285	0.288	91.4

Open in a new tab

Abbreviations: SE, standard error.

Note: The true population parameter was calculated based on fitting the model of interest in the population dataset.

$T$ = The model was specified correctly. $F$ = The model was misspecified by excluding the auxiliary variable $Z$ .

5 |. APPLICATION TO HCHS/SOL

Previous observational studies and randomized controlled trials of weight loss have shown body mass and blood pressure to be directly associated²⁷. Therefore, the proposed methods were applied to HCHS/SOL to assess whether this association of body mass index (BMI; kg/m²) with blood pressure change holds in the target population of Hispanic/Latino adults living in the US, and to exemplify the properties of these proposed methods. The goal of this analysis was to estimate the association of baseline BMI with natural (i.e., untreated) blood pressure change between study visits (separately for systolic and diastolic blood pressure change), estimated by a linear regression model with an identity link function, adjusted for the baseline blood pressure, age, sex, time between study visits, and study field center. Since a subset of the analytic sample was taking antihypertensive medications at the follow-up visit (n=1,520, weighted proportion: 12%), and due to the complex survey sampling design, we applied the proposed methods to this analysis. This analysis included 8,935 individuals (see Figure 6 for a description of exclusion criteria for the analytic sample).

Description of exclusion criteria for the analytic sample in analysis of HCHS/SOL data

5.1 |. Measures

5.1.1 |. Blood Pressure

HCHS/SOL collected sitting blood pressure measurements at both the baseline and follow-up clinic visits following the same protocol. Three consecutive blood pressure measurements were taken with an automated oscillometric device after a five minute rest period, and the average measurements were used.

5.1.2 |. Body Mass

To allow for a non-linear association between BMI and blood pressure change, BMI was categorized according to the World Health Organization classifications²⁸. Since very few individuals in the analytic sample were classified as underweight (BMI <18.5 kg/m², weighted prevalence 1%), underweight and normal weight were combined to create the following categories: (1) underweight or normal weight (BMI <25 kg/m², weighted proportion 26%), (2) overweight (25 kg/m² ≤ BMI <30 kg/m², weighted proportion 37%), (3) obese (BMI ≥ 30 kg/m², weighted proportion 37%).

5.1.3 |. Covariates

The following baseline auxiliary variables were included in all time-to-treatment and blood pressure change imputation models (in addition to baseline systolic and diastolic blood pressure, age, sex, time between study visits (imputation model only), and study field center from the model of interest): Hispanic/Latino background, born in the US, years lived in the US, physical activity, dietary sodium, dietary potassium, energy intake, diet quality, education, income, martial status, health insurance coverage, cigarette use, diabetes, chronic kidney disease, total blood cholesterol, family history of coronary heart disease, fasting glucose, employment status, frequency of doctor visits, history of cardiovascular disease, and pregnancy history (for women). In addition, since HCHS/SOL used a complex survey sampling design, the sampling design variables also were included as covariates in all time-to-treatment and blood pressure change imputation models. Since the number of clusters from sampling stage 1 was large in HCHS/SOL (643 clusters for the analytic sample), it was not feasible to directly include cluster membership as covariates in the time-to-treatment and imputation models. Therefore, sampling weights and stage 1 study strata (i.e., a proxy for cluster membership) were included as covariates in all time-to-treatment and imputation models. Twenty-seven participants (<1%) were excluded from the analytic sample due to missing blood pressure measurements at either the baseline or follow-up clinic visits. A small subset of the final analytic sample (n=515, 6%) had missing data on at least one of the covariates; since the amount of missing data was small, and the focus of this analysis was to illustrate application of the proposed methods, missing data for the covariates were imputed by the sample mean (for continuous and binary covariates) or by randomly assigning a category with probabilities equal to the sample proportions (for categorical covariates).

5.1.4 |. Time-to-Treatment

During annual follow-up (AFU) phone calls, participants were asked “Since our last phone interview with you…, has a doctor or health professional told you that you had high blood pressure or hypertension?”; if the answer was “yes”, then the participant was asked “Did the doctor recommend any new or different treatments?”; if the answer was again “yes”, the participant was asked “What treatment was recommended?” (e.g., start new medicine, increase dose of existing medicine, advice to change health behaviors). For those taking antihypertensive treatment at the follow-up visit, the time of treatment initiation was approximated by the time of the first annual follow-up survey in which the participant indicated that their doctor recommended starting a new medicine and/or increasing the dose of an existing medicine, or the time of the follow-up visit, whichever occurred first. A sensitivity analysis was also conducted by instead approximating the time of treatment initiation as the mid-point of the time interval between the first survey in which the participant indicated treatment and the previous survey. Figure 7 illustrates how time-to-treatment was defined for both the original analysis and the sensitivity analysis.

Illustration of time-to-treatment definition for HCHS/SOL for both the original analysis and the sensitivity analysis

5.1.5 |. Hypertension

Hypertension at the follow-up visit was defined as systolic blood pressure above 130 mmHg, diastolic blood pressure above 80 mmHg, or self-reported use of antihypertensive medications at the follow-up visit¹³ (3,634 participants with hypertension at follow-up visit, weighted proportion 34%).

5.2 |. Analysis Methods

Both proposed methods (IPW Proposed and DR Proposed) and the ad hoc methods (Observed, Exclude Treated, Add Constant, Add Random Value, Add Expected Effect) were implemented in this analysis. Again, the time-to-treatment model was specified as a Cox proportional hazards model, and the imputation model was specified as a linear regression model with an identity link function. The median estimated IP weight among the subgroup with untreated disease was 1.4 (minimum=1.0, 1st percentile=1.1, 99th percentile=6.3, maximum=42.4). Table 4 compares descriptive statistics for the covariates for the full analytic sample, for the untreated subgroup without applying IP weighting, and for the untreated subgroup after applying the proposed IP weighting. The descriptive statistics for the untreated subgroup with IP weighting were very similar to those from the full analytic sample, suggesting that the IP weighted sample sufficiently resembled the analytic sample based on the distributions of the observed covariates. Standard errors for the IPW Proposed and DR Proposed methods were estimated using the bootstrap approach described in Section 3. Scaled bootstrap sampling weights for the HCHS/SOL sample were calculated as described in Section 3 and recommended in²¹, and then the scaled bootstrap weights were adjusted for non-response, trimmed, calibrated, and normalized based on the 2010 Census, using similar procedures as were used for the sampling weights for the original HCHS/SOL sample^2,29.

TABLE 4.

Means/percents and standard errors (SE) for the covariates from the time-to-treatment model in the HCHCS/SOL analysis for the full analytic sample, the untreated subgroup without IP weights, and the untreated subgroup with IP weights

	Analytic Sample^a		Exclude Treated^b		IPW Proposed^c

Covariate	Mean/Percent	SE	Mean/Percent	SE	Mean/Percent	SE

Systolic blood pressure	116.67	0.25	114.55	0.24	116.20	0.29
Diastolic blood pressure	70.96	0.20	69.74	0.20	70.74	0.22
Age	37.82	0.25	36.14	0.23	37.44	0.26
Gender (%)	49.53	0.74	49.68	0.79	49.68	0.79
Hispanic/Latino background (%)
Dominican	9.03	0.75	8.76	0.75	8.98	0.75
Central American	8.19	0.77	8.38	0.81	8.40	0.80
Cuban	17.55	1.51	16.61	1.47	17.25	1.53
Mexican	41.42	1.73	42.67	1.76	41.47	1.73
Puerto Rican	14.06	0.81	13.40	0.82	14.06	0.86
South American	4.97	0.42	5.14	0.46	5.02	0.43
Multiple/other	4.77	0.42	5.05	0.46	4.82	0.43
US born (%)	24.91	0.97	26.70	1.04	25.48	1.01
Years in the US	18.69	0.32	18.19	0.32	18.71	0.34
Meets 2008 activity level guidelines (%)	70.00	0.80	71.13	0.82	70.38	0.83
Usual daily sodium intake	3416.75	23.15	3440.27	23.87	3421.49	24.20
Usual daily potassium intake	2499.65	12.90	2496.94	13.92	2495.33	13.35
Usual daily energy intake	2027.61	10.28	2045.72	11.01	2032.75	10.71
Alternative Healthy Eating Index 2010	47.34	0.18	47.13	0.19	47.26	0.18
Education (%)
No high school diploma/GED	30.31	0.91	29.12	0.96	29.62	0.98
At most high school diploma/GED	29.01	0.79	29.73	0.85	29.36	0.85
Greater than high school diploma/GED	40.68	1.02	41.15	1.09	41.02	1.08
Income (%)
< $30,000	60.66	1.09	59.76	1.16	60.52	1.15
$30,000+	33.73	1.14	34.61	1.19	33.96	1.16
Missing	5.61	0.43	5.63	0.47	5.52	0.46
Marital status (%)
Single	37.71	0.96	39.72	1.03	38.21	0.98
Married or living with a partner	49.45	1.05	49.15	1.09	49.32	1.05
Separated, divorced, or widow(er)	12.85	0.57	11.13	0.52	12.47	0.60
Health insurance coverage (%)	46.48	1.10	45.50	1.15	46.29	1.15
Current smoker (%)	20.19	0.72	19.65	0.77	20.19	0.79
Body mass index (BMI) (%)
Underweight/normal weight (BMI < 25 kg/m²)	25.67	0.79	27.31	0.86	25.98	0.82
Overweight (25 kg/m² ≥ BMI < 30 kg/m²)	37.45	0.81	37.20	0.83	37.22	0.82
Obese (BMI ≥ 30 kg/m²)	36.88	0.84	35.50	0.91	36.80	0.92
Diabetes (%)	8.85	0.39	6.63	0.36	8.01	0.46
Chronic kidney disease (stage 2 – 5) (%)	39.77	0.85	37.74	0.91	39.57	0.93
Total cholesterol (mg/dL)	193.34	0.73	191.28	0.79	193.20	0.80
Family history of coronary heart disease (%)	24.37	0.74	22.43	0.77	24.13	0.81
Fasting glucose (mg/dL)	98.60	0.45	96.53	0.39	97.54	0.44
Employment status (%)
Retired and not currently employed	4.04	0.28	2.84	0.25	3.62	0.33
Not retired and not currently employed	40.66	0.88	40.74	0.92	40.48	0.93
Employed part-time	18.55	0.61	19.18	0.67	18.76	0.65
Employed full-time	36.75	0.82	37.24	0.87	37.14	0.87
Frequency of doctor visits in past year (%)
None	35.15	0.91	36.10	0.99	35.45	0.98
Once	18.63	0.69	19.12	0.74	18.69	0.72
2–3 times	23.73	0.76	23.82	0.81	23.72	0.81
> 3 times	22.49	0.69	20.96	0.70	22.14	0.74
High total cholesterol	38.28	0.80	36.56	0.86	38.13	0.86
History of cardiovascular disease (%)	3.10	0.30	2.71	0.33	3.10	0.39
Any pregnancy history (%)	40.82	0.72	39.82	0.77	40.62	0.78

Open in a new tab

NOTE: All values weighted for study design and non-response at follow-up (i.e., weighted by the study sampling weights).

Based on entire analytic sample. The purpose of the IP weights is to re-weight the untreated subgroup to resemble this sample.

Based on the untreated subgroup only, without applying the IP weights.

Based on the untreated subgroup only, weighted by the product of the sampling weights and the proposed IP weights.

5.3 |. Results

Table 5 presents estimated regression coefficients and standard errors (SE) for the association of categorized BMI and blood pressure change (systolic and diastolic) over the course of an average follow-up of 6.1 years (standard deviation = 0.9 years), based on all compared methods. For both systolic and diastolic blood pressure change, estimated associations for the proposed methods (IPW Proposed and DR Proposed) were similar to each other, but differed from the estimated associations for some of the ad hoc methods. For example, the results from the DR Proposed method show that being overweight was associated with a systolic blood pressure change during follow-up that was 0.432 mmHg lower compared to being underweight/normal weight, and being obese was associated with a systolic blood pressure change during follow-up that was 0.169 mmHg higher compared to being underweight/normal weight, but these associations were not statistically significant; note that the average systolic blood pressure change over follow-up was a decrease of 1.307 mmHg (standard deviation = 10.7) for the subgroup without hypertension at follow-up and an increase of 10.023 mmHg (standard deviation = 13.5) for the subgroup with untreated hypertension at follow-up. Being overweight was associated with a diastolic blood pressure change during follow-up that was 1.154 mmHg higher compared to being underweight/normal weight, and being obese was associated with a diastolic blood pressure change during follow-up that was 1.666 mmHg higher compared to being underweight/normal weight; note that the average diastolic blood pressure change over follow-up was a decrease of 0.414 mmHg (standard deviation = 8.8) for the subgroup without hypertension at follow-up and an increase of 6.150 mmHg (standard deviation = 9.5) for the subgroup with untreated hypertension at follow-up. Similar results were produced for the proposed methods (IPW Proposed and DR Proposed) by the sensitivity analysis based on an alternative approximation for time of treatment initiation (Table 5).

TABLE 5.

Regression coefficient estimates and standard errors for association^a of categorized body mass index with systolic and diastolic blood pressure change (mmHg) in the HCHS/SOL target population using all compared methods

	Systolic blood pressure (mmHg)				Diastolic blood pressure (mmHg)
	Overweight^b		Obese^b		Overweight^b		Obese^b

Method	Estimate	SE	Estimate	SE	Estimate	SE	Estimate	SE

Original Analysis^c

Observed	−0.252	0.440	−0.036	0.467	1.312	0.326	1.722	0.391
Exclude Treated	−0.465	0.448	−0.122	0.459	1.017	0.340	1.506	0.401
Add Constant	−0.272	0.442	0.303	0.465	1.144	0.326	1.699	0.384
Add Random	−0.284	0.447	0.487	0.474	1.126	0.331	1.856	0.392
Add Expected Effect	−0.330	0.439	0.073	0.462	1.244	0.324	1.651	0.386
IPW Proposed	−0.503	0.364	0.093	0.389	0.994	0.294	1.601	0.343
DR Proposed	−0.432	0.370	0.169	0.379	1.154	0.292	1.666	0.333

Sensitivity Analysis^d

IPW Proposed	−0.533	0.366	0.130	0.391	0.976	0.293	1.610	0.346
DR Proposed	−0.462	0.373	0.196	0.380	1.132	0.291	1.669	0.335

Open in a new tab

Abbreviations: SE, standard error.

NOTE: There were 5,301 participants in the analytic sample (weighted proportion 66%) without hypertension at follow-up, 2,114 participants (weighted proportion 22%) with untreated hypertension at follow-up, and 1,520 participants (weighted proportion 12%) with treated hypertension at follow-up. The average length of follow-up was 6.1 years (standard deviation = 0.9 years). The average systolic blood pressure change over follow-up was a decrease of 1.307 mmHg (standard deviation = 10.7) for the subgroup without hypertension at follow-up and an increase of 10.023 mmHg (standard deviation = 13.5) for the subgroup with untreated hypertension at follow-up. The average diastolic blood pressure change over follow-up was a decrease of 0.414 mmHg (standard deviation = 8.8) for the subgroup without hypertension at follow-up and an increase of 6.150 mmHg (standard deviation = 9.5) for the subgroup with untreated hypertension at follow-up.

Adjusted for baseline blood pressure, age, sex, time between study visits, and study field center.

Body mass index (BMI) was categorized as underweight or normal weight (BMI < 25 kg/m²), overweight (25 kg/m² ≤ BMI < 30 kg/m²), or obese (BMI ≥ 30 kg/m²). The reference category was underweight or normal weight.

For the original analysis, time of treatment initiation was approximated as the time of the first survey in which the participant indicated treatment.

For the sensitivity analysis, time of treatment initiation was approximated as the mid-point of the time interval between the first survey in which the participant indicated treatment and the previous survey.

6 |. DISCUSSION

We proposed IPW and doubly robust approaches for estimating associations between risk factors and longitudinal biomarker change, incorporating a complex survey sampling design. We considered biomarker values measured while on treatment as missing data, and used all available information about the treatment mechanism by modeling time-to-treatment using survival analysis methods to predict the treatment probability at the follow-up time. The proposed estimators were solutions to sample-weighted sets of estimating equations. Simulation studies showed that both the IPW (with correctly specified time-to-treatment model) and doubly robust (with correctly specified time-to-treatment model and/or imputation model) approaches were essentially unbiased and produced approximately nominal confidence interval coverage for most simulation scenarios considered. Web Appendix D provides sample R code to implement the proposed IPW and doubly robust methods in practice.

Bootstrap estimators were proposed to estimate the covariance matrix of the proposed estimators²¹, which account for both the variability due to the complex survey sampling and the uncertainty due to estimation of the time-to-treatment model and/or the imputation model for biomarker change. Alternatively, if computational resources are too limited to accommodate the bootstrap approach, sandwich variance estimators based on Taylor series linearization⁶ can be used considering the IP weights and imputed biomarker change values to be known, which would provide conservative variance estimation in practice when the IPW weights and imputed biomarker change values are actually estimated^23,12, and performed reasonably well in the practical situations considered in the simulation studies presented in Section 4. Both of these approaches to variance estimation can be easily implemented using existing and widely available statistical software for data analysis with complex survey data, such as the survey package in R^25,26. Recently, Beaumont and Emond³⁰ have proposed a more general bootstrap procedure that has been shown to provide accurate variance estimation based on complex survey samples. It will be interesting to investigate the performance of this approach in our setting in future work.

It is important to note that although the proposed IPW method performed well in most reasonable scenarios considered in the simulation studies presented here, it was observed that there are some circumstances where the proposed IPW method did not perform well. We found that the IPW method produced some bias when the proportion of the population with untreated disease was small relative to the proportion that were treated (e.g., 5% untreated disease vs. 25% treated). In this case, a relatively small group was being weighted to represent a much larger group. However, for a variety of populations (e.g., target population for HCHS/SOL, general population of adults in the US) the prevalence of untreated hypertension (e.g., 22% in the HCHS/SOL analysis, 21% in NHANES³¹) and the ratio of the prevalences of untreated hypertension to treated hypertension (e.g., 22%/12%=1.8 in the HCHS/SOL analysis, 21%/24%=0.9 in NHANES³¹) are higher than those for this problematic simulation scenario, suggesting that this issue may not affect common practical applications for modeling blood pressure change. In particular, it seems that this issue is due to insufficient overlap for the predicted probabilities of no treatment from the time-to-treatment model between the untreated with disease subgroup and the treated subgroup; for example, in the simulation study scenario where the proportion of the population with untreated disease was small relative to the proportion that were treated (e.g., 5% untreated disease vs. 25% treated), there was a considerable subset of treated individuals for whom there were no comparable untreated individuals with disease in the population. This issue of insufficient overlap has been known to lead to extrapolation bias when using IPW methods for causal inference, in that it leads to the treatment model extrapolating predicted values beyond the range of the data available³². In this case observed in the simulation study, since there were treated individuals for whom there were no comparable untreated individuals with disease, this subset of treated individuals could not be represented adequately by the IP-weighted subgroup of untreated individuals; in other words, the IP-weighted sample would not represent the full sample (including all treated and untreated individuals), since there was a subset of treated individuals without comparable untreated individuals available. Therefore, when applying the proposed IPW-based methods in practice, it is important to always check that there is sufficient overlap in the probabilities of no treatment between the treated subgroup and the untreated subgroup with disease (by comparing boxplots, quantiles, or other graphical or numerical representations of the distribution of the probabilities between the two subgroups), to ensure that the time-to-treatment model will not extrapolate predicted probabilities of no treatment beyond the range of the available auxiliary variable data, and to ensure that all treated individuals will be sufficiently represented in the IP-weighted subsample of untreated individuals. If there is insufficient overlap, then application of the proposed IPW methods to draw inference about untreated biomarker change in the full population (including both treated and untreated individuals) would not be recommended. It is worth noting that in the real analysis example using HCHS/SOL data presented here, we observed that the probabilities of no treatment overlapped well between the treated subgroup and the untreated subgroup with disease, based on comparing boxplots and quantile summaries of the probabilites of no treatment between the two groups.

The statistical ideas presented in this research can be extended to other types of missing data methodology (e.g., multiple imputation), as a future direction. In addition, in some applications it is possible that the set of auxiliary variables $Z$ may need to include time-dependent variables for the assumption $Y_{V} ⊥ R_{V} ∣ (Z, B_{V} = 1)$ to hold. Therefore, extending these methods to incorporate time-varying covariates in the time-to-treatment and/or imputation models would be another useful future direction for this research. The presented application to HCHS/SOL data had only a small amount of missing data on covariates included in the time-to-treatment and biomarker change imputation models (n=515, 6%). However, when applying these proposed methods in the presence of substantial missing data, appropriate methods to handle informative missing data (e.g., multiple imputation) should be used. Evaluation of how to best integrate these methods is a topic for future research.

In summary, application of these proposed methods would be useful when the research goal is to model untreated change in a biomarker using observational cohort data, but a subset of the cohort starts using medication during the course of the study, thereby interfering with the desired estimation of untreated biomarker change.

Supplementary Material

Supplement

NIHMS1962085-supplement-Supplement.pdf^{(501.3KB, pdf)}

ACKNOWLEDGMENTS

This research used data from The Hispanic Community Health Study/Study of Latinos, a collaborative study supported by contracts from the National Heart, Lung, and Blood Institute (NHLBI) to the University of North Carolina (HHSN268201300001I / N01-HC-65233), University of Miami (HHSN268201300004I / N01-HC-65234), Albert Einstein College of Medicine (HHSN268201300002I / N01-HC-65235), University of Illinois at Chicago (HHSN268201300003I / N01-HC-65236 Northwestern Univ), and San Diego State University (HHSN268201300005I / N01-HC-65237). The following Institutes/Centers/Offices have contributed to the HCHS/SOL through a transfer of funds to the NHLBI: National Institute on Minority Health and Health Disparities, National Institute on Deafness and Other Communication Disorders, National Institute of Dental and Craniofacial Research, National Institute of Diabetes and Digestive and Kidney Diseases, National Institute of Neurological Disorders and Stroke, NIH Institution-Office of Dietary Supplements. This research was also partially supported by grant U01-DK-098246 from the National Institute of Diabetes, Digestive and Kidney Diseases (NIDDK), NIH, for the Glycemia Reduction Approaches in Diabetes: A Comparative Effectiveness (GRADE) Study, and by grant P01-CA-142538 from the National Cancer Institute (NCI), NIH. The authors thank Ionut Bebu for feedback on an early draft of this manuscript.

Footnotes

Conflict of interest

The authors declare no potential conflict of interests.

Financial disclosure

None reported.

SUPPORTING INFORMATION

Additional supporting information may be found online in the Supporting Information section at the end of the article: A list of assumptions needed for Theorems 1 and 2 to hold (Web Appendix A), proofs for Theorems 1 and 2 (Web Appendix B), additional results from the simulation study (Web Appendix C), and R code to implement the proposed methods in practice (Web Appendix D).

Data availability statement

The data that support the findings of this study are available from the Hispanic Community Health Study/Study of Latinos (HCHS/SOL). Restrictions apply to the availability of these data. Information about collaborating with HCHS/SOL can be found at https://sites.cscc.unc.edu/hchs/.

References

1.Sorlie PD, Aviles-Santa LM, Wassertheil-Smoller S, et al. Design and implementation of the Hispanic Community Health Study/Study of Latinos. The Annals of Epidemiology 2010; 20(8): 629–641. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.LaVange LM, Kalsbeek W, Sorlie PD, et al. Sample design and cohort selection in the Hispanic Community Health Study/Study of Latinos. Annals of Epidemiology 2010; 20(8): 642–649. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Tobin MD, Sheehan NA, Scurrah KJ, Burton PR. Adjusting for treatment effects in studies of quantitative traits: Antihypertensive therapy and systolic blood pressure. Statistics in Medicine 2005; 24: 2911–2935. [DOI] [PubMed] [Google Scholar]
4.Tsiatis AA. Semiparametric Theory and Missing Data. New York: Springer. 2006. [Google Scholar]
5.Scharfstein DO, Rotnitzky A, Robins JM. Rejoinder to adjusting for non-ignorable drop-out using semiparametric non-response models. Journal of the American Statistical Association 1999; 94(448): 1135–1146. [Google Scholar]
6.Binder DA. On the variances of asymptotically normal estimators from complex surveys. International Statistical Review 1983; 51(3): 279–292. [Google Scholar]
7.LaVange LM, Koch GG, Schwartz TA. Applying sample survey methods to clinical trials data. Statistics in Medicine 2001; 20: 2609–2623. [DOI] [PubMed] [Google Scholar]
8.Rust KF, Rao JNK. Variance estimation for complex surveys using replication techniques. Statistical Methods in Medical Research 1996; 5: 283–310. [DOI] [PubMed] [Google Scholar]
9.Shao J. Resampling methods in sample surveys (with discussion). Statistics 1996; 27: 203–254. [Google Scholar]
10.Shao J. Impact of the bootstrap on sample surveys. Statistical Science 2003; 18: 191–198. [Google Scholar]
11.Reiter JP, Raghunathan TE, Kinney SK. The importance of modeling the sampling design in multiple imputation for missing data. Survey Methodology 2006; 32(2): 143–149. [Google Scholar]
12.Seaman SR, White IR. Review of inverse probability weighting for dealing with missing data. Statistical Methods in Medical Research 2013; 22(3): 278–295. [DOI] [PubMed] [Google Scholar]
13.Whelton PK, Carey RM, Aronow WS, et al. 2017 ACC/AHA/AAPA/ABC/ACPM/AGS/APhA/ASH/ASPC/NMA/PCNA guideline for the prevention, detection, evaluation, and management of high blood pressure in adults: a report of the American College of Cardiology/American Heart Association Task Force on Clinical Practice Guidelines. Journal of the American College of Cardiology 2018; 71(19): e127–e248. [DOI] [PubMed] [Google Scholar]
14.Little RJ, Vartivarian S. On weighting the rates in non-response weights. Statistics in Medicine 2003; 22: 1589–1599. [DOI] [PubMed] [Google Scholar]
15.Cox DR. Partial Likelihood. Biometrika 1975; 62(2): 269–276. [Google Scholar]
16.Breslow NE. Discussion of regression models and life-tables by DR Cox. Journal of the Royal Statistical Society, Series B 1972; 34: 216. [Google Scholar]
17.Tsiatis AA. A large sample study of Cox’s regression model. Annals of Statistics 1981; 9(1): 93–108. [Google Scholar]
18.Andersen PK, Gill RD. Cox’s regression model for counting processes: A large sample study. The Annals of Statistics 1982; 10(4): 1100–1120. [Google Scholar]
19.McCullagh P, Nelder JA. Generalized Linear Models. London: Chapman and Hall. 2nd ed. 1989. [Google Scholar]
20.Rao JNK, Wu CFJ. Resampling inference with complex survey data. Journal of the American Statistical Association 1988; 83: 231–241. [Google Scholar]
21.Rao JNK, Wu CFJ, Yue K. Some recent work on resampling methods for complex surveys. Survey Methodology 1992; 18: 209–217. [Google Scholar]
22.McCarthy PJ, Snowden CB. The bootstrap and finite population sampling. In: Vital and Health Statistics. Washington, DC: U.S. Government Printing Office. 1985. (pp. 1–23). [PubMed] [Google Scholar]
23.Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association 1994; 89(427): 846–866. [Google Scholar]
24.Law MR, Morris JK, Wald NJ. Use of blood pressure lowering drugs in the prevention of cardiovascular disease: meta-analysis of 147 randomised trials in the context of expectations from prospective epidemiological studies. BMJ 2009; 338: b1665. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Lumley T Analysis of Complex Survey Samples. Journal of Statistical Software 2004; 9(1): 1–19. R package verson 2.2. [Google Scholar]
26.Lumley T. survey: analysis of complex survey samples. 2019. R package version 3.35–1. [Google Scholar]
27.MacMahon S, Cutler J, Brittain E, Higgins M. Obesity and hypertension: Epidemiological and clinical issues. European Heart Journal 1987; 8 Suppl B: 57–70. [DOI] [PubMed] [Google Scholar]
28.World Health Organization (WHO). Physical status: The use of and interpretation of anthropometry, report of a WHO expert committee. Tech. Rep. 854, WHO Technical Report Series; 1995. [PubMed] [Google Scholar]
29.Cai J, Sotres-Alvarez D, Zeng D, et al. HCHS/SOL Analysis Methods - Visit 2. HCHS/SOL Coordinating Center; 3 ed. 2022. [Google Scholar]
30.Beaumont JF, Emond N. A bootstrap variance estimation method for multistage sampling and two-phase sampling when Poisson sampling is used at the second phase. Stats 2022; 5: 339–357. [Google Scholar]
31.Centers for Disease Control and Prevention (CDC). Hypertension cascade: Hypertension prevalence, treatment and control estimates among US adults aged 18 years and older applying the criteria from the American College of Cardiology and American Heart Association’s 2017 Hypertension Guideline - NHANES 2013–2016. tech. rep., US Department of Health and Human Services; Atlanta, GA: 2019. [Google Scholar]
32.King G, Zeng L. The dangers of extreme counterfactuals. Political Analysis 2006; 14(2): 131–159. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement

NIHMS1962085-supplement-Supplement.pdf^{(501.3KB, pdf)}

Data Availability Statement

[R1] 1.Sorlie PD, Aviles-Santa LM, Wassertheil-Smoller S, et al. Design and implementation of the Hispanic Community Health Study/Study of Latinos. The Annals of Epidemiology 2010; 20(8): 629–641. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.LaVange LM, Kalsbeek W, Sorlie PD, et al. Sample design and cohort selection in the Hispanic Community Health Study/Study of Latinos. Annals of Epidemiology 2010; 20(8): 642–649. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Tobin MD, Sheehan NA, Scurrah KJ, Burton PR. Adjusting for treatment effects in studies of quantitative traits: Antihypertensive therapy and systolic blood pressure. Statistics in Medicine 2005; 24: 2911–2935. [DOI] [PubMed] [Google Scholar]

[R4] 4.Tsiatis AA. Semiparametric Theory and Missing Data. New York: Springer. 2006. [Google Scholar]

[R5] 5.Scharfstein DO, Rotnitzky A, Robins JM. Rejoinder to adjusting for non-ignorable drop-out using semiparametric non-response models. Journal of the American Statistical Association 1999; 94(448): 1135–1146. [Google Scholar]

[R6] 6.Binder DA. On the variances of asymptotically normal estimators from complex surveys. International Statistical Review 1983; 51(3): 279–292. [Google Scholar]

[R7] 7.LaVange LM, Koch GG, Schwartz TA. Applying sample survey methods to clinical trials data. Statistics in Medicine 2001; 20: 2609–2623. [DOI] [PubMed] [Google Scholar]

[R8] 8.Rust KF, Rao JNK. Variance estimation for complex surveys using replication techniques. Statistical Methods in Medical Research 1996; 5: 283–310. [DOI] [PubMed] [Google Scholar]

[R9] 9.Shao J. Resampling methods in sample surveys (with discussion). Statistics 1996; 27: 203–254. [Google Scholar]

[R10] 10.Shao J. Impact of the bootstrap on sample surveys. Statistical Science 2003; 18: 191–198. [Google Scholar]

[R11] 11.Reiter JP, Raghunathan TE, Kinney SK. The importance of modeling the sampling design in multiple imputation for missing data. Survey Methodology 2006; 32(2): 143–149. [Google Scholar]

[R12] 12.Seaman SR, White IR. Review of inverse probability weighting for dealing with missing data. Statistical Methods in Medical Research 2013; 22(3): 278–295. [DOI] [PubMed] [Google Scholar]

[R13] 13.Whelton PK, Carey RM, Aronow WS, et al. 2017 ACC/AHA/AAPA/ABC/ACPM/AGS/APhA/ASH/ASPC/NMA/PCNA guideline for the prevention, detection, evaluation, and management of high blood pressure in adults: a report of the American College of Cardiology/American Heart Association Task Force on Clinical Practice Guidelines. Journal of the American College of Cardiology 2018; 71(19): e127–e248. [DOI] [PubMed] [Google Scholar]

[R14] 14.Little RJ, Vartivarian S. On weighting the rates in non-response weights. Statistics in Medicine 2003; 22: 1589–1599. [DOI] [PubMed] [Google Scholar]

[R15] 15.Cox DR. Partial Likelihood. Biometrika 1975; 62(2): 269–276. [Google Scholar]

[R16] 16.Breslow NE. Discussion of regression models and life-tables by DR Cox. Journal of the Royal Statistical Society, Series B 1972; 34: 216. [Google Scholar]

[R17] 17.Tsiatis AA. A large sample study of Cox’s regression model. Annals of Statistics 1981; 9(1): 93–108. [Google Scholar]

[R18] 18.Andersen PK, Gill RD. Cox’s regression model for counting processes: A large sample study. The Annals of Statistics 1982; 10(4): 1100–1120. [Google Scholar]

[R19] 19.McCullagh P, Nelder JA. Generalized Linear Models. London: Chapman and Hall. 2nd ed. 1989. [Google Scholar]

[R20] 20.Rao JNK, Wu CFJ. Resampling inference with complex survey data. Journal of the American Statistical Association 1988; 83: 231–241. [Google Scholar]

[R21] 21.Rao JNK, Wu CFJ, Yue K. Some recent work on resampling methods for complex surveys. Survey Methodology 1992; 18: 209–217. [Google Scholar]

[R22] 22.McCarthy PJ, Snowden CB. The bootstrap and finite population sampling. In: Vital and Health Statistics. Washington, DC: U.S. Government Printing Office. 1985. (pp. 1–23). [PubMed] [Google Scholar]

[R23] 23.Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association 1994; 89(427): 846–866. [Google Scholar]

[R24] 24.Law MR, Morris JK, Wald NJ. Use of blood pressure lowering drugs in the prevention of cardiovascular disease: meta-analysis of 147 randomised trials in the context of expectations from prospective epidemiological studies. BMJ 2009; 338: b1665. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] 25.Lumley T Analysis of Complex Survey Samples. Journal of Statistical Software 2004; 9(1): 1–19. R package verson 2.2. [Google Scholar]

[R26] 26.Lumley T. survey: analysis of complex survey samples. 2019. R package version 3.35–1. [Google Scholar]

[R27] 27.MacMahon S, Cutler J, Brittain E, Higgins M. Obesity and hypertension: Epidemiological and clinical issues. European Heart Journal 1987; 8 Suppl B: 57–70. [DOI] [PubMed] [Google Scholar]

[R28] 28.World Health Organization (WHO). Physical status: The use of and interpretation of anthropometry, report of a WHO expert committee. Tech. Rep. 854, WHO Technical Report Series; 1995. [PubMed] [Google Scholar]

[R29] 29.Cai J, Sotres-Alvarez D, Zeng D, et al. HCHS/SOL Analysis Methods - Visit 2. HCHS/SOL Coordinating Center; 3 ed. 2022. [Google Scholar]

[R30] 30.Beaumont JF, Emond N. A bootstrap variance estimation method for multistage sampling and two-phase sampling when Poisson sampling is used at the second phase. Stats 2022; 5: 339–357. [Google Scholar]

[R31] 31.Centers for Disease Control and Prevention (CDC). Hypertension cascade: Hypertension prevalence, treatment and control estimates among US adults aged 18 years and older applying the criteria from the American College of Cardiology and American Heart Association’s 2017 Hypertension Guideline - NHANES 2013–2016. tech. rep., US Department of Health and Human Services; Atlanta, GA: 2019. [Google Scholar]

[R32] 32.King G, Zeng L. The dangers of extreme counterfactuals. Political Analysis 2006; 14(2): 131–159. [Google Scholar]

PERMALINK

Modeling Longitudinal Change in Biomarkers Using Data from a Complex Survey Sampling Design: An Application to the Hispanic Community Health Study/Study of Latinos (HCHS/SOL)

Nicole M Butera

Donglin Zeng

Gerardo Heiss

Jianwen Cai

Summary

1 |. INTRODUCTION

2 |. PROPOSED APPROACHES FOR MODELING LONGITUDINAL BIOMARKER CHANGE WITH COMPLEX SURVEY DATA

FIGURE 1.

2.1 |. Inverse Probability Weighting (IPW) Approach

FIGURE 2.

2.2 |. Doubly Robust Approach

3 |. VARIANCE ESTIMATION FOR THE PROPOSED METHODS

Theorem 1.

Theorem 2.

4 |. SIMULATION STUDIES

4.1 |. Sampling Design

FIGURE 3.

4.2 |. Simulated Population Variables

FIGURE 4.

TABLE 1.

TABLE 2.

4.3 |. Simulation Study 1: Comparison of the Proposed Methods to Common Ad Hoc Methods

4.3.1 |. Analysis Methods

4.3.2 |. Results

FIGURE 5.

4.4 |. Simulation Study 2: Impact of Time-to-Treatment and Imputation Model Misspecification

4.4.1 |. Analysis Methods

4.4.2 |. Results

TABLE 3.

5 |. APPLICATION TO HCHS/SOL

FIGURE 6.

5.1 |. Measures

5.1.1 |. Blood Pressure

5.1.2 |. Body Mass

5.1.3 |. Covariates

5.1.4 |. Time-to-Treatment

FIGURE 7.

5.1.5 |. Hypertension

5.2 |. Analysis Methods

TABLE 4.

5.3 |. Results

TABLE 5.

6 |. DISCUSSION

Supplementary Material

ACKNOWLEDGMENTS

Footnotes

Data availability statement

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases