Efficient odds ratio estimation under two-phase sampling using error-prone data from a multi-national HIV research cohort

Sarah C Lotspeich; Bryan E Shepherd; Gustavo G C Amorim; Pamela A Shaw; Ran Tao

doi:10.1111/biom.13512

. Author manuscript; available in PMC: 2023 Dec 1.

Published in final edited form as: Biometrics. 2021 Aug 1;78(4):1674–1685. doi: 10.1111/biom.13512

Efficient odds ratio estimation under two-phase sampling using error-prone data from a multi-national HIV research cohort

Sarah C Lotspeich ^1,^*, Bryan E Shepherd ¹, Gustavo G C Amorim ¹, Pamela A Shaw ², Ran Tao ^1,^3,^**

PMCID: PMC8720323 NIHMSID: NIHMS1720939 PMID: 34213008

SUMMARY:

Persons living with HIV engage in routine clinical care, generating large amounts of data in observational HIV cohorts. These data are often error-prone, and directly using them in biomedical research could bias estimation and give misleading results. A cost-effective solution is the two-phase design, under which the error-prone variables are observed for all patients during Phase I, and that information is used to select patients for data auditing during Phase II. For example, the Caribbean, Central, and South America Network for HIV Epidemiology (CCASAnet) selected a random sample from each site for data auditing. Herein, we consider efficient odds ratio estimation with partially audited, error-prone data. We propose a semiparametric approach that uses all information from both phases and accommodates a number of error mechanisms. We allow both the outcome and covariates to be error-prone and these errors to be correlated, and selection of the Phase II sample can depend on Phase I data in an arbitrary manner. We devise a computationally efficient, numerically stable EM algorithm to obtain estimators that are consistent, asymptotically normal, and asymptotically efficient. We demonstrate the advantages of the proposed methods over existing ones through extensive simulations. Finally, we provide applications to the CCASAnet cohort.

Keywords: case-control sampling, data audits, electronic health records, measurement error, missing data, semiparametric efficiency

1. Introduction

1.1. Background

Electronic health records and other observational databases routinely collect a wide array of information on large numbers of patients. While initially created to support clinical care, financial billing, and insurance claims, these databases are increasingly being used for clinical investigations that influence disease prevention and policy-making. In particular, there has been an uptake in observational studies of HIV/AIDS, where patients engage in regular, frequent care, generating large amounts of routine clinical data (Zaniewski et al., 2018).

Observational databases can be error-prone since data collection is often secondary to clinical care, and the error mechanisms can be quite complicated. For example, a binary outcome of interest may be differentially misclassified; that is, the sensitivity and specificity may depend on other variables such as study site. Errors in continuous variables can be additive or multiplicative, symmetric or skewed, and centered around zero or systematically biased. In addition, errors can be correlated across multiple error-prone variables. Naive logistic regression analyses could yield biased odds ratio (OR) and standard error estimates in the presence of outcome misclassification and/or covariate errors (Quake et al., 1980).

To ensure data integrity, clinical trials have long relied on source document verification and data auditing. Observational studies have begun to advocate for those procedures as well (Duda et al., 2012; Giganti et al., 2019; Lotspeich et al., 2020). In a typical data audit, external, trained auditors visit the site, compare existing records to clinical source documents, and report discrepancies. Complete data re-entry would be ideal, but could be too resource-intensive, especially for large databases. A cost-effective alternative is the partial audit or two-phase design, wherein error-prone variables are observed for all subjects from the research database during Phase I, and then this information is used to select a validation subsample in Phase II. Thus, the available data consist of validated records for subjects chosen in Phase II and unvalidated records for everyone else. Two-phase sampling greatly reduces the burden of data validation and thus has been used in several large-scale observational studies, including the Caribbean, Central, and South America network for HIV Epidemiology (CCASAnet).

1.2. CCASAnet Dataset

CCASAnet is a multi-site research network that uses routinely collected HIV patient care data to address questions about the characteristics of the HIV epidemic and to improve the quality and consistency of clinical research activities across Latin America. Individual-level data from CCASAnet sites are sent to the Data Coordinating Center at Vanderbilt University (VDCC) in Nashville, Tennessee, where they are compiled into a research database (McGowan et al., 2007). Participating sites are regularly audited by the VDCC to ensure data quality; we focus on one round of audits from 2013–2014 (Giganti et al., 2019).

We are interested in estimating the associations between risk factors CD4 count and AIDS, both measured at the time of antiretroviral therapy (ART) initiation, and the odds of subsequently developing an AIDS defining event (ADE) within 2 years after initiating ART. Error-prone data were available for 5109 patients in the collaborative CCASAnet database (Phase I sample), and audit data were available for a site-stratified simple random sample of 117 patients (Phase II sample). The audits revealed many data errors. Giganti et al. (2019) noted a higher prevalence of ADE in the audit database than in the original research database. In addition, from previous audits we have learned that date variables tend to be particularly prone to errors. Since the outcome (ADE within 2 years of ART initiation) and primary predictors (CD4 and AIDS diagnosis at ART initiation) were derived based on the date of ART initiation, errors in the date of ART initiation could induce dependent errors in these key study variables. More details on the definitions of these variables can be found in Section 4. Here we propose a method to combine the audit data with the pre-audit data to obtain unbiased and efficient OR estimators.

1.3. Existing Work

Statistical methods have been developed to obtain valid inference from error-prone data under a two-phase design. The majority of those methods address classical measurement error in covariates only (Carroll et al., 2006) or binary outcome misclassification alone (Magder and Hughes, 1997; Neuhaus, 1999). Fewer methods have been developed to handle outcome misclassification and covariate measurement error simultaneously. When sensitivity and specificity or false positive and false negative rates (FPR and FNR, respectively) of a binary outcome and covariate are known or can be reliably estimated, the matrix method (Barron, 1977) or inverse matrix method (Marshall, 1990), respectively, can be extended to correct naive OR estimates based on misclassified data (Tang et al., 2013). Fully-parametric maximum likelihood estimators (MLE) have been proposed for outcome and/or binary covariate misclassification (Tang et al., 2015), but they do not accommodate error-prone continuous covariates. In addition, they require explicit specification of the distribution of measurement errors, which may render bias when models are misspecified (Robins et al., 1994). Design-based estimators, such as the Horvitz–Thompson (HT) (Horvitz and Thompson, 1952) and generalized raking (Deville et al., 1993) estimators, can also be used. Of these, generalized raking estimators are particularly attractive because they (1) include all the best estimators in the class of augmented inverse probability weighting estimators and (2) offer a straightforward implementation of these doubly robust estimators (Lumley et al., 2011). While these design-based estimators do not require specification of the error models, they tend to be less efficient than model-based estimators, especially when the Phase II sampling probabilities are extremely unequal.

1.4. Overview

In this manuscript, we propose a general framework to accommodate a misclassified binary outcome and error-prone categorical or continuous covariates under a two-phase design. Our methods handle settings where the error rates depend on other error-prone and error-free covariates. We model the covariate error distribution nonparametrically with B-spline sieves (Grenander, 1981) to accommodate continuous covariates subject to arbitrary measurement error patterns. An EM algorithm maximizes the resulting semiparametric likelihood. Our estimators are shown to be consistent, asymptotically normal, and asymptotically efficient.

The rest of the paper is organized as follows: in Section 2 we describe the proposed methods, while relegating most of the technical details to Web Appendices; in Section 3 we evaluate the performance of the methods using extensive simulations; in Section 4 we illustrate their use with the CCASAnet dataset; and in Section 5 we discuss our findings.

2. Methods

Consider a binary outcome, Y, and vector of continuous or categorical covariates, X, which are assumed to be related through the logistic regression model P_θ(Y = 1|X) = [1 + exp{−(α + Xβ)}]⁻¹, where θ = (α,β^T)^T. Error-prone measures of the outcome and covariates recorded in the database are denoted by Y* and X*, respectively. Throughout the paper, we use “error-prone” to describe both categorical variables subject to misclassification and continuous variables subject to measurement error. A complete, validated observation (Y*,X*, Y,X) randomly drawn from the population of interest is assumed to be generated from

P (Y^{*}, X^{*}, Y, X) = P_{θ} (Y ∣ X) P (Y^{*} ∣ X^{*}, Y, X) P (X ∣ X^{*}) P (X^{*}),

(1)

where P_θ(Y|X) is the logistic regression model of primary interest, P(Y*|X*, Y,X) is the conditional probability of Y* given (X*, Y,X), P(X|X*) is the conditional density of X given X*, and P(X*) is the marginal density of X*. Conditional independence of Y and X* given X is assumed, such that P(Y|X,X*) = P_θ(Y|X). No additional assumptions are made, and equation (1) allows for differential outcome misclassification and correlated covariate errors. This general setup covers the classical scenarios with either outcome misclassification only, letting X* = X and P(Y*, Y,X) = P_θ(Y|X)P(Y*|Y,X)P(X), or covariate measurement error only, setting Y* = Y and P(X*, Y,X) = P_θ(Y|X)P(X|X*)P(X*). Error-free covariates can also be included by decomposing $X = {(X_{a}^{T}, X_{b}^{T})}^{T}$ and $X * = {(X_{a}^{T}, X_{b}^{* T})}^{T}$ , where subscripts a and b denote error-free and error-prone covariates, respectively. Thus, complete observations are assumed to be generated from $P (Y^{*}, X^{*}, Y, X) = P_{θ} (Y ∣ X_{a}, X_{b}) P (Y^{*} ∣ X_{b}^{*}, Y, X_{a}, X_{b}) P (X_{b} ∣ X_{b}^{*}, X_{a}) P (X_{b}^{*}, X_{a})$ .

With complete data on all N subjects, estimates can be obtained by maximizing the likelihood $\prod_{i = 1}^{N} P_{θ} (Y_{i} ∣ X_{i})$ . In a two-phase study, however, the likelihood contributions of validated and unvalidated subjects are different. While validated subjects contribute complete data to the likelihood via equation (1), unvalidated subjects contribute incomplete data to the likelihood by marginalizing out the unobserved Phase II variables. Let V denote the validation indicator, where V_i = 1 if subject i was selected for Phase II and 0 otherwise. The observed-data log-likelihood can be expressed as

\begin{array}{l} \sum_{i = 1}^{N} V_{i} {log P_{θ} (Y_{i} ∣ X_{i}) + log P (Y_{i}^{*} ∣ X_{i}^{*}, Y_{i}, X_{i}) + log P (X_{i} ∣ X_{i}^{*})} \\ + \sum_{i = 1}^{N} (1 - V_{i}) log {\sum_{y = 0}^{1} \int_{x} P_{θ} (y ∣ x) P (Y_{i}^{*} ∣ X_{i}^{*}, y, x) P (x ∣ X_{i}^{*}) d x} . \end{array}

(2)

Because X* is fully observed at Phase I, P(X*) can be ignored in the inference of θ. Further, two-phase designs assume that the selection of the Phase II sample depends on Phase I variables Y* and X* only. Thus, the Phase II variables are missing at random and the distribution of V can be omitted from expression (2).

Our primary interest lies in the inference of θ, the true conditional log OR of X on Y. The unknown error mechanisms P(Y*|X*, Y,X) and P(X|X*) are nuisance parameters, rarely of interest on their own. Tang et al. (2015) proposed a fully-parametric MLE for misclassified Y* with a single binary covariate X_b and error-free covariates X_a. They instead factored $P (Y^{*}, X^{*}, Y, X) = P_{θ} (Y ∣ X_{b}, X_{a}) P (Y^{*} ∣ X_{b}^{*}, Y, X_{b}, X_{a}) P (X_{b}^{*} ∣ Y, X_{b}, X_{a}) P (X_{b} ∣ X_{a}) P (X_{a})$ , modeled the first four terms with four logistic regressions, and ignored the last term because X_a was fully observed at Phase I. This approach explicitly specifies all error mechanisms, and thus could perform poorly if not done so correctly. In particular, the MLE will be biased if errors are incorrectly assumed to be independent of other variables. Usually, little is known about how errors were introduced to data, so proper specification of error models can be challenging, especially when there are continuous error-prone covariates.

As we extend to settings with continuous covariate error, we aim to develop a more robust estimator for θ by requiring fewer assumptions on the error mechanisms. Since we already assume that Y|X follows a logistic model, it is reasonable to assume a logistic model for the outcome error mechanism P_γ(Y*|X*, Y,X) in a similar manner, where γ denotes the model parameters. The algorithm derived herein treats this model generally and does not dictate the form of the linear predictor. No assumption is made about the distribution of X|X*; that is, P(X|X*) is estimated nonparametrically. Let m denote the number of distinct values of X in the validation sample, and let x₁, …, x_m denote these values. For each X* = x*, we use discrete probability functions to estimate P(X = x_k|X* = x*) (k = 1, …, m). This works when X* is categorical. However, with continuous components of X*, few validated subjects will have X* = x* for each distinct x*. In this situation, this nonparametric estimator for P(X = x_k|X* = x*) is not directly applicable and smoothing techniques are required.

We extend Tao et al. (2017) and use the method of sieves (Grenander, 1981) to handle error-prone continuous covariates. Specifically, B-splines (Schumaker, 1981) are used to approximate the covariate error mechanism. We note that B-splines have been used to promote robustness in measurement error settings elsewhere (Staudenmayer et al., 2008; Sarkar et al., 2014). The error-prone covariates are assumed to have bounded support. Without loss of generality, each component X* in X* is standardized to have support on the interval [0,1]. Let q and b_N denote the order and number of interior knots for the B-splines basis, respectively, and let ${t_{- q + 1}, \dots, t_{q + b_{N}}}$ denote the knots. The b_N interior knots are assumed to be evenly spaced across the range of X*; this can be revised in practice to best suit the covariate data. Then, the interval [0,1] can be partitioned around the knots as $Δ \equiv {t_{- q + 1} = \dots t_{- 1} = 0 = t_{0} < t_{1} < \dots < t_{b_{N} + 1} = 1 = \dots = t_{q + b_{N}}}$ . For one covariate X* in X*, the q order B-spline basis associated with the partition Δ is denoted by ${N_{l}^{q} (X^{*})}_{l = - q + 1}^{b_{N}}$ , where $N_{l}^{q} (X^{*})$ corresponds to the lth B-spline basis function of order q and is defined according to the recursive formula $N_{l}^{q} (X^{*}) = \frac{X^{*} - t_{l}}{t_{l + q - 1} - t_{l}} N_{l}^{q - 1} (X^{*}) + \frac{t_{l + q} - X^{*}}{t_{l + q} - t_{l + 1}} N_{l + 1}^{q - 1} (X^{*}) (l = - q + 1, \dots, b_{N})$ . The first order B-spline basis function is defined as the histogram function $N_{l}^{1} (X^{*}) = I (t_{l} ⩽ X^{*} < t_{l + 1})$ with I(·) being the indicator function. The multivariate B-spline basis function on the full set of d covariates $X^{*} \equiv {(X_{1}^{*}, \dots, X_{d}^{*})}^{T}$ is defined as $B_{j}^{q} (X^{*}) = \prod_{s = 1}^{d} N_{l_{s}}^{q} (X_{s}^{*}) (s = 1, \dots, d; l_{s} = - q + 1, \dots, b_{N}; j = 1, \dots, s_{N})$ , where s_N = (b_N + q)^d represents the total number of multivariate B-spline basis functions. In essence, the scalar index j replaces the multivariate index (l₁, …, l_d) to simplify notation. We approximate $log P (X_{i} ∣ X_{i}^{*})$ and $P (x_{k} ∣ X_{i}^{*})$ in expression (2) with $\sum_{k = 1}^{m} I (X_{i} = x_{k}) \sum_{j = 1}^{s_{N}} log p_{k j} B_{j}^{q} (X_{i}^{*})$ and $\sum_{k = 1}^{m} I (X_{i} = x_{k}) \sum_{j = 1}^{s_{N}} p_{k j} B_{j}^{q} (X_{i}^{*})$ , respectively, where p_kj is the coefficient of $B_{j}^{q} (X^{*})$ at x_k (k = 1, …, m; j = 1, …, s_N). Thus, expression (2) can be rewritten as l_N(θ,γ, {p_kj})

= \sum_{i = 1}^{N} V_{i} {log P_{θ} (Y_{i} ∣ X_{i}) + log P_{γ} (Y_{i}^{*} ∣ X_{i}^{*}, Y_{i}, X_{i}) + \sum_{k = 1}^{m} \sum_{j = 1}^{s_{N}} I (X_{i} = x_{k}) log p_{k j} B_{j}^{q} (X_{i}^{*})}

+ \sum_{i = 1}^{N} (1 - V_{i}) log {\sum_{y = 0}^{1} \sum_{k = 1}^{m} P_{θ} (y ∣ x_{k}) P_{γ} (Y_{i}^{*} ∣ X_{i}^{*}, y, x_{k}) \sum_{j = 1}^{s_{N}} p_{k j} B_{j}^{q} (X_{i}^{*})},

(3)

which is to be maximized with respect to θ, γ, and {p_kj} under constraints $\sum_{k = 1}^{m} p_{k j} = 1$ and p_kj ⩾ 0 (j = 1, …, s_N; k = 1, …, m).

The maximization of the right-hand side of equation (3) is carried out through an EM algorithm. Variance estimators are obtained via the profile likelihood method of Murphy and Van der Vaart (2000). Details of these procedures are presented in Web Appendices A and B, respectively. Under mild regularity conditions, the resulting sieve maximum likelihood estimators (SMLE) are consistent, asymptotically normal, and asymptotically efficient. Proofs of these asymptotic properties are included in Web Appendix C.

The selection of b_N and q need to satisfy condition (C4) in Web Appendix C. Specifically, b_N should increase at a much slower rate than the Phase I and Phase II sample sizes, and q should increase with the dimension of the error-prone continuous covariates but is usually restricted to be less than or equal to four (corresponding to cubic splines). In practice, b_N and q can be chosen in a data-adaptive manner such as cross-validation. For any fixed b_N and q, one evaluates equation (3) in the validation fold using estimates obtained from the training folds. The optimal b_N and q are those that maximize the average cross-validation likelihood. For the purposes of this paper, and to save computation time, we choose q and b_N such that the model is stable within a reasonable range of b_N values.

3. Simulation Studies

Our simulation studies begin with the setting of an error-prone outcome and binary covariate (Section 3.1). The SMLE, fully-parametric MLE (Tang et al., 2015), complete-case analysis (CC), and design-based generalized raking and HT estimators (detailed in Web Appendix D) are compared. Next, binary outcome misclassification and a continuous error-prone covariate are considered (Section 3.2), illustrating the performance of the SMLE under settings where the MLE does not apply. Method performance is assessed on bias, confidence interval (CI) coverage, and efficiency. Error-free covariates are included. All settings focus on estimation of β: the conditional log OR for X_b on Y. The bootstrap method of Koehler et al. (2009) was used to estimate the Monte Carlo simulation errors for bias and coverage.

3.1. Error-Prone Binary Covariate

True binary covariate X_b, error-free covariate X_a, and true outcome Y were generated from Bernoulli distributions with P(X_b = 1) = 0.5, P(X_a = 1) = 0.25, and P(Y = 1|X_b,X_a) = [1 + exp{−(−0.65 − 0.2X_b − 0.1X_a)}]⁻¹, respectively. Misclassification-prone $X_{b}^{*}$ and Y* were generated from Bernoulli distributions with $P (X_{b}^{*} = 1 ∣ X_{b}, Y, X_{a}) = {[1 + \exp {- (- 1.1 + 2.2 X_{b} + 0.5 X_{a})}]}^{- 1}$ and $P (Y^{*} = 1 ∣ X_{b}^{*}, Y, X_{b}, X_{a}) = [1 + \exp {- (- 2.2 - 0.2 X_{b}^{*} + 5.14 Y - {0.2 X_{b} - 0.1 X_{a})}]}^{- 1}$ , respectively. These settings followed from Tang et al. (2015), except that conditional independence between $X_{b}^{*}$ and Y given (X_b,X_a) was assumed here. These settings yielded approximately 32% and 35% prevalence of Y and Y*, respectively, and 8% and 25% misclassification in Y* (FPR = 8%; FNR = 6%) and $X_{b}^{*}$ (FPR = 28%; FNR = 23%), respectively. From N = 1000 and 2000 Phase I subjects, proportions of p_υ = 0.1, 0.25, or 0.5 were selected for validation through simple random sampling (SRS) or 1:1 case-control sampling based on Y* (naive case-control). The likelihood for the fully-parametric MLE (Tang et al., 2015) was correctly specified for the data generation scheme, and the nlm function in R (R Core Team, 2019) was used to obtain the estimates. For generalized raking, we calibrated sampling weights to the naive, error-prone influence functions following Chen and Lumley (2020) and used the survey package in R (Lumley, 2019).

Results for these settings are presented in Table 1. All methods were essentially unbiased. The average SEE for the SMLE were reasonably close to the empirical standard error (SE), and improved with increasing N. As expected, increasing p_υ for a fixed N decreased both bias and standard error. Results were similar with N = 5000 (Table S1 in Web Appendix E.1). In comparison, a naive analysis using only the Phase I data yielded an average bias of 14%. All estimators had smaller standard errors under naive case-control than SRS. The SMLE was as efficient as the MLE, suggesting that the robustness of the SMLE came with little cost. The CC, HT, and raking estimators, by comparison, lost as much as 31%, 33%, and 23% efficiency to the SMLE, respectively. Our EM algorithm was stable, with convergence rates ⩾ 99% for audit sizes n > 100. Additional simulations, reported in Web Appendix E.2, compared the SMLE with the MLE under model misspecification. In general, these simulations suggest that with binary $X_{b} / X_{b}^{*}$ , both estimators behave similarly and are fairly robust to misspecification.

Table 1.

Simulation results for outcome misclassification and a binary error-prone covariate for increasing Phase I sample size N and audit proportion p_υ

N	p _υ	SMLE				MLE			CC			HT			Raking
N	p _υ	Bias	SE	SEE	CP	Bias	SE	RE	Bias	SE	RE	Bias	SE	RE	Bias	SE	RE

									SRS
1000	0.10	−0.003	0.381	0.369	0.950	0.010	0.376	1.027	−0.012	0.452	0.711	−0.012	0.452	0.711	−0.008	0.419	0.825
	0.25	0.007	0.247	0.242	0.950	0.007	0.246	1.008	−0.008	0.283	0.732	−0.008	0.283	0.732	−0.003	0.271	0.831
	0.50	0.001	0.183	0.181	0.938	0.001	0.183	1.000	0.000	0.193	0.902	0.000	0.193	0.902	−0.001	0.187	0.954
2000	0.10	0.002	0.271	0.258	0.946	0.004	0.270	1.007	−0.008	0.310	0.762	−0.008	0.310	0.762	−0.006	0.285	0.905
	0.25	0.001	0.177	0.171	0.941	0.001	0.177	1.000	−0.005	0.195	0.820	−0.005	0.195	0.820	−0.008	0.178	0.984
	0.50	0.000	0.128	0.128	0.954	0.000	0.128	1.000	−0.007	0.133	0.924	−0.007	0.133	0.924	−0.006	0.128	1.001
							1:1 case-control sampling based on Y*
1000	0.10	0.018	0.366	0.343	0.936	0.020	0.363	1.017	−0.015	0.423	0.750	−0.026	0.430	0.724	−0.022	0.393	0.869
	0.25	−0.011	0.239	0.228	0.943	−0.011	0.238	1.008	0.001	0.263	0.824	−0.009	0.267	0.799	−0.010	0.253	0.895
	0.50	−0.001	0.169	0.171	0.954	−0.001	0.169	1.000	0.016	0.183	0.856	0.009	0.186	0.827	0.007	0.181	0.874
2000	0.10	−0.008	0.242	0.240	0.952	−0.009	0.241	1.008	0.002	0.291	0.690	−0.008	0.295	0.671	−0.013	0.276	0.770
	0.25	0.004	0.163	0.161	0.949	0.004	0.162	1.012	0.003	0.183	0.800	−0.005	0.184	0.784	−0.002	0.169	0.930
	0.50	0.005	0.118	0.121	0.957	0.005	0.118	1.000	0.002	0.129	0.835	−0.006	0.131	0.811	−0.006	0.125	0.894

Open in a new tab

Note: Bias and SE are, respectively, the empirical bias and standard error of the parameter estimator; SEE is the average of the standard error estimator; CP is the coverage probability of the 95% confidence interval. RE is the relative efficiency of the estimator to the SMLE. Each entry is based on 1000 replicates. Convergence rates for the SMLE with an audit size of n = 100 were 89% and 94%, respectively, under SRS and 1:1 case-control sampling based on Y*. This was due to complete or quasi-complete separation of the outcome error model $P (Y^{*} ∣ X_{b}^{*}, Y, X_{b}, X_{a})$ in these settings. The SMLE had greater than 99% convergence rates in other settings. The Monte Carlo simulation error for the bias and CP of the SMLE did not exceed 0.013 and 0.8%, respectively.

3.2. Error-Prone Continuous Covariate

3.2.1. Varying covariate error variance.

Continuous covariate X_b was generated from a standard normal distribution. Error-free covariate X_a and outcome Y were generated from Bernoulli distributions with P(X_a = 1) = 0.25 and $P (Y = 1 ∣ X_{b}, X_{a}) = {[1 + \exp {- (- 1 + X_{b} - 0.5 X_{a}))}]}^{- 1}$ , respectively. Error-prone covariate $X_{b}^{*}$ was constructed as $X_{b}^{*} = X_{b} + U$ , where U was a normal random variable with mean zero and variance $σ_{U}^{2}$ . We considered $σ_{U}^{2}$ values of 0.1, 0.25, 0.5, and 1, corresponding to correlations of 0.95, 0.9, 0.82, and 0.71, respectively, between X_b and $X_{b}^{*}$ . The misclassification-prone outcome Y* was generated from a Bernoulli distribution with $P (Y^{*} = 1 ∣ X_{b}^{*}, Y, X_{b}, X_{a}) = [1 + \exp {- (- 2.2 + X_{b}^{*} + 5.14 Y + {X_{b} - 0.5 X_{a})}]}^{- 1}$ . This simulation setup yielded approximately 28% and 37% prevalence of Y and Y*, respectively, regardless of the choice of $σ_{U}^{2}$ , and 12%–13% misclassification in Y* (FPR =14%–15%; FNR = 6%–7%). We used a cubic B-spline basis (q = 4) and varied b_N from 16 to 28 to assess its effects on model fitting, maintaining a 3:1 ratio of knots allocated to subjects with X_a = 0:X_a = 1. This ratio allocates the knots proportionally to the available data, distributing 25% of the knots to the 25% of subjects with X_a = 1. When N = 1000, the results were very similar for b_N ⩾ 20, meaning that the maximum difference in the coverage probability of the 95% CI was less than 0.5%. Consequently, separate cubic B-splines with 15 and 5 interior knots were used for subjects with X_a = 0 and X_a = 1, respectively; when N = 2000, 18 and 6 interior knots were used.

Simulation results when SRS was used to select Phase II are shown in Table 2. The proposed SMLE continued to be unbiased with accurate SEE and reasonable coverage probabilities. The EM algorithm remained stable, converging in ⩾ 96% of replicates. The CC and HT estimators are equivalent under SRS. The efficiency gain of the SMLE, which used all available information on all subjects, over the CC analyses, which used information on audited subjects only, was higher for smaller values of $σ_{U}^{2}$ . This makes sense because $X_{b}^{*}$ was more informative about X_b when $σ_{U}^{2}$ was smaller. Thus, more information could be gained by including the Phase I data. In some settings, the CC was as much as 41% less efficient than the SMLE. SMLE was generally, although not always, more efficient than the raking estimator. For a fixed $σ_{U}^{2}$ , the relative efficiency (RE) of the SMLE to the CC or raking estimator decreased as p_υ increased. This also makes sense because as audited data became available on more subjects, less information could be extracted from the unvalidated subjects.

Table 2.

Simulation results for outcome misclassification and a continuous covariate with varied additive measurement error variance when the Phase II design is simple random sampling

$σ_{U}^{2}$	N	p _υ	SMLE				CC/HT			Raking
$σ_{U}^{2}$	N	p _υ	Bias	SE	SEE	CP	Bias	SE	RE	Bias	SE	RE

0.10	1000	0.10	0.005	0.250	0.237	0.943	0.069	0.323	0.599	0.038	0.271	0.849
		0.25	−0.005	0.157	0.160	0.953	0.019	0.183	0.736	0.016	0.166	0.898
		0.50	0.005	0.121	0.121	0.940	0.009	0.132	0.840	0.005	0.116	1.085
	2000	0.10	−0.012	0.166	0.164	0.953	0.031	0.216	0.592	0.020	0.181	0.843
		0.25	−0.011	0.112	0.112	0.949	0.010	0.132	0.721	0.005	0.119	0.887
		0.50	0.002	0.083	0.085	0.956	0.002	0.087	0.905	0.004	0.085	0.949
0.25	1000	0.10	0.003	0.267	0.251	0.950	0.070	0.322	0.688	0.045	0.285	0.878
		0.25	−0.004	0.166	0.166	0.959	0.019	0.183	0.823	0.019	0.173	0.925
		0.50	0.007	0.125	0.123	0.944	0.009	0.132	0.897	0.006	0.120	1.091
	2000	0.10	−0.015	0.179	0.173	0.943	0.031	0.216	0.686	0.025	0.187	0.915
		0.25	−0.011	0.117	0.116	0.944	0.010	0.132	0.784	0.005	0.122	0.918
		0.50	0.004	0.084	0.086	0.956	0.002	0.087	0.936	0.004	0.086	0.958
0.50	1000	0.10	0.030	0.292	0.273	0.948	0.067	0.318	0.843	0.050	0.298	0.959
		0.25	0.001	0.171	0.173	0.957	0.019	0.183	0.873	0.020	0.179	0.910
		0.50	0.008	0.128	0.126	0.941	0.009	0.132	0.940	0.005	0.122	1.103
	2000	0.10	0.001	0.196	0.187	0.941	0.031	0.216	0.824	0.029	0.194	1.021
		0.25	−0.006	0.123	0.121	0.949	0.010	0.132	0.862	0.006	0.127	0.931
		0.50	0.003	0.085	0.088	0.961	0.002	0.087	0.965	0.004	0.089	0.922
1.00	1000	0.10	0.060	0.318	0.292	0.951	0.070	0.322	0.975	0.053	0.310	1.052
		0.25	0.010	0.177	0.180	0.964	0.019	0.183	0.936	0.022	0.183	0.931
		0.50	0.008	0.129	0.128	0.948	0.009	0.132	0.955	0.006	0.124	1.083
	2000	0.10	0.026	0.212	0.201	0.940	0.031	0.216	0.960	0.032	0.202	1.097
		0.25	0.001	0.126	0.125	0.953	0.010	0.132	0.908	0.005	0.129	0.951
		0.50	0.002	0.086	0.089	0.957	0.002	0.087	0.988	0.003	0.091	0.903

Open in a new tab

The naive analysis was most biased in settings where $σ_{U}^{2}$ was smaller and improved as $σ_{U}^{2}$ increased. Specifically, the naive estimator yielded an average of 65%, 57%, 36%, and 3% bias when $σ_{U}^{2} = 0.1, 0.25, 0.5$ , and 1, respectively. This counterintuitive phenomenon was due to the way we generated $X_{b}^{*}$ and Y*. In Web Appendix E.4, we did additional simulation studies and found that the bias of the naive estimator could increase as $σ_{U}^{2}$ increased or reverse direction in various settings.

Simulation results when naive case-control was used to select Phase II subjects are included in Table 3. The SMLE performed well under this sampling scheme, with smaller standard errors than under SRS. The CC estimators were 5%–10% biased because the case-control sampling was based on an outcome subject to differential misclassification, but the HT and raking estimators remained unbiased. Efficiency was slightly better with naive case-control sampling than SRS, although RE of SMLE to the other estimators was similar.

Table 3.

Simulation results for outcome misclassification and a continuous covariate with varied additive measurement error variance when the Phase II design is 1:1 case-control sampling based on Y*

$σ_{U}^{2}$	N	p _υ	SMLE				CC		HT			Raking
$σ_{U}^{2}$	N	p _υ	Bias	SE	SEE	CP	Bias	SE	Bias	SE	RE	Bias	SE	RE

0.10	1000	0.10	−0.049	0.234	0.222	0.932	−0.059	0.286	0.046	0.298	0.617	0.043	0.248	0.891
		0.25	−0.028	0.148	0.151	0.952	−0.077	0.160	0.028	0.170	0.758	0.004	0.156	0.898
		0.50	−0.008	0.118	0.115	0.941	−0.091	0.115	0.009	0.124	0.906	0.006	0.118	1.004
	2000	0.10	−0.046	0.159	0.155	0.930	−0.073	0.192	0.031	0.207	0.590	0.024	0.172	0.853
		0.25	−0.026	0.103	0.106	0.945	−0.090	0.111	0.010	0.119	0.749	0.015	0.111	0.863
		0.50	−0.006	0.079	0.080	0.946	−0.096	0.078	0.005	0.084	0.884	0.002	0.081	0.947
0.25	1000	0.10	−0.042	0.233	0.237	0.950	−0.047	0.274	0.054	0.294	0.650	0.040	0.267	0.764
		0.25	−0.021	0.150	0.157	0.958	−0.081	0.161	0.021	0.170	0.853	0.005	0.162	0.856
		0.50	0.000	0.118	0.117	0.949	−0.085	0.115	0.013	0.124	0.890	0.009	0.115	1.057
	2000	0.10	−0.038	0.172	0.165	0.930	−0.069	0.186	0.028	0.200	0.681	0.026	0.176	0.960
		0.25	−0.029	0.109	0.110	0.933	−0.094	0.112	0.002	0.120	0.840	0.011	0.113	0.924
		0.50	−0.004	0.080	0.082	0.948	−0.093	0.079	0.005	0.086	0.909	0.001	0.081	0.968
0.50	1000	0.10	−0.004	0.270	0.256	0.940	−0.038	0.270	0.060	0.292	0.855	0.037	0.272	0.987
		0.25	−0.006	0.160	0.165	0.958	−0.068	0.162	0.027	0.172	0.865	0.013	0.157	1.037
		0.50	0.005	0.122	0.119	0.946	−0.075	0.118	0.019	0.127	0.923	0.012	0.125	0.954
	2000	0.10	−0.019	0.178	0.177	0.938	−0.068	0.183	0.028	0.193	0.851	0.020	0.190	0.880
		0.25	−0.017	0.114	0.114	0.954	−0.084	0.115	0.007	0.123	0.859	0.013	0.117	0.946
		0.50	−0.005	0.080	0.083	0.962	−0.089	0.078	0.004	0.084	0.907	0.010	0.084	0.902
1.00	1000	0.10	0.013	0.288	0.270	0.941	−0.034	0.285	0.051	0.300	0.922	0.044	0.287	1.007
		0.25	0.001	0.172	0.169	0.942	−0.063	0.168	0.021	0.179	0.923	0.012	0.180	0.918
		0.50	0.002	0.118	0.120	0.953	−0.072	0.114	0.013	0.121	0.951	0.008	0.121	0.947
	2000	0.10	0.007	0.193	0.189	0.952	−0.057	0.186	0.028	0.198	0.950	0.019	0.199	0.936
		0.25	−0.012	0.118	0.118	0.948	−0.080	0.117	0.004	0.124	0.906	0.014	0.121	0.945
		0.50	−0.002	0.082	0.084	0.953	−0.078	0.079	0.005	0.084	1.000	0.005	0.082	1.000

Open in a new tab

Note: Bias and SE are, respectively, the empirical bias and standard error of the parameter estimator; SEE is the average of the standard error estimator; CP is the coverage probability of the 95% confidence interval. RE is the relative efficiency of the estimator to the SMLE, but relative efficiency of the CC estimator to SMLE is not reported since it was biased under this sampling scheme. Each entry is based on 1000 replicates. The SMLE had greater than 96% convergence rates in all settings. The Monte Carlo simulation errors were ⩽ 0.009 for bias and ⩽ 0.8% for CP.

3.2.2. Varying outcome misclassification rate.

We also varied the misclassification rate in Y* by changing the intercept and regression coefficient of Y, denoted by γ₀ and γ₁, respectively, in its generation model $P (Y^{*} ∣ X_{b}^{*}, Y, X_{b}, X_{a}) = [1 + \exp {- (γ_{0} + X_{b}^{*} + γ_{1} Y + {X_{b} - 0.5 X_{a})}]}^{- 1}$ . The values of γ₀ and γ₁ were determined by the “baseline” sensitivity and specificity of Y* when it depended on Y only, i.e., $γ_{0} = - log (\frac{specificity}{1 - specificity})$ and $γ_{1} = - γ_{0} - log (\frac{1 - sensitivity}{sensitivity})$ . The baseline sensitivity of Y* was varied from 0.95 to 0.55 by decrements of 0.1, and the baseline specificity was set to be 0.05 lower than the baseline sensitivity. A Phase I sample of N = 1000 subjects was generated, and a validation subsample of n = 250 subjects was selected via SRS or naive case-control. The error variance $σ_{U}^{2}$ was fixed at 0.1, and all other variables were generated as in Section 3.2.1. The results are shown in Table 4.

Table 4.

Simulation results for outcome misclassification with varied baseline sensitivity and specificity and an error-prone continuous covariate

Sensitivity	Specificity	SMLE				HT			Raking
Sensitivity	Specificity	Bias	SE	SEE	CP	Bias	SE	RE	Bias	SE	RE

					SRS
0.95	0.90	−0.007	0.156	0.160	0.953	0.018	0.178	0.768	0.021	0.170	0.842
0.85	0.80	−0.006	0.170	0.175	0.961	0.020	0.180	0.892	0.022	0.176	0.933
0.75	0.70	0.001	0.177	0.182	0.965	0.020	0.183	0.936	0.015	0.189	0.877
0.65	0.60	0.010	0.182	0.185	0.962	0.019	0.184	0.978	0.021	0.193	0.899
0.55	0.50	0.018	0.183	0.186	0.963	0.019	0.183	1.000	0.021	0.190	0.928
			1:1 case-control sampling based on Y*
0.95	0.90	−0.028	0.148	0.151	0.952	0.028	0.170	0.758	0.012	0.154	0.924
0.85	0.80	−0.016	0.164	0.168	0.950	0.021	0.176	0.868	0.021	0.172	0.909
0.75	0.70	−0.003	0.178	0.177	0.951	0.019	0.185	0.926	0.009	0.180	0.978
0.65	0.60	0.018	0.182	0.184	0.956	0.028	0.185	0.968	0.025	0.184	0.978
0.55	0.50	0.020	0.188	0.186	0.952	0.021	0.188	1.000	0.026	0.194	0.939

Open in a new tab

The largest efficiency gains of the SMLE over the HT estimator under SRS (equivalent to the CC analysis) and naive case-control were seen when the sensitivity and specificity of Y* were highest. In fact, the RE decreased with these diagnostic measures until it was approximately equal to one at 0.55 sensitivity and 0.5 specificity. This was expected because for there to be an efficiency gain of the SMLE from incorporating information in unvalidated subjects, there needs to be a fair degree of correlation between Y and Y*. Sensitivity and specificity of Y* near 0.5 resulted in near random misclassification, in which case the unvalidated subjects were not very informative about the relationship between Y and X_b. The SMLE was always more efficient than the raking estimator, which also incorporates information on unvalidated subjects.

3.2.3. Other simulations with an error-prone continuous covariate.

In Web Appendix E.3, we include comparisons of the SMLE to regression calibration (RC) (Prentice, 1982) and generalized raking under the classical measurement error setting with covariate error only. The robustness of the SMLE to different covariate error mechanisms, including non-zero mean additive errors and multiplicative errors, was illustrated in Web Appendices E.5 and E.6, respectively. In those simulations, the errors in $X_{b}^{*}$ depended on the error-free covariate X_a. The SMLE continued to perform well in those settings.

4. Application to the CCASAnet Dataset

We now apply our method to the CCASAnet dataset. As in Giganti et al. (2019), the risk of developing an ADE after initiating ART was of primary interest. Specifically, we were interested in estimating the relative odds of developing an ADE within 2 years of ART initiation for two risk factors, CD4 count and prior AIDS diagnosis, conditional on other covariates. Both risk factors were measured at ART initiation. Specifically, CD4 count was defined as the lab measurement closest to the ART initiation date but no more than 6 months prior to or 30 days after ART initiation. Prior AIDS diagnosis was any evidence of a clinical AIDS event before ART initiation. Because variables were derived based on error-prone ART initiation date, errors could be correlated. Other error-free covariates included clinical site, age at baseline, sex, and year of ART initiation. CD4 count was rescaled to units of ten cells per microliter before being square root transformed, age was rescaled to 10-year increments, and year of ART initiation was centered at the median, 2004.

Clinical data from five sites (anonymously labeled as sites A–E) were compiled into the CCASAnet research database. Each site underwent an on-site audit by VDCC investigators between 2013 and 2014. Approximately 30 patient records were randomly selected from each site for auditing. Pre-audit records were compared with clinical source documents, including paper-based patient charts or electronic medical records; see Giganti et al. (2019) for details about the audit protocol and findings. The values found in clinical source documents were treated as the reference standard and are assumed to be more correct than the database.

To be included in our analysis, patients needed to (1) initiate ART while in care at a CCASAnet clinic, (2) be at least 18 years old at cohort enrollment, (3) have a valid CD4 measurement at time of ART initiation, and (4) remain in care for at least 2 years after initiating ART. These inclusion criteria based on the unvalidated data resulted in a Phase I sample of 5109 subjects from the CCASAnet research database of whom 117 were audited. The number of audited records meeting these criteria varied between 16–36 per site. There were 510 unvalidated ADE (10% prevalence) and 13 validated ADE (11% prevalence). Giganti et al. (2019) noted that risk of an ADE was higher in the audited data than in the pre-audit data over a ten-year follow-up period.

In these audits, VDCC identified 6% misclassification in the ADE, all of which were false negatives. AIDS prior to ART initiation had 6% misclassification, with a higher FPR (13%) than FNR (3%). CD4 count had an error rate of 8%, with mean magnitude of −0.11 and variance of 2.51 on the square root scale. Errors in CD4 count were assumed to be additive on the square root scale, so magnitude was calculated by subtracting error-prone from validated values. No subject had errors in both their outcome and covariates, and only one had errors in both CD4 count and AIDS status, suggesting little evidence of error correlation. Sites A, B, and C had five or six erroneous records while sites D and E had two or three. The low error rates and small audit size led us to choose the histogram basis for the SMLE. Specifically, we used separate histogram bases with one interior knot for subjects with and without unvalidated AIDS at ART initiation. Further stratification by site did not noticeably alter the results; see Table S8 in Web Appendix F. Thus, errors in AIDS and CD4 count were assumed to be independent of other error-free covariates.

Results are presented in Table 5. The naive analysis using only Phase I data indicated that both CD4 count (log OR = −0.28; 95% CI: (−0.34, −0.22)) and prior AIDS (log OR = 1.54; 95% CI: (1.32, 1.77)) were strongly associated with ADE during the first two years after initiating ART. The CC and HT analyses, which only used Phase II data, yielded greater point estimates for the CD4 count association but point estimates closer to the null for the prior AIDS association; confidence intervals for the CC and HT analyses were quite wide due to the small audit size and included zero for the prior AIDS association. The CI for AIDS in the raking analysis was narrower than those in the HT and CC analyses but still contained zero. Using the SMLE, estimates for CD4 count (log OR = −0.48; 95% CI: (−0.73, −0.24)) and AIDS (log OR = 1.39; 95% CI: (0.58, 2.19)) were significant and fell between those of the naive and complete-data-based analyses, capturing the information from the validated data while harnessing the statistical power of the full cohort.

Table 5.

log OR estimates and 95% confidence intervals from the analysis of the CCASAnet dataset

Covariate	Naive		CC		HT		Raking		SMLE
Covariate	log OR	95% CI	log OR	95% CI	log OR	95% CI	log OR	95% CI	log OR	95% CI
$\sqrt{CD4/10}$	−0.280	(−0.343, −0.217)	−0.688	(−1.164, −0.212)	−0.755	(−1.154, −0.356)	−0.620	(−0.922, −0.318)	−0.482	(−0.725, −0.240)
AIDS	1.543	(1.317, 1.770)	0.243	(−1.166, 1.653)	0.579	(−0.850, 2.009)	0.093	(−1.131, 1.318)	1.388	(0.582, 2.194)
Site: A	−1.399	(−1.755, −1.042)	−0.396	(−2.433, 1.642)	−0.357	(−2.601, 1.887)	−0.289	(−1.945, 1.368)	1.129	(0.278, 1.980)
Site: C	0.409	(0.154, 0.664)	0.561	(−1.368, 2.491)	0.658	(−1.447, 2.764)	0.543	(−1.099, 2.184)	0.184	(0.003, 0.365)
Site: D	−0.991	(−1.412, −0.570)	−2.416	(−5.027, 0.194)	−2.638 (	−5.015, −0.261)	−2.548	(−5.226, 0.131)	−1.225	(−2.394, −0.056)
Site: E	−0.225	(−0.581, 0.131)	−0.688	(−3.353, 1.976)	−0.686	(−3.615, 2.244)	−0.542	(−2.855, 1.772)	−0.732	(−1.725, 0.260)
Male	0.073	(−0.169, 0.316)	−0.728	(−2.330, 0.874)	−0.823	(−2.395, 0.749)	−1.195	(−2.669, 0.280)	−0.703	(−1.933, 0.527)
Age/10 years	0.014	(−0.091, 0.119)	0.354	(−0.296, 1.003)	0.310	(−0.315, 0.935)	0.223	(−0.311, 0.756)	−0.690	(−1.644, 0.263)
Year of ART	−0.023	(−0.051, 0.006)	0.092	(−0.144, 0.327)	0.155	(−0.206, 0.516)	0.081	(−0.134, 0.297)	−0.508	(−1.225, 0.210)

Open in a new tab

Note: 95% CI is the 95% confidence interval.

Our analyses excluded 283 (5%) unaudited and 5 (4%) audited subjects who died within two years of initiating ART. Analyses were repeated including these patients and using the composite endpoint of death or ADE; results were largely similar (Web Appendix F.1).

5. Discussion

Measurement error is a wide-reaching problem in biomedical research. As error-prone observational data are increasingly supporting decision-making in health policy and patient care, there is a demonstrated need for statistical methods that can retain the high power lent by large cohorts while accounting for data errors. We proposed a new SMLE method that can address dependent errors in binary outcomes and categorical or continuous covariates, and we illustrated its performance in our simulations and CCASAnet data application. The SMLE is robust, efficient, and can handle measurement error settings not yet addressed by the MLE of Tang et al. (2015). We note that other methods, including multiple imputation approaches proposed by Edwards et al. (2013) and Giganti et al. (2020), could be adapted to handle the same problem. Because those approaches rely on proper specification of the error-generating mechanisms, we expect them to perform similarly to the MLE.

The SMLE has limitations. First, it can only accommodate two or three continuous covariates in the B-spline basis because the dimension of the basis grows exponentially fast as the number of continuous covariates increases. This is a manifestation of the curse of dimensionality. There are workarounds: 1) error-free covariates that can be assumed to be independent of the error-prone covariates can be omitted or 2) dimension reduction techniques can summarize the covariates into a few representative features on which the basis is constructed. Second, although the logistic regression model P_γ(Y*|X*, Y,X) seems to be fairly robust to certain types of model misspecification, e.g., when a quadratic term of an error-free covariate is missed (Web Appendix E.2), proper specification is still desirable. One may include additional covariates that affect Y* but not Y and additional interaction terms or splines to facilitate flexible modeling of the outcome error mechanism. Third, X* is assumed to be a surrogate for X such that P(Y|X,X*) = P_θ(Y|X). Relaxing this assumption is straightforward but changes the marginal interpretation of the estimates.

The proposed SMLE allows the Phase II sample selection to depend on the Phase I data in any manner. An interesting topic worth further investigation is efficient design under outcome misclassification and covariate measurement error. Because the outcome is subject to misclassification, traditional case-control sampling may not be ideal. Multi-wave designs like those proposed by McIsaac and Cook (2015) and Chen and Lumley (2020) are promising because one can use validated data obtained from earlier waves to gain insights about error mechanisms and then use this knowledge to optimally allocate audit efforts in later waves. Another future direction is to consider the situation where the Phase I sample is not a representative sample but a naive case-control sample of the population of interest (Lyles et al., 2011). To obtain proper inference in this situation, the observed-data log-likelihood (expression (2)) needs to be modified to reflect the retrospective nature of the Phase I design.

Supplementary Material

supinfo

NIHMS1720939-supplement-supinfo.pdf^{(344.2KB, pdf)}

ACKNOWLEDGEMENTS

This research was supported by the National Institutes of Health grants R01AI131771, R01HL094786, and U01AI069923, the National Institute of Environmental Health Sciences grant T32ES007018, and the Patient-Centered Outcomes Research Institute grant R-1609-36207. The authors thank CCASAnet for permission to present their data and the Advanced Computing Center for Research and Education at Vanderbilt University.

Footnotes

SUPPORTING INFORMATION

The Web Appendices referenced in Sections 2–4, along with code to replicate the simulation studies, are available with this article at the Biometrics website on Wiley Online Library. The R package logreg2ph that implemented the SMLE and all simulation and analysis code are available on GitHub (https://github.com/sarahlotspeich/logreg2ph).

DATA AVAILABILITY STATEMENT

The data that support the findings in this paper are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions.

REFERENCES

Barron BA (1977). The effects of misclassification on the estimation of relative risk. Biometrics 33, 414–418. [PubMed] [Google Scholar]
Carroll RJ, Ruppert D, Stefanski LA, and Crainiceanu CM (2006). Measurement Error in Nonlinear Models: a Modern Perspective. Boca Raton: Chapman and Hall/CRC. [Google Scholar]
Chen T and Lumley T (2020). Optimal multiwave sampling for regression modeling in two-phase designs. Statistics in Medicine 39, 4912–4921. [DOI] [PMC free article] [PubMed] [Google Scholar]
Deville JC, Särndal CE, and Sautory O (1993). Generalized raking procedures in survey sampling. Journal of the American Statistical Association 88, 1013–1020. [Google Scholar]
Duda S, Shepherd B, Gadd C, Masys D, and McGowan C (2012). Measuring the quality of observational study data in an international HIV research network. PloS One 7, e33908. [DOI] [PMC free article] [PubMed] [Google Scholar]
Edwards JK, Cole SR, Troester MA, and Richardson DB (2013). Accounting for misclassified outcomes in binary regression models using multiple imputation with internal validation data. American Journal of Epidemiology 177, 904–912. [DOI] [PMC free article] [PubMed] [Google Scholar]
Giganti MJ, Shaw PA, Chen G, Bebawy SS, Turner MM, Sterling TR, and Shepherd BE (2020). Accounting for dependent errors in predictors and time-to-event outcomes using electronic health records, validation samples, and multiple imputation. Annals of Applied Statistics 14, 1045–1061. [DOI] [PMC free article] [PubMed] [Google Scholar]
Giganti MJ, Shepherd BE, Caro-Vega Y, Luz PM, Rebeiro PF, Maia M, Julmiste G, Cortes C, McGowan CC, and Duda SN (2019). The impact of data quality and source data verification on epidemiologic inference: a practical application using HIV observational data. BMC Public Health 19, 1748. [DOI] [PMC free article] [PubMed] [Google Scholar]
Grenander U (1981). Abstract Inference. New York: Wiley. [Google Scholar]
Horvitz DG and Thompson JD (1952). A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association 47, 663–685. [Google Scholar]
Koehler E, Brown E, and Haneuse JPA (2009). On the assessment of monte carlo error in simulation-based statistical analyses. The American Statistician 63, 166–162. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lotspeich S, Giganti M, Maia M, Vieira R, Machado D, Succi R, Ribeiro S, Pereira M, Rodriguez M, Julmiste G, Luque M, Caro-Vega Y, Mejia F, Shepherd B, McGowan C, and Duda S (2020). Self-audits as alternatives to travel-audits for improving data quality in the Caribbean, Central and South America network for HIV epidemiology. Journal of Clinical and Translational Science 4, 125–132. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lumley T (2019). survey: analysis of complex survey samples. R package version 3.35–1. [Google Scholar]
Lumley T, Shaw PA, and Dai JY (2011). Connections between survey calibration estimators and semiparametric models for incomplete data. International Statistical Review 79, 200–220. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lyles RH, Tang L, Superak HM, King CC, Celentano DD, Lo Y, and Sobel JD (2011). Validation data-based adjustments for outcome misclassification in logistic regression: An illustration. Epidemiology 22, 589–597. [DOI] [PMC free article] [PubMed] [Google Scholar]
Magder LS and Hughes JP (1997). Logistic regression when the outcome is measured with uncertainty. American Journal of Epidemiology 146, 195–203. [DOI] [PubMed] [Google Scholar]
Marshall RJ (1990). Validation study methods for estimating exposure proportions and odds ratios with misclassified data. Journal of Clinical Epidemiology 43, 941–947. [DOI] [PubMed] [Google Scholar]
McGowan C, Cahn P, Gotuzzo E, Padgett D, Pape J, Wolff M, Schechter M, and Masys D (2007). Cohort profile: Caribbean, Central and South America Network for HIV research (CCASAnet) collaboration within the International Epidemiologic Databases to Evaluate AIDS (IeDEA) programme. International Journal of Epidemiology 36, 969–976. [DOI] [PubMed] [Google Scholar]
McIsaac MA and Cook RJ (2015). Adaptive sampling in two-phase designs: a biomarker study for progression in arthritis. Statistics in Medicine 34, 2899–2912. [DOI] [PMC free article] [PubMed] [Google Scholar]
Murphy S and Van der Vaart A (2000). On profile likelihood. Journal of the American Statistical Association 95, 449–465. [Google Scholar]
Neuhaus JM (1999). Bias and efficiency loss due to misclassified responses in binary regression. Biometrika 86, 843–855. [Google Scholar]
Prentice RL (1982). Covariate measurement errors and parameter estimation in a failure time regression model. Biometrika 69, 331–342. [Google Scholar]
Quake D, Lachenbruch PA, Whaley FS, McClish DK, and Haley RW (1980). Effects of misclassifications on statistical inferences in epidemiology. American Journal of Epidemiology 111, 503–515. [DOI] [PubMed] [Google Scholar]
R Core Team (2019). R: a Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. [Google Scholar]
Robins JM, Rotnitzky A, and Zhao LP (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association 89, 846–866. [Google Scholar]
Sarkar M, Mallick BK, and Carroll RJ (2014). Bayesian semiparametric regression in the presence of conditionally heteroscedastic measurement and regression errors. Biometrics 70, 823–834. [DOI] [PubMed] [Google Scholar]
Schumaker L (1981). Spline Functions: Basic Theory. New York: Wiley-Interscience. [Google Scholar]
Staudenmayer J, Ruppert D, and Buonaccorsi JP (2008). Density estimation in the presence of heteroscedastic measurement error. Journal of the American Statistical Association 103, 726–736. [Google Scholar]
Tang L, Lyles RH, King CC, Celentano DD, and Lo Y (2015). Binary regression with differentially misclassified response and exposure variables. Statistics in Medicine 34, 1605–1620. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tang L, Lyles RH, Ye Y, Lo Y, and King CC (2013). Extended matrix and inverse matrix methods utilizing internal validation data when both disease and exposure status are misclassified. Epidemiologic Methods 2, 49–66. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tao R, Zeng D, and Lin DY (2017). Efficient semiparametric inference under two-phase sampling, with applications to genetic association studies. Journal of the American Statistical Association 112, 1468–1476. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zaniewski E, Tymejczyk O, Kariminia A, Desmonde S, Leroy V, Ford N, Sohn AH, Nash D, Yotebieng M, Cornell M, Althoff KN, Rebeiro PF, and Egger M (2018). IeDEA-WHO Research-Policy Collaboration: contributing real-world evidence to HIV progress reporting and guideline development. Journal of Virus Eradication 4, 9–15. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supinfo

NIHMS1720939-supplement-supinfo.pdf^{(344.2KB, pdf)}

Data Availability Statement

The data that support the findings in this paper are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions.

[R1] Barron BA (1977). The effects of misclassification on the estimation of relative risk. Biometrics 33, 414–418. [PubMed] [Google Scholar]

[R2] Carroll RJ, Ruppert D, Stefanski LA, and Crainiceanu CM (2006). Measurement Error in Nonlinear Models: a Modern Perspective. Boca Raton: Chapman and Hall/CRC. [Google Scholar]

[R3] Chen T and Lumley T (2020). Optimal multiwave sampling for regression modeling in two-phase designs. Statistics in Medicine 39, 4912–4921. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Deville JC, Särndal CE, and Sautory O (1993). Generalized raking procedures in survey sampling. Journal of the American Statistical Association 88, 1013–1020. [Google Scholar]

[R5] Duda S, Shepherd B, Gadd C, Masys D, and McGowan C (2012). Measuring the quality of observational study data in an international HIV research network. PloS One 7, e33908. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] Edwards JK, Cole SR, Troester MA, and Richardson DB (2013). Accounting for misclassified outcomes in binary regression models using multiple imputation with internal validation data. American Journal of Epidemiology 177, 904–912. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Giganti MJ, Shaw PA, Chen G, Bebawy SS, Turner MM, Sterling TR, and Shepherd BE (2020). Accounting for dependent errors in predictors and time-to-event outcomes using electronic health records, validation samples, and multiple imputation. Annals of Applied Statistics 14, 1045–1061. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Giganti MJ, Shepherd BE, Caro-Vega Y, Luz PM, Rebeiro PF, Maia M, Julmiste G, Cortes C, McGowan CC, and Duda SN (2019). The impact of data quality and source data verification on epidemiologic inference: a practical application using HIV observational data. BMC Public Health 19, 1748. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Grenander U (1981). Abstract Inference. New York: Wiley. [Google Scholar]

[R10] Horvitz DG and Thompson JD (1952). A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association 47, 663–685. [Google Scholar]

[R11] Koehler E, Brown E, and Haneuse JPA (2009). On the assessment of monte carlo error in simulation-based statistical analyses. The American Statistician 63, 166–162. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Lotspeich S, Giganti M, Maia M, Vieira R, Machado D, Succi R, Ribeiro S, Pereira M, Rodriguez M, Julmiste G, Luque M, Caro-Vega Y, Mejia F, Shepherd B, McGowan C, and Duda S (2020). Self-audits as alternatives to travel-audits for improving data quality in the Caribbean, Central and South America network for HIV epidemiology. Journal of Clinical and Translational Science 4, 125–132. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Lumley T (2019). survey: analysis of complex survey samples. R package version 3.35–1. [Google Scholar]

[R14] Lumley T, Shaw PA, and Dai JY (2011). Connections between survey calibration estimators and semiparametric models for incomplete data. International Statistical Review 79, 200–220. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Lyles RH, Tang L, Superak HM, King CC, Celentano DD, Lo Y, and Sobel JD (2011). Validation data-based adjustments for outcome misclassification in logistic regression: An illustration. Epidemiology 22, 589–597. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Magder LS and Hughes JP (1997). Logistic regression when the outcome is measured with uncertainty. American Journal of Epidemiology 146, 195–203. [DOI] [PubMed] [Google Scholar]

[R17] Marshall RJ (1990). Validation study methods for estimating exposure proportions and odds ratios with misclassified data. Journal of Clinical Epidemiology 43, 941–947. [DOI] [PubMed] [Google Scholar]

[R18] McGowan C, Cahn P, Gotuzzo E, Padgett D, Pape J, Wolff M, Schechter M, and Masys D (2007). Cohort profile: Caribbean, Central and South America Network for HIV research (CCASAnet) collaboration within the International Epidemiologic Databases to Evaluate AIDS (IeDEA) programme. International Journal of Epidemiology 36, 969–976. [DOI] [PubMed] [Google Scholar]

[R19] McIsaac MA and Cook RJ (2015). Adaptive sampling in two-phase designs: a biomarker study for progression in arthritis. Statistics in Medicine 34, 2899–2912. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Murphy S and Van der Vaart A (2000). On profile likelihood. Journal of the American Statistical Association 95, 449–465. [Google Scholar]

[R21] Neuhaus JM (1999). Bias and efficiency loss due to misclassified responses in binary regression. Biometrika 86, 843–855. [Google Scholar]

[R22] Prentice RL (1982). Covariate measurement errors and parameter estimation in a failure time regression model. Biometrika 69, 331–342. [Google Scholar]

[R23] Quake D, Lachenbruch PA, Whaley FS, McClish DK, and Haley RW (1980). Effects of misclassifications on statistical inferences in epidemiology. American Journal of Epidemiology 111, 503–515. [DOI] [PubMed] [Google Scholar]

[R24] R Core Team (2019). R: a Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. [Google Scholar]

[R25] Robins JM, Rotnitzky A, and Zhao LP (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association 89, 846–866. [Google Scholar]

[R26] Sarkar M, Mallick BK, and Carroll RJ (2014). Bayesian semiparametric regression in the presence of conditionally heteroscedastic measurement and regression errors. Biometrics 70, 823–834. [DOI] [PubMed] [Google Scholar]

[R27] Schumaker L (1981). Spline Functions: Basic Theory. New York: Wiley-Interscience. [Google Scholar]

[R28] Staudenmayer J, Ruppert D, and Buonaccorsi JP (2008). Density estimation in the presence of heteroscedastic measurement error. Journal of the American Statistical Association 103, 726–736. [Google Scholar]

[R29] Tang L, Lyles RH, King CC, Celentano DD, and Lo Y (2015). Binary regression with differentially misclassified response and exposure variables. Statistics in Medicine 34, 1605–1620. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] Tang L, Lyles RH, Ye Y, Lo Y, and King CC (2013). Extended matrix and inverse matrix methods utilizing internal validation data when both disease and exposure status are misclassified. Epidemiologic Methods 2, 49–66. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] Tao R, Zeng D, and Lin DY (2017). Efficient semiparametric inference under two-phase sampling, with applications to genetic association studies. Journal of the American Statistical Association 112, 1468–1476. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] Zaniewski E, Tymejczyk O, Kariminia A, Desmonde S, Leroy V, Ford N, Sohn AH, Nash D, Yotebieng M, Cornell M, Althoff KN, Rebeiro PF, and Egger M (2018). IeDEA-WHO Research-Policy Collaboration: contributing real-world evidence to HIV progress reporting and guideline development. Journal of Virus Eradication 4, 9–15. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Efficient odds ratio estimation under two-phase sampling using error-prone data from a multi-national HIV research cohort

Sarah C Lotspeich

Bryan E Shepherd

Gustavo G C Amorim

Pamela A Shaw

Ran Tao

SUMMARY:

1. Introduction

1.1. Background

1.2. CCASAnet Dataset

1.3. Existing Work

1.4. Overview

2. Methods

3. Simulation Studies

3.1. Error-Prone Binary Covariate

Table 1.

3.2. Error-Prone Continuous Covariate

3.2.1. Varying covariate error variance.

Table 2.

Table 3.

3.2.2. Varying outcome misclassification rate.

Table 4.

3.2.3. Other simulations with an error-prone continuous covariate.

4. Application to the CCASAnet Dataset

Table 5.

5. Discussion

Supplementary Material

ACKNOWLEDGEMENTS

Footnotes

DATA AVAILABILITY STATEMENT

REFERENCES

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Efficient odds ratio estimation under two-phase sampling using error-prone data from a multi-national HIV research cohort

Sarah C Lotspeich

Bryan E Shepherd

Gustavo G C Amorim

Pamela A Shaw

Ran Tao

SUMMARY:

1. Introduction

1.1. Background

1.2. CCASAnet Dataset

1.3. Existing Work

1.4. Overview

2. Methods

3. Simulation Studies

3.1. Error-Prone Binary Covariate

Table 1.

3.2. Error-Prone Continuous Covariate

3.2.1. Varying covariate error variance.

Table 2.

Table 3.

3.2.2. Varying outcome misclassification rate.

Table 4.

3.2.3. Other simulations with an error-prone continuous covariate.

4. Application to the CCASAnet Dataset

Table 5.

5. Discussion

Supplementary Material

ACKNOWLEDGEMENTS

Footnotes

DATA AVAILABILITY STATEMENT

REFERENCES

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases