Abstract
A Bayesian approach for variable selection is developed for use in models with a misclassified binary predictor variable. We define the main outcome model containing the latent predictor, the measurement model associated with the prevalence of the predictor, and the sensitivity and specificity models of the fallible classifier conditioned on the true value of the predictor. We use binary indicator variables to execute the Gibbs sampler-based variable selection process, and we identify the highest posterior probability model given the data. We demonstrate the performance of the procedure in several simulation studies, and we utilize the selection method to optimize model performance in two datasets.
Keywords: Bayesian variable selection, validation sample, differential misclassification, sensitivity, specificity, Gibbs, automobile safety, retinopathy
1. Introduction
Misclassification in covariates results in biased relationships between explanatory and response variables[26]. The process of adjusting for misclassification to minimize bias is often a subjective process because few methods exist for using data to optimize misclassification measurement models[19, 6]. Optimally performing statistical models are unbiased and parsimonious, i.e., no more complicated than necessary[8, 20], and statistical models that adjust for one or more misclassified covariates frequently make simplifying assumptions that measurement error is nondifferential or that covariates are known. Thus, despite the existence of numerous methods for variable selection and for misclassification correction, rarely is the need for parsimony considered for bias correction due to a differentially misclassified covariate.
Variable selection methods focus on identifying a subset of model covariates that predict a desired response satisfying some optimality criteria. Both frequentist and Bayesian variable selection models typically assume all response and design variables are measured perfectly such that the observed predictor-response relationship lacks systematic bias that requires correction. This paradigm has historically been the focus of seminal frequentist papers [23, 5, 25]. Important advances in Bayesian approaches for variable selection have permitted the incorporation of prior probabilities and the Bayesian inferential perspective including Markov chain Monte Carlo methods [14, 7, 24, 13].
Few variable selection methods have been developed specifically for models that contain misclassified or error-prone data. One important approach developed a Bayesian variable selection approach for a misclassified binary response in a logistic regression scenario[16]. Similarly, a subsequent paper advanced this approach for a Poisson response model subject to underreporting bias[29]. Methods using the frequentist perspective have been proposed that apply simultaneous measurement error adjustments and variable selection to identify optimal models in the case where covariates may have measurement error. These include scenarios where the outcome is assumed measured without error[9] and when the outcome is misclassified[10, 11]. However, these works assumed mismeasured continuous covariates and did not allow for differential measurement error. No methods thus far that we are aware of have utilized a Bayesian perspective for variable selection and error adjustment with differential misclassification in a covariate. Such an approach ideally should allow for: identification of posterior probabilities for competing models, control over which covariates to consider for selection, identification and quantification of possible differential misclassification, and incorporation of prior information if available. Thus, we propose a method for adjusting for covariate misclassification that utilizes Bayesian variable selection to identify not only what covariates contribute significantly to the main outcome, but also which covariates, if any, make the misclassification differential.
Our approach merges the ideas of variable selection and correction for misclassification in a single procedure, similar to previous Bayesian methodological approaches [15, 16, 29], except in our situation the binary predictor rather than response is misclassified. The method optimizes the set of covariates for parsimoniously estimating three important relationships: the effect of the true binary predictor on the outcome, the sensitivity and specificity of the imperfectly measured binary surrogate predictor, and the association between the covariates and the predictor. Importantly, through variable selection we can used a data-driven method to identify whether misclassification is differential or nondifferential. In Section 2, we introduce the models and define the potential nature of the misclassification, followed by a discussion of the variable selection method we use. Next, we perform simulations illustrating gains in model performance for the proposed method on randomly generated datasets in Section 3. Then, we apply the method to two data sets in Section 4 to demonstrate its effectiveness in identifying a parsimonious model while adjusting for misclassification due to self-reported data. Finally, we discuss the results as well as considerations for future research in Section 5.
2. Methods
2.1. Model Specification
Assume we are interested in the association between an outcome and a binary predictor , such that represents the true value of the predictor, as measured by a gold standard. The association between the outcome and predictor is modified by covariates represented by . A simple approach for estimating this association assuming linear relationships between predictors and outcome is
| (1) |
where is the link function, and is the key parameter of interest.
Next, we assume that is expensive or difficult to obtain in all study participants, but all study subjects have a more easily obtained binary measure that is a fallible surrogate test for the classifier . Because both and are binary, let us define the sensitivity of as
and the specificity
where
defined as
and
The prevalence of the gold standard is , and the prevalence of is , which is a function of , and . Importantly, naïvely substituting for in (1) yields a biased relationship between the predictor and . If is a nondifferentially misclassified version of , then the relationship between and is attenuated compared to the relationship between and . However, if the misclassification is differential, that is, rates vary between subpopulations, the relationship between and could be stronger or weaker than the relationship between and .
In the present approach, we assume that is measured in all study participants and that both , and are measured in a validation subsample of participants, as originally proposed by [21]. In such a scenario, we can directly estimate , and to ultimately yield an unbiased estimate of using the subsample of participants with measurements of both and . We assume that all three probabilities are independent of , as is common in the literature on misclassification and measurement error.
The prevalence of for the participant, denoted as , depends on the measured covariates , such that
| (2) |
Next, sensitivity and specificity are estimated using conditioned on . For the sensitivity, i.e. , assume the probability varies according to covariates such that
| (3) |
and for the specificity, i.e. , we have covariates in :
| (4) |
Using these models, we can then estimate among the participants where was not measured directly using the posterior predictive distribution of . Fortunately, posterior probability estimation using MCMC methods makes such imputation of missing values straightforward using the validation subsample relationships.
Our general formulation of the models from the previous section defines different covariates for the outcome, exposure, sensitivity, and specificity models, denoted by , and , respectively, each with potentially differing number of covariates. This is appropriate because in practice, one might reasonably assume that, for example, age is a confounder that would be accounted for by inclusion in the disease model, but age may not be relevant for the sensitivity and/or specificity models. However, the use of variable selection procedures assumes a lack of knowledge about which covariates should be included in models (1 – 4), and thus the user depends on the procedure to make data-driven decisions to identify covariates. For this reason, in subsequent sections we will use the simplifying notation (i.e., ) to denote the list of potential covariates under consideration for all models; for example, we seek to identify whether including one or more of age, sex, body weight, etc. in each of the 4 models yields the highest posterior model probability. This allows for scenarios in which a covariate may be “significant” in one or more of the models but not all. We proceed assuming that the user of the method would not know the ideal covariate vector and thus relies on our selection method to know which subset of covariates to include for each of models.
2.2. Gibbs Variable Selection
The availability of relatively large amounts of data could lead to inefficient specification of both the main outcome model as well as the ancillary models for sensitivity, specificity, and prevalence of . Ideally, only a parsimonious subset of covariates increase model posterior probability would be included. Numerous methods for variable selection to optimize various criteria have been developed; however, few if any to date have the capacity to select a parsimonious set of predictors for a misclassified covariate.
Our proposal is to adapt a Gibbs variable selection strategy. Gibbs variable selection is characterized by multiplying binary indicator variables, which we denote with , to all covariates whose presence in the model is in question and then fitting the posterior distribution using a Markov chain Monte Carlo (MCMC) sampling approach. The marginal posterior distribution of the indicator variables provides the necessary information for which predictors ultimately contribute to the model based on the highest posterior probability of the vector of the indicator variables, jointly represented by assuming covariates considered for selection[24]. Covariate parameters whose values do not deviate from 0 will have an indicator that stays at approximately 0, while covariates that differ from 0 will have indicators that predominantly take a value of 1.
This approach is particularly useful for a misclassification model because all components of the model are interrelated, and fitting each compartment of the data likelihood separately may have unintended effects on other parameters in the model. Fortunately, by using Gibbs variable selection, all parameters and indicators are estimated within the same MCMC sampler, and the highest posterior probability model can be identified that conditions on all available data.
We can thus expand the models from the previous section to include the variable selection indicators as follows:
and thus we can model model the main outcome , prevalence probability , sensitivity , and specificity , respectively. This includes the length vector , and consequently we assign all models positive prior probability. Here, we include an indicator for the term, although if this is the key relationship of interest, we can easily remove the indicator parameter and limit variable selection to the covariates. As previously mentioned, the potential covariates of each of the 4 models need not be the same. Furthermore, if all possible relevant covariates are considered and the selection process does not identify any covariates, then all probabilities are defined based on the intercept, and for the sensitivity and specificity models, an intercept-only misclassification model indicates nondifferential misclassification.
Identifiability is a concern for frequentist consideration of such a model, but this is not a concern for the Bayesian approach[24]. Our interest is in the posterior probability of vectors of indicator parameters, and parameters that deviate from zero will be paired with a corresponding nonzero indicator parameter within the MCMC algorithm; more succinctly, if the best fitting model depends on a nonzero covariate, the indicator parameter will accommodate this by taking a value of 1 over a large proportion of iterations of the MCMC sampler. Once an optimal model is identified via the highest posterior probability vector of nonzero ’s, the selected model can be fit by including the relevant covariates and excluding the indicator selection parameters. A result of this is that even if the same covariates (ie, sex, age, body weight, etc) may be considered for inclusion in each of the four models, the optimized final model could separately include different combinations of these covariates such that a covariate could be nonzero, for example, for the sensitivity model but not the main outcome model.
2.3. Sampling Procedure
The Gibbs sampling-based selection procedure is easily implemented by most Bayesian software packages and follows the method described previously [27] except that we generalize to include a differentially misclassified covariate.
Borrowing notation heavily from [27], let be a parameter vector of length , and then define as the corresponding -length vector of binary indicators (). We can partition this into subvectors for each component model, . The full (condensed) likelihood is , and we define the prior distribution as . Next, let be the vector of indicator variables that excludes , and we can let be partitioned into and , corresponding to the variables that are included in and excluded from the model, respectively, via variable selection. This partition allows the prior to be written as the joint density of included and excluded variables, ie .
The likelihood only uses the parameters in the vector, so , and the posterior for is
| (5) |
The quantity is referred to as the “pseudoprior” because it contains no information from the data, but it captures parameters excluded from the posterior. Importantly, sampling from the pseudoprior remains a key component of the Gibbs process to permit excluded covariates to reenter the model through the sampling sequence if supported by the data.
Sampling from the posterior distribution is more complex due to the adjustment for misclassified parameter. When and are both available within the dataset, the first stage of estimation of posterior parameters is performed by sampling from the joint posterior:
so that we can sample directly from the (joint) full conditional posterior for each component model. We provide the equations below, and sampling for parameters in the model generally follow sampling from the full conditional posterior for each parameter in the form
This general proportionality statement applies to all the full conditional distributions:
For observations in which is observed, the joint likelihood can inform the parameters associated with , and . However, for observations where is not observed, we must sample from the posterior predictive distribution which is influenced heavily by the parameters , and to predict based on and other included covariates.
Next, we sample from the pseudoprior
Note, similar to above, this distribution could be broken into component models, but because these do not yield information for posterior estimation based on the observed data, we simply reiterate that contains all excluded parameters from the estimation process.
The third and final portion of the process is where we randomly sample with probability where
This step may modify and which thus lies at the heart of the Gibbs selection process.
We include a sample R program and JAGS code in the Appendix that can be modified to other covariate arrangements. The reason model identifiability in not a concern lies in the sampling process for each individual . It is clear that we do not require simultaneous estimation of and as would be required for frequentist estimation, but rather the probabilities of are a function of the parameters’ values at any step of the MCMC procedure.
3. Simulations
We assessed the model with a number of simulation scenarios designed to demonstrate both the reduction in bias due to adjustment for misclassification compared to a naïve approach; the performance of the selection procedure in identifying the correct “data generating” model; and the improvement in posterior credible interval widths and Deviance Information Criteria (DIC)[30] of the reduced adjusted model relative to an adjusted model without selection. For each simulation we used varying sample sizes for the overall study (sample size denoted by ), and for the smaller substudy (sample size denoted by ) from which we obtain gold standard observations and estimates of the misclassification and measurement models. Across all simulations we assess the expected value of or (biased), the credible interval width of or , and the DIC of the models. Example R code is in the Appendix.
3.1. Simulation Details
We generated simulated data for binary covariates and with probability 0.4 and 0.5, respectively. The the outcome state of interest, , was generated using
The true value of the exposure of interest, , was generated using
and we generated our fallible classifier with sensitivity
and specificity
We modeled the data under varying scenarios to evaluate model performance. We generated data from the models described above using total sample sizes , and 2400; then, for each of these, we set a subsample size , and 900. For each of the 9 pairings of and , we generated 300 simulated datasets and fit each dataset 4 times described in sections 3.1.1 through 3.1.4.
3.1.1. Naïve Model
First, we fit a model naïvely assuming can be used in place of for all observations. Under this scenario, we fit
| (6) |
where the posterior of is a biased parameter of the association between the outcome and exposure.
3.1.2. Full adjusted model
For the full adjustment model, we fit each of the models (1), (2), (3), and (4) using the full set of covariates, including those generated with mean 0. This model adjusts for misclassification by allowing estimation of . However, this model includes covariates that do not significantly contribute to the model, thus potentially adding prediction error.
3.1.3. Variable selection model
For the variable selection, we fit the model described in (2.2) to identify the vector of indicator variables with highest posterior probability. As implied above, the data generating vector will be with 1 indicating a nonzero parameter value, and we estimated the proportion of simulated datasets in which the index vector was identified as the highest posterior probability model for varying sample sizes.
3.1.4. Reduced adjusted model
Finally, we fit the reduced model that includes only the nonzero covariates used to generate the datasets, ie the data-generating model that excludes unnecessary covariates. This model demonstrated the gains in model performance that are achievable when the correct model is identified.
3.2. Simulation Results
3.2.1. Variable selection performance
The frequency with which the variable selection method identified the correct data-generating model is shown in Figure 1. The lines were grouped by sample sizes for , the number of observations in which the gold standard was observed, across different values of , the total sample sizes. The vertical axis is the frequency of times that the selection method correctly identified as the highest posterior probability model.
Figure 1:

Relative frequency of selecting the correct model across varying overall and subsample sizes.
The probability of obtaining perfect identification of all nonzero parameter estimates in each of the 4 models was most highly dependent on the size of the subsample. Because the “correct” model includes parameters from all 4 models, there is low power and wide parameter credible intervals of the sensitivity and specificity models at small sample sizes for . Probabilities increased slowly as grew but rapidly as grew. This indicated that significant attention should be paid to the validation subsample size when considering the measurement of gold standard data.
3.2.2. Gains in model performance due to misclassification adjustment and variable selection
Table 1 presents features of the model performances using simulated datasets across the full and validation sample sizes. The columns that present results for show the posterior mean, standard deviation, and 95% credible interval width when data were fit ignoring the misclassification and used only the proxy as if it were perfectly observed. The next two sets of columns took the misclassification into account to estimate , the parameter describing the model-adjusted association between and . First, the adjusted without selection columns used the observations of estimate parameters for the prevalence , sensitivity , and specificity models and then generate the posterior predictive distribution of to produce unbiased estimates for model parameters. The final set of columns present estimates of using only the reduced set of parameters that were used to generate the data, representing a “best case scenario” for a reduced set of covariates. For the latter two groupings, we present the 95% credible interval widths and DIC under each scenario.
Table 1:
Posterior model estimates and DIC quantiies across 300 simulated datasets for each pair. indicates the total sample size and indicates the substudy sample size in which the gold standard was observed.
| Naïve model | Adjusted w/out selection | Adjusted with reduced set of covariates | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| SD () | 95% CI width | DIC | SD () | 95% CI width | DIC | ||||||
|
| |||||||||||
| 1200 | 300 | 0.529 | 0.131 | 0.997 | 0.198 | 0.775 | 4388 | 0.994 | 0.185 | 0.727 | 4322 |
| 600 | 0.542 | 0.131 | 1.015 | 0.166 | 0.651 | 3854 | 1.015 | 0.159 | 0.623 | 3839 | |
| 900 | 0.539 | 0.131 | 1.015 | 0.148 | 0.579 | 3889 | 1.015 | 0.142 | 0.557 | 3885 | |
| 1800 | 300 | 0.534 | 0.107 | 1.006 | 0.175 | 0.687 | 7616 | 1.006 | 0.162 | 0.638 | 7363 |
| 600 | 0.529 | 0.107 | 1.006 | 0.150 | 0.589 | 6122 | 1.006 | 0.142 | 0.558 | 6051 | |
| 900 | 0.541 | 0.107 | 1.009 | 0.135 | 0.529 | 5794 | 1.007 | 0.129 | 0.507 | 5771 | |
| 2400 | 300 | 0.524 | 0.092 | 0.991 | 0.159 | 0.625 | 11802 | 0.990 | 0.146 | 0.574 | 11225 |
| 600 | 0.527 | 0.092 | 0.989 | 0.138 | 0.540 | 8860 | 0.998 | 0.130 | 0.509 | 8671 | |
| 900 | 0.535 | 0.092 | 1.012 | 0.126 | 0.495 | 7988 | 1.012 | 0.120 | 0.471 | 7918 | |
Data were generated using for all scenarios. The naïve model yields biased estimates using the posterior mean of . The bias does not improve with increasing sample sizes, which has previously been demonstrated for covariate misclassification models [12]. Compared to the model adjusted estimates, the posterior standard deviations and 95% credible interval widths are small, but this must be interpreted in the context of the large bias; importantly, on average the 95% credible intervals centered at the posterior mean for excluded the true value of , and thus the high degree of precision is misleading.
The columns with adjustment without selection and with the reduced set of covariates showed that both models that adjusted for misclassification yielded unbiased posterior mean estimates of . However, the analysis that excluded the unnecessary covariates demonstrated that there was added precision in the posterior estimates via reduced standard deviations and credible interval widths for the more parsimonious models. This applied uniformly for increasing and . This improvement is further reinforced by the reductions in DIC for the reduced model compared to the full model.
We conclude that generally when it is feasible to select the significant contributors to the parameters for the prevalence, sensitivity, and specificity models, there can be large reductions in estimation bias of the association of with compared to the naïve model that uses the proxy . Futhermore, the precision for the estimates of can be increased using a parsiomious model compared to the full model using variable selection.
4. Examples
We demonstrated the method on two datasets. The first dataset explored the relationship between motor vehicle injury and seatbelt use where seatbelt use may be misclassified [21], and the second explored retinopathy status and cardiovascular disease where self-reported retinopathy status may be misclassified [17].
4.1. Example 1: Injury in Automobile Accidents
Our first example dataset used data from highway safety research in which we wished to determine whether the probability of an automobile accident producing an injury depends on the binary covariates driver gender, car damage (low or high), and seat belt use[21]. However, the information on the police reports regarding belt usage were subject to systematic misclassification errors that may potentially depend on the other covariates gender and car damage. We assumed all other data contained on the police report were accurate. The majority of the data observations (80,084) had data recorded solely from police accident reports, while a subsample () of drivers had the police accident report along with a more intensive follow-up interview that contained what we will consider gold standard information regarding seat belt usage. The total sample size was .
Initially, we analyzed the data ignoring misclassification. We defined the response as the binary response of injury status (0 = uninjured, 1 = injured) for the individual. Furthermore, the fallible measure for belt usage was denoted by , where indicated that belt usage was reported by the police and indicated no reported belt usage. The indicators and represented gender (1=male) and car damage (1=high), respectively. The unadjusted model that ignored misclassification (naïve model) was thus
In this formulation, was the parameter vector for covariates of the naïve model that ignored misclassification, and was the parameter for the association of interest using the fallible covariate. We subsequently multiplied each element of and by a Bernoulli indicator variable so that our model became
| (7) |
Because there were three parameters in the model to consider, the procedure must choose between 23 = 8 possible models. We placed diffuse normal prior distributions on the model parameters such that and , where and was the 3 × 3 identity matrix. Furthermore, we placed independent prior distributions on each where .
To account for the misclassified covariate , we defined the true belt usage indicator variable for the individual where indicated the study participant was wearing a belt and if not. The variables and were observed for only the subsample (). We then defined the Bernoulli response as occurring with probability , where
Furthermore, we allowed the prevalence of as well as the sensitivity and specificity of to depend on and as follows:
To develop a more parsimonious model, we multiplied binary indicator parameters to each coefficient of the set of models so that our system of equations became
Assuming no prior information was available, we placed diffuse normal prior distributions for each of the model parameters in the manner of
and
Next, for each , we assigned prior distribution
In this situation we had 29=512 possible models with an equivalent prior probability induced for each model. We assigned initial values to each of three chains, executed a 5,000 iteration burn-in, and then kept 15,000 samples from the posterior distribution for each chain. The highest probability posterior distributions are listed in Table 4.
Table 4:
Automobile Accidents: Top 5 Posterior Model Probabilities for Misclassification Model
| Covariate Vector | Posterior probability |
|---|---|
|
| |
| 0.54 | |
| 0.24 | |
| 0.19 | |
| 0.015 | |
| 0.0059 | |
We first observed in Table 2 the posterior model probabilities for the variable selection procedure from model (7). Of the eight possible models, overwhelmingly the highest posterior model included all three binary covariates. Given the large sample size, , all parameter standard deviation estimates were small enough that the posterior mean supported for .
Table 2:
Automobile Accidents: Top Posterior Model Probabilities for Naïve Model
| Covariate Vector | Posterior probability |
|---|---|
|
| |
| 0.996 | |
| 0.004 | |
| All others | 0 |
In Table 3 we observed the parameter estimates for the naïve model. These were used as a comparator with the posterior values from Table 5. All model parameters were nonzero with small standard deviations, which indicates the variables were significant predictors.
Table 3:
Automobile Accidents: Highest posterior probability model parameter estimates
| Parameter | Mean | Standard Deviation | 95% Credible Interval |
|---|---|---|---|
|
| |||
| −2.13 | 0.02 | (−2.18, −2.10) | |
| 1.54 | 0.02 | (1.50, 1.59) | |
| −0.38 | 0.02 | (−0.42, −0.35) | |
| −0.28 | 0.033 | (−0.35, −0.22) | |
Table 5:
Automobile Accidents: Highest posterior probability model parameter estimates
| Parameter | Mean | Standard Deviation | 95% Credible Interval |
|---|---|---|---|
|
| |||
| −2.11 | 0.02 | (−2.15, −2.06) | |
| 1.54 | 0.02 | (1.51, 1.59) | |
| −0.39 | 0.02 | (−0.43, −0.35) | |
| −0.38 | 0.05 | (−0.47, −0.29) | |
| −1.60 | 0.05 | (−1.70, −1.50) | |
| −0.18 | 0.09 | (−0.34, −0.00) | |
| 0.37 | 0.05 | (0.26, 0.48) | |
| 3.58 | 0.14 | (3.33, 3.86) | |
| 0.87 | 0.20 | (0.53, 1.30) | |
We observe in Table 4 the models identified by their posterior probability when we corrected for misclassification. The highest posterior probability model had no covariates from the model, indicating that neither gender nor crash severity contributed significantly to the estimation of . However, crash severity and gender were identified in the highest posterior probability model, indicating differential misclassification.
Given this information, we constructed the posterior distribution with the least posterior predictive variability. We fit the following model
which yielded the posterior estimates listed on Table 5.
The posterior mean estimates for in Table 5 had similar means to the posterior estimates for in Table 3 with the notable exception for for compared to for . It appeared that the misclassification resulted in an attenuation of the effect of seat belt usage on the probability of injury. When the misclassification has been accounted for, we observed an even lower probability of injury for belt users than was observed by self-report. Furthermore, the corrected model had a slightly larger parameter standard deviation, which reflected our added uncertainty due to the misclassified binary predictor.
4.2. Example 2: Retinopathy in the ACCORD-Eye Study
In the Action to Control Cardiovascular Risk in Diabetes (ACCORD) Study, baseline diabetic retinopathy (an indicator of microvascular disease) was assessed via quesionnaire among all participants ( for the purposes of this analysis), and within the ACCORD-Eye ancillary study a smaller group of individuals () was assessed for retinopathy using gold-standard fundus photography. ACCORD and ACCORD-Eye were designed to test the impacts of glycemia control, blood pressure control, and lipid control on cardiovascular disease and retinopathy ([1, 3, 2, 4]). In a subsequent paper [17], researchers were interested in the relationship between retinopathy and cardiovascular disease, to determine whether the microvascular damage could be associated with more serious cardiovascular outcomes. The outcome of interest was prospectively ascertained cardiovascular disease assessed as binary indicator, as predicted by retinopathy status (assessed via fallible questionnaire and, in a subset, gold-standard fundus photography), gender, age (≥ 60 yrs vs < 60), and high systolic blood pressure (SBP, ≥ 130 mmHg vs < 130). Using a similar model framework as the previous sections, we defined CVD as the primary disease outcome , self-report retinopathy as , retinopathy based on fundus photography as , and the remaining covariates as follows: gender as , age indicator as , and SBP indicator as . There were a total of 213 = 8,192 possible unique configurations of parameters.
We fit the model, and all of the highest posterior probability models excluded . Because this was the primary quantity of interest, we modified the procedure to force to remain in the model by redefining the model as
resulting in one fewer variable selection parameter for a possible 212 = 4,096 models.
The highest posterior probability model in Table 6 included only gender and age in the main CVD prediction model and age in the specificity model. When fit separately, the deviance information criteria (DIC) for the highest posterior model was 15,976, while the DIC for the second model was 15,983, where lower DIC values indicate better fit. The form of the highest posterior probability model, again, including , was
and we fit this model with noninformative priors for all parameters. In Table 7, all included parameters excluded 0 from their 95% credible intervals except the variable of interest, , which would have been excluded otherwise. A noteworthy point about substituting in the place of in the model was that the posterior mean (95% CI) of the self-report retinopathy parameter was 0.36 (0.15, 0.57), indicating an increased CVD risk among those with self-reported retinopathy. Importantly, the value for suggested the sensitivity of the self-report retinopathy was quite poor.
Table 6:
ACCORD Eye Variable Selection: Highest posterior probability models
| Covariate Vector | Posterior probability |
|---|---|
|
| |
| 0.58 | |
| 0.13 | |
| 0.05 | |
| 0.05 | |
| 0.03 | |
Table 7:
ACCORD Eye: Highest posterior probability model parameter estimates
| Parameter | Mean | Standard Deviation | 95% Credible Interval |
|---|---|---|---|
|
| |||
| −2.38 | 0.08 | (−2.54, −2.22) | |
| −0.39 | 0.07 | (−0.54, −0.24) | |
| 0.49 | 0.08 | (0.34, 0.64) | |
| −0.06 | 0.12 | (−0.31, 0.17) | |
| −0.73 | 0.04 | (−0.80, −0.66) | |
| −2.26 | 0.09 | (−2.44, −2.08) | |
| 2.39 | 0.09 | (2.22, 2.57) | |
| −0.38 | 0.10 | (−0.58, −0.18) | |
5. Discussion
The problem of obtaining parsimonious models and covariate selection in complex situations such as misclassification create difficulty for traditional variable selection methods. However, we demonstrate that a Bayesian approach using Gibbs variable selection can help identify parsimonious sets of covariates with only the computational resources available within standard freely available MCMC software packages. The method simultaneously identifies a model to yield an unbiased relationship between covariate and response, and we further show that the resulting parsimonious model has reduced credible interval widths and DIC compared to the model that includes all covariates. It is useful and flexible for complex real-world datasets, as we demonstrated in our examples in auto safety and biomedical science.
Future work should identify alternate approaches that optimize alternate model performance criteria, such as the Bayesian LASSO[28] or reversible-jump MCMC[18]. Because this is the first approach in our literature search that addresses variable selection for a misclassified covariate, there are no published comparable methods to evaluate performance. Furthermore, this method could be expanded to allow for variable selection when misclassification exists in both response variables and one or more covariates to permit even greater complexity[22]. Finally, logistic regression using normal priors on parameters is quite slow due to lack of conjugacy between priors and the likelihood. Future work should explore computational or distributional advances to reduce computation time.
Funding
The ACCORD Trial was supported by the National Heart, Lung, and Blood Institute under grant 5M01RR007122 and the ACCORD Eye study was supported by the National Eye Institute under grant 1Z01EY000410.
6. Appendix
This method was executed in JAGS. The R code below is demonstrated for a randomly generated dataset:
set.seed(192)
library(R2jags)
library(boot)
N=1000 # Total sample size = N
n=500 # Validation sample size = n
terms<−9 # Number of non-intercept parameters
probx1=0.4 # Covariate probabilities
probx2=0.5
#Known parameters: First value is intercept
betv=c(−1,0,1.1,−0.8)
lamv=c(−0.2,1.6,−2.0)
gamv=c(2.5,0,−1.8)
tauv=c(0.4,1.6,0)
# Generate fake data
# Define x1 & x2 (independent)
x1=rbinom(N,1,probx1)
x2=rbinom(N,1,probx2)
# Define the true unobserved covariate z (tx3 will be a subset of z)
p=exp(lamv[1]+lamv[2]*x1+lamv[3]*x2)/(1+exp(lamv[1]+lamv[2]*x1+lamv[3]*x2))
z=rbinom(N,1,p)
# Creates validation dataset tx3 w/ 200 obs in each subpopulation
tx3=rep(NA,N)
tx3[1:n]=z[1:n]
# Define the sensitivity and specificity of fallible test, generate
# misclassified x3
se=inv.logit(gamv[1]+gamv[2]*x1+gamv[3]*x2)
sp=inv.logit(tauv[1]+tauv[2]*x1+tauv[3]*x2)
x3=rbinom(N,1,z*se+(1-z)*(1-sp))
# Generate probabilities & output variable y
pi=inv.logit(betv[1]+betv[2]*x1+betv[3]*x2+betv[4]*z)
y=rbinom(N,1,pi)
#######################################################
# #
# Calling JAGS - Variable Selection #
# #
#######################################################
covbvs<-function()
{
for (i in 1:N){
# y is perfectly observed response (1= injured, 0 = uninjured)
y[i] ~ dbern(pi[i])
logit(pi[i]) <- beta[1] + beta[2]*x1[i]*g[1] + beta[3]*x2[i]*g[2] +
beta[4]*tx3[i]*g[3]
tx3[i] ~ dbern(p[i])
logit(p[i]) <- lambda[1] + lambda[2]*x1[i]*g[4] + lambda[3]*x2[i]*g[5]
x3[i] ~ dbern(q[i])
q[i] <- tx3[i]*se[i] + (1-tx3[i])*(1-sp[i])
logit(se[i]) <- gamma[1] + gamma[2]*x1[i]*g[6] + gamma[3]*x2[i]*g[7]
logit(sp[i]) <- tau[1] + tau[2]*x1[i]*g[8] + tau[3]*x2[i]*g[9]
}
for (i in 1:4){
beta[i] ~ dnorm(0.0,1.0E-3)
}
for (i in 1:3){
gamma[i] ~ dnorm(0.0,1.0E-3)
tau[i] ~ dnorm(0.0,1.0E-3)
lambda[i] ~ dnorm(0.0, 1.0E-3)
}
for (k in 1:terms){
g[k] ~ dbern(0.5); #Prior on selection parameters
}
} terms=9
j.data <- list (“N”, “y”, “x1”, “x2”, “x3”, “tx3”, “terms”)
j.inits <- list(
list(lambda=c(−0,0.5,−0.2),
gamma=c(2.6,0.45,−2.3),
tau=c(1,1.7,−0.3),
beta=c(−1.1,0.7,1.6,−0.3),
g=c(rep(1,terms))),
list(lambda=c(−0.8,0.8,0),
gamma=c(2.0,0.5,−1.5),
tau=c(0,2.5,−0.8),
beta=c(−0.5,1.5,0.6,−0.9),
g=c(rep(1,terms))),
list(lambda=c(−0.2,0.2,0.8),
gamma=c(3.3,0.4,−2.8),
tau=c(0.5,1,0),
beta=c(0,0,2.6,0.3),
g=c(rep(1,terms)))
)
j.parameters <- c(“g”)
# Using R2jags
n.chns=3
n.it=13000
n.burn=4000
n.keep=n.it-n.burn
jags.fixed.sim <- jags(data=j.data, inits=j.inits,
parameters.to.save = j.parameters,
model.file=covbvs, working.directory=getwd(),
n.chains=n.chns, n.iter=n.it, n.burnin=n.burn,
n.thin=3)
j.sel<-as.mcmc(jags.fixed.sim)
n.out=floor(n.keep/3) #output rows divided by number thinned
v<-matrix(nrow=n.out,ncol=n.chns)
for (k in 1:n.chns){
for (h in 1:n.out){
v[h,k]<-paste(j.sel[[k]][h,2],j.sel[[k]][h,3],j.sel[[k]][h,4],
j.sel[[k]][h,5],j.sel[[k]][h,6],j.sel[[k]][h,7],j.sel[[k]][h,8],
j.sel[[k]][h,9],j.sel[[k]][h,10])
}
}
v.all<-rep(NA,n.out*n.chns)
for (i in 1:n.chns){
v.all[((i-1)*nrow(v)+1):(i*nrow(v))]<-v[,i]
}
v.out<-table(v.all)/length(v.all)
# Selection vectors with posterior probabilities
sort(v.out,decreasing=TRUE)
Footnotes
Disclosure statement
The authors report there are no competing interests to declare.
Data Availability Statement
Data for the ACCORD trial are available through the NHLBI Biologic Specimen and Data Repository Information Coordinating Center: https://biolincc.nhlbi.nih.gov/studies/accord/, and data for the seatbelt example are available in the cited work [21].
References
- [1].ACCORD Study Group. Effects of intensive glucose lowering in type 2 diabetes. New England Journal of Medicine, 358(24):2545–2559, June 2008. ISSN 1533–4406. doi: 10.1056/nejmoa0802743. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].ACCORD Study Group. Effects of intensive blood-pressure control in type 2 diabetes mellitus. New England Journal of Medicine, 362(17):1575–1585, Apr. 2010. ISSN 1533–4406. doi: 10.1056/nejmoa1001286. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].ACCORD Study Group. Effects of combination lipid therapy in type 2 diabetes mellitus. New England Journal of Medicine, 362(17):1563–1574, Apr. 2010. ISSN 1533–4406. doi: 10.1056/nejmoa1001282. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].ACCORD Study Group and ACCORD Eye Study Group. Effects of medical therapies on retinopathy progression in type 2 diabetes. New England Journal of Medicine, 363(3):233–244, July 2010. ISSN 1533–4406. doi: 10.1056/nejmoa1001288. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [5].Akaike H Information theory and an extension of the maximum likelihood principle. In Petrov BN and E C, editors, 2nd International Symposium of Information Theory, pages 267–281. Akademiai Kiado., 1973. [Google Scholar]
- [6].Arezzo MF, Guagnano G, and Vitale D Estimating the size of undeclared work from partially misclassified survey data via the Expectation–Maximization algorithm. Journal of the Royal Statistical Society Series C: Applied Statistics, 73(3):816–834, Mar. 2024. ISSN 0035–9254. doi: 10.1093/jrsssc/qlae013. [DOI] [Google Scholar]
- [7].Carlin B and Chib S Bayesian Model Choice via Markov Chain Monte Carlo Methods. Journal of the Royal Statistical Society, Series B, 57(3):473–484, 1995. [Google Scholar]
- [8].Carroll R, Ruppert D, Stefanski L, and Crainiceanu C Measurement Error in Nonlinear Models: A Modern Perspective, volume 105 of Monographs of Statistics and Applied Probability. Chapman & Hall, Boca Raton, Florida, second edition, 2006. [Google Scholar]
- [9].Chen L-P De-noising boosting methods for variable selection and estimation subject to error-prone variables. Statistics and Computing, 33(2):38, Feb. 2023. ISSN 1573–1375. doi: 10.1007/s11222-023-10209-3. [DOI] [Google Scholar]
- [10].Chen L-P Variable selection and estimation for misclassified binary responses and multivariate error-prone predictors. Journal of Computational and Graphical Statistics, 33(2):407–420, 2024. doi: 10.1080/10618600.2023.2218428. [DOI] [Google Scholar]
- [11].Chen L-P and Hsu W-H CHEMIST: an R package for causal inference with high-dimensional error-prone covariates and misclassified treatments. Japanese Journal of Statistics and Data Science, Sept. 2023. ISSN 2520–8764. doi: 10.1007/s42081-023-00217-y. [DOI] [Google Scholar]
- [12].Cheng D, Branscum AJ, and Stamey JD Accounting for response misclassification and covariate measurement error improves power and reduces bias in epidemiologic studies. Annals of Epidemiology, 20(7):562–567, 2010. ISSN 1047–2797. doi: 10.1016/j.annepidem.2010.03.012. [DOI] [PubMed] [Google Scholar]
- [13].Dellaportas P, Forster J, and Ntzoufras I On Bayesian Model and Variable Selection Using MCMC. Statistics and Computing, 12:27–36, 2002. [Google Scholar]
- [14].George EI and McCulloch RE Variable Selection for Gibbs Sampling. Journal of the American Statistical Association, 88(423):881–889, 1993. [Google Scholar]
- [15].Gerlach R and Eyre T Bayesian Model Selection and Averaging for Discrete Regression Models. 2004. [Google Scholar]
- [16].Gerlach R and Stamey J Bayesian model selection for logistic regression with misclassified outcomes. Statistical Modelling, 7(3):255–273, 2007. [Google Scholar]
- [17].Gerstein HC and Ambrosius WT Diabetic Retinopathy, Its Progression, and Incident Cardiovascular Events in the ACCORD Trial. Diabetes Care, 36:1266–1271, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].GREEN PJ Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 82(4):711–732, 12 1995. ISSN 0006–3444. doi: 10.1093/biomet/82.4.711. [DOI] [Google Scholar]
- [19].Gustafson P Measurement Error and Misclassification in Statistics and Epidemiology: Impacts and Bayesian Adjustments. Chapman & Hall, Boca Raton, Florida, 2004. [Google Scholar]
- [20].Heinze G, Wallisch C, and Dunkler D Variable selection – a review and recommendations for the practicing statistician. Biometrical Journal, 60(3):431–449, 2018. doi: 10.1002/bimj.201700067. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [21].Hochberg Y On the Use of Double Sampling Schemes in Analyzing Categorical Data with Misclassification Errors. Journal of the American Statistical Association, 72(360):914–921, 1977. [Google Scholar]
- [22].Juxin Liu HM, Afful Annshirley and Ma Y Bias analysis for misclassification errors in both the response variable and covariate. The American Statistician, 76(4):353–362, 2022. doi: 10.1080/00031305.2022.2066725. [DOI] [Google Scholar]
- [23].Kullback S and Leibler R On information and sufficiency. The Annals of Mathematical Statistics, 22(1):79–86, 1951. [Google Scholar]
- [24].Kuo L and Mallick B Variable Selection for Regression Models. Sankhyā: The Indian Journal of Statistics, Series B (1960–2002), 60(1):65–81, 1998. ISSN 0581–5738. URL https://www.jstor.org/stable/25053023. Publisher: Springer. [Google Scholar]
- [25].Mallows CL Some Comments on Cp. Technometrics, 15(4):661–675, Nov. 1973. [Google Scholar]
- [26].Neuhaus J Bias and efficiency loss due to misclassified responses in binary regression. Biometrika, 86(4):843–855, Dec. 1999. ISSN 1464–3510. doi: 10.1093/biomet/86.4.843. [DOI] [Google Scholar]
- [27].Ntzoufras I Gibbs Variable Selection Using BUGS. Journal of Statistical Software, 7(7), 2002. URL www.jstatsoft.org. [Google Scholar]
- [28].Park T and Casella G The bayesian lasso. Journal of the American Statistical Association, 103(482):681–686, 2008. doi: 10.1198/016214508000000337. [DOI] [Google Scholar]
- [29].Powers S, Gerlach R, and Stamey J Bayesian variable selection for Poisson regression with underreported responses. Computational Statistics & Data Analysis, 54(12):3289–3299, 2010. ISSN 0167–9473. doi: 10.1016/j.csda.2010.04.003. [DOI] [Google Scholar]
- [30].Spiegelhalter DJ, Best NG, Carlin BP, and Van Der Linde A Bayesian Measures of Model Complexity and Fit. Journal of the Royal Statistical Society Series B: Statistical Methodology, 64(4):583–639, 10 2002. ISSN 1369–7412. doi: 10.1111/1467-9868.00353. [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Data for the ACCORD trial are available through the NHLBI Biologic Specimen and Data Repository Information Coordinating Center: https://biolincc.nhlbi.nih.gov/studies/accord/, and data for the seatbelt example are available in the cited work [21].
