Combining information from two data sources with misreporting and incompleteness to assess hospice-use among cancer patients: a multiple imputation approach

Yulei He; Mary Beth Landrum; Alan M Zaslavsky

doi:10.1002/sim.6173

. Author manuscript; available in PMC: 2015 Sep 20.

Published in final edited form as: Stat Med. 2014 May 7;33(21):3710–3724. doi: 10.1002/sim.6173

Combining information from two data sources with misreporting and incompleteness to assess hospice-use among cancer patients: a multiple imputation approach

Yulei He ^a,^*,^†, Mary Beth Landrum ^b, Alan M Zaslavsky ^b

PMCID: PMC4125445 NIHMSID: NIHMS581454 PMID: 24804628

Abstract

Combining information from multiple data sources can enhance estimates of health-related measures by using one source to supply information that is lacking in another, assuming the former has accurate and complete data. However, there is little research conducted on combining methods when each source might be imperfect, for example, subject to measurement errors and/or missing data. In a multisite study of hospice-use by late-stage cancer patients, this variable was available from patients’ abstracted medical records, which may be considerably underreported because of incomplete acquisition of these records. Therefore, data for Medicare-eligible patients were supplemented with their Medicare claims that contained information on hospice-use, which may also be subject to underreporting yet to a lesser degree. In addition, both sources suffered from missing data because of unit nonresponse from medical record abstraction and sample undercoverage for Medicare claims. We treat the true hospice-use status from these patients as a latent variable and propose to multiply impute it using information from both data sources, borrowing the strength from each. We characterize the complete-data model as a product of an ‘outcome’ model for the probability of hospice-use and a ‘reporting’ model for the probability of underreporting from both sources, adjusting for other covariates. Assuming the reports of hospice-use from both sources are missing at random and the underreporting are conditionally independent, we develop a Bayesian multiple imputation algorithm and conduct multiple imputation analyses of patient hospice-use in demographic and clinical subgroups. The proposed approach yields more sensible results than alternative methods in our example. Our model is also related to dual system estimation in population censuses and dual exposure assessment in epidemiology.

Keywords: data augmentation, health services research, measurement error, model diagnostics, multilevel models

1. Introduction

Combining information from multiple data sources (e.g., surveys, health claims, medical records, and registries) can enhance estimates of health-related measures by using one source to supply information that is lacking in another. For example, health services researchers frequently use health claims data to supplement information available from disease registries such as the linked Surveillance, Epidemiology, and End Results (SEER)-Medicare data. Furthermore, data sources are typically subject to nonsampling errors including missing data due to nonresponse, noncoverage, measurement and/or response errors. If different sources have different limitations, combining information for the same set of variables reported from multiple sources might alleviate these errors and produce improved estimates of these variables.

Schenker and Raghunathan [1] described several examples of combining information from multiple surveys, including the race bridging project that predicted single-race report for census data using multiple race reports from a national survey [2]. In health services research, Yucel and Zaslavsky [3] corrected underreporting of cancer patients’ receipt of adjuvant chemotherapy in a statewide registry, using a validation sample from medical records data. He and Zaslavsky [4] extended this approach to multivariate outcomes. A common theme of these research is to combine information on key variable(s) from two sources, one of which is assumed to be accurate and complete (without reporting errors or missing data) for the variable(s) of interest in a validation sub-sample of the population; errors in the other source are corrected using multiple imputation approach [5].

However, there is a lack of methods for studies in which both (or multiple) sources are (all) subject to error. Our research is motivated from a multisite cohort study of care patterns for colorectal and lung cancer patients diagnosed between 2003 and 2005, the Cancer Care Outcomes Research and Surveillance Consortium (CanCORS) sponsored by National Cancer Institute [6]. Patient data were collected from surveys, medical records, Medicare claims, and cancer registries, to provide a rich set of information on a wide variety of topics and issues in the care process. The availability of multiple data sources in CanCORS also allows combining information when necessary.

National guidelines recommend that physicians discuss end-of-life (EOL) care planning with patients who have incurable cancer and a life expectancy of less than 1 year. Benefited from the uniqueness of the CanCORS data, several CanCORS studies describe patterns and quality of EOL care for patients diagnosed in late stages of lung and colorectal cancer (e.g., [7, 8]) including patients’ enrollment for hospice services prior to death, a key measure assessed and validated by previous literature (e.g., [9]). For Can- CORS participants, this information was obtained from their medical records, which were abstracted at hospitals and physicians’ offices within 15 months of cancer diagnosis. However, because some medical records were not abstracted because of patient nonconsent or provider noncooperation or inaccessibility, hospice-use might be underreported or missing. In addition, medical records from some physicians providing cancer care (e.g., oncologists or radiologists) may miss hospice-use because hospice enrollment involves a change of providers responsible for care. The misreporting or missingness might also be correlated with the quality of the abstraction process. To address such concerns, Medicare enrollment and claims data for Medicare-eligible enrollees (typically of age ≥ 65 years) within a similar time frame, which is generally recognized as a reliable data source by health services researchers, were obtained as a supplement for analyses.

Table I crosstabulates hospice-use reports from the two data sources for patients who died within 15 months of diagnosis in the analytic sample. The off-diagonal counts (445 for Medicare claims YES/medical records NO and 54 for Medicare claims NO /medical records YES) show considerable inconsistency of reports from the two sources. We conjecture that both sources are subject to underreporting, but overreporting might be relatively unusual and could be neglected. Furthermore, the much larger number for the cell ‘Medicare claims YES /medical records NO’ than the cell ‘medical records YES /Medicare claims NO’ (i.e., 445 vs. 54) is consistent with a priori expectation that both sources might underreport, but Medicare claims might be more reliable because they are required for payment, not abstracted for specific research purposes such as in CanCORS. Finally, missing data occur for both sources: unit nonresponses from medical records because the abstraction was not implemented for all CanCORS participants and noncoverage from Medicare claims for patients under 65 years old.

Table I.

Hospice-use reports from medical records and medicare claims, a subsample of CanCORS data.

		Medicare claims

		Yes	No	Missing
	Yes	395	54	260
Medical	No	445	617	646
Records	Missing	136	116	358

Open in a new tab

Note: The subsample consists of 3027 CanCORS patients who died within 15 months of diagnosis (Section 3.1).

A natural analytic strategy is to treat hospice-use as a missing variable (Table II) and impute it, using information from both sources. Some ad hoc imputation procedures might include the ‘OR’ algorithm, which assigns a ‘YES’ for an individual if either source reports ‘YES’, and the ‘AND’ algorithm, which assigns a ‘YES’ only if both sources are ‘YES’. These procedures, however, lack rigorous statistical justifications and offer no method for imputing missing reports. They also ignore the possible associations between hospice-use and other covariates in the study.

Table II.

Combining information from two data sources in a missing-data analysis framework.

True hospice-use status, Y_O	Medical records report, Y_R1	Medicare claims report, Y_R2	Covariates, X
?	1	1	…
?	1	0	…
?	0	?	…
?	0	0	…
?	0	1	…
?	?	1	…
?	?	0	…
?	…	…	…

Open in a new tab

Note: 1, yes (patient had hospice-use); 0, no; ‘?’, missing.

In this paper, we aim to develop a more principled imputation approach. However, missing data methods that handle partially classified contingency tables in the form of Table I (Chapter 13 of [10]) assume no misreports from the two sources. More related research on combining information from two sources assume that one of them can be treated as a gold standard, while the other might be subject to misreporting or missing data [1, 3, 4]. Here, we extend previous research to account for misreports and missing data in both sources. In Section 2, we introduce the notation and modeling strategy. Section 3 presents the analysis of CanCORS data. Section 4 points out connections with related methods and discusses future research topics.

2. Method

Let Y_O be true hospice-use status (1, yes/0, no) and Y_R1 and Y_R2 be its reports from medical records and Medicare claims, respectively. As shown in Table II, Y_O is a latent variable (100% missing), and missing values (due to nonresponses for the data sources) can also occur for Y_Rl = (Y_Rl,obs, Y_Rl,mis), (l = 1, 2). We assume that the mechanisms leading to misreporting and missing responses from each source can be related to some covariates X. For simplicity, we assume that X has no measurement errors and is fully observed. We also treat the linked cases from both sources as a simple random sample from the combined population.

We further assume that missing Y_Rl’s are at random (MAR), meaning that the probability of missing reports can only depend on X but not Y_O. Let θ indicate some parameters governing the process of misreporting and nonresponse. In our context, around 90% of the missing Medicare claims data are due to noncoverage from patients younger than 65 years old. This is largely ‘missing by design’ that satisfies MAR if we include age as a predictor X in the imputation model for hospice-use. On the other hand, the missing cases from the medical records are mainly caused by subject nonconsent and inaccessibility of some records. The rest of missing data resulted from nonmatches in the data linking process (Section 3.1). If the probability of unit nonresponses in medical records and nonmatches in Medicare claims are largely associated with demographic and clinical variables, which are included in X in the modeling process, then the MAR assumption is more plausible.

Under MAR, valid inferences about θ can be made using the observed-data likelihood P(Y_R1,obs, Y_R2,obs|X, θ) = ∫ P(Y_R1, Y_R2, Y_O|X, θ)dY_R1,misdY_R2,misdY_O, where P(Y_R1, Y_R2, Y_O|X, θ) is the complete-data model.

We further consider the following decomposition of the complete-data model

P (Y_{R_{1}}, Y_{R_{2}}, Y_{O} | X, θ) = P_{O} (Y_{O} | X, θ_{O}) P_{R} (Y_{R_{1}}, Y_{R_{2}} | Y_{O}, X, θ_{R})

Following [3], we refer to P_O(Y_O|X, θ_O) as the ‘outcome’ model. It relates hospice-use to covariate X, with regression parameters θ_O that might be of subject-matter interest.

The reporting model (or the measurement error model), P_R(Y_R1, Y_R2|Y_O, X, θ_R), characterizes reporting in the two sources given true hospice-use status, covariates, and parameters θ_R. Our reporting model rests on two assumptions:

Assumption 1

Reporting in the two sources is independent conditional on true status and observed covariates:

P_{R} (Y_{R 1}, Y_{R 2} | Y_{O}, X, θ_{R}) = P_{R 1} (Y_{R 1} | Y_{O}, X, θ_{R 1}) P_{R 2} (Y_{R 2} | Y_{O}, X, θ_{R 2}) .

Assumption 2

Both sources may be subject to underreporting but not overreporting: P_Rl(Y_Rl = 1|Y_O = 0, X, θ_Rl) = 0.

Assumption 1 is justified if X contains all factors that are predictive of misreporting. However, with enough scientifically relevant covariates, the residual correlation between two reporting systems might be minimal. Section 1 presents arguments for plausibility of Assumption 2 in our application. Generalizations of these assumptions are considered in Section 4.

We propose to multiply impute Y_O to facilitate statistical analyses involving the hospice-use variable. Although originally proposed as a tool that can be used by statistical agencies to handle nonresponse in large-sample public-use household surveys, the multiple imputation framework has been adapted for other statistical contexts over the past 30 years [11]. Relevant examples include latent variables [12] and measurement error problems [13]. See Section 3.3. for more details related to our example.

3. Application

3.1. Data background

The CanCORS Consortium is a collaboration of seven teams of investigators across the nation to evaluate prospectively the quality of cancer care for patients with lung or colorectal cancer. Approximately 10,000 patients newly diagnosed during 2003–2005 were identified from five geographic sites (Northern California, Los Angeles County, central and eastern North Carolina, Iowa, and Alabama), five large health management organizations (Group Health Cooperative, Harvard Pilgrim Healthcare, Henry Ford Health System, Kaiser Permanente Health Insurance, and Kaiser Permanente North West), and 10 Veterans Health Administration hospitals, capturing variation in cancer care across geographical regions and health care systems.

The CanCORS investigators abstracted detailed clinical data from medical records, including cancer-related diagnostic and staging procedures, surgery, chemotherapy, radiation therapy, and EOL care measures [6]. Additional predictors of clinical outcomes included tumor stage, comorbid illnesses, and relevant test results. Medical records for some patients were unavailable because of nonconsent or inaccessibility. As an additional data source, Medicare enrollment and claims data for Medicare enrollees among CanCORS participants (excluding Veterans Health Administration patients) were linked for 89% of CanCORS enrollees aged 65 years and older within the study period. We used inpatient, outpatient, provider, and hospice files to code the EOL care quality indicators including hospice-use.

The analytic sample for the pattern of hospice-use includes 3027 CanCORS patients who died within 15 months of diagnosis (Table I). Among them, 358 patients missed hospice reports from both sources. Under the MAR assumption for reported data, these cases contain no information on the parameters of the outcome and reporting models proposed in Section 2. Thus, the analysis focuses on the remaining 2669 cases with observed data from at least one source. Table III describes patient demographic and clinical characteristics, which might have been associated with the use of EOL care among terminally ill patients as suggested by previous literature [9]. These variables constitute the covariates X in both the outcome and reporting models. They were fully observed and assumed to be without measurement errors. Note that in the reporting model for Medicare claims P_R2(Y_R2|Y_O, X, θ_R2), there exists no less-than-65 age group due to noncoverage, and therefore we could only use the remaining age groups in the covariates.

Table III.

Distribution of covariates from medical records and medicare claims, a subsample of CanCORS data.

Variables	Levels (%)
Cancer type	Lung (78%), CRC (22%)
Gender	Male (56%), Female (44%)
Marital status	Married (56%), Unmarried (44%)
Age	<65 (20%), 65–69 (22%), 70–74 (18%), 75–79 (16%), 80+(24%)
Education	<High school (25%), high school (57%), college (18%)
Income	<20K (34%), 20–39K (36%), 40–59K (17%), >60K (13%)
Race	NH-White (74%), NH-Black (11%), Hispanic (6%), NH-Asian (4%), Others (5%)
Cancer stage	I (9%), II (12%), III (28%), IV (46%), unknown (5%)
Comorbidity	Myocardial infarction (21%)
	Chronic heart failure (15%)
	Stroke (18%)
	Lung disease (35%)
	Diabetes (25%)
	Depression (23%)
Time between diagnosis and death	Mean = 123 days, standard deviation = 132 days
Study site	GHC (5.1%), HPHC (.4%), HFHS (2.0%), KPHI (1.6%), KPNW (5.8%), NCCC (22%), UAB (13%), UCLA (22%), IOWA (23%), UNC (5.2%)

Open in a new tab

Note: The subsample consists of 2669 CanCORS patients who died within 15 months of diagnosis and had at least one observed hospice-use record from two sources (Section 3.1).

CRC, colorectal cancer; GHC, Group Health Cooperative; HPHC, Harvard Pilgrim Healthcare; HFHS, Henry Ford Health System; KPHI, Kaiser Permanente Health Insurance; KPNW, Kaiser Permanente North West; NCCC, Northern California Cancer Center; UAB, University of Alabama; UCLA, University of California at Los Angeles; IOWA, University of Iowa; UNC, University of North Carolina.

3.2. Exploratory analyses

We conduct some exploratory analyses initially to establish the basis for model-based imputation. Under our assumptions, hospice-use status might be 1 even if reports from both sources are 0. This phenomenon can be connected with the dual system estimation (DSE) [14] of population counts and other capture–recapture problems. The classical DSE approach conceptualizes that each person in a population is either in or not in the two lists. Analogously in our context, a patient’s hospice-use status is either reported or not in the two data sources, medical records, and Medicare claims data.

We first treat Table I as a simple DSE example, ignoring the cells containing missing values. Note that the cell with ‘NO’ from both sources (617) can be viewed as the count of cases excluded from both lists in the DSE framework. The reporting completeness (inclusion probability or capturing rate) for data source 1 (2) can be simply calculated as E(Y_R1|Y_R2 = 1) (E(Y_R2|Y_R1 = 1)). For example in Table I, note that the 449(= 395+54) cases who were reported as ‘YES’ from the medical records data are true positives. Yet only 395 of them were reported as ‘YES’ from the Medicare claims data. This suggests that the reporting completeness rate for the Medicare claims data is 395/(395 + 54) ≈ 88%. Similarly, the capturing rate for medical records is estimated as 395/(395 + 445) ≈ 44%. In addition, the total number of ‘YES’ on hospice-use in the sample is estimated as (395+54)(395+445)/395 ≈ 955. That is, among the 617 cases with ‘NO’ reports from both sources, 445 × 54/395 ≈ 61 cases should be imputed as ‘YES’ in hospice-use. If we further consider cases with missing reports from either source, an expectation–maximization (EM) algorithm (details not shown) can be developed to fit a model with three parameters including P(Y_O = 1), P(Y_R1 = 1|Y_O = 1), and P(Y_R2 = 1|Y_O = 1), the simplest form of Model (1) (Section 3.3.1).

The classic DSE assumes independent reporting between two data sources, homogenous across individuals. A nonhomogeneous population may be divided into several homogeneous post-strata based on demographical variables, or the heterogeneous inclusion probability might be related to covariates using logistic regression models [15]. Our modeling approach (Section 2) is consistent with the latter strategy. However, because some sites have perfect reporting completeness or sparse data, directly including study sites as a fixed-effect covariate for the reporting models would entail the data separation issue: a linear combination of the predictors perfectly predicts the binary outcome, making estimates for the associated parameters unidentifiable [16]. Table AI (Appendix) describes hospice reports stratified by site, showing 100% sample reporting completeness for Medicare claims data from sites HPHC, KPHI, UAB, and UNC and for medical records from site HPHC. On the other hand, removing site as a predictor is not desirable because the heterogeneity of reporting completeness across sites might not be fully attributable to other patient-level characteristics. Use of hospice (the outcome) may also vary across sites, possibly revealing interesting geographical and organizational variations among health care providers. We instead treat site as a random effects covariate, making site effects estimable. At sites with a perfect unadjusted reporting rate, the model-based estimate would be shrunken toward the population average adjusted for other covariates. In addition, the random effects model treats the included sites as a random sample from a population of potential data collection sites, making the inferences more generalizable.

3.3. Model-based imputation

3.3.1. Model specification

We consider Bayesian random effects probit models [17] for both the outcome and reporting processes. Let i index site i = 1,…, 10 and j = 1,…n_i index patients. The complete-data model (1) is

Φ^{- 1} (Pr (Y_{O i j} = 1 | X_{i j})) = Z_{O i j} = X_{i j}^{T} β_{O} + γ_{O i}

(1.1)

Y_{R 1 i j} = Y_{R 2 i j} = 0 if Y_{O i j} = 0

(1.2)

Φ^{- 1} (Pr (Y_{R 1 i j} = 1 | Y_{O i j} = 1, X)) = Z_{R 1 i j} = X_{i j}^{T} β_{R 1} + γ_{R 1 i}

(1.3)

Φ^{- 1} (Pr (Y_{R 2 i j} = 1 | Y_{O i j} = 1, X)) = Z_{R 2 i j} = X_{i j}^{T} β_{R 2} + γ_{R 2 i}

(1.4)

where Z_Oij, Z_R1ij, and Z_R2ij are the normal latent variables, β_O, β_R1, and β_R2 are the fixed-effects parameters, γ_Oi, γ_R1i, and γ_R2i are the random site effects for the outcome and reporting processes, respectively with $γ_{O i} ~ N (0, σ_{O}^{2}), γ_{R 1 i} ~ N (0, σ_{R 1}^{2})$ , and $γ_{R 2 i} ~ N (0, σ_{R 2}^{2})$ independently across sites. Following our reasoning in Section 2, Equation (1.1) specifies the outcome model, Equation (1.2) imposes the underreporting assumption, and Equations (1.3) and (1.4) specify the reporting model for the medical records data and Medicare claims data, respectively.

We consider vague priors for parameters given that little is known about the mechanism governing the outcome and reporting processes. We impose flat priors for β’s (p(β) ∝ 1). For the between-site variance $σ_{O}^{2}, σ_{R 1}^{2}$ and $σ_{R 2}^{2}$ , we lacked solid information and field expertise on how study sites might vary on the hospice-use and reporting completeness and therefore used noninformative or weakly informative priors [18]. Despite the popular use of inverse gamma priors (IG(a, a)) for σ² because of its conjugacy [19], recent literature (e.g., [18, 20]) has shown the sensitivity of the corresponding posterior inferences to the choices of a, especially with a small number of clusters. We confirmed this in our own application: under inverse gamma priors with a = 0.001, 0.01, 0.1, 1, the posterior estimate of σ² increases considerably as a increases. Gelman [18] recommended using a noninformative uniform prior density for σ, that is, p(σ) ∝ 1, which also implies p(σ²) ∝ (σ²)^1/2. If the number of clusters is small, say below five, a weakly informative half-t prior can be used to prevent from obtaining extremely large posterior estimate. Note that our data had 10 sites, offering a reasonable number of clusters for applying the uniform prior. Table V lists posterior estimates of $σ_{O}^{2}, σ_{R 1}^{2}$ , and $σ_{R 2}^{2}$ . We also implemented the half−t priors, and results were similar.

Table V.

Coefficient estimates from outcome and reporting models using the medical records and medicare claims, a subsample of CanCORS data.

Covariates	Outcome model	Reporting model for medical records	Reporting model for Medicare claims
Colorectal cancer	−.58^*	.16	−.41
65–69 years	.45^*	−.39^*	NA
70–74 years	.31	−.23	.00
75–79 years	.35	−.25	−.11
80 years or older	.54^*	−.29	.10
Female	.09	−.09	.31
Black	−.08	−.18	−.01
Hispanic	.30	.03	−.70^*
Asian	.01	.03	−.23
Other race	−.28	.19	−.40
Less than high school	.17	.11	−.08
High school	.04	.22	−.23
Less than 20k	.08	−.21	−.02
20–39k	.09	−.28	−.01
40–59k	.19	−.18	.16
Married	−.10	.03	.15
Stage 2	.12	−.43	−.42
Stage 3	.40^*	−.36	−.07
Stage 4	.81^*	−.36	−.31
Stage missing	.49^*	−.61^*	.03
MI	−.05	−.07	−.18
CHF	−.07	.09	.54^*
Stroke	.09	.03	−.23
Lung disease	−.09	−.09	.16
Diabetes	−.02	.06	.37^*
Depression	.22^*	−.12	−.11
Days till death	.30^*	−.06	.24*
Site random effects variance	.10	.48	.05

Open in a new tab

Note The subsample consists of 2669 CanCORS patients who died within 15 months of diagnosis and had at least one hospice-use record from two sources. Outcome model: Equation (1.1), reporting model for medical records: Equation (1.2), and reporting model for Medicare claims: Equation (1.3). Posterior medians for β_O and β_R’s are based from 10⁵ iterations of the Gibbs chain. Estimates with 95% CIs (not shown) excluding 0 are highlighted with *. The reference categories are lung cancer, younger than 65 years old for the outcome model and reporting model for medical records/younger than 70 years old for the reporting model for Medicare claims, White, college, more than 60K, stage I. The variable ‘Days till death’ is included in the model using a standardized Z-score.

3.3.2. Imputation algorithm

We implemented a data augmentation (DA) algorithm [21] to draw model parameters and impute missing values. The main steps are sketched here, and details on the conditional distributions appear in the Appendix. Some sample R (http://www.r-project.org) code is also attached.

Draw latent variables (Z_Oij, Z_R1ij, Z_R2ij) of Model (1) from truncated normal distributions.
Draw fixed-effects coefficients β_O and β_R from multivariate normal distributions.
Draw random effects {γ_Oi, γ_Ri} from independent normal distributions for each i.
Draw missing values of Y_O from multiple Bernoulli distributions where the probabilities are estimated from the functions of cumulative distributions of the standard normal distribution. Note that ∫ P_Rl(Y_Rl|Y_O = 1, X, θ_Rl)dY_Rl,mis = Π P(Y_Rl,obs|Y_O = 1, X, θ_Rl), l = 1, 2. The latter suggests that conditional on Y_O = 1 and MAR for missing Y_Rl, the likelihood inference on θ_Rl does not involve the missing reports. Therefore, it is unnecessary to impute missing Y_Rl in the algorithm.

We diagnosed the convergence of the DA algorithm using statistics developed by Gelman and Rubin [22] and concluded that the Gibbs chain converged after 10⁵ iterations. The posterior inferences and multiple imputation analyses are based on running the chain for another 10⁵ iterations.

3.3.3. Model diagnostics

We used the posterior predictive checking method [23] for the model diagnostics, comparing the observed reports Y_Rl,obs with distributions of their replicates $Y_{R l, obs}^{rep}$ under Model (1). If Q is a diagnostic statistic (or the discrepancy function as defined in [23]), then the posterior predictive p-value for assessing the model fit in terms of Q is calculated as $P (Q (Y_{R l, obs}, X) > Q (Y_{R l, obs}^{rep}, X))$ . An extreme p-value (close to 0 or 1) would imply some model misfit.

A natural choice of Q is the average of reports, Q(Y_Rl) = Ȳ_Rl, the mean rate of hospice-use estimated from each data source or within certain strata defined by covariates (e.g., for all female patients or patients in certain study sites). These diagnostics (not shown) do not reveal any apparent misfit. We also check the model fit for the joint distribution of Y_R1 and Y_R2. Going back to the DSE framework (Section 3.2), we use some unadjusted estimates of capturing rate for both sources as the discrepancy function (i.e., Q(Y_R1, Y_R2) = E(Y_R1|Y_R2 = 1) and E(Y_R2|Y_R1 = 1)). Here, the term ‘unadjusted’ merely means that they are assessed in a straightforward way as opposed to fitting Model (1) with a extensive set of covariates. These rates can be calculated across the whole sample (e.g., Section 3.2) or within covariate groups (e.g., for each site). In addition, we used the sample variance of these capturing rates as a discrepancy measure for heterogeneity.

Table IV shows the observed capturing rates, medians from their posterior predictions, and posterior predictive p-values for each study site. For example in the GHC site, the capturing rate estimate is 23/(23 + 38) ≈ .38 for Medical records and 23/(23 + 2) ≈ .92 for Medicare claims. The corresponding numerators and denominators can be found in Table 7. The posterior predictive p-values are mostly between 0.1 and 0.9, showing a good fit for the model. Correspondingly, the (medians of) predictions are in general not far different from the observed statistics. For the sites with perfect observed capturing rates (e.g., UAB and UNC from Medicare claims data), their predictions tend to be slightly lower, as expected from shrinkage-to-the-mean effect under the random site effects model. The sample variance of the site-specific capturing rates (shown for the line ‘Var’ in Table IV) is also well predicted by the model.

Table IV.

Posterior predictive checking results from the medical records and medicare claims, a subsample of CanCORS data.

	Medical records			Medicare claims

Study site	Observed capturing rate	Median from predictions	ppp	Observed capturing rate	Median from predictions	ppp
GHC	.38	.31	.78	.92	.89	.64
HPHC	1	1	.41	1	1	.16
HFHS	.78	.67	.73	.70	.83	.18
KPHI	.20	.38	.48	1	1	.20
KPNW	.67	.67	.48	.85	.85	.51
NCCC	.25	.26	.43	.78	.83	.25
UAB	.20	.22	.30	1	.89	.92
UCLA	.44	.41	.73	.87	.83	.79
IOWA	.73	.71	.66	.90	.88	.72
UNC	.22	.23	.46	1	1	.44
Var	.08	.07	.21	.01	.01	.54

Open in a new tab

Note: The subsample consists of 2669 CanCORS patients who died within 15 months of diagnosis and had at least one observed hospice-use record from two sources. The posterior predictive checks are based on 105 replicates. ppp, posterior predictive p-values. For example in the GHC site, the capturing rate is 23/(23 + 38) = .38 for medical records and 23/(23 + 2) = .92 for Medicare claims. The corresponding numerators and denominators can be found in Table AI.

3.4. Analytic results

3.4.1. Model estimates

Table V shows the parameter estimates (posterior medians) of β_O and β_R from the outcome and reporting models (Equations (1.1)–(1.4)). Estimates with 95% CIs (not shown) excluding 0 are highlighted with *. Results from the outcome model (β_O from Equation (1.1)) suggest that hospice was used less often by CRC patients than lung cancer patients, more often in the older groups (65–69 years old or >80 years old), more often among late (stage 3 or 4) or missing stage patients, and more often among patients with depression symptoms. Patients who lived longer after diagnosis were more likely to use hospice services. Results from reporting models (β_R1 and β_R2 from Equations (1.3) and (1.4)) suggest that medical record data were less complete for hospice-use among patients aged between 65 and 69 years and much less complete among patients with stage missing information. Medicare claims data were less complete for hospice-use among Hispanic patients and more complete among patients with heart failure or diabetes or those who lived longer.

As can be seen from the ‘Site random effects variance’ in Table V, there was considerable between-site variation for hospice-use and reporting from medical records data but less so for the reporting from Medicare claims. Under Model (1), the posterior median rates for hospice-use range from around 43% to 100% across sites. Those for the reporting completeness range from around 21% to 78% for medical records and from around 77% to 100% for Medicare claims. The much smaller between-site heterogeneity for coverage of Medicare claims than medical records abstraction is well predictable given that the former is collected by the nationally uniform system that pays for the services, while the latter is collected by separate staff groups within each CanCORS site.

The posterior median number of imputed ‘YES’ case when both sources have negative reports is around 88, substantially more than the estimate (61, Section 3.2) from the simple DSE approach. In addition, the corresponding posterior medians when the reports from medical records (Medicare claims) are negative and Medicare claims (medical records) are missing is around 272 (21). Neither quantity, however, can be estimated using the simple DSE approach that does not account for the missing reports. Therefore, methods that do not capture the heterogeneous reporting probabilities associated with covariates or do not account for missing reports might underestimate the number of false negatives. On the other hand, the posterior medians of the average reporting completeness rate for the medical records and Medicare claims are 46.8% and 85.6%, respectively. They are close to the simple DSE estimates (44.0% and 88.0%). Therefore, simple and model-based approaches agree on the much lower the reporting completeness for medical records than for Medicare claims.

Although the primary analytic goal is to impute the hospice-use status for substantive analyses (Section 3.4.2), the model outputs in Table V might inform us about certain factors correlated with misreporting. Therefore, targeted efforts might be planned in the process of data collection and dissemination to improve the accuracy and reliability of large data systems (e.g., reducing the variation across multiple sites with good coordination).

3.4.2. Post-imputation analyses

We can use the multiply imputed data for substantive analyses involving the hospice-use variable. Because of the 100% missingess rate, we use 50 imputed datasets, selected from every 2000 iterations of the Gibbs chain, to minimize potential autocorrelations and ensure a high relative efficiency [5] for multiple imputation inferences. Model coefficients are estimated from each of the 50 imputed datasets and combined using rules for multiple imputation inference for scalar estimands [5]. Thus, point estimates are means of the estimates from the 50 imputations, and the standard errors combine between-imputation variance because of missing data with sampling (within-imputation) variability conditional on imputed data.

We consider two analyses. One is to describe the hospice-use rate for the analytic sample. The other is a logistic regression model predicting hospice-use for lung cancer patients. These patients are more likely to be diagnosed at late stage and to die more quickly after diagnosis. Therefore, hospice-use might be a more relevant issue for them and their family members. We exclude from the predictors the time from diagnosis till death and study sites to make the inferences more predictive for patients seeking EOL care and therefore more generalizable.

For comparative purposes, we consider several alternative missing data strategies. The ‘OR’ algorithm assigns true status ‘YES’ if either source reports ‘YES’, ‘NO’ if both sources report ‘NO’, and ‘MISSING’ if one source reports ‘NO’ and the other reports ‘MISSING’. The ‘AND’ algorithm assigns true status ‘YES’ if both sources report ‘YES’, ‘NO’ if neither source reports ‘MISSING’ and at least one source reports ‘NO’, and ‘MISSING’ if at least one source reports ‘MISSING’.We also analyze the data using either source alone.

Table VI shows estimates from the two aforementioned post-imputation analyses. The first row lists marginal estimates of the hospice-use rate, and other rows list the logistic regression coefficients. Except for the analysis using ‘OR’ algorithm, all other methods yield substantially lower estimates for the rate of hospice-use than that from multiple imputation. This is because these methods treat reported ‘NO’ as true negatives and therefore tend to underestimate the rate under the assumption of independent underreporting. The rate estimate from the ‘OR’ algorithm is higher because it removes cases with missing reports from either source and therefore deflates the denominator. The logistic regression analysis results from the multiply imputed data suggest that older patients (60–65 or 80 years or older), patients in late stage (III/IV) or with stages missing, and patients with depression are more likely to use hospice. The directions and magnitudes of these coefficients from other missing data methods can be considerably different. For example, the analysis using the ‘OR’ algorithm shows patients older than 65 years old are less likely to use hospice than those younger than 65 years old. This less obvious result might be caused by solely using the misreported medical records data for patients younger than 65 years old. In addition, as opposed to the multiple imputation approach, the other missing data methods rarely show strong associations between late stage and hospice-use, which is less plausible because we might expect more use of hospice by patients with more predictable mortality.

Table VI.

Two post-imputation analysis results from alternative missing data methods using the medical records and medicare claims, a subsample of CanCORS data.

Variables	MI	OR	AND	Medical records	Medicare claims

Hospice-use	62.8%	67.6%	39.0%	29.3%	55.4%
65–69 years	.56*	−1.51*	−.05	−.11	NA
70–74 years	.26	−1.61*	.08	−.03	.10
75–79 years	.36	−1.52*	.13	.04	.09
80 years or older	.68*	−1.22*	.26	.08	.42*
Female	.12	.27*	.20	−.03	.24*
Black	−.31	−.28	−.69*	−.89*	−.20
Hispanic	.31	.11	−.63	−.55*	−.17
Asian	−.36	−.42	−1.54*	−.62*	−.66*
Other race	−.58	−.58*	−1.12*	−.50*	−.83*
Less than high school	.11	.00	.33	.33*	.02
High school	−.03	−.10	.25	.42*	−.17
Less than 20K	.23	−.16	−.08	.01	.18
20–39K	.32	−.09	−.18	−.05	.22
40–59K	.44*	.14	−.10	.12	.37
Married	−.23	−.16	−.05	.02	−.14
Stage 2	.49	.34	.20	−.04	.30
Stage 3	.48*	.35	.23	.13	.29
Stage 4	.90*	.48*	.21	.21	.34
Stage missing	.78*	.68*	−.10	−.22	.44
MI	−.02	−.18	−.16	−.02	−.11
CHF	−.27	−.04	−.04	.06	−.06
Stroke	.08	−.09	.04	.17	−.10
Lung disease	−.13	−.11	−.10	−.18*	−.03
Diabetes	.03	.09	.30*	.04	.28*
Depression	.33*	.18	.03	.00	.15

Open in a new tab

Note: The subsample consists of 2669 CanCORS patients who died within 15 months of diagnosis and had at least one observed hospice-use record from two sources. The first row (‘hospice use’) lists marginal estimates of the hospice-use rate. Following rows list the logistic regression coefficients (Section 3.4.2). MI, multiple imputation. Standard errors are not shown. Logistic regression coefficients that are significant at 10% level are highlighted. The reference categories are lung cancer, younger than 65 years old for MI, OR, AND, and Medical records/younger than 70 years old for Medicare Claims, White, college, more than 60K, stage I.

Because the alternative methods do not appropriately account for measurement errors in the response variable, the resulting coefficients are susceptible to bias [24], which might lead to misleading scientific conclusions. For example, using medical records only, which contain substantial measurement errors, might be less capable of detecting true effects. On the other hand, using Medicare claims only, which might be more accurate yet are limited to the sample older than 65 years old, might produce association estimates that cannot be generalizable to younger patients. The multiple imputation analysis overcomes the weakness of the two data sources. It yields overall larger standard errors than the alternative approaches that overstate the precision of the coefficients because they fail to account for the misclassification of the outcome (hospice-use) in an appropriate way.

In multiple imputation analysis, an important statistic is the rate of missing information, which is approximately the ratio of between-imputation to total variance and shows the increase of variance associated with missing data under the model. In the logistic regression analysis, the estimated rate of missing information is substantial, ranging largely between 40% and 70% for various estimates (results not shown). But they are still considerably lower than the rate of imputation (100%). This range suggests that using mismeasured reports and covariates is as informative as having observed around 30–60% (=(1–70%, 1–40%)) of the true hospice-use data for the analyses, supporting the utility of Model (1) for imputation [25].

4. Discussion

In our application, a binary measure of hospice-use is available from two data sources, yet it is subject to misclassification and missing data in both datasets, errors commonly encountered in population research. We multiply impute the true status of this measure, using models that capture both the outcome and reporting processes (Equations (1.1)–(1.4)). The subjective-matter analyses yield conclusions more sensible than those of ad hoc approaches that fail to account for these errors. Although Medicare claims data are shown to be more accurate than medical records in our case, the former is limited to patients older than 65 years old. Therefore, a major advantage of the proposed approach is to analyze hospice-use across the entire age spectrum in a more reliable way.

Some major limitations of the proposed approach stem from the assumptions (Section 2.1), including MAR for the reported data, conditional independence of reporting, and underreporting for both sources. We discuss possible generalizations that can lead to future research topics. First, we might generalize the missingness mechanism to not MAR [5] by assuming that, for example, the medical records for patients who did not use hospice were more likely to be missing. However, not MAR models are usually weakly identifiable [26] and are especially likely to be so in our application because true hospice-use is 100% missing. A more feasible strategy might be to perform sensitivity analysis under possible specifications of models for missingness mechanisms that rely on strong subject-matter expertise on how medical records data might be missing during the abstracting process.

Our topic is related to dual exposure assessment in the measurement error problems literature [27]. There, two assessments are used to determine whether subjects were exposed to putative binary risk factors in case-control studies. For example, Drews et al. [28] used patient interviews and medical records to provide two assessments of a variety of exposures (e.g., maternal anemia during pregnancy) for a case-control study on sudden infant death syndrome. The objective is to estimate exposure-outcome associations (Section 5.3–5.4 of [29]). In our context, the analog to exposure is hospice-use status, reported by two data sources. Unlike typical case-control studies, there is no analog to the disease outcome measure in our setting, although hospice-use could be either the exposure (predictor) or outcome (response) in post-imputation analyses. Our model factorization corresponds to the ‘exposure given outcome model’ in Equation (5.14) of [29].

Gustafson’s [29] variations on dual exposure assessment models include relaxing the conditional independence assumption and allowing both overreporting and underreporting. For example, we might consider a more general specification of the reporting model P_R(Y_R₁, Y_R₂|Y_O, X, θ_R) relaxing both assumptions, with complete-data model

g (Pr (Y_{O} = 1 | X)) = X^{T} β_{0} g (Pr (Y_{R 1} = 1 | Y_{O}, X)) = β_{1} Y_{O} + X^{T} β_{2} g (Pr (Y_{R 2} = 1 | Y_{O}, Y_{R 1}, X)) = β_{3} Y_{O} + β_{4} Y_{R 1} + X^{T} β_{5}

(2)

where g is a link function for the binary outcome Y_O, X is the vector of covariates including the intercept term, and β’s are the coefficients. Thus, the misreporting of two data sources can go either direction, and they can still be correlated after controlling for covariates.

However, Model (2) might be weakly identified. Although with an increasing number of covariates X included in the data, there seem to be enough degrees of freedom for estimating the associated parameters, practical experience with such estimation has been largely negative (Section 5.4.2 of [29]). See also comments from Section 15.3.2.1 of [27] in a similar context with unknown response misclassifications in logistic regression models. They argued that the practical nonidentifiability of these models might be due to the fact that the misclassification probabilities are only weakly identified by the data if both overreporting and underreporting are allowed. Our own application of Model (2) to the hospice-use data show similar lack of identifiability (results not shown). Further methodological investigations on the issues of model nonidentifiability is beyond the scope of this paper.

On the other hand, identifiability of more general misreporting models might be improved if data are available from more than two sources. With L > 2 sources, the complete-data model can be factorized as

P (Y_{R 1}, Y_{R 2}, \dots, Y_{R L}, Y_{O} | X, θ) = P_{O} (Y_{O} | X, θ_{O}) P_{R} (Y_{R 1,} Y_{R 2}, \dots, Y_{R L} | Y_{O}, X, θ_{R})

The main challenge is to model the multivariate reported outcomes P_R(Y_R1, Y_R2,…, Y_RL|Y_O, X, θ_R) (i.e., the reporting model). With L increasing, the data might contain more flexibility for allowing both overreporting and underreporting, as well as residual correlations among reporting sources after adjusting for covariates. Intuitively, binary data Y_Rl’s constitute a 2^L table, which opens more degrees of freedom for model parameters. With L = 3, the modeling framework resembles triple system estimation, which augments census and post-enumeration survey data (both of which are included in DSE) with administrative-list data to improve estimates of population count [30]. Similarly, CanCORS patients or their surrogates were also surveyed on whether they received hospice care. Patient survey data can be treated as the third source for this measure, which might include both overreporting and underreporting due to recall bias. Further, modeling efforts to impute hospice-use will be necessary in such a complicated data structure. Extensions can also be considered for mismeasured variables with distributions other than binary (e.g., ordinal, nominal, or continuous).

Increased efforts on data collection and assembly offer great opportunities to use information from multiple data sources for scientific investigations. The imputation strategy provides an effective means to synthesize information, closely related to advances in survey sampling and measurement error problems. Methods under more general modeling assumptions or for more complicated data structures deserve further research.

Supplementary Material

Supp Material

NIHMS581454-supplement-Supp_Material.pdf^{(17.6KB, pdf)}

Acknowledgements

The findings and conclusions in this study are those of the authors and do not necessarily represent the views of the Centers for Disease Control and Prevention. The authors acknowledged the journal editors and reviewers, Meena Khare and Jennifer Madans, for their valuable input. The work of the CanCORS Consortium was supported by grants from the National Cancer Institute (NCI) to the Statistical Coordinating Center (U01 CA093344) and the NCI-supported Primary Data Collection and Research Centers (Dana-Farber Cancer Institute/Cancer Research Network U01 CA093332, Harvard Medical School/Northern California Cancer Center U01 CA093324, RAND/UCLA U01 CA093348, University of Alabama at Birmingham U01 CA093329, University of Iowa U01 CA093339, and University of North Carolina U01 CA093326) and by the Department of Veterans Affairs grant to the Durham VA Medical Center CRS 02-164.

Appendix

Let $θ = (β_{O}, {γ_{O i}}, σ_{O}^{2}, β_{R 1}, {γ_{R 1 i}}, σ_{R 1}^{2}, β_{R 2}, {γ_{R 2 i}}, σ_{R 2}^{2})$ . The imputation step of the DA algorithm:

Set Y_O = 1 if Y_R1 = 1 or Y_R2 = 1 by the assumption of underreporting

When Y_R1 = 0 and Y_R2 = 0, by the Bayes theorem,

P (Y_{O} = 1 | Y_{R 1} = 0, Y_{R 2} = 0, X, θ) = \frac{P (Y_{R 1} = 0 | Y_{O} = 1, X, β_{R 1}, {γ_{R 1 i}}) P (Y_{R 2} = 0 | Y_{O} = 1, X, β_{R 2}, {γ_{R 2 i}}) P (Y_{O} = 1 | X, β_{O}, {γ_{O i}})}{P (Y_{R 1} = 0 | Y_{O} = 1, X, β_{R 1}, {γ_{R 1 i}}) P (Y_{R 2} = 0 | Y_{O} = 1, X, β_{R 2}, {γ_{R 2 i}}) P (Y_{O} = 1 | X, β_{O}, {γ_{O i}}) + P (Y_{O} = 0 | X, β_{O}, {γ_{O i}})} = \frac{Φ (- X^{T} β_{R 1} - γ_{R 1}) Φ (- X^{T} β_{R 2} - γ_{R 2}) Φ (X^{T} β_{O} + γ_{O})}{Φ (- X^{T} β_{R 1} - γ_{R 1}) Φ (- X^{T} β_{R 2} - γ_{R 2}) Φ (X^{T} β_{O} + γ_{O}) + Φ (- X^{T} β_{O} - γ_{O})}

draw Y_O from a Bernoulli distribution with this probability.

When Y_R1 = 0 and Y_R2 is missing, draw Y_O from a Bernoulli distribution with the following probability

P (Y_{O} = 1 | Y_{R 1} = 0, X, θ) = \frac{P (Y_{R 1} = 0 | Y_{O} = 1, X, β_{R 1}, {γ_{R 1 i}}) P (Y_{O} = 1 | X, β_{O}, {γ_{O i}})}{P (Y_{R 1} = 0 | Y_{O} = 1, X, β_{R 1}, {γ_{R 1 i}}) P (Y_{O} = 1 | X, β_{O}, {γ_{O i}}) + P (Y_{O} = 0 | X, β_{O}, {γ_{O i}})} = \frac{Φ (- X^{T} β_{R 1} - γ_{R 1}) Φ (X^{T} β_{O} + γ_{O})}{Φ (- X^{T} β_{R 1} - γ_{R 1}) Φ (X^{T} β_{O} + γ_{O}) + Φ (- X^{T} β_{O} - γ_{O})}

Similarly when Y_R1 is missing and Y_R2 is 0, draw Y_O from a Bernoulli distribution with the following probability
$P (Y_{O} = 1 | Y_{R 2} = 0, X, θ) = \frac{Φ (- X^{T} β_{R 2} - γ_{R 2}) Φ (X^{T} β_{O} + γ_{O})}{Φ (- X^{T} β_{R 2} - γ_{R 2}) Φ (X^{T} β_{O} + γ_{O}) + Φ (- X^{T} β_{O} - γ_{O})}$

The posterior step of the DA algorithm draws a new value of θ conditional on imputed Y_O via the auxiliary variable Gibbs sampling algorithm for probit models [31]. To describe the algorithm in a general way, suppose data contains m sites and, in site i, there are n_i patients. The observed data consist of {Y_R1ij} and {Y_R2ij}, and patient-level covariates X_ij. Unless denoted specifically (e.g., for $σ_{O}^{2}$ ), the posterior distributions are standard, and we omit the details. The Gibbs sampler is as follows:

Step 1. Draw latent variables {Z_Oij} from truncated univariate standard normal distributions with mean X_ijβ_O + γ_Oi with the signs of latents depending on Y_Oij, that is, Z_Oij > 0 iff Y_Oij = 1.
Step 2. For the cases with Y_Oij = 1, draw latent variables {Z_Rlij} from truncated univariate standard normal distributions with mean X_ijβ_Rl + γ_Rli with the signs of latents depending on Y_Rlij, that is, Z_Rlij > 0 iff Y_Rlij = 1 (l = 1, 2).
Step 3. Draw
$β_{O} ~ N ({(\sum_{i, j} X_{i j}^{T} X_{i j})}^{- 1} (\sum_{i, j} X_{i j}^{T} (Z_{O i j} - γ_{O i})), {(\sum_{i, j} X_{i j}^{T} X_{i j})}^{- 1}) .$
Step 4. Draw
$β_{R l} ~ N ({(\sum_{i, j, Y_{O i j} = 1} X_{i j}^{T} X_{i j})}^{- 1} (\sum_{i, j, Y_{O i j} = 1} X_{i j}^{T} (Z_{R l i j} - γ_{(R l i})), {(\sum_{i, j, Y_{O i j} = 1} X_{i j}^{T} X_{i j})}^{- 1}) .$
Step 5. Let S be the design matrix for random effects γ_Oi. Draw random effects {γ_Oi} from a multivariate normal with covariance matrix

$Ω_{γ_{O}} = {(diag {n_{i}} + 1 / σ_{O}^{2} I_{m})}^{- 1}$ and mean vector μ_{γ_O} = Ω_{γ_O} (S^T (Z_O − Xβ_O)), where I_m is the identity matrix with dimension m and diag{n_i} is a m × m diagonal matrix with the i-th element as n_i.

Step 5. Let S_{Y_O} = 1 be the submatrix of S for the cases with Y_O = 1. Draw random effects γ_(Rl) from a multivariate normal with covariance matrix

Table AI.

Frequency tables of hospice-use reports from medical records and medicare claims, by study sites.

	GHC Medicare claims yes	Medicare claims no
Medical records yes	23	2
Medical records no	38	22
	HPHC Medicare claims yes	Medicare claims no
Medical records yes	2	0
Medical records no	0	0
	HFHS Medicare claims yes	Medicare claims no
Medical records yes	7	3
Medical records no	2	11
	KPHI Medicare claims yes	Medicare claims no
Medical records yes	1	0
Medical records no	4	8
	KPNW Medicare claims yes	Medicare claims no
Medical records yes	41	7
Medical records no	20	23
	NCCC Medicare claims yes	Medicare claims no
Medical records yes	43	12
Medical records no	130	179
	UAB Medicare claims yes	Medicare claims no
Medical records yes	20	0
Medical records no	82	101
	UCLA Medicare claims yes	Medicare claims no
Medical records yes	62	9
Medical records no	78	122
	IOWA Medicare claims yes	Medicare claims no
Medical records yes	190	21
Medical records no	70	112
	UNC Medicare claims yes	Medicare claims no
Medical records yes	6	0
Medical records no	21	39

Open in a new tab

Note: The subsample consists of 2669 CanCORS patients who died within 15 months of diagnosis and had at least one observed hospice-use record from two sources. An example for calculating unadjusted site-specific capturing rate for two data sources: For GHC, the estimate of Medicare Claims is 23/(23 + 2) ≈ 92%; the estimate of medical records is 23/(23 + 38) ≈ 38% (see also the footnote of Table IV).

$Ω_{γ_{R l}} = {(S_{Y_{O} = 1}^{T} S_{Y_{O} = 1} + 1 / σ_{R l}^{2} I_{m})}^{- 1}$ and mean vector $μ_{γ_{O}} = Ω_{γ_{O}} (S_{Y_{O} = 1}^{T} (Z_{R l} - X_{Y_{O} = 1} β_{R l}))$ .

Step 6. Draw $σ_{O}^{2}$ from $I G ((m - 1) / 2, \sum_{i} γ_{O i}^{T} γ_{O i} / 2)$ . Note that the conditional distribution of $σ_{O}^{2}, p (σ_{O}^{2} | Others) \propto p ({γ_{O i}} | σ_{O}^{2}) p (σ_{O}^{2})$ . As in [18], a uniform prior on σ_O implies that $p (σ_{O}^{2}) ~ {(σ_{O}^{2})}^{- \frac{1}{2}}$ . Therefore, $p (σ_{O}^{2} | Others) \propto {(σ_{O}^{2})}^{- \frac{m}{2}} exp (- \frac{1}{2 σ_{O}^{2}} \sum_{i} γ_{O i}^{T} γ_{O i}) {(σ_{O}^{2})}^{- \frac{1}{2}} = {(σ_{O}^{2})}^{- \frac{m + 1}{2}} exp (- \frac{1}{2 σ_{O}^{2}} \sum_{i} γ_{O i}^{T} γ_{O i})$ . On the other hand, for a random variable X ~ IG(a, b), p(X|a, b) ∝ X^−(a+1)exp(−b/X). Therefore, $p (σ_{O}^{2} | Others) ~ I G (a = (m - 1) / 2, b = \sum_{i} γ_{O i}^{T} γ_{O i} / 2)$ . In addition, note that $p (σ_{O}^{2} | Others) ~ Inv - χ^{2} (ν = m - 1, s^{2} = \sum_{i} γ_{O i}^{T} γ_{O i} / (m - 1))$ by the connection of the inverse gamma between scaled inverse chi-square distribution with DOF ν and scale s [19].
Step 7. Draw $σ_{R l}^{2}$ from $I G ((m - 1) / 2, \sum_{i} γ_{R l i}^{T} γ_{R l i} / 2)$ , where the conditional distributions can be derived similarly as in Step 6.

References

1.Schenker N, Raghunathan TE. Combining information from multiple surveys to enhance estimation of measures of health. Statistics in Medicine. 2007;26:1802–1811. doi: 10.1002/sim.2801. [DOI] [PubMed] [Google Scholar]
2.Schenker N, Parker JD. From single-race reporting to multiple-race reporting: using imputation methods to bridge the transition. Statistics in Medicine. 2003;22:1571–1587. doi: 10.1002/sim.1512. [DOI] [PubMed] [Google Scholar]
3.Yucel RM, Zaslavsky AM. Imputation of binary treatment variables with measurement error in administrative data. Journal of American Statistical Association. 2005;100:1123–1132. [Google Scholar]
4.He Y, Zaslavsky AM. Combining information from cancer registry and medical records data to improve analyses of adjuvant cancer therapies. Biometrics. 2009;65:946–952. doi: 10.1111/j.1541-0420.2008.01164.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Rubin DB. Multiple Imputation for Nonresponse in Surveys. New York: Wiley; 1987. [Google Scholar]
6.Ayanian JZ, Chrischilles EA, Wallace RB, Fletcher RH, Fouad MN, Kiefe CI, Harrington DP, Weeks JC, Kahn KL, Malin JL, Lipscomb J, Potosky AL, Provenzale DT, Sandler RS, Ryn MV, West DW. Understanding cancer treatment and outcomes: the cancer care outcomes research and surveillance consortium. Journal of Clinical Oncology. 2004;22:2992–2996. doi: 10.1200/JCO.2004.06.020. [DOI] [PubMed] [Google Scholar]
7.Huskamp HA, Keating NL, Malin JL, Zaslavsky AM,Weeks JC, Earle CC, Teno JM, Virnig BA, Kahn KL, He Y, Ayanian JZ. Discussions with physicians about hospice among patients with metastatic lung cancer. Archives of Internal Medicine. 2009;169:954–962. doi: 10.1001/archinternmed.2009.127. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Mack JW, Cronin A, Taback N, Huskamp HA, Keating NL, Malin JL, Earle CC, Weeks JC. End-of-Life care discussions among patients with advanced cancer: a cohort study. Annals of Internal Medicine. 2012;156:204–210. doi: 10.1059/0003-4819-156-3-201202070-00008. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Earle CC, Landrum MB, Souza JM, Neville BA, Weeks JC, Ayanian JZ. Aggressiveness of cancer care near the end of life: is it a quality-of-care issue. Journal of Clinical Oncology. 2008;26:3860–3866. doi: 10.1200/JCO.2007.15.8253. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Little RJA, Rubin DB. Statistical Analysis of Missing Data. New York: Wiley; 2002. [Google Scholar]
11.Reiter JP, Raghunathan TE. The multiple adaptations of multiple imputation. Journal of the American Statistical Association. 2007;102:1462–1471. [Google Scholar]
12.Mislevy RJ. Randomized-based inference about latent variables from complex samples. Psychometrika. 1991;56:177–196. [Google Scholar]
13.Cole SR, Chu H, Greenland S. Multiple imputation for measurement-error correction. International Journal of Epidemiology. 2006;35:1074–1081. doi: 10.1093/ije/dyl097. [DOI] [PubMed] [Google Scholar]
14.Sekar C, Deming EW. On a method of estimating birth and death rates and the extent of registration. Journal of the American Statistical Association. 1949;44:101–115. [Google Scholar]
15.Alho JM, Mulry MH, Wurdeman K, Kim J. Estimating heterogeneity in the probabilities of enumeration for dual-system estimation. Journal of the American Statistical Association. 1993;88:1130–1136. [PubMed] [Google Scholar]
16.Albert A, Anderson JA. On the existence of maximum likelihood estimates in logistic regression models. Biometrika. 1984;71:1–10. [Google Scholar]
17.Albert JH, Chib S. Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association. 1993;88:669–679. [Google Scholar]
18.Gelman A. Prior distributions for variance parameters in hierarchical models. Bayesian Analysis. 2006;1:515–533. [Google Scholar]
19.Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian Data Analysis. 2nd edn. New York, NY: CRC Press; 2004. [Google Scholar]
20.Spiegelhalter DJ, Abrams KR, Myles JP. Bayesian Approaches to Clinical Trials and Health-Care Evaluation. Chichester: Wiley; 2004. Section 5.7.3. [Google Scholar]
21.Tanner MA, Wong WH. The calculation of posterior distributions by data augmentation (with discussion) Journal of the American Statistical Association. 1987;82:528–550. [Google Scholar]
22.Gelman A, Rubin DB. Inference from iterative simulation using multiple sequences. Statistical Science. 1992;7:457–472. [Google Scholar]
23.Gelman A, Meng XL, Stern HS. Posterior predictive assessment of model fitness via realized discrepancies (with discussion) Statistical Sinica. 1996;6:733–807. [Google Scholar]
24.Neuhaus JM. Bias and efficiency loss due to misclassified responses in binary regression. Biometrika. 1999;86:843–855. [Google Scholar]
25.Harel O, Miglioretti D. Missing information as a diagnostic tool for latent class analysis. Journal of Data Science. 2007;5:269–288. [Google Scholar]
26.Molenberghs G, Kenward MG. Missing Data in Clinical Studies. West Sussex: Wiley; 2007. [Google Scholar]
27.Carroll RJ, Ruppert D, Stefanski LA, Crainiceau CM. Measurement Error in Nonlinear Models: A Modern Perspective. 3rd edn. New York, NY: CRC Press; 2006. [Google Scholar]
28.Drews CD, Flanders WD, Kosinski AS. Use of two data sources to estimate odds-ratios in case-control studies. Epidemiology. 1993;4:327–355. doi: 10.1097/00001648-199307000-00008. [DOI] [PubMed] [Google Scholar]
29.Gustafson P. Measurement Error and Misclassification in Statistics and Epidemiology. Boca Raton, FL: CRC Press; 2004. [Google Scholar]
30.Zaslavsky A, Wolfgang GS. Triple-system modeling of census, post-enumeration survey, and administrative-list data. Journal of Business and Economic Statistics. 1993;11:279–288. [Google Scholar]
31.Chib S, Greenberg E. Analysis of multivariate probit models. Biometrika. 1998;85:347–361. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp Material

NIHMS581454-supplement-Supp_Material.pdf^{(17.6KB, pdf)}

[R1] 1.Schenker N, Raghunathan TE. Combining information from multiple surveys to enhance estimation of measures of health. Statistics in Medicine. 2007;26:1802–1811. doi: 10.1002/sim.2801. [DOI] [PubMed] [Google Scholar]

[R2] 2.Schenker N, Parker JD. From single-race reporting to multiple-race reporting: using imputation methods to bridge the transition. Statistics in Medicine. 2003;22:1571–1587. doi: 10.1002/sim.1512. [DOI] [PubMed] [Google Scholar]

[R3] 3.Yucel RM, Zaslavsky AM. Imputation of binary treatment variables with measurement error in administrative data. Journal of American Statistical Association. 2005;100:1123–1132. [Google Scholar]

[R4] 4.He Y, Zaslavsky AM. Combining information from cancer registry and medical records data to improve analyses of adjuvant cancer therapies. Biometrics. 2009;65:946–952. doi: 10.1111/j.1541-0420.2008.01164.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Rubin DB. Multiple Imputation for Nonresponse in Surveys. New York: Wiley; 1987. [Google Scholar]

[R6] 6.Ayanian JZ, Chrischilles EA, Wallace RB, Fletcher RH, Fouad MN, Kiefe CI, Harrington DP, Weeks JC, Kahn KL, Malin JL, Lipscomb J, Potosky AL, Provenzale DT, Sandler RS, Ryn MV, West DW. Understanding cancer treatment and outcomes: the cancer care outcomes research and surveillance consortium. Journal of Clinical Oncology. 2004;22:2992–2996. doi: 10.1200/JCO.2004.06.020. [DOI] [PubMed] [Google Scholar]

[R7] 7.Huskamp HA, Keating NL, Malin JL, Zaslavsky AM,Weeks JC, Earle CC, Teno JM, Virnig BA, Kahn KL, He Y, Ayanian JZ. Discussions with physicians about hospice among patients with metastatic lung cancer. Archives of Internal Medicine. 2009;169:954–962. doi: 10.1001/archinternmed.2009.127. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Mack JW, Cronin A, Taback N, Huskamp HA, Keating NL, Malin JL, Earle CC, Weeks JC. End-of-Life care discussions among patients with advanced cancer: a cohort study. Annals of Internal Medicine. 2012;156:204–210. doi: 10.1059/0003-4819-156-3-201202070-00008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Earle CC, Landrum MB, Souza JM, Neville BA, Weeks JC, Ayanian JZ. Aggressiveness of cancer care near the end of life: is it a quality-of-care issue. Journal of Clinical Oncology. 2008;26:3860–3866. doi: 10.1200/JCO.2007.15.8253. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Little RJA, Rubin DB. Statistical Analysis of Missing Data. New York: Wiley; 2002. [Google Scholar]

[R11] 11.Reiter JP, Raghunathan TE. The multiple adaptations of multiple imputation. Journal of the American Statistical Association. 2007;102:1462–1471. [Google Scholar]

[R12] 12.Mislevy RJ. Randomized-based inference about latent variables from complex samples. Psychometrika. 1991;56:177–196. [Google Scholar]

[R13] 13.Cole SR, Chu H, Greenland S. Multiple imputation for measurement-error correction. International Journal of Epidemiology. 2006;35:1074–1081. doi: 10.1093/ije/dyl097. [DOI] [PubMed] [Google Scholar]

[R14] 14.Sekar C, Deming EW. On a method of estimating birth and death rates and the extent of registration. Journal of the American Statistical Association. 1949;44:101–115. [Google Scholar]

[R15] 15.Alho JM, Mulry MH, Wurdeman K, Kim J. Estimating heterogeneity in the probabilities of enumeration for dual-system estimation. Journal of the American Statistical Association. 1993;88:1130–1136. [PubMed] [Google Scholar]

[R16] 16.Albert A, Anderson JA. On the existence of maximum likelihood estimates in logistic regression models. Biometrika. 1984;71:1–10. [Google Scholar]

[R17] 17.Albert JH, Chib S. Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association. 1993;88:669–679. [Google Scholar]

[R18] 18.Gelman A. Prior distributions for variance parameters in hierarchical models. Bayesian Analysis. 2006;1:515–533. [Google Scholar]

[R19] 19.Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian Data Analysis. 2nd edn. New York, NY: CRC Press; 2004. [Google Scholar]

[R20] 20.Spiegelhalter DJ, Abrams KR, Myles JP. Bayesian Approaches to Clinical Trials and Health-Care Evaluation. Chichester: Wiley; 2004. Section 5.7.3. [Google Scholar]

[R21] 21.Tanner MA, Wong WH. The calculation of posterior distributions by data augmentation (with discussion) Journal of the American Statistical Association. 1987;82:528–550. [Google Scholar]

[R22] 22.Gelman A, Rubin DB. Inference from iterative simulation using multiple sequences. Statistical Science. 1992;7:457–472. [Google Scholar]

[R23] 23.Gelman A, Meng XL, Stern HS. Posterior predictive assessment of model fitness via realized discrepancies (with discussion) Statistical Sinica. 1996;6:733–807. [Google Scholar]

[R24] 24.Neuhaus JM. Bias and efficiency loss due to misclassified responses in binary regression. Biometrika. 1999;86:843–855. [Google Scholar]

[R25] 25.Harel O, Miglioretti D. Missing information as a diagnostic tool for latent class analysis. Journal of Data Science. 2007;5:269–288. [Google Scholar]

[R26] 26.Molenberghs G, Kenward MG. Missing Data in Clinical Studies. West Sussex: Wiley; 2007. [Google Scholar]

[R27] 27.Carroll RJ, Ruppert D, Stefanski LA, Crainiceau CM. Measurement Error in Nonlinear Models: A Modern Perspective. 3rd edn. New York, NY: CRC Press; 2006. [Google Scholar]

[R28] 28.Drews CD, Flanders WD, Kosinski AS. Use of two data sources to estimate odds-ratios in case-control studies. Epidemiology. 1993;4:327–355. doi: 10.1097/00001648-199307000-00008. [DOI] [PubMed] [Google Scholar]

[R29] 29.Gustafson P. Measurement Error and Misclassification in Statistics and Epidemiology. Boca Raton, FL: CRC Press; 2004. [Google Scholar]

[R30] 30.Zaslavsky A, Wolfgang GS. Triple-system modeling of census, post-enumeration survey, and administrative-list data. Journal of Business and Economic Statistics. 1993;11:279–288. [Google Scholar]

[R31] 31.Chib S, Greenberg E. Analysis of multivariate probit models. Biometrika. 1998;85:347–361. [Google Scholar]

PERMALINK

Combining information from two data sources with misreporting and incompleteness to assess hospice-use among cancer patients: a multiple imputation approach

Yulei He

Mary Beth Landrum

Alan M Zaslavsky

Abstract

1. Introduction

Table I.

Table II.

2. Method

3. Application

3.1. Data background

Table III.

3.2. Exploratory analyses

3.3. Model-based imputation

3.3.1. Model specification

Table V.

3.3.2. Imputation algorithm

3.3.3. Model diagnostics

Table IV.

3.4. Analytic results

3.4.1. Model estimates

3.4.2. Post-imputation analyses

Table VI.

4. Discussion

Supplementary Material

Acknowledgements

Appendix

Table AI.

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Combining information from two data sources with misreporting and incompleteness to assess hospice-use among cancer patients: a multiple imputation approach

Yulei He

Mary Beth Landrum

Alan M Zaslavsky

Abstract

1. Introduction

Table I.

Table II.

2. Method

3. Application

3.1. Data background

Table III.

3.2. Exploratory analyses

3.3. Model-based imputation

3.3.1. Model specification

Table V.

3.3.2. Imputation algorithm

3.3.3. Model diagnostics

Table IV.

3.4. Analytic results

3.4.1. Model estimates

3.4.2. Post-imputation analyses

Table VI.

4. Discussion

Supplementary Material

Acknowledgements

Appendix

Table AI.

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases