Abstract
Outcome under-ascertainment, characterized by the incomplete identification or reporting of cases, poses a substantial challenge in epidemiologic research. While capture–recapture methods can estimate unknown case numbers, their role in estimating exposure effects in observational studies is not well established. This paper presents an ascertainment probability weighting framework that integrates capture–recapture and propensity score weighting. We propose a nonparametric estimator of effects on binary outcomes that combines exposure propensity scores with data from two conditionally independent outcome measurements to simultaneously adjust for confounding and under-ascertainment. Demonstrating its practical application, we apply the method to estimate the relationship between health care work and coronavirus disease 2019 testing in a Swedish region. We find that ascertainment probability weighting greatly influences the estimated association compared to conventional inverse probability weighting, underscoring the importance of accounting for under-ascertainment in studies with limited outcome data coverage. We conclude with practical guidelines for the method’s implementation, discussing its strengths, limitations, and suitable scenarios for application.
Keywords: Information bias, Misclassification, Register completeness, Sensitivity, Under-coverage
Administrative population registers play a vital role in epidemiology, providing data on disease incidence and exposure across more or less entire populations.1 However, despite their richness in data, these databases can have shortcomings such as incomplete coverage, data inaccuracies, and absent records. For instance, it is often unclear if the data source(s) used to measure outcomes can capture all relevant outcome events. Unlike studies with original data collection, where researchers are typically aware of which individuals have missing data, unrecorded events in register-based studies lead to a problem called “under-ascertainment.”2 This issue arises from our inability to differentiate between individuals with missing outcomes and those without, which means that conventional missing data imputation cannot be employed.
Under-ascertainment leads to an underestimation of disease prevalence and incidence and can therefore hamper disease surveillance efforts. When outcome ascertainment depends on individual characteristics, it can also result in biased exposure (or treatment) effect estimates by biasing the observed outcome numbers downward in one group more than the other, compromising the validity of causal estimates in a way similar to confounding.1,3 Despite these serious implications, under-ascertainment often goes unaddressed in empirical studies.4
Researchers often conduct register validation studies to assess the accuracy of disease classifiers by comparing them with gold standard reference data, which are presumed to be complete.1,4 In situations where such reference data are unavailable, capture–recapture methods can be employed. These methods use imperfect case data from multiple sources to estimate the true prevalence of underreported diseases.5–10 For example, techniques have been applied to uncover “hidden” public health challenges such as drug use,11 modern slavery,12 homelessness,13 and more.14 However, existing capture–recapture methods have not been developed to estimate exposure effects, limiting their value for epidemiologic research.
In this paper, we introduce an ascertainment probability weighting (APW) framework that combines capture–recapture with exposure propensity scores to simultaneously address under-ascertainment and confounding. Unlike existing empirical approaches for handling outcome misclassification, which require a single, perfect validation dataset, our approach exploits outcome data from two imperfect sources.15–25 We begin by introducing our framework and estimator, highlighting the assumptions required for valid estimation. We then demonstrate the approach through a hypothetical example and an empirical case study on under-ascertained coronavirus disease 2019 (COVID-19) tests in Sweden, while providing practical implementation guidance, discussing strengths and limitations, and highlighting its appropriate usage scenarios.
METHODS
Preliminaries and Notation
This section introduces the notation used throughout the paper. Let denote a binary “target outcome” of interest, such as disease outcomes (e.g., lung cancer), health behaviors (smoking), or other characteristics (being homeless). represents a predictor of interest (e.g., an exposure or treatment variable), and represents covariate(s).
To be concise, we often employ shorthand expressions when presenting conditional probabilities. For instance, instead of writing out the full expression , we write .
Under-ascertainment
Our framework addresses under-ascertainment, a type of one-sided outcome misclassification. Under-ascertainment occurs when individuals with may either be correctly classified as or wrongly classified as , but all individuals with are always classified correctly as (i.e., there are no false positives). In the machine learning literature, this is often referred to as a positive-unlabeled classification problem.26
Causal Framework and Assumptions
To relate the realized to causal estimands, we apply the potential outcomes framework.27 This framework defines a potential outcome as the outcome that would occur if the exposure (or treatment), perhaps contrary to fact, were set to x. We assume standard causal assumptions: consistency ( whenever is observed), exchangeability of and the potential outcomes conditional on observed covariates (| for all ), and exposure positivity ( for all and ). We also assume that all variables except are measured without error, and that the study sample is representative of the (super-)population of interest.
Ascertainment Probability Weighting
Our objective is to estimate , representing the potential outcome probability under exposure level x, which can be used to compute various marginal causal effects, such as risk differences (RDs) or risk ratios (RRs).
In Equation (A4) in the Appendix, we establish a relationship between and the under-ascertained data according to the following expression:
| (1) |
Here, is the ascertainment probability conditional on and . The rest of the expression follows the principles of conventional inverse probability weighting (IPW), including adjustment for confounding by inclusion of the exposure propensity score, , in the denominator, which adjusts for observable confounders () by balancing their distribution between exposure groups.28 For continuous , the summation can be replaced by integration.
Although Equation (1) cannot be computed directly due to the unobserved nature of , it can be estimated using capture–recapture. In our data setup, we observe under-ascertained outcome data from two data sources, where indicates the outcome’s ascertainment in source . In turn, means that the outcome has been ascertained in at least one source, that is, if or .
We make two core assumptions beyond those mentioned in the previous section.
-
Assumption 1 (conditional source independence).
Assumption 1 means that we assume conditional independence of the probability of ascertainment between the two data sources, given and observed characteristics and . This relaxation of the basic capture–recapture method’s marginal independence assumption, which has also been exploited by others (e.g., Das et al.,7 Alho,8 and Tilling and Sterne9), is akin to the conditional exchangeability assumption with respect to exposure, used for causal estimation in observational studies when adjusting for confounding by observed covariates. In practice, the covariates involved should correspond to factors that are correlated with ascertainment in both sources, such as geographical factors and health behaviors.
-
Assumption 2 (ascertainment overlap). for all
Assumption 2 means we assume a positive probability for ascertained cases to appear simultaneously in both data sources within all strata of and . This condition is analogous to the exposure positivity assumption and can be verified by examining whether at least some ascertained cases are observed in both sources within all covariate strata.
If Assumptions 1 and 2 hold, Proposition 1 from Das et al.7 shows that
| (2) |
The right-hand side of Equation (2), which reflects capture–recapture-derived ascertainment probabilities, depends solely on probabilities within the subpopulation with (see Appendix Equations (A1–A3)). Consequently, it can be estimated from observable data. Substituting the right-hand side of Equation (2) into Equation (1), we have that
| (3) |
This formulation allows us to apply APW using capture–recapture methods, providing a practical solution for handling under-ascertainment while estimating causal effects.
Estimation
Equation (3) provides an avenue for estimation using a plug-in estimator. In the Appendix (Equation (A5)), we also derive the following estimator to simplify estimation with individual-level data:
| (4) |
This estimation approach simplifies the process by avoiding the need to parametrically estimate the numerator. In Equation (4), represents the size of the study sample, and is an indicator function coded as 1 for individuals with observed and whose observed exposure is equal to x, and 0 otherwise. The remaining terms represent predicted probabilities for each individual i, which can be obtained either through nonparametric methods or parametric techniques, such as logistic regressions. Naturally, the consistency of the estimator hinges on these models being correctly specified.
eAppendix 1; http://links.lww.com/EDE/C109 contains a simulation validating the estimator’s performance when all assumptions are met. The simulation also verifies that percentile-based bootstrap confidence intervals, derived from resampling the full size, study sample with replacement, can be used to quantify the uncertainty of APW estimates (eAppendix 1; http://links.lww.com/EDE/C109).
Numerical Example
To aid intuition about the method, we first present a hypothetical example focusing on the association between seatbelt use and injury risk in car crashes. In eAppendix 2; http://links.lww.com/EDE/C109, we also provide code to generate similar data and show how the method can be applied with individual-level observations.
Our hypothetical data follows the causal model in the Figure. The observed outcome data, , are gathered from two sources: police records, , and hospital admissions, . The unit of observation is car crash events, where injury outcomes are imperfectly ascertained. In our data-generating process, confounding bias arises from drunk driving (), which is a common cause affecting both seatbelt use () and the true injury outcome (). Under-ascertainment bias occurs because both and impact ascertainment in the two data sources. Specifically, seatbelt use affects injury likelihood and severity, which, in turn, influence ascertainment in both police and hospital records. Additionally, drunk driving may affect the likelihood of police ascertainment of injury outcomes for all involved parties, as well as influence injury and crash severity. Consequently, under-ascertainment varies across the four strata defined by and .
FIGURE.
Directed acyclic graph depicting the causal model used to generate the hypothetical example data. In our hypothetical example, Y stands for car crash injury, X is seatbelt use, and Z is drunk driving. Our exposure of interest is X and Z is a confounding variable (a common cause of X and Y). The true Y is unobserved; instead, we observe under-ascertained outcomes from two sources, represented by (police records) and (hospital admissions data). Seatbelt use and drunk driving (X and Z) each affect the probability of ascertainment in both sources, as depicted by the arrows from these variables into and .
Table 1 contains conditional probabilities based on a scenario such as the one outlined above. Setting to 0.5 for simplicity, the potential outcome probabilities and in the population are 0.379 and 0.621, respectively. The true causal RD is therefore −0.242, and the causal RR is 0.61. Estimating these effects using the observed , we instead get −0.099 for the RD and 0.76 for RR, which are both clearly biased. These estimates are already adjusted for confounding by , so the bias is solely related to differential ascertainment.
TABLE 1.
Conditional Probabilities From a Hypothetical Register-based Study Suffering From Outcome Under-ascertainment With Imperfect Outcome Data From Two Sources
| Variable Values | ||||
|---|---|---|---|---|
| Conditional Probability | X = 1, Z = 1 | X = 1, Z = 0 | X = 0, Z = 1 | X = 0, Z = 0 |
| Outcomes | ||||
| a,b | 0.449 | 0.309 | 0.691 | 0.550 |
| 0.411 | 0.215 | 0.552 | 0.272 | |
| 0.256 | 0.082 | 0.208 | 0.169 | |
| Exposure propensities | ||||
| 0.623 | 0.379 | 0.377 | 0.621 | |
| Ascertainment | ||||
| a,c | 0.917 | 0.698 | 0.797 | 0.494 |
| 0.751 | 0.648 | 0.688 | 0.627 | |
| 0.799 | 0.643 | 0.689 | 0.539 | |
| 0.550 | 0.291 | 0.378 | 0.167 | |
These conditional probabilities (and their sample analogues) are presumed to be unobservable in our data setup but are displayed here for reference.
gives the marginal potential outcome probabilities. We apply this formula to compute the true RD and RR and use the same method on to obtain the biased RD and RR presented in the text.
True ascertainment probabilities. These are unobservable but estimable from observable data by capture–recapture methods using Equation (2). For instance, plugging the three observable ascertainment probabilities from the left-most column into the equation gives us 0.550/(0.751 × 0.799) ≈ 0.917, which is equivalent to the true ascertainment probability.
To address the remaining bias, we first apply Equation (3) to estimate :
| (5) |
The above expressions demonstrate how the observable data in Table 1 can be used to estimate the potential outcome probabilities. Applying the same steps for , we get From these estimates, we obtain an RD of −0.242 and an RR of 0.61, which correspond to the true effects.
Empirical Example
In this section, we apply our approach to real-world data from the Swedish COVID-19 Investigation for Future Insights, a Population Epidemiology Approach using Register Linkage (SCIFI-PEARL) project.29 The SCIFI-PEARL project is a comprehensive register-based study covering COVID-19-related outcomes, disease histories, and sociodemographic characteristics of the entire Swedish population. The project benefits from robust national register infrastructures and the ability to link data between registers using personal identification numbers with almost perfect accuracy.30,31 In this example, we focus on investigating variations in the propensity to get tested for the severe acute respiratory syndrome coronavirus 2 virus across different population groups. Such variations can have implications for control strategies (e.g., contact tracing) as well as for interpreting endpoints that rely on individual’s willingness to get tested, such as confirmed cases.32
The SCIFI-PEARL project was approved by the Swedish Ethical Review Authority (diary number 2020-01800).
Data Sources and Factors Assumed to Cause Under-ascertainment of Tests
Estimating testing propensities in Sweden is a challenge due to the lack of a nationally complete register on tests actually conducted that are linkable to population registers. In our example, we instead rely on data from two imperfect sources linked to the SCIFI-PEARL database: a complete register of all positive polymerase chain reaction (PCR) tests reported to the national infectious disease reporting system, SmiNet,33 and tests recorded in a database at Sahlgrenska University Hospital, the largest hospital in Sweden’s second most populated region (Västra Götaland).
Laboratories where PCR tests are analyzed are required by Swedish law to report all positive PCR tests to SmiNet. However, the database lacks information on negative and invalid tests and, therefore, provides an incomplete account of all tests that were conducted in the country. This means that ascertainment of tests in this data source is driven entirely by factors that influence test-positivity (i.e., the probability that a performed test is positive).
Our second data source encompasses all PCR tests, both negative and invalid, from Sahlgrenska University Hospital in Gothenburg, Sweden’s second-largest city. The inclusion of tests in this database hinges on the test’s analysis location, largely determined by the health care provider’s type and location. Sweden’s health care, funded by taxes and divided into 21 regions, allows both regional and contracted private providers to offer services. In Västra Götaland, about half of the primary health care centers are privately operated. The Sahlgrenska database logs tests from the university hospital, two neighboring hospitals in Kungälv and Alingsås, and an external lab serving regional outpatient care facilities. However, it lacks comprehensive data on tests from private providers, which often use a nonincluded lab. We estimate that the Sahlgrenska database holds roughly half of the region’s PCR tests, as about half of the positive tests we observe in SmiNet in the Västra Götaland population are also present in the Sahlgrenska database.
We argue that the observed tests from these two sources fit our definition of under-ascertainment for two reasons: (1) observing a PCR test in either source implies that the individual was indeed tested for COVID-19 ( implies ), and (2) the absence of tests in either source does not necessarily mean that the individual was not tested, as these sources do not encompass all tests conducted in the region ( does not necessarily imply ).
To ensure the validity of the capture–recapture-based ascertainment adjustments, we must consider factors affecting ascertainment of tests in both data sources. Translating Assumption 1 into our context implies that once a test is conducted (), it must be as good as random whether a positive test ends up in the Sahlgrenska database or not, conditional on the exposure and observed covariates. In our case, this means that we must consider factors influencing both test-positivity and an individual’s choice of testing location (particularly, their choice of private or public health care providers). Existing research supports the hypothesis that proximity to the health care center more strongly influences provider choice in Sweden than health care quality indicators.34 Hence, one might assume that area of residence or work is the primary driver of this choice. There were also notable spatial variations in infection rates,35 which may affect test-positivity rates. We therefore suspect that geographic factors may be important to adjust for, because they may influence test ascertainment in both data sources. More broadly, sociodemographic factors may also impact provider choice, infection risk, and individual vulnerability, so we will want to adjust for those as well.
Application of Ascertainment Probability Weighting
We perform the APW analysis following the steps outlined in the Box. Our target outcome, , is defined as “tested at least once between 1 July and 26 December 2020.” We selected this period because it reflects a time when community testing was available, but largely before COVID-19 antigen self-test kits were commonly used and prior to the Swedish vaccination campaign (which started 27 December 2020). During this time, PCR testing therefore played an essential role in the public’s response to the pandemic. In our dataset, we code any individual with an observed test in one of our two data sources during this period as , and otherwise.
Our example compares testing propensities between two exposure groups, health care workers () and other essential occupations (; a composite group including teachers, social care workers, service sector workers, postal/delivery, transport services workers, police/security, and cleaners). Our occupation data come from the Longitudinal Integrated Database for Health Insurance and Labor Market Studies (LISA), and we used four-digit codes from the Swedish Standard Occupational Classification (SSYK2012) to classify occupations (see eTable1; http://links.lww.com/EDE/C109 for details). Our study population comprises the entire working-age population (20–64 years) working in essential occupations and residing in Västra Götaland at the end of 2019 (N = 261,314).
We use logistic regression to estimate the exposure propensity score and to compute the three “source ascertainment models” required to compute Equation (4) (see Box). To adjust for confounding and help satisfy the conditional source independence assumption (Assumption 1), we include a set of covariates () in each model. Based on the rationale in the previous section, we include adjustment for age (5-year groups), sex, health care worker (yes/no), disposable income in national quartiles, country of birth (Sweden or abroad; where abroad is classified into low income, lower-middle income, middle-high income, and high income countries based on the World Bank’s classification in 2020), educational attainment (primary, secondary, or tertiary), marital status (married, unmarried), number of children living at home (0, 1, 2, 3, 4+), household type (living alone, single parent, or living with partner), if the person scored one or higher on the Charlson Comorbidity Index before the pandemic, and municipality of residence fixed effects (49 in the region). Each variable is based on data from Swedish registers linked to the SCIFI-PEARL database; details and data sources are provided in eTable 2; http://links.lww.com/EDE/C109.
RESULTS
Compared to ignoring the under-ascertainment of tests in our data, our APW-based estimates show substantially higher testing propensities in health care workers and other essential occupations (Table 2).
TABLE 2.
Estimates of the Association Between Health Care Work and the Propensity to Get Tested for COVID-19 in the Working-age Population (20–64 Years) of Västra Götaland, Sweden (N = 261,314) Working in Essential Occupations, With and Without APW Using Testing Data From Two Incomplete Data Sources
| Testing Propensity, %a | Exposure Effect Measure | |||
|---|---|---|---|---|
| Estimate | Health Care Workers (n = 76,033) |
Other Essential Workers (n = 185,281) |
Difference | Ratio |
| Unadjusted | 46.1 (45.7, 46.4) | 23.2 (23.0, 23.4) | 22.9 (22.5, 23.3) | 1.99 (1.96, 2.01) |
| IPWb | 45.8 (45.4, 46.2) | 23.8 (23.6, 24.0) | 22.0 (21.6, 22.5) | 1.92 (1.90, 1.95) |
| APWb | 60.7 (59.9, 62.3) | 38.5 (37.9, 39.6) | 22.2 (21.0, 23.6) | 1.58 (1.54, 1.62) |
Confidence intervals (95% CIs), which are presented in parentheses, are based on a nonparametric percentile bootstrap with 1000 resamples (resampling the entire study sample with replacement).
Binary outcome reflecting if the individual underwent PCR testing for COVID-19 at least once between 1 July and 26 December 2020.
Confounding adjustment includes sociodemographic factors (see text for details). Under-ascertainment adjustment includes the same factors and the exposure variable. IPW estimate includes only confounding adjustment, while APW includes confounding and under-ascertainment adjustment.
While absolute differences in testing propensities between the two occupational groups are relatively unaffected in this case, applying the APW estimator substantially impacts the estimated relative differences, that is, testing propensity ratios (Table 2). Specifically, the unadjusted testing propensity ratio estimate between health care workers and other essential occupations changes from 1.99 to 1.92 with conventional IPW adjustment for confounding, and further reduces to 1.58 with APW adjustment for under-ascertainment. Additionally, the 95% confidence intervals for the IPW and APW estimates do not overlap.
These findings emphasize the importance of considering potential under-ascertainment in register-based studies, as it may impact differences on an absolute, or as in this example, relative scale. Ultimately, all estimates, including the APW-adjusted estimate, suggest that health care workers were more likely to be tested for COVID-19 than workers in other essential occupations.
DISCUSSION
We have introduced a novel framework to adjust for both under-ascertainment and confounding, addressing a common challenge, especially for studies investigating sensitive outcomes, such as drug use, and studies relying on administrative data from resource-limited settings. Our contribution relates to misclassification,3,36 which is a well-established problem in epidemiology that has gained renewed attention in recent years.15–25 Importantly, we extend IPW methods for causal inference to account for incomplete ascertainment of outcomes through capture–recapture estimation, which to our knowledge has not previously been explored.
Our APW estimator corrects for under-ascertainment using two imperfect data sources. While existing methods can be used to account for more general classes of misclassification,15–25 including settings with false positives, they require perfect validation data, which is not always available. Unlike sensitivity or simulation-based methods dependent on external data or assumptions,37,38 methods that rely on internal data, including APW, ensure that adjustments are relevant to the study sample.15,19 Hence, this method provides a contextually relevant way to adjust for under-ascertainment bias without requiring perfect validation data.
Every analytical method, including APW, is subject to limitations and requires a thorough evaluation of its assumptions. The method’s exposure propensity aspect is subject to the same limitations as traditional propensity score weighting, including that all confounding must be adjustable by observable factors and the need for sufficient overlap in covariate distributions between exposure groups (see, e.g., Austin and Stuart28).
The additional considerations required are instead closely related to considerations for capture–recapture estimation. Like other capture–recapture techniques, APW assumes a minimal rate of false positives, as their presence would lead to overestimated outcome probabilities.39 The extent of this bias depends on the false positive rate and the difference in this rate across exposure groups. Register validation studies often highlight under-ascertainment as a primary concern in administrative data compared to false positives. Systematic reviews suggest that specificities usually range from 90% to 100% in administrative registers,40–46 suggesting that the false positive rate may be reasonably low in many practical contexts. Nonetheless, it is still important to carefully evaluate this assumption in each application. In our empirical example, for instance, there is no complex diagnostic process involved. We therefore deem it is safe to assume that individuals observed with a positive test in SmiNet or with a test in the Sahlgrenska database were indeed tested (unless there are mistakes in data linkage or recording).
On the other hand, if a subset of true positives systematically evades detection in both data sources, capture–recapture methods may underestimate the true outcome probabilities. For instance, if both data sources exclusively include hospital admissions data, generalizing findings to nonhospitalized cases would be inappropriate. However, in such settings, inferences can still be drawn about ascertainable cases, though caution is required when making conclusions about subpopulations or types of outcomes not ascertained in any data source.
The conditional source independence assumption (Assumption 1) should be critically evaluated in every application. Similar to confounding adjustment, this can be done using contextual, empirical, and theoretical knowledge, possibly incorporated into causal diagrams.47 To see how this can be done, note that Assumption 1 implies that adjustment variables should be selected so that they block any open path between and , after conditioning on . For example, Figure suggests that and must be included in the models predicting ascertainment probabilities to block all open paths and enable valid estimation. For reference, Brenner48 also provides general guidance on the effects of violations of Assumption 1, namely that positive ascertainment covariance leads to an underestimation of outcome probabilities, whereas negative ascertainment covariance leads to overestimation. Whether the estimated exposure effects are also biased, depends on the direction and magnitude of these covariances in each exposure group and the degree to which they differ. A similar concern arises when one source’s ascertainment influences the other (e.g., depicted by in a causal diagram), which can be difficult to adjust away.49 In our example, all laboratories, regardless of their affiliation with the Sahlgrenska database, must report positive test results to SmiNet. This arrangement minimizes additional ascertainment pathways, which should only occur through test-positivity. However, Assumption 1 could still be compromised if systematic reporting biases exist, for example, if Sahlgrenska database labs are more prone to report positive tests to SmiNet.
Finally, our paper leaves room for several possible refinements to the APW framework, such as accommodating misclassified exposures, integrating with design-based sampling techniques to handle false positives,50 and expanding it to multisource capture–recapture estimation, which may offer additional possibilities to account for source dependence.10 Further analysis of the statistical properties of alternative estimators for the method, including finite sample performance under different scenarios, would also be of interest.
PROPOSED STEP-BY-STEP PROCESS TO COMPUTE THE CAPTURE–RECAPTURE-BASED ASCERTAINMENT PROBABILITY WEIGHTING ESTIMATOR.
Step 1. Define the target outcome () and its under-ascertained counterpart ().
Step 2. Define the exposure of interest () and estimate a model for the propensity score for exposure, , controlling for observed confounders (). Predict the propensity score for exposure level (e.g., ) for all individuals in the data.
Step 3. Using data only from the subset of individuals with , estimate one model each for the following conditional probabilities, controlling for the exposure () and covariates () assumed to influence ascertainment in both data sources:
the probability of observing the outcome in source 1 (,
the probability of observing the outcome in source 2 (,
the probability of observing the outcome in both sources ( and ).
Step 4. Use the predicted probabilities from the models in Step 3 to estimate ascertainment probabilities for all individuals in the data (including those with ) using Equation (2).
Step 5. Use Equation (4) to estimate the potential outcome probability for exposure level x (), for example, (the “exposed group”).
Step 6. Repeat Step 5 for exposure level , for example, (the “unexposed group”).
Step 7. Use the estimated potential outcome probabilities to compute effect estimates on a desired scale (e.g., RD, RR).
Step 8. Repeat Steps 2–7 on M (e.g., 1,000) bootstrap resamples of the data (resample the entire study sample with replacement) to compute percentile-bootstrap confidence intervals.
Supplementary Material
APPENDIX
First, we give the proof of Equation (2). For any source j, including the product of multiple sources, we have that
| (A1) |
where the first equality follows from the fact that implies , the second from the product rule, and the third because implies .
By Assumption 1,
| (A2) |
It follows from these expressions that
| (A3) |
Canceling common factors by dividing both sides of (A3) by and then solving for the remaining , which requires Assumption 2, gives Equation (2).
Next, we give the proof of Equations (1) and (3). For discrete , we have that
| (A4) |
where the first equality follows from the law of total probability; the second from conditional independence of the exposure and potential outcomes; the third from counterfactual consistency; the fourth by introducing common terms to the numerator and denominator, which requires exposure and ascertainment positivity; the fifth by the chain rule; the sixth because implies by the definition of ; and the seventh by Equation (2). The final expression redefines terms to simplify notation in the derivations below. For continuous , the summation in (A4) can be replaced with integration.
While the final expression in (A4) could in principle be estimated using a plug-in estimator, we rewrite the estimator to avoid having to estimate the numerator parametrically:
| (A5) |
where is the z-specific stratum size. In the final expression, we replace with . While our target quantity is defined for a fixed , the observed can be used with equivalent results when estimating with the expressions in (A5) that include , because the terms will always evaluate to zero for individuals with . Thus, one can either predict for each separately or, more conveniently, use for all .
Footnotes
The results reported herein correspond to specific aims of a grant held to C.B. and A.N. from the Swedish Research Council for Health, Working life and Welfare (Forte; grant number 2020-00962). J.B. was also supported by the Swedish Research Council (VR; grant numbers 2019-00198 and 2021-04665), Sweden’s Innovation Agency (Vinnova; grant number 2021-02648), and by internal grants for thematic collaboration initiatives at Lund University. M.G. was also supported by the Swedish state, under an agreement between the Swedish government and the county councils (ALF agreement ALFGBG-965885), by SciLifeLab from the Knut and Alice Wallenberg Foundation (2020.0182 & 2020.0241), by the Swedish Research Council (2021-05405 & 2021-06545), and by King Gustaf V:s and Queen Victoria’s Foundation. The SCIFI-PEARL project also received additional funding from the ALF agreement (Avtal om Läkarutbildning och Forskning/Medical Training and Research Agreement) (grant numbers 938453 and 971130) and FORMAS (Forskningsrådet för miljö, areella näringar och samhällsbyggande/Research Council for Environment, Agricultural Sciences and Spatial Planning), also known as the Swedish Research Council for Sustainable Development (grant number 2020-02828). The funders played no role in the design of the study, data collection or analysis, decision to publish, or preparation of the manuscript.
Disclosure: M.G. has received research grants from Gilead Sciences and Janssen-Cilag and honoraria as speaker, DSMB committee member, and/or scientific advisor from Amgen, AstraZeneca, Biogen, Bristol-Myers Squibb, Gilead Sciences, GlaxoSmithKline/ViiV, Janssen-Cilag, MSD, Novocure, Novo Nordic, Pfizer, and Sanofi. N.H. is scientific advisor at Sobi. The other authors have no conflicts to report.
Supplemental digital content is available through direct URL citations in the HTML and PDF versions of this article (www.epidem.com).
J.B. and F.N. have contributed equally to this work.
R code to generate simulated data and apply our method is available in the eAppendix but processing data from the empirical example requires ethical approval from the Swedish Ethical Review Authority and adherence to the European General Data Protection Regulation (GDPR). The SCIFI-PEARL project has received ethical approval (dnr 2020-01800), but external researchers wanting data access must first seek and receive their own approval from the Swedish Ethical Review Authority (www.etikprovningsmyndigheten.se/en/).
REFERENCES
- 1.Thygesen LC, Ersbøll AK. When the entire population is the sample: strengths and limitations in register-based epidemiology. Eur J Epidemiol. 2014;29:551–558. [DOI] [PubMed] [Google Scholar]
- 2.Gibbons CL, Mangen MJJ, Plass D, et al. ; Burden of Communicable Diseases in Europe (BCoDE) Consortium. Measuring underreporting and under-ascertainment in infectious disease datasets: a comparison of methods. BMC Public Health. 2014;14:147. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Greenland S. Basic methods for sensitivity analysis of biases. Int J Epidemiol. 1996;25:1107–1116. [PubMed] [Google Scholar]
- 4.Bernatsky S, Joseph L, Bélisle P, et al. Bayesian modelling of imperfect ascertainment methods in cancer studies. Stat Med. 2005;24:2365–2379. [DOI] [PubMed] [Google Scholar]
- 5.Bird SM, King R. Multiple systems estimation (or capture–recapture estimation) to inform public policy. Annu Rev Stat Its Appl. 2018;5:95–118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Huggins R, Hwang WH. A review of the use of conditional likelihood in capture–recapture experiments. Int Stat Rev. 2011;79:385–400. [Google Scholar]
- 7.Das M, Kennedy EH, Jewell NP. Doubly robust capture–recapture methods for estimating population size [published online ahead of print April 12, 2023]. J Am Stat Assoc. 2023. doi:10.1080/01621459.2023.2187814. [Google Scholar]
- 8.Alho JM. Logistic regression in capture–recapture models. Biometrics. 1990;46:623–635. [PubMed] [Google Scholar]
- 9.Tilling K, Sterne JAC. Capture–recapture models including covariate effects. Am J Epidemiol. 1999;149:392–400. [DOI] [PubMed] [Google Scholar]
- 10.Chao A, Tsay PK, Lin SH, Shau WY, Chao DY. The applications of capture–recapture models to epidemiological data. Stat Med. 2001;20:3123–3157. [DOI] [PubMed] [Google Scholar]
- 11.Mastro TD, Kitayaporn D, Weniger BG, et al. Estimating the number of HIV-infected injection drug users in Bangkok: a capture–recapture method. Am J Public Health. 1994;84:1094–1099. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Bales K, Hesketh O, Silverman B. Modern slavery in the UK: how many victims? Significance. 2015;12:16–21. [Google Scholar]
- 13.Fisher N, Turner SW, Pugh R, Taylor C. Estimated numbers of homeless and homeless mentally ill people in north east Westminster by using capture–recapture analysis. BMJ. 1994;308:27–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Tilling K. Capture–recapture methods—useful or misleading? Int J Epidemiol. 2001;30:12–14. [DOI] [PubMed] [Google Scholar]
- 15.Gravel CA, Filion KB, Reynier PM, Platt RW. Postmyocardial infarction statin exposure and the risk of stroke with weighting for outcome misclassification. Epidemiology. 2020;31:880–888. [DOI] [PubMed] [Google Scholar]
- 16.Gravel CA, Farrell PJ, Krewski D. Conditional validation sampling for consistent risk estimation with binary outcome data subject to misclassification. Pharmacoepidemiol Drug Saf. 2019;28:227–233. [DOI] [PubMed] [Google Scholar]
- 17.Gravel CA, Platt RW. Weighted estimation for confounded binary outcomes subject to misclassification. Stat Med. 2018;37:425–436. [DOI] [PubMed] [Google Scholar]
- 18.Edwards JK, Cole SR, Troester MA, Richardson DB. Accounting for misclassified outcomes in binary regression models using multiple imputation with internal validation data. Am J Epidemiol. 2013;177:904–912. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Lyles RH, Tang L, Superak HM, et al. Validation data-based adjustments for outcome misclassification in logistic regression: an illustration. Epidemiology. 2011;22:589–597. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Tang L, Lyles RH, King CC, Celentano DD, Lo Y. Binary regression with differentially misclassified response and exposure variables. Stat Med. 2015;34:1605–1620. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Tang L, Lyles RH, Ye Y, Lo Y, King CC. Extended matrix and inverse matrix methods utilizing internal validation data when both disease and exposure status are misclassified. Epidemiol Methods. 2013;2:49–66. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Penning de Vries BB, van Smeden M, Groenwold RH. A weighting method for simultaneous adjustment for confounding and joint exposure-outcome misclassifications. Stat Methods Med Res. 2021;30:473–487. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Shu D, Yi GY. Causal inference with noisy data: bias analysis and estimation approaches to simultaneously addressing missingness and misclassification in binary outcomes. Stat Med. 2020;39:456–468. [DOI] [PubMed] [Google Scholar]
- 24.Shu D, Yi GY. Weighted causal inference methods with mismeasured covariates and misclassified outcomes. Stat Med. 2019;38:1835–1854. [DOI] [PubMed] [Google Scholar]
- 25.Edwards JK, Cole SR, Fox MP. Flexibly accounting for exposure misclassification with external validation data. Am J Epidemiol. 2020;189:850–860. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Li F, Dong S, Leier A, et al. Positive-unlabeled learning in bioinformatics and computational biology: a brief review. Brief Bioinform. 2022;23:bbab461. [DOI] [PubMed] [Google Scholar]
- 27.Rubin DB. Estimating causal effects of treatments in randomized and nonrandomized studies. J Educ Psychol. 1974;66:688–701. [Google Scholar]
- 28.Austin PC, Stuart EA. Moving towards best practice when using inverse probability of treatment weighting (IPTW) using the propensity score to estimate causal treatment effects in observational studies. Stat Med. 2015;34:3661–3679. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Nyberg F, Franzén S, Lindh M, et al. Swedish COVID-19 investigation for future insights – a population epidemiology approach using register linkage (SCIFI-PEARL). Clin Epidemiol. 2021;13:649–659. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Ludvigsson JF, Otterblad-Olausson P, Pettersson BU, Ekbom A. The Swedish personal identity number: possibilities and pitfalls in healthcare and medical research. Eur J Epidemiol. 2009;24:659–667. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Ludvigsson JF, Almqvist C, Bonamy AKE, et al. Registers of the Swedish total population and their use in medical research. Eur J Epidemiol. 2016;31:125–136. [DOI] [PubMed] [Google Scholar]
- 32.Griffith GJ, Morris TT, Tudball MJ, et al. Collider bias undermines our understanding of COVID-19 disease risk and severity. Nat Commun. 2020;11:5749. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Rolfhamre P, Janson A, Arneborn M, Ekdahl K. SmiNet-2: description of an internet-based surveillance system for communicable diseases in Sweden. Euro Surveill. 2006;11:15–16. [DOI] [PubMed] [Google Scholar]
- 34.Dahlgren C, Dackehag M, Wändell P, Rehnberg C. Simply the best? The impact of quality on choice of primary healthcare provider in Sweden. Health Policy. 2021;125:1448–1454. [DOI] [PubMed] [Google Scholar]
- 35.Parkes B, Stafoggia M, Fecht D, et al. Community factors and excess mortality in the COVID-19 pandemic in England, Italy and Sweden. Eur J Public Health. 2023;33:695–703. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Copeland KT, Checkoway H, McMichael AJ, Holbrook RH. Bias due to misclassification in the estimation of relative risk. Am J Epidemiol. 1977;105:488–495. [DOI] [PubMed] [Google Scholar]
- 37.Magder LS, Hughes JP. Logistic regression when the outcome is measured with uncertainty. Am J Epidemiol. 1997;146:195–203. [DOI] [PubMed] [Google Scholar]
- 38.Gilbert R, Martin RM, Donovan J, et al. Misclassification of outcome in case–control studies: methods for sensitivity analysis. Stat Methods Med Res. 2016;25:2377–2393. [DOI] [PubMed] [Google Scholar]
- 39.Ramos PL, Sousa I, Santana R, et al. A review of capture–recapture methods and its possibilities in ophthalmology and vision sciences. Ophthalmic Epidemiol. 2020;27:310–324. [DOI] [PubMed] [Google Scholar]
- 40.McCormick N, Lacaille D, Bhole V, Avina-Zubieta JA. Validity of heart failure diagnoses in administrative databases: a systematic review and meta-analysis. PLoS One. 2014;9:e104519. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.McCormick N, Lacaille D, Bhole V, Avina-Zubieta JA. Validity of myocardial infarction diagnoses in administrative databases: a systematic review. PLoS One. 2014;9:e92286. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Mbizvo GK, Bennett KH, Schnier C, Simpson CR, Duncan SE, Chin RFM. The accuracy of using administrative healthcare data to identify epilepsy cases: a systematic review of validation studies. Epilepsia. 2020;61:1319–1335. [DOI] [PubMed] [Google Scholar]
- 43.Abraha I, Montedori A, Serraino D, et al. Accuracy of administrative databases in detecting primary breast cancer diagnoses: a systematic review. BMJ Open. 2018;8:e019264. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Samadoulougou S, Idzerda L, Dault R, Lebel A, Cloutier AM, Vanasse A. Validated methods for identifying individuals with obesity in health care administrative databases: a systematic review. Obes Sci Pract. 2020;6:677–693. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Leong A, Dasgupta K, Bernatsky S, Lacaille D, Avina-Zubieta A, Rahme E. Systematic review and meta-analysis of validation studies on a diabetes case definition from health administrative records. PLoS One. 2013;8:e75256. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Byrne N, Regan C, Howard L. Administrative registers in psychiatric research: a systematic review of validity studies. Acta Psychiatr Scand. 2005;112:409–414. [DOI] [PubMed] [Google Scholar]
- 47.Tennant PWG, Murray EJ, Arnold KF, et al. Use of directed acyclic graphs (DAGs) to identify confounders in applied health research: review and recommendations. Int J Epidemiol. 2021;50:620–632. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Brenner H. Use and limitations of the capture–recapture method in disease monitoring with two dependent sources. Epidemiology. 1995;6:42–48. [DOI] [PubMed] [Google Scholar]
- 49.Jones HE, Hickman M, Welton NJ, De Angelis D, Harris RJ, Ades AE. Recapture or precapture? Fallibility of standard capture–recapture methods in the presence of referrals between sources. Am J Epidemiol. 2014;179:1383–1393. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Ge L, Zhang Y, Ward KC, Lash TL, Waller LA, Lyles RH. Tailoring capture–recapture methods to estimate registry-based case counts based on error-prone diagnostic signals. Stat Med. 2023;42:2928–2943. [DOI] [PMC free article] [PubMed] [Google Scholar]

