Abstract
The application of serial principled sampling designs for diagnostic testing is often viewed as an ideal approach to monitoring prevalence and case counts of infectious or chronic diseases. Considering logistics and the need for timeliness and conservation of resources, surveillance efforts can generally benefit from creative designs and accompanying statistical methods to improve the precision of sampling-based estimates and reduce the size of the necessary sample. One option is to augment the analysis with available data from other surveillance streams that identify cases from the population of interest over the same timeframe, but may do so in a highly nonrepresentative manner. We consider monitoring a closed population (e.g., a long-term care facility, patient registry, or community), and encourage the use of capture–recapture methodology to produce an alternative case total estimate to the one obtained by principled sampling. With care in its implementation, even a relatively small simple or stratified random sample not only provides its own valid estimate, but provides the only fully defensible means of justifying a second estimate based on classical capture–recapture methods. We initially propose weighted averaging of the two estimators to achieve greater precision than can be obtained using either alone, and then show how a novel single capture–recapture estimator provides a unified and preferable alternative. We develop a variant on a Dirichlet-multinomial-based credible interval to accompany our hybrid design-based case count estimates, with a view toward improved coverage properties. Finally, we demonstrate the benefits of the approach through simulations designed to mimic an acute infectious disease daily monitoring program or an annual surveillance program to quantify new cases within a fixed patient registry.
Keywords: Bias, Efficiency, Inverse-variance weights, Sampling, Stratification, Surveillance
Statement of Significance
Ideally, disease surveillance to monitor prevalence and case counts in closed populations relies on principled (e.g., simple or stratified) random sampling. Registry-based studies permit such a sample, but resource considerations often limit its feasible size. However, it is common for existing data streams to be in place, whereby cases are identified preferentially. We examine the benefits of a single principled sample, even if small, carefully administered to be agnostic with respect to such existing streams. This sampling-based “anchor stream” allows all available information to be efficiently leveraged through capture–recapture methods, and leads to remarkably precise novel estimators of the case count that remain valid no matter what the operational associations may be among the existing data sources.
1. INTRODUCTION
Capture–recapture (CRC) methods have a long and rich history, having derived original motivation through tag and release experiments in ecological settings (Petersen 1896; Lincoln 1930; Schnabel 1938) and subsequently spawned multiple texts devoted to design and statistical considerations (e.g., Seber 1982; Krebs 1998; Borchers, Buckland, and Zucchini 2002). Applications of the approach have more recently become increasingly common in social science and epidemiological contexts, when the goal is to quantify human populations (e.g., Böhning, van der Heijden and Bunge 2018). Fundamentally, the statistical problem in CRC arises due to missing count data when studying the overlap of multiple capture or surveillance efforts, where T ≥ 2 such efforts result in 2T possible capture profiles. Because the number of units in the final profile category (not captured in any of the T efforts) is unobserved, one must generally make unverifiable assumptions in order to infer an estimate of the total population size based on the CRC paradigm.
Within the broader epidemiological context, applications of CRC methods have been utilized in conjunction with efforts to estimate the prevalence of activities such as injection drug use (e.g., Hickman, Higgins, Hope, Bellis, Tilling, et al. 2004; Platt, Hickman, Rhodes, Mikhailova, Karavashkin, et al. 2004), to monitor chronic diseases like cancer (e.g., McClish and Penberthy 2004; Parkin and Bray 2009), or in the surveillance of infectious diseases such as HIV (e.g., Poorolajal, Mohammadi, and Farzinara 2017), hepatitis C (e.g., Wu, Chang, McNutt, and Smith 2005), and tuberculosis (World Health Organization 2012). One important weakness of CRC analyses, however, is that practitioners often derive estimates based on statistical models made accessible through available software (e.g., Baillargeon and Rivest 2007), without a clear and complete understanding of the underlying assumptions and/or the ramifications of violations of those assumptions (e.g., Agresti 1994; Hook and Regal 1995; Jones, Hickman, Welton, De Angelis, Harris, et al. 2014; Lyles, Wilkinson, Williamson, Chen, Taylor, et al. 2021). The result is that almost all CRC estimates of population totals are likely biased and/or tend to be accompanied by overly optimistic measures of precision. A recent note (Zhang and Small 2020) illustrates this point in the case of monitoring COVID-19 case counts, by emphasizing the need for sensitivity analyses to accompany model-based CRC estimates.
In this report, we explore a unique setting in which a CRC estimator and its accompanying measure of precision can be fully justified in epidemiologic practice. This setting requires that one of the disease surveillance streams is implemented by design as a random or representative sample of the population, and is conducted agnostically with respect to another available data stream that one seeks to use to identify cases from the same population and over the same timeframe. We refer to such a random or representative surveillance effort as an anchor stream. By “agnostically,” we mean independently, in the sense that identification (or not) by the anchor stream does not influence and is not influenced by identification (or not) by at least one other stream. Importantly, this second (non-anchor) stream can be highly nonrandom in terms of how it identifies cases from the population and we make no assumptions about the mechanisms behind that selection process. It should be noted that many sources (e.g., Seber 1982; Chao, Tsay, Lin, Shau, and Chao 2001; van der Heijden, Whittaker, Cruyff, Bakker, and van der Vliet 2012) have acknowledged that the presence of a nonrandom sample taken independently of a second sample reflecting homogeneous selection probabilities provides justification for the use of classical CRC estimators. However, as we will see, these estimators do not leverage the full statistical precision unlocked by design, when and if an anchor stream sample can be definitively obtained.
In practice, a true anchor stream can only be fully realized by a principled sampling effort. In most cases this would require that the population of interest be enumerated and accessible, whether in person or through records abstraction, so that a simple or stratified random sample can be drawn for disease status assessment. As such, there will already be a fully defensible sampling-based estimate of the case count in the population. In this report, our primary goal is to leverage the additional precision obtained by making full use of data from a second, nonrepresentative, surveillance stream. In the primary applications that we envision, at least one such non-anchor (sometimes called “passive”) surveillance stream will typically already be in place. In contrast, the anchor stream must be implemented actively and carefully by design, but can and generally will be much smaller in scale and identify far fewer cases. Further, if there are multiple surveillance streams with respect to which the anchor stream is agnostic, we will see that a valid and highly efficient estimator is available without any assumptions about the association parameters that govern identification by those non-anchor streams.
In what follows, we provide multiple variants of a novel unbiased CRC-based estimator that harnesses the full power of both data streams (anchor and non-anchor). These estimators often yield far greater statistical efficiency than either the sampling-based estimator or the standard bias-corrected CRC estimator (Chapman 1951) that is in fact rendered defensible via the anchor stream study design. The extension to efficiently handle multiple non-anchor streams then becomes immediate, assuming the anchor stream is taken agnostically with respect to each of them. The resulting hybrid surveillance estimation procedure is mutually beneficial in the sense that information from the non-anchor data stream(s) is calibrated by the anchor stream, while information from the anchor stream gains precision from the non-anchor stream data.
Setting 1: Daily Surveillance of an Acute Infectious Disease
Consider a large but essentially closed population (e.g., a university or long-term care facility community), and assume the goal is to estimate the number (and/or prevalence) of positive cases in the community on a particular day. Suppose that members of this population have access to testing on a volunteer basis if they are willing to go through the necessary steps. Clearly, such a program is extremely important for the health of the community, but leads to a highly nonrepresentative sample and potential selection bias that could impact any resulting estimates of prevalence or case numbers. For example, symptomatic individuals and those who believe they are at high risk of having or transmitting the infection will likely be oversampled, while asymptomatics (a large subgroup in the case of diseases like COVID-19) and uninfected individuals will be undersampled. In addition, since we assume there is an enumerated list of the members of this population, it is feasible for a principled simple or stratified random sample to be drawn. Taken together, note that the volunteer testing and sampling-based testing produce two surveillance data streams.
Setting 2: Annual Chronic Disease Surveillance
Consider now the case of monitoring cumulative incidence of a chronic disease or condition (e.g., cancer recurrence), where the goal is to estimate the number of new cases occurring within a geographically defined community over a specified timeframe (e.g., one calendar year). As in Setting 1, we assume the initially disease-free population of individuals is essentially closed and enumerated (e.g., registered survivors of cancers treated with curative intent), so that a principled simple or stratified random sample toward which medical records abstraction is applicable can be drawn from the registry. Also as before, we assume there is at least one other surveillance effort that identifies new cases over the same region and timeframe, yielding a sample of cases that may be far from random (e.g., pathologically confirmed recurrence).
Anchor Stream Characteristics
To effectively implement the principled simple or stratified random sample as an anchor stream, similar considerations arise under Settings 1 and 2. We delineate these as follows:
Individuals are chosen to be part of the random sample without any regard to whether they are identified by the other surveillance stream (meaning voluntarily tested on the day in question in Setting 1, or detected as an incident case through the non-anchor stream in Setting 2). In other words, there are no “referrals” into the random sample or “denials” out of it; this should be feasible to assure (also see point c).
The factors leading to case identification through the non-anchor surveillance effort are unaffected by selection (or not) into the anchor stream random sample. Conceptually this could be more difficult to achieve than a), because, for example, being randomly selected for testing under Setting 1 could obviate an individual’s need to also seek a test voluntarily that day. This could lead to the opposite of a referral effect, that is, a population-level tendency toward what is known as “trap aversion” (as opposed to “trap happiness”) in the CRC animal abundance literature. However, from an implementation standpoint, this tendency could be eliminated by controlling the timing of notification and testing/assessment for those sampled. That is, selected individuals could be notified and assessed at the end of the time period under consideration (at the end of the day in Setting 1 or at the end of the year in Setting 2), so that identification by the non-anchor surveillance stream would already have taken place if it were to occur. This is akin to a small-scale version of a “post-enumeration” survey, as is often implemented subsequent to the US Census (e.g., Chao, Pan, and Chiang 2008).
We assume the closed community maintains a record of cases identified in the non-anchor stream (i.e., positive volunteer test results for the day in Setting 1, or incident cases recorded in Setting 2). Note that anyone selected into the random sample who was previously identified as a case need not be approached or re-assessed, as such individuals are known to be captured by both surveillance streams. Under Setting 1, if negative tests are recorded through the non-anchor stream on the day in question, then there is no need to approach any of those negative individuals again if they are selected into the anchor stream. Thus, taking advantage of known case status via the non-anchor stream increases the number of test results available for the anchor stream random sample beyond what might be budgeted, as tests for some of those selected into the anchor stream need not be performed.
We assume there is community support for the random sampling. That is, the community recognizes the need for monitoring, and the number of individuals who would be unwilling to participate in the random testing or assessment if sampled at the end of the monitoring period (and not already voluntarily tested that day in Setting 1, or already known to have become an incident case in Setting 2) is negligible.
We assume all test results obtained accurately identify cases and non-cases, which implies that results obtained on individuals selected for testing in both streams would be the same. This assumption is possible to relax, for example, given access to manufacturer-specified sensitivity and specificity values for diagnostic tests that are used (see Section 4).
2. METHODS
In what follows, assume either of the hypothetical settings described above to be in effect along with the accompanying caveats. Assuming the total number (Ntot) of individuals in the community is known and that the principled “anchor stream” sample is drawn completely at random provides immediate access to two simple, valid, and defensible estimates of the number (N) of diseased individuals. We introduce notation for the first of these based on the sample proportion of population members selected into the random sample who are positive, that is:
with n denoting the number selected and the number of those testing positive. Given that the population is closed and finite, a finite population correction (fpc) should be applied when estimating the variance of and thus of , that is,
While the preceding reflects a common and familiar form for the fpc multiplier and is a plug-in estimator of the true variance, Cochran (1977, eq. 3.11) also provides an unbiased variance estimator that replaces in the denominator by ; however, in practice the two versions are virtually indistinguishable. As is well known, the reduction in variance relative to sampling from a conceptually infinite population is substantial, if n is large relative to Ntot.
2.1 Weighted (Composite) Estimators
Importantly, the set of assumed conditions accompanying Settings 1 and 2 serve to fully validate the classical Lincoln-Petersen (LP; Petersen 1896; Lincoln 1930) or Chapman (1951) CRC estimators. This follows as a direct result of the design and implementation of the anchor stream. Henceforth, we will refer to the principled anchor stream sample as Stream 2, and refer to the other nonrandom surveillance sample as Stream 1. In the interest of a strictly or at least approximately unbiased estimate (see Wittes 1972) of the case total (N), we focus initially on Chapman’s estimator. This is readily calculated based on the three observed cell counts in table 1, where the count n00 of individuals not identified by either stream is unknown.
Table 1.
Cell Counts for Two-Stream Capture–Recapture
| Positive test obtained in Stream 1 | Positive test obtained in Stream 2 |
Total | |
|---|---|---|---|
| Yes | No | ||
| Yes | n 11 | n 10 | n 1• |
| No | n 01 | n 00 = ? | |
| Total | n •1 | N = ? | |
The well-known Chapman estimate and its estimated variance (Chapman 1951; Seber 1982) are:
Through a prior review of the CRC literature (Lyles et al. 2021), we view a transformed logit-based confidence interval (Sadinle 2009) as the most reliable closed-form interval to report in conjunction with (see Section 3). We note here that is the same as as defined in the previous section, and that (being a standard CRC estimator) does not utilize the sampling-based information about the negative tests and the unsampled individuals that is made available through the anchor stream (Stream 2). Further, the Stream 2 data definitively indicate that the true N cannot exceed , so that in an extreme case where (or any other estimator) should exceed that number, it would clearly be an overestimate.
We define an initial “composite” estimator of N, designed to enhance precision relative to either or alone, as the following weighted average:
In particular, three options (; ; and ) are natural to consider, with weights defined as follows:
Note that “B” assigns inverse-variance weights to each estimator, while in theory “C” estimates the optimal weight from the standpoint of minimizing the variance of the composite estimator.
In principle, the simple estimator “A” that weights the two estimates equally would tend to be inefficient whenever their variances differ markedly. However, one advantage to “A” is that the assigned weights (0.5) are not correlated with the component estimators and . As a result, the estimator “A” is unbiased whenever is unbiased (and at least approximately unbiased in almost all realistic applications). In contrast, “B” and “C” can be subject to variance stabilization issues whereby low point estimates that occur by chance are upweighted, leading to a tendency toward negative bias in the resulting composite estimators. Given empirical evidence of little to no correlation between and , we focus attention on and as candidate composite estimators in simulation studies examining bias and precision (see Section 3). We estimate the variance of as , while an estimated variance to accompany is given by
| (1) |
where is the estimated variance of the LP estimator (Seber 1982). We use in place of in (1) to convey a slight measure of conservatism, and we calculate after replacing any 0 observed cell (n11, n10, or n01) by 0.5.
2.2 Novel and Efficient CRC Estimators
As a tool for sensitivity analysis in the two-stream CRC setting, Chen (2020) and Lyles et al. (2021) proposed the following estimator:
| (2) |
where is the overall proportion of individuals identified in Stream 2 among those not identified in Stream 1. Specifically, is the maximum likeliood estimator (MLE) for N based on a general population-level multinomial model for the four cell counts (including the unobserved count n00) in table 1 regardless of the distribution of the catch events across subjects, if the true value of the population parameter ψ is specified. In general, this crucial parameter cannot be known or even reliably estimated based on the observed data. However, if Stream 2 satisfies the anchor stream conditions laid out in Section 1, becomes an estimator that is available for practical use and, to our knowledge, novel to the CRC literature. Because Stream 2 is a random sample drawn independently of Stream 1, it follows that , where is the selection probability into the Stream 2 sample and the conditional probabilities reflect identification (or not) by Stream 1. A natural sampling mechanism for Stream 2 is to apply a random probability ψ to generate a Bernoulli selection event (yes or no) for each of the Ntot enumerated subjects in the population, in which case ψ is indeed known and presumably chosen based on available sampling and testing resources.
Advantages of the estimator include strict unbiasedness (a noteworthy feature relative to most if not all existing CRC estimators) and considerable statistical precision, as it harnesses the power of the anchor characteristics of Stream 2 together with all the additional case identification data obtained via Stream 1. The multivariate delta method-based variance estimate to accompany based on the general multinomial model is as follows (Chen 2020):
| (3) |
In practice, empirical studies (not shown) demonstrate that inserting the estimated selection probability in place of ψ in the equation for yields a somewhat more precise estimator than using the actual specified Bernoulli probability ψ applied to select each subject into Stream 2. That is, we recommend the following estimator for practical use:
| (4) |
The accompanying standard error can be obtained by inserting into (3), and may be slightly conservative as a result of the enhanced precision of relative to . From an inferential standpoint this is not a significant drawback considering benefits to the coverage properties of a standard Wald-based confidence interval, although further improvements can be obtained via an adaptation of a Bayesian credible interval (see next section).
In the case of Setting 1, a similar alternative estimator can be derived if both Stream 1 and Stream 2 document negative as well as positive test results. This estimator is the MLE for N under a general multinomial model for the seven cell counts defined in table 2.
Table 2.
Cell Counts and Likelihood Contributions for Observations in Setting 1 with Negative Tests Recorded in Stream 1
| Cell count | Observation type | Multinomial likelihood contribution |
|---|---|---|
| n 1 | Sampled in both streams, tested negative | |
| n 2 | Sampled in both streams, tested positive | |
| n 3 | Sampled in Stream 1 but not 2, tested negative | |
| n 4 | Sampled in Stream 1 but not 2, tested positive | |
| n 5 | Sampled in Stream 2 but not 1, tested negative | |
| n 6 | Sampled in Stream 2 but not 1, tested positive | |
| n 7 | Not sampled in either stream |
The likelihood contributions given in table 2 are based on defining the parameters =Pr(Sampled in Stream 2), =Pr(Sampled in Stream 1), =Pr(Diseased | Sampled in Stream 1), and =Pr(Diseased | Not sampled in Stream 1). The vector of seven cell counts is distributed as Multinomial(Ntot; p1, p2, p3, p4, p5, p6, p7), with the likelihood proportional to . The MLEs of the four parameters have the following intuitive closed forms:
In turn, these yield the following closed-form MLE for N:
| (5) |
where the CRC-based cell counts (n11, n10, n01) retain their prior interpretations, with n11 = n2, n10 = n4, and n01 = n6 in the notation of table 2. A multivariate delta method-based variance estimator to accompany is readily obtained in conjunction with numerical maximization of the likelihood, or by taking advantage of the closed form expression in (5) together with well-known distributional properties of the multinomial for the seven cell counts. A SAS IML program (SAS Institute Inc. 2008) to facilitate the numerical standard error calculation is available from the authors by request, based on the following equivalent expression to (5):
Note that the algebraic form of reveals it to be nearly identical in spirit to (4), except that divides the cell count n01 by an estimate of the selection probability (ψ) into Stream 2 that is derived only among individuals not sampled into Stream 1.
We expect that will generally be unavailable for use in Setting 2, because even if negative test records for individuals identified by Stream 1 are maintained throughout the time period (e.g., year), there will be no way to know whether those individuals became incident cases by the end of the time period unless they happen to be selected into the Stream 2 sample. That is, not all cell counts in table 2 will be available. Interestingly, as suggested by an associate editor, there is in fact another novel estimator in this case that can be derived as an MLE based on a condensed version of table 2. In particular, the observation types underlying the cell counts n1 and n5 become indistinguishable when negative test results in Stream 1 are unavailable, as do the observation types underlying the counts n3 and n7. Grouping these two pairs together and allocating the cell totals (n1+n5) and (n3+n7) and the corresponding summed likelihood contributions to the two new observation types lead to a multinomial likelihood based upon which the individual parameters , , and are no longer identifiable. However, ψ of course remains estimable, as does the product , and, crucially, the overall disease prevalence . It turns out that a closed form for the new ML estimator for N becomes:
where =(Selected in Stream 2 | Not Selected in Stream 1 OR Selected in Stream 1 and Negative)
We should note here that the newly proposed estimators [ in (4), in (5), and ] are all subtle variations on the same theme, differing only in terms of the chosen valid estimate of the selection probability ψ by which the cell count n01 is divided. Our empirical studies (Section 3) suggest that, in comparison with in (4), the precision benefits obtainable through the use of based on the full version of table 2 when negative Stream 1 test results are available are generally small but worth seeking. In contrast, any precision gains through the use of relative to tend to be very small to negligible. Thus, for Setting 2 we recommend the estimator in (4), whereas under Setting 1 we recommend the use of in (5) if all seven cell counts are available (and the use of otherwise).
With regard to estimated standard errors, the delta method-based variance of is based on a full multinomial model for the Ntot members of the population that permits variation in the number (N) of cases upon repeated sampling. As such, it will be conservative if the sampling rate (ψ) into Stream 2 is relatively large and if the analyst’s vision of the inferential model for repeated sampling keeps N fixed. Under Setting 1 with available data on all seven cell counts, the delta method-based variance associated with is reasonable to use for small sampling rates; we suggest using it when ψ ≤ 0.10. For larger values of ψ, we recommend use of the variance estimator in (3) with ψ replaced by to obtain the standard error to accompany .
2.3 A Bayesian Credible Interval Approach
Standard confidence intervals (CIs) based on asymptotic normality of MLEs, typically of the form , are known to yield unsatisfying coverage properties in many CRC settings when a relatively low proportion of the total population is identified by at least one surveillance stream. For example, under the conditions required for validity of the LP estimator, it is common to calculate Wald-type intervals on a log- or inverse-transformed scale (Chao 1987; Jensen 1989; Krebs 1998), or to employ more nuanced adjustments aimed at improved coverage (Sadinle 2009). While the unbiasedness and precision of the proposed estimators and tend to translate into relatively more favorable properties when using the standard Wald CI approach, we propose an adaptation of a Bayesian credible interval based on a weakly informative Dirichlet prior on cell probabilities as an approach designed to enhance interval coverage properties.
Our proposed technique begins with a Dirichlet(½, ½, ½) prior for the three multinomial proportions , where the represent the probabilities that Stream 1 capture status=i and Stream 2 capture status=j, conditional on being captured by at least one of the two streams. This yields the following conjugate posterior distribution (e.g., Sangeetha, Subbiah, and Srinivasan 2013):
| (6) |
where is the vector of observed cell counts in table 1. We obtain a large sample of draws from this posterior through appropriate sequences of gamma random variates. From each draw (j = 1,…, D), we generate the corresponding posterior probability of identification in at least one of the two captures as a sum of posterior unconditional capture probabilities, that is, . Making use of the estimated Stream 2 selection probability (), we use these quantities to mimic posterior draws of the estimand in (4) and subsequently adjust these so that the variance of the posterior distribution more closely matches that associated with the actual sampling distribution of . The proposed approach is then to report a (2.5 percent, 97.5 percent) percentile interval based on the resulting posterior. Further details to facilitate the implementation of this procedure are given in the Appendix.
2.4 Extension 1: Multiple Non-Anchor Streams and Future Monitoring
A considerable advantage of an anchor stream design implemented as outlined under Settings 1 and 2 is that the extension to efficiently incorporate data from any number of non-anchor streams is immediate. Specifically, assume there are T (> 2) surveillance streams, with Stream T implemented as a random sample in a way that is independent (agnostic) of the other T − 1 data collection efforts. Regardless of what dependencies exist among those T − 1 streams, one need only combine them and designate any individual found disease positive in at least one of them as positive and any individual found positive in none of them as negative via a single new “Stream 1*.” The proposed point and interval estimation approaches then carry over automatically, with Stream T taking the role of Stream 2 in Section 2. We note that this result continues to assume that diagnostic tests used in the various data streams accurately determine case status (see Section 4).
To clarify this approach, consider the case in which T = 3, with Stream 3 implemented agnostically with respect to Streams 1 and 2 as an anchor stream simple random sample. With three streams, there are seven observed cell counts (n111, n110, n101, n100, n011, n010, n001), where nijk is the number of individuals with capture status i =(0,1) in Stream 1, j =(0,1) in Stream 2, and k =(0,1) in Stream 3. Regardless of the level or directionality of any capture dependencies between Streams 1 and 2, we can combine them into a new “Stream 1*” and construct table 3 to summarize the overlap between Stream 1* and the anchor Stream 3.
Table 3.
Cell Counts for Three-Stream CRC Where Streams 1 and 2 Are Combined to Form Stream 1* and Stream 3 is an Anchor Stream
| Positive test obtained in Stream 3 |
||
|---|---|---|
| Positive test obtained in Stream 1* | Yes | No |
| Yes | n 11* | n 10* |
| No | n 01* | n 00* = ? |
Note that the observed cell counts in table 3 are defined in terms of the original cell counts as follows: n11* = n111 + n101 + n011, n10* = n110 + n100 + n010, and n01* = n001. A robust and efficient anchor stream-based estimator making use of all the data then directly mimics (4), as follows:
| (7) |
where and n is the number of individuals selected for testing in anchor Stream 3. As noted by an associate editor, one might consider also whether the analogue to in (5) is available. This is potentially the case, but only if all of the T − 1 non-anchor streams record negative as well as positive test results so that table 2 can be constructed with respect to Stream 1*.
In addition to admitting a direct extension to accommodate any number of non-anchor streams regardless of their inherent dependencies, note that this approach offers the further advantage of facilitating future monitoring based only on the non-anchor streams. Specifically, an efficient estimator of the lynchpin parameter , the overall proportion of cases identified in Stream 2 among those not identified in Stream 1, is obtained as follows:
| (8) |
with defined as in (7). Under the assumption that the population-level dependency between non-anchor Streams 1 and 2 remains stable over the next monitoring period, the analyst now has access to a CRC estimator for the case count over that next period based only on those two streams without implementing another anchor stream sample. This estimator, obtained by inserting in place of ψ in (2), may well be more defensible than any other available CRC estimator in the absence of a new anchor stream.
2.5 Extension 2: Stratified Anchor Stream Sampling
Consider the case in which anchor Stream 2 (or, Stream T in the preceding section) is implemented as a stratified random sample (SRS) in order to permit sampling-based estimates of case counts for S mutually exclusive strata. Assuming the selection probability (ψs) into each stratum (s = 1,…, S) is known, a natural overall prevalence estimator (Horvitz and Thompson 1952) becomes:
| (9) |
where ns and ns+ are the number tested and the number tested positive in stratum s. Note that is the analogue to in Section 2. Multiplying in (9) by a census-based total population (Ntot) yields , where is itself the analogue to in Section 2. Assuming availability of the same stratification variables in non-anchor Stream 1 (or Stream 1* in the case of multiple non-anchor streams), the stratum-specific version of in (4) is immediately available and valid given the SRS-based anchor stream. An efficient overall estimate of case totals is then obtained by summing these estimates across the S mutually exclusive strata. While the estimated variance is also immediate as a sum of the stratum-specific variances, a corresponding extension of the proposed adjusted Dirichlet-multinomial credible interval approach based on separate posterior samples from each stratum would be expected to improve coverage rates.
3. RESULTS
We conducted simulations to compare the properties of competing estimators of N under the “anchor stream” design. As an initial scenario, we generated data under the following conditions characteristic of Setting 1 (e.g., in the context of monitoring for prevalence and case counts of an infectious disease in a long-term care facility on a given day): 20 percent of subjects have symptoms; 50 percent of those with symptoms have the disease; 15 percent of those without symptoms have the disease (yielding an overall prevalence of 22 percent); 75 percent of those with symptoms seek and obtain a voluntary test in Stream 1; and 5 percent of those without symptoms obtain a voluntary test in Stream 1. We kept the total number of individuals (Ntot) fixed at 1,000 and considered three choices (10 percent, 25 percent, and 50 percent) for the sampling rate (p2) into the anchor Stream 2 random sample, which was generated independently of Stream 1. Under these simulation conditions, the true total number of cases was fixed at N = 220 in each case. Stream 1 sampled individuals at an overall rate of 36.8 percent, but preferentially selected those with symptoms (and thus those with disease) at a rate that would generally be unknowable in practice. Table 4 summarizes the results of this initial simulation study.
Table 4.
Simulations Mimicking Setting 1, with Ntot = 1,000 and N = 220 Diseased Subjectsa
|
p
2 = 0.1 |
p
2 = 0.25 |
p
2 = 0.5 |
||||
|---|---|---|---|---|---|---|
| Estimator | Mean (SD) [avg. SE] | CI coverage [avg. width] | Mean (SD) [avg. SE] | CI coverage [avg. width] | Mean (SD) [avg. SE] | CI coverage [avg. width] |
| 219.8 (40.1) [39.2] | 93.7% [153.7] | 220.1 (22.5) [22.7] | 95.1% [88.9] | 220.1 (13.0) [13.1] | 95.2% [51.3] | |
| 219.6 (65.4) [56.3] | 96.2% [331.0] | 219.8 (34.7) [33.1] | 95.1% [149.6] | 220.5 (19.7) [19.4] | 94.9% [82.9] | |
| 219.7 (38.3) [35.2] | 92.9% [138.0] | 219.9 (20.8) [20.3] | 94.2% [79.4] | 220.3 (11.9) [11.7] | 94.3% [46.0] | |
| 207.0 (33.3) [31.7] | 88.3% [124.2] | 215.0 (19.2) [18.6] | 91.8% [72.9] | 218.2 (11.0) [10.8] | 93.5% [42.3] | |
| 219.7 (33.1) [35.2] | 94.7%, 94.0% [138.1], [127.8] | 220.0 (19.0) [20.4] | 96.0%, 94.7% [80.0], [74.7] | 220.2 (10.9) [11.8] | 96.0%, 95.6% [46.2], [43.4] | |
| 219.8 (32.6) [34.5] | 95.0% [135.3] | 220.0 (18.6) [22.8] | 98.1% [89.2] | 220.2 (10.7) [17.1] | 99.8% [67.1] | |
Wald-based CIs are evaluated, except we report the transformed logit CI of Sadinle (2009) in conjunction with and the proposed adjusted Bayesian credible interval (bold) in conjunction with .
Key takeaways from table 4 include the fact that the two component estimators and are empirically unbiased or very nearly so in all cases as expected. The composite estimators and are always more precise, given that they take into account information about N available through the Stream 2 random sample as well as through the overlap of Streams 1 and 2 under the assured LP conditions. However, is markedly biased downward as anticipated (due to variance stabilization issues), while , though virtually unbiased, achieves relatively modest efficiency gains relative to alone. The best estimators are clearly and , with the latter offering slightly better precision (lower empirical SD) in each case. We note again that will only be available in practice if both positive and negative test results are recorded in both surveillance streams, making more generally applicable.
Most confidence intervals evaluated in table 4 are of the common form , where the appropriate standard errors are noted in Section 2. The exceptions to this are that a transformed logit CI (Sadinle 2009) is reported as an accompaniment to the Chapman estimate, and the proposed adjusted Bayesian credible interval is evaluated along with the Wald-type CI as an accompaniment to . As expected, standard error estimates to accompany based on (3) with ψ replaced by are slightly conservative, which benefits overall coverage for the corresponding Wald-based CI under the conditions studied in table 4. When p2 is relatively small (e.g., for p2=0.1 in table 4), the numerically calculated delta method-based standard error for described subsequent to (5) matches the empirical SD well and produces a well-behaved Wald CI. However, this standard error becomes exceedingly conservative as p2 increases, since the sampling process underlying the likelihood corresponding to table 2 allows N to vary across repeated samples. Considering the joint criteria of near nominal coverage and width, we note that the proposed adjusted Bayesian credible interval merits the analyst’s preference (see bold type in table 4). As a point of interest, the transformed logit CI reported with displays excellent coverage as expected (Sadinle 2009; Lyles et al. 2021), but is excessively wide relative to the other intervals. While this CI is among the best available under the classical LP conditions in two-stream CRC analysis, it is not built to harness the information about N contained in the anchor stream alone. Similarly, the CI associated with is unnecessarily wide because it is not built to incorporate the case count information in non-anchor Stream 1.
From a design standpoint, note that while the 10 percent, 25 percent, and 50 percent random samples into Stream 2 under table 4 conditions mean selecting on average 100, 250, and 500 members of the population for testing, the number of tests actually required to ascertain case status is markedly smaller given access to the documented results in Stream 1. Specifically, assuming only positive Stream 1 results are known, the proposed design requires an average of 92, 230, and 460 tests to be conducted in Stream 2 for p2 = 0.1, 0.25, and 0.5, respectively. If both positive and negative Stream 1 test results are available, an average of only 81, 202, and 405 tests are required for Stream 2.
Table 5 summarizes a second simulation study under similar conditions, except assuming 90 percent and 30 percent of those with and without symptoms, respectively, seek and obtain a voluntary test in Stream 1. This yields a larger rate of selection for individuals to be tested (57.3 percent as opposed to 36.8 percent) via Stream 1. In these new simulations, however, we also varied Ntot and p2 while keeping the average size of the Stream 2 random sample fixed at only 50. The goal is to study the benefits of implementing a small anchor stream sample for testing to accompany a relatively large but nonrepresentative data stream of voluntary test results.
Table 5.
Simulations Mimicking Setting 1, Varying Ntot and p2 to Maintain Average Size of Anchor Stream Sample at 50
| Estimator |
N
tot=1,000; p2 = 0.05; True no. of cases = 220 |
N
tot=500; p2 = 0.10; True no. of cases = 110 |
N
tot=200; p2 = 0.25; True no. of cases = 44 |
|||
|---|---|---|---|---|---|---|
| Mean (SD) [avg. SE] | CI coverage [avg. width] | Mean (SD) [avg. SE] | CI coverage [avg. width] | Mean (SD) [avg. SE] | CI coverage [avg. width] | |
| 220.4 (57.6) [56.8] | 93.9% [222.5] | 110.0 (28.1) [27.6] | 93.3% [108.3] | 43.9 (10.3) [10.1] | 93.6% [39.6] | |
| 219.5 (64.6) [52.3] | 97.8% [398.0] | 110.3 (31.4) [25.6] | 97.6% [199.4] | 44.0 (11.6) [9.3] | 97.2% [73.8] | |
| 220.0 (43.2) [40.2] | 92.7% [157.5] | 110.1 (21.2) [19.6] | 91.9% [76.8] | 44.0 (775) [7.1] | 92.1% [28.0] | |
| 203.7 (38.7) [38.3] | 85.6% [150.3] | 102.4 (19.0) [18.8] | 86.3% [73.5] | 41.1 (6.9) [6.8] | 85.4% [26.6] | |
| 219.9 (40.3) [41.5] | 91.7%, 95.1% [162.8], [163.0] | 110.2 (19.9) [20.2] | 91.2%, 95.6% [79.4], [78.6] | 44.0 (7.2) [7.4] | 91.8%, 95.1% [28.9], [28.1] | |
| 220.2 (39.3) [40.9] | 93.3% [160.1] | 110.2 (19.3) [21.0] | 94.1% [82.3] | 43.9 (7.0) [9.0] | 97.4% [35.4] | |
As in table 4, all estimators considered remain virtually unbiased with the exception of , which displays unacceptable downward bias with correspondingly poor CI coverage. The new estimators proposed are far more efficient than either or taken alone, with slightly more precise than . In the scenarios shown in table 5 based on small (average size of 50) anchor Stream 2 samples, we note that no Wald-type interval offers a satisfying combination of properties in terms of the joint criteria of coverage and width. The proposed adjusted Bayesian credible interval (bold) clearly performs best, with favorable width and outstanding coverage rates despite the small Stream 2 samples.
As a third simulation study, we consider annual monitoring to count new cases of a disease among a large enumerated cohort population. For example, Lash, Cronin-Fenton, Ahern, Rosenberg, Lunetta, et al. (2011) considered the surveillance of colorectal cancer recurrences among a Danish registry cohort of over 20,000 patients whose initial cancer had been treated. The goal is to investigate the benefits of even a small random sample of such a registry population for definitive determination of incident case status (e.g., through medical record abstraction and/or direct follow-up) according to the criteria of Setting 2.
As in the simulation summarized in table 5, we assume 20 percent of subjects are in a high-risk group, among which 50 percent become a new case over the course of the year. In contrast, 15 percent of those not at high risk become a new case (yielding an overall cumulative incidence of 22 percent), while 90 percent and 30 percent of individuals in the high and non–high-risk groups are selected for testing in Stream 1 (for an overall rate of selection for testing of 57.3 percent). We varied the large enumerated population size (Ntot = 5,000; 10,0000; 20,000), with small rates of sampling (p2 = 0.01; 0.005; 0.0025) to again yield an average size of the Stream 2 random sample of only 50 subjects. The results in table 6 are comparable to those in table 5, despite the difference in scale of the total population sizes and the Stream 2 (anchor stream) sampling rates. Estimators and again perform best, offering major precision benefits relative to and . While standard errors accompanying and match closely with the empirical SDs here, Wald-based CIs still suffer in terms of overall coverage. This continues to be effectively mitigated by the proposed Bayesian credible intervals.
Table 6.
Simulations Mimicking Setting 2, Varying Ntot and p2 to Maintain Average Size of Anchor Stream Sample at 50
| Estimator |
N
tot=5,000; p2 = 0.01; True no. of cases = 1,100 |
N
tot=10,000; p2 = 0.005; True no. of cases = 2,200 |
N
tot=20,000; p2 = 0.0025; True no. of cases = 4,400 |
|||
|---|---|---|---|---|---|---|
| Mean (SD) [avg. SE] | CI coverage [avg. width] | Mean (SD) [avg. SE] | CI coverage [avg. width] | Mean (SD) [avg. SE] | CI coverage [avg. width] | |
| 1,098 (295) [288] | 93.4% [1,130] | 2,184 (598) [577] | 92.5% [2,263] | 4,416 (1,177) [1,163] | 93.2% [4,560] | |
| 1,100 (322) [268] | 97.9% [2,236] | 2,210 (676) [543] | 97.8% [4,165] | 4,388 (1,383) [1,070] | 97.9% [8,843] | |
| 1,099 (220) [205] | 92.5% [802] | 2,197 (453) [413] | 92.4% [1,617] | 4,402 (913) [823] | 92.8% [3,227] | |
| 1,019 (199) [197] | 86.1% [771] | 2,032 (397) [395] | 85.7% [1,548] | 4,070 (790) [785] | 86.0% [3,078] | |
| 1,102 (207) [212] | 92.0%, 95.6% [830], [837] | 2,201 (414) [424] | 92.2%, 96.1% [1,662], [1,679] | 4,400 (833) [850] | 92.3%, 95.6% [3,333], [3,373] | |
| 1,101 (200) [199] | 92.2% [779] | 2,199 (402) [397] | 92.1% [1,556] | 4,402 (810) [799] | 92.1% [3,133] | |
Our final simulation study was designed to illustrate points raised in the section entitled “Extension 1.” We considered the case of T = 3 surveillance streams, with Stream 3 serving as an anchor stream random sample. Data were generated under three different sets of conditions, each with the total population size and number of cases (Ntot and N) set at 2,000 and 440, respectively. As in our initial simulation study, data were generated such that 20 percent of subjects had symptoms, 50 percent of those with symptoms had the disease, and 15 percent of those without symptoms had the disease (yielding an overall prevalence of 22 percent). Under the first scenario, Streams 1, 2, and 3 randomly and independently sampled individuals from the population with sampling rates of 0.3, 0.1, and 0.2, respectively. In the second, data were generated such that 75 percent of those with symptoms and 5 percent of those without symptoms obtained a voluntary test in Stream 1, while 50 percent of those with symptoms and 10 percent of those without symptoms did so in Stream 2. In the third, the same conditions held for Stream 1 but for Stream 2 they were reversed so that 10 percent of those with symptoms and 50 percent of those without symptoms were voluntarily tested. Note that the second and third scenarios reflect conditions leading to positive and negative association, respectively, between Streams 1 and 2.
For each of the three scenarios considered, we evaluated the mean and empirical SD of the anchor stream-based estimator [; (7)], obtained subsequent to merging the data from Streams 1 and 2 into a single stream (“Stream 1*”). For comparison, we report the same for the random sampling-based estimator based only on anchor Stream 3 (), Chapman’s estimator using only Streams 1 and 2 (), and the numerically obtained MLE () based on the three-stream independence model (Darroch 1958; Wittes 1974). The latter estimator is the direct extension of the LP estimator to the setting of three or more data streams, but is seldom well-justified for use in practice.
As expected and reflected in table 7, all four estimators of N are valid in the unlikely first scenario with the three streams as independent random samples. Note, however, that the anchor stream estimator () provides the best precision, easily outdoing the MLE () based on the independence model that generated the data. This finding stems from the fact that the latter is a classical CRC estimator that fails to take advantage of the full information offered in the sampling-based anchor Stream 3.
Table 7.
Simulations of Three Data Streams with Stream 3 as Anchor (Ntot=2,000, True N = 440)
| Streams 1 and 2 independent | Streams 1 and 2 positively associated | Streams 1 and 2 negatively associated | |
|---|---|---|---|
| Estimator | Mean (SD) | Mean (SD) | Mean (SD) |
| a | 439.2 (36.6) | 439.6 (37.2) | 439.9 (37.1) |
| b | 439.6 (104.0) | 263.3 (15.3) | 1,079.8 (212.2) |
| c | 445.8 (50.2) | 339.4 (18.9) | 597.4 (44.6) |
| d | 439.9 (30.7) | 439.9 (28.9) | 439.7 (23.9) |
| e | 0.101 (0.020) | 0.174 (0.029) | 0.432 (0.048) |
Estimate based on anchor Stream 3 alone.
Chapman estimate based on non-anchor Streams 1 and 2.
CRC estimate assuming independence of all three streams.
Anchor stream-based estimate (7) after combining Streams 1 and 2.
Estimate of available for future monitoring by anchor stream design.
When non-anchor Streams 1 and 2 are positively and negatively associated in table 7, respectively, the corresponding under- and over-estimation characterizing and are predictable but striking. Note that the proposed anchor stream-based estimator remains valid regardless of the nature of the association between Streams 1 and 2, while offering a significant reduction in variance relative to . The final row of the table demonstrates properties of the valid estimator of the association parameter (8) that could be used to facilitate case count estimation over the next monitoring period based only on Streams 1 and 2. The magnitude (and even directionality) of that association would generally be impossible to uncover in practice using only data from those two streams, but both are unlocked through the implementation of anchor Stream 3 during the current monitoring period.
4. DISCUSSION
In this report, we have presented two primary variants ( and ) of a novel CRC estimator that are available in simple closed form, virtually unbiased, and leverage the full power of the CRC paradigm to augment a valid principled sampling-based estimator () implemented independently through what we refer to as an “anchor stream” mechanism. The result is substantially enhanced precision relative to sole use of either the sampling-based estimator or the classical CRC estimator (Chapman 1951) that is made assuredly valid by design in this unique setting. The message is twofold: (1) carefully implemented principled sampling with respect to at least one surveillance stream (e.g., Stream 2 in our notation) is the only way to fully calibrate and validate any CRC estimator, and (2) implementing Stream 2 as an “anchor” stream allows the data obtained by a potentially nonrepresentative sample (Stream 1) to improve surveillance precision without compromising validity in the estimation of N. Importantly, the latter point holds true no matter what the actual sampling process is for identifying disease cases via Stream 1 (the non-anchor stream). Note that, in addition to typically being nonrepresentative in ways that may be unresolvable, Stream 1 on its own often fails to facilitate any attempt to estimate the case total unless it reliably records negative as well as positive test results.
As we have illustrated, immediate extensions of the approach and estimators described in this report are available if there are multiple non-anchor streams. The information from these streams need only be merged into a single stream (Stream 1*), such that only individuals identified as a case by at least one of them are considered captured in Stream 1*. The anchor stream then plays the role of Stream 2 in our original notation, and an efficient estimate of the key parameter capturing the population-level dependency between the non-anchor streams becomes available for future monitoring based only on those streams [see (8)]. Likewise, an immediate extension is available if Stream 2 is implemented as a stratified random sample (SRS) with the same stratification variables available in Stream 1. In particular, the proposed novel estimators [i.e., (4) and (5)] apply within each stratum, and the estimates would be summed across strata for an overall estimate of the case count. This point somewhat recalls the work of Chandra Sekar and Deming (1949), which advocated the summing of classical CRC estimates across strata within which assumed independence of data streams is deemed to be reasonable. However, the difference in this case is that the SRS-based anchor stream estimator is both assuredly valid and potentially far more precise. In ongoing work, we are refining variants of the proposed Bayesian credible interval approach to account for stratified anchor stream sampling. We are also working to incorporate adjustments for misclassification, when one or both data streams identify cases by means of an imperfect diagnostic test or assessment process. Depending on the context, these adjustments could leverage manufacturer-supplied sensitivity and specificity values for diagnostic tests, or estimates of these parameters and/or of positive predictive values based on validation data.
With regard to stipulations (a)–(e) outlined in Section 1, we note again that the presence of an enumerated list of population members from which to sample as well as the temporality of the subsequent testing are key factors in the implementation of a true anchor stream. Efforts to relax assumption (b) through a combination of design and analytic considerations are the target of a large CRC literature (e.g., Wolter 1990; Bell 1993; Chao and Tsay 1998; Elliott and Little 2005, among others). All such efforts, however, rely at least to some extent upon unverifiable modeling assumptions. The key advantage of an anchor stream, if feasible, is that the reliance upon such assumptions is negated by design. With regard to feasibility, we acknowledge that assumption (d), in particular, is strong and difficult to assume in typical surveys or surveillance efforts involving human populations. Nevertheless, both Setting 1 involving self-contained communities to which monitoring programs can be applied or perhaps mandated and Setting 2 involving abstraction of available medical information (e.g., through electronic health records) are relevant scenarios within which the anchor stream design may be an option. In other scenarios, the outlined stipulations of an anchor stream can still serve as an ideal guide when planning the design of an active surveillance effort to supplement one or more existing passive ones for case count estimation.
We believe the novel estimators presented here based on the “anchor stream” approach will make useful additions linking the survey sampling and CRC literature, both from a design and analytic standpoint. In terms of design, our results suggest potentially great value in implementing non–sampling-based surveillance efforts (e.g., Stream 1) in closed communities in such a way that individual-level positive test results (and negative results in the case of daily monitoring, if possible) are duly recorded. Implementing even a relatively small anchor stream based on principled random sampling at the end of the time period of interest and linking existing non-anchor Stream 1 results to the selected individuals then permits demonstrable savings in resources with respect to the number of tests requiring administration to those randomly sampled. The precision gains unlocked relative to a sampling-based or a classical CRC estimator alone are essentially guaranteed and can be substantial. Potential applications of this approach to current collaborative efforts include the daily monitoring of acute infectious diseases (e.g., COVID-19), and the annual monitoring of colorectal and breast cancer recurrences based on multiple surveillance streams.
Appendix: Details for Calculating Proposed Adjusted Bayesian Credible Intervals
Based on each posterior draw (j = 1,…, D), we calculate the posterior probability of capture in Stream 1 as , from which the posterior unconditional capture probabilities are derived as , , and . The posterior probability of identification in at least one of the two captures follows as . From this, we obtain a draw from the posterior distribution of N conditional on the observed number of unique individuals captured (nc = n11 + n10 + n01), that is,
| (A.1) |
The variance of the posterior distribution obtained via (A.1) is too small to make an appropriate basis for a Bayesian credible interval, due to conditioning on the observed value nc. Thus for the jth draw , we generate a new value ncj from the Binomial(, ) distribution. Posterior draws for the three observed cell counts accounting for unconditioning with respect to nc follow as: , , and . We then compute a posterior draw mimicking the estimand of interest, which is (4), as follows:
| (A.2) |
This posterior distribution essentially mimics the sampling distribution of in (2), whereas greater precision is available in the estimator in (4). An asymptotically negligible adjustment to calibrate the posterior variance by first scaling and then re-shifting each draw (A.2) based on the originally obtained frequentist point and variance estimates is as follows:
| (A.3) |
where
We obtain the final adjusted 95 percent Bayesian credible interval by taking the 2.5th and 97.5th sample percentiles of the D draws () in (A.3).
Contributor Information
Robert H Lyles, Professor Research Assistant with the Department of Biostatistics and Bioinformatics, The Rollins School of Public Health of Emory University, 1518 Clifton Rd. N.E., Atlanta, GA 30322, USA.
Yuzi Zhang, Research Assistant Research Assistant with the Department of Biostatistics and Bioinformatics, The Rollins School of Public Health of Emory University, 1518 Clifton Rd. N.E., Atlanta, GA 30322, USA.
Lin Ge, Research Assistant with the Department of Biostatistics and Bioinformatics, The Rollins School of Public Health of Emory University, 1518 Clifton Rd. N.E., Atlanta, GA 30322, USA.
Cameron England, Associate Director, Professor with the Department of Epidemiology, The Rollins School of Public Health of Emory University, Atlanta, GA, USA.
Kevin Ward, Research Assistant Professor, Professor with the Department of Epidemiology, The Rollins School of Public Health of Emory University, Atlanta, GA, USA.
Timothy L Lash, Professor with the Department of Epidemiology, The Rollins School of Public Health of Emory University, Atlanta, GA, USA.
Lance A Waller, Professor, Department of Biostatistics and Bioinformatics, The Rollins School of Public Health of Emory University, Atlanta, GA, USA.
We thank Drs. Aaron Siegler, Sarita Shah, and Imon Banerjee for helpful discussions, and the Editor, Associate Editor, and Reviewer for insightful comments that improved the manuscript.
Partial support was provided by the National Institute of Health-funded Emory Center for AIDS Research (P30AI050409; Del Rio PI), the National Center for Advancing Translational Sciences of the National Institutes of Health (UL1TR002378; Taylor PI), and the National Institutes of Health/National Cancer Institute-funded Cancer Recurrence Information and Surveillance Program (CRISP) study (1 R01 CA208367-01; Ward/Lash MPIs). The content is the sole responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
REFERENCES
- Agresti A. (1994), “Simple Capture-Recapture Models Permitting Unequal Catchability and Variable Sampling Effort,” Biometrics, 50, 494–500. [PubMed] [Google Scholar]
- Baillargeon S., Rivest L.-P. (2007), “Rcapture: Loglinear Models for Capture-Recapture in R,” Journal of Statistical Software, 19, 1–31.21494410 [Google Scholar]
- Bell W. R. (1993), “Using Information from Demographic Analysis in Post-Enumeration Survey Estimation,” Journal of the American Statistical Association, 88, 1106–1118. [PubMed] [Google Scholar]
- Borchers D. L., Buckland S. T., Zucchini W. (2002), Estimating Animal Abundance: Closed Populations, London: Springer-Verlag. [Google Scholar]
- Böhning D., van der Heijden P.G.M., Bunge J. (eds.) (2018), Capture-Recapture Methods for the Social and Medical Sciences, Boca Raton, FL: Taylor and Francis Group LLC. [Google Scholar]
- Chandra Sekar C., Deming W. E. (1949), “On a Method of Estimating Birth and Death Rates and the Extent of Registration,” Journal of the American Statistical Association, 44, 101–115. [Google Scholar]
- Chao A. (1987), “Estimating the Population Wize for Capture-Recapture Data with Unequal Catchability,” Biometrics, 43, 783–791. [PubMed] [Google Scholar]
- Chao A., Pan H.-Y., Chiang S.-C. (2008), “The Petersen-Lincoln Estimator and Its Extension to Estimate the Size of a Shared Population,” Biometrical Journal, 50, 957–970. [DOI] [PubMed] [Google Scholar]
- Chao A., Tsay P. K. (1998), “A Sample Coverage Approach to Multiple-System Estimation with Application to Census Undercount,” Journal of the American Statistical Association, 93, 283–293. [Google Scholar]
- Chao A., Tsay P. K., Lin S.-H., Shau W.-Y., Chao D.-Y. (2001), “The Applications of Capture-Recapture Models to Epidemiological Data,” Statistics in Medicine, 20, 3123–3157. [DOI] [PubMed] [Google Scholar]
- Chapman D. G. (1951), “Some Properties of the Hypergeometric Distribution with Applications to Zoological Simple Censuses,” University of California Publications in Statistics, 1, 131–160. [Google Scholar]
- Chen J. (2020), “Sensitivity and Uncertainty Analysis for Two-Stream Capture-Recapture in Epidemiological Surveillance,” Master of Science in Public Health Thesis, Department of Biostatistics and Bioinformatics, The Rollins School of Public Health, Emory University.
- Cochran W. D. (1977), Sampling Techniques (3rd ed.), New York: John Wiley & Sons. [Google Scholar]
- Darroch J. N. (1958), “The Multiple Recapture Census I. Estimation of a Closed Population,” Biometrika, 45, 343–359. [Google Scholar]
- Elliott M. R., Little R. J. A. (2005), “A Bayesian Approach to 2000 Census Evaluation Using ACE Survey Data and Demographic Analysis,” Journal of the American Statistical Association, 100, 380–388. [Google Scholar]
- Hickman M., Higgins V., Hope V., Bellis M., Tilling K., Walker A., Henry J. (2004), “Injecting Drug Use in Brighton, Liverpool, and London: Best Estimates of Prevalence and Coverage of Public Health Indicators,” Journal of Epidemiological Community Health, 58, 766–771. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hook E. B., Regal R. R. (1995), “Capture-Recapture Methods in Epidemiology: Methods and Limitations,” Epidemiologic Reviews, 17, 243–263. [DOI] [PubMed] [Google Scholar]
- Horvitz D. G., Thompson D. J. (1952), “A Generalization of Sampling without Replacement from a Finite Population,” Journal of the American Statistical Association, 47, 663–685. [Google Scholar]
- Jensen A. L. (1989), “Confidence Intervals for Nearly Unbiased Estimators in Single-Mark and Single-Recapture Experiments,” Biometrics, 45, 1233–1237. [Google Scholar]
- Jones H. E., Hickman M., Welton N. J., De Angelis D., Harris R. J., Ades A. E. (2014), “Recapture or Precapture? Fallibility of Standard Capture-Recapture Methods in the Presence of Referrals between Sources,” American Journal of Epidemiology, 179, 1383–1393. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Krebs C. J. (1998), Ecological Methodology, Menlo Park, CA: Benjamin/Cummings. [Google Scholar]
- Lash T. L., Cronin-Fenton D., Ahern T. P., Rosenberg C. L., Lunetta K. L., Silliman R. A., Garne J. P., Sorensen H. T., Hellberg Y., Christensen M., Pedersen L., Hamilton-Dutoit S. (2011), “CYP2D6 Inhibition and Breast Cancer Recurrence in a Population-Based Study in Denmark,” Journal of the National Cancer Institute, 103, 489–500. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lincoln F. C. (1930), “Calculating Waterfowl Abundance on the Basis of Banding Returns,” U.S. Department of Agriculture Circular, 118, 1–4. [Google Scholar]
- Lyles R. H., Wilkinson A. L., Williamson J. M., Chen J., Taylor A. W., Jambai A., Jallow M., Kaiser R. (2021), “Alternative Capture-Recapture Point and Interval Estimators Based on Two Surveillance Streams,” in Modern Statistical Methods for Health Research, eds. Y. Zhao and D.-G. Chen, pp. 43–82, Cham, Switzerland: Springer. [Google Scholar]
- McClish D., Penberthy L. (2004), “Using Medicare Data to Estimate the Number of Cases Missed by a Cancer Registry: A 3-Source Capture–Recapture Model,” Medical Care, 42, 1111–1116. [DOI] [PubMed] [Google Scholar]
- Parkin D. M., Bray F. (2009), “Evaluation of Data Quality in the Cancer Registry: Principles and Methods Prt II. Completeness,” Europeon Journal of Cancer, 45, 756–764. [DOI] [PubMed] [Google Scholar]
- Petersen C. G. J. (1896), “The Yearly Immigration of Young Plaice into the Limfjord from the German Sea,” Report of the Danish Biological Station, 6, 5–48. [Google Scholar]
- Platt L., Hickman M., Rhodes T., Mikhailova L., Karavashkin V., Vlasov A., Tilling K., Hope V., Khutorksoy M., Renton A. (2004), “The Prevalence of Injecting Drug Use in a Russian City: Implications for Harm Reduction and Coverage,” Addiction, 99, 1430–1438. [DOI] [PubMed] [Google Scholar]
- Poorolajal J., Mohammadi Y., Farzinara F. (2017), “Using the Capture-Recapture Method to Estimate the Human Immunodeficiency Virus-Positive Population,” Epidemiology and Health, 39, e2017042. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sadinle M. (2009), “Transformed Logit Confidence Intervals for Small Populations in Single Capture-Recapture Estimation,” Communications in Statistics – Simulation and Computation, 38, 1909–1924. [Google Scholar]
- Sangeetha U., , M.Subbiah, and , Srinivasan M. R. (2013). “Estimation of Confidence Intervals for Multinomial Proportions of Sparse Contingency Tables Using Bayesian Methods,” International Journal of Scientific and Research Publications, 3, 1–7. [Google Scholar]
- SAS Institute, Inc. (2008), SAS IML User’s Guide 9.2, Cary, NC: SAS Institute Inc. [Google Scholar]
- Schnabel Z. E. (1938), “The Estimation of the Total Fish Population of a Lake,” American Mathematical Monthly, 45, 348–352. [Google Scholar]
- Seber G. A. F. (1982), The Estimation of Animal Abundance and Related Parameters, Caldwell, NJ: The Blackburn Press. [Google Scholar]
- van der Heijden P. G. M., Whittaker J., Cruyff M., Bakker B., van der Vliet R. (2012), “People Born in the Middle East but Residing in The Netherlands: Invariant Population Size Estimates and the Role of Active and Passive Covariates,” The Annals of Applied Statistics, 6, 831–852. [Google Scholar]
- Wittes J. T. (1972), “On the Bias and Estimated Variance of Chapman’s Two-Sample Capture-Recapture Population Estimate,” Biometrics, 28, 592–597. [Google Scholar]
- Wittes J. T. (1974), “Applications of a Multinomial Capture-Recapture Model to Epidemiological Data,” Journal of the American Statistical Association, 69, 93–97. [Google Scholar]
- Wolter K. M. (1990), “Capture-Recapture Estimation in the Presence of a Known Sex Ratio,” Biometrics, 46, 157–162. [Google Scholar]
- World Health Organization (2012), Assessing Tuberculosis Under-Reporting Through Inventory Studies, Geneva, Switzerland: World Health Organization. [Google Scholar]
- Wu C., Chang H.-G., McNutt L.-A., Smith P. F. (2005), “Estimating the Mortality Rate of Hepatitis C Using Multiple Data Sources,” Epidemiology & Infection, 133, 121–125. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang B., Small D. S. (2020), “Number of Healthcare Workers Who Have Died of COVID-19,” Epidemiology, 31, e46. [DOI] [PubMed] [Google Scholar]
