Misclassified group-tested current status data

L C Petito; N P Jewell

doi:10.1093/biomet/asw043

. 2016 Dec 8;103(4):801–815. doi: 10.1093/biomet/asw043

Misclassified group-tested current status data

L C Petito ^1,^✉, N P Jewell ¹

PMCID: PMC5793678 PMID: 29422690

Abstract

Group testing, introduced by Dorfman (1943), has been used to reduce costs when estimating the prevalence of a binary characteristic based on a screening test of Inline graphic groups that include independent individuals in total. If the unknown prevalence is low and the screening test suffers from misclassification, it is also possible to obtain more precise prevalence estimates than those obtained from testing all samples separately (Tu et al., 1994). In some applications, the individual binary response corresponds to whether an underlying time-to-event variable Inline graphic is less than an observed screening time , a data structure known as current status data. Given sufficient variation in the observed values, it is possible to estimate the distribution function of nonparametrically, at least at some points in its support, using the pool-adjacent-violators algorithm (Ayer et al., 1955). Here, we consider nonparametric estimation of Inline graphic based on group-tested current status data for groups of size where the group tests positive if and only if any individual’s unobserved is less than the corresponding observed . We investigate the performance of the group-based estimator as compared to the individual test nonparametric maximum likelihood estimator, and show that the former can be more precise in the presence of misclassification for low values of Inline graphic . Potential applications include testing for the presence of various diseases in pooled samples where interest focuses on the age-at-incidence distribution rather than overall prevalence. We apply this estimator to the age-at-incidence curve for hepatitis C infection in a sample of U.S. women who gave birth to a child in 2014, where group assignment is done at random and based on maternal age. We discuss connections to other work in the literature, as well as potential extensions.

Keywords: Current status data, Expectation-maximization algorithm, Group testing, Pool-adjacent-violators algorithm

1. Introduction

In the past decade, group testing of a binary response has once again become a topic of great interest (Remlinger et al., 2006; Wahed et al., 2006; Dhand et al., 2010). The idea was first introduced in 1943 as a potential cost-saving measure for the detection of syphilis in U.S. army recruits (Dorfman, 1943). Group testing reduces the number of tests by allocating, randomly or otherwise, Inline graphic individuals into groups of equal size and testing each pooled group only once, in order to provide an estimate of the prevalence of a binary characteristic in a population.

More recent work has considered potential issues with group testing, such as dilution effects, non-random group assignment, and misclassification (Hwang, 1976; Wein & Zenios, 1996; Delaigle & Hall, 2012; Liu et al., 2012). Tu et al. (1995) suggested that if the unknown prevalence of a binary characteristic is sufficiently low and the screening test suffers from misclassification, more precise estimates of the prevalence can be obtained from Inline graphic group tests than by testing all individuals separately. The intuition behind this finding is complex. When a test has a rate of misclassification independent of the number of individuals in the pooled sample, performing fewer tests could increase the precision of the prevalence estimate due to fewer tests being performed, thereby leading to less noise in the observations. This is particularly the case when the prevalence is sufficiently small, making it uncommon that two positives will occur in the same group.

The data structure where an individual’s binary response corresponds to an underlying time-to-event variable Inline graphic occurring before an observed screening time is known as current status data, or interval censoring type I (Jewell & van der Laan, 2003; Jewell & Emerson, 2013). The nonparametric maximum likelihood estimator of the distribution function, , of for current status data is the pool-adjacent-violators algorithm, although it is only possible to use this estimator if there is sufficient variation in the observed screening times Inline graphic (Ayer et al., 1955).

In this paper, we develop a simple algorithm to compute a nonparametric maximum likelihood estimator of Inline graphic for group-tested current status data, and extend it to settings where the test is subject to misclassification. When misclassification is present, we hypothesize that there will sometimes be substantial gains in precision for values of at which the prevalence is sufficiently small, as described by Tu et al. (1995) in the case of estimating a single fixed prevalence.

2. Notation and likelihood function

We assume that the underlying data, prior to grouping, arise from Inline graphic independent realizations of a bivariate random variable, , where the survival random variable and screening random variable follow distribution functions and , respectively. Throughout, we assume that and are independent. The observed data are based on grouping these realizations at random into blocks of size Inline graphic , where for convenience we assume that is an integer. It is trivial to extend all the results below to situations where the block sizes may vary. Thus each original unit corresponds to the th individual in the th group, where and . The group-tested result from the th group, , is the only test result available, whereas individual screening times, Inline graphic , are observed for all participants. Specifically, if and only if for all and otherwise. The group test detects the presence of one or more positives in the group, but cannot distinguish between a single, or several, positive . The immediate goal is to estimate the distribution function Inline graphic .

Owing to the assumed independence of Inline graphic and , we can focus on the conditional likelihood of the data given the observed screening times . Since , this conditional likelihood is

C L = \prod_{i = 1}^{n / k} {S (c_{i 1}) \times \dots \times S (c_{i k})}^{1 - δ_{i}} {1 - S (c_{i 1}) \times \dots \times S (c_{i k})}^{δ_{i}},

(1)

where Inline graphic is the survival function of . This conditional likelihood applies to various methods of selecting the screening times and assigning the observations to groups for testing. At one extreme, the values in each group are selected completely at random; at the other end of the spectrum, individuals with a common value of Inline graphic are assigned to the same group. The latter sampling scheme is only fully feasible if the distribution function is discrete. While the estimation strategy pursued here applies generally, estimation is much simpler with a common value in each group, and asymptotic properties of the estimator are more easily derived in that case. For example, with a common value of Inline graphic in each grouping of fixed group size , the likelihood (1) simplifies to that for the standard current status data problem with underlying survival function . Estimates and inference regarding can then be immediately translated to corresponding statements regarding itself. In practice, with a continuous Inline graphic , it may be advantageous to group together individuals with approximately the same value of .

This development assumes a perfect screening test of whether or not the true group test result was positive, Inline graphic . We can extend these ideas to permit misclassification of the test results, and we now use the notation to distinguish the potentially misclassified test result from the true result . Assume that the test has known sensitivity and specificity, independent of both the screening time and the group size, given by Inline graphic and with the assumption that . Then the conditional likelihood of the potentially misclassified data, given the observed screening times , can be written as

C L (α, β) = \prod_{i = 1}^{n / k} {1 - α + γ S (c_{i 1}) \times \dots \times S (c_{i k})}^{1 - y_{i}} {α - γ S (c_{i 1}) \times \dots \times S (c_{i k})}^{y_{i}},

where Inline graphic .

3. An expectation-maximization pool-adjacent-violators algorithm

3.1. Development of the algorithm

Group-tested current status data can be formulated as a missing data problem. First, consider the setting without misclassification of test results. While the full set of screening times Inline graphic is observed, only group-tested results are available, whereas a complete dataset would include all individual test results, . This missing information setting naturally allows use of the expectation-maximization algorithm (Dempster et al., 1977).

To implement the expectation-maximization algorithm, we calculate the expected value of the true individual test result, Inline graphic , given the observed value of the group-tested result, , based on a current estimate of . These calculations are straightforward when there is no misclassification:

\begin{matrix} E (Φ_{i j} ∣ Δ_{i} = 0, C_{i 1} = c_{i 1}, \dots, C_{i k} = c_{i k}) & = 0, \end{matrix}

(2)

\begin{matrix} E (Φ_{i j} ∣ Δ_{i} = 1, C_{i 1} = c_{i 1}, \dots, C_{i k} = c_{i k}) & = F (c_{i j}) {1 - S (c_{i 1}) \times \dots \times S (c_{i k})}^{- 1} . \end{matrix}

(3)

For misclassified data with sensitivity Inline graphic and specificity , computing the expected value of an individual true disease status given the potentially misclassified observed group-test result becomes slightly more complicated; see the Supplementary Material. Letting , this step becomes

\begin{matrix} E (Φ_{i j} ∣ Y_{i} = 1, C_{i 1} = c_{i 1}, \dots, C_{i k} = c_{i k}) & = α F (c_{i j}) {α - γ S (c_{i 1}) \times \dots \times S (c_{i k})}^{- 1}, \\ E (Φ_{i j} ∣ Y_{i} = 0, C_{i 1} = c_{i 1}, \dots, C_{i k} = c_{i k}) & = \frac{(1 - α) F (c_{i j})}{(1 - α) + γ S (c_{i 1}) \times \dots \times S (c_{i k})} . \end{matrix}

For the maximization step, we simply use a weighted version of the pool-adjacent-violators algorithm on the full dataset Inline graphic , where with weight 1 if , per (2). On the other hand, according to (3), if , then with weight given by the right-hand side of (3), and additional observations have weight given by 1 minus the right-hand side of (3). The complete algorithm is thus described as follows.

Step 1

Initialize values of for each individual and set a threshold for convergence.

Step 1

(Expectation). For each individual in group , calculate the probability that the individual tested positive, given their group’s test result. For perfectly classified results, , use

$f_{i j}^{*} = {\begin{matrix} f_{i j}^{(0)} {1 - \prod_{J = 1}^{k} (1 - f_{i J}^{(0)})}^{- 1}, & δ_{i} = 1, \\ 0, & δ_{i} = 0 . \end{matrix}$ (4)

For group-tested results subject to misclassification, , with sensitivity and specificity such that , use

$f_{i j}^{*} = {\begin{matrix} α f_{i j}^{(0)} {α - γ \prod_{J = 1}^{k} (1 - f_{i J}^{(0)})}^{- 1}, & y_{i} = 1, \\ (1 - α) f_{i j}^{(0)} {1 - α + γ \prod_{J = 1}^{k} (1 - f_{i J}^{(0)})}^{- 1}, & y_{i} = 0 . \end{matrix}$ (5)

Step 3

(Maximization). Use the group-tested results, or , as the observations for each individual, and the probabilities from Step 2 as the weights in the weighted pool-adjacent-violators algorithm to calculate updated estimates of .

Step 4

Repeat Steps 2 and 3, using the estimate of from Step 3 as the initial value for Step 2, until convergence, for example until

$\sum_{i = 1}^{n / k} \sum_{j = 1}^{k} {{\hat{F}}^{(t + 1)} (c_{i j}) - {\hat{F}}^{(t)} (c_{i j})}^{2} < τ .$

It is important to run the algorithm with several choices of starting values, not only to reduce the possibility of converging to a local extrema, but also to discover possible different nonunique versions of the nonparametric maximum likelihood estimator. We recommend choosing a large set of random starting values of Inline graphic at the observed set of by generating random Un values ordered so that the starting values are monotonically increasing with .

3.2. Comments regarding asymptotics

Asymptotic results for standard current status data are nonstandard. The nonparametric maximum likelihood estimator is known to be consistent, although converging only at the rate Inline graphic , but has a non-Gaussian limiting distribution known as Chernoff’s distribution (Groeneboom & Wellner, 1992) in situations where the monitoring time distribution, , is continuous; Banerjee (2012) provides a concise discussion of this result. Rather than using Wald-type pointwise confidence intervals derived from this limit, Banerjee & Wellner (2001, 2005) suggest using a likelihood ratio approach to construct confidence bands.

On the other hand, when Inline graphic has finite support, the likelihood is parametric, since can then be estimated only at this finite number of support points, namely the observed censoring times. As expected from this observation, the nonparametric maximum likelihood estimator now converges to a Gaussian limit at rate Inline graphic , with the asymptotic variance at a specific monitoring time given simply by , which is straightforward to estimate using the obvious plug-in estimators (Yu et al., 1998; Maathuis & Hudgens, 2011). The hybrid problem where the number of support points grows with the sample size is discussed beautifully in Tang et al. (2012). Sal y Rosas & Hughes (2010) proposed the inversion of a likelihood ratio test to obtain pointwise confidence intervals for Inline graphic when the data are subject to misclassification.

These results can be applied directly to the group-testing scenario only in the simplest situations. For the extreme situation of only one monitoring time, estimation of Inline graphic reduces to the simple estimation of prevalence. This scenario has been studied extensively in the literature on group testing with misclassification; for example, Tu et al. (1994) provided asymptotically normal confidence intervals with convergence rate . Generalizing slightly, the situation with finite support for Inline graphic , and with no misclassification, simplifies to the case considered by Yu et al. (1998) if individuals within a group all share a common value of . In this case, , so that asymptotic results for the nonparametric maximum likelihood estimator applied to the group-tested data immediately follow through for the plug-in estimator of Inline graphic , or , at the finite number of screening times by using the delta method. We anticipate that this will extend straightforwardly in the presence of misclassification, and we also suggest that use of the bootstrap will be effective here.

Even with a finite number of monitoring times, the situation becomes more complex when screening times are randomly assigned to the groups. This is clear even in the case of only two monitoring times and with pair groupings done at random. Further, there are as yet no known asymptotic results for the nonparametric maximum likelihood estimator of Inline graphic 3.1 with a continuous screening time distribution, although we anticipate that convergence will remain at a rate.

4. Elementary example

4.1. An analytic solution

For illustration, consider a simple example in a setting without misclassified test results, where there are two groups each containing two individuals; that is, Inline graphic and . There are twelve possible combinations of group assignments and test results, corresponding to three different possible pair assignments with each pair having two possible test outcomes. Consideration of the conditional likelihood (1) reveals a simple solution in all but one of these cases; we focus on the remaining case, which has the grouping shown in Fig. 1, with Inline graphic and .

Fig. 1. — Elementary example of data configuration with two groups, each of size 2, where the first group has tested positive and the second group has tested negative.

The conditional likelihood (1) in this setting is

{C L}_{4} = {1 - S (c_{1}) S (c_{3})} S (c_{2}) S (c_{4}) .

It is immediate that the nonparametric maximum likelihood estimator must have Inline graphic and . Hence, the nonparametric maximum likelihood estimator is not unique but is achieved by any set of with , and . We show how the expectation-maximization pool-adjacent-violators algorithm converges to one such solution, with the specific value depending directly on the starting values for Inline graphic .

Given an initial set of probabilities Inline graphic , , and such that , the first step of the algorithm calculates the expectation of each of the initial conditional probabilities, (), as given in (4) and (5), i.e., the probability that each individual was positive given the known group-tested result. For two of these probabilities, in a setting without misclassification, this calculation is trivial: the pair tested negative, so neither of the individuals was positive. Hence we can set Inline graphic . For the pair that tested positive, this calculation follows directly from (4):

\begin{matrix} f_{1}^{*} & = pr (T_{1} \leq C_{1} ∣ Δ_{1} = 1) = f_{1} {1 - (1 - f_{1}) (1 - f_{3})}^{- 1} = f_{1} {f_{1} + f_{3} - f_{1} f_{3}}^{- 1}, \\ f_{3}^{*} & = pr (T_{3} \leq C_{3} ∣ Δ_{1} = 1) = f_{3} {1 - (1 - f_{1}) (1 - f_{3})}^{- 1} = f_{3} {f_{1} + f_{3} - f_{1} f_{3}}^{- 1} . \end{matrix}

The next step of the algorithm is to make these Inline graphic monotonic, recalling that , by using the pool-adjacent-violators algorithm. This yields the following updated estimates of :

\begin{matrix} {\hat{F}}^{(1)} (C_{1}) & = {\hat{F}}^{(1)} (C_{2}) = f_{1}^{*} / 2 = f_{1} {2 (f_{1} + f_{3} - f_{1} f_{3})}^{- 1}, \end{matrix}

(6)

\begin{matrix} {\hat{F}}^{(1)} (C_{3}) & = {\hat{F}}^{(1)} (C_{4}) = f_{3}^{*} / 2 = f_{3} {2 (f_{1} + f_{3} - f_{1} f_{3})}^{- 1} . \end{matrix}

(7)

These steps are then iterated until a determination of convergence based on comparing, say, the sum of the squared differences between Inline graphic and at each observed to a prespecified threshold .

4.2. Multiple convergence values

As we demonstrated in Inline graphic 4.1, the initial values for the pair that tested negative, and , are not relevant to the update step in our expectation-maximization pool-adjacent-violators algorithm. Therefore, when discussing convergence of the algorithm, we will only consider initial values for and .

In all settings where Inline graphic , the update step given by (6) and (7) becomes

{\hat{F}}^{(1)} (C_{1}) = {\hat{F}}^{(1)} (C_{3}) = \hat{f} = 1 {2 (2 - f)}^{- 1} .

Therefore, at convergence, Inline graphic so that the algorithm converges to , the only solution in . This can, of course, also be expressed as for .

For any other set of starting values, the ratio Inline graphic remains unchanged by the iterations. We can therefore write , where and stays fixed, as determined by the starting values for and . At convergence, (6) then simplifies to

f_{3} = f_{3} {2 (r f_{3} + f_{3} - r f_{3}^{2})}^{- 1} .

Thus, convergence occurs when Inline graphic . After an application of the quadratic formula, this simplifies to

f_{3} = {r + 1 - (r^{2} + 1)^{1 / 2}} (2 r)^{- 1},

the only feasible solution. It immediately follows that at convergence, Inline graphic so that the condition holds, as noted in 4.1.

This simple example demonstrates the nonuniqueness of the nonparametric maximum likelihood estimator, with the algorithm converging to a specific solution for Inline graphic determined by the ratio of the starting values of at and . When using this algorithm in an applied setting, we suggest repeating it many times, using a different set of randomly drawn starting values each time, and then computing the likelihood function to identify as many different unique solutions to the optimization as possible.

5. Simulations

5.1. Design of simulations

We carry out two series of simulations to examine the behaviour of the expectation-maximization pool-adjacent-violators algorithm for group-tested data, as compared to the pool-adjacent-violators algorithm, which is the nonparametric maximum likelihood estimator for individual-level current status data (Barlow et al., 1972). We consider two scenarios, one where the tests are subject to no misclassification, and another where the test is subject to misclassification with known, constant error rates. In the latter case, the comparative estimator for misclassified individual-level current status data was derived by McKeown & Jewell (2010). We consider both continuous and discrete independent screening times. The former are described and discussed below, and the latter in the Supplementary Material.

Each simulation is characterized by a set of fixed parameters: Inline graphic , the number of individuals; , the group size; and and , the sensitivity and specificity of the screening test, respectively. We set in scenarios without misclassification. We first simulate traditional current status data for each individual from the distribution of the true event times, Inline graphic , and the censoring distribution, . Each run of the simulations begins with simulating data of sample size at the individual level, followed by assigning individuals to groups randomly.

The distribution Inline graphic of the event times is Weibull with shape and scale parameters 4 and 25, respectively; here has mean 227 and variance 404. For the perfectly classified test simulations, the screening distribution for is Un, allowing almost all of the distribution to be identified. The necessary binary datum Inline graphic is then determined from the generated individual values of and . The values of , the group-tested results, follow immediately from the values of from each individual in the group, as described in 2. Each simulation is performed 1000 times in six different settings, given by and groupings of sizes Inline graphic .

For misclassified test results, we are most interested in examining performance of the expectation-maximization pool-adjacent-violators estimator in the left tail of Inline graphic , where false positive test results could have the largest effect on the estimate of (Tu et al., 1994). Hence, while remains the same Weibull distribution, we now take to be Un to ensure that . Here we select a single sample size in 12 different settings with group sizes and misclassification rates of Inline graphic . In these simulations, the observed misclassified data are obtained by, first, subjecting each individual test result to misclassification under the specified test characteristics and, second, generating the group-tested outcome separately by misclassifying the corresponding group-test result Inline graphic . Here we have used the same test classification probabilities, assuming independence between the group size and the error rates of the testing procedure.

In each run of the two sets of simulations, for perfectly classified data and misclassified data, we compute both the appropriate expectation-maximization pool-adjacent-violators algorithm for the group-tested data and the appropriate pool-adjacent-violators algorithm for individual data. To select initial values for the expectation-maximization pool-adjacent-violators algorithm, we first draw Inline graphic values uniformly from the range and sort them from smallest to largest; we then order the observations so that the values are monotonically increasing, and match the ordered initial probabilities to the ordered data. Although, as noted earlier, for a specific application we recommend choosing multiple starting values, here we opt to randomly select only one set of initial values for each simulated dataset, thereby achieving only one of potentially many possible nonparametric maximum likelihood estimates.

The averages of the estimates of Inline graphic obtained from each algorithm over the 1000 runs are calculated for each in the support of . To calculate the estimate of at a value of not observed in a specific simulation, we assume left-continuity of both estimators in situations where this is not imposed by monotonicity. To provide a sense of the variability of each estimator, we also calculate the 2 Inline graphic 5th and 975{th} quantiles of the estimates over the 1000 simulations. For the second set of simulations, we use these quantities to compute a measure of pseudo-relative efficiency, the ratio of the widths of these 95% Monte Carlo quantile intervals: . The variances of the simulated estimates are less relevant, since we hypothesize that this estimator does not converge to a Gaussian distribution, nor at a Inline graphic rate.

The Supplementary Material contains results from two simulations in samples of size Inline graphic , with 10 fixed, equal-frequency screening times , and with true event probabilities at each screening time fixed at . In the first simulation, we randomly group individuals by values of to allow for the presentation of asymptotically normal confidence intervals, as described in 3.2; in the second, we group across screening times and again present the widths of the 95% Monte Carlo quantile intervals.

5.2. Results: perfectly classified data

Figure 2 displays the results from applying the expectation-maximization pool-adjacent-violators algorithm and the pool-adjacent-violators algorithm to data generated in the six simulations where there is no misclassification of the test results. These simulations show that the finite-sample bias is small, except perhaps when the group size is large, e.g., Inline graphic , and is small. Even then, this bias declines systematically as the sample size increases. As anticipated, in all situations, the bias is also smaller for the estimator based on individual test results. Similarly, and also to be expected, the latter is more precise, although the gain in precision decreases for larger sample sizes and smaller Inline graphic . This being said, the group-tested estimator stands up remarkably well given that the screening costs are reduced by 50%, 80% and 90% when and , respectively, assuming that costs are proportional to the number of tests.

Fig. 2. — Results from six simulations of the estimation of , with 1000 runs each, for different sample sizes and group sizes . In each panel, the black lines are the average estimates of over the 1000 simulations, with the solid line representing the true cumulative distribution function Wei and the dashed and dotted lines representing, respectively, the estimates from the pool-adjacent-violators algorithm and the expectation-maximization pool-adjacent-violators algorithm; the grey lines are the 25th and 975{th} quantiles from the simulation runs for each estimator, using the same line types.

Inline graphic — Results from six simulations of the estimation of , with 1000 runs each, for different sample sizes and group sizes . In each panel, the black lines are the average estimates of over the 1000 simulations, with the solid line representing the true cumulative distribution function Wei and the dashed and dotted lines representing, respectively, the estimates from the pool-adjacent-violators algorithm and the expectation-maximization pool-adjacent-violators algorithm; the grey lines are the 25th and 975{th} quantiles from the simulation runs for each estimator, using the same line types.

Because the asymptotic properties of the expectation-maximization pool-adjacent-violators algorithm are currently unknown, to demonstrate variability in the estimates we delineate the 95% Monte Carlo quantile interval by dashed and dotted lines in Fig. 2. The width of this interval for the pool-adjacent-violators algorithm from individual data is always smaller than that for the expectation-maximization pool-adjacent-violators algorithm applied to group-tested data. This is to be expected, as there is no misclassification in these simulations. Smaller group sizes Inline graphic in the expectation-maximization pool-adjacent-violators algorithm provide 95% quantile intervals more similar to those estimated from individual data, and as increases for fixed , the width of the 95% quantile interval decreases. Overall, Fig. 2 demonstrates that the expectation-maximization pool-adjacent-violators algorithm provides an unbiased estimate of the true underlying distribution, Inline graphic .

5.3. Results: misclassified data

Figures 3 and 4 present results from the twelve simulations in settings with Inline graphic individuals and varying group sizes and misclassification rates. Figure 3 shows that the percentage relative bias of both estimators in these finite samples is large, e.g., greater than 100%, for estimates of that are very small, e.g., less than 0002, and is very close to zero for estimates of Inline graphic that are greater than 002, even at large group sizes with high misclassification rates. Although the individual-based estimator is less biased at small group sizes and low misclassification rates, we do see similar or lower amounts of bias from the group-testing estimator at higher misclassification rates, e.g., Inline graphic or , particularly with the larger grouping sizes and at lower values of . Ultimately, the shapes of the finite-sample relative bias curves for these two estimators are very similar, so, at the very least, grouping does not introduce substantial amounts of additional bias.

Fig. 3. — Graphical representation of the finite-sample percentage relative bias from 12 simulations repeated 1000 times with 5000 individuals each, based on different group sizes and misclassification rates , with values of the latter noted along the right-hand side. In each plot, the solid black line represents results obtained from the expectation-maximization pool-adjacent-violators algorithm for group tests, and the short-dashed black line represents results from the pool-adjacent-violators algorithm for misclassified individual test data; the long-dashed black line represents the reference level of 0% bias.

With regard to variability, a comparison of the widths of the 95% Monte Carlo quantile intervals associated with both estimators, as shown in Fig. 4, demonstrates a considerable advantage of our estimator from group-tested data at low Inline graphic and high levels of misclassification. For example, corresponds to a true prevalence of 25%. If a test is subject to 10% misclassification, i.e., , then test results from data grouped into pools of size 10 will provide a more or equally precise estimate of for than data from individual tests. This implies that if the cumulative failure rate in question is less than 2 Inline graphic 5%, a testing procedure that involves groups of size 10 will cost 90% less than testing everyone individually, and will result in a less biased and more precise estimate of in this range. In general, the specific threshold below which such precision gains can be expected depends on both the group size and the misclassification rate, as suggested by Tu et al. (1994) for estimation of a single fixed prevalence.

The Supplementary Material includes results from simulations of group-tested current status data on a grid, with grouping done solely according to common observation times, which more easily ensures a sufficiently small maximum value of Inline graphic . As seen in Tu et al. (1994), we observe a reduction in the size of 95% confidence intervals as the group size increases, and separately a reduction in the size of the 95% confidence intervals as the misclassification rates decrease. Additionally, there appears to be no substantial increase in bias as group size increases.

6. Application to hepatitis C data

To investigate the performance of our estimator in a practical setting, we use publicly available data from the 2014 U.S. Birth Data File, created by the National Center for Health Statistics, to investigate the age-at-incidence distribution for hepatitis C in non-Hispanic white women of child-bearing age. The dataset includes all such women of ages 13–40 who gave birth in 2014. We are therefore making the tacit assumption that women who gave birth are a representative sample of women of the same ages that could have given birth in terms of their risk of infection with hepatitis C. This is not exactly correct but seems to be a reasonable approximation, at least for sexually active women. Of the 1 981 521 eligible women, we randomly sampled 10%, creating a sample of Inline graphic observations, for greater ease of illustration and computation. The data include the mother’s age in years and her hepatitis C status at the birth of her child. Of the women in our investigation, only 901 tested positive for hepatitis C, a cumulative incidence of 046%. When accounting for potential misclassification of these test results, we used the sensitivity, Inline graphic , and specificity, , associated with the most commonly used test for hepatitis C: an enzyme immunoassay test. Although hepatitis C can be spread via sexual contact, it is primarily transmitted through blood, and an increase in the incidence of hepatitis C after age 25 would imply that people are beginning or continuing to engage in risky drug behaviour.

These data are based on individual blood testing for each mother separately. To illustrate our proposed methods, we consider group testing of pooled blood samples, representing potentially enormous savings in test costs depending on the size of the grouping used. These savings persist even if specific infected individuals need to be identified. As discussed above, given the low misclassification rates, we anticipate some loss of accuracy in estimating the prevalence, but this may nonetheless be worth the considerable cost reduction. We created artificial group-test results in two ways: (i) by assigning the data to groups of sizes 2, 5 and 10 according to age (gridded group assignment), and (ii) by randomly assigning the data to groups of sizes 2, 5 and 10. Then, each group test was assigned a positive result if at least one individual test was positive. For gridded group assignments, we computed point estimates and 95% confidence intervals adjusted for misclassification using the method described in Inline graphic 3.2. For random group assignments, we applied the adjusted pool-adjacent-violators algorithm to the individual test results and, for comparison, the expectation-maximization pool-adjacent-violators algorithm to the group-tested results.

Figure 5 displays estimates obtained from individual and group-tested results with groups of sizes 2, 5 and 10 in a setting where group assignment is done by common age. The results are satisfying, as they lead to the same public health implications. Although the estimates are slightly different, they increase with group size, and the major jumps in the estimates occur at ages 19 and 21 for each of the group sizes considered. From these results, we can be fairly certain that any intervention to potentially reduce the public health burden due to hepatitis C infection would best occur during adolescence, ideally before risky behaviours such as drug use and unprotected sexual activity begin. In this example, major cost reductions could be achieved by decreasing the number of tests performed, assuming costs are proportional to the number of tests, without changing the conclusions of the analysis.

Fig. 5. — Four estimates of the cumulative incidence of hepatitis C in non-Hispanic white child-bearing U.S. women of ages 13–40 in 2014 when grouping is assigned according to common values of age. Group sizes considered were and 10. In each panel, the solid line is the estimate from the individual or group-tested results, and the dashed lines represent the upper and lower bounds of 95% confidence intervals.

Figure 6 displays estimates obtained from individual and group-tested results with groups of sizes 2, 5 and 10 in a setting where group assignment is done completely at random. Unlike the estimates in Fig. 5 obtained from data grouped according to the women’s age, here the estimates from data in groups of different sizes yield different implications. The results from the individual tests suggest an essentially flat cumulative incidence of hepatitis C after age 21, having reached a cumulative incidence of approximately 0 Inline graphic 38%. This has significant implications for a public health intervention: it potentially indicates, for example, that any future hepatitis C vaccination would be most effective if implemented during late adolescence. No vaccine currently exists, although several candidates are under development. The group-tested results from groups of size 2 support the same conclusion, although they suggest that the cumulative incidence does not increase after age 19. However, the results from groups of sizes 5 and 10 tell a slightly different story: while these estimates increase to a cumulative incidence of roughly 0 Inline graphic 4% before age 20, they then both continue to increase with age to somewhere in the range of 045–055% by age 40, suggesting that a substantial fraction of hepatitis C infections occur post-adolescence.

Fig. 6. — Four estimates of the cumulative incidence of hepatitis C in non-Hispanic white child-bearing U.S. women of ages 13–40 in 2014 when group testing with random group assignments. The solid line is the pool-adjacent-violators estimate from the individual test results, and the dotted, short-dashed and long-dashed lines are the estimates obtained from the expectation-maximization pool-adjacent-violators algorithm with the individual test results artificially assigned to groups of sizes 2, 5 and 10, respectively.

Because these estimates seem to imply public health interventions at different times in life, it is important to consider which estimate is most reliable in this particular setting. As noted earlier, there is very little misclassification in the testing procedure, so we would expect that the results from the adjusted pool-adjacent-violators algorithm based on individual data would be more accurate, albeit obtained at significantly higher cost. However, the pool-adjacent-violators algorithm adjusted for misclassification has a limitation: it automatically estimates cumulative incidences that are less than Inline graphic as 0; because the cumulative incidences at the early ages are less than 05%, if we had set in this application, our estimate from the individual data adjusted for misclassification would have been zero at all ages. This suggests a potential issue with individual test results that may not be as much of a problem with group-tested results.

7. Discussion

In this paper we have proposed a modified expectation-maximization algorithm to estimate a distribution function from data obtained by group-tested current status screening with test misclassification. Simulations show that the estimator based on group-tested data adds relatively little extra small-sample bias compared to an estimator based on individual data, but has a far lower cost, although this conclusion necessarily requires a larger Inline graphic as the grouping size increases. Additionally, when substantial misclassification is present, and is low, estimates obtained from the expectation-maximization pool-adjacent-violators algorithm with groups of size 5 or larger may be less biased and have improved precision, although inferential properties for this procedure need further development. This offers the possibility that a significantly less expensive testing procedure might yield a less biased and more precise estimate for the left tail of Inline graphic .

In the presence of misclassification, these observations suggest possible hybrid grouping strategies that may improve precision at low values of Inline graphic and maintain performance at higher levels, all in comparison to individual tests whose costs are far greater. That is, where possible, if the screening times are known in advance of pooling, it will likely be advantageous to first group individuals according to the observed values, and then use larger group sizes at the smaller values of Inline graphic and decrease the group size as increases, even down to individual tests. Simulations to examine variations of these possibilities are currently under way. As noted earlier, when individuals in a group have similar values, it is possible to also use an approximate individual group-tested current status estimator by treating all Inline graphic values in the group as being the same.

There are a number of important extensions to these results. As noted, the pool-adjacent-violators estimator for classic current status data converges at a rate of Inline graphic with a nonstandard asymptotic limit, see a 1987 technical report by P. Groeneboom from the University of Amsterdam. We conjecture that the same asymptotics will hold for the group-tested estimator, although this remains to be established. In practice, in a setting with misclassified individual current status data, the Inline graphic -out-of- bootstrap (McKeown & Jewell, 2010) has been shown to provide one method of obtaining valid inference procedures. We look forward to further theoretical progress in this area.

It is natural to anticipate that misclassification rates may depend on group size. This may occur, for example, if the screening test is more sensitive to detecting a positive group when there are more individual positives in the pool, related to the so-called dilution effect (Hwang, 1976; McMahan et al., 2013). Second, covariate-adjusted regression analysis has been a primary focus of the statistical literature on group testing (Vansteelandt et al., 2000; Xie, 2001; Chen et al., 2009; Delaigle & Meister, 2011). In addition, in many applications, interest is focused on regression effects or group comparisons of time-to-event properties rather than on estimation of the underlying distribution function itself, often through use of standard multiplicative or additive regression models. Such regression models have been widely studied for individual current status data (Jewell & Emerson, 2013). Future work will investigate the use of additive hazard regression models for group-tested current status data.

Supplementary Material

Supplementary Data

Click here for additional data file.^{(199.9KB, zip)}

Acknowledgement

The authors thank the editor, associate editor and reviewers for their insightful feedback. This work was supported by the National Heart, Lung, and Blood Institute, U.S. National Institutes of Health.

Supplementary material

Supplementary material available at Biometrika online contains a derivation of the expectation step of our expectation-maximization pool-adjacent-violators algorithm in the presence of misclassification, results from both sets of simulations with fixed censoring times, and code needed to replicate the simulations outlined in Inline graphic 5.1.

References

Ayer M. Brunk H. D. Ewing G. M. Reid W. T. & Silverman E.. (1955). An empirical distribution function for sampling with incomplete information. Ann. Math. Statist. 26, 641–7. [Google Scholar]
Banerjee M. (2012). Current status data in the 21st century: Some interesting developments. In Interval-Censored Time-to-Event Data: Methods and Applications. Chen D. G. Sun J. and Peace K. E. eds. Boca Raton, Florida: Chapman & Hall/CRC, pp. 45–90. [Google Scholar]
Banerjee M. & Wellner J. A.. (2001). Likelihood ratio test for monotone functions. Ann. Statist. 29, 1699–731. [Google Scholar]
Banerjee M. & Wellner J. A.. (2005). Confidence intervals for current status data. Scand. J. Statist. 32, 405–24. [Google Scholar]
Barlow R. E. Bartholomew D. J. Bremner J. M. & Brunk H. D.. (1972). Statistical Inference Under Order Restrictions. New York: Wiley. [Google Scholar]
Chen P. Tebbs J. M. & Bilder C. R.. (2009). Group testing regression models with fixed and random effects. Biometrics 65, 1270–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
Delaigle A. & Meister A.. (2011). Nonparametric regression analysis for group testing data. J. Am. Statist. Assoc. 106, 640–50. [Google Scholar]
Delaigle A. & Hall P.. (2012). Nonparametric regression with homogeneous group testing data. Ann. Statist. 40, 131–58. [Google Scholar]
Dempster A. P. Laird N. M. & Rubin D. B.. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. R. Statist. Soc. B 39, 1–38. [Google Scholar]
Dhand N. K. Johnson W. O. & Toribio J. A. L.. (2010). A Bayesian approach to estimate OJD prevalence from pooled fecal samples of variable pool size. J. Agric. Biol. Envir. Statist. 15, 452–73. [Google Scholar]
Dorfman R. (1943). The detection of defective members of large populations. Ann. Math. Statist. 14, 436–40. [Google Scholar]
Groeneboom P. & Wellner J. A.. (1992). Nonparametric Maximum Likelihood Estimators for Interval Censoring and Deconvolution. Boston: Birkhäuser. [Google Scholar]
Hwang F. K. (1976). Group testing with a dilution effect. Biometrika 63, 671–80. [Google Scholar]
Jewell N. P. & Emerson R.. (2013). Current status data: An illustration with data on avalanche victims. In Handbook of Survival Analysis. Boca Raton, Florida: Chapman & Hall/CRC, pp. 391–412. [Google Scholar]
Jewell N. P. & van der Laan M.. (2003). Current status data: Review, recent developments and open problems. In Handbook in Statistics, vol. 23 Amsterdam: Elsevier, pp. 625–42. [Google Scholar]
Liu A. Liu C. Zhang Z. & Albert P. S.. (2012). Optimality of group testing in the presence of misclassification. Biometrika 99, 245–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
Maathuis M. & Hudgens M. G.. (2011). Nonparametric inference for competing risks current status data with continuous, discrete or grouped observation times. Biometrika 98, 325–40. [DOI] [PMC free article] [PubMed] [Google Scholar]
McKeown K. & Jewell N. P.. (2010). Misclassification of current status data. Lifetime Data Anal. 16, 215–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
McMahan C. S. Tebbs J. M. & Bilder C. R.. (2013). Regression models for group testing data with pool dilution effects. Biostatistics 14, 284–98. [DOI] [PMC free article] [PubMed] [Google Scholar]
Remlinger K. S. Hughes-Oliver J. M. Young S. S. & Lam R. L.. (2006). Statistical design of pools using optimal coverage and minimal collision. Technometrics 48, 133–43. [Google Scholar]
Sal y Rosas V. G. & Hughes J. P.. (2010). Nonparametric and semiparametric analysis of current status data subject to outcome misclassification. Statist. Commun. Inf. Dis. 2010, article no 364. [PMC free article] [PubMed] [Google Scholar]
Tang R. Banerjee M. & Kosorok M. R.. (2012). Likelihood based inference for current status data on a grid: A boundary phenomenon and an adaptive inference procedure. Ann. Statist. 40, 45–72. [Google Scholar]
Tu X. M. Litvak E. & Pagano M.. (1994). Screening tests: Can we get more by doing less? Statist. Med. 13, 1905–19. [DOI] [PubMed] [Google Scholar]
Tu X. M. Litvak E. & Pagano M.. (1995). On the informativeness and accuracy of pooled testing in estimating prevalence of a rare disease: Application to HIV screening. Biometrika 82, 287–97. [Google Scholar]
Vansteelandt S. Goetghebeur E. & Verstraeten T.. (2000). Regression models for disease prevalence with diagnostic tests on pools of serum samples. Biometrics 56, 1126–33. [DOI] [PubMed] [Google Scholar]
Wahed M. A. Chowdhury D. Nermell B. Khan S. I. Ilias M. Rahman M. Persson L. A. & Vahter M.. (2006). A modified routine analysis of arsenic content in drinking-water in Bangladesh by hydride generation-atomic absorption spectrophotometry. J. Health Pop. Nutr. 24, 36–41. [PubMed] [Google Scholar]
Wein L. M. & Zenios S. A.. (1996). Pooled testing for HIV screening: Capturing the dilution effect. Oper. Res. 44, 543–69. [Google Scholar]
Xie M. (2001). Regression analysis of group testing samples. Statist. Med. 20, 1957–69. [DOI] [PubMed] [Google Scholar]
Yu G. Schick A. Li L. & Wong G. Y. C.. (1998). Asymptotic properties of the GMLE in the case 1 interval-censorship model with discrete inspection times. Can. J. Statist. 26, 619–27. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Click here for additional data file.^{(199.9KB, zip)}

[B1] Ayer M. Brunk H. D. Ewing G. M. Reid W. T. & Silverman E.. (1955). An empirical distribution function for sampling with incomplete information. Ann. Math. Statist. 26, 641–7. [Google Scholar]

[B2] Banerjee M. (2012). Current status data in the 21st century: Some interesting developments. In Interval-Censored Time-to-Event Data: Methods and Applications. Chen D. G. Sun J. and Peace K. E. eds. Boca Raton, Florida: Chapman & Hall/CRC, pp. 45–90. [Google Scholar]

[B3] Banerjee M. & Wellner J. A.. (2001). Likelihood ratio test for monotone functions. Ann. Statist. 29, 1699–731. [Google Scholar]

[B4] Banerjee M. & Wellner J. A.. (2005). Confidence intervals for current status data. Scand. J. Statist. 32, 405–24. [Google Scholar]

[B5] Barlow R. E. Bartholomew D. J. Bremner J. M. & Brunk H. D.. (1972). Statistical Inference Under Order Restrictions. New York: Wiley. [Google Scholar]

[B6] Chen P. Tebbs J. M. & Bilder C. R.. (2009). Group testing regression models with fixed and random effects. Biometrics 65, 1270–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B7] Delaigle A. & Meister A.. (2011). Nonparametric regression analysis for group testing data. J. Am. Statist. Assoc. 106, 640–50. [Google Scholar]

[B8] Delaigle A. & Hall P.. (2012). Nonparametric regression with homogeneous group testing data. Ann. Statist. 40, 131–58. [Google Scholar]

[B9] Dempster A. P. Laird N. M. & Rubin D. B.. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. R. Statist. Soc. B 39, 1–38. [Google Scholar]

[B10] Dhand N. K. Johnson W. O. & Toribio J. A. L.. (2010). A Bayesian approach to estimate OJD prevalence from pooled fecal samples of variable pool size. J. Agric. Biol. Envir. Statist. 15, 452–73. [Google Scholar]

[B11] Dorfman R. (1943). The detection of defective members of large populations. Ann. Math. Statist. 14, 436–40. [Google Scholar]

[B12] Groeneboom P. & Wellner J. A.. (1992). Nonparametric Maximum Likelihood Estimators for Interval Censoring and Deconvolution. Boston: Birkhäuser. [Google Scholar]

[B13] Hwang F. K. (1976). Group testing with a dilution effect. Biometrika 63, 671–80. [Google Scholar]

[B14] Jewell N. P. & Emerson R.. (2013). Current status data: An illustration with data on avalanche victims. In Handbook of Survival Analysis. Boca Raton, Florida: Chapman & Hall/CRC, pp. 391–412. [Google Scholar]

[B15] Jewell N. P. & van der Laan M.. (2003). Current status data: Review, recent developments and open problems. In Handbook in Statistics, vol. 23 Amsterdam: Elsevier, pp. 625–42. [Google Scholar]

[B16] Liu A. Liu C. Zhang Z. & Albert P. S.. (2012). Optimality of group testing in the presence of misclassification. Biometrika 99, 245–51. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B17] Maathuis M. & Hudgens M. G.. (2011). Nonparametric inference for competing risks current status data with continuous, discrete or grouped observation times. Biometrika 98, 325–40. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B18] McKeown K. & Jewell N. P.. (2010). Misclassification of current status data. Lifetime Data Anal. 16, 215–30. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B19] McMahan C. S. Tebbs J. M. & Bilder C. R.. (2013). Regression models for group testing data with pool dilution effects. Biostatistics 14, 284–98. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B20] Remlinger K. S. Hughes-Oliver J. M. Young S. S. & Lam R. L.. (2006). Statistical design of pools using optimal coverage and minimal collision. Technometrics 48, 133–43. [Google Scholar]

[B21] Sal y Rosas V. G. & Hughes J. P.. (2010). Nonparametric and semiparametric analysis of current status data subject to outcome misclassification. Statist. Commun. Inf. Dis. 2010, article no 364. [PMC free article] [PubMed] [Google Scholar]

[B22] Tang R. Banerjee M. & Kosorok M. R.. (2012). Likelihood based inference for current status data on a grid: A boundary phenomenon and an adaptive inference procedure. Ann. Statist. 40, 45–72. [Google Scholar]

[B23] Tu X. M. Litvak E. & Pagano M.. (1994). Screening tests: Can we get more by doing less? Statist. Med. 13, 1905–19. [DOI] [PubMed] [Google Scholar]

[B24] Tu X. M. Litvak E. & Pagano M.. (1995). On the informativeness and accuracy of pooled testing in estimating prevalence of a rare disease: Application to HIV screening. Biometrika 82, 287–97. [Google Scholar]

[B25] Vansteelandt S. Goetghebeur E. & Verstraeten T.. (2000). Regression models for disease prevalence with diagnostic tests on pools of serum samples. Biometrics 56, 1126–33. [DOI] [PubMed] [Google Scholar]

[B26] Wahed M. A. Chowdhury D. Nermell B. Khan S. I. Ilias M. Rahman M. Persson L. A. & Vahter M.. (2006). A modified routine analysis of arsenic content in drinking-water in Bangladesh by hydride generation-atomic absorption spectrophotometry. J. Health Pop. Nutr. 24, 36–41. [PubMed] [Google Scholar]

[B27] Wein L. M. & Zenios S. A.. (1996). Pooled testing for HIV screening: Capturing the dilution effect. Oper. Res. 44, 543–69. [Google Scholar]

[B28] Xie M. (2001). Regression analysis of group testing samples. Statist. Med. 20, 1957–69. [DOI] [PubMed] [Google Scholar]

[B29] Yu G. Schick A. Li L. & Wong G. Y. C.. (1998). Asymptotic properties of the GMLE in the case 1 interval-censorship model with discrete inspection times. Can. J. Statist. 26, 619–27. [Google Scholar]

PERMALINK

Misclassified group-tested current status data

L C Petito

N P Jewell

Abstract

1. Introduction

2. Notation and likelihood function

3. An expectation-maximization pool-adjacent-violators algorithm

3.1. Development of the algorithm

Step 1

Step 1

Step 3

Step 4

3.2. Comments regarding asymptotics

4. Elementary example

4.1. An analytic solution

Fig. 1.

4.2. Multiple convergence values

5. Simulations

5.1. Design of simulations

5.2. Results: perfectly classified data

Fig. 2.

5.3. Results: misclassified data

Fig. 3.

Fig. 4.

6. Application to hepatitis C data

Fig. 5.

Fig. 6.

7. Discussion

Supplementary Material

Acknowledgement

Supplementary material

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Misclassified group-tested current status data

L C Petito

N P Jewell

Abstract

1. Introduction

2. Notation and likelihood function

3. An expectation-maximization pool-adjacent-violators algorithm

3.1. Development of the algorithm

Step 1

Step 1

Step 3

Step 4

3.2. Comments regarding asymptotics

4. Elementary example

4.1. An analytic solution

Fig. 1.

4.2. Multiple convergence values

5. Simulations

5.1. Design of simulations

5.2. Results: perfectly classified data

Fig. 2.

5.3. Results: misclassified data

Fig. 3.

Fig. 4.

6. Application to hepatitis C data

Fig. 5.

Fig. 6.

7. Discussion

Supplementary Material

Acknowledgement

Supplementary material

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases