Abstract
Screening procedures for infectious diseases, such as HIV, often involve pooling individual specimens together and testing the pools. For diseases with low prevalence, group testing (or pooled testing) can be used to classify individuals as diseased or not while providing considerable cost savings when compared to testing specimens individually. The pooling literature is replete with group testing case identification algorithms including Dorfman testing, higher-stage hierarchical procedures, and array testing. Although these algorithms are usually evaluated on the basis of the expected number of tests and classification accuracy, most evaluations in the literature do not account for the continuous nature of the testing responses and thus invoke potentially restrictive assumptions to characterize an algorithm’s performance. Commonly used case identification algorithms in group testing are considered and are evaluated by taking a different approach. Instead of treating testing responses as binary random variables (i.e., diseased/not), evaluations are made by exploiting an assay’s underlying continuous biomarker distributions for positive and negative individuals. In doing so, a general framework to describe the operating characteristics of group testing case identification algorithms is provided when these distributions are known. The methodology is illustrated using two HIV testing examples taken from the pooling literature.
Keywords: Classification, Measurement error, Pooled testing, Screening, Sensitivity, Specificity
1. Introduction
Testing individual specimens in pools, which is known as group testing (or pooled testing), is widespread in disease screening applications. Individuals in pools that test negatively are declared to be negative, and positive pools are resolved (or “decoded”) to determine which individuals are positive. The origins of group testing are usually traced back to Dorfman (1943), who proposed that it be used to screen World War II soldiers for syphilis. Since this seminal work, group testing has been applied to numerous infectious disease applications. A literature review reveals recent public health and surveillance applications for HIV (Krajden et al., 2014), HBV and HCV (Page-Shafer et al., 2008; Candotti and Allain, 2009), chlamydia and gonorrhea (Lewis et al., 2012), West Nile virus (Busch et al., 2005), and influenza (Edouard et al., 2015). Group testing is also routinely used by national organizations around the world to screen blood and plasma donations for HIV/HBV/HCV and other diseases (see, e.g., Schmidt et al., 2010; O’Brien et al., 2012; Stramer et al., 2013).
The original procedure proposed by Dorfman (1943) is a two-stage hierarchical algorithm; i.e., non-overlapping pools are tested in the first stage and individuals from positive pools are tested in the second. Hierarchical algorithms using a larger number of stages can reduce the number of tests needed when the disease prevalence is small. For example, Mehta et al. (2011) describe a three-stage algorithm for HIV testing in San Diego that uses master pools of size 10 in the first stage, subpools of size 5 in the second stage, and individual testing in the third. The most common non-hierarchical algorithm is two-dimensional array testing (Phatarfod and Sudbury, 1994; Hudgens and Kim, 2011; McMahan et al., 2012b), where individuals are tested in the rows and columns of an array. A recent HIV application in New Jersey (Martin et al., 2013) illustrates how array testing can even be used in higher dimensions (Kim and Hudgens, 2009). Comprehensive summaries of group testing algorithms and their operating characteristics are found in Kim et al. (2007) and Westreich et al. (2008).
When faced with the task of choosing an appropriate case identification algorithm for screening purposes, public health officials and lab technicians are interested in cost and accuracy. Laboratories with large budgets may opt to test specimens individually as pooling can reduce an assay’s sensitivity. In the group testing literature, this reduction is known as “the dilution effect” and can result in an increased number of false negative diagnoses. Group testing algorithms can be selected on the basis of minimizing the expected number of tests per individual to minimize costs (Kim et al., 2007; Westreich et al., 2008) or perhaps in a way that incorporates both the expected number of tests and classification accuracy (see, e.g., Malinovsky et al., 2016). Of course, additional practical considerations such as testing platform constraints, the time needed for testing, and the availability of individuals to pool should also be carefully considered.
When an individual or pooled specimen is tested, an assay typically elicits a binary diagnosis (positive/negative) that is derived from measuring a continuous biomarker; large values of this continuous measurement are usually evidence that the disease is present. Although it is widely known that dichotomizing a continuous outcome can lead to a loss in information, previous evaluations in group testing have largely ignored this underlying aspect and instead have relied explicitly on binary results. Doing so helps to facilitate the derivation of closed-form expressions for the expected number of tests and classification accuracy probabilities; however, this also usually requires one to make assumptions such as (a) the sensitivity and specificity are unaffected by pool size; i.e., there is no dilution effect, and (b) testing outcomes on pools containing common individuals are independent conditional on the true pool statuses. An important contribution of this article is to provide a general framework for case identification evaluation where these assumptions are not needed.
In offering this framework, our approach exploits the underlying continuous biomarker distributions associated with positive and negative individuals. In other words, we do not dichotomize testing outcomes into “positive” or “negative” categories, but instead we make our evaluations in terms of the biomarker distributions themselves. Our work is related to the methodology in Wein and Zenios (1996), who proposed using biomarker concentrations to determine an optimized Dorfman algorithm for HIV testing. However, our article takes a somewhat different perspective. We are not focused on determining optimal designs for specific group testing procedures per se; instead, our goal is to enhance previous case identification algorithm evaluations, such as those in Kim et al. (2007) and Westreich et al. (2008), in group testing applications where biomarker distributions are known. Our evaluations can be performed for any group testing procedure, including Dorfman testing, higher-stage hierarchical algorithms, and array testing. We obtain closed-form expressions for operating characteristics for normally distributed biomarkers in specific algorithms; however, even these expressions may be of limited utility for practitioners. We therefore use simulation to overcome the computational challenges when incorporating biomarker information.
2. Notation and Preliminaries
We modify the notation from Wang et al. (2015), who used biomarker distributions to acknowledge the dilution effect in group testing regression. Let Ti = 1 if the ith individual is truly positive; Ti = 0 otherwise. We assume the Ti’s are independent and identically distributed statuses with pr(Ti = 1) = p, the prevalence of the population. Generalizing our evaluation framework to allow for unequal individual disease probabilities (McMahan et al., 2012a; 2012b) or correlated individuals (Lendle et al., 2012) is straightforward; see Section 6. Let 𝒞̃i denote the true biomarker level of the ith individual (e.g., viral load, optical density reading, antibody concentration, etc.). We assume the 𝒞̃i’s are mutually independent random variables and that the conditional probability density function of 𝒞̃i given the true status Ti = t is
where f𝒞̃+ and f𝒞̃− denote the true biomarker density functions for positive and negative individuals, respectively. In other words, positive individuals in the population have true biomarker levels described by the common density f𝒞̃+; similarly, negative individuals’ true biomarker levels are described by f𝒞̃−.
We are interested in calculating quantities like the expected number of tests per individual and classification accuracy probabilities commonly seen in the group testing case identification literature (i.e., pooling sensitivity, pooling specificity, predictive values). To set our ideas, we assume a hierarchical group testing algorithm is used in S ≥ 2 stages, although we later modify our notation to account for array testing in two dimensions (Phatarfod and Sudbury, 1994; Hudgens and Kim, 2011; McMahan et al., 2012b); see Section 3.3. An S-stage hierarchical algorithm begins by testing a master pool of individual specimens. If the master pool tests negatively, all individuals are declared to be disease-free and no further testing is performed. Otherwise, non-overlapping subpools are formed and are tested in the second stage. Any second-stage subpool that tests positively is split again while subpools that test negatively in the second stage are declared to be disease-free. This process continues until all subpools in a particular stage test negatively or until individual testing (in stage S) is performed.
For an S-stage hierarchical algorithm, let ℘sl denote the index set of individuals in the lth pool formed at the sth stage of testing, for l = 1, 2, …, n1/ns and s = 1, 2, …, S, where ns = |℘sl| is the number of individuals in ℘sl. To illustrate this notation, Figure 1 displays the S = 3 stage hierarchical algorithm described in Mehta et al. (2011) from Section 1. In this example, the master pool is ℘11 = {1, 2, …, 10}, the two second-stage pools are ℘21 = {1, 2, …, 5} and ℘22 = {6, 7, …, 10}, and the singleton pools ℘31 = {1},℘32 = {2}, …,℘3,10 = {10} are for individual testing in the third stage. These pools are of size n1 = 10, n2 = 5, and n3 = 1. Additional examples of hierarchical algorithms used in HIV testing are found in Sherlock et al. (2007). Henceforth, a general S-stage hierarchical algorithm is denoted by H(n1: n2: ⋯: nS), where nS = 1. Note that Dorfman’s seminal strategy uses S = 2 stages.
Let T℘sl = 1 if the lth pool in the sth stage is truly positive; i.e., ℘sl contains at least one truly positive individual, T℘sl = 0 otherwise. Similarly, let Z℘sl = 1 if ℘sl tests positively, Z℘sl = 0 otherwise. To acknowledge the continuous nature of the diagnostic assay, we assume that Z℘sl = I(𝒞℘sl > τ℘sl); i.e., the pool ℘sl tests positively if 𝒞℘sl, the measured biomarker level of the pool, exceeds a threshold τ℘sl which potentially depends on the pool size ns at stage s. To acknowledge the potential of error when measuring the true biomarker level 𝒞̃℘sl, we assume that 𝒞℘sl | 𝒞̃℘sl ~ fε, where fε = fε(·| 𝒞̃℘sl) is a known probability density function. Therefore, our framework utilizes three distributions: the true biomarker distributions for positive and negative individuals, f𝒞̃+ and f𝒞̃−, respectively, and fε, which incorporates the effect of assay measurement error. Threshold selection for τ℘sl is discussed in Section 4.
As noted in Section 1, previous evaluations of case identification algorithms have largely assumed the sensitivity and specificity are constant and hence are unaffected by pool size. Although this assumption may be reasonable when testing negative pools (i.e., constant specificity), it is potentially more dubious when testing positive pools. Using the Law of Total Probability, note that the sensitivity associated with testing ℘sl can be written as
where the random variable Σi∈℘sl Ti counts the number of positive individuals in ℘sl. Therefore, for the sensitivity to remain constant throughout the testing process, one would have to require that pr(Z℘sl = 1| Σi∈℘sl Ti = m) are equal for each m = 1, 2, …, ns, l = 1, 2, …, n1/ns and s = 1, 2, …, S. Clearly, this requirement may be unsuitable−especially when testing results are heavily influenced by dilution.
On the other hand, when written in terms of the true biomarker distributions, f𝒞̃+ and f𝒞̃−, and the measurement error density fε, the sensitivity of ℘sl is given by
where q = 1 − p and
(1) |
where
(2) |
The expression in Equation (2) is the density of Σi∈℘sl 𝒞̃i, the sum of the mutually independent biomarker levels in ℘sl when ℘sl contains exactly m ≥ 1 positive and ns − m negative individuals; we obtain this density by convolving the true individual biomarker densities f𝒞̃+ and f𝒞̃− m and ns − m times, respectively.
In writing Equation (1), we assume the true biomarker level 𝒞̃℘sl is the arithmetic average of the individual biomarker levels in ℘sl; i.e., . This assumption is often viewed as sacrosanct in the biomarker pooling literature (see, e.g., Zhang and Albert, 2011; Malinovsky et al., 2012; Mitchell et al., 2014; Delaigle and Hall, 2015) and is likely reasonable when pools are formed from aliquots of equal volume. Under this assumption, the specificity of ℘sl is given by
(3) |
where is the density that convolves f𝒞̃− ns times−once for each of the negative individuals in ℘sl. Note that Equations (1) and (3) are similar in form to the analogous expressions found in McMahan et al. (2013) and Delaigle and Hall (2015), both of whom incorporate biomarker and measurement error distributions in group testing regression.
As an example, suppose the true individual biomarker distributions for negative and positive individuals are 𝒞̃− ~ 𝒩(3, 0.25) and 𝒞̃+ ~ 𝒩(6, 1), respectively, and that the measurement error density is 𝒩(𝒞̃, 0.0025). For these distribution choices, the threshold that maximizes Youden’s index (Youden, 1950) for individual testing is τ* = 4.11, which provides values of sensitivity and specificity (for individual testing) equal to 0.970 and 0.987, respectively. To illustrate the effect of pooling, Figure 2 displays the densities of the measured biomarker level on ℘sl; i.e., , for different values of m when the pool size is ns = 5 and ns = 10. This figure illustrates how relevant operating characteristics in group testing could ultimately depend on the individual biomarker distributions, the pool size, the threshold used for pools (see Section 4), and the number of positive individuals in each pool. In other words, once one moves beyond treating pool and individual diagnoses as binary, case identification evaluation becomes far more complicated. Note that we have created Figure 2 assuming normality for 𝒞̃−, 𝒞̃+, and the measurement error so that f𝒞℘sl</sub></sub> (u) can be calculated exactly. However, biomarkers in real applications are rarely normally distributed and calculating f𝒞℘sl</sub></sub> (u) for non-normal biomarkers, if it is even possible to do so, potentially involves high-dimensional integration (i.e., of dimension equal to the pool size).
3. Operating Characteristics
3.1. Efficiency
The most important characteristic of a group testing case identification algorithm is its expected number of tests per individual, or efficiency. Because the cost of screening is usually highly correlated with the number of tests expended, algorithms with lower values of this expectation are generally preferred. For example, an algorithm whose efficiency is 0.5 is twice as efficient as individual testing. An algorithm whose efficiency is larger than 1 uses more tests than individual testing on average. In the group testing literature, optimal algorithms are usually identified as those that are the most efficient.
Unfortunately, within the general framework we have outlined in this article, calculating the efficiency quickly becomes unmanageable−even for simple algorithms. For example, consider the S = 3 stage algorithm H(10: 5: 1) depicted in Figure 1. It is easy to see that the efficiency of this algorithm is , where recall Z℘11 and Z℘21 denote the (binary) testing responses of ℘11 and ℘21, respectively. When written in terms of the biomarker distributions, the first-stage probability is
where Se(10: m) is calculated using Equations (1) and (2) and Sp(10) is calculated using Equation (3) with ns = n1 = 10, τ℘sl = τ℘11, and ℘sl = ℘11. Even more daunting, a second-stage pool tests positively with probability pr(Z℘11 = 1, Z℘21 = 1), which equals
where n1 = 10 and n2 = 5. In this expression, it is understood that products of the form and , a > b, are vacuous.
As this simple example illustrates, offering a biomarker-based framework for group testing case identification presents nearly overwhelming computational challenges. Unfortunately, this is the price one must pay when relaxing assumptions used in previous evaluations. For example, under classical assumptions in Kim et al. (2007) and Westreich et al. (2008), the probabilities we have just presented reduce to pr(Z℘11 = 1) = Se(1 − q10) + (1 − Sp)q10 and
respectively, where Se and Sp are the assumed common sensitivity and specificity for pools of size n1 = 10 and n2 = 5. The simplified formula for pr(Z℘11 = 1, Z℘21 = 1) above arises only when the testing responses Z℘11 and Z℘21 are conditionally independent given the true pool statuses T℘11 and T℘21. This assumption is required under classical evaluations because ℘11 and ℘21 contain common individuals.
In Appendix A in the Supplementary Material, we have derived a general expression for the efficiency of an S-stage hierarchical algorithm. This derivation has been described previously in the group testing literature; see, e.g., Kim et al. (2007) and the references therein. In our notation, the efficiency can be expressed as
where the random variable equals 1 if and only if the first pool in each of the first s stages tests positively. Calculating within our framework involves accounting for the joint uncertainty that arises in the correlated, error-laden biomarker measurements 𝒞℘11, 𝒞℘21, …, 𝒞℘s1, an extremely difficult problem analytically. Although this probability can be calculated exactly under normal biomarker assumptions, in general we recommend using Monte Carlo simulation and estimating EFF{H(n1: n2: ⋯: nS)} instead. Such a strategy is flexible and will accommodate any biomarker and measurement error distributions. In addition, one can quickly estimate the variance of the number of tests per individual (Kim et al., 2007), which would otherwise be an intractable calculation. A description of our simulation procedure is now given.
SIMULATION PROCEDURE
Generate T1, T2, …, Tn1~ iid Bernoulli(p). Generate 𝒞̃i ~ f𝒞̃i|Ti=t (u) = tf𝒞̃+(u) + (1 − t)f𝒞̃−(u), i = 1, 2, …, n1.
-
(Stage 1). Calculate and generate 𝒞℘11 from fε(·| 𝒞̃℘11).
If Z℘11 = I(𝒞℘11 > τ℘11) = 0, stop and classify the n1 individuals in ℘11 as negative.
If Z℘11 = I(𝒞℘11 > τ℘11) = 1, divide 𝒞̃i ∈ ℘11 into subgroups of size n2.
-
(Stage 2). Calculate for each subgroup in Step 2(b) and generate 𝒞℘2lfrom fε(·| 𝒞̃℘2l). Calculate Z℘2l= I(𝒞℘2l> τ℘2l). For each l,
if Z℘2l= I(𝒞℘2l> τ℘2l) = 0, classify the n2 individuals in ℘2l as negative (stop if all second-stage subgroups are negative).
if Z℘2l= I(𝒞℘2l> τ℘2l) = 1, divide 𝒞̃i ∈ ℘2l into subgroups of size n3.
(Stage 3). For each subgroup in Step 3(b), calculate , generate 𝒞℘3lfrom fε(·| 𝒞̃℘3l), and calculate Z℘3l= I(𝒞℘3l> τ℘3l). Continue this overall process until all subgroups in a particular stage test negatively or until individual testing (in stage S) is performed.
We implement this procedure B times and estimate the efficiency of H(n1: n2: ⋯: nS) using
where Mb is the number of tests observed in the bth replication. The variance of the number of tests per individual, denoted by var{H(n1: n2: ⋯: nS)}, can be estimated using the sample variance of M1/n1,M2/n1, …,MB/n1. Our simulation procedure is extremely fast and thus can be performed using very large values of B. Under normal biomarker assumptions, we show in Appendix B in the Supplementary Material that the difference between calculating EFF{H(n1: n2: ⋯: nS)} exactly and estimating it using a large number of replications is negligible.
3.2. Classification Accuracy
Although the efficiency of a group testing case identification algorithm is its most important characteristic, being able to quantify an algorithm’s classification accuracy is also critical. Two commonly used measures of accuracy in the case identification literature are pooling sensitivity and pooling specificity. For an S-stage hierarchical algorithm, the pooling sensitivity
is the probability a truly positive individual is classified positively. Analogously, the pooling specificity
is the probability a truly negative individual is classified negatively. Values of PSE and PSP close to unity are preferred as this translates to a small percentage of false negative and false positive diagnoses. Simple formulae for PSE and PSP are available under classical assumptions (see, e.g., Kim et al., 2007). For example, implies that a larger number of stages decreases pooling sensitivity. Of course, this formula no longer applies in our more general framework.
We derive expressions for PSE{H(n1: n2: ⋯: nS)} and PSP{H(n1: n2: ⋯: nS)} in terms of f𝒞̃+, f𝒞̃−, and fε in Appendix B in the Supplementary Material. However, as with the efficiency, these expressions may ultimately be too complicated for practical use. Therefore, simulation details to estimate PSE and PSP for an S-stage hierarchical algorithm are also provided. With these estimates in hand, one can also estimate the pooling positive predictive value
and the pooling negative predictive value
These probabilities measure how likely an individual is truly positive (negative) given that the individual has been classified positively (negatively).
3.3. Array Testing
Our simulation methodology can be extended to estimate the operating characteristics of array testing algorithms. In two-dimensional array testing, individuals are first assigned to the cells of an array with R rows and C columns (Phatarfod and Sudbury, 1994; McMahan et al., 2012b). In the first stage, the rows and the columns of the array are tested. The second stage uses individual testing for individuals not classified as negative after the first stage. When the prevalence p is small, two-dimensional array testing can be more efficient than hierarchical algorithms (Kim et al., 2007; Westreich et al., 2008).
We modify our notation from Section 2 to accommodate array testing in two dimensions. Let Tr,c denote the true binary status of the individual in the (r, c)th position, and let 𝒞̃r,c denote this individual’s true biomarker level so that f𝒞̃r,c|Tr,c=t(u) = tf𝒞̃+(u) + (1 − t)f𝒞̃−(u), for r = 1, 2, …,R and c = 1, 2, …,C. The rth row and cth column pools are denoted by ℘r+ = {(r, 1), (r, 2), …, (r,C)} and ℘+c = {(1, c), (2, c), …, (R, c)}, respectively. Let and 𝒞℘r+ denote the true and measured biomarker level of ℘r+, respectively. Let and 𝒞℘+cbe defined analogously for ℘+c. In the first stage, row and column testing provide Z℘r+ = I(𝒞℘r+ > τ℘r+) and Z℘+c= I(𝒞℘+c > τ℘+c), where τ℘r+ and τ℘+care first-stage thresholds (see Section 4) and where 𝒞℘r+| 𝒞̃℘r+ ~ fε(·| 𝒞̃℘r+) and 𝒞℘+c| 𝒞̃℘+c~ fε(·| 𝒞̃℘+c). We follow the convention in Kim et al. (2007) when identifying which individuals to test in the second stage; i.e., those individuals in
The event {Z℘r+ = Z℘+c= 1} occurs at the intersection of the rth row and cth column. The other two events in ℳ represent ambiguous first-stage outcomes that could arise from testing error. Second-stage testing observes Zr,c = I(𝒞r,c > τ) for each individual in ℳ, where 𝒞r,c| 𝒞̃r,c ~ fε(·| 𝒞̃r,c) and τ is a threshold for individual testing. Figure 3 illustrates this notation when R = C = 5 (i.e., for a square array). Complete simulation details to estimate the efficiency and accuracy probabilities are provided in Appendix C in the Supplementary Material.
4. Threshold Selection
There are different types of assays used for infectious disease detection, including antibody tests (e.g., ELISA, Western Blot, combination tests which also detect antigens, etc.) and more modern tests which utilize amplification methods. Before an assay is approved for commercial use, it is usually applied to known positive and known negative specimens to determine suitable thresholds for individual testing. Ideally, these thresholds provide high levels of sensitivity and specificity when testing individual specimens. A complete list of screening assays for HIV/HBV/HCV and other infectious agents in the United States is available at www.fda.com. An approved assay’s product insert typically recommends which threshold should be used to identify positive individuals.
When an assay is applied to pooled specimens, choosing the appropriate threshold can be more subjective. Early work in group testing estimation (see, e.g., Chen and Swallow, 1990; Tu et al., 1994) suggested that individual testing assay thresholds might also be used for pools; see Stephens et al. (2000) and Currie et al. (2004) for specific applications. In the infectious disease pooling literature, a common strategy is to take the individual testing threshold, say τ, and divide it by the number of individuals in the pool; e.g., τ℘11 = τ/n1 for a master pool in an S-stage hierarchical algorithm H(n1: n2: ⋯: nS), τ℘2l= τ/n2 for a second-stage pool, and so on. Note that selecting a pooled threshold inappropriately large will decrease the pooling sensitivity, thereby increasing the number of false negative diagnoses. On the other hand, a pooled threshold that is too small will provide far too many false positive pools, thereby weakening the efficiency of group testing.
For individual testing with threshold τ, it is easy to see that the sensitivity is a decreasing function of τ while the specificity is an increasing function of τ. Therefore, one way to choose an individual testing threshold is to maximize Youden’s index (Youden, 1950); i.e., τ* = arg maxτ∈ℝ{Se(τ) + Sp(τ) − 1}, as this offers a balance between maximizing both sensitivity and specificity. For a pool generically denoted by ℘ consisting of individuals whose true disease statuses are denoted by Ti, we propose a pooled threshold that is similar in spirit to Youden’s index for individual testing; i.e.,
The conditional density f𝒞℘| ΣiTi=1(·) describes the distribution of the measured biomarker level of pool ℘ when there is exactly one positive individual in it. We have selected this density for two reasons. First, in low disease prevalence applications, it is almost always true that a truly positive pool is positive because there is only one positive individual in the pool. Therefore, will be the appropriate threshold for a large majority of the positive pools. Second, as positive pools could conceivably contain more than one positive individual, favors the adoption of a smaller-than-necessary threshold. Although this may inflate the efficiency slightly, it also promotes the detection of positive individuals.
5. Applications
We illustrate our simulation methodology using two examples taken from the HIV pooling literature. The first example is from Wein and Zenios (1996) and Zenios and Wein (1998), who consider HIV testing with an antibody assay. The second example from May et al. (2010) is not an classical HIV screening application, but instead describes a virological assay to detect treatment failure among HIV patients. The salient feature of each application is that biomarker distributions for 𝒞+ and 𝒞− are presented as well as posited distributions for the assay measurement error. We illustrate our biomarker-based evaluations in each application using Dorfman testing, an S = 3 stage hierarchical algorithm using halving (Black et al., 2012), and two-dimensional array testing.
For Application 1 (Wein and Zenios, 1996; Zenios and Wein, 1998), the biomarker distributions provided by the authors are ln 𝒞+ ~ 𝒩(0.958, 0.8652), 𝒞− ~ I(𝒞− = 0.086), and the measurement error distribution is 𝒞℘|𝒞̃℘ ~ 𝒩{ 𝒞̃℘/(1+ 𝒞̃℘), 0.0088× 𝒞̃℘/(1+ 𝒞̃℘)2}. The use of a degenerate distribution for negative individuals is described in Zenios and Wein (1998). For this collection of distributions, the threshold that maximizes Youden’s index for individual testing is τ* = 0.0485, which provides values of Se > 0.999 and Sp > 0.999; i.e., individual testing is nearly perfect. For Application 2 (May et al., 2010), log10 𝒞+ is specified to have a two-component mixture 0.93G1 +0.07G2, where G1(G2) is a three-parameter gamma random variable with shape parameter 1.6 (3.2), scale parameter 0.5 (0.5), and location parameter 2.7 (2.7). For negative individuals, 𝒞− ~ 0.85U1 +0.05U2 +0.10U3, where U1 ~ 𝒰(0, 50), U2 ~ 𝒰(50, 100), and U3 ~ 𝒰(100, 500), where 𝒰(a, b) denotes a uniform distribution from a to b. The measurement error distribution is specified as log10 𝒞℘| 𝒞̃℘ ~ 𝒩(log10 𝒞̃℘, 0.122). The threshold that maximizes Youden’s index for individual testing in Application 2 is τ* = 436.11, which provides values of Se = 0.989 and Sp = 0.980.
For both applications, we illustrate the differences between our biomarker-based calculations of efficiency, variability, and classification accuracy and the same calculations which rely on classical assumptions (Kim et al., 2007; Westreich et al., 2008); i.e., constant Se(Sp) and conditional independence of testing responses given the true statuses. In doing so, we consider values of p ∈ {0.01, 0.05, 0.10} while utilizing the three threshold options described in Section 4: τ* (same for individual testing and pools), τ* divided by pool size, and our proposed Youden index threshold for pools . For each combination of p and the threshold used, we calculate the efficiency, the standard deviation of the number of tests per individual, and the four accuracy probabilities in Section 3.2. All of our biomarker-based characteristics are estimated using B = 1,000,000 Monte Carlo data sets. Operating characteristics under classical assumptions are calculated exactly using the expressions in Kim et al. (2007).
Our results when p = 0.05 are provided in Table 1; the same tables for p = 0.01 and p = 0.10 are given in Appendix D in the Supplementary Material. In each table, we first determine the most efficient Dorfman algorithm H(n1: 1), three-stage halving algorithm H(n1: n1/2: 1), and square array algorithm A(n1 × n1) under the classical assumptions in Kim et al. (2007) and then compare our biomarker-based evaluations to this optimal setting. Our goal is not to try to outperform the operating characteristics under classical assumptions per se, but instead to illustrate the differences between these calculations and those which exploit underlying biomarker distributions and measurement error, and, more pointedly, how these differences depend on the threshold used. This comparison simultaneously allows one to assess how robust group testing characteristics are under classical assumptions. To the best of knowledge, this is the first assessment of this type in the case identification literature.
Table 1.
Biomarker-based evaluations | Classical | |||||
---|---|---|---|---|---|---|
| ||||||
τ* | τ*/pool size |
|
||||
Application 1 Se > 0.999; Sp > 0.999 | H(5: 1) | EFF (SD) | 0.426 (0.418) | 0.788 (0.492) | 0.426 (0.418) | 0.426 (0.418) |
PSE | 0.996 | >0.999 | 0.997 | >0.999 | ||
PSP | >0.999 | >0.999 | >0.999 | >0.999 | ||
PPV | >0.999 | >0.999 | >0.999 | >0.999 | ||
NPV | >0.999 | >0.999 | >0.999 | >0.999 | ||
| ||||||
H(8: 4: 1) | EFF (SD) | 0.391 (0.388) | 0.704 (0.450) | 0.393 (0.391) | 0.395 (0.389) | |
PSE | 0.986 | 0.999 | 0.993 | >0.999 | ||
PSP | >0.999 | >0.999 | >0.999 | >0.999 | ||
PPV | >0.999 | >0.999 | >0.999 | >0.999 | ||
NPV | 0.999 | >0.999 | >0.999 | >0.999 | ||
| ||||||
A(9 × 9) | EFF (SD) | 0.374 (0.115) | 0.835 (0.161) | 0.378 (0.116) | 0.380 (0.117) | |
PSE | 0.973 | 0.999 | 0.986 | >0.999 | ||
PSP | >0.999 | >0.999 | >0.999 | >0.999 | ||
PPV | >0.999 | >0.999 | >0.999 | >0.999 | ||
NPV | 0.999 | >0.999 | >0.999 | >0.999 | ||
| ||||||
Application 2 Se = 0.989; Sp = 0.980 | H(5: 1) | EFF (SD) | 0.337 (0.344) | 0.588 (0.487) | 0.458 (0.437) | 0.439 (0.426) |
PSE | 0.633 | 0.987 | 0.952 | 0.978 | ||
PSP | 0.998 | 0.983 | 0.991 | 0.996 | ||
PPV | 0.931 | 0.756 | 0.853 | 0.930 | ||
NPV | 0.981 | >0.999 | 0.997 | 0.999 | ||
| ||||||
H(8: 4: 1) | EFF (SD) | 0.258 (0.302) | 0.575 (0.424) | 0.406 (0.397) | 0.396 (0.387) | |
PSE | 0.513 | 0.986 | 0.920 | 0.967 | ||
PSP | 0.999 | 0.985 | 0.994 | 0.997 | ||
PPV | 0.947 | 0.776 | 0.894 | 0.948 | ||
NPV | 0.975 | 0.999 | 0.996 | 0.998 | ||
| ||||||
A(10 × 10) | EFF (SD) | 0.253 (0.053) | 0.751 (0.169) | 0.394 (0.121) | 0.385 (0.120) | |
PSE | 0.412 | 0.989 | 0.892 | 0.967 | ||
PSP | 0.999 | 0.981 | 0.994 | 0.997 | ||
PPV | 0.964 | 0.733 | 0.889 | 0.948 | ||
NPV | 0.970 | 0.999 | 0.994 | 0.998 |
From Table 1 and the additional tables in Appendix D, it is clear that the efficiency (EFF), the standard deviation of the number of tests per individual (SD), and the pooling sensitivity (PSE) of group testing are the most heavily influenced by the choice of threshold. One should not be deceived by the osten-sibly efficient results that arise when the threshold for individual testing τ* is also used with pools, as this is also accompanied by a decrease in PSE−sharply so in Application 2 where Se and Sp are lower. On the other hand, dividing τ* by the pool size leads to a threshold that is too small. This results in too many negative pools testing positively which inflates the efficiency. Our proposed threshold for pools offers a nice compromise between these two extremes by providing approximately the same efficiency as under classical assumptions. Both applications show that accuracy probabilities under classical assumptions may be slightly optimistic, an important finding for practitioners who are concerned about classification accuracy. This is seen more noticeably in Application 2 where the error rates for individual testing are comparatively larger and also for lower values of p in both applications (e.g., p = 0.01, shown in Appendix D).
6. Discussion
We have proposed a simulation-based methodology to evaluate the operating characteristics of group testing case identification algorithms when individual biomarker distributions are known. Our approach allows the investigator to incorporate the effect of assay measurement error and proposes a new strategy for selecting thresholds when testing pools. Our research web site www.chrisbilder.com/grouptesting contains R programs that implement our simulation methods for hierarchical algorithms and two-dimensional array testing with normally distributed biomarkers. These programs can be changed to include other biomarker distributions; e.g., gamma, lognormal, or nonstandard choices like those found in Section 5. In addition, these programs can be modified to include other group testing strategies, such as array testing designs that include master pools (Kim et al., 2007), higher dimensional arrays (Kim and Hudgens, 2009), and other algorithms outside the H(n1: n2: ⋯: nS) family described in Section 2.
Our evaluations of case identification algorithms do not require one to assume anything about the sensitivity and specificity of testing pools, because operating characteristics are estimated directly from the biomarker distributions themselves. Our approach also does not force one to assume that testing results are conditionally independent given the true statuses of the individuals being tested. This assumption is required under classical evaluations because pools formed throughout the testing process can contain common individuals. Litvak et al. (1994) have described scenarios where the conditional independence assumption is reasonable empirically; however, there is a large body of evidence in the diagnostic testing literature suggesting that this assumption may be too restrictive. Finally, because the framework described in this article incorporates Monte Carlo simulation, it would be straightforward to generalize our evaluations to allow for unequal disease probabilities pi, say, which may arise when covariate information is available on individuals (McMahan et al., 2012a; 2012b). For this same reason, our approach could also be extended to accommodate individual disease statuses that are correlated (Lendle et al., 2012) or to applications where biomarkers are measured for multiple diseases (Tebbs et al., 2013).
Throughout this article, we have assumed that the biomarker distributions for positive and negative individuals, f𝒞̃+ and f𝒞̃−, respectively, and the measurement error density fε are known exactly. This assumption may be prohibitive in applications where biomarker and measurement error information is not available (e.g., in surveillance studies, etc.). It should be possible to estimate these distributions with continuous group testing responses on pools and individuals, although this would require the development of new deconvolution methods and hence we leave this to future work. In lieu of perfect knowledge about these distributions, an anonymous referee has suggested that one could perform a sensitivity analysis to assess the impact of misspecifying f𝒞̃+, f𝒞̃−, or fε. This is straightforward to accomplish within the framework outlined in this article because our methods make use of Monte Carlo simulation. In Appendix E in the Supplementary Material, we provide an example showing how such an analysis could be implemented.
Supplementary Material
Acknowledgments
We are grateful to two anonymous referees who provided insightful comments and suggestions. We thank Dr. Elizabeth Torrone at the Centers for Disease Control and Prevention for her consultation on infectious disease screening practices in the United States. This research was funded by Grant R01 AI121351 from the National Institutes of Health.
Footnotes
Supplementary material related to this article can be found online at [insert address here].
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
References
- Black M, Bilder C, Tebbs J. Group testing in heterogeneous populations by using halving algorithms. Journal of the Royal Statistical Society: Series C. 2012;61:277–290. doi: 10.1111/j.1467-9876.2011.01008.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Busch M, Caglioti S, Robertson E, McAuley J, Tobler L, Kamel H, Linnen J, Shyamala V, Tomasulo P, Kleinman S. Screening the blood supply for West Nile virus RNA by nucleic acid amplification testing. New England Journal of Medicine. 2005;353:460–467. doi: 10.1056/NEJ-Moa044029. [DOI] [PubMed] [Google Scholar]
- Candotti D, Allain J. Transfusion-transmitted hepatitis B virus infection. Journal of Hepatology. 2009;51:798–809. doi: 10.1016/j.jhep.2009.05.020. [DOI] [PubMed] [Google Scholar]
- Chen C, Swallow W. Using group testing to estimate a proportion, and to test the binomial model. Biometrics. 1990;46:1035–1046. doi: 10.2307/2532446. [DOI] [PubMed] [Google Scholar]
- Currie M, McNiven M, Yee T, Schiemer U, Bowden F. Pooling of clinical specimens prior to testing for Chlamydia trachomatis by PCR is accurate and cost saving. Journal of Clinical Microbiology. 2004;42:4866–4867. doi: 10.1128/JCM.42.10.4866-4867.2004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Delaigle A, Hall P. Nonparametric methods for group testing data, taking dilution into account. Biometrika. 2015;102:871–887. doi: 10.1093/biomet/asv049. [DOI] [Google Scholar]
- Dorfman R. The detection of defective members of large populations. Annals of Mathematical Statistics. 1943;14:436–440. doi: 10.1214/aoms/1177731363. [DOI] [Google Scholar]
- Edouard S, Prudent E, Gautret P, Memish Z, Raoult D. Cost-effective pooling of DNA from nasopharyngeal swab samples for large-scale detection of bacteria by real-time PCR. Journal of Clinical Microbiology. 2015;52:1002–1004. doi: 10.1128/JCM.03609-14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hudgens M, Kim H. Optimal configuration of a square array group testing algorithm. Communications in Statistics−Theory and Methods. 2011;40:436–448. doi: 10.1080/03610920903391303. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim H, Hudgens M. Three-dimensional array-based group testing algorithms. Biometrics. 2009;65:903–910. doi: 10.1111/j.1541-0420.2008.01158.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim H, Hudgens M, Dreyfuss J, Westreich D, Pilcher C. Comparison of group testing algorithms for case identification in the presence of testing error. Biometrics. 2007;63:1152–1163. doi: 10.1111/j.1541-0420.2007.00817.x. [DOI] [PubMed] [Google Scholar]
- Krajden M, Cook D, Mak A, Chu K, Chahil N, Steinberg M, Rekart M, Gilbert M. Pooled nucleic acid testing increases the diagnostic yield of acute HIV infections in a high-risk population compared to 3rd and 4th generation HIV enzyme immunoassays. Journal of Clinical Virology. 2014;61:132–137. doi: 10.1016/j.jcv.2014.06.024. [DOI] [PubMed] [Google Scholar]
- Lendle S, Hudgens M, Qaqish B. Group testing for case identification with correlated responses. Biometrics. 2012;68:532–540. doi: 10.1111/j.1541-0420.2011.01674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lewis J, Lockary V, Kobic S. Cost savings and increased efficiency using a stratified specimen pooling strategy for Chlamydia trachomatis and Neisseria gonorrhoeae. Sexually Transmitted Diseases. 2012;39:46–48. doi: 10.1097/OLQ.0b013e318231cd4a. [DOI] [PubMed] [Google Scholar]
- Litvak E, Tu X, Pagano M. Screening for the presence of a disease by pooling sera samples. Journal of the American Statistical Association. 1994;89:424–434. doi: 10.1080/01621459.1994.10476764. [DOI] [Google Scholar]
- Malinovsky Y, Albert P, Roy A. A note on the evaluation of group testing algorithms in the presence of misclassification. Biometrics. 2016;72:299–302. doi: 10.1111/biom.12385. [DOI] [PubMed] [Google Scholar]
- Malinovsky Y, Albert P, Schisterman E. Pooling designs for outcomes under a Gaussian random effects model. Biometrics. 2012;68:45–52. doi: 10.1111/j.1541-0420.2011.01673.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Martin E, Salaru G, Mohammed D, Coombs R, Paul S, Cadoff E. Finding those at risk: Acute HIV infection in Newark, NJ. Journal of Clinical Virology. 2013;58:24–28. doi: 10.1016/j.jcv.2013.07.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- May S, Gamst A, Haubrich R, Benson C, Smith D. Pooled nucleic acid testing to identify antiretroviral treatment failure during HIV infection. Journal of Acquired Immune Deficiency Syndromes. 2010;53:194–201. doi: 10.1097/QAI.0b013e3181ba37a7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McMahan C, Tebbs J, Bilder C. Informative Dorfman screening. Biometrics. 2012a;68:287–296. doi: 10.1111/j.1541-0420.2011.01644.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McMahan C, Tebbs J, Bilder C. Two-dimensional informative array testing. Biometrics. 2012b;68:793–804. doi: 10.1111/j.1541-0420.2011.01726.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McMahan C, Tebbs J, Bilder C. Regression models for group testing data with pool dilution effects. Biostatistics. 2013;14:284–298. doi: 10.1093/biostatistics/kxs045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mehta S, Nguyen V, Osorio G, Little S, Smith D. Evaluation of pooled rapid HIV antibody screening of patients admitted to a San Diego hospital. Journal of Virological Methods. 2011;174:94–98. doi: 10.1016/j.jviromet.2011.04.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mitchell E, Lyles R, Manatunga A, Danaher M, Perkins N, Schisterman E. Regression for skewed biomarker outcomes subject to pooling. Biometrics. 2014;70:202–211. doi: 10.1111/biom.12134. [DOI] [PMC free article] [PubMed] [Google Scholar]
- O’Brien S, Yi Q, Fan W, Scalia V, Fearon M, Allain J. Current incidence and residual risk of HIV, HBV and HCV at Canadian Blood Services. Vox Sanguinis. 2012;103:83–86. doi: 10.1111/j.1423-0410.2012.01584.x. [DOI] [PubMed] [Google Scholar]
- Page-Shafer K, Pappalardo B, Tobler L, Phelps B, Edlin B, Moss A, Wright T, Wright D, O’Brien T, Caglioti S, Busch M. Testing strategy to identify cases of acute hepatitis C virus (HCV) infection and to project HCV incidence rates. Journal of Clinical Microbiology. 2008;46:499–506. doi: 10.1128/JCM.01229-07. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Phatarfod R, Sudbury A. The use of a square array scheme in blood testing. Statistics in Medicine. 1994;13:2337–2343. doi: 10.1002/sim.4780132205. [DOI] [PubMed] [Google Scholar]
- Schmidt M, Pichl L, Jork C, Hourfar M, Schottstedt V, Wagner F, Seifried E, Muller T, Bux J, Saldanha J. Blood donor screening with cobas s 201/cobas TaqScreen MPX under routine conditions at German Red Cross institutes. Vox Sanguinis. 2010;98:37–46. doi: 10.1111/j.1423-0410.2009.01219.x. [DOI] [PubMed] [Google Scholar]
- Sherlock M, Zelota N, Klausner J. Routine detection of acute HIV infection through RNA pooling: Survey of current practice in the United States. Sexually Transmitted Diseases. 2007;34:314–316. doi: 10.1097/01.olq.0000263262.00273.9c. [DOI] [PubMed] [Google Scholar]
- Stephens G, Raboud J, Karakas L, Sherlock H. Can pooling be used for seroprevalence studies of hepatitis C? Journal of Clinical Microbiology. 2000;38:4264–4265. doi: 10.1128/jcm.38.11.4264-4265.2000. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stramer S, Krysztof D, Brodsky J, Fickett T, Reynolds B, Dodd R, Kleinman S. Comparative analysis of triplex nucleic acid test assays in United States blood donors. Transfusion. 2013;53:2525–2537. doi: 10.1111/trf.12178. [DOI] [PubMed] [Google Scholar]
- Tebbs J, McMahan C, Bilder C. Two-stage hierarchical group testing for multiple infections with application to the Infertility Prevention Project. Biometrics. 2013;69:1064–1073. doi: 10.1111/biom.12080. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tu X, Litvak E, Pagano M. Screening tests: Can we get more by doing less? Statistics in Medicine. 1994;13:1905–1919. doi: 10.1002/sim.4780131904. [DOI] [PubMed] [Google Scholar]
- Wang D, McMahan C, Gallagher C. A general regression framework for group testing data, which incorporates pool dilution effects. Statistics in Medicine. 2015;34:3606–3621. doi: 10.1002/sim.6578. [DOI] [PubMed] [Google Scholar]
- Wein L, Zenios S. Pooled testing for HIV screening: Capturing the dilution effect. Operations Research. 1996;44:543–569. doi: 10.1287/opre.44.4.543. [DOI] [Google Scholar]
- Westreich D, Hudgens M, Fiscus S, Pilcher C. Optimizing screening for acute human immunodeficiency virus infection with pooled nucleic acid amplification tests. Journal of Clinical Microbiology. 2008;46:1785–1792. doi: 10.1128/JCM.00787-07. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Youden W. Index for rating diagnostic tests. Cancer. 1950;3:32–35. doi: 10.1002/1097-0142(1950)3:1<32::AID-CNCR2820030106>3.0.CO;2-3. [DOI] [PubMed] [Google Scholar]
- Zenios S, Wein L. Pooled testing for HIV prevalence estimation: Exploiting the dilution effect. Statistics in Medicine. 1998;17:1447–1467. doi: 10.1002/(SICI)1097-0258(19980715)17:13<1447::AID-SIM862>3.0.CO;2-K. [DOI] [PubMed] [Google Scholar]
- Zhang Z, Albert P. Binary regression analysis with pooled exposure measurements: A regression calibration approach. Biometrics. 2011;67:636–645. doi: 10.1111/j.1541-0420.2010.01464.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.