Abstract
Group testing is widely used to reduce the cost of screening individuals for infectious diseases. There is an extensive literature on group testing, most of which traditionally has focused on estimating the probability of infection in a homogeneous population. More recently, this research area has shifted towards estimating individual-specific probabilities in a regression context. However, existing regression approaches have assumed that the sensitivity and specificity of pooled biospecimens are constant and do not depend on the pool sizes. For those applications, where this assumption may not be realistic, these existing approaches can lead to inaccurate inference, especially when pool sizes are large. Our new approach, which exploits the information readily available from underlying continuous biomarker distributions, provides reliable inference in settings where pooling would be most beneficial and does so even for larger pool sizes. We illustrate our methodology using hepatitis B data from a study involving Irish prisoners.
Keywords: Binary response, Biomarker, Maximum likelihood, Pooled testing, Sensitivity, Specificity
1. Introduction
Group testing involves pooling individual specimens (e.g., blood, urine, swabs, etc.) and testing the pooled samples for the presence of infection. Since the seminal work of Dorfman (1943), group testing has been used to reduce the cost of testing individuals for a variety of sexually transmitted diseases, including HIV, hepatitis B/C, and chlamydia (Cardoso and others, 1998; Pilcher and others, 2005; Lewis and others, 2012, and for other infections such as the West Nile virus (Busch and others, 2005 and the H1N1 virus (Van and others, 2012. The use of pooling through group testing has also been exploited in other areas, including drug discovery (Remlinger and others, 2006, genetics (Gastwirth, 2000), and bioterrorism detection(Schmidt and others, 2005.
Statistical research in group testing generally splits into two areas: classification and estimation. In the classification problem, the goal is to identify all positives (e.g. HIV-infected individuals, etc.) among those individuals tested. This is accomplished by retesting individuals belonging to pools which test positive; see Kim and others (2007) for a review. In the statistics literature, estimation has received more attention, but most of this work historically has regarded the population of individuals as being homogeneous; i.e. each individual has the same probability of positivity. More recently, the estimation problem has been extended to incorporate individual covariate information in regression models with pooled binary responses. Vansteelandt and others (2000) proposed a generalized linear modeling approach using initial pool responses, generalizing the earlier work of Farrington (1992). Xie (2001) presents an expectation–maximization algorithm that flexibly accommodates different classes of regression functions and additional retest information. Chen and others (2009) and Huang (2009) extend the work of Vansteelandt and others (2000) to incorporate random effects and covariate measurement error, respectively. Delaigle and Meister (2011) and Delaigle and Hall (2012) have proposed non-parametric approaches.
Modeling group testing data has been an important advance because it allows one to assess the significance of covariate effects and/or to report subject-specific estimates of the probability of infection at a fraction of the cost of individual testing. However, previous regression work in group testing has proceeded under the simplifying assumption that testing error rates (sensitivity and specificity) for pooled specimens are constant and are functionally independent of the pool sizes. In practice, this assumption can be met with skepticism, especially when larger pool sizes are used, because positive specimens in a pool can be diluted by negative ones beyond the threshold of detection of an assay. As we demonstrate, regression methods which ignore this effect (when it is present) can lead to inaccurate estimates, potentially affecting the conclusions reached about the covariate effects and severely underestimating subject-specific probabilities of infection.
Some authors have already considered the so-called “dilution effect” in group testing estimation; unfortunately, this research has been limited to a homogeneous population. To estimate a common individual probability of infection, say p, Hung and Swallow (1999) specify that misclassification probabilities on the group scale are known functions of the pool size. In separate work, motivated by antibody testing for HIV, Wein and Zenios (1996) and Zenios and Wein (1998) proposed a parametric hierarchical model that relates the continuous response of an assay test to the (latent) antibody concentration in the pool. In this paper, we extend this hierarchical modeling framework to include individual covariate information in a regression setting. Our model formulation is general, so it is applicable with any test where a continuous biomarker can be measured, with or without error, on individual or pooled specimens. This includes antibody tests and tests based on nucleic acid amplification technology.
In Section2, we state our model assumptions and discuss some practical issues associated with testing pooled specimens in an infectious disease context. In Section3, we derive pool-specific misclassification probabilities (sensitivity and specificity) in terms of underlying biomarker distributions and then generalize the regression approach taken by Vansteelandt and others (2000) to include them. In Section4, we provide simulation evidence showing that our new methods perform very well in realistic settings and that traditional regression approaches can perform poorly when the sensitivity and specificity of pooled specimens are assumed to be constant. In Section5, we apply our regression techniques to a hepatitis B virus (HBV) data set involving Irish prisoners. In Section6, we conclude with a summary discussion.
2. Preliminaries
2.1. Notation and assumptions
Suppose N individuals are to be tested for infection and that each individual is assigned to exactly one of J pools. Let
if the ith individual in the jth pool is truly positive,
otherwise, for i=1,2,…,cj and j=1,2,…,J. We assume throughout that the
's are independent Bernoulli random variables with mean pij. Let
denote the true binary status of the jth pool; i.e.
if the pool contains at least one positive individual,
otherwise. Allowing for the possibility of misclassification, we let Zj=1 if the jth pool tests positive and Zj=0 otherwise. An appeal to the Law of Total Probability shows that
![]() |
where
and
are the test sensitivity and specificity for the jth pool, respectively. Previous research in group testing regression has assumed that
and
do not depend on the pool size cj and are the same for all pools. In this paper, our primary goal is to remove this assumption.
To acknowledge heterogeneity, we assume that pij is related to the linear predictor x′ijβ through a monotonic, differentiable link function g(⋅), where xij=(1,xij1,…,xijr)′ is an (r+1)×1 vector of covariates for the ith individual in the jth pool and β=(β0,β1,…,βr)′ is a vector of regression parameters; i.e.
. As in Vansteelandt and others (2000), we assume that the available data are {(Zj,x′1j,x′2j,…,x′cjj)′,j=1,2,…,J} so that the log-likelihood function, on the basis of pool responses and individual covariates, is given by
![]() |
(2.1) |
where
is the probability the jth pool tests positive and Z=(Z1,Z2,…,ZJ)′. Note that our log-likelihood in (2.1) is identical to that in Vansteelandt and others (2000) except we allow
and
to be possibly different for different pools.
2.2. Assay thresholds
In most applications, the outcome of a diagnostic test is a binary response derived from the measured concentration of a continuous biomarker. For example, enzyme-linked immunosorbent assay is a technique that detects the concentration of antigens or antibodies in a specimen. As in Wein and Zenios (1996) and Zenios and Wein (1998), we adopt the standard convention that if a specimen's measured concentration (e.g. optical density (OD)reading, viral load, etc.) is above a predetermined threshold, say t0, then the test classifies the specimen as positive. Otherwise, the test classifies the specimen as negative. Previous group testing regression approaches have included only the (dichotomized) binary testing responses as part of the modeling process, ignoring the information available from the underlying (continuous) biomarker distributions.
When individual specimens are pooled together, an important consideration is determining the threshold t0 that discriminates positive pools from the negative ones. Unfortunately, there is no established consensus on how this should be done with pooled samples, perhaps because many types of assays are available and diagnostic performance varies greatly from assay to assay. Early work in group testing estimation seemingly conjectured that t0 could be taken to be the same when testing pools as when testing individuals (Tu and others, 1994), and we have found studies in the infectious disease testing literature that have proceeded under this assumption (Currie and others, 2004. Another approach is to decrease the individual testing threshold t0 to account for the effect of pooling; for example, Vansteelandt and others (2005) change t0 to t0/c when pooling c specimens. In the biomarker pooling literature, Schisterman and others (2005) describe how t0 can be chosen to maximize Youden's index.
While certainly relevant for practical use, methods for choosing the “best” threshold t0 in group testing applications is not a focus of this paper. For our purposes, we assume that the threshold t0 has been determined a priori after the assay has been rigorously validated for use with pooled samples. Our regression methodology, to be described next, allows for pools of different sizes to use different thresholds. We write t0=t0(cj) if this is to be emphasized.
3. Regression with pool-specific misclassification probabilities
In this section, we first obtain closed-form expressions for
and
on a pool-by-pool basis in terms of the relevant biomarker distributions. We then show how to incorporate these probabilities into a regression model fitting procedure using maximum likelihood.
3.1. Biomarker distributions
Let
denote the probability density function of the true biomarker concentration for positive (negative) individuals. We interpret these distributions as conditional distributions given an individual's true status; additionally, we assume that these distributions do not depend on individual-level covariates. To acknowledge the inherent, assay-specific error associated with measuring the true biomarker concentration level
of a specimen (individual or pooled), let
denote the conditional density of the measured concentration ζ, given the true concentration
. In this section, we assume that
,
, and
are known. If the individual biomarker distributions
and
are unknown, training data can be used to estimate them under parametric or non-parametric assumptions; we illustrate this in Sections4 and5.
3.2. Expressions for pool-specific sensitivity and specificity
Let
denote the true biomarker concentration for the ith individual in the jth pool so that
and
. For notational purposes, denote the jth pool by
and let
denote the true biomarker concentration of
. Throughout this paper, we assume that
, that is, the true biomarker concentration of
is the average of the true individual biomarker concentrations. This assumption is very common in the biomarker pooling literature(Vexler and others, 2008; Malinovksy and others, 2012; see also Section6.
Let
denote the (conditional) density of
, given that
contains exactly k positive individuals, for k=0,1,…,cj. To write an expression for
in terms of
and
, we first consider the (conditional) density of the sum
, given that
contains k positive individuals and cj−k negative individuals; denote this density by
. It is well known that the density of the sum of independent random variables is the convolution of their respective densities; see, e.g. Casella and Berger (2002, Chapter 5). Therefore, we can write
![]() |
where “*” denotes the usual convolution operator and fm* is the m-fold convolution of f with itself. A linear transformation argument (mapping sums to averages) shows that
, for k=0,1,…,cj.
We now obtain expressions for
and
. To account for assay measurement error, let
denote the measured biomarker concentration of
which, by assumption and conditional on
, is distributed as
. Under our classification rule
, the probability
tests negative (Zj=0), given that it contains only negative individuals; i.e. the specificity for
, is given by
![]() |
To derive an expression for
, we first note that the probability
tests positive (Zj=1), given that it contains exactly k positive individuals, is given by
![]() |
It is possible to express
as a linear combination of Se(cj,k), for k=1,2,…,cj. To see why, note that
![]() |
and write
so that
![]() |
Therefore, the sensitivity for
can be written as
, where
![]() |
The probability
can be calculated using the approach outlined in Wang (1993). It is insightful to note that
depends on the pool size cj and on the number of positive individuals in
through the infection probabilities pij, i=1,2,…,cj. On the other hand,
depends on the pool size cj but not on the infection probabilities pij of those individuals in
.
3.3. Maximum likelihood
Estimating β via maximum-likelihood proceeds by maximizing (2.1) numerically, after replacing
and
with the expressions in Section3.2 and setting pij=g−1(xij′β) in
. Because the pool sensitivity
is a function of the regression parameter β, maximizing (2.1) might seem to be formidable at first glance. However, we have had little difficulty in implementing a quasi-Newton optimization procedure in R (using the optim function); see also Section6. The large-sample covariance matrix of the maximum-likelihood estimator (MLE)
can be estimated using the negative inverse Hessian at the last iteration of the procedure, making the construction of large-sample Wald confidence intervals possible (applicable when J is large). Our simulation results show that this matrix is estimated well and that Wald intervals confer the nominal coverage probability in realistic settings.
4. Simulation evidence
4.1. Simulation description
We use simulation to assess the performance of our proposed regression methods. Of primary interest is to compare our approach with the approach that assumes
and
are constant for all pools and do not depend on pool size. We consider the following models:
; β=(β0,β1)′=(−3,2)′,
; β=(β0,β1,β2)′=(−3,1,0.5)′,
; β=(β0,β1,β2)′=(−3,2,1)′,
where
and xij2∼Bernoulli(0.1). These distributions were chosen to emulate situations where pooling is sensible. The mean prevalence for the three models ranges from 8 to 10%, which is consistent with our application in Section5. We first simulate N=1500 individual probabilities pij according to each model, where i=1,2,…,c and j=1,2,…,J, J=N/c, and use common pool sizes c=1,5,10,15 (c=1 corresponds to individual testing). True individual statuses
are then simulated from a Bernoulli(pij) distribution.
We specify
and
, where τ∈{0.1,0.2,0.3} controls the amount of separation between the negative and positive individual biomarker distributions. Normal distributions were chosen for
and
so that the corresponding pool biomarker distributions are available in closed form. While normality is often assumed in the biomarker literature for this reason, we show in Appendix A of the supplementary material (available at Biostatistics online) that our main findings herein are unaffected when the assumed distributions for
and
are skewed. To allow for assay-specific error, we specify the distribution of the measured concentration, conditional on the true biomarker concentration, to be
. Under these assumptions, we calculate
![]() |
where Φ(⋅) is the standard normal cumulative distribution function, μk(c)=c−1{(c−k)(0.1)+k}, and
, for k=0,1,…,c.
To simulate the pool response Zj, we first generate
, for i=1,2,…,c. The measured concentration of
is then calculated as
, where the measurement error ϵj is generated independently from a
distribution. Finally, the observed response Zj is recorded as
, where t0=0.2 for each pool size. This value of t0 provides nearly perfect discrimination between the positive and negative biomarker distributions (regardless of τ) for individual testing. For each (c,τ) combination and for each model, this entire procedure is repeated B=500 times.
4.2. Estimating
and
with training data
The exact form of
and
may be unknown in some applications. However, if individual biomarker training data are available, there is nothing to prevent one from first estimating
and
and then implementing our methods using these estimates. We therefore assess the impact of estimating
and
in our simulations, both parametrically and non-parametrically. To create a training data set, we first generate true (unobserved) individual biomarker concentrations
, for t=1,2,…,nt, where nt=100. This is then repeated for
, also with nt=100. Such training data set sizes are not unrealistic; for example, Wein and Zenios (1996) cite infectious disease examples where thousands of individual training observations are available. We calculate the measured concentrations (i.e. the training data) according to
and
, t=1,2,…,nt, where the independent measurement errors
and
are distributed as
; that is, we continue to assume that the conditional distribution of the measured concentration, given the true biomarker concentration, is known.
Under the normality assumptions outlined in the last paragraph, it is straightforward to calculate estimates of Se(c,k) and Sp(c) in Section3.2 based on the two samples of training data
and
; denote these estimates by
and
, respectively. To estimate the biomarker distributions non-parametrically with the training data, we implement a Fourier transform deconvolution kernel method for additive measurement error models using the decon package in R; see Wang and Wang (2011). With the estimated distributions, say
and
, we then use Monte Carlo integration to calculate the estimates
and
. Complete details are given in Appendix B of the supplementary material (available at Biostatistics online). Note that in the simulation results which follow, we have used a new training data set each time a regression model is estimated.
4.3. Simulation results
Figure1 displays boxplots of the B=500 MLEs from individual testing (I), the approach in Vansteelandt and others (2000) with constant Se and Sp across pools (C), and our dilution approach for Model 2. In this figure, we allow the biomarker distributions
and
to be known (DT), estimated parametrically (DP), and estimated non-parametrically (DN). Complete simulation results are shown in Web Appendix C of the supplementary material (available at Biostatistics online), including numerical summaries of the estimates for all models. To fit the individual data and constant Se/Sp models, we used Se=Se(1,1) and Sp=Sp(1) under the assumption that
and
are known; these are the values of sensitivity and specificity for individual testing, respectively.
Fig. 1.
Simulation study for Model 2. Boxplots of maximum likelihood estimates for B=500 data sets at each configuration of c and τ. True parameter values are shown using horizontal dotted lines. On the horizontal axes in each subfigure, “I” refers to individual testing, “C” refers to the constant Se/Sp model, “DT” refers to the dilution model when individual biomarker distributions are known, and “DP” (“DN”) refers to the dilution model when individual biomarker distributions are estimated parametrically (non-parametrically) as described in Section4.2. Pool sizes c are within the parentheses.
For smaller pool sizes (e.g. c=5) and even when the threshold t0 is unchanged from that under individual testing, Figure1 shows that ignoring a dilution effect has only a minimal impact on average; both the constant Se/Sp and dilution model estimates are generally on target and have about the same amount of variability. However, for larger pool sizes (e.g. c=10, 15), the bias associated with the constant Se/Sp model can be substantial, while the corresponding dilution estimates remain on target, regardless of whether or not individual biomarker distributions
and
are estimated first with training data. It is not surprising that the variability in the regression estimates increases as the pool size c does; individual parameter estimates are based on J=1500 responses whereas group testing estimates (constant and dilution) are based on J/c responses. The amount of separation between the positive and negative true biomarker distributions, as measured by τ, has little effect on the dilution model regression estimates.
Figure2 shows the estimated regression functions for Models 1 and 2, averaged over the B=500 data sets, when τ=0.1. The main message taken from Figure2 is the same; namely, both the constant Se/Sp and dilution approaches accurately estimate the true regression function for smaller pool sizes. However, it is surprising to see how inaccurate the constant Se/Sp regression can be when c=10 and especially when c=15 for Model 2; it is evident that covariate-specific probabilities of infection can be drastically underestimated when the dilution effect is ignored. This inaccuracy also manifests itself in the estimated Wald coverage probabilities for the regression parameters, shown in the supplementary material (available at Biostatistics online). For all models considered, the constant Se/Sp model's estimated coverage probability for β0 is incongruously low when c=10 and 15. On the other hand, confidence intervals calculated from the dilution model estimates maintain the nominal level regardless of the pool size used. Finally, one notes in Figure2 that there is little difference on average between the dilution model estimates when
and
are known and those when
and
are estimated; in fact, it is very difficult to distinguish these estimates from each other in the figure.
Fig. 2.
Simulation study for Model 1 (left) and Model 2 (right). Estimated regression functions averaged over B=500 data sets for different pool sizes c and τ=0.1. Dilution model results include those cases where individual biomarker distributions are known (T), estimated parametrically (P), and estimated non-parametrically (N). The true regression function and the constant Se/Sp model results are also shown.
5. Irish HBV data
We apply our group testing regression methods to an HBV data set from Ireland. The data were collected as part of a public health study to estimate the prevalence and to identify risk factors of HBV infection among Irish prisoners; for further discussion, see Allwright and others (2000). The data provided to us by Dr Allwright consist of OD readings from a Murex ICE enzyme immunoassay along with a confirmed diagnosis for each individual. Additionally, the data set contains covariate information obtained via voluntary questionnaire, including age, drug use, and sexual practices for each participating individual. We illustrate our methods using the 1098 individuals for whom complete covariate and testing information is available. Note that in this section, the OD notation corresponds to the ζ notation in Section3.
The purpose of our analysis here is to compare the performance of the constant Se/Sp and dilution modeling approaches. To do this, we assign individuals to pools at random, observe the resulting pool diagnoses, and fit both models. Because
,
, and
are unknown in this application, we assume that the observed OD readings are linearly related to the true antibody concentrations and are measured without error. Under these assumptions, the OD reading for a pooled specimen is the average of the OD readings of individual specimens in the pool, that is,
. To choose an assay threshold t0 in our analysis (this was not provided to us), we first dichotomized all 1098 OD readings based on the true statuses to form
and
corresponding to the N+=60 HBV-positive and N−=1038 HBV-negative individuals. We then selected t0 to minimize the discrepancies between the individuals’ diagnosed statuses and their true statuses based on the OD readings; i.e.
![]() |
Finally, testing responses for pools were determined using
, for each j=1,2,…,J and for each pool size. In Web Appendix D of the supplementary material (available at Biostatistics online), we display histograms of the values of OD+ and OD− observed in this study.
To calculate pool-specific misclassification probabilities, we begin by estimating fOD+ and
, the probability density functions of OD readings for positive and negative individuals, respectively. To do this, we first create a training data set by taking a simple random sample of n+t=10 HBV-positive OD readings and n−t=38 HBV-negative OD readings from
and
, respectively. These training data set sample sizes are large enough to estimate fOD+ and fOD− under parametric assumptions and leave us with N=1050 individuals to fit both the constant Se/Sp and dilution regression models. We consider three different parametric models for OD+ and OD−; namely the gamma, Weibull, and log-normal, and we use maximum likelihood to estimate fOD+ and fOD− with the training data for each model. Under our assumptions for the observed OD readings described previously, estimates of Se(c,k) and Sp(c) are given by
![]() |
where
, and where
denotes the estimated OD density for HBV-positive (HBV-negative) individuals. For each of the three OD models we consider (gamma, Weibull, and log-normal), it is particularly difficult to obtain closed-form expressions of the convolutions needed to calculate
and
. To obviate this difficulty, we use Monte Carlo simulation to approximate these two integrals.
To make our comparisons, we consider the first-order logistic model
![]() |
(5.1) |
where xij denotes the age of the ith individual in the jth pool. For simplicity, we specify a common pool size c∈{2,3,5,6,7,10}; note that each of these pool sizes divides N=1050 so that there are no remainder pools. For each c, we randomly assign each individual to one of J=1050/c pools, we record the resulting pool responses Zj, and we fit both the constant Se/Sp and dilution models. The dilution model is fit using the estimates
and
calculated from the training data. We fit the constant Se/Sp model and also the individual data model (i.e. c=1) using
and
. To cover a large number of possible arrangements of the individuals for the group testing models, and also to account for the variability associated with estimating Se(c,k) and Sp(c) with training data, we repeat this process B=1000 times for each pool size. A new training data set is selected each time model (5.1) is estimated.
Figure3 shows the estimated regression functions, averaged over the B=1000 data sets, for the constant Se/Sp, dilution, and individual data models, under the gamma OD assumption. The corresponding Weibull and log-normal figures are in Web Appendix D of the supplementary material (available at Biostatistics online), along with additional figures which allow one to visually assess the variability in the estimates. In addition, Table1 presents summary statistics for the 1000 estimates of β0 and β1 for each OD model. From the results, one first notes that estimating fOD+ and fOD− under different parametric assumptions has little impact on the resulting regression estimates. Furthermore, Figure3 and the additional figures located in the supplementary material (available at Biostatistics online) largely reinforce the main findings from Section4. For very small pool sizes (e.g. c=2), there are usually only minor differences between the constant Se/Sp and dilution model fits on average, and both are similar to the fit with the individual data. However, for larger pool sizes, it is clear that age-specific probabilities of HBV infection can be drastically underestimated when the dilution effect is not accounted for. Finally, when examining Figure3, note that there were only 49 individuals (out of 1098) in the data set with ages >45. This fact, coupled with the inherent information reduction that results when constructing larger pools, likely explains the discrepancy between the individual data and the dilution model regression functions in this region.
Fig. 3.
Irish HBV data. Estimated regression functions averaged over B=1000 sets of pools for different pool sizes c. A gamma model for the OD readings is assumed. In each subfigure, the estimated individual data regression function is shown for comparison purposes. The corresponding figures for the Weibull and log-normal models are shown in the supplementary material (available at Biostatistics online).
Table 1.
Irish HBV data. Mean (standard deviation) of the B=1000 maximum likelihood estimates of β0 and β1 in model (5.1) for three OD distributions (gamma, Weibull, and log-normal) and for different pool sizes c. The individual data results (c=1) are also shown
![]() |
![]() |
||||
|---|---|---|---|---|---|
| Distribution | c | Constant | Dilution | Constant | Dilution |
| Gamma | 1 | −4.59 (0.13) | — | 0.06 (0.01) | — |
| 2 | −4.74 (0.30) | −4.65 (0.32) | 0.05 (0.01) | 0.06 (0.01) | |
| 3 | −4.75 (0.44) | −4.53 (0.48) | 0.05 (0.01) | 0.05 (0.01) | |
| 5 | −4.93 (0.79) | −4.35 (0.85) | 0.04 (0.03) | 0.04 (0.03) | |
| 6 | −5.02 (0.92) | −4.29 (1.01) | 0.04 (0.03) | 0.04 (0.03) | |
| 7 | −5.11 (1.14) | −4.18 (1.24) | 0.04 (0.04) | 0.04 (0.05) | |
| 10 | −5.33 (1.95) | −3.97 (2.02) | 0.03 (0.08) | 0.04 (0.08) | |
| Weibull | 1 | −4.58 (0.13) | — | 0.06 (0.01) | — |
| 2 | −4.74 (0.31) | −4.65 (0.32) | 0.06 (0.01) | 0.06 (0.01) | |
| 3 | −4.72 (0.45) | −4.51 (0.48) | 0.05 (0.01) | 0.05 (0.01) | |
| 5 | −4.97 (0.72) | −4.46 (0.80) | 0.04 (0.02) | 0.05 (0.03) | |
| 6 | −5.01 (0.84) | −4.33 (0.94) | 0.04 (0.03) | 0.04 (0.03) | |
| 7 | −5.13 (1.07) | −4.28 (1.18) | 0.04 (0.04) | 0.04 (0.04) | |
| 10 | −5.41 (1.92) | −4.14 (1.97) | 0.03 (0.08) | 0.04 (0.07) | |
| Log-normal | 1 | −4.60 (0.14) | — | 0.06 (0.01) | — |
| 2 | −4.75 (0.31) | −4.65 (0.33) | 0.05 (0.01) | 0.06 (0.01) | |
| 3 | −4.79 (0.46) | −4.51 (0.52) | 0.05 (0.01) | 0.05 (0.02) | |
| 5 | −4.98 (0.74) | −4.30 (0.85) | 0.04 (0.03) | 0.05 (0.03) | |
| 6 | −5.06 (0.89) | −4.21 (1.04) | 0.04 (0.03) | 0.04 (0.03) | |
| 7 | −5.06 (1.18) | −4.04 (1.32) | 0.04 (0.04) | 0.04 (0.05) | |
| 10 | −5.36 (2.17) | −3.94 (2.30) | 0.03 (0.09) | 0.04 (0.10) | |
6. Discussion
We have generalized the group testing regression approach of Vansteelandt and others (2000) to account for the effect that pooling specimens can have on pool misclassification probabilities. We have shown that previously used regression methods which assume constant sensitivity and specificity can provide poor estimates when a dilution effect is present. Our approach exploits the information available from underlying individual biomarker distributions and can offer substantially improved estimates. The web site www.chrisbilder.com/grouptesting/dilution contains R functions that implement the methodology in this paper.
Unlike regression methods which assume constant sensitivity and specificity, our dilution approach requires that individual biomarker distributions (or estimates of them) be specified a priori. Such a requirement is likely not prohibitive. For example, manufacturers often undertake large preliminary studies to assess the operating characteristics of assays for truly positive and truly negative individuals (Wein and Zenios, 1996 and individual biomarker distributions
and
can be estimated from these studies. To that end, an anonymous referee has suggested that an extension of this work could include reconstructing estimates of
and
when training data are available only on pools. This is a difficult deconvolution problem, but it has been addressed in the biomarker literature for the case wherein training pools contain all negative or all positive individuals (Vexler and others, 2008. Finally, our methodology also relies on the assumption that pool biomarker concentrations are averages of concentrations of individual specimens in the pools. We view this assumption to be reasonable as long as individual specimens being pooled are of equal size and do not interact(Zenios and Wein, 1998.
In the light of this work, we anticipate that incorporating pool-specific error probabilities could lead to similar gains for other regression modeling techniques involving pooled response data (Chen and others, 2009; Huang, 2009; Delaigle and Meister, 2011. Additionally, it might be possible to generalize our approach to incorporate information from retesting subsets of positive pools, similarly to Xie (2001) under the constant Se/Sp assumption, although we would expect the mathematical details required for this extension to be far more formidable than those shown herein. Finally, the classification problem in group testing has generally proceeded under the assumption that assay sensitivity and specificity are constant throughout the decoding process (see, e.g. Kim and others, 2007; McMahan and others, 2012). Especially when a dilution effect is suspected, it may be worthwhile to relax this assumption and revisit the design and evaluation of classification procedures that are used in practice. The methods outlined in Section3 of this paper could serve as a starting point towards accomplishing this.
Supplementary material
Supplementary material is available at http://biostatistics.oxfordjournals.org.
Funding
This work was supported by the National Institutes of Health (Grant R01 AI067373).
Supplementary Material
Acknowledgements
The authors thank the Editor, Associate Editor, and two anonymous referees for their comments on an earlier version of this paper. We also thank Dr Shane Allwright and her colleagues for providing us with the Irish HBV data and Dr Aiyi Liu for his helpful remarks. Conflict of Interest: None declared.
References
- Allwright S., Bradley F., Long J., Barry J., Thornton L., Parry J. Prevalence of antibodies to hepatitis B, hepatitis C, and HIV and risk factors in Irish prisoners: results of a national cross sectional survey. British Medical Journal. 2000;321:78–82. doi: 10.1136/bmj.321.7253.78. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Busch M., Caglioti S., Robertson E., McAuley J., Tobler L., Kamel H., Linnen J., Shyamala V., Tomasulo P., Kleinman S. Screening the blood supply for West Nile Virus RNA by nucleic acid amplification testing. New England Journal of Medicine. 2005;353:460–467. doi: 10.1056/NEJMoa044029. [DOI] [PubMed] [Google Scholar]
- Cardoso M., Koerner K., Kubanek B. Mini-pool screening by nucleic acid testing for hepatitis B virus, hepatitis C virus, and HIV: preliminary results. Transfusion. 1998;38:905–907. doi: 10.1046/j.1537-2995.1998.381098440853.x. [DOI] [PubMed] [Google Scholar]
- Casella G., Berger R. Statistical Inference. 2nd edition. Duxbury, MA: Duxbury Press; 2002. [Google Scholar]
- Chen P., Tebbs J., Bilder C. Group testing regression models with fixed and random effects. Biometrics. 2009;65:1270–1278. doi: 10.1111/j.1541-0420.2008.01183.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Currie M., McNiven M., Yee T., Schiemer U., Bowden F. Pooling of clinical specimens prior to testing for Chlamydia trachomatis by PCR is accurate and cost saving. Journal of Clinical Microbiology. 2004;42:4866–4867. doi: 10.1128/JCM.42.10.4866-4867.2004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Delaigle A., Hall P. Nonparametric regression with homogeneous group testing data. Annals of Statistics. 2012;40:131–158. [Google Scholar]
- Delaigle A., Meister A. Nonparametric regression analysis for group testing data. Journal of the American Statistical Association. 2011;106:640–650. doi: 10.1198/jasa.2011.tm10355. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dorfman R. The detection of defective members of large populations. Annals of Mathematical Statistics. 1943;14:436–440. [Google Scholar]
- Farrington C. Estimating prevalence by group testing using generalized linear models. Statistics in Medicine. 1992;11:1591–1597. doi: 10.1002/sim.4780111206. [DOI] [PubMed] [Google Scholar]
- Gastwirth J. The efficiency of pooling in the detection of rare mutations. American Journal of Human Genetics. 2000;67:1036–1039. doi: 10.1086/303097. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huang X. An improved test of latent-variable model misspecification in structural measurement error models for group testing data. Statistics in Medicine. 2009;28:3316–3327. doi: 10.1002/sim.3698. [DOI] [PubMed] [Google Scholar]
- Hung M., Swallow W. Robustness of group testing in the estimation of proportions. Biometrics. 1999;55:231–237. doi: 10.1111/j.0006-341x.1999.00231.x. [DOI] [PubMed] [Google Scholar]
- Kim H., Hudgens M., Dreyfuss J., Westreich D., Pilcher C. Comparison of group testing algorithms for case identification in the presence of testing error. Biometrics. 2007;63:1152–1163. doi: 10.1111/j.1541-0420.2007.00817.x. [DOI] [PubMed] [Google Scholar]
- Lewis J., Lockary V., Kobic S. Cost savings and increased efficiency using a stratified specimen pooling strategy for Chlamydia trachomatis and Neisseria gonorrhoeae. Sexually Transmitted Diseases. 2012;39:46–48. doi: 10.1097/OLQ.0b013e318231cd4a. [DOI] [PubMed] [Google Scholar]
- Malinovsky Y., Albert P., Schisterman E. Pooling designs for outcomes under a Gaussian random effects model. Biometrics. 2012;68:45–52. doi: 10.1111/j.1541-0420.2011.01673.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McMahan C., Tebbs J., Bilder C. Informative Dorfman screening. Biometrics. 2012;68:287–296. doi: 10.1111/j.1541-0420.2011.01644.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pilcher C., Fiscus S., Nguyen T., Foust E., Wolf L., Williams D., Ashby R., O’Dowd J., McPherson J., Stalzer B., Hightow L., Miller W., and others. Detection of acute infections during HIV testing in North Carolina. New England Journal of Medicine. 2005;352:1873–1883. doi: 10.1056/NEJMoa042291. [DOI] [PubMed] [Google Scholar]
- Remlinger K., Hughes-Oliver J., Young S., Lam R. Statistical design of pools using optimal coverage and minimal collision. Technometrics. 2006;48:133–143. [Google Scholar]
- Schisterman E., Perkins N., Liu A., Bondell H. Optimal cut-point and its corresponding Youden index to discriminate individuals using pooled blood samples. Epidemiology. 2005;16:73–81. doi: 10.1097/01.ede.0000147512.81966.ba. [DOI] [PubMed] [Google Scholar]
- Schmidt M., Roth W., Meyer H., Seifried E., Hourfar M. Nucleic acid test screening of blood donors for orthopoxviruses can potentially prevent dispersion of viral agents in case of bioterrorism. Transfusion. 2005;45:399–403. doi: 10.1111/j.1537-2995.2005.04242.x. [DOI] [PubMed] [Google Scholar]
- Tu X., Litvak E., Pagano M. Screening tests: Can we get more by doing less? Statistics in Medicine. 1994;13:1905–1919. doi: 10.1002/sim.4780131904. [DOI] [PubMed] [Google Scholar]
- Van T., Miller J., Warshauer D., Reisdorf E., Jerrigan D., Humes R., Shult P. Pooling nasopharyngeal/throat swab speciments to increase testing capacity for influenza viruses by PCR. Journal of Clinical Microbiology. 2012;50:891–896. doi: 10.1128/JCM.05631-11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vansteelandt S., Goetghebeur E., Thomas I., Mathys E., Van Loock F. On the viral safety of plasma pools and plasma derivatives. Journal of the Royal Statistical Society: Series A. 2005;168:345–363. [Google Scholar]
- Vansteelandt S., Goetghebeur E., Verstraeten T. Regression models for disease prevalence with diagnostic tests on pools of serum samples. Biometrics. 2000;56:1126–1133. doi: 10.1111/j.0006-341x.2000.01126.x. [DOI] [PubMed] [Google Scholar]
- Vexler A., Schisterman E., Liu A. Estimation of ROC curves based on stably distributed biomarkers subject to measurement error and pooling mixtures. Statistics in Medicine. 2008;27:280–296. doi: 10.1002/sim.3035. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang Y. On the number of successes in independent trials. Statistica Sinica. 1993;3:295–312. [Google Scholar]
- Wang X., Wang B. Deconvolution estimation in measurement error models: the R package decon. Journal of Statistical Software. 2011;39:1–24. [PMC free article] [PubMed] [Google Scholar]
- Wein L., Zenios S. Pooled testing for HIV screening: capturing the dilution effect. Operations Research. 1996;44:543–569. [Google Scholar]
- Xie M. Regression analysis of group testing samples. Statistics in Medicine. 2001;20:1957–1969. doi: 10.1002/sim.817. [DOI] [PubMed] [Google Scholar]
- Zenios S., Wein L. Pooled testing for HIV prevalence estimation: exploiting the dilution effect. Statistics in Medicine. 1998;17:1447–1467. doi: 10.1002/(sici)1097-0258(19980715)17:13<1447::aid-sim862>3.0.co;2-k. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.

















