Regression models for group testing data with pool dilution effects

Christopher S McMahan; Joshua M Tebbs; Christopher R Bilder

doi:10.1093/biostatistics/kxs045

. 2012 Nov 28;14(2):284–298. doi: 10.1093/biostatistics/kxs045

Regression models for group testing data with pool dilution effects

Christopher S McMahan ¹, Joshua M Tebbs ^2,^*, Christopher R Bilder ³

PMCID: PMC3590921 PMID: 23197382

Abstract

Group testing is widely used to reduce the cost of screening individuals for infectious diseases. There is an extensive literature on group testing, most of which traditionally has focused on estimating the probability of infection in a homogeneous population. More recently, this research area has shifted towards estimating individual-specific probabilities in a regression context. However, existing regression approaches have assumed that the sensitivity and specificity of pooled biospecimens are constant and do not depend on the pool sizes. For those applications, where this assumption may not be realistic, these existing approaches can lead to inaccurate inference, especially when pool sizes are large. Our new approach, which exploits the information readily available from underlying continuous biomarker distributions, provides reliable inference in settings where pooling would be most beneficial and does so even for larger pool sizes. We illustrate our methodology using hepatitis B data from a study involving Irish prisoners.

Keywords: Binary response, Biomarker, Maximum likelihood, Pooled testing, Sensitivity, Specificity

1. Introduction

Group testing involves pooling individual specimens (e.g., blood, urine, swabs, etc.) and testing the pooled samples for the presence of infection. Since the seminal work of Dorfman (1943), group testing has been used to reduce the cost of testing individuals for a variety of sexually transmitted diseases, including HIV, hepatitis B/C, and chlamydia (Cardoso and others, 1998; Pilcher and others, 2005; Lewis and others, 2012, and for other infections such as the West Nile virus (Busch and others, 2005 and the H1N1 virus (Van and others, 2012. The use of pooling through group testing has also been exploited in other areas, including drug discovery (Remlinger and others, 2006, genetics (Gastwirth, 2000), and bioterrorism detection(Schmidt and others, 2005.

Statistical research in group testing generally splits into two areas: classification and estimation. In the classification problem, the goal is to identify all positives (e.g. HIV-infected individuals, etc.) among those individuals tested. This is accomplished by retesting individuals belonging to pools which test positive; see Kim and others (2007) for a review. In the statistics literature, estimation has received more attention, but most of this work historically has regarded the population of individuals as being homogeneous; i.e. each individual has the same probability of positivity. More recently, the estimation problem has been extended to incorporate individual covariate information in regression models with pooled binary responses. Vansteelandt and others (2000) proposed a generalized linear modeling approach using initial pool responses, generalizing the earlier work of Farrington (1992). Xie (2001) presents an expectation–maximization algorithm that flexibly accommodates different classes of regression functions and additional retest information. Chen and others (2009) and Huang (2009) extend the work of Vansteelandt and others (2000) to incorporate random effects and covariate measurement error, respectively. Delaigle and Meister (2011) and Delaigle and Hall (2012) have proposed non-parametric approaches.

Modeling group testing data has been an important advance because it allows one to assess the significance of covariate effects and/or to report subject-specific estimates of the probability of infection at a fraction of the cost of individual testing. However, previous regression work in group testing has proceeded under the simplifying assumption that testing error rates (sensitivity and specificity) for pooled specimens are constant and are functionally independent of the pool sizes. In practice, this assumption can be met with skepticism, especially when larger pool sizes are used, because positive specimens in a pool can be diluted by negative ones beyond the threshold of detection of an assay. As we demonstrate, regression methods which ignore this effect (when it is present) can lead to inaccurate estimates, potentially affecting the conclusions reached about the covariate effects and severely underestimating subject-specific probabilities of infection.

Some authors have already considered the so-called “dilution effect” in group testing estimation; unfortunately, this research has been limited to a homogeneous population. To estimate a common individual probability of infection, say p, Hung and Swallow (1999) specify that misclassification probabilities on the group scale are known functions of the pool size. In separate work, motivated by antibody testing for HIV, Wein and Zenios (1996) and Zenios and Wein (1998) proposed a parametric hierarchical model that relates the continuous response of an assay test to the (latent) antibody concentration in the pool. In this paper, we extend this hierarchical modeling framework to include individual covariate information in a regression setting. Our model formulation is general, so it is applicable with any test where a continuous biomarker can be measured, with or without error, on individual or pooled specimens. This includes antibody tests and tests based on nucleic acid amplification technology.

In Section2, we state our model assumptions and discuss some practical issues associated with testing pooled specimens in an infectious disease context. In Section3, we derive pool-specific misclassification probabilities (sensitivity and specificity) in terms of underlying biomarker distributions and then generalize the regression approach taken by Vansteelandt and others (2000) to include them. In Section4, we provide simulation evidence showing that our new methods perform very well in realistic settings and that traditional regression approaches can perform poorly when the sensitivity and specificity of pooled specimens are assumed to be constant. In Section5, we apply our regression techniques to a hepatitis B virus (HBV) data set involving Irish prisoners. In Section6, we conclude with a summary discussion.

2. Preliminaries

2.1. Notation and assumptions

Suppose N individuals are to be tested for infection and that each individual is assigned to exactly one of J pools. Let Inline graphic if the ith individual in the jth pool is truly positive, otherwise, for i=1,2,…,c_j and j=1,2,…,J. We assume throughout that the 's are independent Bernoulli random variables with mean p_ij. Let denote the true binary status of the jth pool; i.e. if the pool contains at least one positive individual, Inline graphic otherwise. Allowing for the possibility of misclassification, we let Z_j=1 if the jth pool tests positive and Z_j=0 otherwise. An appeal to the Law of Total Probability shows that

where Inline graphic and are the test sensitivity and specificity for the jth pool, respectively. Previous research in group testing regression has assumed that and do not depend on the pool size c_j and are the same for all pools. In this paper, our primary goal is to remove this assumption.

To acknowledge heterogeneity, we assume that p_ij is related to the linear predictor x′_ijβ through a monotonic, differentiable link function g(⋅), where x_ij=(1,x_ij1,…,x_ijr)′ is an (r+1)×1 vector of covariates for the ith individual in the jth pool and β=(β₀,β₁,…,β_r)′ is a vector of regression parameters; i.e. Inline graphic . As in Vansteelandt and others (2000), we assume that the available data are {(Z_j,x′_1j,x′_2j,…,x′_{c_jj})′,j=1,2,…,J} so that the log-likelihood function, on the basis of pool responses and individual covariates, is given by

(2.1)

where Inline graphic is the probability the jth pool tests positive and Z=(Z₁,Z₂,…,Z_J)′. Note that our log-likelihood in (2.1) is identical to that in Vansteelandt and others (2000) except we allow and to be possibly different for different pools.

2.2. Assay thresholds

In most applications, the outcome of a diagnostic test is a binary response derived from the measured concentration of a continuous biomarker. For example, enzyme-linked immunosorbent assay is a technique that detects the concentration of antigens or antibodies in a specimen. As in Wein and Zenios (1996) and Zenios and Wein (1998), we adopt the standard convention that if a specimen's measured concentration (e.g. optical density (OD)reading, viral load, etc.) is above a predetermined threshold, say t₀, then the test classifies the specimen as positive. Otherwise, the test classifies the specimen as negative. Previous group testing regression approaches have included only the (dichotomized) binary testing responses as part of the modeling process, ignoring the information available from the underlying (continuous) biomarker distributions.

When individual specimens are pooled together, an important consideration is determining the threshold t₀ that discriminates positive pools from the negative ones. Unfortunately, there is no established consensus on how this should be done with pooled samples, perhaps because many types of assays are available and diagnostic performance varies greatly from assay to assay. Early work in group testing estimation seemingly conjectured that t₀ could be taken to be the same when testing pools as when testing individuals (Tu and others, 1994), and we have found studies in the infectious disease testing literature that have proceeded under this assumption (Currie and others, 2004. Another approach is to decrease the individual testing threshold t₀ to account for the effect of pooling; for example, Vansteelandt and others (2005) change t₀ to t₀/c when pooling c specimens. In the biomarker pooling literature, Schisterman and others (2005) describe how t₀ can be chosen to maximize Youden's index.

While certainly relevant for practical use, methods for choosing the “best” threshold t₀ in group testing applications is not a focus of this paper. For our purposes, we assume that the threshold t₀ has been determined a priori after the assay has been rigorously validated for use with pooled samples. Our regression methodology, to be described next, allows for pools of different sizes to use different thresholds. We write t₀=t₀(c_j) if this is to be emphasized.

3. Regression with pool-specific misclassification probabilities

In this section, we first obtain closed-form expressions for Inline graphic and on a pool-by-pool basis in terms of the relevant biomarker distributions. We then show how to incorporate these probabilities into a regression model fitting procedure using maximum likelihood.

3.1. Biomarker distributions

Let Inline graphic denote the probability density function of the true biomarker concentration for positive (negative) individuals. We interpret these distributions as conditional distributions given an individual's true status; additionally, we assume that these distributions do not depend on individual-level covariates. To acknowledge the inherent, assay-specific error associated with measuring the true biomarker concentration level Inline graphic of a specimen (individual or pooled), let denote the conditional density of the measured concentration ζ, given the true concentration . In this section, we assume that , , and are known. If the individual biomarker distributions and are unknown, training data can be used to estimate them under parametric or non-parametric assumptions; we illustrate this in Sections4 and5.

3.2. Expressions for pool-specific sensitivity and specificity

Let Inline graphic denote the true biomarker concentration for the ith individual in the jth pool so that and . For notational purposes, denote the jth pool by and let denote the true biomarker concentration of . Throughout this paper, we assume that , that is, the true biomarker concentration of is the average of the true individual biomarker concentrations. This assumption is very common in the biomarker pooling literature(Vexler and others, 2008; Malinovksy and others, 2012; see also Section6.

Let Inline graphic denote the (conditional) density of , given that contains exactly k positive individuals, for k=0,1,…,c_j. To write an expression for in terms of and , we first consider the (conditional) density of the sum , given that contains k positive individuals and c_j−k negative individuals; denote this density by Inline graphic . It is well known that the density of the sum of independent random variables is the convolution of their respective densities; see, e.g. Casella and Berger (2002, Chapter 5). Therefore, we can write

where “*” denotes the usual convolution operator and f^m* is the m-fold convolution of f with itself. A linear transformation argument (mapping sums to averages) shows that Inline graphic , for k=0,1,…,c_j.

We now obtain expressions for Inline graphic and . To account for assay measurement error, let denote the measured biomarker concentration of which, by assumption and conditional on , is distributed as . Under our classification rule , the probability tests negative (Z_j=0), given that it contains only negative individuals; i.e. the specificity for Inline graphic , is given by

To derive an expression for Inline graphic , we first note that the probability tests positive (Z_j=1), given that it contains exactly k positive individuals, is given by

It is possible to express Inline graphic as a linear combination of S_e(c_j,k), for k=1,2,…,c_j. To see why, note that

and write Inline graphic so that

Therefore, the sensitivity for Inline graphic can be written as , where

The probability Inline graphic can be calculated using the approach outlined in Wang (1993). It is insightful to note that depends on the pool size c_j and on the number of positive individuals in through the infection probabilities p_ij, i=1,2,…,c_j. On the other hand, depends on the pool size c_j but not on the infection probabilities p_ij of those individuals in Inline graphic .

3.3. Maximum likelihood

Estimating β via maximum-likelihood proceeds by maximizing (2.1) numerically, after replacing Inline graphic and with the expressions in Section3.2 and setting p_ij=g⁻¹(x_ij′β) in . Because the pool sensitivity is a function of the regression parameter β, maximizing (2.1) might seem to be formidable at first glance. However, we have had little difficulty in implementing a quasi-Newton optimization procedure in R (using the optim function); see also Section6. The large-sample covariance matrix of the maximum-likelihood estimator (MLE) Inline graphic can be estimated using the negative inverse Hessian at the last iteration of the procedure, making the construction of large-sample Wald confidence intervals possible (applicable when J is large). Our simulation results show that this matrix is estimated well and that Wald intervals confer the nominal coverage probability in realistic settings.

4. Simulation evidence

4.1. Simulation description

We use simulation to assess the performance of our proposed regression methods. Of primary interest is to compare our approach with the approach that assumes Inline graphic and are constant for all pools and do not depend on pool size. We consider the following models:

; β=(β₀,β₁)′=(−3,2)′,
; β=(β₀,β₁,β₂)′=(−3,1,0.5)′,
; β=(β₀,β₁,β₂)′=(−3,2,1)′,

where Inline graphic and x_ij2∼Bernoulli(0.1). These distributions were chosen to emulate situations where pooling is sensible. The mean prevalence for the three models ranges from 8 to 10%, which is consistent with our application in Section5. We first simulate N=1500 individual probabilities p_ij according to each model, where i=1,2,…,c and j=1,2,…,J, J=N/c, and use common pool sizes c=1,5,10,15 (c=1 corresponds to individual testing). True individual statuses Inline graphic are then simulated from a Bernoulli(p_ij) distribution.

We specify Inline graphic and , where τ∈{0.1,0.2,0.3} controls the amount of separation between the negative and positive individual biomarker distributions. Normal distributions were chosen for and so that the corresponding pool biomarker distributions are available in closed form. While normality is often assumed in the biomarker literature for this reason, we show in Appendix A of the supplementary material (available at Biostatistics online) that our main findings herein are unaffected when the assumed distributions for Inline graphic and are skewed. To allow for assay-specific error, we specify the distribution of the measured concentration, conditional on the true biomarker concentration, to be . Under these assumptions, we calculate

where Φ(⋅) is the standard normal cumulative distribution function, μ_k(c)=c⁻¹{(c−k)(0.1)+k}, and Inline graphic , for k=0,1,…,c.

To simulate the pool response Z_j, we first generate Inline graphic , for i=1,2,…,c. The measured concentration of is then calculated as , where the measurement error ϵ_j is generated independently from a distribution. Finally, the observed response Z_j is recorded as , where t₀=0.2 for each pool size. This value of t₀ provides nearly perfect discrimination between the positive and negative biomarker distributions (regardless of τ) for individual testing. For each (c,τ) combination and for each model, this entire procedure is repeated B=500 times.

4.2. Estimating and with training data

The exact form of Inline graphic and may be unknown in some applications. However, if individual biomarker training data are available, there is nothing to prevent one from first estimating and and then implementing our methods using these estimates. We therefore assess the impact of estimating and in our simulations, both parametrically and non-parametrically. To create a training data set, we first generate true (unobserved) individual biomarker concentrations Inline graphic , for t=1,2,…,n_t, where n_t=100. This is then repeated for , also with n_t=100. Such training data set sizes are not unrealistic; for example, Wein and Zenios (1996) cite infectious disease examples where thousands of individual training observations are available. We calculate the measured concentrations (i.e. the training data) according to Inline graphic and , t=1,2,…,n_t, where the independent measurement errors and are distributed as ; that is, we continue to assume that the conditional distribution of the measured concentration, given the true biomarker concentration, is known.

Under the normality assumptions outlined in the last paragraph, it is straightforward to calculate estimates of S_e(c,k) and S_p(c) in Section3.2 based on the two samples of training data Inline graphic and ; denote these estimates by and , respectively. To estimate the biomarker distributions non-parametrically with the training data, we implement a Fourier transform deconvolution kernel method for additive measurement error models using the decon package in R; see Wang and Wang (2011). With the estimated distributions, say Inline graphic and , we then use Monte Carlo integration to calculate the estimates and . Complete details are given in Appendix B of the supplementary material (available at Biostatistics online). Note that in the simulation results which follow, we have used a new training data set each time a regression model is estimated.

4.3. Simulation results

Figure1 displays boxplots of the B=500 MLEs from individual testing (I), the approach in Vansteelandt and others (2000) with constant S_e and S_p across pools (C), and our dilution approach for Model 2. In this figure, we allow the biomarker distributions Inline graphic and to be known (DT), estimated parametrically (DP), and estimated non-parametrically (DN). Complete simulation results are shown in Web Appendix C of the supplementary material (available at Biostatistics online), including numerical summaries of the estimates for all models. To fit the individual data and constant S_e/S_p models, we used S_e=S_e(1,1) and S_p=S_p(1) under the assumption that Inline graphic and are known; these are the values of sensitivity and specificity for individual testing, respectively.

Fig. 1. — Simulation study for Model 2. Boxplots of maximum likelihood estimates for B=500 data sets at each configuration of c and τ. True parameter values are shown using horizontal dotted lines. On the horizontal axes in each subfigure, “I” refers to individual testing, “C” refers to the constant S_e/S_p model, “DT” refers to the dilution model when individual biomarker distributions are known, and “DP” (“DN”) refers to the dilution model when individual biomarker distributions are estimated parametrically (non-parametrically) as described in Section4.2. Pool sizes c are within the parentheses.

For smaller pool sizes (e.g. c=5) and even when the threshold t₀ is unchanged from that under individual testing, Figure1 shows that ignoring a dilution effect has only a minimal impact on average; both the constant S_e/S_p and dilution model estimates are generally on target and have about the same amount of variability. However, for larger pool sizes (e.g. c=10, 15), the bias associated with the constant S_e/S_p model can be substantial, while the corresponding dilution estimates remain on target, regardless of whether or not individual biomarker distributions Inline graphic and are estimated first with training data. It is not surprising that the variability in the regression estimates increases as the pool size c does; individual parameter estimates are based on J=1500 responses whereas group testing estimates (constant and dilution) are based on J/c responses. The amount of separation between the positive and negative true biomarker distributions, as measured by τ, has little effect on the dilution model regression estimates.

Figure2 shows the estimated regression functions for Models 1 and 2, averaged over the B=500 data sets, when τ=0.1. The main message taken from Figure2 is the same; namely, both the constant S_e/S_p and dilution approaches accurately estimate the true regression function for smaller pool sizes. However, it is surprising to see how inaccurate the constant S_e/S_p regression can be when c=10 and especially when c=15 for Model 2; it is evident that covariate-specific probabilities of infection can be drastically underestimated when the dilution effect is ignored. This inaccuracy also manifests itself in the estimated Wald coverage probabilities for the regression parameters, shown in the supplementary material (available at Biostatistics online). For all models considered, the constant S_e/S_p model's estimated coverage probability for β₀ is incongruously low when c=10 and 15. On the other hand, confidence intervals calculated from the dilution model estimates maintain the nominal level regardless of the pool size used. Finally, one notes in Figure2 that there is little difference on average between the dilution model estimates when Inline graphic and are known and those when and are estimated; in fact, it is very difficult to distinguish these estimates from each other in the figure.

Fig. 2. — Simulation study for Model 1 (left) and Model 2 (right). Estimated regression functions averaged over B=500 data sets for different pool sizes c and τ=0.1. Dilution model results include those cases where individual biomarker distributions are known (T), estimated parametrically (P), and estimated non-parametrically (N). The true regression function and the constant S_e/S_p model results are also shown.

5. Irish HBV data

We apply our group testing regression methods to an HBV data set from Ireland. The data were collected as part of a public health study to estimate the prevalence and to identify risk factors of HBV infection among Irish prisoners; for further discussion, see Allwright and others (2000). The data provided to us by Dr Allwright consist of OD readings from a Murex ICE enzyme immunoassay along with a confirmed diagnosis for each individual. Additionally, the data set contains covariate information obtained via voluntary questionnaire, including age, drug use, and sexual practices for each participating individual. We illustrate our methods using the 1098 individuals for whom complete covariate and testing information is available. Note that in this section, the OD notation corresponds to the ζ notation in Section3.

The purpose of our analysis here is to compare the performance of the constant S_e/S_p and dilution modeling approaches. To do this, we assign individuals to pools at random, observe the resulting pool diagnoses, and fit both models. Because Inline graphic , , and are unknown in this application, we assume that the observed OD readings are linearly related to the true antibody concentrations and are measured without error. Under these assumptions, the OD reading for a pooled specimen is the average of the OD readings of individual specimens in the pool, that is, Inline graphic . To choose an assay threshold t₀ in our analysis (this was not provided to us), we first dichotomized all 1098 OD readings based on the true statuses to form and corresponding to the N⁺=60 HBV-positive and N⁻=1038 HBV-negative individuals. We then selected t₀ to minimize the discrepancies between the individuals’ diagnosed statuses and their true statuses based on the OD readings; i.e.

Finally, testing responses for pools were determined using Inline graphic , for each j=1,2,…,J and for each pool size. In Web Appendix D of the supplementary material (available at Biostatistics online), we display histograms of the values of OD⁺ and OD⁻ observed in this study.

To calculate pool-specific misclassification probabilities, we begin by estimating f_OD⁺ and Inline graphic , the probability density functions of OD readings for positive and negative individuals, respectively. To do this, we first create a training data set by taking a simple random sample of n⁺_t=10 HBV-positive OD readings and n⁻_t=38 HBV-negative OD readings from and , respectively. These training data set sample sizes are large enough to estimate f_OD⁺ and f_OD⁻ under parametric assumptions and leave us with N=1050 individuals to fit both the constant S_e/S_p and dilution regression models. We consider three different parametric models for OD⁺ and OD⁻; namely the gamma, Weibull, and log-normal, and we use maximum likelihood to estimate f_OD⁺ and f_OD⁻ with the training data for each model. Under our assumptions for the observed OD readings described previously, estimates of S_e(c,k) and S_p(c) are given by

where Inline graphic , and where denotes the estimated OD density for HBV-positive (HBV-negative) individuals. For each of the three OD models we consider (gamma, Weibull, and log-normal), it is particularly difficult to obtain closed-form expressions of the convolutions needed to calculate and . To obviate this difficulty, we use Monte Carlo simulation to approximate these two integrals.

To make our comparisons, we consider the first-order logistic model

(5.1)

where x_ij denotes the age of the ith individual in the jth pool. For simplicity, we specify a common pool size c∈{2,3,5,6,7,10}; note that each of these pool sizes divides N=1050 so that there are no remainder pools. For each c, we randomly assign each individual to one of J=1050/c pools, we record the resulting pool responses Z_j, and we fit both the constant S_e/S_p and dilution models. The dilution model is fit using the estimates Inline graphic and calculated from the training data. We fit the constant S_e/S_p model and also the individual data model (i.e. c=1) using and . To cover a large number of possible arrangements of the individuals for the group testing models, and also to account for the variability associated with estimating S_e(c,k) and S_p(c) with training data, we repeat this process B=1000 times for each pool size. A new training data set is selected each time model (5.1) is estimated.

Figure3 shows the estimated regression functions, averaged over the B=1000 data sets, for the constant S_e/S_p, dilution, and individual data models, under the gamma OD assumption. The corresponding Weibull and log-normal figures are in Web Appendix D of the supplementary material (available at Biostatistics online), along with additional figures which allow one to visually assess the variability in the estimates. In addition, Table1 presents summary statistics for the 1000 estimates of β₀ and β₁ for each OD model. From the results, one first notes that estimating f_OD⁺ and f_OD⁻ under different parametric assumptions has little impact on the resulting regression estimates. Furthermore, Figure3 and the additional figures located in the supplementary material (available at Biostatistics online) largely reinforce the main findings from Section4. For very small pool sizes (e.g. c=2), there are usually only minor differences between the constant S_e/S_p and dilution model fits on average, and both are similar to the fit with the individual data. However, for larger pool sizes, it is clear that age-specific probabilities of HBV infection can be drastically underestimated when the dilution effect is not accounted for. Finally, when examining Figure3, note that there were only 49 individuals (out of 1098) in the data set with ages >45. This fact, coupled with the inherent information reduction that results when constructing larger pools, likely explains the discrepancy between the individual data and the dilution model regression functions in this region.

Fig. 3. — Irish HBV data. Estimated regression functions averaged over B=1000 sets of pools for different pool sizes c. A gamma model for the OD readings is assumed. In each subfigure, the estimated individual data regression function is shown for comparison purposes. The corresponding figures for the Weibull and log-normal models are shown in the supplementary material (available at *Biostatistics* online).

Table 1.

Irish HBV data. Mean (standard deviation) of the B=1000 maximum likelihood estimates of β₀ and β₁ in model (5.1) for three OD distributions (gamma, Weibull, and log-normal) and for different pool sizes c. The individual data results (c=1) are also shown


Distribution	c	Constant	Dilution	Constant	Dilution
Gamma	1	−4.59 (0.13)	—	0.06 (0.01)	—
	2	−4.74 (0.30)	−4.65 (0.32)	0.05 (0.01)	0.06 (0.01)
	3	−4.75 (0.44)	−4.53 (0.48)	0.05 (0.01)	0.05 (0.01)
	5	−4.93 (0.79)	−4.35 (0.85)	0.04 (0.03)	0.04 (0.03)
	6	−5.02 (0.92)	−4.29 (1.01)	0.04 (0.03)	0.04 (0.03)
	7	−5.11 (1.14)	−4.18 (1.24)	0.04 (0.04)	0.04 (0.05)
	10	−5.33 (1.95)	−3.97 (2.02)	0.03 (0.08)	0.04 (0.08)
Weibull	1	−4.58 (0.13)	—	0.06 (0.01)	—
	2	−4.74 (0.31)	−4.65 (0.32)	0.06 (0.01)	0.06 (0.01)
	3	−4.72 (0.45)	−4.51 (0.48)	0.05 (0.01)	0.05 (0.01)
	5	−4.97 (0.72)	−4.46 (0.80)	0.04 (0.02)	0.05 (0.03)
	6	−5.01 (0.84)	−4.33 (0.94)	0.04 (0.03)	0.04 (0.03)
	7	−5.13 (1.07)	−4.28 (1.18)	0.04 (0.04)	0.04 (0.04)
	10	−5.41 (1.92)	−4.14 (1.97)	0.03 (0.08)	0.04 (0.07)
Log-normal	1	−4.60 (0.14)	—	0.06 (0.01)	—
	2	−4.75 (0.31)	−4.65 (0.33)	0.05 (0.01)	0.06 (0.01)
	3	−4.79 (0.46)	−4.51 (0.52)	0.05 (0.01)	0.05 (0.02)
	5	−4.98 (0.74)	−4.30 (0.85)	0.04 (0.03)	0.05 (0.03)
	6	−5.06 (0.89)	−4.21 (1.04)	0.04 (0.03)	0.04 (0.03)
	7	−5.06 (1.18)	−4.04 (1.32)	0.04 (0.04)	0.04 (0.05)
	10	−5.36 (2.17)	−3.94 (2.30)	0.03 (0.09)	0.04 (0.10)

Open in a new tab

6. Discussion

We have generalized the group testing regression approach of Vansteelandt and others (2000) to account for the effect that pooling specimens can have on pool misclassification probabilities. We have shown that previously used regression methods which assume constant sensitivity and specificity can provide poor estimates when a dilution effect is present. Our approach exploits the information available from underlying individual biomarker distributions and can offer substantially improved estimates. The web site www.chrisbilder.com/grouptesting/dilution contains R functions that implement the methodology in this paper.

Unlike regression methods which assume constant sensitivity and specificity, our dilution approach requires that individual biomarker distributions (or estimates of them) be specified a priori. Such a requirement is likely not prohibitive. For example, manufacturers often undertake large preliminary studies to assess the operating characteristics of assays for truly positive and truly negative individuals (Wein and Zenios, 1996 and individual biomarker distributions Inline graphic and can be estimated from these studies. To that end, an anonymous referee has suggested that an extension of this work could include reconstructing estimates of and when training data are available only on pools. This is a difficult deconvolution problem, but it has been addressed in the biomarker literature for the case wherein training pools contain all negative or all positive individuals (Vexler and others, 2008. Finally, our methodology also relies on the assumption that pool biomarker concentrations are averages of concentrations of individual specimens in the pools. We view this assumption to be reasonable as long as individual specimens being pooled are of equal size and do not interact(Zenios and Wein, 1998.

In the light of this work, we anticipate that incorporating pool-specific error probabilities could lead to similar gains for other regression modeling techniques involving pooled response data (Chen and others, 2009; Huang, 2009; Delaigle and Meister, 2011. Additionally, it might be possible to generalize our approach to incorporate information from retesting subsets of positive pools, similarly to Xie (2001) under the constant S_e/S_p assumption, although we would expect the mathematical details required for this extension to be far more formidable than those shown herein. Finally, the classification problem in group testing has generally proceeded under the assumption that assay sensitivity and specificity are constant throughout the decoding process (see, e.g. Kim and others, 2007; McMahan and others, 2012). Especially when a dilution effect is suspected, it may be worthwhile to relax this assumption and revisit the design and evaluation of classification procedures that are used in practice. The methods outlined in Section3 of this paper could serve as a starting point towards accomplishing this.

Supplementary material

Supplementary material is available at http://biostatistics.oxfordjournals.org.

Funding

This work was supported by the National Institutes of Health (Grant R01 AI067373).

Supplementary Material

Supplementary Data

supp_14_2_284__index.html^{(859B, html)}

Acknowledgements

The authors thank the Editor, Associate Editor, and two anonymous referees for their comments on an earlier version of this paper. We also thank Dr Shane Allwright and her colleagues for providing us with the Irish HBV data and Dr Aiyi Liu for his helpful remarks. Conflict of Interest: None declared.

References

Allwright S., Bradley F., Long J., Barry J., Thornton L., Parry J. Prevalence of antibodies to hepatitis B, hepatitis C, and HIV and risk factors in Irish prisoners: results of a national cross sectional survey. British Medical Journal. 2000;321:78–82. doi: 10.1136/bmj.321.7253.78. [DOI] [PMC free article] [PubMed] [Google Scholar]
Busch M., Caglioti S., Robertson E., McAuley J., Tobler L., Kamel H., Linnen J., Shyamala V., Tomasulo P., Kleinman S. Screening the blood supply for West Nile Virus RNA by nucleic acid amplification testing. New England Journal of Medicine. 2005;353:460–467. doi: 10.1056/NEJMoa044029. [DOI] [PubMed] [Google Scholar]
Cardoso M., Koerner K., Kubanek B. Mini-pool screening by nucleic acid testing for hepatitis B virus, hepatitis C virus, and HIV: preliminary results. Transfusion. 1998;38:905–907. doi: 10.1046/j.1537-2995.1998.381098440853.x. [DOI] [PubMed] [Google Scholar]
Casella G., Berger R. Statistical Inference. 2nd edition. Duxbury, MA: Duxbury Press; 2002. [Google Scholar]
Chen P., Tebbs J., Bilder C. Group testing regression models with fixed and random effects. Biometrics. 2009;65:1270–1278. doi: 10.1111/j.1541-0420.2008.01183.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Currie M., McNiven M., Yee T., Schiemer U., Bowden F. Pooling of clinical specimens prior to testing for Chlamydia trachomatis by PCR is accurate and cost saving. Journal of Clinical Microbiology. 2004;42:4866–4867. doi: 10.1128/JCM.42.10.4866-4867.2004. [DOI] [PMC free article] [PubMed] [Google Scholar]
Delaigle A., Hall P. Nonparametric regression with homogeneous group testing data. Annals of Statistics. 2012;40:131–158. [Google Scholar]
Delaigle A., Meister A. Nonparametric regression analysis for group testing data. Journal of the American Statistical Association. 2011;106:640–650. doi: 10.1198/jasa.2011.tm10355. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dorfman R. The detection of defective members of large populations. Annals of Mathematical Statistics. 1943;14:436–440. [Google Scholar]
Farrington C. Estimating prevalence by group testing using generalized linear models. Statistics in Medicine. 1992;11:1591–1597. doi: 10.1002/sim.4780111206. [DOI] [PubMed] [Google Scholar]
Gastwirth J. The efficiency of pooling in the detection of rare mutations. American Journal of Human Genetics. 2000;67:1036–1039. doi: 10.1086/303097. [DOI] [PMC free article] [PubMed] [Google Scholar]
Huang X. An improved test of latent-variable model misspecification in structural measurement error models for group testing data. Statistics in Medicine. 2009;28:3316–3327. doi: 10.1002/sim.3698. [DOI] [PubMed] [Google Scholar]
Hung M., Swallow W. Robustness of group testing in the estimation of proportions. Biometrics. 1999;55:231–237. doi: 10.1111/j.0006-341x.1999.00231.x. [DOI] [PubMed] [Google Scholar]
Kim H., Hudgens M., Dreyfuss J., Westreich D., Pilcher C. Comparison of group testing algorithms for case identification in the presence of testing error. Biometrics. 2007;63:1152–1163. doi: 10.1111/j.1541-0420.2007.00817.x. [DOI] [PubMed] [Google Scholar]
Lewis J., Lockary V., Kobic S. Cost savings and increased efficiency using a stratified specimen pooling strategy for Chlamydia trachomatis and Neisseria gonorrhoeae. Sexually Transmitted Diseases. 2012;39:46–48. doi: 10.1097/OLQ.0b013e318231cd4a. [DOI] [PubMed] [Google Scholar]
Malinovsky Y., Albert P., Schisterman E. Pooling designs for outcomes under a Gaussian random effects model. Biometrics. 2012;68:45–52. doi: 10.1111/j.1541-0420.2011.01673.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
McMahan C., Tebbs J., Bilder C. Informative Dorfman screening. Biometrics. 2012;68:287–296. doi: 10.1111/j.1541-0420.2011.01644.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pilcher C., Fiscus S., Nguyen T., Foust E., Wolf L., Williams D., Ashby R., O’Dowd J., McPherson J., Stalzer B., Hightow L., Miller W., and others. Detection of acute infections during HIV testing in North Carolina. New England Journal of Medicine. 2005;352:1873–1883. doi: 10.1056/NEJMoa042291. [DOI] [PubMed] [Google Scholar]
Remlinger K., Hughes-Oliver J., Young S., Lam R. Statistical design of pools using optimal coverage and minimal collision. Technometrics. 2006;48:133–143. [Google Scholar]
Schisterman E., Perkins N., Liu A., Bondell H. Optimal cut-point and its corresponding Youden index to discriminate individuals using pooled blood samples. Epidemiology. 2005;16:73–81. doi: 10.1097/01.ede.0000147512.81966.ba. [DOI] [PubMed] [Google Scholar]
Schmidt M., Roth W., Meyer H., Seifried E., Hourfar M. Nucleic acid test screening of blood donors for orthopoxviruses can potentially prevent dispersion of viral agents in case of bioterrorism. Transfusion. 2005;45:399–403. doi: 10.1111/j.1537-2995.2005.04242.x. [DOI] [PubMed] [Google Scholar]
Tu X., Litvak E., Pagano M. Screening tests: Can we get more by doing less? Statistics in Medicine. 1994;13:1905–1919. doi: 10.1002/sim.4780131904. [DOI] [PubMed] [Google Scholar]
Van T., Miller J., Warshauer D., Reisdorf E., Jerrigan D., Humes R., Shult P. Pooling nasopharyngeal/throat swab speciments to increase testing capacity for influenza viruses by PCR. Journal of Clinical Microbiology. 2012;50:891–896. doi: 10.1128/JCM.05631-11. [DOI] [PMC free article] [PubMed] [Google Scholar]
Vansteelandt S., Goetghebeur E., Thomas I., Mathys E., Van Loock F. On the viral safety of plasma pools and plasma derivatives. Journal of the Royal Statistical Society: Series A. 2005;168:345–363. [Google Scholar]
Vansteelandt S., Goetghebeur E., Verstraeten T. Regression models for disease prevalence with diagnostic tests on pools of serum samples. Biometrics. 2000;56:1126–1133. doi: 10.1111/j.0006-341x.2000.01126.x. [DOI] [PubMed] [Google Scholar]
Vexler A., Schisterman E., Liu A. Estimation of ROC curves based on stably distributed biomarkers subject to measurement error and pooling mixtures. Statistics in Medicine. 2008;27:280–296. doi: 10.1002/sim.3035. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang Y. On the number of successes in independent trials. Statistica Sinica. 1993;3:295–312. [Google Scholar]
Wang X., Wang B. Deconvolution estimation in measurement error models: the R package decon. Journal of Statistical Software. 2011;39:1–24. [PMC free article] [PubMed] [Google Scholar]
Wein L., Zenios S. Pooled testing for HIV screening: capturing the dilution effect. Operations Research. 1996;44:543–569. [Google Scholar]
Xie M. Regression analysis of group testing samples. Statistics in Medicine. 2001;20:1957–1969. doi: 10.1002/sim.817. [DOI] [PubMed] [Google Scholar]
Zenios S., Wein L. Pooled testing for HIV prevalence estimation: exploiting the dilution effect. Statistics in Medicine. 1998;17:1447–1467. doi: 10.1002/(sici)1097-0258(19980715)17:13<1447::aid-sim862>3.0.co;2-k. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

supp_14_2_284__index.html^{(859B, html)}

supp_kxs045_kxs045supp.pdf^{(736.6KB, pdf)}

[KXS045C1] Allwright S., Bradley F., Long J., Barry J., Thornton L., Parry J. Prevalence of antibodies to hepatitis B, hepatitis C, and HIV and risk factors in Irish prisoners: results of a national cross sectional survey. British Medical Journal. 2000;321:78–82. doi: 10.1136/bmj.321.7253.78. [DOI] [PMC free article] [PubMed] [Google Scholar]

[KXS045C2] Busch M., Caglioti S., Robertson E., McAuley J., Tobler L., Kamel H., Linnen J., Shyamala V., Tomasulo P., Kleinman S. Screening the blood supply for West Nile Virus RNA by nucleic acid amplification testing. New England Journal of Medicine. 2005;353:460–467. doi: 10.1056/NEJMoa044029. [DOI] [PubMed] [Google Scholar]

[KXS045C3] Cardoso M., Koerner K., Kubanek B. Mini-pool screening by nucleic acid testing for hepatitis B virus, hepatitis C virus, and HIV: preliminary results. Transfusion. 1998;38:905–907. doi: 10.1046/j.1537-2995.1998.381098440853.x. [DOI] [PubMed] [Google Scholar]

[KXS045C4] Casella G., Berger R. Statistical Inference. 2nd edition. Duxbury, MA: Duxbury Press; 2002. [Google Scholar]

[KXS045C5] Chen P., Tebbs J., Bilder C. Group testing regression models with fixed and random effects. Biometrics. 2009;65:1270–1278. doi: 10.1111/j.1541-0420.2008.01183.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[KXS045C6] Currie M., McNiven M., Yee T., Schiemer U., Bowden F. Pooling of clinical specimens prior to testing for Chlamydia trachomatis by PCR is accurate and cost saving. Journal of Clinical Microbiology. 2004;42:4866–4867. doi: 10.1128/JCM.42.10.4866-4867.2004. [DOI] [PMC free article] [PubMed] [Google Scholar]

[KXS045C7] Delaigle A., Hall P. Nonparametric regression with homogeneous group testing data. Annals of Statistics. 2012;40:131–158. [Google Scholar]

[KXS045C8] Delaigle A., Meister A. Nonparametric regression analysis for group testing data. Journal of the American Statistical Association. 2011;106:640–650. doi: 10.1198/jasa.2011.tm10355. [DOI] [PMC free article] [PubMed] [Google Scholar]

[KXS045C9] Dorfman R. The detection of defective members of large populations. Annals of Mathematical Statistics. 1943;14:436–440. [Google Scholar]

[KXS045C10] Farrington C. Estimating prevalence by group testing using generalized linear models. Statistics in Medicine. 1992;11:1591–1597. doi: 10.1002/sim.4780111206. [DOI] [PubMed] [Google Scholar]

[KXS045C11] Gastwirth J. The efficiency of pooling in the detection of rare mutations. American Journal of Human Genetics. 2000;67:1036–1039. doi: 10.1086/303097. [DOI] [PMC free article] [PubMed] [Google Scholar]

[KXS045C12] Huang X. An improved test of latent-variable model misspecification in structural measurement error models for group testing data. Statistics in Medicine. 2009;28:3316–3327. doi: 10.1002/sim.3698. [DOI] [PubMed] [Google Scholar]

[KXS045C13] Hung M., Swallow W. Robustness of group testing in the estimation of proportions. Biometrics. 1999;55:231–237. doi: 10.1111/j.0006-341x.1999.00231.x. [DOI] [PubMed] [Google Scholar]

[KXS045C14] Kim H., Hudgens M., Dreyfuss J., Westreich D., Pilcher C. Comparison of group testing algorithms for case identification in the presence of testing error. Biometrics. 2007;63:1152–1163. doi: 10.1111/j.1541-0420.2007.00817.x. [DOI] [PubMed] [Google Scholar]

[KXS045C15] Lewis J., Lockary V., Kobic S. Cost savings and increased efficiency using a stratified specimen pooling strategy for Chlamydia trachomatis and Neisseria gonorrhoeae. Sexually Transmitted Diseases. 2012;39:46–48. doi: 10.1097/OLQ.0b013e318231cd4a. [DOI] [PubMed] [Google Scholar]

[KXS045C16] Malinovsky Y., Albert P., Schisterman E. Pooling designs for outcomes under a Gaussian random effects model. Biometrics. 2012;68:45–52. doi: 10.1111/j.1541-0420.2011.01673.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[KXS045C17] McMahan C., Tebbs J., Bilder C. Informative Dorfman screening. Biometrics. 2012;68:287–296. doi: 10.1111/j.1541-0420.2011.01644.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[KXS045C18] Pilcher C., Fiscus S., Nguyen T., Foust E., Wolf L., Williams D., Ashby R., O’Dowd J., McPherson J., Stalzer B., Hightow L., Miller W., and others. Detection of acute infections during HIV testing in North Carolina. New England Journal of Medicine. 2005;352:1873–1883. doi: 10.1056/NEJMoa042291. [DOI] [PubMed] [Google Scholar]

[KXS045C19] Remlinger K., Hughes-Oliver J., Young S., Lam R. Statistical design of pools using optimal coverage and minimal collision. Technometrics. 2006;48:133–143. [Google Scholar]

[KXS045C20] Schisterman E., Perkins N., Liu A., Bondell H. Optimal cut-point and its corresponding Youden index to discriminate individuals using pooled blood samples. Epidemiology. 2005;16:73–81. doi: 10.1097/01.ede.0000147512.81966.ba. [DOI] [PubMed] [Google Scholar]

[KXS045C21] Schmidt M., Roth W., Meyer H., Seifried E., Hourfar M. Nucleic acid test screening of blood donors for orthopoxviruses can potentially prevent dispersion of viral agents in case of bioterrorism. Transfusion. 2005;45:399–403. doi: 10.1111/j.1537-2995.2005.04242.x. [DOI] [PubMed] [Google Scholar]

[KXS045C22] Tu X., Litvak E., Pagano M. Screening tests: Can we get more by doing less? Statistics in Medicine. 1994;13:1905–1919. doi: 10.1002/sim.4780131904. [DOI] [PubMed] [Google Scholar]

[KXS045C23] Van T., Miller J., Warshauer D., Reisdorf E., Jerrigan D., Humes R., Shult P. Pooling nasopharyngeal/throat swab speciments to increase testing capacity for influenza viruses by PCR. Journal of Clinical Microbiology. 2012;50:891–896. doi: 10.1128/JCM.05631-11. [DOI] [PMC free article] [PubMed] [Google Scholar]

[KXS045C24] Vansteelandt S., Goetghebeur E., Thomas I., Mathys E., Van Loock F. On the viral safety of plasma pools and plasma derivatives. Journal of the Royal Statistical Society: Series A. 2005;168:345–363. [Google Scholar]

[KXS045C25] Vansteelandt S., Goetghebeur E., Verstraeten T. Regression models for disease prevalence with diagnostic tests on pools of serum samples. Biometrics. 2000;56:1126–1133. doi: 10.1111/j.0006-341x.2000.01126.x. [DOI] [PubMed] [Google Scholar]

[KXS045C26] Vexler A., Schisterman E., Liu A. Estimation of ROC curves based on stably distributed biomarkers subject to measurement error and pooling mixtures. Statistics in Medicine. 2008;27:280–296. doi: 10.1002/sim.3035. [DOI] [PMC free article] [PubMed] [Google Scholar]

[KXS045C27] Wang Y. On the number of successes in independent trials. Statistica Sinica. 1993;3:295–312. [Google Scholar]

[KXS045C28] Wang X., Wang B. Deconvolution estimation in measurement error models: the R package decon. Journal of Statistical Software. 2011;39:1–24. [PMC free article] [PubMed] [Google Scholar]

[KXS045C29] Wein L., Zenios S. Pooled testing for HIV screening: capturing the dilution effect. Operations Research. 1996;44:543–569. [Google Scholar]

[KXS045C30] Xie M. Regression analysis of group testing samples. Statistics in Medicine. 2001;20:1957–1969. doi: 10.1002/sim.817. [DOI] [PubMed] [Google Scholar]

[KXS045C31] Zenios S., Wein L. Pooled testing for HIV prevalence estimation: exploiting the dilution effect. Statistics in Medicine. 1998;17:1447–1467. doi: 10.1002/(sici)1097-0258(19980715)17:13<1447::aid-sim862>3.0.co;2-k. [DOI] [PubMed] [Google Scholar]

PERMALINK

Regression models for group testing data with pool dilution effects

Christopher S McMahan

Joshua M Tebbs

Christopher R Bilder

Abstract

1. Introduction

2. Preliminaries

2.1. Notation and assumptions

2.2. Assay thresholds

3. Regression with pool-specific misclassification probabilities

3.1. Biomarker distributions

3.2. Expressions for pool-specific sensitivity and specificity

3.3. Maximum likelihood

4. Simulation evidence

4.1. Simulation description

4.2. Estimating and with training data

4.3. Simulation results

Fig. 1.

Fig. 2.

5. Irish HBV data

Fig. 3.

Table 1.

6. Discussion

Supplementary material

Funding

Supplementary Material

Acknowledgements

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Regression models for group testing data with pool dilution effects

Christopher S McMahan

Joshua M Tebbs

Christopher R Bilder

Abstract

1. Introduction

2. Preliminaries

2.1. Notation and assumptions

2.2. Assay thresholds

3. Regression with pool-specific misclassification probabilities

3.1. Biomarker distributions

3.2. Expressions for pool-specific sensitivity and specificity

3.3. Maximum likelihood

4. Simulation evidence

4.1. Simulation description

4.2. Estimating and with training data

4.3. Simulation results

Fig. 1.

Fig. 2.

5. Irish HBV data

Fig. 3.

Table 1.

6. Discussion

Supplementary material

Funding

Supplementary Material

Acknowledgements

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases