Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Oct 15.
Published in final edited form as: Stat Med. 2019 Jul 11;38(23):4519–4533. doi: 10.1002/sim.8311

Regression analysis and variable selection for two-stage multiple-infection group testing data

Juexin Lin 1, Dewei Wang 1,*, Qi Zheng 2
PMCID: PMC6736686  NIHMSID: NIHMS1037528  PMID: 31297869

Abstract

Group testing, as a cost-effective strategy, has been widely used to perform largescale screening for rare infections. Recently, the use of multiplex assays has transformed the goal of group testing from detecting a single disease to diagnosing multiple infections simultaneously. Existing research on multiple-infection group testing data either exclude individual covariate information or ignore possible retests on suspicious individuals. To incorporate both, we propose a new regression model. This new model allows us to perform a regression analysis for each infection using multiple-infection group testing data. Furthermore, we introduce an efficient variable selection method to reveal truly relevant risk factors for each disease. Our methodology also allows for the estimation of the assay sensitivity and specificity when they are unknown. We examine the finite sample performance of our method through extensive simulation studies and apply it to a chlamydia and gonorrhea screening data set to illustrate its practical usefulness.

Keywords: adaptive LASSO, multiplex assay, pooled testing, sensitivity, specificity

1. INTRODUCTION

1.1. Motivation

This article is motivated by the annual chlamydia trachomatis (CT) and neisseria gonorrhoeae (NG) screening practice conducted by the State Hygienic Laboratory (SHL) in Iowa. The CT and NG are two of the most common notifiable sexually transmitted diseases (STDs) in the United States. Over two million cases were reported to the Centers for Disease Control and Prevention (CDC) in 2016.1 Both infections are commonly asymptomatic in women. If left untreated, they could cause pelvic inflammatory disease and further lead to tubal infertility, ectopic pregnancy, or chronic pelvic pain.2 In addition, both diseases could facilitate the transmission of HIV and human papillomavirus infection.3 Concerned by these severe sequelae, CDC continually supports nationwide CT/NG screening and recommends annual CT/NG screening for all sexually active women under 25 years old.4

In this nationwide screening practice, specimens (swab or urine) are collected across each state and shipped to major state laboratories to be tested. Due to different budgets, laboratories conduct the screening differently. For example, the Nebraska Public Health Laboratory (NPHL) uses a traditional individual testing protocol which tests individual specimens one-by-one. The SHL tests male specimens and female urine specimens individually, but tests female swab specimens according to a two-stage pooling protocol:

The SHL Pooling Protocol

  • Individual swab specimens are randomly assigned to non-overlapping groups of size four. A pool is constructed by mixing individual specimens in the same group.

  • Stage 1: Each pool is tested for CT and NG simultaneously using a multiplex assay. If a pool tests negative for both infections, all the involved individuals are diagnosed as negative for each infection with no additional tests; otherwise, the protocol proceeds to the next stage.

  • Stage2: Swabs of individuals in pools that test positive for either infection are retested separately using the same multiplex assay for final diagnosis.

The most practical reason of using pooling is cost reduction. When a pool tests negative for both infections, four individuals are diagnosed at the expense of one assay. Since switching from individual testing to pooling in 1999, Iowa has saved over $2.2 million in the CT/NG screening.5

As per the screening guidelines, many risk factors are collected as well, such as age, number of partners, any symptoms of the infections, etc. A motivating question is how to incorporate these covariate information so that one can identify truly relevant risk factors for each infection and understand their effects. Challenges to this question arise from the use of the multiplex Aptima Combo 2 Assay (Gen-Prob, San Diego), an imperfect discriminatory test that produces diagnoses for both diseases simultaneously. Due to the imperfectness of the assay, it is possible to observe some discrepancies between testing outcomes of the two stages, as shown in Figure 1. Whenever a discrepancy occurs, the SHL ignores pooled-level results from Stage 1 and makes the diagnosis solely based on individual testing from Stage 2. However, when the objective is probing the impact of risk factors rather than case identification, disregarding testing outcomes from any stage could impair the estimation. It is important to seamlessly incorporate outcomes from both stages. Towards this goal, we need to account for how likely the retests were triggered by either infection.

FIGURE 1.

FIGURE 1

A possible set of the SHL pooled testing outcomes from a group of 4 individuals: the rectangle with rounded corners represents the pooled specimen that is constructed by mixing the 4 individual specimens (in circles) together. The pool tested negative for CT (i.e., CT = 0) but positive for NG (i.e., NG = 1). As per the SHL pooling protocol, the positivity of NG triggered the second stage of screening for both infections. Due to testing errors, we see a discrepancy between the two stages; i.e., the fourth individual retested positive for CT but the pool tested negative for CT.

1.2. Literature review

Pooled testing (also known as group testing) was initially proposed to screen for syphilis among War World II American army recruits.6 Since this seminal work, pooling techniques have been successfully implemented to screen for many other infectious diseases, including HIV, HBV, HCV,7, influenza8, and herpes9. Besides disease screening, many other areas, including genetics10, veterinary science11, medical entomology12, blood safety13, and drug discovery14, have also used the method of pooling. Statistical research in group testing primarily focused on improving the diagnostic accuracy and cost-saving ability of a pooling protocol15, or estimating individual-level characteristics from pooled testing data. This article falls into the latter category. When group testing data only involves a single infection, the research focusing on estimation started with estimating a disease prevalence.16,17 This research avenue was expanded to incorporate individual covariate information through the use of parametric regression models, such as generalized linear models18,19, mixed models20, and Bayesian regression models21. Semiparametric and nonparametric regression methods have also been developed.22,23 However, all these work are limited to one infection.

The use of multiplex assays makes pooled testing data with multiple infections widely available. For example, in addition to CT/NG, HIV/Syphilis, HIV/HCV, or HIV/HBV/HCV can be detected simultaneously.24 In statistical literature, the research focusing on estimation with multiple-infection group testing data is scarce. A few works have studied the estimation of disease prevalence.25,26,27,28 Regression analysis for this type of data remains mostly untapped. To the best of our knowledge, the only work is an approach based on generalized estimating equations.29 However, it did not consider retesting outcomes arising from the second stage of screening and thus does not apply to the SHL screening practice.

When using an imperfect assay, the values of assay sensitivity and specificity are crucial to estimation in pooled testing. Most of the aforementioned literature assumed that there were some preliminary studies to provide those misclassification parameters. However, this assumption could be impractical because the preliminary study might have used unrepresentative samples.17 If inaccurate values of assay sensitivity and specificity were used for estimation, it could compromise inference. In this article, we keep the testing error rates as unknown and estimate them from the data along with the regression.

Existing literature has not considered the combination of incorporating retesting results into regression and estimating misclassification parameters in the context of multiple-infection group testing. Only one Bayesian work has provided inference for disease prevalence and estimates of assay sensitivity and specificity without consideration of individual covariates.27 In this article, we propose a copula-based multivariate binary regression model to incorporate the covariates. We introduce a generalized expectation-maximization (GEM) algorithm to facilitate the numerical computation of the maximum likelihood estimates (MLEs) of the regression coefficients and misclassification parameters. When compared to the traditional EM algorithm, the GEM only requires the maximization step to search for an increase in the objective function rather than achieving the maximum.30,31 This feature greatly accelerates the computation of the MLE.

In addition, we provide a variable selection technique that can identify truly relevant risk factors for each infection. A recent work has introduced a regularized regression technique for group testing.32 But it is for a single infection. Our work is designed to allow for multiple infections. We believe a package of regression, estimation of misclassification parameters, and variable selection can provide a useful toolbox for the epidemiology study of CT and NG based on group testing data.

The rest of the article is organized as follows. In Section 2, we propose a new copula-based regression model for multiple-infection group testing data. In Section 3, we introduce the GEM algorithm that accelerates the computation of the MLE. Section 4 presents a variable selection method that can identify important risk factors for each infection. In Section 5.1, we use simulation to illustrate that, with the use of a fewer number of tests, the SHL pooling protocol can lead to more efficient regression estimates, better prediction of infection probabilities, and more accurate variable selection than traditional individual testing. These advantages are further demonstrated by analyzing a CT/NG screening data set in Section 5.2. Section 6 presents a discussion of this work. All technical details and additional numerical results are relegated to the supplementary materials.

2. MODEL

Suppose N individuals are to be tested. We randomly assign each individual to one of J groups, each of size cj; i.e., N=j=1Jcj. For generality, we allow group size cj to vary across groups. Motivated by the CT/NG screening practice, we mainly consider two infections. Section 6 discusses an extension of more than two diseases. The true infection statuses of the ith individual in the jth group are denoted by a binary vector Y˜ij=(Y˜ij1,Y˜ij2)T, where Y˜ijk=1 if the individual is positive for the kth infection, Y˜ijk=0 otherwise, for i = 1,…,cj, j = 1,…, J, and k = 1,2. Denote the covariates (risk factors and an intercept term) of the ith individual in the jth group by a (p+1)-dimensional vector xij = (1,xij1,…,xijp)T. We assume that Y˜ij|xij’s are independent across ij and Y˜ijk is related to a linear predictor xijTβk via

pr(Y˜ijk=1|xij)=gk(xijTβk),fork=1,2, (1)

where βk = (βk0, βk1,…, βkp)T is a vector of (p+1) regression coefficients that will be estimated and gk is a user-chosen known link function (e.g., the inverse of the logit or probit link). One could use different links for different infections. Equation (1) builds marginal probability models of the random vector Y˜ij|xij.

In pooled testing, the true infection statuses are often latent due to pooling and potential misclassification. In each group, individual specimens are mixed together to form a pool. We denote the true status of the jth pool by Z˜j=(Z˜j1,Z˜j2)T where Z˜jk=max{Y˜ijk:i=1,,cj}; i.e., Z˜jk=1 if the pool involves at least one individual who is positive for the kth infection, Z˜jk=0 otherwise. With the use of an imperfect assay, both Y˜ij’s and Z˜j’s are latent. Observed data are the testing outcomes from the imperfect multiplex assay. Pools are tested in Stage 1. We denote the testing outcomes of the jth pool by Zj = (Zj1, Zj2)T, where Zjk = 1(0) if the pool tests positive (negative) for the kth infection. If Zj = (0,0)T, then Zj is the only observed test response for the jth group of individuals. Otherwise, those individuals are tested separately in Stage 2. We denote by Yij = (Yij1,2)T the retesting outcome of the ith individual in the jth group; i.e., Yijk = 1(0) if the individual is retested as positive (negative) for the kth infection. Note that, the Yij’s can only be observed if Zj ≠ (0,0)T. In summary, observed testing outcomes from the jth group, denoted by Pj, take one of the two forms, either Zj = (0,0)T, or Zj ∈ {(1,0)T,(0,1)T,(1,1)T} and Y1j,,Ycjj.

The discrepancy between true statuses and testing outcomes is often measured by assay sensitivity and specificity. Denote by Sek and Spk the assay sensitivity and specificity, respectively, for the kth infection. In practice, an assay used for largescale screening is often imperfect. We let Sek’s and Spk’s be in (0,1). Our methodology posits three assumptions on these misclassification parameters. Assumption 1 is that Sek’s and Spk’s do not depend on the group size; e.g., Se:k=pr(Zjk=1|Z˜jk=1)=pr(Yijk=1|Y˜ijk=1) and Sp:k=pr(Zjk=0|Z˜jk=0)=pr(Yijk=0|Y˜ijk=0) hold for all i, j, and k. Assumption 2 assumes that conditioning on the true statuses of the specimens being tested, testing responses are independent across each other and also across infections. Assumption 3 further assumes that given the true statuses, testing responses are independent of the covariates; e.g., pr(Zj1=0,Zj2=1,Yij1=1,Yij2=0|Z˜j1=0,Z˜j1=0,Y˜ij1=1,Y˜ij2=1,xij)=pr(Zj1=0|Z˜j1=0)pr(Zj2=1|Z˜j2=0)pr(Yij1=1|Y˜ij1=1)pr(Yij2=0|Y˜ij2=1)=Sp:1(1Sp:2)Se:1(1Se:2) All these assumptions are standard in group testing literature (see most references in Section 1.2). In practice, one may need to conduct proper assay calibration to ensure the applicability of these assumptions.

Our primary goal is to estimate βk’s, Sek’s and Spk’s. Towards this goal, we want to incorporate the retesting outcomes for two main reasons: 1) ignoring the retesting outcomes could severely inflate the variance of the estimators of βk’s (see the supplementary materials for a numerical illustration). 2) Including the retesting outcomes gives us repeated measurements (i.e., many specimens are tested in pools and also individually) which provide valuable information to estimate misclassification parameters. To seamlessly incorporate all retesting outcomes, we propose a copula-based multivariate binary regression model. We assume that there exists a vector of standard uniform random variables, Uij = (Uij1,2)T, such that the event {Y˜ijk=1|xij} is equivalent to {Uijkgk(xijTβk)}, where Uij’s are independent and follow a bivariate copula.33 Denote the chosen copula by C{u1, u2|δ}, where u1,u2 ∈ (0,1) and is known up to a parameter δ (which could be a vector). Then the marginal regression models in (1) naturally hold, and the co-infection probability is

pr(Y˜ij1=1,Y˜ij2=1|xij)=C{g1(xijTβ1),g2(xijTβ2)|δ}. (2)

Combining (1) and (2) together defines our joint probability model of Y˜ij|xij.

3. ESTIMATION

We maximize the likelihood function to obtain our estimators of βk’s, Sek’s, Spk’s, and δ. For notation simplicity, we write θ1=(β1T,β2T,δ)T, θ2=(Se:1,Se:2,Sp:1,Sp:2)T and θ=(θ1T,θ2T)T. Furthermore, we denote by pijy1y2(θ1) the cell probability pr(Y˜ij1=y1,Y˜ij2=y2|xij) defined by (1) and (2) under θ1 for y1,y2 ∈ {0,1}, i = 1,…,cj, and j = 1,…,J. Then pij11(θ1)=C{g1(xijTβ1),g2(xijTβ2)|δ}, pij10(θ1)=g1(xijTβ1)pij11(θ1), pij01(θ1)=g2(xijTβ2)pij11(θ1), and pij00(θ1)=1pij11(θ1)pij10(θ1)pij01(θ1) In the supplementary materials, we derive an expression of the log-likelihood function (θ|P,X) where P and X denote the collections of Pj’s and xij’s, respectively. However, due to the complexity of (θ|P,X), a direct maximization could be time-consuming. The supplementary materials include a numerical illustration of this disadvantage.

We propose a GEM algorithm to accelerate the computation. The algorithm incorporates Y˜={Y˜11,,Y˜cJJ} as latent variables. The complete log-likelihood function of θ, derived from the conditional distribution of P and Y˜ given X, can be written by c(θ|P,Y˜,X)=c1(θ1|Y˜,X)+c2(θ2|P,Y˜) where

c1(θ1|Y˜,X)=j=1Ji=1cj[(1Y˜ij1)(1Y˜ij2)logpij00(θ1)+Y˜ij1(1Y˜ij2)logpij10(θ1)+(1Y˜ij1)Y˜ij2logpij01(θ1)+Y˜ij1Y˜ij2logpij11(θ1)] (3)

and

c2(θ2|P,Y˜)=j=1Jk=12[{Z˜jkZjk+I(Zj(0,0)T)i=1cjY˜ijkYijk}logSe:k+{Z˜jk(1Zjk)+I(Zj(0,0)T)i=1cjY˜ijk(1Yijk)}log(1Se:k)+{(1Z˜jk)(1Zjk)+I(Zj(0,0)T)i=1cj(1Y˜ijk)(1Yijk)}logSp:k+{(1Z˜jk)Zjk+I(Zj(0,0)T)i=1cj(1Y˜ijk)Yijk}log(1Sp:k)], (4)

in which Z˜jk=max{Y˜ijk:i=1,,cj} and I(⋅) is the indicator function.

Our GEM algorithm starts at an initial value, and then iterates between an E-step and an M-step to update the value until reaching a numerical convergence. At a current value (d), the E-step calculates Q(θ|θ(d))=Q1(θ1|θ(d))+Q2(θ2|θ(d)), where Q1(θ1|θ(d))=E{c1(θ1|Y˜,X)|P,X,θ(d)} and Q2(θ2|θ(d))=E{c2(θ2|P,Y˜)|P,X,θ(d)}. After an inspection of (3) and (4), it suffices to calculate ηij00(d), ηij10(d), ηij01(d), ηij11(d) (for Q1) and ηP,jk(d) (for Q2) where

ηijy1y2(d)=pr(Y˜ij1=y1,Y˜ij2=y2|P,X,θ(d))andηP,jk(d)=pr(Z˜jk=1|P,X,θ(d)), (5)

for i = 1,…,cj, j = 1,…,J, y1,y2 ∈ {0,1}, and k = 1,2. Though ηijy1y2(d)’s have been studied without the consideration of X26, they were not updated in closed forms and thus a Gibbs sampler was employed to approximate these quantities. However, in the regression context, using such approximations requires enlarging the tolerance of the numerical convergence and hence might induce bias. To improve the computational accuracy, we calculate all the probabilities in (5) exactly (see the supplementary materials for details).

With the probabilities in (5) calculated, we rewrite Q1(θ1|θ(d)) by

Q1(β1,β2,δ|θ(d))=j=1Ji=1cjy1=01y2=01ηijy1y2(d)logpijy1y2(θ1),

and Q2(θ2|θ(d)) by

k=12{W1k(d)logSe:k+W2k(d)log(1Se:k)+W3k(d)logSp:k+W4k(d)log(1Sp:k)}, (6)

where

W1k(d)=j=1J{ηP,jk(d)Zjk+I(Zj(0,0)T)i=1cjηij,k(d)Yijk},W2k(d)=j=1J{ηP,jk(d)(1Zjk)+I(Zj(0,0)T)i=1cjηij,k(d)(1Yijk)},W3k(d)=j=1J{(1ηP,jk(d))(1Zjk)+I(Zj(0,0)T)i=1cj(1ηij,k(d))(1Yijk)},W4k(d)=j=1J{(1ηP,jk(d))Zjk+I(Zj(0,0)T)i=1cj(1ηij,k(d))Yijk},

in which, ηij,1(d)=ηij11(d)+ηij10(d) and ηij,2(d)=ηij11(d)+ηij01(d). The M-step in our GEM algorithm updates θ1(d) by θ1(d+1)=(β1(d+1)T,β2(d+1)T,δ(d+1))T where β1(d+1)=argmaxβ1Q1(β1,β2(d),δ(d)|θ(d)), β2(d+1)=argmaxβ2Q1(β1(d+1)1,β2,δ(d)|θ(d)) and δ(d+1)=argmaxδQ1(β1(d+1),β2(d+1),δ|θ(d)). The value of θ2(d+1) is obtained by maximizing (6) and can be written as

θ2(d+1)=(Se:1(d+1),Se:2(d+1),Sp:1(d+1),Sp:2(d+1))T

where Se:k(d+1)=W1k(d)/(W1k(d)+W2k(d)) and Sp:k(d+1)=W3k(d)/(W3k(d)+W4k(d)), for k = 1,2. Combining θ1(d+1) and θ2(d+1) provides θ(d+1). Because Q(θ(d+1)|θ(d))Q(θ(d)|θ(d)), the convergence of {θ(d)}d=1 is guaranteed.30 We denote by θ^ the limit of (d)’s.

Denote by I(θ) the observed data information matrix. Following the standard arguments of the MLE34we have I(θ^)1/2(θ^θ) converges in distribution to N(0,I2p+7) as N → ∞, where Im denotes the m-dimensional identity matrix. Applying Louis’ method35 provides

I(θ)=E{2c(θ|P,Y˜,X)θθT|P,X,θ}cov{c(θ|P,Y˜,X)θ|P,X,θ}.

Again, instead of approximating I(θ) via the Gibbs sampling approach26, we are able to calculate it exactly. The calculations are included in the supplementary materials. With I(θ^), one can make large sample Wald-type inferences. For example, let θl, θ^l and σ^ll2 be the lth component of θ, the lth component of θ^ and the lth diagonal entry of I(θ^)1, respectively, for l = 1,…,2p+7. The estimated standard error (SE) of θ^l is σ^ll and an approximated 100(1 − α)% confidence interval of θl is θ^l±zα/2σ^ll, where zα is the αth upper quantile of N(0,1).

4. VARIABLE SELECTION FOR EACH INFECTION

With θ^ and I(θ^) computed, we further identify which risk factors are truly relevant for each infection. Denote by β1* and β2* the values of β1 and β2 that generate the true individual statuses Y˜, respectively, where βk*=(βk0*,βk1*,,βkp*)T. One can index the significant risk factors to the kth infection by Mk={jM:βkj*0}, where we take M={1,2,p} by defaulting that an intercept term is always included in the model. One must note that M1 and M2 might be different.

We apply a shrinkage method to simultaneously select Mk’s and estimate nonzero βkj*’s. To unify notation, we write θT and Σ^TT as the sub-vector and the sub-matrix of θ and Σ^ according to an index set T{1,,2p+7}, respectively. Let A={2,,p+1,p+3,,2p+2}. Our shrinkage estimator of θA is defined by

θ˜A,λ=argminθA{12(θ^AθA)TΣ^AA(θ^AθA)+k=12λkj=1pωkj|βkj|}, (7)

where λkj=1pωkj|βkj| is an adaptive LASSO penalty36, λk ≥ 0 is a tuning parameter that controls the shrinkage level, and ωkj=|β^kj|1 is an adaptive weight. When λk’s are 0, θ˜A,λ=θ^A. When λk’s increase, due to the singularity of the absolute value function at the origin, components of θ˜A,λ are penalized to zero one-by-one. Writing θ˜A,λ=(β˜11,λ,,β˜1p,λ,β˜21,λ,,β˜2p,λ)T, we estimate M1 and M2 by M˜1,λ={jM:β˜1j,λ0} and M˜2,λ={jM:β˜2j,λ0}, respectively.

Computing θ˜A,λ is fast. The objective function in (7) is simply a summation of a quadratic function and a weighted l1-norm of θA and therefore can be quickly minimized by slightly modifying the seminal least angle regression.37 Let Ac={1,2,,2p+7}\A and (θA|P,X,θ^Ac) be the log-likelihood function (θ|P,X) with θAc fixed to be θ^Ac. One could also construct a shrinkage estimator by the traditional penalized MLE38 which minimizes (θA|P,X,θ^Ac)+k=12λkj=1pωkj|βkj|. As the quadratic term in (7) is the leading component of the Taylor’s expansion of (θA|P,X,θ^Ac) at θA=θ^A, it can be easily shown that θ˜A and the penalized MLE are asymptotically equivalent. However, the computation cost of obtaining penalized MLE will be a lot higher due to the complexity of the log-likelihood function.

The use of adaptive weights ωkj’s is critical to achieve the oracle properties.36 It assigns sufficiently large penalties to insignificant covariates so that they would be excluded from the model; on the other hand, it imposes mild penalties to significant ones in order that they would be retained in the model. The oracle properties are stated as follows. As N → ∞, if max (λ1,λ2)/N0 and min(λ1,λ2) → ∞, we have both the selection consistency, pr(M˜1,λ=M1,M˜2,λ=M2)1, and the estimation consistency, supk,jβ˜kj,λβkj*=Op(N1/2). The proof follows similar arguments in the proofs of Theorems 1 and 2 in39 and thus is omitted.

To select λ1 and λ2, we propose to minimize a BIC-type criterion40,

BIC(λ1,λ2)=(θ^Aθ˜A,λ)TΣ^AA(θ^Aθ˜A,λ)+{df1,λ+df2,λ}logN, (8)

where dfk,λ=|M˜k,λ| for k = 1,2. Following the proof of Theorem 3 in41, one can show that with the optimal (λ1,2) from (8), pr(M˜1,λ=M1,M˜2,λ=M2)1 as N → ∞. In other words, any (λ1, λ 2) that does not lead to the correct variable selection cannot be selected by (8) when the number of individuals is large.

The purpose of this subsection is to provide a shrinkage estimator of the regression coefficients, of which the sparsity pattern can help us identify the truly relevant risk factor for each infection. Inference procedures, such as constructing a confidence interval or conducting hypothesis testing, based on this shrinkage estimator are beyond the scope of this work. There are numerous studies demonstrating that even in classical linear regression, finite-sample inference procedures based on asymptotic properties of the adaptive LASSO estimator perform poorly.42 Developing valid inferential methods for shrinkage estimators in group testing, even with a single infection, could be an interesting but challenging future research topic. In this article, it is the variable selection of primary interest.

5. NUMERICAL STUDIES

5.1. Simulation

We consider three different settings for the joint distribution of Y˜ij|xij. In all of them, we keep both g1 and g2 in the marginal regression model (1) being the inverse of the logit link function, and use a Gumbel copula43, C(u1,u2|δ)=exp{[(logu1)1/δ+(logu2)1/δ]δ} with δ = 0.3, to generate the co-infection probability (2). The difference across the three settings comes from the choices of (β1, β2,), where x is a generic notation of xij’s:

  • (S1) β1 = (−5,−3,2,0,0,0)T,β2 = (−5,−3,0,3,0,0)T and x = (1,x1,⋯,x5)T, where we independently simulate x1 from (0,1), x2 and x3 from Bernoulli(0.4), x4 from Uniform(−0.5,0.5), and x5 from N(0,0.752)

  • (S2) β1 = (−4,−2,2,0,0,0)T,β2 = (−5,−2≠,0,−2,0,0)T and x = (1,x1,⋯,x5)T, where x is simulated from N(0,Ω) with [Ω]st = 1 if s = t and [Ω]st = 0.5 if s ≠ t.

  • (S3) β1 = (−5,(−2,−2,−2,2,2) (1,0))T, β = (−6,(−3,−3,2,3,0) (1,0))T and x = (1,x,⋯,x10)T, where is the Kronecker product, x is simulated from.N(0,Ω) with [Ω]st=1 if s = t and [Ω]st=0.5 if st.

Note that β1 and β2 have different sparsity patterns (e.g., in S1, x2 is significant to the first infection but not to the second infection). This is to emulate the situation where two infections have different sets of significant risk factors. The values of β1 and β2 are chosen in a way such that the prevalence of each infection is about 7%–10%.

Under each setting, we simulate two types of data: individual testing data and the SHL pooled testing data. To do so, we first generate N = 3000 individual covariates. Given a set of covariates, we calculate the individual’s cell probabilities (pijy1y2 ‘s) using the specified copula-based multivariate binary regression model, and then generate the true infection statuses for both infections from a multinomial distribution with those cell probabilities. We denote the covariates and the true infection statuses of the nth individual by xn and Y˜n=(Y˜n1,Y˜n2)T, respectively, for n = 1,…,3000. Herein, because groups have not been created yet, we use the subscript n instead of the ij (in Y˜ij and xij). Given (Y˜n,xn)’s, we simulate individual testing data and the SHL pooled testing data. We let Sek = Spk = 0.95 fork = 1,2. Values other than 0.95 are considered in the supplementary materials.

Based on Y˜n’s, we generate individual testing outcomes of the nth specimen by T n = (Tn1, Tn2)T where Tnk ∼ Bernoulli {Se:kY˜nk+(1Sp:k)(1Y˜nk)}. Then we estimate (β1, β 2,)T from (T n, xn)’s. This estimation procedure is similar to the one outlined in Section 3. We also use a GEM-algorithm to compute the MLEs and Louis’ method to calculate the observed data information matrix for making large sample Wald-type inferences. Furthermore, we slightly modify our variable selection method (in Section 4) to accommodate individual testing data. All the details are provided in the supplementary materials. It is worthwhile to note that Sek’s and Spk’s are not estimable in individual testing data. Hence, with individual testing data (Tn, xn)’s, we have to assume the true values of Sek’s and Spk’s as known to estimate (β1,2,δ).

We generate the SHL pooled testing data from Y˜n’s. A common group size is used in our simulations; i.e., cj = c, and c ∈ {2,5,10}. For a fixed c, we randomly assign the 3000 individuals to one of J = 3000∕c groups. With the group membership identified, we relabel (Y˜n,xn)’s by (Y˜ij,xij) where i = 1,…,c and j = 1,…,J. The true statuses of the jth pool are calculated as Z˜jk=maxiY˜ijk where k = 1,2. Then we generate the pooled testing outcomes by Zj = (Zj1, Zj2)T, where Zjk ∼ Bernoulli {Se:kZ˜jk+(1Sp:k)(1Z˜jk)}. As per the SHL pooling protocol, only if max(Zj1,Zj2) = 1, we generate retesting outcomes of the ith individual in this group by Yij = (Yij1,Yij2)T, where Yijk ∼ Bernoulli {Se:kY˜ijk+(1Sp:k)(1Y˜ijk)}. Collecting all Zj’s and Yij’s yields the SHL pooled testing data P. Note that the number of tests that were used to obtain P is the summation of J and the number of Yij’s. From P and xij’s, we estimate (β1, β2,,∶1,Se∶2,Sp∶1,Sp∶2).

We repeat 500 times the process of generating Tn’s and P for each c ∈ {2,5,10}. For each set of individual testing data or the SHL pooled testing data, we first treat the diagnosis results for each infection as the true statues and fit them using our copula-based multivariate binary regression model. The resulting MLE of θ1 is used as the initial value of θ1. The initial values of the assay sensitivity and specificity are chosen to be 0.9. Then we run our GEM algorithm to compute the MLE and use Louis’ method to construct a 95% confidence interval for each unknown parameter (see the last paragraph of Section 3). In addition to the BIC-type shrinkage estimator, we also compute an AIC-type44 and an ERIC-type45 estimator using the tuning parameters selected by minimizing AIC(λ1,λ2)=(θ^Aθ˜A,λ)TΣ^AA(θ^Aθ˜A,λ)+2{df1,λ+df2,λ} and ERIC(λ1,λ2)=(θ^Aθ˜A,λ)TΣ^AA(θ^Aθ˜A,λ)+df1,λlog(N/λ1)+df2,λlog(N/λ2), respectively. For individual testing data, slightly modified versions are available in the supplementary materials.

To compare the overall performance of the MLE and three shrinkage estimators, we consider the prediction error, PE=N1j=1Ji=1cj{y1=01y2=01(p^ijy1y2pijy1y2*)2}1/2, where pijy1y2*’s are the true cell probabilities and p^ijy1y2’s are the predicted cell probabilities using an estimator of (β1,β2,δ). To evaluate the variable selection performance of shrinkage estimators, we define by the selection rate (SR) the proportion of the true model being exactly selected by a shrinkage estimator. Results from the 500 replications under S1–S3 are summarized in Tables 14.

TABLE 1.

Summary statistics of the 500 MLEs obtained under S1, including the sample mean (Mean), the sample standard deviation (SD), the average of the estimated standard errors (SE), and the empirical coverage (EC) of 95% confidence intervals under either individual testing (IT) or the SHL pooling with c = 2, 5, 10. The average number of tests (# of tests) under each protocol is also provided. The prevalence (averaged over 500 repetitions) of the first and second infections are 7.64% and 8.22%, respectively.

IT c = 2 c = 5 c = 10

# tests 3000 2351 2078 2445

Truth Mean(SD) EC(SE) Mean(SD) EC(SE) Mean(SD) EC(SE) Mean(SD) EC(SE)
β10 −5 −5.08(0.36) 0.94(0.37) −5.06(0.29) 0.94(0.29) −5.06(0.31) 0.94(0.29) −5.07(0.34) 0.95(0.32)
β11 −3 −3.05(0.25) 0.96(0.26) −3.03(0.21) 0.94(0.21) −3.04(0.22) 0.94(0.21) −3.04(0.24) 0.95(0.23)
β12 2 2.03(0.27) 0.94(0.27) 2.02(0.24) 0.94(0.24) 2.02(0.25) 0.94(0.24) 2.03(0.26) 0.95(0.25)
β13 0 −0.01(0.24) 0.95(0.23) −0.01(0.22) 0.95(0.21) −0.01(0.22) 0.95(0.21) −0.01(0.22) 0.94(0.21)
β14 0 0.01(0.38) 0.95(0.39) 0.01(0.34) 0.96(0.35) −0.01(0.35) 0.96(0.35) 0.00(0.37) 0.96(0.36)
β15 0 0.00(0.19) 0.96(0.20) 0.00(0.17) 0.97(0.18) 0.00(0.17) 0.97(0.18) 0.00(0.19) 0.94(0.19)

β20 −5 −5.08(0.37) 0.95(0.37) −5.05(0.28) 0.94(0.30) −5.05(0.30) 0.94(0.30) −5.04(0.33) 0.96(0.32)
β21 −3 −3.04(0.26) 0.95(0.26) −3.03(0.21) 0.94(0.22) −3.03(0.22) 0.94(0.21) −3.02(0.24) 0.95(0.23)
β22 0 −0.01(0.24) 0.94(0.23) −0.01(0.21) 0.93(0.21) 0.00(0.22) 0.93(0.21) −0.01(0.23) 0.94(0.21)
β23 3 3.04(0.33) 0.94(0.32) 3.03(0.27) 0.94(0.27) 3.03(0.29) 0.94(0.27) 3.03(0.30) 0.94(0.29)
β24 0 0.00(0.40) 0.95(0.38) 0.02(0.34) 0.95(0.35) 0.01(0.35) 0.95(0.35) 0.00(0.36) 0.96(0.36)
β25 0 0.01(0.20) 0.95(0.20) 0.00(0.18) 0.94(0.18) 0.00(0.18) 0.94(0.18) 0.00(0.19) 0.95(0.19)

δ 0.3 0.28(0.09) 0.97(0.10) 0.29(0.06) 0.95(0.06) 0.29(0.06) 0.95(0.06) 0.29(0.07) 0.95(0.07)

Se:1 0.95 0.95(0.02) 0.93(0.02) 0.95(0.02) 0.93(0.02) 0.95(0.02) 0.90(0.02)
Se:2 0.95 0.95(0.01) 0.95(0.01) 0.95(0.02) 0.91(0.01) 0.95(0.02) 0.92(0.02)
Sp:1 0.95 0.95(0.01) 0.94(0.01) 0.95(0.01) 0.94(0.01) 0.95(0.01) 0.93(0.01)
Sp:2 0.95 0.95(0.01) 0.94(0.01) 0.95(0.01) 0.93(0.01) 0.95(0.01) 0.93(0.01)

TABLE 4.

The average prediction error PE × 100 and the SR value (provided in parenthesis) of the MLE and the shrinkage estimates under the AIC, BIC, and ERIC tuning parameter criterion over 500 replications under S1 – S3 across individual testing (IT) and the SHL pooling with c = 2, 5 and 10. Recall that the SR (selection rate) is defined to be the proportion of the true model being exactly selected by a shrinkage estimator. The highest SR value under each setting is underlined.

IT c = 2 c = 5 c = 10

Setting Estimate PE×100(SR) PE×100(SR) PE×100(SR) PE×100(SR)
S1 MLE 0.148(0.000) 0.126(0.000) 0.130(0.000) 0.142(0.000)
AIC 0.106(0.414) 0.092(0.430) 0.092(0.442) 0.102(0.462)
BIC 0.079(0.910) 0.071(0.908) 0.073(0.926) 0.083(0.898)
ERIC 0.085(0.724) 0.075(0.736) 0.076(0.744) 0.085(0.734)

S2 MLE 0.133(0.000) 0.106(0.000) 0.117(0.000) 0.121(0.000)
AIC 0.095(0.414) 0.074(0.414) 0.084(0.436) 0.087(0.418)
BIC 0.074(0.908) 0.059(0.910) 0.067(0.892) 0.069(0.876)
ERIC 0.084(0.702) 0.064(0.696) 0.074(0.702) 0.074(0.702)

S3 MLE 0.284(0.000) 0.231(0.000) 0.250(0.000) 0.266(0.000)
AIC 0.193(0.266) 0.160(0.294) 0.175(0.274) 0.184(0.298)
BIC 0.158(0.818) 0.130(0.820) 0.145(0.786) 0.153(0.808)
ERIC 0.183(0.428) 0.150(0.448) 0.163(0.420) 0.170(0.448)

Tables 13 provide summary statistics of the MLEs for S1–S3, respectively. Under both individual testing and the SHL pooling protocol, the MLEs of the unknown parameters obtained by our GEM algorithm exhibit little, if any, evidence of bias, across all considered settings. Regarding the use of Louis’ method, we notice that the average standard errors are in agreement with the sample standard deviations of the estimates. In addition, the empirical coverage probabilities for 95% confidence intervals are predominantly at the nominal level. These results indicate that the observed data information matrix is estimated correctly via Louis’ method.

TABLE 3.

Summary statistics of the 500 MLEs obtained under S3, including the sample mean (Mean), the sample standard deviation (SD), the average of the estimated standard errors (SE), and the empirical coverage (EC) of 95% confidence intervals under either individual testing (IT) or the SHL pooling with c = 2, 5, 10. The average number of tests (# of tests) under each protocol is also provided. The prevalence (averaged over 500 repetitions) of the first and the second infections are 9.97% and 8.54%, respectively.

IT c = 2 c = 5 c = 10

# tests 3000 2508 2337 2701

Truth Mean(SD) EC(SE) Mean(SD) EC(SE) Mean(SD) EC(SE) Mean(SD) EC(SE)
β10 −5 −5.10(0.39) 0.94(0.36) −5.07(0.30) 0.93(0.27) −5.10(0.33) 0.92(0.30) -5.09(0.36) 0.95(0.34)
β11 −2 −2.04(0.22) 0.94(0.21) −2.03(0.18) 0.95(0.17) −2.05(0.20) 0.93(0.18) −2.04(0.20) 0.94(0.20)
β12 0 0.00(0.13) 0.97(0.14) 0.00(0.12) 0.98(0.13) 0.00(0.13) 0.96(0.13) 0.00(0.13) 0.97(0.14)
β13 −2 −2.04(0.22) 0.93(0.21) −2.03(0.18) 0.94(0.17) −2.04(0.19) 0.94(0.18) −2.04(0.21) 0.93(0.20)
β14 0 0.01(0.14) 0.96(0.14) 0.00(0.12) 0.95(0.13) 0.00(0.13) 0.94(0.13) 0.00(0.13) 0.97(0.14)
β15 −2 −2.05(0.22) 0.94(0.21) −2.04(0.18) 0.94(0.17) −2.05(0.19) 0.93(0.18) −2.05(0.21) 0.92(0.19)
β16 0 0.01(0.14) 0.96(0.14) 0.00(0.13) 0.95(0.13) 0.01(0.13) 0.96(0.13) 0.01(0.14) 0.95(0.14)
β17 2 2.04(0.22) 0.95(0.21) 2.03(0.18) 0.93(0.17) 2.05(0.19) 0.93(0.18) 2.04(0.21) 0.94(0.20)
β18 0 0.00(0.14) 0.96(0.14) 0.00(0.12) 0.96(0.13) 0.00(0.13) 0.95(0.13) 0.01(0.14) 0.96(0.14)
β19 2 2.04(0.21) 0.94(0.21) 2.03(0.18) 0.93(0.17) 2.04(0.19) 0.94(0.18) 2.04(0.20) 0.96(0.20)
β110 0 0.00(0.15) 0.94(0.14) 0.00(0.13) 0.96(0.13) 0.01(0.13) 0.95(0.13) 0.00(0.14) 0.95(0.14)

β20 −6 −6.13(0.49) 0.95(0.48) −6.10(0.36) 0.96(0.36) −6.10(0.41) 0.95(0.39) −6.12(0.43) 0.95(0.43)
β21 −3 −3.07(0.30) 0.95(0.29) −3.05(0.24) 0.94(0.24) −3.05(0.26) 0.94(0.25) −3.06(0.27) 0.94(0.27)
β22 0 0.00(0.16) 0.95(0.16) 0.01(0.14) 0.95(0.14) 0.00(0.15) 0.95(0.15) 0.00(0.15) 0.95(0.15)
β23 −3 −3.07(0.30) 0.96(0.29) −3.05(0.24) 0.94(0.24) −3.05(0.26) 0.94(0.25) −3.06(0.27) 0.94(0.27)
β24 0 0.00(0.16) 0.96(0.16) 0.00(0.13) 0.94(0.14) 0.00(0.14) 0.95(0.15) 0.01(0.15) 0.95(0.16)
β25 2 2.04(0.21) 0.97(0.23) 2.04(0.19) 0.96(0.19) 2.04(0.20) 0.95(0.20) 2.04(0.20) 0.95(0.20)
β26 0 0.01(0.17) 0.94(0.16) 0.01(0.15) 0.93(0.14) 0.01(0.15) 0.95(0.15) 0.00(0.16) 0.94(0.16)
β27 3 3.06(0.29) 0.95(0.29) 3.04(0.24) 0.95(0.24) 3.04(0.26) 0.94(0.25) 3.05(0.27) 0.95(0.27)
β28 0 0.01(0.16) 0.96(0.16) 0.00(0.14) 0.96(0.14) 0.00(0.15) 0.95(0.15) 0.01(0.15) 0.96(0.16)
β29 0 0.00(0.16) 0.95(0.16) −0.01(0.14) 0.94(0.14) −0.01(0.15) 0.95(0.15) −0.01(0.16) 0.96(0.15)
β210 0 0.01(0.16) 0.95(0.16) 0.01(0.14) 0.94(0.14) 0.01(0.15) 0.95(0.15) 0.01(0.15) 0.94(0.16)

 δ 0.3 0.29(0.09) 0.98(0.13) 0.28(0.07) 0.99(0.08) 0.29(0.07) 0.98(0.09) 0.29(0.07) 0.99(0.11)

Se:1 0.95  –  – 0.95(0.01) 0.93(0.01) 0.95(0.01) 0.94(0.01) 0.95(0.02) 0.94(0.01)
Se:2 0.95  –  – 0.95(0.01) 0.95(0.01) 0.95(0.02) 0.93(0.01) 0.95(0.02) 0.91(0.02)
Sp:1 0.95  –  – 0.95(0.01) 0.96(0.01) 0.95(0.01) 0.93(0.01) 0.95(0.01) 0.92(0.01)
Sp:1 0.95  –  – 0.95(0.01) 0.96(0.01) 0.95(0.01) 0.94(0.01) 0.95(0.01) 0.93(0.01)

To examine the performance of the variable selection, Table 4 provides the SR (in parenthesis) of each shrinkage estimator across all considered settings. One can see that our BIC-type estimator performs the best in identifying the true model in each scenario. For example, in S3 when c = 2, the SR using the BIC criterion is 0.820 which is significantly larger than the ones using the AIC (0.294) and the ERIC (0.448) criterion. These results demonstrate the advantage of using the BIC criterion in identifying risk factors that are truly relevant for each infection.

Table 4 also provides the average PE×100 values of the MLE and the three shrinkage estimators across all settings. It is clear that all the shrinkage estimators produce smaller prediction errors than the MLE. For example, the BIC-type estimator can reduce almost 50% of the prediction error of the MLE. This is because that the adaptive LASSO penalty in (7) could eliminate unnecessary risk factors. Furthermore, because our BIC-type estimator outperforms the other two in term of variable selection, its prediction errors are the smallest under all settings. In conclusion, using the BIC-type shrinkage estimator not only provides a large chance of identifying truly relevant covariates, but also yields a high prediction accuracy.

Finally, we want to see whether the SHL pooling protocol causes a loss of information and thus compromises regression inference, when compared to individual testing. To find the answer, we revisit Tables 14. This time we focus on the comparison between individual testing and the SHL pooling. Tables 13 provide the average number of tests under each setting. Obviously, the SHL pooling protocol uses fewer tests than individual testing (saves about 16% costs). This is an expected appealing feature of the SHL pooling.26 And we observe more: (i) In Tables 13, the standard deviations obtained using pooling data are uniformly less than the ones obtained using individual testing, suggesting that the SHL pooling could provide a less variational MLE; (ii) All the averaged standard errors under the SHL pooling are smaller than the ones under individual testing, meaning that one could use the SHL pooled testing data to construct narrower confidence intervals while maintaining the same nominal level; (iii) The advantage of pooling also holds when comparing the average PE×100 values in Table 4, indicating that the SHL pooling enables one to make a better prediction of an individual’s infection probabilities; (iv) In terms of variable selection, the highest SR value (in Table 4) always occurs at c > 1 under each setting; that is, using the SHL pooled testing data has a larger chance to identify the true model. Hence, instead of compromising regression inference, the SHL pooling produces more precise inference. In addition, one must note that these advantages are achieved with a less amount of costs and a larger number of parameters to be estimated. This finding could be very encouraging to laboratories that are not using pooling (such as the NHPL).

5.2. A CT/NG screening data set

To further encourage the use of pooling, we analyze a data set collected from the NPHL which currently uses individual testing for the CT/NG screening. We will illustrate, if switching from individual testing to the two-stage hierarchical pooling used by the SHL, what benefits could be achieved for regression. To do so, we first reiterate how the SHL is using the pooling protocol.26 Only female swab specimens are screened using the SHL pooling protocol. The testing is carried out by the TECAN DTS platform with the Aptima Combo 2 assay. The platform is calibrated for a group size c = 4. The sensitivity and specificity of the assay are S e:1 = 0.942 (S e:2 = 0.992) and S p:1 = 0.976 (S p:2 = 0.987) for CT (NG), respectively (Gen-Probe, San Diego).

In 2009, 14530 female swab specimens were tested individually in the NPHL. The employed assay was also the Aptima Combo 2 Assay. We are provided with the diagnosed results of each specimen for CT and NG. Based on these diagnoses, the approximated prevalence of CT and NG are 0.069 and 0.013, respectively. To reveal the benefits of pooling, we mimic the SHL screening practice in the most realistic way. We use a group size c = 4 which is used by the SHL. Then we construct pools by assigning specimens according to their arrival time at the NPHL. Because the arrival time of specimens at the NPHL are random, our way of pooling is also random. We treat the diagnoses as “true” statuses and simulate a two-stage group testing data set using the above testing error rates. For comparison, we also simulate an individual testing data set using the same testing error rates. The considered covariates include age, prenatal, symptoms, cervical friability, pelvic inflammatory disease, cervicitis, multiple partners, new partner in the last 90 days, and contact with someone who has an STD. All covariates, except age, are binary. With these covariates on each individual, we first fit the individual diagnoses results by viewing them as the truth. The resulting estimates are used as the “reference” estimates. We then fit the individual testing data and the two-stage group testing data using the regression and variable selection methods previously described. In our analysis, we standardize age and code dichotomous covariates as either −0.5 or 0.5.

Table 5 summarizes the parameter estimates and variable selection results. The estimates from both testing protocols are close to the “reference” estimates, but the SEs under c = 4 are uniformly less than the ones under individual testing. The testing error rates are estimated accurately from the group testing data. In terms of variable selection, the reference shrinkage estimates identified different sets of significant risk factors for the two infections, where prenatal is significant to CT but not to NG. The same results are identified by the three shrinkage estimates based on the group testing data. However, based on the individual testing data, none of the three shrinkage estimates can select prenatal for CT. These comparisons reinforce our conclusion that, in addition to a significant cost reduction(i.e.,itsaves14530−7737 = 6793tests), the two-stage pooling protocol leads to more precise inference than individual testing while estimating the testing error rates simultaneously. In addition, we have considered randomly assigning individuals into groups as in Section 5.1 and used group sizes varying from 2 to 10. The supplementary materials include these results which reinforce the aforementioned conclusion on the advantages of the two-stage pooling protocol when compared to individual testing. We believe these numerical findings could encourage more laboratories to consider the two-stage pooling protocol.

TABLE 5.

The NPHL screening data analysis: parameter estimates (MLE), estimated standard errors (SE) and variable selection results (using the AIC, BIC, and ERIC criterion) from the reference estimates (Reference), individual testing estimates (IT) and the SHL pooling estimates with a group size 4 (c = 4). The number of tests under each is provided as well.

Reference
IT
c = 4
number of tests 14530 7737

MLE(SE) AIC BIC ERIC MLE(SE) AIC BIC ERIC MLE(SE) AIC BIC ERIC
CT Intercept −1.382(0.241) −1.528(0.286) −1.269(0.262)
Age −0.559(0.045) −0.535(0.057) −0.561(0.051)
Prenatal 0.390(0.220) 0.141(0.291) × × × 0.480(0.229)
Symptoms 0.356(0.079) 0.324(0.095) 0.356(0.088)
Cervical F 0.065(0.163) −0.058(0.202) 0.003(0.182)
PID 0.443(0.392) 0.443(0.448) 0.492(0.427)
Cervicitis 0.611(0.106) 0.746(0.118) 0.645(0.116)
Multi Partner 0.476(0.099) 0.522(0.116) 0.532(0.109)
New Partner −0.069(0.091) −0.205(0.116) −0.067(0.102)
Contact STD 1.006(0.098) 1.023(0.111) 1.048(0.108)

NG Intercept −2.426(0.416) −2.727(0.595) −2.683(0.507)
Age −0.251(0.083) −0.278(0.112) × −0.258(0.087)
Prenatal 0.283(0.591) 0.003(0.929) −0.073(0.750)
Symptoms 1.202(0.164) 1.176(0.219) 1.234(0.174)
Cervical F 0.277(0.288) 0.290(0.327) 0.270(0.301)
PID 1.032(0.496) 0.719(0.635) 0.879(0.554)
Cervicitis 0.625(0.199) 0.746(0.225) 0.712(0.201)
Multi Partner 1.070(0.177) 0.894(0.216) 1.106(0.185)
New Partner −0.130(0.189) −0.060(0.229) −0.127(0.198)
Contact STD 1.405(0.173) 1.208(0.216) 1.402(0.180)

δ 0.573(0.030) 0.604(0.042) 0.563(0.033)

Se:1 = 0:942 0.922(0.016)
Se:2 = 0:992 0.989(0.029)
Sp:1 = 0:976 0.974(0.004)
Sp:2 = 0:987 0.985(0.002)

6. DISCUSSION

Motivated by the SHL CT/NG screening practice, we have developed a regression method for the two-stage hierarchical pooling data. Our proposed technique jointly models the unobserved individual disease statuses and produces interpretable marginal inference for each infection. The assay sensitivity and specificity for each infection can be estimated as well. In addition, we further developed a shrinkage estimator to consistently select truly relevant risk factors for each infection. To disseminate this work, code, written in R, that implements our new methodology, is available upon request.

From the simulation studies and the CT/NG screening data analysis, it is exciting to observe that, as compared to individual testing, the SHL pooling protocol can significantly reduce cost and yet produce more efficient regression estimators. An interesting future project would be to theoretically investigate how to construct groups to obtain the most efficient regression estimators for each infection within a budget limit. Intuitively, individuals with high probabilities of being infected should be tested individually and those with low probabilities could be tested in pools. But what is the criterion to differentiate between high and low probabilities? How to know these probabilities before the screening? For those tested in pools, what is the optimal pool size that should be used for inference? These are interesting but challenging questions to be answered in future works. Possible guidance could be found in46,17.

In our simulation studies, we used a Gumbel copula. We chose it for two reasons. 1) When compared to Gaussian copulas, it has an analytic expression which facilitates the computation. 2) It is able to deliver robust estimates of the regression coefficients and misclassification parameters even when the true copula is not Gumbel. To reveal this robustness, we have included a simulation study in the supplementary materials. In practice, users are welcome to choose other copulas, such as Gaussian, Clayton, or Frank.33 Besides, the logistic function for gk ‘s could also be changed to the inverse of the link in probit or complementary log-log models. Our GEM algorithm has the generality to incorporate those choices.

Though this work mainly focuses on two infections, the model can be extended to incorporate more infections. For example, suppose there are three infections. We have Y˜ij=(Y˜ij1,Y˜ij2,Y˜ij3)T. A joint model for Y˜ij|xij is built by assuming that there exists a random vector Uij=(Uij1,Uij2,Uij3)T, of which the distribution function is a three-dimensional copula C(u1,u2,u3|δ), such that the event {Y˜ijk=1|xij} is equivalent to {Uijkgk(xijTβk)} for k = 1,2,3. Consequently, the marginal regression model (1) naturally holds for each disease, and the cell probabilities of Y˜ij|xij can be calculated in terms of C; e.g., pr(Y˜ij1=1,Y˜ij2=1,Y˜ij3=0|xij)=C{g1(xijTβ1),g2(xijTβ2),1|δ}C{g1(xijTβ1),g2(xijTβ2),g3(xijTβ3)|δ}. Our GEM algorithm can be generalized to incorporate more than two infections as well. We omit details but include some simulation results in the supplementary materials to demonstrate this generalizability.

Lastly, we discuss the three assumptions (Assumptions 1–3) on the assay sensitivity and specificity and possible ways to relax them. For Assumption 1, when the assay utilizes the concentration level of a specific biological marker (biomarker) to make a diagnosis, mixing a positive specimen with negative ones could dilute the concentration level and affect the assay sensitivity and specificity significantly when group size changes. This “dilution effect” can be taken into consideration if the distribution of the biomarker concentration is provided in advance.47,48 To relax Assumption 2, one could use a multinomial distribution to account for the cross-disease dependency of the testing outcomes when the true statuses are given. Then the number of misclassification parameters increases from 4 to 12 when the number of diseases is two. One could modify the GEM algorithm to estimate the twelve parameters along with the regression. However, some of these parameters may require an impractical large sample size to be accurately estimated. The last assumption can be relaxed by assuming a covariate-adjusted model for misclassification parameters.49 But caution must be taken for model identifiability when the covariate-adjusted misclassification parameters are to be estimated along with the regression.

Supplementary Material

Supp info

TABLE 2.

Summary statistics of the 500 MLEs obtained under S2, including the sample mean (Mean), the sample standard deviation (SD), the average of the estimated standard errors (SE), and the empirical coverage (EC) of 95% confidence intervals under either individual testing (IT) or the SHL pooling with c = 2, 5, 10. The average number of tests (# of tests) under each protocol is also provided. The prevalence (averaged over 500 repetitions) of the first and the second infections are 6.77% and 9.98%, respectively.

IT c = 2 c = 5 c = 10

# tests 3000 2493 2312 2678

Truth Mean(SD) EC(SE) Mean(SD) EC(SE) Mean(SD) EC(SE) Mean(SD) EC(SE)
β10 −4 −4.05(0.25) 0.95(0.26) −4.02(0.20) 0.95(0.19) −4.03(0.19) 0.97(0.20) −4.03(0.21) 0.96(0.21)
β11 −2 −2.03(0.19) 0.96(0.20) −2.02(0.15) 0.96(0.15) −2.03(0.16) 0.96(0.16) −2.02(0.17) 0.95(0.17)
β12 2 2.03(0.19) 0.94(0.20) 2.02(0.16) 0.95(0.16) 2.02(0.16) 0.96(0.16) 2.02(0.17) 0.95(0.17)
β13 0 0.00(0.13) 0.95(0.14) 0.00(0.12) 0.95(0.12) 0.00(0.12) 0.96(0.12) 0.00(0.12) 0.95(0.13)
β14 0 0.00(0.13) 0.96(0.14) 0.00(0.12) 0.95(0.12) 0.00(0.12) 0.96(0.12) −0.01(0.13) 0.95(0.13)
β15 0 0.00(0.14) 0.95(0.14) 0.00(0.11) 0.96(0.12) 0.00(0.12) 0.95(0.12) 0.00(0.13) 0.95(0.13)

β20 −5 −5.06(0.36) 0.94(0.35) −5.04(0.26) 0.96(0.27) −5.03(0.28) 0.97(0.29) −5.04(0.32) 0.94(0.33)
β21 −2 −2.04(0.20) 0.95(0.20) −2.03(0.16) 0.97(0.17) −2.02(0.17) 0.97(0.17) −2.03(0.19) 0.95(0.19)
β22 0 0.01(0.13) 0.97(0.13) 0.00(0.12) 0.95(0.12) 0.01(0.12) 0.96(0.12) 0.01(0.13) 0.95(0.13)
β23 −2 −2.04(0.20) 0.95(0.20) −2.03(0.17) 0.96(0.17) −2.02(0.17) 0.95(0.17) −2.02(0.19) 0.94(0.18)
β24 0 0.01(0.14) 0.93(0.13) 0.01(0.12) 0.93(0.12) 0.00(0.12) 0.94(0.12) 0.00(0.13) 0.95(0.13)
β25 0 0.01(0.13) 0.95(0.13) 0.01(0.12) 0.95(0.12) 0.00(0.12) 0.95(0.12) 0.01(0.12) 0.96(0.13)

δ 0.3 0.30(0.08) 0.99(0.11) 0.30(0.06) 0.97(0.07) 0.30(0.07) 0.97(0.07) 0.30(0.08) 0.97(0.08)

Se:1 0.95 0.95(0.02) 0.93(0.02) 0.95(0.02) 0.93(0.02) 0.95(0.02) 0.92(0.02)
Se:2 0.95 0.95(0.02) 0.95(0.01) 0.95(0.02) 0.92(0.01) 0.95(0.02) 0.91(0.02)
Sp:1 0.95 0.95(0.01) 0.97(0.01) 0.95(0.01) 0.95(0.01) 0.95(0.01) 0.92(0.01)
Sp:2 0.95 0.95(0.01) 0.92(0.01) 0.95(0.01) 0.93(0.01) 0.95(0.01) 0.92(0.01)

ACKNOWLEDGMENTS

This article is supported by Grant R03 AI135614 from the National Institutes of Health. The authors thank an Associate Editor and three reviewers for their constructive comments which lead to a better presentation of this work, and Drs. Joshua M. Tebbs and Christopher S. McMahan for their insightful comments. The data that support the findings of this study are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions.

Footnotes

SUPPORTING INFORMATION

Supplementary materials are available along with the submission. These materials contain a numerical comparison showing that ignoring the retesting outcomes could inflate the variance of estimators of the regression coefficients in Section 2, the observed log-likelihood function, a numerical study showing the computational advantages of the GEM algorithm, detailed derivations of the E-step and the observed data information matrix introduced in Section 3, additional numerical results for other values of Se:k’s and Sp:k’s (Section 5.1), extensions of our method to fit individual testing data as discussed in Section 5.1, additional results of the real data analysis in Section 5.2 and simulation studies that reveal the robustness of the Gumbel copula and demonstrate the generalizability of our method to more than two infections in Section 6.

References

  • 1.Centers for Disease Control and Prevention. 2016 STD surveillance report https://www.cdc.gov/std/stats16/default.htm; Last accessed April, 2018.
  • 2.Lewis JL, Lockary VM, Kobic S. Cost savings and increased efficiency using a stratified specimen pooling strategy for Chlamydia trachomatis and Neisseria gonorrhoeae. Sexually Transmitted Diseases 2012;39(1):46–48. [DOI] [PubMed] [Google Scholar]
  • 3.Samoff E, Koumans EH, Markowitz LE, et al. Association of Chlamydia trachomatis with persistence of high-risk types of human papillomavirus in a cohort of female adolescents. American Journal of Epidemiology 2005;162(7):668–675. [DOI] [PubMed] [Google Scholar]
  • 4.Centers for Disease Control and Prevention. STDs & Infertility https://www.cdc.gov/std/infertility/default.htm; Last accessed April, 2018.
  • 5.Jirsa S Pooling specimens: A decade of successful cost savings. In: National STD Prevention Conference; 2008. [Google Scholar]
  • 6.Dorfman R The detection of defective members of large populations. The Annals of Mathematical Statistics 1943;14(4):436–440. [Google Scholar]
  • 7.Stramer SL, Krysztof DE, Brodsky JP, et al. Comparative analysis of triplex nucleic acid test assays in United States blood donors. Transfusion 2013;53:2525–2537. [DOI] [PubMed] [Google Scholar]
  • 8.Edouard S, Prudent E, Gautret P, Memish ZA, Raoult D. Cost-effective pooling of DNA from nasopharyngeal swab samples for large-scale detection of bacteria by real-time PCR. Journal of Clinical Microbiology 2015;53(3):1002–1004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Hill JA, HallSedlak R, Magaret A, et al. Efficient identification of inherited chromosomally integrated human herpesvirus6 using specimen pooling. Journal of Clinical Virology 2016;77:71–76. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Gastwirth JL. The efficiency of pooling in the detection of rare mutations. American Journal of Human Genetics 2000;67(4):1036–1039. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Zanzi CA, Johnson WO, Thurmond MC, Hietala SK. Pooled-sample testing as a herd-screening tool for detection of bovine viral diarrhea virus persistently infected cattle. Journal of Veterinary Diagnostic Investigation 2000;12(3):195–203. [DOI] [PubMed] [Google Scholar]
  • 12.Venette RC, Moon RD, Hutchison WD. Strategies and statistics of sampling for rare individuals. Annual Review of Entomology 2002;47(1):143–174. [DOI] [PubMed] [Google Scholar]
  • 13.Dodd RY, Notari EP 4th, Stramer SL. Current prevalence and incidence of infectious disease markers and estimated window-period risk in the American Red Cross donor population. Transfusion 2002;42(8):975–979. [DOI] [PubMed] [Google Scholar]
  • 14.Remlinger KS, Hughes-Oliver JM, Young SS, Lam RL. Statistical design of pools using optimal coverage and minimal collision. Technometrics 2006;48(1):133–143. [Google Scholar]
  • 15.Kim HY, Hudgens MG, Dreyfuss JM, Westreich DJ, Pilcher CD. Comparison of group testing algorithms for case identification in the presence of test error. Biometrics 2007;63(4):1152–1163. [DOI] [PubMed] [Google Scholar]
  • 16.Liu A, Liu C, Zhang Z, Albert PS. Optimality of group testing in the presence of misclassification. Biometrika 2012;99(1):245–251. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Huang SH, Huang MN Lo, Shedden K, Wong WK. Optimal group testing designs for estimating prevalence with uncertain testing errors. Journal of the Royal Statistical Society: Series B 2017;79(5):1547–1563. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Vansteelandt S, Goetghebeur E, Verstraeten T. Regression models for disease prevalence with diagnostic tests on pools of serum samples. Biometrics 2000;56(4):1126–1133. [DOI] [PubMed] [Google Scholar]
  • 19.Xie M Regression analysis of group testing samples. Statistics in Medicine 2001;20(13):1957–1969. [DOI] [PubMed] [Google Scholar]
  • 20.Chen P, Tebbs JM, Bilder CR. Group testing regression models with fixed and random effects. Biometrics 2009;65(4):1270–1278. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.McMahan CS, Tebbs JM, Hanson TE, Bilder CR. Bayesian regression for group testing data. Biometrics 2017;73:1443–1452. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Delaigle A, Meister A. Nonparametric regression analysis for group testing data. Journal of the American Statistical Association 2011;106(494):640–650. [Google Scholar]
  • 23.Delaigle A, Hall P, Wishart JR. New approaches to nonparametric and semiparametric regression for univariate and multivariate group testing data. Biometrika 2014;101(3):567–585. [Google Scholar]
  • 24.Xiao X, Zhai J, Zeng J, Tian C, Wu H, Yu Y. Comparative evaluation of a triplex nucleic acid test for detection of HBV DNA,HCV RNA, and HIV-1 RNA, with the Procleix Tigris System. Journal of Virological Methods 2013;187(2):357–361. [DOI] [PubMed] [Google Scholar]
  • 25.Hughes-Oliver JM, Rosenberger WF. Efficient estimation of the prevalence of multiple rare traits. Biometrika 2000;87(2):315–327. [Google Scholar]
  • 26.Tebbs JM, McMahan CS, Bilder CR. Two-stage hierarchical group testing for multiple infections with application to the Infertility Prevention Project. Biometrics 2013;69(4):1064–1073. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Warasi MS, Tebbs JM, McMahan CS, Bilder CR. Estimating the prevalence of multiple diseases from two-stage hierarchical pooling. Statistics in Medicine 2016;35(21):3851–3864. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Li Q, Liu A, Xiong W. D-optimality of group testing for joint estimation of correlated rate diseases with misclassification. Statistica Sinica 2017;27(2):823–838. [Google Scholar]
  • 29.Zhang B, Bilder CR, Tebbs JM. Regression analysis for multiple-disease group testing data. Statistics in Medicine 2013;32(28):4954–4966. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Wu CJ. On the convergence properties of the EM algorithm. The Annals of Statistics 1983;11:95–103. [Google Scholar]
  • 31.Neal RM, Hinton GE. A view of the EM algorithm that justifies incremental, sparse, and other variants In: Springer; 1998. (pp. 355–368). [Google Scholar]
  • 32.Gregory KB, Wang D, McMahan CS. Adaptive elastic net for group testing. Biometrics in print;00:1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Nelsen RB. An introduction to copulas Springer Science & Business Media; 2007. [Google Scholar]
  • 34.Lehmann EL. Theory of point estimation Pacific Grove, CA: Wadsworth and Brooks/Cole; 1983. [Google Scholar]
  • 35.Louis TA. Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society: Series B 1982;44:226–233. [Google Scholar]
  • 36.Zou H The adaptive lasso and its oracle properties. Journal of the American Statistical Association 2006;101(476):1418–1429. [Google Scholar]
  • 37.Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. The Annals of Statistics 2004;32(2):407–499. [Google Scholar]
  • 38.Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 2001;96(456):1348–1360. [Google Scholar]
  • 39.Wang H, Leng C. Unified LASSO estimation by least squares approximation. Journal of the American Statistical Association 2007;102(479):1039–1048. [Google Scholar]
  • 40.Schwarz G Estimating the dimension of a model. The Annals of Statistics 1978;6(2):461–464. [Google Scholar]
  • 41.Wang H, Li B, Leng C. Shrinkage tuning parameter selection with a diverging number of parameters. Journal of the Royal Statistical Society: Series B 2009;71(3):671–683. [Google Scholar]
  • 42.Minnier J, Tian L, Cai T. A perturbation method for inference on regularized regression estimates. Journal of the American Statistical Association 2011;106(496):1371–1382. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Gumbel EJ. Bivariate exponential distributions. Journal of the American Statistical Association 1960;55(292):698–707. [Google Scholar]
  • 44.Akaike H A new look at the statistical model identification. IEEE Transactions on Automatic Control 1974;19(6):716–723. [Google Scholar]
  • 45.Hui FK, Warton DI, Foster SD. Tuning parameter selection for the adaptive LASSO using ERIC. Journal of the American Statistical Association 2015;110(509):262–269. [Google Scholar]
  • 46.McMahan CS, Tebbs JM, Bilder CR. Informative Dorfman screening. Biometrics 2012;68(1):287–296. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Wang D, McMahan CS, Gallagher CM. A general parametric regression framework for group testing data with dilutioneffects. Statistics in Medicine 2015;34(27):3606–3621. [DOI] [PubMed] [Google Scholar]
  • 48.Wang D, McMahan CS, Tebbs JM, Bilder CR. Group testing case identification with biomarker information. Computational statistics & data analysis 2018;122:156–166. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Janes H, Pepe MS. Adjusting for covariates in studies of diagnostic, screening, or prognostic markers: an old concept in anew setting. American journal of epidemiology 2008;168(1):89–97. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp info

RESOURCES