Regression analysis and variable selection for two-stage multiple-infection group testing data

Juexin Lin; Dewei Wang; Qi Zheng

doi:10.1002/sim.8311

. Author manuscript; available in PMC: 2020 Oct 15.

Published in final edited form as: Stat Med. 2019 Jul 11;38(23):4519–4533. doi: 10.1002/sim.8311

Regression analysis and variable selection for two-stage multiple-infection group testing data

Juexin Lin ¹, Dewei Wang ^1,^*, Qi Zheng ²

PMCID: PMC6736686 NIHMSID: NIHMS1037528 PMID: 31297869

Abstract

Group testing, as a cost-effective strategy, has been widely used to perform largescale screening for rare infections. Recently, the use of multiplex assays has transformed the goal of group testing from detecting a single disease to diagnosing multiple infections simultaneously. Existing research on multiple-infection group testing data either exclude individual covariate information or ignore possible retests on suspicious individuals. To incorporate both, we propose a new regression model. This new model allows us to perform a regression analysis for each infection using multiple-infection group testing data. Furthermore, we introduce an efficient variable selection method to reveal truly relevant risk factors for each disease. Our methodology also allows for the estimation of the assay sensitivity and specificity when they are unknown. We examine the finite sample performance of our method through extensive simulation studies and apply it to a chlamydia and gonorrhea screening data set to illustrate its practical usefulness.

Keywords: adaptive LASSO, multiplex assay, pooled testing, sensitivity, specificity

1. INTRODUCTION

1.1. Motivation

This article is motivated by the annual chlamydia trachomatis (CT) and neisseria gonorrhoeae (NG) screening practice conducted by the State Hygienic Laboratory (SHL) in Iowa. The CT and NG are two of the most common notifiable sexually transmitted diseases (STDs) in the United States. Over two million cases were reported to the Centers for Disease Control and Prevention (CDC) in 2016.¹ Both infections are commonly asymptomatic in women. If left untreated, they could cause pelvic inflammatory disease and further lead to tubal infertility, ectopic pregnancy, or chronic pelvic pain.² In addition, both diseases could facilitate the transmission of HIV and human papillomavirus infection.³ Concerned by these severe sequelae, CDC continually supports nationwide CT/NG screening and recommends annual CT/NG screening for all sexually active women under 25 years old.⁴

In this nationwide screening practice, specimens (swab or urine) are collected across each state and shipped to major state laboratories to be tested. Due to different budgets, laboratories conduct the screening differently. For example, the Nebraska Public Health Laboratory (NPHL) uses a traditional individual testing protocol which tests individual specimens one-by-one. The SHL tests male specimens and female urine specimens individually, but tests female swab specimens according to a two-stage pooling protocol:

The SHL Pooling Protocol

Individual swab specimens are randomly assigned to non-overlapping groups of size four. A pool is constructed by mixing individual specimens in the same group.
Stage 1: Each pool is tested for CT and NG simultaneously using a multiplex assay. If a pool tests negative for both infections, all the involved individuals are diagnosed as negative for each infection with no additional tests; otherwise, the protocol proceeds to the next stage.
Stage2: Swabs of individuals in pools that test positive for either infection are retested separately using the same multiplex assay for final diagnosis.

The most practical reason of using pooling is cost reduction. When a pool tests negative for both infections, four individuals are diagnosed at the expense of one assay. Since switching from individual testing to pooling in 1999, Iowa has saved over $2.2 million in the CT/NG screening.⁵

As per the screening guidelines, many risk factors are collected as well, such as age, number of partners, any symptoms of the infections, etc. A motivating question is how to incorporate these covariate information so that one can identify truly relevant risk factors for each infection and understand their effects. Challenges to this question arise from the use of the multiplex Aptima Combo 2 Assay (Gen-Prob, San Diego), an imperfect discriminatory test that produces diagnoses for both diseases simultaneously. Due to the imperfectness of the assay, it is possible to observe some discrepancies between testing outcomes of the two stages, as shown in Figure 1. Whenever a discrepancy occurs, the SHL ignores pooled-level results from Stage 1 and makes the diagnosis solely based on individual testing from Stage 2. However, when the objective is probing the impact of risk factors rather than case identification, disregarding testing outcomes from any stage could impair the estimation. It is important to seamlessly incorporate outcomes from both stages. Towards this goal, we need to account for how likely the retests were triggered by either infection.

A possible set of the SHL pooled testing outcomes from a group of 4 individuals: the rectangle with rounded corners represents the pooled specimen that is constructed by mixing the 4 individual specimens (in circles) together. The pool tested negative for CT (i.e., CT = 0) but positive for NG (i.e., NG = 1). As per the SHL pooling protocol, the positivity of NG triggered the second stage of screening for both infections. Due to testing errors, we see a discrepancy between the two stages; i.e., the fourth individual retested positive for CT but the pool tested negative for CT.

1.2. Literature review

Pooled testing (also known as group testing) was initially proposed to screen for syphilis among War World II American army recruits.⁶ Since this seminal work, pooling techniques have been successfully implemented to screen for many other infectious diseases, including HIV, HBV, HCV,⁷, influenza⁸, and herpes⁹. Besides disease screening, many other areas, including genetics¹⁰, veterinary science¹¹, medical entomology¹², blood safety¹³, and drug discovery¹⁴, have also used the method of pooling. Statistical research in group testing primarily focused on improving the diagnostic accuracy and cost-saving ability of a pooling protocol¹⁵, or estimating individual-level characteristics from pooled testing data. This article falls into the latter category. When group testing data only involves a single infection, the research focusing on estimation started with estimating a disease prevalence.^16,17 This research avenue was expanded to incorporate individual covariate information through the use of parametric regression models, such as generalized linear models^18,19, mixed models²⁰, and Bayesian regression models²¹. Semiparametric and nonparametric regression methods have also been developed.^22,23 However, all these work are limited to one infection.

The use of multiplex assays makes pooled testing data with multiple infections widely available. For example, in addition to CT/NG, HIV/Syphilis, HIV/HCV, or HIV/HBV/HCV can be detected simultaneously.²⁴ In statistical literature, the research focusing on estimation with multiple-infection group testing data is scarce. A few works have studied the estimation of disease prevalence.^25,26,27,28 Regression analysis for this type of data remains mostly untapped. To the best of our knowledge, the only work is an approach based on generalized estimating equations.²⁹ However, it did not consider retesting outcomes arising from the second stage of screening and thus does not apply to the SHL screening practice.

When using an imperfect assay, the values of assay sensitivity and specificity are crucial to estimation in pooled testing. Most of the aforementioned literature assumed that there were some preliminary studies to provide those misclassification parameters. However, this assumption could be impractical because the preliminary study might have used unrepresentative samples.¹⁷ If inaccurate values of assay sensitivity and specificity were used for estimation, it could compromise inference. In this article, we keep the testing error rates as unknown and estimate them from the data along with the regression.

Existing literature has not considered the combination of incorporating retesting results into regression and estimating misclassification parameters in the context of multiple-infection group testing. Only one Bayesian work has provided inference for disease prevalence and estimates of assay sensitivity and specificity without consideration of individual covariates.²⁷ In this article, we propose a copula-based multivariate binary regression model to incorporate the covariates. We introduce a generalized expectation-maximization (GEM) algorithm to facilitate the numerical computation of the maximum likelihood estimates (MLEs) of the regression coefficients and misclassification parameters. When compared to the traditional EM algorithm, the GEM only requires the maximization step to search for an increase in the objective function rather than achieving the maximum.^30,31 This feature greatly accelerates the computation of the MLE.

In addition, we provide a variable selection technique that can identify truly relevant risk factors for each infection. A recent work has introduced a regularized regression technique for group testing.³² But it is for a single infection. Our work is designed to allow for multiple infections. We believe a package of regression, estimation of misclassification parameters, and variable selection can provide a useful toolbox for the epidemiology study of CT and NG based on group testing data.

The rest of the article is organized as follows. In Section 2, we propose a new copula-based regression model for multiple-infection group testing data. In Section 3, we introduce the GEM algorithm that accelerates the computation of the MLE. Section 4 presents a variable selection method that can identify important risk factors for each infection. In Section 5.1, we use simulation to illustrate that, with the use of a fewer number of tests, the SHL pooling protocol can lead to more efficient regression estimates, better prediction of infection probabilities, and more accurate variable selection than traditional individual testing. These advantages are further demonstrated by analyzing a CT/NG screening data set in Section 5.2. Section 6 presents a discussion of this work. All technical details and additional numerical results are relegated to the supplementary materials.

2. MODEL

Suppose N individuals are to be tested. We randomly assign each individual to one of J groups, each of size c_j; i.e., $N = \sum_{j = 1}^{J} c_{j}$ . For generality, we allow group size c_j to vary across groups. Motivated by the CT/NG screening practice, we mainly consider two infections. Section 6 discusses an extension of more than two diseases. The true infection statuses of the ith individual in the jth group are denoted by a binary vector ${\tilde{Y}}_{i j} = {({\tilde{Y}}_{i j 1}, {\tilde{Y}}_{i j 2})}^{T}$ , where ${\tilde{Y}}_{i j k} = 1$ if the individual is positive for the kth infection, ${\tilde{Y}}_{i j k} = 0$ otherwise, for i = 1,…,c_j, j = 1,…, J, and k = 1,2. Denote the covariates (risk factors and an intercept term) of the ith individual in the jth group by a (p+1)-dimensional vector x_ij = (1,x_ij1,…,x_ijp)^T. We assume that ${\tilde{Y}}_{i j} | x_{i j}$ ’s are independent across ij and ${\tilde{Y}}_{i j k}$ is related to a linear predictor $x_{i j}^{T} β_{k}$ via

pr ({\tilde{Y}}_{i j k} = 1 | x_{i j}) = g_{k} (x_{i j}^{T} β_{k}), for k = 1, 2,

(1)

where β_k = (β_k0, β_k1,…, β_kp)^T is a vector of (p+1) regression coefficients that will be estimated and g_k is a user-chosen known link function (e.g., the inverse of the logit or probit link). One could use different links for different infections. Equation (1) builds marginal probability models of the random vector ${\tilde{Y}}_{i j} | x_{i j}$ .

In pooled testing, the true infection statuses are often latent due to pooling and potential misclassification. In each group, individual specimens are mixed together to form a pool. We denote the true status of the jth pool by ${\tilde{Z}}_{j} = {({\tilde{Z}}_{j 1}, {\tilde{Z}}_{j 2})}^{T}$ where ${\tilde{Z}}_{j k} = \max {{\tilde{Y}}_{i j k} : i = 1, \dots, c_{j}}$ ; i.e., ${\tilde{Z}}_{j k} = 1$ if the pool involves at least one individual who is positive for the kth infection, ${\tilde{Z}}_{j k} = 0$ otherwise. With the use of an imperfect assay, both ${\tilde{Y}}_{i j}$ ’s and ${\tilde{Z}}_{j}$ ’s are latent. Observed data are the testing outcomes from the imperfect multiplex assay. Pools are tested in Stage 1. We denote the testing outcomes of the jth pool by Z_j = (Z_j1, Z_j2)^T, where Z_jk = 1(0) if the pool tests positive (negative) for the kth infection. If Z_j = (0,0)^T, then Z_j is the only observed test response for the jth group of individuals. Otherwise, those individuals are tested separately in Stage 2. We denote by Y_ij = (Y_ij1,₂)^T the retesting outcome of the ith individual in the jth group; i.e., Y_ijk = 1(0) if the individual is retested as positive (negative) for the kth infection. Note that, the Y_ij’s can only be observed if Z_j ≠ (0,0)^T. In summary, observed testing outcomes from the jth group, denoted by $P_{j}$ , take one of the two forms, either Z_j = (0,0)^T, or Z_j ∈ {(1,0)^T,(0,1)^T,(1,1)^T} and $Y_{1 j}, \dots, Y_{c_{j} j}$ .

The discrepancy between true statuses and testing outcomes is often measured by assay sensitivity and specificity. Denote by S_e∶k and S_p∶k the assay sensitivity and specificity, respectively, for the kth infection. In practice, an assay used for largescale screening is often imperfect. We let S_e∶k’s and S_p∶k’s be in (0,1). Our methodology posits three assumptions on these misclassification parameters. Assumption 1 is that S_e∶k’s and S_p∶k’s do not depend on the group size; e.g., $S_{e : k} = pr (Z_{j k} = 1 | {\tilde{Z}}_{j k} = 1) = pr (Y_{i j k} = 1 | {\tilde{Y}}_{i j k} = 1)$ and $S_{p : k} = pr (Z_{j k} = 0 | {\tilde{Z}}_{j k} = 0) = pr (Y_{i j k} = 0 | {\tilde{Y}}_{i j k} = 0)$ hold for all i, j, and k. Assumption 2 assumes that conditioning on the true statuses of the specimens being tested, testing responses are independent across each other and also across infections. Assumption 3 further assumes that given the true statuses, testing responses are independent of the covariates; e.g., $pr (Z_{j 1} = 0, Z_{j 2} = 1, Y_{i j 1} = 1, Y_{i j 2} = 0 | {\tilde{Z}}_{j 1} = 0, {\tilde{Z}}_{j 1} = 0, {\tilde{Y}}_{i j 1} = 1, {\tilde{Y}}_{i j 2} = 1, x_{i j}) = pr(Z_{j 1} = 0 | {\tilde{Z}}_{j 1} =$ $0) pr(Z_{j 2} = 1 | {\tilde{Z}}_{j 2} = 0) pr(Y_{i j 1} = 1 | {\tilde{Y}}_{i j 1} = 1) pr(Y_{i j 2} = 0 | {\tilde{Y}}_{i j 2} = 1) = S_{p : 1} (1 - S_{p : 2}) S_{e : 1} (1 - S_{e : 2})$ All these assumptions are standard in group testing literature (see most references in Section 1.2). In practice, one may need to conduct proper assay calibration to ensure the applicability of these assumptions.

Our primary goal is to estimate β_k’s, S_e∶k’s and S_p∶k’s. Towards this goal, we want to incorporate the retesting outcomes for two main reasons: 1) ignoring the retesting outcomes could severely inflate the variance of the estimators of β_k’s (see the supplementary materials for a numerical illustration). 2) Including the retesting outcomes gives us repeated measurements (i.e., many specimens are tested in pools and also individually) which provide valuable information to estimate misclassification parameters. To seamlessly incorporate all retesting outcomes, we propose a copula-based multivariate binary regression model. We assume that there exists a vector of standard uniform random variables, U_ij = (U_ij1,₂)^T, such that the event ${{\tilde{Y}}_{i j k} = 1 | x_{i j}}$ is equivalent to ${U_{i j k} \leq g_{k} (x_{i j}^{T} β_{k})}$ , where U_ij’s are independent and follow a bivariate copula.³³ Denote the chosen copula by C{u₁, u₂|δ}, where u₁,u₂ ∈ (0,1) and is known up to a parameter δ (which could be a vector). Then the marginal regression models in (1) naturally hold, and the co-infection probability is

pr({\tilde{Y}}_{i j 1} = 1, {\tilde{Y}}_{i j 2} = 1 | x_{i j}) = C {g_{1} (x_{i j}^{T} β_{1}), g_{2} (x_{i j}^{T} β_{2}) | δ} .

(2)

Combining (1) and (2) together defines our joint probability model of ${\tilde{Y}}_{i j} | x_{i j}$ .

3. ESTIMATION

We maximize the likelihood function to obtain our estimators of β_k’s, S_e∶k’s, S_p∶k’s, and δ. For notation simplicity, we write $θ_{1} = {(β_{1}^{T}, β_{2}^{T}, δ)}^{T}$ , $θ_{2} = {(S_{e : 1}, S_{e : 2}, S_{p : 1}, S_{p : 2})}^{T}$ and $θ = {(θ_{1}^{T}, θ_{2}^{T})}^{T}$ . Furthermore, we denote by $p_{i j y_{1} y_{2}} (θ_{1})$ the cell probability $pr({\tilde{Y}}_{i j 1} = y_{1}, {\tilde{Y}}_{i j 2} = y_{2} | x_{i j})$ defined by (1) and (2) under θ₁ for y₁,y₂ ∈ {0,1}, i = 1,…,c_j, and j = 1,…,J. Then $p_{i j 11} (θ_{1}) = C {g_{1} (x_{i j}^{T} β_{1}), g_{2} (x_{i j}^{T} β_{2}) | δ}$ , $p_{i j 10} (θ_{1}) = g_{1} (x_{i j}^{T} β_{1}) - p_{i j 11} (θ_{1})$ , $p_{i j 01} (θ_{1}) = g_{2} (x_{i j}^{T} β_{2}) - p_{i j 11} (θ_{1})$ , and $p_{i j 00} (θ_{1}) = 1 - p_{i j 11} (θ_{1}) - p_{i j 10} (θ_{1}) - p_{i j 01} (θ_{1})$ In the supplementary materials, we derive an expression of the log-likelihood function $ℓ (θ | P, X)$ where $P$ and X denote the collections of $P_{j}$ ’s and x_ij’s, respectively. However, due to the complexity of $ℓ (θ | P, X)$ , a direct maximization could be time-consuming. The supplementary materials include a numerical illustration of this disadvantage.

We propose a GEM algorithm to accelerate the computation. The algorithm incorporates $\tilde{Y} = {{\tilde{Y}}_{11}, \dots, {\tilde{Y}}_{c_{J} J}}$ as latent variables. The complete log-likelihood function of θ, derived from the conditional distribution of $P$ and $\tilde{Y}$ given X, can be written by $ℓ_{c} (θ | P, \tilde{Y}, X) = ℓ_{c 1} (θ_{1} | \tilde{Y}, X) + ℓ_{c 2} (θ_{2} | P, \tilde{Y})$ where

ℓ_{c 1} (θ_{1} | \tilde{Y}, X) = \sum_{j = 1}^{J} \sum_{i = 1}^{c_{j}} [(1 - {\tilde{Y}}_{i j 1}) (1 - {\tilde{Y}}_{i j 2}) \log p_{i j 00} (θ_{1}) + {\tilde{Y}}_{i j 1} (1 - {\tilde{Y}}_{i j 2}) \log p_{i j 10} (θ_{1}) + (1 - {\tilde{Y}}_{i j 1}) {\tilde{Y}}_{i j 2} \log p_{i j 01} (θ_{1}) + {\tilde{Y}}_{i j 1} {\tilde{Y}}_{i j 2} \log p_{i j 11} (θ_{1})]

(3)

and

ℓ_{c 2} (θ_{2} | P, \tilde{Y}) = \sum_{j = 1}^{J} \sum_{k = 1}^{2} [{{\tilde{Z}}_{j k} Z_{j k} + I (Z_{j} \neq {(0, 0)}^{T}) \sum_{i = 1}^{c_{j}} {\tilde{Y}}_{i j k} Y_{i j k}} \log S_{e : k} + {{\tilde{Z}}_{j k} (1 - Z_{j k}) + I (Z_{j} \neq {(0, 0)}^{T}) \sum_{i = 1}^{c_{j}} {\tilde{Y}}_{i j k} (1 - Y_{i j k})} \log (1 - S_{e : k}) + {(1 - {\tilde{Z}}_{j k}) (1 - Z_{j k}) + I (Z_{j} \neq {(0, 0)}^{T}) \sum_{i = 1}^{c_{j}} (1 - {\tilde{Y}}_{i j k}) (1 - Y_{i j k})} \log S_{p : k} + {(1 - {\tilde{Z}}_{j k}) Z_{j k} + I (Z_{j} \neq {(0, 0)}^{T}) \sum_{i = 1}^{c_{j}} (1 - {\tilde{Y}}_{i j k}) Y_{i j k}} \log (1 - S_{p : k})],

(4)

in which ${\tilde{Z}}_{j k} = \max {{\tilde{Y}}_{i j k} : i = 1, \dots, c_{j}}$ and I(⋅) is the indicator function.

Our GEM algorithm starts at an initial value, and then iterates between an E-step and an M-step to update the value until reaching a numerical convergence. At a current value ^(d), the E-step calculates $Q (θ | θ^{(d)}) = Q_{1} (θ_{1} | θ^{(d)}) + Q_{2} (θ_{2} | θ^{(d)})$ , where $Q_{1} (θ_{1} | θ^{(d)}) = E {ℓ_{c 1} (θ_{1} | \tilde{Y}, X) | P, X, θ^{(d)}}$ and $Q_{2} (θ_{2} | θ^{(d)}) = E {ℓ_{c 2} (θ_{2} | P, \tilde{Y}) | P, X, θ^{(d)}}$ . After an inspection of (3) and (4), it suffices to calculate $η_{i j 00}^{(d)}$ , $η_{i j 10}^{(d)}$ , $η_{i j 01}^{(d)}$ , $η_{i j 11}^{(d)}$ (for $Q_{1}$ ) and $η_{P, j k}^{(d)}$ (for $Q_{2}$ ) where

η_{i j y_{1} y_{2}}^{(d)} = pr({\tilde{Y}}_{i j 1} = y_{1}, {\tilde{Y}}_{i j 2} = y_{2} | P, X, θ^{(d)}) and η_{P, j k}^{(d)} = pr({\tilde{Z}}_{j k} = 1 | P, X, θ^{(d)}),

(5)

for i = 1,…,c_j, j = 1,…,J, y₁,y₂ ∈ {0,1}, and k = 1,2. Though $η_{i j y_{1} y_{2}}^{(d)}$ ’s have been studied without the consideration of X²⁶, they were not updated in closed forms and thus a Gibbs sampler was employed to approximate these quantities. However, in the regression context, using such approximations requires enlarging the tolerance of the numerical convergence and hence might induce bias. To improve the computational accuracy, we calculate all the probabilities in (5) exactly (see the supplementary materials for details).

With the probabilities in (5) calculated, we rewrite $Q_{1} (θ_{1} | θ^{(d)})$ by

Q_{1} (β_{1}, β_{2}, δ | θ^{(d)}) = \sum_{j = 1}^{J} \sum_{i = 1}^{c_{j}} \sum_{y_{1} = 0}^{1} \sum_{y_{2} = 0}^{1} η_{i j y_{1} y_{2}}^{(d)} \log p_{i j y_{1} y_{2}} (θ_{1}),

and $Q_{2} (θ_{2} | θ^{(d)})$ by

\sum_{k = 1}^{2} {W_{1 k}^{(d)} \log S_{e : k} + W_{2 k}^{(d)} \log (1 - S_{e : k}) + W_{3 k}^{(d)} \log S_{p : k} + W_{4 k}^{(d)} \log (1 - S_{p : k})},

(6)

where

W_{1 k}^{(d)} = \sum_{j = 1}^{J} {η_{P, j k}^{(d)} Z_{j k} + I (Z_{j} \neq {(0, 0)}^{T}) \sum_{i = 1}^{c_{j}} η_{i j, k}^{(d)} Y_{i j k}}, W_{2 k}^{(d)} = \sum_{j = 1}^{J} {η_{P, j k}^{(d)} (1 - Z_{j k}) + I (Z_{j} \neq {(0, 0)}^{T}) \sum_{i = 1}^{c_{j}} η_{i j, k}^{(d)} (1 - Y_{i j k})}, W_{3 k}^{(d)} = \sum_{j = 1}^{J} {(1 - η_{P, j k}^{(d)}) (1 - Z_{j k}) + I (Z_{j} \neq {(0, 0)}^{T}) \sum_{i = 1}^{c_{j}} (1 - η_{i j, k}^{(d)}) (1 - Y_{i j k})}, W_{4 k}^{(d)} = \sum_{j = 1}^{J} {(1 - η_{P, j k}^{(d)}) Z_{j k} + I (Z_{j} \neq {(0, 0)}^{T}) \sum_{i = 1}^{c_{j}} (1 - η_{i j, k}^{(d)}) Y_{i j k}},

in which, $η_{i j, 1}^{(d)} = η_{i j 11}^{(d)} + η_{i j 10}^{(d)}$ and $η_{i j, 2}^{(d)} = η_{i j 11}^{(d)} + η_{i j 01}^{(d)}$ . The M-step in our GEM algorithm updates $θ_{1}^{(d)}$ by $θ_{1}^{(d + 1)} = {(β_{1}^{{(d + 1)}^{T}}, β_{2}^{{(d + 1)}^{T}}, δ^{(d + 1)})}^{T}$ where $β_{1}^{(d + 1)} = \arg \max_{β_{1}} Q_{1} (β_{1}, β_{2}^{(d)}, δ^{(d)} | θ^{(d)})$ , $β_{2}^{(d + 1)} = \arg \max_{β_{2}} Q_{1} (β_{1}^{(d + 1)}_{1}, β_{2}, δ^{(d)} | θ^{(d)})$ and $δ^{(d + 1)} = \arg \max_{δ} Q_{1} (β_{1}^{(d + 1)}, β_{2}^{(d + 1)}, δ | θ^{(d)})$ . The value of $θ_{2}^{(d + 1)}$ is obtained by maximizing (6) and can be written as

θ_{2}^{(d + 1)} = {(S_{e : 1}^{(d + 1)}, S_{e : 2}^{(d + 1)}, S_{p : 1}^{(d + 1)}, S_{p : 2}^{(d + 1)})}^{T}

where $S_{e : k}^{(d + 1)} = W_{1 k}^{(d)} / (W_{1 k}^{(d)} + W_{2 k}^{(d)})$ and $S_{p : k}^{(d + 1)} = W_{3 k}^{(d)} / (W_{3 k}^{(d)} + W_{4 k}^{(d)})$ , for k = 1,2. Combining $θ_{1}^{(d + 1)}$ and $θ_{2}^{(d + 1)}$ provides $θ^{(d + 1)}$ . Because $Q (θ^{(d + 1)} | θ^{(d)}) \geq Q (θ^{(d)} | θ^{(d)})$ , the convergence of ${θ^{(d)}}_{d = 1}^{\infty}$ is guaranteed.³⁰ We denote by $\hat{θ}$ the limit of ^(d)’s.

Denote by $I (θ)$ the observed data information matrix. Following the standard arguments of the MLE³⁴we have $I {(\hat{θ})}^{1 / 2} (\hat{θ} - θ)$ converges in distribution to $N (0, I_{2 p + 7})$ as N → ∞, where I_m denotes the m-dimensional identity matrix. Applying Louis’ method³⁵ provides

I (θ) = - E {\frac{\partial^{2} ℓ_{c} (θ | P, \tilde{Y}, X)}{\partial θ \partial θ^{T}} | P, X, θ} - cov {\frac{\partial ℓ_{c} (θ | P, \tilde{Y}, X)}{\partial θ} | P, X, θ} .

Again, instead of approximating $I (θ)$ via the Gibbs sampling approach²⁶, we are able to calculate it exactly. The calculations are included in the supplementary materials. With $I (\hat{θ})$ , one can make large sample Wald-type inferences. For example, let $θ_{l}$ , ${\hat{θ}}_{l}$ and ${\hat{σ}}_{l l}^{2}$ be the lth component of θ, the lth component of $\hat{θ}$ and the lth diagonal entry of $I {(\hat{θ})}^{- 1}$ , respectively, for l = 1,…,2p+7. The estimated standard error (SE) of ${\hat{θ}}_{l}$ is ${\hat{σ}}_{l l}$ and an approximated 100(1 − α)% confidence interval of $θ_{l}$ is ${\hat{θ}}_{l} \pm z_{α / 2} {\hat{σ}}_{l l}$ , where z_α is the αth upper quantile of $N (0, 1)$ .

4. VARIABLE SELECTION FOR EACH INFECTION

With $\hat{θ}$ and $I (\hat{θ})$ computed, we further identify which risk factors are truly relevant for each infection. Denote by $β_{1}^{*}$ and $β_{2}^{*}$ the values of β₁ and β₂ that generate the true individual statuses $\tilde{Y}$ , respectively, where $β_{k}^{*} = {(β_{k 0}^{*}, β_{k 1}^{*}, \dots, β_{k p}^{*})}^{T}$ . One can index the significant risk factors to the kth infection by $M_{k} = {j \in M : β_{k j}^{*} \neq 0}$ , where we take $M = {1, 2, \dots p}$ by defaulting that an intercept term is always included in the model. One must note that $M_{1}$ and $M_{2}$ might be different.

We apply a shrinkage method to simultaneously select $M_{k}$ ’s and estimate nonzero $β_{k j}^{*}$ ’s. To unify notation, we write $θ_{T}$ and ${\hat{Σ}}_{T T}$ as the sub-vector and the sub-matrix of θ and $\hat{Σ}$ according to an index set $T \subset {1, \dots, 2 p + 7}$ , respectively. Let $A = {2, \dots, p + 1, p + 3, \dots, 2 p + 2}$ . Our shrinkage estimator of $θ_{A}$ is defined by

{\tilde{θ}}_{A, λ} = \underset{θ_{A}}{\arg \min} {\frac{1}{2} {({\hat{θ}}_{A} - θ_{A})}^{T} {\hat{Σ}}_{A A} ({\hat{θ}}_{A} - θ_{A}) + \sum_{k = 1}^{2} λ_{k} \sum_{j = 1}^{p} ω_{k j} | β_{k j} |},

(7)

where $λ_{k} \sum_{j = 1}^{p} ω_{k j} | β_{k j} |$ is an adaptive LASSO penalty³⁶, λ_k ≥ 0 is a tuning parameter that controls the shrinkage level, and $ω_{k j} = | {\hat{β}}_{k j} |^{- 1}$ is an adaptive weight. When λ_k’s are 0, ${\tilde{θ}}_{A, λ} = {\hat{θ}}_{A}$ . When λ_k’s increase, due to the singularity of the absolute value function at the origin, components of ${\tilde{θ}}_{A, λ}$ are penalized to zero one-by-one. Writing ${\tilde{θ}}_{A, λ} = {({\tilde{β}}_{11, λ}, \dots, {\tilde{β}}_{1 p, λ}, {\tilde{β}}_{21, λ}, \dots, {\tilde{β}}_{2 p, λ})}^{T}$ , we estimate $M_{1}$ and $M_{2}$ by ${\tilde{M}}_{1, λ} = {j \in M : {\tilde{β}}_{1 j, λ} \neq 0}$ and ${\tilde{M}}_{2, λ} = {j \in M : {\tilde{β}}_{2 j, λ} \neq 0}$ , respectively.

Computing ${\tilde{θ}}_{A, λ}$ is fast. The objective function in (7) is simply a summation of a quadratic function and a weighted l₁-norm of $θ_{A}$ and therefore can be quickly minimized by slightly modifying the seminal least angle regression.³⁷ Let $A^{c} = {1, 2, \dots, 2 p + 7} \ A$ and $ℓ (θ_{A} | P, X, {\hat{θ}}_{A^{c}})$ be the log-likelihood function $ℓ (θ | P, X)$ with $θ_{A^{c}}$ fixed to be ${\hat{θ}}_{A^{c}}$ . One could also construct a shrinkage estimator by the traditional penalized MLE³⁸ which minimizes $- ℓ (θ_{A} | P, X, {\hat{θ}}_{A^{c}}) + \sum_{k = 1}^{2} λ_{k} \sum_{j = 1}^{p} ω_{k j} | β_{k j} |$ . As the quadratic term in (7) is the leading component of the Taylor’s expansion of $- ℓ (θ_{A} | P, X, {\hat{θ}}_{A^{c}})$ at $θ_{A} = {\hat{θ}}_{A}$ , it can be easily shown that ${\tilde{θ}}_{A}$ and the penalized MLE are asymptotically equivalent. However, the computation cost of obtaining penalized MLE will be a lot higher due to the complexity of the log-likelihood function.

The use of adaptive weights ω_kj’s is critical to achieve the oracle properties.³⁶ It assigns sufficiently large penalties to insignificant covariates so that they would be excluded from the model; on the other hand, it imposes mild penalties to significant ones in order that they would be retained in the model. The oracle properties are stated as follows. As N → ∞, if max $(λ_{1}, λ_{2}) / \sqrt{N} \to 0$ and min(λ₁,λ₂) → ∞, we have both the selection consistency, $pr({\tilde{M}}_{1, λ} = M_{1}, {\tilde{M}}_{2, λ} = M_{2}) \to 1$ , and the estimation consistency, $\sup_{k, j} ‖ {\tilde{β}}_{k j, λ} - β_{k j}^{*} ‖ = O_{p} (N^{- 1 / 2})$ . The proof follows similar arguments in the proofs of Theorems 1 and 2 in³⁹ and thus is omitted.

To select λ₁ and λ₂, we propose to minimize a BIC-type criterion⁴⁰,

BIC (λ_{1}, λ_{2}) = {({\hat{θ}}_{A} - {\tilde{θ}}_{A, λ})}^{T} {\hat{Σ}}_{{_{A}}_{A}} ({\hat{θ}}_{A} - {\tilde{θ}}_{A, λ}) + {d f_{1, λ} + d f_{2, λ}} \log N,

(8)

where $d f_{k, λ} = | {\tilde{M}}_{k, λ} |$ for k = 1,2. Following the proof of Theorem 3 in⁴¹, one can show that with the optimal (λ₁,₂) from (8), $pr({\tilde{M}}_{1, λ} = M_{1}, {\tilde{M}}_{2, λ} = M_{2}) \to 1$ as N → ∞. In other words, any (λ₁, λ ₂) that does not lead to the correct variable selection cannot be selected by (8) when the number of individuals is large.

The purpose of this subsection is to provide a shrinkage estimator of the regression coefficients, of which the sparsity pattern can help us identify the truly relevant risk factor for each infection. Inference procedures, such as constructing a confidence interval or conducting hypothesis testing, based on this shrinkage estimator are beyond the scope of this work. There are numerous studies demonstrating that even in classical linear regression, finite-sample inference procedures based on asymptotic properties of the adaptive LASSO estimator perform poorly.⁴² Developing valid inferential methods for shrinkage estimators in group testing, even with a single infection, could be an interesting but challenging future research topic. In this article, it is the variable selection of primary interest.

5. NUMERICAL STUDIES

5.1. Simulation

We consider three different settings for the joint distribution of ${\tilde{Y}}_{i j} | x_{i j}$ . In all of them, we keep both g₁ and g₂ in the marginal regression model (1) being the inverse of the logit link function, and use a Gumbel copula⁴³, $C (u_{1}, u_{2} | δ) = \exp {- {[{(- \log u_{1})}^{1 / δ} + {(- \log u_{2})}^{1 / δ}]}^{δ}}$ with δ = 0.3, to generate the co-infection probability (2). The difference across the three settings comes from the choices of (β₁, β₂,), where x is a generic notation of x_ij’s:

(S1) β₁ = (−5,−3,2,0,0,0)^T,β₂ = (−5,−3,0,3,0,0)^T and x = (1,x₁,⋯,x₅)^T, where we independently simulate x₁ from (0,1), x₂ and x₃ from Bernoulli(0.4), x₄ from Uniform(−0.5,0.5), and x₅ from $N (0, {0.75}^{2})$
(S2) β₁ = (−4,−2,2,0,0,0)^T,β₂ = (−5,−2≠,0,−2,0,0)^T and x = (1,x₁,⋯,x₅)^T, where x is simulated from $N (0, Ω)$ with [Ω]_st = 1 if s = t and [Ω]_st = 0.5 if s ≠ t.
(S3) β₁ = (−5,(−2,−2,−2,2,2) ⊗ (1,0))^T, β = (−6,(−3,−3,2,3,0) ⊗ (1,0))^T and x = (1,x,⋯,x₁₀)^T, where ⊗ is the Kronecker product, x is simulated from. $N (0, Ω)$ with ${[Ω]}_{s t} = 1$ if s = t and ${[Ω]}_{s t} = 0.5$ if $s \neq t$ .

Note that β₁ and β₂ have different sparsity patterns (e.g., in S1, x₂ is significant to the first infection but not to the second infection). This is to emulate the situation where two infections have different sets of significant risk factors. The values of β₁ and β₂ are chosen in a way such that the prevalence of each infection is about 7%–10%.

Under each setting, we simulate two types of data: individual testing data and the SHL pooled testing data. To do so, we first generate N = 3000 individual covariates. Given a set of covariates, we calculate the individual’s cell probabilities ( $p_{i j y_{1} y_{2}}$ ‘s) using the specified copula-based multivariate binary regression model, and then generate the true infection statuses for both infections from a multinomial distribution with those cell probabilities. We denote the covariates and the true infection statuses of the nth individual by x_n and ${\tilde{Y}}_{n} = {({\tilde{Y}}_{n 1}, {\tilde{Y}}_{n 2})}^{T}$ , respectively, for n = 1,…,3000. Herein, because groups have not been created yet, we use the subscript n instead of the ij (in ${\tilde{Y}}_{i j}$ and x_ij). Given $({\tilde{Y}}_{n}, x_{n})$ ’s, we simulate individual testing data and the SHL pooled testing data. We let S_e∶k = S_p∶k = 0.95 fork = 1,2. Values other than 0.95 are considered in the supplementary materials.

Based on ${\tilde{Y}}_{n}$ ’s, we generate individual testing outcomes of the nth specimen by T _n = (T_n1, T_n2)^T where ^T_nk ∼ Bernoulli ${S_{e : k} {\tilde{Y}}_{n k} + (1 - S_{p : k}) (1 - {\tilde{Y}}_{n k})}$ . Then we estimate (β₁, β ₂,)^T from (T _n, x_n)’s. This estimation procedure is similar to the one outlined in Section 3. We also use a GEM-algorithm to compute the MLEs and Louis’ method to calculate the observed data information matrix for making large sample Wald-type inferences. Furthermore, we slightly modify our variable selection method (in Section 4) to accommodate individual testing data. All the details are provided in the supplementary materials. It is worthwhile to note that S_e∶k’s and S_p∶k’s are not estimable in individual testing data. Hence, with individual testing data (T_n, x_n)’s, we have to assume the true values of S_e∶k’s and S_p∶k’s as known to estimate (β₁,₂,δ).

We generate the SHL pooled testing data from ${\tilde{Y}}_{n}$ ’s. A common group size is used in our simulations; i.e., c_j = c, and c ∈ {2,5,10}. For a fixed c, we randomly assign the 3000 individuals to one of J = 3000∕c groups. With the group membership identified, we relabel $({\tilde{Y}}_{n}, x_{n})$ ’s by $({\tilde{Y}}_{i j}, x_{i j})$ where i = 1,…,c and j = 1,…,J. The true statuses of the jth pool are calculated as ${\tilde{Z}}_{j k} = \max_{i} {\tilde{Y}}_{i j k}$ where k = 1,2. Then we generate the pooled testing outcomes by Z_j = (Z_j1, Z_j2)^T, where Z_jk ∼ Bernoulli ${S_{e : k} {\tilde{Z}}_{j k} + (1 - S_{p : k}) (1 - {\tilde{Z}}_{j k})}$ . As per the SHL pooling protocol, only if max(Z_j1,Z_j2) = 1, we generate retesting outcomes of the ith individual in this group by Yij = (Y_ij1,Y_ij2)^T, where Y_ijk ∼ Bernoulli ${S_{e : k} {\tilde{Y}}_{i j k} + (1 - S_{p : k}) (1 - {\tilde{Y}}_{i j k})}$ . Collecting all Z_j’s and Y_ij’s yields the SHL pooled testing data $P$ . Note that the number of tests that were used to obtain $P$ is the summation of J and the number of Y_ij’s. From $P$ and x_ij’s, we estimate (β₁, β₂,,_∶1,S_e∶2,S_p∶1,S_p∶2).

We repeat 500 times the process of generating T_n’s and $P$ for each c ∈ {2,5,10}. For each set of individual testing data or the SHL pooled testing data, we first treat the diagnosis results for each infection as the true statues and fit them using our copula-based multivariate binary regression model. The resulting MLE of θ₁ is used as the initial value of θ₁. The initial values of the assay sensitivity and specificity are chosen to be 0.9. Then we run our GEM algorithm to compute the MLE and use Louis’ method to construct a 95% confidence interval for each unknown parameter (see the last paragraph of Section 3). In addition to the BIC-type shrinkage estimator, we also compute an AIC-type⁴⁴ and an ERIC-type⁴⁵ estimator using the tuning parameters selected by minimizing $AIC (λ_{1}, λ_{2}) = {({\hat{θ}}_{A} - {\tilde{θ}}_{A, λ})}^{T} {\hat{Σ}}_{A A} ({\hat{θ}}_{A} - {\tilde{θ}}_{A, λ}) + 2 {d f_{1, λ} + d f_{2, λ}}$ and $ERIC (λ_{1}, λ_{2}) = {({\hat{θ}}_{A} - {\tilde{θ}}_{A, λ})}^{T} {\hat{Σ}}_{A A} ({\hat{θ}}_{A} - {\tilde{θ}}_{A, λ}) + d f_{1, λ} \log (N / λ_{1}) + d f_{2, λ} \log (N / λ_{2})$ , respectively. For individual testing data, slightly modified versions are available in the supplementary materials.

To compare the overall performance of the MLE and three shrinkage estimators, we consider the prediction error, $PE = N^{- 1} \sum_{j = 1}^{J} \sum_{i = 1}^{c_{j}} {\sum_{y_{1} = 0}^{1} \sum_{y_{2} = 0}^{1} {({\hat{p}}_{i j y_{1} y_{2}} - p_{i j y_{1} y_{2}}^{*})}^{2}}^{1 / 2}$ , where $p_{i j y_{1} y_{2}}^{*}$ ’s are the true cell probabilities and ${\hat{p}}_{i j y_{1} y_{2}}$ ’s are the predicted cell probabilities using an estimator of (β₁,β₂,δ). To evaluate the variable selection performance of shrinkage estimators, we define by the selection rate (SR) the proportion of the true model being exactly selected by a shrinkage estimator. Results from the 500 replications under S1–S3 are summarized in Tables 1–4.

TABLE 1.

Summary statistics of the 500 MLEs obtained under S1, including the sample mean (Mean), the sample standard deviation (SD), the average of the estimated standard errors (SE), and the empirical coverage (EC) of 95% confidence intervals under either individual testing (IT) or the SHL pooling with c = 2, 5, 10. The average number of tests (# of tests) under each protocol is also provided. The prevalence (averaged over 500 repetitions) of the first and second infections are 7.64% and 8.22%, respectively.

		IT		c = 2		c = 5		c = 10

# tests		3000		2351		2078		2445

	Truth	Mean(SD)	EC(SE)	Mean(SD)	EC(SE)	Mean(SD)	EC(SE)	Mean(SD)	EC(SE)
β₁₀	−5	−5.08(0.36)	0.94(0.37)	−5.06(0.29)	0.94(0.29)	−5.06(0.31)	0.94(0.29)	−5.07(0.34)	0.95(0.32)
β₁₁	−3	−3.05(0.25)	0.96(0.26)	−3.03(0.21)	0.94(0.21)	−3.04(0.22)	0.94(0.21)	−3.04(0.24)	0.95(0.23)
β₁₂	2	2.03(0.27)	0.94(0.27)	2.02(0.24)	0.94(0.24)	2.02(0.25)	0.94(0.24)	2.03(0.26)	0.95(0.25)
β₁₃	0	−0.01(0.24)	0.95(0.23)	−0.01(0.22)	0.95(0.21)	−0.01(0.22)	0.95(0.21)	−0.01(0.22)	0.94(0.21)
β₁₄	0	0.01(0.38)	0.95(0.39)	0.01(0.34)	0.96(0.35)	−0.01(0.35)	0.96(0.35)	0.00(0.37)	0.96(0.36)
β₁₅	0	0.00(0.19)	0.96(0.20)	0.00(0.17)	0.97(0.18)	0.00(0.17)	0.97(0.18)	0.00(0.19)	0.94(0.19)

β₂₀	−5	−5.08(0.37)	0.95(0.37)	−5.05(0.28)	0.94(0.30)	−5.05(0.30)	0.94(0.30)	−5.04(0.33)	0.96(0.32)
β₂₁	−3	−3.04(0.26)	0.95(0.26)	−3.03(0.21)	0.94(0.22)	−3.03(0.22)	0.94(0.21)	−3.02(0.24)	0.95(0.23)
β₂₂	0	−0.01(0.24)	0.94(0.23)	−0.01(0.21)	0.93(0.21)	0.00(0.22)	0.93(0.21)	−0.01(0.23)	0.94(0.21)
β₂₃	3	3.04(0.33)	0.94(0.32)	3.03(0.27)	0.94(0.27)	3.03(0.29)	0.94(0.27)	3.03(0.30)	0.94(0.29)
β₂₄	0	0.00(0.40)	0.95(0.38)	0.02(0.34)	0.95(0.35)	0.01(0.35)	0.95(0.35)	0.00(0.36)	0.96(0.36)
β₂₅	0	0.01(0.20)	0.95(0.20)	0.00(0.18)	0.94(0.18)	0.00(0.18)	0.94(0.18)	0.00(0.19)	0.95(0.19)

δ	0.3	0.28(0.09)	0.97(0.10)	0.29(0.06)	0.95(0.06)	0.29(0.06)	0.95(0.06)	0.29(0.07)	0.95(0.07)

S_e:1	0.95	–	–	0.95(0.02)	0.93(0.02)	0.95(0.02)	0.93(0.02)	0.95(0.02)	0.90(0.02)
S_e:2	0.95	–	–	0.95(0.01)	0.95(0.01)	0.95(0.02)	0.91(0.01)	0.95(0.02)	0.92(0.02)
Sp:1	0.95	–	–	0.95(0.01)	0.94(0.01)	0.95(0.01)	0.94(0.01)	0.95(0.01)	0.93(0.01)
Sp:2	0.95	–	–	0.95(0.01)	0.94(0.01)	0.95(0.01)	0.93(0.01)	0.95(0.01)	0.93(0.01)

Open in a new tab

TABLE 4.

The average prediction error PE × 100 and the SR value (provided in parenthesis) of the MLE and the shrinkage estimates under the AIC, BIC, and ERIC tuning parameter criterion over 500 replications under S1 – S3 across individual testing (IT) and the SHL pooling with c = 2, 5 and 10. Recall that the SR (selection rate) is defined to be the proportion of the true model being exactly selected by a shrinkage estimator. The highest SR value under each setting is underlined.

		IT	c = 2	c = 5	c = 10

Setting	Estimate	PE×100(SR)	PE×100(SR)	PE×100(SR)	PE×100(SR)
S1	MLE	0.148(0.000)	0.126(0.000)	0.130(0.000)	0.142(0.000)
	AIC	0.106(0.414)	0.092(0.430)	0.092(0.442)	0.102(0.462)
	BIC	0.079(0.910)	0.071(0.908)	0.073(0.926)	0.083(0.898)
	ERIC	0.085(0.724)	0.075(0.736)	0.076(0.744)	0.085(0.734)

S2	MLE	0.133(0.000)	0.106(0.000)	0.117(0.000)	0.121(0.000)
	AIC	0.095(0.414)	0.074(0.414)	0.084(0.436)	0.087(0.418)
	BIC	0.074(0.908)	0.059(0.910)	0.067(0.892)	0.069(0.876)
	ERIC	0.084(0.702)	0.064(0.696)	0.074(0.702)	0.074(0.702)

S3	MLE	0.284(0.000)	0.231(0.000)	0.250(0.000)	0.266(0.000)
	AIC	0.193(0.266)	0.160(0.294)	0.175(0.274)	0.184(0.298)
	BIC	0.158(0.818)	0.130(0.820)	0.145(0.786)	0.153(0.808)
	ERIC	0.183(0.428)	0.150(0.448)	0.163(0.420)	0.170(0.448)

Open in a new tab

Tables 1–3 provide summary statistics of the MLEs for S1–S3, respectively. Under both individual testing and the SHL pooling protocol, the MLEs of the unknown parameters obtained by our GEM algorithm exhibit little, if any, evidence of bias, across all considered settings. Regarding the use of Louis’ method, we notice that the average standard errors are in agreement with the sample standard deviations of the estimates. In addition, the empirical coverage probabilities for 95% confidence intervals are predominantly at the nominal level. These results indicate that the observed data information matrix is estimated correctly via Louis’ method.

TABLE 3.

Summary statistics of the 500 MLEs obtained under S3, including the sample mean (Mean), the sample standard deviation (SD), the average of the estimated standard errors (SE), and the empirical coverage (EC) of 95% confidence intervals under either individual testing (IT) or the SHL pooling with c = 2, 5, 10. The average number of tests (# of tests) under each protocol is also provided. The prevalence (averaged over 500 repetitions) of the first and the second infections are 9.97% and 8.54%, respectively.

		IT		c = 2		c = 5		c = 10

# tests		3000		2508		2337		2701

	Truth	Mean(SD)	EC(SE)	Mean(SD)	EC(SE)	Mean(SD)	EC(SE)	Mean(SD)	EC(SE)
β₁₀	−5	−5.10(0.39)	0.94(0.36)	−5.07(0.30)	0.93(0.27)	−5.10(0.33)	0.92(0.30)	-5.09(0.36)	0.95(0.34)
β₁₁	−2	−2.04(0.22)	0.94(0.21)	−2.03(0.18)	0.95(0.17)	−2.05(0.20)	0.93(0.18)	−2.04(0.20)	0.94(0.20)
β₁₂	0	0.00(0.13)	0.97(0.14)	0.00(0.12)	0.98(0.13)	0.00(0.13)	0.96(0.13)	0.00(0.13)	0.97(0.14)
β₁₃	−2	−2.04(0.22)	0.93(0.21)	−2.03(0.18)	0.94(0.17)	−2.04(0.19)	0.94(0.18)	−2.04(0.21)	0.93(0.20)
β₁₄	0	0.01(0.14)	0.96(0.14)	0.00(0.12)	0.95(0.13)	0.00(0.13)	0.94(0.13)	0.00(0.13)	0.97(0.14)
β₁₅	−2	−2.05(0.22)	0.94(0.21)	−2.04(0.18)	0.94(0.17)	−2.05(0.19)	0.93(0.18)	−2.05(0.21)	0.92(0.19)
β₁₆	0	0.01(0.14)	0.96(0.14)	0.00(0.13)	0.95(0.13)	0.01(0.13)	0.96(0.13)	0.01(0.14)	0.95(0.14)
β₁₇	2	2.04(0.22)	0.95(0.21)	2.03(0.18)	0.93(0.17)	2.05(0.19)	0.93(0.18)	2.04(0.21)	0.94(0.20)
β₁₈	0	0.00(0.14)	0.96(0.14)	0.00(0.12)	0.96(0.13)	0.00(0.13)	0.95(0.13)	0.01(0.14)	0.96(0.14)
β₁₉	2	2.04(0.21)	0.94(0.21)	2.03(0.18)	0.93(0.17)	2.04(0.19)	0.94(0.18)	2.04(0.20)	0.96(0.20)
β₁₁₀	0	0.00(0.15)	0.94(0.14)	0.00(0.13)	0.96(0.13)	0.01(0.13)	0.95(0.13)	0.00(0.14)	0.95(0.14)

β₂₀	−6	−6.13(0.49)	0.95(0.48)	−6.10(0.36)	0.96(0.36)	−6.10(0.41)	0.95(0.39)	−6.12(0.43)	0.95(0.43)
β₂₁	−3	−3.07(0.30)	0.95(0.29)	−3.05(0.24)	0.94(0.24)	−3.05(0.26)	0.94(0.25)	−3.06(0.27)	0.94(0.27)
β₂₂	0	0.00(0.16)	0.95(0.16)	0.01(0.14)	0.95(0.14)	0.00(0.15)	0.95(0.15)	0.00(0.15)	0.95(0.15)
β₂₃	−3	−3.07(0.30)	0.96(0.29)	−3.05(0.24)	0.94(0.24)	−3.05(0.26)	0.94(0.25)	−3.06(0.27)	0.94(0.27)
β₂₄	0	0.00(0.16)	0.96(0.16)	0.00(0.13)	0.94(0.14)	0.00(0.14)	0.95(0.15)	0.01(0.15)	0.95(0.16)
β₂₅	2	2.04(0.21)	0.97(0.23)	2.04(0.19)	0.96(0.19)	2.04(0.20)	0.95(0.20)	2.04(0.20)	0.95(0.20)
β₂₆	0	0.01(0.17)	0.94(0.16)	0.01(0.15)	0.93(0.14)	0.01(0.15)	0.95(0.15)	0.00(0.16)	0.94(0.16)
β₂₇	3	3.06(0.29)	0.95(0.29)	3.04(0.24)	0.95(0.24)	3.04(0.26)	0.94(0.25)	3.05(0.27)	0.95(0.27)
β₂₈	0	0.01(0.16)	0.96(0.16)	0.00(0.14)	0.96(0.14)	0.00(0.15)	0.95(0.15)	0.01(0.15)	0.96(0.16)
β₂₉	0	0.00(0.16)	0.95(0.16)	−0.01(0.14)	0.94(0.14)	−0.01(0.15)	0.95(0.15)	−0.01(0.16)	0.96(0.15)
β₂₁₀	0	0.01(0.16)	0.95(0.16)	0.01(0.14)	0.94(0.14)	0.01(0.15)	0.95(0.15)	0.01(0.15)	0.94(0.16)

δ	0.3	0.29(0.09)	0.98(0.13)	0.28(0.07)	0.99(0.08)	0.29(0.07)	0.98(0.09)	0.29(0.07)	0.99(0.11)

S_e:1	0.95	–	–	0.95(0.01)	0.93(0.01)	0.95(0.01)	0.94(0.01)	0.95(0.02)	0.94(0.01)
S_e:2	0.95	–	–	0.95(0.01)	0.95(0.01)	0.95(0.02)	0.93(0.01)	0.95(0.02)	0.91(0.02)
S_p:1	0.95	–	–	0.95(0.01)	0.96(0.01)	0.95(0.01)	0.93(0.01)	0.95(0.01)	0.92(0.01)
S_p:1	0.95	–	–	0.95(0.01)	0.96(0.01)	0.95(0.01)	0.94(0.01)	0.95(0.01)	0.93(0.01)

Open in a new tab

To examine the performance of the variable selection, Table 4 provides the SR (in parenthesis) of each shrinkage estimator across all considered settings. One can see that our BIC-type estimator performs the best in identifying the true model in each scenario. For example, in S3 when c = 2, the SR using the BIC criterion is 0.820 which is significantly larger than the ones using the AIC (0.294) and the ERIC (0.448) criterion. These results demonstrate the advantage of using the BIC criterion in identifying risk factors that are truly relevant for each infection.

Table 4 also provides the average PE×100 values of the MLE and the three shrinkage estimators across all settings. It is clear that all the shrinkage estimators produce smaller prediction errors than the MLE. For example, the BIC-type estimator can reduce almost 50% of the prediction error of the MLE. This is because that the adaptive LASSO penalty in (7) could eliminate unnecessary risk factors. Furthermore, because our BIC-type estimator outperforms the other two in term of variable selection, its prediction errors are the smallest under all settings. In conclusion, using the BIC-type shrinkage estimator not only provides a large chance of identifying truly relevant covariates, but also yields a high prediction accuracy.

Finally, we want to see whether the SHL pooling protocol causes a loss of information and thus compromises regression inference, when compared to individual testing. To find the answer, we revisit Tables 1–4. This time we focus on the comparison between individual testing and the SHL pooling. Tables 1–3 provide the average number of tests under each setting. Obviously, the SHL pooling protocol uses fewer tests than individual testing (saves about 16% costs). This is an expected appealing feature of the SHL pooling.²⁶ And we observe more: (i) In Tables 1–3, the standard deviations obtained using pooling data are uniformly less than the ones obtained using individual testing, suggesting that the SHL pooling could provide a less variational MLE; (ii) All the averaged standard errors under the SHL pooling are smaller than the ones under individual testing, meaning that one could use the SHL pooled testing data to construct narrower confidence intervals while maintaining the same nominal level; (iii) The advantage of pooling also holds when comparing the average PE×100 values in Table 4, indicating that the SHL pooling enables one to make a better prediction of an individual’s infection probabilities; (iv) In terms of variable selection, the highest SR value (in Table 4) always occurs at c > 1 under each setting; that is, using the SHL pooled testing data has a larger chance to identify the true model. Hence, instead of compromising regression inference, the SHL pooling produces more precise inference. In addition, one must note that these advantages are achieved with a less amount of costs and a larger number of parameters to be estimated. This finding could be very encouraging to laboratories that are not using pooling (such as the NHPL).

5.2. A CT/NG screening data set

To further encourage the use of pooling, we analyze a data set collected from the NPHL which currently uses individual testing for the CT/NG screening. We will illustrate, if switching from individual testing to the two-stage hierarchical pooling used by the SHL, what benefits could be achieved for regression. To do so, we first reiterate how the SHL is using the pooling protocol.²⁶ Only female swab specimens are screened using the SHL pooling protocol. The testing is carried out by the TECAN DTS platform with the Aptima Combo 2 assay. The platform is calibrated for a group size c = 4. The sensitivity and specificity of the assay are S _e:1 = 0.942 (S _e:2 = 0.992) and S _p:1 = 0.976 (S _p:2 = 0.987) for CT (NG), respectively (Gen-Probe, San Diego).

In 2009, 14530 female swab specimens were tested individually in the NPHL. The employed assay was also the Aptima Combo 2 Assay. We are provided with the diagnosed results of each specimen for CT and NG. Based on these diagnoses, the approximated prevalence of CT and NG are 0.069 and 0.013, respectively. To reveal the benefits of pooling, we mimic the SHL screening practice in the most realistic way. We use a group size c = 4 which is used by the SHL. Then we construct pools by assigning specimens according to their arrival time at the NPHL. Because the arrival time of specimens at the NPHL are random, our way of pooling is also random. We treat the diagnoses as “true” statuses and simulate a two-stage group testing data set using the above testing error rates. For comparison, we also simulate an individual testing data set using the same testing error rates. The considered covariates include age, prenatal, symptoms, cervical friability, pelvic inflammatory disease, cervicitis, multiple partners, new partner in the last 90 days, and contact with someone who has an STD. All covariates, except age, are binary. With these covariates on each individual, we first fit the individual diagnoses results by viewing them as the truth. The resulting estimates are used as the “reference” estimates. We then fit the individual testing data and the two-stage group testing data using the regression and variable selection methods previously described. In our analysis, we standardize age and code dichotomous covariates as either −0.5 or 0.5.

Table 5 summarizes the parameter estimates and variable selection results. The estimates from both testing protocols are close to the “reference” estimates, but the SEs under c = 4 are uniformly less than the ones under individual testing. The testing error rates are estimated accurately from the group testing data. In terms of variable selection, the reference shrinkage estimates identified different sets of significant risk factors for the two infections, where prenatal is significant to CT but not to NG. The same results are identified by the three shrinkage estimates based on the group testing data. However, based on the individual testing data, none of the three shrinkage estimates can select prenatal for CT. These comparisons reinforce our conclusion that, in addition to a significant cost reduction(i.e.,itsaves14530−7737 = 6793tests), the two-stage pooling protocol leads to more precise inference than individual testing while estimating the testing error rates simultaneously. In addition, we have considered randomly assigning individuals into groups as in Section 5.1 and used group sizes varying from 2 to 10. The supplementary materials include these results which reinforce the aforementioned conclusion on the advantages of the two-stage pooling protocol when compared to individual testing. We believe these numerical findings could encourage more laboratories to consider the two-stage pooling protocol.

TABLE 5.

The NPHL screening data analysis: parameter estimates (MLE), estimated standard errors (SE) and variable selection results (using the AIC, BIC, and ERIC criterion) from the reference estimates (Reference), individual testing estimates (IT) and the SHL pooling estimates with a group size 4 (c = 4). The number of tests under each is provided as well.

		Reference				IT				c = 4
number of tests		–				14530				7737

		MLE(SE)	AIC	BIC	ERIC	MLE(SE)	AIC	BIC	ERIC	MLE(SE)	AIC	BIC	ERIC
CT	Intercept	−1.382(0.241)	–	–	–	−1.528(0.286)	–	–	–	−1.269(0.262)	–	–	–
	Age	−0.559(0.045)	✓	✓	✓	−0.535(0.057)	✓	✓	✓	−0.561(0.051)	✓	✓	✓
	Prenatal	0.390(0.220)	✓	✓	✓	0.141(0.291)	×	×	×	0.480(0.229)	✓	✓	✓
	Symptoms	0.356(0.079)	✓	✓	✓	0.324(0.095)	✓	✓	✓	0.356(0.088)	✓	✓	✓
	Cervical F	0.065(0.163)				−0.058(0.202)				0.003(0.182)
	PID	0.443(0.392)	✓	✓	✓	0.443(0.448)	✓	✓	✓	0.492(0.427)	✓	✓	✓
	Cervicitis	0.611(0.106)	✓	✓	✓	0.746(0.118)	✓	✓	✓	0.645(0.116)	✓	✓	✓
	Multi Partner	0.476(0.099)	✓	✓	✓	0.522(0.116)	✓	✓	✓	0.532(0.109)	✓	✓	✓
	New Partner	−0.069(0.091)				−0.205(0.116)	✓			−0.067(0.102)
	Contact STD	1.006(0.098)	✓	✓	✓	1.023(0.111)	✓	✓	✓	1.048(0.108)	✓	✓	✓

NG	Intercept	−2.426(0.416)	–	–	–	−2.727(0.595)	–	–	–	−2.683(0.507)	–	–	–
	Age	−0.251(0.083)	✓	✓	✓	−0.278(0.112)	✓	✓	×	−0.258(0.087)	✓	✓	✓
	Prenatal	0.283(0.591)				0.003(0.929)				−0.073(0.750)
	Symptoms	1.202(0.164)	✓	✓	✓	1.176(0.219)	✓	✓	✓	1.234(0.174)	✓	✓	✓
	Cervical F	0.277(0.288)				0.290(0.327)	✓	✓	✓	0.270(0.301)
	PID	1.032(0.496)	✓	✓	✓	0.719(0.635)	✓	✓	✓	0.879(0.554)	✓	✓	✓
	Cervicitis	0.625(0.199)	✓	✓	✓	0.746(0.225)	✓	✓	✓	0.712(0.201)	✓	✓	✓
	Multi Partner	1.070(0.177)	✓	✓	✓	0.894(0.216)	✓	✓	✓	1.106(0.185)	✓	✓	✓
	New Partner	−0.130(0.189)				−0.060(0.229)				−0.127(0.198)
	Contact STD	1.405(0.173)	✓	✓	✓	1.208(0.216)	✓	✓	✓	1.402(0.180)	✓	✓	✓

	δ	0.573(0.030)	–	–	–	0.604(0.042)	–	–	–	0.563(0.033)	–	–	–

	S_e:1 = 0:942	–	–	–	–	–	–	–	–	0.922(0.016)	–	–	–
	S_e:2 = 0:992	–	–	–	–	–	–	–	–	0.989(0.029)	–	–	–
	S_p:1 = 0:976	–	–	–	–	–	–	–	–	0.974(0.004)	–	–	–
	S_p:2 = 0:987	–	–	–	–	–	–	–	–	0.985(0.002)	–	–	–

Open in a new tab

6. DISCUSSION

Motivated by the SHL CT/NG screening practice, we have developed a regression method for the two-stage hierarchical pooling data. Our proposed technique jointly models the unobserved individual disease statuses and produces interpretable marginal inference for each infection. The assay sensitivity and specificity for each infection can be estimated as well. In addition, we further developed a shrinkage estimator to consistently select truly relevant risk factors for each infection. To disseminate this work, code, written in R, that implements our new methodology, is available upon request.

From the simulation studies and the CT/NG screening data analysis, it is exciting to observe that, as compared to individual testing, the SHL pooling protocol can significantly reduce cost and yet produce more efficient regression estimators. An interesting future project would be to theoretically investigate how to construct groups to obtain the most efficient regression estimators for each infection within a budget limit. Intuitively, individuals with high probabilities of being infected should be tested individually and those with low probabilities could be tested in pools. But what is the criterion to differentiate between high and low probabilities? How to know these probabilities before the screening? For those tested in pools, what is the optimal pool size that should be used for inference? These are interesting but challenging questions to be answered in future works. Possible guidance could be found in^46,17.

In our simulation studies, we used a Gumbel copula. We chose it for two reasons. 1) When compared to Gaussian copulas, it has an analytic expression which facilitates the computation. 2) It is able to deliver robust estimates of the regression coefficients and misclassification parameters even when the true copula is not Gumbel. To reveal this robustness, we have included a simulation study in the supplementary materials. In practice, users are welcome to choose other copulas, such as Gaussian, Clayton, or Frank.³³ Besides, the logistic function for g_k ‘s could also be changed to the inverse of the link in probit or complementary log-log models. Our GEM algorithm has the generality to incorporate those choices.

Though this work mainly focuses on two infections, the model can be extended to incorporate more infections. For example, suppose there are three infections. We have ${\tilde{Y}}_{i j} = {({\tilde{Y}}_{i j 1}, {\tilde{Y}}_{i j 2}, {\tilde{Y}}_{i j 3})}^{T}$ . A joint model for ${\tilde{Y}}_{i j} | x_{i j}$ is built by assuming that there exists a random vector $U_{i j} = {(U_{i j 1}, U_{i j 2}, U_{i j 3})}^{T}$ , of which the distribution function is a three-dimensional copula $C (u_{1}, u_{2}, u_{3} | δ)$ , such that the event ${{\tilde{Y}}_{i j k} = 1 | x_{i j}}$ is equivalent to ${U_{i j k} \leq g_{k} (x_{i j}^{T} β_{k})}$ for k = 1,2,3. Consequently, the marginal regression model (1) naturally holds for each disease, and the cell probabilities of ${\tilde{Y}}_{i j} | x_{i j}$ can be calculated in terms of C; e.g., $pr ({\tilde{Y}}_{i j 1} = 1, {\tilde{Y}}_{i j 2} = 1, {\tilde{Y}}_{i j 3} = 0 | x_{i j}) = C {g_{1} (x_{i j}^{T} β_{1}), g_{2} (x_{i j}^{T} β_{2}), 1 | δ} - C {g_{1} (x_{i j}^{T} β_{1}), g_{2} (x_{i j}^{T} β_{2}), g_{3} (x_{i j}^{T} β_{3}) | δ}$ . Our GEM algorithm can be generalized to incorporate more than two infections as well. We omit details but include some simulation results in the supplementary materials to demonstrate this generalizability.

Lastly, we discuss the three assumptions (Assumptions 1–3) on the assay sensitivity and specificity and possible ways to relax them. For Assumption 1, when the assay utilizes the concentration level of a specific biological marker (biomarker) to make a diagnosis, mixing a positive specimen with negative ones could dilute the concentration level and affect the assay sensitivity and specificity significantly when group size changes. This “dilution effect” can be taken into consideration if the distribution of the biomarker concentration is provided in advance.^47,48 To relax Assumption 2, one could use a multinomial distribution to account for the cross-disease dependency of the testing outcomes when the true statuses are given. Then the number of misclassification parameters increases from 4 to 12 when the number of diseases is two. One could modify the GEM algorithm to estimate the twelve parameters along with the regression. However, some of these parameters may require an impractical large sample size to be accurately estimated. The last assumption can be relaxed by assuming a covariate-adjusted model for misclassification parameters.⁴⁹ But caution must be taken for model identifiability when the covariate-adjusted misclassification parameters are to be estimated along with the regression.

Supplementary Material

Supp info

NIHMS1037528-supplement-Supp_info.pdf^{(858.5KB, pdf)}

TABLE 2.

Summary statistics of the 500 MLEs obtained under S2, including the sample mean (Mean), the sample standard deviation (SD), the average of the estimated standard errors (SE), and the empirical coverage (EC) of 95% confidence intervals under either individual testing (IT) or the SHL pooling with c = 2, 5, 10. The average number of tests (# of tests) under each protocol is also provided. The prevalence (averaged over 500 repetitions) of the first and the second infections are 6.77% and 9.98%, respectively.

		IT		c = 2		c = 5		c = 10

# tests		3000		2493		2312		2678

	Truth	Mean(SD)	EC(SE)	Mean(SD)	EC(SE)	Mean(SD)	EC(SE)	Mean(SD)	EC(SE)
β₁₀	−4	−4.05(0.25)	0.95(0.26)	−4.02(0.20)	0.95(0.19)	−4.03(0.19)	0.97(0.20)	−4.03(0.21)	0.96(0.21)
β₁₁	−2	−2.03(0.19)	0.96(0.20)	−2.02(0.15)	0.96(0.15)	−2.03(0.16)	0.96(0.16)	−2.02(0.17)	0.95(0.17)
β₁₂	2	2.03(0.19)	0.94(0.20)	2.02(0.16)	0.95(0.16)	2.02(0.16)	0.96(0.16)	2.02(0.17)	0.95(0.17)
β₁₃	0	0.00(0.13)	0.95(0.14)	0.00(0.12)	0.95(0.12)	0.00(0.12)	0.96(0.12)	0.00(0.12)	0.95(0.13)
β₁₄	0	0.00(0.13)	0.96(0.14)	0.00(0.12)	0.95(0.12)	0.00(0.12)	0.96(0.12)	−0.01(0.13)	0.95(0.13)
β₁₅	0	0.00(0.14)	0.95(0.14)	0.00(0.11)	0.96(0.12)	0.00(0.12)	0.95(0.12)	0.00(0.13)	0.95(0.13)

β₂₀	−5	−5.06(0.36)	0.94(0.35)	−5.04(0.26)	0.96(0.27)	−5.03(0.28)	0.97(0.29)	−5.04(0.32)	0.94(0.33)
β₂₁	−2	−2.04(0.20)	0.95(0.20)	−2.03(0.16)	0.97(0.17)	−2.02(0.17)	0.97(0.17)	−2.03(0.19)	0.95(0.19)
β₂₂	0	0.01(0.13)	0.97(0.13)	0.00(0.12)	0.95(0.12)	0.01(0.12)	0.96(0.12)	0.01(0.13)	0.95(0.13)
β₂₃	−2	−2.04(0.20)	0.95(0.20)	−2.03(0.17)	0.96(0.17)	−2.02(0.17)	0.95(0.17)	−2.02(0.19)	0.94(0.18)
β₂₄	0	0.01(0.14)	0.93(0.13)	0.01(0.12)	0.93(0.12)	0.00(0.12)	0.94(0.12)	0.00(0.13)	0.95(0.13)
β₂₅	0	0.01(0.13)	0.95(0.13)	0.01(0.12)	0.95(0.12)	0.00(0.12)	0.95(0.12)	0.01(0.12)	0.96(0.13)

δ	0.3	0.30(0.08)	0.99(0.11)	0.30(0.06)	0.97(0.07)	0.30(0.07)	0.97(0.07)	0.30(0.08)	0.97(0.08)

S_e:1	0.95	–	–	0.95(0.02)	0.93(0.02)	0.95(0.02)	0.93(0.02)	0.95(0.02)	0.92(0.02)
S_e:2	0.95	–	–	0.95(0.02)	0.95(0.01)	0.95(0.02)	0.92(0.01)	0.95(0.02)	0.91(0.02)
S_p:1	0.95	–	–	0.95(0.01)	0.97(0.01)	0.95(0.01)	0.95(0.01)	0.95(0.01)	0.92(0.01)
S_p:2	0.95	–	–	0.95(0.01)	0.92(0.01)	0.95(0.01)	0.93(0.01)	0.95(0.01)	0.92(0.01)

Open in a new tab

ACKNOWLEDGMENTS

This article is supported by Grant R03 AI135614 from the National Institutes of Health. The authors thank an Associate Editor and three reviewers for their constructive comments which lead to a better presentation of this work, and Drs. Joshua M. Tebbs and Christopher S. McMahan for their insightful comments. The data that support the findings of this study are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions.

Footnotes

SUPPORTING INFORMATION

Supplementary materials are available along with the submission. These materials contain a numerical comparison showing that ignoring the retesting outcomes could inflate the variance of estimators of the regression coefficients in Section 2, the observed log-likelihood function, a numerical study showing the computational advantages of the GEM algorithm, detailed derivations of the E-step and the observed data information matrix introduced in Section 3, additional numerical results for other values of S_e:k’s and S_p:k’s (Section 5.1), extensions of our method to fit individual testing data as discussed in Section 5.1, additional results of the real data analysis in Section 5.2 and simulation studies that reveal the robustness of the Gumbel copula and demonstrate the generalizability of our method to more than two infections in Section 6.

References

1.Centers for Disease Control and Prevention. 2016 STD surveillance report https://www.cdc.gov/std/stats16/default.htm; Last accessed April, 2018.
2.Lewis JL, Lockary VM, Kobic S. Cost savings and increased efficiency using a stratified specimen pooling strategy for Chlamydia trachomatis and Neisseria gonorrhoeae. Sexually Transmitted Diseases 2012;39(1):46–48. [DOI] [PubMed] [Google Scholar]
3.Samoff E, Koumans EH, Markowitz LE, et al. Association of Chlamydia trachomatis with persistence of high-risk types of human papillomavirus in a cohort of female adolescents. American Journal of Epidemiology 2005;162(7):668–675. [DOI] [PubMed] [Google Scholar]
4.Centers for Disease Control and Prevention. STDs & Infertility https://www.cdc.gov/std/infertility/default.htm; Last accessed April, 2018.
5.Jirsa S Pooling specimens: A decade of successful cost savings. In: National STD Prevention Conference; 2008. [Google Scholar]
6.Dorfman R The detection of defective members of large populations. The Annals of Mathematical Statistics 1943;14(4):436–440. [Google Scholar]
7.Stramer SL, Krysztof DE, Brodsky JP, et al. Comparative analysis of triplex nucleic acid test assays in United States blood donors. Transfusion 2013;53:2525–2537. [DOI] [PubMed] [Google Scholar]
8.Edouard S, Prudent E, Gautret P, Memish ZA, Raoult D. Cost-effective pooling of DNA from nasopharyngeal swab samples for large-scale detection of bacteria by real-time PCR. Journal of Clinical Microbiology 2015;53(3):1002–1004. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Hill JA, HallSedlak R, Magaret A, et al. Efficient identification of inherited chromosomally integrated human herpesvirus6 using specimen pooling. Journal of Clinical Virology 2016;77:71–76. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Gastwirth JL. The efficiency of pooling in the detection of rare mutations. American Journal of Human Genetics 2000;67(4):1036–1039. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Zanzi CA, Johnson WO, Thurmond MC, Hietala SK. Pooled-sample testing as a herd-screening tool for detection of bovine viral diarrhea virus persistently infected cattle. Journal of Veterinary Diagnostic Investigation 2000;12(3):195–203. [DOI] [PubMed] [Google Scholar]
12.Venette RC, Moon RD, Hutchison WD. Strategies and statistics of sampling for rare individuals. Annual Review of Entomology 2002;47(1):143–174. [DOI] [PubMed] [Google Scholar]
13.Dodd RY, Notari EP 4th, Stramer SL. Current prevalence and incidence of infectious disease markers and estimated window-period risk in the American Red Cross donor population. Transfusion 2002;42(8):975–979. [DOI] [PubMed] [Google Scholar]
14.Remlinger KS, Hughes-Oliver JM, Young SS, Lam RL. Statistical design of pools using optimal coverage and minimal collision. Technometrics 2006;48(1):133–143. [Google Scholar]
15.Kim HY, Hudgens MG, Dreyfuss JM, Westreich DJ, Pilcher CD. Comparison of group testing algorithms for case identification in the presence of test error. Biometrics 2007;63(4):1152–1163. [DOI] [PubMed] [Google Scholar]
16.Liu A, Liu C, Zhang Z, Albert PS. Optimality of group testing in the presence of misclassification. Biometrika 2012;99(1):245–251. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Huang SH, Huang MN Lo, Shedden K, Wong WK. Optimal group testing designs for estimating prevalence with uncertain testing errors. Journal of the Royal Statistical Society: Series B 2017;79(5):1547–1563. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Vansteelandt S, Goetghebeur E, Verstraeten T. Regression models for disease prevalence with diagnostic tests on pools of serum samples. Biometrics 2000;56(4):1126–1133. [DOI] [PubMed] [Google Scholar]
19.Xie M Regression analysis of group testing samples. Statistics in Medicine 2001;20(13):1957–1969. [DOI] [PubMed] [Google Scholar]
20.Chen P, Tebbs JM, Bilder CR. Group testing regression models with fixed and random effects. Biometrics 2009;65(4):1270–1278. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.McMahan CS, Tebbs JM, Hanson TE, Bilder CR. Bayesian regression for group testing data. Biometrics 2017;73:1443–1452. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Delaigle A, Meister A. Nonparametric regression analysis for group testing data. Journal of the American Statistical Association 2011;106(494):640–650. [Google Scholar]
23.Delaigle A, Hall P, Wishart JR. New approaches to nonparametric and semiparametric regression for univariate and multivariate group testing data. Biometrika 2014;101(3):567–585. [Google Scholar]
24.Xiao X, Zhai J, Zeng J, Tian C, Wu H, Yu Y. Comparative evaluation of a triplex nucleic acid test for detection of HBV DNA,HCV RNA, and HIV-1 RNA, with the Procleix Tigris System. Journal of Virological Methods 2013;187(2):357–361. [DOI] [PubMed] [Google Scholar]
25.Hughes-Oliver JM, Rosenberger WF. Efficient estimation of the prevalence of multiple rare traits. Biometrika 2000;87(2):315–327. [Google Scholar]
26.Tebbs JM, McMahan CS, Bilder CR. Two-stage hierarchical group testing for multiple infections with application to the Infertility Prevention Project. Biometrics 2013;69(4):1064–1073. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Warasi MS, Tebbs JM, McMahan CS, Bilder CR. Estimating the prevalence of multiple diseases from two-stage hierarchical pooling. Statistics in Medicine 2016;35(21):3851–3864. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Li Q, Liu A, Xiong W. D-optimality of group testing for joint estimation of correlated rate diseases with misclassification. Statistica Sinica 2017;27(2):823–838. [Google Scholar]
29.Zhang B, Bilder CR, Tebbs JM. Regression analysis for multiple-disease group testing data. Statistics in Medicine 2013;32(28):4954–4966. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Wu CJ. On the convergence properties of the EM algorithm. The Annals of Statistics 1983;11:95–103. [Google Scholar]
31.Neal RM, Hinton GE. A view of the EM algorithm that justifies incremental, sparse, and other variants In: Springer; 1998. (pp. 355–368). [Google Scholar]
32.Gregory KB, Wang D, McMahan CS. Adaptive elastic net for group testing. Biometrics in print;00:1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Nelsen RB. An introduction to copulas Springer Science & Business Media; 2007. [Google Scholar]
34.Lehmann EL. Theory of point estimation Pacific Grove, CA: Wadsworth and Brooks/Cole; 1983. [Google Scholar]
35.Louis TA. Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society: Series B 1982;44:226–233. [Google Scholar]
36.Zou H The adaptive lasso and its oracle properties. Journal of the American Statistical Association 2006;101(476):1418–1429. [Google Scholar]
37.Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. The Annals of Statistics 2004;32(2):407–499. [Google Scholar]
38.Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 2001;96(456):1348–1360. [Google Scholar]
39.Wang H, Leng C. Unified LASSO estimation by least squares approximation. Journal of the American Statistical Association 2007;102(479):1039–1048. [Google Scholar]
40.Schwarz G Estimating the dimension of a model. The Annals of Statistics 1978;6(2):461–464. [Google Scholar]
41.Wang H, Li B, Leng C. Shrinkage tuning parameter selection with a diverging number of parameters. Journal of the Royal Statistical Society: Series B 2009;71(3):671–683. [Google Scholar]
42.Minnier J, Tian L, Cai T. A perturbation method for inference on regularized regression estimates. Journal of the American Statistical Association 2011;106(496):1371–1382. [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Gumbel EJ. Bivariate exponential distributions. Journal of the American Statistical Association 1960;55(292):698–707. [Google Scholar]
44.Akaike H A new look at the statistical model identification. IEEE Transactions on Automatic Control 1974;19(6):716–723. [Google Scholar]
45.Hui FK, Warton DI, Foster SD. Tuning parameter selection for the adaptive LASSO using ERIC. Journal of the American Statistical Association 2015;110(509):262–269. [Google Scholar]
46.McMahan CS, Tebbs JM, Bilder CR. Informative Dorfman screening. Biometrics 2012;68(1):287–296. [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Wang D, McMahan CS, Gallagher CM. A general parametric regression framework for group testing data with dilutioneffects. Statistics in Medicine 2015;34(27):3606–3621. [DOI] [PubMed] [Google Scholar]
48.Wang D, McMahan CS, Tebbs JM, Bilder CR. Group testing case identification with biomarker information. Computational statistics & data analysis 2018;122:156–166. [DOI] [PMC free article] [PubMed] [Google Scholar]
49.Janes H, Pepe MS. Adjusting for covariates in studies of diagnostic, screening, or prognostic markers: an old concept in anew setting. American journal of epidemiology 2008;168(1):89–97. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp info

NIHMS1037528-supplement-Supp_info.pdf^{(858.5KB, pdf)}

[R1] 1.Centers for Disease Control and Prevention. 2016 STD surveillance report https://www.cdc.gov/std/stats16/default.htm; Last accessed April, 2018.

[R2] 2.Lewis JL, Lockary VM, Kobic S. Cost savings and increased efficiency using a stratified specimen pooling strategy for Chlamydia trachomatis and Neisseria gonorrhoeae. Sexually Transmitted Diseases 2012;39(1):46–48. [DOI] [PubMed] [Google Scholar]

[R3] 3.Samoff E, Koumans EH, Markowitz LE, et al. Association of Chlamydia trachomatis with persistence of high-risk types of human papillomavirus in a cohort of female adolescents. American Journal of Epidemiology 2005;162(7):668–675. [DOI] [PubMed] [Google Scholar]

[R4] 4.Centers for Disease Control and Prevention. STDs & Infertility https://www.cdc.gov/std/infertility/default.htm; Last accessed April, 2018.

[R5] 5.Jirsa S Pooling specimens: A decade of successful cost savings. In: National STD Prevention Conference; 2008. [Google Scholar]

[R6] 6.Dorfman R The detection of defective members of large populations. The Annals of Mathematical Statistics 1943;14(4):436–440. [Google Scholar]

[R7] 7.Stramer SL, Krysztof DE, Brodsky JP, et al. Comparative analysis of triplex nucleic acid test assays in United States blood donors. Transfusion 2013;53:2525–2537. [DOI] [PubMed] [Google Scholar]

[R8] 8.Edouard S, Prudent E, Gautret P, Memish ZA, Raoult D. Cost-effective pooling of DNA from nasopharyngeal swab samples for large-scale detection of bacteria by real-time PCR. Journal of Clinical Microbiology 2015;53(3):1002–1004. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Hill JA, HallSedlak R, Magaret A, et al. Efficient identification of inherited chromosomally integrated human herpesvirus6 using specimen pooling. Journal of Clinical Virology 2016;77:71–76. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Gastwirth JL. The efficiency of pooling in the detection of rare mutations. American Journal of Human Genetics 2000;67(4):1036–1039. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Zanzi CA, Johnson WO, Thurmond MC, Hietala SK. Pooled-sample testing as a herd-screening tool for detection of bovine viral diarrhea virus persistently infected cattle. Journal of Veterinary Diagnostic Investigation 2000;12(3):195–203. [DOI] [PubMed] [Google Scholar]

[R12] 12.Venette RC, Moon RD, Hutchison WD. Strategies and statistics of sampling for rare individuals. Annual Review of Entomology 2002;47(1):143–174. [DOI] [PubMed] [Google Scholar]

[R13] 13.Dodd RY, Notari EP 4th, Stramer SL. Current prevalence and incidence of infectious disease markers and estimated window-period risk in the American Red Cross donor population. Transfusion 2002;42(8):975–979. [DOI] [PubMed] [Google Scholar]

[R14] 14.Remlinger KS, Hughes-Oliver JM, Young SS, Lam RL. Statistical design of pools using optimal coverage and minimal collision. Technometrics 2006;48(1):133–143. [Google Scholar]

[R15] 15.Kim HY, Hudgens MG, Dreyfuss JM, Westreich DJ, Pilcher CD. Comparison of group testing algorithms for case identification in the presence of test error. Biometrics 2007;63(4):1152–1163. [DOI] [PubMed] [Google Scholar]

[R16] 16.Liu A, Liu C, Zhang Z, Albert PS. Optimality of group testing in the presence of misclassification. Biometrika 2012;99(1):245–251. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.Huang SH, Huang MN Lo, Shedden K, Wong WK. Optimal group testing designs for estimating prevalence with uncertain testing errors. Journal of the Royal Statistical Society: Series B 2017;79(5):1547–1563. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Vansteelandt S, Goetghebeur E, Verstraeten T. Regression models for disease prevalence with diagnostic tests on pools of serum samples. Biometrics 2000;56(4):1126–1133. [DOI] [PubMed] [Google Scholar]

[R19] 19.Xie M Regression analysis of group testing samples. Statistics in Medicine 2001;20(13):1957–1969. [DOI] [PubMed] [Google Scholar]

[R20] 20.Chen P, Tebbs JM, Bilder CR. Group testing regression models with fixed and random effects. Biometrics 2009;65(4):1270–1278. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.McMahan CS, Tebbs JM, Hanson TE, Bilder CR. Bayesian regression for group testing data. Biometrics 2017;73:1443–1452. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22.Delaigle A, Meister A. Nonparametric regression analysis for group testing data. Journal of the American Statistical Association 2011;106(494):640–650. [Google Scholar]

[R23] 23.Delaigle A, Hall P, Wishart JR. New approaches to nonparametric and semiparametric regression for univariate and multivariate group testing data. Biometrika 2014;101(3):567–585. [Google Scholar]

[R24] 24.Xiao X, Zhai J, Zeng J, Tian C, Wu H, Yu Y. Comparative evaluation of a triplex nucleic acid test for detection of HBV DNA,HCV RNA, and HIV-1 RNA, with the Procleix Tigris System. Journal of Virological Methods 2013;187(2):357–361. [DOI] [PubMed] [Google Scholar]

[R25] 25.Hughes-Oliver JM, Rosenberger WF. Efficient estimation of the prevalence of multiple rare traits. Biometrika 2000;87(2):315–327. [Google Scholar]

[R26] 26.Tebbs JM, McMahan CS, Bilder CR. Two-stage hierarchical group testing for multiple infections with application to the Infertility Prevention Project. Biometrics 2013;69(4):1064–1073. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] 27.Warasi MS, Tebbs JM, McMahan CS, Bilder CR. Estimating the prevalence of multiple diseases from two-stage hierarchical pooling. Statistics in Medicine 2016;35(21):3851–3864. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] 28.Li Q, Liu A, Xiong W. D-optimality of group testing for joint estimation of correlated rate diseases with misclassification. Statistica Sinica 2017;27(2):823–838. [Google Scholar]

[R29] 29.Zhang B, Bilder CR, Tebbs JM. Regression analysis for multiple-disease group testing data. Statistics in Medicine 2013;32(28):4954–4966. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] 30.Wu CJ. On the convergence properties of the EM algorithm. The Annals of Statistics 1983;11:95–103. [Google Scholar]

[R31] 31.Neal RM, Hinton GE. A view of the EM algorithm that justifies incremental, sparse, and other variants In: Springer; 1998. (pp. 355–368). [Google Scholar]

[R32] 32.Gregory KB, Wang D, McMahan CS. Adaptive elastic net for group testing. Biometrics in print;00:1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] 33.Nelsen RB. An introduction to copulas Springer Science & Business Media; 2007. [Google Scholar]

[R34] 34.Lehmann EL. Theory of point estimation Pacific Grove, CA: Wadsworth and Brooks/Cole; 1983. [Google Scholar]

[R35] 35.Louis TA. Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society: Series B 1982;44:226–233. [Google Scholar]

[R36] 36.Zou H The adaptive lasso and its oracle properties. Journal of the American Statistical Association 2006;101(476):1418–1429. [Google Scholar]

[R37] 37.Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. The Annals of Statistics 2004;32(2):407–499. [Google Scholar]

[R38] 38.Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 2001;96(456):1348–1360. [Google Scholar]

[R39] 39.Wang H, Leng C. Unified LASSO estimation by least squares approximation. Journal of the American Statistical Association 2007;102(479):1039–1048. [Google Scholar]

[R40] 40.Schwarz G Estimating the dimension of a model. The Annals of Statistics 1978;6(2):461–464. [Google Scholar]

[R41] 41.Wang H, Li B, Leng C. Shrinkage tuning parameter selection with a diverging number of parameters. Journal of the Royal Statistical Society: Series B 2009;71(3):671–683. [Google Scholar]

[R42] 42.Minnier J, Tian L, Cai T. A perturbation method for inference on regularized regression estimates. Journal of the American Statistical Association 2011;106(496):1371–1382. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R43] 43.Gumbel EJ. Bivariate exponential distributions. Journal of the American Statistical Association 1960;55(292):698–707. [Google Scholar]

[R44] 44.Akaike H A new look at the statistical model identification. IEEE Transactions on Automatic Control 1974;19(6):716–723. [Google Scholar]

[R45] 45.Hui FK, Warton DI, Foster SD. Tuning parameter selection for the adaptive LASSO using ERIC. Journal of the American Statistical Association 2015;110(509):262–269. [Google Scholar]

[R46] 46.McMahan CS, Tebbs JM, Bilder CR. Informative Dorfman screening. Biometrics 2012;68(1):287–296. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R47] 47.Wang D, McMahan CS, Gallagher CM. A general parametric regression framework for group testing data with dilutioneffects. Statistics in Medicine 2015;34(27):3606–3621. [DOI] [PubMed] [Google Scholar]

[R48] 48.Wang D, McMahan CS, Tebbs JM, Bilder CR. Group testing case identification with biomarker information. Computational statistics & data analysis 2018;122:156–166. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R49] 49.Janes H, Pepe MS. Adjusting for covariates in studies of diagnostic, screening, or prognostic markers: an old concept in anew setting. American journal of epidemiology 2008;168(1):89–97. [DOI] [PubMed] [Google Scholar]

PERMALINK

Regression analysis and variable selection for two-stage multiple-infection group testing data

Juexin Lin

Dewei Wang

Qi Zheng

Abstract

1. INTRODUCTION

1.1. Motivation

The SHL Pooling Protocol

FIGURE 1.

1.2. Literature review

2. MODEL

3. ESTIMATION

4. VARIABLE SELECTION FOR EACH INFECTION

5. NUMERICAL STUDIES

5.1. Simulation

TABLE 1.

TABLE 4.

TABLE 3.

5.2. A CT/NG screening data set

TABLE 5.

6. DISCUSSION

Supplementary Material

TABLE 2.

ACKNOWLEDGMENTS

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Regression analysis and variable selection for two-stage multiple-infection group testing data

Juexin Lin

Dewei Wang

Qi Zheng

Abstract

1. INTRODUCTION

1.1. Motivation

The SHL Pooling Protocol

FIGURE 1.

1.2. Literature review

2. MODEL

3. ESTIMATION

4. VARIABLE SELECTION FOR EACH INFECTION

5. NUMERICAL STUDIES

5.1. Simulation

TABLE 1.

TABLE 4.

TABLE 3.

5.2. A CT/NG screening data set

TABLE 5.

6. DISCUSSION

Supplementary Material

TABLE 2.

ACKNOWLEDGMENTS

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases