Abstract
In the group testing procedure, several individual samples are grouped and the pooled samples, instead of each individual sample, are tested for outcome status (e.g., infectious disease status). Although this cost-effectiveness strategy in data collection is both labor and time efficient, it poses statistical challenges to derive statistically and computationally efficient estimators under semiparametric models. We consider semiparametric isotonic regression models for the simultaneous estimation of the conditional probability curve and covariate effects, in which a parametric form for combining the covariate information is assumed and the monotonic link function is left unspecified. We develop an expectation-maximization algorithm to overcome the computational challenge and embed the pool-adjacent violators algorithm in the M-step to facilitate the computation. We establish the large sample behavior of the proposed estimators and examine their finite sample performance in simulation studies. We apply the proposed method to data from the National Health and Nutrition Examination Survey for illustration.
Keywords: Expectation-maximization algorithm, group testing data, isotonic regression, pool-adjacent violators algorithm
1. INTRODUCTION
Group testing has been commonly used to save time and cost in screening for infectious diseases (Dorfman, 1943; Lewis et al., 2012; Van et al., 2012), drug discovery (Remlinger et al., 2006), quality control (Johnson and Pearson, 1999) and genetic studies (Gastwirth, 2000). When screening for infection diseases, group testing begins by collecting specimens (e.g., blood, urine,and plasma) from individuals; these samples are then physically combined to form a pooled specimen. The pooled specimen is then tested for the infection of interest. Accordingly, the test response provides evidence of whether or not the pool contains at least one individual who has a positive status for the disease of interest. If a pool tests negative, all individuals within the pool are declared to have negative status for the infection of interest. It is known that group testing is a cost-effective strategy when the disease prevalence is low. Furthermore, the utility of array-based group testing algorithms has been shown to reduce rates of misclassification (Kim and Hudgens, 2009). Another area where group testing methods have been found useful includes biomarker evaluation, in which pooled samples result in the average of the biomarker concentrations over the pooled individuals (Bondell et al., 2007). The cost-effectiveness of group testing does not, however, come without a price. The use of group testing creates challenges for statistical analysis due to the absence of individual responses.
There is extensive literature on estimating procedures for group testing, most of which has focused on nonparametric procedures (Delaigle and Meister, 2011; Delaigle et al., 2012; Wang et al., 2013; Mitchell et al., 2014; Delaigle and Zhou, 2015). In many studies that use the group testing procedure, one or several covariates are also available, and it is of interest to estimate the conditional probability of the response, given such covariate information. Having postulated a parametric regression model, the shape of the link function is assumed to be known up to finitely many unknown regression parameters (Xie, 2001; Zhang et al., 2013). However, in many situations, the researchers have no prior information regarding the mathematical specification of the true link function. When the assumed link function cannot capture the underlying shape function of the conditional density, the fitted results have undesirable properties such as inaccurate estimators and misleading inference. Delaigle et al. (2014) relaxed the parametric regression models and developed semiparametric models by using a nonparametric weight function, for which it requires a determination of the bandwidth.
Isotonic regression is a least squares problem under order restrictions. Pioneering work was conducted by Ayer et al. (1955), and a comprehensive review can be found in Barlow et al. (1972). A unique solution to standard isotonic regression exists and can be obtained using the pool-adjacent violators algorithm (PAVA) (Barlow et al., 1972; Best and Chakravarti, 1990). The computational aspects of PAVA and its fast implementation in R are discussed in Mair et al. (2009). Standard isotonic regression with PAVA cannot be directly applied to group testing data, since individual responses are not available. Our strategy is to use a semiparametric isotonic regression model and a computationally appealing algorithm by integrating PAVA and the expectation-maximization (EM) algorithm.
The remaining sections of this paper are as follows. In Section 2, we introduce the semiparametric isotonic regression model and then describe the proposed computational algorithm for the estimation. In Section 3, we establish the asymptotic properties of the estimators. We conduct simulation studies to assess the finite sample performance of the proposed algorithm, and compare it with existing estimating methods in Section 4. We apply the proposed method to data from the National Health and Nutrition Examination Survey (NHANES) in Section 5 and provide a brief discussion in Section 6.
2. MODELS AND ESTIMATION
Suppose there are N subjects to be tested and each subject is assigned to exactly one of n pools. The binary outcome Yij represents the true binary status for subject i in group j, which cannot be observed directly. Here, Yij = 1 indicates that the response of this subject is positive, and Yij = 0 otherwise. Let denote the observed binary status for the jth pool with a size of nj, ; i.e., if the pool contains at least one positive subject, , otherwise. For each subject in the pool, a vector of subject characteristics Xij = (xij,1, …, xij,d)T is observed, i = 1, …, nj; j = 1, …, n. To acknowledge heterogeneity, we assume the mean of Yij is related to the subjects characteristics through a monotonic function,
| (1) |
where π(·) denotes the unknown monotonic nondecreasing function, v(Xij; β) is a pre-specified function of the subjects’ characteristics and β = (β1, …, βd)T is a finite-dimensional parameter. A commonly used linear model summarizes the subject information as v(Xij; β) = βTXij. Due to the unspecified link function, we need to impose an additional constraint on the regression coefficients to ensure model identifiability. It has been well studied that such a model is identifiable if either one of the components of β is fixed or if ‖β‖ = 1 and the first component of β > 0 (Ichimura, 1993; Hristache et al., 2001; Groeneboom and Hendrickx, 2019). We chose the first option due to its computational ease.
2.1. Estimation algorithm
The mass-density of the observed data from pool j, (, xij, i = 1, …, nj), is
where g(.) is the density function of the covariate. Denote, for the observed data. Leaving out g(·), we have the log-likelihood of the observed data
| (2) |
Let B be the range of β, and Π = {π(·) on R: 0 ≤ π(·) ≤ 1, π(·) is monotone nondecreasing}. Then, (β, π) can be estimated by maximizing the above likelihood, subject to the monotonic constraint for π denoted as . However, directly maximizing the likelihood in Equation (2) is computationally challenging, due to the non-parametric part involved. Alternatively, to fully use the existing computing algorithm for isotonic regression, we treat the unobserved data Yij as the missing data and use the EM algorithm to maximize the log-likelihood of the complete data,
subject to the monotonic constraint for π. Following the principle of the EM algorithm, the following iterative steps are used.
E-step: Given an initial estimator (β(0), π(0)) and O, calculate the expected complete log-likelihood function,
| (3) |
where and
M-step: Maximize the conditional expectation of the log-likelihood of the complete data from the E-step, and update the estimates for the unknown parameters. Note that directly maximizing the likelihood in Equation (3) with the monotonic constraint for π is not straightforward. Considering that the likelihood in Equation (3) belongs to the exponential family, we can apply PAVA to simplify the computational task. Theorems 1.5.1 and 1.5.2 in Robertson (1988) unify the theory of order-restricted inference, including the maximization of the likelihood function within the exponential family. In our setting, we can easily verify that the regularity conditions in the Theorems are satisfied. It follows that maximizing the likelihood in Equation (3) under the monotonic constraint is equivalent to minimizing the following sum of squares, denoted as Q(β, π) under the same constraint:
where v(1)(β), …, v(n)(β) are given in the following step. The minimization can be carried out through the following three steps:
Step 1. For a given β, sort v(xij, β), i = 1, …, nj, j = 1, … n from the smallest to largest, denoted as v(1)(β) ≤ v(2)(β) ≤ ⋯ ≤ v(n)(β).
Step 2. Use PAVA to minimize the objection function Q(β, π) with respect to π(·), subject to the condition that if v(1)(β) ≤ v(2)(β) ≤ ⋯ ≤ v(n)(β) then π{v(1)(β)} ≤ π{v(2)(β)} ≤ ⋯ ≤ π{v(n)(β)}.
Step 3. Minimize with respect to β, and denote the resulting minimizer as (β(1), π(1)). In fact, β(1) is a profile maximum likelihood estimator (MLE) at this iteration. See, for example, Severini and Wong (1992); Murphy and Van der Vaart (2000) for the properties of such an estimator.
After obtaining the r-th iteration estimator (β(r), π(r)) we re-calculate the E-step E(lc|O; β(r), π(r)) and then obtain the next iteration estimator (β(r+1), π(r+1)) by the M-step. With a given convergence criterion, we can iterate the E-step and M-step until the convergence criterion is satisfied.
3. ASYMPTOTIC PROPERTIES
Without loss of generality, we assume the group size nj to be 2 and n = N/2 for notation simplicity. We establish the asymptotic properties of the estimator of the true parameters (β0, π0). Denote ℓ(β, π|o) be the log-likelihood based on o = {y*, x}, x = (x1, x2), , be the Hadamard derivative of ℓ(β, π|o) with respect to π in the direction h, , and . As a convention, O and X etc. denote random vectors and o and x etc. denote for observations. Denote
(in the above x3−j = x2 for j = 1, and x3−j = x1 for j = 2) and
We assume the following regularity conditions and summarize the asymptotic properties in Theorems 1 and 2, the proofs of which we sketch in the Appendix.
(C1). B is bounded.
(C2). and exist and . Some discussions on this condition are provided in Supplementary Materials.
(C3). and are Lipschitz uniformly for π in a neighborhood of π0.
(C4). and are second order differentiable and L1(P) bounded, uniformly for π in a neighborhood of π0.
Theorem 1. Under regularity conditions (C1)-(C2), the estimators are consistent: , and .
Theorem 2. Under regularity conditions (C1)-(C4), , where , with the Euclidean distance and .
Remark 1. When β is known, it can be shown that under suitable regularity conditions, , for some A(v), with being the two-sided Browning motion starting from zero. This implies for each fixed v, and . When β is unknown to be estimated, we should still have . Thus the convergence rate n1/3 given in Theorem 2 is optimal.
Remark 2. As is bundled together with , which is a non-smooth step function, the of and its weak limit is still an open problem (see Section 3.2.2 of Huang and Wellner (1997)). This, in turn, affected the derivation of the asymptotic distribution of which is also currently open. The conjecture is that as long as , it still holds that . But the fact that needs to be justified which is not easy.
4. SIMULATION RESULTS
We conducted simulation studies to evaluate the finite sample performance of the proposed method and compared it to that of Delaigle et al. (2014) and the standard logistic regression. We generated group testing data with a group size of 2 from n pairs, which were set to be 100, 250, 500, and 5000, representing scenarios with small (N = 200), moderate (N = 500 or 1000), or large (N = 10000) sample sizes. The covariate vector included three components (X1, X2, X3), where X1 followed a standard normal distribution, X2 followed a Bernoulli distribution with a probability of 0.5, and X3 followed a uniform distribution on [−3, 3]. Two models were considered to generate the binary outcome:
| (i) |
| (ii) |
Table 1 summarizes the average estimates and empirical standard deviations of β by the proposed method (left panel), the method of Delaigle et al. (2014) (middle panel) and the logistic regression method (right panel). Under both scenarios, our proposed method clearly outperformed the method of Delaigle et al. (2014), with both smaller empirical biases and standard deviations, especially under scenarios with small (200 or 500) or moderate sample sizes (1000). For example, under scenario (ii) with a sample size of 200, the standard deviation of by Delaigle et al. (2014) was 2.25 times larger than that of the proposed method. When the sample size increased to 1000 or 10000, the efficiency gain of the proposed method over that of Delaigle et al. (2014) remained, with the ratio of the standard deviation of by Delaigle et al. (2014) over that of the proposed method being 2.28 and 1.74, respectively. Under Scenario (i), as expected, the logistic regression method outperformed the other two methods, since the logistic regression method used the information on the underlying link function (logistic curve), while the other two semiparametric methods did not. However, under Scenario (ii), the logistic regression method produced substantial biases, due to the misspecified model assumptions. For example, when N=1000, the bias of was 6.86 times larger than the corresponding empirical standard deviation, resulting in a misleading inference result.
Table 1:
Simulation results of the proposed method, Delaigle (2014) method and logistic regression under Scenario (i) and (ii). MEAN, empirical average; SD: empirical standard deviation.
| Proposed Method | Delaigle (2014) | Logistic | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Scenario | N | PARA | true | MEAN | SD | MEAN | SD | MEAN | SD |
| (i) | 200 | β 2 | 0.5 | 0.487 | 0.990 | 0.585 | 1.673 | 0.449 | 0.864 |
| β 3 | 0.9 | 1.048 | 0.649 | 1.203 | 1.213 | 1.003 | 0.498 | ||
| 500 | β 2 | 0.5 | 0.537 | 0.604 | 0.582 | 0.961 | 0.561 | 0.562 | |
| β 3 | 0.9 | 0.983 | 0.487 | 1.116 | 0.729 | 0.928 | 0.254 | ||
| 1000 | β 2 | 0.5 | 0.529 | 0.496 | 0.535 | 0.679 | 0.511 | 0.378 | |
| β 3 | 0.9 | 0.985 | 0.366 | 1.009 | 0.450 | 0.923 | 0.171 | ||
| 10000 | β 2 | 0.5 | 0.491 | 0.165 | 0.503 | 0.165 | 0.496 | 0.122 | |
| β 3 | 0.9 | 0.907 | 0.123 | 0.926 | 0.135 | 0.898 | 0.051 | ||
| (ii) | 200 | β 2 | 0.5 | 0.541 | 1.001 | 0.629 | 2.253 | 0.158 | 0.899 |
| β 3 | 0.9 | 0.909 | 0.858 | 0.977 | 1.602 | 0.296 | 0.260 | ||
| 500 | β 2 | 0.5 | 0.514 | 0.822 | 0.449 | 1.884 | 0.191 | 0.525 | |
| β 3 | 0.9 | 1.013 | 0.787 | 0.943 | 1.134 | 0.291 | 0.159 | ||
| 1000 | β 2 | 0.5 | 0.446 | 0.785 | 0.567 | 1.786 | 0.156 | 0.375 | |
| β 3 | 0.9 | 1.016 | 0.645 | 1.201 | 0.955 | 0.279 | 0.105 | ||
| 10000 | β 2 | 0.5 | 0.527 | 0.252 | 0.585 | 0.437 | 0.162 | 0.114 | |
| β 3 | 0.9 | 0.944 | 0.190 | 0.985 | 0.234 | 0.276 | 0.032 | ||
Figures 1 and 3 display the average estimated curves with 95% empirical percentile confidence intervals under two scenarios. For the proposed method, the empirical biases were relatively large at the right tail of the curves, but were in a reasonable range and decreased with increasing sample sizes. Under the setting of a sample size of 1000 or 10000, the estimated curves were close to the true curve, with indistinguishable biases under both scenarios. Compared with the two semiparametric methods, the logistic regression model was very sensitive to the underlying true link functions. In Figure 1, when true link function is the logistic function, the estimated curves from the logistic regression were very close to the true curves. However, when the true curves deviated from the logistic function (Figure 3), the logistic regression could not provide a reliable risk estimation due to the substantial bias in the estimated curves. Also, such biases did not decrease with increasing sample sizes. Regarding the variation comparison of the estimate curves, we only considered the two semiparametric methods for fair comparison. The estimated curves obtained by our proposed method had smaller variations than those by the method of Delaigle et al. (2014) for small or moderate sample sizes. Particularly, the estimated curves by Delaigle et al. (2014) has poor performance for the left tail (low risk region) of the curves due to data sparsity. Even when the sample size increased to 1000, the empirical confidence intervals for the low risk region were still substantially wider than those by the proposed method. Under the large sample size (N=10000), both methods had similar performance. Figures 2 and 4 illustrate the median, as well as the first and third quartiles of the 200 integrated squared errors obtained by the two methods, where the integrated squared error is defined as and [a, b] is the x-range of the figure. Under the large sample size, (N=10000), the two methods had comparable performance; whereas our method outperformed the method of Delaigle et al. (2014) under the small sample size (N=200 or 500) and moderate sample size (N=1000).
Figure 1(Scenario 1):

True curves (black solid), estimated curves from logistic regression (red dash), estimated curves from the proposed method (blue dash), and the Delaigle (2014) method (purple dash) with 95% shaded confidence intervals. Subplot (1), (3), (5), and (7): proposed method and logistic regression with sample sizes of 200, 500, 1000, and 10000 respectively; Subplot (2), (4), (6), and (8): Delaigle method and logistic regression with sample sizes of 200, 500, 1000, and 10000, respectively.
Figure 3 (Scenario 2):

True curves (black solid), estimated curves from logistic regression (red dash), estimated curves from the proposed method (blue dash), and the Delaigle (2014) method (purple dash) with 95% shaded confidence intervals. Subplot (1), (3), (5), and (7): proposed method and logistic regression with sample sizes of 200, 500, 1000, and 10000, respectively; Subplot (2), (4), (6), and (8): Delaigle method and logistic regression with sample sizes of 200, 500, 1000, and 10000, respectively.
Figure 2(Scenario 1):

True curves (black solid) and estimated curves corresponding to the 50th (dotted), 25th (dashed), and 75th (dot, dash) percentile values of Subplot (1), (3), (5), and (7): Proposed method with sample sizes of 200, 500, 1000, and 10000, respectively; Subplot (2), (4), (6), and (8): Delaigle (2014) method with sample sizes of 200, 500, 1000, and 10000, respectively.
Figure 4 (Scenario 2):

True curves (black solid) and estimated curves corresponding to the 50th (dotted), 25th (dashed), and 75th (dot, dash) percentile values of Subplot (1), (3), (5), and (7): Proposed method with sample sizes of 200, 500, 1000, and 10000, respectively; Subplot (2), (4), (6), and (8): Delaigle (2014) method with sample sizes of 200, 500, 1000, and 10000, respectively.
We conducted simulation studies to evaluate the finite sample performance of the proposed method using data with a group size of 5. Table S2 and Figure S2 in the online supplementary materials summarize the simulation results. Overall, we observed similar simulation results, regardless of the group size.
In summary, the simulation studies suggest that the proposed method performs well for the simultaneous estimation of the conditional probability curve and covariate effects. Our method needs prior knowledge that the underlying probability curve is monotonic, which is a commonly used assumption in studies for deriving risk scores. When the underlying probability curve is monotonic and the sample size is small or moderate, our proposed method yields a substantial gain in statistical efficiency for estimating regression coefficients especially compared with the method of Delaigle et al. (2014), which was designed for a more general case without the monotonic constraint. In application, the two methods can provide valuable complementary information for robust estimation and improved efficiency.
5. APPLICATION
We used the same data as those in Delaigle and Meister (2011) and Delaigle et al. (2014) to illustrate the proposed method. NHANES is a large nutrition and health study to assess the health and nutritional status of adults and children in the United States. The survey has been conducted as a series of questionnaires (Zipf et al., 2013). Details of the design and content of each survey are available at .e used the data NHANES 1999-2000, which can be downloaded from the website: ,ith a cutoff date of 11/15/2017. Although the data from NHANES are not grouped, they are ideal for illustrating the effect of data grouping on the estimation precision of parameters of interest. The data and computational codes with instructions are available in the Supplementary Materials.
Following the paper of Delaigle et al. (2014), we considered two covariates X = {X1, X2}, where X1 is the subject’s age and X2 is the subject’s total cholesterol level measured in a unit of 100 mg/dL. We aimed to estimate the probability of having the antibody to the hepatitis B virus core antigen (Y = 1) given the information from the two covariates. After removing the subjects with missing values, the sample size of the analytic data was 7064. Among those subjects, 339 subjects (4.8%) had the antibody. The median age was 27 years, with a range of (6-85 years), and the median total cholesterol level was 1.83 (100 mg/dL), with a range of (0.72-5.75). We applied isotonic regression with PAVA to the full cohort as the reference standard for the estimated parameters and estimated the standard error by using the bootstrap method with 200 re-samplings. We then grouped the data randomly using a group size of 2 and repeated the grouping process 200 times for estimating the standard errors. For each of the 200 randomly grouped data, we conducted the proposed method and the method of Delaigle et al. (2014) to estimate the portability P(Y = 1|x) assuming a linear combination of the two covariates. Given the age effect of 1, the estimated coefficients of the cholesterol level given by the three methods were very close, and the associated inferences were consistent: 4.30 with a standard error (DE) of 1.86 by using the full data, 4.49 with a SE of 2.14 by the proposed method using the grouped data, and 4.50 with a SE of 2.26 by the method of Delaigle et al. (2014). As expected, the method using the full cohort had the smallest standard error and the method of Delaigle et al. (2014) had the largest standard error. Figure 5 presents the estimated probability curves obtained by using the full data set and grouping the data with our proposed estimation method. As shown in the figure, the estimated probability curve obtained by using the grouped data was close to the one obtained by using the full cohort, although it was not as smooth. The bootstrap confidence intervals obtained by using the grouped data were almost identical to those obtained by using the full cohort except for the right tail, which suggests that group testing did not lose much statistical efficiency for the probability curve.
Figure 5:

Estimated probability curve of having antibody to hepatitis B virus core antigen from NHANES data.
Here we have included bootstrap standard errors of the estimated coefficients and 95% bootstrap percentile confidence intervals of the estimated curves for the full cohort and grouping data to explore the information loss due to the grouping. Note that we only have an n1/3 convergence rate for the estimated probability curve, and the asymptotic behavior of the bootstrap standard errors and confidence intervals is not clear in our setting. The simulation studies in the online supplementary materials suggest that the finite sample performance of the bootstrap standard errors and confidence intervals have a reasonable performance in our setting.
6. DISCUSSION
Group testing is a cost-effective sampling strategy to reduce the time and labor in large screening studies. In this paper, we have proposed a computationally appealing algorithm that integrates PAVA with the EM algorithm for fitting semiparametric isotonic regression models for group testing data. Although the estimating algorithm involves iterative steps, it is computationally simple and efficient. The conditional expectations in the E-step can be simply obtained with closed forms. In the M-step, the non-specified monotonic function can be estimated easily by using available R function “Isoreg” in the R basic package or R packages “ISO” and “Isotone.” Then the regression coefficients can be estimate by the “optim” function in R. Our simulation studies suggest that the R function Isoreg in the R basic package is sufficient in our setting, and our method is computationally faster than Delaigle et al.’s method regardless of the sample size. For example, in a 200-run simulation under scenario (ii) using a 3.20 GHz desktop CPU, the CPU run times of the proposed method and Delaigle et al.’s method were 0.17 and 0.77 hours for a sample size of 500, and 35.19 and 115.15 hours for a sample size of 10000, respectively.
One advantage of the PAVA algorithm is that it is statistically and computationally efficient. Alternatively, the monotone spline smoothing method can be used to estimate the unspecified link function with the monotonic constraint (He and Shi, 1998; Leitenstorfer and Tutz, 2006), in which the number of knots or the integrated smoothness parameter need extra efforts to be estimated. Given the binary outcome, our simulation study using the R package SCAM indicated that the non-convergence of the monotone spline smoothing with the integrated smoothness parameter is not negligible in some settings. A thorough theoretic and empirical comparison of the two methods is beyond the scope of this paper and worth of further research. Also, although the nonparametric maximum likelihood estimator does not need additional specifications, such as the kernel function and bandwidth, it is a non-smooth step function and has a cubic root n consistency. By using the techniques in Mukerjee (1988) and Groeneboom and Jongbloed (2014), we can generalize the PAVA estimator to a kernel smoothed PAVA estimator for an improved convergence rate, for which we would require additional computation. Further developing a computational procedure to facilitate the corresponding bandwidth section is a worthy objective for future research.
One potential limitation of the EM algorithm is that the algorithm may converge to a local maxima of the likelihood function, depending on the starting values. In our study, to avoid this issue, we chose 10 sets of initial values from a uniform distribution, and selected the maximizers over the 10 resulting estimators. Alternatively, the estimated regression coefficients from logistic regression could be used as resalable starting points for the proposed methods. We compared the simulation results, obtained by using the estimators from the logistic regression as the initial values to those obtained by using 10 different sets of initial values. The results were almost identical.
Supplementary Material
ACKNOWLEDGEMENTS
The authors thank the editor, associate editor and reviewers for helpful comments and suggestions, which have led to improvements of this article. This work was partially supported by grants from the National Institute of Health (R01CA193878, P30CA016672, and U24CA230144) and the Andrew Sabin Family Fellowship. The authors acknowledge the Texas Advanced Computing Center at the University of Texas at Austin for providing high performance computing resources that have contributed to the research results reported within this paper.
APPENDIX
Proof of Theorem 1.
Without loss of generality, we assume the group size nj to be 2 and n = N/2 for notation simplicity. Let f(o|β, π) = P(y*|x1, x2, β, π)g(x1)g(x2) as given in Section 2.1. Let be the MLE and (β0, π0) be the true parameter, and
be the log-likelihood ratio. Let P be the probability measure of f(·|β0, π0), Pq(O|β, π) = ∫q(o|β, π)dP(o) the true mean of q; Pn be the empirical measure based on the data and , the empirical mean of q.
Note that Pq(O|β, π) is the negative Kullback-Leibler divergence of f(o|β, π) from f(o|β0, π0) as a function of (β, π) it is always non-positive, attaining its maximum value of 0 at (β0, π0) and the model is identifiable, so we have (β0, π0) = arg max(β, π)∈(B, Π) Pq(O|β, π). Also, recall that is the MLE, so .
Let d(β, π; β0, π0) = ‖β − β0‖ + supt∈R |π(t) − π0(t)|. By our model specification, Pq(O|β, π) is continuous with respect to (β, π); also identifiability of the model implies (β0, π0) is the unique maximizer of Pq(O|β, π). Thus for all η > 0, sup(β, π)∈(B,Π):d(β, π;β0, π0)>η Pq(O|β, π) < Pq(O|β0, π0) and by definition of MLE, , so if we show that is a P-Glivenko-Cantelli class, then by Theorem 5.8 in van der Vaart (2002, p.386), .
Now we show that is P-Glivenko-Cantelli. For any function g, let ‖g‖L1(P) = ∫|g(o)|dP(o) and be the minimum number of e-brackets needed to cover under the norm ‖·‖L1(P). We first show that is finite ∀ϵ > 0. By our model specification,
is differentiable in (β, π) where differentiability for π is in the Hadamard sense. Denote and be the Hadamard derivative of ℓ(β, π|o) with respect to π, in the direction h. Then ∀ (βk, πk) ∈ (B, Π) (k = 1, 2),
where (β, π) is an intermediate point between (β1, π1) and (β2, π2) Note that
Then by the boundedness of π(·), π1(·), π2(·), (C2) and Hölder’s inequality,
Similarly, for some 0 < C2 < ∞,
Thus, ‖q(·|β2, π2) − q(·|β1, π1)‖L1(P) ≤ C1‖β2 − β1‖L2(P) + C2‖π2 − π1‖L2(P), and so, ∀ ϵ > 0, .
By (C1), N[](ϵ/(2C1), B, ‖·‖L2(P)) = N[](ϵ/(2C1), B, ‖·‖) = O(1/ϵd), with d = dim(β) Since Π is a collection of bounded monotone functions, by Theorem 2.7.5 in van der Vaart and Wellner (1996, p.159), there is constant 0 < C < ∞, for all P and r > 0,
Thus, for some generic constant 0 < C < ∞, ∀ ϵ > 0,
and so by Theorem 2.4.1 in van der Vaart and Wellner (1996, p.122), is a Glivenko-Cantelli class with respect to P.
Proof of Theorem 2.
Let q(o|β, π) and Pn be defined in the proof of Theorem 1. Define and . Then , for any (β, π) ∈ (B, Π) and any positive sequence rn → ∞.
Since conditions of Theorem 1 are satisfied, . Note that . Denote , , , , , etc. as the first and second order partial derivatives of ℓ(β, π|o) with respect to (β, π). The derivatives with respect to π are in the Hadamard sense. By our specification of the model, these quantities exist. Note that and . Denote as the matrix of all the second order partial derivatives, then
and is of order O(d2(β, π; β0, π0)), where is an intermediate value between (β, π) and (β0, π0). Under (C4), the above can be upper bounded by, −Cd2(β, π; β0, π0) for some 0 < C < ∞, in the small neighborhood of (β0, π0). So for any and any 0 < ηn → 0 and any τ with ηn < τ ≤ η < ∞, for some 0 < C < ∞,
Next we show, with E* for outer expectation,
| (A.0) |
for some decreasing function ϕn(·) to be given.
In the proof of Theorem 1, we showed N[](ϵ, (B, Π), L2(P)) = O(1/ϵd exp{C/ϵ}) = O(exp{C/ϵ}). Similarly, N[](ϵ, (B, Π), L4(P)) = O(exp{C/ϵ}). Let C be some generic finite positive constant, we have
In the above we used the fact that for small τ > 0, log τ < 0, so e−t > 1 and 1 + Ce−t ≤ (1 + C)e−t on (−∞, log τ)
Let , , . Since (given in the proof of Theorem 1), so . In the proof of Theorem 1, we showed . by the same way, , for some 0 < C < ∞, and so .
Also, as in the proof of Theorem 1, with our specification of the likelihood, Pq2 < Cτ2 and ‖q‖∞ < C for all , for some 0 < C < ∞. Thus, by Lemma 3.4.2 in van der Vaart and Wellner (1996, p.324),
which implies (A.0) with ϕn(τ) = Cτ1/2(1 + τ−3/2n−1/2) Take rn = n1/3 then
Now, all conditions of Theorem 3.4.1 in van der Vaart and Wellner (1996, p.322) are satisfied, so by this Theorem, .
There should be just one appendix, for proofs and longer mathematical arguments. These proofs are in the following form:
BIBLIOGRAPHY
- Ayer M, Brunk HD, Ewing GM, Reid WT, & Silverman E(1955). An empirical distribution function for sampling with incomplete information. The Annals of Mathematical Statistics, 26, 641–647. [Google Scholar]
- Barlow RE, Bartholomew DJ, Bremner J, & Brunk HD (1972). Statistical inference under order restrictions: The theory and application of isotonic regression. Wiley, New York. [Google Scholar]
- Best MJ, & Chakravarti N (1990). Active set algorithms for isotonic regression; a unifying framework. Mathematical Programming, 47, 425–439. [Google Scholar]
- Bondell HD, Liu A, & Schisterman EF (2007). Statistical inference based on pooled data: a moment-based estimating equation approach. Journal of Applied Statistics, 34, 129–140. [Google Scholar]
- Delaigle A, Hall P, et al. (2012). Nonparametric regression with homogeneous group testing data. The Annals of Statistics, 40, 131–158. [Google Scholar]
- Delaigle A, Hall P, & Wishart J (2014). New approaches to nonparametric and semiparametric regression for univariate and multivariate group testing data. Biometrika, 101, 567–585. [Google Scholar]
- Delaigle A & Meister A (2011). Nonparametric regression analysis for group testing data. Journal of the American Statistical Association, 106, 640–650. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Delaigle A & Zhou WX (2015). Nonparametric and parametric estimators of prevalence from group testing data with aggregated covariates. Journal of the American Statistical Association, 110, 1785–1796. [Google Scholar]
- Dorfman R (1943). The detection of defective members of large populations. The Annals of Mathematical Statistics, 14, 436–440. [Google Scholar]
- Gastwirth JL (2000). The efficiency of pooling in the detection of rare mutations. The American Journal of Human Genetics, 67, 1036–1039. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Groeneboom P & Hendrickx K (2019). Estimation in monotone single-index models. Statistica Neerlandica, 73, 78–99. [Google Scholar]
- Groeneboom P & Jongbloed G (2014). Nonparametric estimation under shape constraints. Vol. 38. Cambridge University Press. [Google Scholar]
- He X & Shi P (1998) Monotone B-spline smoothing. Journal of the American statistical Association, 93, 643–650. [Google Scholar]
- Hristache M, Juditsky A, & Spokoiny V (2001). Direct estimation of the index coefficient in a single-index model. The Annals of Statistics, 29(3), 595–623. [Google Scholar]
- Huang J &Wellner JA Interval censored survival data: a review of recent progress. InProceedings of the First Seattle Symposium in Biostatistics, 123–169. Springer, New York, NY. [Google Scholar]
- Ichimura H (1993). Semiparametric least squares (SLS) and weighted SLS estimation of single-index models. Journal of Econometrics, 58, 71–120. [Google Scholar]
- Johnson WO & Pearson LM (1999). Dual screening. Biometrics, 55, 867–873. [DOI] [PubMed] [Google Scholar]
- Kim HY & Hudgens MG (2009). Three-dimensional array-based group testing algorithms. Biometrics, 65,903–910. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Leitenstorfer F & Tutz G (2006) Generalized monotonic regression based on B-splines with an application to air pollution data. Biostatistics, 8, 654–673. [DOI] [PubMed] [Google Scholar]
- Lewis JL, Lockary VM, & Kobic S (2012). Cost savings and increased efficiency using a stratified specimen pooling strategy for chlamydia trachomatis and neisseria gonorrhoeae. Sexually Transmitted Diseases, 39, 46–48. [DOI] [PubMed] [Google Scholar]
- Mair P, Hornik K, & de Leeuw J (2009). Isotone optimization in r: pool-adjacent-violators algorithm (pava) and active set methods. Journal of Statistical Software, 32, 1–24. [Google Scholar]
- Mitchell EM, Lyles RH, Manatunga AK, Perkins NJ, & Schisterman EF (2014). A highly efficient design strategy for regression with outcome pooling. Statistics in Medicine, 33, 5028–5040. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mukerjee H(1988). Monotone nonparametric regression. The Annals of Statistics, 16, 741–750. [Google Scholar]
- Murphy SA & Van der Vaart AW (2000). On profile likelihood. Journal of the American Statistical Association, 95, 449–465. [Google Scholar]
- Remlinger KS, Hughes-Oliver JM, Young SS, & Lam RL (2006). Statistical design of pools using optimal coverage and minimal collision. Technometrics, 48, 133–143. [Google Scholar]
- Robertson T (1988). Order restricted inference. John Wiley & Sons, New York. [Google Scholar]
- Severini TA & Wong WH (1992). Profile likelihood and conditionally parametric models. The Annals ofStatistics, 20, 1768–1802. [Google Scholar]
- Van TT, Miller J, Warshauer DM, Reisdorf E, Jernigan D, Humes R, & Shult PA (2012). Pooling nasopharyngeal/throat swab specimens to increase testing capacity for influenza viruses by pcr. Journal of Clinical Microbiology, 50, 891–896. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang D,Zhou H, & Kulasekera K (2013). A semi-local likelihood regression estimator of the proportion based on group testing data. Journal of Nonparametric Statistics, 25, 209–221. [Google Scholar]
- Xie M (2001). Regression analysis of group testing samples. Statistics in Medicine, 20, 1957–1969. [DOI] [PubMed] [Google Scholar]
- Zhang B, Bilder CR, & Tebbs JM (2013). Group testing regression model estimation when case identification is a goal. Biometrical Journal, 55, 173–189. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zipf G, Chiappa M, Porter KS, Ostchega Y, Lewis BG, & Dostal J (2013). Health and nutrition examination survey plan and operations, 1999-2010, https://stacks.cdc.gov/view/cdc/21304. [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
