Abstract
We propose local polynomial estimators for the conditional mean of a continuous response when only pooled response data are collected under different pooling designs. Asymptotic properties of these estimators are investigated and compared. Extensive simulation studies are carried out to compare finite sample performance of the proposed estimators under various model settings and pooling strategies. We apply the proposed local polynomial regression methods to two real-life applications to illustrate practical implementation and performance of the estimators for the mean function.
Keywords: cross validation, homogeneous pooling, random pooling, 62G08, 62G20
1. Introduction
Instead of measuring individual specimens to collect data for biomarkers or analytes of interest, collecting such data on pools of specimens has become increasingly common in epidemiological studies (Kendziorski et al. 2003; Shih et al. 2004) and environmental studies (Kärrman et al. 2006; Kato et al. 2009; Heffernan et al. 2016; Mosites et al. 2020). Collecting pooled data can reduce information loss when there is a detecting limit, and offer a more timely manner to gather information, in addition to the obvious benefit of reducing cost of laboratory assays and preserving irreplaceable specimens. In some econometrics applications, pooled data are all that is available to researchers, such as data aggregated by family or by region (Martinez-Espineira 2003; Fukuda 2006; Jiang et al. 2009). In these applications, data of other attributes at the individual level are often also recorded, and researchers are interested in associations between quantities at the individual level even though some data are collected at the pool level. Our study is motivated by these research questions that require methodologies for regression analysis based on pooled continuous response data and individual-level covariate data.
Traditional regression methodology applicable to individual response data cannot be directly used to analyze pooled response data, and there exist some research on regression analysis for pooled continuous responses. Under the parametric framework, Malinovsky et al. (2012) considered Gaussian random effects models for pooled repeated measures, and studied inference for variance components under different pooling strategies. Mitchell et al. (2014) proposed a Monte Carlo expectation maximization algorithm to carry out regression analyses of pooled biomarker assessments assuming that the biomarker follows a log-normal distribution given covariates. McMahan et al. (2016) developed methods to infer receiver-operating characteristic curves using pooled biomarker measurements. Liu et al. (2017) provided a general strategy based on Monte Carlo maximum likelihood for regression analysis of pooled data under generic parametric models assumed for the individual response given covariates. Under the semiparametric framework, Mitchell et al. (2015) proposed a semiparametric method for regression analysis of a right-skewed and positive response when data for the response are taken from pooled specimens. Without imposing parametric assumptions on the biomarker distribution, Lin and Wang (2018) developed a semiparametric approach for analyzing pooled biomarker measurements originating from a single-index model for the individual response. Different from these works, we develop nonparametric estimation methods without imposing a functional form for the conditional mean of the response or a distribution family on the response given covariates. The estimation methods proposed in this article are thus more generally applicable, even though prediction based on a nonparametrically estimated mean response is less convenient than when one employs a parametric estimation method. Also under the nonparametric framework, Linton and Whang (2002) proposed a kernel-based estimator for regression function for pooled data when covariate data are also aggregated, with both aggregated response data and covariate data subject to additive measurement error. Instead of the framework of kernel regression adopted in their work, we consider in this study local polynomial regression with kernel weights that has been shown to have advantages both asymptotically and in finite sample performance over kernel regression (Fan et al. 1995).
Among the existing works on regression analysis of pooled response data, many consider various pooling designs. For example, Ma et al. (2011) compared two pooling designs in the context of linear regression analysis for a pooled continuous response and aggregated covariates, one being random pooling where pools are randomly formed without taking into account covariate information, and the other termed as optimal pooling by the authors, where pools are formed by gathering specimens corresponding to similar covariate values. This latter strategy is better known as homogeneous pooling in the pool/group testing literature (Shu and Burn 2003; Bilder and Tebbs 2009; Deckert et al. 2020), and many researchers have shown efficiency gain in prediction and covariate effects estimation when homogeneous pooled data are used than when random pooled data are used (Vansteelandt et al. 2000; Ma et al. 2011). Mitchell et al. (2014) developed a regression methodology for log-normal response data subject to a special form of homogeneous pooling where covariate values within a pool are identical. Like the regression analysis discussed in Ma et al. (2011), Mitchell et al. (2014) also regressed the pooled continuous response on aggregated covariates to infer the association between the response and covariates at the individual level.
In this article, we propose local polynomial estimators for the mean of a continuous response given covariates using pooled response data and individual-level covariate data. More specifically, the proposed estimators are for the mean function , where Y is a continuous response of an experimental unit, X is the covariate that can be vector-valued and relate to attributes of the experimental unit or individual. For ease of exposition, we consider a scalar covariate in this article. Observed data available for inferring m(x) include pooled responses from J groups of individuals, , where Yjk, in which cj is the number of individuals in pool j, and Yjk is the unobserved response of individual k in that pool, for j = 1,...,J, k = 1,...,cj. Also observed are covariate data , where , with being the covariate associated with individual k in pool j, for k = 1,...,cj and j = 1,...,J. Three proposed local polynomial estimators for m(x) based on data are presented in Section 2 next, where we assume that data arise from random pooling. Section 3 presents local polynomial estimators based on homogeneous pooled data. Asymptotic properties of these estimators are investigated and compared in Section 4 under each of the two pooling designs. Section 5 describes bandwidth selection methods tailored for the proposed estimators. Section 6 presents a simulation study where we compare finite sample performance of the proposed estimators under different model settings and various pooling designs. We further illustrate the implementation and performance of the proposed methods in two real-life applications in Section 7. Finally, in Section 8, we summarize contributions of our study and discuss follow-up research directions.
2. Local polynomial estimators under random pooling
Local polynomial regression has been a well-received and widely applicable nonparametric strategy for estimating m(x) when individual data are available (Fan and Gijbels 1996). To estimate the regression function m(x) based on individual data , this strategy exploits the weighted least squares method to construct an objective function following a p-th order Taylor expansion of m(s) around , with equal to evaluated at s = x. In particular, the objective function is given by
| (1) |
where is a symmetric kernel, h is a bandwidth, , for , and . Minimizing Q0(β) with respect to β yields an estimate of , along with estimates of , for . Denote by the so-obtained estimator for m(x).
In what follows, we revise Q0(β) to construct new objective functions to adapt the local polynomial regression strategy to pooled response data from random pooling. Despite the pooling design considered, it is assumed that the individual data, , consist of independent copies multivariate random variable (Y, X) from a common distribution.
2.1. The average-weighted estimator
Now that individual responses in (1) are unobserved but pooled responses are instead, it is natural to switch attention from to , as if one were regressing Z on the accompanying covariates in a pool collectively. This motivates the following weighted least squares objective function,
| (2) |
In (1), the weight function quantifies the proximity of the ith covariate data point to x, producing a larger weight for an individual whose covariate value is closer to x. In (2), the average of such proximity measures associated with cj covariate data points in pool j is used to assess the overall closeness of this collection of covariate values to x.
Minimizing Q1(β) with respect to β and extracting the first element of the resultant minimizer gives a p-th order local polynomial estimator for m(x). This estimator can be explicitly expressed as , where , in which, D1(x) is a matrix with , for , and . Elaborated expressions of entries in S1(x) and T1(x) are given in Appendix A of the supplementary materials. To highlight the weight function construction in (2), is referred to as the average-weighted estimator in this article.
2.2. The product-weighted estimator
Instead of averaging individual-level weights to construct a weight function as in Q1(β), one may view as a multivariate covariate resulting from stacking the cj individual-level covariates in pool j on top of each other, and an alternative weight function can be formulated to measure the nearness of this multivariate covariate to , where 1cj denotes the cj × 1 vector of one’s. Mimicking the product kernel used in multivariate kernel density estimation, we propose the following weighted least squares objective function with a different weight function,
| (3) |
More succinctly, the estimator for m(x) resulting from minimizing Q2(β) is given by , where and , in which . Detailed expressions of entries in S2(x) and T2(x) are provided in Appendix B in the supplementary materials. Due to the construction of the weight function in (3), we call the product-weighted estimator in the sequel.
2.3. The marginal-integration estimator
The first two estimators are motivated by the mean of Zj given all covariate data in pool j. The third estimator is inspired by the mean of cjZj given one arbitrary individual’s covariate in pool j derived next under the assumption that and the pools are formed randomly independent of covariate information. By the definition of Zj, we have
where for and . Hence,
| (4) |
If one views as a pseudo response, (4) is reminiscent of the conditional mean model for individual-level data, , except for the dependence of the pseudo response on the unknown parameter μ. Since μ is the marginal mean of Y, one may use the overall sample mean response, , to estimate μ. This yields a surrogate of the pseudo response defined by , for j = 1,...,J. However, due to the estimation of μ in Rj. In fact, one can show that . Suggested by an anonymous referee, in each Rj, we replace and define . One can view as a bias-corrected version of Rj, and also as an “estimator” for Yjk that satisfies .
Using the surrogate of the pseudo response and (4), we formulate the following weighted least squares objective function,
| (5) |
Minimizing Q3(β) with respect to β yields our third proposed p-th order local polynomial estimator for m(x), denoted by . As one can see from the elaborated expression of it given in Appendix C of the supplementary materials that is simply with Yjk replaced by , for j = 1,...,J, k = 1,...,cj. The construction of stems from the marginal integration result (4). For this reason, we refer to as the marginal-integration estimator henceforth. Using marginal integration is not new in the pooling literature. Lin and Wang (2018) used it to estimate a single index model with a focus on the parametric part of their model. The asymptotic properties of in Section 4 do not follow their derivation directly, and additionally, they used Rj while we use to correct the bias induced by Rj.
All three estimators reduce to when cj = 1 for j = 1,...,J, but are otherwise typically very different from each other. In-depth comparisons between the three estimators that go beyond their formulations demand more systematic investigation on their theoretical properties. This is the content of Section 4, where we look into the asymptotic bias and variance of these estimators under each of the two considered pooling designs.
3. Local polynomial estimators under homogeneous pooling
When pooled data result from homogeneous pooling, it is no longer sensible to consider the mean of given one “arbitrary” covariate data point in pool j as we just did to construct , since individuals’ covariates within a pool are not that “arbitrary” now after all, and is typically not equal to for . But it is still meaningful to consider the mean of Zj given all covariate data in pool j as we did under random pooling that leads to and .
To be more concrete, consider the homogeneous pooling design following which pools of individuals are created according to the sorted covariate data in . This yields covariate data associated with pool j given by , where . Even though the response data are not sorted, we use to denote the corresponding pooled response, where is the response of the individual whose covariate value is , for k = 1,...,cj, and j = 1,...,J. Evaluating the objective functions in (2) and (3) at give the following objective functions one maximizes with respect to β in order to obtain the average-weighted estimator, , and the product-weighted estimator, , respectively, under homogeneous pooling,
4. Comparisons between different estimators
4.1. Asymptotic properties
Under certain regularity conditions listed in the supplementary materials, we derive asymptotic means and variances of the proposed estimators for β as J → ∞ with bounded. Conditions listed there relate to m(x), the variance function , the density function of X, fX(x), and the kernel K(t), which are mostly common conditions seen in the context of local polynomial regression using individual-level data. In what follows, we summarize findings from these derivations (with details provided in the supplementary materials) in two theorems that highlight some interesting contrasts between different estimators for m(x) when pools are of equal size with cj = c, for j = 1,...,J, with additional conditions imposed in each theorem when needed. Several quantities appearing in these theorems are defined next for ease of reference:
| (6) |
where and , for .
The first theorem concerns the three estimators under random pooling. Appendices A, B, and C in the supplementary materials provide the proof for the three parts of this theorem that allow unequal pool sizes.
Theorem 4.1
As J → ∞ and h → 0, one has the following results regarding the difference between an estimator for m(x) and m(x).
-
(i)If the -th moment of X exists, for = 1,...,2p, then
where(7)
in which (t)+ = max(t, 0). -
(ii)If m(x) is (p + 3) th-order continuously differentiable, then
-
(iii)Let exists, then
(8)
Theorem 4.1-(i) indicates that is an inconsistent estimator for m(x), with the dominating bias given by that does not depend on h, and thus does not diminish as h → 0, but it does vanish when c = 1. Considering a local constant estimator by setting p = 0 in (7), we show in Appendix A in the supplementary material that
| (9) |
of which the second term (of order h2) is c−1 times the dominating bias of the Nadaraya-Watson estimator based on individual-level data. Observing that the dominating bias in (9) is equal to , one can easily derive an improved local constant estimator by correcting for this dominating bias. This leads to a consistent local constant estimator given by , of which the bias is of order OP(h2). For p > 0, correcting for its dominating bias requires estimating functionals of m(x) more involved than μ = E{m(X)} that appear in in (6).
Theorem 4.1-(ii) suggests that is a consistent estimator for m(x) with the asymptotic variance of order , which inflates quickly as c increases. It is worth pointing out that, Q2(β) in (3) is essentially a special form of the objective function associated with the regular multivariate local polynomial estimator for the c-variate conditional mean based on individual level multivariate covariate data of c dimensional. Hence, following Masry (1996) and Gu et al. (2015), under the same set of regularity conditions listed in the supplementary materials, is asymptotically normal when .
Comparing Theorem 4.1-(ii) and (iii) reveals that and typically do not share the same dominating bias except when c = 1, and exhibits the same asymptotic bias as that of regardless of the pool size. The variability of is understandably higher than that of , but it only grows linearly in c and thus is much less inflated than the variance of . More specifically, (8) implies that the amount of variance inflation of depends linearly on the pool size and .
Summarizing the above remarks on Theorem 4.1, we conclude that the marginal-integration estimator is the preferred estimator among the three proposed under random pooling. It outperforms the average-weighted estimator for its consistency, and it surpasses the product-weighted estimator for its much less inflated variance when compared with . In practice, we recommend using the local linear version of which corresponds to p = 1. For this case, Theorem 4.1-(iii) yields that
Thus the mean squared error of which attains the optimal nonparametric rate when is used. Furthermore, if there exists η > 0 such that is bounded for all x1,...,xc, then we have converges in distribution to N(0, 1) as N goes to infinity.
Despite these virtues of , it is no longer well justified under homogeneous pooling as pointed out in Section 3. The following theorem is regarding the average-weighted estimator and the product-weighted estimator applied to data from the homogeneous pooling design. Appendix D in the supplementary material provides the proof for this theorem.
Theorem 4.2
Assume that x is an interior point of a compact and nondegenerate interval , the pdf of X, fX(·), is bounded away from zero on an interval , where , with bounded. Then, as J → ∞, h → 0, and Jh4 → ∞,
| (10) |
and
| (11) |
If Condition (C5) is satisfied for the kernel defined by K†(t) = Kc(t), then and
| (12) |
and
| (13) |
where are the counterparts of , and , respectively, with K(t) replaced by K†(t).
Among the additional assumptions imposed in Theorem 4.2, the one on x and the assumption on K(t) are similar to Conditions (T1) and (T5) in Delaigle and Hall (2012), respectively. Theorem 4.2 indicates that both and are consistent estimators for m(x) under homogeneous pooling, with the former sharing the same dominating bias as that of , and the latter exhibiting the same form of dominating bias with a redefined kernel that depends on c. Moreover, the asymptotic variances of both estimators are of the same order as that of despite the pool size. The practical implication of Theorem 4.2 is that, if one uses homogeneous pooled data to infer m(x) via either one of the two proposed local polynomial estimators, one only needs J assays without losing accuracy or efficiency asymptotically compared with when un-pooled data are used that require N = cJ assays.
In practice, we recommend using or with p = 1. For when p = 1, Theorem 4.2 implies that
The mean squared error of is also which attains the optimal nonparametric rate when is used. Furthermore, if there exists η > 0 such that is bounded for all x, we have converges in distribution to N(0,1) as N → ∞. Similar properties hold for .
4.2. Further remarks
We are now in the position to reflect on the findings in Theorems 4.1 and 4.2 to gain a deeper understanding of the three proposed estimators for m(x) using pooled data.
The stark contrast between properties of the average-weighted estimator under the two pooling designs may seem peculiar at first glance. As natural as it initially appears to be, the use of average weights is the root cause for the persistent bias of under random pooling. For ease of exposition, assume for the time being cj = 2, for j = 1,...,J. The objective function Q1(β) in (2) associated with is essentially constructed for estimating evaluated at (x1, x2) = x12. The same weight, , is assigned to both individuals in pool j whose covariate values are . This can yield misleading weight when, for example, Xj1 is close to x but Xj2 is far away from x, which can often happen under random pooling. In contrast, the product weight in Q2(β) in (3) associated with avoids such misleading weighting scheme because is small if either one of the two individual weights is small, and thus will only contribute more in estimating when both Xj1 and Xj2 are closer to x. In particular, when K(t) is the Gaussian kernel, the product weight function amounts to evaluating the bivariate Gaussian density function at the Euclidean distance between and x12, whereas the average weight function lacks such connection with a meaningful distance measure between the two points in .
Even though exploits a more sensible weight function when comparing with under random pooling, downplaying Xj1 even when it is close to x simply because the covariate value of the other individual in the same pool is far away from x is not an efficient use of data. And such waste of data information is more severe when the pool size is bigger, which is essentially the curse-of-dimensionality when one estimates the multivariate function based on a response along with a c-dimensional covariate. It is such inefficient use of data that causes the much inflated variance concluded in Theorem 4.1 for . Figure 1 illustrates the average weight function and the product weight function (in bottom panels) under random pooling when c = 2 and K(t) is the Epanechnikov kernel. Also shown in Figure 1 (see the top-left panel) are individual-level data generated according to the model specified in (D1) described in Section 6, overlaid with the pseudo response data from random pooling, which are used for the construction of . From there one can see that the pseudo data, , are much more variable than the original data used to obtain , and thus the increased variance of is expected when compared with . Despite the higher variability, the pseudo data cloud does preserve the overall pattern of the original data cloud, which explains the common dominating bias shared between and . Unlike Q1(β) and Q2(β), the construction of Q3(β) in (5) is directly designed for estimating the univariate function m(x) instead of , and thus overcomes the pitfall of misleading weight assignment in , as well as the curse-of-dimensionality that suffers.
Figure 1.
Plots under random pooling. Top-left panel: Individual-level data as black circles, pseudo individual-level responses and covariate data as red circles, overlaid with the true m(x) as the green curve. Top-right panel: the bivariate function as the curved light green surface, with its value evaluated at (x, x), i.e., m*(x, x) = m(x), highlighted as the green curve, overlaid with the pool-level data as blue circles. Bottom-left panel: the shape of the average kernel weights when x = 0, along with and the pool-level data. Bottom-right panel: the shape of the product kernel weights when x = 0, along with and the pool-level data.
Figure 2 is the counterpart of Figure 1 under homogeneous pooling. Now one can see (in the top-left panel) in Figure 2 that the pseudo data, . From there one can see that the pseudo data, , clearly distort the original data pattern, and thus are inappropriate for estimating m(x). With individuals sharing similar covariates values gathering in the same pool, the concern relating to of assigning inadequate weight no longer exists, neither does the concern relating to of inefficient use of data. The bottom panels of Figure 2 depict the average weight function and the product weight function, both are reminiscent of some symmetric kernel function.
Figure 2.
Plots under homogeneous pooling. Top-left panel: Individual-level data as black circles, pseudo individual-level responses and covariate data as red circles, overlaid with the true m(x) as the green curve. Top-right panel: the bivariate function as the curved light green surface, with its value evaluated at (x, x), i.e., m*(x, x) = m(x), highlighted as the green curve, overlaid with pool-level data as blue circles. Bottom-left panel: the shape of the average kernel weights when x = 0, along with m*(x, x) and the pool-level data. Bottom-right panel: the shape of the product kernel weights when x = 0, along with m*(x, x) and the pool-level data.
5. Bandwidth selection
The choice of bandwidths in local polynomial estimators plays a key role in the performance of these estimators. Besides the usual challenges encountered in bandwidth selection in local polynomial regression, a unique complication we face here is the lack of individual-level response data, which makes loss functions used for bandwidth selection that are based on individual-level residuals (or prediction errors) inapplicable in our context. Next we develop leave-one-pool-out cross-validation (CV) procedures to choose bandwidths in three proposed local polynomial estimators for m(x) using random pooled data.
For the average-weighted estimator, , we choose the bandwidth h that minimizes the following pool-level residual sum of squares,
| (14) |
where is the realization of based on the observed data excluding data from pool j, , with the bandwidth set at h. The bandwidth in the product-weighted estimator, , is chosen by minimizing a CV criterion similarly defined as (14),
| (15) |
Admittedly, CV criteria or loss functions constructed based on prediction errors at the pool level may not be sensitive to the influence of h on prediction power at the individual level, and thus may not serve as effective model criteria for the purpose for choosing bandwidths.
Given (14) and (15), one can easily envision a similar CV criterion, denoted by RSS3(h), defined for choosing h in . We however take into account the close tie between and local polynomial estimators designed for individual-level data, and propose a new and more effective CV criterion. This new criterion tailored for is mostly thanks to the pseudo individual-level observations, , used in . In particular, we choose h used in that minimizes the following pseudo (individual-level) residual sum of squares,
| (16) |
where is the realization of based on the pseudo individual-level data excluding data from pool j, , with bandwidth set at h. Empirical evidence suggest that PRSS3(h) is a more effective CV criterion for bandwidth selection than RSS3(h).
6. Simulation study
6.1. Design of simulation experiments
To compare different estimators of m(x) in regard to their finite sample performance, and to explore other factors that may influence the estimation, we carry out an empirical study using synthetic data. More specifically, we adopt the following data generating processes reported in Delaigle et al. (2009) to generate individual-level response data:
-
(D1)
, where X1 follows a distribution with pdf given by 0.1875x2I(−2 ≤ x ≤ 2) and X2 ∼ uniform(−1,1);
-
(D2)
, where the distributions of X1 and X2 are as those specified in (D1);
-
(D3)
;
-
(D4)
.
Under each generating process, we generate individual-level data, , where N ∈ {600, 1200}. Given an individual-level data set, we create pooled data, first using random pooling and then using homogeneous pooling, with a common pool size c = 2, 3, 4, 5, 6 across all J pools. Given each pooled data set, we obtain three local linear estimates for the mean function, , and . In addition, we also compute the local linear estimate using individual-level data, , as a benchmark estimate. In all four estimators, we set K(t) as the Epanechnikov kernel. The empirical integrated squared error (ISE) is the metric we use to assess the overall quality of an estimated mean function, defined by for an estimator . Additionally, we monitor in the simulation the pointwise empirical bias and standard error of each estimate for m(·).
6.2. Simulation results
We summarize in this section simulation results when individual-level data are generated according to (D2) with N = 600. Counterparts results relating to (D1), (D3), and (D4) are provided in Appendix E in the supplementary materials.
More specifically, Figure 3 shows boxplots of 500 realizations of ISE associated with the two proposed consistent estimators based on random pooling data, and , and those corresponding to the two proposed consistent estimators based on homogeneous pooling data, and , at each considered pool size, all comparing with ISEs of . Evidently, when homogeneous pooling data are used, the overall performances of the two proposed estimators are similar to the benchmark estimator based on individual level data, . In contrast, when random pooling data are used, albeit consistent, both and exhibit much higher ISE than does, especially when the pool size is larger.
Figure 3.
Boxplots of 500 ISEs associated with each of the consistent local linear estimators for m(x) based on random pooling data (upper panels) and those based on homogeneous pooling data (lower panels) under (D2) at each pool size configuration, all comparing with boxplots of ISEs associated with the local linear estimator based on individual-level data (IT). Consistent estimators based on random pooling data include the product-weighted estimator, , and the marginal-integration estimator, . Consistent estimators based on homogeneous pooling data include the average-weight estimator, , and .
Instead of the overall performance of a consistent estimator over a range of covaraite values, Figures 4 and 5 depict the pointwise performance of all four estimators, , , , and , in regard to bias, variance, and mean squared error (MSE), when c = 2 and c = 5, respectively. Under random pooling (see upper panels of Figures 4 and 5), the average-weighted estimator is unable to capture the shape of m(x), and it fails more miserably around regions with more curvature. The product-weighted estimator is able to recover the overall shape of m(x), although more variable than , especially around the inflection points of m(x). With c = 2 (as in Figure 4), the marginal-integration estimator performs similarly as . When c = 5 (as in Figure 5), outperforms substantially in every regard. This is in line with the implication of Theorem 4.1 that the variance of inflates faster as the pool size increases than the variance of does. Under homogeneous pooling (see lower panels of Figures 4 and 5), the marginal-integration estimator distorts the functional form of m(x), whereas both and perform similarly as , in regard to both accuracy and precision.
Figure 4.
Four estimates for m(x) under (D2) when c = 2: the local linear estimate based on individual-level data (the first column), , the average-weighted estimate (the second column), , the product-weighted estimate (the third column), , and the marginal-integration estimate (the fourth column), . The latter three estimates are based on random pooled data in the upper panels, and are based on homogeneous pooled data in the lower panels. Within each panel, the dot-dashed curve is the true function m(x), the blue curve is the pointwise mean curve based on the 500 function estimates, the gray band is constructed by the mean curve plus and minus 1.96 times the pointwise standard deviation curve, and the dotted lines provides a comparison of the pointwise mean squared error curve across different estimates plotted with respect to the right axis of each subfigure.
Figure 5.
Four estimates for m(x) under (D2) when c = 5: the local linear estimate based on individual-level data (the first column), , the average-weighted estimate (the second column), , the product-weighted estimate (the third column), , and the marginal-integration estimate (the fourth column), . The latter three estimates are based on random pooled data in the upper panels, and are based on homogeneous pooled data in the lower panels. Within each panel, the dot-dashed curve is the true function m(x), the blue curve is the pointwise mean curve based on the 500 function estimates, the gray band is constructed by the mean curve plus and minus 1.96 times the pointwise standard deviation curve, and the dotted lines provides a comparison of the pointwise mean squared error curve across different estimates plotted with respect to the right axis of each subfigure.
7. Real-life applications
In this section, we analyze data from two real-life applications to illustrate the proposed local linear estimators for a conditional mean function. The individual-level observations are available in both applications, making it feasible to compute the local linear estimate based on individual-level data, , which we compare our proposed estimates based on pooled data with. In all considered estimators, we set K(t) as the Epanechnikov kernel.
Example 1 (Perfluorinated chemicals):
The first data set is from the National Health and Nutrition Examination Survey, relating to a study of the bioaccumulation of perfluorinated chemicals (PFCs) in human bodies. PFCs are widely used in the coating of industrial products, such as food packaging foams and non-stick cookware surfaces, many of which are toxic and accumulate in human bodies. Kärrman et al. (2006) studied the relationship between the concentration levels of PFCs in an individual’s blood and one’s age, gender, and geographic region using pooled serum samples of individuals in Australia. The particular data we entertain here include concentration levels of multiple PFCs in the serum samples of 1,904 residents in the United States between 2011 and 2012, along with their demographic information. The goal of our analysis is to infer the relationship between the concentration level of one particular type of PFCs, perfluorohexane sulfonic acid (PFHxS, Y), in an individual’s blood and his/her age (X).
To assess the uncertainty of each estimation procedure, we generate 500 bootstrap samples from the raw individual-level data. Based on each bootstrap version of the individual level data, we compute the local linear estimate, , for the mean concentration level of PFHxS given one’s age. Additionally, using the original data, we randomly create 952 pools, each of size two, producing a set of random pooled data; and we also create 952 pools of equal size based on the sorted data for age, producing a set of homogeneous pooled data. With the pool composition under each pooling design fixed, 500 bootstrap versions of random pooled data, and 500 bootstrap versions of homogeneous pooled data are generated by resampling pools with replacement. Using each pooled data set, we compute , , and , resulting in 500 realizations of each estimator.
Figure 6 depicts the average of each estimate across 500 bootstrap samples and two quantiles of selected estimates. When random pooled data are used, the marginal integration estimate matches closely with the benchmark estimate based on individual-level data, , both indicating a relatively stable level of PFHxS with a slight decrease as one approaches age 40, and then a steep increase of the concentration level once one passes around age 50. This pattern can be explained by the fact that PFHxS can be partly eliminated from the human body via, for instance, gastrointestinal activities, menstrual bleeding, and breast feeding (Genuis et al. 2013), but many of these pathways of PFCs elimination become less proactive or are completely lost (such as due to menopause) after one reaches certain age. In contrast, the average-weighted estimate, , and the product-weighted estimate, , suggest a much slower and nearly a constant increase in the concentration level as one gets older across the entire observed age range. We believe that this is one case where fails to capture the underlying pattern of m(x) due to its inherent inconsistency in estimation, and also misses this pattern due to its high uncertainty in estimation. In conclusion, when only random pooled data are available, provides a more reliable estimate for the underlying relationship between one’s PFHxS level in blood and age than the other two proposed estimates, although its variability is slightly higher than that of according to the bootstrap quantiles of the two estimates.
Figure 6.
Results from Example 1 (Perfluorinated chemicals). Top panels depict the average of each considered estimate across 500 bootstraps. The black dots are individual observations, with observations far larger than 4 omitted. Within each panel, the solid black line corresponds to the local linear estimate based on individual-level data, ; the solid red, blue, and green lines correspond to the average-weight estimate , the product-weighted estimate , and the marginal-integration estimate , respectively. Bottom panels show two quantiles of the estimates across 500 bootstraps. The dashed black, red, and green lines are 5% and 95% quantiles of , , and , respectively.
When homogeneous pooled data are used (see the top-right panel of Figure 6), appears to exaggerate the curvature of the conditional mean function, resulting in a much faster increase in the concentration level once one passes age 50, compared to the rate of increase indicated by the same estimate under random pooling. Despite the use of pooled data, and are nearly indistinguishable from , and these three estimates mostly preserve the earlier estimated pattern of m(x) that can be justified on scientific grounds. Moreover, the variability of is comparable with that of according to the comparison of the bootstrap quantiles associated with these two estimates. In conclusion, the marginal-integration estimate based on homogeneous pooled data leads to misleading inference for the underlying truth, whereas the other two estimates based on pooled data provide inference similar to those from the estimate based on individual-level data without noticeable efficiency loss.
Example 2 (Chemokines):
The second data set we use to illustrate local linear estimation using different types of data is from the Collaborative Perinatal Project (CPP), which is a long-standing, collaborative project on maternal and child health in the United States. More specifically, this data include chemokine levels collected from 388 pregnant females recruited in CPP, with measurements taken at the individual level as well as the pool level, with 194 non-overlapping pools of size two randomly formed. Chemokines are a family of small proteins related to the homeostatic and inflammatory process in the human body. Medical researchers have studied extensively the role that chemokines play in the immune system. For example, regarding to two particular chemokines, MCP-3 and GRO-α, Dhawan and Richmond (2002) investigated the role of the former in tumorigenesis, and Tsou et al. (2007) studied the latter in monocyte mobilization.
Based on the observed individual-level data and the random pooled data available in CPP, we infer the conditional mean concentration of GRO-α (Y) given MCP-3 (X). For illustration purpose, we generate another pooled data set, with a common pool size of two, following the homogeneous pooling design based on sorted MCP-3 levels. To assess the uncertainty of each estimation method, we generate 500 bootstrap samples for each of the three data types, individual-level data, random pooled data, and homogeneous pooled data, following the same resampling process described in the first example. Figure 7 shows the average of each considered estimate across 500 bootstrap samples and two quantiles of selected estimates.
Figure 7.
Results from Example 2 (Chemokines). Top panels depict the average of each considered estimate across 500 bootstraps. The black dots are individual observations. Within each panel, the solid black line corresponds to the local linear estimate based on individual-level data, ; the solid red, blue, and green lines correspond to the average-weight estimate , the product-weighted estimate , and the marginal-integration estimate , respectively. Bottom panels show two quantiles of the estimates across 500 bootstraps. The dashed black, red, and green lines are 5% and 95% quantiles of , and , respectively.
Similar to the phenomena in the first example, the marginal-integration estimate yields a similar estimate for the mean concentration level of GRO-α given the level of MCP-3 as that of when random pooled data are used; but it grossly deviates from this benchmark estimate when homogeneous pooled data are used. In contrast, the other two proposed local linear estimates based on random pooled data go through an obviously uninteresting region of the observed data, yet both estimates applied to homogeneous pooled data follow closely the benchmark estimate , and they only show slight discrepancy from it around the region where data are relatively scarce.
8. Discussion
We present in this article methods for estimating the mean of a continuous response given covariates via local polynomial regression when only pooled response data are observed along with individual-level covariates. Two commonly adopted pooling designs in practice are considered when formulating the local polynomial estimators, and properties of the proposed estimators are compared under each of the pooling designs. We use two real-life applications to illustrate the implementation and performance of the proposed estimators in comparison with their counterpart estimator when individual response data are available. Findings from the two applications are in line with observations on their finite sample performance using synthetic data from the simulation study, which agree with the theoretical implications of the large-sample properties derived for the proposed estimators.
In summary, the marginal-integration estimator is the winner among the three proposed when pooled data are from a random pooling design, but it fails when pools are not formed randomly; the average-weighted estimator performs the best when homogeneous pooled data are used, but it is an inconsistent estimator for the mean function when pools are formed randomly; the product-weighted estimator is a consistent estimator under both pooling designs, but is subject to high variability under random pooling. Between the two winners, i.e., under random pooling and under homogeneous pooling, they share the same bias asymptotically. A closer look at their asymptotic variances (see Appendix C in the supplementary materials) reveals that the asymptotic variance of the former is times that of the latter, indicating that the former tends to be more variable than the latter given a fixed N and c. These patterns of comparisons between the two winning estimators (based on different pooling designs) are also observed in the numerical examples (e.g., Figures 4, 5, 6, and 7). Based on our discussions in Section 4.2, we believe that there is still room for improvement by more carefully/selectively incorporating individual covariate information within a pool to relate to the pooled response of that pool, as opposed to either using all covariate information (as in and ) or using one individual’s covariate information (as in ). Following this more selective incorporation of covariate information for each pool, an alternative construction of the weight function in the objective function may be needed accordingly to exploit a more sensible measure of distance between selected individuals’ covariate information and x, the value at which the mean function is of interest. We are hopeful that this more refined strategy for constructing the objective function can lead to a local polynomial estimator that outperforms all three estimators proposed in the current study despite the pooling design.
Another follow-up research is motivated by the fact that, in many applications, covariates of interest cannot be measured precisely or observed directly. It is of interest then to carry out local polynomial regression to infer m(x) using pooled response data and individual-level covariate data that are prone to measurement error.
Supplementary Material
Acknowledgments
We are grateful to the Associate Editor and the two anonymous referees for their insightful comments and helpful suggestions that greatly improved our work. Wang’s work was supported by the National Institutes of Health (NIH) under the Award Number R03 AI135614.
Footnotes
Supplementary Materials
The online supplementary materials contain proofs of Theorems 4.1 and 4.2, additional simulation results referred to in Section 6.
References
- Bilder CR and Tebbs JM (2009). Bias, efficiency, and agreement for group-testing regression models. Journal of Statistical Computation and Simulation, 79(1):67–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Deckert A, Brnighausen T, and Kyei N (2020). Pooled-sample analysis strategies for covid-19 mass testing: a simulation study. Bulletin of the World Health Organization. E-pub: 2 April 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Delaigle A, Fan J, and Carroll R (2009). A design-adaptive local polynomial estimator for the errors-in-variables problem. Journal of the American Statistical Association, 104(485):348–359. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Delaigle A and Hall P (2012). Nonparametric regression with homogeneous group testing data. The Annals of Statistics, 40(1):131–158. [Google Scholar]
- Dhawan P and Richmond A (2002). Role of cxcl1 in tumorigenesis of melanoma. Journal of Leukocyte Biology, 72(1):9–18. [PMC free article] [PubMed] [Google Scholar]
- Fan J and Gijbels I (1996). Local Polynomial Modelling and Its Applications: Monographs on Statistics and Applied Probability 66, volume 66. Chapman & Hall/CRC. [Google Scholar]
- Fan J, Heckman NE, and Wand MP (1995). Local polynomial kernel regression for generalized linear models and quasi-likelihood functions. Journal of the American Statistical Association, 90(429):141–150. [Google Scholar]
- Fukuda K (2006). Age–period–cohort decomposition of aggregate data: an application to us and japanese household saving rates. Journal of Applied Econometrics, 21(7):981–998. [Google Scholar]
- Genuis SJ, Curtis L, and Birkholz D (2013). Gastrointestinal elimination of perfluorinated compounds using cholestyramine and chlorella pyrenoidosa. ISRN Toxicology, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gu J, Li Q, and Yang J-C (2015). Multivariate local polynomial kernel estimators: Leading bias and asymptotic distribution. Econometric Reviews, 34(6–10):979–1010. [Google Scholar]
- Heffernan A, English K, Toms L, Calafat A, Valentin-Blasini L, Hobson P, Broomhall S, Ware R, Jagals P, Sly P, et al. (2016). Cross-sectional biomonitoring study of pesticide exposures in queensland, australia, using pooled urine samples. Environmental Science and Pollution Research, 23(23):23436–23448. [DOI] [PubMed] [Google Scholar]
- Jiang R, Manchanda P, and Rossi PE (2009). Bayesian analysis of random coefficient logit models using aggregate data. Journal of Econometrics, 149(2):136–148. [Google Scholar]
- Kärrman A, Mueller JF, Van Bavel B, Harden F, Toms L-ML, and Lindström G (2006). Levels of 12 perfluorinated chemicals in pooled australian serum, collected 2002–2003, in relation to age, gender, and region. Environmental Science & Technology, 40(12):3742–3748. [DOI] [PubMed] [Google Scholar]
- Kato K, Calafat AM, Wong L-Y, Wanigatunga AA, Caudill SP, and Needham LL (2009). Polyfluoroalkyl compounds in pooled sera from children participating in the national health and nutrition examination survey 2001–2002. Environmental science & technology, 43(7):2641–2647. [DOI] [PubMed] [Google Scholar]
- Kendziorski C, Zhang Y, Lan H, and Attie A (2003). The efficiency of pooling mrna in microarray experiments. Biostatistics, 4(3):465–477. [DOI] [PubMed] [Google Scholar]
- Lin J and Wang D (2018). Single-index regression for pooled biomarker data. Journal of Nonparametric Statistics, 30(4):813–833. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Linton O and Whang Y-J (2002). Nonparametric estimation with aggregated data. Econometric Theory, 18(2):420–468. [Google Scholar]
- Liu Y, McMahan C, and Gallagher C (2017). A general framework for the regression analysis of pooled biomarker assessments. Statistics in Medicine, 36(15):2363–2377. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ma C-X, Vexler A, Schisterman EF, and Tian L (2011). Cost-efficient designs based on linearly associated biomarkers. Journal of Applied Statistics, 38(12):2739–2750. [Google Scholar]
- Malinovsky Y, Albert PS, and Schisterman EF (2012). Pooling designs for outcomes under a gaussian random effects model. Biometrics, 68(1):45–52. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Martinez-Espineira R (2003). Estimating water demand under increasing-block tariffs using aggregate data and proportions of users per block. Environmental and resource economics, 26(1):5–23. [Google Scholar]
- Masry E (1996). Multivariate local polynomial regression for time series: uniform strong consistency and rates. Journal of Time Series Analysis, 17(6):571–599. [Google Scholar]
- McMahan CS, McLain AC, Gallagher CM, and Schisterman EF (2016). Estimating covariate-adjusted measures of diagnostic accuracy based on pooled biomarker assessments. Biometrical Journal, 58(4):944–961. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mitchell EM, Lyles RH, Manatunga AK, Danaher M, Perkins NJ, and Schisterman EF (2014). Regression for skewed biomarker outcomes subject to pooling. Biometrics, 70(1):202–211. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mitchell EM, Lyles RH, Manatunga AK, and Schisterman EF (2015). Semiparametric regression models for a right-skewed outcome subject to pooling. American Journal of Epidemiology, 181(7):541–548. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mosites E, Rodriguez E, Caudill SP, Hennessy TW, and Berner J (2020). A comparison of individual-level vs. hypothetically pooled mercury biomonitoring data from the maternal organics monitoring study (moms), alaska, 1999–2012. International Journal of Circumpolar Health, 79(1):1726256. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shih JH, Míchalowska AM, Dobbin K, Ye Y, Qiu TH, and Green JE (2004). Effects of pooling mrna in microarray class comparisons. Bioinformatics, 20(18):3318–3325. [DOI] [PubMed] [Google Scholar]
- Shu C and Burn DH (2003). Spatial patterns of homogeneous pooling groups for flood frequency analysis. Hydrological sciences journal, 48(4):601–618. [Google Scholar]
- Tsou C-L, Peters W, Si Y, Slaymaker S, Aslanian AM, Weisberg SP, Mack M, and Charo IF (2007). Critical roles for ccr2 and mcp-3 in monocyte mobilization from bone marrow and recruitment to inflammatory sites. The Journal of Clinical Investigation, 117(4):902–909. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vansteelandt S, Goetghebeur E, and Verstraeten T (2000). Regression models for disease prevalence with diagnostic tests on pools of serum samples. Biometrics, 56(4):1126–1133. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.







