Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Dec 1.
Published in final edited form as: Biometrics. 2021 Aug 1;78(4):1328–1341. doi: 10.1111/biom.13516

Varying-coefficient regression analysis for pooled biomonitoring

Dewei Wang 1,*, Xichen Mou 2, Yan Liu 3
PMCID: PMC8716640  NIHMSID: NIHMS1719690  PMID: 34190334

Summary:

Human biomonitoring involves measuring the accumulation of contaminants in biological specimens (such as blood or urine) to assess individuals exposure to environmental contamination. Due to the expensive cost of a single assay, the method of pooling has become increasingly common in environmental studies. The implementation of pooling starts by physically mixing specimens into pools, and then measures pooled specimens for the concentration of contaminants. An important task is to reconstruct individual-level statistical characteristics based on pooled measurements. In this article, we propose to use the varying-coefficient regression model for individual-level biomonitoring and provide methods to estimate the varying-coefficients based on different types of pooled data. Asymptotic properties of the estimators are presented. We illustrate our methodology via simulation and with application to pooled biomonitoring of a brominated flame retardant provided by the National Health and Nutrition Examination Survey (NHANES).

Keywords: Homogeneous pooling, Local linear fit, NHANES, Pooled biospecimens, Random pooling, Varying-coefficient models

1. Introduction

Physically combining individual specimens into pools for analysis has a long history tracing back to the 1940s when screening for syphilis among World War II U.S. inductees (Dorfman, 1943). Since this seminal work, the method of pooling has been widely used in many applications, including gene mutation detection (Gastwirth, 2000), drug discovery (Remlinger et al., 2006), disease screening (Lewis et al., 2012), blood safety (Stramer et al., 2013), and the motivating example of this study, biomonitoring of environmental chemicals.

In biomonitoring, assays used to measure chemical concentration levels in human specimens are often prohibitively expensive, such that only a small-scale study is financially feasible (Caudill, 2012). An increasingly common solution is pooling. By mixing multiple individual specimens into a pool and then analyzing the pooled specimen, we can use one assay to assess the chemical level of all the contributing members, and thus include more individuals in the study within financial constraints. In addition to reducing cost, pooling could also help us save irreplaceable specimens, meet a minimum volume required by the assay, and preserve the anonymity of participants. Because of these advantages, pooling has gained more and more popularity in human biomonitoring of environmental chemicals. For example, Kärrman et al. (2006) pooled serum samples to assess human exposure to perfluorinated chemicals, and Heffernan et al. (2016) pooled urine specimens to monitor multiple pesticide-related chemicals. Furthermore, the National Health and Nutrition Examination Survey (NHANES) has been continually assessing the U.S. population’s exposure to various environmental chemicals via pooling since 2005 (Kato et al., 2009).

In statistical analysis, a pooled observation is viewed as an aggregation of individual-level data that are available only if measuring individual specimens one-by-one. Analysis starts with specifying a model for the latent individual-level data and then develops a method to estimate the model from pooled observations. The estimation further helps assess how pooling influences statistical precision when using different pool sizes and pooling schemes.

Various models for individual-level data have been estimated from pooled observations, predominately focusing on topics from population-level estimation (e.g., see Caudill et al., 2007; Li et al., 2014) to parametric regression (e.g., see Ma et al., 2011; Malinovsky et al., 2012; Mitchell et al., 2014; Liu et al., 2017). Recently, Wang et al. (2020) proposed nonparametric estimators of the individual-level conditional mean given a continuous covariate based on pooled observations. When multiple covariates are present, directly applying Wang et al. (2020)’s method cannot avoid the so-called “curse of dimensionality” in the nonparametric framework. Literature has provided many sophisticated models to deal with data with more than one covariate. Examples include, but not limited to, additive models (Hastie and Tibshirani, 1987), varying-coefficient models (Cleveland et al., 1991; Hastie and Tibshirani, 1993), partially linear models (Green and Silverman, 1993), and single-index models (Ichimura, 1993; Lin and Wang, 2018). These models have different focuses on modeling and together form useful tool kits that relax traditional parametric assumptions for processing data with multiple covariates. In this article, we focus on the varying-coefficient model.

Our choice of the varying-coefficient model is motivated by a particular interest in human biomonitoring of environmental chemicals; that is to model the age-varying association between chemical exposures and covariates such as gender, race, and body mass index (BMI), and varying-coefficient models can naturally capture such age-varying associations. Though varying-coefficient models have been studied in literature extensively, yet none have provided solutions for pooled observations. In this article, we consider two types of pooled data obtained from two different pooling schemes, namely random pooling and homogeneous pooling. When creating pools, random pooling randomly assigns individuals into pools. This is commonly used when individual covariate information is not available before pooling. In contrast, homogeneous pooling assigns individuals with similar covariate values into pools. In Section 3, we provide kernel-based methods to estimate the varying-coefficients from both homogeneously and randomly pooled biomonitoring data. The large-sample properties in Section 4 reveal how pooling affects the estimation theoretically. We illustrate the practical performance of our procedure via simulation and an application to pooled biomonitoring of a brominated flame retardant (BFR) in Sections 5 and 6, respectively. Section 7 concludes the article with a discussion.

2. A motivating example: pooled biomonitoring of BFRs

The BFRs are commonly used chemicals to reduce the flammability of consumer products such as electronics, toys, clothes, and furniture. The main route of human exposures includes inhalation, ingestion, and dermal contact (Kim et al., 2014). These chemicals’ persistence, bioaccumulation, and biomagnification in humans could induce many harmful health effects. For example, BFRs have been identified as a possible cause of endocrine disorders, particularly in children and during fetal development, and chronic exposures to BFRs can increase risks for lymphatic cancer, liver cancer, diabetes, and metabolic syndrome (Tilley and Fry, 2015). Because of these adverse health effects, the U.S. Centers for Disease Control and Prevention, in conjunction with NHANES, has been conducting biannual biomonitoring of these chemicals to assess the U.S. population’s exposure for the past two decades.

The NHANES uses a probability sampling design to select participants who are representative of the civilian, noninstitutionalized U.S. population. Sampling weights were provided for all participants. After collection, participants’ serum specimens were shipped on dry ice to the National Center for Environmental Health and stored at −70°C for analysis. A random one-third subsample of the participants aged 12 years and older was measured for the BFRs in pools. Serum samples were pooled homogeneously based on participants’ covariate information. The creation of pooled serum followed a weighted pooling-sample design. In each pool, the volume chosen from each participant is based on the ratio of its sampling weight to the sum of the sampling weights of all participants in the pool. All the pooled measurements, along with individual-level covariates, were provided.

A motivating question is whether we could utilize NHANES’ efforts in achieving these pooled data to estimate the individual-level age-varying associations between chemical exposure and covariates. An ideal solution is the varying-coefficient model. Note that the literature has not yet reported methods to estimate a varying-coefficient model from pooled data. This article provides the very first one, and we will illustrate the methodology via an NHANES pooled biomonitoring data set of a BFR, PBDE-47 (known as 2, 2′, 4, 4′-tetrabromodiphenyl ether) in Section 7.

3. Methodology

To emulate a resource-limited setting, we consider that the budget can only afford J assays. In the pooling context, these J assays analyze J pooled specimens. When different pooling schemes are used, they yield different types of pooled biomonitoring data which require different estimation strategies.

3.1. Random Pooling

The first scheme is random pooling which randomly assigns individual serum samples into non-overlapping pools. Suppose in total there are N individuals randomly assigned into the J pools of size cj for j = 1, …, J; i.e., N=j=1Jcj. We label the individual-level data of the ith individual in the jth pool by subscript ij as (Yij, Uij, Xij, Wij), for i = 1, …, cj and j = 1, …, J, where Yij is the concentration level of the pollutant, Uij is a continuous covariate of our key interest, Xij = (Xij0 = 1, Xij1, …, Xijp) collects other covariates of the participant, and Wij is the sample weight assigned by the survey. We assume that these individual-level data are independent and follow a varying-coefficient model,

Yij=Xijβ(Uij)+ϵij, (1)

where β(u)={β0(u),,βp(u)}p+1 is unknown with each βk(u) being a smooth function of u. In practice, the selection of the index variable Uij always depends on the scientific problem at hand (see; e.g., Hoover et al., 1998; Wang and Xia, 2009; Mu et al., 2018). Our study of the NHANES data chooses Uij to be age, then β(u) describes the age-varying association between Yij and Xij. We further assume that E(ϵij|Uij, Xij, Wij) = 0 and var(ϵij|Uij, Xij, Wij) = σ2(Uij, Xij, Wij) for some variance function σ2(·, ·, ·) > 0. For generality, the conditional variance is unspecified to allow for incorporation of many common scenarios as special cases. For example, we could take σ2(Uij, Xij, Wij) = σ2(Uij) to be univariate as in Fan and Zhang (1999), or σ2(Uij, Xij, Wij) = σ2 for some constant σ2 as in Eubank et al. (2004), or σ2(Uij, Xij, Wij) = σ2/Wij to account for the sampling weights.

Under NHANES’ weighted pooling-sample design, Yij’s are all latent. Observed are the chemical concentrations from pools, formulated by Zj=i=1cjWijYij/Wj where Wj=i=1cjWij, for each j (see, e.g., Caudill, 2012; Mitchell et al., 2014; Liu et al., 2017). Our goal is to estimate β(u) using Zj’s and DR={(Uij,Xij,Wij):i=1,,cj,j=1,,J}.

We extend the marginal-integration method proposed by Wang et al. (2020). Because (Yij, Uij, Xij, Wij)’s are independent and identically distributed (iid), E[WijYij]’s are the same, denoted by μ. Since the pools are formed randomly, we have a marginal-integration result,

E(WjZj|Uij,Xij,Wij)=i=1,iicjE(WijYij)+E(WijYij|Uij,Xij,Wij)=(cj1)μ*+WijXijβ(Uij), (2)

where μ can be estimated by μ˜*=N1j=1Ji=1cjWijYij=N1j=1JWjZj. It leads to an approximated regression equation, E{WjZj(cj1)μ˜*|Uij,Xij,Wij}WijXijβ(Uij). Thus, we estimate β(u) by minimizing the following locally sample-weighted least squares,

j=1Ji=1cjWij1{WjZj(cj1)μ˜*WijXijβ(u)Wij(Uiju)Xijβ(u)}2Kh(Uiju),

with respect to {β(u), β′(u)}, where β(u)={β0(u),,βp(u)} collects all the first derivatives, K(·) is a symmetric kernel density function, h > 0 is a user-chosen bandwidth, and Kh(·) = h−1K/h). Denote by {β˜(u),β˜(u)} the minimizer of the above least squares. The β˜(u)={β˜0(u),,β˜p(u)} is our estimator of β(u) for randomly pooled biomonitoring. A closed-form expression of β˜(u) is available in Web Appendix A.

3.2. Homogeneous pooling

Homogeneous pooling assigns individuals with similar covariate values into pools. There are different ways to create homogeneous pools; e.g., the X-homogeneous pooling in Vansteelandt et al. (2000) and the k-means clustering in Mitchell et al. (2014). In the varying-coefficient regression context, we specify homogeneous pooling as if the homogeneity is formed with respect to the similarity of U-covariates. For example, in the biomonitoring of BFRs, NHANES pooled serum samples from participants with similar ages. This specification also includes the homogeneous pooling studied by Delaigle et al. (2012) (for disease screening) and Wang et al. (2020) (for biological marker evaluation) in nonparametric regression as special cases by taking X = 1.

Suppose the N individuals in Section 3.1 are assigned into J non-overlapping pools homogeneously. We denote the individual-level data of the ith individual in the jth homogeneous pool by (Yij, U[ij], Xij, Wij) for i = 1, …, cj and j = 1, …, J. The subscript [ij] in U[ij]’s emphasizes that individuals’ pool memberships are rearranged such that in each pool individuals now have similar U-values. Again, following the weighted pooling-sample design, the observed chemical concentration in the jth pool is Zj=i=1cjWijYij/Wj where Wj=i=1cjWij. Our goal is to estimate β(u) using Zj’s and DH={(U[ij],Xij,Wij):i=1,,cj,j=1,J}. Because of the homogeneity of U[ij]’s, the marginal-integration in (2) no longer holds. Therefore, the methodology in Section 3.1 becomes invalid, and it requires a different approach.

By construction, U[ij]’s are close to each other within the jth pool. This closeness, along with the smoothness of β(u) in u, motivates the following approximation,

E(Zj|U[ij]s,Xijs,Wijs)=i=1cjWijXijβ(U[ij])WjX¯jβ(U[ij]) (3)

for each i, where X¯j=i=1cjWijXij/Wj. Consequently, we estimate β(u) by minimizing

j=1Ji=1cjWj{ZjX¯jβ(u)(U[ij]u)X¯jβ(u)}2Kh(U[ij]u) (4)

with respect to {β(u), β′(u)}. Denoting by {β¯(u),β¯(u)} the minimizer, our estimator for homogeneously pooled biomonitoring is β¯(u)={β¯0(u),,β¯p(u)}. A closed-form expression of β¯(u) is also provided in Web Appendix A.

If cj = 1 for all j, both types of pooled data reduce to the individual biomonitoring data {(Y, U, X, W) : = 1, …, J} with the use of J assays; i.e., the J assays are used to test J individual serum samples. Then, both the least squares in Section 3.1 and (4) become,

=1JW{YXβ(u)(Uu)Xβ(u)}2Kh(Uu), (5)

and both estimators, β˜(u) and β¯(u), reach the same. We denote the common version by β^(u), an estimator of β(u) if the individual-level concentrations Yij’s are available. We will use β^(u) as a reference to investigate the effect of pooling under the two schemes across different pool sizes. When compared to existing works on the individual-level varying coefficient model (e.g., Fan and Zhang, 1999; Eubank et al., 2004), the β^(u) can incorporate sampling weights. We acknowledge that incorporating NHANES sampling weights via weighted regression is not new in the literature (see; e.g., Li et al., 2010).

4. Asymptotics

We now present large-sample properties of β˜(u) and β¯(u) as J → ∞. These properties reveal some interesting contrasts between the estimators using pooled data and the reference estimator β^(u). To gain a clear understanding of the role of the pool size, we let cj = c for all j in the asymptotic properties, and to set the baseline performance, we first discuss the asymptotics of the reference estimator β^(u).

Corollary 1:

Under Conditions C1–C5 in Web Appendix B, if h → 0 and Jh → ∞ as J → ∞, we have β^(u)=BI(u)+Op{VI(u)} where

BI(u)=bias{β^(u)|DI}=E{β^(u)β(u)|DI}=κ22β(u)h2{1+op(1)},
VI(u)=var{β^(u)|DI}=1Jhν0fU(u)Γxw1(u)Σxw(u)Γxw1(u){1+op(1)}.

Herein, DI={(U,X,W):=1,,J}, κl=ulK(u)du, and νl=ulK2(u)du for l = 0, 1, 2, β(u)={β0(u),,βp(u)} collects all the second order derivatives, fU(·) is the density of U, Γxw(u) = E(W XX|U = u), and Σxw(u) = E{W 2σ2(U, X, W)XX|U = u}.

As a special case of β˜(u) and β¯(u), β^(u)’s results in this corollary can be deduced from the following Theorems 1 and 2 when cj = c = 1. We include proofs of this and other theorems in Web Appendix C. It is clear that the conditional mean squared error of β^k(u), for k = 0, …, p, is Op{h4 +(Jh)−1}. Taking h = Op(J−1/5), we see that the convergence rate of β^(u) is Op(J−2/5) which is the same as the optimal convergence rate in nonparametric regression (Fan and Gijbels, 1996). This result also coincides with the findings in Fan and Zhang (1999) and Eubank et al. (2004).

Theorem 1:

Under Conditions C1–C6 in Web Appendix B, if h → 0 and Jh → ∞ as J → ∞, we have

β˜(u)=AR(u)+BR(u)+Op{VR(u)},

where AR(u)=Op{(c1)/N} quantifies the impact of approximating μ by μ˜, the BR(u) and VR(u) are the conditional bias and variance of β˜(u) given DR provided μ as known in advance and expressed by

BR(u)=BI(u)+(1c1)1/2BR*(u)andVR(u)=c1VI(u)+(1c1)VR*(u),

respectively, in which

BR*(u)=Op[{σm2fU(u)ν0Jh}1/2],VR*(u)=1Jhν0σe2fU(u)Γxw1(u)Γx(u)Γxw1(u){1+op(1)},
σe2=E{W2σ2(U,X,W)},σm2=var{WXβ(U)},andΓx(u)=E(XX|U=u).

Obviously, when c = 1, AR(u) = 0 and both BR(u) and VR(u) reduce to BI(u) and VI(u), respectively. This is expected because β˜(u)=β^(u) when c = 1. When c > 1, provided that Jh4 → ∞ as J → ∞, AR2(u)=op[max{BR2(u),VR(u)}]; i.e., using μ˜ has a negligible impact on the estimation of β(u). In addition, the conditional mean square error of β˜k(u) is Op{h4 +(Jh)−1}, for k = 0, 1 …, p. Thus, β˜(u) also achieves the convergence rate Op(J−2/5) when h = Op(J−1/5). However, there are notable impacts caused by random pooling.

When compared to β^(u), random pooling brings a new variance term VR*(u), and the conditional variance VR(u) of β˜(u) is a weighted average of VI(u) and VR*(u). Depending on the relation between VI(u) and VR*(u), pooling could have different influence on VR(u). If VI(u) and VR*(u) are equal (e.g., when Wij = 1 and σ2(Uij, Xij, Wij) = σ2), then VR(u) = VI(R); i.e., pooling does not impact the conditional variance. If VI(u)VR*(u) is positive (or negative) definite, using a larger pool size could reduce (or increase) the conditional variance. However, regarding the conditional bias, pooling randomly might only deteriorate the performance, because it brings an extra non-negligible bias term (1c1)1/2BR*(u) in which the multiplier (1 − c−1)1/2 gradually increases with c.

To present the asymptotic properties of β¯(u), we let {(Ul, Xl, Wl) : l = 1, …, c} be iid copies of (U, X, W) and denote

Γ¯xw(u)=E{(l=1cWlXl)(l=1cWlXl)l=1cWl|U1=u,,Ul=u},
Σ¯xw(u)=E{(l=1cWlXl){l=1cWl2σ2(Ul,Xl,Wl)}(l=1cWlXl)(l=1cWl)2|U1=u,,Ul=u}.

These two terms characterize the conditional variance of β¯(u). It is important to note that when c = 1, Γ¯xw(u) and Σ¯xw(u) reduce to Γxw(u) and Σxw(u), respectively. In addition, we would like to understand the influence of the homogeneity of pools on our estimator. We collect pools that contain at least one member in a neighborhood of u into J(u|h)={j:mini|U[ij]u|<h} and let K be supported on (−1, 1); e.g., the Epanechnikov kernel. Then Kh(U[ij]u) = 0 for all jJ(u|h); i.e., by (4), only the Zj’s labeled by jJ(u|h) contribute to our estimator β¯(u). Finally, we define a homogeneity parameter for β¯(u) by

ωN(u|h)=maxjJ(u|h){maxi1,i2|U[i1j]U[i2j]|}.

This parameter symbolizes the closeness of U[ij]’s in pools that contribute to β¯(u) and consequently, reflects the accuracy of the approximation in (3).

Theorem 2:

Under Conditions C1–C5 and C7 in Web Appendix B, if h → 0, Jh2 → ∞, and ωN(u|h) = op(h) as J → ∞, β¯(u)=BH(u)+Op{VH(u)} where BH(u) and VH(u) are the conditional bias and variance of β¯(u) given DH, respectively, and expressed by

BH(u)=BI(u)+(c1)Op{ωN(u|h)},andVH(u)=VH*(u)+(c1)Op{ωN(u|h)Jh2},

in which

VH*(u)=1Jhν0fU(u)Γ¯xw1(u)Σ¯xw(u)Γ¯xw1(u){1+op(1)}.

The β¯(u) reduces to β^(u) when c = 1. This is reflected in Theorem 2, because the two terms brought by the approximation (3), (c−1)Op{ωN(u|h)} and (c−1)Op{ωN(u|h)/(Jh2)}, vanish when c = 1, and both Γ¯xw(u) and Σ¯xw(u) reduce to Γxw(u) and Σxw(u), respectively. When c > 1, the optimal convergence rate of β¯k(u), for k = 0, 1, …, p, depends on the homogeneity parameter ωN(u|h); i.e., the approximation (3). The BH(u) and VH(u) implies that the conditional mean squared error of β¯k(u) is Op{h4+ωN2(u|h)+(Jh)1+(Jh2)1ωN(u|h)}. To facilitate a clear discussion, we consider that ωN(u|h)/haA in probability for some a > 0 and A>0 as J → ∞. Then the optimal convergence rate of β¯k(u) is Op(J−2/5) if a ⩾ 2 and h = Op(J−1/5); Op(Ja/(2a+1)) if 1 ⩽ a < 2 and h = Op(J−1/(2a+1)); Op(Ja/(a+2)) if 0 < a < 1 and h = Op(J−1/(a+2)). Therefore, if a < 2, the approximation in (3) would be crude and slows down the convergence rate of the estimator; while if a ⩾ 2 and h = Op(J−1/5), β¯k(u) achieves the convergence rate Op(J−2/5), same as β^k(u) and β˜k(u).

When compared to β˜(u), β¯(u) tells a different story about the role of the pool size. The VH(u) has the same rate as VI(u). The leading term of VH(u) is VH*(u) provided ωN(u|h) = op(h). Though VH*(u) is of a non-friendly form for comparison in general, we could learn some from certain special cases. For example, if we take W = 1 and σ2(U, X, W) = σ2, then {Γ¯xw1(u)Σ¯xw(u)Γ¯xw1(u)}1{Γxw1(u)Σxw(u)Γxw1(u)}1=(c1)E(X|U=u)E(X|U=u) which is non-negative definite; i.e., homogeneous pooling could even yield a more efficient estimator than individual biomonitoring (within the same financial constraints; i.e., J is fixed). When looking at the bias, homogeneous pooling brings (c − 1)Op{ωN(u|h)} which likely increases as c increases, not only because c − 1 increases, but also because ωN(u|h) increases as more individuals are pooled together. However, its magnitude is controlled by the homogeneity parameter ωN(u|h). If h2 = op{ωN(u|h)}, then β¯(u) has a larger asymptotic bias than β^(u). In contrast, if ωN(u|h) = op(h2), then β¯(u) does not induce extra asymptotic bias at all, completely different from β˜(u).

The above discussion also provides guidance on the design of homogeneous pools. Provided h = Op(J−1/5), we would like to have ωN(u|h) = op(h2) = op(J−2/5). This order could be achieved via three possible designs. The first one is to sort participants based on the U-variable and then decide the pool membership starting from the smallest to the largest U-value; i.e., we set U[ij] to be the (cjc + i)-th order statistic of U[ij]’s. Then ωN(u|h) = Op(J−1/2) = op(J−2/5) holds under mild conditions (see Delaigle et al., 2012). The second design is to first create large demographic subgroups based on some X-covariates (e.g., gender and race) such that in each subgroup participants have the same or similar values of those X-covariates. Then within each subgroup, we sort their U-values to create homogeneous pools. Doing this does not only yield ωN(u|h) = op(J−2/5), but also allows us to take advantages of X-homogeneity (see Section 5 for numerical evidence). The last design is inspired by the observation that ωN(u|h)M=maxj{maxi1,i2|U[i1j]U[i2j]|}, where we call maxi1,i2|U[i1j]U[i2j]| the U-span of the jth pool and M the maximum U-span. Therefore, we can create homogeneous pools such that MbJ2/5/logJ for some constant b. In this way, for any u and h, ωN(u|h)M=op(J2/5)(see more discussion in Web Appendix D).

5. Numerical study

We now conduct a simulation study to assess the performance of our methodology. The main purpose is threefold: 1) to assess the estimation when varying-coefficients are of various nonlinear patterns; 2) to evaluate the impact of pooling (c > 1) when compared to the individual (c = 1) biomonitoring; 3) to compare the two estimators β˜(u) and β¯(u) under the same c and J.

5.1. Data generation

We consider two population models of the form

Y=β0(U)+X1β1(U)+X2β2(U)+X3β3(U)+ϵ. (6)

In both models, we first generate (T1, T2) from a normal copula with the correlation parameter being ρ = 0.26, then set U = ψ(T1) and X1 = Φ−1(T2), where ψ(t) = 4t − 2 and Φ−1 is the quantile function of N(0,1). This setting yields U ~ Uniform(−2, 2) and X1N(0,1) emulating the (centered and scaled) age and (standardized) logarithm of the BMI in the NHANES data (see Section 7), respectively. The copula with ρ = 0.26 aims to reproduce the correlation between the two predictors in the NHANES data. The X2 ~ Bernoulli(0.52) and X3 ~ Bernoulli(0.36) emulate the gender and race, respectively. The W ~ Uniform(1, 10) simulates the sampling weight of each individual and ϵN(0,W1) to account for the sampling weights. Our consideration of {βk(u) : k = 0, …, 3} covers a broad range of nonlinear patterns. In the first model (M1),

β0(u)=u3/8,β1(u)=sin(πu/2),β2(u)=0.25u(1u),β3(u)=exp(u2),

and in the second model (M2), letting I(·) be the indicator function,

β0(u)=u2/2,β1(u)=exp[{I(u>0)1.44I(u<0)/1.44}u2/4],
β2(u)=[exp{(U+1)2/0.72}+0.7exp{(U1)2/0.98}],
β3(u)=0.4sin{π(u0.5)/2.5}+0.8exp{u+(u+0.5)2I(u>0.5)}/6.

We consider J ∈ {250, 500}. For each J and each model, we first randomly sample the individual biomonitoring data SI={(Y,U,X,W):=1,,J} from (6) and calculate the estimate β^(u) based on SI. The β^(u) serves as a reference for our comparison of the two pooling schemes. To understand the pooling effect, we let c ∈ {2, 4, 6, 8, 10} vary from c = 2 to c = 10. As c increases, we add new samples from (6) in to produce Sc={(Y,U,X,W):=1,,cJ} such that an increasing ordering SIS2S10 is enforced. This increasing ordering helps us obtain a fair comparison of the estimation when c varies. In addition, to fairly compare the two pooling schemes at each c, we generate both types of pooled data using the same Sc.

To generate randomly pooled data using Sc, we randomly assign the observations in Sc into J pools of size c. The pool membership is acknowledged by relabeling these cJ individuals by subscript ij as in {(Yij, Uij, Xij, Wij) : i = 1, …, c, j = 1, …, J}. We then calculate Zj=i=1cWijYij/i=1c for j = 1, …, J. The randomly pooled data is {(Zj, Uij, Xij, Wij) : i = 1, …, c, j = 1, …, J} based on which we compute β˜(u).

We consider two different designs to generate homogeneously pooled data. The first design, denoted by (D1), is solely based on the U-covariate. Using Sc, it proceeds almost the same as the generation of randomly pooled data with only one difference on the assignment of the observations into pools. We first sort U’s into an increasing order U(1) ⩽ · · · ⩽ U(cJ), then assign observations into pools sequentially according to the order. After the assignment, we have Sc as {(Yij, U[ij], Xij, Wij) : i = 1, …, c, j = 1, …, J}, where U[ij] = U(cJJ+i). We calculate Zj=i=1cWijYij/i=1cWij and compute β¯(u) based on {(Zj, U[ij], Xij, Wij) : i = 1, …, c, j = 1, …, J}.

The second design (D2) uses both X and U. Note that X2 and X3 are binary. We first partition Sc into 4 subgroups, each with a distinct value of (X2, X3), respectively. Then within each subgroup, we apply (D1) to determine the pool membership and generate the Zj’s; i.e., we sort individuals in each subgroup into an increasing order with respect to U and sequentially assigned individual observations to pools of size c. Note that, the number of individuals in a subgroup may not be a multiplier of c. We simply mixed the remainders together to create a pool such that the total number of pools is still J. Though this pool may have a larger U-span (i.e., less homogeneous) than others, we found it had little impact on the performance of β¯(u).

For each J and under each model, we repeat the above data generation and the estimation of β(u) 500 times. When computing our estimates, we used the Epanechnikov kernel. Our selection of bandwidth followed the leave-one-pool-out cross-validation procedure presented in Web Appendix E, wherein we also provide bootstrap methods to construct 100(1 − α)% confidence intervals on βk(u) for each u. In addition to the case U ~ Uniform(0, 1), we have repeated the simulation for UN(0,1). Lastly, the above setting is to emulate a fixed-budget setting; i.e., J is fixed while c varies. We also repeated the simulation study for the case where N is fixed and c varies (see data generation and supplemental results in Web Appendices F and G, respectively).

5.2. Results

Figures 13 compare the estimates of (M1) when c ∈ {1, 2, 4, 8}, J = 500 and U ~ Uniform(0, 1) under random pooling, homogeneous pooling (D1) and (D2), respectively. In each sub-figure, we plot the true function (dashed), the pointwise sample mean (solid), and the pointwise mean ± 1.96 × standard deviation (dotted) of the 500 estimates. The gap between the true function and the pointwise sample mean reveals the bias of the corresponding estimator at each u, while the gap between the two dotted lines reflects the pointwise variance of the estimator.

Figure 1.

Figure 1.

Simulation results for (M1) when J = 500 and U ~ Uniform(0, 1). Columns from left to right correspond to the estimate when c=1(β^(u)) and c=2,4,8(β˜(u)) under random pooling, respectively. In each sub-figure, the dashed line plots the true varying-coefficient function, and the solid line connects the pointwise sample mean of the 500 function estimates. The top and bottom dotted lines are the pointwise mean ± 1.96 × standard deviation of the 500 estimates, respectively. This figure appears in color in the electronic version of this article, and any mention of color refers to that version.

Figure 3.

Figure 3.

Simulation results for (M1) when J = 500 and U ~ Uniform(0, 1). Columns from left to right correspond to the estimate when c=1(β^(u)) and c=2,4,8(β¯(u)) under homogeneous pooling (D2), respectively. In each sub-figure, the dashed line plots the true varying-coefficient function, and the solid line connects the pointwise sample mean of the 500 function estimates. The top and bottom dotted lines are the pointwise mean ± 1.96 × standard deviation of the 500 estimates, respectively. This figure appears in color in the electronic version of this article, and any mention of color refers to that version.

In general, our estimation well captures the nonlinear shape of the varying-coefficient in each case. Some observable bias (gap between the dashed and the solid lines) presents in all the estimation of β2(u) when u is around −1 or 1 and β4(u) when u is near 0. As c increases, the changes in the bias are too small to be seen. But the bias induced by random pooling appears to be the largest when compared to the two types of homogeneous pooling. When looking at the variance (proportional to the gap between two dotted lines), all the estimates are impacted by the boundary effect, which is expected in nonparametric kernel regression. As c increases, the patterns from different pooling schemes are evidently different. For β˜(u), the variance increases as c increases; for β¯(u) under (D1), the variance stays nearly the same; while for β¯(u) under (D2), the variance decreases. For Model (M2) and/or for UN(0,1), the patterns are similar and presented in Web Figures 19.

Similar results for fixed-N cases are presented in Web Figures 1021. From those figures, we can also see than the bias does not change too much as c increases, but the variance under all pooling schemes becomes larger. This is expected because when N is fixed, using a larger c aggregates more data into each pool. Though saving more cost, it brings more difficulty to the estimation. When comparing the three pooling schemes, the variance of β˜(u) under random pooling increases the most in c, while the one of β¯(u) under (D2) increases the least.

Figure 4 and Web Figures 2224 visualize a comprehensive comparison of the estimation over the settings considered. These comparison reveal the role of the pool size and clearly illustrate the difference between β˜(u) and β¯(u). We calculated the averaged integrated squared error (AISE) of each estimate; e.g., the AISE of β^(u) is

AISE{β^(u)}=1p+1k=0p[1Mm=1M{β^k(um)βk(um)}2],

where {um}m=1M is a dense grid of [−2, 2]. The AISE combines both bias and variance to summarize the overall precision of an estimator.

Figure 4.

Figure 4.

Boxplots of AISEs for J ∈ {250, 500} and c ∈ {1, 2, 4, 6, 8, 10} under both models when U ~ Uniform(0, 1), where (RP) represents random pooling, and (D1) and (D2) are the two designs for homogeneous pooling. The dotted horizontal line marks the median of the 500 AISEs of the reference estimate β^(u) where c = 1.

We first observe from Figure 4 and Web Figure 22 (fixed-J cases) that, as J increases from 250 to 500, all the boxplots move closer to zero accordingly. It reinforces the consistency of all the estimators. When compared to individual biomonitoring, we see that random pooling quickly degrades the precision as c increases, same as our observations from Figure 1. When c becomes larger than 4, the boxplots tend to stabilize. This reinforces our findings in Theorem 1; i.e., BR(u)=BI(u)+(1c1)1/2BR*(u) and VR(u)=c1VI(u)+(1c1)VR*(u) and VR*(u), respectively, as c increases. When looking at the boxplots of the AISEs of β¯(u) under (D1) and (D2), it is encouraging to see that, when c > 1, the majority of the AISEs are below the dotted horizontal line. This indicates using homogeneous pooling could even improve precision when compared to individual biomonitoring. Of course, this improvement is achieved by including more individuals in the study as J is fixed and c increases. From Web Figures 2324 (fixed-N cases), we also see that all the AISEs become smaller as N increases from 2500 to 5000, reinforcing the consistency of the estimators. At each fixed N, the AISEs under all pooling schemes become larger as c increases, which coincides with our observations from Web Figures 1021. Lastly, when the same c and J are used, the performance of β¯(u) under (D2) is the best, while the one of β˜(u) is the worst.

6. NHANES BFR data analysis

Since 2005, NHANES has been using the weighted pooling-sample design to assess the U.S. population’s exposure to environmental chemicals. These chemicals include 61 polychlorinated and 13 polybrominated compounds. The primary reason NHANES uses pooling is to save money. As reported by Caudill (2012), one analytical measurement of these chemicals cost $1400. After switching to pooling, the number of measurements in 2005 was reduced from 2201 to 228, translating into a savings of approximately $2.78 million in one year. Unfortunately, those pooled measurements are of a much more complicated structure, and statistical analysis often poses challenges. The methodology in this article can be used to analyze data like those collected at NHANES.

For illustration, we apply our methodology to the pooled measurements of PBDE-47 collected by NHANES during 2013–2016 (CDC., 2016). In total, 3854 participants were assigned homogeneously into 509 pools. A majority (443) of the pools are of size 8, and other (66) pools are of size ranging from 2 to 7. All the pooled measurements are provided. At the individual level, participants’ covariate information were all provided by NHANES. In addition, NHANES has assigned a sample weight to each participant to measure the number of people in the population represented by that participant. After extensive model building, we selected the following model to illustrate our methodology:

Y=β0(U)+X1β1(U)+X2β2(U)+X3β3(U)+ϵ,

where Y is the (latent) PBDE-47 concentration, U denotes age (in years), X1 indicates gender (= 1 female; = 0 otherwise), X2 is a race indicator (= 1 non-Hispanic white; = 0 otherwise), and X3 is the standardized logarithm of BMI. Because homogeneous pooling was used, we applied our method developed in Section 3.2 and Web Appendix E to compute the estimate β¯(u) for each u and a 95% confidence interval on βk(u) for all k and u.

The results are plotted in Figure 5. The β¯0(u) is significantly above zero but declines from age 12. This downtrend coincides with our understanding of PBDEs. Research has shown that human exposure could start from fetal development because PBDEs can cross the placental and lactational membranes (Faniband et al., 2014). In addition, children’s high frequent hand-to-mouth behavior and their closer proximity to the ground could also increase their exposures through dermal contact and indoor air inhalation (Kim et al., 2014). If we look at β¯0(u) backward from age 80 to age 40, it also shows a downtrend. A possible reason is that the adverse health effects of the PBDEs came to our attention in the 1970s, and since then, the government started regulating and even phasing out the use of many PBDEs (Stapleton et al., 2012). Elderly people might have already accumulated a higher concentration of this persistent and bioaccumulative chemical before its phase-out.

Figure 5.

Figure 5.

NHANES BFR data analysis. We plot β¯(u) in solid lines. The top and bottom dotted lines plot the pointwise 95% confidence interval on βk(u) for each k ∈ {0, 1, 2, 3} via the use of 500 bootstrap estimates. The dashed horizontal line marks the position of zero. This figure appears in color in the electronic version of this article, and any mention of color refers to that version.

Regarding gender, we see that β¯1(u) is significantly negative at the beginning. It could be that at a young age girls might have less hand-to-mouth activities than boys. This difference gradually disappears as they became more mature. The β¯2(u) consistently being negative indicates non-Hispanic whites could have fewer exposures than other races. However, the difference might not be significant given the associated confidence band almost covers the entire dotted horizontal line. The β¯3(u) suggests BMI might have an overall positive impact on the exposure. It coincides with the fact that one main route of human exposure is ingestion and that higher intake of PBDEs has been found via beverages, sweets, fats, seafood, meat, eggs, and dairy products than vegetables, fruits, and tubers (Eljarrat and Barceló, 2011).

7. Discussion

In this article, we have proposed a varying-coefficient regression framework for data measured in pooled specimens, a data collection mechanism becoming increasingly popular in human biomonitoring of environmental chemicals. Two types of pooling schemes are considered, and we provided kernel-based estimation of the varying-coefficients for each. Our careful investigation of the theoretical properties reveals some interesting contrasts between the two pooling schemes. We further illustrated these contrasts via simulation studies. Finally, we applied our method to a pooled biomonitoring data set collected at NHANES.

Lastly, we want to discuss two interesting topics along with this research line. Firstly, we note that our comparisons between homogeneous pooling and random pooling via the use of β˜(u) and β¯(u) might be confounded by the fact that the two estimators are from different estimation strategies. A future topic could be to develop a unified approach to estimate β(u) for all different pooling schemes. Secondly, we note that though there are no missing values in the NHANES data in Section 6, it is important to develop solid statistical tools to deal with missingness in pooled biomonitoring data analysis. Traditional statistical tools for individual-level data often remove subjects with missing values from the regression analysis. This simple strategy works fine when the data are missing completely at random (MCAR). However, in the pooling context, even if the MCAR assumption is reasonable, removing one subject with missing values would force practitioners to discard an entire pool and lead to wasted information. We are currently developing techniques to address this issue.

Supplementary Material

supinfo

Figure 2.

Figure 2.

Simulation results for (M1) when J = 500 and U ~ Uniform(0, 1). Columns from left to right correspond to the estimate when c=1(β^(u)) and c=2,4,8(β¯(u)) under homogeneous pooling (D1), respectively. In each sub-figure, the dashed line plots the true varying-coefficient function, and the solid line connects the pointwise sample mean of the 500 function estimates. The top and bottom dotted lines are the pointwise mean ± 1.96 × standard deviation of the 500 estimates, respectively. This figure appears in color in the electronic version of this article, and any mention of color refers to that version.

Acknowledgements

This research was supported in part by R03 AI135614 grant from the National Institutes of Health. We want to thank the Co-Editor, the Associate Editor, and two referees for their constructive comments which have greatly improved this work.

Footnotes

Supporting Information

Web Appendices and Figures referenced in Sections 35 and Software in form of R code that implement our methodology are available with this paper at the Biometrics website on Wiley Online Library. The R code is also available at https://github.com/Harrindy/VCMforPB.

Data Availability Statement

The data (CDC., 2016) that support the findings in this paper are available at the website of the National Health and Nutrition Examination Survey (https://wwwn.cdc.gov/nchs/nhanes/default.aspx).

References

  1. Caudill SP (2012). Use of pooled samples from the national health and nutrition examination survey. Statistics in Medicine, 31, 3269–3277. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Caudill SP, Turner WE, and Patterson DG Jr (2007). Geometric mean estimation from pooled samples. Chemosphere, 69, 371–380. [DOI] [PubMed] [Google Scholar]
  3. CDC. (2016). National Health and Nutrition Examination Survey Data. Hyattsville, MD: U.S. Department of Health and Human Services, Centers for Disease Control and Prevention. Data repository: https://wwwn.cdc.gov/nchs/nhanes/default.aspx. [Google Scholar]
  4. Cleveland WS, Grosse E, and Shyu WM (1991). Local regression models. In Statistical models in S,, pages 309–376. Wadsworth/Brooks-Cole. [Google Scholar]
  5. Delaigle A, Hall P, et al. (2012). Nonparametric regression with homogeneous group testing data. The Annals of Statistics, 40, 131–158. [Google Scholar]
  6. Dorfman R (1943). The detection of defective members of large populations. The Annals of Mathematical Statistics, 14, 436–440. [Google Scholar]
  7. Eljarrat E and Barceló D (2011). Brominated Flame Retardants,. The Handbook of Environmental Chemistry. Springer; Berlin Heidelberg. [Google Scholar]
  8. Eubank R, Huang C, Maldonado YM, Wang N, Wang S, and Buchanan R (2004). Smoothing spline estimation in varying-coefficient models. Journal of the Royal Statistical Society: Series B, 66, 653–667. [Google Scholar]
  9. Fan J and Gijbels I (1996). Local polynomial modelling and its applications: monographs on statistics and applied probability,. CRC Press. [Google Scholar]
  10. Fan J and Zhang W (1999). Statistical estimation in varying coefficient models. The Annals of Statistics, 27, 1491–1518. [Google Scholar]
  11. Faniband M, Lindh CH, and Jönsson BA (2014). Human biological monitoring of suspected endocrine-disrupting compounds. Asian journal of andrology, 16, 5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Gastwirth JL (2000). The efficiency of pooling in the detection of rare mutations. The American Journal of Human Genetics, 67, 1036–1039. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Green PJ and Silverman BW (1993). Nonparametric regression and generalized linear models: a roughness penalty approach,. CRC Press. [Google Scholar]
  14. Hastie T and Tibshirani R (1987). Generalized additive models: some applications. Journal of the American Statistical Association, 82, 371–386. [Google Scholar]
  15. Hastie T and Tibshirani R (1993). Varying-coefficient models. Journal of the Royal Statistical Society: Series B, 55, 757–779. [Google Scholar]
  16. Heffernan A, English K, Toms L, Calafat A, Valentin-Blasini L, Hobson P, Broomhall S, Ware R, Jagals P, Sly P, et al. (2016). Cross-sectional biomonitoring study of pesticide exposures in queensland, australia, using pooled urine samples. Environmental Science and Pollution Research, 23, 23436–23448. [DOI] [PubMed] [Google Scholar]
  17. Hoover DR, Rice JA, Wu CO, and Yang L-P (1998). Nonparametric smoothing estimates of time-varying coefficient models with longitudinal data. Biometrika, 85, 809–822. [Google Scholar]
  18. Ichimura H (1993). Semiparametric least squares (sls) and weighted sls estimation of single-index models. Journal of Econometrics, 58, 71–120. [Google Scholar]
  19. Kärrman A, Mueller JF, Van Bavel B, Harden F, Toms L-ML, and Lindström G (2006). Levels of 12 perfluorinated chemicals in pooled australian serum, collected 2002–2003, in relation to age, gender, and region. Environmental science & technology, 40, 3742–3748. [DOI] [PubMed] [Google Scholar]
  20. Kato K, Calafat AM, Wong L-Y, Wanigatunga AA, Caudill SP, and Needham LL (2009). Polyfluoroalkyl compounds in pooled sera from children participating in the national health and nutrition examination survey 2001–2002. Environmental science & technology, 43, 2641–2647. [DOI] [PubMed] [Google Scholar]
  21. Kim YR, Harden FA, Toms L-ML, and Norman RE (2014). Health consequences of exposure to brominated flame retardants: a systematic review. Chemosphere, 106, 1–19. [DOI] [PubMed] [Google Scholar]
  22. Lewis JL, Lockary VM, and Kobic S (2012). Cost savings and increased efficiency using a stratified specimen pooling strategy for Chlamydia trachomatis and Neisseria gonorrhoeae. Sexually Transmitted Diseases, 39, 46–48. [DOI] [PubMed] [Google Scholar]
  23. Li X, Kuk AY, and Xu J (2014). Empirical Bayes Gaussian likelihood estimation of exposure distributions from pooled samples in human biomonitoring. Statistics in Medicine, 33, 4999–5014. [DOI] [PubMed] [Google Scholar]
  24. Li Y, Graubard BI, and Korn EL (2010). Application of nonparametric quantile regression to body mass index percentile curves from survey data. Statistics in Medicine, 29, 558–572. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Lin J and Wang D (2018). Single-index regression for pooled biomarker data. Journal of Nonparametric Statistics, 30, 813–833. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Liu Y, McMahan C, and Gallagher C (2017). A general framework for the regression analysis of pooled biomarker assessments. Statistics in Medicine, 36, 2363–2377. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Ma C-X, Vexler A, Schisterman EF, and Tian L (2011). Cost-efficient designs based on linearly associated biomarkers. Journal of Applied Statistics, 38, 2739–2750. [Google Scholar]
  28. Malinovsky Y, Albert PS, and Schisterman EF (2012). Pooling designs for outcomes under a gaussian random effects model. Biometrics, 68, 45–52. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Mitchell EM, Lyles RH, Manatunga AK, Danaher M, Perkins NJ, and Schisterman EF (2014). Regression for skewed biomarker outcomes subject to pooling. Biometrics, 70, 202–211. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Mu J, Wang G, and Wang L (2018). Estimation and inference in spatially varying coefficient models. Environmetrics, 29, e2485. [Google Scholar]
  31. Remlinger KS, Hughes-Oliver JM, Young SS, and Lam RL (2006). Statistical design of pools using optimal coverage and minimal collision. Technometrics, 48, 133–143. [Google Scholar]
  32. Stapleton HM, Sharma S, Getzinger G, Ferguson PL, Gabriel M, Webster TF, and Blum A (2012). Novel and high volume use flame retardants in us couches reflective of the 2005 pentabde phase out. Environmental science & technology, 46, 13432–13439. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Stramer SL, Krysztof DE, Brodsky JP, Fickett TA, Reynolds B, Dodd RY, and Kleinman SH (2013). Comparative analysis of triplex nucleic acid test assays in united states blood donors. Transfusion, 53, 2525–2537. [DOI] [PubMed] [Google Scholar]
  34. Tilley SK and Fry RC (2015). Priority environmental contaminants: Understanding their sources of exposure, biological mechanisms, and impacts on health. Systems Biology in Toxicology and Environmental Health, 2015, 117–169. [Google Scholar]
  35. Vansteelandt S, Goetghebeur E, and Verstraeten T (2000). Regression models for disease prevalence with diagnostic tests on pools of serum samples. Biometrics, 56, 1126–1133. [DOI] [PubMed] [Google Scholar]
  36. Wang D, Mou X, Li X, and Huang X (2020). Local polynomial regression for pooled response data. Journal of Nonparametric Statistics, 32, 814–837. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Wang H and Xia Y (2009). Shrinkage estimation of the varying coefficient model. Journal of the American Statistical Association, 104, 747–757. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supinfo

Data Availability Statement

The data (CDC., 2016) that support the findings in this paper are available at the website of the National Health and Nutrition Examination Survey (https://wwwn.cdc.gov/nchs/nhanes/default.aspx).

RESOURCES