SUMMARY
Measuring a biomarker in pooled samples from multiple cases or controls can lead to cost-effective estimation of a covariate-adjusted odds ratio, particularly for expensive assays. But pooled measurements may be affected by assay-related measurement error (ME) and/or pooling-related processing error (PE), which can induce bias if ignored. Building on recently developed methods for a normal biomarker subject to additive errors, we present two related estimators for a right-skewed biomarker subject to multiplicative errors: one based on logistic regression and the other based on a Gamma discriminant function model. Applied to a reproductive health dataset with a right-skewed cytokine measured in pools of size 1 and 2, both methods suggest no association with spontaneous abortion. The fitted models indicate little ME but fairly severe PE, the latter of which is much too large to ignore. Simulations mimicking these data with a non-unity odds ratio confirm validity of the estimators and illustrate how PE can detract from pooling-related gains in statistical efficiency. These methods address a key issue associated with the homogeneous pools study design and should facilitate valid odds ratio estimation at a lower cost in a wide range of scenarios.
Keywords: Biomarkers, Discriminant function, Gamma, Maximum likelihood, Measurement error, Pooling
1. Introduction
In the regression setting where measuring a continuous exposure requires an expensive assay, a pooling study design can be extremely cost-effective (Weinberg and Umbach, 1999; Mitchell and others, 2014; Lyles and others, 2016). We consider a design in which the assay is applied to pooled rather than individual biospecimen samples, with each pooled sample comprised of an equal volume from some number of like participants with respect to case status (all cases or all controls). Assuming the assay returns the mean biomarker level for members of a given pool, the logistic regression model provided by Weinberg and Umbach (1999) can be used to estimate the log-odds ratios (log-OR’s) of interest with poolwise data.
However, two types of error may affect pooled biomarker measurements and induce bias if ignored. Measurement error (ME) is extra variability due to assay imprecision, and processing error (PE) is extra variability due to physically combining biospecimens into pools (Schisterman and others, 2010). In a hybrid design that includes some single-specimen pools (“singles” or “pools of size 1”), ME would affect all assay measurements while PE would only affect multi-specimen pools. ME could be assumed non-existent or negligible in scenarios where the assay is known to be highly accurate, but it seems generally dubious to assume no PE. That would require precise formation of exactly equal-volume pools, complete mixing, and no changes in the pooled biomarker concentration due to cross-reactions from mixing biospecimen samples from different subjects.
Following the framework of Schisterman and others (2010), Lyles and others (2015) used maximum likelihood (ML) to estimate the covariate-adjusted log-OR for a pooled exposure subject to ME and PE. They used a discriminant function approach in which the exposure log-OR is estimated not from a logistic regression, but from a linear regression of the exposure on case status and covariates. The primary assumptions were as follows: (i) exposure level given case status and covariates is normally distributed with homoscedastic errors; (ii) MEs and PEs are additive, independent, and normally distributed with mean 0 and variances
and
(independent of pool size); and (iii) MEs affect all measurements, while PEs only affect pools of size 2 or larger. Advantages of this approach include its computational simplicity, its applicability to designs with homogeneous or heterogeneous pools with respect to case status, the availability of a small-sample bias correction, and the ability to correct for both ME and PE without replicate assay measurements, provided there are at least three different pool sizes including pools of size 1. A notable disadvantage is that it produces a log-OR estimate for the pooled exposure, but not for covariates.
Van Domelen and others (2018) relied on similar error assumptions as Schisterman and others (2010) and Lyles and others (2015) to correct for assay errors in fitting the Weinberg and Umbach (1999) poolwise logistic regression model. Taking a classical ME modeling approach, they wrote the likelihood contribution for the
pool as the product of three densities—case status given true biomarker level and covariates, imprecise biomarker level given true biomarker level, and true biomarker level given covariates—with the unobserved true biomarker level integrated out. This approach allows estimation of all coefficients in the logistic regression of interest.
An important limitation of the Lyles and others (2015) and Van Domelen and others (2018) methods is that they assume normally distributed biomarkers (conditional on covariates), whereas biomarkers are often skewed. In this work, we address this limitation by providing Gamma model-based analogues to both of these approaches.
In considering how to adapt the Van Domelen and others (2018) logistic regression to account for skewness, a natural idea is to assume a linear model for the log-biomarker level given covariates. This would imply a linear model for the sum of the log-biomarker levels for members of a given pool vs. summed covariates. But the summed log-biomarker level cannot be recovered from the poolwise mean, which is what the assay is assumed to measure (Mitchell and others, 2014).
We propose an alternative approach with a more convenient poolwise-sum result: a Gamma regression model for biomarker level given covariates. Following Mitchell and others (2015) “alternative gamma” model, we assume a constant scale parameter and covariates linearly related to the log of the shape parameter. This implies each covariate is linearly related to the log of the expected value of the biomarker, and the variance of the biomarker is directly proportional to the expected value. Assuming independence among members of a pool, the summed biomarker level is also Gamma distributed. Unlike the log-transformed linear regression, this result is compatible with observed data, provided individual-level covariates are available.
As for the errors, strictly positive poolwise measurements are more compatible with mean-1 multiplicative errors than mean-0 additive errors. The latter would be incongruent with observed measurements having considerable density around 0, but no 0’s or negative values, as is the case in our motivating example. So we assume the PE and ME are mean-1 lognormal and act multiplicatively on the true poolwise means, using a generally similar set of assumptions as in previous work (Schisterman and others, 2010; Lyles and others, 2015). We accommodate replicate assay measurements, which are not strictly required for identifiability but may help stabilize ML estimation at feasible sample sizes. An interesting feature of the Gamma setup is that correcting for errors is theoretically possible even with no replicates and a single pool size.
To adapt the Lyles and others (2015) discriminant function approach to account for right-skewed biomarkers, we generalize the framework of Whitcomb and others (2012) to include covariates and multiplicative errors. The Gamma discriminant function model is similar to the Gamma model just described, but with a different scale parameter for cases and controls and a case status coefficient incorporated into the shape parameter. Applying Bayes rule produces an expression for the adjusted log-OR for
, which is constant (and the same expression as in Whitcomb and others (2012)) only when the case status coefficient is 0. This is analogous to the normal discriminant function model, where different residual error variances for cases and controls implies a logistic regression model with a quadratic term for
and thus a non-constant odds ratio (Cornfield, 1962). This approach extends to the pooling scenario and permits incorporating multiplicative lognormal errors as in the logistic regression approach.
Our motivating example is estimation of the covariate-adjusted log-OR relating serum levels of a positive, right-skewed cytokine to odds of miscarriage, with the cytokine measured in pools of size 1 (some with replicates) and 2. We use Akaike information criterion (AIC) to confirm better model fit for the Gamma analogues of the previously developed logistic regression and discriminant function approaches and perform simulations to confirm validity, gauge whether estimation is reasonably stable in scenarios where the Gamma models are identifiable but the normal models are not, and examine how errors affect the relative efficiency of pooling vs. traditional designs.
2. Methods
2.1. Homogeneous pools logistic regression
Consider a design in which a continuous biomarker
is measured in pooled samples, with the
sample (
) comprised of equal-volume aliquots from
participants (
) that are either all cases (
) or all non-cases (
), but assumed independent of each other given case status.
The assay is assumed to produce the arithmetic mean biomarker level for members of the
pool,
, from which the summed biomarker level can be calculated as
. Weinberg and Umbach (1999),Weinberg and Umbach (2014) showed that if a logistic regression model relates individual-level case status to exposure and covariates
, i.e.
, then the corresponding model relating
to
and
is:
![]() |
(2.1) |
with the offset
defined as:
![]() |
(2.2) |
where
and
are accrual probabilities for individual cases and controls, and
and
are the total number of cases and controls sampled (Weinberg and Umbach, 1999, 2014). The first term is 0 for prospective or cross-sectional sampling, where
. For case–control sampling it is nonzero, but omitting it (e.g. if accrual probabilities are unknown) only compromises validity of the estimated intercept
. Thus with poolwise observations and no errors in
one can fit 2.1 to estimate
, the same parameters that would be targeted via a traditional study without biospecimen pooling.
2.2. Normal
logistic regression (NLR)
Here, we describe the Van Domelen and others (2018) normal-
logistic regression (“NLR”) approach for handling errors. Suppose the assay does not return the true poolwise mean biomarker level
, but rather an imprecise version,
, from which the imprecise poolwise sum can be calculated as
. To immediately accommodate replicate measurements on the same pool, suppose there are
such measurements for the
pool, such that
are observed and
can be calculated. The likelihood contribution is
, which can be factored:
![]() |
(2.3) |
where
. Van Domelen and others (2018) make two assumptions to arrive at the second line of 2.3: (i) the imprecise
does not inform
given the true
(and covariates), and (ii) errors in
are unrelated to covariates. These are fairly standard assumptions for ME problems, although as a reviewer noted it is conceivable that the second assumption might be violated, say if some other biomarker in the blood cross-reacts with the biomarker of interest and is itself correlated with covariates. For the purposes of this article, we follow the framework of Lyles and others (2015) and Van Domelen and others (2018) and leverage these two assumptions.
The first term under the integral is specified by 2.1. For the second term, suppose there is additive mean-0 ME and PE acting on the poolwise mean
such that
can be written:
![]() |
(2.4) |
The errors are assumed to be independent of each other, independent of the true
and
, and independent of pool size. Errors are also assumed to be normally distributed:
and
, where
is a
-dimensional identity matrix. This suggests each of the
assay measurements is the sum of the true poolwise mean, a common PE if
, and a unique ME, with all errors independent of each other. Multivariate normal theory leads to:
![]() |
(2.5) |
where
.
For the third term in 2.3, an individual-level normal linear regression leads to the poolwise result
with:
![]() |
(2.6) |
Optimization routines can be used to obtain ML estimates (MLEs) for
, and standard errors estimated as the inverse of the numerically approximated Hessian matrix at the MLEs. While a closed-form approximation for the integral is available (Carroll and others, 1984, 2006; Lyles and Kupper, 2013), we consider full ML, which requires integrating out the unobserved
’s numerically for each pool at each iteration.
2.3. Gamma
logistic regression (GLR)
We propose a Gamma-
logistic regression approach (“GLR”) to accommodate a right-skewed biomarker. The likelihood in 2.3 carries over, but we specify different models for the second and third densities. For the typical case where an assay produces strictly positive measurements, we assume mean-1 lognormal errors acting multiplicatively on the poolwise mean. The analogue of 2.4 becomes:
![]() |
(2.7) |
with
and
.
To determine the form of
, note that
. The product of the independent lognormal error terms,
, is multivariate lognormal:
, and thus
is also multivariate lognormal:
![]() |
(2.8) |
Note that 2.8 can be viewed as a multiplicative lognormal analogue of 2.5.
For
, we assume a constant-scale Gamma model for individual biomarker levels:
. This implies
, such that there is a monotone, non-linear relationship between each covariate and the expected value of the biomarker. It also implies
, which means
, i.e. the variance is directly proportional to the mean.
The sum of independent Gamma variables with shape parameters
and the same scale parameter
is
, so the poolwise sum biomarker level is distributed as follows:
![]() |
(2.9) |
The likelihood is now fully specified for GLR, with the three densities in the 2.3 likelihood given by 2.1, 2.8, and 2.9. Similar computational procedures as in Van Domelen and others (2018) can be used to obtain MLEs and Hessian-based standard errors; the unobserved
’s have to be integrated out numerically.
2.4. Normal discriminant function approach (NDFA)
Lyles and others (2015) proposed a normal-
discriminant function approach (“NDFA”) to estimate the log-OR for a biomarker measured in pools while correcting for additive normal errors. The assumed model for individual biomarker level given case status and covariates is:
. If this model holds, the quantity
represents the covariate-adjusted log-OR relating
and
. The corresponding poolwise sum model is:
![]() |
(2.10) |
where
is the number of cases in the pool, which is
under homogeneous pooling. The likelihood contribution for the observed
(allowing for replicates) is
. The same additive normal error assumptions as for NLR lead to:
![]() |
(2.11) |
This closed-form likelihood can be maximized to obtain
,
, and
. A delta method-based variance estimator is
with
.
2.5. Gamma discriminant function approach (GDFA)
Towards a Gamma-
discriminant function approach (“GDFA”), we assume the following individual-level Gamma model:
![]() |
(2.12) |
The scale parameter is
for cases and
for controls. Applying Bayes rule and taking the logit gives:
![]() |
(2.13) |
where
is a constant term which is a function of
but not
. In general, the log-OR for a 1-unit increase in
is given by:
![]() |
(2.14) |
The log-OR depends on
(and
) unless
, in which case it simplifies to
. This is very similar to the Whitcomb and others (2012) scenario without covariates, where the log-OR is constant if the shape parameter is the same for cases and controls.
For poolwise data, if the
pool is comprised of
cases or controls with
distributed as in 2.12, then the poolwise sum biomarker level
is distributed:
![]() |
(2.15) |
To incorporate errors, we make the same multiplicative lognormal error assumptions as for GLR. The likelihood for the
pool with
replicates is
(utilizing the two simplifying assumptions from Section 2.2), with the terms given by 2.8 and 2.15. The
’s are integrated out numerically.
2.6. Implementation
We previously developed the R (R Core Team, 2019) package pooling (Van Domelen, 2019) with functions for fitting poolwise regression models, including p_ndfa (originally named p_dfa_xerrors) for implementing the Lyles and others (2015) NDFA approach and
for the Van Domelen and others (2018) NLR approach. We have added two functions for the Gamma methods introduced in this article: p_gdfa for GDFA and p_logreg_xerrors2 for GLR. Function inputs include the data, an indicator for which error types to model, and various options for likelihood maximization; outputs include the MLEs, a variance–covariance matrix, and AIC.
Numerical integration, which is necessary for NLR, GLR, and GDFA, is performed via the hcubature function from the cubature package v. 2.0.3 (Narasimhan and others, 2018). For log-likelihood maximization, we use the nlminb function in base R, which implements a quasi-Newton method that accommodates bounds. Starting values of 0.01 and 1 are used for regression coefficients and variance terms, respectively, and lower bounds of 0.0001 are used for variance terms (note: this may not be appropriate for biomarkers on a different scale). Hessian matrices are approximated numerically via hessian from numDeriv v. 2016.8-1 (Gilbert and Varadhan, 2016).
Some additional details on optimization and how occasional numerical issues were handled are included as Appendix B of the supplementary material available at Biostatistics online.
2.7. Reproducibility
A figshare depository is available at https://figshare.com/s/0f73bf55d944e7158f31 with the following items: (i) a snapshot build of the pooling package (pooling.tar.gz and source_files.zip), which includes a simulated dataset intended to mimic our motivating example; (ii) R code for reproducing our analyses on that dataset (biostatistics_analysis.R); and (iii) R code for running one trial of each simulation (run_simulations.R).
3. Results
3.1. Motivating example: MCP-1 and odds of miscarriage
We use data from a nested case–control follow-up study to the Collaborative Perinatal Project (CPP) (Hardy, 2003; Whitcomb and others, 2007) to estimate the association between serum levels of the cytokine monocyte chemotactic protein (MCP-1) and odds of miscarriage controlling for mother’s age, race, and current smoking. Our dataset consists of 126 single-specimen pools, 30 of which have replicate MCP-1 measurements, and 280 pools of size 2. The pools of size 2 are all homogeneous with respect to case status, i.e. contain samples from women whose pregnancies both did or did not result in miscarriages.
Given that the 126 single-specimen pools are not subject to PE, and the 30 replicates suggest only a small amount of ME (Figure 1a of the supplementary material available at Biostatistics online), a histogram of these values should give a reasonable indication of the marginal MCP-1 distribution (Figure 1b of the supplementary material available at Biostatistics online). The data are more compatible with lognormal and Gamma distributions than normal.
Table 1 summarizes model fits for the two logistic regression methods, using all available data including replicates and modeling both error types. Covariates
–
represent mother’s age, non-white race, and current smoking, respectively, and
represents the pooled exposure MCP-1. The
’s are logistic regression coefficients in 2.1, the
’s,
, and
are parameters in the
models (2.6 for NLR, 2.9 for GLR), and
are the PE and ME variances.
Table 1.
Logistic regression fits for odds of miscarriage. Values are point estimates (standard errors).
represents the covariate-adjusted log-OR relating MCP-1 and miscarriage
| Naive | NLR | GLR | |
|---|---|---|---|
| (AIC = 2340.8) | (AIC = 1787.5) | ||
|
1.57 (0.37) |
1.58 (0.37) |
1.60 (0.39) |
|
0.01 (0.02) | 0.05 (0.08) | 0.05 (0.12) |
|
0.04 (0.01) | 0.04 (0.01) | 0.04 (0.01) |
|
0.56 (0.18) | 0.57 (0.18) | 0.57 (0.18) |
|
0.34 (0.16) | 0.34 (0.16) | 0.34 (0.16) |
|
— | 0.50 (0.38) | 0.38 (0.26) |
|
— | 0.03 (0.01) | 0.01 (0.01) |
|
— |
0.17 (0.17) |
0.33 (0.11) |
|
— | 0.02 (0.16) |
0.01 (0.09) |
|
— | — | 0.69 (0.08) |
|
— | 1.58 (0.21) | — |
|
— | 0.73 (0.18) | 0.62 (0.09) |
|
— | 0.11 (0.03) | 0.02 (0.01) |
AIC favored GLR over NLR. The estimated log-OR was higher for NLR and GLR than for the naive poolwise logistic regression ignoring MCP-1 errors but still not significantly different from 0. The other logistic regression coefficients were virtually identical for the three models. Both NLR and GLR suggested much larger PE than ME.
Notably, if the replicate MCP-1 measurements had not been included, the NLR model could not be fit with both PE and ME, while the GLR model could. For identifiability, NLR would require a third pool size in addition to 1 and 2 (further details on identifiability requirements are provided in Appendix A of the supplementary material available at Biostatistics online). GLR fit without replicates gave a somewhat larger log-OR (
,
) and very different variance estimates (
,
) compared with the fit with replicates. It is unclear whether GLR’s identifiability is practical in this scenario, given the
estimates here and the instability of NLR without replicates reported by Van Domelen and others (2018) in what seemed to be an easier identifiability scenario. It may be that GLR’s identifiability is not practical when the pooled biomarker is close to being normally distributed (Carroll and others, 2006). We explore this issue later via simulations.
Table 2 summarizes fits for the two discriminant function methods, under the simplifying assumptions such that the log-OR is constant with
. For NDFA, the parameters
are from the
model in 2.10; for GDFA,
are from the
model in 2.15 with
set to 0.
Table 2.
Discriminant function approach estimates for odds of miscarriage. Values are point estimates (standard errors)
| NDFA | GDFA | |
|---|---|---|
| (AIC = 1796.5) | (AIC = 1242.9) | |
|
0.50 (0.38) | 0.41 (0.26) |
|
0.08 (0.13) | — |
|
0.02 (0.01) | 0.01 (0.01) |
|
0.19 (0.17) |
0.34 (0.11) |
|
0.01 (0.16) |
0.02 (0.09) |
|
— | 0.72 (0.09) |
|
— | 0.67 (0.09) |
|
1.58 (0.21) | — |
|
0.73 (0.18) | 0.62 (0.09) |
|
0.11 (0.03) | 0.02 (0.01) |
| log-OR | 0.05 (0.08) | 0.10 (0.13) |
Results were similar to Table 1 in that AIC favored the Gamma approach over normal and neither suggested a significant association between MCP-1 and odds of miscarriage. Also mirroring the logistic regression results, GDFA parameters were identifiable without replicates (
,
,
,
) while NDFA parameters were not.
The constant log-OR models reported in Table 2 are the result of restrictions corresponding to testable hypotheses. For NDFA, the log-OR is constant if the residual error variance in the
model is the same for cases and controls, i.e. under
(see 2.10). For GDFA, it is constant under
(see 2.15). Likelihood ratio tests did not reject
(
,
) or
(
,
).
Despite not rejecting
, one could visualize the non-constant log-OR implied by the fitted GDFA model with
by plotting the estimated log-OR vs. MCP-1. This is somewhat complicated by the fact that the log-OR also depends on covariates. With two binary and one continuous covariate, we plotted the association for the four combinations of race and smoking, each with age held fixed at its median of 26 years (Figure 2 of the supplementary material available at Biostatistics online). The curves suggest higher MCP-1 levels are associated with higher odds of miscarriage at the lower end of the MCP-1 range and lower odds of miscarriage at the upper end; however, confidence bands are compatible with no association over the entire range.
In summary, the Gamma models fit the CPP data better than the corresponding normal models and were noteworthy in that they could be fit without replicates. Substantive results were similar for all four methods: the estimated log-OR is small, there is little evidence of an association between MCP-1 and odds of miscarriage, and poolwise MCP-1 measurements seem to be more severely impacted by PE than by ME.
3.2. Simulation study
The purpose of the first simulation study is to confirm validity of the proposed Gamma methods, while also gauging robustness of the four methods to model misspecification. For each of the four methods, data were generated under the corresponding models described in Sections 2.2, 2.3, 2.4, or 2.5, mimicking the CPP data and using the CPP point estimates for parameter values (see Tables 1 and 2). In each case, log-OR’s were estimated by fitting the data-generating model as well as the three others.
For each trial under GLR, individual-level covariates (
= mother’s age,
= non-white race,
= current smoking) were generated independently for 686 participants as follows:
with sampling probabilities equal to the CPP proportions;
; and
. Individual-level
(MCP-1) values were then generated from the individual version of 2.9 with
,
, and
, and
(miscarriage) generated from the individual version of 2.1 with
,
(increased from
), and
. Observations were then split into
cases and
controls. The
cases were randomly formed into
(rounded up) pools of size 2 and the rest left as singles, and similarly for the
controls, to produce an approximately equal number of pools of size 1 and 2.
Poolwise means
were then calculated, multiplied by mean-1 lognormal PEs with
(if
) and mean-1 lognormal MEs with
, and multiplied by
to produce imprecise poolwise sums
. For 30 randomly selected single-specimen pools,
was generated from the same process but for two independent MEs rather than one.
For GDFA,
was generated via the same process,
based on a logistic regression with
and
, and
based on 2.12 with
,
,
,
, and
(increased from
to induce a log-OR of 0.15). Poolwise data were generated via the same process as above, again with
and
.
For NLR, individual-level
were generated from the individual version of 2.6 with
and
. To avoid negative
’s, which would preclude fitting GLR and GDFA,
was set to 6.5 (increased from
).
was generated from the individual version of 2.1 with
,
(increased from
), and
. Error variances were set to
and
.
For NDFA,
was generated as for GDFA, and
based on the individual version of 2.10 with
,
(increased from
to avoid negative
’s),
(increased from
to induce a log-OR of 0.15), and
. Error variances were set to
and
.
Results are summarized in Table 3. For data generated under GLR, the naive poolwise logistic regression (i.e. ignoring PE and ME) underestimated the true log-OR and had poor confidence interval (CI) coverage. The correctly specified GLR estimator had a slight upward bias and nominal coverage; GDFA performed about the same as GLR. NLR and NDFA performed surprisingly well despite assuming additive normal rather than multiplicative lognormal errors; they were nearly unbiased, only slightly less efficient than the Gamma methods and had approximately nominal coverage. For data generated under GDFA, all four methods were unbiased and had good coverage, while the correctly specified GDFA estimator was slightly more efficient than GLR (SD = 0.123 vs. 0.125) and moderately more efficient than NLR (SD = 0.133) and NDFA (SD = 0.134). For data generated under NLR and NDFA, all four methods performed well; surprisingly, GDFA had a slight efficiency advantage.
Table 3.
Simulation results for estimating log-odds ratios with an approximately equal number of pools of size 1 (30 with replicates) and 2 (1000 trials)
| log-OR = 0.15 |
|
|
|
|||||
|---|---|---|---|---|---|---|---|---|
| 95% CI | 95% CI | 95% CI | 95% CI | |||||
| Mean bias | SD | Mean SE | MSE | coverage | coverage | coverage | coverage | |
| Generated GLR | ||||||||
| Naive |
0.100 |
0.056 | 0.051 | 0.013 | 0.470 | 0.947 | 0.943 | 0.957 |
| GLR | 0.007 | 0.129 | 0.127 | 0.017 | 0.948 | 0.951 | 0.945 | 0.957 |
| NLR | 0.005 | 0.140 | 0.135 | 0.020 | 0.948 | 0.956 | 0.940 | 0.956 |
| GDFA | 0.005 | 0.126 | 0.125 | 0.016 | 0.942 | — | — | — |
| NDFA | 0.002 | 0.137 | 0.133 | 0.019 | 0.940 | — | — | — |
| Generated GDFA | ||||||||
| Naive |
0.106 |
0.055 | 0.049 | 0.014 | 0.392 | — | — | — |
| GLR | 0.000 | 0.125 | 0.123 | 0.016 | 0.958 | — | — | — |
| NLR | 0.000 | 0.133 | 0.131 | 0.018 | 0.959 | — | — | — |
| GDFA |
0.001 |
0.123 | 0.121 | 0.015 | 0.953 | — | — | — |
| NDFA | 0.000 | 0.134 | 0.130 | 0.018 | 0.951 | — | — | — |
| Generated NLR | ||||||||
| Naive |
0.061 |
0.052 | 0.054 | 0.006 | 0.777 | 0.950 | 0.957 | 0.945 |
| GLR | 0.001 | 0.087 | 0.089 | 0.008 | 0.958 | 0.946 | 0.958 | 0.940 |
| NLR | 0.006 | 0.088 | 0.091 | 0.008 | 0.958 | 0.950 | 0.963 | 0.944 |
| GDFA |
0.002 |
0.085 | 0.087 | 0.007 | 0.957 | — | — | — |
| NDFA | 0.004 | 0.087 | 0.090 | 0.008 | 0.956 | — | — | — |
| Generated NDFA | ||||||||
| Naive |
0.060 |
0.049 | 0.049 | 0.006 | 0.752 | — | — | — |
| GLR | 0.001 | 0.082 | 0.081 | 0.007 | 0.955 | — | — | — |
| NLR | 0.004 | 0.084 | 0.083 | 0.007 | 0.959 | — | — | — |
| GDFA |
0.006 |
0.077 | 0.078 | 0.006 | 0.961 | — | — | — |
| NDFA | 0.004 | 0.081 | 0.082 | 0.007 | 0.961 | — | — | — |
AIC favored GLR over NLR in 100% of trials generated under GLR, GDFA over NDFA in 100% of trials under GDFA, NLR over GLR in 96.8% of trials under NLR, and NDFA over GDFA in 98.3% of trials under NDFA.
A reviewer asked about performance when
is similar or larger in magnitude than
, so we re-ran Table 3 with the error variances flipped (see Table 3 of the supplementary material available at Biostatistics online). The correctly specified models performed fairly well, although all four had some upward mean bias. GLR and GDFA performed well under misspecification, while NLR and NDFA performed somewhat poorly for data generated under GLR and GDFA. Perhaps the robustness of NLR and NDFA in Table 3 stemmed from having very small ME and thus many observations that were nearly error-free (the singles).
The next set of simulations is aimed at assessing whether the Gamma models’ identifiability absent replicates is practically useful. We consider the CPP scenario: pools of size 1 and 2 and biomarker measurements subject to multiplicative lognormal PE and ME. Data were generated under GLR and GDFA in the same manner as in previous simulations, but for various sample sizes, with and without the 30 replicates. After initially observing good performance without replicates despite
frequently hitting the lower bound of 0.0001, simply because the ME was nearly small enough to ignore, we increased
from 0.02 to 0.2 and decreased
from 0.62 to 0.42. In trials where
or
hit the 0.0001 boundary, PE-only and ME-only models were fit and the one with lower AIC selected. Results are summarized in Table 4.
Table 4.
Simulation results for estimating log-odds ratio with an approximately equal number of pools of size 1 and 2, with and without replicates (1000 trials, true value = 0.15)
| GLR | GDFA | |||||||
|---|---|---|---|---|---|---|---|---|
| Mean | Median | 95% CI | Median | Mean | Median | 95% CI | Median | |
| bias | bias | coverage | CI width | bias | bias | coverage | CI width | |
| n = 686 | ||||||||
| No replicates | 0.023 | 0.011 | 0.978 | 0.560 | 0.141 | 0.010 | 0.970 | 0.543 |
| 30 replicates | 0.010 | 0.006 | 0.961 | 0.546 | 0.009 | 0.006 | 0.964 | 0.518 |
| n = 2000 | ||||||||
| No replicates | 0.005 | 0.003 | 0.951 | 0.319 | 0.011 | 0.005 | 0.950 | 0.308 |
| 30 replicates | 0.004 | 0.005 | 0.951 | 0.316 | 0.009 | 0.003 | 0.946 | 0.303 |
Overall performance was surprisingly good for the no-replicates estimators, although there was upward mean bias at n = 686 due to occasional extreme log-OR estimates (
in 3 trials for GLR, 3 trials for GDFA). CIs were wider without replicates, but not much, especially for n = 2000. The ME variance estimate
occasionally hit 0.0001 for the no replicates scenarios (3.8% of trials for GLR and 2.8% for GDFA at n = 686, 0.1% for GLR and 0.3% for GDFA at n = 2000), while the PE variance estimate
never did.
Next, we compare efficiency of a pooling vs. traditional design for the same number of total assays in a no-ME (PE only) scenario, where the pooling design is perhaps most attractive. For each trial we generate 686 observations via the same procedure as previously under GDFA. For the pooling design, we form
(rounded up) case pools of size 4 and leave the remaining cases as singles, and similarly for controls, to produce approximately twice as many pools of size 4 as there are singles. For the traditional design, we randomly sample the same number of cases and controls as there are case pools and control pools in the same trial and obtain individual-level
values, which are precise because singles are not impacted by PE. Figure 1 shows that the pooling design is more efficient than the traditional design for small
, but that efficiency advantage erodes and eventually reverses as
gets larger. This agrees with two-sample t-test efficiency arguments of Van Domelen and others (2018).
Fig. 1.
Boxplots of log-odds ratio estimates for pooling and traditional designs (1000 trials each, true value = 0.15).
We performed additional simulations to address issues raised by reviewers; results are included in the supplementary material available at Biostatistics online. In addition to those already described, we re-ran Table 3 with a negative log-OR (Table 4 of the supplementary material available at Biostatistics online) and a null log-OR (Table 5 of the supplementary material available at Biostatistics online: type-1 error rates roughly nominal) and assessed performance of the Gamma methods under misspecified error structures (Tables 6 and 7 of the supplementary material available at Biostatistics online: estimation was robust with mean-1 Gamma and uniform errors).
4. Discussion
We have presented two Gamma-based methods for estimating the adjusted log-OR relating a binary outcome to a continuous exposure measured in pools and subject to errors. This work integrates the poolwise logistic regression approach of Weinberg and Umbach (1999) with the error modeling assumptions of Schisterman and others (2010) and the discriminant function ideas of Lyles and others (2015) and Whitcomb and others (2012). Accommodating skewed biomarkers should broaden the scope of scenarios where a highly cost-effective homogeneous pools study design can be utilized.
The homogeneous pools design is compelling as it offers potentially large gains in statistical power over a traditional design (e.g. Figure 1 with
). Absent errors, there would be no need to worry about the distribution of the biomarker; one could simply fit the Weinberg and Umbach (1999) logistic regression model. However, while assay ME may be negligible in certain scenarios, we believe negligible PE is a strong and seldom justifiable assumption. In our motivating example, the estimated PE variance was much larger than the estimated ME variance according to all four corrective methods, and together these errors were much too large to ignore (e.g. the naive logistic regression estimator was badly biased in Table 3 simulations). Thus, performing valid inference with poolwise data will typically require error modeling. Our Gamma-based methods extend prior approaches to accommodate skewed biomarkers, which tend to be much more common than normally distributed biomarkers.
Our methods are not limited to the pooling scenario for which they were developed. A special case for all four methods is all
, i.e. a traditional design with no pooling. Our R functions apply to a wide range of scenarios for estimating exposure–disease associations. They can handle pooling or traditional designs with or without covariates, for a normal or skewed exposure measured precisely or with errors (additive or multiplicative), incorporating replicates if available, and either assuming a constant log-OR or allowing it to vary with exposure level and covariates.
One potential problem with the logistic regression methods is that they are based on a likelihood function that assumes prospective sampling. The
part is not problematic given the Prentice and Pyke (1979) results, but the
model could be affected by case-oversampling. That is, even if the individual-level
models (linear regression for NLR, constant-scale Gamma for GLR) are correctly specified for the population, that relationship may not hold within cases and controls, and thus may not hold in a case–control study where the proportion of cases is far higher than in the population. Guolo (2008) suggests that using the prospective likelihood is valid if the specified distribution for the error-prone covariate (
in our framework) is correct in the case–control sampling scheme, which is intuitive. Specifying and assessing a model for an imperfectly measured exposure is typically one of the hardest parts of a ME correction (Carroll and others, 2006). But a unique feature of the pooling context is that if there is PE only, the singles are actually precisely measured, and thus the
model can be directly assessed with that data. So our logistic regression methods should be valid in case–control studies, provided the
model is supported by the data on hand, which can be directly assessed in certain cases. The discriminant function methods are based on models for
and are therefore unaffected by sampling rates for
.
For the same reason, the discriminant function methods can immediately accommodate covariate-dependent or “(y,c)-pooling” (Lyles and others, 2016), where pools are formed on similar covariate values in addition to like case status. This was noted by Lyles and others (2015) and the same result holds for our Gamma extension. The rationale is that members of a pool formed on
’s and
are still independent given the
’s and
’s, and that conditional independence is sufficient to justify the poolwise-sum results for the
’s. Adapting the logistic regression methods to accommodate (y,c)-pooling should be possible, although we leave this as future work. Alternatively, covariate-dependent pooling may fit nicely into the conditional logistic regression framework of Saha-Chaudhuri and others (2011), for which we are examining extensions to handle errors in a similar manner as our NLR and GLR approaches.
A natural problem for investigators considering pooling is how to choose design parameters, especially the pool size(s), whether to include replicates, and how to form pools. These are difficult questions. Even in the two-sample t-test setting, the optimal pool size (and number of pools needed for target power) is highly sensitive to magnitudes of MEs and PEs and the relative cost to run each assay and to recruit each subject. We do not wish to make broad recommendations but will share that we currently favor a design with pools of size 1 (with replicates, if the assay has ME) and one other pool size. The larger pools drive efficiency gains; a single non-unity pool size avoids having to specifying the relationship between pool size and PEs; and replicate singles isolate MEs, which stabilizes estimation of parameters including the log-OR of primary interest. Some of the more nuanced scenarios that permit identifiability may be useful for analyzing existing data, but we would not recommend leveraging them in the design of new studies. For example, Van Domelen and others (2018) reported that NLR and NDFA were somewhat unstable under a “two pool sizes, neither of which is 1” design with both error types and no replicates. As for deciding which method to use, AIC may be helpful for choosing normal vs. Gamma models. While the discriminant function methods are somewhat obscure compared with logistic regression, they may offer better precision when the relevant distributional assumptions are met (Lyles and others, 2009).
As a reviewer noted, the idea of replicates is somewhat paradoxical, as a pooling design might be chosen for the very purpose of reducing the number of assays that are required. This brings to mind issues of optimal design, which may warrant future work. So far, our impression from simulation studies is that a small number of replicate singles can greatly improve stability, likely justifying the additional costs (e.g. about 7% more assays in our motivating example).
While both ME and PE have the effect of reducing the efficiency advantage of a pooling design over a traditional design, PE is particularly worrisome because it can render the pooling design counterproductive. In fact, this may have occurred in our motivating example. The GDFA model gave
, and in simulations mimicking the CPP data the pooling design was less efficient than traditional for
(Figure 1). Absent PE and ME, pooling designs offer gains in statistical efficiency limited only by the number of samples that can feasibly be combined in the lab. With PE, if
is large enough, the pooling design may be less efficient than traditional for the same number of assays regardless of how large the pools are. Adaptive study designs could be considered, whereby a pooling study is initiated, but a stopping rule is in place to transition to all
if it becomes clear that
is prohibitively large. We note that pooling may be warranted regardless of
in cases where it is used to reach minimum assay volumes.
In future applied work, it will be valuable to search for ways to minimize PE and determine whether certain types of biospecimens (blood, saliva, etc.) are more or less susceptible to PE. On the statistical side, the assumption that the PE variance is constant with pool size needs to be vetted and perhaps modified, as it seems likely that larger pools would have larger errors. This is a key assumption that directly affects identifiability requirements and efficiency results. Additionally, it would be useful to develop less parametric approaches for improved robustness, ideally relaxing distributional assumptions on the errors and not having to specify the exposure given covariates distribution.
In summary, the two Gamma approaches presented here, in conjunction with the normal versions previously developed (Van Domelen and others, 2018), permit valid odds ratio estimation with a normal or skewed biomarker measured in pools and subject to errors. These methods help to quell an important concern associated with the Weinberg and Umbach (1999) homogeneous pools design, making this very cost-effective design more feasible to deploy.
Supplementary Material
Acknowledgments
Conflict of Interest: None declared.
Funding
This research was supported by the Intramural Research Program of the Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, MD, USA. This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. DGE-0940903. The views expressed in this article are those of the authors, and no official endorsement by the Department of Health and Human Services, or the Agency for Healthcare Research and Quality, or the National Science Foundation, is intended or should be inferred.
References
- Carroll, R. J., Ruppert, D., Stefanski, L. A. and Crainiceanu, C. M. (2006). Measurement Error in Nonlinear Models: A Modern Perspective. CRC Press, Boca Raton, FL. [Google Scholar]
- Carroll, R. J., Spiegelman, C. H., Lan, K. K. G., Bailey, K. T. and Abbott, R. D. (1984). On errors-in-variables for binary regression models. Biometrika 71, 19–25. [Google Scholar]
- Cornfield, J. (1962). Joint dependence of risk of coronary heart disease on serum cholesterol and systolic blood pressure: a discriminant function analysis. Federation proceedings 21, 58–61. [PubMed] [Google Scholar]
- Gilbert, P. and Varadhan, R. (2016). numDeriv: Accurate Numerical Derivatives. R package version 2016.8-1. https://CRAN.R-project.org/package=numDeriv. [Google Scholar]
- Guolo, A. (2008). A flexible approach to measurement error correction in case–control studies. Biometrics 64, 1207–1214. [DOI] [PubMed] [Google Scholar]
- Hardy, J. B. (2003). The collaborative perinatal project: lessons and legacy. Annals of Epidemiology 13, 303–311. [DOI] [PubMed] [Google Scholar]
- Lyles, R. H., Guo, Y. and Hill, A. N. (2009). A fresh look at the discriminant function approach for estimating crude or adjusted odds ratios. The American Statistician 63, 320–327. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lyles, R. H. and Kupper, L. L. (2013). Approximate and pseudo-likelihood analysis for logistic regression using external validation data to model log exposure. Journal of Agricultural, Biological, and Environmental Statistics 18, 22–38. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lyles, R. H., Mitchell, E. M., Weinberg, C. R., Umbach, D. M. and Schisterman, E. F. (2016). An efficient design strategy for logistic regression using outcome-and covariate-dependent pooling of biospecimens prior to assay. Biometrics 72, 965–975. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lyles, R. H., Van Domelen, D., Mitchell, E. M. and Schisterman, E. F. (2015). A discriminant function approach to adjust for processing and measurement error when a biomarker is assayed in pooled samples. International Journal of Environmental Research and Public Health 12, 14723–14740. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mitchell, E. M., Lyles, R. H., Manatunga, A. K., Danaher, M., Perkins, N. J. and Schisterman, E. F. (2014). Regression for skewed biomarker outcomes subject to pooling. Biometrics 70, 202–211. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mitchell, E. M., Lyles, R. H. and Schisterman, E. F. (2015). Positing, fitting, and selecting regression models for pooled biomarker data. Statistics in Medicine 34, 2544–2558. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Narasimhan, B., Johnson, S. G., Hahn, T., Bouvier, A., and Kiãšu, K. (2018). cubature: Adaptive Multivariate Integration over Hypercubes. R package version 2.0.3. https://CRAN.R-project.org/package=cubature. [Google Scholar]
- Prentice, R. L. and Pyke, R. (1979). Logistic disease incidence models and case-control studies. Biometrika 66, 403–411. [Google Scholar]
- R Core Team. (2019). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. [Google Scholar]
- Saha-Chaudhuri, P., Umbach, D. M. and Weinberg, C. R. (2011). Pooled exposure assessment for matched case-control studies. Epidemiology (Cambridge, Mass.) 22, 704. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schisterman, E. F., Vexler, A., Mumford, S. L. and Perkins, N. J. (2010). Hybrid pooled–unpooled design for cost-efficient measurement of biomarkers. Statistics in Medicine 29, 597–613. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Van Domelen, D. R. (2019). pooling: Fit Poolwise Regression Models. R package version 1.1.2. https://github.com/vandomed/pooling. [Google Scholar]
- Van Domelen, D. R., Mitchell, E. M., Perkins, N. J., Schisterman, E. F., Manatunga, A. K., Huang, Y. and Lyles, R. H. (2018). Logistic regression with a continuous exposure measured in pools and subject to errors. Statistics in Medicine 37, 4007–4021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Weinberg, C. R. and Umbach, D. M. (1999). Using pooled exposure assessment to improve efficiency in case-control studies. Biometrics 55, 718–726. [DOI] [PubMed] [Google Scholar]
- Weinberg, C. R. and Umbach, D. M. (2014). Correction to “using pooled exposure assessment to improve efficiency in case-control studies,” by clarice r. weinberg and david m. umbach; 55, 718-726, September 1999. Biometrics 70, 1061. [DOI] [PubMed] [Google Scholar]
- Whitcomb, B. W., Perkins, N. J., Zhang, Z., Ye, A. and Lyles, R. H. (2012). Assessment of skewed exposure in case-control studies with pooling. Statistics in Medicine 31, 2461–2472. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Whitcomb, B. W., Schisterman, E. F., Klebanoff, M. A., Baumgarten, M., Rhoton-Vlasak, A., Luo, X. and Chegini, N. (2007). Circulating chemokine levels and miscarriage. American Journal of Epidemiology 166, 323–331. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.


























































