Abstract
Pooled study designs, where individual biospecimens are combined prior to measurement via a laboratory assay, can reduce lab costs while maintaining statistical efficiency. Analysis of the resulting pooled measurements, however, often requires specialized techniques. Existing methods can effectively estimate the relation between a binary outcome and a continuous pooled exposure when pools are matched on disease status. When pools are of mixed disease status, however, the existing methods may not be applicable. By exploiting characteristics of the gamma distribution, we propose a flexible method for estimating odds ratios from pooled measurements of mixed and matched status. We use simulation studies to compare consistency and efficiency of risk effect estimates from our proposed methods to existing methods. We then demonstrate the efficacy of our method applied to an analysis of pregnancy outcomes and pooled cytokine concentrations. Our proposed approach contributes to the toolkit of available methods for analyzing odds ratios of a pooled exposure, without restricting pools to be matched on a specific outcome.
Keywords: Biomarkers, Gamma distribution, Odds ratio, Pooled specimens, Skewness
1 Introduction
Pooling biospecimens prior to performing laboratory assays is a design strategy in which individual samples are grouped, either randomly or by some criteria, and then physically combined to create a set of pooled biological samples which are then assayed. This technique can be particularly advantageous in reducing lab costs, by reducing the number of tests required to obtain information from all available specimens. Pooling is commonly used as a cost-saving tool in screening for disease (e.g., screening donated blood for HIV), but also has utility when continuous biomarkers are assessed for epidemiological studies (Kacena et al., 1998; Pilcher et al., 2005). Strategic pooling designs have been shown to promote statistical efficiency, particularly when compared with similar designs that select the same number of individual assays for analysis (Weinberg and Umbach, 1999; Liu and Schisterman, 2003; Mumford et al., 2006; Ma et al., 2011; Schisterman et al., 2011; Malinovsky et al., 2012; Saha-Chaudhuri and Weinberg, 2013; Heffernan et al., 2014). In certain situations assays may require a minimum volume for analysis that is impractical or impossible to obtain from single specimens. Pooling multiple biospecimens increases the pooled sample volume to attain that minimum necessary quantity for the assay, yielding usable measurements where none would have been available by conventional strategies.
When pools are matched based on case status, application of common statistical tools such as logistic regression models or receiver operating characteristic curves is straightforward (Weinberg and Umbach, 1999; Schisterman et al., 2005; Saha-Chaudhuri et al., 2011; Whitcomb et al., 2012). However, few techniques are available for the analysis of pooled case-control data when pools are “mixed” based on case status (Zhang and Albert, 2011). Explicitly, a mixed pool is composed of specimens from individuals where at least one member is of each disease status. This scenario could occur, for instance, if pools are formed based on a primary outcome, but analysis is later conducted on a secondary outcome. In the secondary analysis, the existing pools are unlikely to be perfectly matched with respect to this secondary outcome.
In this study, we propose and examine the statistical properties of a technique for analyzing mixed pool data from a case-control study, specifically, assessing the relation between a continuous exposure and disease status. The development of methods to analyze such mixed pools is motivated by an ancillary study of the Collaborative Perinatal Project to assess the association between cytokine levels and pregnancy outcomes. Using specimens obtained from this study, pools of size g = 2 matched on spontaneous abortion (SA) status were formed and cytokines were measured, facilitating a cost-effective strategy to estimate the odds ratio of cytokines on SA status. However, there was additional interest in assessing the relation between cytokine levels and the secondary outcome of low birth weight (LBW). Since pooling was conducted without regard for LBW, a considerable proportion of mixed pools with respect to LBW were formed. We are unaware of any readily applicable methods for analyzing such mixed pools of a highly skewed biomarker, implying that these pools (approximately 50% in this example) might be rendered useless for assessing cytokine relations with LBW based on existing methodology.
In the next section, we summarize the theoretical basis for estimating the odds ratio of interest based on disease-specific gamma distributions of the exposure when pools are mixed. In Section 3, we assess the performance of our methods that utilize mixed pools and compare these estimates to those that would have been available from only matched pools or a random sample of the same size. Finally, we apply these methods to the Collaborative Perinatal Project ancillary study on the relation between maternal cytokine levels and LBW.
2 Methods
Let OR(x) denote the odds ratio of interest for a disease (D) corresponding to a 1-unit increase in the exposure (X), such that:
(1) |
When this odds ratio is assumed to be constant for all values of X, it can easily be estimated from the full, individual data by performing logistic regression. For pooled data, Weinberg and Umbach (1999) showed that when pools are matched by case status, it is relatively straightforward to estimate Eq. (1) via a set-based logistic regression model. When pools are mixed with respect to case status, these mixed pools cannot be accommodated under the existing set-based logistic regression model, and thus must be omitted from the analysis. When pools are formed initially based on a primary outcome, the status of the pools with respect to a secondary outcome is highly dependent on the correlation between the primary and secondary outcome. Figure 1 illustrates a pooling strategy in which individual samples are grouped in pools of g = 2 matched on a primary outcome (O1). It is easy to see that for a secondary outcome (O2), even one that is highly correlated with O1, these original groupings would lead to both matched and mixed pools for O2. Omitting the mixed pools and relying on existing matched pool methods would greatly reduce sample size, information, and our ability to investigate a relation between the biomarker and O2.
Zhang and Albert (2011) developed a general method for performing binary regression with a pooled exposure, which can utilize pools of mixed case status; however, their methods assume an approximately normal distribution for the exposure, which may not apply to many highly skewed biomarkers. Whitcomb et al. (2012) proposed a flexible methodology for constant and nonconstant odds ratio estimates for individual specimens and matched pools, with particular focus on exposures that are well characterized by a gamma distribution. Here, we apply the convenient properties of the gamma distribution to develop an accessible and flexible procedure to estimate constant and nonconstant odds ratios, by extending the existing methods of Whitcomb et al. (2012) to incorporate all types of pools.
Let represent the measured value of the i-th pool, which may contain specimens from both case and control populations. Continuous measurements of the pool are considered the average of the measurements of the individual biospecimens in each pool. In other words, Yi is assumed to be the average of the individual Xil, a reasonable assumption given that most assays are per unit of volume. Applying Bayes’ theorem, Eq. (1) can be rewritten as:
(2) |
where f (x|D, θ1) and represent the exposure densities for cases (D) and controls , respectively, that depend on some unknown parameters θ0 and θ1. Equation (2) can be evaluated by assuming conditional densities and estimating the unknown parameters θ0 and θ1. Since individual level measurements, xil, of the exposure values are not available, maximum likelihood estimates of θ0 and θ1 must be obtained by maximizing the log-likelihood for the pooled exposures:
(3) |
where fY (Yi|θ0, θ1) denotes the density of the measured value of the i-th pool. Suppose that the first mi, mi ≤ gi, specimens contained in this pool are from cases, and the remaining gi − mi specimens are from controls. Then, fY (Yi|θ0, θ1) is characterized by the multidimensional integral as follows:
(4) |
where dX indicates integrating sequentially with respect to each of the members of the i-th pool, the xil, l = 1, …, gi. Due to the complicated nature of this density function, calculating maximum likelihood estimator (MLEs) based on direct optimization of the log-likelihood can be highly impractical. Through prudent choice of f(x|θ), Eq. (4) can be greatly simplified. When the conditional density of the exposure given disease status is normal, for instance, the pool measurement retains a normal distribution due to summation properties of this distribution. Subsequent optimization of the log-likelihood would be straightforward, and closed-form OR estimates could be obtained using a minor variant of normal-theory discriminant function analysis (Hosmer et al., 2013).
Many biomarkers, however, exhibit a positive, right-skewed distribution, which may be better modeled by a lognormal or gamma distribution. Lognormal is often conveniently chosen in analysis of individual level data because of the resulting normality after a log transformation. Since the log of the average is not equal to the average of the logs, the log transformation is problematic with pooled data. Alternatively, the gamma distribution is particularly compelling in this situation since it requires no transformation and benefits from similar distributional preservation properties as the normal distribution.
Suppose the exposure is a gamma random variable such that XD Gamma(α1, β1) for cases and for controls, where α represents the shape parameter and β the rate. Then, as Whitcomb et al. (2012) showed, the odds ratio of interest Eq. (2) can be written as:
(5) |
Note that the OR(x) in Eq. (5) is a function of x, when α0 ≠ α1, that reduces to a constant when α0 = α1.
To estimate MLEs for θ = {α1, β1, α0, β0}, let us partition our pooled measurements into three groups with respect to the disease status of the contributors: (i) matched pools of cases, (ii) matched pools of controls , (iii) mixed pools with m cases and g − m controls, , where 0 < m < g. The distributions of gYD and have previously been exploited in the pooling literature (Schisterman et al., 2005; Whitcomb et al., 2012). Applying the same summation property of gamma distributions, we can reduce Eq. (4) to a much more manageable likelihood by collapsing the cases and controls within a mixed pool such that and . The density of gYM can then be written as:
(6) |
where f (t|α, β) = Γ(α)−1βαtα−1 exp(−tβ) is the gamma density. Note that if β0 = β1, that is the rate parameters are identical for cases and controls, this density reduces further to , since equivalence of rate parameters is sufficient to preserve the gamma distribution. The likelihood in Eq. (3) can now be written as follows:
(7) |
Since the contribution of each pool to the likelihood contains at most a one-dimensional integral, it is fairly straightforward to apply numerical integration and optimization techniques to Eq. (7) in order to estimate the MLEs θ = {α1, β1, α0, β0} that characterize the disease-specific distributions of the exposure. Once calculated, these MLEs can then be applied to Eq. (5) in order to estimate the odds ratio of interest.
A natural concern is the potential dependence in Y induced when the pools are created based on O1. In the Appendix, we develop the likelihood (A.1) given Y and disease status for O1 and O2 and show how the resulting parameter estimates can be used to estimate OR(x) similar to Eq. (5). We will continue to focus throughout on the simplified approach employing Eqs. (7) and (5) since it is consistent and comparatively more efficient for estimating OR(x) for O2 than the fuller (A.1) and (A.3).
3 Simulation
Extensive simulation studies were conducted to evaluate the effects of mixed pooling on estimation of the odds ratio for a gamma-distributed exposure. Five thousand simulations were run in R for each scenario, where XD was generated from a gamma distribution with various values of shape and scale parameters(α1, β1), and was generated from a gamma distribution such that .
An equal number of diseased and nondiseased with respect to the primary outcome were simulated based on the total sample size (N). The number of diseased and nondiseased for the secondary outcome was then generated based on the specified proportions a and d (see Fig. 2). n pools were then formed with equal size g matched on the primary outcome, resulting in both matched and mixed pools with respect to the secondary outcome. Established methods are limited to using only the matched pools for data analysis (Weinberg and Umbach, 1999; Whitcomb et al., 2012). Log-odds ratio estimates for the secondary outcome corresponding to a 1-unit increase in exposure (X) were calculated based on these existing methods using only the matched pools (denoted “Matched Pools” in results tables). Since the gamma-distributed exposure can imply nonconstant odds ratios, logOR estimates are calculated at the first and third quartiles of X. We then compare these estimates to those that we can obtain from the entire set of pools (mixed and matched) based on our proposed analytical methods outlined in Section 2. Results from this analysis are subsequently labeled “Mixed Pools”.
Log-odds ratio estimates were calculated from Eq. (5), where MLEs of θ = {α1, β1, α0, β0} were estimated by optimizing the log-likelihood Eq. (7) using R’s optim function. Each contribution to the likelihood was defined in accordance with the type of pooling or sampling, and the density for mixed pools was evaluated using R’s numerical integration function integrate. Standard errors for the Wald 95% confidence intervals (CIs) log(OR) were calculated in straightforward fashion based on components estimated in the Hessian matrix.
Various parameters could potentially impact the proportion of mixed pools, and subsequently, the performance of the proposed estimation procedure. In particular, a strong positive correlation between O1 and O2 can result in a greater number of pools matched with respect to O2, improving performance of the estimation procedure for the O2 odds ratio. A weak correlation or a large g, however, can result in mostly mixed pools, which will almost certainly reduce precision of estimates. Figure 2 provides a visualization of these potential correlations between the two outcomes, where the effect of various a and d are tested in the simulation studies. In addition, Table 1 gives the distribution of case pools, control pools, and mixed pools from each of the simulation studies. On average, when a = 0.2 and d = 0.9, only five case pools were available for analysis. Consequently, analytical methods that exclude mixed pools can be highly unstable due to the degree of partial separation.
Table 1.
aa | db | Average number
|
||
---|---|---|---|---|
Case pools | Control pools | Mixed pools | ||
0.7 | 0.7 | 58 | 58 | 84 |
0.9 | 0.9 | 82 | 82 | 36 |
0.2 | 0.9 | 5 | 145 | 50 |
Proportion of individuals with a primary outcome that also have the secondary outcome.
Proportion of individuals negative for a primary outcome and the secondary outcome.
3.1 Nonconstant odds ratio
First, we consider the scenario when odds ratios are not constant across different values of the exposure. In this scenario, a nonconstant odds ratio has different shape parameters of the gamma distributions of the cases and controls, that is α0 ≠ α1. This characteristic has been shown to apply to the cytokine studied in the motivating example (Whitcomb et al., 2012). Relative bias, 95% CI coverage, and root mean square error (RMSE) of logOR estimates from these simulations are provided in Table 2 at various levels of a, d, and θ. Mixed and Matched columns of Table 2 display, respectively, the proposed method applied to mixed pools versus the method developed by Whitcomb et al. (2012) for matched pools. LogOR results are displayed at the first and third quartiles of the healthy population, , in each simulation, denoted by Q1 and Q3, respectively. As evidenced by these results, estimates from mixed pools calculated under the proposed method far outperform those from existing methodology for a nonconstant odds on only matched pools. Analysis using only matched pools performs particularly poorly when a = 0.2 and d = 0.9, due to the limited number of diseased matched pools (see Table 1). In fact, less than 62% of the simulations under this scenario contained enough matched pools of each type to provide reliable estimates. Analysis on mixed pools, however, was able to utilize all pools, not just those matching by chance on O2, providing reliable estimates for over 99% of the simulations (Table 3).
Table 2.
A | d | α1 | β1 | Relative bias (95% CI coverage)
|
RMSE
|
|||
---|---|---|---|---|---|---|---|---|
Mixed | Matchedb | Mixed | Matched | |||||
0.7 | 0.7 | 1.5 | 0.5 | logOR(Q1) = 1.249 | 0.024 (95.0) | 0.060 (95.3) | 0.298 | 0.376 |
logOR(Q3) = 0.772 | 0.042 (95.0) | 0.056 (95.3) | 0.146 | 0.157 | ||||
2 | 0.5 | logOR(Q1) = 1.999 | 0.019 (95.4) | 0.051 (95.3) | 0.374 | 0.517 | ||
logOR(Q3) = 1.043 | 0.038 (95.0) | 0.054 (95.1) | 0.160 | 0.189 | ||||
0.9 | 0.9 | 1.5 | 0.5 | logOR(Q1) = 1.249 | 0.030 (95.0) | 0.043 (95.8) | 0.279 | 0.304 |
logOR(Q3) = 0.772 | 0.038 (95.3) | 0.042 (95.4) | 0.124 | 0.128 | ||||
2 | 0.5 | logOR(Q1) = 1.999 | 0.021 (94.6) | 0.035 (94.9) | 0.369 | 0.426 | ||
logOR(Q3) = 1.043 | 0.029 (95.0) | 0.036 (95.3) | 0.143 | 0.154 | ||||
0.2 | 0.9 | 1.5 | 0.5 | logOR(Q1) = 1.249 | 0.091 (94.5) | 0.440 (96.5) | 0.484 | 1.239 |
logOR(Q3) = 0.772 | 0.039 (94.4) | 0.337 (98.8) | 0.137 | 0.746 | ||||
2 | 0.5 | logOR(Q1) = 1.999 | 0.076 (94.6) | 0.145 (96.4) | 0.626 | 1.124 | ||
logOR(Q3) = 1.043 | 0.046 (95.0) | 0.436 (97.5) | 0.179 | 0.923 |
Healthy population assumed gamma distribution with α0 = 1 and β0 = 1 for all simulations.
Estimators based on methods relying solely on matched pools after omitting mixed pools.
Table 3.
a | d | α1 | β1 | % Convergence
|
|
---|---|---|---|---|---|
Mixed | Matched | ||||
0.7 | 0.7 | 1.0 | 0.5 | 99.2 | 100.0 |
1.5 | 0.5 | 99.8 | 100.0 | ||
2.0 | 0.5 | 99.8 | 99.9 | ||
0.9 | 0.9 | 1.0 | 0.5 | 99.7 | 99.9 |
1.5 | 0.5 | 99.9 | 99.9 | ||
2.0 | 0.5 | 99.9 | 100.0 | ||
0.2 | 0.9 | 1.0 | 0.5 | 99.1 | 87.0 |
1.5 | 0.5 | 99.8 | 76.0 | ||
2.0 | 0.5 | 99.5 | 61.9 |
As expected, the RMSE under mixed pools (Table 2) is lowest when the correlation between O1 and O2 is strong (a = 0.9, d = 0.9), since this strong correlation results in a greater number of pools matched on the secondary outcome. When O1 and O2 are weakly correlated (a = 0.2, d = 0.9), RMSE for all methods is noticeably larger, due to a small number of matched pools with O2. Even in this scenario, however, the analysis under mixed pooling performs surprisingly well. In general, additional simulations suggested similar trends for various values of n, g, and N, where an increase in the number of assays (n) led to reduced bias and lower RMSE values.
3.2 Constant odds ratio
In some cases, the odds ratio may be constant across different values of the exposure, meaning that the shape parameters from the gamma distributions characterizing the cases and controls are equal, α0 = α1 (Whitcomb et al., 2012). In such cases, it will be more efficient to calculate a joint MLE for the shape parameter, instead of calculating this parameter separately for cases and controls. Tables 4 and 5 demonstrate the potential improvements in results when a constant odds ratio is correctly assumed. As evidenced by these results, correctly assuming equal shape parameters noticeably improves CI coverage and reduces RMSE for all situations. Note that assuming constant odds ratios and analyzing only matched pools can also be analyzed via the method proposed by Weinberg and Umbach (1999), column 2 of Tables 4 and 5. While their method is flexible in that it does not require distributional assumptions on the exposure, it is limited to analyzing only matched pools, reducing the available information and precision of the resulting estimates when a substantial number of pools are mixed.
Table 4.
a | d | Relative bias (95% CI coverage)
|
||||
---|---|---|---|---|---|---|
Constant oddsb
|
Nonconstant oddsc
|
|||||
Mixed | Matchedd | Mixed | Matched | |||
0.7 | 0.7 | logOR(Q1) = 0.5 | 0.031 (94.8) | 0.043 (95.1) | 0.020 (90.5) | 0.050 (95.8) |
logOR(Q3) = 0.5 | 0.051 (90.1) | 0.061 (95.5) | ||||
0.9 | 0.9 | logOR(Q1) = 0.5 | 0.029 (95.1) | 0.030 (95.5) | 0.028 (92.8) | 0.039 (95.3) |
logOR(Q3) = 0.5 | 0.037 (92.8) | 0.040 (95.4) | ||||
0.2 | 0.9 | logOR(Q1) = 0.5 | 0.001 (95.7) | 0.007 (98.4) | 0.137 (91.3) | 1.096 (96.6) |
logOR(Q3) = 0.5 | 0.003 (93.9) | −0.124(99.7) |
Healthy and diseased populations assumed gamma distributions with α0 = 1, β0 = 1 and α1 = 1, β1 = 0.5 for all simulations with Q1 and Q3 denoting first and third quartiles of the healthy population.
Methods applied assume constants odds.
Methods applied allow for nonconstant odds.
Method applied to only matched pools as developed by Weinberg and Umbach (1999).
Table 5.
a | d | RMSE
|
||||
---|---|---|---|---|---|---|
Constant oddsb
|
Nonconstant oddsc
|
|||||
Mixed | Matchedd | Mixed | Matched | |||
0.7 | 0.7 | logOR(Q1) = 0.5 | 0.124 | 0.136 | 0.214 | 0.236 |
logOR(Q3) = 0.5 | 0.142 | 0.148 | ||||
0.9 | 0.9 | logOR(Q1) = 0.5 | 0.104 | 0.110 | 0.188 | 0.197 |
logOR(Q3) = 0.5 | 0.115 | 0.117 | ||||
0.2 | 0.9 | logOR(Q1) = 0.5 | 0.117 | 0.370 | 0.335 | 1.199 |
logOR(Q3) = 0.5 | 0.124 | 0.681 |
Healthy and diseased populations assumed gamma distributions with α0 = 1, β0 = 1 and α1 = 1, β1 = 0.5 for all simulations with Q1 and Q3 denoting first and third quartiles of the healthy population.
Methods applied assume constants odds.
Methods applied allow for nonconstant odds.
Method applied to only matched pools as developed by Weinberg and Umbach (1999).
To test the constant odds assumption, we apply a likelihood ratio (LR) test with null hypothesis α0 = α1. Table 6 gives the Type I Errors of the LR test for the simulation studies in this section. When the proposed analytical strategies for mixed pools are applied, nominal 5% and 1% error rates are exhibited. When only matched pools are included in the analysis, however, Type I Error rates can exceed the 5% and 1% nominal values as a decreases and d = 0.9, due to the smaller number of matched case pools. Specifically, a matched pools strategy results in nominal rates when d = 0.9 from a = 0.9 to 0.6, however, Type I Errors continually increase thereafter for a = 0.5 to almost doubling at a = 0.2. Thus, the ability to include all pools in the analysis can improve accuracy in identifying a constant odds ratio, which can subsequently result in more appropriate and efficient log-odds ratio estimates.
Table 6.
a | d | LR test Type I Error Rate
|
|||
---|---|---|---|---|---|
0.05
|
0.01
|
||||
Mixed | Matched | Mixed | Matched | ||
0.7 | 0.7 | 0.050 | 0.048 | 0.009 | 0.009 |
0.9 | 0.9 | 0.047 | 0.048 | 0.010 | 0.009 |
0.8 | 0.9 | 0.043 | 0.044 | 0.008 | 0.008 |
0.7 | 0.9 | 0.053 | 0.053 | 0.013 | 0.012 |
0.6 | 0.9 | 0.051 | 0.056 | 0.012 | 0.013 |
0.5 | 0.9 | 0.055 | 0.059 | 0.014 | 0.014 |
0.4 | 0.9 | 0.052 | 0.062 | 0.011 | 0.015 |
0.3 | 0.9 | 0.048 | 0.071 | 0.010 | 0.020 |
0.2 | 0.9 | 0.054 | 0.087 | 0.010 | 0.018 |
4 Data analysis
Various cytokines were measured in stored samples from the Collaborative Perinatal Project to investigate associations with SA (Whitcomb et al., 2008). “Matched” pools of size g = 2 were formed within SA status. The Fluorokine MAP Multiplex Human Cytokine Panel A detection system (R&D Systems, Inc., Minneapolis, MN, USA) and the Luminex 100IS platform (Luminex Corporation, Austin, Texas) was used to measure cytokine levels in the pooled serum. Nonspecificity was accounted for with “blank” subtracted calibration curves and samples were randomly dispersed within the 96-well plates with respect to SA. The goal of our analysis is to investigate the relation between the secondary outcome LBW, defined by birth weight less than 2500 g, and the same cytokines. In particular, we are interested in the association between LBW and Interferon gamma-inducible protein (IP). In general, biomarkers, cytokines, and in this case IP, often have right-skewed distributions that can be well characterized by disease-specific gamma distributions (Whitcomb et al., 2012).
Since birth weight is an outcome conditional on live birth, only biospecimens from controls were available for the secondary analysis on LBW. Subsequently, IP measurements on 177 pools were available. One hundred and forty-six pooled biospecimens had both normal weight babies, 29 pooled biospecimens had a mixed outcome of a normal weight baby and an LBW baby, and two pooled biospecimens were from LBW babies. Since LBW was not considered in the initial formation of the pools and with a 9.3% prevalence in our sample (33 out of 354 women), only two of the pools in this sample were matched randomly with respect to LBW. The measurement of IP from the pooled specimens ranged from 0.0001 to 0.109, where these measurements are assumed to be the average of the measurements that would have been observed on the individual specimens. Figure 3 illustrates the distribution of the pooled measurements across each pool type.
Focusing on the columns of matched pools in Fig. 3, circles and crosses, it is easy to see that analysis excluding the middle column of mixed pools, circle-cross, greatly reduces the overall sample size. Moreover, such a small number of matched case pools preclude the estimation of ORs, constant or nonconstant, using existing analytical methods (logistic regression and MLEs, respectively) that can only accept matched pools.
Utilizing the mixed pools based in our proposed method can investigate IP’s relation with LBW. Assuming gamma distributions and maximizing Eq. (7), we estimated distribution parameters from the full set of pools, with and without the α0 = α1 constraint. Log-odds are calculated to reflect the logOR of a 0.01 unit increase in IP. Estimates and CIs of the constant and nonconstant logOR are provided in Table 7, at various quartiles (Q1–Q3). While neither result is significant, the nonconstant, ORs of 1.37, 1.19, and 1.10, are suggestive of a potential relation between IP and LBW at lower levels that can be further investigated if additional data become available. Alternatively, relying on previous methods and requiring matched pools considerably reduces sample size to and nD = 2. These analyses are unstable at best and surely unreliable.
Table 7.
IP level | logOR
|
|
---|---|---|
Constant | Nonconstant | |
Q1 | 0.000 (−0.045, 0.045) | 0.312 (−0.208, 0.832) |
Q2 | 0.000 (−0.045, 0.045) | 0.173 (−0.128, 0.473) |
Q3 | 0.000 (−0.045, 0.045) | 0.097 (−0.084, 0.279) |
5 Conclusions
In this study, we developed methods to estimate odds ratios when an exposure is measured on pools that are not matched on case status. We expanded on existing methods requiring matched pools and demonstrated the efficacy of our proposed analysis via a simulation study and an auxiliary study of the Collaborative Perinatal Project.
Pooling is an attractive design strategy to alleviate fiscal or practical limitations on analyzing individual level biospecimens while answering important research questions (Heffernan et al., 2014). Previous methods that focus on a primary outcome have been shown to be reliable and efficient when pools are matched based on this status (Weinberg and Umbach, 1999; Liu and Schisterman, 2003; Mumford et al., 2006; Saha-Chaudhuri et al., 2011; Whitcomb et al., 2012). The analytical procedures proposed here provide a reliable and effective method to alleviate the matched pool constraint and incorporate all pooled measurements, whether mixed or matched on outcome status, into calculation of gamma distribution parameters and constant or nonconstant OR estimates.
As illustrated in the simulation studies, the efficiency of the OR estimates of a secondary outcome will depend largely on the degree to which the pools are matched on the secondary outcome or the correlation with the primary outcome. Of course, OR estimates will generally gain efficiency when more assays are performed (n) relative to the total sample size (N). This potential precision gain is often limited, however, by budgetary constraints or other factors that initially motivated pooling on the primary outcome. Relying solely on matched pools also resulted in inflated Type I Error as prevalence of a secondary outcome decreased, while our mixed pool methodology performed nominally. As demonstrated in the data analysis, the proposed methods employing mixed pools was the only means to enable analysis of a secondary outcome because too few matched case pools were created to use either the existing approaches by Weinberg and Umbach (1999) or Whitcomb et al. (2012).
The gamma distributional assumption was the key in estimation of distribution parameters and the OR estimate when X is continuous and right-skewed in our developments. Additional distributions could have been assumed for X, such as a lognormal distribution, but subsequent analysis would be considerably more difficult for larger pool sizes (g > 2). The gamma distribution, in addition to enjoying much simpler computation, has previously been demonstrated to be effective at characterizing right-skewed variables (Firth, 1988; Whitcomb et al., 2012; Mitchell et al., 2015). For nonskewed continuous biomarkers where normality might be assumed on X, similar techniques can be adapted from the developments in Section 2. Another assumption of the strategies presented here is that the secondary outcome is conditionally independent of the primary outcome. Given the exposure of interest, a straightforward extension to the likelihood is provided in the Appendix that can help assess this assumption and facilitate additional analyses.
Regardless of motivation for the implementation of a pooling design, statistical techniques for pooled data must continually develop and attempt to mirror those for individual level data. As our example demonstrated, utilizing mixed pools, as a consequence of being a secondary outcome or otherwise, is vital to analysis of pooled data. The flexible framework of our extension of case-control methods to utilize the valuable information in pools of mixed outcome status is an important next step to the broader implementation of pooling designs in practice.
Acknowledgments
This research was supported by the Intramural Research Program of the Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, Maryland.
Appendix
Assume individual level biospecimens, X, are grouped into pools of size g based on the status of a primary binary outcome, O1, and then measured, Y. Then assume a known secondary outcome, O2, which may or may not be associated with O1 but was not considered in the creation of the pools. Let D1 = j and D2 = k be the disease status for O1 and O2, respectively, where j = 0,1, k = 0,1 and 1 and 0 indicating with and without the disease, respectively. Let the parameters governing the distribution of X given D1 and D2 be denoted by θjk. Also, let m1 and m2 be the number of individuals within a pool with diseases O1 and O2, respectively, noting that m1 = 0 or g by design.
Under a Gamma distribution, the likelihood (similar to Eq. (7)), conditional on Y, m1, and m2, can now be written as:
(A.1) |
where log fY (giYi|α11, β11, α10, β10, m2i) is defined as in Eq. (7). Maximizing Eq. (A.1) yields MLEs for the shape and rate parameters θ = {α11, β11, ⋯, α00, β00} that characterize the four disease-specific distributions of the exposure. Once calculated, these MLEs can be applied to Eq. (5) in order to estimate the odds ratio for a given status for D1 as follows:
(A.2) |
Table A.1.
a | d | α1 | β1 | Relative bias
|
RMSE
|
|||
---|---|---|---|---|---|---|---|---|
With O1c | Without O1d | With O1c | Without O1d | |||||
0.7 | 0.7 | 1.5 | 0.5 | logOR(Q1) = 1.249 | 0.026 | 0.018 | 0.352 | 0.306 |
logOR(Q3) = 0.772 | 0.013 | 0.047 | 0.156 | 0.156 | ||||
0.9 | 0.9 | 1.5 | 0.5 | logOR(Q1) = 1.249 | 0.024 | 0.024 | 0.304 | 0.294 |
logOR(Q3) = 0.772 | 0.015 | 0.040 | 0.132 | 0.135 | ||||
0.2 | 0.9 | 1.5 | 0.5 | logOR(Q1) = 1.249 | 0.080 | 0.077 | 0.521 | 0.473 |
logOR(Q3) = 0.772 | 0.043 | 0.036 | 0.189 | 0.140 |
To achieve the relation of interest, OR(x) for D2 not conditioned on D1, we simply sum over D1
(A.3) |
where
While this formulation is not reducible in the way that resulted in Eq. (A.2), it is easy to calculate given parameter estimates and a cross tabulation of D1 and D2. Furthermore, note that if D1 and D2 are conditionally independent given X, then Eq. (A.2) reduces to Eq. (2) since
and every component containing D1 cancels. A brief simulation similar to that displayed in Table 2 demonstrates that estimating logOR(x) from Eq. (A.3) achieves similar relative bias (Table A1) to estimates from Eq. (5), which ignores O1 in the likelihood. Table A1 displays a simulated comparison of logOR(x) estimates for O2 from Eq. (A.3) (“with O1” column) and Eq. (5) (“without O1” column), corresponding to likelihoods which do and do not consider D1 in the likelihood, respectively. Predictably, a loss of efficiency results from estimating the additional parameters in Eq. (A.1) versus Eq. (7).
Footnotes
Additional supporting information including source code to reproduce the results may be found in the online version of this article at the publisher’s web-site
Conflict of interest
The authors have declared no conflict of interest.
References
- Firth D. Multiplicative errors: log-normal or gamma? Journal of the Royal Statistical Society, Series B: Methodological. 1988;50:266–268. [Google Scholar]
- Heffernan AL, Aylward LL, Toms LM, Sly PD, Macleod M, Mueller JF. Pooled biological specimens for human biomonitoring of environmental chemicals: opportunities and limitations. Journal of Exposure Science and Environmental Epidemiology. 2014;24:225–232. doi: 10.1038/jes.2013.76. [DOI] [PubMed] [Google Scholar]
- Hosmer DW, Lemeshow S, Sturdivant RX. Applied Logistic Regression. John Wiley and Sons; New Jersey, NJ: 2013. [Google Scholar]
- Kacena K, Quinn S, Hartman S, Quinn T, Gaydos C. Pooling of urine samples for screening for Neisseria gonorrhoeae by ligase chain reaction: accuracy and application. Journal of Clinical Microbiology. 1998;36:3624–3628. doi: 10.1128/jcm.36.12.3624-3628.1998. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu A, Schisterman EF. Comparison of diagnostic accuracy of biomarkers with pooled assessments. Biometrical Journal. 2003;45:631–644. [Google Scholar]
- Ma CX, Vexler A, Schisterman EF, Tian L. Cost-efficient designs based on linearly associated biomarkers. Journal of Applied Statistics. 2011;38:2739–2750. [Google Scholar]
- Malinovsky Y, Albert PS, Schisterman EF. Pooling designs for outcomes under a Gaussian random effects model. Biometrics. 2012;68:45–52. doi: 10.1111/j.1541-0420.2011.01673.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mitchell EM, Lyles RH, Schisterman EF. Positing, fitting, and selecting regression models for pooled biomarker data. Statistics in Medicine. 2015;34:2544–2558. doi: 10.1002/sim.6496. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mumford SL, Schisterman EF, Vexler A, Liu A. Pooling biospecimens and limits of detection: effects on ROC curve analysis. Biostatistics. 2006;7:585–598. doi: 10.1093/biostatistics/kxj027. [DOI] [PubMed] [Google Scholar]
- Pilcher C, Fiscus S, Nguyen T, Foust E, Wolf L, Williams D, Ashby R, O’Dowd J, Leone P. Detection of acute infections during HIV testing in North Carolina. New England Journal of Medicine. 2005;352:1873–1883. doi: 10.1056/NEJMoa042291. [DOI] [PubMed] [Google Scholar]
- Saha-Chaudhuri P, Umbach DM, Weinberg CR. Pooled exposure assessment for matched case-control studies. Epidemiology. 2011;22:704–712. doi: 10.1097/EDE.0b013e318227af1a. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Saha-Chaudhuri P, Weinberg CR. Specimen pooling for efficient use of biospecimens in studies of time to a common event. American Journal of Epidemiology. 2013;178:126–135. doi: 10.1093/aje/kws442. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schisterman EF, Perkins NJ, Bondell H. Optimal cut-point and its corresponding Youden index to discriminate individuals using pooled blood samples. Epidemiology. 2005;16:73–81. doi: 10.1097/01.ede.0000147512.81966.ba. [DOI] [PubMed] [Google Scholar]
- Schisterman EF, Vexler A, Yi A, Perkins NJ. A combined efficient design for biomarker data subject to a limit of detection due to measuring instrument sensitivity. The Annals of Applied Statistics. 2011;5:2651–2667. [Google Scholar]
- Weinberg CR, Umbach DM. Using pooled exposure assessment to improve efficiency in case-control studies. Biometrics. 1999;55:718–726. doi: 10.1111/j.0006-341x.1999.00718.x. [DOI] [PubMed] [Google Scholar]
- Whitcomb BW, Perkins NJ, Zhang Z, Ye A, Lyles RH. Assessment of skewed exposure in case-control studies with pooling. Statistics in Medicine. 2012;31:2461–2472. doi: 10.1002/sim.5351. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Whitcomb BW, Schisterman EF, Klebanoff MA, Baumgarten M, Luo X, Chegini N. Circulating levels of cytokines during pregnancy; thrombopoietin is elevated in miscarriage. Fertility and Sterility. 2008;89:1795–1802. doi: 10.1016/j.fertnstert.2007.05.046. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang Z, Albert PS. Binary regression analysis with pooled exposure measurements: a regression calibration approach. Biometrics. 2011;67:636–645. doi: 10.1111/j.1541-0420.2010.01464.x. [DOI] [PMC free article] [PubMed] [Google Scholar]