Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2015 May 27.
Published in final edited form as: Biometrics. 2014 Sep 22;71(1):227–236. doi: 10.1111/biom.12220

On the Analysis of Hybrid Designs that Combine Group- and Individual-Level Data

E Smoot 1,, S Haneuse 1
PMCID: PMC4445683  NIHMSID: NIHMS690961  PMID: 25251477

Summary

Ecological studies that make use of data on groups of individuals, rather than on the individuals themselves, are subject to numerous biases that cannot be resolved without some individual-level data. In the context of a rare outcome, the hybrid design for ecological inference efficiently combines group-level data with individual-level case-control data. Unfortunately, except in relatively simple settings, use of the design in practice is limited since evaluation of the hybrid likelihood is computationally prohibitively expensive. In this article we first propose and develop an alternative representation of the hybrid likelihood. Second, based on this new representation, a series of approximations are proposed that drastically reduce computational burden. A comprehensive simulation shows that, in a broad range of scenarios, estimators based on the approximate hybrid likelihood exhibit the same operating characteristics as the exact hybrid likelihood, without any penalty in terms of increased bias or reduced efficiency. Third, in settings where the approximations may not hold, a pragmatic estimation and inference strategy is developed that uses the approximate form for some likelihood contributions and the exact form for others. The strategy gives researchers the ability to balance computational tractability with accuracy in their own settings. Finally, as a by-product of the development, we provide the first explicit characterization of the hybrid aggregate data design which combines data from an aggregate data study (Prentice and Sheppard, 1995, Biometrika 82, 113–125) with case–control samples. The methods are illustrated using data from North Carolina on births between 2007 and 2009.

Keywords: Aggregate data study, Case–control data, Computation, Ecological study, Hybrid design

1. Introduction

As researchers plan and conduct studies they have at their disposal a broad range of designs on which to base their data collection efforts. Typically, research studies have well-defined study units and data is collected on a sub-sample of individual units. In some settings individual-level data may not be readily available and researchers may only have access to aggregated data on groups of individuals. When data is solely available on groups of individuals, the resulting study is commonly referred to as an ecological study (Sheppard, 2002). With the increasing ubiquity of large administrative databases, ecological studies are often cheaper to conduct than individual-level cohort and case–control study counterparts and can also, in some cases, provide greater exposure variability and therefore greater statistical power (Prentice and Sheppard, 1995). Recent prominent examples of ecological studies in the literature include studies of the impact of air pollution on life expectancy in the U.S. (Pope, Ezzati, and Dockery, 2009) and China (Chen et al., 2013).

Despite the benefits, ecological studies suffer from numerous sources of bias in which the observed group-level exposure-outcome association does not accurately reflect the exposure-outcome association at the individual-level (Greenland and Morgenstern, 1989; Sheppard, 2003; Salway and Wakefield, 2005). Collectively, the impact of these biases is often referred to as “ecological bias.” In the most severe case, the “ecological fallacy” arises where conclusions drawn about the exposure-outcome association differ from those that would have been drawn had an individual-level study been conducted (Robinson, 1950; Piantadosi, Byar, and Green, 1988; Wakefield and Shaddick, 2006).

Unfortunately, any attempt to draw conclusions regarding individual-level associations solely using group-level data relies on untestable assumptions in one form or another (Haneuse and Wakefield, 2008a). Consequently, when scientific interest lies in individual-level associations, the only reproducible approach to avoiding ecological bias is to collect, incorporate and analyze individual-level data (Haneuse and Wakefield, 2007). Over the last 20 years a number of statistical designs/methods have been proposed that combine group- and individual-level data (Wong and Mason, 1985; Prentice and Sheppard, 1995; Greenland, 2000; Haneuse and Wakefield, 2007; Martínez et al., 2007, 2009; Wakefield and Haneuse, 2008). Although details differ across the designs/methods, each: (i) uses individual-level data to mitigate ecological bias, and (ii) takes advantage of the group-level data to provide efficiency and power gains over designs/methods based solely on individual-level data.

In this article we focus on settings where scientific interest focuses on a rare binary outcome. In this setting, Haneuse and Bartell (2011) showed that the hybrid design for ecological inference provides the greatest potential for statistical efficiency. In its most general form, the hybrid design supplements group-level data with individual-level case–control data; the superior efficiency properties arise in part due to the design (i.e., the case–control sampling) as well as due to estimation/inference being likelihood-based. Unfortunately, evaluation of the hybrid likelihood is computationally very expensive. Indeed, when the model of interest considers more than 3 risk factors the computational burden may be sufficiently prohibitive that, in practice, researchers could be tempted to simply analyze the individual-level data and forgo the efficiency gains provided by incorporating the group-level data in the analysis. To address this problem we propose a novel approach for analyzing data from the hybrid design. Towards this we first develop an alternative representation of the hybrid likelihood. We then show that much, if not all, of the computational burden can be attributed to one component of the new decomposition. A series of approximations for this component are proposed. We show that estimation/inference based on the approximate hybrid likelihood exhibits the same operating characteristics as that based on the exact hybrid likelihood while simultaneously drastically reducing computational burden. As we will elaborate upon, the approximations correspond to a misspecified likelihood; as in any setting when approximations are used, the quality of the approximation is essential and misspecifying the likelihood can result in bias. In settings where the approximations may not hold, a pragmatic strategy that balances the use of the exact and approximate hybrid likelihood representations is developed. To illustrate the ideas, concepts and methods of this article we use data on all births in North Carolina from 2007 to 2009, obtained from the North Carolina State Center for Health Statistics.

2. The Hybrid Design

To ground the notation and exposition, consider the relationship between the risk of low birth weight (defined as a birth weight of <2500 g) and two risk factors: the race of the baby and whether or not the mother smoked. Throughout, while numerous choices are possible, we take the births to be “grouped” by county; in North Carolina there are K = 100 counties.

2.1. Notation

Let R be a binary indicator of race (0/1 = white/non-white), S an indicator of whether the mother smoked during pregnancy (0/1 = no/yes) and Y an indicator of low birth weight status (0/1 = no/yes). Suppose interest lies in the individual-level logistic regression model:

logitP(Yki=1|Rki,Ski)=β0k+β1Rki+β2Ski, (1)

where the subscript [ki] indicates the ith birth in the kth county, for i = 1, …, Nk and k = 1, …, K. Note, model (1) is an individual-level model in the sense that it considers the relationship between risk factors and an outcome jointly measured on each individual birth (Sheppard, 2003). As such, the log odds ratios β1, and β2 are interpreted as characterizing individual-level associations. Finally, let Mrsk denote the number of births in the [R, S] = [r, s] race/smoking stratum of the kth county and Nyrsk the corresponding total number of births with Y = y.

2.2. A Complete Individual-Level Study

Suppose complete individual-level data is observed on all N=k=1KNk individuals from all K groups. That is, suppose the collections Nyrsk = {Nyrsk; y = 0/1, r = 0/1, s = 0/1} and Mrsk = {Mrsk; r = 0/1, s = 0/1} are observed for each group. The top panel of Table 1 provides a summary of the notation for this data scenario. Assuming independence across groups, estimation and inference for β = (β01, …, β0K, β1, β2) could proceed straightforwardly using the following individual-level binomial likelihood:

LI(β;Nyrs)=k=1KLI(β;Nyrsk|Mrsk)=k=1K{r=01s=01(MrskN1rsk)πrskN1rsk(1πrsk)MrskN1rsk}, (2)

where Nyrs denotes the collection of R/S/Y counts across all K groups, {Nyrsk; k = 1, …, K}, and πrsk ≡ πrsk(β) = P(Y = 1|R = r, S = s, Group = k) is given by model (1). Note, one could adopt additional structure on the K group-specific β0k intercepts, for example assuming that they arise from some common random effects distribution which may or may not exhibit some specific spatial structure (Haneuse and Wakefield, 2008b). For ease of presentation, we assume that the intercept parameters are estimated without any such structure.

Table 1.

Notation for data available under three data scenarios/designs. Shown are counts for a generic group, k. Counts within square brackets are not observed in the respective design.

I. Complete individual-level data

Y=0 Y=1

R=0/S=0 N000k N100k M00k

R=0/S=1 N001k N101k M01k

R=1/S=0 N010k N110k M10k

R=1/S=1 N011k N111k M11k

N0k N1k Nk
II. Aggregate data study supplemented with case-control data

Y=0 Y=1 Y=0 Y=1


R=0/S=0 [N000k] [N100k] M00k R=0/S=0 n000k n100k m00k


R=0/S=1 [N001k] [N101k] M01k R=0/S=1 n001k n101k m01k


R=1/S=0 [N010k] [N110k] M10k R=1/S=0 n010k n110k m10k


R=1/S=1 [N011k] [N111k] M11k R=1/S=1 n011k n111k m11k


N0k N1k Nk n0k n1k nk
III. Pure ecological study supplemented with case-control data

S=0 S=1 Y=0 Y=1 Y=0 Y=1



R=0 [M00k] [M01k] M0+k R=0/S=0 [N000k] [N100k] [M00k] R=0/S=0 n000k n100k m00k



R=1 [M10k] [M11k] M1+k R=0/S=1 [N001k] [N101k] [M01k] R=0/S=1 n001k n101k m01k



M+0k M+1k Nk R=1/S=0 [N010k] [N110k] [M10k] R=1/S=0 n010k n110k m10k


R=1/S=1 [N011k] [N111k] [M11k] R=1/S=1 n011k n111k m11k


N0k N1k Nk n0k n1k nk

2.3. Supplementing an Aggregate Data Design Study with Case–Control Data

In the absence of complete individual-level data, researchers may nevertheless have access to counts aggregated at the group-level. Under the aggregate data design, these data consist of the group-specific marginal outcome counts Nyk = {N0k, N1k} together with the group-specific marginal covariate counts Mrsk. Consequently, while “complete” information on the outcomes is observed along with “complete” information on the marginal covariate counts, their joint distribution is not observed. In a hybrid aggregate design, these data are supplemented with a case–control sample of n0k non-cases and n1k cases drawn from the kth group; on each of the nk = n0k+n1k individuals sampled in this scheme, complete information on the joint distribution of R/S/Y is retrospectively observed. The middle panel of Table 1 provides a summary of the notation for this data scenario. Note, the Nyrsk are within square brackets to emphasize that they are not observed.

Since complete individual-level data is not observed, one cannot proceed using the likelihood given by (2). Instead, estimation/ inference is based on the induced hybrid likelihood:

LA(β;Ny,nyrs)=k=1KLA(β;Nyk,nyrsk|Mrsk,nyk)=k=1KNyrsk𝒩kw(Nyrsk|nyrsk,nyk)LI(β;Nyrsk|Mrsk) (3)

Intuitively, the contribution from the kth group in expression (3) is a weighted convolution of individual-level likelihood contributions integrating over the unknown Nyrsk, with weights given as the product of the probability distribution functions from two multivariate hypergeometric distributions:

w(Nyrsk|nyrsk,nyk)=HG(n0rsk|N0rsk,n0k)HG(n1rsk|N1rsk,n1k).

The set 𝒩k in expression (3) denotes the collection of Nyrsk counts that are consistent with both the aggregated group-level data, Nyk and Mrsk, and the sampled case–control data, nyrsk. The specific form of 𝒩k is given in Web Appendix A.

2.4. Supplementing a Pure Ecological Study with Case–Control Data

In some settings, researchers may not have access to the observed joint distribution of the covariates, Mrsk. In particular, the observed data in a pure ecological study consists of marginal totals for Y, R, and S across the K groups. Using the notation developed so far, this “pure ecological” data consists of the county-specific counts (Nk, N1k, M1+k, M+1k), where Nk is the total number of births, N1k is the number of low birth weight births, M1+k is the number of non-white births, and M+1k is the number of births to mothers who smoked during pregnancy. A hybrid design would supplement these marginal counts with detailed, individual-level data on a case–control sample of n0k non-cases and n1k cases drawn from the kth group. The lower panel of Table 1 summarizes the notation for this data scenario.

In this setting the induced hybrid likelihood can again be derived as the product of K group-specific weighted convolutions. In addition to integrating over the unknown Nyrsk, as in expression (3), one also needs to integrate over the unknown Mrsk. The latter requires additional parameters specific to the joint distribution of the covariates; for the present setting (i.e., two binary covariates), the log odds ratio between S and R, denoted ϕrs, suffices. Since ϕrs will, in general, be unknown, it must be jointly estimated along with the regression parameters in model (1). The resulting induced hybrid likelihood is then given by:

LH(β,ϕrs;Ny,nyrs)=k=1KMrskkP(Mrsk|Mr+k,M+sk,ϕrs)·LA(β;Nyk,nyrsk|Mrsk,nyk)=k=1KMrskMkP(Mrsk|Mr+k,M+sk,ϕrs)·{Nyrsk𝒩kw(Nyrsk|nyrsk,nyk)LI(β;Nyrsk|Mrsk)} (4)

where P(Mrsk|Mr+k,M+sk, ϕrs) is the probability distribution function of an extended hypergeometric distribution (Johnson and Kotz, 1969; Haneuse and Wakefield, 2007). Furthermore, ℳk is the set of all possible configurations of the Mrsk counts that are consistent with both the (Mr+k, M+sk) marginal totals and the case–control counts mrsk in the lower panel of Table 1. The specific form of ℳk is given in Web Appendix B.

3. Computational Burden

From expressions (3) and (4), evaluation of the hybrid likelihood requires computing a product of summations with the number of terms in the summations determined by 𝒩k for the hybrid aggregate data design and (ℳk, 𝒩k) jointly for the hybrid pure ecological design. These evaluations can often be incredible computationally expensive. To illustrate this point, we use data on 387,705 births in North Carolina with complete vital records during the 3-year span from 2007 to 2009. For these data, “complete” refers to the record having no missing information on birth-county, race, infant birth weight, and mother’s smoking status. Across the 100 counties, the number of births ranged from 147 to 44,076 with a median of 1,981; only seven counties had more than 10,000 births recorded. Figure 1 provides a visual representation of this information from the North Carolina data, under both an aggregate data study (panels a and b) and under a pure ecological study (panels a, c, and d).

Figure 1.

Figure 1

Visual representations of group-level, aggregated information derived from the North Carolina birth weight data. Panels (a) and (b) together represent observed information in an aggregate data study. Panels (a), (c), and (d) collectively represent observed information in a pure ecological study.

To illustrate the computational burden of evaluating the aggregate data hybrid likelihood (3), we drew a single stratified random sample of n0k = n1k = 25 non-cases and cases from each county. Based on simulations run on an Apple iMac with a dual-core Intel Core i5 3.6 GHz processor with 8GB RAM, running Mac OS X Lion, we estimated that a single evaluation of expression (3) corresponds to evaluating a sum with approximately k=1100size(𝒩k)5×109. Based on an implementation in the osDesign package in R, using C as the computational work engine, we estimated that this single evaluation would take approximately 21.5 days. Under the hybrid pure ecological design, evaluation of expression (4) corresponds to evaluating a sum with more than 5 × 1012 terms; using the same hardware/software as the calculation for the aggregate data hybrid likelihood, we estimated that a single evaluation of the pure ecological hybrid likelihood could take up to 60 years.

4. Approximating the Hybrid Likelihood

Given data from a hybrid design, the computational burden associated with evaluating the exact hybrid likelihood may lead analysts to base estimation/inference solely on the individual-level case–control data. For example, one could simply use conditional logistic regression to estimate the log odds ratio parameters in model (1). Doing so, however, ignores the observed group-level data and forgoes the efficacy benefits associated with including this information in the analysis. In this section we present a novel analysis strategy for data arising from the hybrid design that makes use of an approximation to the hybrid likelihood.

4.1. An Alternative Representation of the Hybrid Likelihood

Consider the data set-up of hybrid aggregate data design, given by the middle row of Table 1. As indicated in Section 2.3, the hybrid likelihood is obtained by integrating the expression for the complete data likelihood over the distribution of the unknown Nyrsk. The case–control data inform this distribution through the weighting terms P(nyrsk|Nyrsk, nyk) and by restricting the range of admissible Nyrsk. An alternative to viewing the nk = n0k+n1k case–control samples as a subset of the broader population is to consider them as distinct from the Nk*=Nknk individuals in the kth group who were not sampled. Table 2 summarizes this notation, with a superscript “*” indicating that the counts refer to individuals not sampled by the case–control scheme. Note, the right-hand table is unchanged from Table 1 while the left-hand table contains group-level aggregated information on the unsampled individuals.

Table 2.

Notation for an alternative representation of the data available under the hybrid aggregate data design of Section 2.3. Shown are counts for a generic group, k. Counts within square brackets are not observed.

Individuals not sampled case control sample


Y=0 Y=1 Y=0 Y=1


R=0/S=0 [N000k*] [N100k*] M00k* R=0/S=0 n000k n100k m00k


R=0/S=1 [N001k*] [N101k*] M01k* R=0/S=1 n001k n101k m01k


R=1/S=0 [N010k*] [N110k*] M10k* R=1/S=0 n010k n110k m10k


R=1/S=1 [N011k*] [N111k*] M11k* R=1/S=1 n011k n111k m11k


N0k* N1k* Nk* n0k n1k nk

Under this new representation, the hybrid aggregate data likelihood can be re-written as:

LA(β;Nyk,nyrsk|Mrsk,nyk)=LE(β;Nyk*)HG(mrsk|Mrsk,nk)HG(nyk|Nyk,nk)LI(β;nyrsk|mrsk). (5)

where LI (β; nyrsk|mrsk) is a (naïve) prospective likelihood contribution based on the case–control data and LE(β;Nyk*) is an ecological likelihood for those individuals not sampled:

LE(β;Nyk*)=Nyrsk*𝒩k*LI(β;Nyrsk*|Mrsk*), (6)

where 𝒩k* denotes the collection of Nyrsk* that are consistent with the group-level data on those not sampled (Nyk*,Mrsk*). The weighting in expression (5) by the ratio of the two (multivariate) hypergeometric distributions serves to account for the case–control sampling scheme as well as the finite population sampling from the Nk individuals in the group.

4.2. Approximating the Aggregate Data Hybrid Likelihood

Inspection of expression (5) reveals that the primary source of computational burden is LE(β;Nyk*). Towards mitigating computational burden we consider approximating this component by taking the total number of events in the kth group, N1k*, to be conditionally distributed according to a binomial distribution:

N1k*|Mrsk*~Binomial(Nk*,r,sMrsk*Nk*πrsk) (7)

where, as in Section 2.2, πrsk ≡ πrsk(β) is given by model (1). Wakefield (2004) considered this approximation in the context of a single binary exposure. Wakefield (2004) also considered approximations based on the Poisson distribution and the Normal distribution; here we restrict attention to the binomial approximation since we found it to work well in a broad range of settings; details are provided in Web Appendix C.

Denoting the approximate ecological likelihood contribution based on (7) by L˜E(β;Nyk*), an approximate aggregate data hybrid likelihood contribution for the kth group is:

L˜A(β;Nyk,nyrsk|Mrsk,nyk)=L˜E(β;Nyk*)HG(mrsk|Mrsk,nk)HG(nyk|Nyk,nk)LI(β;nyrsk|mrsk). (8)

Crucially, evaluation of (8) no longer requires summing over the, often very large, collection of possible Nyrsk*. As such, the computational burden is essentially trivial.

4.3. Approximating the Pure Ecological Hybrid Likelihood

The form of the pure ecological hybrid likelihood for the kth group, repeated here from Section 2.4 for convenience, is the summation of a series of nested summations:

LH(β,ϕrs;Nyk,nyrsk|Mr+k,M+sk,nyk)=MrskkP(Mrsk|Mr+k,M+sk,ϕrs)·{Nyrsk𝒩kw(Nyrsk|nyrsk,nyk)LI(β;Nyrsk|Mrsk)}.

Unfortunately, while the nested summation corresponds to LA(β; Nyk, nyrsk|Mrsk, nyk) and can therefore be approximated using the approach of Section 4.2, the outer summation across these approximations is not amenable to approximation. Nevertheless, even if the approximation given by

L˜H(β,ϕrs;Nyk,nyrsk|Mr+k,M+sk,nyk)=MrskkP(Mrsk|Mr+k,M+sk,ϕrs)L˜A(β;Nyk,nyrsk|Mrsk,nyk) (9)

does not completely eliminate the overall computational burden, that only a single (outer) summation is required will reduce it considerably.

4.4. Estimation and Inference

Given data on K groups from a hybrid design, likelihood-based estimation/inference could proceed in the usual way: a likelihood could be formed by taking the product of K terms of the form (8) or (9), point estimates can be obtained by maximization and standard error estimates from the inverse of the observed information matrix. Expressions for the approximate hybrid likelihood scores and Hessian terms under both the hybrid aggregate data and hybrid pure ecological designs are derived in Web Appendix D.

In practice, use of (8) or (9) for each of the K terms will lead to the greatest reduction of computational burden although doing so may incur a trade-off in terms of statistical operating characteristics. Since the ideal is to use the exact hybrid likelihood for each group, the extent to which use of the approximate hybrid likelihood impacts estimation and inference is crucial. When the group size is large the approximation given by (7) is readily motivated as a large sample approximation to the distribution of the number of cases, N1k*. Thus, precisely in the situations where relief of the computational burden is most needed is where the approximations are expected to be most accurate. When group sizes are small, the approximations may not be expected to hold as well. However, for these groups the computational burden may be manageable. Together, these observations suggest use of the exact form for small groups and approximate form for large groups may strike a reasonable compromise between computational tractability and accuracy. One simple strategy is to sequentially obtain MLEs and standard error estimates using an overall likelihood where:

  1. All K contributions are of the approximate form, A (β; Nyk, nyrsk|Mrsk, nyk).

  2. The group with the smallest N1k* contributes the exact form, while the remaining K − 1 groups contribute the approximate form.

  3. The two groups with the smallest N1k* contribute the exact form, while the remaining K − 2 groups contribute the approximate form.

As one permits more and more of the contributions to be of the exact form, the computational burden will increase and the point estimates will get closer and closer to what one would have obtained by a gold-standard analysis that uses the exact hybrid likelihood for all K contributions. Practically, one could initiate the process and stop when point estimates and standard error estimates “converge,” to some level of tolerance, in the sense that the use of additional exact likelihood contributions does not change the conclusions one draws.

To illustrate this strategy, we drew a single stratified case–control sample of n0k = n1k = 25 from each county in North Carolina and considered combining these data with data from the aggregate data design (presented in Figures 1a and b). Figure 2 shows how the point estimates for β1 in model (1) change as one modifies the balance of contributions that are of the approximate and exact form. Also shown is how the standard error estimates change, as well as the increase in computational burden in terms of the number of summations and the time taken to obtain the MLE; these were all evaluated using the same hardware/software configuration of Section 3. Overall, very little change is seen in the point estimates and standard error estimates; neither change by more than 2%. In contrast, the computation time quickly becomes onerous as the number of exact-form contributions is increased. In particular, the MLE computation time increased from 21 seconds when 20% of the counties were contributing the exact form of the likelihood to 1 hour and over 4 hours when the percent of counties contributing the exact form is increased to 60% and 75%, respectively.

Figure 2.

Figure 2

Implementation of the compromise strategy of Section 4.4 which balances the use of the exact and approximate forms of the hybrid aggregate data likelihood and computational burden. Point and standard error estimates are for β1 in model (1).

5. Simulations

The strategy presented in the previous section is pragmatic from a computational perspective. A trade-off, however, is that use of any approximate forms corresponds to a misspecified likelihood. As such, estimation is no longer guaranteed to be consistent and/or asymptotically efficient. Furthermore, standard error estimates based on inverting the observed information matrix are not guaranteed to be valid. To investigate the potential trade-off between computational burden and operating characteristics, we conducted a simulation study to evaluate the impact of using approximate forms of the hybrid likelihood. Specifically, we investigated: (i) the magnitude of bias in point estimates, if any, associated with the use of the approximate likelihood and, (ii) whether or not the use of the approximate likelihood impacts the efficiency gains one sees when one combines group- and individual-level data using the exact hybrid likelihood. Note, rather than evaluating the “quality” of the approximations themselves, we evaluated the “quality” of the approximation in terms of a downstream impact on estimation/inference with the “gold-standard” taken to be estimation/inference based on the exact likelihood.

5.1. Simulation Set-Up

At the outset, we initially generated 10,000 simulated datasets under the following “baseline” scenario. For each of K = 20 groups we set the group size to be Nk = 2,000. Let Qrk = P(R = 1|group k) denote the marginal prevalence of a binary covariate R in the kth group; similarly, let Qsk = P(S = 1|group k) denote the marginal prevalence of a binary covariate S in the kth group. Values across the K groups for Qrk and Qsk were fixed at the quantiles of a Normal(0.2, 0.12) distribution; assignment to specific values for both Qrk and Qsk was randomly permuted across the 10,000 simulated datasets. Individual values for R and S were then generated as random deviates from Bernoulli(Qrk) and Bernoulli(Qsk), according to group membership of the individual. Given these covariate values, outcomes were generated as random draws from a Bernoulli(πrsk) distribution with πrsk given by model (1) with (β1, β2)=(log 1.5, log 1.25). The group-specific intercepts, β0k were set such that the baseline outcome rates (i.e., π00k, when R = S = 0) were the quantiles from a Normal(0.1, 0.22).

We also considered six additional simulation scenarios, each modifying a single aspect of the data generating mechanism for the baseline scenario:

  • #1

    Increase the mean Qrk across the K groups from 0.2 to 0.5.

  • #2

    Decrease the standard deviation of the K Qrk and Qsk from 0.1 to 0.01.

  • #3

    Decrease the standard deviation of the K π00k from 0.02 to 0.005.

  • #4

    Increase the log-odds ratio associations, from (log 1.5, log 1.25) to (log 2.5, log 2.0).

  • #5

    Decrease the group sizes from Nk = 2,000 to Nk = 1,000 for each k.

  • #6

    Increase the number of groups from K = 20 to K = 40.

For each of the 10,000 simulated datasets, and under each of the 7 data scenarios, we computed aggregated totals for the outcomes and two covariates that would be observed under both an aggregate data design and a pure ecological study. We also drew a stratified random sample of n0k = n1k = 25 non-cases and cases from each of the K groups.

5.2. Analyses

For each simulated dataset we estimated components of model (1) using (i) the full data estimator based on all k=1KNk individuals, denoted β̂full, and (ii) the estimator obtained by performing conditional logistic regression on the (stratified) case–control data alone, denoted β̂CC. For simulated hybrid aggregate data designs, we considered an additional two estimators: (iii) the exact hybrid likelihood estimator, denoted β̂A and (iv) the approximate hybrid likelihood estimator based on the Binomial approximation, denoted β˜ABin. For simulated hybrid pure ecological designs, we only considered the approximate hybrid likelihood estimator, denoted β˜HBin. The exact hybrid likelihood estimator for the pure ecological setting was not considered because of the prohibitive computational burden. Throughout, for the approximate hybrid likelihood estimators all K contributions were of the approximate form.

5.3. Results

Tables 35 report operating characteristics for estimation of β1, the log-odds ratio for race in model (1). Estimates of β2, the log-odds ratio for smoking status, exhibited qualitatively similar operating characteristics; results are provided in Web Tables D1–3.

Table 3.

Operating characteristics for four likelihood-based estimators of β1 from model (1) using the full data, case-contol data, and data from a hybrid aggregate data design, under the seven simulation scenarios described in Section 5.1. All values are based on 10,000 simulated datasets.

Individual-level Hybrid aggregate
data likelihood


Full Case–control Exact Binomial
Percent bias
  Baseline 0.0 0.2 −0.2 0.4
  #1 0.1 0.4 0.8 1.2
  #2 −0.0 0.3 −0.1 0.4
  #3 −0.1 0.3 −0.1 0.4
  #4 0.0 0.2 0.6 1.0
  #5 0.0 0.4 0.2 2.0
  #6 0.0 −0.2 0.6 0.5
Estimated vs. true standard error × 100a
  Baseline 99.6 99.4 99.9 99.6
  #1 98.6 99.7 99.4 99.2
  #2 101.0 100.1 100.6 100.3
  #3 99.1 99.0 99.5 99.2
  #4 100.1 99.4 100.2 100.0
  #5 99.0 98.5 98.1 97.2
  #6 101.0 98.9 98.3 98.3
Coverage probability × 100a
  Baseline 94.9 94.8 95.0 94.9
  #1 94.5 94.8 95.0 94.9
  #2 95.2 94.9 95.3 95.2
  #3 94.8 94.8 95.3 95.1
  #4 94.9 94.9 95.2 95.1
  #5 94.7 95.0 94.8 94.3
  #6 95.5 95.0 94.7 94.7
a

Estimated standard errors and coverage probabilities are based on the inverse of the Hessian of the corresponding (possibly misspecified) likelihood.

Table 5.

Relative uncertaintya for five likelihood-based estimators of β1 from model (1) under the seven simulation scenarios described in Section 5.1. Shown are results for both the hybrid aggregate data and hybrid pure ecological designs. All values are based on 10,000 simulated datasets.

Individual-level Hybrid likelihood


Full Case–control Exactb Binomial
Aggregate data
  Baseline 23.4 100 76.4 76.8
  #1 24.2 100 80.3 80.6
  #2 23.6 100 75.7 76.2
  #3 23.8 100 76.1 76.6
  #4 20.8 100 76.6 76.9
  #5 34.3 100 77.6 79.0
  #6 16.2 100 77.8 77.7
Pure ecological
  Baseline 23.7 100 78.5
  #1 23.9 100 79.5
  #2 23.8 100 77.2
  #3 24.0 100 78.0
  #4 20.8 100 78.9
  #5 15.0 100 76.9
  #6 16.6 100 77.4
a

Ratio of the standard error for estimator relative to that of the case–control estimator

b

Not considered for the pure ecological design. See Section 5.2.

From the upper portions of Tables 3 and 4, all of the estimators exhibited very low bias. Across all estimator/data scenarios, the greatest percent bias was only 2.3%. Furthermore, none of the approximate hybrid likelihood estimators, under both the hybrid aggregate data and the hybrid pure ecological designs, exhibited any systematically greater bias than the exact hybrid likelihood estimator.

Table 4.

Operating characteristics for three estimators of β1 from model (1) using the full data, case-control data, and data from a hybrid pure ecological design, under the seven simulation scenarios described in Section 5.1. All values are based on 10,000 simulated datasets.

Individual-level Hybrid pure
ecological likelihood


Full Case–control Binomial
Percent bias
  Baseline −0.0 0.3 0.6
  #1 0.0 0.1 1.0
  #2 0.0 0.1 0.5
  #3 −0.1 0.1 0.4
  #4 0.0 0.2 1.1
  #5 −0.0 0.3 −0.4
  #6 0.0 0.2 0.5
Estimated vs. true standard error × 100a
  Baseline 98.8 100.0 98.2
  #1 98.9 98.8 99.8
  #2 99.2 99.4 98.3
  #3 98.8 99.7 98.2
  #4 99.3 99.5 98.0
  #5 100.2 100.1 99.9
  #6 99.5 99.5 99.3
Coverage probabilitya
  Baseline 94.7 95.0 94.5
  #1 94.7 94.9 94.9
  #2 94.7 95.1 94.6
  #3 94.9 94.9 94.6
  #4 94.6 95.0 94.3
  #5 94.8 95.4 94.8
  #6 95.0 95.0 95.0
a

Estimated standard errors and coverage probabilities are based on the inverse of the Hessian of the corresponding (possibly misspecified) likelihood.

From the middle and lower portions of Tables 3 and 4 we see that naïve standard error estimation for the approximate hybrid likelihood estimators was not subject to any systematic bias. Specifically, the mean of the estimator standard error estimates based on the Hessians for the misspecified approximate likelihoods did not exhibit any systematic variation from the true standard error (calculated as the standard deviation of the 10,000 point estimates). Furthermore, in all simulation scenarios, 95% confidence intervals based on the naïve standard error estimates attained coverage probabilities very close to the nominal rate. Consequently, despite the fact that the approximate hybrid likelihood is a misspecified likelihood, estimation and inference remains valid in a broad range of data scenarios.

Finally, Table 5 reports on relative uncertainty defined as the ratio of the standard error for a given estimator to the standard error of the conditional logistic regression estimator β̂CC. From the first column, as expected, estimation of β1 is substantially more efficient when the full data likelihood is used. Columns 4–6 indicate the efficiency gain associated with the combination of the group- and case–control data. Under the aggregate data scenarios considered, standard errors based on the combined data are approximately 75–80% those of the case–control data. Furthermore, there is no systematic detriment in this efficiency gain when one uses the approximate forms of the hybrid likelihood. Under the pure ecological scenarios we again see substantial efficiency gains associated with the combination of group- and case–control data. For these scenarios we were not able to evaluate the exact hybrid likelihood (due to the computational burden). As such, it is possible that analysts could enjoy even further gains through use of the exact form, although we believe the gains by having used the approximate form are important nonetheless.

6. Application

To further illustrate the methods in Section 4 we consider a more detailed analysis of risk factors for a low birth weight. Specifically, we expand on model (1) by considering three additional covariates: premature birth, defined as a birth at 37 weeks; plurality, taking on levels of a singleton birth, twins or triplets or more; whether or not the mother experienced a low weight gain during the pregnancy (i.e., < 15 lbs). Furthermore, the model is expanded to include an interaction between the mother’s race and the mother’s smoking status.

Restricting to the 373,438 births, across the 100 North Carolina counties, with complete data we replicated the data that would have been observed in an aggregate data design. An individual birth could be categorized into one of 2×2×2×3×2 = 48 unique levels across the five covariates we consider (race, smoking, premature birth, plurality, and low weight gain). Hence the observed group-level data consists of K = 100 48×2 tables, each analogous to the first table in middle row of Table 1. To emulate a hybrid aggregate data study, we took a single stratified case–control sample of n0k = n1k = 25 from each county.

Table 6 reports point and standard error estimates for the log-odds ratio parameters in the expanded model based on three analyses. Under each study, the first column reports on a fit of the model using the full data (i.e., all N=373,438 individual records). The second column reports on results from a conditional logistic regression analysis of the stratified case–control sample. The third column combines the stratified case–control sample with the group-level data via an approximate hybrid aggregate data likelihood analogous to expression (8), using the binomial approximation. Note, using a result presented in Section F of the Web Appendix we estimated that the exact hybrid likelihood for the expanded model corresponds to a summation of approximately 10124 terms. In addition, the pure ecological data likelihood requires 8.8 × 1012 summations. From Figure 2 we estimate that, for a model with approximately 100 parameters, likelihoods involving over 108, 109, and 1011 summations will require more than a day, week, and year to maximize, respectively. Hence, for all practical purposes, the exact hybrid likelihood and pure ecological likelihood could not be used to perform estimation or inference. In general, computation time for any sized model becomes infeasible when the likelihood involves 109 summations.

Table 6.

Point and standard error estimates for log-odds ratio parameters in extended model, described in Section 6, based on the North Carolina data. Estimates for the hybrid aggregate data are based on the binomial approximation to the hybrid likelihood for all K = 100 counties.

Full data Case–control data Hybrid aggregate data



Est SE Est SE Est SE
Early birth 2.92 0.014 2.82 0.088 3.03 0.061
Number of babies
  One 1.00 1.00 1.00
  Two 2.33 0.025 2.92 0.236 2.81 0.143
  Three or more 4.30 0.209 1.62 1.028 2.07 0.769
Low weight gain 0.66 0.018 0.57 0.099 0.59 0.076
Non-white 0.70 0.016 0.71 0.097 0.69 0.077
Smoker 0.90 0.024 0.97 0.107 0.94 0.085
Non-white × Smoker −0.40 0.041 0.36 0.215 −0.33 0.172

From the point estimates based on the full data, we find increased risk of a low birth weight event associated with an early birth, an increased plurality, low weight gain by the mother, non-white race and the mother smoking during pregnancy. That the interaction is statistically significant indicates that if the mother smokes during pregnancy, the impact is somewhat less for non-white babies than white babies. From the point estimates based on the case–control data alone and the combined data analyses, the conclusions are qualitatively the same. However, the standard errors based the combined data sources are between 20% and 40% lower than those based on the case–control data alone. As such, use of the approximate hybrid aggregate data likelihood has resulted in substantial efficiency gains. Finally, estimates of the county-specific intercepts based on the full data likelihood and the approximate hybrid likelihood were also consistent; of the K = 100 intercept estimates, the largest discrepancy between the hybrid aggregate data design estimates and the full-data point estimates is only a 6% difference. Web Appendix G provides a scatterplot of the two sets of estimates.

7. Discussion

In this article we have proposed a pragmatic approach to efficient estimation and inference of individual-level models based on data from hybrid designs that combine group- and individual-level data. While use of the exact hybrid likelihood will be prohibitively expensive in most settings, the proposed approximations give researchers a practical tool for making use of important group-level data that could otherwise be ignored. From a comprehensive simulation study, despite the fact that use of the approximate hybrid likelihood corresponds to use of a misspecified likelihood, estimation and inference remain valid in a broad range of data scenarios. Furthermore, over the broad range of scenarios we considered, use of the approximate form does not induce any systematic penalty in terms of the efficiency gains that one should expect when one combines the two sources of information. In short, the proposed method provides a practical approach to combining group- and individual-level data from a hybrid design without incurring any penalties in terms of bias and efficiency.

While the simulation study of Section 5 considered both the hybrid pure ecological and the hybrid aggregate data designs, the application in Section 6 only considered the latter. We found that, even with the proposed approximation, computation for the former for the model of interest was prohibitive; that some of the groups were quite large and that the model had 106 parameters both contributed to this. Interestingly attempts to alleviate the burden by changing the data scenario (e.g. reducing the number of counties to 50 and/only considering data from 2009) did not reduce the burden to a point of being practical. Additional work is therefore needed on practical estimation/inference procedures for when researchers seek to combine a pure ecological study with case–control data in general data settings.

Finally, to our knowledge, this article is the first to formally describe a hybrid aggregate data design, which supplements the aggregate data design of Prentice and Sheppard (1995) with case–control data from each group. The most closely related design is the integrated aggregate data design of Martínez et al. (2007, 2009), which supplements an aggregate data design with individual-level data prospectively sampled within each group. Beyond the sampling of individual-level data, methods for the two designs differ in that estimation/inference for the integrated aggregate design typically focuses on a log-linear model; in contrast, this article considers a logistic regression model. Jointly, the hybrid and integrated aggregate data designs provide a comprehensive set of tools for rare and non-rare outcomes, respectively.

Supplementary Material

SM

Acknowledgements

The authors are grateful for feedback from the Editor and Associate Editor. This research was supported, in part, by grants R01 CA125801-01, T32 A1007358, and T32 ES007142 from the National Institutes of Health.

Footnotes

Supplementary Materials

Web Appendices and tables referenced in Sections 2, 4, and 6, along with code, are available with this paper at the Biometrics website on Wiley Online Library.

References

  1. Chen Y, Ebenstein A, Greenstone M, Li H. Evidence on the impact of sustained exposure to air pollution on life expectancy from China’s Huai River policy. Proceedings of the National Academy of Sciences. 2013;110:12936–12941. doi: 10.1073/pnas.1300018110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Greenland S. Principles of multilevel modelling. International Journal of Epidemiology. 2000;29:158–167. doi: 10.1093/ije/29.1.158. [DOI] [PubMed] [Google Scholar]
  3. Greenland S, Morgenstern H. Ecological bias, confounding, and effect modification. International Journal of Epidemiology. 1989;18:269–274. doi: 10.1093/ije/18.1.269. [DOI] [PubMed] [Google Scholar]
  4. Haneuse S, Bartell S. Designs for the combination of group- and individual-level data. Epidemiology. 2011;22:382. doi: 10.1097/EDE.0b013e3182125cff. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Haneuse S, Wakefield J. Hierarchical models for combining ecological and case-control data. Biometrics. 2007;63:128–136. doi: 10.1111/j.1541-0420.2006.00673.x. [DOI] [PubMed] [Google Scholar]
  6. Haneuse S, Wakefield J. The combination of ecological and case-control data. Journal of the Royal Statistical Society, Series B. 2008a;70:73–93. doi: 10.1111/j.1467-9868.2007.00628.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Haneuse S, Wakefield J. Geographic-based ecological correlation studies using supplemental case–control data. Statistics in Medicine. 2008b;27:864–887. doi: 10.1002/sim.2979. [DOI] [PubMed] [Google Scholar]
  8. Johnson NL, Kotz S. Wiley Series in Probability and Mathematical Statistics. New York: John Wiley & Sons, Inc.; 1969. Distributions in Statistics. [Google Scholar]
  9. Martínez JM, Benach J, Benavides FG, Muntaner C, Clèries R, Zurriaga O, Martínez-Beneito MA, Yasui Y. Improving multilevel analyses: The integrated epidemiologic design. Epidemiology. 2009;20:525–532. doi: 10.1097/EDE.0b013e3181a48c33. [DOI] [PubMed] [Google Scholar]
  10. Martínez JM, Benach J, Ginebra J, Benavides FG, Yasui Y. An integrated analysis of individual and aggregated health data using estimating equations. The International Journal of Biostatistics. 2007;3 doi: 10.2202/1557-4679.1060. [DOI] [PubMed] [Google Scholar]
  11. Piantadosi S, Byar D, Green S. The ecological fallacy. American Journal of Epidemiology. 1988;127:893–904. doi: 10.1093/oxfordjournals.aje.a114892. [DOI] [PubMed] [Google Scholar]
  12. Pope CA, III, Ezzati M, Dockery DW. Fine-particulate air pollution and life expectancy in the United States. New England Journal of Medicine. 2009;360:376–386. doi: 10.1056/NEJMsa0805646. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Prentice RL, Sheppard L. Aggregate data studies of disease risk factors. Biometrika. 1995;82:113–125. [Google Scholar]
  14. Robinson W. Ecological correlations and the behavior of individuals. American Sociological Review. 1950;15:351–357. [Google Scholar]
  15. Salway R, Wakefield J. Sources of bias in ecological studies of non-rare events. Environmental and Ecological Statistics. 2005;12:321–347. [Google Scholar]
  16. Sheppard L. Ecological study design. In: Chen Yi-Hau., Prof., editor. Encyclopedia of Environmetrics. Vol. 2. New York: John Wiley & Sons, Ltd.; 2002. pp. 602–606. [Google Scholar]
  17. Sheppard L. Insights on bias and information in group-level studies. Biostatistics. 2003;4:265–278. doi: 10.1093/biostatistics/4.2.265. [DOI] [PubMed] [Google Scholar]
  18. Wakefield J. Ecological inference for 2×2 tables. Journal of the Royal Statistical Society, Series A. 2004;167:385–445. [Google Scholar]
  19. Wakefield J, Haneuse S. Overcoming ecologic bias using the two-phase study design. American Journal of Epidemiology. 2008;167:908–916. doi: 10.1093/aje/kwm386. [DOI] [PubMed] [Google Scholar]
  20. Wakefield J, Shaddick G. Health-exposure modeling and the ecological fallacy. Biostatistics. 2006;7:438–455. doi: 10.1093/biostatistics/kxj017. [DOI] [PubMed] [Google Scholar]
  21. Wong GY, Mason WM. The hierarchical logistic regression model for multilevel analysis. Journal of the American Statistical Association. 1985;80:513–524. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

SM

RESOURCES