Disease Risk Estimation by Combining Case–Control Data with Aggregated Information on the Population at Risk

Xiaohui Chang; Rasmus Waagepetersen; Herbert Yu; Xiaomei Ma; Theodore R Holford; Rong Wang; Yongtao Guan

doi:10.1111/biom.12256

. Author manuscript; available in PMC: 2016 Mar 8.

Published in final edited form as: Biometrics. 2014 Oct 28;71(1):114–121. doi: 10.1111/biom.12256

Disease Risk Estimation by Combining Case–Control Data with Aggregated Information on the Population at Risk

Xiaohui Chang ¹, Rasmus Waagepetersen ², Herbert Yu ³, Xiaomei Ma ⁴, Theodore R Holford ⁴, Rong Wang ⁴, Yongtao Guan ^5,^*

PMCID: PMC4782587 NIHMSID: NIHMS669668 PMID: 25351292

Summary

We propose a novel statistical framework by supplementing case–control data with summary statistics on the population at risk for a subset of risk factors. Our approach is to first form two unbiased estimating equations, one based on the case–control data and the other on both the case data and the summary statistics, and then optimally combine them to derive another estimating equation to be used for the estimation. The proposed method is computationally simple and more efficient than standard approaches based on case–control data alone. We also establish asymptotic properties of the resulting estimator, and investigate its finite-sample performance through simulation. As a substantive application, we apply the proposed method to investigate risk factors for endometrial cancer, by using data from a recently completed population-based case–control study and summary statistics from the Behavioral Risk Factor Surveillance System, the Population Estimates Program of the US Census Bureau, and the Connecticut Department of Transportation.

Keywords: Aggregated information, Estimating equation, Spatial epidemiology, Spatial point process

1. Introduction

Population-based case–control studies typically consist of a subset of individuals who have developed the disease (cases) and a representative sample of individuals from the population at risk (controls). For each study subject, an extensive list of risk factors is often collected. The sample size of the controls can be very limited in many studies due to cost constraints. As a result, large discrepancies may occur between the distributions of risk factors in the controls and in the population, which could in turn limit one’s ability to detect significant risk factors. However, more accurate information can be obtained from other sources for at least a subset of the risk factors. For example, the US Census provides summary statistics for many demographic and socioeconomic status variables at very fine spatial scales, and the Behavioral Risk Factor Surveillance System (BRFSS), which is a large state-based system of health surveys, routinely releases summary statistics concerning lifestyle variables, such as alcohol consumption, smoking, exercise, and overweight and obesity, across different age-by-sex groups. These summary statistics are often based on a much larger number of study participants and are therefore appreciably more accurate than their counterparts that can be derived from the controls in a case–control study.

We propose a novel statistical framework for disease risk estimation by supplementing standard case–control data with aggregated information of risk factors on the population. We consider a common regression setting as is often used in analyzing case–control data alone (see Section 3 for details). To estimate the regression parameters, we first form two unbiased estimating equations, one based on the case–control data and the other based on the case data and some summary statistics, and then efficiently combine them to derive a new unbiased estimating equation. Our proposed method is computationally simple and can lead to parameter estimates with smaller standard errors than standard methods using case–control data alone. As a substantive application, we investigate risk factors for endometrial cancer, based on a recently completed population-based case–control study and summary statistics extracted from data provided by BRFSS, the Population Estimates Program of the US Census Bureau, and the Connecticut Department of Transportation.

Several other authors have studied the problem of combining individual-level and aggregated epidemiological data. For example, Prentice and Sheppard (1995) and Wakefield (2004) considered combining aggregated disease data with individual control or cohort data, and Haneuse and Wakefield (2007, 2008a,b) proposed a hybrid design to combine aggregated disease data with either case–control data or control data alone. However, all these articles are concerned with aggregated outcomes on the diseased subjects but not on the non-diseased ones. Diggle et al. (2010) developed procedures for combining individual-level case data and spatially aggregated information on the population at risk. Their methods require spatially aggregated information for all risk factors which may not always be available in practice. Moreover, they did not consider any additional control data in the analysis. In contrast, we combine both cases and controls in a case–control study as well as summary statistics on the population even when such information is available only for a subset of risk factors.

The remainder of the article is organized as follows. We describe the endometrial cancer data in Section 2 and provide some necessary background in Section 3. We introduce the proposed method in Section 4, assess its numerical properties through a simulation study in Section 5, and apply it to analyze the endometrial cancer data in Section 6. We conclude with a discussion in Section 7. Additional theoretical results, MATLAB codes and artificial data are given in the Supplementary Materials section online.

2. Description of Data

2.1. Case–Control Study for Endometrial Cancer

Endometrial (uterine corpus) cancer is the fourth most common cancer among women in the United States. The American Cancer Society estimates 51,577 newly diagnosed endometrial cancer incidences and 8,418 deaths in 2014 (http://www.cancer.org/research/cancerfactsfigures/). To investigate risk factors for endometrial cancer, a population-based case–control study was conducted in Connecticut between October 2004 and March 2009. The study included 668 Connecticut residents between the ages of 35 and 80 that were newly diagnosed endometrial cancer during the study period and 665 control subjects that were identified through a random-digit dialing method and were frequency matched to cases by age groups (35–51, 52–59, 60–64, 65–69, 70–74, and 75–79 years). All the study participants provided signed informed consent before an in-person interview. During the interview, a structured questionnaire was used to collect information such as ethnicity, education, lifestyle, menstrual and reproductive features, self-reported weight, height and other physical dimensions. More details on the study design can be found in Lu et al. (2011).

2.2. Behavioral Risk Factor Surveillance System Data

BRFSS is a state-based system of health surveys that collect information on health risk behaviors, preventive health practices, and health care access primarily related to chronic diseases and injury. It was first established in 1984 by the Centers for Disease Control and Prevention (CDC); with more than 350,000 adults interviewed each year, it is the largest telephone health survey in the world. The CDC routinely releases summary tables on variables collected through the BRFSS surveys. The Web Enables Analysis Tool (WEAT), which is available on the CDC’s website http://www.cdc.gov/brfss/index.htm, allows researchers to create cross tabulation reports from the BRFSS data. Using WEAT, we have calculated annual summary statistics on tobacco use, education level and body mass index (BMI = weight (kg)/height² (m²)) from 2005 to 2008 across different age-by-sex groups. We will take advantage of this information to investigate risk factors for endometrial cancer.

2.3. Traffic and Census Data

As a major source of air pollution, automobile emissions have been investigated for their potential associations with various types of cancer; see Pearson, Wachtel, and Ebi (2000), Raaschou-Nielsen et al. (2001), Beelen et al. (2008), and the references therein. For endometrial cancer, Grant (2009) found a significant positive association between an air pollution index and endometrial cancer mortality rates in 1950–1969 in the US. However, they observed no such association between the same air pollution index and the mortality rates in 1970–1994.

In this article, we will examine the effect of exposure to traffic on endometrial cancer risk, using average daily traffic (ADT) data on all state and interstate highways in 2007 provided by the Connecticut Department of Transportation. For each subject enrolled in the case–control study, her exposure to traffic can be derived by integrating the ADT values over highways within a fixed buffer zone from the subject’s residency (Holford et al., 2010).

The Population Estimates Program of the US Census Bureau produces annual population count estimates by age, gender and ethnicity for each state at the county level. We extracted data on these variables in Connecticut from 2005 to 2008. To get aggregated traffic data at a finer spatial level, we first obtained the 2010 US Census data for Connecticut at the zip code level and then scaled them proportionally to estimate the population counts from 2005 to 2008. For each zip code, we defined an aggregated exposure as the product of the traffic exposure at the zip code centroid and the rescaled population counts. Although this aggregated exposure is not a sum of the true exposures of all subjects within a given zip code, it still contains information about how the population may be exposed to traffic differently across different zip codes. We will incorporate this new information in our analysis to investigate whether exposure to traffic is associated with risk for developing endometrial cancer.

3. Background

3.1. Notation and Set-Up

Let N and M be two spatial point processes generating the random spatial locations of cases and controls over a geographic region, D. We represent the spatially varying population density by λ₀(s). Let Z(s) be a p × 1 vector of risk factors for an individual at location s. We assume that both N and M are Poisson, with their respective intensities given by λ(s; β) = λ₀(s) exp{Z(s)′β} for some unknown β and ρ(s) = α(s)λ₀(s), where α(s) denotes the probability for an individual at s to be included in the controls of a case–control study. We assume that α(·) is known given the sampling design used to collect the controls. Our interest is to estimate β, which defines the effect of potential risk factors on cancer risk.

Suppose that Z(s) = {X(s)′, Y(s)′}′, where X(·) and Y(·) are respectively p_x × 1 and p_y × 1 subvectors of Z(·) with p = p_x + p_y. The first element of X(·) is always equal to one. We assume that aggregated information on the population is available for X(·), but not for Y(·), over K strata, D_k, for k = 1, …, K. For ease of presentation, we assume that D_k’s are geographic regions that form a partition of D, but in general, D_k’s can be strata based on non-geographic criteria such as age and sex.

3.2. Estimating Equation for Case–Control Data

Diggle and Rowlingson (1994) proposed a conditional likelihood approach to estimate β using case–control data. They argued that conditional on an observed event s ∈ (N ∪ M), the probability for it to be from N is $p (s; β) = \frac{λ (s; β)}{λ (s; β) + ρ (s)}$ . The conditional log-likelihood is then given as L(β) = Σ_s_∈_N log p(s; β) + Σ_s_∈_M log{1 − p(s; β)}. Note that maximizing L(β) is equivalent to solving the unbiased estimating equation

\begin{array}{l} U_{c} (β) \equiv \sum_{s \in (N \cap D)} Z (s) \frac{α (s)}{α (s) + \exp {Z (s)' β}} \\ - \sum_{s \in (M \cap D)} Z (s) \frac{\exp {Z (s)' β}}{α (s) + \exp {Z (s)' β}} = 0_{p}, \end{array}

(1)

where 0_p is a p × 1 zero vector. If α(·) is constant, U_c(β) coincides with the score function of the commonly used logistic regression analysis for case–control data.

3.3. Estimating Equation for Individual-Level Case Data and Aggregated Population Data

Let μ_k denote a p_x × 1 vector of population summaries aggregated over D_k, for k = 1, …, K. In the subsequent development, we assume that μ_k’s are known and write $μ_{k} = \int_{D_{k}} X_{k} (s) λ_{0} (s) d s$ , where X_k(·) is a p_x × 1 vector related to X(·). Often X_k(·) = X(·), that is, μ_k’s are aggregated over the risk factors of the population in D_k. However, X_k(·) can also be different from X(·). For example, we used traffic exposure at the zip code centroid to derive an aggregated traffic exposure in Section 2.3. Let β* denote the true value of β. By Campbell’s Theorem (e.g., Møller and Waagepetersen, 2004) and the definition of λ(s; β), ${\hat{μ}}_{k} (β) = \sum_{s \in (N \cap D_{k})} X_{k} (s) / \exp {Z (s)' β}$ is an unbiased estimator for μ_k at β = β*. Hence,

U_{a} (β) = \sum_{k = 1}^{K} w_{k} {{\hat{μ}}_{k} (β) - μ_{k}} = 0_{p_{x}}

(2)

forms an unbiased estimating equation (Diggle et al., 2010), where w_k’s are some pre-defined weights. Note that solving (2) alone will not yield a unique estimate for β* since p_x < p, that is, aggregated information is not available for all risk factors. If p_x = p and X(·) is spatially continuous, Diggle et al. (2010) showed that efficiency of the resulting estimator from solving (2) increased with K. As K increases, the average of X(s) for s ∈ D_k, which is denoted as ${\bar{X}}_{k}$ and can be easily derived from μ_k, approaches X(s). Thus, most efficiency gains can be achieved from incorporating the aggregated information when ${\bar{X}}_{k}$ approximates X(s) well for s ∈ D_k. In such a case, there is often a large number of strata and consequently a small population size in each stratum, as well as a large between-group variation in ${\bar{X}}_{k}$ ’s.

4. Combining Estimating Equations for Case–Control and Aggregated Data

The estimating equations given in (1) and (2) both contain information on β*. Below we develop a mechanism to ‘optimally’ combine them. To do so, we first write U(β) = {U_c(β)′, U_a(β)′}′ and define $J (β) = E {\frac{\partial U (β)}{\partial β}}$ and V(β) = Var {U_a(β)}, where expectation and variance are at β = β*. Let Ĵ(β) and $\hat{V} (β)$ be consistent estimators of J(β) and V(β) when β = β*. We then follow Heyde (1997) to consider the estimating equation

\tilde{U} (β) \equiv \hat{J} {(β)}^{'} \hat{V} {(β)}^{- 1} U (β) = 0_{p}

(3)

The resulting estimator obtained by solving (3), denoted by $\hat{β}$ , is ‘optimal’, in the sense that Ũ(β) has the maximum Godambe information (Heyde, 1997) among all estimating functions taking the form A(β) U(β), where A(β) is an arbitrary p × (p + p_x) real matrix.

By Campbell’s theorem, it can be shown that all components of J(β) and V(β) can be expressed as

η (f, β) = \int_{D} λ_{0} (s) f (s; β) d s

(4)

for some function f (s; β). In Web Appendix A, we give detailed expressions of f (s; β) related to J(β*) and V(β*). In the next subsection, we derive a consistent estimator for η(f, β*).

4.1. A Consistent Estimator for η(f, β*)

For any θ ∈ [0, 1], define

\hat{η} (f, β; θ) = θ \sum_{s \in (N \cap D)} \frac{f (s; β)}{\exp (Z (s)' β)} + (1 - θ) \sum_{s \in (M \cap D)} \frac{f (s; β)}{α (s)} .

(5)

By Campbell’s Theorem, it can be shown that $\hat{η} (f, β^{*}; θ)$ is an unbiased estimator of η(f, β*) for any θ. We choose θ such that the variance of $\hat{η} (f, β^{*}; θ)$ is minimized. In Web Appendix B, we show that the minimum variance is achieved at

θ_{0} = \frac{\int_{D} λ_{0} (s) f_{num} (s) d s}{\int_{D} λ_{0} (s) f_{den} (s) d s},

(6)

where f_num(s) = f (s; β*)²α(s)⁻¹ and f_den(s) = f(s; β*)² [exp{−Z(s)′β*} + α(s)⁻¹].

To estimate θ₀, we need to estimate the two integrals in (6) using (5). Let $\hat{η} (f_{num}, β; θ_{num})$ and $\hat{η} (f_{den}, β; θ_{den})$ be the resulting estimators for the integrals in the numerator and denominator, respectively, for some θ_num, θ_den ∈ [0, 1]. We then define

\hat{θ} (β; θ_{num}, θ_{den}) = \frac{\hat{η} (f_{num}, β; θ_{num})}{\hat{η} (f_{den}, β; θ_{den})} .

(7)

Let N(D) and M(D) denote the numbers of cases and controls in D from the case–control study, respectively. For any given θ_num, θ_den ∈ [0, 1] and as N(D) → ∞ and M(D) → ∞, $\hat{η} (f_{num}, β^{*}; θ_{num})$ and $\hat{η} (f_{den}, β^{*}; θ_{den})$ are consistent estimators for the numerator and denominator of (6) under mild conditions; see Web Appendix C for details. Therefore, $\hat{θ} (β^{*}; θ_{num}, θ_{den})$ is also consistent for θ₀. For simplicity, we set

θ_{num} = θ_{den} = {\hat{θ}}_{p} = \frac{N (D)}{N (D) + M (D)} .

Since ${\hat{θ}}_{p}$ is a consistent estimator of the expected number of cases divided by the total expected number of cases and controls, the resulting estimator $\hat{θ} (β) = \hat{θ} (β; {\hat{θ}}_{p}, {\hat{θ}}_{p})$ is consistent for θ₀ when evaluated at β*; see Web Appendix C.

To summarize, we use $\hat{η} (f, β) = \hat{η} {f, β; \hat{θ} (β)}$ to estimate a given component of Ĵ(β) and $\hat{V} (β)$ . Through solving (3), we obtain an estimator $\hat{β}$ for β*. In the next subsection, we study asymptotic properties of $\hat{β}$ .

4.2. Asymptotic Properties

For the development of asymptotic results, we assume that D is fixed and consider a sequence of increasing population densities λ₀_,n(·) = nλ₀(·) for n = 1, 2, …. This corresponds to the usual setting in spatial epidemiology where data are accumulated over time “n” in a fixed geographical region (e.g., the state of Connecticut). In the following, we modify the notation from the previous sections by adding a subscript n. Thus, N_n and M_n correspond to N and M, and are sequences of Poisson processes with intensity functions λ₀_,n(·) exp{Z(·)′ β*} and λ₀_,n(·)α(·), respectively. Furthermore, U_n(β) and Ũ_n(β) are defined as U(β) and Ũ(β) but with N and M replaced by N_n and M_n. Other quantities that are dependent on N and M can be similarly generalized. We let U₁(·) be U_n with n = 1, and define $J_{1} (β) = E {\frac{\partial U_{1} (β)}{\partial β}}$ and V₁(β) = Var {U₁(β)}. Theorem 1 establishes consistency and asymptotic normality of ${\hat{β}}_{n}$ , and its proof can be found in Web Appendix C.

Theorem 1

Assume sup_s_∈D ‖Z(s)‖ ≤ C for some 0 < C < ∞ and J₁(β*)V₁(β*)⁻¹J₁(β*) is positive definite. Then there exists a $\sqrt{n}$ -consistent asymptotically normal sequence of solutions ${\hat{β}}_{n}$ of the estimating equation Ũ_n(β) = 0.

5. Simulations

Let W(·) and Z₁(·) be independent realizations of a stationary,· isotropic Gaussian process with covariance function exp(−10u), where u is the spatial lag distance. We simulated both W(·) and Z₁(·) on a 100 × 100 grid laid over a square window D = [0, 1] × [0, 1], where each grid cell had constant values of W and Z₁. We similarly simulated Z₂(·) except that they were independent standard normal random variables. Given W(·), we defined the spatially varying population intensity λ₀_,n(s) = n exp{0.5W(s)} for n = 1, 2. Both Z₁(·) and Z₂(·) were treated as covariates but W(·) was not.

We generated realizations of cases· and controls on D from two inhomogeneous Poisson processes with respective intensity functions λ_n(s; β) = λ₀_,n(s) exp{β₀ + Z₁(s)β₁ + Z₂(s)β₂} and ρ_n(s) = αλ₀_,n(s), where β = (β₀, β₁, β₂) = (4.9335, 0.5, 0.5). The expected number of cases per realization was 200 and 400 for n = 1, 2, respectively. We chose α in a way such that the expected number of controls was twice as large as that of cases. We assumed that Z(·) = {Z₀(·), Z₁(·), Z₂(·)}′ was observed for every case and control event, where Z₀(s) = 1 for all s ∈ D In addition, aggregated information $μ_{j k, n} = \int_{D_{k}} λ_{0, n} (s) Z_{i} (s) d s$ was available for j ∈ {0, 1}, {0, 2} or {0, 1, 2}, where D_k’s were equal sub-squares that partitioned D for k = 1, …, K. We considered K = 5², 10², and 20².

Table 1 compares the empirical standard errors (SEs) of our estimator and the estimator from the standard logistic regression without using any aggregated information, based on 1000 simulations. The empirical biases were all negligible. It is clear that our proposed estimator could reduce the SEs considerably compared to the logistic regression approach. Specifically, when there was aggregated information available for Z₁ (and/or Z₂), the SEs of our estimator for β₁ (and/or β₂) were appreciably smaller than those of the estimator based on the logistic regression approach, regardless of the values of n and K. This observation demonstrated the importance of including aggregated information in the analysis. When n increased from 1 to 2, the SEs of our proposed estimator dropped on average by 30%, which was comparable to the expected drop of 29.29% $(= 1 - 1 / \sqrt{2})$ following the convergence rate given in Theorem 1. When K increased, the SEs of our proposed estimator for β₁ decreased when μ₁_k’s were available, but remained nearly unchanged when only μ₂_k’s were available. The difference was due to the fact that Z₁’s were spatially correlated but Z₂’s were not. A finer partition of D could yield more information on the covariate Z₁ and further led to an improved estimator for β₁, but the same could not be said for Z₂ and β₂.

Table 1.

Ratios of empirical SEs from the proposed method using aggregated information with $\hat{θ}$ in (5) chosen optimally to the empirical SEs from the standard logistic regression based on 1000 simulations. Indices indicate the collections of j’s where Z_j has available aggregated information, for j = 0, 1 or 2. Parameter n is related with the spatially varying population density, and K is the number of equal sub-regions that partition the entire region. The empirical SEs of (β₁, β₂) from the standard logistic regression are (0.1028, 0.0976) when n = 1, and (0.0725, 0.0679) when n = 2.

Indices	K	n = 1		n = 2
Indices	K	β₁	β₂	β₁	β₂
{0, 1}	5²	0.9095	1.0143	0.9034	1.0103
	10²	0.8852	1.0154	0.8759	1.0118
	20²	0.8658	1.0133	0.8538	1.0103
{0, 2}	5²	1.0107	0.9314	1.0110	0.9426
	10²	1.0107	0.9303	1.0110	0.9396
	20²	1.0107	0.9262	1.0124	0.9381
{0, 1, 2}	5²	0.9105	0.9242	0.9034	0.9323
	10²	0.8842	0.9139	0.8759	0.9190
	20²	0.8619	0.9016	0.8538	0.9087

Open in a new tab

We estimated the SEs of our proposed estimator using bootstrap. For each bootstrap iteration, we sampled random samples of size R₁ and R₂ with replacement from the cases and controls, where R₁ and R₂ were independent Poisson random variables with means 200 and 400 for n = 1 and 400 and 800 for n = 2, respectively. We used 200 bootstrap samples. The bootstrap SEs on average were slightly smaller than the empirical SEs (their ratios can be found in Table 2) but the differences were small. The coverage probabilities for 95% confidence intervals were only slightly less than 95% (between 92.7% and 94.5%).

Table 2.

Ratios of bootstrap SEs using 50 bootstrap iterations to empirical SEs for the proposed method based on 1000 simulations. Same symbols as in Table 1.

Indices	K	n = 1		n = 2
Indices	K	β₁	β₂	β₁	β₂
{0, 1}	5²	0.9615	0.9737	0.9649	0.9869
	10²	0.9549	0.9758	0.9638	0.9898
	20²	0.9596	0.9778	0.9725	0.9825
{0, 2}	5²	0.9856	0.9802	0.9659	0.9766
	10²	0.9750	0.9868	0.9727	0.9765
	20²	0.9769	0.9812	0.9728	0.9686
{0, 1, 2}	5²	0.9658	0.9745	0.9649	0.9731
	10²	0.9527	0.9709	0.9575	0.9744
	20²	0.9628	0.9795	0.9661	0.9773

Open in a new tab

6. Application to Endometrial Cancer Data

6.1. Risk Factors and Aggregated Summary Statistics

We applied the proposed method to investigate potential risk factors for endometrial cancer, by supplementing the population-based case–control data with summary statistics for the population obtained through BRFSS, the population estimates and the ADT data. The population at risk were females between the ages of 35 and 80 years. In the epidemiology literature, aging, overweight, low parity, early menarche, late menopause, hormone imbalance are generally regarded as risk factors for endometrial cancer, while race and lifestyle variable (such as smoking) are also found to be associated with it; see MacMahon (1974) and Austin, Drews, and Partridge, (1993) for references.

Let Z(s) denote the vector of risk factors for a female at s. The risk factors considered in our study were age, race, education level, smoking history, alcohol consumption, pregnancy history, overweight, obesity, and exposure to traffic. More specifically, race, education level, smoking history, alcohol consumption and pregnancy history referred to whether the subject was white, attended college, ever smoked, ever drank more than once per week for over a period of six months, and ever became pregnant in the past, respectively. Overweight and obese individuals were those with 25 ≤ BMI < 30 and BMI ≥ 30, respectively. We also included exposure to traffic derived at the subject’s residency location. All factors were binary except for age and traffic, which were standardized by their respective means and standard deviations obtained from the case–control data. Since the log odds of endometrial cancer appeared to have a nonlinear relationship with age, our model also included an age-squared term. Therefore, including the intercept term, Z(·) was an 11 × 1 vector.

We were able to obtain aggregated information for seven of the eleven elements of Z(·). In particular, percentages of females who had attended college, smoked in the past, and were overweight or obese were available from BRFSS in every 5-year interval from age 35 to 80 years, which led to a total of 9 strata. The population estimates provided the total number of female residents and the percentages of white females in each of the nine age groups and in each of the eight counties in Connecticut; this resulted in 72 strata. For exposure to traffic, the aggregated summary statistics were defined at the age group-by-zip code level. There were 282 zip codes in Connecticut, which led to 25,38 (=9 × 282) strata.

For each given form of strata, U_a(·) defined in (2) was derived given some properly defined weights wk. We followed Diggle et al. (2010) to define $w_{k} = \exp ({\tilde{Z}}_{k} \tilde{β})$ , where ${\tilde{Z}}_{k}$ and $\tilde{β}$ were estimates of Z(s) for s ∈ D_k and β*, respectively. For a given component of ${\hat{Z}}_{k}$ , we calculated it at the stratification level at which U_a(·) was formed.

Because the controls were frequency matched to the cases by age groups, the probability for an individual being included as a control depended on her age. In our data, we also observed an over-representation of whites in the controls compared to the population. Specifically, 94.4% of the controls were white, in contrast to only 87.2% in the population according to the Census. We thus estimated α(·) as the ratio of the number of controls in a particular age group-by-race category to the total number of the population in the same category. In a more general setting, we estimate α(·) by accounting for both known factors that were used in the sampling design to select the controls, for example, age in our endometrial cancer study, and any additional factors whose distributions in the controls are clearly different from those in the population, for example, race in our study. As we did in our application, this can be achieved by setting α(·) as the ratio of the number of controls in a given category defined by these factor level combinations to the total number of population in the same category. Nevertheless, a misspecification of α(·) is still possible and may result in biased estimates for the effect of the risk factors.

6.2. Results

We estimated β using our proposed approach, with α(·) being adjusted for both age and race as we described in the previous subsection and with α(·) being adjusted for age only. For comparison, we also applied the conditional logistic regression approach since the controls were frequency matched to the cases using age groups. This is essentially to leave α(·) as being non-parametric by the conditioning on the matched age groups. It is worth noting that the different treatments of α(·) affect the interpretation and estimates of regression parameters in these methods.

Table 3 presents the parameter estimates and their SEs. The parameter estimates from the two analyses using our proposed approach are very similar for all risk factors except race, indicating that adjusting α(·) for race did not materially affect the risk estimates of the other factors. The SEs were computed using bootstrap with a consideration for the frequency matching used in the study design. More specifically, for each of the 1000 bootstrap iterations, if the numbers of cases and controls in an age group were n and m respectively, we randomly sampled with replacement R₁ cases and R₂ controls from that particular age group, where R₁ and R₂ were independent Poisson random variables with respective means equal to n and m. Consistent with the observations made in the simulation, our method yielded smaller SEs for all risk factors compared with the conditional logistic regression approach.

Table 3.

Risk estimates and their standard errors from 1000 bootstrap resampling (in parentheses). Method 1: conditional logistic regression; Method 2: the proposed methods using aggregated information with $\hat{θ}$ in (5) chosen optimally and adjusting for age only; Method 3: similar as Method 2 but adjusting for both age and race. Method 1 is based on the case–control data alone.

Covariate	Source of aggregated information	Method 1	Method 2	Method 3
Intercept	Census population estimates	—	−6.6831 (0.2647)	−6.8842 (0.3483)
Age	—	—	0.4004 (0.0537)	0.3878 (0.0569)
Age²	—	—	−0.3666 (0.0449)	−0.3783 (0.0520)
Race	Census population estimates	−0.1324 (0.2620)	0.4006 (0.1545)	0.6276 (0.2564)
Education	BRFSS	−0.3838 (0.1442)	−0.1780 (0.1318)	−0.1790 (0.1408)
Smoking	BRFSS	−0.3373 (0.1237)	−0.3342 (0.1151)	−0.3357 (0.1218)
Alcohol	—	−0.2387 (0.1340)	−0.2311 (0.1323)	−0.2352 (0.1273)
Pregnancy	—	−0.7488 (0.1697)	−0.7334 (0.1678)	−0.7434 (0.1830)
Overweight	BRFSS	0.4454 (0.1571)	0.3545 (0.1420)	0.3852 (0.1554)
Obesity	BRFSS	1.5615 (0.1591)	1.5468 (0.1195)	1.5936 (0.1565)
Traffic	Traffic data	0.0376 (0.0626)	0.0340 (0.0562)	0.0352 (0.0606)

Open in a new tab

Our analysis suggests that endometrial cancer risk increases if one is white, overweight or obese, but decreases if one ever smoked or had past pregnancies. It also increases with age before 66 but gradually decreases after. However, exposure to traffic does not appear to be related to endometrial cancer risk. The results presented in Table 3 were based on traffic exposure derived using a 2000-meter buffer zone from the residency. Our conclusion persisted when a 500- or a 1000-meter buffer zone was used instead.

Hormone balance, especially the balance between the estrogen and progesterone, is thought to play an important part in the development of endometrial cancer. Exposure to elevated estrogen levels that are not counter balanced by progesterone is an important risk factor for the disease. Many epidemiologic studies have found evidence that prior pregnancies protect the endometrium from excessive estrogen exposure and are linked with lowered endometrial cancer risk. On the other hand, obesity is classified as a source of estrogen exposure and may increase the risk. See Lu et al. (2011) and the references therein.

Contradictory to the common belief that cigarette smoking increases the incidence of chronic diseases, our method along with other cohort and case–control studies found a statistically significant lower risk of endometrial cancer among smokers; see also Zhou et al. (2008)’s meta-analysis and the references therein. While estrogen contributes directly to endometrial cancer, anti-estrogens inhibit estrogen-induced cellular proliferation and mutations in endometrial glands to protect against the cancer (Henderson and Feigelson, 2000). Smoking may lower endometrial cancer risk through raising the circulation of anti-estrogens (Tankó and Christiansen, 2004), and reducing the circulation of estrogen through weight loss and earlier menopause (Parazzini et al., 1991).

Our analysis revealed a significant effect of race but such an effect could not be detected using the conditional logistic regression approach. Specifically, we found that endometrial cancer risk increased by a factor 1.87 (95% confidence interval 1.13–3.10) for whites. Our finding is consistent with a prospective cohort study by Setiawan et al. (2007), which documented a higher endometrial cancer risk among Caucasians compared to other ethnic groups after adjusting for other risk factors. In terms of education, we concluded no association between education and risk of endometrial cancer with our approach, but the logistic regression suggested that attending college reduced endometrial cancer risk.

7. Discussion

We have proposed a new method to combine population-based case–control data and population summary statistics in disease risk estimation. Our method is flexible and can be applied when aggregated information is available only for a subset of the risk factors. It can be used to incorporate both spatially and non-spatially aggregated information that can be obtained from diverse sources under different stratification structures. Our simulation shows that the proposed method can yield more efficient estimators than the logistic regression approach based on case–control data alone.

Combining data from diverse sources is helpful, but care must be taken when the underlying populations of the different data sources are not in close agreement. Large discrepancies particularly may occur between the population from which the controls was selected and the true population at risk. Failure to properly account for such discrepancies could lead to biased parameter estimates. In our application where an over-representation of whites in the controls was observed, we accounted for it by adjusting α(·) for race. Our two analyses, with or without the adjustment, yielded similar parameter estimates except with race. We could adjust for race because a detailed age group-by-race distribution of the population at risk was available, which allowed us to compare it to that of the controls directly. However, in general such information may not be available for other risk factors.

Besides discrepancies in the different populations, there are other potential caveats with our method. Firstly, we assume that α(·) is known, but often it has to be estimated. Secondly, to generalize our findings to the population, we would require the cases to be a random sample of all diseased subjects which may not be true. A violation of any of these assumptions could lead to biased parameter estimates. However, both assumptions are also required by other existing methods (e.g., the logistic regression analysis approach) for analyzing case–control data. Future research is needed in order to develop new methods to mitigate such potential biases. Lastly, we have not accounted for uncertainties associated with the population summaries. Even though the uncertainties are expected to be small, the bootstrap standard errors may underestimate the true variability in the parameter estimates.

To analyze the effect of traffic exposure on endometrial cancer risk, we used Geographic Information Systems (GIS) to derive an exposure measure based on ADT and individuals’ resident locations. Our proposed measure captures varying traffic density due to patterns of intersecting roadways (Holford et al., 2010), but does not account for other factors that could also affect one’s exposure such as prevailing wind directions. Even if all such factors have been considered, the resulting measure to the best can still only be a proxy for one’s true exposure. More accurate measures can be obtained using personal monitors. However, this will be considerably more expensive and may be infeasible for large-scale population-based case–control studies of chronic diseases, since the diseases may take years to develop and an exposure often has to be constructed retrospectively as a result. In contrast, the use of GIS to estimate exposure can be performed rather quickly and therefore remains an attractive approach in the investigation of cancer risk factors.

8. Supplementary Materials

Web Appendices referenced in Section 4, MATLAB codes, and some artificial data are available with this paper at the Biometrics website on Wiley Online Library.

Supplementary Material

theory

NIHMS669668-supplement-theory.pdf^{(60.5KB, pdf)}

Acknowledgments

We thank the editor, associate editor, and two reviewers for their constructive comments. This research has been partially supported by NIH grants 1R01CA169043, 5R01CA098346 and 5R01ES017416, NSF grant DMS-0845368, the Danish Council for Independent Research-Natural Sciences grant 12-124675, “Mathematical and Statistical Analysis of Spatial Data,” and the Centre for Stochastic Geometry and Advanced Bioimaging, funded by the Villum Foundation.

References

Austin H, Drews C, Partridge E. A case-control study of endometrial cancer in relation to cigarette smoking, serum estrogen levels, and alcohol use. American Journal of Obstetrics & Gynecology. 1993;169:1086–1086. doi: 10.1016/0002-9378(93)90260-p. [DOI] [PubMed] [Google Scholar]
Beelen R, Hoek G, van den Brandt PA, Goldbohm RA, Fischer P, Schouten LJ, Armstrong B, Brunekreef B. Long-term exposure to traffic-related air pollution and lung cancer risk. Epidemiology. 2008;19:702–710. doi: 10.1097/EDE.0b013e318181b3ca. [DOI] [PubMed] [Google Scholar]
Diggle P, Guan Y, Hart C, Paize F, Stanton M. Estimating individual-level risk in spatial epidemiology using spatially aggregated information on the population at risk. Journal of the American Statistical Association. 2010;105:1394–1402. doi: 10.1198/jasa.2010.ap09323. [DOI] [PMC free article] [PubMed] [Google Scholar]
Diggle PJ, Rowlingson BS. Conditional approach to point process modelling of elevated risk. Journal of the Royal Statistical Society, Series A. 1994;157:433–440. [Google Scholar]
Grant WB. Air pollution in relation to us cancer mortality rates: An ecological study; likely role of carbonaceous aerosols and polycyclic aromatic hydrocarbons. Anticancer Research. 2009;29:3537–3545. [PubMed] [Google Scholar]
Haneuse S, Wakefield J. Hierarchical models for combining ecological and case-control data. Biometrics. 2007;63:128–136. doi: 10.1111/j.1541-0420.2006.00673.x. [DOI] [PubMed] [Google Scholar]
Haneuse S, Wakefield J. Geographic-based ecological correlation studies using supplemental case-control data. Statistics in Medicine. 2008a;27:864–887. doi: 10.1002/sim.2979. [DOI] [PubMed] [Google Scholar]
Haneuse S, Wakefield J. The combination of ecological and case-control data. Journal of the Royal Statistical Society, Series B. 2008b;70:73–93. doi: 10.1111/j.1467-9868.2007.00628.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Henderson BE, Feigelson HS. Hormonal carcinogenesis. Carcinogenesis. 2000;21:427–433. doi: 10.1093/carcin/21.3.427. [DOI] [PubMed] [Google Scholar]
Heyde C. Quasi-Likelihood and Its Application a General Approach to Optimal Parameter Estimation. New York: Springer-Verlag; 1997. [Google Scholar]
Holford T, Ebisu K, McKay L, Gent J, Triche E, Bracken M, Leaderer B. Integrated exposure modeling: A model using GIS and GLM. Statistics in Medicine. 2010;29:116–129. doi: 10.1002/sim.3732. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lu L, Risch H, Irwin ML, Mayne ST, Cartmel B, Schwartz P, Rutherford T, Yu H. Long-term overweight and weight gain in early adulthood in association with risk of endometrial cancer. International Journal of Cancer. 2011;129:1237–1243. doi: 10.1002/ijc.26046. [DOI] [PMC free article] [PubMed] [Google Scholar]
MacMahon B. Risk factors for endometrial cancer. Gynecologic Oncology. 1974;2:122–129. doi: 10.1016/0090-8258(74)90003-1. [DOI] [PubMed] [Google Scholar]
Møller J, Waagepetersen R. Statistical Inference and Simulation for Spatial Point Process. Boca Raton: Chapman and Hall; 2004. [Google Scholar]
Parazzini F, La Vecchia C, Bocciolone L, Franceschi S. The epidemiology of endometrial cancer. Gynecologic Oncology. 1991;41:1–16. doi: 10.1016/0090-8258(91)90246-2. [DOI] [PubMed] [Google Scholar]
Pearson RL, Wachtel H, Ebi KL. Distance-weighted traffic density in proximity to a home is a risk factor for leukemia and other childhood cancers. Journal of the Air & Waste Management Association. 2000;50:175–180. doi: 10.1080/10473289.2000.10463998. [DOI] [PubMed] [Google Scholar]
Prentice R, Sheppard L. Aggregate data studies of disease risk factors. Biometrika. 1995;82:113–125. [Google Scholar]
Raaschou-Nielsen O, Hertel O, Thomsen BL, Olsen JH. Air pollution from traffic at the residence of children with cancer. American Journal of Epidemiology. 2001;153:433–443. doi: 10.1093/aje/153.5.433. [DOI] [PubMed] [Google Scholar]
Setiawan VW, Pike MC, Kolonel LN, Nomura AM, Goodman MT, Henderson BE. Racial/ethnic differences in endometrial cancer risk: The multiethnic cohort study. American Journal of Epidemiology. 2007;165:262–270. doi: 10.1093/aje/kwk010. [DOI] [PubMed] [Google Scholar]
Tankó LB, Christiansen C. An update on the antiestrogenic effect of smoking: A literature review with implications for researchers and practitioners. Menopause. 2004;11:104–109. doi: 10.1097/01.GME.0000079740.18541.DB. [DOI] [PubMed] [Google Scholar]
Wakefield J. Ecological inference for 2 × 2 tables (with discussion) Journal of the Royal Statistical Society, Series A. 2004;167:385–445. [Google Scholar]
Zhou B, Yang L, Sun Q, Cong R, Gu H, Tang N, Zhu H, Wang B. Cigarette smoking and the risk of endometrial cancer: A meta-analysis. American Journal of Medicine. 2008;121:501. doi: 10.1016/j.amjmed.2008.01.044. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

theory

NIHMS669668-supplement-theory.pdf^{(60.5KB, pdf)}

[R1] Austin H, Drews C, Partridge E. A case-control study of endometrial cancer in relation to cigarette smoking, serum estrogen levels, and alcohol use. American Journal of Obstetrics & Gynecology. 1993;169:1086–1086. doi: 10.1016/0002-9378(93)90260-p. [DOI] [PubMed] [Google Scholar]

[R2] Beelen R, Hoek G, van den Brandt PA, Goldbohm RA, Fischer P, Schouten LJ, Armstrong B, Brunekreef B. Long-term exposure to traffic-related air pollution and lung cancer risk. Epidemiology. 2008;19:702–710. doi: 10.1097/EDE.0b013e318181b3ca. [DOI] [PubMed] [Google Scholar]

[R3] Diggle P, Guan Y, Hart C, Paize F, Stanton M. Estimating individual-level risk in spatial epidemiology using spatially aggregated information on the population at risk. Journal of the American Statistical Association. 2010;105:1394–1402. doi: 10.1198/jasa.2010.ap09323. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Diggle PJ, Rowlingson BS. Conditional approach to point process modelling of elevated risk. Journal of the Royal Statistical Society, Series A. 1994;157:433–440. [Google Scholar]

[R5] Grant WB. Air pollution in relation to us cancer mortality rates: An ecological study; likely role of carbonaceous aerosols and polycyclic aromatic hydrocarbons. Anticancer Research. 2009;29:3537–3545. [PubMed] [Google Scholar]

[R6] Haneuse S, Wakefield J. Hierarchical models for combining ecological and case-control data. Biometrics. 2007;63:128–136. doi: 10.1111/j.1541-0420.2006.00673.x. [DOI] [PubMed] [Google Scholar]

[R7] Haneuse S, Wakefield J. Geographic-based ecological correlation studies using supplemental case-control data. Statistics in Medicine. 2008a;27:864–887. doi: 10.1002/sim.2979. [DOI] [PubMed] [Google Scholar]

[R8] Haneuse S, Wakefield J. The combination of ecological and case-control data. Journal of the Royal Statistical Society, Series B. 2008b;70:73–93. doi: 10.1111/j.1467-9868.2007.00628.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Henderson BE, Feigelson HS. Hormonal carcinogenesis. Carcinogenesis. 2000;21:427–433. doi: 10.1093/carcin/21.3.427. [DOI] [PubMed] [Google Scholar]

[R10] Heyde C. Quasi-Likelihood and Its Application a General Approach to Optimal Parameter Estimation. New York: Springer-Verlag; 1997. [Google Scholar]

[R11] Holford T, Ebisu K, McKay L, Gent J, Triche E, Bracken M, Leaderer B. Integrated exposure modeling: A model using GIS and GLM. Statistics in Medicine. 2010;29:116–129. doi: 10.1002/sim.3732. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Lu L, Risch H, Irwin ML, Mayne ST, Cartmel B, Schwartz P, Rutherford T, Yu H. Long-term overweight and weight gain in early adulthood in association with risk of endometrial cancer. International Journal of Cancer. 2011;129:1237–1243. doi: 10.1002/ijc.26046. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] MacMahon B. Risk factors for endometrial cancer. Gynecologic Oncology. 1974;2:122–129. doi: 10.1016/0090-8258(74)90003-1. [DOI] [PubMed] [Google Scholar]

[R14] Møller J, Waagepetersen R. Statistical Inference and Simulation for Spatial Point Process. Boca Raton: Chapman and Hall; 2004. [Google Scholar]

[R15] Parazzini F, La Vecchia C, Bocciolone L, Franceschi S. The epidemiology of endometrial cancer. Gynecologic Oncology. 1991;41:1–16. doi: 10.1016/0090-8258(91)90246-2. [DOI] [PubMed] [Google Scholar]

[R16] Pearson RL, Wachtel H, Ebi KL. Distance-weighted traffic density in proximity to a home is a risk factor for leukemia and other childhood cancers. Journal of the Air & Waste Management Association. 2000;50:175–180. doi: 10.1080/10473289.2000.10463998. [DOI] [PubMed] [Google Scholar]

[R17] Prentice R, Sheppard L. Aggregate data studies of disease risk factors. Biometrika. 1995;82:113–125. [Google Scholar]

[R18] Raaschou-Nielsen O, Hertel O, Thomsen BL, Olsen JH. Air pollution from traffic at the residence of children with cancer. American Journal of Epidemiology. 2001;153:433–443. doi: 10.1093/aje/153.5.433. [DOI] [PubMed] [Google Scholar]

[R19] Setiawan VW, Pike MC, Kolonel LN, Nomura AM, Goodman MT, Henderson BE. Racial/ethnic differences in endometrial cancer risk: The multiethnic cohort study. American Journal of Epidemiology. 2007;165:262–270. doi: 10.1093/aje/kwk010. [DOI] [PubMed] [Google Scholar]

[R20] Tankó LB, Christiansen C. An update on the antiestrogenic effect of smoking: A literature review with implications for researchers and practitioners. Menopause. 2004;11:104–109. doi: 10.1097/01.GME.0000079740.18541.DB. [DOI] [PubMed] [Google Scholar]

[R21] Wakefield J. Ecological inference for 2 × 2 tables (with discussion) Journal of the Royal Statistical Society, Series A. 2004;167:385–445. [Google Scholar]

[R22] Zhou B, Yang L, Sun Q, Cong R, Gu H, Tang N, Zhu H, Wang B. Cigarette smoking and the risk of endometrial cancer: A meta-analysis. American Journal of Medicine. 2008;121:501. doi: 10.1016/j.amjmed.2008.01.044. [DOI] [PubMed] [Google Scholar]

PERMALINK

Disease Risk Estimation by Combining Case–Control Data with Aggregated Information on the Population at Risk

Xiaohui Chang

Rasmus Waagepetersen

Herbert Yu

Xiaomei Ma

Theodore R Holford

Rong Wang

Yongtao Guan

Summary

1. Introduction

2. Description of Data

2.1. Case–Control Study for Endometrial Cancer

2.2. Behavioral Risk Factor Surveillance System Data

2.3. Traffic and Census Data

3. Background

3.1. Notation and Set-Up

3.2. Estimating Equation for Case–Control Data

3.3. Estimating Equation for Individual-Level Case Data and Aggregated Population Data

4. Combining Estimating Equations for Case–Control and Aggregated Data

4.1. A Consistent Estimator for η(f, β*)

4.2. Asymptotic Properties

Theorem 1

5. Simulations

Table 1.

Table 2.

6. Application to Endometrial Cancer Data

6.1. Risk Factors and Aggregated Summary Statistics

6.2. Results

Table 3.

7. Discussion

8. Supplementary Materials

Supplementary Material

Acknowledgments

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases