Summary
We propose a novel statistical framework by supplementing case–control data with summary statistics on the population at risk for a subset of risk factors. Our approach is to first form two unbiased estimating equations, one based on the case–control data and the other on both the case data and the summary statistics, and then optimally combine them to derive another estimating equation to be used for the estimation. The proposed method is computationally simple and more efficient than standard approaches based on case–control data alone. We also establish asymptotic properties of the resulting estimator, and investigate its finite-sample performance through simulation. As a substantive application, we apply the proposed method to investigate risk factors for endometrial cancer, by using data from a recently completed population-based case–control study and summary statistics from the Behavioral Risk Factor Surveillance System, the Population Estimates Program of the US Census Bureau, and the Connecticut Department of Transportation.
Keywords: Aggregated information, Estimating equation, Spatial epidemiology, Spatial point process
1. Introduction
Population-based case–control studies typically consist of a subset of individuals who have developed the disease (cases) and a representative sample of individuals from the population at risk (controls). For each study subject, an extensive list of risk factors is often collected. The sample size of the controls can be very limited in many studies due to cost constraints. As a result, large discrepancies may occur between the distributions of risk factors in the controls and in the population, which could in turn limit one’s ability to detect significant risk factors. However, more accurate information can be obtained from other sources for at least a subset of the risk factors. For example, the US Census provides summary statistics for many demographic and socioeconomic status variables at very fine spatial scales, and the Behavioral Risk Factor Surveillance System (BRFSS), which is a large state-based system of health surveys, routinely releases summary statistics concerning lifestyle variables, such as alcohol consumption, smoking, exercise, and overweight and obesity, across different age-by-sex groups. These summary statistics are often based on a much larger number of study participants and are therefore appreciably more accurate than their counterparts that can be derived from the controls in a case–control study.
We propose a novel statistical framework for disease risk estimation by supplementing standard case–control data with aggregated information of risk factors on the population. We consider a common regression setting as is often used in analyzing case–control data alone (see Section 3 for details). To estimate the regression parameters, we first form two unbiased estimating equations, one based on the case–control data and the other based on the case data and some summary statistics, and then efficiently combine them to derive a new unbiased estimating equation. Our proposed method is computationally simple and can lead to parameter estimates with smaller standard errors than standard methods using case–control data alone. As a substantive application, we investigate risk factors for endometrial cancer, based on a recently completed population-based case–control study and summary statistics extracted from data provided by BRFSS, the Population Estimates Program of the US Census Bureau, and the Connecticut Department of Transportation.
Several other authors have studied the problem of combining individual-level and aggregated epidemiological data. For example, Prentice and Sheppard (1995) and Wakefield (2004) considered combining aggregated disease data with individual control or cohort data, and Haneuse and Wakefield (2007, 2008a,b) proposed a hybrid design to combine aggregated disease data with either case–control data or control data alone. However, all these articles are concerned with aggregated outcomes on the diseased subjects but not on the non-diseased ones. Diggle et al. (2010) developed procedures for combining individual-level case data and spatially aggregated information on the population at risk. Their methods require spatially aggregated information for all risk factors which may not always be available in practice. Moreover, they did not consider any additional control data in the analysis. In contrast, we combine both cases and controls in a case–control study as well as summary statistics on the population even when such information is available only for a subset of risk factors.
The remainder of the article is organized as follows. We describe the endometrial cancer data in Section 2 and provide some necessary background in Section 3. We introduce the proposed method in Section 4, assess its numerical properties through a simulation study in Section 5, and apply it to analyze the endometrial cancer data in Section 6. We conclude with a discussion in Section 7. Additional theoretical results, MATLAB codes and artificial data are given in the Supplementary Materials section online.
2. Description of Data
2.1. Case–Control Study for Endometrial Cancer
Endometrial (uterine corpus) cancer is the fourth most common cancer among women in the United States. The American Cancer Society estimates 51,577 newly diagnosed endometrial cancer incidences and 8,418 deaths in 2014 (http://www.cancer.org/research/cancerfactsfigures/). To investigate risk factors for endometrial cancer, a population-based case–control study was conducted in Connecticut between October 2004 and March 2009. The study included 668 Connecticut residents between the ages of 35 and 80 that were newly diagnosed endometrial cancer during the study period and 665 control subjects that were identified through a random-digit dialing method and were frequency matched to cases by age groups (35–51, 52–59, 60–64, 65–69, 70–74, and 75–79 years). All the study participants provided signed informed consent before an in-person interview. During the interview, a structured questionnaire was used to collect information such as ethnicity, education, lifestyle, menstrual and reproductive features, self-reported weight, height and other physical dimensions. More details on the study design can be found in Lu et al. (2011).
2.2. Behavioral Risk Factor Surveillance System Data
BRFSS is a state-based system of health surveys that collect information on health risk behaviors, preventive health practices, and health care access primarily related to chronic diseases and injury. It was first established in 1984 by the Centers for Disease Control and Prevention (CDC); with more than 350,000 adults interviewed each year, it is the largest telephone health survey in the world. The CDC routinely releases summary tables on variables collected through the BRFSS surveys. The Web Enables Analysis Tool (WEAT), which is available on the CDC’s website http://www.cdc.gov/brfss/index.htm, allows researchers to create cross tabulation reports from the BRFSS data. Using WEAT, we have calculated annual summary statistics on tobacco use, education level and body mass index (BMI = weight (kg)/height2 (m2)) from 2005 to 2008 across different age-by-sex groups. We will take advantage of this information to investigate risk factors for endometrial cancer.
2.3. Traffic and Census Data
As a major source of air pollution, automobile emissions have been investigated for their potential associations with various types of cancer; see Pearson, Wachtel, and Ebi (2000), Raaschou-Nielsen et al. (2001), Beelen et al. (2008), and the references therein. For endometrial cancer, Grant (2009) found a significant positive association between an air pollution index and endometrial cancer mortality rates in 1950–1969 in the US. However, they observed no such association between the same air pollution index and the mortality rates in 1970–1994.
In this article, we will examine the effect of exposure to traffic on endometrial cancer risk, using average daily traffic (ADT) data on all state and interstate highways in 2007 provided by the Connecticut Department of Transportation. For each subject enrolled in the case–control study, her exposure to traffic can be derived by integrating the ADT values over highways within a fixed buffer zone from the subject’s residency (Holford et al., 2010).
The Population Estimates Program of the US Census Bureau produces annual population count estimates by age, gender and ethnicity for each state at the county level. We extracted data on these variables in Connecticut from 2005 to 2008. To get aggregated traffic data at a finer spatial level, we first obtained the 2010 US Census data for Connecticut at the zip code level and then scaled them proportionally to estimate the population counts from 2005 to 2008. For each zip code, we defined an aggregated exposure as the product of the traffic exposure at the zip code centroid and the rescaled population counts. Although this aggregated exposure is not a sum of the true exposures of all subjects within a given zip code, it still contains information about how the population may be exposed to traffic differently across different zip codes. We will incorporate this new information in our analysis to investigate whether exposure to traffic is associated with risk for developing endometrial cancer.
3. Background
3.1. Notation and Set-Up
Let N and M be two spatial point processes generating the random spatial locations of cases and controls over a geographic region, D. We represent the spatially varying population density by λ0(s). Let Z(s) be a p × 1 vector of risk factors for an individual at location s. We assume that both N and M are Poisson, with their respective intensities given by λ(s; β) = λ0(s) exp{Z(s)′β} for some unknown β and ρ(s) = α(s)λ0(s), where α(s) denotes the probability for an individual at s to be included in the controls of a case–control study. We assume that α(·) is known given the sampling design used to collect the controls. Our interest is to estimate β, which defines the effect of potential risk factors on cancer risk.
Suppose that Z(s) = {X(s)′, Y(s)′}′, where X(·) and Y(·) are respectively px × 1 and py × 1 subvectors of Z(·) with p = px + py. The first element of X(·) is always equal to one. We assume that aggregated information on the population is available for X(·), but not for Y(·), over K strata, Dk, for k = 1, …, K. For ease of presentation, we assume that Dk’s are geographic regions that form a partition of D, but in general, Dk’s can be strata based on non-geographic criteria such as age and sex.
3.2. Estimating Equation for Case–Control Data
Diggle and Rowlingson (1994) proposed a conditional likelihood approach to estimate β using case–control data. They argued that conditional on an observed event s ∈ (N ∪ M), the probability for it to be from N is . The conditional log-likelihood is then given as L(β) = Σs∈N log p(s; β) + Σs∈M log{1 − p(s; β)}. Note that maximizing L(β) is equivalent to solving the unbiased estimating equation
(1) |
where 0p is a p × 1 zero vector. If α(·) is constant, Uc(β) coincides with the score function of the commonly used logistic regression analysis for case–control data.
3.3. Estimating Equation for Individual-Level Case Data and Aggregated Population Data
Let μk denote a px × 1 vector of population summaries aggregated over Dk, for k = 1, …, K. In the subsequent development, we assume that μk’s are known and write , where Xk(·) is a px × 1 vector related to X(·). Often Xk(·) = X(·), that is, μk’s are aggregated over the risk factors of the population in Dk. However, Xk(·) can also be different from X(·). For example, we used traffic exposure at the zip code centroid to derive an aggregated traffic exposure in Section 2.3. Let β* denote the true value of β. By Campbell’s Theorem (e.g., Møller and Waagepetersen, 2004) and the definition of λ(s; β), is an unbiased estimator for μk at β = β*. Hence,
(2) |
forms an unbiased estimating equation (Diggle et al., 2010), where wk’s are some pre-defined weights. Note that solving (2) alone will not yield a unique estimate for β* since px < p, that is, aggregated information is not available for all risk factors. If px = p and X(·) is spatially continuous, Diggle et al. (2010) showed that efficiency of the resulting estimator from solving (2) increased with K. As K increases, the average of X(s) for s ∈ Dk, which is denoted as and can be easily derived from μk, approaches X(s). Thus, most efficiency gains can be achieved from incorporating the aggregated information when approximates X(s) well for s ∈ Dk. In such a case, there is often a large number of strata and consequently a small population size in each stratum, as well as a large between-group variation in ’s.
4. Combining Estimating Equations for Case–Control and Aggregated Data
The estimating equations given in (1) and (2) both contain information on β*. Below we develop a mechanism to ‘optimally’ combine them. To do so, we first write U(β) = {Uc(β)′, Ua(β)′}′ and define and V(β) = Var {Ua(β)}, where expectation and variance are at β = β*. Let Ĵ(β) and be consistent estimators of J(β) and V(β) when β = β*. We then follow Heyde (1997) to consider the estimating equation
(3) |
The resulting estimator obtained by solving (3), denoted by , is ‘optimal’, in the sense that Ũ(β) has the maximum Godambe information (Heyde, 1997) among all estimating functions taking the form A(β) U(β), where A(β) is an arbitrary p × (p + px) real matrix.
By Campbell’s theorem, it can be shown that all components of J(β) and V(β) can be expressed as
(4) |
for some function f (s; β). In Web Appendix A, we give detailed expressions of f (s; β) related to J(β*) and V(β*). In the next subsection, we derive a consistent estimator for η(f, β*).
4.1. A Consistent Estimator for η(f, β*)
For any θ ∈ [0, 1], define
(5) |
By Campbell’s Theorem, it can be shown that is an unbiased estimator of η(f, β*) for any θ. We choose θ such that the variance of is minimized. In Web Appendix B, we show that the minimum variance is achieved at
(6) |
where fnum(s) = f (s; β*)2α(s)−1 and fden(s) = f(s; β*)2 [exp{−Z(s)′β*} + α(s)−1].
To estimate θ0, we need to estimate the two integrals in (6) using (5). Let and be the resulting estimators for the integrals in the numerator and denominator, respectively, for some θnum, θden ∈ [0, 1]. We then define
(7) |
Let N(D) and M(D) denote the numbers of cases and controls in D from the case–control study, respectively. For any given θnum, θden ∈ [0, 1] and as N(D) → ∞ and M(D) → ∞, and are consistent estimators for the numerator and denominator of (6) under mild conditions; see Web Appendix C for details. Therefore, is also consistent for θ0. For simplicity, we set
Since is a consistent estimator of the expected number of cases divided by the total expected number of cases and controls, the resulting estimator is consistent for θ0 when evaluated at β*; see Web Appendix C.
To summarize, we use to estimate a given component of Ĵ(β) and . Through solving (3), we obtain an estimator for β*. In the next subsection, we study asymptotic properties of .
4.2. Asymptotic Properties
For the development of asymptotic results, we assume that D is fixed and consider a sequence of increasing population densities λ0,n(·) = nλ0(·) for n = 1, 2, …. This corresponds to the usual setting in spatial epidemiology where data are accumulated over time “n” in a fixed geographical region (e.g., the state of Connecticut). In the following, we modify the notation from the previous sections by adding a subscript n. Thus, Nn and Mn correspond to N and M, and are sequences of Poisson processes with intensity functions λ0,n(·) exp{Z(·)′ β*} and λ0,n(·)α(·), respectively. Furthermore, Un(β) and Ũn(β) are defined as U(β) and Ũ(β) but with N and M replaced by Nn and Mn. Other quantities that are dependent on N and M can be similarly generalized. We let U1(·) be Un with n = 1, and define and V1(β) = Var {U1(β)}. Theorem 1 establishes consistency and asymptotic normality of , and its proof can be found in Web Appendix C.
Theorem 1
Assume sups∈D ‖Z(s)‖ ≤ C for some 0 < C < ∞ and J1(β*)V1(β*)−1J1(β*) is positive definite. Then there exists a -consistent asymptotically normal sequence of solutions of the estimating equation Ũn(β) = 0.
5. Simulations
Let W(·) and Z1(·) be independent realizations of a stationary,· isotropic Gaussian process with covariance function exp(−10u), where u is the spatial lag distance. We simulated both W(·) and Z1(·) on a 100 × 100 grid laid over a square window D = [0, 1] × [0, 1], where each grid cell had constant values of W and Z1. We similarly simulated Z2(·) except that they were independent standard normal random variables. Given W(·), we defined the spatially varying population intensity λ0,n(s) = n exp{0.5W(s)} for n = 1, 2. Both Z1(·) and Z2(·) were treated as covariates but W(·) was not.
We generated realizations of cases· and controls on D from two inhomogeneous Poisson processes with respective intensity functions λn(s; β) = λ0,n(s) exp{β0 + Z1(s)β1 + Z2(s)β2} and ρn(s) = αλ0,n(s), where β = (β0, β1, β2) = (4.9335, 0.5, 0.5). The expected number of cases per realization was 200 and 400 for n = 1, 2, respectively. We chose α in a way such that the expected number of controls was twice as large as that of cases. We assumed that Z(·) = {Z0(·), Z1(·), Z2(·)}′ was observed for every case and control event, where Z0(s) = 1 for all s ∈ D In addition, aggregated information was available for j ∈ {0, 1}, {0, 2} or {0, 1, 2}, where Dk’s were equal sub-squares that partitioned D for k = 1, …, K. We considered K = 52, 102, and 202.
Table 1 compares the empirical standard errors (SEs) of our estimator and the estimator from the standard logistic regression without using any aggregated information, based on 1000 simulations. The empirical biases were all negligible. It is clear that our proposed estimator could reduce the SEs considerably compared to the logistic regression approach. Specifically, when there was aggregated information available for Z1 (and/or Z2), the SEs of our estimator for β1 (and/or β2) were appreciably smaller than those of the estimator based on the logistic regression approach, regardless of the values of n and K. This observation demonstrated the importance of including aggregated information in the analysis. When n increased from 1 to 2, the SEs of our proposed estimator dropped on average by 30%, which was comparable to the expected drop of 29.29% following the convergence rate given in Theorem 1. When K increased, the SEs of our proposed estimator for β1 decreased when μ1k’s were available, but remained nearly unchanged when only μ2k’s were available. The difference was due to the fact that Z1’s were spatially correlated but Z2’s were not. A finer partition of D could yield more information on the covariate Z1 and further led to an improved estimator for β1, but the same could not be said for Z2 and β2.
Table 1.
Ratios of empirical SEs from the proposed method using aggregated information with in (5) chosen optimally to the empirical SEs from the standard logistic regression based on 1000 simulations. Indices indicate the collections of j’s where Zj has available aggregated information, for j = 0, 1 or 2. Parameter n is related with the spatially varying population density, and K is the number of equal sub-regions that partition the entire region. The empirical SEs of (β1, β2) from the standard logistic regression are (0.1028, 0.0976) when n = 1, and (0.0725, 0.0679) when n = 2.
Indices | K |
n = 1
|
n = 2
|
||
---|---|---|---|---|---|
β1 | β2 | β1 | β2 | ||
{0, 1} | 52 | 0.9095 | 1.0143 | 0.9034 | 1.0103 |
102 | 0.8852 | 1.0154 | 0.8759 | 1.0118 | |
202 | 0.8658 | 1.0133 | 0.8538 | 1.0103 | |
{0, 2} | 52 | 1.0107 | 0.9314 | 1.0110 | 0.9426 |
102 | 1.0107 | 0.9303 | 1.0110 | 0.9396 | |
202 | 1.0107 | 0.9262 | 1.0124 | 0.9381 | |
{0, 1, 2} | 52 | 0.9105 | 0.9242 | 0.9034 | 0.9323 |
102 | 0.8842 | 0.9139 | 0.8759 | 0.9190 | |
202 | 0.8619 | 0.9016 | 0.8538 | 0.9087 |
We estimated the SEs of our proposed estimator using bootstrap. For each bootstrap iteration, we sampled random samples of size R1 and R2 with replacement from the cases and controls, where R1 and R2 were independent Poisson random variables with means 200 and 400 for n = 1 and 400 and 800 for n = 2, respectively. We used 200 bootstrap samples. The bootstrap SEs on average were slightly smaller than the empirical SEs (their ratios can be found in Table 2) but the differences were small. The coverage probabilities for 95% confidence intervals were only slightly less than 95% (between 92.7% and 94.5%).
Table 2.
Ratios of bootstrap SEs using 50 bootstrap iterations to empirical SEs for the proposed method based on 1000 simulations. Same symbols as in Table 1.
Indices | K |
n = 1
|
n = 2
|
||
---|---|---|---|---|---|
β1 | β2 | β1 | β2 | ||
{0, 1} | 52 | 0.9615 | 0.9737 | 0.9649 | 0.9869 |
102 | 0.9549 | 0.9758 | 0.9638 | 0.9898 | |
202 | 0.9596 | 0.9778 | 0.9725 | 0.9825 | |
{0, 2} | 52 | 0.9856 | 0.9802 | 0.9659 | 0.9766 |
102 | 0.9750 | 0.9868 | 0.9727 | 0.9765 | |
202 | 0.9769 | 0.9812 | 0.9728 | 0.9686 | |
{0, 1, 2} | 52 | 0.9658 | 0.9745 | 0.9649 | 0.9731 |
102 | 0.9527 | 0.9709 | 0.9575 | 0.9744 | |
202 | 0.9628 | 0.9795 | 0.9661 | 0.9773 |
6. Application to Endometrial Cancer Data
6.1. Risk Factors and Aggregated Summary Statistics
We applied the proposed method to investigate potential risk factors for endometrial cancer, by supplementing the population-based case–control data with summary statistics for the population obtained through BRFSS, the population estimates and the ADT data. The population at risk were females between the ages of 35 and 80 years. In the epidemiology literature, aging, overweight, low parity, early menarche, late menopause, hormone imbalance are generally regarded as risk factors for endometrial cancer, while race and lifestyle variable (such as smoking) are also found to be associated with it; see MacMahon (1974) and Austin, Drews, and Partridge, (1993) for references.
Let Z(s) denote the vector of risk factors for a female at s. The risk factors considered in our study were age, race, education level, smoking history, alcohol consumption, pregnancy history, overweight, obesity, and exposure to traffic. More specifically, race, education level, smoking history, alcohol consumption and pregnancy history referred to whether the subject was white, attended college, ever smoked, ever drank more than once per week for over a period of six months, and ever became pregnant in the past, respectively. Overweight and obese individuals were those with 25 ≤ BMI < 30 and BMI ≥ 30, respectively. We also included exposure to traffic derived at the subject’s residency location. All factors were binary except for age and traffic, which were standardized by their respective means and standard deviations obtained from the case–control data. Since the log odds of endometrial cancer appeared to have a nonlinear relationship with age, our model also included an age-squared term. Therefore, including the intercept term, Z(·) was an 11 × 1 vector.
We were able to obtain aggregated information for seven of the eleven elements of Z(·). In particular, percentages of females who had attended college, smoked in the past, and were overweight or obese were available from BRFSS in every 5-year interval from age 35 to 80 years, which led to a total of 9 strata. The population estimates provided the total number of female residents and the percentages of white females in each of the nine age groups and in each of the eight counties in Connecticut; this resulted in 72 strata. For exposure to traffic, the aggregated summary statistics were defined at the age group-by-zip code level. There were 282 zip codes in Connecticut, which led to 25,38 (=9 × 282) strata.
For each given form of strata, Ua(·) defined in (2) was derived given some properly defined weights wk. We followed Diggle et al. (2010) to define , where and were estimates of Z(s) for s ∈ Dk and β*, respectively. For a given component of , we calculated it at the stratification level at which Ua(·) was formed.
Because the controls were frequency matched to the cases by age groups, the probability for an individual being included as a control depended on her age. In our data, we also observed an over-representation of whites in the controls compared to the population. Specifically, 94.4% of the controls were white, in contrast to only 87.2% in the population according to the Census. We thus estimated α(·) as the ratio of the number of controls in a particular age group-by-race category to the total number of the population in the same category. In a more general setting, we estimate α(·) by accounting for both known factors that were used in the sampling design to select the controls, for example, age in our endometrial cancer study, and any additional factors whose distributions in the controls are clearly different from those in the population, for example, race in our study. As we did in our application, this can be achieved by setting α(·) as the ratio of the number of controls in a given category defined by these factor level combinations to the total number of population in the same category. Nevertheless, a misspecification of α(·) is still possible and may result in biased estimates for the effect of the risk factors.
6.2. Results
We estimated β using our proposed approach, with α(·) being adjusted for both age and race as we described in the previous subsection and with α(·) being adjusted for age only. For comparison, we also applied the conditional logistic regression approach since the controls were frequency matched to the cases using age groups. This is essentially to leave α(·) as being non-parametric by the conditioning on the matched age groups. It is worth noting that the different treatments of α(·) affect the interpretation and estimates of regression parameters in these methods.
Table 3 presents the parameter estimates and their SEs. The parameter estimates from the two analyses using our proposed approach are very similar for all risk factors except race, indicating that adjusting α(·) for race did not materially affect the risk estimates of the other factors. The SEs were computed using bootstrap with a consideration for the frequency matching used in the study design. More specifically, for each of the 1000 bootstrap iterations, if the numbers of cases and controls in an age group were n and m respectively, we randomly sampled with replacement R1 cases and R2 controls from that particular age group, where R1 and R2 were independent Poisson random variables with respective means equal to n and m. Consistent with the observations made in the simulation, our method yielded smaller SEs for all risk factors compared with the conditional logistic regression approach.
Table 3.
Risk estimates and their standard errors from 1000 bootstrap resampling (in parentheses). Method 1: conditional logistic regression; Method 2: the proposed methods using aggregated information with in (5) chosen optimally and adjusting for age only; Method 3: similar as Method 2 but adjusting for both age and race. Method 1 is based on the case–control data alone.
Covariate | Source of aggregated information | Method 1 | Method 2 | Method 3 |
---|---|---|---|---|
Intercept | Census population estimates | — | −6.6831 (0.2647) | −6.8842 (0.3483) |
Age | — | — | 0.4004 (0.0537) | 0.3878 (0.0569) |
Age2 | — | — | −0.3666 (0.0449) | −0.3783 (0.0520) |
Race | Census population estimates | −0.1324 (0.2620) | 0.4006 (0.1545) | 0.6276 (0.2564) |
Education | BRFSS | −0.3838 (0.1442) | −0.1780 (0.1318) | −0.1790 (0.1408) |
Smoking | BRFSS | −0.3373 (0.1237) | −0.3342 (0.1151) | −0.3357 (0.1218) |
Alcohol | — | −0.2387 (0.1340) | −0.2311 (0.1323) | −0.2352 (0.1273) |
Pregnancy | — | −0.7488 (0.1697) | −0.7334 (0.1678) | −0.7434 (0.1830) |
Overweight | BRFSS | 0.4454 (0.1571) | 0.3545 (0.1420) | 0.3852 (0.1554) |
Obesity | BRFSS | 1.5615 (0.1591) | 1.5468 (0.1195) | 1.5936 (0.1565) |
Traffic | Traffic data | 0.0376 (0.0626) | 0.0340 (0.0562) | 0.0352 (0.0606) |
Our analysis suggests that endometrial cancer risk increases if one is white, overweight or obese, but decreases if one ever smoked or had past pregnancies. It also increases with age before 66 but gradually decreases after. However, exposure to traffic does not appear to be related to endometrial cancer risk. The results presented in Table 3 were based on traffic exposure derived using a 2000-meter buffer zone from the residency. Our conclusion persisted when a 500- or a 1000-meter buffer zone was used instead.
Hormone balance, especially the balance between the estrogen and progesterone, is thought to play an important part in the development of endometrial cancer. Exposure to elevated estrogen levels that are not counter balanced by progesterone is an important risk factor for the disease. Many epidemiologic studies have found evidence that prior pregnancies protect the endometrium from excessive estrogen exposure and are linked with lowered endometrial cancer risk. On the other hand, obesity is classified as a source of estrogen exposure and may increase the risk. See Lu et al. (2011) and the references therein.
Contradictory to the common belief that cigarette smoking increases the incidence of chronic diseases, our method along with other cohort and case–control studies found a statistically significant lower risk of endometrial cancer among smokers; see also Zhou et al. (2008)’s meta-analysis and the references therein. While estrogen contributes directly to endometrial cancer, anti-estrogens inhibit estrogen-induced cellular proliferation and mutations in endometrial glands to protect against the cancer (Henderson and Feigelson, 2000). Smoking may lower endometrial cancer risk through raising the circulation of anti-estrogens (Tankó and Christiansen, 2004), and reducing the circulation of estrogen through weight loss and earlier menopause (Parazzini et al., 1991).
Our analysis revealed a significant effect of race but such an effect could not be detected using the conditional logistic regression approach. Specifically, we found that endometrial cancer risk increased by a factor 1.87 (95% confidence interval 1.13–3.10) for whites. Our finding is consistent with a prospective cohort study by Setiawan et al. (2007), which documented a higher endometrial cancer risk among Caucasians compared to other ethnic groups after adjusting for other risk factors. In terms of education, we concluded no association between education and risk of endometrial cancer with our approach, but the logistic regression suggested that attending college reduced endometrial cancer risk.
7. Discussion
We have proposed a new method to combine population-based case–control data and population summary statistics in disease risk estimation. Our method is flexible and can be applied when aggregated information is available only for a subset of the risk factors. It can be used to incorporate both spatially and non-spatially aggregated information that can be obtained from diverse sources under different stratification structures. Our simulation shows that the proposed method can yield more efficient estimators than the logistic regression approach based on case–control data alone.
Combining data from diverse sources is helpful, but care must be taken when the underlying populations of the different data sources are not in close agreement. Large discrepancies particularly may occur between the population from which the controls was selected and the true population at risk. Failure to properly account for such discrepancies could lead to biased parameter estimates. In our application where an over-representation of whites in the controls was observed, we accounted for it by adjusting α(·) for race. Our two analyses, with or without the adjustment, yielded similar parameter estimates except with race. We could adjust for race because a detailed age group-by-race distribution of the population at risk was available, which allowed us to compare it to that of the controls directly. However, in general such information may not be available for other risk factors.
Besides discrepancies in the different populations, there are other potential caveats with our method. Firstly, we assume that α(·) is known, but often it has to be estimated. Secondly, to generalize our findings to the population, we would require the cases to be a random sample of all diseased subjects which may not be true. A violation of any of these assumptions could lead to biased parameter estimates. However, both assumptions are also required by other existing methods (e.g., the logistic regression analysis approach) for analyzing case–control data. Future research is needed in order to develop new methods to mitigate such potential biases. Lastly, we have not accounted for uncertainties associated with the population summaries. Even though the uncertainties are expected to be small, the bootstrap standard errors may underestimate the true variability in the parameter estimates.
To analyze the effect of traffic exposure on endometrial cancer risk, we used Geographic Information Systems (GIS) to derive an exposure measure based on ADT and individuals’ resident locations. Our proposed measure captures varying traffic density due to patterns of intersecting roadways (Holford et al., 2010), but does not account for other factors that could also affect one’s exposure such as prevailing wind directions. Even if all such factors have been considered, the resulting measure to the best can still only be a proxy for one’s true exposure. More accurate measures can be obtained using personal monitors. However, this will be considerably more expensive and may be infeasible for large-scale population-based case–control studies of chronic diseases, since the diseases may take years to develop and an exposure often has to be constructed retrospectively as a result. In contrast, the use of GIS to estimate exposure can be performed rather quickly and therefore remains an attractive approach in the investigation of cancer risk factors.
8. Supplementary Materials
Web Appendices referenced in Section 4, MATLAB codes, and some artificial data are available with this paper at the Biometrics website on Wiley Online Library.
Supplementary Material
Acknowledgments
We thank the editor, associate editor, and two reviewers for their constructive comments. This research has been partially supported by NIH grants 1R01CA169043, 5R01CA098346 and 5R01ES017416, NSF grant DMS-0845368, the Danish Council for Independent Research-Natural Sciences grant 12-124675, “Mathematical and Statistical Analysis of Spatial Data,” and the Centre for Stochastic Geometry and Advanced Bioimaging, funded by the Villum Foundation.
References
- Austin H, Drews C, Partridge E. A case-control study of endometrial cancer in relation to cigarette smoking, serum estrogen levels, and alcohol use. American Journal of Obstetrics & Gynecology. 1993;169:1086–1086. doi: 10.1016/0002-9378(93)90260-p. [DOI] [PubMed] [Google Scholar]
- Beelen R, Hoek G, van den Brandt PA, Goldbohm RA, Fischer P, Schouten LJ, Armstrong B, Brunekreef B. Long-term exposure to traffic-related air pollution and lung cancer risk. Epidemiology. 2008;19:702–710. doi: 10.1097/EDE.0b013e318181b3ca. [DOI] [PubMed] [Google Scholar]
- Diggle P, Guan Y, Hart C, Paize F, Stanton M. Estimating individual-level risk in spatial epidemiology using spatially aggregated information on the population at risk. Journal of the American Statistical Association. 2010;105:1394–1402. doi: 10.1198/jasa.2010.ap09323. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Diggle PJ, Rowlingson BS. Conditional approach to point process modelling of elevated risk. Journal of the Royal Statistical Society, Series A. 1994;157:433–440. [Google Scholar]
- Grant WB. Air pollution in relation to us cancer mortality rates: An ecological study; likely role of carbonaceous aerosols and polycyclic aromatic hydrocarbons. Anticancer Research. 2009;29:3537–3545. [PubMed] [Google Scholar]
- Haneuse S, Wakefield J. Hierarchical models for combining ecological and case-control data. Biometrics. 2007;63:128–136. doi: 10.1111/j.1541-0420.2006.00673.x. [DOI] [PubMed] [Google Scholar]
- Haneuse S, Wakefield J. Geographic-based ecological correlation studies using supplemental case-control data. Statistics in Medicine. 2008a;27:864–887. doi: 10.1002/sim.2979. [DOI] [PubMed] [Google Scholar]
- Haneuse S, Wakefield J. The combination of ecological and case-control data. Journal of the Royal Statistical Society, Series B. 2008b;70:73–93. doi: 10.1111/j.1467-9868.2007.00628.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Henderson BE, Feigelson HS. Hormonal carcinogenesis. Carcinogenesis. 2000;21:427–433. doi: 10.1093/carcin/21.3.427. [DOI] [PubMed] [Google Scholar]
- Heyde C. Quasi-Likelihood and Its Application a General Approach to Optimal Parameter Estimation. New York: Springer-Verlag; 1997. [Google Scholar]
- Holford T, Ebisu K, McKay L, Gent J, Triche E, Bracken M, Leaderer B. Integrated exposure modeling: A model using GIS and GLM. Statistics in Medicine. 2010;29:116–129. doi: 10.1002/sim.3732. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lu L, Risch H, Irwin ML, Mayne ST, Cartmel B, Schwartz P, Rutherford T, Yu H. Long-term overweight and weight gain in early adulthood in association with risk of endometrial cancer. International Journal of Cancer. 2011;129:1237–1243. doi: 10.1002/ijc.26046. [DOI] [PMC free article] [PubMed] [Google Scholar]
- MacMahon B. Risk factors for endometrial cancer. Gynecologic Oncology. 1974;2:122–129. doi: 10.1016/0090-8258(74)90003-1. [DOI] [PubMed] [Google Scholar]
- Møller J, Waagepetersen R. Statistical Inference and Simulation for Spatial Point Process. Boca Raton: Chapman and Hall; 2004. [Google Scholar]
- Parazzini F, La Vecchia C, Bocciolone L, Franceschi S. The epidemiology of endometrial cancer. Gynecologic Oncology. 1991;41:1–16. doi: 10.1016/0090-8258(91)90246-2. [DOI] [PubMed] [Google Scholar]
- Pearson RL, Wachtel H, Ebi KL. Distance-weighted traffic density in proximity to a home is a risk factor for leukemia and other childhood cancers. Journal of the Air & Waste Management Association. 2000;50:175–180. doi: 10.1080/10473289.2000.10463998. [DOI] [PubMed] [Google Scholar]
- Prentice R, Sheppard L. Aggregate data studies of disease risk factors. Biometrika. 1995;82:113–125. [Google Scholar]
- Raaschou-Nielsen O, Hertel O, Thomsen BL, Olsen JH. Air pollution from traffic at the residence of children with cancer. American Journal of Epidemiology. 2001;153:433–443. doi: 10.1093/aje/153.5.433. [DOI] [PubMed] [Google Scholar]
- Setiawan VW, Pike MC, Kolonel LN, Nomura AM, Goodman MT, Henderson BE. Racial/ethnic differences in endometrial cancer risk: The multiethnic cohort study. American Journal of Epidemiology. 2007;165:262–270. doi: 10.1093/aje/kwk010. [DOI] [PubMed] [Google Scholar]
- Tankó LB, Christiansen C. An update on the antiestrogenic effect of smoking: A literature review with implications for researchers and practitioners. Menopause. 2004;11:104–109. doi: 10.1097/01.GME.0000079740.18541.DB. [DOI] [PubMed] [Google Scholar]
- Wakefield J. Ecological inference for 2 × 2 tables (with discussion) Journal of the Royal Statistical Society, Series A. 2004;167:385–445. [Google Scholar]
- Zhou B, Yang L, Sun Q, Cong R, Gu H, Tang N, Zhu H, Wang B. Cigarette smoking and the risk of endometrial cancer: A meta-analysis. American Journal of Medicine. 2008;121:501. doi: 10.1016/j.amjmed.2008.01.044. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.