Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2015 Jan 1.
Published in final edited form as: J Am Stat Assoc. 2014 Mar 19;109(505):11–23. doi: 10.1080/01621459.2013.870904

A new estimation approach for combining epidemiological data from multiple sources

Hui Huang 1, Xiaomei Ma 2, Rasmus Waagepetersen 3, Theodore R Holford 4, Rong Wang 5, Harvey Risch 6, Lloyd Mueller 7, Yongtao Guan 8,
PMCID: PMC3964681  NIHMSID: NIHMS546629  PMID: 24683281

Abstract

We propose a novel two-step procedure to combine epidemiological data obtained from diverse sources with the aim to quantify risk factors affecting the probability that an individual develops certain disease such as cancer. In the first step we derive all possible unbiased estimating functions based on a group of cases and a group of controls each time. In the second step, we combine these estimating functions efficiently in order to make full use of the information contained in data. Our approach is computationally simple and flexible. We illustrate its efficacy through simulation and apply it to investigate pancreatic cancer risks based on data obtained from the Connecticut Tumor Registry, a population-based case-control study, and the Behavioral Risk Factor Surveillance System which is a state-based system of health surveys.

Keywords: Spatial epidemiology, spatial point process, estimating equation

1 Introduction

In the era of electronic records, in order to investigate risk factors for disease, it has become easier than ever for researchers to obtain epidemiological data from diverse sources. One widely used source of such data is the case-control study where the aim is to investigate risk factors affecting the probability that an individual develops certain disease. Book-length treatment of this topic can be found in Schlesselman (1982). In its simplest form, a case-control study consists of the values of all risk factors under consideration for a representative sample of individuals in the study region who develop the health outcome (cases), as well as for a representative sample of individuals of the population giving rise to the cases (controls). Case-control studies are attractive because they provide information for an extensive list of risk factors, more than generally available from other data sources such as tumor registry data, which, other than patients’ residential addresses, typically contain only basic demographic, diagnostic and clinical variables. Case-control studies allow epidemiologists to perform a more comprehensive assessment of risk factors. Established statistical techniques are available for analyzing case-control data, in particular logistic regression analysis or conditional logistic regression analysis (for individually matched case-control studies).

Despite the strengths and popularity of case-control studies, their use alone can be inefficient when investigating spatial risk factors such as exposure to traffic. This is mainly because information obtained from other sources, such as tumor registry data or data from the Behavioral Risk Factor Surveillance System (BRFSS), which is a large state-based system of health surveys, may also contain important, and often richer or more precise, spatial or geographic information. Moreover, effects of spatial risk factors on cancer risk, if any, may be small. It is therefore desirable to be able to combine risk-factor data of diverse sources in a single analysis so as to increase the effective sample size.

The problem of combining epidemiological data obtained from multiple sources has been studied by several authors. For example, Prentice and Sheppard (1995) and Wakefield (2004) considered combining aggregated disease data with individual control or cohort data; Haneuse and Wakefield (2007, 2008a,b) argued that the cohort-based approach of Wakefield (2004) was inefficient for investigation of rare disease outcomes such as cancer. They instead proposed a hybrid design to combine aggregated disease data with either case-control data or case data alone. Prentice and Sheppard’s approach assumes significant variation in the risk factors across sub-populations over which the aggregated disease counts were made. The methods proposed in Wakefield (2004) and Haneuse and Wakefield (2007, 2008a,b) concerned binary risk factors, but we may also be interested in continuous risk factors (e.g., exposure to traffic) in analyses. Diggle et al. (2010) developed procedures for combining individual-level disease data and spatially aggregated information on a population at risk. They required spatially aggregated information for all risk factors under consideration, which may not be generally possible. Best et al. (2000) proposed a sophisticated parametric model to combine data at disparate spatial resolutions, where all data are related to a latent, spatially continuous stochastic process representing unexplained spatial variation in risk. In order to fit the model, they used a computationally intensive Markov chain Monte Carlo (MCMC) implementation of Bayesian inference. Jackson et al. (2006) also used MCMC methods to fit their proposed hierarchical model for combining small area and individual-level data on exposures and health outcomes. In a social science case study, Angrist and Krueger (1992) considered instrumental variable inference for a linear model with moments for the instrumental variable method obtained from different data sets.

Multiple imputation (e.g., Rubin, 2004) is a general strategy for handling inference in presence of missing data where missing data are imputed in complete-case estimating functions. In the context of combining different epidemiological datasets where variables from one dataset may be missing in another dataset, multiple imputation has been used, e.g., by Gelman et al. (1998) and Schenker et al. (2010). One drawback of multiple imputation is the need of a joint model for all variables so that the missing quantities can be simulated given observed ones (e.g., Schafer, 1999). Statistical (or file) matching (Little and Rubin, 2002) is another possibility for creating complete datasets where two incomplete data sets are merged based on a common variable appearing in both data sets. The validity of this procedure hinges on restrictive assumptions of conditional independence given the common variable used for matching. The approach in Robins et al. (1994) is also based on complete-case estimating functions but removes estimating function components due to individuals with missing data. The remaining terms are then weighted with the inverse probability of missingness in order to retain unbiasedness.

We propose a novel two-step procedure to combine data obtained from diverse sources, with the aim to fit a parametric model to quantify risk factors affecting the probability that an individual develops certain disease such as cancer. In the first step, we derive all possible estimating functions based on data from one single source (e.g., a case-control study) or from two different sources. By considering only one or two data sources at a time, we can significantly reduce the difficulty in handling datasets with different patterns of missing covariates. In the second step, we combine all the available estimating functions efficiently in order to make full use of information contained in data. Our method of combining several estimating functions is related to the generalized method of moments used to combine various sources of information for microeconomic models and longitudinal surveys in respectively Imbens and Lancaster (1994) and Zhou and Kim (2012). Because our approach is constructed by forming unbiased estimating functions in terms of the risk factors, it avoids the use of the computationally intensive MCMC algorithms. In situations with missing data, we use single-imputation of the missing data in a set of modified complete-case estimating functions. Only crude estimators of the missing quantities are needed in order to maintain unbiasedness of the resulting estimating functions, as long as the variance of these estimators is asymptotically negligible. Our method does not assume conditional independence nor requires any precise knowledge on the correlation among the various covariates.

We organize the remainder of this article as follows. In Section 2 we describe the pancreatic cancer study that has motivated the present work. We provide necessary background in Section 3 and discuss our proposed methods in Section 4. We assess their numerical properties through simulation in Section 5. We apply the proposed methods in Section 6 to answer the substantive question whether traffic-related exposures are related to risk of developing pancreatic cancer. We will use data obtained from a case-control study in Connecticut, the Connecticut Tumor Registry (CTR) and BRFSS. We conclude the article with a discussion in Section 7. Additional theoretical results are given in the we Appendices.

2 Description of the data

Our core dataset was obtained from a classical case-control study for pancreatic cancer. We enrich this dataset with further cases from the CTR and further controls from BRFSS. Finally we add traffic exposure data from the Connecticut Department of Transportation.

2.1 Case-control study of pancreatic cancer

The cases included in the case-control study were histologically and clinically validated, incident pancreatic cancer patients in Connecticut diagnosed between January 1, 2005, and August 31, 2009 (Risch et al., 2010). To identify case patients, study staff made frequent regular visits to each of the 30 general hospitals across the state. Consent to approaching patients was obtained from physicians or physician practices for 83% of 1,092 potentially eligible newly diagnosed individuals aged 35-83 years at diagnosis of pancreatic cancer. Of 906 requested in-person interviews, 421 (46%) were successfully completed, after which 19 were found to be ineligible after further clinical review. The remaining patients were not able to be located or contacted (n = 50), were too ill or had died before study contact (n = 333), or refused study participation (n = 121). To identify potential control subjects, a pre-letter-assisted random-digit dialing (RDD) method was used over the same time frame. An address was sought for each randomly selected land-line telephone number through reverse-directory lookup in order to mail a study letter before telephone contact for eligibility. Control subjects were frequency matched to case patients by gender and age group (35-51, 52-59, 60-64, 65-69, 70-74, 75-79, and 80-83 years). A total of 1,137 potentially eligible control subjects was identified and 715 (63%) of them participated. Reasons for nonparticipation included inability to locate or contact (n = 140) and subject refusal (n = 282). All subjects were interviewed in person. At interview, participants provided signed informed consent, after which a structured questionnaire was utilized to collect information on a variety of potential risk factors. The study was approved by the Yale Human Investigation Committee.

2.2 The Connecticut Tumor Registry data on pancreatic cancer

Connecticut is a small state geographically, yet has a dense population (about 3.5 million). The CTR is the oldest cancer registry in the United States and has been a Surveillance, Epidemiology and End Results (SEER) program participating site since the SEER program commenced in 1973. The CTR has reciprocal reporting agreements with cancer registries in all adjacent states (and Florida, which is a popular vacation destination) to identify Connecticut residents with cancer diagnosed or treated in these states. CTR cases included in the present study fulfilled the following eligibility criteria: 1) Incident cancer designated in the CTR as pancreatic, diagnosed between January 1, 2005 and August 31, 2009; 2) Resident at diagnosis in the state of Connecticut; and 3) Aged 35-83 years old. These criteria were set to correspond to those used in the case-control study. However, only a minority of pancreatic cancer cases in the CTR undergo rigorous research study-level validation of their primary site, thus blanket accession of CTR cases allows for some cases of cancer from other organs extending to the pancreas (e.g., Ampulla of Vater, common bile duct) or metastatic to it, to be included. The CTR subjects do include deceased cases and those not granted physician permission to be approached by the case-control study, thus their number is appreciably larger. For each CTR case, we have identified age, date of diagnosis, gender, race, Hispanic ethnic origin, and residential address at the time of diagnosis. A total of 2,335 nominally pancreatic cancer patients was found (including the case-control study cases) and we have successfully geocoded the residential addresses of 2,275 (97%) of them.

2.3 The Behavioral Risk Factor Surveillance System data

BRFSS is a state-based system of health surveys collecting information on health risk behaviors, preventive health practices, and health care access primarily related to chronic diseases and injury. BRFSS was established in 1984 by the Centers for Disease Control and Prevention; with more than 350,000 adults interviewed each year, it is the largest telephone health survey in the world. We have obtained the raw 2008 BRFSS survey data for Connecticut to gather information on life-style variables such as cigarette smoking. There were a total 6,155 Connecticut residents 18 years or older who participated in the survey in 2008.

The 2008 BRFSS was conducted by using RDD to select study samples. The sampling frames between the BRFSS RDD and the case-control study RDD differed somewhat because the case-control study matched controls to the distribution of case sex and age. BRFSS also used post-survey weighting techniques to maximize the representativeness of the sampled data. The current BRFSS weighting formula, which can be found at http://www.cdc.gov/brfss/technical infodata/weighting.htm, accounts for differences in the basic probability of selecting among strata (i.e., subsets of area/prefix combinations), the number of residential telephone lines in the respondent’s household, the number of adults in the household, and the age-by-sex or age-by-race-by-sex distribution in the population in general (not in the cancer cases) so as to adjust for over-coverage and non-response.

The BRFSS data provide extremely rich information on lifestyle variables such as cigarette smoking. Because the survey targets the general population, BRFSS subjects should also be treated as controls. However, unlike the CTR data and the case-control data, residential locations of the study participants, and consequently traffic-related exposures that are derived based on residential locations, are only available at the zip code level. In the present study, the 4,459 (72%) BRFSS participants aged 35-83 years old with available zip codes were included in the analysis.

2.4 Traffic data

Traffic data for Connecticut are available from the state’s department of transportation. We have obtained data on annualized average daily traffic (ADT) on all numbered state highways (including interstates) for the year 2007; the ADT was measured on segments of roadways where major changes in volume occur.

For a subject at residence u, we may define exposure to traffic as an integral of the traffic volume over all points of highways within a buffer surrounding u. In practice, however, this integral has to be estimated. The ADT estimates are given along highway segments, as shown by the highways displayed in Figure 1. The open circles therein represent nodes dividing lines into segments with a common ADT. The ith segment, Ci, i = 1, ⋯, n, would be specified by the starting and ending nodes. We divide Ci into short subsegments, si0, si1, si2, ⋯, siJi. In our analysis, we use

i=1nτij=0JiI(sijuD)Δsij, (1)

as the traffic exposure variable, where τi is the common ADT value on Ci, I(·) is an indicator function with ∥·∥ denoting the Euclidean distance, D is a prespecified constant used to define the buffer extent, and Δsij is the length of the jth subsegment of Ci.

Figure 1.

Figure 1

Calculation of the traffic exposure of a residence (bullet) along highways (bold curve) within a buffer circle with radius D.

2.5 Rationale for combining data

Motor vehicle emissions are a major source of air pollution. To date, numerous studies have been conducted to investigate whether exposure to traffic is related to risk of developing cancer (e.g., Pearson et al., 2000; Reynolds et al., 2002; Beelen et al., 2008; Visser et al., 2004; Raaschou-Nielsen et al., 2001, 2011). While some of these studies concluded a significant association of exposure to traffic with the risk for certain types of cancer, others were not able to detect such an association. For example, Pearson et al. (2000) concluded significant associations between their distance-weighted traffic density metrics and all childhood cancers (including childhood leukemia), but Raaschou-Nielsen et al. (2001) found that traffic-related air pollution at the residence did not appear to cause leukemias, central nervous system tumors, or non-Hodgkins lymphomas in children. Beelen et al. (2008) found an association of exposure to traffic with lung cancer for nonsmokers but no association for ex- or current smokers in their study. They argued that the failure to detect a significant association for the latter might be because the effect was too small to measure when compared against the much stronger association between cigarette smoking and lung cancer risk.

The majority of the existing studies was based on data collected from either case-control or cohort/case cohort studies. As a result, the total numbers of incident cancers included were typically much smaller than those recorded in tumor registries. Given that the effect of traffic may be small, as argued by Beelen et al. (2008) in the case of lung cancer for ex- and current smokers, it is desirable to increase the sample size by including all available cases. However, as we pointed out earlier, an analysis based on tumor registry data alone is not useful due to the very few risk factors available. For example, Reynolds et al. (2002) estimated rate ratios for childhood cancer incidence using tumor registry data but were able to adjust only for age, sex and race; Visser et al. (2004) conducted a separate smoking survey to show that smoking was not confounded with traffic effect, but still they did not include any other risk factors.

Our primary interest is to study whether exposure to traffic is related to risk of developing pancreatic cancer. Raaschou-Nielsen et al. (2011) found no evidence that traffic-related air pollution increased the risk for pancreatic cancer, based on a large Danish cohort study. We have conducted a preliminary analysis based on our case-control data but also failed to detect any significant association. However, given the additional CTR data, we are interested in supplementing them to the case-control data so as to increase the sample size but without suffering the limitations with using tumor registry data alone. We wish to further include the BRFSS data and investigate whether a significant association can now be detected with these new additional data.

2.6 Definition of risk factors

Let Z(s) be a p × 1 vector of risk factors for an individual at location s. For ease of presentation we suppose that Z(s) = [Zd(s)′, Zl(s)′, Zt(s)′]′, where Zd(·), Zl(·) and Zt(·) denote demographic, life-style and traffic-related exposure variables, respectively. Moreover, we set the first element of Zd(·) to be unity. In our pancreatic cancer data application (Section 6), in addition to the value one, Zd will contain age, age squared and indicator variables for sex and race, Zl will consist of indicator variables for smoking and education status, and Zt will be a one-dimensional measure of traffic exposure. The availability and accuracy of the different risk factors can vary with the source of data. A typical scenario is summarized in Table 1 and will be assumed throughout the paper.

Table 1.

Information on risk factors in the pancreatic cancer data.

Source Zd Zl Zt

CTR available missing available
BRFSS available available available at the zip code level
Case-control data available available available

In what follows, we use Zd, Zl and Zt to denote the risk factors that are in Z but not in Zd, Zl and Zt, respectively. Similarly, we use βind to denote the regression coefficients associated with Zind, where ind = d, l, t, −d, −l, −t.

3 Case-control study with complete risk factors

In this section we consider an estimating function for a simple yet generic setting with only one case group and one control group. We further assume that Z(·) is entirely observed for all available cases and controls. The estimating functions to be derived in this setting will serve as the starting point for the construction of estimating functions in the more complex situations with varying patterns of incomplete covariate data, see Section 4.

Let N and M be two spatial point processes that have generated the random spatial locations of cases and controls over a geographic region, W. We assume that the control process M is an inhomogeneous spatial Poisson process with intensity α(s)λ0(s), where α(s) is the probability for an individual at s to be included in the controls and λ0(s) represents a spatially varying population density. In general, there is some prior knowledge about the sampling design used to select the controls so we assume that α(·) is known. However, we do not need any Specific knowledge of λ0(·).

We assume that conditional on a non-negative (random) intensity measure Λ(s), the case process N is also an inhomogeneous spatial Poisson process. We further assume that

λ(s;β)E[Λ(s)]=λ0(s)exp[Z(s)β] (2)

for some unknown regression coefficients β. Intuitively speaking, exp[Z(s)′β] is the probability that an individual residing at s develops cancer. Hence, the parameter vector β provides a direct interpretation on cancer risk.

To estimate β, let N(ds) and M(ds) denote the number of cases and controls in an infinitesimal region ds containing s. It’s easy to see that

E[N(ds)]=λ0(s)exp[Z(s)β]dsandE[M(ds)]=α(s)λ0(s)ds.

Hence, E[Δ(ds; β)] = 0, where

Δ(ds;β)=N(ds)exp[Z(s)β]α(s)M(ds). (3)

Following well established theories on estimating equations (Crowder, 1986), a consistent estimator for β can be obtained by solving the following estimating equations:

wh(s;β)Δ(ds;β)=0p, (4)

where h(s; β) is a p × 1 real vector-valued function of β and 0p is a p × 1 vector of zeros. If we set

h(s;β)=Z(s)α(s)α(s)+exp[Z(s)β], (5)

then (4), using sum-notation, can be written as

U(β)s(NW)Z(s)α(s)α(s)+exp[Z(s)β]s(MW)Z(s)exp[Z(s)β]α(s)+exp[Z(s)β]=0p. (6)

Note that U(β) in (6) is essentially the score function of the conditional likelihood proposed by Diggle and Rowlingson (1994). In the case where N and M are both Poisson processes, (5) is optimal in the sense of minimizing variance of parameter estimates (Rathbun, 2012). Other forms of h(·) can be used but we will consider only (5) here, as our proposed methods can be easily generalized to the new settings.

4 Data from multiple sources with potential selection bias and incomplete risk factors

Let N1 and M1 be two spatial point processes that generated the locations of cases and controls in our case-control study. Let N2 and M2 denote two other spatial point processes for the additional cases and controls in the CTR and BRFSS data. We assume that M1 and M2 are independent of each other and also of N1 and N2. We use α1(·) and α2(·) to denote the counterpart sampling probabilities to α(·) defined in Section 3 for M1 and M2.

The first problem that we need to handle is selection bias when N1 is not a simple random sample from Nc = N1N2. In a typical case-control study, not all contacted cases have equal likelihood to participate. If the participation rates are correlated with any of the risk factors, estimates of effects of those risk factors may be biased. The second problem is how to adapt the methodology in Section 3 to obtain unbiased estimating functions based on pairs of case and control datasets with potentially incomplete covariate data. Finally, we need to derive a method to combine efficiently the estimating functions obtained for the different pairs.

4.1 Handling selection bias

Suppose that N1 is a sample from Nc where the probability for a case to be included in N1 varies with the case’s own characteristics. Specifically, we assume that the probability π(s) of including a case s in N1 is of the form

π(s;η)=exp[Y(s)η]1+exp[Y(s)η], (7)

where η is an unknown vector of parameters and Y(s) is a vector of covariates assumed to be known for all diseased subjects. For example, we may set Y(·) as a combination of the demographic variables and the traffic-related exposure variables. We perform a standard logistic regression analysis to estimate η. In what follows, let η̂ and η0 denote the resulting estimator and the true value of η, respectively.

4.2 Deriving estimating functions for the pancreatic cancer data

We derive the estimating functions in our setting based on all possible pairs of case and control processes, i.e., (N1, M1), (N1, M2), (N2, M1) and (N2, M2). For the pair (N1, M1), Z(·) is fully observed. We may therefore modify U(β) defined in (6) as

U11(β,η^)s(N1W)Z(s)α1(s)α1(s)+π(s;η^)exp[Z(s)β]s(M1W)Z(s)π(s;η^)exp[Z(s)β]α1(s)+π(s;η^)exp[Z(s)β]. (8)

Here α1(·) is used because E[M1(ds)] = α1(s)λ0(s)ds and the additional term π(·) is used because E[N1(ds)] = π(s; η0)E[Nc(ds)].

To derive the estimating functions for the remaining pairs, let t(·) and l(·) be some estimators for the traffic-related exposure variables in Zt(·) and the life-style variables in Zl(·), respectively. It is not necessary for either t(·) or l(·) to be consistent estimators, but we require their variances to be asymptotically negligible. For example, we may set t(s) = Zt[c(s)], where c(s) is the centroid of the zip code containing s, hence t(·) is non-random. However, t(·) still varies spatially and contains information about one’s exposure level. We will describe strategies to estimate Zl(·) in the simulation and real data analysis.

For the pair (N1, M2), note that the traffic-related exposure variables in Zt(·) are missing for the BRFSS data (i.e., M2). Let 12 (·) = [Zd(·)′, Zl(·)′, t(·)]′ be an estimate for the complete set of risk factors Z(·). We consider

U12(β,η^)=s(N1W)Z^12(s)α2(s)α2(s)+π(s;η^)exp[Z^12(s)β]exp[Z^t(s)βt]exp[Zt(s)βt]s(M2W)Z^12(s)π(s;η^)exp[Z(s)β]α2(s)+π(s;η^)exp[Z^12(s)β]exp[Z^t(s)βt]exp[Zt(s)βt]. (9)

Compared to U(β) defined in (6), 12(·) is used instead of Z(·) due to the missing data on Zt(·) in M2. Similar to the derivation of U11, α2(·) is used because E[M2(ds)] = α2(s)λ0(s)ds and π(·) is used because E[N1(ds)] = π(s; η0)E[Nc(ds)]. By multiplying the additional term exp[t(s)′ βt]/ exp[Zt(s)′βt], we see that

exp[Z(s)β]exp[Z^t(s)βt]exp[Zt(s)βt]=exp[Z^12(s)β],

which can be calculated for the control subjects in M2 even though the traffic-related exposure variables in Zt are missing for these subjects.

For the pair (N2, M1), note that the life-style variables in Zl(·) are missing for cases in the CTR but not in the case-control study (i.e., N2). Let 21 (s) = [Zd(·)′ l(·)′ Zt(·)]′ be an estimate for the complete set of risk factors in Z(·). We consider

U21(β,η^)=s(N2W)Z^21(s)α1(s)α1(s)+[1π(s;η^)]exp[Z^21(s)β]s(M1W)Z^21(s)[1π(s;η^)]exp[Z(s)β]α1(s)+[1π(s;η^)]exp[Z^21(s)β], (10)

where Z^21(·) is either 21(·) or a subset of it obtained by removing component(s) of l(·) that are highly correlated with other components of Z(·). Such a scenario may occur if any component of l(·) is a linear combination of components in Zd(·). In particular, if l(·) is constant, then it’s linear in Zd(·) since the first element of Zd(·) is always one. Nevertheless, new information on Zd(·) and Zt(·) is still obtained by including the new data source N2. Compared to U(β) defined in (6), 21(·) and Z^21(·) are used instead of Z(·) due to the missing data on Zl(·) in N2. Similar to the derivation of U11, α1(·) is used because E[M1(ds)] = α1(s)λ0(s)ds and [1 − π(·)] is used because E[N2(ds)] = [1 − π(s; η0)]E[Nc(ds)].

For the last pair (N2, M2), note that the traffic-related exposure variables in Zt(·) are missing for the BRFSS data (i.e., M2) and the life-style variables in Zl(·) are missing cases in the CTR but not in the case-control study (i.e., N2), respectively. Define 22(s) = [Zd(·)′ l(·)′ t(·)]′. In light of the previous derivations of U12(·) and U21(·), we consider

U22(β,η^)=s(N2W)Z^22(s)α2(s)α2(s)+[1π(s;η^)]exp[Z^22(s)β]exp[Z^t(s)βt]exp[Zt(s)βt]s(M2W)Z^22(s)[1π(s;η^)]exp[Z^12(s)β]α2(s)+[1π(s;η^)]exp[Z^22(s)β], (11)

where similar to Z^21(·) used in (10), Z^22(·) is either 22(·) or a subset of it.

4.3 Combining estimating functions for the pancreatic cancer data

Define Uc(β, η̂) = [U11(β, η̂)′, U12(β, η̂)′, U21 (β, η̂)′, U22(β, η̂)′]′. Note that Uc(·) is asymptotically unbiased for zero if the variances of t(·), l(·) and η̂ are all asymptotically negligible. Define

D(β)=E[Uc(β,η^)β]andV(β)=Var[Uc(β,η^)].

In Web Appendix A, we derive consistent estimators for all components of D(β) under β = β0, where β0 denotes the true value of β. In Web Appendix B, we derive theoretical expressions for V(β), which involve the extra variability caused by η̂, as well as consistent estimators for all its components.

To combine the derived estimating functions, one strategy is to follow Heyde (1997) to consider

U(β,η^)=D(β)V(β)1Uc(β,η^). (12)

Solving Ũ(β, η̂) = 0 with respect to β corresponds to minimizing the generalized least squares criterion Uc(β, η̂)−1V(β)−1 Uc(β, η̂) except that we replace the derivative Uc(β,η^)βwith its expectation. Hence (12) is closely related to the generalized method of moments (Hansen, 1982). However, for this approach, V(·)−1 may be difficult to evaluate because 1) the estimating functions in Uc(β, η̂) tend to be highly correlated with each other and as a result V(·)−1 may be unstable and 2) the dimension of V(·) can be high depending on the numbers of risk factors and pairs of processes to be considered. To mitigate this problem, we instead combine these estimating functions sequentially in three steps.

  • Step 1
    Define U0(β, η̂) = [U11(β, η̂)′, U12(β, η̂)′]′. We follow (12) to consider
    U1(β,η^)=D0(β)V0(β)1U0(β,η^),
    where D0(·) and V0(·) are sub-matrices of D(·) and V(·) that are associated with U11(·) and U12(·).
  • Step 2
    Define U1(β, η̂) = [Ũ1(β, η̂)′, U21(β, η̂)′]′. We follow (12) again to consider
    U2(β,η^)=D1(β)V1(β)1U1(β,η^),
    where
    D1(β)=E[U1(β,η^)β]andV1(β)=Var[U1(β,η^)].

    In Web Appendix C, we show that D1(·) and V1(·) can be written in terms of the submatrices of D(·) and V(·) that are associated with U11(·), U12(·) and U21(·).

  • Step 3

    Define U2(β, η̂) = [Ũ2(β, η̂)′, U22(β, η̂)′]′. Similar to Step 2, we now follow (12) to derive Ũ3(·); see Web Appendix C for details. Solve Ũ3(β, η̂) = 0p to estimate β.

The covariance matrices involved in all the above steps are at most 2p × 2p in dimension so it’s not difficult to calculate their inverses. Moreover, the proposed procedure can be easily modified in order to include any additional estimating functions in the analysis. Nevertheless, to apply the method we would need to decide the order in which the estimating functions are sequentially combined. The order that we considered above is natural because the level of completeness in terms of the covariates drops in the order of U11(·), U12(·), U21(·) and U22(·). Our own simulation experience also suggests that the actual order has only limited impact on the performance of the resulting estimators.

We finally note that instead of plugging in an estimate for η, one could alternatively follow Zhou and Kim (2012) and include the logistic regression score for estimating η in Uc as yet another estimating function so that β and η can be estimated simultaneously. However, this would add to the already considerable computational complexity and we prefer to stick with the simpler plug-in approach.

4.4 Model diagnostics

Our model diagnostics consist of two tasks. The first is to evaluate the validity of the assumed intensity model (2), and the second is to assess whether Nc is Poisson.

4.4.1 Evaluation of the intensity model

We develop residual diagnostic tools based on the complete CTR data (i.e., Nc) and the BRFSS data (i.e., M2) to check the validity of the assumed intensity model (2). Let X(·) be a statistic derived from of Zd(·). For example, X(·) can be age, which is included in Zd(·), or is age in a given sex-by-race category. Define the cumulative residuals

Q(x)=s(NcW)exp[Z^t(s)β^t]exp[Zt(s)β^t]I[X(s)x]s(M2W)exp[Z^t(s)β^t+Zt(s)β^t]α2(s)I[X(s)x],

where β̂t and β̂t are the estimates of βt and βt from our proposed sequential approach, respectively. We plot Q(x) against x to assess the overall fit. If our proposed model fits the data well, then Q(x) should be close to zero for all x. For inference, we use bootstrap to construct confidence bands for Q(x). To do so, we keep the memberships of the case subjects (i.e., whether they belong to N1 or N2) unchanged from the original data.

Let Wl : l = 1, …, L be a partition of W. In our analysis, Wl’s are the complete set of zip code regions in Connecticut. We also consider the residuals

Rl=s(NcWl)exp[Z^t(s)β^t]exp[Zt(s)β^t]s(M2Wt)exp[Z^t(s)β^t+Zt(s)β^t]α2(s).

We plot Rl against components of t,l to assess whether the assumed functional forms of the traffic-related exposure variables Zt(·) are appropriate, where t,l are the traffic-related exposure variables derived at the centroid of the lth zip code. If the proposed model fits the data well, then Rl should vary randomly across zero.

4.4.2 Detecting non-Poisson behavior

To detect non-Poisson behavior, we follow Diggle et al. (2007) to consider second-order statistics related to the K-function. Specifically, for any positive number r, define

K1(r)=∑∑s,u(NcW)I(sur)ρ(s;β^)ρ(u;β^),

where ≠ signifies summation over distinct events and

ρ(s;β^)={exp[Z(s)β^]ifsN1;exp[Zl(s)β^l+Z^l(s)β^l]ifsN2.

Under the Poisson assumption, the second-order intensity function λ2(s, u) = λ(s; β)λ(u; β). Then by a direct application of Campbell’s Theorem (Møller and Waagepetersen, 2004) and ignoring some higher-order terms, we can show that for su,

E{Nc(ds)Nc(du)/[ρ(s;β^)ρ(u;β^)]}λ0(s)λ0(u)ψ(s,u;β^,η^)dsdu, (13)

where

ψ(s,u;β^,η^)=π(s;η^)π(u;η^)+2π(s;η^)[1π(u;η^)]exp{[Zl(s)Z^l(s)]β^l}+[1π(s;η^)][1π(u;η^)]exp{[Zl(s)Z^l(s)]β^l+[Zl(u)Z^l(u)]β^l}.

In terms of the control process M1, we may then define

K2(r)=∑∑s,u(M1W)I(sur)ψ(s,u;β^,η^)α1(s)α1(u).

Define T(r) = K1(r)/K2(r). By (13), T(r) should be close to one for a given r > 0 under the Poisson assumption. We again use bootstrap to construct confidence bands for T(r). As before, we keep the memberships of the case subjects unchanged, but we update the sampling probabilities α1(·) based on the resampled control data.

5 Simulation Study

5.1 Simulation design

We conduct a simulation study to assess the performance of our proposed estimation method. Specifically, we generate realizations of cases and controls from inhomogeneous spatial Poisson processes over a 1 × 1 square region W. The intensity functions for the control processes are set to be αi(s)λ0(s) for i = 1, 2, where α1(s) = 0:0007, α2(s) = 0:0045 and λ0(s) = 1; 000; 000 for all sW. The expected numbers of control events are therefore equal to 700 for M1 and 4500 for M2. The constant probabilities α1(·) and α2(·) imply that all subjects in the population have the same probability to be selected in a control process.

The intensity function for the overall case process is λ(s; β) = λ0(s) exp[Z(s)′β], where β = (β0, βd, βl, βt)′ = (−6.5655, 1, 1, 1)′, Z(s) = [1, Zd(s), Zl(s), Zt(s)], and Zd(·), Zl(·) and Zt(·) denote demographic, life-style and traffic exposure variables, respectively. We generate Z(·) over a 100 × 100 pixel grid and assume that its value is constant over each pixel. Specifically,

  • Zd(s)’s are independent and identically distributed (i.i.d.) normal random variables with mean 0 and standard deviation 0.5;

  • Zl(s) = 0.2 * Zd(s) + 0.2 * ε(s) + e(s), where ε(s)’s are a realization of a zero-mean and unit-variance Gaussian random field with an exponential covariance function exp(−20r) with r being the lag distance, and e(s)’s are i.i.d. normal random variables with mean 0 and standard deviation 0.4.

  • Zt(s)’s are a realization of a zero-mean and unit-variance Gaussian random field with an exponential covariance function exp(−10r).

The above construction yields a correlation of 0.2 between the life-style variable Zl(s) and the demographic variable Zd(s). We assign a case to be in N1 randomly with the probability given by (7), with Y(s) = [1, Zd(s)]′and η = [−1.75, 0.5]′. The remaining cases are included in N2. Based on the simulation setup, the expected number of cases is 400 in N1 and 1900 in N2. Note that the expected numbers of cases and controls are similar to their counterparts in the real data, see Section 2. We assume that Zd(·) is observed for all cases and controls but Zl(·) and Zt(·) are missing for N2 and M2, respectively. This setup mimics that being described in Section 2.6.

To implement our proposed method, it is necessary to obtain l(·) and t(·). For the former, we first fit the regression model E[Zl(s)] = θ0 + θ1Zd(s) based on the data given in M1, and then set the predicted value given Zd(·) as l(·). For the latter, we divide W into 10 × 10 non-overlapping equal square subblocks and define t(s) as the average of Zt(·) for the subblock that contains s. Since l(·) is linear in Zd(·), it will not be included in Z^21(·) and Z^22(·) that are used to form the estimating functions defined in (10) and (11).

5.2 Simulation results

We generate 1000 realizations of {N1,N2,M1, M2} and apply our proposed method to estimate β as described in Section 4.3. For comparison, we also conduct the estimation by combining different subsets of these pairs (see Table 2 and the discussion below for details). For the analysis based on (N1, M1) alone, we also run a logistic regression analysis without adjusting for π(·), since this is the commonly adopted approach in practice. For each estimator, we calculate its empirical mean and standard error (values in round brackets), and estimate the standard error (values in square brackets) using a standard sandwich estimator.

Table 2.

Simulation Results. The two rows for each of the first three methods show the empirical means and standard errors from 1,000 simulations; the three rows for each of the remaining methods show the empirical means and standard errors from 1,000 simulations, and the estimated sandwich standard errors.

Method β0 βd βt βl
Prentice & Sheppard -6.5792 (.0776) 1.0377 (.1840) 1.0073 (.1683) 1.0296 (.1996)
Diggle et al. -8.4938 (.1009) 1.4567 (.1996) .9996 (.1724) 1.0296 (.2013)
(N1, M1) Logistic -.6656 (.1021) 1.4282 (.1743) 1.0107 (.1781) 1.0040 (.1844)
(N1, M1) Adjusted for η -6.5729 (.0773) [.0774] 1.0099 (.1450) [.1476] 1.0110 (.1781) [.1787] 1.0041 (.1844) [.1885]
(N1, M1) + (N2, M1) -6.5723 (.0679) [.0679] 1.0062 (.1257) [.1270] 1.0092 (.1310) [.1330] 1.0086 (.1854) [.1876]
(N1, M1) + (N1, M2) -6.5763 (.0548) [.0536] 1.0027 (.0752) [.0755] 1.0061 (.1381) [.1341] 1.0051 (.1309) [.1285]
(N1, M1) + (N2, M2) -6.5731 (.0520) [.0521] 1.0014 (.0698) [.0700] 1.0003 (.0719) [.0713] .9978 (.1771) [.1787]
(N1, M1) + (N1, M2) + (N2, M1) + (N2, M2) -6.5721 (.0440) [.0428] 1.0028 (.0640) [.0633] .9964 (.0711) [.0695] .9988 (.1308) [.1261]

We also apply the methods proposed by Prentice and Sheppard (1995) and Diggle et al. (2010). For both methods, we first divide W into 100 equal subsquare regions, Wk : k = 1, …, 100. For Prentice and Sheppard (1995), the total count of events Nc = N1N2 in each subsquare is known in addition to the control data M1, but no information on Z(·) is available for any case event in Nc. The unweighted version of the estimator is used due to its ease of implementation as well as its good performance (Prentice and Sheppard, 1995). For Diggle et al. (2010), we assume that the summary measures Wk λ0(s)Z(s)ds are also available, for k = 1, ⋯, 100. These summary measures are then combined with the case events in N1 to estimate β. Diggle et al. (2010)’s method requires that the complete set of Z(·) must be known for the cases. As a result, N2 cannot be used.

The simulation results are shown in Table 2. For both the conventional logistic regression analysis approach and Diggle et al.’s approach, we can see that β̂d is severely biased. This is because the probability of selecting N1 from Nc, which depends on Zd(·), is not adjusted by these approaches. All the remaining estimators are approximately unbiased, and our proposed estimator has the smallest standard error for all regression coefficients. Moreover, the standard error estimates obtained from the sandwich method are reasonably close to their empirical counterparts.

To understand how the inclusion of additional data may affect accuracy of the resulting estimators, we supplement (N1, M1) each time in the estimation with only one of the remaining three pairs, i.e., (N1, M2), (N2, M1) and (N2, M2). In all these situations, the standard error of β̂d is significantly reduced and the largest reduction occurs when (N2, M2) is included. This is expected because all these new pairs contain additional information on d(·) and the most information is provided by (N2, M2). Some further findings and possible explanations are: when (N1, M2) is included, the standard errors of β̂l and β̂t are both reduced due to the new individual-level and coarsened information in M2 on Zl(·) and t(·), respectively; when (N2, M1) is included, the additional information on Zt(·) from N2 leads to a better estimator for βt; when (N2, M2) is included, information on t(·) that is available in both N2 and M2 helps improve the estimation of βt. For the latter two scenarios, l(·) is needed in order to set up the necessary estimating functions. However, because l(·) is linear in Zd(·), it does not provide any additional information on Zl(·). As a result, the standard errors of βl are similar to that based on (N1, M1) alone.

We have conducted additional simulations with different sample sizes for both cases and controls. The primary findings described above persisted. Furthermore, when including all four pairs, we have assessed the effects of doing so in different sequential orders. The results are nearly identical so we omit a detailed presentation. We have also considered solving Ũ (β, η̂) = 0p directly, where Ũ (·) is given in (12). The resulting estimator can be much more variable than that obtained from our proposed sequential approach. Moreover, the estimated standard errors based on the sandwich method tend to underestimate the true standard errors.

6 Data Analysis

6.1 Definition of risk factors

We define the demographic risk factors Zd(·) = [1, age, age2, sex, race]′, where sex = 1 for male and 0 for female, and race = 1 for white and 0 for others. For the life-style variables, we define Zl(·) = [smoking, education]′, where smoking = 1 if one ever smoked and 0 otherwise, and education = 1 if the subject received some college or above education and 0 otherwise.

To estimate Zl(·), we employ two different strategies. For smoking, we run a logistic regression analysis based on Mc = M1M2, where M1 and M2 are controls in the case-control study and the BRFSS, respectively. The response variable is a subject’s smoking status and the predictors are the demographic variables Zd(·). We then define an estimate as the fitted value given Zd(·). For education, we still run a logistic regression analysis based on Mc. The response variable is the education status, but both Zd(·) and edu(·) are used as predictors, where edu(·) is the percentage of population aged 25 and up that had received some college or above education at the zip code level from the 2000 US Census.

The traffic-related exposure variable is derived as described in Section 2.4, with the subsegment length Δsij = 50 meters and the radius of the circular buffer zone D = 2000 meters. A power transformation of 0.25 is used to reduce the skewness. We have also considered D = 1000; 3000 meters but obtained similar results. We rewrite Zt as Zt in what follows since only one traffic related exposure variable is considered. As described in Section 4, t(·) is defined as the exposure derived at the centroid of the zip code region that the subject resided in. Both age and traffic exposure have been standardized in the subsequent analysis. We define Y(·) = [Zd(·), Zt(·)]′ when estimating the selection probability π(·).

6.2 Derivation of sampling probabilities

Since the controls in the case-control study (i.e., M1) were selected to frequency match the age and sex distributions of the case process N1, we derive the sampling probability α1(s) given the subject’s age and sex information. To do so, we first obtain the age-by-sex distribution based on the Census for the following ten age groups: 35-40, 41-45, 46-50, 51-55, 56-60, 61-65, 66-70, 71-75, 76-80 and 81-83. Let S(S) denote the age-by-sex stratum that the subject at s is in. Then, we define α1(s) as

α1(s)=number of subjects fromM1inS(s)total number of subjects in the population inS(s).

Similarly, we define

α2(s)=number of subjects fromM2inS(s)total number of subjects in the population inS(s).

It may be overly simplistic to derive the sampling probabilities based on the age-by-sex distribution alone, since other factors such as the number of residential telephone lines in the respondent’s household and whether the telephone number(s) are in directory listings will also affect one’s probability to be selected in a study. Further, the likelihood of participation of selected potential controls also generally depends upon socioeconomic and other factors that are typically difficult to account for in case-control analyses since they are not measured well (or may be unknown). The BRFSS data do assign differential weights to the sampled subjects by taking some of the various factors into consideration. Let w(s) denote the weight assigned to an individual at s. The weights can be viewed as the inverse of the sampling probabilities. We therefore define

α2(s)=1w(s).

Although α2(·) can better describe the sampling probabilities than α2(·), it is not available for the cases since information such as the number of residential telephone lines is typically not recorded for cases. Moreover, the selection probability π(·) depends on the traffic variable Zt(·) and therefore cannot be calculated for the controls in the BRFSS data. As a result, the estimating functions U12(·) and U22(·) given in (9) and (11) cannot be calculated. For U12(·), we modify it as

U12(β,η^)=s(N1W)Z^(s)α2(s)α2(s)+π(s)exp[Z^(s)β]·exp[Z^t(s)βt]π(s)exp[Zt(s)βt]π(s;η^)s(M2W)Z^(s)π(s)exp[Z^(s)β]α2(s)+π(s)exp[Z^(s)β]·α2(s)α2(s), (14)

where π̃(·) is an alternative estimate of π(·) obtained by defining Y(·) as the demographic variables Zd(·) alone. A similar modification can be made straightforwardly for U22(·) and we denote the resulting new estimating function by U22(·).

6.3 Results

We first estimate η based on equation (7), which yields

η^=[1.8484,0.3380,0.2729,0.2003,0.3182,0.1316].

The result suggests that younger, male, white patients were more likely to participate in the case-control study and that cases included in the case-control study were less exposed to traffic than those in the CTR but not in the case-control study. The demographic variables in Zd(·) are in fact also correlated with the life-style variables in Zl(·). Given these observations, biased estimates for βt and βl could be obtained if we do not adjust for η.

We assume the intensity model (2) given in Section 3, with the covariate vector Z(·) therein as being defined in Section 6.1. Our main purpose is to estimate the regression parameters β by combining the data from the following different sources: the case-control data (i.e., N1 and M1), the CTR data excluding the cases included in the case-control study (i.e., N2), and the BRFSS data (i.e., M2). To apply our proposed method, we combine the available pairs sequentially in the order (N1, M1), (N1, M2), (N2, M1) and (N2, M2). For comparison, we also conduct the estimation based on different subsets of these pairs (see Table 3 for details). For the analysis based on (N1, M1) alone, we also run a logistic regression analysis to estimate β without adjusting for η.

Table 3.

Data Analysis Result. The two rows for each method show the parameter estimates and the estimated bootstrap standard errors.

Intercept Age Age2 Sex Race Traffic Smoke Education
(N1, M1) Logistic -.249 (.332) .175 (.136) .047 (.080) .017 (.144) -.287 (.268) .064 (.074) .456 (.150) -.641 (.150)
(N1, M1) Adjusted for η -5.557 (.220) .911 (.057) -.057 (.036) .319 (.062) -.446 (.205) .081 (.054) .431 (.158) -.665 (.153)
(N1, M1) + (N2, M1) -5.416 (.214) .779 (.046) -.129 (.033) .277 (.061) -.551 (.194) .157 (.050) .435 (.162) -.689 (.164)
(N1, M1) + (N1, M2) -6.079 (.215) .799 (.054) -.136 (.059) .288 (.097) -.120 (.191) .127 (.065) .495 (.154) -.354 (.136)
(N1, M1) + (N2, M2) -6.013 (.199) .747 (.038) -.181 (.025) .285 (.054) -.009 (.114) .233 (.051) .424 (.159) -.518 (.124)
(N1, M1) + (N1, M2) + (N2, M1) + (N2, M2) -6.166 (.134) .778 (.034) -.164 (.021) .251 (.048) -.089 (.099) .217 (.042) .508 (.144) -.235 (.132)

Figure 2 shows plots of cumulative residuals Q(x) versus age in each sex-by-race group, and Figure 3 plots zip-code residuals Rl versus zip-code level traffic, where Q(x) and Rl are as defined in Section 4.4. The confidence bands of Q(x) include the zero line inside for all x, suggesting a satisfactory overall fit. The plot of the zip-code residuals is centered around zero and also shows no systematic pattern, and hence the functional form of Zt(·) appears to be appropriate.

Figure 2.

Figure 2

Plots of cumulative residuals versus age in each sex-by-race group. In each plot, the solid line is Q(x) (y-axis) versus age (x-axis), the dashed lines are the confidence band, the dotted line is the constant line of zero.

Figure 3.

Figure 3

Plot of zip-code residuals versus zip-code level traffic.

Figure 4 shows the plot of T(r). The values of T(r) are approximately constant and are reasonably close to one for all r. Moreover, the confidence band contains one throughout the plotted range, which indicates that the assumption of Poisson process is acceptable.

Figure 4.

Figure 4

Test for Poisson point process. The solid line is T(r) of data, the dashed lines are the confidence band, and the dot-dashed line is the constant line of one.

For variance estimation, we need to incorporate the complex sampling design used to produce the BRFSS data. For our data, a disproportionate stratified sampling design was used involving four geographic region-by-household density strata. Each BRFSS weight is calculated as the product of three components: the stratum weight, which accounts for differences in the basic probability of selection among strata, a raw weighting factor, which adjusts for variations in the numbers of residential telephone numbers and adults in the respondent’s household, and a post-stratification weight, which adjusts for noncoverage and nonresponse based on the age-by-sex distribution in the population at the county level. We follow Lahiri (2003) to use a bootstrap procedure to estimate the variance. Specifically, we first sample with replacement the same number of subjects within each stratum. By doing so, both the stratum weight and the raw weighting factor associated with each newly selected sample are unchanged. We then adjust the post-stratification weights by comparing the age-by-sex distributions in the population and in the resampled data. We resample the remaining data, i.e., N1, N2 and M1, as described in Section 4.4. For each resampled dataset, we apply our proposed model estimation procedures to estimate β.

The estimation results are shown in Table 3. Values in parentheses are the estimated standard errors calculated by the bootstrap method. The results from our proposed sequential approach suggest that exposure to traffic is a significant risk factor for pancreatic cancer. After adjusting for the effect of age, sex, race, smoking and education, a unit change in the standardized traffic exposure variable increases the risk by a factor of 1.2427 (90% confidence interval 1.1594 to 1.3312). A significant relationship is also detected in all other analyses except the two based on (N1, M1) alone. These observations demonstrate the benefits of including the additional CTR and BRFSS data.

Our analysis also revealed a significant effect of smoking and education. Specifically, after adjusting all the other factors, the pancreatic cancer risk increases by a factor of 1.6622 (90% confidence interval 1.3114 to 2.1062) if one ever smoked but decreases by a factor of 0.7905 (90% confidence interval 0.6363 to 0.9823) if one has received college or above education. The finding on smoking is consistent with other findings in literature (Risch et al., 2010). The decrease of pancreatic cancer risk with education may be due to unobserved socioeconomic factors that confound with education.

The results also suggest that both age and sex are significantly related to the risk of developing pancreatic cancer. After controlling for other risk factors, males have an increased risk versus females by a factor of 1.2850 (90% confidence interval 1.1877 to 1.3909). The risk increases with age, but the interpretation is less straightforward due to the term for age squared. However, neither age nor sex is significant from the conventional logistic regression analysis based on (N1, M1) alone. This is because the controls were frequency matched to the cases in the case-control study by age and sex.

Our proposed sequential approach suggests that race is not related to risk for pancreatic cancer. However, when the BRFSS data (i.e., M2) are not included, the results suggest that white race has a significantly lower risk for pancreatic cancer than other races. This observation is likely a result of the large number of white subjects included in M1.

7 Discussion

We have proposed a new method for combining epidemiologic data that are obtained from diverse sources. The proposed approach allows us to make full use of all available information, regardless of source. It is computationally simple and can also be easily generalized to more complex settings. Our simulation shows that our method can yield estimators with smaller variances than those based on only a subset of the available data sources. In the substantive application, we have supplemented a standard population-based case-control study with Tumor Registry data and BRFSS data. The inclusion of these additional data can significantly enhance study power to detect effects of risk factors over the conventional case-control analysis approach. In particular, our analysis of the Connecticut pancreatic cancer data provides evidence for a positive association between traffic-related exposure and disease risk, while such a conclusion cannot be made with the case-control data alone.

For many population-based case-control studies, individual residential histories are available and can be used to construct trajectories of past traffic-related exposures. Since we do not have such data, we derive our traffic-related exposure based on residential locations at the time of diagnosis (for cases) or interview (for controls). A small number (3%) of patients also did not have geocodable addresses on file, due to the use of post office boxes and possibly because of transcription errors when the addresses were recorded by tumor registrars. We do not include these data in our analysis. Given the successfully geocoded addresses, we define traffic-related exposure broadly based on geographic proximity to highways. In reality, vehicle emissions include both particulate and gaseous pollutants, and the impact of different pollutants on health outcomes could be influenced by the local environment (e.g., wind, rain) and mode of transmission (e.g. inhalation vs. ingestion).

When forming our proposed estimating functions (8)-(11), we require that all necessary risk factors must be available in either the case or the control data being considered or in both. The unbiasedness of these estimating functions is maintained even if there are unmeasured confounders in some of the data sources. For example, the tumor registry data do not provide any information on smoking which can be a potential confounder. However, we form the estimating functions (10) and (11) by combining the tumor registry data with the control data in a case-control study and the BRFSS data, both of which contain information on smoking. By doing so, we can still obtain unbiased estimating functions. Nevertheless, biased estimates can be resulted in if there are omitted confounders in both the case and control data sources. In our application, residential proximity to highways may also be associated with other factors such as socioeconomic status. Although we have controlled for education, a commonly used proxy measure for socioeconomic status, residual confounding remains a possibility.

As pointed out by one referee, our real data analysis results reveal that the parameter estimates can vary with the data sources being included, even though the effect of the risk factors is common as specified by (2). In some situations, conflicting results may even be obtained; see the discussion in the last paragraph of Section 6 regarding the effect of race as an example. Such variations are due to the different forms of potential selection bias in the sampling designs used to collect the data from different sources. For the case data, we are able to account for the selection bias through the use of (7), because we have information for all available cases given the tumor registry data. Should data from an unbiased sampling design be available for the controls, then a similar mechanism can be potentially developed and incorporated in the estimation process in order to correct for the bias associated with a given control data source. More research is needed on this topic, since that would require a more careful design of future epidemiological studies and also a nontrivial extension of our proposed approach. Nevertheless, it is reasonable to believe that the BRFSS data can produce less biased results than the controls collected by individual investigators in a case-control study, because more sophisticated sampling designs and statistical tools are often used to mitigate the bias for the BRFSS data; hence, we believe that our estimates are more objective than those obtained from the commonly used approach based on case-control data alone. In terms of the effect of traffic, our main conclusion on its significance persisted across the different data sources, despite the variations in the actual estimates.

Supplementary Material

Supplementary Materials

Acknowledgments

This research has been partially supported by NIH grants 1R01CA169043, R01 ES01746 and 5R01CA098870, NSF grant DMS-0845368, the Danish Council for Independent Research - Natural Sciences grant 12-124675, “Mathematical and Statistical Analysis of Spatial Data”, and the Centre for Stochastic Geometry and Advanced Bioimaging, funded by the Villum Foundation. The Connecticut Tumor Registry is supported by Contract No. HHSN261201300019I between the National Cancer Institute and State of Connecticut Department of Public Health. This study was approved by the Connecticut Department of Public Health (CDPH). Certain data used in this paper were obtained from the CDPH. The authors assume full responsibility for analysis and interpretation of these data.

Contributor Information

Hui Huang, Department of Management Science, University of Miami, Coral Gables, FL 33124.

Xiaomei Ma, Yale School of Public Health, New Haven, CT 06520.

Rasmus Waagepetersen, Department of Mathematical Sciences, Aalborg University, Fredrik Bajersvej 7G, DK-9220 Aalborg, Denmark.

Theodore R. Holford, Yale School of Public Health, New Haven, CT 06520.

Rong Wang, Yale School of Public Health, New Haven, CT 06520.

Harvey Risch, Yale School of Public Health, New Haven, CT 06520.

Lloyd Mueller, Connecticut Department of Public Health, 410 Capitol Avenue, MS# 11HCQ, Hartford, CT 06134.

Yongtao Guan, Department of Management Science, University of Miami, Coral Gables, FL 33124.

References

  1. Angrist JD, Krueger AB. The Effect of Age at School Entry on Educational Attainment: An Application of Instrumental Variables with Moments from Two Samples. Journal of the American Statistical Association. 1992;87(418):328–336. [Google Scholar]
  2. Beelen R, Hoek G, van den Brandt PA, Goldbohm RA, Fischer P, Schouten LJ, Armstrong B, Brunekreef B. Long-Term Exposure to Traffic-Related Air Pollution and Lung Cancer Risk. Epidemiology. 2008;19:702–710. doi: 10.1097/EDE.0b013e318181b3ca. [DOI] [PubMed] [Google Scholar]
  3. Best NG, Ickstadt K, Wolpert RL. Spatial Poisson regression for health and exposure data measured at disparate resolutions. Journal of the American Statistical Association. 2000;95:1076–1088. [Google Scholar]
  4. Crowder M. On Consistency and Inconsistency of Estimating Equations. Econometric Theory. 1986;2:305–330. [Google Scholar]
  5. Diggle P, Guan Y, Hart C, Paize F, Stanton M. Estimating Individual-Level Risk in Spatial Epidemiology Using Spatially Aggregated Information on the Population at Risk. Journal of the American Statistical Association. 2010;105:1394–1402. doi: 10.1198/jasa.2010.ap09323. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Diggle PJ, Gómez-Rubio V, Brown PE, Chetwynd AG, Gooding S. Second-Order Analysis of Inhomogeneous Spatial Point Processes Using Case-Control Data. Biometrics. 2007;63(2):550–557. doi: 10.1111/j.1541-0420.2006.00683.x. [DOI] [PubMed] [Google Scholar]
  7. Diggle PJ, Rowlingson BS. Conditional approach to point process modelling of elevated risk. Journal of the Royal Statistical Society: Series A. 1994;157:433–440. [Google Scholar]
  8. Gelman A, King G, Liu C. Not Asked and Not Answered: Multiple Imputation for Multiple Surveys. Journal of the American Statistical Association. 1998;93:846–857. [Google Scholar]
  9. Haneuse S, Wakefield J. Hierarchical models for combining ecological and case-control data. Biometrics. 2007;63:128–136. doi: 10.1111/j.1541-0420.2006.00673.x. [DOI] [PubMed] [Google Scholar]
  10. Haneuse S, Wakefield J. Geographic-based ecological correlation studies using supplemental case-control data. Statistics in Medicine. 2008a;27:864–887. doi: 10.1002/sim.2979. [DOI] [PubMed] [Google Scholar]
  11. Haneuse S, Wakefield J. The combination of ecological and case-control data. Journal of the Royal Statistical Society B. 2008b;70:73–93. doi: 10.1111/j.1467-9868.2007.00628.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Hansen LP. Large Sample Properties of Generalized Method of Moments Estimators. Econometrica. 1982;50(4):1029–1054. [Google Scholar]
  13. Heyde CC. Quasi likelihood and its application a general approach to optimal parameter estimation. New York: Springer-Verlag, Inc; 1997. [Google Scholar]
  14. Imbens GW, Lancaster T. Combining Micro and Macro Data in Microeconometric Models. The Review of Economic Studies. 1994;61(4):655–680. [Google Scholar]
  15. Jackson C, Best N, Richardson S. Improving ecological inference using individual-level data. Statistics in Medicine. 2006;25:2136–2159. doi: 10.1002/sim.2370. [DOI] [PubMed] [Google Scholar]
  16. Little R, Rubin D. Statistical Analysis with Missing Data. New York: John Wiley & Sons, Inc; 2002. [Google Scholar]
  17. Møller J, Waagepetersen R. Statistical Inference and Simulation for Spatial Point Process. Chapman and Hall; 2004. [Google Scholar]
  18. Pearson RL, Wachtel H, Ebi KL. Distance-Weighted Traffic Density in Proximity to a Home Is a Risk Factor for Leukemia and Other Childhood Cancers. Journal of the Air & Waste Management Association. 2000;50:175–180. doi: 10.1080/10473289.2000.10463998. [DOI] [PubMed] [Google Scholar]
  19. Prentice RL, Sheppard L. Aggregate data studies of disease risk factors. Biometrika. 1995;82:113–125. [Google Scholar]
  20. Raaschou-Nielsen O, Andersen ZJ, Hvidberg M, Jensen SS, Ketzel M, Sørensen M, Hansen J, Loft S, Overvad K, Tjønneland A. Air pollution from traffic and cancer incidence:a Danish cohort study. Environmental Health. 2011;10:67–77. doi: 10.1186/1476-069X-10-67. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Raaschou-Nielsen O, Hertel O, Thomsen BL, Olsen JH. Air Pollution from Traffic at the Residence of Children with Cancer. American Journal of Epidemiology. 2001;153:433–443. doi: 10.1093/aje/153.5.433. [DOI] [PubMed] [Google Scholar]
  22. Rathbun SL. Optimal estimation of Poisson intensity with partially observed covariates. Biometrika. 2012;100:277–281. [Google Scholar]
  23. Reynolds P, Behren JV, Gunier RB, Goldberg DE, Hertz A, Smith D. Traffic patterns and childhood cancer incidence rates in California, United States. Cancer Causes and Control. 2002;13:665–673. doi: 10.1023/a:1019579430978. [DOI] [PubMed] [Google Scholar]
  24. Risch HA, Yu H, Lu L, Kidd MS. ABO Blood Group, Helicobacter pylori Seropositivity, and Risk of Pancreatic Cancer: A CaseControl Study. Journal of the National Cancer Institute. 2010;102:502–505. doi: 10.1093/jnci/djq007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Robins JM, Rotnitzky A, Zhao L. Estimation of Regression Coefficients When Some Regressors Are Not Always Observed. Journal of the American Statistical Association. 1994;89:846–866. [Google Scholar]
  26. Rubin DB. Multiple imputation for nonresponse in surveys. New York: John Wiley & Sons, Inc; 2004. [Google Scholar]
  27. Schafer JL. Multiple imputation: a primer. Statistical Methods in Medical Research. 1999;8:3–15. doi: 10.1177/096228029900800102. [DOI] [PubMed] [Google Scholar]
  28. Schenker N, Raghunathan T, Bondarenko I. Improving on analyses of self-reported data in a large-scale health survey by using information from an examination-based survey. Statistics in Medicine. 2010;29:533–545. doi: 10.1002/sim.3809. [DOI] [PubMed] [Google Scholar]
  29. Schlesselman JJ. Case-Control Studies: Design Conduct Analysis. New York: Oxford University Press; 1982. [Google Scholar]
  30. Visser O, van Wijnen JH, van Leeuwen FE. Residential traffic density and cancer incidence in Amsterdam, 1989-1997. Cancer Causes and Control. 2004;15:331–339. doi: 10.1023/B:CACO.0000027480.32494.a3. [DOI] [PubMed] [Google Scholar]
  31. Wakefield J. Ecological inference for 2 × 2 tables (with Discussion) Journal of the Royal Statistical Society A. 2004;167:385–445. [Google Scholar]
  32. Zhou M, Kim JK. An efficient method of estimation for longitudinal surveys with monotone missing data. Biometrika. 2012;99(3):631–648. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Materials

RESOURCES