INTEGRATING MULTIPLE BUILT ENVIRONMENT DATA SOURCES

Jung Yeon Won; Michael R Elliott; Emma V Sanchez-Vaznaugh; Brisa N Sánchez

doi:10.1214/22-aoas1692

. Author manuscript; available in PMC: 2024 Nov 27.

Published in final edited form as: Ann Appl Stat. 2023 May 1;17(2):1722–1739. doi: 10.1214/22-aoas1692

INTEGRATING MULTIPLE BUILT ENVIRONMENT DATA SOURCES

Jung Yeon Won ¹, Michael R Elliott ¹, Emma V Sanchez-Vaznaugh ², Brisa N Sánchez ^3,^*

PMCID: PMC11600455 NIHMSID: NIHMS1987803 PMID: 39605798

Abstract

Studies examining the contribution of the built environment to health often rely on commercial data sources to derive exposure measures such as the number of specific food outlets in study participants’ neighborhoods. Data on the location of community amenities (e.g., food outlets) can be collected from multiple sources. However, these commercial listings are known to have ascertainment errors and thus provide conflicting claims about the number and location of amenities. We propose a method that integrates exposure measures from different databases while accounting for ascertainment errors and obtains unbiased health effects of latent exposure. We frame the problem of conflicting exposure measures as a problem of two contingency tables with partially known margins, with the entries of the tables modeled using a multinomial distribution. Available estimates of source quality were embedded in a joint model for observed exposure counts, latent exposures, and health outcomes. Simulations show that our modeling framework yields substantially improved inferences regarding the health effects. We used the proposed method to estimate the association between children’s body mass index (BMi) and the concentration of food outlets near their schools when both the NETS and Reference USA databases are available.

Keywords: Built-environment, Count exposure, Data integration, Measurement error, Dirichlet process mixture model, Commercial business lists

1. Introduction.

Built environments and health studies focus on examining how the availability of community amenities influences health outcomes, such as how the availability of junk food outlets near schools influences child obesity (Larson, Story and Nelson (2009); Lee (2012)). Many of these studies rely on available commercial business data such as InfoUSA, Google Maps, Yelp, and others, to quantify the availability of food outlets. A practical challenge in the field is that the exposure measurements differ depending on the food outlet data source because of what is known as ascertainment error, that is data sources with incomplete listings or list businesses that no longer exist. There is growing concern about the extent of bias in health effect estimates due to the ascertainment error (Lucan et al. (2013); Jones et al. (2017)), and resource-intensive strategies have been recommended to minimize measurement errors (Caspi and Friebur (2016)). However, the impact of ascertainment error on health effect estimates has not been systematically investigated, is rarely quantified in practice, and is often explained by simplified heuristics such as citing the attenuation of regression coefficients (van Smeden, Lash and Groenwold (2020)). In this study, we show that exposure ascertainment errors introduce biases with more complex patterns, beyond simplified heuristics, and propose methods to integrate the exposure information from two popular commercial business lists to construct more accurate school-level exposure measures and achieve improved association estimates.

A large amount of literature has proposed methods to combine information from error-prone databases or multi-observer data. Among these, capture-recapture methods have been widely used for ecological population estimation, and many extensions of the original approach are available (Fienberg (1972a); Pollock and Otto (1983); Manrique-Vallier (2016)). The key information required for capture-recapture methods is a capture history that identifies which list captures a particular entry. However, in our specific application, we do not know the capture histories for individual business records unless we merge the listings. Because of the proprietary nature of commercial databases and related data use agreements, business-level matching is not permitted; thus, standard capture-recapture methods cannot be applied. In our motivating study, we compared the aggregated exposure metrics at a given subject’s location, that is, the number of available businesses within each school’s neighborhood. Although matching sources is not allowed, field validation studies of secondary commercial data sources are available (e.g., Liese et al. (2013); Fleischhacker et al. (2013)). External validation studies use direct field observations for a given geographic area, validate the listed businesses on databases, and publish the validity scores for each source. Given this reference on source quality, latent truth models can be a potential approach to improve exposure measures. Latent truth models are commonly used in crowdsourcing problems where data providers have different levels of credibility (Dong, Berti-Equille and Srivastava (2009); Zhao et al. (2012)). This modeling approach uses the user’s sensitivity and specificity to evaluate whether the claim or input from each user (i.e., the data source) is true. However, the latent truth model focuses on finding the “correct” data entry among multiple observations. In our case, however, it is highly unlikely that the true exposure values are captured by any one list or combination thereof.

Our goal is to minimize the bias in estimating the health effects of food environment exposure when direct observations or gold standard data are not available. After describing the motivating study in Section 2, we propose a method that accounts for errors in exposure measures derived from more than one business database in Section 3. Specifically, for each database, we consider a model that treats business counts as being obtained from the trivariate distribution of observed true businesses, falsely listed businesses that do not exist, and unlisted but truly existing businesses. The number of latent true businesses is assumed to be the same across databases. A flexible Dirichlet process (DP) prior is used for the latent true exposure. The health outcome is modeled jointly with a regression model, with the latent true business total as the primary predictor. The available estimates of the database’s credibility (sensitivity and positive predictive value) make it possible for the identification of model parameters. Section 4 presents an estimation approach, and Section 5 presents simulation studies on its properties. Section 6 applies our method to estimate the effect of the availability of convenience and grocery stores near schools on the body mass index (BMI) among children in California. In Section 7, we discuss our final remarks and possible future research directions.

2. Food exposure in multiple sources .

Our motivating example investigates the association between the availability of food outlets near schools and the bodyweight of 5th- and 7th-grade children attending 2,651 urban public schools in California in 2008. Children’s weight and height were collected as part of the state-mandated Fitnessgram test (California Department of Education (2019)) and used to calculate the BMI z-scores. BMI z-scores are obtained by standardizing the BMI according to age- and sex-specific reference distributions and ensuring the comparability of body weight across children of different ages because they are still growing (Must and Anderson (2006)). Our outcome variable was the school-level mean BMI z-score among children in these schools. We use publicly available school-level characteristics as covariates (e.g., income) and the schools’ latitude and longitude to assess the availability of nearby outlets.

A list of food outlets was obtained from the two commercial business lists most commonly used in built environment studies in the United States: the National Establishment Time-Series (NETS) (NETS (2021)) and InfoUSA (InfoUSA (2012)). The NETS is a longitudinal dataset produced by Walls & Associates, from annual Dun & Bradstreet (D&B) establishment data. InfoUSA is an independent data vendor (a competitor of D&B) that produces annual business listings. Both databases include information regarding business names, their locations (addresses and latitude-longitude), dates of business openings and closings, and information on the goods or services offered by each business, which are denoted using standard industrial classification (SIC) codes. Validation studies show that these business lists have different validities and do not always provide matching information for a given region (Liese et al. (2013); Jones et al. (2017)).

Direct merging of these two databases at the business level (e.g., matching by business name, address, or other statistical matching algorithms) is not allowed because of specifications in the data-use agreements with the data providers. Instead, we compared school-level exposures, which are aggregated/derived quantities. We adopted one of the commonly used methods to derive exposure measures, namely, counting the number of community resources within a one-mile radius circular buffer around each school (Athens, Duncan and Elbel (2016)). We focused on the SIC codes that identify ‘convenience stores’ and separately, ‘grocery stores,’ as defined in a field validation study (Powell et al. (2011)). For convenience stores, establishments with SIC codes 541102 , 55410000 , 55419901 , and 55419903 were used for NETS, whereas establishments with SIC/NAICS codes 541103, 554101 , 554102 , and 554103 were used for InfoUSA. For grocery stores, SIC codes 541101, 541100, and 541199 were used for NETS, whereas SIC/NAICS codes 541101, 541102, 541104, 541105, 541106, 541107, and 541108 were used for InfoUSA. The exposure distributions of convenience and grocery stores are shown in Figure 1. This figure demonstrates the discordance between NETS and InfoUSA. In the following sections, we use convenience store exposures to illustrate and compare the distributions of the exposure data across the databases, as well as overdispersion.

Fig 1: — Distributions of exposure to convenience stores and grocery stores from different sources. Exposures are computed by counting the number of stores within 1 mile buffer around California public schools. Means and standard deviations are presented in the legend.

With the growing use of readily available business databases in community environment studies, there is a rich set of validation studies that compare commercial databases with ground observations from field studies, or from gold-standard data such as local government databases (Powell et al. (2011); Liese et al. (2013); Lebel et al. (2017)). The agreement between the true availability and listings on commercial databases is measured by the number of correctly listed businesses (true positives), incorrectly omitted businesses (false negatives), and incorrectly listed businesses (false positives). These measures are used to construct two major validity scores: (1) sensitivity, which is the proportion of truly existing establishments listed in the commercial database, and (2) positive predictive value (PPV), which is the proportion of establishments listed in the commercial database that are also present on the ground. Undercount (1–sensitivity) and overcount (1–PPV) rates are also commonly reported.

Considering the similarity in the urbanization level of the study region and the year in which the data were collected, we used Powell et al.’s validity statistics as a reference for the California establishment data in 2008. According to the authors, the sensitivities of D&B and InfoUSA for convenience stores were 0.3769 and 0.5047, respectively, with PPVs of 0.4812 (D&B) and 0.6182 (InfoUSA). For grocery stores, the sensitivities were 0.4635 (D&B) and 0.5414 (InfoUSA), and the PPVs were 0.2880 (D&B) and 0.4446 (InfoUSA).

3. Modeling approach.

3.1. Notation for measurement error model.

Let $X_{i}$ denote the true, unobserved outlet count within a one mile radius buffer of ith subject (e.g., school), and $Y_{i}$ be a continuous outcome for subject $i = 1, \dots, n$ . For simplicity, we assumed a simple linear regression model: $Y_{i} = β_{0} + β_{x} X_{i} + ϵ_{i}$ with $ϵ_{i} ~ N (0, σ)$ , although the inclusion of covariates is straightforward. Our primary objective was to estimate $β_{x}$ .

Instead of true $X_{i}$ , the contaminated measures $W_{1 i}$ and $W_{2 i}$ were derived from two databases. Their relationships with $X_{i}$ are represented in Table 1. For database $s (s = 1, 2)$ , $N_{11 s, i}$ denotes the number of outlets that are “correctly” captured (true positives) within the buffer area of the $i$ th subject, $N_{12 s, i}$ is the number of omitted outlets (false negatives), and $N_{21 s, i}$ denotes the number of non-existing outlets that are falsely listed (false positives). Because outlets that do not exist cannot be incorrectly omitted from the databases, these tables have structural zeros. Each table has three cell probabilities $π_{s} = (π_{11 s}, π_{12 s}, π_{21 s})$ which sum to one. The tables share the row margin $X_{i} = N_{111, i} + N_{121, i} = N_{112, i} + N_{122, i}$ (true exposure), but have different column margins $W_{1 i} = N_{111, i} + N_{211, i}$ and $W_{2 i} = N_{112, i} + N_{212, i}$ (observed exposure). Given that the companies providing establishment data are competitors, we assumed that the two contingency tables in Table 1 were conditionally independent given the row margin $X_{i}$ . Thus, the data at each location can be represented by two conditionally independent contingency tables with structural zeros. Note that we can easily extend our model to more than two databases owing to the conditional independence assumption.

Table 1.

Contingency tables comparing exposure measures between truth and claims from two databases.

(a) Database 1				(b) Database 2
	Listed	Not Listed	Total		Listed	Not Listed	Total
Exist	$N_{111, i} (π_{111})$	$N_{121, i} (π_{121})$	$X_{i}$	Exist	$N_{112, i} (π_{112})$	$N_{122, i} (π_{122})$	$X_{i}$
Not exist	$N_{211, i} (π_{211})$	-		Not exist	$N_{212, i} (π_{212})$	-
Total	$W_{1 i}$			Total	$W_{2 i}$

Open in a new tab

To develop a modeling strategy for contingency tables at any one location, we relied on the Poisson sampling scheme for tables with structural zeros, where each cell was assumed to be an independent Poisson variate (Fienberg (1972b); Levin (1981)). Because the sum of the independent Poisson random variables also has a Poisson distribution with a sum of the mean parameters, the shared row margin $X_{i}$ is Poisson distributed. We defined $X_{i} ~ P o i s (λ_{i})$ , with a location-specific mean number of stores within the $i$ th subject’s buffer $λ_{i}$ .

Let $ϕ_{1} = {s e n_{1}, p p v_{1}}$ and $ϕ_{2} = {s e n_{2}, p p v_{2}}$ denote the sensitivity and positive predictive values for source $s$ , respectively, for $s = 1, 2$ , which are available from published literature, as described in Section 2. The validation studies reported the sensitivity and PPV as proportions calculated using the observed number of businesses in a large study region (e.g., an entire city). Although $ϕ_{1}$ and $ϕ_{2}$ can be further indexed by $i$ to reflect variations in school-level characteristics, such granular data on PPV and sensitivity are rarely available. To model $ϕ_{s i}$ instead of $ϕ_{s}$ , we would need additional data or rely on a stronger modeling assumption; otherwise the model is not identifiable. Furthermore, the majority of the validation studies that examined the association between source quality and geographical (e.g., urbanicity) or demographic characteristics (e.g., socio-economic status) of the locations found no significant difference in sensitivity or PPV values across these location characteristics (Lebel et al. (2017)). Thus, we assumed that the agreement statistics were constant over all subject locations (schools) and expressed them based on the expected counts: $s e n_{s} = \frac{E (N_{11 s, i})}{E (X_{i})}$ and $p p v_{s} = \frac{E (N_{11 s, i})}{E (W_{s i})}$ for $i = 1, \dots, n$ . Then, as shown in the Supplementary Material, each cell probability $π_{1}$ , $π_{2}$ in Table 1 was computed from $ϕ_{1}$ and $ϕ_{2}$ .

Since the shared row margin $X_{i} ~ P o i s (λ_{i})$ , the cell counts of database $s$ follow

N_{11 s, i} \sim P o i s (\frac{π_{11 s}}{π_{11 s} + π_{12 s}} λ_{i}) \equiv P o i s (s e n_{s} λ_{i}), N_{12 s, i} \sim P o i s (\frac{π_{12 s}}{π_{11 s} + π_{12 s}} λ_{i}) \equiv P o i s ((1 - s e n_{s}) λ_{i}), N_{21 s, i} \sim P o i s (\frac{π_{21 s}}{π_{11 s} + π_{12 s}} λ_{i}) \equiv P o i s ((\frac{s e n_{s}}{p p v_{s}} - s e n_{s}) λ_{i}) .

(1)

Consequently, the local expectations for $W_{1 i}$ and $W_{2 i}$ are proportional to $λ_{i} : E (W_{1 i}) = E (N_{111, i} + N_{211, i}) = \frac{s e n_{1}}{p p v_{1}} λ_{i}, E (W_{2 i}) = E (N_{112, i} + N_{212, i}) = \frac{s e n_{2}}{p p v_{2}} λ_{i}$ . The measurement error does not follow the classical zero mean assumption because $E (W_{s i} - X_{i}) = (\frac{s e n_{s}}{p p v_{s}} - 1) E (X_{i})$ is typically nonzero unless $s e n_{s} = p p v_{s}$ or the expected true exposure is zero, which generally does not hold for count exposures.

3.2. Modeling mixture distribution of the latent exposure counts.

We assumed that the true exposure was locally Poisson distributed with a mean parameter $λ_{i}$ . However, not imposing any structure on the $λ_{i}$ ’s leads to estimation problems, whereas setting $λ_{i}$ ’s to a constant leads to an unrealistic marginal distribution for the $X$ ’s (given the overdispersion seen in Figure 1). One approach is to model $λ_{i}$ by using covariates for the $i$ th subject. However, our primary interest was $β_{x}$ , not identifying predictors of exposure. Moreover, the misspecification of such a model could introduce bias in the estimates of $β_{x}$ . Thus, we aimed to model $λ_{i}$ flexibly to reduce the parameters estimated while maintaining robustness. Specifically, we considered a Poisson mixture model, so that the results in Section 3.1 hold conditional on the mixture component membership, without requiring prior knowledge of the mixture components.

Furthermore, to avoid specifying the number of mixture components in advance, we used a DP prior on $λ_{i}$ (Ferguson (1973)), which enabled the clustering of $λ_{i}$ based on shared information across sites. This ultimately creates a flexible marginal distribution for the latent exposures $X$ , which effectively handles overdispersion. Although this prior allows for infinitely many mixture components (clusters), in practical applications a finite number of clusters $K$ is identified in a data-adaptive fashion. The cluster allocation of $λ_{i}$ is denoted by $Z_{i} \in {1, \dots, K}$ . Within the $k$ th cluster, the subjects share the same value $λ_{k}$ so that $[X_{i} ∣ Z_{i} = k] ~ P o i s (λ_{k})$ . The model is $X_{i} ~ P o i s (λ_{i}); λ_{i} ~ G; G ~ D P (α G_{0})$ ; and $G_{0} = G a m m a (a_{0}, b_{0})$ , where $α$ denotes a concentration parameter and $G$ is a random measure on the parameter space. The choice of $α$ is discussed in Section 4.

Under the Bayesian framework, we are interested in the posterior distribution of latent $X_{i}$ given the observed data, which will be used as the basis of Gibbs sampling in Section 4. For each subject, we jointly modeled two independent trinomial distributions with known column margins $(W_{1 i}, W_{2 i})$ and an unknown shared row margin $X_{i}$ . Using Table 1 with our outcome model defined in Section 3.1, we modeled the conditional distribution of $X_{i}$ given observed data, cluster membership, and a set of parameters as

[x_{i} ∣ y_{i}, w_{1 i}, w_{2 i}, z_{i}] \propto [y_{i} ∣ x_{i}; β_{0}, β_{x}, σ] \times [w_{1 i} ∣ x_{i}; ϕ_{1}, λ_{k}] \times [w_{2 i} ∣ x_{i}; ϕ_{2}, λ_{k}] \times [x_{i} ∣ z_{i} = k; λ_{k}] .

By definition of the conditional distribution,

[w_{1 i} ∣ x_{i}; ϕ_{1}, λ_{k}] = \frac{[w_{1 i}, x_{i}; ϕ_{1}, λ_{k}]}{[x_{i}; λ_{k}]} = \frac{\sum_{n_{111}} [n_{111, i}, w_{1 i}, x_{i}; ϕ_{1}, λ_{k}]}{[x_{i}; λ_{k}]} = \frac{\sum_{n_{111}} [n_{111, i}; s e n_{1} λ_{k}] [x_{i} - n_{111, i}; (1 - s e n_{1}) λ_{k}] [w_{1 i} - n_{111, i}; (\frac{s e n_{1}}{p p v_{1}} - s e n_{1}) λ_{k}]}{[x_{i} ∣ z_{i} = k; λ_{k}]}

(2)

since each cell count was an independent Poisson random variable. Thus, the log-likelihood is as follows:

l_{x_{i} ∣ \sim} \propto l o g N (y_{i} ∣ x_{i}; β_{0}, β_{x}, σ) + l o g (\frac{\sum_{n_{111, i} = 0}^{m i n (w_{1 i}, x_{i})} P o i s (n_{111, i}; s e n_{1} λ_{k}) P o i s (x_{i} - n_{111, i}; (1 - s e n_{1}) λ_{k}) P o i s (w_{1 i} - n_{111, i}; (\frac{s e n_{1}}{p p v_{1}} - s e n_{1}) λ_{k})}{P o i s (x_{i} ∣ z_{i} = k; λ_{k})}) + l o g (\frac{\sum_{n_{112, i} = 0}^{m i n (w_{2 i}, x_{i})} P o i s (n_{112, i}; s e n_{2} λ_{k}) P o i s (x_{i} - n_{112, i}; (1 - s e n_{2}) λ_{k}) P o i s (w_{2 i} - n_{112, i}; (\frac{s e n_{2}}{p p v_{2}} - s e n_{2}) λ_{k})}{P o i s (x_{i} ∣ z_{i} = k; λ_{k})}) + l o g P o i s (x_{i} ∣ z_{i} = k; λ_{k}) .

(3)

Although we could also estimate $X_{i}$ by maximizing the log likelihood in Equation (3) by treating the $λ_{k}$ nuisance parameter, the estimation process was not stable because we only had two observations ( $W_{1 i}$ , $W_{2 i}$ ) per site. Alternatively, the EM algorithm can be considered for frequentist inferences. However, unlike the Dirichlet process mixture (DPM) model, the EM algorithm requires a fixed number of clusters, which is undesirable in our problem with an unknown degree of overdispersion in true exposure data. In addition, the EM algorithm may suffer from the multimodality of the model, which typically arises with a discrete-valued latent variable (Bartolucci, Pandolfi and Pennoni (2022)).

4. Estimation.

We used a Bayesian approach for the estimation. We selected a noninformative prior for the regression parameters $π (β_{0}, β_{x}, σ) \propto σ^{- 2}$ , and the hyperparameters ( $a_{0}$ , $b_{0}$ ) for the base measure $G_{0}$ were set to (1,1). Given that the DPM model is used to model the distribution of latent exposure, there is relatively little information in the data to identify the concentration parameter $α$ . Thus, although more recent applications of DPM models estimate $α$ , we instead used a fixed value selected as follows: Assuming that the observed exposures had a similar distribution shape to the true exposure, we fitted a DPM model to the observed exposures, $W$ ’s, from each data source separately. The models were fitted in R (R Core Team (2021)) using the dirichletprocess package (v0.4.0; J. Ross and Markwick (2020)). In these models, we used a Gamma(1,1) prior for a source-specific concentration parameter $α_{s}$ , and thus obtained a data-driven estimate of the concentration parameter. We used 10,000 iterations for each model, and after discarding the first half as a burn-in, each chain produced the posterior mean $\hat{α_{1}}$ and $\hat{α_{2}}$ for sources 1 and 2, respectively. We averaged the separate estimates and obtained $\bar{α} = m e a n (\hat{α_{1}}, \hat{α_{2}})$ , which we plugged into the model. This is a reasonable approach because the observed exposures contain information on the number of components and mixing proportions that make up the mixture distribution for $X_{i}$ . Specifically, we considered that $W_{s i} ~ P o i s (\frac{s e n_{s}}{p p v_{s}} λ_{i})$ at the $i$ th subject’s location, and $λ_{i} = λ_{k}$ when the $i$ th subject belongs to the $k$ th cluster. Thus, the distribution of $W_{s i}$ has the same number of components, although the mean parameters of the mixing components are multiplied by a nonzero scalar $\frac{s e n_{s}}{p p v_{s}}$ .

We implemented a Gibbs sampling algorithm to draw values from the posterior distribution and used standard summaries of the draws for inference (see Supplementary Material for details). Within the algorithm, we implemented the DPM model for the latent exposure model using the Chinese restaurant process (CRP) (Aldous (1985); Teh (2010)). The CRP finds a flexible number of clusters by allowing observations to be assigned to new clusters as needed. The conditional posterior probability of allocation to a new cluster is proportional to the concentration parameter $\bar{α}$ in the previous section.

5. Simulations and bias analysis.

The measurement error in the exposure predictor leads to bias in the estimation of the regression coefficients. Here, we algebraically describe the nature of the bias in linear regression coefficients when a naïve regression analysis that disregards measurement error is used. Having described the factors that drive the direction and magnitude of the bias, we demonstrate our model’s estimation properties, under various data-generation scenarios.

5.1. Bias in OLS.

When exposure measures derived from a single source s are used as a predictor in a linear regression model and the observed exposures arise from the hypothesized model (Table 1), the ordinary least squares (OLS) estimate of the target coefficient is as follows:

p l i m {\hat{β}}_{w_{s}} = \frac{\frac{V a r (X)}{E (X)} - (1 - p p v_{s})}{(\frac{s e n_{s}}{p p v_{s}}) \frac{V a r (X)}{E (X)} + (1 - \frac{s e n_{s}}{p p v_{s}})} β_{x} = (\frac{(p p v_{s} - 1) E (X) + (1 - \frac{s e n_{s}}{p p v_{s}}) {V a r (X) - E (X)}}{E (X) + (\frac{s e n_{s}}{p p v_{s}}) {V a r (X) - E (X)}} + 1) β_{x}

(4)

(see the Supplementary Material for details). The scaling factor multiplying $β_{x}$ in Equation (4), also known as the reliability ratio in the measurement error literature, consists of source credibility statistics and the dispersion of $X$ relative to its mean, or, equivalently, the difference between the mean and variance of $X$ . Consequently, the direction and magnitude of the bias depend upon the source credibility and the true exposure distribution, as follows:

When $s e n_{s} > p p v_{s}$ , the term $(p p v_{s} - 1) E (X) + (1 - \frac{s e n_{s}}{p p v_{s}}) {V a r (X) - E (X)}$ is negative since ${V a r (X) - E (X)} \geq 0$ , and therefore ${\hat{β}}_{w_{s}}$ is always attenuated towards zero. However, when $s e n_{s} < p p v_{s}$ , either attenuation or inflation can occur, depending on the amount of overdispersion. In the case $X$ is marginally equidispersed (e.g., $λ_{i} = λ$ ; thus, overall $X$ has a simple Poisson distribution), the reliability ratio simplifies to $p p v_{s}$ . Because $0 < p p v_{s} < 1$ , the OLS estimate is always attenuated. In this case, unbiasedness can only occur when there are no false positives (i.e., $p p v_{s} = 1$ ), irrespective of sensitivity.

Given $s e n_{s} < p p v_{s}$ , the naïve coefficient estimate is attenuated when $V a r (X) < {\frac{p p v_{s} (1 - p p v_{s})}{p p v_{s} - s e n_{s}} + 1} E (X)$ . Otherwise, inflated coefficient ${\hat{β}}_{w_{s}}$ is obtained. Interestingly, when overdispersion is approximated by ${\frac{p p v_{s} (1 - p p v_{s})}{p p v_{s} - s e n_{s}} + 1} E (X)$ , the naive coefficient estimator is approximately unbiased. Note that this is a special form of the negative binomial overdispersion, where the mean of the prior gamma distribution varies with shape and the scale remains constant – see (Nelder and Lee (1992)).

Considering that $W_{1}$ and $W_{2}$ are independent given $X$ , the average of the two exposures can be used as a predictor. However, the resultant estimate also has a bias that depends on the combination of the source credibility parameters, $E (X)$ , and $V a r (X)$ .

5.2. Simulation design.

We used the analytical results from the previous section to design our simulation study, which varied depending on three factors: (i) the distribution of $X$ , (ii) the ratio of sensitivity to PPV, and (iii) the values of the agreement statistics (i.e., high or low source quality).

Distribution of $X$ .

We aimed to use a realistic distribution of the latent exposure, while at the same time having a way to control the variance of $X$ relative to its mean, given the dependence of the bias of $β_{w}$ on these moments of $X$ (Equation (4)). Our strategy was to fix $E (X)$ and investigate the effect of the magnitude of the variance on the amount of bias in the naïve estimate of $β_{x}$ . Specifically, we fixed $E (X) = 5.71$ and created distributions of $X$ with varying degrees of variance using mixture distributions (see Table 2). To obtain realistic mixture components from which to simulate $X$ , we fitted a Poisson mixture distribution to the observed exposures from the InfoUSA data alone using the DPM model. The DPM was fitted using a concentration parameter $α = 0.1$ and a Gamma(1,1) base measure for the rate parameter. The estimated median number of mixture components was four, with mixing probabilities (0.23, 0.42, 0.33, and 0.02) and mean parameters (1, 4, 10, 25).

Table 2.

Different distributions of $X$ with the same expectation for the simulation study.

Type	E(X)	Var(X)	Components
Mixture 1	5.71	5.71	Pois(5.71)
Mixture 2	5.71	10.72	0.23Pois(3)+ 0.42Pois(5) + 0.33Pois(8) + 0.02Pois(14)
Mixture 3	5.71	25.56	0.23Pois(1)+ 0.42Pois(4) + 0.33Pois(10) + 0.02Pois(25)
Mixture 4	5.71	35.98	0.23Pois(2)+ 0.42Pois(5) + 0.33Pois(7) + 0.02Pois(42)
Mixture 5	5.71	41.74	0.23Pois(3)+ 0.42Pois(5) + 0.33Pois(6) + 0.02Pois(47)

Open in a new tab

The mixtures were selected to have increasing variance to mimic various degrees of overdispersion. This was achieved by setting ‘Mixture 1’ to a simple Poisson(5.71). The remaining mixtures had four components with mixing proportions (0.23, 0.42, 0.33, and 0.02) and the mean parameters of the components are listed in Table 2.

Source quality parameters.

To select realistic values of source quality, we used Powell et al.’s agreement statistics for convenience stores, namely $s e n_{1} = 0.37$ , $s e n_{2} = 0.5$ , $p p v_{1} = 0.48$ , and $p p v_{2} = 0.62$ . However, since the ratio of sensitivity to PPV determines the direction of bias in naïve analysis (Section 5.1), we reversed the sensitivity and PPV values, that is, $s e n_{1} = 0.48$ , $s e n_{2} = 0.62$ , $p p v_{1} = 0.37$ , and $p p v_{2} = 0.5$ , to show the opposite scenario. Although other scenarios such as ( $s e n_{1} > p p v_{1}, s e n_{2} < p p v_{2}$ ) can exist, we found that the relative size of sensitivity to PPV was mostly consistent across different data sources (see Lebel et al. (2017) for a systematic review). The source credibilities reported for different food store types by different databases also suggest that it is most likely to observe undercount error, i.e., the sensitivity is less than PPV (Powell et al. (2011); Liese et al. (2013)). Hence, for practical importance, we only present ( $s e n_{1} < p p v_{1}$ , $s e n_{2} < p p v_{2}$ ) or ( $s e n_{1} > p p v_{1}$ , $s e n_{2} > p p v_{2}$ ). An additional simulation study with ( $s e n_{1} > p p v_{1}$ , $s e n_{2} < p p v_{2}$ ) is available in Web Table 3 of the Supplementary Material.

To investigate the extent to which the bias varies with varying quality of the source, in addition to the ‘realistic’ scenario, we also defined ‘extremely optimistic,’ ‘optimistic,’ ‘pessimistic,’ and ‘extremely pessimistic’ scenarios for data sources by multiplying the realistic sensitivity and PPV values by 1.5, 1.25, 0.75, and 0.5, respectively. For example, for the extremely pessimistic scenario, $s e n_{1} = 0.185$ , $s e n_{2} = 0.25$ , $p p v_{1} = 0.24$ , and $p p v_{2} = 0.31$ . This range of source qualities covers the values of the agreement statistics reported by field studies of various other business types, aside from the convenience and grocery stores considered here.

Data generation.

For the single-component model, all $X_{i}$ values were generated using Poisson(5.71). For the mixture component models, we first generated cluster assignments ${Z_{i}}_{i = 1}^{n}$ to one of the four mixture components from a multinomial distribution with given mixing probabilities (0.23, 0.42, 0.33, 0.02). We then generated $X_{i} ∣ Z_{i} = k$ from $P o i s s o n (λ_{k})$ using a set of mean parameters defined for each mixture distribution in Table 2. For each distribution of $X$ , we generated 100 datasets with a sample size $n = 1, 000$ . For each subject in each dataset, we simulated the health outcome $Y_{i}$ from $N (1.5 + 0.5 X_{i}, 1.5)$ . To generate the observed exposure value, $W_{s i}$ , we first obtained $N_{11 s, i}$ from its conditional distribution $N_{11 s, i} ∣ X_{i} ~ B i n o m (X_{i}, s e n_{s})$ to prevent drawing a value that was larger than $X_{i}$ . For $N_{21 s, i}$ , we drew a value from $P o i s ((\frac{s e n_{1}}{p p v_{1}} - s e n_{1}) λ_{k})$ according to Equation (1). Then, $W_{s i}$ was obtained by adding $N_{11 s, i}$ and $N_{21 s, i}$ .

5.3. Simulation results for naïve bias .

Figure 2 shows how the percent bias of the naïve slope coefficient differs by the three factors. Given the source quality, the naïve estimates for the slope have the largest attenuation when $X$ has a simple Poisson distribution (‘Mixture 1’). When sensitivity is smaller than PPV in both sources, Figure 2(a) shows that the bias of OLS estimators switch from attenuation to inflation as $V a r (X)$ exceeds ${\frac{p p v_{s} (1 - p p v_{s})}{p p v_{s} - s e n_{s}} + 1} E (X)$ , as described in Section 5.1. Meanwhile, as displayed in Figure 2(b), all naïve estimates are attenuated towards zero when the ratio of sensitivity to PPV > 1 in both sources.

Fig 2: — Percent bias of slope ( $β_{x}$ ) estimates from naïve regression models (“OLS, source 1”, “OLS, source 2”, “OLS, average”) under the different $X$ distribution, the ratio of sensitivity to PPV, and the magnitude of agreement statistics. Values for *realistic* (labeled as R in the legend) sensitivity and PPV are (a) $s e n_{1} = 0.37$ , $s e n_{2} = 0.5$ , $p p v_{1} = 0.48$ , $p p v_{2} = 0.62$ , and (b) $s e n_{1} = 0.48$ , $s e n_{2} = 0.62$ , $p p v_{1} = 0.37$ , $p p v_{2} = 0.5$ .

Regarding source quality, the mean ${\hat{β}}_{w_{s}}$ is larger for more optimistic sources. It may seem counter-intuitive that we still observed biased naïve estimates with high-quality sources. However, this result is explained by Equation (4), which shows that a higher PPV results in a higher point estimate when the ratio $\frac{s e n_{s}}{p p v_{s}}$ remains the same. More specifically, the reliability ratio in Equation (4) is a concave function with respect to $s e n_{s}$ when $V a r (X) ∕ E (X)$ and $p p v_{s}$ are held constant (Web Figure 1 of the Supplementary Material). The concavity of the reliability ratio with respect to $s e n_{s}$ implies that we have smaller ${\hat{β}}_{w_{s}}$ as we have higher sensitivity, given the first two moments of $X$ and the value of PPV. This counter-intuitive result can be explained by the probabilities that generate the contingency table in Table 1. Increasing $\frac{s e n_{s}}{p p v_{s}}$ indicates a higher probability of adding false stores in the database (false positives) than uncaptured true stores (false negatives), which leads to attenuated ${\hat{β}}_{w_{s}}$ . The ideal ratio that provides the minimum bias is achieved when $s e n_{s} = \frac{p p v_{s} (p p v_{s} + V a r (X) ∕ E (X) - 2)}{V a r (X) ∕ E (X) - 1}$ . Therefore, even though both the sensitivity and PPV are high for source $s$ , ${\hat{β}}_{w_{s}}$ can be biased depending on the values of $\frac{s e n_{s}}{p p v_{s}}$ and $V a r (X) ∕ E (X)$ . In our case, we simulated data while holding values of $\frac{s e n_{s}}{p p v_{s}}$ at 0.77 ( $s = 1$ ) and 0.81 ( $s = 2$ ), shown in Figure 2(a), and 1.30 ( $s = 1$ ) and 1.24 ( $s = 2$ ), shown in Figure 2(b).

5.4. Evaluation of the proposed model.

We evaluated our model when the true exposures were simulated according to the five mixture distributions in Section 5.2 and realistic values of the sensitivities and PPVs. Our simulations considered two aspects of the model. First, given our model’s use of external values for sensitivities and PPVs, we additionally evaluated the consequences of misspecifying these values. In practice, model misspecification occurs when the working credibility parameters, denoted as $ϕ_{1}^{*}$ and $ϕ_{2}^{*}$ , differ from true but unknown agreement statistics. In our case, given that the observed exposures were generated with realistic true source credibilities, the model was evaluated under five different sets of working parameters ${ϕ_{1}^{*}, ϕ_{2}^{*}}$ . The five sets of working parameters were defined by multiplying the realistic sensitivity and PPV by (0.5,0.75, 1, 1.25,1.5) to generate extremely pessimistic, pessimistic, realistic (correctly specified), optimistic, and extremely optimistic working models, respectively. Second, given that the model relies on the estimation of the parameters of the latent exposure distribution, we compared the performance of the model when all the parameters are estimated with that when the parameters of the distribution of $X$ are known. This comparison showed the effect of estimating $λ_{i}$ ’s on the estimation of regression parameters. Third, we showed the benefits of using DP priors for $λ_{i}$ by comparing our model results with those of the model that uses a common Gamma(1,1) prior for all $λ_{i}$ ’s. Note that our main goal of using the DP prior is the nonparametric density estimation of $X$ , not the inference on the number of components or cluster memberships. It is well known that the DPM performs well for density estimation under general conditions (Ghosal (2010)). Moreover, using a DP prior generates a mixture of Poisson distributions for the underlying $X$ . A mixture of Poissons is a robust choice for modeling overdispersed count data (Hougaard, Lee and Whitmore (1997); Joe and Zhu (2005)) and is consistent with our observed built-environment data. For each simulated dataset, we estimated $\bar{α}$ as the concentration parameter, as described in Section 4. For brevity, we present the results for Mixtures 1,3, and 5 only.

5.4.1. Sensitivity < PPV in both sources.

We first demonstrate the estimation properties of our model when complete information about the true exposure distribution, namely cluster assignments and cluster means, is provided. As can be observed from Table 3 (middle), having knowledge of the underlying distribution of latent $X$ is particularly helpful for overdispersed $X$ (columns labeled Mixtures 3 and 5), as the model shows improved estimates of $β_{x}$ compared with naïve approaches, even with extreme misspecifications of credibility parameters in either direction.

Table 3.

( $s e n_{1} < p p v_{1}$ , $s e n_{1} < p p v_{2}$ ) Posterior summaries of the proposed model for $β_{x}$ when true source credibility is $s e n_{1} = 0.37$ , $s e n_{2} = 0.5$ , $p p v_{1} = 0.48$ , and $p p v_{2} = 0.62$ . True $β_{x} = 0.5$ . The proposed model was evaluated both with and without estimating $X$ distribution. The results shown are the average of the point estimates, posterior standard deviation (SD), and coverage probability (CP) from 100 replications.

Method	Mixture 1			Mixture 3			Mixture 5
	Average Naive Est. (%Bias)	SD	CP	Average Naive Est. (%Bias)	SD	CP	Average Naive Est. (%Bias)	SD	CP
OLS, source 1	0.240 (−52.05%)	0.028	0%	0.540 (7.96%)	0.016	8%	0.580 (16.06%)	0.012	16%
OLS, source 2	0.313 (−37.40%)	0.026	0%	0.539 (7.86%)	0.014	8%	0.570 (14.04%)	0.011	14%
OLS, average	0.449 (−10.15%)	0.033	72%	0.602 (20.31%)	0.015	20%	0.614 (22.77%)	0.011	23%
Proposed Model when X distribution is known
Working credibility	Average Posterior Mean (%Bias)	SD	CP	Average Posterior Mean (%Bias)	SD	CP	Average Posterior Mean (%Bias)	SD	CP
XP	0.706 (41.18%)	0.042	0%	0.516 (3.18%)	0.014	76%	0.523 (4.68%)	0.018	82%
P	0.607 (21.31%)	0.044	28%	0.510 (2.06%)	0.013	88%	0.513 (2.65%)	0.016	95%
R (Correctly specified)	0.506 (1.12%)	0.037	91%	0.503 (0.68%)	0.013	97%	0.502 (0.36%)	0.014	98%
O	0.442 (−11.59%)	0.032	58%	0.496 (−0.82%)	0.012	93%	0.493 (−1.38%)	0.013	91%
XO	0.391 (−21.76%)	0.029	7%	0.486 (−2.86%)	0.012	76%	0.486 (−2.79%)	0.011	76%
Proposed Model when X distribution is unknown
Working credibility	Average Posterior Mean (%Bias)	SD	CP	Average Posterior Mean (%Bias)	SD	CP	Average Posterior Mean. (%Bias)	SD	CP
XP	0.709 (41.75%)	0.040	0%	0.651 (30.22%)	0.021	0%	0.632 (26.50%)	0.030	0%
P	0.605 (20.99%)	0.043	28%	0.576 (15.27%)	0.017	0%	0.558 (11.50%)	0.022	14%
R (Correctly specified)	0.504 (0.90%)	0.037	91%	0.532 (6.49%)	0.014	36%	0.521 (4.12%)	0.017	77%
O	0.441 (−11.71%)	0.032	56%	0.509 (1.76%)	0.013	91%	0.502 (0.41%)	0.014	97%
XO	0.401 (−19.81%)	0.028	16%	0.490 (−1.93%)	0.012	84%	0.491 (−1.87%)	0.012	81%

Open in a new tab

However, we typically do not have such information regarding the underlying distribution of $X$ . We estimated the $X$ density using the DPM sampler, as described in Section 4. Table 3 (bottom) presents the posterior mean for the slope parameter $β_{x}$ when the DPM sampler estimates the distribution of $X$ (the results for $β_{0}$ are available in Web Table 1 of the Supplementary Material). For Mixture 1 (Poisson $X$ ), the model performance is similar to the performance when $X$ density is known since estimating the density for the single-cluster Poisson distribution is easier than estimating complex distributions such as Mixture 3 or Mixture 5. For these mixture distributions, the results in Table 3 show that the optimistic working model has the smallest bias and highest coverage rate. This result is likely due to the bias that arises from the $X$ density estimation via the DPM sampler because the correctly specified model outperforms other misspecified models with Poisson distributed $X$ . However, the DPM model has a strong advantage over the common Gamma prior for $λ_{i}$ ’s by allowing a flexible $X$ distribution (Web Table 2). Using a common prior assumes incorrect equidispersion and leads to more biased estimates than naïve models.

When the working credibility parameters are much worse than the true values, the proposed model shows less correction and sometimes results in estimates with larger biases compared to OLS. That is, when data sources are assumed to be much worse than they truly are, the proposed model is not particularly useful. In summary, when the working credibilities were at least at the level of true sensitivity and PPV values, our model corrected the naïve bias (Table 3). With correctly specified values of working sensitivity and PPV, the standard error sometimes increased on bias correction (Mixture 1 and Mixture 5), which is not uncommon in the measurement error literature. However, the increase in variability was small compared with the reduction in bias.

5.4.2. Sensitivity > PPV in both sources.

As shown in Section 5.3, the naïve model can already attain a nearly unbiased estimation when the sensitivity is less than the PPV, depending on the distribution of the latent $X$ . This section evaluates our model when sensitivity is larger than PPV, that is, when the naïve OLS estimator is attenuated towards zero regardless of $X$ distribution. The results presented in SubSection 5.4.1 imply that the estimates from the pessimistic and optimistic working models are in between the realistic and the two extreme cases. Thus, we only present the results of the extremely pessimistic (XP), correctly specified (R), and extremely optimistic (XO) misspecification.

It can be observed in Table 4 that all naïve OLS estimators for the slope parameter are attenuated, as shown in Section 5.1. With the correct specification of working credibilities, our model substantially reduces the bias and has a high coverage rate. Similar to Table 3, simple Poisson $X$ is more sensitive to the choice of working credibilities compared to the overdispersed $X$ . However, as overdispersion is more common in the real data, our model will produce less biased estimates with the belief that true source credibilities are similar to the working credibilities.

Table 4.

( $s e n_{1} > p p v_{1}$ , $s e n_{2} > p p v_{2}$ ) Posterior summaries of the proposed model for $β_{x}$ when true source credibility is $s e n_{1} = 0.48$ , $s e n_{2} = 0.62$ , $p p v_{1} = 0.37$ , and $p p v_{2} = 0.5$ . True $β_{x} = 0.5$ .

Method	Mixture 1			Mixture 3			Mixture 5
	Average Naive Est. (%Bias)	SD	CP	Average Naive Est. (%Bias)	SD	CP	Average Naive Est. (%Bias)	SD	CP
OLS, source 1	0.186 (−62.87%)	0.021	0%	0.350 (−30.10%)	0.009	0%	0.366 (−26.84%)	0.007	0%
OLS, source 2	0.252 (−49.67%)	0.021	0%	0.375 (−25.07%)	0.009	0%	0.388 (−22.48%)	0.007	0%
OLS, average	0.353 (−29.32%)	0.026	0%	0.389 (−22.24%)	0.009	0%	0.393 (−21.42%)	0.007	0%
Proposed Model when X distribution is unknown
Working credibility	Average Posterior Mean (%Bias)	SD	CP	Average Posterior Mean (%Bias)	SD	CP	Average Posterior Mean (%Bias)	SD	CP
XP	0.695 (38.98%)	0.041	2%	0.600 (20.09%)	0.018	0%	0.600 (20.02%)	0.026	0%
R (Correctly specified)	0.504 (0.83%)	0.037	95%	0.509 (1.76%)	0.013	91%	0.510 (1.93%)	0.015	96%
XO	0.393 (−21.30%)	0.029	7%	0.481 (−3.74%)	0.012	66%	0.487 (−2.60%)	0.011	72%

Open in a new tab

6. Application to children’s BMI study in California.

The influence of local convenience stores and grocery stores on obesity risk among young children has been studied in many prior studies, using observed exposure estimates ( $W^{'} s$ ), typically from one data source only (Grafova (2008); Howard, Fitzpatrick and Fulfrost (2011)). In this section, we use our model to obtain estimates of the association between these two markers of the food environment near schools on childhood obesity among children in grades 5 and 7 attending urban schools in California (also see Section 2), in order to integrate information from both data sources and correct for the ascertainment error. We conducted a stratified analysis by sex and grade, given that boys and girls mature at different ages and have different obesity prevalence rates (Flegal et al. (2009)). Among urban public schools in California, 1,773 and 1,758 schools provided BMI z-scores for male and female 5th graders, respectively. For 7th graders, 623 and 622 schools reported BMI z-scores of male and female students, respectively. Separate analyses were conducted for exposure to convenience and grocery stores.

Because the responses are mean BMI z-scores of schools, and the student sizes differ by school, our outcome model needs weighting to adjust for heterogeneous error variance. In each stratum, a weighted linear model was used to regress the school-level mean BMI z-scores on the number of food stores of interest. For easier interpretation, we divided the exposure variable by 10 so that ten stores become the unit used to interpret the regression co-efficients. We also adjusted for school-level demographic characteristics in the model, including the percentages of age groups in the school, percentages of children who meet different fitness standards, and percentages of children’s ethnicity.

Based on the exact match agreement statistics by Powell et al., both NETS and InfoUSA have sensitivities less than PPVs for convenience stores (see Section 2 for the credibilities of both sources). Hence, the naive estimators of the effect of convenience stores on childhood BMI could be attenuated, inflated, or even approximately unbiased. For supermarket/grocery stores, both databases have a sensitivity > PPV (Section 2). Therefore, naive estimators of the effect of supermarket/grocery stores on BMI should be attenuated (here, the term ‘grocery stores’ broadly includes both supermarkets and grocery stores).

By following the SIC and NAICS codes selected to compute the reference statistics (Powell et al. (2011)), 5th-grade schools had on average 4.56 (SD 3.55) convenience stores within 1 mile around the school according to NETS in 2008. However, InfoUSA claims that the mean convenience store exposure for these schools was 7.36 (4.12). Similarly, 7th-grade schools had a mean exposure of 4.47 (3.38) according to the NETS and 7.18 (3.93) according to the InfoUSA. For supermarket/grocery stores, the mean exposure for 5th-grade schools was 19.98 (22.42) by NETS and 14.22 (16.86) by InfoUSA. Grade 7 had a mean exposure of 18.71 (18.84) for NETS and 13.15 (13.96) for InfoUSA.

Figure 3 shows the corrected results from our model and the naïve estimates for the effect of ten additional food stores on the school’s mean BMI z-score. In addition to the reference values for working credibilities, we present results from our model that use 1.25 × reference values for the working model (defined as optimistic misspecification in Table 3), based on the simulation results in Table 3. The correction under the optimistic working model showed lower point estimates as we could expect from the simulation results in Table 3 and Table 4. The graphical posterior predictive checks showed that the replicated $W$ ’s from both of our working models approximated the observed exposures, suggesting that our models fit the data well (Web Figure 2 of the Supplementary Material).

Fig 3: — Comparison of naïve OLS estimators for $β_{x}$ and posterior mean of $β_{x}$ from the proposed model under different misspecification. For naïve approaches, “OLS, InfoUSA” and ‘OLS, NETS” denote OLS estimators for simple linear regression that use $w_{InfoUSA}$ and $w_{NETS}$ , respectively. “OLS, average” denotes an OLS estimator for a regression using average of $w_{1}$ and $w_{2}$ . For the proposed model, reference values (denoted as “Proposed”) and reference values × 1.25 (denoted as “Proposed×1.25”) were used for working credibility. Plotted are the naïve estimates with 95% confidence interval (black unfilled dots) and the posterior means with 95% credible intervals from our proposed model with reference values (red filled dot), and with 1.25 × reference values (black filled dot).

The results from our approach were qualitatively similar to those of the naïve analyses. Strong positive associations were found between exposure to convenience stores and school-level obesity risk, except for the female students in the 7th grade. For the effect of grocery stores, we expected the naïve estimates to be attenuated in all strata given the ratio of source credibilities. The corrected estimates from our model that used reference values for working credibilities indicated that the naïve use of secondary databases may fail to capture meaningful associations, as shown in the results from the 7th-grade male students. For both working models, the proposed method found a higher effect of the availability of grocery stores on the obesity of 5th graders.

7. Discussion.

In this study, we proposed a data integration method for those cases in which multiple exposure sources (e.g., business lists) yield different exposure measures of an otherwise unobservable true exposure count. We also analytically and empirically demonstrated the extent of bias in the linear regression parameter estimates when a mismeasured exposure predictor was used. The bias in naïve estimates was linked to the unknown true exposure density in addition to the source quality measures. One advantage of our approach is that we can obtain a correct estimate of the health effects of exposure from multiple erroneous sources, without requiring gold-standard data.

Although our proposed method showed substantial bias correction, it has several short-comings. First, the choice of working sensitivity and positive predictive value is critical. While the simulation study showed that a modest misspecification of source credibility still produces corrected estimates, we suggest finding validity measures for the same businesses with similar geographical and time contexts. As many validation studies are available, obtaining a good reference for source credibility should not be extremely difficult. Our data example used source quality parameters from a single validation study that matched the type of location (urban) and the time period. If multiple validation studies are available for the study location at a time, it would be useful to extend our approach to incorporate source quality parameters from those validation studies. Furthermore, as confidence intervals of the sensitivity and PPV are also reported in validation studies, researchers are recommended to use the upper and lower bounds of the intervals to conduct sensitivity analyses in practice. Second, our method uses a fixed concentration parameter obtained from the observed exposures in the DPM model. Although we used a plug-in value for the concentration parameter, our assumption can be relaxed in our Bayesian method by proposing a prior distribution for $α$ , for example, based on the mean and variance as functions of ${\hat{α}}_{1}$ , ${\hat{α}}_{2}$ . However, from our sensitivity analyses that attempted to draw $α$ from its posterior distribution with different priors for $α$ , we found a negligible gain in the correction of the regression parameters (result not shown). Finally, our model relies on conditional independence between sources, which is a common assumption in the measurement error literature (Richardson and Gilks (1993a,b)). This assumption helps identification of our model and it seems reasonable in our case since the data vendors are competitors. To relax this assumption, more data will be required to model the correlations between measurements (Wang, Carroll and Liang (1996)), although this would need to be addressed in future research.

Despite these limitations, our method contributes to measurement error studies and built environment research. Unlike the existing measurement error literature, we incorporate the measurement error into the model without specifying a parametric model for the error variable. From the viewpoint of the missing data problem, we showed that we can construct an identifiable model given the source credibility, although the true exposure $X$ is completely missing. Furthermore, our model specifies the marginal distribution of $X$ in a data-adaptive manner. Our work is one of the few applications that uses the DPM model for latent Poisson data (Dorazio et al. (2008)). The DPM model is well-suited for nonparametric Bayesian density estimation and it makes our model more generalizable by approximating overdispersed count data using Poisson mixtures. Our model may struggle with underdispersed $X$ distributions, but this is unlikely to occur in built-environment studies. We anticipate that our work will provide guidance for handling complicated count exposure variables that are prone to errors.

We suggest the following future research directions for the proposed method. First, our outcome model can be extended with random effects or non-normal residual errors, although we used school-level responses with a normality assumption for the ecological analysis in this study. These extensions require different Gibbs samplers and more complex computations. However, conceptually, more general outcome models can be fitted. Second, there are situations in which it may be desirable to infer the subjects’ cluster membership. Although not required in the context of our current approach, which relies only on the marginal distribution of the latent $X$ , correctly identifying clusters would require modifications for the prior on $λ_{i}$ . The existing literature shows that the traditional DPM has limitations in making inferences about clusters, and emerging approaches address this issue (Miller and Harrison (2014, 2018)). Third, our model can easily incorporate three or more lists. By adding one more source (even of medium quality), simulation studies showed that our model provides lower bias and higher efficiency (results not shown). As an increasing number of commercial lists provide point-referenced data of built environment features, our model can improve the inference with more sources. Note that we can also consider varying the source quality by location. In our motivating example of urban schools, we assumed that the source credibility remained constant for all school locations. Although there is an inconsistent conclusion regarding the systematic relationship between source quality and area-level characteristics (Lebel et al. (2017)), the working source quality can be specified at the site level or modeled with site characteristics.

Supplementary Material

supplementary material

NIHMS1987803-supplement-supplementary_material.pdf^{(792.4KB, pdf)}

Funding.

This work was funded by NIH R01-HL131610 (PI: Sánchez); data collection used in the applications was partially funded by NIH R01-HL136718(MPIs: Sanchez-Vaznaugh/Sánchez).

Footnotes

SUPPLEMENTARY MATERIAL

Web-based supplementary material

Technical derivation of Equation 4 and extensive simulation results for bias analyses in Section 5 are available in web-based supplementary material.

R code for the simulation studies in Section 5

R code to generate a single set of data and to implement our model are available.

REFERENCES

Aldous DJ (1985). Exchangeability and related topics. In École d’Été de Probabilités de Saint-Flour XIII—1983 1–198. Springer. [Google Scholar]
Athens JK, Duncan DT and Elbel B (2016). Proximity to fast-food outlets and supermarkets as predictors of fast-food dining frequency. Journal of the Academy of Nutrition and Dietetics 116 1266–1275. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bartolucci F, Pandolfi S and Pennoni F (2022). Discrete Latent Variable Models. Annual Review of Statistics and Its Application 9 425–452. [Google Scholar]
Caspi CE and Friebur R (2016). Modified ground-truthing: an accurate and cost-effective food environment validation method for town and rural areas. International Journal of Behavioral Nutrition and Physical Activity 13 1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dong XL, Berti-Equille L and Srivastava D (2009). Integrating conflicting data: the role of source dependence. Proceedings of the VLDB Endowment 2 550–561. [Google Scholar]
Dorazio RM, Mukherjee B, Zhang L, Ghosh M, Jelks HL and Jordan F (2008). Modeling unobserved sources of heterogeneity in animal abundance using a Dirichlet process prior. Biometrics 64 635–644. [DOI] [PubMed] [Google Scholar]
Ferguson TS (1973). A Bayesian Analysis of Some Nonparametric Problems. The Annals of Statistics 1 209 – 230. [Google Scholar]
Fienberg SE (1972a). The Multiple Recapture Census for Closed Populations and Incomplete 2k Contingency Tables. Biometrika 59 591–603. [Google Scholar]
Fienberg SE (1972b). The Analysis of Incomplete Multi-Way Contingency Tables. Biometrics 28 177–202. [Google Scholar]
Flegal KM, Wei R, Ogden CL, Freedman DS, Johnson CL and Curtin LR (2009). Characterizing extreme values of body mass index–for-age by using the 2000 Centers for Disease Control and Prevention growth charts. The American Journal of Clinical Nutrition 90 1314–1320. [DOI] [PubMed] [Google Scholar]
Fleischhacker SE, Evenson KR, Sharkey J, Pitts SBJ and Rodriguez DA (2013). Validity of secondary retail food outlet data: a systematic review. American journal of preventive medicine 45 462–473. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ghosal S. (2010). The Dirichlet process, related priors and posterior asymptotics. Bayesian nonparametrics 28 35. [Google Scholar]
Grafova IB (2008). Overweight children: assessing the contribution of the built environment. Preventive Medicine 47 304–308. [DOI] [PubMed] [Google Scholar]
Hougaard P, Lee M-LT and Whitmore G (1997). Analysis of overdispersed count data by mixtures of Poisson variables and Poisson processes. Biometrics 1225–1238. [PubMed] [Google Scholar]
Howard PH, Fitzpatrick M and Fulfrost B (2011). Proximity of food retailers to schools and rates of overweight ninth grade students: an ecological study in California. BMC Public Health 11 1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
INFOUSA (2012). Infousa Business Listing Description. https://www.Infousa.Com/Product/Business-Lists/, Accessed: 2021–02-05. [Google Scholar]
Joe H and Zhu R (2005). Generalized Poisson distribution: the property of mixture of Poisson and comparison with negative binomial distribution. Biometrical Journal: Journal of Mathematical Methods in Biosciences 47 219–229. [DOI] [PubMed] [Google Scholar]
Jones KK, Zenk SN, Tarlov E, Powell LM, Matthews SA and Horoi I (2017). A step-by-step approach to improve data quality when using commercial business lists to characterize retail food environments. BMC Research Notes 10 1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
Larson NI, Story MT and Nelson MC (2009). Neighborhood environments: disparities in access to healthy foods in the US. American Journal of Preventive Medicine 36 74–81. [DOI] [PubMed] [Google Scholar]
Lebel a., Daepp MI, Block JP, Walker R, Lalonde B, Kestens Y, and Subramanian S (2017). Quantifying the foodscape: a systematic review and meta-analysis of the validity of commercially available business data. PLoS One 12 e0174417. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lee H. (2012). The role of local food availability in explaining obesity risk among young school-aged children. Social Science & Medicine 74 1193–1203. [DOI] [PubMed] [Google Scholar]
Levin B. (1981). A representation for multinomial cumulative distribution functions. The Annals of Statistics 9 1123–1126. [Google Scholar]
Liese AD, Barnes TL, Lamichhane AP, Hibbert JD, Colabianchi N and Lawson AB (2013). Characterizing the food retail environment: impact of count, type, and geospatial error in 2 secondary data sources. Journal of Nutrition Education and Behavior 45 435–442. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lucan SC, Maroko AR, Bumol J, Torrens L, Varona M and Berke EM (2013). Business list vs ground observation for measuring a food environment: saving time or waste of time (or worse)? Journal of the Academy of Nutrition and Dietetics 113 1332–1339. [DOI] [PMC free article] [PubMed] [Google Scholar]
Manrique-Vallier D. (2016). Bayesian population size estimation using Dirichlet process mixtures. Biometrics 72 1246–1254. [DOI] [PubMed] [Google Scholar]
Miller JW and Harrison MT (2014). Inconsistency of Pitman-Yor process mixtures for the number of components. The Journal of Machine Learning Research 15 3333–3370. [Google Scholar]
Miller JW and Harrison MT (2018). Mixture models with a prior on the number of components. Journal of the American Statistical Association 113 340–356. [DOI] [PMC free article] [PubMed] [Google Scholar]
Must A and Anderson S (2006). Body mass index in children and adolescents: considerations for population-based applications. International Journal ofObesity 30 590–594. [DOI] [PubMed] [Google Scholar]
Nelder J and Lee Y (1992). Likelihood, quasi-likelihood and pseudolikelihood: some comparisons. Journal of the Royal Statistical Society: Series B (Methodological) 54 273–284. [Google Scholar]
NETS (2021). Business Dynamics Research Consortium, National Establishment Time-Series (NETS) Database: Database Description. http://exceptionalgrowth.org, Accessed: 2021-02-05. [Google Scholar]
California Department of Education (2019). Physical Fitness Testing (PFT). http://www.cde.ca.gov/ta/tg/pf/, Accessed: 2019-06-05. [Google Scholar]
Pollock KH and Otto MC (1983). Robust estimation of population size in closed animal populations from capture-recapture experiments. Biometrics 39 1035–1049. [PubMed] [Google Scholar]
Powell LM, Han E, Zenk SN, Khan T, Quinn CM, Gibbs KP, Pugach O, Barker DC, Resnick EA, Myllyluoma J and Chaloupka FJ (2011). Field validation of secondary commercial data sources on the retail food outlet environment in the U.S. Health Place 17 1122–1131. [DOI] [PubMed] [Google Scholar]
Richardson S and Gilks WR (1993a). A Bayesian approach to measurement error problems in epidemiology using conditional independence models. American Journal of Epidemiology 138 430–442. [DOI] [PubMed] [Google Scholar]
Richardson S and Gilks WR (1993b). Conditional independence models for epidemiological studies with covariate measurement error. Statistics in Medicine 12 1703–1722. [DOI] [PubMed] [Google Scholar]
Ross J, G. and Markwick D (2020). dirichletprocess: Build Dirichlet Process Objects for Bayesian Modelling. R package version 0.4.0 [Google Scholar]
R Core Team (2021). R: A Language and Environment for Statistical Computing. [Google Scholar]
Teh YW (2010). Dirichlet Process. In In Encyclopedia of Machine Learning 280–287. Springer. [Google Scholar]
van Smeden M, Lash TL and Groenwold RH (2020). Reflection on modern methods: five myths about measurement error in epidemiological research. International Journal of Epidemiology 49 338–347. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang N, Carroll R and Liang K-Y (1996). Quasilikelihood estimation in measurement error models with correlated replicates. Biometrics 401–411. [PubMed] [Google Scholar]
Zhao B, Rubinstein BIP, Gemmell J and Han J (2012). A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration. Proceedings of the VLDB Endowment 5 550–561. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplementary material

NIHMS1987803-supplement-supplementary_material.pdf^{(792.4KB, pdf)}

[R1] Aldous DJ (1985). Exchangeability and related topics. In École d’Été de Probabilités de Saint-Flour XIII—1983 1–198. Springer. [Google Scholar]

[R2] Athens JK, Duncan DT and Elbel B (2016). Proximity to fast-food outlets and supermarkets as predictors of fast-food dining frequency. Journal of the Academy of Nutrition and Dietetics 116 1266–1275. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Bartolucci F, Pandolfi S and Pennoni F (2022). Discrete Latent Variable Models. Annual Review of Statistics and Its Application 9 425–452. [Google Scholar]

[R4] Caspi CE and Friebur R (2016). Modified ground-truthing: an accurate and cost-effective food environment validation method for town and rural areas. International Journal of Behavioral Nutrition and Physical Activity 13 1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Dong XL, Berti-Equille L and Srivastava D (2009). Integrating conflicting data: the role of source dependence. Proceedings of the VLDB Endowment 2 550–561. [Google Scholar]

[R6] Dorazio RM, Mukherjee B, Zhang L, Ghosh M, Jelks HL and Jordan F (2008). Modeling unobserved sources of heterogeneity in animal abundance using a Dirichlet process prior. Biometrics 64 635–644. [DOI] [PubMed] [Google Scholar]

[R7] Ferguson TS (1973). A Bayesian Analysis of Some Nonparametric Problems. The Annals of Statistics 1 209 – 230. [Google Scholar]

[R8] Fienberg SE (1972a). The Multiple Recapture Census for Closed Populations and Incomplete 2k Contingency Tables. Biometrika 59 591–603. [Google Scholar]

[R9] Fienberg SE (1972b). The Analysis of Incomplete Multi-Way Contingency Tables. Biometrics 28 177–202. [Google Scholar]

[R10] Flegal KM, Wei R, Ogden CL, Freedman DS, Johnson CL and Curtin LR (2009). Characterizing extreme values of body mass index–for-age by using the 2000 Centers for Disease Control and Prevention growth charts. The American Journal of Clinical Nutrition 90 1314–1320. [DOI] [PubMed] [Google Scholar]

[R11] Fleischhacker SE, Evenson KR, Sharkey J, Pitts SBJ and Rodriguez DA (2013). Validity of secondary retail food outlet data: a systematic review. American journal of preventive medicine 45 462–473. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Ghosal S. (2010). The Dirichlet process, related priors and posterior asymptotics. Bayesian nonparametrics 28 35. [Google Scholar]

[R13] Grafova IB (2008). Overweight children: assessing the contribution of the built environment. Preventive Medicine 47 304–308. [DOI] [PubMed] [Google Scholar]

[R14] Hougaard P, Lee M-LT and Whitmore G (1997). Analysis of overdispersed count data by mixtures of Poisson variables and Poisson processes. Biometrics 1225–1238. [PubMed] [Google Scholar]

[R15] Howard PH, Fitzpatrick M and Fulfrost B (2011). Proximity of food retailers to schools and rates of overweight ninth grade students: an ecological study in California. BMC Public Health 11 1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] INFOUSA (2012). Infousa Business Listing Description. https://www.Infousa.Com/Product/Business-Lists/, Accessed: 2021–02-05. [Google Scholar]

[R17] Joe H and Zhu R (2005). Generalized Poisson distribution: the property of mixture of Poisson and comparison with negative binomial distribution. Biometrical Journal: Journal of Mathematical Methods in Biosciences 47 219–229. [DOI] [PubMed] [Google Scholar]

[R18] Jones KK, Zenk SN, Tarlov E, Powell LM, Matthews SA and Horoi I (2017). A step-by-step approach to improve data quality when using commercial business lists to characterize retail food environments. BMC Research Notes 10 1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Larson NI, Story MT and Nelson MC (2009). Neighborhood environments: disparities in access to healthy foods in the US. American Journal of Preventive Medicine 36 74–81. [DOI] [PubMed] [Google Scholar]

[R20] Lebel a., Daepp MI, Block JP, Walker R, Lalonde B, Kestens Y, and Subramanian S (2017). Quantifying the foodscape: a systematic review and meta-analysis of the validity of commercially available business data. PLoS One 12 e0174417. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Lee H. (2012). The role of local food availability in explaining obesity risk among young school-aged children. Social Science & Medicine 74 1193–1203. [DOI] [PubMed] [Google Scholar]

[R22] Levin B. (1981). A representation for multinomial cumulative distribution functions. The Annals of Statistics 9 1123–1126. [Google Scholar]

[R23] Liese AD, Barnes TL, Lamichhane AP, Hibbert JD, Colabianchi N and Lawson AB (2013). Characterizing the food retail environment: impact of count, type, and geospatial error in 2 secondary data sources. Journal of Nutrition Education and Behavior 45 435–442. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Lucan SC, Maroko AR, Bumol J, Torrens L, Varona M and Berke EM (2013). Business list vs ground observation for measuring a food environment: saving time or waste of time (or worse)? Journal of the Academy of Nutrition and Dietetics 113 1332–1339. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] Manrique-Vallier D. (2016). Bayesian population size estimation using Dirichlet process mixtures. Biometrics 72 1246–1254. [DOI] [PubMed] [Google Scholar]

[R26] Miller JW and Harrison MT (2014). Inconsistency of Pitman-Yor process mixtures for the number of components. The Journal of Machine Learning Research 15 3333–3370. [Google Scholar]

[R27] Miller JW and Harrison MT (2018). Mixture models with a prior on the number of components. Journal of the American Statistical Association 113 340–356. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Must A and Anderson S (2006). Body mass index in children and adolescents: considerations for population-based applications. International Journal ofObesity 30 590–594. [DOI] [PubMed] [Google Scholar]

[R29] Nelder J and Lee Y (1992). Likelihood, quasi-likelihood and pseudolikelihood: some comparisons. Journal of the Royal Statistical Society: Series B (Methodological) 54 273–284. [Google Scholar]

[R30] NETS (2021). Business Dynamics Research Consortium, National Establishment Time-Series (NETS) Database: Database Description. http://exceptionalgrowth.org, Accessed: 2021-02-05. [Google Scholar]

[R31] California Department of Education (2019). Physical Fitness Testing (PFT). http://www.cde.ca.gov/ta/tg/pf/, Accessed: 2019-06-05. [Google Scholar]

[R32] Pollock KH and Otto MC (1983). Robust estimation of population size in closed animal populations from capture-recapture experiments. Biometrics 39 1035–1049. [PubMed] [Google Scholar]

[R33] Powell LM, Han E, Zenk SN, Khan T, Quinn CM, Gibbs KP, Pugach O, Barker DC, Resnick EA, Myllyluoma J and Chaloupka FJ (2011). Field validation of secondary commercial data sources on the retail food outlet environment in the U.S. Health Place 17 1122–1131. [DOI] [PubMed] [Google Scholar]

[R34] Richardson S and Gilks WR (1993a). A Bayesian approach to measurement error problems in epidemiology using conditional independence models. American Journal of Epidemiology 138 430–442. [DOI] [PubMed] [Google Scholar]

[R35] Richardson S and Gilks WR (1993b). Conditional independence models for epidemiological studies with covariate measurement error. Statistics in Medicine 12 1703–1722. [DOI] [PubMed] [Google Scholar]

[R36] Ross J, G. and Markwick D (2020). dirichletprocess: Build Dirichlet Process Objects for Bayesian Modelling. R package version 0.4.0 [Google Scholar]

[R37] R Core Team (2021). R: A Language and Environment for Statistical Computing. [Google Scholar]

[R38] Teh YW (2010). Dirichlet Process. In In Encyclopedia of Machine Learning 280–287. Springer. [Google Scholar]

[R39] van Smeden M, Lash TL and Groenwold RH (2020). Reflection on modern methods: five myths about measurement error in epidemiological research. International Journal of Epidemiology 49 338–347. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] Wang N, Carroll R and Liang K-Y (1996). Quasilikelihood estimation in measurement error models with correlated replicates. Biometrics 401–411. [PubMed] [Google Scholar]

[R41] Zhao B, Rubinstein BIP, Gemmell J and Han J (2012). A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration. Proceedings of the VLDB Endowment 5 550–561. [Google Scholar]

PERMALINK

INTEGRATING MULTIPLE BUILT ENVIRONMENT DATA SOURCES

Jung Yeon Won

Michael R Elliott

Emma V Sanchez-Vaznaugh

Brisa N Sánchez

Abstract

1. Introduction.

2. Food exposure in multiple sources .

Fig 1:

3. Modeling approach.

3.1. Notation for measurement error model.

Table 1.

3.2. Modeling mixture distribution of the latent exposure counts.

4. Estimation.

5. Simulations and bias analysis.

5.1. Bias in OLS.

5.2. Simulation design.

Distribution of X.

Table 2.

Source quality parameters.

Data generation.

5.3. Simulation results for naïve bias .

Fig 2:

5.4. Evaluation of the proposed model.

5.4.1. Sensitivity < PPV in both sources.

Table 3.

5.4.2. Sensitivity > PPV in both sources.

Table 4.

6. Application to children’s BMI study in California.

Fig 3:

7. Discussion.

Supplementary Material

Funding.

Footnotes

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Distribution of $X$ .