Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2016 Oct 1.
Published in final edited form as: Am J Prev Med. 2015 Oct;49(4):624–630. doi: 10.1016/j.amepre.2015.06.021

Statistical Design Features of the Healthy Communities Study

Warren J Strauss 1, Christopher J Sroka 1, Edward A Frongillo 2, S Sonia Arteaga 3, Catherine M Loria 3, Eric S Leifer 3, Colin O Wu 3, Heather Patrick 4, Howard A Fishbein 5, Lisa V John 6
PMCID: PMC4575768  NIHMSID: NIHMS715908  PMID: 26384932

Abstract

The Healthy Communities Study is designed to assess relationships between characteristics of community programs and policies targeting childhood obesity and children’s BMI, diet, and physical activity. The study involved a complex data collection protocol implemented over a 2-year period (2013–2015) across a diverse sample of up to 125 communities, defined as public high school catchment areas. The protocol involved baseline assessment within each community that included in-person or telephone interviews regarding community programs and policies and in-home collection of BMI, nutritional, and physical activity outcomes from a sample of up to 81 children enrolled in kindergarten through eighth grade in public schools. The protocol also involved medical record reviews to establish a longitudinal trajectory of BMI for an estimated 70% of participating children. Staged sampling was used to collect less detailed measures of physical activity and nutrition across the entire sample of children, with a subset assessed using more costly, burdensome, and detailed measures. Data from the Healthy Community Study will be analyzed using both cross-sectional and longitudinal models that account for the complex design and correct for measurement error and bias using a likelihood-based Markov chain Monte Carlo methodology. This methods paper provides insights into the complex design features of the Healthy Communities Study and may serve as an example for future large-scale studies that assess the relationship between community-based programs and policies and health outcomes of community residents.

Introduction

The U.S. spends more than any other country on health care, but ranks 38th in the world in life expectancy,1 and high rates of obesity likely contribute to this low ranking. Obese adults and children are at increased risk of chronic disease, with annual health-related costs exceeding $100 billion and economic losses costing the nation an estimated $1 trillion annually.2,3 Since the 1970s, the rate of obesity among children and adolescents has roughly tripled for those aged 2–19 years. The Healthy People 2020 report to the nation identified nutrition, physical activity, and obesity as one of the 12 leading health indicators on which the nation should focus during this decade.4 The nation has made minimal progress in reaching this 2020 goal, despite widely available data documenting the development of obesity and factors associated with its rise. The social and environmental determinants of obesity are less well studied. Numerous observational studies have demonstrated increased risk of obesity in environments with greater access to unhealthy foods, less access to healthy foods, and fewer opportunities to be physically active,6,7 all of which tend to be characteristic of low-income communities and may help explain health disparities. The need to identify the most promising approaches that communities can use to reduce the obesity epidemic is urgent.8,9

Thus, this is an opportune moment for conducting a comprehensive and systematic study of the strategies that communities across the country have initiated to prevent childhood obesity. The Healthy Communities Study (HCS) was designed to meet this research requirement, and specifically to address the following three primary aims10:

  1. Determine associations between characteristics of community programs and policies and BMI, diet, and physical activity in children.

  2. Identify community, family, and child factors that modify or mediate these associations.

  3. Examine the associations between characteristics of programs and policies and BMI, diet, and physical activity in children in communities that have a high proportion of groups experiencing health disparities (i.e., African American, Latino, or low-income residents).

To address these aims, the HCS design included these features:

  1. a large sample of communities, with power to identify associations between characteristics of program and policy intensity and measures of childhood obesity across communities;

  2. hierarchical data collection to efficiently reach children across the country and minimize the time burden placed on them;

  3. retrospective collection of program and policy information and BMI data; and

  4. standardized data collection instruments consistently applied across communities and among sampled children.

The HCS employed a hybrid approach for selecting communities to maximize variation among community programs and policies for reducing childhood obesity. The hybrid design initially included a national probability-based sample of 195 communities and 69 “certainty” communities, as described below, with up to 81 child/family participants within each community. Because of time and resource constraints, however, the HCS will realize a smaller number of communities (125) while striving to maintain the child/family participant sampling goal within each community. This sample provides sufficient power to address the scientific aims of the HCS, while yielding results applicable to a diverse sample of U.S. communities, including those with high-proportions of Hispanic/Latino, African American, and low-income households.

Sampling of Communities and Study Participants

The HCS combined a stratified probability-based sample of 85 communities that ensured diversity across demographics and programs and policies with a purposeful sample of 40 “certainty” communities that were identified by an expert panel as having evidence of innovative and/or promising programs and policies related to childhood obesity.

Strata for the probability-based sample represented groups of Census Tracts organized according to unique combinations of factors such as race, ethnicity, income, region, and a pre-selection score of perceived program and policy intensity (Appendix Table 1).1115 One or more Census Tract was selected at random from each stratum, with probability proportional to the population of children aged 4–15 years. The public high school closest to the centroid of each selected census tract was identified to represent the selected community, with kindergarten through eighth grade (K–8) schools within that high school catchment area used for participant recruitment. The “certainty” communities were identified by:

  1. nominating candidate communities with likely high policy or programmatic activity from published literature, agency documents, and professional networks;

  2. scoring candidate communities, by six experts not affiliated with the HCS, based on available information; and

  3. selecting communities by a panel of experts and HCS investigators.

The selected communities represented geographic areas of different sizes (small towns, large cities, entire counties). For large geographic areas, a random Census Tract was selected within the area as a first step towards identifying the high school catchment area that would serve as the “certainty” community.

Once the high school catchment area was identified for an HCS community, school district approval was obtained and up to two elementary and two middle schools were recruited within the catchment area from which participants were recruited. If the HCS failed to gain district approval, or failed to recruit a sufficient number of schools in the community to allow successful recruitment of children across grades K–8, the HCS used a probability-based community replacement strategy to select another Census Tract from the stratum.

Eligible children/families were identified by recruiting within selected elementary and middle schools. All grade-eligible children took home an informational recruitment letter inviting them to participate, and study staff followed up by telephone with families that expressed interest.16 After making contact with a household adult, the age and gender of all children living within the household were identified and the willingness of the child and parent to participate was confirmed. Children who were institutionalized or non-ambulatory, or whose families lived in the community for <1 year, were excluded. Eligible children were recruited following a stratified random selection process that maintained maximum balance among gender, grade, and race/ethnicity for each community, selecting one child per household.

For the program and policy assessments, a sample of ten to 14 key informants was identified in each community to represent different settings/sectors, including schools, health organizations/coalitions, local government, non-profits, community organizations, and service agencies. Initial key informants were identified via web-based research and telephone screening, with the broader sample recruited through snowball sampling using referrals from participating key informants.

Data Collection

Documentation of programs and policies related to childhood obesity within communities occurred via key informant interviews using multiple modes of data collection (telephone interviews, web-based questionnaires, in-person meetings).17 This process was supplemented by document retrieval and abstraction by research staff. The HCS developed interview and data abstraction tools to facilitate development of a time series of standardized scores associated with each community. These scores rated the strength of the program or policy across different dimensions that promote healthy behaviors of children.

Data collection among child/parent pairs living in the selected communities was done via home visits made by field data collectors. Innovative statistical techniques for subsampling were applied in which less detailed and less burdensome measures were collected on all children (standard protocol), and more detailed dietary and physical activity measures were collected on a random subset of children (enhanced protocol):

  1. standard protocol: current height/weight status of child, questionnaires from parent/child on physical activity and diet, and medical record abstraction to develop longitudinal BMI trajectories from entire sample1820;

  2. enhanced protocol: additional assessments, including the use of accelerometers to assess physical activity, a physical activity behavior recall, and two 24-hour dietary recalls over a 1-week period from a random subset (one child per grade per community).19,20

This subsampling approach using two protocols builds on well-developed statistical methodology related to models that adjust for measurement error.2126 Another attractive feature of the design is the ability to generate longitudinal BMI trajectories27 up to 10 years in length on a sample of children within each community (combining BMI measures from medical record abstraction18 with those from the household visit). These trajectories can be modeled as a function of the time series of standardized community scores.

Planned Statisical Analysis Of Data

By integrating both current and retrospective information across the ten to 14 key informants within each community, a time series of community-level scores going back 10 years will be computed to characterize the annual intensity of characteristics of a community’s programs and policies (or specific components or strategies embedded within community programs and policies). The community-specific intensity scores will be continuous, scaled from “0” (no intensity) to “1” (highest intensity), with the most recent/current information gathered from each community expected to have the highest accuracy and reliability.

Mixed-effects statistical models will be used to relate child obesity outcomes (i.e., measures of BMI, waist circumference, diet, and physical activity) to the community-level intensity score, while adjusting for:

  1. correlation among participants from within the same community, and repeated measures on children over time using random effects28,29; and

  2. measurement error and bias in both the child outcome measures and measures of the strength and other attributes of community-level programs and policies as they evolve over time, using both likelihood-based and hierarchical Bayesian methods with specialized software.25,26,3032

Cross-sectional models will relate child outcomes to current measures of community program and policy intensity. Longitudinal models will examine combined BMI measures from both the in-home data collection and medical record abstraction as a function of the intensity score at the point in time when the BMI measurement was taken (either at the time of the visit for BMI measured in -home, or at the time of the physician visit for BMI obtained from medical records). Time-lagged models will also be explored.

The strength of the HCS lies in common factors observed on programs and policies17 across the 125 communities that may influence BMI, physical activity, and nutrition outcomes. With these outcomes observed on a large sample of children, data can be combined across communities to identify the program and policy attributes that are most closely associated with child outcomes at different stages of development. Power calculations demonstrated that the study can detect 4.9%–7.5% differences in current BMI associated with current indices of community-based program and policy intensity using cross-sectional models, and BMI change differences of less than 1% by combining longitudinal measures of BMI from medical records18 with longitudinal indices of community-based program and policy intensity.17

The statistical models for the planned analyses are introduced using the notation introduced in Table 1, and the equations below that provide the general form of the cross sectional (Equation 1) and longitudinal (Equation 2) models:

Yij(last)=f(Ageijk,β0)+β1·Xi+β2·Ci+β4·Cij+δi+εij (1)

Table 1.

Statistical Notation to Support the Cross-Sectional and Longitudinal Models for the Healthy Communities Study

Yijk = the kth childhood obesity outcome for the jth study participant in the ith community, based on the most detailed measurement method available (e.g., BMI from in-home data collection; physical activity or nutritional outcomes derived from the enhanced data collection protocol methods).
Y*ijk = the kth childhood obesity outcome for the jth study participant in the ith community, based on a measurement method that may be subject to error and/or bias (e.g., BMI from medical records abstraction; physical activity or nutritional outcomes derived from the Stage 1 measurement methods).
Xi = a static yet continuous index variable ranging from 0 to 1 that measures the current intensity of a particular program or policy component within the ith community.
Xijk = a time-varying continuous index variable, ranging from 0 to 1 that measures the intensity of a particular program or policy component within the ith community relative to the time at which Yijk or Y*ijk was observed.
Ci = a vector of community-specific covariates and/or confounders not expected to vary over time, such as urbanicity.
Ci.k = a vector of community-specific covariates and/or confounders expected to vary over time, such as the number of fast-food retail outlets within the community.
Cij = a vector of child-specific covariates and/or confounders not expected to vary over time, such as race/ethnicity or gender.
Cijk = a vector of child-specific covariates and/or confounders expected to vary over time, such as whether the child has an injury that would prevent her from participating in physical activity.
Ageijk = the age of the child (in years) associated with the outcome measures (Yijk or Y*ijk).
Wijk = a weighting variable used to adjust for the selection of communities and/or the differences in the number of BMI measures available from medical record abstraction within the longitudinal models.

The cross sectional model in Equation 1 describes the most current childhood obesity outcome (BMI,17 physical activity,19 or nutrition20) as a function of Age (likely non-linear and captured by a vector of β0 parameters), the community program and policy index variable Xi (captured by the β1 parameter), time-invariant community and child-specific covariates Ci and Cij (captured by the vectors of β2 and β4 parameters), a community-specific random effect δi, and an error term left unexplained by the model εij. Both δi and εij are expected to be independent and follow a normal distribution with mean zero and positive variance. Assuming that Yijk represents BMI or ln(BMI) scores, the β1 parameter in this model would capture the effect of the policy/program element represented by Xi on BMI, after controlling for other community- and child-specific covariates and confounders, as well as any within-community correlation in responses (via the δi random effect).

Yijk=f(Ageijk,β0)+β1·Xi+β2·Ci+β3·Ci.k+β4·Cij+β5·Cijk+δi+f(Ageijk,αij)+εijk (2a)
Yijk=f(Ageijk,β0)+β1·Xijk+β2·Ci+β3·Ci.k+β4·Cij+β5·Cijk+δi+f(Ageijk,αij)+εijk (2b)

The longitudinal models, represented by Equations 2a and 2b, follow a similar format. In Equation 2a, the β1 parameter captures the association between BMI and a current program or policy element represented by Xi, whereas in Equation 2b, the β1 parameter captures the association between BMI and a time-varying program or policy element represented by the Xijk variable that is constructed as described above. The longitudinal models are adjusted for all four types of community- and child-specific static and time-varying covariates (Ci, Ci.k, Cij, and Cijk) using vector parameters β2 through β5; adjusted for within-community correlation using a similarly constructed random effect (δi); and also adjusted for child-specific BMI trajectories over time using a function, f(Ageijkij), where αij is assumed to be a vector of parameters associated with each child (perhaps including linear or quadratic terms). Weighted analyses will be pursued to adjust for differences in the number of BMI measures for each child in the longitudinal models.

Interpretation of Key Model Parameters

Regardless of whether the model is cross-sectional (Equation 1) or longitudinal (Equations 2a and 2b), the relationship between the community program and policy indices (Xi or Xijk) and the childhood obesity outcome (Yijk) is captured by β1. In the cross-sectional model, β1 represents the average difference in the obesity outcome (e.g., ln[BMI]) between communities whose program or policy characteristic is rated as having maximum (1) and minimum impact (0). Interpretation of β1 is similar for the longitudinal model (2a) where the community program or policy is not expected to change over time (Xi is time invariant). Whether the association between program and policy components and childhood obesity outcomes differs by other factors (urbanicity, age, gender, income, race/ethnicity, or others) can be assessed by adding interaction terms to the model.

Methods to Correct for Bias and Error

Including joint (standard and enhanced) measures on an approximate 10% subset of study participants will allow the study to both:

  1. characterize the relationship between these measures; and

  2. make appropriate statistical adjustments for bias and error to assess relationships.

The relationship between BMI measures from the standardized in-person data collection and from the medical record review (including any bias or error in BMI obtained from medical records across the study population) can also be established, thereby allowing the study to make appropriate statistical adjustments for these measures.

The statistical models described above assume that precise measurements of child variables and community program and policy variables (Y and X) are observed on all study participants and communities. However, for most study participants and communities, only imprecise Y* and X* will be observed. To adjust the models for potential error and bias in the Y* and X* measures, a likelihood-based approach will be used that integrates information from all sources while fostering the ability to draw inferences within the study in a manner that preserves an interpretation as if Y and X were assessed for all children and communities:

When Y and X are observed:

L=f(YX,C)·f(YY=y)·f(XX=x)

When Y is observed and X is missing:

L=xf(YX,C)·f(YY=y)·f(XX=x)·f(x)·dx

When X is observed and Y is missing:

L=yf(YX,C)·f(YY=y)·f(XX=x)·f(y)·dy

When both Y and X are missing:

L=y,xf(YX,C)·f(YY=y)·f(XX=x)·f(y)·f(x)·dx·dy.

Whereas f(Y|X,C) represents the cross-sectional or longitudinal models provided in the equations from above that assume precise measurement, f(Y*|Y=y) is assumed to be simple linear regression models that express Y* as a function of Y (allowing for additive and/or multiplicative bias in Y* relative to Y, as well as variability using the error term); f(X*|X=x) is a similarly defined simple regression expressing X* as a function of X; and f(x) and f(y) represent the marginal distributions of X and Y, respectively. Specialized software has been developed to solve these likelihood equations using a Monte Carlo Markov Chain approach implemented in C++.32

Use of Modeling Framework to Build Multifactor Intensity Scores

Because the intensity score is on a scale from 0 to 1, the β1 coefficient has similar interpretation across the various program and policy predictors, and represents the difference in BMI outcome associated with the strongest observed intensity compared to the lowest observed intensity. In a single predictor model (one that relates the child outcome to a single program or policy predictor), β1 captures the direct effect of the program or policy variable (Xijk) on the obesity measure (Yijk) after adjusting for any additional covariates included in the model.

A large number (>50) of program and policy predictors are anticipated based on the ten to 14 key informant interviews per community and corresponding documentation review conducted across up to 125 communities. Because the analysis methods proposed are likelihood-based, the log-likelihood of single predictor variable models will be used to assess predictive power as a screening approach. This approach will be used for cross-sectional or longitudinal models, within specific subsets of the study population (e.g., age or gender specific analyses), for multiple outcomes (BMI, nutritional outcomes, or physical activity), and while adjusting for different covariates. As long as the same screening model is being fit to the same response, across the same subpopulation, while adjusting for the same covariates—where the only change is the specific program or policy variable (Xijk) being evaluated—the log-likelihood provides an objective metric to assess the predictive performance among the candidate Xijk variables. For each combination of outcome, subset of study population, and type/form of the model, the candidate program and policy predictors will be screened to identify which have the strongest association using model-based likelihood statistics. Among the set of strong predictors, simple principal component analysis will be used to identify any collinearity. Multi-predictor model building will then proceed by sequentially adding different program and policy predictors into the model (using an analysis of deviance to assess whether the addition of each variable significantly improves the model fit). The magnitude of the β coefficients (compared to their SEs) provides an objective weighting of the relative importance of each program and policy for a given outcome within a subpopulation of the HCS, as these indices will all be standardized to the same scale from 0 to 1.

Simple interactions among different program and policy predictors will be explored. If there were synergistic effects of including multiple programs and policies simultaneously that go beyond an additive effect, this interaction would be negative for a BMI response (leading to lower BMI scores). If there were diminishing returns, this interaction would be positive. Without loss of generality, the above models can also be expanded to include effect modifiers to assess whether programs/policies have differential effects on childhood obesity responses among different subpopulations.

This modeling strategy allows HCS researchers to examine the component strategies and elements of various programs and policies aimed at reducing childhood obesity (or improving nutritional or physical activity outcomes), and assess which of these strategies/elements—alone or in concert with others—have the greatest association with obesity outcomes among subpopulations of interest within the HCS.

Supplementary Material

supplement

Acknowledgments

The Healthy Communities Study is funded with federal funds from the National Heart, Lung, and Blood Institute, in collaboration with the Eunice Kennedy Shriver National Institute of Child Health and Development, National Institute of Diabetes and Digestive and Kidney Disorders, National Cancer Institute, and NIH Office of Behavioral and Social Sciences Research; DHHS, under Contract No. HHSN268201000041C.

Footnotes

No financial disclosures were reported by the authors of this paper.

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplement

RESOURCES