Clustering regions with dynamic time warping to model obesity prevalence disparities in the United States

Katherine Vorpe; Sierra Hessinger; Rebekah Poth; Tatjana Miljkovic

doi:10.1080/02664763.2023.2192445

. 2023 Mar 28;51(4):793–807. doi: 10.1080/02664763.2023.2192445

Clustering regions with dynamic time warping to model obesity prevalence disparities in the United States

Katherine Vorpe ¹, Sierra Hessinger ¹, Rebekah Poth ¹, Tatjana Miljkovic ^1,^CONTACT

PMCID: PMC10929681 PMID: 38482195

Abstract

Current methods for clustering adult obesity prevalence by state focus on creating a single map of obesity prevalence for a given year in the United States. Comparing these maps for different years may limit our understanding of the progression of state and regional obesity prevalence over time for the purpose of developing targeted regional health policies. In this application note, we adopt the non-parametric Dynamic Time Warping method for clustering longitudinal time series of obesity prevalence by state. This method captures the lead and lag relationship between the time series as part of the temporal alignment, allowing us to produce a single map that captures the regional and temporal clusters of obesity prevalence from 1990 to 2019 in the United States. We identify six regions of obesity prevalence in the United States and forecast future estimates of obesity prevalence based on ARIMA models.

Keywords: Body mass index, obesity, clustering, DTW, United States

Mathematics Subject Classifications 2020: 62-08, 62P10, 62M10, 62G99

1. Introduction

The obesity pandemic remains an important public health problem in the United States. As a result, the Centers for Disease Control and Prevention (CDC) monitors the obesity prevalence across regions for different sociodemographic variables [2]. The CDC, a historically trusted source of health information in the United States [18], collects and monitors the United States data by sponsoring the Behavioral Risk Factor Surveillance System (BRFSS). In the past decade, the studies of obesity prevalence have become a widespread area of interest across many research domains. Various statistical methods have been explored to estimate the obesity prevalence (see [6,9,10,19], and many others). [28] projected that the increasing trend of obesity in the adult population will continue to increase in the United States, reaching a point where the prevalence will not exist below 35% in any state by 2022. To verify the outlook of the national trends in Body Mass Index (BMI) data, we processed approximately 8.5 million observations from BRFSS survey compiled by [18], for the time period 1990–2019 and plotted the time series of obesity prevalence for the normal, overweight, and obese population groups in Figure 1.

Figure 1. — Time series of obesity prevalence based on the BMI category: normal (18.5 ≤ BMI < 25), overweight (25 ≤ BMI < 30), and obese (BMI ≥ 30), for the time period 1990–2019 in the United States. The BMI is defined as a person's weight in pounds divided by the square of their height in inches multiplied by a conversion factor of 703.

From 1990 to 2019, the proportion of individuals with a normal BMI decreased sharply from roughly 0.5 to 0.28, while the proportion of obese individuals increased from roughly 0.16 to 0.36. Additionally, the proportion of overweight individuals appears to remain relatively stable, fluctuating between 0.31 and 0.35. While the transitions between BMI categories can occur in both directions, it is clear that the general population has seen an increase in prevalence rates for overweight and obese groups and a decline in prevalence rates for normal-weight groups in the past two decades.

We recognize that these national trends in BMI are driven by significant variation in the socioeconomic factors across states and geographic regions. Thus we focus on studying the state-level obesity prevalence. The aim of this study is finding the most efficient way of clustering the longitudinal time series of obesity prevalence, creating a single map of obesity prevalence by region for the time period 1990–2019, and developing a predictive model for forecasting obesity prevalence for each region.

1.1. Data challenges

This study uses the data obtained from the BRFSS survey compiled by [18] for 50 states within the United States and the District of Columbia. The BRFSS data were collected using stratified and weighted survey designs and played an important part of many research studies [16,17,30]. We utilized annual state data samples from 1990 through 2019; consequently, we processed approximately 8.5 million total observations containing individuals' height and weight. The data were cleaned, merged, aggregated, and filtered.

The CDC uses a “post stratification statistical method” to weigh BRFSS survey data. This method consists of Design Weight and Ranking Adjustment components [29] to ensure that the respondent data adheres to known demographic distributions. This weighting methodology consists of Design Weight and Ranking Adjustment components [29]. Refused and missing data values defined in the annual BRFSS Survey Data and Documentation Overview documents [1] were excluded (less than 6%) to ensure that the data were reasonable and meaningful. All the data processing and aggregation was performed using SAS software, version 9.4 [26].

The BMI values were computed based on the height and weight variables provided in the BRFSS data set. The prevalence rates were calculated as the proportion of the total count of individuals in the obese category relative to the sample size by state and year. The aggregated data consist of fifty-one time series of annual obesity prevalence reflecting the time period 1990–2019. When plotting these time series in Figure 2, we observe a dense structure of the image where the general increasing trend can be emphasized by a bold LOESS smoother. It is challenging to visually observe any differences in the alignments between the individual trajectories.

Figure 2. — Obesity prevalence by state over time, overlayed with the bold LOESS smoother. Fifty states as well as District of Columbia are included.

A single time series of obesity prevalence exhibits a lot of variabilities related to the granularity of available data. Thus current predictions of obesity prevalence at the state level may suffer from a great level of uncertainty. Finding patterns in this dense group of longitudinal time series is a challenging data problem that requires explorations of some innovative clustering approaches. However, these patterns, when adequately extracted from the data, can help with the data aggregation by pooling the potentially volatile state-level data and consolidating states that have similar obesity trends allowing us to use more comprehensive data to fit increasingly reliable forecasting models. We proceed by discussing how to solve this challenge by implementing the non-parametric Dynamic Time Warping method for clustering longitudinal time series of obesity prevalence by state.

2. Methodology

The goal of clustering longitudinal time series is to partition the data into non-overlapping groups, or clusters, that contain similar time series in each cluster based on predefined characteristics. Different similarity measures are considered in the literature to assess similarity within and between clusters. The most popular similarity measure between time series is Euclidean distance. However, a major disadvantage of this measure is its “sensitivity to distortions and shifting along the time axis” [14]. To alleviate this issue, we investigate Dynamic Time Warping (DTW). This non-parametric method is based on the Levenshtein distance [15] and is used to minimize the temporal alignment between two time series to assess their similarity.

2.1. Alignment between time series using dynamic time warping

Suppose that we sampled $n_{ij}$ people (age 18 and older) in state i for $i \in (1, \dots, L)$ at year j for $j \in (1, \dots, T)$ . If an individual's BMI is at least 30, then the individual is classified as obese. Thus the indicator function $1_{[.]}$ for the “obesity” event is defined as

1_{[BMI \geq 30]} = {\begin{cases} 1, if BMI \geq 30 \\ 0, if BMI < 30. \end{cases}

(1)

For each state i and year j, we divide the number of obese people by the total number of people in the sample to obtain the obesity prevalence. The obesity prevalence is defined as

p_{i, j} = \frac{\sum_{k = 1}^{n_{ij}} 1_{[{BMI}_{k} \geq 30]}}{n_{ij}}, for i = 1, \dots, L; j = 1, \dots, T .

(2)

By structuring data in this way, we obtain L sequences where each sequence has length T. Consider a subset of two sequences $p_{1} = {p_{1, 1}, p_{1, 2}, \dots, p_{1, T}}$ and $p_{2} = {p_{2, 1}, p_{2, 2}, \dots, p_{2, T}}$ and compute the elements of the cumulative distance matrix $D_{T \times T}$ as follows:

\begin{aligned} D (1, 1) & = d_{1, 1} \end{aligned}

(3)

\begin{aligned} D (1, j) & = d_{1, j} + D (1, j - 1) \end{aligned}

(4)

\begin{aligned} D (i, 1) & = d_{i, 1} + D (i - 1, 1) \end{aligned}

(5)

\begin{aligned} D (i, j) & = d_{i, j} + \min {D (i - 1, j - 1), D (i - 1, j), D (i, j - 1)}, \end{aligned}

(6)

where $d_{i, j} = \sqrt{(p_{1, i} - p_{2, j})^{2}}$ is the Euclidian distance between two data points $p_{1, i}$ and $p_{2, j}$ for $i = 1, \dots, T$ and $j = 1, \dots, T$ . The DTW aims in finding an alignment between $p_{1}$ and $p_{2}$ that creates an optimal warping path of length G, $W = (w_{1}, w_{2}, \dots, w_{G})$ for $g = 1, \dots, G$ . The starting point $w_{1} = D (1, 1)$ and ending point $w_{K} = D (T, T)$ of the warping path represent the opposite corners (the bottom left and the top right) of the matrix $D$ . This cumulative matrix is also referred to as the cost matrix since each element $D (i, j)$ in the matrix represents the cost to align point i of the time series $p_{1, i}$ with the point j of the time series $p_{1, j}$ . Several warping paths may be computed through the matrix $D$ ; however, we are interested in finding the optimal path that minimizes the total cost of aligning. This optimal path is defined as follows:

DTW (p_{1, i}, p_{1, j}) = \underset{W = (w_{1}, \dots, w_{G})}{argmin} \sqrt{\sum_{g = 1, w_{k} = (i, j)}^{G} (p_{1, i} - p_{2, j})^{2}}

(7)

The dynamic programming algorithm for the computation of the optimal warping path is available in several statistical softwares. [21] proposed the constraints to this algorithm to speed up the computations and enhance accuracy. One of these constraints is that, rather than preventing the warping path from drifting away from the diagonal elements of the matrix $D$ , we restrict it to a window of size r called the Sakoe–Chiba band [24]. This constraint translates to $‖ i - j ‖ \leq r$ for every $w_{k}$ .

2.2. Motivating example

To illustrate the features of DTW and how it compares two time series sequences, consider our motivating example presented in Figure 3. The dotted lines connecting the respective series link the two points that are the closest in correspondence based on their DTW temporal alignment to evaluate the similarity between the two time series. While in some cases the connecting points concern the same time period (e.g. 1, 6, and 8), we observe that in most cases earlier or later time periods provide better alignment. These temporal point-to-point alignment relationships can be referred to as leading or lagging, depending on the existence of non-vertical point-mappings. When leads and lags are relatively constant throughout the sample, the lines connecting the two time series have similar angles and appear in a parallel fashion. Vertical point-mappings indicate that the time series are not shifted.

The non-vertical point-mappings suggest that leading and lagging relationships exist between the two time series. We observe that y leads x with a significant lag during the first six periods. Then, there is break in the lead-lag structure at period 6 when x starts leading for the remaining two periods. Since Euclidean distances only allows for one-to-one point mappings, the alignment based on Euclidean distance produces significantly different results. The Euclidean distance was referred to as a pessimistic dissimilarity measure by [21] and the DTW was emphasized as a more intuitive distance measure. The non-parametric similarity evaluation conducted by DTW alignment allows the distance between the two time series to be computed by capturing the shape and temporal dynamics. When the DTW is used as the similarity measures in clustering the time series of obesity prevalence it would allow for a more comprehensive study of disparities in obesity prevalence across regions. This information would be important to public health officials and health care agencies in their efforts to better understand and manage the obesity pandemic in the United States.

2.3. Clustering time series using dynamic time warping

A few commonly used clustering approaches are based on hierarchical and partitional methods [3] that differ based on the way they handle the observations and how underlying groupings are determined. The time series clustering is typically done using partitional stochastic methods with an aim to create partitions in the data. A number of pre-specified cluster centroids are randomly selected to represent the center of each cluster. The distances between all the time series and these centroids are computed so that each time series is assigned to the cluster based on the smallest DTW distance. The distances and centroids are updated iteratively until stable clusters are established or a certain number of iterations elapse. In the context of time series clustering, the partitional method is closely related to prototyping or time series averaging. The simple prototyping function is based on the arithmetic mean calculated for each cluster C of size s as follows:

μ_{j} = \frac{1}{s} \sum_{s \in C} p_{c, j}

(8)

where $p_{c, j}$ represents the value of the jth element of the cth series that belongs to cluster C [25]. In prototyping based on DTW, [20] developed a global averaging method referred to as DTW barycenter averaging (DBA). The DBA iteratively refines a chosen initial average time series with an objective to minimize the sum of squared DTW distances from the average time series to the set of time series.

Many techniques for methods that require a chosen initial average time series, such as the k-means clustering, have been developed. The k-means algorithm expects a well-defined average of a set of series. To choose the initial average time series for the DBA and k-means clustering, we utilized the R package dtwclust developed by [25]. Starting with an initial clustering assignment, the k-means clustering then proceeds by averaging the time series of each cluster by means of DBA and reassigning each time series to the cluster with the lowest DTW distance. This process continues until the reassignment of each time series stays the same and the optimal assignment is reached for a given number of clusters. The nature of the DBA procedure allows for the temporal shifts to be accounted for by means of DTW [20].

The temporal alignment of each time series is calculated to the current best average, and then the average of all the time series points that are currently aligned with a particular point of the average is defined as its updated version. The remaining time series are then grouped into the cluster with the “closest” centroid as assessed by the DTW distances. The optimal number of clusters is determined using a scree plot and several cluster validity indices (CVIs). The R function tsclust() from the R package dtwclust produces a list of seven CVIs, including the silhouette index (SIL) proposed by [23], the Davies–Bouldin (DB) index proposed by [7], and the modified Davies-Bouldin ( $D B^{*}$ ) index proposed by [13]. This R package also allows for the extraction of the Within Cluster Sum of Squares (WCSS) associated with each cluster.

2.4. Forecasts using ARIMA models

The time series of obesity prevalence are typically non-stationary. Thus taking first-or second-order differencing is a standard approach in analysis of non-stationary time series [11]. We are interested in developing an ARIMA model for each DTW-cluster to identify trends over time and forecast the obesity prevalence by region. Considering that the prevalence rates by state are non-additive, the sample population data by state can be aggregated within each cluster so that the obesity prevalence rates are computed for each cluster as follows:

p_{r, t} = \frac{\sum_{k = 1}^{n_{r, t}} 1_{[{BMI}_{k} \geq 30]}}{n_{r, t}}, for r = 1, \dots, R; t = 1, \dots, T,

(9)

where $n_{r, j}$ denotes the total number of individuals in region r at year t. We obtain R time series of obesity prevalence by region. We introduce the notation and briefly summarize the mathematical properties of ARIMA models below.

Consider the time series ${p_{r, t} : t = 0, \pm 1, \pm 2, \pm 3, \dots}$ or a stochastic process for region r. The backshift operator, denoted as B, operates on the time index of this time series and shifts time back one time unit to form a new time series as $B p_{r, t} = p_{r, (t - 1)}$ . Let, ∇ denote the differencing operator as $\nabla^{d} = (1 - B)^{d}$ . The $ARIMA (p, d, q)$ process is a stochastic time series process where d represents the level of differencing, p is the auto-regressive order, and q is the moving average order, with all three parameters being non-negative integers [5]. Then, the general $ARIMA (p, d, q)$ model is expressed as

ϕ (B) (1 - B)^{d} p_{r, t} = θ (B) e_{r, t}

(10)

where $e_{r, t}$ is the white noise process with zero mean and variance $σ^{2}$ . Further, $θ (B)$ is the moving average characteristic polynomial evaluated at B, i.e. $θ (B) = 1 - θ_{1} B - θ_{2} B^{2} -, \dots, - θ_{q} B^{q}$ while $ϕ (B)$ is the auto-regressive characteristic polynomial evaluated at B, i.e. $ϕ (B) = 1 - ϕ_{1} B - ϕ_{2} B^{2} -, \dots, - ϕ_{p} B^{p}$ . We utilize R [22] version 4.1.1. with R package forecast [12] to determine the optimal number of parameters in the $ARIMA (p, d, q)$ model. We perform the unit root test as well as plot the Auto Correlation Function (ACF) and Partial Auto Correlation Function (PACF) to confirm the stationarity of the selected models. However, we would like to caution the reader that for the short time series ( $T \leq 30$ ) power of the unit root test decreases. Ideally, we would like to have at least 50 observations to see an increase in power of the unit root test and therefore have reliable forecasts (see [4]).

3. Simulation study

A simulation study is performed to assess the performance and accuracy of the DTW and k-means methods under three different simulation scenarios. More specifically, we test the clustering methods under three scenarios each containing lead and lagged time series. This simulation study is partially motivated by previous work of [31] and [8]. In our study, we generated 100 data sets for 6 and 8 time series respectively for each of the three scenarios. The 6 and 8 time series are represented in two clusters based on two comparable functions.

In Scenario 1, the time series within one cluster are generated from the function $y = a + bt + d t^{2} + ϵ_{1}$ , with a = 10, b = 1, c = 1, and $ϵ_{1} \sim N (0, 4^{2})$ while the time series within the other cluster are generated from the function $z = a - bt + d t^{2} + ϵ_{2}$ with the same coefficients but different error term $ϵ_{2} \sim N (0, 1)$ . The length of each generated time series is N = 40 along the interval $t \in [0, 2 π]$ . The lead and lag are created by assuming the time shift $Δ = 0.15$ ; in particular, each time series in the corresponding cluster is shifted by one observation across all time periods. From the plots in the top row of Figure 4 we can observe a significant overlap and upward trend in the 6 (left) and 8 (right) time series for Scenario 1. This is similar to what we observed in the time series of BMI by state. Scenario 2 (with 6 time series) is created with two clusters of different shapes generated from $y = \sin (t - aπ) + ϵ$ and $z = \cos (t + aπ) + ϵ$ respectively with $a = (0, 0.1, 0.15)$ , $ϵ \sim N (0, {0.05}^{2})$ , and $Δ = 0.1$ . In Scenario 2 (with 8 time series), four possible values for a are considered, $a = (0, 0.1, 0.15, 0.20)$ resulting into four time series in each cluster. Scenario 3 is created with two clusters of different shapes generated from $y = \sin (a (t - c)) - 1 + ϵ$ and $z = \sin (b (t - c)) - 1 + ϵ$ respectively, where a = 0.2, b = 0.4, c = 0.3, $ϵ \sim N (0, {0.05}^{2})$ , and $Δ = 0.2$ . Each simulation setting was repeated 100 times.

Figure 4. — A single realization of the simulation Scenario 1 (top row) with 6 (left) and 8 (right) time series. A single realization of the simulation Scenario 2 (middle row) with 6 (left) and 8 (right) time series. A single realization of the simulation Scenario 3 (bottom row) with 6 (left) and 8 (right) time series.

The k-means and DTW methods were used on each generated data set and the proportion of correct classifications was recorded. The classifications measured the accuracy of each clustering method by evaluating whether the time series generated from the same process were grouped correctly in the same cluster. Table 1 summarizes the proportion of correct classifications across all simulation settings. As seen in Table 1 for Scenario 2, both methods perform well achieving 100% accuracy.

Table 1.

Summary of the proportions for correct classification related to three simulation scenarios and two methods of clustering (DTW and k-means) with 6 and 8 times series (6-TS and 8-TS).

Proportion of correct classification
	DTW (6-TS)	DTW (8-TS)	k-means (6-TS)	k-means (8-TS)
Scenario 1	0.86	1.00	0.00	0.34
Scenario 2	1.00	1.00	1.00	1.00
Scenario 3	0.90	0.93	1.00	1.00

Open in a new tab

It is obvious that Scenario 2 does not require computational tools for the cluster detection as the two clusters are recognizably different. For Scenario 3, both methods show high accuracy with k-means resulting in slightly better performance than DTW. From Figure 4, we can observe that the clusters in Scenario 3 are somewhat separated, but exhibit similar cyclical behavior with an upward trend. Lastly, the results for Scenario 1 show superior performance of DTW compared to k-means. The k-means method fails to detect correct cluster structure in the setting with 6 time series. When the number of time series is increased to 8, the k-means accuracy increases to 34% but it is still very low compared to that of DTW. Scenario 1 aligns the closest with the BMI time series observed in this study. In both the scenario and study, time series are densely distributed, overlap, have varying slopes, and exhibit upward trends. The overall results, with a special focus on the Scenario 1 results, provide confidence in choosing DTW as a suitable method for clustering the BMI series in this study.

4. Application

In this section, we illustrate the application of the DTW method for the purpose of investigating the similarity structure of the states' time series of annual obesity prevalence across the United States from 1990 to 2019. By minimizing the intra-cluster dissimilarities, maximizing the inter-cluster dissimilarities, and capturing leading and lagging relationships of the time series, we were able to determine the optimal number of clusters. Based on the scree plot and the corresponding WCSS values, we determined that the optimal number of clusters is six. The choice of six clusters is also validated with three CVIs. The DB (1.3481) and $D B^{*}$ (1.4611) indices produced the smallest value and the SIL index (0.1964) produced the largest value with six clusters. All stated index results are optimal. In addition, we considered results from the simulation study and compared the six optimal clusters from the DTW approach with cluster groupings from the k-means approach to determine that the DTW technique is preferable for our data. The reader is referred to the Supplemental Material for more information.

Table 2 displays the states within each associated cluster. The smallest regions, in terms of the number of states, are Region 3 and Region 5, each including six states. The largest region, Region 6, encompasses 14 states.

Table 2.

Summary of six clusters.

Region	States	AICD
1	AZ, NV, NJ, NM, RI, UT, WY	8.384
2	AK, DE, IA, MO, ND, PA, SC, SD, WI	8.140
3	AL, IN, LA, MI, MS, WV	9.039
4	AR, GA, KS, KY, NE, OH, OK, TN, TX	8.910
5	CA, CO, CT, DC, HI, MA	10.001
6	FL, ID, IL, ME, MD, MI, MT, NH, NY, NC, OR, VT, VA, WA	8.603

Open in a new tab

Note: The AICD (Average Inter-Cluster Distance) is calculated based on the DBA for the period 1990–2019.

Figure 5 displays a map of the United States with the six optimal regions. On the map, we incorporated patterns and solid colors to better differentiate the regions. To provide additional characteristics of these geographic areas, we considered their average inter-cluster distance and which states were included in each region.

One may think of regions as areas that are in similar geographic locations or share some physical characteristics such as landform, climate, or vegetation. However, the definition of a region can alternatively be formed on the basis of artificial features such as language, government, religion, etc. In our study, the results of DTW-clustering resulted in regions that were formed based on an artificial feature: the similarity of obesity prevalence. We interpret a subset of clusters in terms of locations or characteristics as follows. Region 3 with the highest obesity prevalence is composed of Alabama, Indiana, Louisiana, Michigan, Mississippi, and West Virginia with half of the states located in the south. Similar findings are reported by [27] for West Virginia, Mississippi, Arkansas, Alabama, and Louisiana. Poverty is one of the dominant characteristics for the states in this region. According to the Chamber of Commerce (www.chamberofcommerce.org), the 3-year average poverty rates (percentage of people in poverty) are the highest in the nation for Louisiana, Mississippi, West Virginia, and Alabama. Most of the states in Region 6 have relatively low poverty rates with New Hampshire, Maryland, and Minnesota having the lowest poverty rates in the country. Since there are some other characteristics of the population that have not been accounted for in this study, additional analysis of the socio-demographic and economic characteristics of these region would help with finding common characteristics of the states included in each region.

The line plot in Figure 6 shows the time series of the obesity prevalence rates for the six regions. While all six time series exhibit an upward trend, there are distinct differences among them. Region 3 consistently has the highest obesity prevalence among all regions, starting around 0.176 in 1990 and reaching 0.405 in 2019. Region 5 and Region 1 had similar prevalence rates before 1999. However, after 1999 these two regions followed different trajectories; the prevalence rates continued to increase for Region 1, but stalled for Region 5. These two regions showed a large difference in obesity prevalence in the past two decades with Region 1 and Region 5 reaching prevalence rates of 0.341 and 0.306, respectively, in 2019. The time series for Region 6 is entirely distinct from the other regions. This region started at 0.152 in 1990 and increased to 0.353 in 2019. Lastly, we observe an interesting dynamic between the time series for Region 2 and Region 4. Both regions were unique from 1990 to 2000; then, from 2000 onwards, the regions were similar and overlapped multiple times. In 2019, however, these regions appear to diverge again with Region 2 having a prevalence rate of 0.385 and Region 4 having a prevalence rate of 0.399.

The panel plot of the normalized values of the obesity prevalence by state over thirty years for Regions 1, 2, 3, 4, 5, and 6 is included in the Supplemental Material. The averages computed using the DBA method are represented on each figure by a thick dashed line.

In an effort to study the six regions closely, we developed time series models for each region and forecasted the obesity prevalence rates for three years into the future. The time series for all six regions exhibit a clear trend in the mean. Thus the data are not stationary and differencing is performed to achieve stationarity. Regions 1, 2, 4, and 5 require first-order differencing and Regions 3 and 6 require second-order differencing. The p-value obtained from the Augmented Dickey–Fuller unit root test verifies the previously stated first-order and second-order differencing decisions. After differencing the time series, the autocorrelations – correlations between a given time series and a lagged version of itself over successive time intervals – remain constant over time. Achieving stationarity allows us to build ARIMA models for forecasting time series of obesity prevalence. A simple way to test for stationarity is to obtain plots of ACF and PACF [5]. These plots help in determining whether neighboring models are needed with some number of AR or MA terms. To view a summary of the best fitting ARIMA models by region, refer to column two of Table 3. After using the visual plots to choose an optimal model, we verified that the chosen model is a good fit for the data using the Ljung–Box Test. The null hypothesis of this test suggests that the residuals are white noise. The p-value obtained from the Ljung–Box Test verifies that the model is useful in forecasting. The values of p, d, and q of the best fitted $ARIMA (p, d, q)$ model were verified with the R function auto.arima() from the R package forecast [12].

Table 3.

Summary of the forecasts by region.

Region	ARIMA	2019 Observed	2019 Validation	2022 Forecast
1	(0,1,0)	0.3410	0.3402 (0.3218, 0.3586)	0.3614 (0.3437, 0.3791)
2	(0,1,0)	0.3851	0.3920 (0.3747, 0.4093)	0.4072 (0.3897, 0.4247)
3	(1,2,1)	0.4055	0.4059 (0.3861, 0.4257)	0.4248 (0.4059, 0.4436)
4	(0,1,0)	0.3993	0.3976 (0.3838, 0.4115)	0.4245 (0.4114, 0.4377)
5	(0,1,0)	0.3066	0.3060 (0.2909, 0.3211)	0.3231 (0.3088, 0.3374)
6	(1,2,1)	0.3528	0.3555 (0.3371, 0.3739)	0.3706 (0.3532, 0.3880)

Open in a new tab

Note: The observed obesity prevalence rates for 2019 are provided in the third column. The 2019 values with 95% prediction interval are based on the model validation using the training set (forth column). The 2022 forecasted values with 95% prediction intervals are based on the full data set (last column).

For the purpose of this validation, the data set is split into a training set (27 years) and the test set (3 years). Further, a 3-year forecast of obesity prevalence by region is performed to validate the chosen models. A comparison between the observed and forecasted values for the year 2019 are summarized in the third and fourth columns of Table 3. These in-sample validation results suggest that the observed 2019 prevalence rate falls within the 95% prediction interval, confirming good performance of the selected model so that it can be used for out-of-sample forecast (last column of Table 3).

The best forecasting model for Region 1 is $\nabla p_{t} = 0.0068 + e_{t}$ , or the white noise with a drift where $e_{t} \sim N (0, 0.000027)$ . Similarly, the best model for Region 2 is $\nabla^{2} p_{t} = 0.0074 + e_{t}$ where $e_{t} \sim N (0, 0.000026)$ . The best model for Region 3 is the ARIMA(1,2,1) model, formulated as $ϕ (B) \nabla^{2} p_{t} = θ (B) e_{t}$ where $ϕ (B) = 1 + 0.4022 e_{t}$ and $θ (B) = 1 - 0.7059 B$ with $e_{t} \sim N (0, 0.000029)$ . The best forecasting model for Region 4 is $\nabla p_{t} = 0.0084 + e_{t}$ , or the white noise with a drift where $e_{t} \sim N (0, 0.000015)$ . Further, the best forecasting model for Region 5 is $\nabla p_{t} = 0.0055 + e_{t}$ , or the white noise with a drift where $e_{t} \sim N (0, 0.000018)$ . Lastly, the best model for Region 6 is the ARIMA(1,2,1) model, defined by $ϕ (B) \nabla^{2} p_{t} = θ (B) e_{t}$ where $ϕ (B) = 1 + 0.5787 B$ and $θ (B) = 1 - 0.6813 B$ with $e_{t} \sim N (0, 0.000027)$ .

The Supplemental Material includes the 3-year forecasts and 95% prediction intervals for Region 3 (the region with the maximum average obesity prevalence over the 30 years) and Region 5 (the region with the minimum average obesity prevalence over the 30 years). Selecting these two regions allowed us to compare two distinct models, the ARIMA(1,2,1) and ARIMA(0,1,0), which are fitted to the observed data.

It is worth mentioning that, unlike the approach used in this study, the CDC publishes a map of Adult Obesity Prevalence of the United States for every individual year that displays the regional differences of obesity prevalence [2]. While these maps are helpful in analyzing regional differences between each individual year, our approach allows us to review the regional differences of average obesity prevalence over a span of 30 years encompassed into one United States map.

5. Conclusion and discussion

In this application note, we employed dynamic time warping as a method to cluster states' time series of obesity prevalence for the adult population over a 30-year period (1990–2019) in the United States. The time series clustering using DTW helped us define six homogeneous regions of obesity prevalence rates in the United States. We observed that the states grouped in the same region exhibit close distance and similarities in terms of the temporal dynamics measured by lagging and leading relationships. Then, we developed predictive ARIMA models based on these regions for the purpose of forecasting obesity prevalence rates for several years ahead.

We believe that our clustering approach provides additional comprehensive information about the developments of obesity prevalence over time and may be more useful when creating a map of obesity prevalence by region. It is worth mentioning that, unlike the approach used in this study, the CDC publishes a map of Adult Obesity Prevalence of the United States for every individual year that displays the regional differences of obesity prevalence [4]. While these maps are helpful in analyzing regional differences between each individual year, our approach allows us to review the regional differences of average obesity prevalence over a span of 30 years encompassed into one United States map. Thus we encapsulate the history of obesity prevalence and use it to forecast future projections of obesity prevalence. Thus the findings and results of our study provide an important contribution to the development of regional public health policies and initiatives to decrease the obesity rates among the adult population in the United States. While we acknowledge that state policies tend to reflect the specific needs of the state, it is not uncommon for the underlying objectives of certain policies to diffuse between states. Moreover, it is not unusual for policy decisions of a state to influence the policy decisions of other like-states. Consequently, studying data from multiple states within one cluster could boost reliability of findings, aid policymakers in understanding how different factors impact obesity prevalence, provide intra-cluster support in health and wellness policy development, and foster dialogue regarding future public health policies. Investigating additional factors that lead to these relationships is a strong avenue for future research. The impact of the COVID-19 pandemic on obesity prevalence should be investigated when the new data are available with our study serving as the baseline for comparison.

Supplementary Material

Supplemental Material

CJAS_A_2192445_SM6942.pdf^{(228.1KB, pdf)}

Acknowledgments

We thank the Editors and the anonymous Referee for their helpful comments that improved the readability and quality of this Application Note. We also thank Thomas Woods for reviewing this work and providing useful feedback for its revision.

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

1.2019 BRFSS Survey Data and Documentation . The Behavioral Risk Factor Surveillance System. The Centers of Disease Control and Prevention. Available at https://www.cdc.gov/brfss/annual_data/annual_2019.html Accessed April 15, 2022.
2.Adult Obesity Prevalence Maps . The Centers of Disease Control and Prevention. Available at https://www.cdc.gov/obesity/data/prevalence-maps.html Accessed October 15, 2021.
3.Aghabozorgi S., Shirkhorshidi A.S., and Wah T.Y., Time-series clustering–a decade review, Inf. Syst. 53 (2015), pp. 16–38. [Google Scholar]
4.Arltová M. and Fedorová D., Selection of unit root test on the basis of length of the time series and value of AR (1) parameter, Statistika-Stat. Econ. J. 96 (2016), pp. 47–64. [Google Scholar]
5.Cryer J.D. and Chan K.-S., Time Series Analysis: With Applications in R, Vol. 2, Springer, New York, 2008. [Google Scholar]
6.Daawin P., Kim S., and Miljkovic T., Predictive modeling of obesity prevalence for the US population, N. Am. Actuar. J. 23 (2019), pp. 64–81. [Google Scholar]
7.Davies D.L. and Bouldin D.W., A cluster separation measure, IEEE. Trans. Pattern. Anal. Mach. Intell. 2 (1979), pp. 224–227. [PubMed] [Google Scholar]
8.D'Urso P., García-Escudero L.A., De Giovanni L., Vitale V., and Mayo-Iscar A., Robust fuzzy clustering of time series based on B-splines, Int. J. Approx. Reason. 136 (2021), pp. 223–246. [Google Scholar]
9.Finkelstein E.A., Khavjou O.A., Thompson H., Trogdon J.G., Pan L., Sherry B., and Dietz W., Obesity and severe obesity forecasts through 2030, Am. J. Prev. Med. 42 (2012), pp. 563–570. [DOI] [PubMed] [Google Scholar]
10.Flegal K.M., Kruszon-Moran D., Carroll M.D., Fryar C.D., and Ogden C.L., Trends in obesity among adults in the United States, 2005 to 2014, Jama 315 (2016), pp. 2284–2291. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Hamilton J.D., Time Series Analysis, Princeton university press, 2020. [Google Scholar]
12.Hyndman R., Forecasting Functions for Time Series and Linear Models. R package version 8.16. 2022.
13.Kim M. and Ramakrishna R.S., New indices for cluster validity assessment, Pattern. Recognit. Lett. 26 (2005), pp. 2353–2363. [Google Scholar]
14.Kurbalija V., Radovanović M., Geler Z., and Ivanović M., The influence of global constraints on similarity measures for time-series databases, Knowl. Based. Syst. 56 (2014), pp. 49–67. [Google Scholar]
15.Levenshtein V.I., Binary codes capable of correcting deletions, insertions, and reversals, In Sov. Phys. Dokl. 10 (1966), pp. 707–710. [Google Scholar]
16.Miljkovic T., Shaik S., and Miljkovic D., Redefining standards for body mass index of the US population based on BRFSS data using mixtures, J. Appl. Stat. 44 (2017), pp. 197–211. [Google Scholar]
17.Miljkovic T. and Wang X., Identifying subgroups of age and cohort effects in obesity prevalence, Biom. J. 63 (2021), pp. 168–186. [DOI] [PubMed] [Google Scholar]
18.National Center for Chronic Disease Prevention and Health Promotion, Division of Population Health . Behavioral Risk Factor Surveillance System. Available at https://www.cdc.gov/brfss/index.html Accesses July 1, 2021.
19.Oshan T.M., Smith J.P., and Fotheringham A.S., Targeting the spatial context of obesity determinants via multiscale geographically weighted regression, Int. J. Health. Geogr. 19 (2020), pp. 1–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Petitjean F., Ketterlin A., and Gançarski P., A global averaging method for dynamic time warping, with applications to clustering, Pattern. Recognit. 44 (2011), pp. 678–693. [Google Scholar]
21.Ratanamahatana C.A. and Keogh E., Making time-series classification more accurate using learned constraints, in Proceedings of the 2004 SIAM international conference on data mining, Society for Industrial and Applied Mathematics, 2004, pp. 11–22.
22.RC Team . R: A Language and Environment for Statistical Computing. The R foundation for Statistical Computing. Version 4.1.1. Vienna, Austria. 2021.
23.Rousseeuw P.J., Silhouettes: a graphical aid to the interpretation and validation of cluster analysis, J. Comput. Appl. Math. 20 (1987), pp. 53–65. [Google Scholar]
24.Sakoe H. and Chiba S., Dynamic programming algorithm optimization for spoken word recognition, IEEE. Trans. Acoust. 26 (1978), pp. 43–49. [Google Scholar]
25.Sardá-Espinosa A., Comparing time-series clustering algorithms in R using the dtwclust package, R Package Vignette 12 (2017), p. 41. [Google Scholar]
26.SAS/STAT Software , Version 9.4. Cary, NC. Available at http://www.sas.com/.
27.Wang Y., Beydoun M.A., Min J., Xue H., Kaminsky L.A., and Cheskin L.J., Has the prevalence of overweight, obesity and central obesity leveled off in the United States? Trends, patterns, disparities, and future projections for the obesity epidemic, Int. J. Epidemiol. 49 (2020), pp. 810–823. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Ward Z.J., Bleich S.N., Cradock A.L., Barrett J.L., Giles C.M., Flax C., Long M.W., and Gortmaker S.L., Projected US state-level prevalence of adult obesity and severe obesity, N. Engl. J. Med. 381 (2019), pp. 2440–2450. [DOI] [PubMed] [Google Scholar]
29.Weighting the BRFSS Data . The Centers of Disease Control and Prevention. https://www.cdc.gov/brfss/annual_data/2017/pdf/weighting-2017-508.pdf Accessed April 15, 2022.
30.Woods T. and Miljkovic T., Modeling the economic cost of obesity risk and its relation to the health insurance premium in the United States: a state level analysis, Risks 10 (2022), pp. 197. [Google Scholar]
31.Yuan Y., Chen Y.P.P., Ni S., Xu A.G., Tang L., Vingron M., Somel M., and Khaitovich P., Development and application of a modified dynamic time warping algorithm (DTW-S) to analyses of primate brain expression time series, BMC. Bioinformatics. 12 (2011), pp. 1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Material

CJAS_A_2192445_SM6942.pdf^{(228.1KB, pdf)}

[CIT0001] 1.2019 BRFSS Survey Data and Documentation . The Behavioral Risk Factor Surveillance System. The Centers of Disease Control and Prevention. Available at https://www.cdc.gov/brfss/annual_data/annual_2019.html Accessed April 15, 2022.

[CIT0002] 2.Adult Obesity Prevalence Maps . The Centers of Disease Control and Prevention. Available at https://www.cdc.gov/obesity/data/prevalence-maps.html Accessed October 15, 2021.

[CIT0003] 3.Aghabozorgi S., Shirkhorshidi A.S., and Wah T.Y., Time-series clustering–a decade review, Inf. Syst. 53 (2015), pp. 16–38. [Google Scholar]

[CIT0004] 4.Arltová M. and Fedorová D., Selection of unit root test on the basis of length of the time series and value of AR (1) parameter, Statistika-Stat. Econ. J. 96 (2016), pp. 47–64. [Google Scholar]

[CIT0005] 5.Cryer J.D. and Chan K.-S., Time Series Analysis: With Applications in R, Vol. 2, Springer, New York, 2008. [Google Scholar]

[CIT0006] 6.Daawin P., Kim S., and Miljkovic T., Predictive modeling of obesity prevalence for the US population, N. Am. Actuar. J. 23 (2019), pp. 64–81. [Google Scholar]

[CIT0007] 7.Davies D.L. and Bouldin D.W., A cluster separation measure, IEEE. Trans. Pattern. Anal. Mach. Intell. 2 (1979), pp. 224–227. [PubMed] [Google Scholar]

[CIT0008] 8.D'Urso P., García-Escudero L.A., De Giovanni L., Vitale V., and Mayo-Iscar A., Robust fuzzy clustering of time series based on B-splines, Int. J. Approx. Reason. 136 (2021), pp. 223–246. [Google Scholar]

[CIT0009] 9.Finkelstein E.A., Khavjou O.A., Thompson H., Trogdon J.G., Pan L., Sherry B., and Dietz W., Obesity and severe obesity forecasts through 2030, Am. J. Prev. Med. 42 (2012), pp. 563–570. [DOI] [PubMed] [Google Scholar]

[CIT0010] 10.Flegal K.M., Kruszon-Moran D., Carroll M.D., Fryar C.D., and Ogden C.L., Trends in obesity among adults in the United States, 2005 to 2014, Jama 315 (2016), pp. 2284–2291. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0011] 11.Hamilton J.D., Time Series Analysis, Princeton university press, 2020. [Google Scholar]

[CIT0012] 12.Hyndman R., Forecasting Functions for Time Series and Linear Models. R package version 8.16. 2022.

[CIT0013] 13.Kim M. and Ramakrishna R.S., New indices for cluster validity assessment, Pattern. Recognit. Lett. 26 (2005), pp. 2353–2363. [Google Scholar]

[CIT0014] 14.Kurbalija V., Radovanović M., Geler Z., and Ivanović M., The influence of global constraints on similarity measures for time-series databases, Knowl. Based. Syst. 56 (2014), pp. 49–67. [Google Scholar]

[CIT0015] 15.Levenshtein V.I., Binary codes capable of correcting deletions, insertions, and reversals, In Sov. Phys. Dokl. 10 (1966), pp. 707–710. [Google Scholar]

[CIT0016] 16.Miljkovic T., Shaik S., and Miljkovic D., Redefining standards for body mass index of the US population based on BRFSS data using mixtures, J. Appl. Stat. 44 (2017), pp. 197–211. [Google Scholar]

[CIT0017] 17.Miljkovic T. and Wang X., Identifying subgroups of age and cohort effects in obesity prevalence, Biom. J. 63 (2021), pp. 168–186. [DOI] [PubMed] [Google Scholar]

[CIT0018] 18.National Center for Chronic Disease Prevention and Health Promotion, Division of Population Health . Behavioral Risk Factor Surveillance System. Available at https://www.cdc.gov/brfss/index.html Accesses July 1, 2021.

[CIT0019] 19.Oshan T.M., Smith J.P., and Fotheringham A.S., Targeting the spatial context of obesity determinants via multiscale geographically weighted regression, Int. J. Health. Geogr. 19 (2020), pp. 1–17. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0020] 20.Petitjean F., Ketterlin A., and Gançarski P., A global averaging method for dynamic time warping, with applications to clustering, Pattern. Recognit. 44 (2011), pp. 678–693. [Google Scholar]

[CIT0021] 21.Ratanamahatana C.A. and Keogh E., Making time-series classification more accurate using learned constraints, in Proceedings of the 2004 SIAM international conference on data mining, Society for Industrial and Applied Mathematics, 2004, pp. 11–22.

[CIT0022] 22.RC Team . R: A Language and Environment for Statistical Computing. The R foundation for Statistical Computing. Version 4.1.1. Vienna, Austria. 2021.

[CIT0023] 23.Rousseeuw P.J., Silhouettes: a graphical aid to the interpretation and validation of cluster analysis, J. Comput. Appl. Math. 20 (1987), pp. 53–65. [Google Scholar]

[CIT0024] 24.Sakoe H. and Chiba S., Dynamic programming algorithm optimization for spoken word recognition, IEEE. Trans. Acoust. 26 (1978), pp. 43–49. [Google Scholar]

[CIT0025] 25.Sardá-Espinosa A., Comparing time-series clustering algorithms in R using the dtwclust package, R Package Vignette 12 (2017), p. 41. [Google Scholar]

[CIT0026] 26.SAS/STAT Software , Version 9.4. Cary, NC. Available at http://www.sas.com/.

[CIT0027] 27.Wang Y., Beydoun M.A., Min J., Xue H., Kaminsky L.A., and Cheskin L.J., Has the prevalence of overweight, obesity and central obesity leveled off in the United States? Trends, patterns, disparities, and future projections for the obesity epidemic, Int. J. Epidemiol. 49 (2020), pp. 810–823. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0028] 28.Ward Z.J., Bleich S.N., Cradock A.L., Barrett J.L., Giles C.M., Flax C., Long M.W., and Gortmaker S.L., Projected US state-level prevalence of adult obesity and severe obesity, N. Engl. J. Med. 381 (2019), pp. 2440–2450. [DOI] [PubMed] [Google Scholar]

[CIT0029] 29.Weighting the BRFSS Data . The Centers of Disease Control and Prevention. https://www.cdc.gov/brfss/annual_data/2017/pdf/weighting-2017-508.pdf Accessed April 15, 2022.

[CIT0030] 30.Woods T. and Miljkovic T., Modeling the economic cost of obesity risk and its relation to the health insurance premium in the United States: a state level analysis, Risks 10 (2022), pp. 197. [Google Scholar]

[CIT0031] 31.Yuan Y., Chen Y.P.P., Ni S., Xu A.G., Tang L., Vingron M., Somel M., and Khaitovich P., Development and application of a modified dynamic time warping algorithm (DTW-S) to analyses of primate brain expression time series, BMC. Bioinformatics. 12 (2011), pp. 1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Clustering regions with dynamic time warping to model obesity prevalence disparities in the United States

Katherine Vorpe

Sierra Hessinger

Rebekah Poth

Tatjana Miljkovic

Abstract

1. Introduction