qDIET: toward an automated, self-sustaining knowledge base to facilitate linking point-of-sale grocery items to nutritional content

Valliammai Chidambaram; Philip J Brewster; Kristine C Jordan; John F Hurdle

. 2013 Nov 16;2013:224–233.

qDIET: toward an automated, self-sustaining knowledge base to facilitate linking point-of-sale grocery items to nutritional content

Valliammai Chidambaram ¹, Philip J Brewster ¹, Kristine C Jordan ², John F Hurdle ¹

PMCID: PMC3900174 PMID: 24551333

Abstract

The United States, indeed the world, struggles with a serious obesity epidemic. The costs of this epidemic in terms of healthcare dollar expenditures and human morbidity/mortality are staggering. Surprisingly, clinicians are ill-equipped in general to advise patients on effective, longitudinal weight loss strategies. We argue that one factor hindering clinicians and patients in effective shared decision-making about weight loss is the absence of a metric that can be reasoned about and monitored over time, as clinicians do routinely with, say, serum lipid levels or HgA1C. We propose that a dietary quality measure championed by the USDA and NCI, the HEI-2005/2010, is an ideal metric for this purpose. We describe a new tool, the quality Dietary Information Extraction Tool (qDIET), which is a step toward an automated, self-sustaining process that can link retail grocery purchase data to the appropriate USDA databases to permit the calculation of the HEI-2005/2010.

Introduction

The United States is in the grip of a serious obesity epidemic.¹^,² While there are some recent data suggesting that obesity rates are leveling off,¹ the number of people with a body mass index (BMI) in the range of ‘overweight’ or ‘obese’ is at an alarming, all-time high. This is true of children as well as adults. This problem is not limited to the US: most developed, and several developing, countries are experiencing the same epidemic.³ Since weight-related morbidity and mortality comprise a significant burden on the nation’s healthcare system, it makes sense to improve nutrition and weight counseling at the point of care. We argue here that healthcare practitioners need a metric in the electronic health record that reliably reflects dietary quality, especially over time. Such a metric could become as routine in shared decision-making with overweight patients as serum cholesterol is with hyperlipidemic patients, or as Hemoglobin A1C is with diabetic patients. We describe a tool, qDIET, that is a step in this direction. It addresses the chief informatics challenge with integrating dietary quality metrics into the EHR, namely the efficient and bias-free collection of household-level nutrition data.

Background

The Morbidity and Mortality of Obesity

Over the past 20 years there has been a dramatic shift in obesity incidence in the US. The US Centers for Disease Control and Prevention (CDC) uses the measure “body mass index” (BMI), a number calculated from a person’s weight and height, as a reliable measure of overall body adiposity. The weight status categories for different values of the BMI are shown in Table 1 on the right. In 1990, the CDC reported that there were no states in the US where the percentage of the obese population was greater than 15%. By 2010, in contrast, there was no state where the percentage of the obese population was less than 20%. See Figure 1 for the state-by-state details. According to the most recent data from the CDC, 35.7% of adults in the US are estimated to be obese.

Table 1.

The definitions of BMI weight status categories

BMI	Weight Status
Below 18.5	Underweight
18.5 – 24.9	Normal
25.0 – 29.9	Overweight
30.0 and Above	Obese

Open in a new tab

Figure 1. — The distribution of adult obesity incidence by state, 1990 compared to 2010. (source: CDC¹)

In both children and adults, obesity brings with it important health consequences. Obese children are more likely to have hypertension and hyperlipidemia;⁴ to manifest joint problems and musculoskeletal discomfort;⁵ to show increased risk of impaired glucose tolerance, increased insulin resistance, and type 2 diabetes;⁶ and to suffer from social stigmatization and depression.⁶ Adults manifest a similar profile of weight-related morbidity. Finkelstein et al. estimated that in 2008 adult obesity cost the healthcare system $147 billion. On a per-patient basis, it cost $1,429 more per year to provide healthcare to obese adults than to normal-weight adults.⁷

Lack of a Dietary Metric And Poor Clinician Training In Diet Counseling

The National Institutes of Health (NIH) provide clinical guidelines to aid healthcare practitioners in advising their patients on weight loss.⁸ As a group, however, physicians perform poorly when trying to provide effective advice on weight. Studying the late 1990s, Jackson et al. concluded “There is a need for mechanisms that allow health care professionals to devote sufficient attention to weight control and to link with evidence-based weight loss interventions, especially those that target groups most at risk for obesity.”⁹ In a 2008 study, Smith et al. found that less than 50% of a nationally representative sample of primary care providers reported always providing specific guidance on diet, physical activity, or weight control when it was indicated.¹⁰ On the positive side, Appel et al. found that obese patients with cardiovascular risk factors could successfully lose weight and keep that weight off both with intense person-to-person interventions and with remotely delivered interventions.¹¹

We believe that an important reason clinicians, as a group, perform poorly with weight counseling is that they lack a dietary metric that comports to their standard clinical reasoning model. For example, a physician can fine tune the therapy for a diabetic patient presenting with an elevated Hemoglobin A1C and track that value over time. We are proposing a dietary quality metric that also can be monitored over time, one that can form the basis of informed diet shared decision making that can lead to weight loss. The long-term goal of our work is to validate the hypothesis that a dietary quality metric can lead to improved weight outcomes if it can be presented to clinicians and patients in a simple and intuitive way.

A Dietary Quality Metric Suitable for the EHR: the argument for the healthy eating index

A publicly available framework for much contemporary dietary assessment research in the United States has been developed by the US Department of Agriculture (USDA), in conjunction with the National Cancer Institute (NCI). Of particular significance here, the Food and Nutrition Database for Dietary Studies (FNDDS),¹² the MyPyramid Equivalents Database (MPED),¹³ and the Healthy Eating Index (HEI-2005)¹⁴ are widely recognized as established reference standards. The HEI-2005 was designed to provide a measure of overall dietary quality in accordance with the Dietary Guidelines for Americans (2005). This metric has been applied primarily in the context of population-based studies and was initially designed to analyze the dietary recall components in the What We Eat In America (WWEIA) sections of the biennial National Health and Nutrition Examination Survey (NHANES), conducted jointly by the USDA Agricultural Research Service (USDA-ARS) and the National Center for Health Statistics (NCHS) at the Centers for Disease Control and Prevention(CDC).

The food codes in the FNDDS database represent the complete range of foods reported in the corresponding WWEIA/NHANES dietary recalls. Along with a text-based food description that is often quite detailed, the FNDDS contains the nutrient values and caloric energy (kcal) for “typical” portion weights or serving sizes of that food. The MPED references the same set of food codes as the FNDDS, but represents the nutrient information in a standard (100 gram) portion size and disaggregates mixed foods or recipes proportionally into their MyPyramid components and food groups for dietary analysis. This information is used to calculate the HEI-2005 scores for any food that can be represented by an 8-digit USDA food code. The mapping techniques being developed in the qDIET framework should permit the dietary quality of grocery food items to be evaluated with the HEI-2005, leveraging the well-validated methods developed by the CDC, USDA, and NCI.

A total HEI-2005 score (range: 0–100) is derived from the sum of twelve component scores and their discrete contributions to total energy (kcal), using a ‘nutrient density’ approach that is robust and independent of individual fluctuations in intake levels. Nine of the twelve HEI-2005 component scores assess nutritional adequacy, such that higher intakes (e.g., from whole grains, fruits, and vegetables) as a proportion of total energy result in higher scores, while the remaining three components reflect guidelines for moderation (e.g., to reduce the amounts of sodium or ‘empty calories’ from added sugars). Overall, higher HEI-2005 scores indicate closer conformance to the USDA’s dietary guidance. A newly released version of the Healthy Eating Index,¹⁵ based on the USDA’s 2010 Dietary Guidelines for Americans, redefines or adjusts several of the components, but our current work is focused on the HEI-2005 since it is the better studied metric. Transitioning to the HEI-2010 will be straightforward, since both are based on FNDDS food codes.

We argue here that the HEI is extremely well-suited for the EHR. First, it reliably implements the USDA Dietary Guidelines for Americans. Second, it is robust and, because it is based on a nutrient-density approach, it scales well across different populations (e.g., from households to provider panels to healthcare networks). In the words of its developers, “The HEI is a scoring metric that can be applied to any defined set of foods, such as previously-collected dietary data, a defined menu, or a market basket.” Using the qDIET tool described below, we intend to employ the HEI component scores as a highly scalable dietary metric to estimate the overall healthfulness of the food environment at the patient-household level. In sum, we believe that these informatics solutions will facilitate the inclusion of grocery sales data as covariates in dietary studies in ways that are readily generalizable.

The State-of-the-Art in Dietary Data Collection

In order to assess a person’s dietary quality, using the HEI or any other metric, one needs detailed nutritional information about the foods that the person is consuming. In the NHANES studies described above, a trained interviewer conducts an hours-long interview, probing which foods were consumed over the last one or two days. This is very resource intensive and it places a significant respondent burden on the interviewee. A less resource-intensive approach is to use a food frequency questionnaire (FFQ) that is filled out by the subject. Respondent burden is still quite high, however, and such a tool is vulnerable to respondent bias and faulty recall. ¹⁶^,¹⁷ An example FFQ is shown in Figure 2 (note that in this case respondents were recalling food frequencies over an entire year).

Figure 2. — Sample of a food frequency questionnaire.

Food item UPC data for nutritional analysis

In our work to find better ways to measure food intakes, we have been focusing on point-of-sale grocery data. Virtually all grocery stores use barcode scanners to identify items purchased. Many large grocery chains link these purchases to a shopping card that, in turn, associates households with their purchase history. Food retailers use that purchase history to predict future customer interests and tailor advertising accordingly. As a means to track food purchases by households, Universal Product Code (UPC) shopping data offer a very attractive alternative to FFQs and food diaries:

The data are collected passively, so respondent burden is eliminated, with no opportunity for reporting bias;
Both the grocer and the customer have strong incentives to use the shopping card consistently;
Food purchases are tracked continuously, rather than in snapshots like the FFQ;
Acquiring the data requires minimal resources, since the data are virtually free, being collected as a part of normal operations by the grocery retailer.

UPC data have limitations as well, and we detail those in the Discussion section. We are not the only group who are exploring the utility of easily scanned codes for nutritional informatics research. Lambert et al. used a smart “swipe” card to study the eating habits of students in a large boys school cafeteria.¹⁸ Those items were manually mapped to a UK database, the McCance and Widdowson’s Composition of Foods dataset, containing about 3,400 food items. Ni Mhurchu and her team asked shoppers in New Zealand to scan foods purchased over the course of 12 weeks in one supermarket using a handheld UPC scanner. They manually merged scan data with the New Zealand Food Composition Database that contains about 2,700 food items.

In our work, we have teamed with a large national grocery retailer who provides UPC purchase data that can span a year or more of transactions for any given household. That retailer tracks over 120,000 food and beverage UPC codes. Since our dietary quality metric of interest is the HEI, we are exploring ways to automate linking UPCs to the food codes of the USDA’s FNDDS and MPED databases. The USDA data cover nearly 8,000 food items. Given the sizes of these datasets, manual mapping is not practical. The retail food market is also quite dynamic, with new foods being introduced and older foods being retired all the time. For example, five years ago there were very few “energy drink” or “Greek yogurt” products on the market, whereas today there are scores of these products. Our goal is to build a self-maintaining knowledge base with our qDIET tool to facilitate linking point-of-sale grocery items to nutritional content.

Methods

Despite the fact that all packaged foods in the US contain a UPC label and a FDA-mandated nutrition facts panel, there is no open-source database that links the two. There are commercial sources for UPC-linked nutrition data¹⁹ but they are cost prohibitive on a large scale and their use is restricted by non-disclosure agreements that make sharing difficult. We described a method for manually mapping grocery retail data to a USDA database called the Standard Reference (SR)²⁰ in a previous paper.²¹ We were able to show, as a proof of principle, that we could map 70% of 26,854 unique UPCs to the SR. Since the USDA FNDDS and MPED are integral to calculating the HEI, they are mapping targets for our current work.

The Food Item Dataset

One of the main aims of our research has been to collect data longitudinally in order to analyze consumer market baskets and describe food purchasing patterns with the aid of data mining and statistical tools. To this end, we obtained complete sets of grocery transaction data for a sample of 50 consented households, representing their food-shopping activity over a period of 12 months, with dates ranging from February 2007 to April 2008. IRB approval was obtained under University of Utah IRB #18830 (exempt).

Participants enrolled in the study met the following inclusion criteria:

Participants were members of the loyalty card program at our grocery retail partner’s stores, with at least 12 months of retrospective purchases linked to their card. Having a complete year’s worth of data has been suggested for nutritional studies to account for seasonal changes in dietary intake or purchasing behavior;
Participants were classified by our grocery retail partner as one of their “top tier shoppers,” ensuring that the participant household shopped at this specific supermarket chain frequently;
Participants resided in the Salt Lake Valley, Utah, at the time of the pilot study in 2007–08.

Data Collection Procedure

Our grocery retailer partner contacted their Frequent Shopper Card customers through a recruitment letter. Interested participants who met the inclusion criteria were to voluntarily contact the University of Utah research team by phone for further information and enrollment. For this pilot we planned on recruiting 50 households. After mailing the recruitment letters, we received an overwhelming response and stopped taking information from potential participants after 100 households had contacted us. Households contacting us after the end of our enrollment period were routed to a voice message letting them know that their interest was appreciated, but that due to the great response, enrollment in the study was closed.

Data Cleaning

We received the transaction data from the supermarket chain in a single flat file, in a pipe-delimited (‘|’) text format, which was initially loaded into Microsoft Access for the pilot study.²¹ The USDA databases of interest, the USDA SR, the FNDDS, and the MPED, were also downloaded in the Microsoft Access format. In order to create a more unified and standardized workspace, we migrated all MS Access tables to MySQL, using the ODBC database connectors provided by the MySQL development community. The flat grocery transaction data were cleaned and converted to a normalized relational database schema. We retained the retail supermarket partner’s product classification categories in our data model, in which food items are organized into four hierarchical levels (Department, Commodity, Sub-commodity, Food-item) as shown in Figure 3. The data reside in a HIPAA-compliant, high-performance compute cluster at the University of Utah’s Center for High Performance Computing.

Figure 3. — The product hierarchy of our supermarket partner, using a milk example (this type of schema is common across grocery retailers).

We received the following product-specific information for every food item in the transaction data set: customer ID, the UPC, short text item description (for printing on a register receipt), timestamp, price, quantity, size, weight (if the item was sold by weight, e.g., fresh produce, deli products, fresh meats, bakery goods, etc.), and the corresponding grocer item hierarchy. A sample record is shown in Figure 4.

Figure 4. — Record format from the retail food item database.

The qDIET tool

We are developing a tool, the quality Dietary Information Extraction Tool (qDIET), to link computationally records of point-of-sale food items at the UPC or item descriptor level to the corresponding nutritional information in the USDA’s FNDDS tables, and then to use the USDA’s standard 8-digit food code for information retrieval from the MPED for dietary analysis. Figure 5 shows the Entity-Relationship diagram for some of the key tables in our MySQL relational database. On the left side of the diagram are the normalized supermarket tables, and on the right are the USDA’s databases (the FNDDS, SR Links, and MPED), represented here schematically with one table each. As indicated in this ER diagram, our goal is to create an automated data crosswalk, i.e., the pink table in the center, linking the grocery UPCs on the left to the USDA database information on the right. Given size limitations, we present a high level ER Model here, to give a qualitative sense of the data relationships and the major relational groups.

Figure 5. — The Entity-Relationship diagram of the MySQL database that underpins *qDIET*.

The problem space:

A significant drawback with these sales log data is that no full descriptors are available for the items. Remarkably, even at the national level, this grocery store chain does not maintain a full-length description of items. Rather they rely on the short ‘sales-tape’ description such as in the example discussed below.

Like most other food retailers, our supermarket data-sharing partner had their own idiosyncratic way of naming food items, abbreviating their descriptions to fewer than 40 characters in order to accommodate the narrow dimensions on a point-of-sale itemized sales receipt. Consider this example: the product name “Kelloggs Raisin Bran Cereal” is displayed as “KELL RAISIN BRAN CEREL.” Presumably, the retailer does not require complete product names in order to track sales in their point-of-sale systems.

However, a complete food item descriptor is a vital piece of information for our research. Our first pass at matching records will exploit fuzzy string matching and related natural language processing techniques to match the UPC textual descriptions to the USDA item textual descriptions. Our preliminary work suggests these techniques will work better with full-text descriptors than they will with the short-form sales descriptors. In addition, a full product description from the Web (see next section) will almost always include the net quantity, size, and weight information along with the name of the food item or product brand. Those data are crucial in estimating food intakes at the household level.

Web Crawling:

We have developed an object-oriented framework in Java that leverages several data integration resources and the Web to obtain complete product item descriptors along with packaging information (size, weight, quantity) of the food items in the retail supermarket database tables. It is possible to build on existing application programming interfaces (APIs) that combine data from more than one Web source. These ‘data mashups’ allow developers to connect to data resources programmatically and thus support the process of automated online information retrieval.

Two such mashups that we have made best use of are: Google’s Search API for Shopping and Factual.com. Google provides access to retail product information that has been voluntarily uploaded to Google’s servers by participating merchants and resellers. Factual is a typical ‘data aggregator,’ a platform that provides access to well-curated global data. These APIs are cost-free, although Google imposes a maximum of 2,500 API queries per day and Factual imposes a maximum of 500 API queries per day. Unlike many other mashups that rely solely on crowd sourcing models or volunteer contributions, information in the Google shopping API comes from retailers and resellers who use the UPC information as part of their business model, making it trustworthy. Likewise, the Factual.com platform provides a dedicated channel for highly curated data that has been cleaned, standardized, and linked to canonical, global labels (called GTINS) for accuracy.

The process flow for Web crawling is shown in Figure 6.

Google’s Search API for Shopping

We used three different kinds of raw data as input to extract product descriptors for as many grocery food items as possible and to match the food items with the FNDDS descriptors. For the first iteration, we crawled the Web with the FNDDS descriptors as our input search string. In the second iteration, we crawled using the supermarket partner’s “sub-commodity” as input. In the third and final iteration, we crawled directly by the UPC code itself.

For each query, we requested that the Google API retrieve a maximum of 1,000 products, rank-ordered by relevance. The results were returned in JavaScript Object Notation (JSON) format, which was then parsed to extract individual product attributes like the UPC, the Product Name, Product Description, and Brand name. The result set of 1,000 UPCs was then compared with the UPC codes in the grocery database tables. If a match was found, qDIET marked it as a ‘hit’ and stored the product attributes in the Web-crawling results database.

First Iteration:

Each FNDDS descriptor has a unique 8-digit food code associated with it. The food code is assigned by the USDA according to a data scheme that allocates the first three or four digits of the code with various food groups and subgroups. The first digit in the food code can represent and identify one of nine major food groups: 1 = milk and milk products; 2 = meat, poultry, fish, and mixtures; 3 = eggs; 4 = legumes, nuts, and seeds; 5 = grain products; 6 = fruits; 7 = vegetables; 8 = fats, oils, and salad dressings; and 9 = sugars, sweets, and beverages. The second, third, and sometimes fourth digits of a food code specify increasingly more precise subgroups within these nine major food groups. The remaining digits are used for identification of particular foods within a subgroup and its numerical sequence.

After a careful review of the various food descriptions in FNDDS, it was decided that we could truncate the FNDDS descriptions to six digits and use the resulting string as a more generic query parameter. We hypothesized that crawling the Web with a more generalized description would enable the extraction of as many full descriptors for the grocery items as possible. At the 8-digit level, the FNDDS food descriptors were often too specific, and subsequent entries resulted in a redundant result set, for the most part. We found 6 digits to be the optimal level where the FNDDS descriptors were substantially different and distinct from one another. Table 2 illustrates the rules we applied with an example. Since all of the FNDDS food descriptors share the same first 6 digits in this example (the yellow highlighted digits), we select the first record of the subgroup, which in most cases is the most generalized description, and exclude all the other records with a matching code prefix. That first descriptor is then analyzed with rule-based regular expressions in Java. For example, one of these rules deletes all words after a “with” clause, until a comma or end of line is encountered. So in this case the final query parameter would read “steak submarine sandwich.”

Table 2.

Sample food codes and their corresponding descriptions in FNDDS.

Food_code	FNDDS_Main_food_description
27515000	Steak submarine sandwich with lettuce and tomato
27515010	Steak sandwich, plain, on roll
27515020	Steak & cheese submarine sandwich, with lettuce & tomato
27515030	Steak and cheese sandwich, plain, on roll
27515040	Steak and cheese submarine sandwich, plain, on roll
27515080	Steak sandwich, plain, on biscuit

Open in a new tab

Second Iteration:

For the second iteration, we used our supermarket retail partner’s sub-commodity descriptions as the input query parameters. These sub-commodity descriptions were well formed, with complete words and few abbreviations. In a sense they represented a more generalized search term for all of the grocery food items that fell under that particular food category.

Third Iteration:

For the third iteration, we used our retail supermarket partner’s set of item UPCs as the input query parameters.

The Factual Platform

We accessed Factual Product data with UPC-based queries. Factual offers nutrition information for more than 150,000 food and beverage products and ingredient lists for more than 350,000 packaged goods. Some of the key attributes that the API returns include UPC, EAN-13, Product Name, Manufacturer, Brand, Size, Category and many more. In a significant number of cases, Factual will also provide the Nutrition Facts Panel information for the product UPC, such as energy (kcal) per serving and key macronutrients (protein, carbohydrates, total fat, etc.) as a percentage of the recommended Daily Values per serving. These attributes may facilitate dietary analysis in future work.

Results

The 50-Family Year Long Dataset

There were a total of 6,610 shopping transactions logged by the 50 families over their one year of purchasing. Collectively, they purchased a total of 98,066 items, of which 12,332 were unique. Starting at the top tier of the supermarket hierarchy (see Figure 3): there were 35 unique Departments; 194 unique Commodities; and 949 unique Sub-commodities. As a proof of concept, we manually mapped a subset of the grocery food item UPCs for one week of shopping activity to the corresponding FNDDS food codes (n=42 households). Using the HEI-2005 scoring method developed in SAS by the USDA, we were able to estimate the household HEI for each of these weekly market baskets. The distribution of Total HEI scores ranged from a minimum of 23 to a maximum of 79 (out of 100), with a mean score of 51.4, and showing a slightly bimodal curve overall. Higher densities of HEI scores in the 35–45 and in the 55–65 ranges compared to the expected midpoint values indicated a ‘less healthy’ and a ‘more healthy’ clustering of households, but the sub-sample of 42 households for the week in question was too small for this observation to be considered conclusive at this time.

Owing to its robustness across many kinds of statistical distributions, we ran the non-parametric Kolmogorov-Smirnov empirical distribution function two-sample test (SAS PROC NPAR1WAY) to compare the Total HEI scores for these households based on our grocery data with the Total HEI scores for NHANES 2007–08 respondents who had reported over 75% daily intakes from retail food stores (DR1FS=1). The results showed no significant differences between the two distributions (p > 0.10, where p> 0.05 means no difference). Again, the small sample size in the case of our grocery households may have been a factor, though this outcome remained the same when we randomly selected the same (small) number of respondent records (n=42) from the NHANES data sets for the K-S test statistic, and the shapes of the distribution curves appeared similar.

Web Crawling Results

Before starting the Web-crawling process, we had to account for the food items in our sample grocery transaction set that were sold as store brands. Store brands (also known as own brand, house brands, or private label brands) are unique to a particular retailer. While Google shopping server did not have information about these house brands or private-label brands, Factual’s real-time data had information on several house brands including our data partners’. QDIET’s success rates are shown in Table 3.

Table 3.

Summary of matching results using the Google API, detailing the gain provided by different matching iterations

Matching results	Count
Total number of distinct UPCs purchased by the 50 households in 12 months	12,332
Total Number of full descriptors extracted with Google’s Search API for Shopping	3,790
Recall rate with Google API	30.7%
Total Number of full descriptors extracted with Factual API	8,822
Recall rate with Factual API	71.5%

Open in a new tab

Discussion and Conclusion

We argue in this article that the nation, indeed the world, is facing a serious epidemic of obesity. That epidemic is extremely costly, both in terms of healthcare dollars as well as in human morbidity and mortality. The literature makes it clear that clinicians are often ill prepared to discuss meaningful ways to reduce weight with their patients. We believe that, in a shared decision-making model, having a metric in the EHR that both the patient and the physician can monitor and discuss over time is a first step in realizing the potential benefit of nutritional informatics. Traditional forms of collecting dietary data are cumbersome, expensive, and prone to bias and error. Since grocery stores collect food sales data for marketing purposes, we hypothesize that those same data could be used to build a longitudinal record of dietary quality, one suitable for inclusion in the EHR in the form of HEI-2005/2010 scores. In addition to helping calculate the HEI score, those same data could provide the raw data for a recommendation system that provides clinical decision support. For example, if the HEI component score for “Dark green and orange vegetables and legumes” is lower than recommended, the food purchase history shows exactly which of these food items that a household actually does prefer to purchase, and these would be a clear basis for a recommendation.

Limitations

This is early work, and there are several significant technical challenges to be addressed by the qDIET tool. First, in our 50-family longitudinal study, 2,359 (23.8%) of the UPCs represent store brands. This is not unique to our grocery chain; most food retailers maintain store brands.²² Factual has product information for some of the store brands that our grocery partner sells, but not for all. As a first pass to address this, we will attempt to expand systematically the short food descriptor into a fuller description that we can compare to FNDDS food descriptors. Second, it is evident that people eat food away from home. A recent study conducted by Todd et al. at the USDA-ERS indicates that on average Americans eat 0.67 meals away from home per day.²³ That means that we eat 77.7% of our meals at home, on average. The Todd study attributes the overall energy burden from foods eaten away from home to be only 134 Kcal, so our hypothesis is that the dietary signal in grocery data, though incomplete, is still a useful surrogate indicator of overall dietary quality. Lastly, we assume that we can eventually convince grocery retailers to share data with healthcare providers. Our work to-date with a national food retailer gives reason for optimism on this point.

Future Work

This exploratory study has provided evidence that leveraging these product APIs provides us with a much more complete set of full product descriptions/packaging information than the grocery sales transaction record alone. There are several other promising APIs for online product search that can be loosely coupled to qDIET for a more extensive retrieval of full product descriptions and for a more robust search solution that does not rely on any single Web services provider for data. Much work remains to be done to develop qDIET into an automated, easily maintained mapping tool to accurately and reliably link any given grocery UPC with its matching FNDDS food code(s).

Acknowledgments

We are indebted to Dr. Patricia M. Guenther at the USDA’s Center for Nutrition Policy and Promotion for her insights into the HEI-2005/2010 scoring methodology. The authors would like to thank the University of Utah Center for High Performance Computing for allocation of computer time. This work was funded in part by a Seed Grant from the University of Utah, an Innovation Research Grant administered by the Utah National Children’s Study, and training grant T15-LM007124 from the National Library of Medicine.

References

1.CDC Adult Obesity Facts. 2013. [cited March 07, 2013]; Available: http://www.cdc.gov/obesity/data/adult.html.
2.CDC CDC Grand Rounds: Childhood Obesity in the United States. 2013. Available: http://www.cdc.gov/mmwr/preview/mmwrhtml/mm6002a2.htm. [PubMed]
3.De Onis M, Blössner M. Prevalence and trends of overweight among preschool children in developing countries. The Am J Clin Nutrn. 2000;72(4):1032–9. doi: 10.1093/ajcn/72.4.1032. [DOI] [PubMed] [Google Scholar]
4.Freedman DS, Mei Z, Srinivasan SR, Berenson GS, Dietz WH. Cardiovascular risk factors and excess adiposity among overweight children and adolescents: the Bogalusa Heart Study. J Pediatr. 2007 Jan;150(1):12–7. doi: 10.1016/j.jpeds.2006.08.042. e2. [DOI] [PubMed] [Google Scholar]
5.Taylor ED, Theim KR, Mirch MC, Ghorbani S, Tanofsky-Kraff M, Adler-Wailes DC, et al. Orthopedic complications of overweight in children and adolescents. Pediatrics. 2006 Jun;117(6):2167–74. doi: 10.1542/peds.2005-1832. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Whitlock EP, Williams SB, Gold R, Smith PR, Shipman SA. Screening and interventions for childhood overweight: a summary of evidence for the US Preventive Services Task Force. Pediatrics. 2005 Jul;116(1):e125–44. doi: 10.1542/peds.2005-0242. [DOI] [PubMed] [Google Scholar]
7.Finkelstein EA, Trogdon JG, Cohen JW, Dietz W. Annual medical spending attributable to obesity: payer-and service-specific estimates. Health Aff (Millwood) 2009 Sep-Oct;28(5):w822–31. doi: 10.1377/hlthaff.28.5.w822. [DOI] [PubMed] [Google Scholar]
8.CDC Clinical Guidelines on the Identification, Evaluation, and Treatment of Overweight and Obesity in Adults. 2013. [cited March 07, 2013]; Available: http://www.nhlbi.nih.gov/guidelines/obesity/ob_home.htm. [PubMed]
9.Jackson JE, Doescher MP, Saver BG, Hart LG. Trends in professional advice to lose weight among obese adults, 1994 to 2000. J Gen Intern Med. 2005 Sep;20(9):814–8. doi: 10.1111/j.1525-1497.2005.0172.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Smith AW, Borowski LA, Liu B, Galuska DA, Signore C, Klabunde C, et al. U.S. primary care physicians’ diet-, physical activity-, and weight-related care of adult patients. Am J Prev Med. 2011 Jul;41(1):33–42. doi: 10.1016/j.amepre.2011.03.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Appel LJ, Clark JM, Yeh HC, Wang NY, Coughlin JW, Daumit G, et al. Comparative effectiveness of weight-loss interventions in clinical practice. N Engl J Med. 2011 Nov 24;365(21):1959–68. doi: 10.1056/NEJMoa1108660. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Bodner-Montville J, Ahuja JK, Ingwersen LA, Haggerty ES, Enns CW, Perloff BP. USDA food and nutrient database for dietary studies: released on the web. J Food Compost Anal. 2006;19:S100–S7. [Google Scholar]
13.Bowman S, Friday J, Moshfegh A. Food Surveys Research Group Beltsville Human Nutrition Research Center. Agricultural Research Service, USDA; Beltsville MD: 2008. Mypyramid Equivalents Database 2.0 for USDA Survey Foods, 2003–2004. [Google Scholar]
14.Guenther PM, Reedy J, Krebs-Smith SM, Reeve BB. Evaluation of the Healthy Eating Index-2005. J Am Diet Assoc. 2008 Nov;108(11):1854–64. doi: 10.1016/j.jada.2008.08.011. [DOI] [PubMed] [Google Scholar]
15.Guenther PM, Casavale KO, Reedy J, Kirkpatrick SI, Hiza HA, Kuczynski KJ, et al. Update of the Healthy Eating Index: HEI-2010. J Acad Nutr Diets. 2013 doi: 10.1016/j.jand.2012.12.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Illner AK, Freisling H, Boeing H, Huybrechts I, Crispim SP, Slimani N. Review and evaluation of innovative technologies for measuring diet in nutritional epidemiology. Int J Epidemiol. 2012 Aug;41(4):1187–203. doi: 10.1093/ije/dys105. [DOI] [PubMed] [Google Scholar]
17.Thompson FE, Subar AF, Loria CM, Reedy JL, Baranowski T. Need for technological innovation in dietary assessment. J Am Diet Assoc. 2010 Jan;110(1):48–51. doi: 10.1016/j.jada.2009.10.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Lambert N, Plumb J, Looise B, Johnson IT, Harvey I, Wheeler C, et al. Using smart card technology to monitor the eating habits of children in a school cafeteria: 1. Developing and validating the methodology. J Hum Nutr Diet. 2005 Aug;18(4):243–54. doi: 10.1111/j.1365-277X.2005.00617.x. [DOI] [PubMed] [Google Scholar]
19.Gladson Nutrition Database. 2013. Available from: http://www.gladson.com/our-services/nutrition-database.
20.Haytowitz D, Lemar L, Pehrsson P, Exler J, et al. Agricultural Research Service. Beltsville: US Department of Agriculture; 2011. USDA National Nutrient Database for Standard Reference, Release 24. [Google Scholar]
21.Brinkerhoff KM, Brewster PJ, Clark EB, Jordan KC, Cummins MR, Hurdle JF, editors. AMIA Annu Symp Proc. American Medical Informatics Association; 2011. Linking Supermarket Sales Data To Nutritional Information: An Informatics Feasibility Study. [PMC free article] [PubMed] [Google Scholar]
22.Collins-Dodd C, Lindley T. Store brands and retail differentiation: the influence of store image and store brand attitude on store own brand perceptions. Jour of Retail and Consumer Serv. 2003;10(6):345–52. [Google Scholar]
23.Todd J, Mancino L, Lin BH. The impact of food away from home on adult diet quality. USDA-ERS Economic Research Report Paper #90. 2010.

[b1-amia_2013_symposium_224] 1.CDC Adult Obesity Facts. 2013. [cited March 07, 2013]; Available: http://www.cdc.gov/obesity/data/adult.html.

[b2-amia_2013_symposium_224] 2.CDC CDC Grand Rounds: Childhood Obesity in the United States. 2013. Available: http://www.cdc.gov/mmwr/preview/mmwrhtml/mm6002a2.htm. [PubMed]

[b3-amia_2013_symposium_224] 3.De Onis M, Blössner M. Prevalence and trends of overweight among preschool children in developing countries. The Am J Clin Nutrn. 2000;72(4):1032–9. doi: 10.1093/ajcn/72.4.1032. [DOI] [PubMed] [Google Scholar]

[b4-amia_2013_symposium_224] 4.Freedman DS, Mei Z, Srinivasan SR, Berenson GS, Dietz WH. Cardiovascular risk factors and excess adiposity among overweight children and adolescents: the Bogalusa Heart Study. J Pediatr. 2007 Jan;150(1):12–7. doi: 10.1016/j.jpeds.2006.08.042. e2. [DOI] [PubMed] [Google Scholar]

[b5-amia_2013_symposium_224] 5.Taylor ED, Theim KR, Mirch MC, Ghorbani S, Tanofsky-Kraff M, Adler-Wailes DC, et al. Orthopedic complications of overweight in children and adolescents. Pediatrics. 2006 Jun;117(6):2167–74. doi: 10.1542/peds.2005-1832. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b6-amia_2013_symposium_224] 6.Whitlock EP, Williams SB, Gold R, Smith PR, Shipman SA. Screening and interventions for childhood overweight: a summary of evidence for the US Preventive Services Task Force. Pediatrics. 2005 Jul;116(1):e125–44. doi: 10.1542/peds.2005-0242. [DOI] [PubMed] [Google Scholar]

[b7-amia_2013_symposium_224] 7.Finkelstein EA, Trogdon JG, Cohen JW, Dietz W. Annual medical spending attributable to obesity: payer-and service-specific estimates. Health Aff (Millwood) 2009 Sep-Oct;28(5):w822–31. doi: 10.1377/hlthaff.28.5.w822. [DOI] [PubMed] [Google Scholar]

[b8-amia_2013_symposium_224] 8.CDC Clinical Guidelines on the Identification, Evaluation, and Treatment of Overweight and Obesity in Adults. 2013. [cited March 07, 2013]; Available: http://www.nhlbi.nih.gov/guidelines/obesity/ob_home.htm. [PubMed]

[b9-amia_2013_symposium_224] 9.Jackson JE, Doescher MP, Saver BG, Hart LG. Trends in professional advice to lose weight among obese adults, 1994 to 2000. J Gen Intern Med. 2005 Sep;20(9):814–8. doi: 10.1111/j.1525-1497.2005.0172.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b10-amia_2013_symposium_224] 10.Smith AW, Borowski LA, Liu B, Galuska DA, Signore C, Klabunde C, et al. U.S. primary care physicians’ diet-, physical activity-, and weight-related care of adult patients. Am J Prev Med. 2011 Jul;41(1):33–42. doi: 10.1016/j.amepre.2011.03.017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b11-amia_2013_symposium_224] 11.Appel LJ, Clark JM, Yeh HC, Wang NY, Coughlin JW, Daumit G, et al. Comparative effectiveness of weight-loss interventions in clinical practice. N Engl J Med. 2011 Nov 24;365(21):1959–68. doi: 10.1056/NEJMoa1108660. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b12-amia_2013_symposium_224] 12.Bodner-Montville J, Ahuja JK, Ingwersen LA, Haggerty ES, Enns CW, Perloff BP. USDA food and nutrient database for dietary studies: released on the web. J Food Compost Anal. 2006;19:S100–S7. [Google Scholar]

[b13-amia_2013_symposium_224] 13.Bowman S, Friday J, Moshfegh A. Food Surveys Research Group Beltsville Human Nutrition Research Center. Agricultural Research Service, USDA; Beltsville MD: 2008. Mypyramid Equivalents Database 2.0 for USDA Survey Foods, 2003–2004. [Google Scholar]

[b14-amia_2013_symposium_224] 14.Guenther PM, Reedy J, Krebs-Smith SM, Reeve BB. Evaluation of the Healthy Eating Index-2005. J Am Diet Assoc. 2008 Nov;108(11):1854–64. doi: 10.1016/j.jada.2008.08.011. [DOI] [PubMed] [Google Scholar]

[b15-amia_2013_symposium_224] 15.Guenther PM, Casavale KO, Reedy J, Kirkpatrick SI, Hiza HA, Kuczynski KJ, et al. Update of the Healthy Eating Index: HEI-2010. J Acad Nutr Diets. 2013 doi: 10.1016/j.jand.2012.12.016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b16-amia_2013_symposium_224] 16.Illner AK, Freisling H, Boeing H, Huybrechts I, Crispim SP, Slimani N. Review and evaluation of innovative technologies for measuring diet in nutritional epidemiology. Int J Epidemiol. 2012 Aug;41(4):1187–203. doi: 10.1093/ije/dys105. [DOI] [PubMed] [Google Scholar]

[b17-amia_2013_symposium_224] 17.Thompson FE, Subar AF, Loria CM, Reedy JL, Baranowski T. Need for technological innovation in dietary assessment. J Am Diet Assoc. 2010 Jan;110(1):48–51. doi: 10.1016/j.jada.2009.10.008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b18-amia_2013_symposium_224] 18.Lambert N, Plumb J, Looise B, Johnson IT, Harvey I, Wheeler C, et al. Using smart card technology to monitor the eating habits of children in a school cafeteria: 1. Developing and validating the methodology. J Hum Nutr Diet. 2005 Aug;18(4):243–54. doi: 10.1111/j.1365-277X.2005.00617.x. [DOI] [PubMed] [Google Scholar]

[b19-amia_2013_symposium_224] 19.Gladson Nutrition Database. 2013. Available from: http://www.gladson.com/our-services/nutrition-database.

[b20-amia_2013_symposium_224] 20.Haytowitz D, Lemar L, Pehrsson P, Exler J, et al. Agricultural Research Service. Beltsville: US Department of Agriculture; 2011. USDA National Nutrient Database for Standard Reference, Release 24. [Google Scholar]

[b21-amia_2013_symposium_224] 21.Brinkerhoff KM, Brewster PJ, Clark EB, Jordan KC, Cummins MR, Hurdle JF, editors. AMIA Annu Symp Proc. American Medical Informatics Association; 2011. Linking Supermarket Sales Data To Nutritional Information: An Informatics Feasibility Study. [PMC free article] [PubMed] [Google Scholar]

[b22-amia_2013_symposium_224] 22.Collins-Dodd C, Lindley T. Store brands and retail differentiation: the influence of store image and store brand attitude on store own brand perceptions. Jour of Retail and Consumer Serv. 2003;10(6):345–52. [Google Scholar]

[b23-amia_2013_symposium_224] 23.Todd J, Mancino L, Lin BH. The impact of food away from home on adult diet quality. USDA-ERS Economic Research Report Paper #90. 2010.

PERMALINK

qDIET: toward an automated, self-sustaining knowledge base to facilitate linking point-of-sale grocery items to nutritional content

Valliammai Chidambaram, MSc

Philip J Brewster, PhD

Kristine C Jordan, PhD, RD

John F Hurdle, MD, PhD

Abstract

Introduction

Background

The Morbidity and Mortality of Obesity

Table 1.

Figure 1.

Lack of a Dietary Metric And Poor Clinician Training In Diet Counseling

A Dietary Quality Metric Suitable for the EHR: the argument for the healthy eating index

The State-of-the-Art in Dietary Data Collection

Figure 2.

Food item UPC data for nutritional analysis

Methods

The Food Item Dataset

Data Collection Procedure

Data Cleaning

Figure 3.

Figure 4.

The qDIET tool

Figure 5.

The problem space:

Web Crawling:

Figure 6.

Google’s Search API for Shopping

First Iteration:

Table 2.

Second Iteration:

Third Iteration:

The Factual Platform

Results

The 50-Family Year Long Dataset

Web Crawling Results

Table 3.

Discussion and Conclusion

Limitations

Future Work

Acknowledgments

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases