Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

medRxiv logoLink to medRxiv
[Preprint]. 2025 Jan 16:2022.04.23.22274217. [Version 5] doi: 10.1101/2022.04.23.22274217

GroceryDB: Prevalence of Processed Food in Grocery Stores

Babak Ravandi 1,*, Gordana Ispirova 2,*, Michael Sebek 1,*, Peter Mehler 3, Albert-László Barabási 1,2,4, Giulia Menichetti 2,1,5,*,
PMCID: PMC11177926  PMID: 38883708

Abstract

The offering of grocery stores is a strong driver of consumer decisions, shaping their diet and long-term health. While highly processed food like packaged products, processed meat, and sweetened soft drinks have been increasingly associated with unhealthy diet, information on the degree of processing characterizing an item in a store is not straight forward to obtain, limiting the ability of individuals to make informed choices. Here we introduce GroceryDB, a database with over 50,000 food items sold by Walmart, Target, and Wholefoods, unveiling how big data can be harnessed to empower consumers and policymakers with systematic access to the degree of processing of the foods they select, and the potential alternatives in the surrounding food environment. The extensive data gathered on ingredient lists and nutrition facts enables a large-scale analysis of ingredient patterns and degrees of processing, categorized by store, food category, and price range. Our findings reveal that the degree of food processing varies significantly across different food categories and grocery stores. Furthermore, this data allows us to quantify the individual contribution of over 1,000 ingredients to ultra-processing. GroceryDB and the associated http://TrueFood.Tech/ website make this information accessible, guiding consumers toward less processed food choices while assisting policymakers in reforming the food supply.

Introduction

Food ultra-processing has drastically increased productivity and shelf-time, addressing the issue of food availability to the detriment of food systems sustainability and health [14]. Indeed, there is increasing evidence that our over-reliance on ultra-processed food (UPF) has fostered unhealthy diet [5]. The sheer number of peer-reviewed articles investigating the link between the degree of food processing and health embodies a general consensus among independent researchers on the health relevance of UPF, contributing up to 60% of consumed calories in developed nations [68]. For instance, recent studies have linked the consumption of UPF to non-communicable diseases like metabolic syndrome [915], and exposure to industrialized preservatives and pesticides [1620]. This body of work has driven a paradigm shift from focusing solely on food security, which emphasizes access to affordable food, to prioritizing nutrition security [21, 22]. Nutrition security stresses equitable access to healthy, safe, and affordable foods essential for optimal health and well-being, as defined by the USDA [23, 24], echoing the recent White House Conference on Hunger, Nutrition, and Health [25].

Much of UPF reaches consumers through grocery stores, as documented by the National Health and Nutrition Examination Survey (NHANES), indicating that in the US over 60% of the food consumed comes from grocery stores (Figure S1). The high reliance on UPFs and their potential negative health effects raise numerous critical questions, such as: 1) How can we determine the degree of processing of food items? 2) What methods can be used to quantify the extent of food processing in the food supply? 3) What alternatives can we identify to reduce UPF consumption?

Measuring the degree of food processing is a key step in addressing these questions, but it is not straightforward. Indeed, food labels often display mixed messages, partly driven by reductionist metrics focusing on one nutrient at a time [26], and partly because of the contrasting criteria on how to classify processed foods [27]. The ambiguity and inconsistency of current food processing classification systems (FPCS) have led to conflicting results on their role as risk factors for non-communicable chronic diseases [28, 29]. Some of these classification systems also suffer from poor inter-rater reliability and lack of reproducibility, issues rooted in purely descriptive expertise-based approaches, leaving room for ambiguity and differences in interpretation [27, 28, 30]. Hence, there is a growing call among scientists for a more objective definition of the degree of food processing, based on underlying biological mechanisms rather than subjective opinions of different research groups [28]. Among the proposed areas for aligning food processing definitions, the nutritional profile of food is currently the only aspect consistently regulated and reported worldwide [27, 28, 31].

The research efforts outlined in [28] align with a growing demand for high-quality and internationally comparable statistics to promote objective metrics, reproducibility, and data-driven decision-making, advancing our convergence towards the Sustainable Development Goals (SDGs) [32, 33]. Artificial intelligence (AI) methodologies [3336], in particular, are increasingly being utilized for their potential as objective, data-driven tools to advance populations’ nutrition security, a concept underpinning SDGs ‘zero-hunger’, ‘good health and well-being’, ‘industry, innovation, and infrastructure’, and ‘reduce in-equalities’.

Responding to the need for objective and scalable metrics to ensure nutrition security, we have recently harnessed machine learning (ML) to create and fully automate our Food Processing Score (FPro) [37]. FPro is a continuous index derived by training an ML model to predict manual labels of processing techniques based on the overall nutrient profile of a food (see Methods and Section S4). To teach our algorithm how to score processing from nutrients, we leveraged the labels provided by NOVA, currently the most widely used system to classify foods according to processing-related criteria, offering us an extensive array of epidemiological literature for comparative analysis [38, 39]. However, the FPro algorithm can accommodate different FPCS such as EPIC [40], UNC [41], or SIGA [42]. We rigorously tested the predictive power of FPro for epidemiological outcomes with an Environment-Wide Association Study (EWAS), leveraging multiple cycles of USDA’s model food databases and national food consumption surveys [37].

Here, building on the versatility and scalability of the FPro algorithm, we extend our analysis beyond “model foods” tailored for epidemiological databases and instead analyze real-world data encompassing over 50,000 products obtained from major US grocery store websites. This extensive dataset underpins the development of GroceryDB, an open-source database of foods and beverages, featuring comprehensive metadata on nutritional content, ingredient list, and price for each item, collected from publicly available online markets of Walmart, Target, and Whole Foods. Our objective is to demonstrate how ML can effectively analyze large-scale real-world food composition data, and translate this wealth of information into the degree of processing for any food in grocery stores, facilitating consumer decision-making and informing public health initiatives aimed at enhancing the overall quality of the food environment. GroceryDB, accessible to the public at http://TrueFood.Tech/, offers both the data and methodologies needed to quantify food processing and analyze the structure of ingredients within the U.S. food supply. This initiative not only lays the groundwork for similar efforts globally, aimed at promoting better-informed dietary choices, but also underscores the critical role of open-access, internationally comparable data in advancing global nutrition security.

Main

For each food, we automated the process of determining the extent of food processing using FPro, which translates the nutritional content of a food item, as reported by the nutrition facts, into its degree of processing [37]. In Figure 1, we illustrate the use of FPro by offering the processing score of three products in the breads and yogurt categories, allowing us to compare their degree of processing. Indeed, the Manna Organics multi-grain bread is made from whole wheat kernels, barley, and rice without additives, added salt, oil, and even yeast, resulting in a low processing score of FPro=0.314. However, the Aunt Millie’s and Pepperidge Farmhouse breads include ‘resistant corn starch’, ‘soluble corn fiber’, and ‘oat fiber’, requiring additional processing to extract starch and fiber from corn and oat to be used as an independent ingredient (Figure 1a), resulting in much higher processing score of FPro=0.732 and FPro=0.997. Similarly, the Seven Stars Farm yogurt (FPro=0.355) is a whole milk yogurt made from ‘grade A pasteurized organic milk’, yet the Siggi’s yogurt (FPro=0.436) uses ‘Pasteurized Skim Milk’ that requires more processing to obtain 0% fat. Finally, the Chobani Cookies & Cream yogurt relies on cane sugar as the second most dominant ingredient, and on a cocktails of additives like ‘caramel color’, ‘fruit pectin’, and ‘vanilla bean powder’ making it a highly processed yogurt, resulting in a high processing score FPro=0.918.

Figure 1: Degrees of Food Processing in Three Categories.

Figure 1:

FPro allows us to assess the extent of food processing in three major US grocery stores, and it is best suited to rank foods within the same category. (a) In breads, the Manna Organics multi-grain bread, offered by WholeFoods, is mainly made from ‘whole wheat kernels’, barley, and brown rice without any additives, added salt, oil, and yeast, with FPro=0.314. However, the Aunt Millie’s (FPro=0.732) and Pepperidge Farmhouse (FPro=0.997) breads, found in Target and Walmart, include ‘soluble corn fiber’ and ‘oat fiber’ with additives like ‘sugar’, ‘resistant corn starch’, ‘wheat gluten’, and ‘monocalcium phosphate’. (b) The Seven Stars Farm yogurt (FPro=0.355) is made from the ‘grade A pasteurized organic milk’. The Siggi’s yogurt (FPro=0.436) declares ‘Pasteurized Skim Milk’ as the main ingredients that has 0% fat milk, requiring more food processing to eliminate fat. Lastly, the Chobani Cookies & Cream yogurt (FPro=0.918) has cane sugar as the second most dominant ingredient combined with multiple additives like ‘caramel color’, ‘fruit pectin’, and ‘vanilla bean powder’, making it a highly processed yogurt.

We assigned an FPro score to each food in GroceryDB by leveraging our ML classifier FoodProX, which takes as input the mandatory information captured by the nutrition facts (Methods). We find that the distribution of the FPro scores in the three stores is rather similar: in each store, we observe a monotonically increasing curve (Figure 2a), indicating that minimally-processed products (low FPro) represent a relatively small fraction of the inventory of grocery stores, the majority of the offerings being in the ultra-processed category (high FPro). Although less-processed items make up a smaller share of the overall inventory, they likely account for a proportionally larger portion of actual purchases, highlighting a discrepancy between sales data and available food options. Nevertheless, we identified systematic differences between stores: Whole Foods offers a greater selection of minimally processed items and fewer ultra-processed options, whereas Target has a particularly high proportion of ultra-processed products (high FPro).

Figure 2: Food Processing in Grocery Stores.

Figure 2:

(a) The distribution of FPro scores from the three stores follows a similar trend, a monotonically increasing curve, indicating that the number of low FPro items (unprocessed and minimally-processed) offered by the grocery stores is relatively lower than the number of high FPro items (highly-processed and ultra-processed items), and the majority of offerings are ultra-processed (Methods for FPro calculation). (b) Distribution of FPro scores for different categories of GroceryDB. The distributions indicate that FPro has a remarkable variability within each food category, confirming the different degrees of food processing offered by the stores. Unprocessed foods like eggs, fresh produce, and raw meat are excluded (Section S7). (c) The distributions of FPro scores in GroceryDB compared to two USDA nationally representative food databases: the USDA Food and Nutrient Database for Dietary Studies (FNDDS) and FoodData Central Branded Products (BFPD). The similarity between the distributions of FPro scores in GroceryDB, BFPD, and FNDDS suggests that GroceryDB offers a comprehensive coverage of foods and beverages (Section S6).

FPro also captures the inherent variability in the degree of processing per food category. As illustrated in Figure 2b, we find a small variability of FPro scores in categories like jerky, popcorn, chips, bread, biscuits, and mac & cheese, indicating that consumers have limited choices in terms of degree of processing in these categories (Section S7 for harmonizing categories between stores). Yet, in categories like cereals, milk & milk-substitute, pasta-noodles, and snack bars, FPro varies widely, reflecting a wider extent of possible choices from a food processing perspective.

We compared the distribution of FPro in GroceryDB with the latest USDA Food and Nutrient Database for Dietary Studies (FNDDS), offering a representative sample of the consumed food supply (Figure 2c). The similarity between the distributions of FPro scores obtained from GroceryDB and FNDDS suggests that GroceryDB also offers a representative sample of foods and beverages in the supply chain. Additionally, we compared GroceryDB with the USDA Global Branded Food Products Database (BFPD), which contains 1,142,610 branded products, finding that the distributions of FPro in GroceryDB and BFPD follow similar trends (Figure 2c). While BFPD contains 22 times more foods than GroceryDB, only an estimated 44% of GroceryDB’s products are represented in BFPD, even after accounting for potential variability in food names and ingredient lists (Section S6). This indicates that while BFPD offers an extensive representation of branded products, it does not fully capture the current offering of stores. Furthermore, we compared GroceryDB with Open Food Facts (OFF) [43], another extensive collection of branded products collected through crowd-sourcing, containing 426,000 products with English ingredient lists. We find that less than 40% of the products in GroceryDB are present in OFF (Figure S4), a small overlap, suggesting that monitoring the products currently offered in grocery stores may provide a more accurate account of the food supply available to consumers.

Food Processing and Caloric Intake

The depth and the resolution of the data collected in GroceryDB allow us to unveil some of the complexity regarding the relation between price and calories. Among all categories in GroceryDB, a 10% increase in FPro results in 8.7% decrease in the price per calorie of products, as captured by the dashed line in Figure 3A. However, the relationship between FPro and price per calorie strongly depends on the food category (Section S8). For example, in soups & stews the price per calorie drops by 24.3% for 10% increase in FPro (Figure 3b), a trend observed also in cakes, mac & cheese, and ice cream (Figure S8). This means that on average, the most processed soups & stews, with FPro1, are 67.72% cheaper per calories than the minimally-processed alternatives with FPro0.4 (Figure 3e). In contrast, in cereals price per calorie drops only by 1.2% for 10% increase in FPro (Figure 3c), a slow decrease observed also for seafood and yogurt products (Figure S8). Interestingly, we find an increasing trend between FPro and price in the milk & milk-substitute category (Figure 3d), partially explained by the higher price of plant-based milk substitutes, that require more extensive processing than the dairy-based milks.

Figure 3: Price and Food Processing.

Figure 3:

(a) Using robust linear models, we assessed the relationship between price and food processing (Figure S8 for regression coefficients of all categories). We find that price per calories drops by 24.3% and 1.2% for 10% increase in FPro in soup & stew and cereals, respectively. Also, we observe a 8.7% decrease across all foods in GroceryDB for 10% increase in FPro. Interestingly, in milk & milk-substitute, price per calorie increases by 1.6% for 10% increase in FPro, partially explained by the higher price of plant-based milks that are more processed than regular dairy milk. (b-d) Distributions of price per calorie in the linear bins of FPro scores for each store (Figure S7 illustrates the correlation between price and FPro for all categories). In soup & stew, we find a steep decreasing slope between FPro and price per calorie, while in cereals we observe a smaller effect. In milk & and milk-substitute, price tends to slightly increase with higher values of FPro. (e) Percentage of change in price per calorie from the minimally-processed products to ultra-processed products in different food categories. This analysis was performed by comparing the average price per calorie of the top 10% most processed items with the top 10% least processed items within each category. In the full GroceryDB, on average, the ultra-processed items are 52.09% cheaper than their minimally-processed alternatives.

Choice Availability and Food Processing

Not surprisingly, GroceryDB documents differences in the offering of the three stores we analyzed: while WholeFoods offers a selection of cereals with a wide range of processing levels, from minimally-processed to ultra-processed, in Walmart the available cereals are limited to products with higher FPro values (Figure 4a). To understand the roots of these differences, we investigated the ingredients of cereals offered by each grocery store, one of the most popular staple crops, consumed by 283 million Americans in 2020 [44]. We find that cereals offered by WholeFoods rely on less sugar, less natural flavors, and added vitamins (Figure 4b). In contrast, cereals in Target and Walmart tend to contain corn syrup, a sweetener associated with enhanced absorption of dietary fat and weight gain [45]. Corn syrup is largely absent in the WholeFoods cereals, partially explaining the wider range of processing scores characterising cereals offered by the store (Figure 4a).

Figure 4: The Difference between Stores in Term of Processing.

Figure 4:

The degree of processing of food choices depends on the grocery store and food category. (a) The degree of processing of food items offered in grocery stores, stratified by food category. For example, in cereals, WholeFoods shows a higher variability of FPro, implying that consumers have a choice between low and high processed cereals. Yet, in pizzas all supermarkets offer choices characterised by high FPro values. Lastly, all cheese products are minimally-processed, showing consistency across different grocery stores. (b) The top 30 most reported ingredients in cereals shows that WholeFoods tends to eliminate corn syrup, uses more sunflower oil and less canola oil, and relies less on vitamin fortification. In total, GroceryDB has 1,168 cereals from which 973 have ingredient lists. 309, 260, and 395 cereals are from Walmart, Target, and WholeFoods, respectively. (c) The brands of cereals offered in stores partially explains the different patterns of ingredients and variation of FPro. While Walmart and Target have a larger intersection in the brands of their cereals, WholeFoods tends to supply cereals from brands not available elsewhere.

The brands offered by each store could also help explain the different patterns. We found that while Walmart and Target have a large overlap in the list of brands they carry, WholeFoods relies on different suppliers (Figure 4c), largely unavailable in other grocery stores. In general, WholeFoods offers less processed soups & stews, yogurt & yogurt drinks, and milk & milk-substitute (Figure 4a). In these categories Walmart’s and Target’s offerings are limited to higher FPro values. Lastly, some food categories like pizza, mac & cheese, and popcorn are highly processed in all stores (Figure 4a). Indeed, pizzas offered in all three chains are limited to high FPro values, partially explained by the reliance on substitute ingredients like “imitation mozzarella cheese,” instead of “mozzarella cheese”.

While grocery stores offer a large variety of products, the offered processing choices can be identical in multiple stores. For example, GroceryDB has a comparable number of cookies & biscuits in each chain, with 453, 373, and 402 items in Walmart, Target, and WholeFoods, respectively. The degree of processing of cookies & biscuits in Walmart and Target are nearly identical (0.88<FPro<1), limiting consumer nutritional choices in a narrow range of processing (Figure 4a). In contrast, WholeFoods not only offers a large number of items (402 cookies & biscuits), but it also offers a wider choices of processing (0.57<FPro<1)

Organization of Ingredients in the Food Supply

Food and beverage companies are required to report the list of ingredients in the descending order of the amount used in the final product. When an ingredient itself is a composite, consisting of two or more ingredients, FDA mandates parentheses to declare the corresponding sub-ingredients (Figure 5ab) [46]. We organized the ingredient list as a tree (Methods), allowing us to compare a highly processed cheesecake with a less processed alternative (Figure 5). In general, we find that products with complex ingredient trees are more processed than products with simpler and fewer ingredients (Section S9.3). For example, the ultra-processed cheesecake in Figure 5a has 43 ingredients, 26 additives, and 3 branches with sub-ingredients. In contrast, the minimally-processed cheesecake has only 14 ingredients, 5 additives, and 2 branch with sub-ingredients (Figure 5b). As illustrated by the cheesecakes example, ingredients used in the food supply are not equally processed, prompting us to ask: which ingredients contribute the most to the degree of processing of a product? To answer this we introduce the Ingredient Processing Score (IgFPro), defined as

IgFProg=fFgrgf*FProffFgrgf, (1)

where rgf ranks an ingredient g in decreasing order based on its position in the ingredient list of each food f that contains g (Section S9.5). IgFPro ranges between 0 (unprocessed) and 1 (ultra-processed), allowing us to rank-order ingredients based on their contribution to the degree of processing of the final product. We find that not all additives contribute equally to ultra-processing. For example, the ultra-processed cheesecake (Figure 5a) has polysorbate 60 (an emulsifier used in cakes for increased volume and fine grain with IgFPro=0.908), and corn syrup (a corn sweetener with IgFPro=0.905) [47], each of which emerging as signals of ultra-processing with high IgFPro scores. In contrast, both the minimally-processed and ultra-processed cheesecakes (Figure 5) contain xanthan gum (IgFPro=0.818), guar gum (IgFPro=0.801), locust bean gum (IgFPro=0.786), and salt (IgFPro=0.777). Indeed, the European Food Safety Authority (EFSA) reported that xanthan gum as a food additive does not pose any safety concern for the general population, and FDA classified guar gum and locust bean gum as generally recognized safe [47].

Figure 5: Ingredient Trees.

Figure 5:

GroceryDB organizes the ingredient list of products into structured trees, where the additives are marked as orange nodes (Methods and Section S9). (a) Edwards Desserts Original Whipped Cheesecake is a highly processed cheesecake that contains 43 ingredients from which 26 are additives, resulting in a complex ingredient tree with 3 branches of sub-ingredients. (b) Pearl River Mini No Sugar Added Chessecake is a minimally-processed cheesecake that has a simpler ingredient tree with 14 ingredients, 5 additives, and 2 branches with sub-ingredients.

By the same token, we looked into the oils used as ingredients in branded products to assess which oils contribute the most to UPFs. IgFPro identifies brain octane oil (IgFPro=0.573), flax seed oil (IgFPro=0.69), and olive oil (IgFPro=0.722) as the highest quality oils, having the smallest contribution to ultra-processing. In contrast, palm oil (IgFPro=0.888), vegetable oil (IgFPro=0.866), and soy bean oil (IgFPro=0.862) represent strong signals of ultra-processing (Figure 6a). Indeed, flax seed oil is high in omega-3 fatty acids with several health benefits [48]. In contrast, the blending of vegetable oils, a signature of UPF, is one of the simplest methods to create products with desired texture, stability, and nutritional properties [49].

Figure 6: Ingredient Processing Score (IgFPro).

Figure 6:

To investigate which ingredients contribute most to ultra-processed products, we extend FPro to ingredient lists using Eq. 1. With the introduction of IgFPro, we can rank over 12,000 ingredients by their prevalence and contribution to ultra-processed products prioritizing ingredients and food groups for targeted intervention. A total of 1,676 ingredients are in more than 10 products. (a) The IgFPro of all ingredients that appeared in at least 10 products are calculated, rank-ordering ingredients based on their contribution to UPFs. The ingredients are colored based on their distance to the root node, d, of the ingredient tree (Methods). The popular oils used as an ingredient are highlighted, with the brain octane, flax seed, and olive oils contributing the least to ultra-processed products. In contrast, the palm, vegetable, and soybean oils contribute the most to ultra-processed products (Section S9.5). (b) The patterns of ingredients in the least-processed tortilla chips vs. the ultra-processed tortilla chips. The bold fonts track the IgFPro of the oils used in the three tortilla chips. The minimally-processed Siete tortilla chips (FPro=0.477) uses avocado oil (IgFPro=0.822), and the more processed El Milagro tortilla (FPro=0.769) has corn oil (IgFPro=0.886). In contrast, the ultra-processed Doritos (FPro=0.982) relies on a blend of vegetable oils (IgFPro=0.866), and is accompanied with a much more complex ingredient tree, indicating that there is no single ingredient “bio-marker” for UPFs.

Finally, to illustrate the ingredient patterns characterising UPFs in Figure 6b, we show three tortilla chips, ranked from the “minimally-processed” to the ultra-processed. Relative to the snack-chips category, Siete tortilla is minimally-processed (FPro=0.477), made with avocado oil and blend of cassava and coconut flours. The more processed El Milagro tortilla (FPro=0.769) is cooked with corn oil, grounded corn, and has calcium hydroxide, generally recognized as a safe additive made by adding water to calcium oxide (lime) to promote dispersion of ingredients [47]. In contrast, the ultra-processed Doritos (FPro=0.982) have corn flour, a blend of vegetable oils, and rely on 12 additives to ensure a palatable taste and the texture of the tortilla chip, demonstrating the complex patterns of ingredients and additives needed for ultra-processing (Figure 6b).

In summary, complex ingredient patterns accompany the production of UPFs (Section S9.4). IgFPro captures the role of individual ingredients in the food supply, enabling us to diagnose the processing characteristics of the whole food supply as well as the contribution of individual ingredients.

Discussion

By combining large-scale data on food composition and ML, GroceryDB uncovers insights on the current state of food processing in the US grocery landscape, enabling us to obtain distributions of food processing scores that capture a remarkable variability in the offerings of different grocery stores. The differences in FPro’s distributions (Figure 2A) indicate that multiple factors drive the range of choices available in grocery stores, from the cost of food and the socio-economic status of the consumers to the distinct declared missions of the supermarket chains: “quality is a state of mind” for WholeFoods Market and “helping people save money so they can live better” for Walmart [50, 51]. Furthermore, the continuous nature of FPro enabled us to conduct a data-driven investigation on the relationship between price and food processing stratified by food category. We find that overall in GroceryDB food processing tends to be associated with the production of more affordable calories, a positive correlation that raises the likelihood of habitual consumption among lower-income populations, ultimately contributing to growing socioeconomic disparities in terms of nutrition security [5257]. However, it is important to note that the strength and direction of this correlation varies depending on the specific food category under consideration, as exemplified by the opposite trend of milk & milk-substitutes compared to soups & stews (Section S8). Further in-depth analyses are needed to evaluate the effectiveness of intervention strategies targeting specific food groups within diverse food environments.

Governments increasingly acknowledge the impact of processed foods on population health, and its long-term effect on healthcare [58, 59]. For example, the UK spends £18 billion annually on direct medical costs related to non-communicable diseases like obesity [60], while the US incurs $1.1 trillion in yearly food-related human health costs [61, 62]. GroceryDB serves as a valuable resource for both consumers and policymakers, offering essential insights to gauge the level of food processing within the food supply. For instance, in categories like cereals, milk & milk alternatives, pasta-noodles, and snack bars, FPro exhibits a wide range, highlighting the substantial variations in the processing levels of products. If consumers had access to this processing data, they could make informed choices, selecting items with significantly different degrees of processing (Figure 2B). Yet, the comprehension of nutrient and ingredient data disclosed on food packaging often poses a challenge to consumers due to unrealistic serving sizes and confusing health claims based on one or a few nutrients. Our primary objective lies in translating this wealth of data into an actionable scoring system, enabling consumers to make healthier food choices and embrace effective dietary substitutions, without overwhelming them with excessive information. Additionally, our approach holds great potential for public health initiatives aimed at improving the overall quality of our food environment, such as strategies reorganizing supermarket layouts, optimizing shelf placements, and thoughtfully designing counter displays [54, 63, 64]. Transforming health-related behaviors is a challenging task [65, 66], hence easily adoptable dietary modifications along with environmental nudges could make it easier for individuals to embrace healthier choices.

Currently, FPro partially draws from expertise-based food processing classifications due to limited data concerning compound concentrations indicative of food matrix alterations, such as cellular wall transformations or industrial processing techniques. However, a comprehensive mapping of the “Dark Matter of Nutrition”, encompassing chemical concentrations for additives and processing byproducts, aims to evolve FPro into an unsupervised system, independent of manual classifications [67, 68]. Unlike expertise-based systems, FPro functions as a quantitative algorithm, utilizing standardized inputs to generate reproducible continuous scores, facilitating sensitivity analysis and uncertainty estimations [37] (Section S5). These important features enhance analyses’ reliability, transparency, and interpretability while reducing errors linked to the descriptive nature of manual classifications [28], which have displayed a low degree of consistency among nutrition specialists [69].

The chemical composition of branded products is partially captured by the nutrition facts table and partially reported in the ingredient list, which includes additives like artificial colors, flavors, and emulsifiers. However, comprehensive and internationally well-regulated data on food ingredients is currently limited, as documented by the GS1 UK data crunch analysis which reported an average of 80% inconsistency in products’ data [31], leading us to focus on the nutrition facts to enhance our algorithm’s portability and reproducibility. The nutrition facts alone exhibit excellent performance in discriminating between NOVA classes, confirming how food processing consistently alters nutrient concentrations with reproducible patterns, effectively harnessed by ML [37]. While FPro assesses the degree of food processing by holistically evaluating nutrient concentrations, the few nutrients available on food packaging increase the risk of identifying products with similar nutrition facts but distinct food matrices (e.g., pre-frying, puffing, extrusion-cooking). Indeed, if the chemical panel used to train the algorithm fails to exhaustively capture matrix modifications induced by processing and cooking, FPro and the substitution algorithm implemented at http://TrueFood.Tech/, remain blind to these chemical-physical changes. Incorporating disambiguated ingredients in FPro, such as the ultra-processing markers characterized by SIGA [70], may offer a solution until larger composition tables for branded products become available (Section S5).

In summary, our work represents a departure from traditional food classification systems, advancing toward the use of ML methodologies to model the chemical complexity of food (Section S1). Despite the limited information provided by FDA-regulated nutrition labels, GroceryDB and FPro offer a data-driven approach that enables a substitution algorithm capable of recommending similar but less processed alternatives for any food in GroceryDB. Together, GroceryDB and the TrueFood platform highlight the importance of data transparency in grocery store inventories, a key factor that directly shapes consumer choices.

Methods

Data Collection

We compiled publicly accessible data on food products available at Walmart, Target, and Whole Foods through their respective online platforms. Each store organizes its food items hierarchically. Utilizing these categorizations, we systematically navigated through the stores’ websites to identify specific food items. To ensure consistency, we standardized the food category hierarchy within GroceryDB by comparing and aligning the classification systems employed by each store. These stores sourced nutrition facts from physical food labels and provided digital versions for each food item. This data enabled us to standardize nutrient concentrations to a uniform measure of 100 grams and employ FoodProX to evaluate the degree of food processing for each item. Lastly, all data was collected in May 2021.

Calculation of the Food Processing Score (FPro)

Processing alters the nutrient profile of food, changes that are detectable and categorizable using ML [37, 71, 72]. Hence, we developed FoodProX [37], a random forest classifier that can translate the combinatorial changes in the nutrient amounts induced by food processing into a food processing score (FPro). We extensively tested and validated the stability of FPro in several databases such as the US Food and Nutrient Database for Dietary Studies (FNDDS) and the international Open Food Facts. FPro allowed us to implement an in-silico study based on US cross-sectional population data, where we showed that on average substituting only a single food item in a person’s diet with a minimally processed alternative from the same food category can significantly reduce the risk of developing metabolic syndrome (12.25% decrease in odds ratio) and increase vitamin blood levels (4.83% and 12.31% increase of vitamin B12 and vitamin C blood concentration) [37].

FoodProX takes as input 12 nutrients reported in the nutrition facts (Table S1), and returns FPro, a continuous score ranging between 0 (unprocessed foods like fruits and vegetables) and 1 (UPFs like instant soups and shelf-stable breads). We used the manual NOVA classification applied to the USDA Standard Reference (SR) and FNDDS databases to train FoodProX. In the original classification, NOVA labels were assigned by inspecting the ingredient list and the food description, but without taking into account nutrient content.

FPro does not assess individual nutrients in isolation but, rather, learns from the configurations of correlated nutrient changes within a fixed quantity of food (100 grams) [37]. Consequently, a single high or low nutrient value does not dictate a food’s FPro but the final score depends on the likelihood of observing the overall pattern of nutrient concentrations in unprocessed foods versus UPFs. For instance, while fortified foods may mirror mineral and vitamin content in unprocessed foods, our algorithm identifies unique concentration signatures unlikely to be found in minimally processed foods, resulting in a higher FPro [37].

The calculation of FPro for all foods in GroceryDB represents a generalization task, where the model faces “never-before-seen” data [71, 73]. More details on the training dataset, including class heterogeneity and imbalance, are available in Section S4.

Price for calories trends

We applied robust linear models with Huber’s t-norm [7476] to calculate regression coefficients and p-values for the relationship log(PricePerCalorie)log(FPro). The detailed regression results for each food category are presented in Figure S8, while the overall trend across GroceryDB is depicted in Figure 3A. To illustrate the price disparity at the extremes of food processing, the percentage change in price per calorie shown in Figure 3E was calculated by comparing the average price per calorie of the top 10% minimally processed items to that of the top 10% ultra-processed items within each category.

Ingredient Trees

An ingredient list is a reflection of the recipe used to prepare a branded food item. The ingredient lists are sorted based on the amount of ingredients used in the preparation of an item as required by the FDA. An ingredient tree can be created in two ways: (a) with emphasis on capturing the main and sub-ingredients, similar to a recipe, as illustrated in Figure S16A; (b) with emphasis on the order of ingredients as a proxy for their amount in a final product, as illustrated in Figure S16B, where the distance from the root, d, reflects the amount of an individual ingredient relative to all ingredients. We opted for (b) to calculate IgFPro, as ranking the amount of an ingredient in a food is essential to quantify the contribution of individual ingredients to ultra-processing. In Eq. 1, we used rgf=1/dgf to rank the amount of an ingredient g in food f, where dgf captures the distance from the root (Figure S16B for an example). Finally, IgFPro shows a remarkable variability when compared to the average FPro of products containing the selected ingredient (Figure S17), suggesting distinctive patterns of correlation between the products’ FPro and the ranking of ingredients in their ingredient lists.

Database Structure

The database comprises two main files, both stored in CSV format for ease of use and accessibility:

  1. GroceryDB Foods File

    This file contains comprehensive information about all the foods included in the GroceryDB. Each row represents a distinct food item. This file includes the following columns:

    • name: The name of the food item, typically as it appears on the product packaging.

    • brand: The brand or manufacturer of the food item.

    • harmonized single category: The general category or type of food (e.g., seafood, cereal, etc.).

    • store: The retail store where the food item is available (e.g., Walmart, Target, Whole Foods).

    • f_FPro: Average FPro score of the food across the ensemble of classifiers. The FPro score is calculated using the FoodProX algorithm, taking into account the nutrition facts of the food.

    • f_FPro P: a string indicating if the food has enough nutritional descriptors as detailed in SI Section 4.

    • f_min_FPro: Minimum FPro score across the ensemble of classifiers.

    • f_std_FPro: The standard deviation of the FPro score across the ensemble of classifiers.

    • f_FPro_class: expected NOVA class assigned according to FoodProX.

    • ingredientList: A list of ingredients used in the food item, providing insight into its composition and processing level. The ingredient list is crucial for calculating the IgFPro.

    • has10_nuts: boolean value indicating if the food is described by the 10 key nutrients described in SI Section 4.

    • is_Nuts_Converted_100g: Indicator if the food nutrients are converted per 100 grams.

    • nutritional information: Detailed nutritional information for the food item, including protein, total fat, carbohydrate, total sugars, total dietary fiber, calcium, iron, sodium, vitamin C, cholesterol, total saturated fatty acids, and total vitamin A.

    Please note that the prices of the food items are not included in this public release due to potential restrictions on public disclosure. However, we are willing to provide price information upon request. The file is available at https://github.com/Barabasi-Lab/GroceryDB/blob/main/data/GroceryDBfoods.csv.

  2. GroceryDB IgFPro File

    This file contains data related to the IgFPro score of the ingredients listed in GroceryDB. Each rowcorresponds to a specific ingredient. The file is available at https://github.com/Barabasi-Lab/GroceryDB/blob/main/data/GroceryDB_IgFPro.csv.

    The columns in this file are as follows:

    • ingredient_name: The standardized name of the ingredient.

    • count_of_products: The total number of products in the database that contain this ingredient.

    • ingredient_FPro: IgFPro calculated for the selected ingredient.

    • average FPro of products: The average FPro score of the products containing the selected ingredient.

    • average_distance_to_root: The average distance of the ingredient from the root in the ingredient tree, representing its relative amount in the food item. Ingredients closer to the root contribute more significantly to the calculation of IgFPro.

    • ingredient_normalization term: A numerical value used to normalize a food’s contribution to the IgFPro score, based on the ingredient’s overall ranking across all foods.

Substitution Algorithm at TrueFood.Tech

TrueFood.Tech provides food substitution recommendations aimed at gently nudging consumers towards less processed alternatives. To accomplish this, we first identify food items that belong to the same category and share partial semantic similarity with the targeted item (range 0.10–0.95), based on both food names and ingredient lists. This approach increases the diversity of displayed recommendations while ensuring they remain within the same category.

We utilize the popular term frequency–inverse document frequency (Tf–idf) algorithm to measure the significance of words to foods in our database, adjusted for commonality across entries [77]. The similarity between weighted word vectors is calculated using cosine similarity. The final similarity between the queried food and other food items is determined by multiplying the ingredient-list-based similarity and the food-name-based similarity.

Next, we sort the semantically filtered foods by their FPro scores, ranking the recommendations in ascending order of FPro. This method allows us to identify the most similar food items with a lower FPro compared to the targeted item. Up to 50 items, listed in increasing order of FPro, are displayed on the website.

Supplementary Material

Supplement 1

Acknowledgments

We thank Dwijay Shanbhag at Northeastern University for his help on data collection and cleaning. We thank Daria Koshkina for help in designing the figures. A.-L.B is partially supported by NIH grant 1P01HL132825, American Heart Association grant 151708, and ERC grant 810115-DYNASET. G.M. is supported by NIH/NHLBI K25HL173665 and AHA 24MERIT1185447.

Footnotes

Competing Interests

A.-L.B. is the founder of Scipher Medicine and Naring Health, companies that explore the use of network-based tools in health and food, and Datapolis, that focuses on urban data. All other authors declare no competing interests.

Code and Data Availability

All code and data are available at BarabasiLab GitHub repository via https://github.com/Barabasi-Lab/GroceryDB/. Furthermore, GroceryDB is available to the public and consumers at http://TrueFood.Tech/.

References

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement 1

Data Availability Statement

All code and data are available at BarabasiLab GitHub repository via https://github.com/Barabasi-Lab/GroceryDB/. Furthermore, GroceryDB is available to the public and consumers at http://TrueFood.Tech/.


Articles from medRxiv are provided here courtesy of Cold Spring Harbor Laboratory Preprints

RESOURCES