Abstract
We present snap-n-eat, a mobile food recognition system. The system can recognize food and estimate the calorific and nutrition content of foods automatically without any user intervention. To identify food items, the user simply snaps a photo of the food plate. The system detects the salient region, crops its image, and subtracts the background accordingly. Hierarchical segmentation is performed to segment the image into regions. We then extract features at different locations and scales and classify these regions into different kinds of foods using a linear support vector machine classifier. In addition, the system determines the portion size which is then used to estimate the calorific and nutrition content of the food present on the plate. Previous approaches have mostly worked with either images captured in a lab setting, or they require additional user input (eg, user crop bounding boxes). Our system achieves automatic food detection and recognition in real-life settings containing cluttered backgrounds. When multiple food items appear in an image, our system can identify them and estimate their portion size simultaneously. We implemented this system as both an Android smartphone application and as a web service. In our experiments, we have achieved above 85% accuracy when detecting 15 different kinds of foods.
Keywords: food recognition, visual food recognition, mobile food recognition, nutrition estimation
Recently, eating habit recording services have become popular (eg, myfitness). They can make users aware of their eating habit problems such as unhealthy diets, which are useful for disease prevention and diet control. However, most of these services require manually selecting food items from hierarchical menus by hand, which is too time-consuming and inconvenient for people to continue using such services.
Due to rapid growth of image sensors on smartphones (eg, iPhone and Android phones) and wearable gadgets (eg, Google Glass, smart watches), image-based food recording has emerged as an alternative to such manual logging;1-4 those sources along with Anthimopoulos et al5 represent the current state of the art in food recognition. Ideally the whole process involves simply pointing the camera to the food and snapping pictures. However most of the previous approaches require additional user input (eg, drawing a bounding box over each food items). This becomes very cumbersome when there are multiple food items on the plate. Other approaches require the food items to be placed in an experimental setting with a static background, which becomes infeasible in real-life settings, or taking multiple pictures,1 which is inconvenient
Our proposed system, named snap-n-eat, is summarized in Figure 1. First a user points a smartphone camera to the food plate, and clicks on the screen once most of the food is inside the rendered circle on the screen. After that, a cropped 400 × 400 image is uploaded to the server triggering the recognition process. Once the recognition is complete, the results are sent back to cell phone, and the bounding boxes of each food item and their confidence scores are displayed on the screen. The whole process takes about 4 seconds, including the network transmission time.
Figure 1.
Overview of proposed approach.
Food is defined mainly by what it is made of (ingredient) and how it is made (recipe). The ingredient and food items are not mutually exclusive, for example “potato” can be a food item on its own, or one of the ingredients of “potato salad.” The image features of individual food ingredients are highly ambiguous; however, food items are relatively easier to recognize once they are grouped into larger chunks. Note that our focus is purely on visible ingredients rather than on invisible ingredients that are hidden or blended.
Region grouping often helps facilitate recognition. (1) It helps grouping smaller items into larger chunks for better discrimination. (2) When multiple food items are presented on the same plate, they often belong to different food categories (eg, salads, fruits, vegetables, etc) and can be delineated well against each other. Also in many cases the background plate is homogeneous, and it can be separated from the food. (3) Regions provide precise localization information and the relationship between multiple foods can be analyzed with better accuracy. Segmentation on its own can be also useful for finer-grained analysis like nutrients and calorie estimation.
Approach
In this section we describe our approach in detail. We first describe the features that are extracted from the image (or image-regions), followed by techniques used to encode these features. Next we describe how given an image of a food plate, we use saliency to detect the location of the food, followed by segmentation (region-based sampling) to obtain image segments containing different food items. The low-level features are extracted from these image segments. We next describe our classifier training approach, followed by a brief description of nutrition estimation and user modeling.
Low-Level Features
Global shape, which refers to the overall shape of a food item (see Tamrakar et al6 for an overall review of low level features), can be very useful for distinguishing certain items from others. For example, no matter what the topping, a pizza slice often exhibits a clear wedge shape. However, in many other cases where food items are chopped, self-occluded, or amorphous (eg, salads, rice), the global shape is not distinctive enough on its own.
Color is another natural candidate. Since our goal is to detect the food items in a noncontrolled setting, we want our result to be robust, and more importantly stable to variable lighting, white balance setting, and image sensor variations. Naive implementation of color features (eg, color moment, RGB [red green blue] histogram) does not generalize across different capture conditions and it increases the variance of test performance dramatically.
Gradient features, like dense HOG (histogram of oriented gradients) and dense SIFT (scale-invariant feature transform7), are more suitable in our case. We implement these two features in the following way:
Dense HOG
First, HOG descriptors are densely extracted on a regular grid at steps of 8 pixels. HOG features are then computed, which gives a 31-dimension descriptor for each node of the grid. Then groups of 4 HOG descriptors in a 2 × 2 neighborhood are stacked together to form a descriptor with 124 dimensions. The stacked descriptors spatially overlap. This 2 × 2 neighbor stacking is important because the higher feature dimensionality increases the descriptive power.
Dense SIFT
SIFT7 descriptors are densely extracted using a flat rather than a Gaussian window on a regular grid at steps of 5 pixels. The 3 descriptors are stacked together for each HSV (hue saturation value) color channel. We compute the SIFT feature at 2 different scales. To study the discriminative power of different features, we use the normalized cut embedding to visualize the exemplars of different food instances using each feature as shown in Figure 2.
Figure 2.
We computed the low level features inside labeled food items. The top 2 images show the embedding of the HOG feature. The bottom 2 images show the embedding of SIFT feature at scales of 16 × 16 and 32 × 32. Each colored dot represents 1 labeled food instance.
As HOG features are extracted on the gradient of a grey scale image, they mainly encode the texture information of different food categories. Two food items (eg, French fries and egg omelets; Figure 2, top) with similar colors can be discriminated due to their different textures as captured by the HOG feature.
SIFT features capture similar information as HOG. Interestingly we find that SIFT features at different scales have different discriminative powers. As shown in Figure 2 (bottom), steak and chicken wings are both clustered into 2 clusters. One cluster is discriminated better at a smaller scale, while the other is discriminated at a larger scale. This motivates us to fuse the detection results at multiple scales.
Feature Encoding
Traditionally the local features are represented as bag of visual words (BoVW). A codebook of K centroids (visual words) is usually obtained by k-means clustering, and each local descriptor is assigned to the closest centroid. The BoVW representation is the histogram counts of each image descriptor. The limitations of BoVW in this application are mainly 3-fold:
Lack of continuity in feature space: The dictionary of visual words is obtained in a purely generative way. The assignment of a descriptor can jump from one visual word to another with small perturbations in the original feature space.
Sparsity: In the BoVW representation, each local descriptor contributes to only one nonzero entry in the final histogram. This often entails more sophisticated kernels to be used during the test phase.
Low dimensionality: The dimension of the final histogram is determined by the size of the dictionary. Larger dictionary size often aggravates the lack of continuity in the feature space. However, robust classification often requires high feature dimension.
Inspired by the recent progress in visual classification, we use the Fisher vector representation,8 which augments the BoW (bag of words) representation by encoding high-order statistics (first and, optionally, second order). The Fisher vector8 describes how the set of descriptors deviates from an average distribution of descriptors, modeled by a parametric generative model. Similar to BoW, Fisher vector can encode the information contained in a set of descriptors into a fixed-length signature. Instead of using K-means to learn the codebook, a Gaussian mixture model (GMM) is used. Given a set of D dimensional local descriptors (eg, SIFT or HOG features) extracted from an image, let be the parameters of a GMM fitting the distribution of the descriptors. The GMM associates each vector to a mode k in the mixture with a strength given by
For each mode k, the mean and covariance deviation vectors are given by
where j spans the D dimensions of the descriptor. The Fisher vector of the image is then obtained by stacking the vectors and of each of the K Gaussian mixtures. Thus the Fisher vector is of 2KD dimensions. In our work, we choose K = 256, and the feature dimension is reduced to 64 using PCA (principal components analysis). These parameters are chosen empirically based on their performance.
Food Detection
When a user takes a picture of the food using a cell phone, the image resolutions vary according to the sensor resolution and photo orientation (portrait or landscape). To ensure that the result is invariant to all these factors, we down-sample and crop the image to 400 × 400 resolution for processing. We fix the aspect ratio to 1:1 since most food plates have this aspect ratio.
Saliency-Based Sampling
When a user snaps a picture of the food plate, the foreground regions corresponding to the food items are usually more visually salient than the background. A possible explanation is that chefs tend to make food visually appealing. This phenomenon encourages us to apply saliency detection to guide us “where” to look for food.
We use the spectral residual approach to detect the saliency. Input image is resized such that the short dimension is 100 pixels. Then the saliency value is computed and thresholded to compute the binary saliency map as shown in Figure 3.
Figure 3.
Saliency computation. The left image shows a picture of a food plate snapped by a user. The center image shows the saliency image with red/yellow colors indicating high saliency and blue colors indicating a low saliency. The right image shows the binary saliency map obtained by applying a threshold on the center image which can be used to identify the location of food in the image. Note that the binary saliency map picks out the most important regions of the image.
Region-Based Sampling
When multiple food items are present on a plate, we have to separate them before identification. This leads us to image segmentation. Furthermore, in images of food, the formation is intrinsically hierarchical: the salad is composed of different ingredients, and it is contained in a plate. Hence we perform segmentation of images at multiple scales.
One single feature may not be suitable for multiple segmentations. There are many different reasons why a pair of regions should be grouped together, such as similarity of color or texture. Hence we also perform segmentation with multiple features. We rely on a hierarchical grouping algorithm to form the basis of our selective search. As regions yield richer information than pixels, we want to use region-based features whenever possible. We initially obtain a set of small starting regions. We extract features from each of these regions. Then we use a greedy algorithm to iteratively group regions together: First, the similarities between features extracted from all pairs of neighboring regions are calculated. The two most similar regions are grouped together, and a new set of similarities is calculated between the resulting regions and their neighbors. The process of grouping the most similar regions is repeated until the whole image becomes a single region.
We use the following cues to compute the similarity:
Color similarity: We obtain a 1-dimensional color histogram for each color channel, and compute similarity using the min-intersection kernel.
Texture similarity: We represent texture using SIFT like features. We take Gaussian derivatives in 8 orientations for each color channel. Then the texture is parameterized as a histogram across different orientations and we use the min-intersection kernel.
Size similarity: We want small regions of similar sizes to be merged. This is desirable because it ensures that object locations at all scales are created at all parts of the image. Figure 4 shows the sampled segments of one input image. Normally, a typical image yields about 100 segments.
Figure 4.
Sampled segmentation. The figure shows an example of our segmentation. The food items present in a food plate are segmented into different regions, each of which usually corresponds to a single food item. The regions in green correspond to the steak, the regions in blue correspond to fruit, and the regions in red correspond to the roll. Classifiers are then run on each of these regions.
Classifier Training
Online image database like ImageNet, Flickr, and Google Images can be used to obtain a large number of food images. An obvious approach is to come up a list of predefined foods, harvest their images, manually fix the incorrect labels and train a classifier. However these web images are often from advertisements and they bear little resemblance to the pictures taken by a user. These images are often taken for aesthetic purposes, with controlled lighting and nonrealistic views (eg, burgers are often shot from the side view with all layers exposed). We found that the classifiers learnt from these web images generalize very poorly for cell phone photos. Hence we collected our own data set, this was done by encouraging users to upload images of their food plates using a preliminary version of our app. We then had a set of annotators manually segment different food items and assign them food categories. We collected approximately 2000 training images for 15 food categories, with about 100-400 images for each category.
We train a multiple class SVM (support vector machine) for each category. We use stochastic gradient descent to train the classifier. After the first round, we perform the detection on the training images, and append negative examples with a large score to the negative data. Then we retrain the classifier. In this way the learned classifier is more discriminative.
To decide on the set of food items to detect is a nontrivial task. The food items assume some kind of coarse-to-fine hierarchical relationship between them. As shown in Figure 5, if the semantic granularity is too coarse we find the appearance variance could be too large such that the learnt classifier does not generalize well on the test set. If the semantic granularity is too fine, then the number of classes could explode affecting system performance.
Figure 5.
Semantic ambiguities in food items. Left: 4 food items are assigned the same label “sandwiches.” Right: 4 visually similar items are assigned different labels.
We use a food ontology to organize the food categories. The ontology is a directed acyclic graph (DAG), where each node represents a food category. The root node of this food ontology is simply food. One directed edge connects a parent node and a child node. The edge represents the ISA (is a) relationship. A node closer to the leaf is a more well-defined food category while a node closer to the root is a more general one. One such food ontology is shown in Figure 6.
Figure 6.
Food ontology. An example of a food ontology with an enlarged view of a small section.
This DAG food ontology enables us to organize the large number of food categories and control the mapping from food labels to classifiers. For example, each node in the figure corresponds to a classifier. It may contain a single well-defined food category or a more general one, such as red meat.
During annotation, we label each example by specifying the path from the root node to the leaf node in the food ontology. At each level we ask a multiple-choice question to specify which children it belongs to. The number of questions depends on the depth of the DAG. In our case the maximal depth is 7. As we may have hundreds of different food categories, choosing the exact label by scrolling down a long list can be hard and extremely time-consuming. Instead we ask a few simple questions, which reduce the labeling time drastically.
During training, the food ontology renders an automatic split of the positive/negative training data. To train a classifier corresponding to a node, all the images corresponding to this node and its children are included in the positive set and all the images corresponding to the remaining nodes (including non-food) are included in the negative set. Table 1 shows classification results.
Table 1.
Classification Performance With 4 Different Features and Their Fusion in Our Data Set (5-Fold Cross-Validation).
Pizza | Strawberry | Mixed Fruit | Burger | French Fries | Green Salad | Spaghetti | Sandwich | Steak | Chicken wings | Sushi roll | Cheerios | Egg omelet | Pancakes | Broccoli | Mean | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
HOG | 65.3 | 65.7 | 74.2 | 48.9 | 81.9 | 78.2 | 78.6 | 52.5 | 62.2 | 50.9 | 54.5 | 85.7 | 41.4 | 51.3 | 57.7 | 63.3 |
SIFT-16 | 76.5 | 78.4 | 76.4 | 63.9 | 85.3 | 86.5 | 79.5 | 61.0 | 77.3 | 68.9 | 64.9 | 84.1 | 62.6 | 56.4 | 75.0 | 73.1 |
SIFT-32 | 70.4 | 78.4 | 68.5 | 61.7 | 86.2 | 85.7 | 75.9 | 59.3 | 72.3 | 72.6 | 62.3 | 93.7 | 58.6 | 62.8 | 65.4 | 71.6 |
LAB | 42.9 | 88.2 | 52.8 | 36.1 | 53.4 | 82.7 | 35.7 | 44.1 | 66.4 | 57.5 | 63.6 | 44.4 | 47.5 | 38.5 | 73.1 | 55.1 |
Fusion | 81.6 | 88.2 | 82.0 | 66.9 | 90.5 | 93.2 | 83.0 | 69.5 | 84.0 | 81.1 | 75.3 | 92.1 | 65.7 | 70.5 | 80.8 | 80.3 |
Note: Fusion denotes a combination of the first four features (i.e. HOG, SIFT-16, SIFT-32 and LAB). The bold numbers indicate the highest performing individual feature and the performance of the fused feature.
Nutrition Estimation
In addition, the system estimates the portion size of the food by counting the pixels in each corresponding food segment. While, we can also use depth images to obtain better estimates of the portion size,4 we empirically found that counting the number of pixels leads to a reasonably good estimate. Once the portion size is determined, the system can estimate the calorific content and the nutrition facts. For simplicity, we assume that each food category has a predefined calorific and nutrition density, which we use to compute its calorific and nutrition content. Figure 7 shows some of the nutrition and calorific contents estimated for images in our test data set.
Figure 7.
Examples of nutrition estimation on the test data set. In each image the food items and the detected class label below it have been given the same color.
Personalization
Using the initial training set, we train a classifier for many commonly occurring food items. However, each user often consumes a number of food items that the system has not seen before. Moreover, due to the user’s environment and habits (eg, homemade food) or the locations (restaurant served food) there are well defined patterns in the new food items that appear. It is impossible for us to collect enough food pictures for all such food items before deploying the system. Therefore, such systems need to be personalized. The user and location information are easily accessible due to the nature of the phone app. This information can be used to fine-tune the classifier so that it works optimally for a specific user rather than trying to build a classifier that can accurately recognize all the different kinds of food. A model personalized to a specific user and a model customized to a particular location (eg, restaurant) can significantly boost the classification accuracy of a generic classifier. Such a model personalized to a specific user or location makes food recognition “in the wild” much more solvable and will be the focus of our future work.
Results
We evaluate our system performance on our data set consisting of about 2000 images comprising 15 predefined classes. We have between 100 to 400 examples of each food in our training set. Table 1 shows the break-down of the classification performance using 4 different features using 5-fold cross-validation. The results of Table 1 have been derived from the underlying confusion matrices, which we have omitted for reasons of space. The table shows the average classification accuracy, that is, the percentage of the test images of each category correctly classified. The SIFT features at the 2 different scales result in the best performance and fusing all the features further improves on that performance. Furthermore, we also tested the app in a cafeteria with plates of food that various customers got. Therefore, the test set is an ongoing set which is tested on a daily basis at a cafeteria among other places. For privacy and confidentiality reasons we could not provide those pictures. We have not seen any obvious pattern in the failures when they occur. Note that the results can be vastly improved by narrowing down the list of possibilities using information such as location (eg, menu items on a restaurant), prior choices indicating user habit and other such metadata, thus achieving personalization.
The images in Figure 7 show our nutrition estimation on images from the test set.
Discussion
The results clearly show that with a modest number of food categories, the proposed technique is effective. The app is not in the public domain yet. There are two avenues for further improvement. First, by expending more computation such as by computing at more scales, it is likely that the accuracy will go up. Currently we limit the number of scales because of computational reasons. Furthermore, we can use a richer set of features, especially those pertaining to shape. Second, we need to address scaling up of the problem. For the application to be realistic, we need to tackle hundreds of food items. For example, our study of the FNDDS (Food and Nutrient Database for Dietary Studies) database indicates that about 300 foods account for a large percentage of usual American dietary intake. Scaling up to hundreds of foods will require development of new approaches to the food ontology described in this article because we will have to develop an incremental approach to upgrading the ontology anytime a new food is added.
Conclusions
We have presented a mobile food recognition system. The system recognizes food items present on a plate and estimates their calorific and nutrition content automatically without any user intervention. To identify food items, the user simply needs to snap a photo of the food plate. The system detects the salient regions corresponding to the food items. Hierarchical segmentation is performed to segment the images into region. The system extracts features at different locations and scales, classifying these regions into different food categories using a linear SVM classifier. The system also estimates the portion size of the food and uses it to determine the calorific and nutrition content. We have implemented this system as an Android smartphone application as well as a web service. In our experiments, we have achieved over 85% accuracy when detecting 15 different categories of food items. As part of future work, we plan to make our system capable of personalizing the food recognition classifier based on user habits, location, and other meta-data.
Footnotes
Abbreviations: BoVW, bag of visual words; BoW, bag of words; DAG, directed acyclic graph; FNDDS, Food and Nutrient Database for Dietary Studies; GMM, Gaussian mixture model; HOG, histogram of oriented gradients; HSV, hue saturation value; PCA, principal components analysis; SIFT, scale-invariant feature transform; SVM, support vector machine.
Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.
References
- 1. Puri M, Zhu Z, Yu Q, Divakaran A, Sawhney H. Recognition and volume estimation of food intake using a mobile device. Paper presented at: IEEE Workshop on Applications of Computer Vision; December 2009; Snowbird, UT. [Google Scholar]
- 2. Weiss R, Stumbo P, Divakaran A. Automatic food documentation and volume computation using digital imaging and electronic transmission. J Am Diet Assoc. 2010;42:44-110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Yang S, Chen M, Pomerleau D, Sukthankar R. Food recognition using statistics of pairwise local features. Paper presented at: IEEE Conference on Computer Vision and Pattern Recognition; June 2010; San Francisco, CA. [Google Scholar]
- 4. Chen MY, Yang YH, Ho CJ, et al. Automatic Chinese food identification and quantity estimation. Paper presented at: SIGGRAPH Asia; November 2012; Singapore. [Google Scholar]
- 5. Anthimopoulos M, Dehais J, Diem P, Mougiakakou S. Segmentation and recognition of multi-food meal images for carbohydrate counting. Paper presented at: IEEE 13th International Conference on Bioinformatics and Bioengineering; November 2013; Chania, Greece. [Google Scholar]
- 6. Tamrakar A, Ali S, Yu Q, et al. Evaluation of low-level features and their combinations for complex event detection in open source videos. Paper presented at: IEEE Conference on Computer Vision and Pattern Recognition; June 2012; Providence, RI. [Google Scholar]
- 7. Lowe DG. Distinctive image features from scale-invariant keypoints. Int J Computer Vision. 2004;60:91-110. [Google Scholar]
- 8. Perronnin F, Dance C. Fisher kernels on visual vocabularies for image categorization. Paper presented at: IEEE Conference on Computer Vision and Pattern Recognition; June 2007; Minneapolis, MN. [Google Scholar]