Skip to main content
. 2022 Jul 8;13(6):2590–2619. doi: 10.1093/advances/nmac078

TABLE 5.

List of image-based food-recognition systems for dietary assessment based on CNNs on publicly available food datasets1

Reference User input-preprocessing Segmentation Feature extraction Dimensionality reduction Classification Volume estimation Datasets and performance
(32)  — Locality-Constrained Linear Coding, deep features with DeCAF Bag-of-Words SVM  — Dishes (deep features)ACC = 72.88%
(6) Drawing polygons Manual segmentation using polygons CNN  — CNN  — Food segmentation: UNIMIB 2016 Recall = 71.4%, Precision = 73.4%, F-measure = 72.4% Food classification: UNIMIB 2016 ACC = 78%
(17) Food localization using CNN CNN  — CNN  —

Food segmentation:UEC-Food 256 Precision = 54.33% Recall = 50.86% Egocentric Food Precision = 17.38%

Recall = 8.72%

Food classification:UEC-Food 256 ACC = 63.16%

Egocentric Food ACC = 90.90%

(76)  — CNN SVM  — PFID ACC = 70.13%
(48) Top-view photo including credit card Color-pixel-based k-means clustering and GrabCut CNN Based on the size of the reference object UEC-Food 100 ACC = 75%
(88)  —  —  — CNN  — Food-101 ACC = 88.28%UEC-Food 100 ACC = 81.45%UEC-Food 256 ACC = 76.17%
(120)  — CNN including semantics-aware features CNN  — Food-101 ACC = 72.11%
(37) CNN exploiting the joint relation between food and ingredient labels through multi-task learning  — CNN  — Food classification:VIREO Food-172 ACC = 82.06%UEC-Food 100 ACC = 82.12%Ingredients recognition:VIREO Food-172 F-measure = 67.17%UEC-Food 100 F-measure = 70.72%
(77)  —  — CNN  — CNN  — UEC-Food 100 ACC = 60.9%
(121) Covariances of convolutional layer feature maps of CNN  — CNN  — Food-101 ACC = 58.65%
(122)  — CNN  — CNN  — UEC-Food100 ACC = 76.3%Food-101 ACC = 77.4%
(123)  — Feature vector from ensemble of 3 CNNs CNN  — Food-101 ACC = 71.12%
(124)  — Food border defined by user with a circle CNN CNN  — FooDD ACC = 94.11%
(125)  — CNN  — Multi-Task Triplet Network  — UEC-Food 256 MAP = 31.7%
(110)  —  — CNN  — CNN (NutriNet)  — UNIMIB 2016 ACC = 86.39%
(126)  —  — CNN CNN  — Food-101 ACC = 87.96%UEC-Food 100 ACC = 86.51%UEC-Food 256 ACC = 78.60%
(127) CNN CNN  — Food-101 ACC = 86.97%
(92) CNN (DualNet) Ensemble of CNNs  — UEC-Food 100 ACC = 49.19%
(39)  — CNN SVM  — NTUA-Food 2017 ACC = 85.94%
(128)  —  — CNN  — CNN (Inception V3)  — Food-101 ACC = 81.65%
(103) CNN (Mask R-CNN) CNN (ResNet50) CNN CNN (VolumeNet) Food segmentation:MAP = 64.7%Food classification:Madima 2017 ACC = 93.33%
(129)  — CNN CNN (WISeR)  — UEC-Food 100 ACC = 89.58%UEC-Food 256 ACC = 83.15%Food-101 ACC = 90.27%
(130)  —  —  — CNN  — UNIMIB 2016 ACC = 77.5%
(22)  —  — CNN (VGG, ResNet-50, Wide ResNet-50, Inception V3)  — Food classification:ChinaFood-100 ACC = 78.26% (Inception V3)Nutrients estimation:Protein, fiber, vitamin, calcium, and ironMAPE is approximately 65%
(36)  —  — CNN (VGG) SVM  — Instagram800K ACC = 72.8%
(40) Top view image and side view image, a coin as a fiducial marker GrabCut and Faster R-CNN CNN CNN  — ECUST Food Dataset Mean Error = ±20%
(41) CNN (ResNet) CNN (ResNet)  — Food524DB ACC = 69.52%
(45)  —  — Multi-task CNN (ResNet) CNN (ResNet)  — Eating Occasion Image to Food Energy ACC = 88.67% MAE = 56.82 (kcal)
(46)  — Faster R-CNN provides bounding boxes with a foodness score CNN (DenseNet-121) Multi-task CNN (DenseNet-121)  — Food localization:UEC-Food 100 Precision = 82%, Recall = 86%, F-Measure = 84%UEC-Food 256 Precision = 94%, Recall = 88%, F-Measure = 91%VIPERFoodNet Precision = 79%, Recall = 64%, F-Measure = 71%Food classification:Food-101 ACC = 80%UPMC Food-101 ACC = 69%UEC-Food 100 ACC = 81%UEC-Food 256 ACC = 72%VIPERFoodNet ACC = 72%
(22)  —  — CNN (VGG, ResNet-50, Wide ResNet-50, Inception V3)  — Food classification:ChinaFood-100 ACC = 78.26% (Inception V3)Nutrients estimation:Protein, fiber, vitamin, calcium, and ironMAPE is approximately 65%
(36) CNN (VGG) SVM  — Instagram800K ACC = 72.8%
(40) Top view image and side view image, a coin as a fiducial marker GrabCut and Faster R-CNN CNN  — CNN  — ECUST Food Dataset Mean Error = ±20%
(41)  —  — CNN (ResNet) CNN (ResNet)  — Food524DB ACC = 69.52%
(45)  —  — Multi-task CNN (ResNet)  — CNN (ResNet)  — Eating Occasion Image to Food Energy ACC = 88.67% MAE = 56.82 (kcal)
(46)  — Faster R-CNN provides bounding boxes with a foodness score CNN (DenseNet-121) Multi-task CNN (DenseNet-121)  — Food localization:UEC-Food 100 Precision = 82%, Recall = 86%, F-Measure = 84%UEC-Food 256 Precision = 94%, Recall = 88%, F-Measure = 91%VIPERFoodNet Precision = 79%, Recall = 64%, F-Measure = 71%Food classification:Food-101 ACC = 80%UPMC Food-101 ACC = 69%UEC-Food 100 ACC = 81%UEC-Food 256 ACC = 72%VIPERFoodNet ACC = 72%
(22)  —  —  — CNN (VGG, ResNet-50, Wide ResNet-50, Inception V3)  — Food classification:ChinaFood-100 ACC = 78.26% (Inception V3)Nutrients estimation:Protein, fiber, vitamin, calcium, and ironMAPE is approximately 65%
(36) CNN (VGG)  — SVM  — Instagram800K ACC = 72.8%
(40) Top view image and side view image, a coin as a fiducial marker GrabCut and Faster R-CNN CNN CNN  — ECUST Food Dataset Mean Error = ±20%
(41)  —  — CNN (ResNet)  — CNN (ResNet)  — Food524DB ACC = 69.52%
(45) Multi-task CNN (ResNet)  — CNN (ResNet)  — Eating Occasion Image to Food Energy ACC = 88.67% MAE = 56.82 (kcal)
(46)  — Faster R-CNN provides bounding boxes with a foodness score CNN (DenseNet-121)  — Multi-task CNN (DenseNet-121)  — Food localization:UEC-Food 100 Precision = 82%, Recall = 86%, F-Measure = 84%UEC-Food 256 Precision = 94%, Recall = 88%, F-Measure = 91%VIPERFoodNet Precision = 79%, Recall = 64%, F-Measure = 71%Food classification:Food-101 ACC = 80%UPMC Food-101 ACC = 69%UEC-Food 100 ACC = 81%UEC-Food 256 ACC = 72%VIPERFoodNet ACC = 72%
(93)  —  —  — Ensemble of CNNs (VGG16, VGG19, GoogLeNet, ResNet, Inception V3) with 5 different combination rules (minimum, average, median, maximum, product)  — Food-101 ACC = 84.28%UEC-Food 100 ACC = 84.52%UEC-Food 256 ACC = 77.20%
(131)  — Category and ingredient oriented feature extraction based on CNN (VGG-16); fusion of 2 different kinds of features  — Adapted CNN  — Food-101 ACC = 55.3%VIREO Food-172 ACC = 75.1%ChineseFoodNet ACC = 66.1%
(97) Two meal images from 2 different viewing angles, 90 and 75 degrees from the table's plane, or short video, fiducial marker Automated segmentation based on Mask-RCNN. Semi-automatic segmentation based on region growing and merging algorithm  —  — CNN (Inception V3)  — Madima database segmentation results: Fmin = 83.9%, Fsum = 94.4%Madima database food recognition results ACC = 57.1%
(99) Multiple hypothesis segmentation: salient region detection, multi-scale segmentation and fast rejection Color, texture, and local neighborhood pixel features  — ANN  — UNIMIB 2016 ACC = 95.9%
(117)  — Multi-relational graph convolutional network, termed mRGCN (ResNet-50)  — VIREO Food-172 ACC = 24.2% of unseen ingredientsUEC-Food 100 ACC = 17.9% of unseen ingredients
(116)  —  — First approach with Bag-of-Words extracts texture (binary patterns), color, SURF, and geometry features First approach Bag-of-Words First approach ANN; second approach CNNs such as GoogLeNet, Inception-v3, ResNet101  — 16 Categories were selected from UEC-Food 256. ACC = 93% (ResNet)
(132) Faster R-CNN provides bounding boxes (maximum 5 per image)  — CNN (VGGNet)  — UEC-Food 100 MAP = 17.5%UEC-Food 256 MAP = 10.5%.
(38)  —  — CNN (GoogLeNet)  — FOOD-5K ACC = 99.2%Food-11 ACC = 83.6%.
(133)  —  — First approach: feature extraction from AlexNet and VGG16 First approach: SVM; second approach: fine-tuning CNN (ResNet50)  — First approach:  FOOD-5K ACC = 99.00%Food-11 ACC = 89.33%Food-101 ACC = 62.44%.Second approach:Food-101 ACC = 79.86%
(134)  —  —  — 5-Layer CNN  — UEC-Food 100 ACC = 60.90%
(100)  — Canny edge detection, multi-scale segmentation, fast rejection of background pixels Color, texture, SIFT, and SURF features 3-Layer ANN  — UNIMIB 2016 ACC = 94.5%
(135) Data augmentation by translations, rotations, shearing, zooming, and flipping  — CNN (Inception-v3) Ingredients and nutritional value estimation from vector space embeddings of words (text data from the internet) Food-101 ACC = 80.0%
(98) Local variation segmentation Color descriptors: SCD, DCD. Texture descriptors: EFD and GFD. Local descriptors: SIFT and MDSIFT. Multi-kernel SVM  — UNIMIB 2016 ACC = 93.9%
(86)  —  — Pretrained CNNs (GoogLeNet and ResNet152)  — NB, SVM-RBF, SVM-Poly, ANN, RF  — FOOD-5K (ResNet-152 and SVM-RBF) ACC = 98.8%Food-11 (ResNet-152 and ANN) ACC = 91.34%Food-101 (ResNet-152 and SVM-RBF) ACC = 64.98%
(136)  — Weakly supervised CNN model with a new pooling technique and incorporate a class activation map for graph-based segmentation (VGG-16)  —  — CNN (VGG-16)  — Food-101 ACC = 74.02%
(137) JSEG algorithm consists of color quantization and spatial segmentation  —  — UNIMIB 2016F-measure = 58%
1

ACC, accuracy; ANN, artificial neural network; CNN, convolutional neural network; DCD, dominant color descriptor; EFD, entropy-based categorization and fractal dimension estimation; GFD, Gabor-based image decomposition and fractal dimension estimation; MAP, mean average precision; MAPE, mean absolute percentage error; MDSIFT, multi-scale dense SIFT; NB, naive Bayes; PFID, Pittsburgh fast-food image dataset; Poly, polynomial; RBF, radial basis function; RF, random forest; SCD, scalable color descriptor; SIFT, scale-invariant feature transform; SURF, speeded up robust features; SVM, support vector machine.