. 2022 Jul 8;13(6):2590–2619. doi: 10.1093/advances/nmac078

TABLE 5.

List of image-based food-recognition systems for dietary assessment based on CNNs on publicly available food datasets¹

Reference	User input-preprocessing	Segmentation	Feature extraction	Dimensionality reduction	Classification	Volume estimation	Datasets and performance
(32)	—	—	Locality-Constrained Linear Coding, deep features with DeCAF	Bag-of-Words	SVM	—	Dishes (deep features)ACC = 72.88%
(6)	Drawing polygons	Manual segmentation using polygons	CNN	—	CNN	—	Food segmentation: UNIMIB 2016 Recall = 71.4%, Precision = 73.4%, F-measure = 72.4% Food classification: UNIMIB 2016 ACC = 78%
(17)	—	Food localization using CNN	CNN	—	CNN	—	Food segmentation:UEC-Food 256 Precision = 54.33% Recall = 50.86% Egocentric Food Precision = 17.38% Recall = 8.72% Food classification:UEC-Food 256 ACC = 63.16% Egocentric Food ACC = 90.90%
(76)	—	—	CNN	—	SVM	—	PFID ACC = 70.13%
(48)	Top-view photo including credit card	Color-pixel-based k-means clustering and GrabCut		—	CNN	Based on the size of the reference object	UEC-Food 100 ACC = 75%
(88)	—	—	—	—	CNN	—	Food-101 ACC = 88.28%UEC-Food 100 ACC = 81.45%UEC-Food 256 ACC = 76.17%
(120)	—	—	CNN including semantics-aware features	—	CNN	—	Food-101 ACC = 72.11%
(37)	—	—	CNN exploiting the joint relation between food and ingredient labels through multi-task learning	—	CNN	—	Food classification:VIREO Food-172 ACC = 82.06%UEC-Food 100 ACC = 82.12%Ingredients recognition:VIREO Food-172 F-measure = 67.17%UEC-Food 100 F-measure = 70.72%
(77)	—	—	CNN	—	CNN	—	UEC-Food 100 ACC = 60.9%
(121)	—	—	Covariances of convolutional layer feature maps of CNN	—	CNN	—	Food-101 ACC = 58.65%
(122)	—	—	CNN	—	CNN	—	UEC-Food100 ACC = 76.3%Food-101 ACC = 77.4%
(123)	—	—	Feature vector from ensemble of 3 CNNs	—	CNN	—	Food-101 ACC = 71.12%
(124)	—	Food border defined by user with a circle	CNN	—	CNN	—	FooDD ACC = 94.11%
(125)	—	—	CNN	—	Multi-Task Triplet Network	—	UEC-Food 256 MAP = 31.7%
(110)	—	—	CNN	—	CNN (NutriNet)	—	UNIMIB 2016 ACC = 86.39%
(126)	—	—	CNN	—	CNN	—	Food-101 ACC = 87.96%UEC-Food 100 ACC = 86.51%UEC-Food 256 ACC = 78.60%
(127)	—		CNN	—	CNN	—	Food-101 ACC = 86.97%
(92)	—	—	CNN (DualNet)	—	Ensemble of CNNs	—	UEC-Food 100 ACC = 49.19%
(39)	—	—	CNN	—	SVM	—	NTUA-Food 2017 ACC = 85.94%
(128)	—	—	CNN	—	CNN (Inception V3)	—	Food-101 ACC = 81.65%
(103)		CNN (Mask R-CNN)	CNN (ResNet50)	—	CNN	CNN (VolumeNet)	Food segmentation:MAP = 64.7%Food classification:Madima 2017 ACC = 93.33%
(129)	—	—	CNN	—	CNN (WISeR)	—	UEC-Food 100 ACC = 89.58%UEC-Food 256 ACC = 83.15%Food-101 ACC = 90.27%
(130)	—	—	—	—	CNN	—	UNIMIB 2016 ACC = 77.5%
(22)	—	—	—	—	CNN (VGG, ResNet-50, Wide ResNet-50, Inception V3)	—	Food classification:ChinaFood-100 ACC = 78.26% (Inception V3)Nutrients estimation:Protein, fiber, vitamin, calcium, and ironMAPE is approximately 65%
(36)	—	—	CNN (VGG)	—	SVM	—	Instagram800K ACC = 72.8%
(40)	Top view image and side view image, a coin as a fiducial marker	GrabCut and Faster R-CNN	CNN	—	CNN	—	ECUST Food Dataset Mean Error = ±20%
(41)	—	—	CNN (ResNet)	—	CNN (ResNet)	—	Food524DB ACC = 69.52%
(45)	—	—	Multi-task CNN (ResNet)	—	CNN (ResNet)	—	Eating Occasion Image to Food Energy ACC = 88.67% MAE = 56.82 (kcal)
(46)	—	Faster R-CNN provides bounding boxes with a foodness score	CNN (DenseNet-121)	—	Multi-task CNN (DenseNet-121)	—	Food localization:UEC-Food 100 Precision = 82%, Recall = 86%, F-Measure = 84%UEC-Food 256 Precision = 94%, Recall = 88%, F-Measure = 91%VIPERFoodNet Precision = 79%, Recall = 64%, F-Measure = 71%Food classification:Food-101 ACC = 80%UPMC Food-101 ACC = 69%UEC-Food 100 ACC = 81%UEC-Food 256 ACC = 72%VIPERFoodNet ACC = 72%
(22)	—	—	—	—	CNN (VGG, ResNet-50, Wide ResNet-50, Inception V3)	—	Food classification:ChinaFood-100 ACC = 78.26% (Inception V3)Nutrients estimation:Protein, fiber, vitamin, calcium, and ironMAPE is approximately 65%
(36)	—	—	CNN (VGG)	—	SVM	—	Instagram800K ACC = 72.8%
(40)	Top view image and side view image, a coin as a fiducial marker	GrabCut and Faster R-CNN	CNN	—	CNN	—	ECUST Food Dataset Mean Error = ±20%
(41)	—	—	CNN (ResNet)	—	CNN (ResNet)	—	Food524DB ACC = 69.52%
(45)	—	—	Multi-task CNN (ResNet)	—	CNN (ResNet)	—	Eating Occasion Image to Food Energy ACC = 88.67% MAE = 56.82 (kcal)
(46)	—	Faster R-CNN provides bounding boxes with a foodness score	CNN (DenseNet-121)	—	Multi-task CNN (DenseNet-121)	—	Food localization:UEC-Food 100 Precision = 82%, Recall = 86%, F-Measure = 84%UEC-Food 256 Precision = 94%, Recall = 88%, F-Measure = 91%VIPERFoodNet Precision = 79%, Recall = 64%, F-Measure = 71%Food classification:Food-101 ACC = 80%UPMC Food-101 ACC = 69%UEC-Food 100 ACC = 81%UEC-Food 256 ACC = 72%VIPERFoodNet ACC = 72%
(22)	—	—	—	—	CNN (VGG, ResNet-50, Wide ResNet-50, Inception V3)	—	Food classification:ChinaFood-100 ACC = 78.26% (Inception V3)Nutrients estimation:Protein, fiber, vitamin, calcium, and ironMAPE is approximately 65%
(36)	—	—	CNN (VGG)	—	SVM	—	Instagram800K ACC = 72.8%
(40)	Top view image and side view image, a coin as a fiducial marker	GrabCut and Faster R-CNN	CNN	—	CNN	—	ECUST Food Dataset Mean Error = ±20%
(41)	—	—	CNN (ResNet)	—	CNN (ResNet)	—	Food524DB ACC = 69.52%
(45)	—	—	Multi-task CNN (ResNet)	—	CNN (ResNet)	—	Eating Occasion Image to Food Energy ACC = 88.67% MAE = 56.82 (kcal)
(46)	—	Faster R-CNN provides bounding boxes with a foodness score	CNN (DenseNet-121)	—	Multi-task CNN (DenseNet-121)	—	Food localization:UEC-Food 100 Precision = 82%, Recall = 86%, F-Measure = 84%UEC-Food 256 Precision = 94%, Recall = 88%, F-Measure = 91%VIPERFoodNet Precision = 79%, Recall = 64%, F-Measure = 71%Food classification:Food-101 ACC = 80%UPMC Food-101 ACC = 69%UEC-Food 100 ACC = 81%UEC-Food 256 ACC = 72%VIPERFoodNet ACC = 72%
(93)	—	—	—	—	Ensemble of CNNs (VGG16, VGG19, GoogLeNet, ResNet, Inception V3) with 5 different combination rules (minimum, average, median, maximum, product)	—	Food-101 ACC = 84.28%UEC-Food 100 ACC = 84.52%UEC-Food 256 ACC = 77.20%
(131)	—	—	Category and ingredient oriented feature extraction based on CNN (VGG-16); fusion of 2 different kinds of features	—	Adapted CNN	—	Food-101 ACC = 55.3%VIREO Food-172 ACC = 75.1%ChineseFoodNet ACC = 66.1%
(97)	Two meal images from 2 different viewing angles, 90 and 75 degrees from the table's plane, or short video, fiducial marker	Automated segmentation based on Mask-RCNN. Semi-automatic segmentation based on region growing and merging algorithm	—	—	CNN (Inception V3)	—	Madima database segmentation results: Fmin = 83.9%, Fsum = 94.4%Madima database food recognition results ACC = 57.1%
(99)	—	Multiple hypothesis segmentation: salient region detection, multi-scale segmentation and fast rejection	Color, texture, and local neighborhood pixel features	—	ANN	—	UNIMIB 2016 ACC = 95.9%
(117)	—	—	—	—	Multi-relational graph convolutional network, termed mRGCN (ResNet-50)	—	VIREO Food-172 ACC = 24.2% of unseen ingredientsUEC-Food 100 ACC = 17.9% of unseen ingredients
(116)	—	—	First approach with Bag-of-Words extracts texture (binary patterns), color, SURF, and geometry features	First approach Bag-of-Words	First approach ANN; second approach CNNs such as GoogLeNet, Inception-v3, ResNet101	—	16 Categories were selected from UEC-Food 256. ACC = 93% (ResNet)
(132)	—	Faster R-CNN provides bounding boxes (maximum 5 per image)	—	—	CNN (VGGNet)	—	UEC-Food 100 MAP = 17.5%UEC-Food 256 MAP = 10.5%.
(38)	—	—	—	—	CNN (GoogLeNet)	—	FOOD-5K ACC = 99.2%Food-11 ACC = 83.6%.
(133)	—	—	First approach: feature extraction from AlexNet and VGG16	—	First approach: SVM; second approach: fine-tuning CNN (ResNet50)	—	First approach: FOOD-5K ACC = 99.00%Food-11 ACC = 89.33%Food-101 ACC = 62.44%.Second approach:Food-101 ACC = 79.86%
(134)	—	—	—	—	5-Layer CNN	—	UEC-Food 100 ACC = 60.90%
(100)	—	Canny edge detection, multi-scale segmentation, fast rejection of background pixels	Color, texture, SIFT, and SURF features	—	3-Layer ANN	—	UNIMIB 2016 ACC = 94.5%
(135)	Data augmentation by translations, rotations, shearing, zooming, and flipping	—	—	—	CNN (Inception-v3)	Ingredients and nutritional value estimation from vector space embeddings of words (text data from the internet)	Food-101 ACC = 80.0%
(98)	—	Local variation segmentation	Color descriptors: SCD, DCD. Texture descriptors: EFD and GFD. Local descriptors: SIFT and MDSIFT.	—	Multi-kernel SVM	—	UNIMIB 2016 ACC = 93.9%
(86)	—	—	Pretrained CNNs (GoogLeNet and ResNet152)	—	NB, SVM-RBF, SVM-Poly, ANN, RF	—	FOOD-5K (ResNet-152 and SVM-RBF) ACC = 98.8%Food-11 (ResNet-152 and ANN) ACC = 91.34%Food-101 (ResNet-152 and SVM-RBF) ACC = 64.98%
(136)	—	Weakly supervised CNN model with a new pooling technique and incorporate a class activation map for graph-based segmentation (VGG-16)	—	—	CNN (VGG-16)	—	Food-101 ACC = 74.02%
(137)	—	JSEG algorithm consists of color quantization and spatial segmentation	—	—	—	—	UNIMIB 2016F-measure = 58%

ACC, accuracy; ANN, artificial neural network; CNN, convolutional neural network; DCD, dominant color descriptor; EFD, entropy-based categorization and fractal dimension estimation; GFD, Gabor-based image decomposition and fractal dimension estimation; MAP, mean average precision; MAPE, mean absolute percentage error; MDSIFT, multi-scale dense SIFT; NB, naive Bayes; PFID, Pittsburgh fast-food image dataset; Poly, polynomial; RBF, radial basis function; RF, random forest; SCD, scalable color descriptor; SIFT, scale-invariant feature transform; SURF, speeded up robust features; SVM, support vector machine.