Skip to main content
Wiley Open Access Collection logoLink to Wiley Open Access Collection
. 2025 Mar 7;90(3):e70116. doi: 10.1111/1750-3841.70116

Image‐based food groups and portion prediction by using deep learning

Hidir Selcuk Nogay 1, Nalan Hakime Nogay 2,, Hojjat Adeli 3
PMCID: PMC11887021  PMID: 40052549

Abstract

Chronic diseases such as obesity and hypertension due to malnutrition can be prevented by following the appropriate diet, correct diet intake with correct measuring portion size, and developing healthy eating habits. Having a system that can automatically measure food consumption is important to determine whether individual nutritional needs are being met in order to accurately diagnose and solve nutritional problems, act quickly, and minimize the risk of malnutrition due to the cross‐cultural diversity of foods. In this study, a deep learning system has been developed and implemented for automatically grouping and classifying foods. Dishes from Turkish cuisine were chosen as a sample for application and testing. The deep learning method used in this system is convolutional neural network (CNN) models based on image recognition. This study developed and implemented a deep learning system using CNNs to classify food groups and estimate portion sizes of Turkish cuisine dishes, achieving accuracy rates of up to 80% for food group classification and 80.47% for portion estimation with the inclusion of data augmentation.

Keywords: convolutional neural networks, data augmentation, food groups, portion, transfer learning

1. INTRODUCTION

Evaluation of nutritional status is very important in terms of the prevention of many chronic diseases, such as obesity, diabetes, heart disease, cancer, and hypertension. Food consumption records and the frequency of food consumption are methods used to assess individuals’ dietary status. However, the information received about the types and amounts of food consumed depends entirely on the subjective judgments of the person giving the information and may lead to an incorrect evaluation of food consumption (Lo et al., 2020). These subjective judgments include factors such as differences in perception of food portion sizes and difficulties in remembering and recording foods believed to be healthy as overconsumed (Christopher et al., 2017). Food image recognition systems aid in evaluating nutritional status and help monitor food consumption. Deep learning helps automate the process of assessing daily food consumption compared with using the retrospective recall method (Ma et al., 2021). This system can also be used in the billing of the food consumed by the customers in the restaurants and in determining which foods are preferred to be consumed more. It can also help both in reducing food waste and increasing the profit of a food business (Min et al., 2019). Image recognition systems can be a potential solution for an accurate assessment of nutritional status.

Numerous scientific studies have been published addressing image‐based food portion and volume estimation, classification, or automated diet evaluation. Wang et al. (2022) conducted a comprehensive review of vision‐based dietary assessment methods, focusing on multi‐task deep learning frameworks for diet assessment and food volume estimation. Their study highlighted the increasing role of convolutional neural networks (CNNs) in food analysis but also pointed out key challenges, including the difficulty of accurately estimating meal complexity and the lack of standardized large‐scale datasets. Additionally, their review did not address the applicability of these methods to specific cuisines, limiting their practical use in culturally diverse dietary assessments. Unlike previous works, our study specifically targets Turkish cuisine, incorporating both food classification and portion size estimation within a deep learning framework. By integrating advanced preprocessing techniques such as canny edge detection (CED) and data augmentation (DA), we enhance model accuracy and robustness, addressing some of the key limitations identified in Wang et al.’s study.

Recent studies have explored various aspects of food classification and portion estimation. Sheng et al. (2022) proposed a lightweight transformer‐based deep neural network for food image recognition, optimizing computational efficiency while maintaining classification accuracy. However, their study was limited to food recognition without integrating dietary assessment components such as portion size estimation. Similarly, Gambetti and Han (2022) investigated food aesthetics in restaurant marketing, leveraging deep learning to assess visual appeal rather than nutritional or dietary aspects. Feng et al. (2023) developed a fine‐grained food image recognition model for Chinese cuisine using Laplacian pyramids and bilinear pooling to enhance texture feature extraction. However, their approach lacked generalization beyond Chinese food and did not integrate a structured dietary assessment framework. These works demonstrate the versatility of deep learning in food image analysis but fall short in providing a holistic approach to dietary assessment.

In terms of portion estimation, Siwathammarat et al. (2023) developed a multi‐task learning framework to classify food and estimate weight from images, primarily focusing on Thai cuisine. Although their study demonstrated improvements in computational efficiency, it lacked generalizability across different cuisines and did not incorporate advanced preprocessing techniques to enhance classification accuracy. Kadam et al. (2022) introduced a food volume estimation model utilizing mask‐based pre‐trained ResNet, achieving high accuracy for geometrically regular foods. However, their approach was tailored to Indian cuisine and did not integrate a structured dietary assessment framework. Konstantakopoulos et al. (2023) developed a dietary assessment system for Mediterranean foods by combining CNN‐based classification with stereo vision for volume estimation. However, their reliance on controlled imaging conditions limited the practicality of their approach for broader applications.

Efforts in food composition detection have also been explored by Kaushal et al. (2024), who applied deep learning to evaluate food composition, offering a non‐destructive approach to food analysis. However, their study primarily reviewed existing models rather than introducing a structured methodology for preprocessing techniques. Sarapisto et al. (2022) developed a meal type and weight estimation system using overhead‐mounted cameras, demonstrating feasibility in cafeteria settings but limiting its applicability to non‐standardized meal compositions. Ling et al. (2024) investigated deep learning–based food image recognition and calorie estimation for Malaysian cuisine, utilizing multiple pre‐trained CNN architectures. Their study achieved high accuracy but was restricted to 20 classes of Malaysian dishes and relied on a nutritional API that lacked cultural adaptability. Agarwal et al. (2023) proposed a hybrid deep learning model combining Mask RCNN and YOLO V5 for food recognition and calorie estimation. Although their system reported high accuracy, its reliance on predefined nutritional datasets limited its adaptability to diverse food preparation methods.

To address these limitations, our study introduces a deep learning–based system specifically designed for Turkish cuisine, integrating both food classification and portion size estimation. Unlike previous works that focus on single aspects of dietary assessment, our model provides a comprehensive solution by incorporating advanced preprocessing techniques such as CED and DA. By leveraging transfer learning (TL) with six pre‐trained CNN architectures—NasNet‐Mobile, SqueezeNet, ShuffleNet, ResNet‐18, GoogleNet, and MobileNet‐v2—our approach enhances classification robustness and accuracy, making it more adaptable to real‐world applications.

This study aims to improve the accuracy and applicability of automated dietary assessment by addressing key challenges in food classification and portion estimation. Our approach not only improves classification accuracy but also ensures that the system is culturally adaptive and practical for diverse dietary environments. By bridging the gap between food recognition and portion estimation, this work contributes to the development of robust and efficient deep learning models for real‐world nutritional analysis.

2. MATERIALS AND METHODS

2.1. Dataset

In this study, 679 food and nutrition photographs belonging to Turkish cuisine were used in the database of the “BeBiS 9” package program, which is the Turkish version of a program developed by Pacific Electric Electronics Ltd. Şti., were used (Ebispro for Windows, 2021). Information about the database and BeBiS 9 can be accessed from the web page “http://www.bfr.bund.de/cd/801” (Bundeslebensmittelschluessel, 3.01B, n.d.).

The obtained raw images were subjected to data preprocessing and rearranged as in Tables 1 and 2. Two deep‐learning applications were implemented in the study. The first of these applications was to estimate food groups. For this purpose, each image data was divided into groups according to food groups, as in Table 1. The second deep learning model was aimed to estimate the portion ranges (the portion size of the food item in the image) in the application. The portion value ranges shown in Table 2 represent the estimated weight range of the food item in the image. One portion of food corresponds to 150 g. Similarly, each image data was divided into groups according to the determined portion ranges and arranged to train the model to be designed for portion estimation. Therefore, two datasets were used in this experimental application study.

TABLE 1.

Food groups (G1–G6).

Food groups (G1–G6) Foods Number of images
G1 (dairy products) Cheese, milk, yogurt, buttermilk, butter 26
G2 (meat and its products, eggs and legumes and oilseeds) Fish, meat, ground beef, salami, sausage, bacon (from beef), chickpeas, beans, sunflower seeds, hazelnut, pistachio 118
G3 (bread and grains) Bread types, sesame bagel, pancake, pastry, etc. 110
G4 (vegetables and Green Stuff) Tomato, pepper, Aysberg, potato, artichoke, onion, parsley, corn, mushrooms, celery, cauliflower, cabbage, capia, lettuce, etc. 137
G5 (fruits) Apple, pear, quince, pomegranate, mandarin, grapefruit, grape, avocado, strawberry, date, mulberry, apricot, peach, kiwi, fig, raisin, dried fig, dried apricot, dried mulberry, nectarine, banana, orange, plum, loquat 142
G6 (other foods) Baklava, Ashura, rolling, biscuit, honey, chocolate, jam 146

TABLE 2.

Portion classes.

Classes (C1–C5) Portion value ranges (1 portion = 150 g) Number of images
C1 1 ≤ C1 < 2 221
C2 2 ≤ C1 < 4 83
C3 C3 ≥ 4 46
C4 C4 < 0.5 173
C5 0.5 ≤ C5 < 1 156

2.2. Data preprocessing

Two applications were performed in the study. In both applications, the image data were separately organized as 224 × 224 images with three color channels, red–green–blue. The total image data, which was 679 before DA, was quadrupled to 2716 after augmentation. In the study, an 80/20 training–test split, which is a widely accepted standard in deep learning applications, was adopted to provide a balance between effective model training and reliable performance evaluation. This ratio, applied to a total of 2716 data points, ensures that the model learns from a sufficient amount of data while reserving a significant portion for unbiased testing. Due to the relatively small size of the dataset, further splitting it to create a separate validation set could lead to a smaller training set and potentially increase the risk of overfitting. In addition, a five‐fold cross‐validation approach was applied to further increase the reliability, reducing the risk of overfitting and providing robust performance evaluation (Miraei Ashtiani et al., 2021). The dataset preprocessing stage consists of two steps. In the first of these steps, edge detection was performed by using the CED algorithm on each image in the dataset, and each image was cropped tangentially from the detected edges. The purpose of cropping the image by applying the CED algorithm is to minimize the gaps between the image and the outer frame. Figure 1 illustrates the initial step of the preprocessing procedure applied to the dataset.

FIGURE 1.

FIGURE 1

The first stage of data preprocessing.

In the second step, the dataset was quadrupled by applying DA to each image by rotating it 90° right, rotating it 90° left, and adding salt noise.

DA plays a crucial role in this study by addressing several key challenges associated with deep learning–based food classification and portion estimation. One of the primary reasons for applying DA is to enhance the generalizability of the model by artificially increasing the diversity of the dataset (Javanmardi & Ashtiani, 2025). Given that the dataset is relatively limited in size, augmentation techniques, such as rotation, noise injection, and flipping, help the model learn more robust and invariant features, reducing its sensitivity to variations in food presentation, lighting conditions, and angles. Additionally, DA mitigates the risk of overfitting by preventing the model from memorizing specific patterns in a limited dataset and instead encouraging it to learn more generalized representations. This is particularly important in multi‐class classification tasks where certain food groups or portion categories may have fewer samples than others. By generating augmented images, the dataset becomes more balanced, improving the model's ability to correctly classify underrepresented classes. Overall, DA significantly contributes to improving model performance, enhancing classification accuracy, and ensuring better adaptability to real‐world variations in food images.

Noise injection is a DA method employed to avoid overfitting in neural network models. Typically, the image is altered with salt and pepper noise, where white or black specks are added to the image, as illustrated in Figure 2d. Incorporating noise into the process can enhance the model's learning speed and efficiency (Maharana et al., 2022).

FIGURE 2.

FIGURE 2

(a) The original image; (b) rotated by 90° to the right; (c) rotated by 90° to the left; (d) adding noise to the image.

2.3. Designing and implementation

The deep learning application framework used in this study was implemented within the MATLAB Deep Learning Toolbox environment. The CNN approach, which is among the top performers in image‐based deep learning applications and allows for automatic feature extraction, is utilized. Instead of designing a CNN network from scratch, pre‐existing models, which had been previously created and trained on approximately one million samples and demonstrated their effectiveness with the ImageNet dataset (ImageNet, n.d.), were chosen to be retrained by modifying some layers using the TL technique to align with the goals of this study. Figure 3 displays the system applied in the study. This system comprises two components. In the first component, data preprocessing and data enhancement were performed. In the second component, training and testing of pre‐existing CNN models was conducted using the TL technique.

FIGURE 3.

FIGURE 3

Schematic representation of the implementation system.

2.3.1. CNN models

The pre‐trained models used with TL approach in the study are NasNet‐Mobile (Sandler et al., 2018; Saxen et al., 2019), SqueezeNet, ShuffleNet (Zhang et al., 2018), ResNet‐18 (He et al., 2016), GoogleNet (Zhang et al., 2020), and MobileNet‐v2 (Howard et al., 2017), respectively, as shown in Figure 3.

The input image size of other pre‐trained networks except SqueezeNet is 224 × 224 (Zoph et al., 2018). Summary information about the architecture of the pre‐trained CNN models employed in the study is provided in Table 3. The reasons for choosing these models can be explained as follows:

  1. Pre‐trained on ImageNet: Each of these models has been pre‐trained on the ImageNet dataset, a vast collection of images used extensively for image classification tasks. This pre‐training allows the models to learn a rich set of features that can be effectively transferred to other image recognition applications, such as our task of food classification and portion estimation.

  2. Proven effectiveness: The chosen models have demonstrated their effectiveness in achieving high accuracy on the ImageNet dataset for various image classification challenges. This suggests their potential for strong performance in other image recognition tasks, including our research.

  3. TL: These models are well‐suited for TL, a technique that involves fine‐tuning a pre‐trained model on a new dataset. By leveraging the knowledge learned from ImageNet, we can adapt these models to our specific task with less training time and computational resources compared to training a model from scratch.

  4. Diverse architectures: We specifically selected models with diverse architectures to explore a range of approaches and identify the most suitable one for our task. NasNet‐Mobile, for example, is optimized through reinforcement learning, whereas SqueezeNet prioritizes efficiency and ShuffleNet employs channel shuffling for improved performance. This diversity allows us to compare the performance of different model architectures and select the best one for our research goals.

  5. Specific strengths: Each model has specific strengths that make it relevant to our research. ResNet‐18 is known for its low computational cost, making it suitable for applications with limited resources. GoogleNet, on the other hand, is recognized for its memory efficiency due to its parallel modules. These strengths make these models suitable for different aspects of our research.

TABLE 3.

Structural information for the pre‐trained networks.

NasNet‐Mobile SqueezeNet ShuffleNet ResNet‐18 GoogleNet MobileNet‐v2
Size of input image 224 × 224 227 × 227 224 × 224 224 × 224 224 × 224 224 × 224
Number of convolution 354 26 49 20 57 42
Number of pooling 60 4 5 2 14 1
Number of fully connected 1 1 1 1 1
Number of normalization 192 49 20 2 52
Number of activation function (ReLu) 192 26 33 17 57 35
Number of other layers 114 12 35 12 13 23
Total number of all layers 913 68 172 72 144 154
Rearranged layers via TL Classification, Softmax, and last fully connected Classification, and last convolution Classification, Softmax and last fully connected Classification, Softmax and last fully connected Classification, Softmax and last fully connected Classification, Softmax and last fully connected

Although alternative approaches such as training a custom CNN from scratch or using Vision Transformers (ViTs) were considered, pre‐trained CNN models were preferred due to their well‐established performance, reduced computational cost, and faster convergence. Training a model from scratch would require significantly larger datasets and extensive computational resources, whereas ViTs, though promising, generally require larger datasets to outperform CNNs in food image classification tasks. Given the dataset characteristics and computational constraints, CNNs with TL emerged as the most practical and effective choice for this study.

Pre‐trained CNN models were employed on two occasions with TL for two different purposes in this research. In the first application, classification of food groups and in the second application, classification and estimation of portion intervals were carried out. The limitations applied to the training and testing processes for both applications are shown in Table 4. Because the study is also a comparison study, attention was paid to use the same structural hyperparameters in all pre‐trained models. Because the number of data increased five times with DA, the hyperparameters with and without DA in Table 4 were determined differently from each other.

TABLE 4.

Limitations of the convolutional neural network (CNN) models for the training and testing.

Hyperparameters With DA Without DA
Maximum epoch 30 30
Maximum iteration 990 240
Iteration per epoch 33 8
Validation Frequency 20 20
Initial learning rate 0.0001 0.0001
Mini batch size 64 64
Learning rate schedule Constant Constant

Abbreviation: DA, data augmentation.

2.3.2. Transfer learning (TL)

The TL approach is a method that enables a CNN to be adapted for a new task by retraining it with new data, whereas keeping the weights of the previously trained CNN unchanged. In this applied and comparative study, two applications and two classifications were carried out for two purposes. In the first classification, it was aimed to predict food groups, and in the second classification, it was aimed to predict portion intervals. The TL approach was used twice for six and quintuple classifications, respectively. Thanks to the TL technique, the six pre‐trained models mentioned above were designed and implemented by rearranging some layers for two purposes. Among these models, only a number of filters in the final convolutional layer and the last two layers have been adjusted, as the SqueezeNet network lacks a fully connected layer. The last row of Table 3 shows the layers modified using the TL method for each pre‐trained model utilized in the study.

2.3.3. Convolution

Convolution is the process of generating an output matrix by applying a filter to the input matrix. During convolution, the strides determine the number of columns and rows to move the nk×nk filter matrix K (kernel) rightward and downward on the input matrix A. The dimensions of the input matrix A should be greater than or multiples of “nk .” Initially, the input matrix A has dimensions “nA × nA ,” and after applying the filter matrix, it is transformed into a smaller output matrix B. Thus, the calculation for each element of the output matrix B is given by the following formula (Michelucci, 2019):

Bij=A×Kij=f=0nk1h=0nk1Ai+f,j+hKi+f,j+h (1)

At times, it is not suitable to obtain results from convolution if the dimensions differ from those of the original image. In such cases, it is necessary to expand the edges of the image in pixels to restore it to its original dimensions. One approach to achieve this involves filling the newly added pixels with values from the nearest pixels or with zeros. This procedure, known as padding, requires specifying the padding amount (p) when designing a CNN architecture. Taking into account the padding and stride (s) values, the dimensions of the output matrix B resulting from convolution and pooling can be calculated as follows (Michelucci, 2019):

nB=nA+2pnKs+1 (2)

2.3.4. Pooling

The pooling layer is typically placed after the rectified linear unit layer. Its primary function is to decrease the dimensions of the input image (width × height) for the subsequent convolutional layer, without affecting the depth of the image data. This size reduction from pooling lessens the computational load for the following network layers and helps prevent overfitting. The pooling layer employs filter matrices, the sizes of which are determined through trial and error to achieve the desired pooling outcomes. These filters are applied based on the s value, extracting either the maximum pixel values (maximum pooling) or the average pixel values (average pooling). In this study, the most effective pooling method for the CNN model was automatically chosen using gravitational search optimization (GSO). The layers utilized in the study are detailed in Table 4. Pooling is applied to all images as many times as there are filters resulting from the convolution layer (Sewak et al., 2018).

2.3.5. Classification and Softmax

In CNN networks, the classification layer is the final layer where the cross‐entropy loss is computed. The Softmax function, which provides the probability distribution for multiclass classification, is applied after the “fully connected” layer. For the classifications with five and six categories conducted in the study, the Softmax function is represented as follows:

yrx=exparxj=1kexpajx (3)

where 0yr1 and j=1kyj=1 with, ar representing the conditional probability of the sample belonging to class r (Bishop & Nasrabadi, 2006; Sewak et al., 2018).

2.4. Performance metrics

Within the classification layer, the Softmax function outputs are allocated to one of the K mutually exclusive classes through the cross‐entropy function. The model undergoes training by minimizing the loss function. Thus, loss values are used as a performance evaluation metric in this study. The loss function is computed as follows:

Loss=i=1Nj=1Ktijlnyij (4)

where N denotes the number of samples, K represents the number of classes tij signifies the ith sample in the jth class, and yij indicates the output for the jth sample of class i (Bishop & Nasrabadi, 2006; Nogay & Adeli, 2023). Accuracy rate was utilized as the second evaluation metric. The accuracy rate (Javanmardi et al., 2021) is calculated as follows:

Accuracy=TotalcorrectpredictionlabelsTotalnumberofreallabels×100 (5)

The study also incorporates additional performance metrics, including specificity, sensitivity, F1 score, Matthews correlation coefficient (MCC), and the area under the receiver operating characteristic (ROC) curve (AUC), which are briefly described below.

2.4.1. Specificity

Specificity represents the proportion of actual negative cases that are correctly classified as negative. It is also termed the true negative rate and indicates how effectively the model distinguishes negative instances from positive ones. A higher specificity value suggests that the model generates fewer false positives, meaning it is less likely to incorrectly label negative cases as positive.

2.4.2. Sensitivity

Sensitivity, also referred to as the true positive rate or recall, quantifies the proportion of actual positive cases that the model successfully identifies. It reflects the model's ability to accurately detect positive samples. A higher sensitivity score implies that the model has a lower number of false negatives, ensuring fewer positive instances are missed.

2.4.3. F1 score

The F1 score is the harmonic mean of precision and recall, offering a balanced evaluation of a model's classification performance by considering both false positives and false negatives. This metric ranges from 0 to 1, with a value of 1 indicating ideal precision and recall, whereas lower values suggest varying degrees of misclassification.

2.4.4. MCC (Matthews correlation coefficient)

The MCC is a statistical measure used to evaluate the correlation between actual and predicted classifications in a binary classification problem. Its values range between −1 and 1, where 1 signifies perfect alignment between predictions and actual values, 0 represents a performance no better than random chance, and −1 indicates a complete inverse correlation. Unlike some other metrics, MCC remains effective even when class distributions are highly imbalanced.

2.4.5. AUC (area under the curve)

The AUC value derived from the ROC curve measures a model's ability to separate positive and negative classes. The ROC curve illustrates how the true positive rate varies with the false positive rate across multiple threshold values. AUC scores range between 0 and 1, where 1 represents a flawless classifier, and 0.5 indicates that the model performs at the level of random guessing (Bishop & Nasrabadi, 2006).

3. EXPERIMENTAL RESULTS AND DISCUSSION

As seen in Table 5, in the deep learning application performed without pre‐processing in the classification of food groups, the highest accuracy rate of 60% was obtained by the ResNet‐18 pre‐trained model as a result of the test (Table 5). With the enlarged and generalizable dataset subjected to the preprocessing process, the highest accuracy rate in food groups was achieved with MobileNet‐v2, with approximately 80% (Table 6). With the ResNet‐18 model, the accuracy rate reached 60% to 77.76% (Tables 5 and 6). For the food groups, the test results obtained with DA and without DA are illustrated in Tables 5 and 6, respectively.

TABLE 5.

Results obtained without data preprocessing for food groups.

CNN models Training accuracy (%) Testing accuracy (%) Training loss Testing loss
NasNet‐Mobile 100 54.81 0.2419 1.2741
SqueezeNet 82.81 53.33 0.6315 1.3492
ShuffleNet 98.44 53.33 0.2711 1.2947
ResNet‐18 95.31 60.00 0.3010 1.1217
GoogleNet 78.12 53.33 0.7799 1.1377
MobileNet‐v2 100 52.59 0.0675 1.3342

Abbreviation: CNN, convolutional neural network.

TABLE 6.

Results obtained with data preprocessing for food groups.

CNN models Training accuracy (%) Testing accuracy (%) Training loss Testing loss
NasNet‐Mobile 89.06 61.95 0.6217 1.1048
SqueezeNet 76.56 72.43 0.4830 0.8303
ShuffleNet 100 74.08 0.1768 0.8019
ResNet‐18 100 77.76 0.1459 0.6283
GoogleNet 81.25 78.68 0.5309 0.6097
MobileNet‐v2 100 79.23 0.0376 0.7178

Abbreviation: CNN, convolutional neural network.

In addition, for the food groups, the comparison of the results obtained with the data with and without pretreatment can be made with the bar graph in Figure 4. In addition, the confusion matrix and the obtained performance values of the MobileNet‐v2 pre‐trained model, which obtained the highest accuracy rate for food groups, are shown in Figure 5 and Table 7, respectively.

FIGURE 4.

FIGURE 4

Testing accuracy rates for food groups.

FIGURE 5.

FIGURE 5

Confusion matrix for food groups (MobileNet‐v2).

TABLE 7.

Performance metrics for food group classification using MobileNet‐v2.

Class Specificity Sensitivity F1 score MCC AUC
g1 0.98 0.74 0.84 0.83 0.96
g2 0.94 0.78 0.85 0.82 0.94
g3 0.96 0.85 0.90 0.88 0.96
g4 0.94 0.81 0.87 0.84 0.94
g5 0.93 0.83 0.88 0.85 0.94
g6 0.93 0.75 0.83 0.80 0.93

Abbreviations: AUC, area under the curve; MCC, Matthews correlation coefficient.

In the application made for the estimation of the portion intervals, GoogleNet achieved the highest accuracy rate among the test accuracy rates obtained without preprocessing, with 40.44% (Table 8). Despite the low performance, the ResNet‐18 pre‐trained CNN model achieved the highest accuracy of 80.48% after preprocessing (Table 9). The GoogleNet test result increased from 40.44% to 66.85% after the preprocessing process (Tables 8 and 9). Tables 8 and 9 show the test results obtained for without DA and with DA for portion intervals, respectively.

TABLE 8.

Results obtained without data preprocessing for portion classes.

CNN models Training accuracy (%) Testing accuracy (%) Training loss Testing loss
NasNet‐Mobile 90.00 38.24 0.4329 1.5879
SqueezeNet 43.75 33.09 1.2872 1.5151
ShuffleNet 96.88 33.82 0.4183 1.8074
ResNet‐18 98.44 30.88 0.3391 1.6279
GoogleNet 50.00 40.44 1.2069 1.4464
MobileNet‐v2 100 36.76 0.1396 1.5983

Abbreviation: CNN, convolutional neural network.

TABLE 9.

Results obtained with data preprocessing for portion classes.

CNN models Training accuracy (%) Testing accuracy (%) Training loss Testing loss
NasNet‐Mobile 100 70.72 0.0651 0.8374
SqueezeNet 81.25 60.22 0.6801 1.0312
ShuffleNet 100 78.27 0.2244 0.6666
ResNet‐18 98.44 80.48 0.1733 0.6252
GoogleNet 64.06 66.85 0.8054 0.8346
MobileNet‐v2 100 75.87 0.0349 0.6867

Abbreviation: CNN, convolutional neural network.

Figure 6 shows that portion classes can be predicted with much more acceptable accuracy after preprocessing and DA. The confusion matrix and performance metrics obtained for portion classes with the ResNet‐18 pre‐trained CNN model are shown in Figure 7 and Table 10, respectively. The different order of the class labels in the confusion matrix in Figure 7 is a result of the automatic output produced by the algorithm. However, this does not affect the interpretation of the results.

FIGURE 6.

FIGURE 6

Testing accuracy rates for portion classes.

FIGURE 7.

FIGURE 7

Confusion matrix for portion classes (ResNet‐18).

TABLE 10.

Performance metrics for portion class classification using ResNet‐18.

Class Specificity Sensitivity F1 score MCC AUC
C4 0.96 0.89 0.92 0.89 0.94
C5 0.96 0.85 0.9 0.88 0.94
C1 0.91 0.87 0.89 0.85 0.93
C2 0.98 0.84 0.9 0.88 0.95
C3 0.98 0.66 0.79 0.78 0.93

Abbreviations: AUC, area under the curve; MCC, Matthews correlation coefficient.

This study contributes to the field of image‐based food analysis by developing a deep learning system for classifying food groups and estimating portion sizes of Turkish cuisine dishes. While prior research has explored various methods for food image recognition and volume estimation, our work specifically addresses the challenges of analyzing Turkish cuisine, which is not well‐represented in existing food image datasets. Our approach involves a two‐stage preprocessing process, including CED for image cropping and DA with rotation and noise injection. This comprehensive preprocessing enhances the model's ability to learn discriminative features and improves its robustness. By employing pre‐trained CNN models and fine‐tuning them with our Turkish cuisine dataset, we achieved high accuracy rates for both food group classification and portion estimation.

In comparison to other studies, our approach demonstrates competitive performance in terms of both accuracy and F1‐score. Notably, our accuracy for food group classification, reaching 80% with MobileNet‐v2, surpasses the 78.26% accuracy reported by Ma et al. (2021) for their nutrient estimation method using Inception V3. Similarly, our accuracy for portion estimation with ResNet‐18, reaching 80.47%, exceeds the 78.82% accuracy reported by Lo et al. (2019) for their point cloud completion method.

Table 11 highlights the superior performance of our approach, particularly in terms of accuracy for both food group classification and portion estimation. Additionally, it shows the competitive F1‐score achieved by our models, further demonstrating their effectiveness in distinguishing between different classes.

TABLE 11.

Comparison of performance metrics for food group and portion estimation.

Metric Proposed Ma et al. (2021) Lo et al. (2019)
Accuracy (food group) 79.23% 78.26%
Accuracy (portion) 80.47% 78.82%
F1‐score 0.88 0.90

Overall, this study contributes to the field of image‐based food analysis by providing a model specifically tailored for Turkish cuisine, achieving high accuracy rates for both food group classification and portion estimation. Our approach addresses the need for more culturally diverse and inclusive food image datasets and models, paving the way for more generalizable and relevant dietary assessment tools.

4. CONCLUSIONS AND FUTURE DIRECTIONS

In this study, a deep learning system was developed to classify food groups and estimate portion sizes, focusing on Turkish cuisine. The system utilized pre‐trained CNN models with TL, achieving 79.23% accuracy for food group classification and 80.47% for portion estimation. The use of preprocessing techniques, particularly CED and DA, significantly improved model performance by enhancing image features and increasing dataset diversity. The study demonstrated that food classification and approximate portion estimation can be performed with high accuracy using deep learning methods, making the proposed approach generalizable and widely applicable.

Although the study presents notable strengths, including high classification accuracy and a region‐specific focus, it also has certain limitations. The dataset, although expanded through augmentation, remains relatively small compared to large‐scale food image databases, which could impact model generalizability. Additionally, estimating portion sizes from 2D images poses inherent challenges due to variations in food presentation, lighting, and camera angles. Although the model successfully predicts portion ranges, integrating 3D imaging techniques or depth estimation models could further enhance accuracy. Another limitation is the lack of direct nutrient content estimation, which is essential for dietary assessment applications.

Future research could focus on addressing these limitations by expanding the dataset with a more diverse range of food images, incorporating 3D imaging methods to improve portion estimation, and developing multimodal learning approaches that combine image data with text descriptions or nutritional databases. Additionally, integrating caloric and nutrient prediction models would enhance the system's practicality for dietary monitoring. Deploying the model in a real‐time mobile or web‐based application could also provide a more interactive and user‐friendly experience, making it accessible for dietitians, healthcare professionals, and individuals tracking their nutrition.

In conclusion, this study contributes to the field of image‐based dietary assessment by providing a deep learning‐based approach tailored for Turkish cuisine. The results demonstrate that CNN‐based models can effectively classify food items and estimate portion sizes with high accuracy. These findings highlight the potential of deep learning in developing automated dietary assessment tools, paving the way for further advancements in nutrition analysis and health monitoring.

AUTHOR CONTRIBUTIONS

Hidir Selcuk Nogay: Writing—original draft; writing—review and editing; data curation; methodology; conceptualization; validation; software; visualization. Nalan Hakime Nogay: Investigation; writing—original draft; writing—review and editing; conceptualization. Hojjat Adeli: Supervision; writing—original draft; writing—review and editing; conceptualization; methodology; validation; visualization.

CONFLICT OF INTEREST STATEMENT

The authors declare no conflicts of interest.

Nogay, H. S. , Nogay, N. H. , & Adeli, H. (2025). Image‐based food groups and portion prediction by using deep learning. Journal of Food Science, 90, e70116. 10.1111/1750-3841.70116

DATA AVAILABILITY STATEMENT

Data will be made available on request.

REFERENCES

  1. Agarwal, R. , Choudhury, T. , Ahuja, N. J. , & Sarkar, T. (2023). Hybrid deep learning algorithm‐based food recognition and calorie estimation. Journal of Food Processing and Preservation, 2023, 6612302. 10.1155/2023/6612302 [DOI] [Google Scholar]
  2. Bishop, C. M. , & Nasrabadi, N. M. (2006) Pattern Recognition and Machine Learning. Vol. 4, No. 4, Springer, New York, 738.
  3. Christopher, J. B. , Barry, B. , & Bridget, H. (2017). Nutritional assessment methods. Human Nutrition, Edition: 13th edition, Chapter 32: Nutritional Assessment Methods, Publisher: Oxford University Press, Editors: Catheriner Geissler, Hilary Powers.
  4. Bundeslebensmittelschluessel . (n.d.). German Food Code and Nutrient Data Base (Version 3.01B). Federal Institute for Risk Assessment (BfR). Retrieved from http://www.bfr.bund.de/cd/801
  5. Feng, S. , Wang, Y. , Gong, J. , Li, X. , & Li, S. (2023). A fine‐grained recognition technique for identifying Chinese food images. Heliyon, 9, e21565. 10.1016/j.heliyon.2023.e21565 [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Gambetti, A. , & Han, Q. (2022). Camera eats first: Exploring food aesthetics portrayed on social media using deep learning. International Journal of Contemporary Hospitality Management, 34, 3300–3331. 10.1108/ijchm-09-2021-1206 [DOI] [Google Scholar]
  7. He, K. , Zhang, X. , Ren, S. , & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778). IEEE. 10.1109/CVPR.2016.90 [DOI] [Google Scholar]
  8. Howard, A. G. , Zhu, M. , Chen, B. , Kalenichenko, D. , Wang, W. , Weyand, T. , Andreetto, M. , & Adam, H. (2017). MobileNets: Efficient convolutional neural networks for mobile vision applications. ArXiv preprint, arXiv:1704.04861.
  9. ImageNet . (n.d.). Large‐scale visual recognition challenge . http://www.image‐net.org
  10. Javanmardi, S. , & Ashtiani, S. M. (2025). AI‐driven deep learning framework for shelf life prediction of edible mushrooms. Postharvest Biology and Technology, 222, 113396. [Google Scholar]
  11. Javanmardi, S. , Miraei Ashtiani, S. , Verbeek, F. , & Martynenko, A. I. (2021). Computer‐vision classification of corn seed varieties using deep convolutional neural network. Journal of Stored Products Research, 92, 101800. [Google Scholar]
  12. Kadam, P. , Pandya, S. , Phansalkar, S. P. , Sarangdhar, M. , Petkar, N. , Kotecha, K. , & Garg, D. (2022). FVEstimator: A novel food volume estimator Wellness model for calorie measurement and healthy living. Measurement, 198, 111294. [Google Scholar]
  13. Kaushal, S. , Tammineni, D. K. , Rana, P. , Sharma, M. , Sridhar, K. , & Chen, H.‐H. (2024). Computer vision and deep learning‐based approaches for detection of food nutrients/nutrition: New insights and advances. Trends in Food Science & Technology, 146, 104408. 10.1016/j.tifs.2024.104408 [DOI] [Google Scholar]
  14. Konstantakopoulos, F. S. , Georga, E. I. , & Fotiadis, D. I. (2023). An automated image‐based dietary assessment system for Mediterranean foods. IEEE Open Journal of Engineering in Medicine and Biology, 4, 45–53. 10.1109/OJEMB.2023.3266135 [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Ling, O. W. , Aziz, A. A. , Sulaiman, S. , & Arasu, K. (2024). Deep learning‐based algorithms for Malaysian food image recognition and calories estimation. In 2024 IEEE 7th International Conference on Electrical, Electronics and System Engineering (ICEESE) (pp. 1–5). IEEE. 10.1109/ICEESE62315.2024.10828551 [DOI] [Google Scholar]
  16. Lo, F. P. , Sun, Y. , Qiu, J. , & Lo, B. P. (2019). A novel vision‐based approach for dietary assessment using deep learning view synthesis. In 2019 IEEE 16th International Conference on Wearable and Implantable Body Sensor Networks (BSN) (pp. 1–4). IEEE. [Google Scholar]
  17. Lo, F. P. , Sun, Y. , Qiu, J. , & Lo, B. P. (2020). Image‐based food classification and volume estimation for dietary assessment: A review. IEEE Journal of Biomedical and Health Informatics, 24, 1926–1939. [DOI] [PubMed] [Google Scholar]
  18. Ma, P. , Lau, C. , Yu, N. , Li, A. , Liu, P. , Wang, Q. , & Sheng, J. (2021). Image‐based nutrient estimation for Chinese dishes using deep learning. Food Research International, 147, 110437. [DOI] [PubMed] [Google Scholar]
  19. Maharana, K. , Mondal, S. K. , & Nemade, B. (2022). A review: Data pre‐processing and data augmentation techniques. Global Transitions Proceedings, 3, 91–99. [Google Scholar]
  20. Michelucci, U. (2019). Advanced applied deep learning: Convolutional neural networks and object detection (1st ed.). Apress. [Google Scholar]
  21. Min, W. , Liu, L. , Luo, Z. , & Jiang, S. (2019). Ingredient‐guided cascaded multi‐attention network for food recognition . Proceedings of the 27th ACM International Conference on Multimedia, Nice, France.
  22. Miraei Ashtiani, S. , Javanmardi, S. , Jahanbanifard, M. , Martynenko, A. I. , & Verbeek, F. (2021). Detection of mulberry ripeness stages using deep learning models. IEEE Access, 9, 100380–100394. [Google Scholar]
  23. Nogay, H. S. , & Adeli, H. (2023). Diagnostic of autism spectrum disorder based on structural brain MRI images using, grid search optimization, and convolutional neural networks. Biomedical Signal Processing and Control, 79, 104234. [Google Scholar]
  24. Pasifik Elektirik Elektronik Ltd. Şti . (2021). Ebispro for Windows: Turkish version (BeBiS 9). Stuttgart, Germany: Pasifik Elektirik Elektronik Ltd. Şti. Retrieved from http://www.bebis.com.tr
  25. Sandler, M. , Howard, A. G. , Zhu, M. , Zhmoginov, A. , & Chen, L. (2018). MobileNetV2: Inverted residuals and linear bottlenecks. In 2018 IEEE/CVF conference on computer vision and pattern recognition (pp. 4510–4520). IEEE. [Google Scholar]
  26. Sarapisto, T. , Koivunen, L. , Mäkilä, T. , Klami, A. , & Ojansivu, P. (2022). Camera‐based meal type and weight estimation in self‐service lunch line restaurants. In 2022 12th International Conference on Pattern Recognition Systems (ICPRS) (pp. 1–7). IEEE.
  27. Saxen, F. , Werner, P. , Handrich, S. , Othman, E. , Dinges, L. , & Al‐Hamadi, A. (2019). Face attribute detection with MobileNetV2 and NasNet‐Mobile. In: 2019 11th International Symposium on Image and Signal Processing and Analysis (ISPA) (pp. 176–180). IEEE.
  28. Sewak, M. , Karim, R. M. , & Pujariet, P. (2018). Practical convolutional neural networks: implement advanced deep learning models using python. Packt Publishing, Limited. [Google Scholar]
  29. Sheng, G. , Sun, S. , Liu, C. , & Yang, Y. (2022). Food recognition via an efficient neural network with transformer grouping. International Journal of Intelligent Systems, 37, 11465–11481. 10.1002/int.23050 [DOI] [Google Scholar]
  30. Siwathammarat, P. , Jesadaporn, P. , & Chawachat, J. (2023). Multi‐task learning frameworks to classify food and estimate weight from a single image. In Proceedings of the 20th International Joint Conference on Computer Science and Software Engineering (JCSSE) (pp. 1–6). IEEE. 10.1109/JCSSE58229.2023.10202056 [DOI] [Google Scholar]
  31. Wang, W. , Min, W. , Li, T. , Dong, X. , Li, H. , & Jiang, S. (2022). A review on vision‐based analysis for automatic dietary assessment. Trends in Food Science & Technology, 122, 223–237. [Google Scholar]
  32. Zhang, X. , Pan, W. , Bontozoglou, C. , Chirikhina, E. , Chen, D. , & Xiao, P. (2020). Skin capacitive imaging analysis using deep learning GoogLeNet. SAI. 10.1007/978-3-030-52246-9_29 [DOI] [Google Scholar]
  33. Zhang, X. , Zhou, X. , Lin, M. , & Sun, J. (2018). ShuffleNet: An extremely efficient convolutional neural network for mobile devices. In 2018 IEEE/CVF conference on computer vision and pattern recognition (pp. 6848–6856). IEEE. [Google Scholar]
  34. Zoph, B. , Vasudevan, V. , Shlens, J. , & Le, Q. (2018). Learning transferable architectures for scalable image recognition. In 2018 IEEE/CVF conference on computer vision and pattern recognition (pp. 8697–8710). IEEE. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Data will be made available on request.


Articles from Journal of Food Science are provided here courtesy of Wiley

RESOURCES