Skip to main content
Journal of Food Science and Technology logoLink to Journal of Food Science and Technology
. 2023 Feb 18;60(5):1530–1540. doi: 10.1007/s13197-023-05694-3

Rapid detection of sunset yellow adulteration in tea powder with variable selection coupled to machine learning tools using spectral data

Rani Amsaraj 2, Sarma Mutturi 1,2,
PMCID: PMC10076470  PMID: 37033304

Abstract

In the present study sunset yellow (SY), a synthetic colour, which is a common adulterant in tea powders has been analysed using FT-IR spectral data coupled to machine learning tools for efficient classification and quantification of the SY adulteration. Earlier established real coded genetic algorithm (RCGA) was used as variable selection method to predict the key fingerprints of SY in the FT-IR spectra. Here, RCGA was used to select 20, 30, 40, 50 and 60 characteristic wavenumbers for SY. Classification was carried using support vector machine (SVM), random forest (RF) and extreme gradient boosting (XGB) classifiers. SVM classifier using 50 variables could give an accuracy of 0.90 amongst the three. Quantification of SY based on PLS (partial least squares), LS-SVM (least squares-SVM), RF and XGBoost were built on characteristic wavenumbers. Both RF and LS-SVM models were observed to be superior to PLS when coupled to RCGA obtained 20 fingerprint variables. Overall, RCGA-LS-SVM model resulted in lowest RMSECV (0.1956) with regression co-efficient values RC2 = 0.9989 and RP2 = 0.9979, when 50 fingerprint variables were used. These results demonstrated that FT-IR combined with RCGA-LS-SVM procedure could be a robust technique for rapid detection of SY in tea powder.

Supplementary Information

The online version contains supplementary material available at 10.1007/s13197-023-05694-3.

Keywords: Tea-adulteration, Sunset yellow, RCGA, LS-SVM, RF, XGBoost

Introduction

Tea is one of the most popular and low-cost beverages in the world. It is consumed by a large population across the world. Due to its increasing demand, tea is considered to be one of the major components in the global market. Due to its wide consumption, tea powder is subjected to adulteration with harmful synthetic colours to conceal its quality defects. Inferior quality or sub-standard tea leaves and tea dust obtained during the manufacturing process are been treated with various synthetic food colours to improve their appearance and to gain more profit (Amsaraj and Mutturi 2021; Li et al. 2016, 2017). Adding artificial colours to tea has already become a menace in India, Malaysia and Sri Lanka (Raja 2019; https://www.foodnavigator-asia.com). Several cases of counterfeited tea across India have been reported, both FDA (Food and Drug Administration) and FSSAI (Food Safety and Standards Authority of India), major food regularities of India have seized counterfeited tea in many parts of the country treated with a high concentration of artificial colours such as sunset yellow and tartrazine.

Sunset yellow (SY) is a petroleum-derived orange azo dye, which has an azo (–N= N–) functional group together with aromatic ring structures (Fig. S1). Azo dyes when consumed are metabolized in the body through the enzyme azo reductases of intestinal microflora and by enzymes of the cytosolic and microsomal fractions of liver. This metabolism leads to the formation of aromatic amines causing carcinogenic effects on the liver and bladder. Although SY is a permitted synthetic food colour, excessive consumption can cause many serious health disorders in humans such as attention deficit hyperactivity disorder in children, anxiety, asthma, eczema, migraines, and cancer (Rovina et al. 2016). In order to check the harmful chemicals in food products, a fast and reliable procedure is required. Analytical methods like thin layer chromatography (TLC), high performance liquid chromatography, enzyme linked immunosorbent assay, spectrophotometry, capillary electrophoresis, and electrochemical sensor methods have been successfully employed for detection of SY in food and beverage products. However, these traditional procedures for detection of SY are considered to be more time consuming, cost intensive and invasive.

Infrared (IR) spectroscopy has been considered to be one of the most effective techniques for analysis of chemical constituents with specific frequency absorbance of functional groups. The major advantage of IR spectroscopy is its fast, easy, sensitive, non-destructive, non-invasive and environment-friendly nature for analysis. Sometimes it is difficult to interpret IR results, because the amount of data generated is very large and usually highly collinear (Li et al. 2016). In order to overcome this problem, different types of chemometric methods based on latent variables (LVs) like principal component analysis (PCA), principal component regression (PCR), partial least squares (PLS), support vector machines (SVM), random forest (RF) and variables selection methods like CARS (competitive adaptive reweighted sampling), SPA (successive projections algorithm), GA (genetic algorithm), RCGA (real coded genetic algorithm), iPLS (interval partial least squares) etc., are coupled with IR data.

Many IR studies have been done on tea analysis. Tartrazine adulteration in tea powder was determined using FT-IR (Fourier transform infrared) spectroscopy (Amsaraj and Mutturi 2021). Sibutramine in green tea, green coffee and in mixed herbal tea was detected using FT-IR based on PLS-DA (Cebi et al. 2017). Sugar and glutinous rice flour adulteration were detected in green tea using NIR (Near infrared) spectroscopy based on PCA and SVR (support vector regression) models (Li et al. 2021). Random forest model based on the NIR and UV-Vis spectra was used to classify green tea varieties (Wang et al. 2015).

The main objective of this research work was to establish a fast, non-invasive and cost-effective technique for detection and quantification of SY, which is one of the most common adulterants in tea powder according to FSSAI (Food safety and standards authority of India). Our research work proposed for the first time the detection of SY in tea powder by FT-IR spectroscopy coupled with robust chemometric and machine learning methods. In this regard, we have employed PLS, LS-SVM, RF and XGBoost based on RCGA algorithm for variable selection and demonstrated its robustness in rapid detection of SY adulteration in tea.

Materials and methods

Preparation of samples

Tea powder was locally purchased from super market. SY was procured from ROHA Dyechem Pvt. Ltd., (Mumbai, India). The tea powder was ground into fine particles and passed through a 150-mesh screen to obtain uniform particle size. Tea powder was spiked at different concentration levels of SY in order to get the desired sample concentration levels (w/w, mg/g) of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 and 15. All the samples were mixed thoroughly and dried at 50 ºC for 2 h. Unadulterated tea powder served as a control. Further, all the samples with different concentrations were subjected FT-IR analysis.

FT-IR analysis

Tea samples were measured in the spectral range of 4000–400 cm−1 with an FT-IR spectrometer (Brüker, TENSOR II, Germany), using a diamond attenuated total reflection (ATR) crystal with a spectral resolution of 4 cm−1 and 15 scans were averaged for each spectrum. Instrumental operations and data acquisitions was carried out using OPUS software. A total of 2515 variables were recorded for each spectrum within the FT-IR spectral range. The crystal was cleaned after each sample run with distilled water followed by ethanol and isopropanol and then dried. Each sample was measured 10 times, resulting in 160 FT-IR spectra.

Spectral data pre-processing

Pre-processing methods like standard normal variate (SNV) and Savitzky–Golay (SG) with 1st (SG1) and 2nd (SG2) derivative were employed in this study. SNV reduces the scattering effects in the spectra and corrects the inferences caused by the variations in the optical path and particle size. SG treatment resolves overlapping signals, enhances signal properties and suppresses unwanted spectral features that arise from the instrument and sample properties properties.

PCA

Here the original data matrix and the sensitive region of the pre-processed matrix is transformed into 160 x PCs, where PCs represent the principal components. The first PC is the one that retains most explained variance (more data information), while the second PC explains the information that is not modelled by the first PCs, and so on. Singular value decomposition (SVD) technique was employed to carry out PCA.

Machine-learning based classification

A classifier is an algorithm that automatically sorts or categorizes data into one or more classes. Here the SY samples were divided in 16 classes based on their concentrations, for example class 0 corresponds to 0 mg SY/g, whereas class 15 indicates 15 mg SY/g. For such classification three machine learning classifiers, viz., SVM, RF and XGBoost tools were used. For all the three classifiers, the hyperparameters were optimized using grid-search algorithm with code written in python using scikit-learn toolbox. For SVM classifier, radial basis function (RBF) was used as kernel, the C and γ values were 106 and 0.1, respectively. In case of RF classifier, the grid search- based optimal hyperparameters were: max_depth = 150, min_samples_leaf = 2, min_samples_split = 4, n_estimators = 257 and criterion = entropy. Finally, for the XGBoost classifier, the hypermeters were: objective = multi:softmax, colsample_bytree = 0.4, learning_rate = 0.25, max_depth = 4, min_child_weight = 3 and n_estimators = 10. The confusion matrix was computed to obtain the accuracy of each classifier.

PLS regression

PLS regression model was carried out using NIPALS (nonlinear iterative partial least squares) algorithm (Geladi and Kowalski 1986). The spectral data was partitioned into calibration (62.5%) and validation (37.5%) sets. Calibration data comprised of 100 spectral samples, whereas the validation data had 60 spectral samples. In this study, we carried out different partition methods such as random sampling (RS), direct sampling (DS), Kennard and Stone (KS), Sample set partitioning based on joint x–y distances (SPXY), Kernel distance-based SPXY (KSPXY) by fixing the LVs to 4 using leave-out-one cross-validation procedure. Whereas, during the model development, the LVs ranged between 2 and 8 using 10-fold cross-validation on the calibration sets. The performance of the PLS models was evaluated based on minimum RMSECV (root mean square error of cross validation) and RMSEP (root mean square error of prediction) values.

RCGA for feature selection

RCGA-PLS is a modified version of GA-PLS variable selection method to select the optimal wavenumbers from the spectrum. Methodology and algorithm details for RCGA-PLS procedure is explained in our early study (Amsaraj and Mutturi 2021). In RCGA-PLS implementation, the procedures for initiating the population, selection, crossovers, and mutations are different from the original GA-PLS algorithm developed by Leardi et al. (1992). Moreover, RCGA-PLS has a provision to fix the number of variables to be selected by the user. RCGA-PLS was performed on the pre-processed data sets. Here we have selected 20, 30, 40, 50 and 60 fingerprint wavenumbers from the tea data set. Later these wavenumbers were used to carry regression analysis.

LS-SVM

LS-SVM is an extension of original SVM which converts inequality constraints to equality constraints, thereby reducing the complexity during computations (Amsaraj et al. 2021). In this study, LS-SVM was used to build models for predicting SY concentration in tea powder. Here the dataset was divided into training and testing sets using randomization procedure. In this work, non-linear kernel, radial basis function (RBF) was applied and the MATLAB based LS-SVM toolbox was used for regression (Suykens et al. 2002). The model performance was evaluated based on determination coefficient (R2), root mean square of calibration (RMSEC) and root mean square error of prediction (RMSEP) values.

RF and XGBoost

RF is an ensemble- learning algorithm comprising ensemble of tree predictors used for classification and regression problems (Breiman 2001). The algorithm needs to be optimized for two key parameters, viz., number of regression trees (nTree) and number of input variables (mTry) before carrying the regression. In order to develop regression model for SY concentration as response against the spectral data, initially nTree and mTry values were optimized using grid-search method. These were carried for all the variable sets (full, 60, 50, 40, 30 and 20) obtained using RCGA. The optimal values were later used for carrying the regression. Tree boosting is extension of ensemble tree-based machine learning method. Here in the present study, XGBoost method which is known as a scalable machine learning system for tree boosting (Chen and Guestrin 2016), has been adopted. Here too the hyperparameters (colsample_bytree, eta, max_depth, min_child_weight, and subsample) tuning for XGBoost were carried using grid search procedure. Regression (using objective reg:squarederror) was carried for full-spectrum as well variable sets as derived from RCGA.

Software

All chemometric analysis statistics were carried out using Matlab 2013b (The Mathworks Inc., Natick, MA, USA). RF and XGBoost algorithms were implemented using Python 3.9. Plots were prepared using R packages like ggplot2, cowplot and ggpubr on R computing platform (R Core Team, 2016). All the computational codes will be made available upon request.

Results and discussion

Overview of samples and spectral pre-treatment

The FT-IR spectra acquired from pure tea sample, adulterated tea sample with SY (15 mg/g) and pure SY are shown in Fig. S2(a-c). The broad adsorption peak at 3000–3500 cm−1 is associated with O-H stretching. Peak band located at 2700–3000 cm−1 is related to vibrations of CH2 stretching, which are common in both pure tea and adulterated tea sample (Lohumi et al. 2017). However, there were no significant changes in both spectra Fig. S2(a-b) and was difficult to distinguish adulterated tea from pure tea. Whereas, in the FT-IR spectra of the pure SY, these broad peaks were absent and most of the significant peaks were cluttered in the region 1700 –500 cm−1, which could be the fingerprint of SY. In order to distinguish the absorbance variations of different sample groups’ spectral pre-treatment methods like SNV and SG methods were carried out. Pre-treated spectra with SNV + SG1 method showed good variations in terms of SY concentrations. Raw and SNV + SG1 pre-treated spectra are shown in Fig. 1a, b, whereas SNV + SG1 spectral plots on limited wavenumbers are shown in Fig. 1c, d, where we could find the changes in the spectral pattern and absorption intensities at around 600–1200 cm−1. These changes were not observed in the raw spectral plots. However, these variations in the spectral regions could be attributed to variation in SY concentration which falls under the fingerprint region of pure SY Fig. S2c. The important peaks observed in the fingerprint region include 1709 cm−1 that attributes to C = O stretching, 1621 cm−1 is associated with COO stretching, 1503 and 1426 cm−1 are associated with C-H and N-H bending, 1307 cm−1 is assigned to S=O stretching vibrations. The absorption peaks in the range 1125 –978 cm−1 are attributed to SO4−2 group. The peaks in the range 614 –500 cm−1 are attributed to NH+ 3 group. However, some of the peaks were common in both pure and adulterated tea. Overall, the pre-treatment methods facilitated to observe the changes in spectral data with varying SY concentration. Subsequently, PLS regression was carried for full spectrum using the chosen pre-processing procedures and the results were provided in Table S1. It was observed that SNV + SG1 resulted in lowest RMSECV and RMSEP among the tested methods. In the studies carried by Amsaraj and Mutturi (2021), SNV + SG1- based pre-processing of FT-IR spectral data of tea samples adulterated with tartrazine provided superior results when compared to other procedures.

Fig. 1.

Fig. 1

Pre-processed spectral data of all the 160 samples with varying SY concentration. a Raw spectra b SNV spectra c SNV + SG1 spectra (500–1400 cm−1) and d SNV + SG1 spectra (970–1060 cm−1). The legend shows the variations in SY concentration levels

PCA results

PCA is a good choice for the reduction of data multidimensionality as FT-IR spectra contains a large amount of redundant data. PCA was successfully applied to process spectral data for discrimination of unadulterated and adulterated tea samples. Here PCA was performed on the pre-processed data (SNV + SG1) sets of whole region and fingerprint region (500–1000 cm−1), each one consisting of 16 classes with varying concentrations of SY ranging from 0 to 15 mg/g (Fig. S3). The score plot of whole region with first two PCs explained 94.3% variability (Fig. S3a), whereas, the score plot of sensitive region explained 99.8% variability with first two PCs, where the first PC alone accounts 99.11% variance and contains all information (Fig. S3b). It could be seen that adulterated and unadulterated tea samples were clearly distinguished in both the PCA score plots. Fig. S3b shows that the samples are clearly distributed according to the concentration levels of SY in the adulterated tea samples with higher clustering accuracy and total variance. When compared to the whole region, the clusters of sensitive regions were grouped clearly from 0 to 4 mg/g concentration levels without any overlaps. On the other hand, the higher concentrations levels ranging from 5 to 15 mg/g were grouped closely, and some overlaps for these concentrations were found. In addition, SY standard and samples extracted from tea powder adulterated SY (0–15 mg/g) were subjected to TLC according to de Andrade et al. (2014) in order to visualize varying concentrations of SY in tea samples Fig. S4. The figure showed intense bands in higher concentration of SY tea samples.

Several studies have employed PCA models for the classification of tea samples. Some of these include classification of adulterated tea powder based on tartrazine concentration (Amsaraj and Mutturi 2021), classification of pure tea, sugar-adulterated tea, and glutinous-rice-flour-adulterated tea (Li et al. 2021), discrimination studies on green tea, green coffee and herbal tea based sibutramine contents (Cebi et al. 2017), and classification of tea samples with respect to the production process (Dankowska and Kowalewski 2019).

Classification of SY using machine learning tools

The results from the three classifiers in provided in and Table 1. Here it was observed that the overall accuracy of SVM- based classifier was superior to other two classifiers, viz., RF and XGBoost (cf. Table 1). Moreover, the accuracy improved in case of SVM and XGBoost when only 50 variables were selected using RCGA. The accuracy of SVM- based classifier using 50 variables 0.90 and was highest among all the combinations of the three classifiers. The precision values for individual SY adulteration groups is also provided in Table 1. The confusion matrix for full and 50 variables for all three classifiers is provided in Fig. 2. Here too it can be observed that the SVM classifier using 50 variables had negligible scattering along the diagonal showing actual versus predicted values when compared to the other combinations (cf. Fig. 2b). Maximum scattering along the diagonal was observed in XGBoost classifier (cf. Fig. 2e and f). Here too from the confusion matrix indicates the RCGA has indeed effectively selected the most relevant variables for the accurate classification (cf. Fig. 2a and b). Therefore, it can be concluded that RCGA coupled to SVM-based classification is superior classifier for prediction of sunset yellow using the FT-IR spectral data.

Table 1.

Classification results using RCGA-SVM, RCGA-RF and RCGA-XGBoost

Classification Precision
SVM classifier RF classifier XGBoost classifier
Full 50 Full 50 Full 50
Class 1 1 1 1 1 1 1
Class 2 0.57 1 0.57 0.5 0.5 0.5
Class 3 1 0.83 1 1 0.75 1
Class 4 0.8 1 0.5 0.5 0.6 0.5
Class 5 0.75 1 0.29 0.33 0.2 0.33
Class 6 1 1 0 0 0.2 0
Class 7 1 1 1 1 0 1
Class 8 0.67 1 0.25 0.33 0.33 0.33
Class 9 1 1 0 0.6 0 0.6
Class 10 1 1 1 1 0.14 1
Class 11 1 1 0.75 0.5 0.33 0.5
Class 12 0.75 1 0.67 0.6 0.5 0.6
Class 13 0.6 0.5 0.5 0.5 0.67 0.5
Class 14 0 0 0 0 0 0
Class 15 0.75 1 1 1 0.5 1
Class 16 0.75 0.75 0.6 0.6 0.33 0.6
Accuracy 0.82 0.90 0.55 0.52 0.38 0.52

Fig. 2.

Fig. 2

Confusion matrix using 3 different classifiers A, B SVM, c, D RF and E, F XGBoost. The top row indicates for full spectrum, whereas the bottom row corresponds to 50 variables selected using RCGA

Many studies have been successfully employed the use of SVM, RF and XGBoost classifiers in tea analysis. Wu et al. (2018), successfully discriminated Longjing green tea quality based on volatile compounds. An accuracy of 100% for qualitative identification of tea quality grades were obtained by SVM and RF. Other studies include discrimination of geographical origin Longjing tea using GA-SVM with 96.25% accuracy Li et al. (2017), classification of five categories green tea using RF classification with 96% accuracy Wang et al. (2015), classification of oolong tea varieties was established using XGBoost and light gradient boosting machine (LightGBM) individually, where BOSS-LightGBM (bootstrapping soft shrinkage- LightGBM) model for discriminating tea varieties achieved the best performance, with the accuracy of 100% in the training set and 97.33% in the prediction set (Ge et al. 2019). In our case SVM proved to the best classifier in detection of SY with 90% accuracy.

Quantitative analysis of SY

PLS based regression results

Since the detection of SY in adulterated tea samples was identified using qualitative model (PCA), a quantitative model based on PLS regression was employed to determine the content of SY in adulterated samples. Model validation is most important part of building supervised models. For building a model with good generalization performance one must have a sensible data splitting strategy and this is more crucial for model validation. Partitioning data into calibration and validation sets allows to develop highly accurate models for prediction (Xu and Goodacre 2018). Table S2 shows the results of partitioning the spectral data set into training and testing data sets. Data splitting methods like RS, KS, SPXY and KSPXY were employed. Here the number of latent variables were fixed to 4 and RMSECV was carried using LOO (leave-out-one) procedure. The performance best splitting method was based on lowest RMSEP value. SPXY outperformed RS (3.006), KS (0.2777) and KSPXY (0.2939) with lowest RMSEP value of 0.2423. Therefore, SPXY was considered to be a better choice for partitioning the data in the present work and all PLS regression analysis were partitioned using the SPXY method.

In our earlier study (Amsaraj and Mutturi 2021), a new variant of GA-based feature selection known as RCGA was established. The key advantage of RCGA over other feature selection algorithm is that the number of variables can be defined as an a priori condition. Frequency plot of variable selection for RCGA-PLS model is shown in Fig. S5. It can be observed that wavenumbers 1679, 1680 and 1682 cm−1 has appeared a greater number of times during frequency analysis. All these wavenumbers fall under the fingerprint region of SY. Here, we have used RCGA to select 20, 30, 40, 50 and 60 variables from the pre-processed data set (SNV + SG1) and used for classification and regression. Table 2 shows the results of RCGA-PLS. Here RCGA-PLS was performed on 20, 30, 40, 50 and 60 fixed variables with different number of LVs like 2, 3, 4, 5, 6, 7 and 8. It can be observed that RCGA-PLS model with 60 variables and 3 LVs has the lowest RMSECV (0.2069) and RMSEP (0.2171) values and the difference between RMSECV and RMSEP was considered to be lowest among all the RCGA-PLS models. Also, the model with 30 variables and 5 LVs resulted in reasonably good RMSECV (0.2135) and RMSEP (0.2607) values along with high RP2 (0.9969). Moreover, this model seems to be stable with less difference between RMSECV and RMSEP values and could be considered superior. Overall, RCGA-PLS could significantly reduce the dimension of spectral variables from 2515 to 30, having effective predictive capability of detecting SY in tea powder. Regression plots showing calibration and validation of the data sets are shown in Fig. S6a.

Table 2.

PLS regression results using RCGA-PLS variable selection method

Parameters LVs RMSEC RC2 RMSECV RP2 RMSEP
20 2 2.8092 0.6194 2.9256 0.706 2.5248
3 2.6389 0.6641 2.7517 0.7492 2.332
4 2.1943 0.7678 2.4655 0.7293 2.4229
5 3.1093 0.5337 3.779 0.4445 3.4706
6 2.3253 0.7392 2.7954 0.6647 2.6965
7 2.0814 0.791 2.6516 0.7548 2.3057
8 2.5253 0.6924 3.1198 0.4727 3.3814
30 2 0.4608 0.9898 0.4763 0.9915 0.4301
3 0.3431 0.9943 0.3687 0.993 0.3907
4 0.1983 0.9981 0.2172 0.9953 0.3205
5 0.1871 0.9983 0.2135 0.9969 0.2607
6 0.1867 0.9983 0.2195 0.9962 0.2865
7 0.1916 0.9982 0.2425 0.9943 0.3526
8 0.1737 0.9985 0.2182 0.9965 0.2773
40 2 0.4685 0.9894 0.4811 0.9892 0.4842
3 0.2874 0.996 0.306 0.9958 0.3035
4 0.4495 0.9903 0.5521 0.9866 0.538
5 0.3887 0.9927 0.4505 0.9817 0.6306
6 0.2804 0.9962 0.3671 0.9923 0.4084
7 0.5432 0.9858 0.713 0.9807 0.6473
8 0.522 0.9869 0.6767 0.9665 0.8529
50 2 0.2408 0.9972 0.2502 0.9972 0.2455
3 0.21 0.9979 0.2221 0.9974 0.2365
4 0.1777 0.9985 0.2036 0.9973 0.2411
5 0.1776 0.9985 0.2047 0.9961 0.2912
6 0.1611 0.9987 0.2026 0.9961 0.2921
7 0.1507 0.9989 0.1919 0.9965 0.2756
8 0.1225 0.9993 0.1653 0.996 0.2961
60 2 0.2191 0.9977 0.2286 0.9977 0.2233
3 0.193 0.9982 0.2069 0.9978 0.2171
4 0.1759 0.9985 0.2023 0.9977 0.2241
5 0.1642 0.9987 0.1993 0.9968 0.263
6 0.1413 0.999 0.172 0.9968 0.2627
7 0.138 0.9991 0.1773 0.9965 0.2737
8 0.1124 0.9994 0.1555 0.9965 0.2764

aThe data size for training and testing were 100 and 60, respectively, partitioning of data is using SPXY, cross-validation was based on kfold = 10, and the preprocessing of the data was SNV + SG1

Sun et al. (2020) determined instant green tea components (caffeine and catechin) by using a portable near infrared (NIR) spectrometer coupled to GA-PLS. NIRS was coupled GA-PLS to determine the moisture content of tea leaves (Zhang et al. 2020; Amsaraj and Mutturi 2021) concluded that RCGA-PLS is a robust variable selection procedure to quantify tea samples adulterated with tartrazine. Here in the present study too, we observed the RCGA-PLS performed reasonably well with low RMSECV and RSMEP values for predicting SY concentrations in adulterated tea samples.

LS-SVM based regression results

To improve the accuracy of the model, LS-SVM was used to build non-linear models and were compared with linear models acquired by PLS. The results are shown in Table 3. It can be observed that with increasing variable numbers from 20 to 60, the RMSEP values decreased. The model with 60 variables was observed to have RMSECV, RMSEP, RC2 and RP2 of 0.1964, 0.2103, 0.9991 and 0.998, respectively, which were either superior or close to the full spectrum results. This indicates, the RCGA procedure was highly effective in selecting the variables, and a LS-SVM model with only 60 fingerprint wavenumbers could quantify the SY accurately. Moreover, it was also suggested that higher number of variables would considerably result in model complexity due to presence of collinear data (Li et al. 2016). Generally, a good model should have higher R2 value, lower RMSEC and RMSEP values, and a small difference between calibration and prediction values. It was observed that the difference between calibration and prediction parameters was low for all the variables sets, suggesting better model stability and adaptability when LS-SVM was used. The scatter plots of actual vs. predicted values for full set and 20 variables are shown in Fig. S6b. The results of the LS-SVM models were compared with linear models acquired by PLS. The PLS model with 30 variables results were comparable to LS-SVM model having only 20 variables. This suggests that LS-SVM was capable of superior regression statistics even with lesser number of variables selected. Comparing the results from Table 3, it may be observed that the LS-SVM model has better performance than PLS across different variable sets. Since, LS‐SVM can deal with nonlinearity in the spectral data, hence better performance of the LS-SVM model (Chanda et al. 2019).

Table 3.

Regression results using LS-SVM, RCGA-RF and RCGA-XGBoost

ML regressor Variables RMSEC RC2 RP2 RMSEP
LS-SVM 2515 0.00004235 1 0.9981 0.2009
20 0.2767 0.9963 0.9919 0.4181
30 0.1743 0.9985 0.9955 0.3125
40 0.189 0.9983 0.9962 0.2886
50 0.1477 0.9989 0.9979 0.2112
60 0.1383 0.9991 0.998 0.2103
Random Forest 2515 0.2162 0.9977 0.9944 0.3481
20 0.4577 0.9899 0.9717 0.7831
30 0.2522 0.9969 0.9896 0.4743
40 0.2376 0.9973 0.9900 0.4652
50 0.2484 0.9970 0.9870 0.5303
60 0.2787 0.9963 0.9864 0.5424
XGBoost 2515 0.1396 0.9990 0.9894 0.5049
20 0.4730 0.9886 0.9761 0.7579
30 0.2786 0.9961 0.9742 0.7863
40 0.2277 0.9974 0.9880 0.5377
50 0.1396 0.9990 0.9894 0.5049
60 0.2395 0.9971 0.9754 0.7680

aAll the data samples were split randomly in the ratio of 62.5:37.5 for training and testing

Li et al. (2013) demonstrated that LS-SVM algorithm was an effective tool to determine the dry matter content (DMC) of tea by near and middle infrared spectroscopy. The optimal model obtained a high RP2 of 0.9556 and low RMSEP of 0.0501. Chanda et al. (2019) made a comparative study based on PLS and LS-SVM models for quantification of caffeine in tea samples. The results suggested that LS-SVM model performed better with good regression coefficient values and root mean square values. In this present study RCGA-LS-SVM was observed to be effective in determining SY content in tea powder using only 20 variables.

RF and XGBoost- based regression results

Random forest, which is a non-linear machine learning tool based on tree ensemble was also tested for quantification of SY. Initially the optimal values for parameters such as nTree and mTry were established using grid search method, and the results are provided in Table S3. The values for nTree and mTry deviated over selected variables. The span of RMSEC values for varying nTree and mTry for the cases of full-spectrum and 40 variables case is provided in Fig. 3a and c. Later, using these optimal values, regression for quantification of SY was carried using full-spectrum as well as for the RCGA predicted variable sets. These results are provided in Table 3. It was observed that RF alone and RCGA-RF was able to predict the SY values reasonably well. Among all the models, the one with 40 variables resulted with superior prediction efficiency having RMSEC, RC2, RP2 and RMSEP of 0.2376, 0.9973, 0.9900, and 0.4652, respectively. The regression result for overall spectrum and using 40 variables is provided in Fig. 3b and d, respectively. The optimal values for hyperparameters of XGBoost algorithm by minimizing mean absolute error (MAE) were provided in Table S4. Using these parameters, the non-linear regression was carried and the results are provided in Table 3. Among all the models, here the 50 variables resulted with higher prediction efficiency having RMSEC, RC2, RP2 and RMSEP of 0.1396, 0.9990, 0.9894 and 0.5049, respectively. Both RCGA-RF and RCGA-XGBoost using 20 variables performed superior to RCGA-PLS counterpart. However, with 50 and 60 variables RCGA-PLS was clearly superior to both RCGA-RF and RCGA-XGBoost. It is interesting to observe that PLS regression outperformed non-linear regression method such as RF and XGBoost. Overall, from all the three different regression algorithms, it was clear that LS-SVM was superior.

Fig. 3.

Fig. 3

Optimization of mTree and nTry with minimization of RMSEC values (a) full spectrum, and c 40 variables. Model regression for (b) full spectrum and d 40 variables. Here the circles indicate training data, whereas stars indicate the test data points

RF and XG-Boost based regression was used in few studies involving tea analysis. Quantification of tea polyphenols in Tibetan teas based on hyperspectral technology was carried by Luo et al. (2021). In their study they compared RF, XG-Boost, CatBoost (Categorical Boosting) and LightGBM regressor models with newly built stacking model for predicting polyphenols in Tibetan tea. Moisture content for congou black tea withering leaves was predicted using RF by Liang et al. (2018). PLSR, SVM and RF regression models were build based on E-nose, E-tongue, E-eye and their fusion signals, to predict the contents of amino acids, catechins, polyphenols and caffeine in tea (Xu et al. 2019). In their studies, RF regression models based on fusion signals achieved the best prediction results compared to others with low RMSE and high R2 values. However, in our study, LS-SVM has outperformed RF for all the selected variable sets (cf. Table 3). Grid-SVR, RF, and XGBoost models were constructed to estimate the polyphenol content of cross-category teas in the studies by (Yang et al. 2020). In their study, XGBoost model outperformed RF and Grind-SVR models with good regression co-efficient values. However, in the present study, the XGBoost did not improve the results of RF, excepting for the 20 variables case where the RMSEP was marginally low. In this present study PLSR, LS-SVM and RF regression were compared to quantify SY in tea powder. Among all three LS-SVM achieved best results.

Conclusion

The research explored for the first time the feasibility of FT-IR spectroscopy for determination of SY in tea powder using machine learning tools. PCA successfully discriminated adulterated and unadulterated tea samples based on SY concentration showing 99.8% variability. RCGA feature selection method was employed to extract the characteristic wavenumbers of SY from spectral data. RCGA coupled SVM classifier had 90% accuracy in predicting the SY adulteration class. PLS, LS-SVM and RF regression models were built for quantification of SY based on RCGA. Further, it was observed that LS-SVM model performed better than RF, XGBoost and PLS models with only 20 wavenumbers and achieved the optimal results with RC2 = 0.9963, RMSECV = 0.4625, RP2 = 0.9919 and RMSEP = 0.4181. Therefore, the results indicated that FT-IR has the potential to discriminate and quantify SY adulteration with varying concentration in tea powder. Furthermore, the selected wavenumbers by RCGA could be used for the development of portable device for detection and quantification of SY in tea powder supply chain.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1 (1.2MB, docx)

Acknowledgements

RA wish to express sincere thanks to Indian Council of Medical Research (ICMR) for granting SRF fellowship to carry out research work. The authors would like to thank Ms. Asha M of Central Instruments Facility & Services (CFTRI) and Mr. Punil HN of Microbiology & Fermentation Technology Dept. (CFTRI) for their assistance during experimentation. Authors also acknowledge the Director, CSIR-CFTRI, Mysuru for providing infrastructure and support during the research work.

Abbreviation

ATR

Attenuated total reflection

BOSS-LightGBM

Bootstrapping soft shrinkage- light gradient boosting machine

CARS

Competitive adaptive reweighted sampling

CatBoost

Categorical boosting

DMC

Dry matter content

DS

Direct sampling

FDA

Food and drug administration

FSSAI

Food safety and standards authority of India

FT-IR

Fourier transform infrared

GA

Genetic algorithm

iPLS

Interval partial least squares

KS

Kennard and stone

KSPXY

Kernel distance-based sample set partitioning based on joint X–Y distances

LightGBM

Light gradient boosting machine

LS-SVM

Least squares-support vector machine

LVs

Latent variables

MAE

Mean absolute error

mTry

Number of input variables

NIPALS

Nonlinear iterative partial least squares

nTree

Number of regression trees

PCs

Principal components

PCA

Principal component analysis

PCR

Principal component regression

PLS

Partial least squares

RBF

Radial basis function

RCGA

Real coded genetic algorithm

RC2 and RP2

Regression coefficient of calibration and prediction

RF

Random forest

RMSEC

Root mean square of calibration

RMSECV

Root mean square error of cross validation

RMSEP

Root mean square error of prediction

RS

Random sampling

SG

Savitzky-Golay

SNV

Standard normal variate

SPA

Successive projections algorithm

SPXY

Sample set partitioning based on joint X–Y distances

SVD

Singular value decomposition

SVM

Support vector machine

TLC

Thin layer chromatography

SY

Sunset yellow

XGB

Extreme Gradient Boosting

Authors’ contributions

RA carried out the experiments and wrote the original manuscript, SM conceived, supervised, and edited the manuscript.

Funding

Not applicable.

Data availability

The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.

Code availability

All the computational codes will be made available upon request.

Declarations

Conflict of interest

Both the authors declare no conflict of interest.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Ethics approval

Not applicable.

Footnotes

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  1. Amsaraj R, Ambade ND, Mutturi S (2021) Variable selection coupled to PLS2, ANN and SVM for simultaneous detection of multiple adulterants in milk using spectral data. Int Dairy J
  2. Amsaraj R, Mutturi S. Real-coded GA coupled to PLS for rapid detection and quantification of tartrazine in tea using FT-IR spectroscopy. LWT–Food Sci Technol. 2021;139:110583. doi: 10.1016/j.lwt.2020.110583. [DOI] [Google Scholar]
  3. Breiman L. Random forests. Mach Learn. 2001;45(1):5–32. doi: 10.1023/A:1010933404324. [DOI] [Google Scholar]
  4. Cebi N, Yilmaz MT, Sagdic O. A rapid ATR-FTIR spectroscopic method for detection of sibutramine adulteration in tea and coffee based on hierarchical cluster and principal component analyses. Food Chem. 2017;229:517–526. doi: 10.1016/j.foodchem.2017.02.072. [DOI] [PubMed] [Google Scholar]
  5. Chanda S, Hazarika AK, Choudhury N, Islam SA, Manna R, Sabhapondit S, et al. Support vector machine regression on selected wavelength regions for quantitative analysis of caffeine in tea leaves by near infrared spectroscopy. J Chemom. 2019;33(10):e3172. doi: 10.1002/cem.3172. [DOI] [Google Scholar]
  6. Chen T, Guestrin C (2016) Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd Acm Sigkdd international conference on knowledge discovery and data mining, pp 785–794
  7. Dankowska A, Kowalewski W. Tea types classification with data fusion of UV–Vis, synchronous fluorescence and NIR spectroscopies and chemometric analysis. Spectrochim Acta Part A. 2019;5:195–202. doi: 10.1016/j.saa.2018.11.063. [DOI] [PubMed] [Google Scholar]
  8. de Andrade FI, Guedes MIF, Vieira ÍGP, Mendes FNP, Rodrigues PAS, Maia CSC, de Ribeiro M. Determination of synthetic food dyes in commercial soft drinks by TLC and ion-pair HPLC. Food Chem. 2014;157:193–198. doi: 10.1016/j.foodchem.2014.01.100. [DOI] [PubMed] [Google Scholar]
  9. Ge X, Sun J, Lu B, Chen Q, Xun W, Jin Y (2019) Classification of oolong tea varieties based on hyperspectral imaging technology and BOSS-LightGBM model. J Food Process Eng 42(8):e13289
  10. Geladi P, Kowalski BR. Partial least-squares regression: a tutorial. Anal Chim Acta. 1986;185:1–17. doi: 10.1016/0003-2670(86)80028-9. [DOI] [Google Scholar]
  11. Leardi R, Boggia R, Terrile M. Genetic algorithms as a strategy for feature selection. J Chemom. 1992;6(5):267–281. doi: 10.1002/cem.1180060506. [DOI] [Google Scholar]
  12. Li X, Luo L, He Y, Xu N. Determination of dry matter content of tea by near and middle infrared spectroscopy coupled with wavelet-based data mining algorithms. Comput Electron Agric. 2013;98:46–53. doi: 10.1016/j.compag.2013.07.014. [DOI] [Google Scholar]
  13. Li X, Zhang Y, He Y. Rapid detection of talcum powder in tea using FT-IR spectroscopy coupled with chemometrics. Sci Rep. 2016;6(1):1–8. doi: 10.1038/srep30313. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Li X, Xu K, Zhang Y, Sun C, He Y. Optical determination of lead chrome green in green tea by Fourier transform infrared (FT-IR) transmission spectroscopy. PLoS ONE. 2017;12(1):1–14. doi: 10.1371/journal.pone.0169430. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Li M, Dai G, Chang T, Shi C, Wei D, Du C, Cui HL. Accurate determination of geographical origin of tea based on terahertz spectroscopy. Appl Sci. 2017;7(2):172. doi: 10.3390/app7020172. [DOI] [Google Scholar]
  16. Li L, Jin S, Wang Y, Liu Y, Shen S, Li M, et al. Potential of smartphone-coupled micro NIR spectroscopy for quality control of green tea. Spectrochim Acta Part A. 2021;247:119096. doi: 10.1016/j.saa.2020.119096. [DOI] [PubMed] [Google Scholar]
  17. Liang G, Dong C, Hu B, Zhu H, Yuan H, Jiang Y, et al. Prediction of moisture content for Congou Black Tea Withering Leaves using image features and nonlinear method. Sci Rep. 2018;8(1):1–8. doi: 10.1038/s41598-018-26165-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Lohumi S, Joshi R, Kandpal LM, Lee H, Kim MS, Cho H, et al. Quantitative analysis of Sudan dye adulteration in paprika powder using FTIR spectroscopy. Food Addit Contam Part A. 2017;34(5):678–686. doi: 10.1080/19440049.2017.1290828. [DOI] [PubMed] [Google Scholar]
  19. Luo X, Xu L, Huang P, Wang Y, Liu J, Hu Y, et al. Nondestructive testing model of tea polyphenols based on hyperspectral technology combined with chemometric methods. Agriculture. 2021;11(7):890. doi: 10.3390/agriculture11070673. [DOI] [Google Scholar]
  20. Malaysian tea manufacturer fined over banned colourings. Accessed 31 Aug 2021
  21. Raja V (2019) Sale of fake tea powder rampant: here’s how to check your tea for adulteration. 11/02/2019, The Better India., https://www.thebetterindia.com/201889/tea-adulterated-test-fake-india-purity-check-homeindia/. Accessed 31 Aug 2021
  22. Rovina K, Prabakaran PP, Siddiquee S, Shaarani SM. Methods for the analysis of Sunset Yellow FCF (E110) in food and beverage products-a review. TrAC Trends Anal Chem. 2016;85:47–56. doi: 10.1016/j.trac.2016.05.009. [DOI] [Google Scholar]
  23. Sun Y, Wang Y, Huang J, Ren G, Ning J, Deng W, et al. Quality assessment of instant green tea using portable NIR spectrometer. Spectrochim Acta Part A. 2020;240:118576. doi: 10.1016/j.saa.2020.118576. [DOI] [PubMed] [Google Scholar]
  24. Suykens JAK, van Gestel T, de Brabanter J, de Moor B, Vandewalle JPL. Least squares support vector machines. World Sci. 2002;5:796. [Google Scholar]
  25. Wang X, Huang J, Fan W, Lu H. Identification of green tea varieties and fast quantification of total polyphenols by near-infrared spectroscopy and ultraviolet-visible spectroscopy with chemometric algorithms. Anal Methods. 2015;7(2):787–792. doi: 10.1039/C4AY02106A. [DOI] [Google Scholar]
  26. Wu X, Zhu J, Wu B, Sun J, Dai C. Discrimination of tea varieties using FTIR spectroscopy and allied Gustafson-Kessel clustering. Comput Electron Agric. 2018;147:64–69. doi: 10.1016/j.compag.2018.02.014. [DOI] [Google Scholar]
  27. Xu Y, Goodacre R. On splitting training and validation set: a comparative study of cross-validation, bootstrap and systematic sampling for estimating the generalization performance of supervised learning. Int J Test. 2018;2(3):249–262. doi: 10.1007/s41664-018-0068-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Xu M, Wang J, Zhu L. The qualitative and quantitative assessment of tea quality based on E-nose, E-tongue and E-eye combined with chemometrics. Food Chem. 2019;289:482–489. doi: 10.1016/j.foodchem.2019.03.080. [DOI] [PubMed] [Google Scholar]
  29. Yang B, Qi L, Wang M, Hussain S, Wang H, Wang B, et al. Cross-category tea polyphenols evaluation model based on feature fusion of electronic nose and hyperspectral imagery. Sensors. 2020;20(1):496. doi: 10.3390/s20010050. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Zhang M, Guo J, Ma C, Qiu G, Ren J, Zeng F, Lü E. An effective Prediction Approach for Moisture Content of Tea Leaves based on Discrete Wavelet transforms and bootstrap soft shrinkage algorithm. Appl Sci. 2020;10(14):4839. doi: 10.3390/app10144839. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material 1 (1.2MB, docx)

Data Availability Statement

The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.

All the computational codes will be made available upon request.


Articles from Journal of Food Science and Technology are provided here courtesy of Springer

RESOURCES