Abstract
OBJECTIVE:
To evaluate the quality of Moyao (Myrrh) in the identification of the geographical origin and processing of the products.
METHODS:
Raw Moyao (Myrrh) and two kinds of Moyao (Myrrh) processed with vinegar from three countries were identified using near-infrared (NIR) spectroscopy combined with chemometric techniques. Principal component analysis (PCA) was used to reduce the dimensionality of the data and visualize the clustering of samples from different categories. A classical chemometric algorithm (PLS-DA) and two machine learning algorithms [K-nearest neighbor (KNN) and support vector machine] were used to conduct a classification analysis of the near-infrared spectra of the Moyao (Myrrh) samples, and their discriminative performance was evaluated.
RESULTS:
Based on the accuracy, precision, recall rate, and F1 value in each model, the results showed that the classical chemometric algorithm and the machine learning algorithm obtained positive results. In all of the chemometric analyses, the NIR spectrum of Moyao (Myrrh) preprocessed by standard normal variation or Multivariate scattering correction combined with KNN achieved the highest accuracy in identifying the geographical origins, and the accuracy of identifying the processing technology established by the KNN method after first-order derivative pretreatment was the best. The best accuracy of geographical origin discrimination and processing technology discrimination were 0.9853 and 0.9706 respectively.
CONCLUSIONS:
NIR spectroscopy combined with chemometric technology can be an important tool for tracking the origin and processing technology of Moyao (Myrrh) and can also provide a reference for evaluations of its quality and the clinical use.
Keywords: Moyao (Myrrh) , near-infrared spectroscopy, geographical origin, processing technology
1. INTRODUCTION
Moyao (Myrrh) is a gum resin obtained from the stems and branches of Commiphora myrrh Engl. or Commiphora molmol Engl. according to the Chinese Pharmacopeia.1 Moyao (Myrrh) has been used worldwide to treat diseases for more than 2000 years.2 In China, Moyao (Myrrh) was first recorded in the Tang Dynasty's Yao Xing Lun,3 and it is named Moyao in Chinese; Mo is the transliteration of the Arabic word mu (bitterness)4 ; therefore, Moyao (Myrrh)was translated as Moyao. It is native to southwest Asia, including Arabia, and is most commonly found in east and northeast Africa near the region of the Red Sea or Arabian Gulf, specifically in Somalia, Kenya, Madagascar, India, Ethiopia, Iran, and Thailand.5
The quality and efficacy of the Moyao (Myrrh) can be different because of the many different sources of the myrrh, the varieties of the Moyao (Myrrh) plant, the natural environment of the place of production, and the methods of cultivation and processing. In addition, it is not easy to identify Moyao (Myrrh) based on experience.
In addition, Moyao (Myrrh) is divided into raw products [raw Moyao (Myrrh)] and processed products [vinegar Moyao (Myrrh)] in Traditional Chinese Medicine, and vinegar Moyao (Myrrh) is obtained by two processing techniques. The first technique is based on the latest edition (2020) of the Chinese Pharmacopeia.6 Briefly, Moyao (Myrrh) is mixed with vinegar and soaked for couple of ten minutes, and then fried to a certain degree. The second technique is based on Jiangxi Province According to Standards for Processing Chinese Herbal Pieces (2008 Edition)7 and Camphor Tree Traditional Chinese Medicine Processing Book.8 In this method, raw Moyao (Myrrh) is fried to a certain extent and then sprayed with rice vinegar to obtain vinegar Moyao (Myrrh). According to the theory of Traditional Chinese Medicine, raw Moyao (Myrrh) has the effects of promoting blood circulation, relieving pain, reducing swelling, and promoting muscle growth, but it has a certain stimulating effect on the stomach.9 It is mostly used externally and is often used in cosmetics and perfumes.5 The volatile oil has been removed from vinegar myrrh, which alleviates irritation of the stomach and, at the same time, enhances the effects of promoting blood circulation, relieving pain, and contracting muscles.10 Therefore, there are certain differences in the quality and clinical efficacy between raw Moyao (Myrrh) and vinegar Moyao (Myrrh). However, the techniques of frying Moyao (Myrrh) after soaking it in vinegar and of spraying Moyao (Myrrh) with vinegar after frying are all used by the manufacturers of Chinese herbal decoctions and are used as undifferentiated Moyao (Myrrh) decoctions in the clinical environment. The quality and efficacy of these products have not been systematically studied. Therefore, it is necessary to identify the geographical origin and processing technology of Moyao (Myrrh) using new technology for evaluating the quality of Moyao (Myrrh) and applying it reasonably in clinical practice.
Many analytical methods, such as high-performance liquid chromatography-mass spectrometry,11,⇓-13 gas chromatography-mass spectrometry (GC-MS),14,⇓,⇓-17 DNA sequencing,18,19 nuclear magnetic resonance,20,21 etc., have been widely used in identifying traditional medicine in the past 10 years. Li et al 10 established the HPLC fingerprint for chromatogram Moyao (Myrrh) which from different regions. The results provided there are differences in the quality of different batches of Moyao (Myrrh) from different regions. Li et al 16 established the GC-MS method to analyze the volatile oil in Moyao (Myrrh) and its processed products. The results showed that the main components and content of Moyao (Myrrh) changed after processing. However, the equipment involved in the above studies are expensive, and are not portable. These techniques technically need experienced researchers to operate. Compared with other analysis methods, near-infrared (NIR) spectroscopy combined with chemometrics, which is easy to operate, convenient, fast, and nondestructive, and only needs a tiny amount of the sample, is widely used in identifying traditional medicine22-24 and food.25,⇓-27 For example, fourier transform infrared and NIR spectroscopy techniques, combined with single-spectrum analysis and multi-sensor information fusion strategies, was used to identify the origin of Sanqi (Panax Notoginseng) samples from five cities in Yunnan Province in China.28 Coffee beans from nine countries on two continents were correctly distinguished using the method of NIR combined with multivariate data analysis.29 However, systematic research on the quality of Moyao (Myrrh) and its processing technology using NIR spectroscopy has not been reported. Therefore, we used the NIR spectroscopy combined with chemometrics in this study to identify the geographical origin and processing technology of myrrh.
In our study, we combined NIR spectroscopy with chemometric methods, and established three classification models [partial least squares classification analysis (PLS-DA), k-nearest neighbor (KNN), and support vector machine (SVM)] to distinguish raw Moyao (Myrrh) and two kinds of vinegar Moyao (Myrrh) from three countries, and optimized the classification models. The establishment of the relevant classification model laid the foundation for the identification of the source of Moyao (Myrrh) with a portable near-infrared spectrometer, and also provided a reference for quality evaluations and the clinical use of Moyao (Myrrh) in later stages.
2. MATERIALS AND METHODS
2.1. Materials and reagents
The Moyao (Myrrh) samples were purchased from Jiangxi Guhan Refined Chinese herbal preparations Co., Ltd. The samples were placed in a laboratory with a room temperature of 25 ℃ and a humidity of 45%. Part of each Moyao (Myrrh) sample was crushed into granules with a grinder and passed through a 60-mesh sieve to obtain Moyao (Myrrh) powder (particle size = 250 μm). Each sample was divided into 3 portions. The list of samples is shown in Table 1. There were three types of Moyao (Myrrh) preparation, each Moyao (Myrrh) preparation had three origins, with a total of 270 samples. The geographical origins of the Moyao (Myrrh) samples are shown in Table 1. Moyao (Myrrh) samples primarily come from Kenya, Ethiopia, and Somalia, with 30 batches of the original Moyao (Myrrh) from each of the three origins. Additionally, there are 30 batches of processed myrrh [raw Moyao (Myrrh) was fried to a certain extent and then sprayed with rice vinegar], from each of the three origins. Furthermore, there are 30 batches of processed Moyao (Myrrh) was mixed with vinegar and soaked for a couple of hours, and then fried to a certain extent), from each of the three origins.
Table 1.
Dataset of Moyao (Myrrh) samples, split into the training set and the prediction set
| Method | Group | Numbers | Total | |
|---|---|---|---|---|
| Training set | Prediction set | |||
| Origin | Kenya | 71 | 19 | 90 |
| Ethiopia | 62 | 28 | 90 | |
| Somalia | 69 | 21 | 90 | |
| Total | 202 | 68 | 270 | |
| Processing technology | Raw Moyao (Myrrh) | 62 | 28 | 90 |
| sprayed vinegar Moyao (Myrrh) | 75 | 15 | 90 | |
| soaked vinegar Moyao (Myrrh) | 65 | 25 | 90 | |
| Total | 202 | 68 | 270 | |
2.2. Acquisition of the spectra
A JDSU MicroNIR 1700 spectrometer (JDSU, Milpitas, CA, USA) was used to collect the reflection spectra. The spectral measurement range was 908.1-1676.2 nm, the spectral interval was 6.2 nm, and the spectral resolution was 10 nm. Spectral data for each sample were measured in triplicate. The temperature in the laboratory was controlled to (25 ± 1) ℃, and the relative humidity was controlled to 46% ± 5%. About 1.0 g of Moyao (Myrrh) powder was transferred into a standard quartz bottle (inner diameter, ∼10 mm) for testing.
2.3. Data analysis
After collecting all the original spectra, all the experiments were completed using the Python programming language, and the algorithm codes of PLS-DA, KNN, and SVM were completed using the scikit-learn 1.0.2 software package. All computations were carried out on a desktop PC with Intel Xeon (R) Platinum 8124M 3.00GHz processor, 64 GB RAM, and the NVIDIA GeForce RTX 3060.
Before modeling, four spectral preprocessing methods were used to preprocess the data, including standard normal variation (SNV), multivariate scattering correction (MSC), first-order derivation (FD), and second-order derivation (SD). For the preprocessed spectral data, three different classification models were established using PLS-DA, KNN, and SVM technique. The processed products were classified.
2.4. Three classification models
2.4.1. PLS-DA classification analysis
The PLS-DA model is a supervised feature extraction and classification analysis method, which has many advantages. For example, the number of samples required is small and the near-infrared spectral data can be reduced. The influence of multicollinearity existing among the wavelength variables in a set is a widely used qualitative analysis method.30,31 The principle of the PLS-DA classifier model is similar to the PLS regression model but, unlike the PLS regression model, the property matrix Y of the PLS-DA classifier model is not a given experimental result but is composed of artificially given classification variables. The principle is as follows: first, the PLS-DA classifier model is established through the matrix of the collected near-infrared spectral dataset X and the artificially given classification variable matrix Y.
2.4.2. KNN classification analysis
The decision rule of KNN is to assign the sample x to the closest sample in a set of samples to the classification point; otherwise, KNN will rely on the majority of samples of the K nearest points, which will affect the classification results of the number of neighbors (K).32
The idea of the KNN algorithm is that when the data and labels in the training set are known, we can input the test data, compare the features of the test data with the corresponding features in the training set, and find the top K most similar data in the training set. After that, the category corresponding to the test data is the category with the most occurrences in the data of the K neighbors. The algorithm is as follows: (a) Calculate the distance between the test data and each of the training data. (b) Sort the data according to the increasing relationship of distance. (c) Select the nearest K points. (d) Determine the frequency of occurrence of the category of the first K points. (e) Return the category with the highest frequency in the first K points as the predicted category of the test data.
2.4.3. SVM classification analyses
The SVM technique is a classification and regression data mining method based on the principle of structural risk reduction, which can overcome the problem of overfitting. It is an effective data mining technology proposed by Bell Labs in 1995.33,34 It shows many unique advantages for solving cases with small samples, nonlinear data, and high-dimensional pattern recognition.
Selecting an appropriate kernel function is the most important step in establishing an adequately performing SVM model, which usually includes two parts. One is to select an appropriate kernel function type, the other is to optimize the important parameters after determining the kernel function type. Research has shown that the model established by Gaussian kernel function (RBF) with kernel function parameters has good learning ability. Therefore, this study used the RBF kernel function to implement the SVM model. The two important parameters of the RBF kernel function are the penalty parameter c and the kernel function parameter g. These have an important influence, so it is necessary to optimize these two parameters. This study chose the network search method (Grid Search, GS) to select the optimal parameters. This method is a traversal algorithm that tries all (c, g) parameter pairs, and then finds out the (c, g) parameter pair value with the highest accuracy through interactive verification. This (c, g) parameter pair is the optimal parameter found.
2.5. Parameters using for evaluating the classification models
Model evaluation was used to measure the effect of the different models in the process of parameter space and feature extraction. In this experiment, accuracy, precision, recall, and the F1 score were used as indicators for evaluating the models.
A confusion matrix is a summary of the prediction results for classification problems.35 Aggregating the number of correct and incorrect predictions using count values, broken down by each category, is the key to a confusion matrix. The confusion matrix shows which part of the classification model will be confused when making predictions. It can understand the mistakes made by the classification model and, more importantly, it can understand the types of errors that occurred. In the following formula, TP is the true class, TN is the true negative class, FP is the false positive class, and FN is the false negative class.
From the confusion matrix, more advanced classification indicators can be obtained, such as the accuracy, precision, recall, and F1 score.
The accuracy is the proportion of all predictions that are correct out the total number of predictions. The equation is as follows:
Precision is the proportion of correct predictions that are positive to all positive predictions.
The recall rate is the proportion of correctly predicted positives to all actual positives.
The F1 score is the arithmetic mean divided by the geometric mean, and the larger the F1, the better. With the abovementioned formulas of precision and recall, when the F1 score is small, both the precision and recall will increase; that is, F1 weights both the precision and the recall.
Transformation of the equation leads to:
Accuracy is the simplest and intuitive evaluation index used in classification problems. Precision reflects the model's ability to distinguish negative samples, where the higher the precision, the stronger the model's ability to distinguish negative samples. Recall reflects the model's ability to identify positive samples, where the higher the recall, the stronger the model's ability to identify positive samples. The F1 score is a combination of the two, and the higher the F1 score, the more robust the model is.
3. RESULTS
3.1. Data splitting methods for the classification methods
Dataset splitting methods include the Duplex method,35 the Kenard-Stone (KS) method,36 the random sampling (RS) method,37 and the SPXY method.38 Among them, the SPXY method could be effectively applied to our analysis of the spectral calibration model. It is a sample division method based on the KS method. Compared with the KS method, the SPXY method calculates the sample's spatial distance by considering both the x variable and the y variable. The formula for calculating the spatial distance of the x variable is the same as that of KS. This algorithm calculates the Euclidean distance between the x vectors of each pair of (p, q) samples. The formula is as follows:
where -xp(j) and xq(j) are the values corresponding to the instrument and to the jth wavelength of the spectral sample, respectively. The KS algorithm increases the distance in the y-direction of the sample dy(p,q):
The stepwise selection process of the SPXY method is similar to that of the KS method, except that dxy(p, q) is replaced by dx. Here, dxy(p, q) is the normalized xy distance such that the samples have the same weight in x and y space; dxy(p, q) is calculated as follows:
The splits of the dataset of Moyao (Myrrh) samples are shown in Table 2.
Table 2.
Precision, accuracy, F1 score, and recall (%) of the KNN, SVM, and PLS-DA classification methods for Moyao (Myrrh) from different geographical origins
| Method | Preprocessing | Precision | Accuracy of training test | Accuracy of test | F1score | Recall |
|---|---|---|---|---|---|---|
| KNN | Raw | 0.8889 | 1 | 0.8971 | 0.8987 | 0.8935 |
| SNV | 0.9833 | 1 | 0.9853 | 0.9853 | 0.9841 | |
| MSC | 0.9833 | 1 | 0.9853 | 0.9853 | 0.9841 | |
| FD | 0.9683 | 1 | 0.9706 | 0.9706 | 0.9683 | |
| SD | 0.9683 | 1 | 0.9706 | 0.9706 | 0.9683 | |
| SNV+FD | 0.9666 | 1 | 0.9706 | 0.9706 | 0.9666 | |
| SVM | Raw | 0.8929 | 0.9109 | 0.8676 | 0.8663 | 0.8611 |
| SNV | 0.9545 | 0.9703 | 0.9559 | 0.9558 | 0.9524 | |
| MSC | 0.9545 | 0.9703 | 0.9559 | 0.9558 | 0.9524 | |
| FD | 0.9683 | 0.995 | 0.9706 | 0.9706 | 0.9683 | |
| SD | 0.9683 | 0.995 | 0.9706 | 0.9706 | 0.9683 | |
| SNV+FD | 0.9683 | 0.9901 | 0.9706 | 0.9706 | 0.9683 | |
| PLS-DA | Raw | 0.9666 | 0.9208 | 0.9706 | 0.9706 | 0.9666 |
| MSC | 0.9833 | 0.9752 | 0.9853 | 0.9853 | 0.9841 | |
| SNV | 0.971 | 0.9356 | 0.9706 | 0.9705 | 0.9666 | |
| FD | 0.9833 | 0.9653 | 0.9853 | 0.9853 | 0.9841 | |
| SD | 0.9833 | 0.9752 | 0.9853 | 0.9853 | 0.9841 | |
| SNV+FD | 0.9683 | 0.9604 | 0.9706 | 0.9706 | 0.9683 |
Notes: KNN: k-nearest neighbor; PLS-DA: partial least squares classification analysis; SNV: standard normal variation; MSC: multivariate scattering correction; FD: first-order derivation; SD: second-order derivation.
Table 3.
Precision, accuracy, F1 score, and recall (%) of the KNN, SVM and PLS-DA classification methods for distinguishing the different processing technologies
| Model | Preprocessing | Precision | Acc_Train | Acc_Test | F1 score | Recall |
|---|---|---|---|---|---|---|
| KNN | Raw | 0.9122 | 1.0000 | 0.9118 | 0.9089 | 0.9200 |
| SNV | 0.8833 | 1.0000 | 0.8971 | 0.8987 | 0.8978 | |
| MSC | 0.9267 | 1.0000 | 0.9412 | 0.9418 | 0.9378 | |
| FD | 0.9753 | 1.0000 | 0.9706 | 0.9701 | 0.9556 | |
| SD | 0.9246 | 1.0000 | 0.9265 | 0.9245 | 0.8978 | |
| SNV+FD | 0.9444 | 1.0000 | 0.9559 | 0.9561 | 0.9511 | |
| SVM | Raw | 0.6205 | 0.7723 | 0.6324 | 0.6250 | 0.6383 |
| SNV | 0.9540 | 0.9406 | 0.9412 | 0.9388 | 0.9111 | |
| MSC | 0.9540 | 0.9307 | 0.9412 | 0.9388 | 0.9111 | |
| FD | 0.9505 | 0.9653 | 0.9559 | 0.9556 | 0.9422 | |
| SD | 0.9139 | 0.9802 | 0.9265 | 0.9259 | 0.9067 | |
| SNV+FD | 0.9139 | 0.9455 | 0.9265 | 0.9259 | 0.9067 | |
| PLS-DA | Raw | 0.8500 | 0.9109 | 0.8676 | 0.8697 | 0.8622 |
| SNV | 0.7491 | 0.8564 | 0.7647 | 0.7681 | 0.7643 | |
| MSC | 0.7750 | 0.8416 | 0.7941 | 0.7963 | 0.7895 | |
| FD | 0.7776 | 0.9010 | 0.7794 | 0.7722 | 0.8057 | |
| SD | 0.8136 | 0.8515 | 0.8088 | 0.8012 | 0.8192 | |
| SNV+FD | 0.8521 | 0.9257 | 0.8382 | 0.8382 | 0.8310 |
Notes: KNN:k-nearest neighbor; PLS-DA: partial least squares classification analysis; SNV: standard normal variation; MSC: multivariate scattering correction; FD: first-order derivation; SD: second-order derivation.
3.2. Spectra interpretation
The original NIR spectra of all Moyao (Myrrh) samples are shown in Figure 1A, with 126 features in the wavelength range of 900-1680 nm. The hydrogen-containing groups (such as C-H, O-H, N-H, etc.) of the organic substances in Moyao (Myrrh) can produce multi-frequency and double-frequency absorption in the near-infrared region, mainly with water in the Moyao (Myrrh) and the related monoterpenes, sesquiterpenes, and triterpenes. Obvious absorption bands can be seen at 1150-1210 nm and 1400-1500 nm, which are related to the double frequency and the second combined frequency of O-H and C-H, respectively. The spectrum at 900-1000 nm is similar to noise and contains very little spectral information of the active components. Because of the severe overlap of the original near-infrared spectra, the information of the peaks, such as peak height and peak intensity, in the figure was similar. Therefore, it was difficult to visually classify the geographic regions and processing techniques of myrrh. To establish a relatively stable near-infrared analysis model, it was necessary to preprocess the original near-infrared spectra of myrrh. In this study, to improve the accuracy of the model, five spectral preprocessing methods were used: standard normal variation (SNV), multivariate scattering correction (MSC), first derivative (FD), second derivative (SD), and the SNV+FD method. As shown in Figure 1, after spectral pretreatment, it can be seen more clearly that the first peak is around 1100-1150 nm, corresponding to the presence of C-H groups, which were attributed to myrrh's lactone A, limonene compounds, and some sesquiterpenoids. The second prominent peak appeared at around 1350-1400 nm, corresponding to the presence of O-H groups, which were attributed to compounds such as bisabolol A, camphor, and alisol E in Moyao (Myrrh).
Figure 1. Spectra of all Moyao (Myrrh) samples in the wavelength range of 900-1600 nm.
A: original spectra; B: spectra after pretreatment by FD; C: spectra after pretreatment by SD; D: spectra after pretreatment by SNV; E: spectra after pretreatment by MSC; F: spectra after pretreatment by MSC + FD. FD: first derivation; SD: second derivation; SNV: standard normal variation; MSC: multiplicative signal correction; MSC: multiplicative signal correction.
3.3. PCA
PCA is a chemometric tool used for feature extraction and reducing the dimensionality of spectral data, and is widely used in the field of spectral data analysis. This method simplifies the data and clearly shows the repeatability of the samples and the differences among the groups. In this study, the PCA method was used to discriminate Moyao (Myrrh) with different origins and different processing techniques. Figure 2 is based on the average spectra of Moyao (Myrrh) pretreated by SNV + FD, and the first two principal components (i.e., PC1, PC2) given by the PCA were used to draw scatterplots. The three datasets of the different origins and the different processing methods formed three different clusters, where each point represented a sample. The two principal components explained 49.84% of the variance and 30.28% of the variance in the first two dimensions of the dataset, respectively. Figure 2A can be roughly divided into three origins, and there is some overlap between the origins. In Figure 2B, samples obtained by different processing techniques overlap more, and the classification trend is not obvious. Judging from the PCA score chart of the near-infrared spectrum, the source of the sample can be identified by using the information of the near-infrared spectrum, but it is difficult to distinguish whether the spectrum effectively reflects the information of the original sample. The reason may be related to key information such as some of the chemical components in myrrh. It can be seen that PCA has some difficulties in identifying the geographical source or processing technology of Moyao (Myrrh).
Figure 2. PCA scores plot in the spectral range of 908.1-1676.2 nm.
A: different geographical origins; B: different processing technologies. PCA: principal component analysis.
Therefore, this study needed to use other multivariate statistical analysis techniques for the classification of Moyao (Myrrh) with different origins and different processing techniques. In the current research, machine learning algorithms were an effective means to accurately select the spectral feature bands, making a classification analysis of the near-infrared spectral data feasible.
3.4. Building the classification model based on the near-infrared spectral data
As an analysis technique, the machine learning algorithm is mainly used to find the relationship between the input and output data, and is widely used in research into the traceability of agricultural products39 and food.40 Regarding the origin and processing technology of Moyao (Myrrh), classification analysis can be carried out on the basis of NIR spectroscopy, combined with two machine learning algorithms and one classical chemometric algorithm.
3.4.1. Classification of the different geographical origins of Moyao (Myrrh)
Before establishing the classification model, 270 samples were first divided into a training set (70%) and a testing set (30%). Combined with the different spectral preprocessing methods, three classification models (KNN, SVM, and PLS-DA) were established to analyze the origin of the Moyao (Myrrh). The corresponding modeling results are shown in Table 4. The accuracy rates of the three classification models were all above 0.9000, showing good discrimination. After the spectra had been preprocessed, the best models in the three classification models were compared. The F1 scores were very similar, showing that the models built by both the machine learning algorithm and the classical chemometric algorithm obtained positive results. After the spectra had been processed by SNV or MSC, the model correction set and prediction set combined with KNN had high accuracy (1.0000 and 0.9853 respectively). After the spectra had been processed by MSC, the model correction set and prediction set were established by combining this method with PLS-DA. The accuracy rates were slightly lower: 0.9752 and 0.9853, respectively, for the calibration set and prediction set. The classification model established using the SVM algorithm had poop robustness. Therefore, the KNN-based model could be combined with SNV- or MSC-processed spectral data to classify Moyao (Myrrh) from the three different origins.
The classification results for visualization of Moyao (Myrrh) from different geographical origins were shown in the following confusion matrices. Figure 3A and Figure 3C are the confusion matrix of KNN and PLS-DA models, respectively. The results indicate 1 sample from Kenya was misclassified as Ethiopian and other samples were completely identified as their correct geographical origins. Figure 3B is the confusion matrix of SVM model. It can be seen that 2 samples were misclassified while remaining 66 samples were discriminated correctly. Among them, 2 samples from Kenya was misclassified as Ethiopian.
Figure 3. Confusion matrix of the geographic origin classification models.
A: confusion Matrix of k-nearest neighbor; B: confusion Matrix of support vector machine; C: confusion matrix of partial least squares classification analysis.
3.4.2. Classification of the different processing technologies
The KNN, SVM and PLS-DA algorithms, combined with the NIR spectral information, were used to establish a classification model for the different processing techniques of Moyao (Myrrh). The results are shown in Table 5. The accuracy of the spectral information based on first-order derivative preprocessing combined with the KNN model and SVM model (0.9753 and 0.9505, respectively) was higher than that of no spectral image preprocessing combined with the PLS-DA model (0.8500). If we compare the accuracy, F1 score, and recall rate of the training set and the prediction set, the best result was obtained after the spectra were preprocessed by first-order derivation combined with KNN to build a model. It can be seen that of the machine learning algorithms, the KNN method can better explain the relationship between the samples obtained by different processing techniques and the spectral information.
The classification results for visualization of Moyao (Myrrh) from different processing technologies were shown in Figure 4. Figure 4A is the confusion matrix of KNN model and it indicated that 2 samples of Moyao (Myrrh) sprayed with vinegar after frying were misclassified as Moyao (Myrrh) soaked in vinegar then fried while other samples were completely identified as their correct processing technologies. Figure 4B is the confusion matrix of SVM model. The result indicate that 3 samples were misclassified while remaining 65 samples were discriminated correctly. Among them, 2 samples of Moyao (Myrrh) sprayed with vinegar after frying were misclassified as Moyao (Myrrh) soaked in vinegar then fried, 1 sample of Moyao (Myrrh) soaked in vinegar then fried was misclassified as Moyao (Myrrh) sprayed with vinegar after frying. The confusion matrix of PLS-DA model was shown in Figure 4C, 9 samples were misclassified while the remaining 59 samples were discriminated correctly. Among them, 2 samples of Moyao (Myrrh) sprayed with vinegar after frying were misclassified as Moyao (Myrrh) soaked in vinegar then fried, 7 sample of Moyao (Myrrh) soaked in vinegar then fried were misclassified as Moyao (Myrrh) sprayed with vinegar after frying. Therefore, the KNN model has the strongest generalization ability.
Figure 4. Confusion matrix of the processing technology classification models.
A: confusion Matrix of k-nearest neighbor; B: confusion Matrix of support vector machine; C: confusion matrix of partial least squares classification analysis.
4. DISCUSSION
Moyao (Myrrh), which is used in clinical Chinese medicine, is obtained from different origins and processed by different techniques. The composition of its characteristic components can differ, which will directly affect the clinical efficacy of Traditional Chinese Medicine. Infrared spectroscopy, as a rapid, accurate, and noninvasive detection technique, was combined with chemometrics. We verified preliminarily that this technique can be used as a powerful tool to identify the origin and processing method of Moyao (Myrrh). This study proved that PCA had some difficulties in identifying the geographical source or processing method of myrrh. The PCA score chart of the near-infrared spectrum proved that the source of the sample could be identified by using the near-infrared spectral information, but it also showed that it is difficult to distinguish whether it effectively reflected the original sample's information. Moreover, after spectral preprocessing, combined with machine learning algorithms (KNN, SVM) and classical chemometric methods (PLS-DA), models with better discrimination effects could be established. In comparison, after the NIR spectra had been processed by SNV or MSC, combined with the KNN algorithm, the results of the model were the best for discriminating the origin. The precision, accuracy of training test, accuracy of test, F1 score, and recall were 0.9833, 1.0000, 0.9853, 0.9853, and 0.9841, respectively. After the NIR spectra had been preprocessed by first-order derivation, the results of the model combined with KNN were the best. The precision, accuracy of training test, accuracy of test, F1 score, and recall were 0.9753, 1.0000, 0.9706, 0.9701, and 0.9556, respectively. Lastly, for the identification of the origin and processing technology of Moyao (Myrrh), the KNN algorithm performed better, indicating that although the number of collected data points was small, the dimensionality was lower compared with the number of samples. However, the results obtained by different algorithms showed that the dimensionality was high enough at this stage, and there was no need to use the SVM algorithm to expand the low-dimensional data to high-dimensional data to achieve full recognition. In addition, the results of the KNN algorithm were better than those of the classical chemometric method, indicating that there are more nonlinear relationships in the spectra of Moyao (Myrrh).
In conclusion, the combination of NIR spectroscopy and chemometrics provides an alternative method with wide applicability to accurately distinguish Moyao (Myrrh) with different origins and different processing techniques and provides a basis for subsequent quality evaluations of Moyao (Myrrh). The new method also provides a guarantee for clinical medication. In the next study, the sample size will be expanded to cover more Moyao (Myrrh) samples from different regions and processing techniques to validate the applicability and stability of the model. At the same time, in order to further evaluate its effectiveness and feasibility in practical use, the established model will be applied in the real application, providing a reference for the quality evaluation of Traditional Chinese Medicine.
REFERENCES
- 1. Han L, Sun JY, Zhou L, Fu XY, Bai CC. . Research progress on non-drug chemical components and drug effects. Asian J Tradit Med 2015; 11: 38-42. [Google Scholar]
- 2. Dolara P, Luceri C, Ghelardini C, et al. Analgesic effects of myrrh. Nature 1996; 379: 29. [DOI] [PubMed] [Google Scholar]
- 3. Zhen Q. . Yao Xing Lun. Wuhu: research department of Wan'an Medical College, 1983: 42. [Google Scholar]
- 4. Wang XQ, C LL, Luo JQ, Liang SW. . Historical evolution and modern research on myrrh processing. Ya Tai Chuan Tong Yi Yao 2016; 12: 66-9. [Google Scholar]
- 5. Abbas RK, Al-Mushhin AAM, Elsharbasy FS, Ashiry KO. . Nutritive value, polyphenol constituents and prevention of pathogenic microorganism by different resin extract of commiphora myrrh. J Pure Appl Microbiol 2020; 14: 1871-78. [Google Scholar]
- 6. Chinese Pharmacopoeia Commission. . Pharmacopoeia of the People's Republic of China, Part 1. Beijing: China Medical Science Press, 2020: 193. [Google Scholar]
- 7. Jiangxi Food and Drug Administration. . Preparation standard of TCM decoction pieces in Jiangxi province 2008 edition. Shanghai: Shanghai Science and Technology Press, 2009: 454. [Google Scholar]
- 8. Gong QF. . A complete book of chinese medicine processing of Zhangshubang. Nanchang: Jiangxi Science and Technology Press, 1990: 384. [Google Scholar]
- 9. Yu XL, Sun L, Xu JM, Li G, Ma SC. . The basic textual research of natural myrrh, colloidal myrrh and muku myrrh. Zhong Guo Yao Shi 2016; 30: 466 -71. [Google Scholar]
- 10. Li X, Wu MQ, Lin FJ, Chen P. . Research on fingerprints of myrrha from different origins. Zhong Yi Yao Dao Bao 2019; 25: 50-4. [Google Scholar]
- 11. Lu JR, Li WB, Xu N, et al. Quality status analysis and intrinsic connection research of growing place, morphological characteristics, and quality of Chinese medicine: Cyperi Rhizoma (Xiangfu) as a case study. Evid Based Complement Alternat Med 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Qiu F, Wu S, Lu XR, et al. Quality evaluation of the artemisinin-producing plant Artemisia annua L. based on simultaneous quantification of artemisinin and six synergistic components and hierarchical cluster analysis. Ind Crops Prod 2018; 118: 131-41. [Google Scholar]
- 13. REN H, ZHAO LT, GAO K, et al. Deciphering the chemical profile and pharmacological mechanism of Jinlingzi powder against bile reflux gastritis using ultra-high performance liquid chromatography coupled with Q exactive focus mass spectrometry, network pharmacology, and molecular docking. J Tradit Chin Med 2023; 43: 1209-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Wang Y, He T, Wang JJ, et al. High performance liquid chromatography fingerprint and headspace gas chromatography-mass spectrometry combined with chemometrics for the species authentication of Curcumae Rhizoma. J Pharm Biomed Anal 2021; 202: 114144. [DOI] [PubMed] [Google Scholar]
- 15. He Y, Wang JZ, Wang MZ, Zhang JF. . Discrimination of wild and domestic deer musk using isotope ratio mass spectrometry. J Mass Spectrom 2018; 53: 1078-85. [DOI] [PubMed] [Google Scholar]
- 16. Li HY, Sun JY. . Comparative study on principal components of essential oil of myrrh before and after processing by GC-MS. Zhongg Cheng Yao 1998; 20: 19-20. [Google Scholar]
- 17. Qudsia T, Tahir M, Sibtain A, et al. Characterization and anticancer potential of Withania somnifera fruit bioactives (a native species to Pakistan) using gas chromatography-mass spectrometer, nuclear magnetic resonance and liquid chromatography-mass spectrometry-electrospray ionization. J Tradit Chin Med 2022; 42: 908-14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Santhosh Kumar JU, Krishna V, Seethapathy GS, Ganesan R, Ravikanth G, Shaanker RU. . Assessment of adulteration in raw herbal trade of important medicinal plants of India using DNA barcoding. Biotech 2018; 8: 135. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Zhong YC, Wang HY, Wei QH, et al. Combining DNA barcoding and HPLC fingerprints to trace species of an important Traditional Chinese Medicine fritillariae bulbus. Molecules 2019; 24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Wang Q, Wang XH, Wu XJ, et al. 1H NMR-based metabolic profiling approach to identify the geo-authentic Chinese yam (Dioscorea polystachya Turczaninow cv. Tiegun). J Food Compos Anal 2021; 98: 103805. [Google Scholar]
- 21. An L, Yuan Y, Ma J, et al. NMR-based metabolomics approach to investigate the distribution characteristics of metabolites in Dioscorea opposita Thunb. cv. Tiegun. Food Chem 2019; 298: 125063. [DOI] [PubMed] [Google Scholar]
- 22. Pasquini C. . Near infrared spectroscopy: a mature analytical technique with new perspectives - a review. Anal Chim Acta 2018: 1026, 8-36. [DOI] [PubMed] [Google Scholar]
- 23. Jamrogiewicz M. . Application of the near-infrared spectroscopy in the pharmaceutical technology. J Pharm Biomed Anal 2012; 66: 1-10. [DOI] [PubMed] [Google Scholar]
- 24. Blanco M, Gozalez Bano R, Bertran E. . Monitoring powder blending in pharmaceutical processes by use of near infrared spectroscopy. Talanta 2002; 56: 203-12. [DOI] [PubMed] [Google Scholar]
- 25. Varrà MO, Fasolato L, Serva L, Ghidini S, Novelli E, Zanardi E. . Use of near infrared spectroscopy coupled with chemometrics for fast detection of irradiated dry fermented sausages. Food Control 2020; 110: 107009. [Google Scholar]
- 26. Xu Y, Zhang J, Wang Y. . Recent trends of multi-source and non-destructive information for quality authentication of herbs and spices. Food Chem 2022; 398: 133939. [DOI] [PubMed] [Google Scholar]
- 27. Coppa M, Ferlay A, Leroux C, et al. Prediction of milk fatty acid composition by near infrared reflectance spectroscopy. Int Dairy J 2010; 20: 182-9. [Google Scholar]
- 28. Zhou Y, Zuo Z, Xu F, Wang Y. . Origin identification of Panax notoginseng by multi-sensor information fusion strategy of infrared spectra combined with random forest. Spectrochim Acta Part A 2020; 226: 117619. [DOI] [PubMed] [Google Scholar]
- 29. Manuel MNB, da Silva AC, Lopes GS, Ribeiro LPD. . One-class classification of special agroforestry Brazilian coffee using NIR spectrometry and chemometric tools. Food Chem 2022; 366: 130480. [DOI] [PubMed] [Google Scholar]
- 30. Lan Z, Zhang Y, Sun Y, et al. Rapid quantitative detection of the discrepant compounds in differently processed Curcumae Rhizoma products by FT-NIR combined with VCPA-GA technology. J Pharm Biomed Anal 2021; 195: 113837. [DOI] [PubMed] [Google Scholar]
- 31. Aminu M, Ahmad NA. . Complex chemical data classification and discrimination using locality preserving partial least squares discriminant analysis. ACS Omega 2020; 5: 26601-10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Keller JM, Gray MR, Givens JA. . A fuzzy K-nearest neighbor algorithm. IEEE Trans Syst Man Cybern 1985; 4: 580-5. [Google Scholar]
- 33. de Santana FB, Otani SK, de Souza AM, Poppi RJ. . Comparison of PLS and SVM models for soil organic matter and particle size using vis-NIR spectral libraries. Geoderma Regional 2021; 27: e00436. [Google Scholar]
- 34. Brown SSG, Mak E, Clare I, et al. Support vector machine learning and diffusion-derived structural networks predict amyloid quantity and cognition in adults with Down's syndrome. Neurobiol Aging 2022; 115: 112-21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Snee RD. . Validation of regression models: methods and examples. Technometrics 1977; 19: 415-28. [Google Scholar]
- 36. Wu W, Walczak B, Massart DL, et al. Artificial neural networks in classification of NIR spectral data: design of the training set. Chemom Intell Lab Syst 1996; 33: 35-46. [Google Scholar]
- 37. Rajer-Kanduč K, Zupan J, Majcen N. . Separation of data on the training and test set for modelling: a case study for modelling of five colour properties of a white pigment. Chemom Intell Lab Syst 2003; 65: 221-29. [Google Scholar]
- 38. Galvao RK, Araujo MC, Jose GE, Pontes MJ, Silva EC, Saldanha TC. . A method for calibration and validation subset partitioning. Talanta 2005; 67: 736-40. [DOI] [PubMed] [Google Scholar]
- 39. Orlandi G, Calvini R, Pigani L, et al. Electronic eye for the prediction of parameters related to grape ripening. Talanta 2018; 186: 381-8. [DOI] [PubMed] [Google Scholar]
- 40. Ciocca G, Napoletano P, Schettini R. . CNN-based features for retrieval and classification of food images. Comput Vis Imagge Und 2018; 176: 70-7. [Google Scholar]




