Skip to main content
PLOS One logoLink to PLOS One
. 2023 Feb 28;18(2):e0282429. doi: 10.1371/journal.pone.0282429

Origin identification of Cornus officinalis based on PCA-SVM combined model

Yueqiang Jin 1,*, Bing Liu 1, Chaoning Li 2, Shasha Shi 3
Editor: Naji Arafat Mahat4
PMCID: PMC9974136  PMID: 36854014

Abstract

Infrared spectroscopy can quickly and non-destructively extract analytical information from samples. It can be applied to the authenticity identification of various Chinese herbal medicines, the prediction of the mixing amount of defective products, and the analysis of the origin. In this paper, the spectral information of Cornus officinalis from 11 origins was used as the research object, and the origin identification model of Cornus officinalis based on mid-infrared spectroscopy was established. First, principal component analysis was used to extract the absorbance data of Cornus officinalis in the wavenumber range of 551~3998 cm–1. The extracted principal components contain more than 99.8% of the information of the original data. Second, the extracted principal component information was used as input, and the origin category was used as output, and the origin identification model was trained with the help of support vector machine. In this paper, this combined model is called PCA-SVM combined model. Finally, the generalization ability of the PCA-SVM model is evaluated through an external test set. The three indicators of Accuracy, F1-Score, and Kappa coefficient are used to compare this model with other commonly used classification models such as naive Bayes model, decision trees, linear discriminant analysis, radial basis function neural network and partial least square discriminant analysis. The results show that PCA-SVM model is superior to other commonly used models in accuracy, F1 score and Kappa coefficient. In addition, compared with the SVM model with full spectrum data, the PCA-SVM model not only reduces the redundant variables in the model, but also has higher accuracy. Using this model to identify the origin of Cornus officinalis, the accuracy rate is 84.8%.

1. Introduction

China has superior climatic and geographical conditions for cultivating herbal medicines. Many of the herbal medicines it cultivates are well known at home and abroad and are exported to more than 100 countries and regions [1, 2]. Compared with synthetic drugs, Chinese herbal medicines have many unique advantages such as natural raw materials, stable effects and less toxic side effects [3]. It is because of these unique advantages that Chinese herbal medicine is receiving more and more attention from countries and regions [46]. However, with the gradual deterioration of the natural ecological environment, the growth environment of wild Chinese herbs has been destroyed, and some wild Chinese herbs are in short supply, leading to confusion in the Chinese herbal medicine sales market. In addition, the wide variety of Chinese herbal medicines and the different habits of use in different regions make the confusion of Chinese herbal medicines common and the identification of Chinese herbal medicines becomes difficult [7, 8].

The commonly used methods for the identification of herbal medicines include character identification, microscopic identification and physicochemical identification [9, 10]. Character identification is mainly achieved by external characteristics such as appearance, color and odor of herbs, and if necessary, water and fire tests can also be performed. This identification method is easy to operate and can achieve rapid identification, but it requires extensive experience of workers engaged in Chinese medicine identification, and the accuracy of identification of herbs with close relatives and high biomorphological similarity needs to be improved [11, 12]. Microscopic identification is to observe the microscopic structure of herbal medicines through microscope. Each herb has its own special structure, and the microscopic structure of herbs can be observed through the microscope to identify the authenticity of herbs. This method is more commonly used when the shape of the herb is not easily identified, or when the herb is broken or in powder form. However, microscopic identification also has some shortcomings, such as the microscopic characteristics of some herbs are not easy to search, and the identification characteristics of some herbs do not conform to the pharmacopoeia. Physicochemical identification refers to the method of identifying the authenticity, purity and degree of quality merit of herbal medicines by using certain physical, chemical or instrumental analysis methods. Determination of physical constants, determination of swelling, colorimetric examination, foam index, chemical qualitative analysis, and chemical quantitative analysis are all common means of physical and chemical identification. With the development of chromatographic coupling technology, the use of modern chromatographic techniques for the examination of Chinese herbal medicines has also rapidly spread. Commonly used chromatographic techniques include thin layer chromatography [13, 14], gas chromatography [15, 16], high performance liquid chromatography [1720] and capillary electrophoresis [2123]. They can achieve very high accuracy in herb identification, but often require complex pretreatment and long analysis time, high analytical cost, and difficulty in non-destructive and rapid identification [2427].

Infrared spectroscopy has been widely used in the structural analysis of organic compounds [28, 29]. Under infrared irradiation, the molecules of the substance under test only absorb infrared spectra that are consistent with their molecular vibration and rotation frequencies. Therefore, the infrared spectroscopy can be used for the qualitative analysis of the measured substance. As there are many groups in the compound molecule, each group will produce characteristic vibration after excitation, and its vibration frequency will be reflected in the infrared absorption spectrum. Therefore, it can be quantitatively analyzed according to the absorption vibration frequency of various groups in the compound. Chinese herbal medicine is a mixture system composed of many chemical substances [30]. As long as the chemical components contained in the complex system are the same, and the relative proportion between the components is certain, the infrared spectrum obtained is the superposition of the spectra of all the compounds in the system. This superimposed spectrum can be reproducible in a stable manner, just like the spectrum of a single compound. If there is any change in the composition or content of the sample, there will be obvious differences in the spectrum, which provides an objective and reliable basis for the identification and evaluation of the authenticity of the sample.

Infrared spectroscopy has the advantages of fast analysis, low cost, non-destructive and simple pre-treatment. In recent years, it has been widely used in the field of quality control of herbal medicines [31]. Li, W., et al. proposed a discriminant analysis technique for near-infrared spectral classification using wavelet transform and influence matrix analysis methods [32]. This discriminant analysis technique was found to achieve good classification results by testing on the near infrared spectroscopy dataset of 265 salviae miltiorrhizae radix samples from 9 different geographical origins. Lu, L., et al. used Fourier transform infrared spectra combined with pattern recognition technology for geographic identification of wild Gentiana rigescens [33]. The comparison result showed that the Partial Least Squares Discriminant Analysis (PLS-DA) method is more suitable for geographic origin classification of wild Gentiana rigescens than Principal Component Analysis (PCA).

Cornus officinalis is a commonly used herbal medicine mainly distributed in central and southern Europe, East Asia and eastern North America [34]. The dried and ripe flesh of Cornus officinalis has the ability to nourish the liver and kidneys and quench thirst with internal heat. It is a traditional medicine commonly used in Chinese medicine to treat diabetes. In this study, infrared spectral data of Cornus officinalis from a total of 11 origins in Shanxi, Jiangsu, Zhejiang, Anhui, Jiangxi, Shandong, Henan, Hunan, Sichuan, Shaanxi and Gansu (OP 1~OP 11) in China were collected. With the infrared spectral data of these samples, principal component analysis and support vector machine were used to develop a model for identifying the origin of Cornus officinalis. This origin identification model is called PCA-SVM combined model in this paper. Compared with other commonly used methods such as the naive Bayesian model, decision tree, LDA, Radial Basis Function (RBF) neural network and PLS-DA, PCA-SVM performs well on some common evaluation indicators [35, 36]. The model can not only provide a convenient and accurate method for the rapid identification of the origin of Cornus officinalis, but also has some reference significance for the identification of other herbal medicines.

2. Material and methods

2.1. Data source and preprocessing

Mid-infrared spectroscopy combined with chemometrics can be used for origin identification of Chinese Herbal Medicines. A set of Cornus officinalis spectral data measured by Chengdu University of Traditional Chinese Medicine is used to establish a classification and identification model of Cornus officinalis. In this experiment, samples of ripe fruit pulp from 11 origins of Cornus officinalis were dried at 40°C to constant weight, crushed by a micro plant pulverizer, and passed through a 200-mesh sieve for use. A mass of 2 mg of Cornus officinalis sample powder was uniformly mixed with dried KBr crystals at a mass ratio of about 1:60 in an agate mortar for grinding. The thoroughly grinded mixture was compressed into flakes with a tablet machine, and immediately placed in a Nicolet iS50 FT-IR Spectrometer for measurement. The scanning range was 4000~400 cm–1, the spectral resolution was 4 cm–1, and the number of scans was 64 times. The interference of water and carbon dioxide was excluded before the background and sample spectra were collected. The prepared sample slices were placed in the spectrometer for measurement and data collection. Due to the environment and operation of the experimental instrument, the original spectrum began to have noise, and finally the spectral data between 3998 ~ 551 cm–1 was retained.

The reproducibility of the test is evaluated by taking 5 consecutive measurements of a given sample and calculating the Relative Standard Deviation (RSD) of their maximum common peak wave number. Eq (1) is the expression of RSD, where S is the standard deviation and x¯ is the corresponding mean value. The RSD of the reproducibility test is determined to be less than 0.2%, indicating good reproducibility of the test. The repeatability of the test is evaluated by measuring the same sample once by 5 different experimenters, and calculating the RSD of the maximum common peak wave number of 5 measurements. The RSD of the repeatability test is determined to be less than 3%, indicating good repeatability of the test.

RSD=Sx¯×100% (1)

Data preprocessing is the first step in data modeling. We consider data that are less than 1/3 times the arithmetic mean of the nearest neighboring values on the left and right or greater than 3 times the arithmetic mean of the nearest neighboring values on the left and right as outliers. Outliers and missing values are interpolated by means of mean interpolation in this paper. After dealing with missing values and outliers, the absorbance data of 3448 corresponding bands under spectral illumination were analyzed and summarized. The range of absorbance after summary is -0.00675~1.48696 AU. There are some negative values of absorbance in the last 184 bands, and there are 626 groups of data with negative values in total. This is because the absorbance used in the data in this paper is the value corrected by the instrument. Since the absolute values of these negative values are small, and the total amount of negative values is less than 0.001% of the total amount of data, we keep them. The absorbance of some data is greater than 1 AU, but the maximum value of absorbance does not exceed 1.5 AU. We think that they do not deviate from the Lambert-Beer law, so no special treatment is required.

2.2. Principle of origin identification model

Principal component analysis was first introduced by Karl Pearson for non-random variables, and then Harold Hotelling extended this method to the case of random vectors [37]. Principal component analysis is performed with minimal loss of data information. It uses the method of mathematical transformation to convert the given multiple index factors into a few principal components, and then replaces the original multi-dimensional related variables with a few principal component factors [38]. Suppose the research on a certain problem involves p indicators, which are represented by X1, X2, ⋯, Xp respectively, and the p-dimensional random vector composed of these p indicators is X = (X1, X2, ⋯, Xp). Let the mean of the random vector X be μ and the covariance matrix be Σ. Linear transformation of X can form a new comprehensive variable, which is represented by Y, that is, the new variable can be linearly represented by the original variable (Eq (2)).

Y1=u11X1+u21X2++up1XpY2=u12X1+u22X2++up2XpYp=u1pX1+u2pX2++uppXp (2)

The above linear transformation of the original variable can be carried out arbitrarily, and the statistical characteristics of the comprehensive variable Y obtained by different linear transformations are also different. Therefore, in order to achieve better results, we hope that the variance of Yi=ui'X is as large as possible and each Yi is independent of each other. Since var(Yi)=var(ui'X)=ui'ui, and for any constant c, there is var(cui'X)=c2ui'ui, so when there is no restriction on ui, var(Yi) can be increased arbitrarily, and the problem will become meaningless. We constrain linear transformations to the following principles: (i) ui'ui=1 (i = 1, 2, ⋯, p); (ii) Yi and Yj are independent of each other (ij; i, j = 1, 2, ⋯, p); (iii) Y1 is the one with the largest variance among all linear combinations of X1, X2, ⋯, Xp that satisfy the principle (i); Y2 is the one with the largest variance among all the linear combinations of X1, X2, ⋯, Xp that are not related to Y1; ⋯; and Yp is the one with the largest variance among all linear combinations of X1, X2, ⋯, Xp that are not related to Y1, Y2, ⋯, Yp-1. The comprehensive variables Y1, Y2, ⋯, Yp determined based on the above three principles are called the first, second, ⋯, p-th principal components of the original variables. In actual research work, only the first few principal components with the largest variance are usually selected, so as to simplify the system structure and grasp the essence of the problem.

maxw,b2w
s.t.yiwTx+b1,i=1,2,,m (3)
minw,b12w2
s.t.1yiwTx+b0,i=1,2,,m (4)

The support vector machine proposed by Cortes and Vapnik is an algorithm to find a classification plane or hyperplane that separates different types of data in the dataset as much as possible [39]. Fig 1 is its architecture. SVM has the theory of structural risk minimization, and it still has strong robustness in the face of nonlinear datasets and higher-dimensional datasets, so it is widely used in classification algorithms [40]. The essence of SVM is to obtain the optimal parameters w and b to determine an optimal hyperplane, so that as much data as possible is distributed on both sides of this plane to achieve classification. Assuming that the training set is (xi, yi), i = 1, 2, ⋯, l, xRn, y ∈ {±1}, the linear equation wTx + b = 0 is used to divide it, where w = (w1; w2; ⋯; wd) is the normal vector and b is the bias term. In order for the hyperplane to have maximum margin (Eq (3)), it is only necessary to maximize ‖w-1, which is equivalent to minimizing ‖w2. Therefore, the problem of constructing the optimal hyperplane is transformed into Eq (4).

Fig 1. Support vector machine architecture.

Fig 1

Eqs (5)–(6) are obtained by introducing Lagrange multipliers, where α = (α1; α2; ⋯; αm). Eqs (7)–(8) can be obtained by taking the partial derivatives of w and b of L(w, b, α). Substituting Eq (6) into Eq (5), w and b in L(w, b, α) can be eliminated, and then considering the constraints in Eq (7), the dual problem of Eq (3) can be obtained (Eq (9)).

minmaxLw,b,α (5)
Lw,b,α=12w2+i=1mαi1yiwTx+b (6)
w=i=1mαiyixi (7)
0=i=1mαiyi (8)
maxαi=1mαi12i=1mj=1mαiαjyiyjxiTxj
s.t.i=1mαiyi=0,αi0,i=1,2,,m (9)

3. Establishment of origin identification model

3.1. Data exploratory analysis

Exploratory analysis can give us a preliminary understanding of the spectral data of the sample. Fig 2 shows the mid-infrared spectra of 658 samples. It can be seen from the spectrum comparison that the mid-infrared spectrum of Cornus officinalis has obvious similarity, especially in the range of 1700~2500 cm–1. In the two band ranges of 1000~1700 cm–1 and 3000~3400 cm–1, there are mainly strong spectral peaks, and the peaks change drastically. This spectral region contains more chemical information. Five strong peaks appeared near the 1070, 1400, 1700, 2950 and 3300 cm–1 bands, with average absorbances of 0.734, 0.508, 0.781, 0.455 and 0.827 AU. The spectrum fluctuates greatly in the 1600~1700 cm–1 and 3250~3350 cm–1 bands, indicating that there are certain differences in the mid-infrared spectra of Cornus officinalis from different origins. The difference in absorbance of different origins can be used to identify the origin of Cornus officinalis [19, 22].

Fig 2. Mid-infrared spectral of 658 Cornus officinalis samples from 11 different places of origin.

Fig 2

Figs are generated using Matlab (Version R2021b, https://www.mathworks.com/) [Software].

In order to further compare the differences in mid-infrared spectra from different origins, we classified and summarized 658 samples by category, and averaged the absorbance under different wavelength bands. Fig 3 shows the average absorbance of Cornus officinalis from 11 origins in different wavelength bands. It can be seen that the spectral averages of Cornus officinalis from the 11 origins are very similar, and the bands where the spectral peaks appear are basically the same. The C−O stretching vibration absorption peak is around 1000~1250 cm–1, the aromatic ring skeleton vibration absorption peak is around 1400~1600 cm–1, the carbonyl C = O stretching vibration absorption peak is around 1700 cm–1, the methylene C−H antisymmetric stretching vibration absorption peak is around 2950 cm–1, and the O−H stretching vibration absorption peak is around 3300 cm–1. In some wavelength bands, the absorbance of different origins is different, which is due to the different climate and geographical conditions of different origins.

Fig 3. The average mid-infrared spectra of Cornus officinalis samples by different places of origin.

Fig 3

3.2. Mid-infrared spectral feature extraction

Mid-infrared spectroscopy was used in this study to identify the origin of Cornus officinalis. Our collection of mid-infrared spectral data contains 3448 bands, and the absorbance data for each band are highly correlated. If all 3448 variables are introduced into the Cornus officinalis origin identification model, it will not only make the model training time very long, but also the introduction of highly correlated variables into the model will lead to poor stability and generalization ability. Therefore, it is necessary to use mathematical methods to extract features from the data. Principal component analysis is a common unsupervised analysis technique that is often used to extract features from complex data [32, 41].

Table 1 shows the results of principal component analysis of the mid-infrared spectral data of Cornus officinalis. It can be seen that the first principal component contains 80.8% of the information of the original data, and the first three principal components contain more than 95% of the information of the original data. Common methods for the selection of the number of principal components include the cumulative contribution rate criterion and the Kaiser criterion based on eigenvalues greater than 1 [42]. In order to introduce more variables into the model so that the model can be fully trained, this paper adopts the Kaiser criterion to select the number of principal components. The eigenvalues corresponding to the first 14 principal components are all greater than 1, and their cumulative contribution rate exceeds 99.8%, capturing most of the information of the original variables. According to the Kaiser criterion, the first 14 principal components are selected to replace the original spectral data to establish the origin identification model of Cornus officinalis.

Table 1. Principal component eigenvalues, contribution rate and cumulative contribution rate of the mid-infrared spectral data of Cornus officinalis.

Serial number Principal component Eigenvalues Contribution rate (%) Cumulative contribution rate (%)
1 1st principal component 2785 0.808 0.808
2 2nd principal component 314.31 0.091 0.899
3 3rd principal component 174.91 0.051 0.950
4 4th principal component 77.257 0.022 0.972
5 5th principal component 29.467 0.009 0.981
6 6th principal component 18.32 0.005 0.986
7 7th principal component 15.433 0.004 0.990
8 8th principal component 9.94 0.003 0.993
9 9th principal component 7.689 0.002 0.995
10 10th principal component 2.614 0.001 0.996
11 11th principal component 2.174 0.001 0.997
12 12th principal component 1.761 0.001 0.997
13 13th principal component 1.446 0.000 0.998
14 14th principal component 1.113 0.000 0.998
15 15th principal component 0.928 0.000 0.998

3.3. PCA-SVM combined model construction

The 14 principal components after feature extraction are used to establish the origin identification model of Cornus officinalis. Since SVM can usually get better results than other algorithms such as naive Bayes, decision trees and linear discriminant analysis on a small sample training set and has better robustness, it is used to establish a mid-infrared spectroscopy-based identification model for the origin of Cornus officinalis [15, 43, 44].

The 658 samples are sorted by the Matlab random permutation function and divided into two parts in a ratio of approximately 3:1. The first part contains 500 samples for training the model and the second part contains 158 samples for testing the model. Since the sample data comes from 11 different origins, and the number of samples in each origin is different, it is necessary to test the balance of the samples. Fig 4 shows the number of samples included in the training set, test set, and all sets under each origin category. It can be seen that the number of OP 4 in all sets is at most 88, accounting for 13.4% of the total number of samples, and the number of OP 5 is at least 31, accounting for 4.7% of the total number of samples. The number of OP 6 in training set is at most 72, accounting for 14.4% of the total number of samples, and the number of OP 5 is at least 20, accounting for 4% of the total number of samples. The number of OP 4 in the test set is at most 24, accounting for 15.2% of the total number of samples, and the number of OP 9 is at least 7, accounting for 4.4% of the total number of samples. No matter which set, the ratio of the maximum number and the minimum number of samples does not exceed 4:1, so it is considered that there is no sample imbalance in this study.

Fig 4. The number of samples in the training set, test set and all sets of Cornus officinalis from different origins.

Fig 4

A model between the origin identification of Cornus officinalis and its discriminant index was established, and the training samples are grouped by K-fold Cross-Validation (KCV) method. K-fold cross-validation is a statistical analysis method used to verify the performance of a classifier [45]. Its basic idea is to group the original data, one part as the training set and the other part as the validation set. First train the classifier with the training set, and then use the validation set to test the trained model, which is used as the performance indicator for evaluating the classifier. KCV divides the original data into K groups, extracts a subset without repetition as a validation set, and combines the remaining K-1 sets of subset data as a training set, as shown in Fig 5. In this paper, the 10-fold cross-validation method is selected.

Fig 5. 10-fold cross-validation process description and implementation.

Fig 5

The samples are grouped by the K-fold cross-validation method, and then cross-trained by SVM to construct the SVM-based identification model of Cornus officinalis. Its specific process is shown in Fig 6. (i) Normalization of extracted principal components. (ii) The sample data is grouped for training by the K-fold cross-validation method, and K = 10 is selected. (iii) Take each subset (50 samples) data as a validation set, and the remaining 9 sets of subset (450 samples) data as a training set, so that 10 training model data will be obtained and brought into the SVM model for training. (iv) When the average accuracy rate of the model is greater than or equal to 80%, it is determined that the model can identify the origin of Cornus officinalis, and the optimal result of the training model is determined as the classification model. If the average accuracy rate is less than 80%, the samples will be re-sorted randomly, and return to step (ii) to perform K-fold cross-validation. (v) A K-fold cross-validation-based SVM identification model for the origin of Cornus officinalis is obtained. (vi) Input the test sample and get the classification result.

Fig 6. Flow chart of realization of SVM origin identification model of Cornus officinalis based on K-fold cross-validation.

Fig 6

The principal components extracted from the spectral data of Cornus officinalis are used as input variables, OP is used as output variables, and 500 divided samples are trained with the help of SVM based on k-fold cross-validation. For SVM based on kernel function, this study compares linear kernel, quadratic kernel, cubic kernel and Gaussian kernel. Eqs (10)–(12) are their expressions, where x is the vector drawn from the input space, xi is the support vector, γ the coefficients of the kernel function, r is the constant term in the kernel function, and p is the degree of polynomial kernel functions. The box constraint selects the default value of 1. The smaller the box constraint, the larger the margin, which means that the more error samples allowed in training, the more support vectors, and the stronger the generalization ability. Kernel scale mode select auto, set to auto to use a heuristic procedure to select the scale value using subsampling. For the selection of multiclass method, since there is no sample imbalance in this study, we choose the one-vs-all with higher efficiency. The accuracy on the validation set is used to compare different kernel functions. The accuracy rate of the support vector machine with the quadratic kernel function after the experiment is 82.4% on the validation set, which is the highest among the given kernel functions and meets the accuracy requirements set in advance. With the help of Matlab classification learner (Version R2021b, https://www.mathworks.com/), the PCA-SVM combined model for identifying the origin of Cornus officinalis based on mid-infrared spectroscopy is established. For the SVM model with full spectrum data, the same settings were performed, resulting in a maximum accuracy of 57.1% on the validation set, which did not meet the preset accuracy requirement. This is mainly due to the existence of multicollinearity in the full spectrum data, the information input to the model overlaps with each other, and the model is very unstable.

Kx,xi=xTxi (10)
Kx,xi=γxTxi+rp,γ>0 (11)
Kx,xi=expγxxi2,γ>0 (12)

In order to verify whether the model has good generalization ability, 158 test sample data onto Cornus officinalis of known origin are input into the model for prediction, and the predicted results are compared with the known results. Fig 7 shows the confusion matrix for the test samples. Each row of the confusion matrix represents the predicted category, and the total number of each row represents the number of data predicted for that category; each column represents the true category, and the total number of data in each column represents the number of data instances of that category. The Precision of each category is shown on the left side of the confusion matrix, the Recall (also called Sensitivity) of each category is shown on the lower side of the confusion matrix, and the Accuracy of the model is shown in the lower right corner of the confusion matrix. Eqs (13)–(15) is the equations of Precision, Recall and Accuracy, where TP indicates the number of samples whose true value is positive and the model judges as positive; FN indicates the number of samples whose true value is positive and the model judges as negative; FP indicates the number of samples whose true value is negative and the model judges as positive; TN indicates the number of samples whose true value is negative and the model judges as negative [40, 46].

Precisioni=TPiTPi+FPi (13)
Recalli=TPiTPi+FNi (14)
Accuracy=TP+TNTP+TN+FP+FN (15)

Fig 7. Confusion matrix for Cornus officinalis test samples.

Fig 7

Each row of the confusion matrix represents the predicted category, and each column represents the true category.

It can be seen from the confusion matrix that there are 17 Cornus officinalis samples predict by the model as OP10, and they are all from OP10. This category has the highest Precision at 100%. There are 18 Cornus officinalis samples predict by the model as OP 8, and 12 of them are from OP 8. This category has the lowest Precision at 66.7%. Here 4 samples of OP 7 are incorrectly predict to be OP 8 by the model. In terms of Sensitivity, all 7 samples of OP 9 are recognized by the model, and the Recall is the highest at 100%. Only 9 of the 16 samples of OP 7 are identified, and the Recall is the lowest at 56.3%, of which 4 samples are misjudged as OP 8. In general, Cornus officinalis from OP 7 and OP 8 are easily confused. This is because the climate and geographical conditions of the two places are similar, resulting in similar chemical composition of Cornus officinalis. Among the total 158 samples of Cornus officinalis, 134 origins are correctly predicted, and the Accuracy is 84.8%, which is similar to the accuracy rate of the validation set, indicating that the model has a strong generalization ability.

4. Discussion

The spectral characteristics of different Chinese Herbal Medicines are quite different. Even the same Chinese Herbal Medicines from different origins will show different spectral characteristics under the irradiation of near-infrared and mid-infrared spectra due to the differences in the chemical composition of inorganic elements and organic matter. Therefore, these characteristics can be used to identify the species and origin of Chinese Herbal Medicines. Based on mid-infrared spectral data, naive Bayes, decision trees, LDA, RBF and PLS-DA can all identify the origin of Cornus officinalis [47].

Bayesian methods are based on Bayesian principles and use knowledge of probability statistics to classify sample data sets. The Bayesian approach is characterized by combining prior and posterior probabilities, i.e., it avoids the subjective bias of using only prior probabilities and the overfitting phenomenon of using sample information alone. The naive Bayesian method is a corresponding simplification based on the Bayesian algorithm, that is, it is assumed that the attributes are conditionally independent of each other when the target value is given. Although this simplification reduces the classification effectiveness of Bayesian classification algorithm to some extent, it greatly simplifies the complexity of Bayesian methods in practical application scenarios.

Decision tree is a basic classification and regression method. The decision tree model has a tree-like structure and represents the process of classifying instances based on features in a classification problem. It can be thought of as a set of if-then rules, or as a conditional probability distribution defined in feature space and class space. Its main advantages are the readability of the model and the speed of classification. For learning, a decision tree model is built based on the principle of minimizing the loss function using the training data. For prediction, the decision tree model is used to classify the new data.

Linear discriminant analysis is a classic linear learning method, which was first proposed by Fisher in 1936 on the binary classification problem. The idea of linear discrimination is that for a given set of training samples, we try to project the samples onto a straight line so that the projection points of similar samples are as close as possible and the projection points of dissimilar samples are as far away as possible. When classifying a new sample, it is projected onto the same straight line, and then the class of the new sample is determined based on the location of the projected points.

Broomhead and Lowe first used radial basis functions for neural network design in 1988 [48]. Radial basis function neural network is a commonly used three-layer feedforward network, which can be used for both function approximation and pattern classification. Compared with other types of artificial neural networks, RBF networks have a physiological basis, simple structure, fast learning speed, excellent approximation performance and generalization ability.

Partial least squares regression analysis is a statistical method that is related to principal component regression, but instead of finding the hyperplane of maximum variance between the response and independent variables, a linear regression model is found by projecting the independent and response variables into a new space, respectively. Because both predictor and response variables are projected into the new space, the methods in the PLS family are called bilinear factorial models [49, 50]. When the response variable is categorical data it is called partial least squares discriminant analysis.

Table 2 shows the Precision and Recall of each model for the identification of Cornus officinalis from different origins [43, 51, 52]. It should be noted that the sample division of training set and test set of each model and the method of model validation are the same as those of PCA-SVM combined model. It can be seen that each model has a certain ability to identify the origin of Cornus officinalis. Decision tree has the highest precision in OP 4; LDA has the highest recall in OP 5; RBF has the highest precision in OP 8 and OP 11, and the highest recall in OP 11; PLS-DA has the highest precision in OP 5 and OP 9, and the highest recall in OP 8 and OP 10; the PCA-SVM combined model presented in this paper has the highest precision in other origins except OP 4, OP 5, OP 8, OP 9 and OP 11, and the highest recall in other origins except OP 5 and OP 11.

Table 2. Precision and recall (sensitivity) of each Chinese herbal medicine origin identification model in 11 different origins, where PPV stands for precision and TPR stands for sensitivity (values are measured in %).

OP Naive Bayes Decision Trees LDA RBF PLS-DA PCA-SVM
PPV TPR PPV TPR PPV TPR PPV TPR PPV TPR PPV TPR
OP 1 77.8 77.8 83.3 55.6 83.3 83.3 81.3 72.2 76.5 72.2 89.5 94.4
OP 2 35.7 50.0 23.5 40.0 31.8 70.0 30.0 60.0 55.6 50.0 80.0 80.0
OP 3 80.0 85.7 61.5 57.1 85.7 85.7 73.3 78.6 90.9 71.4 92.9 92.9
OP 4 77.8 87.5 100 75.0 91.3 87.5 86.4 79.2 77.8 87.5 84.6 91.7
OP 5 85.7 66.7 50.0 66.7 80.0 88.9 77.8 77.8 100 66.7 85.7 66.7
OP 6 41.2 46.7 50.0 73.3 80.0 53.3 52.6 66.7 58.8 66.7 92.9 86.7
OP 7 60.0 37.5 35.7 31.3 30.0 18.8 0.0 0.0 75.0 56.2 81.8 56.3
OP 8 76.9 62.5 69.2 56.3 76.9 62.5 92.3 66.7 60.0 75.0 66.7 75.0
OP 9 55.6 71.4 66.7 57.1 87.5 100 44.4 72.7 100 71.4 77.8 100
OP10 92.3 66.7 61.9 72.2 82.4 77.8 52.9 56.2 73.9 94.4 100 94.4
OP11 46.7 63.6 40.0 36.4 50.0 63.6 77.8 100 72.7 72.7 76.9 90.9

In order to comprehensively compare the ability of each model to identify the origin of Cornus officinalis, we compared each model from the three indicators of Accuracy, F1-Score and Kappa coefficient. The F1-Score indicator combines the results of Precision and Recall output. Its value ranges from 0 to 1, where 1 represents the best output of the model, and 0 represents the worst output of the model. F1-Score needs to average each category of Precision and Recall (Eqs (16)–(17)), and then use Eq (18) to calculate.

Precision=i=1nPrecisionin (16)
Recall=i=1nRecallin (17)
F1=2precisionrecallprecision+recall (18)
Kappa=p0pe1pe (19)

The Kappa coefficient is an indicator used for consistency checks and can also be used to measure the effect of classification. Its calculation is based on the confusion matrix, which takes values between -1 and 1, usually greater than 0. Eq (19) is the calculation equation of the Kappa coefficient, where p0 = ∑i pii is called the observation concordance rate, pii=aiiN, aii represents the actual observation concordance number, and N represents the total number of samples. pe = ∑i pi pi is called the expected concordance rate, that is, the concordance rate of the two test results due to chance, where pi=RiN, pi=CiN, Ri, Ci are the grand totals for rows and grand totals for columns of the i-th grid point respectively.

Table 3 comprehensively compares the three indicators of each origin identification model. It can be seen that the performance of decision trees, Naive Bayes model and RBF in the three indicators need to be improved compared to other models. LDA and PLS-DA perform well on the three indicators. Regardless of which evaluation indicator is used, the PCA-SVM combined model proposed in this paper performs the best among all models. Using this model to identify the origin of Cornus officinalis, the Accuracy is 84.8%.

Table 3. Comparison results of models for origin identification of Cornus officinalis based on mid-infrared spectroscopy.

Evaluation indicators Naive Bayes Decision Trees LDA RBF PLS-DA PCA-SVM
Accuracy (%) 66.5 58.2 70.9 64.6 73.4 84.8
F1-Score 0.657 0.574 0.715 0.635 73.8 0.844
Kappa 0.628 0.538 0.678 0.609 70.3 0.831

5. Conclusions

The origin of Chinese Herbal Medicines is an important part of the quality control of Chinese Herbal Medicines, and it is also of great significance in the exploration and utilization of medicine sources [5, 13]. As a non-destructive analysis technique, mid-infrared spectroscopy has the advantages of short analysis time, simple operation, and low analysis cost. In recent years, it has received increasing attention in the identification of Chinese Herbal Medicines. In this study, a method for rapid origin identification of Cornus officinalis based on mid-infrared spectroscopy and chemometrics was established using the spectral data of Cornus officinalis. The research results showed that although the mid-infrared spectral information of the same Cornus officinalis has strong similarities, they also have certain differences in some parts. The spectral information is fully extracted by principal component analysis [53, 54], and the classification and identification model established by the support vector machine has a high accuracy. The predictive ability of the model was evaluated by an external test set, and the results showed that the established model could classify and identify 158 Cornus officinalis samples from 11 different regions with an accuracy rate of 84.8%. The accuracy of the external test set and validation set is similar, indicating that the model has strong generalization ability. Compared with the SVM model with full-spectrum data, the PCA-SVM model not only reduces the redundant variables in the model, but also has higher accuracy. In addition, by comparing with other commonly used stoichiometric models such as naive Bayes model, decision trees, LDA, RBF and PLS-DA, the PCA-SVM combined model performs the best among the three indicators given in this paper for the origin identification of Cornus officials. The method proposed in this paper can effectively shorten the identification time and cost of medicinal materials, and ensure the reliability of identification results. However, the scope of application of any model is limited by the sample space. Although the model established in this experiment shows good accuracy and robustness in both interactive and external tests, there is still much work to be done to promote it as a practical technique. Future studies can collect different classes of Chinese Herbal Medicines for research to improve the generalizability of the model. In addition, mid-infrared spectroscopy provides less information about the content of specific active constituents of the plant, and if more information on the content of specific active constituents of the plant is required, a more sophisticated analysis is required.

Supporting information

S1 File. Mid-infrared spectral dataset.

(XLSX)

Data Availability

All relevant data are within the paper and its Supporting Information files.

Funding Statement

This work was supported by the research project on philosophy and social science of universities in Jiangsu Province under Grant number 2022SJYB0562 (to Bing Liu) and the horizontal scientific research project of Nanjing Vocational University of Industry Technology under Grant number HK22-38-01 (to Bing Liu). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Jiao L, Bi L, Lu Y, Wang Q, Gong Y, Shi J, et al. Cancer chemoprevention and therapy using chinese herbal medicine. Biological procedures online. 2018; 20(1):1–14. doi: 10.1186/s12575-017-0066-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Ogawa-Ochiai K, Kawasaki K. Panax ginseng for frailty-related disorders: a review. Frontiers in Nutrition. 2019; 5:1–8. doi: 10.3389/fnut.2018.00140 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Lin CH, Hsieh CL. Chinese herbal medicine for treating epilepsy. Frontiers in Neuroscience. 2021; 15:1–13. doi: 10.3389/fnins.2021.682821 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Yu ZJ, Xu Y, Peng W, Liu YJ, Zhang JM, Li JS, et al. Calculus bovis: a review of the traditional usages, origin, chemistry, pharmacological activities and toxicology. Journal of Ethnopharmacology. 2020; 254:1–60. 10.1016/j.jep.2020.112649 [DOI] [PubMed] [Google Scholar]
  • 5.Yang M, Jiang Z, Wen M, Wu Z, Zha M, Xu W, et al. Chemical variation of Chenpi (Citrus peels) and corresponding correlated bioactive compounds by LC-MS metabolomics and multibioassay analysis. Frontiers in Nutrition. 2022; 9:1–17. doi: 10.3389/fnut.2022.825381 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Yao R, Heinrich M, Zhao X, Wang Q, Wei J, Xiao P. What’s the choice for goji: Lycium barbarum L. or L. chinense Mill.? Journal of Ethnopharmacology. 2021; 276:1–8. doi: 10.1016/j.jep.2021.114185 [DOI] [PubMed] [Google Scholar]
  • 7.Baek SH, Lim H Bin, Chun HS. Detection of melamine in foods using terahertz time-domain spectroscopy. Journal of agricultural and food chemistry. 2014; 62(24):5403–7. doi: 10.1021/jf501170z [DOI] [PubMed] [Google Scholar]
  • 8.Jiang Y, David B, Tu P, Barbin Y. Recent analytical approaches in quality control of traditional chinese medicines-A review. Analytica Chimica Acta. 2010; 657(1):9–18. doi: 10.1016/j.aca.2009.10.024 [DOI] [PubMed] [Google Scholar]
  • 9.Kim MK, Kim JH, Wang H, Lee HN, Yang DC. Discrimination of Korean ginseng (Panax ginseng Meyer) cultivar Chunpoong and American ginseng (Panax quinquefolius) using the auxin repressed protein gene. Journal of Ginseng Research. 2016; 40(4):395–9. doi: 10.1016/j.jgr.2015.12.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Tistaert C, Dejaegher B, Heyden YV. Chromatographic separation techniques and data handling methods for herbal fingerprints: A review. Analytica Chimica Acta. 2011; 690(2):148–61. doi: 10.1016/j.aca.2011.02.023 [DOI] [PubMed] [Google Scholar]
  • 11.Liu K, Zhang JW, Liu XG, Wu QW, Li XS, Gao W, et al. Correlation between macroscopic characteristics and tissue-specific chemical profiling of the root of salvia miltiorrhiz. Phytomedicine. 2018; 51: 104–11. 10.1016/j.phymed.2018.10.011 [DOI] [PubMed] [Google Scholar]
  • 12.Liu YY, Wei JH, Gao ZH, Zhan Z, Lyu JC. A review of quality assessment and grading for agarwood. Chinese Herbal Medicines. 2017; 9(1):22–30. 10.1016/S1674-6384(17)60072-8 [DOI] [Google Scholar]
  • 13.Zhu QX, Cao YB, Cao YY, Lu F. Rapid detection of four antipertensive chemicals adulterated in traditional Chinese medicine for hypertension using TLC-SERS. Spectroscopy and Spectral Analysis. 2014; 34(4):990–3. 10.1007/s00216-013-7605-7 [DOI] [PubMed] [Google Scholar]
  • 14.Pozzi F, Shibayama N, Leona M, Lombardi JR. TLC-SERS study of Syrian rue (Peganum harmala) and its main alkaloid constituents. Journal of Raman Spectroscopy. 2013; 44(1):102–7. 10.1002/jrs.4140 [DOI] [Google Scholar]
  • 15.Cui Z, Ge N, Zhang A, Liu Y, Zhang J, Cao Y. Comprehensive determination of polycyclic aromatic hydrocarbons in Chinese herbal medicines by solid phase extraction and gas chromatography coupled to tandem mass spectrometry. Analytical and Bioanalytical Chemistry. 2015; 407(7):1989–97. doi: 10.1007/s00216-015-8463-2 [DOI] [PubMed] [Google Scholar]
  • 16.Cai H, Cao G, Zhang HY. Qualitative analysis of a sulfur-fumigated Chinese herbal medicine by comprehensive two-dimensional gas chromatography and high-resolution time of flight mass spectrometry using colorized fuzzy difference data processing. Chinese Journal of Integrative Medicine. 2017; 23(4):261–9. doi: 10.1007/s11655-015-1966-z [DOI] [PubMed] [Google Scholar]
  • 17.Yang FQ, Wang YT, Li SP. Simultaneous determination of 11 characteristic components in three species of curcuma rhizomes using pressurized liquid extraction and high-performance liquid chromatography. Journal of Chromatography A. 2006; 1134(1):226–31. doi: 10.1016/j.chroma.2006.09.048 [DOI] [PubMed] [Google Scholar]
  • 18.Sun F, Yang XL, Liu F, Zhang Y, Wang SM, Cao H, et al. Quality assessment of different species and differently prepared slices of zedoray rhizome by high-performance liquid chromatography and colorimeter with the aid of chemometrics. Journal of Analytical Methods in Chemistry. 2020; 2020:1–10. doi: 10.1155/2020/8866250 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Obisesan KA, Jiménez-Carvelo AM, Cuadros-Rodriguez L, Ruisánchez I, Callao MP. HPLC-UV and HPLC-CAD chromatographic data fusion for the authentication of the geographical origin of palm oil. Talanta. 2017; 170: 413–8. doi: 10.1016/j.talanta.2017.04.035 [DOI] [PubMed] [Google Scholar]
  • 20.Schmidt B, Jaroszewski JW, Bro R, Witt M. Combining PARAFAC analysis of HPLC-PDA profiles and structural characterization using HPLC-PDA-SPE-NMR-MS experiments: commercial preparations of St. John’s Wort. Analytical Chemistry. 2008; 80(6):1978–87. doi: 10.1021/ac702064p [DOI] [PubMed] [Google Scholar]
  • 21.Zhao H, Chen Z. Screening of neuraminidase inhibitors from traditional Chinese medicines by integrating capillary electrophoresis with immobilized enzyme microreactor. Journal of Chromatography A. 2014; 1340:139–45. doi: 10.1016/j.chroma.2014.03.028 [DOI] [PubMed] [Google Scholar]
  • 22.Zha XQ, Luo JP, Wei P. Identification and classification of Dendrobium candidum species by fingerprint technology with capillary electrophoresis. South African Journal of Botany. 2009; 75(2):276–82. 10.1016/j.sajb.2009.02.002 [DOI] [Google Scholar]
  • 23.Sun XH, Gao CL, Cao WD, Yang XR, Wang EK. Capillary electrophoresis with amperometric detection of curcumin in Chinese herbal medicine pretreated by solidphase extraction. Journal of Chromatography A. 2002; 962:117–25. doi: 10.1016/s0021-9673(02)00509-5 [DOI] [PubMed] [Google Scholar]
  • 24.Park SE, Seo SH, Lee KI, Na CS, Son HS. Metabolite profiling of fermented ginseng extracts by gas chromatography mass spectrometry. Journal of Ginseng Research. 2018; 42(1):57–67. doi: 10.1016/j.jgr.2016.12.010 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Sandasi M, Vermaak I, Chen W, Viljoen A. The application of vibrational spectroscopy techniques in the quality control of material traded as ginseng. Planta Medica. 2016; 82:472–89. 10.3390/molecules21040472 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Yu C, Wang CZ, Zhou CJ, Wang B, Han L, Zhang CF, et al. Adulteration and cultivation region identification of American ginseng using HPLC coupled with multivariate analysis. Journal of Pharmaceutical and Biomedical Analysis. 2014; 99:8–15. doi: 10.1016/j.jpba.2014.06.031 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Nan T, Wu S, Zhao H, Tan W, Li Z, Zhang Q, et al. Development of a secondary antibody thio-functionalized microcantilever immunosensor and an ELISA for measuring ginsenoside Re content in the herb ginseng, Analytical Chemistry. 2012; 84(10):4327–33. doi: 10.1021/ac203414z [DOI] [PubMed] [Google Scholar]
  • 28.Esteban-Díez I, González-Sáiz JM, Sáenz-González C, Pizarro C. Coffee varietal differentiation based on near infrared spectroscopy. Talanta. 2007; 71(1):221–9. doi: 10.1016/j.talanta.2006.03.052 [DOI] [PubMed] [Google Scholar]
  • 29.Krähmer A, Engel A, Kadow D, Ali N, Umaharan P, Kroh LW, et al. Fast and neat–determination of biochemical quality parameters in cocoa using near infrared spectroscopy. Food Chemistry. 2015; 181:152–9. doi: 10.1016/j.foodchem.2015.02.084 [DOI] [PubMed] [Google Scholar]
  • 30.Ren X, He T, Wang J, Wang L, Wang Y, Liu X, et al. Uv spectroscopy and hplc combined with chemometrics for rapid discrimination and quantification of curcumae rhizoma from three botanical origins. Journal of Pharmaceutical and Biomedical Analysis. 2021; 202:1–12. doi: 10.1016/j.jpba.2021.114145 [DOI] [PubMed] [Google Scholar]
  • 31.Tolessa K, Rademaker M, De Baets B, Boeckx P. Prediction of specialty coffee cup quality based on near infrared spectra of green coffee beans. Talanta. 2016; 150:367–74. doi: 10.1016/j.talanta.2015.12.039 [DOI] [PubMed] [Google Scholar]
  • 32.Li W, Qu H. Wavelet-based classification and influence matrix analysis method for the fast discrimination of Chinese herbal medicines according to the geographical origins with near infrared spectroscopy. Journal of Innovative Optical Health Sciences. 2014; 7(4):1–14. 10.1142/S1793545813500612 [DOI] [Google Scholar]
  • 33.Lu L, Ztz B, Yzw B, Frx A. A fast multi-source information fusion strategy based on ftir spectroscopy for geographical authentication of wild gentiana rigescens. Microchemical Journal. 2020; 159:1–10. 10.1016/j.microc.2020.105360 [DOI] [Google Scholar]
  • 34.Hou DY, Shi LC, Yang MM, Li J, Xu HW. De novo transcriptomic analysis of leaf and fruit tissue of Cornus officinalis using illumina platform. Plos One. 2018; 13(2):1–18. doi: 10.1371/journal.pone.0192610 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Qi LM, Ma YT, Zhong FR, Shen C. Comprehensive quality assessment for Rhizoma Coptidis based on quantitative and qualitative metabolic profiles using high performance liquid chromatography, Fourier transform near-infrared and Fourier transform mid-infrared combined with multivariate statistical analysis. Journal of Pharmaceutical and Biomedical Analysis. 2018; 161: 436–43. doi: 10.1016/j.jpba.2018.09.012 [DOI] [PubMed] [Google Scholar]
  • 36.Ma YH, He HQ, Wu JZ, Wang CY, Chao KL, Huang Q. Assessment of polysaccharides from mycelia of genus ganoderma by mid-infrared and near-infrared spectroscopy. Scientific Reports. 2018; 8:1–10. 10.1038/s41598-017-18422-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Hotelling H. Simplified calculation of principal components. Psychometrika. 1936; 1(1):27–35. 10.1007/BF02287921 [DOI] [Google Scholar]
  • 38.Takane Y, Hunter MA. Constrained principal component analysis: a comprehensive theory. Appl Applicable Algebra in Engineering Communication and Computing. 2001; 12(5):391–419. 10.1007/s002000100081 [DOI] [Google Scholar]
  • 39.Cortes C, Vapnik V. Support-Vector Networks. Machine Learning. 1995; 20(3):273–97. 10.1023/A:1022627411411 [DOI] [Google Scholar]
  • 40.Ding SF, Zhang N, Zhang XK, Wu FL. Twin support vector machine: theory, algorithm and applications. Neural Computing and Applications. 2017; 28(11):3119–30. 10.1007/s00521-016-2245-4 [DOI] [Google Scholar]
  • 41.Chen C, Li H, Lv X, Tang J, Chen C, Zheng X. et al. Application of near infrared spectroscopy combined with SVR algorithm in rapid detection of cAMP content in red jujube. Optik. 2019; 194:163063. 10.1016/j.ijleo.2019.163063 [DOI] [Google Scholar]
  • 42.Kaiser HF. The application of electronic computers to factor analysis. Educational and psychological measurement. 1960; 20(1):141–51. 10.1177/001316446002000116 [DOI] [Google Scholar]
  • 43.Wang JM, Liao XY, Zheng PC, Xue SW, Peng R. Classification of Chinese herbal medicine by laser-induced breakdown spectroscopy with principal component analysis and artificial neural network. Analytical Letters. 2017; 51(4):575–86. 10.1080/00032719.2017.1340949 [DOI] [Google Scholar]
  • 44.Liu J, Li Z, Hu F, Chen T, Du Y, Xin H. Identification of GMOs by terahertz spectroscopy and ALAP–SVM. Optical and Quantum Electronics. 2015; 47(3):685–95. 10.1007/s11082-014-9944-9 [DOI] [Google Scholar]
  • 45.Wong TT, Yeh PY. Reliable accuracy estimates from k-fold cross validation. IEEE Transactions on Knowledge and Data Engineering. 2020; 32(8): 1586–94. 10.1109/TKDE.2019.2912815 [DOI] [Google Scholar]
  • 46.Liu W, Liu C, Hu X, Yang J, Zheng L. Application of terahertz spectroscopy imaging for discrimination of transgenic rice seeds with chemometrics. Food Chemistry. 2016; 210:415–21. doi: 10.1016/j.foodchem.2016.04.117 [DOI] [PubMed] [Google Scholar]
  • 47.Chen C, Yang L, Li H, Chen F, Chen C, Gao R. et al. Raman spectroscopy combined with multiple algorithms for analysis and rapid screening of chronic renal failure. Photodiagnosis and Photodynamic Therapy. 2020; 30:101792. doi: 10.1016/j.pdpdt.2020.101792 [DOI] [PubMed] [Google Scholar]
  • 48.Broomhead DS, Lowe D. Radial basis functions, multi-variable functional interpolation and adaptive networks. Royal Signals and Radar Establishment Malvern. 1988; 4148:1–34. [Google Scholar]
  • 49.Gao R, Chen C, Wang H, Chen C, Yan Z, Han H. et al. Classification of multicategory edible fungi based on the infrared spectra of caps and stalks. Plos One. 2020; 15(8):e0238149. doi: 10.1371/journal.pone.0238149 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Yang S, Li CX, Mei Y, Liu W, Liu R, Chen WL, et al. Determination of the geographical origin of coffee beans using terahertz spectroscopy combined with machine learning methods. Frontiers in Nutrition. 2021; 8:1–10. doi: 10.3389/fnut.2021.680627 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Zheng ZP, Qiu B, Luo AL, Li YB. Classification for unrecognized spectra in lamost dr6 using generalization of convolutional neural networks. Publications of the Astronomical Society of the Pacific. 2020; 132(1008):1–13. 10.1088/1538-3873/ab5ed7 [DOI] [Google Scholar]
  • 52.Borsato D, Pina MVR, Spacino KR, Scholz MB dos S, Filho AA. Application of artificial neural networks in the geographical identification of coffee samples. European Food Research and Technology. 2011; 233(3):533–43. 10.1007/s00217-011-1548-z [DOI] [Google Scholar]
  • 53.Yang B, Chen C, Chen F, Chen C, Lv X. Identification of cumin and fennel from different regions based on generative adversarial networks and near infrared spectroscopy. Spectrochimica Acta Part A-Molecular and Biomolecular Spectroscopy. 2021; 260:119956. doi: 10.1016/j.saa.2021.119956 [DOI] [PubMed] [Google Scholar]
  • 54.Chen C, Chen F, Yang B, Zhang K, Lv X, Chen C. A novel diagnostic method: FT-IR, Raman and derivative spectroscopy fusion technology for the rapid diagnosis of renal cell carcinoma serum. Spectrochimica Acta Part A-Molecular and Biomolecular Spectroscopy. 2022; 269:120684. doi: 10.1016/j.saa.2021.120684 [DOI] [PubMed] [Google Scholar]

Decision Letter 0

Naji Arafat Mahat

1 Dec 2022

PONE-D-22-30208Origin identification of Cornus officinalis based on PCA-SVM combined modelPLOS ONE

Dear Dr. Jin,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Jan 15 2023 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Naji Arafat Mahat, PhD

Academic Editor

PLOS ONE

Journal requirements:

When submitting your revision, we need you to address these additional requirements.

1.  Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf  and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Thank you for stating the following financial disclosure:

“This work was supported by the research project on philosophy and social science of universities in Jiangsu Province (No. 2022SJYB0562) and the horizontal scientific research project of Nanjing Vocational University of Industry Technology (No. HK22-38-01).”

Please state what role the funders took in the study.  If the funders had no role, please state: "The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript."

If this statement is not correct you must amend it as needed.

Please include this amended Role of Funder statement in your cover letter; we will change the online submission form on your behalf.

3. We note that you have stated that you will provide repository information for your data at acceptance. Should your manuscript be accepted for publication, we will hold it until you provide the relevant accession numbers or DOIs necessary to access your data. If you wish to make changes to your Data Availability statement, please describe these changes in your cover letter and we will update your Data Availability statement to reflect the information you provide.

4. PLOS requires an ORCID iD for the corresponding author in Editorial Manager on papers submitted after December 6th, 2016. Please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field. This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager. Please see the following video for instructions on linking an ORCID iD to your Editorial Manager account: https://www.youtube.com/watch?v=_xcclfuvtxQ

5. Please upload a new copy of Figure as the Figure file cannot be open. Please follow the link for more information: https://blogs.plos.org/plos/2019/06/looking-good-tips-for-creating-your-plos-figures-graphics/" https://blogs.plos.org/plos/2019/06/looking-good-tips-for-creating-your-plos-figures-graphics/

Additional Editor Comments:

Please refer to the commented manuscript uploaded by Reviewer #1.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: No

Reviewer #2: No

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

Reviewer #2: No

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The manuscript discusses the use of Infrared Spectroscopy (IR) combined with Principal Component Analysis (PCA) in tandem with Support Vector Machine (SVM) for origin identification of Chinese Herbal i.e. Cornus Officinalis collected from eleven different origin/provinces in China. I personally found the manuscript is interesting however major revision needs to be made to the manuscript prior to its publication in PLOS ONE.

My main concern pertaining the manuscript is that other classification models such as naive Bayes, Decision Tree, Linear Discriminant Analysis (LDA), Partial Least Square-Discriminant Analysis (PLS-DA) and a variant of artificial neural network i.e. Radial Basis Function (RBF) were reported but no works showing how these classification models were developed and tested in this study except Table 3 which briefly displays the precision and recall (sensitivity) of these classification models. Furthermore the reference i.e. reference 44 associated with these classification models was on rapid screening of chronic renal failure while reference 47 and 48 was study conducted to lamost dr6 and coffee samples respectively but not on the identification of Cornus Officinalis.

I was not able to locate all the figures mentioned in the manuscript, in other words figure were not included in the manuscript. This made the reviewing process quite difficult. Assuming that Figure 3 is available, the similarities and differences between the spectra of the Cornus Officinalis samples can be directly from the spectra therefore reporting correlation using Pearson Linear Correlation Coefficient in my opinion isn't necessary.

The explanation on data pre-processing was confusing. Authors mentioned that data pre-processing is the first step in modeling and also mentioned that the first step in constructing SVM based identification model is organize, collect sample followed by data normalization however since AU of the spectra did not deviate much from the Beer's Lambert Law, no special treatment was made to the data by the authors. Does this mean that authors used raw data for PCA? Since SVM was developed using principal components extracted from the spectral data, does this mean that authors normalized the principal components prior to SVM?

Authors should focus on reporting the outcomes acquired from PCA and SVM and disregard comparing with other classification models unless one of the objectives of the manuscript was to develop and compare different classification models for identification of Cornus Officinalis. Since authors mentioned the study compared different SVM kernel functions i.e. linear, quadratic, cubic and Gaussian, it would be interesting if the authors could report on these.

Other minor comments are attached in the manuscript.

Reviewer #2: 1. Authors presented the work that combines two multivariate techniques for source determination of a herbal medicine plant from numerous geographical origin.The work is interesting however the structure of the manuscript and important information are missing or if presented, it is not clearly described. some basic practice for writing scientific article was also not practised, thus improvements to the writing, contents and continuity of the article, should be made if the authors intent to publish the manuscript.

Simple corrections that authors can easily made are to italicised the genus and species of herbal plants where applicable, in text citations without initials, improper citation in text, capital letters and small letters (typo perhaps), figure mentioned but not included, make and model of the instrument and font sizes are among the typesetting that should be given attention by the authors. Those are some minor but vital components that would warrant a complete and well prepared manuscript.

2. In the introduction section, authors mention that microscopic analysis is able to determine authenticity of the herbs even when the herbs are highly processed. If this is the case then what is the need for multi varied analysis? Authors also did not mention specifically which part of the plant that was taken for analysis. It seems that the authors are generalising that all the part from that plants are the same. Is this true?

The data obtained seems to be the secondary data obtained from other organization and has been mentioned however the link given is confusing. The purpose of the link was not explained.

In term of statistical analysis, data collection is not clearly explained. Perhaps geographical origin can be better explain with sampling map so that the reader would have better understanding of the location where samples were collected as this is discussed in depth by the authors.

In data source and preprocessing, since this study aims to determine the origin of the sample, perhaps consideration of variations within and between (reproducibility and repeatability issues) of the samples should be considered. The authors merely state their opinions on variations observed to the signal but did not perform any testing or measurement to their hypothesis. The authors can be more specific when discussing the characteristic parts of the FTIR spectrum responsible for the source determination.

Authors have included many statistical techniques tested on Chinese Herbal medicine origin, but are the samples are of Cornus officials? It is true that FTIR spectra may show different spectral characteristic due to different chemical constituents within the sample however to comparison different samples of Chinese herbal medicine (for example quoted as reference 44) to Cornus officials is unfounded. It is also unclear whether the author perform naives Bayes, decision tree, LDA, RBF and PLS DA to the same data set used for PCA-SVM because it was not indicated in the methodology.

Based on this comments, it is highly suggested for the authors to revise the manuscript especially interm of explaining the work to avoid confusion, improve on the contents continuity and arrangement.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

Attachment

Submitted filename: PONE-D-22-30208.pdf

PLoS One. 2023 Feb 28;18(2):e0282429. doi: 10.1371/journal.pone.0282429.r002

Author response to Decision Letter 0


12 Jan 2023

Dear Editor and Reviewers,

Thank you very much for your careful review and constructive suggestions with regard to our manuscript "Origin identification of Cornus officinalis based on PCA-SVM combined model" (Submission ID PONE-D-22-30208). Those comments are all valuable and very helpful for revising and improving our paper, as well as the important guiding significance to our researches. I have tried our best to revise the manuscript according to the reviewers’ comments, which are marked in red in this revised version. I appreciate for editor and reviewers’ warm work earnestly, and hope that the corrections will meet with approval. The main corrections in our manuscript and the responds to the reviewers' comments are presented as follows.

Reviewer 1

The manuscript discusses the use of Infrared Spectroscopy (IR) combined with Principal Component Analysis (PCA) in tandem with Support Vector Machine (SVM) for origin identification of Chinese Herbal i.e. Cornus officinalis collected from eleven different origin/provinces in China. I personally found the manuscript is interesting however major revision needs to be made to the manuscript prior to its publication in PLOS ONE.

1. My main concern pertaining the manuscript is that other classification models such as naive Bayes, Decision Tree, Linear Discriminant Analysis (LDA), Partial Least Square-Discriminant Analysis (PLS-DA) and a variant of artificial neural network i.e. Radial Basis Function (RBF) were reported but no works showing how these classification models were developed and tested in this study except Table 3 which briefly displays the precision and recall (sensitivity) of these classification models. Furthermore the reference i.e. reference 44 associated with these classification models was on rapid screening of chronic renal failure while reference 47 and 48 was study conducted to lamost dr6 and coffee samples respectively but not on the identification of Cornus officinalis.

Response: Thank you very much for your comment. As you mentioned, the main work of this study is to discuss the application of Infrared Spectroscopy (IR) with Principal Component Analysis (PCA) and Support Vector Machine (SVM) in the origin identification of Cornus officinalis. After completing the PCA-SVM model, this model was then compared with other commonly used origin identification models, and it was found that the PCA-SVM model performed better in both the training set and the test set.

For the other classification models, the manuscript is briefly described in Section 4. In the model development and testing, the sample divisions of the training and test sets for each model and the model validation are done in the same way as the combined PCA-SVM model. The same data and the same divisions are compared so that the superior performance of the PCA-SVM model can be demonstrated. The development and testing of other models are described and marked in red at the beginning of Section 4, paragraph 7 of our manuscript.

References [44], [47] and [48] did not identify the origin of Cornus officinalis. As you suggested, it would have been more convincing if the manuscript had introduced other models for the identification of Cornus officinalis origin using infrared spectroscopy when citing references. However, we encountered difficulties in accessing the references. There are very few literatures that both use infrared spectroscopy and identify the origin of Cornus officinalis and use these commonly used classification models. The references introduced into the manuscript did not identify the origin of Cornus officinalis, but used these common classification models associated with the manuscript, so we thought they could be introduced into the manuscript. In addition, precisely because there is little literature identifying the origin of Cornus officinalis, this reflects the significance of our manuscript publication.

2. I was not able to locate all the figures mentioned in the manuscript, in other words figure were not included in the manuscript. This made the reviewing process quite difficult. Assuming that Figure 3 is available, the similarities and differences between the spectra of the Cornus officinalis samples can be directly from the spectra therefore reporting correlation using Pearson Linear Correlation Coefficient in my opinion isn't necessary.

Response: Thank you very much for your suggestion. We checked the manuscript and it is true that there are no figures in the manuscript. This may have been an error in our submission of the manuscript. We apologize for the difficulties this has caused you in reviewing it. Figure 3 provides a visual representation of the differences between the sample spectra. Our purpose of using Pearson Linear Correlation Coefficient to report correlations is to quantify the differences between the sample spectra. As you said, this may make the manuscript look somewhat repetitive and not concise enough. We have removed the Pearson linear correlation coefficient section as you suggested, so that the manuscript looks more concise and clear.

3. The explanation on data pre-processing was confusing. Authors mentioned that data pre-processing is the first step in modeling and also mentioned that the first step in constructing SVM based identification model is organize, collect sample followed by data normalization however since AU of the spectra did not deviate much from the Beer's Lambert Law, no special treatment was made to the data by the authors. Does this mean that authors used raw data for PCA? Since SVM was developed using principal components extracted from the spectral data, does this mean that authors normalized the principal components prior to SVM?

Response: Thank you very much for your comment. We apologize for any confusion caused by the data pre-processing. In this paper, the data processing is carried out in the following steps. First, the collected data are first processed for outliers and missing values. We consider data that are less than 1/3 times the arithmetic mean of the nearest neighboring values on the left and right or greater than 3 times the arithmetic mean of the nearest neighboring values on the left and right as outliers. Outliers and missing values are interpolated by means of mean interpolation in this paper.

After dealing with missing values and outliers, the absorbance data of 3448 corresponding bands under spectral illumination were analyzed and summarized. The range of absorbance after summary is -0.00675~1.48696 AU.

Since the absorbance used for the data in this paper is an instrument-corrected value, some negative values appear. However, the absolute values of these negative values are small, and the total amount of negative values is less than 0.001% of the total amount of data, we keep them. The absorbance of some data is greater than 1 AU, but the maximum value of absorbance does not exceed 1.5 AU. We think that they do not deviate from the Lambert-Beer law, so no special treatment is required.

Our collection of mid-infrared spectral data contains 3448 bands, and the absorbance data for each band are highly correlated. If all 3448 variables are introduced into the Cornus officinalis origin identification model, it will not only make the model training time very long, but also the introduction of highly correlated variables into the model will lead to poor stability and generalization ability. Therefore, it is necessary to use mathematical methods to extract features from the data. Principal component analysis was used in this study to extract the features of the spectra. According to the Kaiser criterion, the first 14 principal components are selected to replace the original variables to establish the origin identification model of Cornus officinalis.

The 14 principal components after feature extraction were used to build the origin identification model of Cornus officinalis with the help of support vector machine. We name this model as PCA-SVM combined model. In the process of building the SVM classification model, as you mentioned, the extracted components need to be normalized. We have added a description of this in paragraph 4 of section 3.3 of the manuscript and marked it in red.

4. Authors should focus on reporting the outcomes acquired from PCA and SVM and disregard comparing with other classification models unless one of the objectives of the manuscript was to develop and compare different classification models for identification of Cornus officinalis. Since authors mentioned the study compared different SVM kernel functions i.e. linear, quadratic, cubic and Gaussian, it would be interesting if the authors could report on these.

Response: Thank you very much for your comment. As you said, one of the objectives of this paper is to develop and compare different classification models to identify the origin of Cornus officinalis. Therefore, we have compared the results for other commonly used models. We strongly agree with your mention of reporting these different support vector machine kernel functions, which will make our manuscript more complete to read. We have added these common kernel functions in Section 3.3, paragraph 5 of the manuscript and marked them in red.

5. Other minor comments: In section 3.3, paragraph 2: No matter which set, the ratio of the maximum number and the minimum number of samples does not exceed 4:1, so it is considered that there is no sample imbalance in this study.

Comment: Was this based on previous study?

Response: Thank you very much for your comment. Sample imbalance refers to the situation where the number of training samples of different classes in a classification task varies significantly. In general, an imbalance ratio (majority class vs. minority class) significantly greater than 1:1 (such as 10:1) can be classified as a problem of sample imbalance. The unbalanced sample will lead to the actual prediction with a focus on majority class, resulting in better accuracy in majority class and worse accuracy in minority class. Sample imbalance can generally be handled by resampling the dataset, generating artificial data samples, trying different classification algorithms, and penalizing the model

Considering that the ratio of both the maximum and minimum sample size of Cornus officinalis did not exceed 4:1 and the number of origin categories was relatively high, based on our team's experience, we concluded that there was no sample imbalance in this study.

Reviewer 2

1. Authors presented the work that combines two multivariate techniques for source determination of a herbal medicine plant from numerous geographical origin.The work is interesting however the structure of the manuscript and important information are missing or if presented, it is not clearly described. some basic practice for writing scientific article was also not practised, thus improvements to the writing, contents and continuity of the article, should be made if the authors intent to publish the manuscript.

Simple corrections that authors can easily made are to italicised the genus and species of herbal plants where applicable, in text citations without initials, improper citation in text, capital letters and small letters (typo perhaps), figure mentioned but not included, make and model of the instrument and font sizes are among the typesetting that should be given attention by the authors. Those are some minor but vital components that would warrant a complete and well prepared manuscript.

Response: Thank you very much for your comment. We strongly agree with your suggestions for the structure of the manuscript to be continuous, the important information of the paper to be kept intact, and the writing of the paper to be standardized. A complete and standardized manuscript is more conducive to the publication and promotion of its contents. We have added important information to the manuscript as you suggested, and have corrected incorrectly quoted text and highlighted it in red.

2. In the introduction section, authors mention that microscopic analysis is able to determine authenticity of the herbs even when the herbs are highly processed. If this is the case then what is the need for multi varied analysis? Authors also did not mention specifically which part of the plant that was taken for analysis. It seems that the authors are generalising that all the part from that plants are the same. Is this true?

Response: Thank you very much for your comment. Microscopic identification is a term in Chinese medicine published in 2004. It is a method that uses microscope to observe the internal tissue structure, cells and the morphology of cellular contents of drugs, describe microscopic features, and develop microscopic identification basis to identify the authenticity of drugs. This identification method is fast, sensitive and simple, and has some practical significance. However, microscopic identification also has some shortcomings, such as the microscopic characteristics of some herbs are not easy to search, and the identification characteristics of some herbs do not conform to the pharmacopoeia. With the development of chromatographic coupling technology, the use of modern chromatographic techniques for the examination of Chinese herbal medicines has also been rapidly popularized. Infrared spectrometry has the advantages of fast analysis, low cost, non-destructive and simple pre-treatment. In recent years, it has been widely used in the field of quality control of Chinese herbal medicines. We have added a note in part 1, paragraph 2 of the manuscript and marked it in red.

As you mentioned, we did not mention in the manuscript the specific part of the plant from which the analysis was taken. The spectral data extracted in the manuscript is the dried and ripe pulp of Cornus officinalis, and we have added a note in the first paragraph of Section 2.1 of the manuscript and marked it in red. We apologize for any trouble caused by our oversight.

3. The data obtained seems to be the secondary data obtained from other organization and has been mentioned however the link given is confusing. The purpose of the link was not explained.

Response: Thank you very much for your comment. As you said, the data used in this study are secondary data obtained from other organizations. We have provided the relevant data in the relevant document, so giving the link loses its relevance. We have removed the link from the manuscript as you suggested.

4. In term of statistical analysis, data collection is not clearly explained. Perhaps geographical origin can be better explain with sampling map so that the reader would have better understanding of the location where samples were collected as this is discussed in depth by the authors.

Response: Thank you very much for your comment. According to your comment, we checked the manuscript, which states 11 origins, but does not clearly give the correspondence between the origins and OP1~OP11. In fact, they correspond in order, and we have added a note in section 1, paragraph 5 of the manuscript and marked it in red.

5. In data source and preprocessing, since this study aims to determine the origin of the sample, perhaps consideration of variations within and between (reproducibility and repeatability issues) of the samples should be considered. The authors merely state their opinions on variations observed to the signal but did not perform any testing or measurement to their hypothesis. The authors can be more specific when discussing the characteristic parts of the FTIR spectrum responsible for the source determination.

Response: Thank you very much for your suggestion. We strongly agree with you on the issue of test reproducibility and repeatability, if a test is not reproducible and repeatable, then the test is meaningless.

The reproducibility of the test is evaluated by taking 5 consecutive measurements of a given sample and calculating the Relative Standard Deviation (RSD) of their maximum common peak wave number. The RSD of the reproducibility test is determined to be less than 0.2%, indicating good reproducibility of the test. The repeatability of the test is evaluated by measuring the same sample once by 5 different experimenters, and calculating the RSD of the maximum common peak wave number of 5 measurements. The RSD of the repeatability test is determined to be less than 3%, indicating good repeatability of the test. We have added a note in Section 2.1, paragraph 2 of the manuscript and marked it in red.

6. Authors have included many statistical techniques tested on Chinese Herbal medicine origin, but are the samples are of Cornus officials? It is true that FTIR spectra may show different spectral characteristic due to different chemical constituents within the sample however to comparison different samples of Chinese herbal medicine (for example quoted as reference 44) to Cornus officials is unfounded. It is also unclear whether the author perform naives Bayes, decision tree, LDA, RBF and PLS-DA to the same data set used for PCA-SVM because it was not indicated in the methodology.

Response: Thank you very much for your comment. We have included in the manuscript many references to statistical techniques for the origin of herbs, many of which do not directly identify the origin of Cornus officinalis. As you suggested, it would have been more convincing if the manuscript had introduced other models for the identification of Cornus officinalis origin using infrared spectroscopy when citing references. However, there is very little literature using these statistical techniques and using infrared spectroscopy to directly identify the origin of Cornus officinalis. Although the references in the manuscript did not directly identify the origin of Cornus officinalis, they used these statistical techniques to identify the origin of other medicinal herbs. Therefore, in the context of the more difficult search for literature identifying the origin of Cornus officinalis, we introduced them into the manuscript for comparison with the combined PCA-SVM model proposed in this paper. The comparison results showed that the PCA-SVM combined model proposed in this paper performed well in each of the given evaluation metrics. In addition, precisely because there is little literature identifying the origin of Cornus officinalis, this reflects the significance of our manuscript publication.

For the development and testing of the naives Bayes, decision tree, LDA, RBF and PLS-DA models, the sample division of the training and test sets for each model and the validation of the models were performed in the same way as for the PCA-SVM combined model. The same data and the same divisions are compared so that the superior performance of the PCA-SVM model can be demonstrated. The development and testing of other models are described and marked in red at the beginning of Section 4, paragraph 7 of our manuscript.

Finally, thanks again to the editor for giving me the opportunity to revise the paper. Your serious and responsible attitude deserves our admiration. Thanks to reviewer 1 for the comments. Your encouragement provides the motivation for our team to move forward. Thanks to reviewer 2 for the suggestion. Your rigorous academic attitude has provided us with great help in future thesis writing.

Sincerely,

Best regards,

Yueqiang Jin

Attachment

Submitted filename: Response to Reviewers.docx

Decision Letter 1

Naji Arafat Mahat

15 Feb 2023

Origin identification of Cornus officinalis based on PCA-SVM combined model

PONE-D-22-30208R1

Dear Dr. Jin,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Naji Arafat Mahat, PhD

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: (No Response)

Reviewer #2: Corrections and explanation have been done by authors. I am satisfied with the manuscript and hope that the authors will keep on producing a good research and share the knowledge to the rest of the scientific community.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

Acceptance letter

Naji Arafat Mahat

17 Feb 2023

PONE-D-22-30208R1

Origin identification of Cornus officinalis based on PCA-SVM combined model

Dear Dr. Jin:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Naji Arafat Mahat

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 File. Mid-infrared spectral dataset.

    (XLSX)

    Attachment

    Submitted filename: PONE-D-22-30208.pdf

    Attachment

    Submitted filename: Response to Reviewers.docx

    Data Availability Statement

    All relevant data are within the paper and its Supporting Information files.


    Articles from PLOS ONE are provided here courtesy of PLOS

    RESOURCES