Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2018 Jan 8;8:89. doi: 10.1038/s41598-017-18458-9

Application of variable selection in the origin discrimination of Wolfiporia cocos (F.A. Wolf) Ryvarden & Gilb. based on near infrared spectroscopy

Tianjun Yuan 1,2, Yanli Zhao 1, Ji Zhang 1, Yuanzhong Wang 1,
PMCID: PMC5758700  PMID: 29311739

Abstract

Dried sclerotium of Wolfiporia cocos (F.A. Wolf) Ryvarden & Gilb. is a traditional Chinese medicine. Its chemical components showed difference among geographical origins, which made it difficult to keep therapeutic potency consistent. The identification of the geographical origin of W. cocos is the fundamental prerequisite for its worldwide recognition and acceptance. Four variable selection methods were employed for near infrared spectroscopy (NIR) variable selection and the characteristic variables were screened for the establishment of Fisher function models in further identification of the origin of W. cocos from Yunnan, China. For the obvious differences between poriae cutis (fu-ling-pi in Chinese, or FLP) and the inner part (bai-fu-ling in Chinese, or BFL) of the sclerotia of W. cocos in the pattern space of principal component analysis (PCA), we established discriminant models for FLP and BFL separately. Through variable selection, the models were significant improved and also the models were simplified by using only a small part of the variables. The characteristic variables were screened (13 for BFL and 10 for FLP) to build Fisher discriminant function models and the validation results showed the models were reliable and effective. Additionally, the characteristic variables were interpreted.

Introduction

Dried sclerotia of Wolfiporia cocos (F.A. Wolf) Ryvarden & Gilb. is a well-known traditional Chinese medicine, which is a fungal species parasitizing the roots of pine trees1. Traditionally, it is used in many prescriptions for inducing diuresis, invigorating the spleen, excreting dampness and tranquilizing the mind. However, poriae cutis (fu-ling-pi in Chinese, or FLP) and the inner part (bai-fu-ling in Chinese, or BFL) of the sclerotia of W. cocos have different therapeutic efficacy. FLP is reported to have only diuretic activity, while BFL has an invigorating activity in addition to diuretic and sedative effects2. Modern phytochemical and pharmacological investigations have shown that triterpenes and polysaccharides are the two main kinds of secondary metabolites found in W. cocos, which are responsible for its functions of anti-tumor, anti-oxidant, anti-rejection, antibacterial, anti-inflammatory, anti-hyperglycemic, nematicidal, etc3. The previous studies found that the contents of triterpenoid and polysaccharide in W. cocos from different origins were different4,5. The difference in chemical components of W. cocos in different geographical origins makes it difficult to keep therapeutic potency consistent. The identification of the geographical origin of W. cocos is the fundamental prerequisite for its worldwide recognition and acceptance.

In China, the poria produced in Yunnan is reputable as Yunnan poria (Yun-ling in Chinese) for its geoherbalism. Yunnan locates in southwest China and is influenced by a low latitude plateau, mountainous country monsoon climate6. There are seven climatic zones in Yunnan from the north temperate zone to north tropic zone, and climatic zones distribute according to the elevation7. The complex climate condition influences the quality of W. cocos. It was reported that the infrared spectra of W. cocos peels from different producing areas (Hubei, Anhui and Yunnan provinces) revealed obvious regional differences, and for the large geographical span, the component contents in samples from Yunnan were different at a certain extent8. Based on ultra performance liquid chromatography-ultraviolet-mass spectrometry (UPLC-UV-MS) fingerprints, the effect of habitat on the quality of peeled and sliced poria was obvious9.

Near-infrared spectroscopy (NIR), as a fast and non-destructive technology, has been widely used to identify traditional Chinese medicinal materials1014. The NIR spectrum reflects the absorption of overtones and combinations of the fundamental mid-IR bands like C-H, O-H, and N-H functional groups. The bandwidth of NIR region (between 780 and 2500 nm (12000 to 4000 cm−1)) is wide and absorption bands overlap heavily, which make the analysis of NIR spectra extremely difficult with conventional methods15,16. The variable selection is a critical step in the analysis of the datasets with thousands of variables in NIR spectroscopy17. In recent years, several variable selection methods of NIR have been developed, such as interval partial least-squares (iPLS)18,19, backward interval partial least-squares (biPLS)20, moving window partial least-squares regression (MWPLSR)21, genetic algorithm (GA)2224, simulated annealing algorithm (SAA)25, competitive adaptive reweighted sampling (CARS)2628, Monte Carlo uninformative variable elimination (MC-UVE)2933, subwindow permutation analysis (SPA)34,35 and latent projective graph (LPG)36,37.

Previously, we used MC-UVE method to screen the NIR spectrum information of W. cocos 38. On this basis, in this study, four variable selection methods including CARS, MC-UVE, SPA and LPG were employed and compared for NIR variables selection. The common variables were selected from the variable selection results of the four methods. Then, the characteristic variables were screened based on the common variables for the establishment of Fisher function models in further identification of the origin of W. cocos from Yunnan, China. Additionally, the characteristic variables were also interpreted.

Results and Discussion

Stability of NIR

The NIR resulting.spc files were converted to.csv data files by the multivariate statistical analysis of SIMCA-P 11.0. The stability of 25 times parallel collections of a sample was considered by Hotelling T2. The results showed that the parallel spectrum acquisitions possessed satisfactory stability with coefficient 4.26 and 7.82 in the 95% and 99% levels in W. cocos, respectively. The results indicated that NIR was a reliable method for discriminant analysis.

Principal Component Analysis

In order to remove the redundant information produced by hi gh-frequency line noise and retain the useful information in the low-frequency region, we applied the spectrum standard deviation (SDD) method to filter the original spectra by TQ 9.239. The wave band 7501.74 cm−1 – 4088.35 cm−1 (886 wavelength points) was preliminary selected (as shown in Fig. 1). Then we analyzed W. cocos by principal component analysis (PCA). In Fig. 2, we could find that in the pattern space of PCA, BFL and FLP were completely separated. The result indicated the inner chemical compositions of the two parts were different. In view of this, we established the discriminant models of BFL and FLP separately.

Figure 1.

Figure 1

The original spectra of BFL and FLP. The red lines represent BFL samples, while the other colorized lines stand for FLP samples.

Figure 2.

Figure 2

Principal component scores of BFL and FLP. The black triangles represent BFL samples, while the red squares correspond to FLP samples.

We analyzed BFL and FLP by PCA, respectively. The results were shown in Supplementary Table S1. According to Kaiser Criterion, only factors with eigenvalues greater than or equal to one will be accepted as possible sources of variance in the data40. The first five factors that accounted for spectrum cumulative 97.858% of BFL and 97.203% of FLP were selected for the next analysis.

Abnormal Samples Diagnosis

In the course of spectrum information (X) collection and index (Y) measurement, the data (X or Y) might deviate along with the abnormal fluctuation of instrument. The outlier samples could interfere with the discrimination model seriously. Through modular group iterative singular samples diagnosis method, the BFL and FLP were analyzed by Matlab R2010a analysis software. In order to establish steady discriminant model, the exceptional spectra including the number of samples 43 of BFL, 3, 33 and 35 of FLP were removed (see Supplementary Fig. S1).

Classification of Training Set and Validation Set

According to K-S method41,42, the samples were divided into the training and validation sets of BFL and FLP by the proportion of 2:1, respectively. The training and validation sets of BFL contained 40 and 19 samples, and those of FLP had 39 and 18 samples, respectively. Each set included the samples of all the five regions. The training set was used for variable selection and modeling, and the independent validation set was used for validation of the model.

Variable Selection based on CARS

The preliminary selected dataset 7501.74 cm−1–4088.35 cm−1 (886 wavelength points) was intended for investigating the ability for CARS to select key variables by eliminating the redundant information. One hundred replicate running of CARS was executed and the root mean square error of cross validation (RMSECV) values were recorded.

By 10-flod cross validation, the optimal number of PCA was five. The statistics of frequency of each selected wave number of spectrum was implemented. The number of Monte Carlo iterations was set to 50. In each iteration, 80% samples from the training sets were randomly chosen to build a PLS-DA model. The optimized number of variables was confirmed with the lowest RMSECV value. Only a small part of the wavelengths could be selected by CARS. According to the lowest RMSECV values, twenty key variables of FLP (RMSECV = 1.6202) were screened, and forty significant variables of BFL (RMSECV = 1.6767) were selected. Compared with preliminary selected variables (886 wavelength points), the optimized number of variables by CARS was reduced significantly (see Supplementary Fig. S2).

Variable Selection based on MC-UVE

Five hundred replicate running of MC-UVE was executed and the RMSECV values were recorded. Ten-fold cross validation and five principal factors of PLS-DA model were used in this study to explore its prediction performance. Reliability index (RI), defined as the ratio of the mean to the standard deviation of this distribution, was used to assess the reliability of each variable. Based on this reliability, all variables were ranked. Then, these variables were sequentially added to build a PLS-DA model whose performance was assessed by cross validation. The RI corresponding to the variable whose addition results in the minimum RMSECV value was chosen as the threshold. The variables that were related with a RI lower than the threshold value could be removed35.

The analysis result showed the variables with the RI values greater than 2.5107 were selected using ten-fold cross validation for BFL, and 95 variables were selected when the minimum ten-fold RMSECV was 1.5601. For FLP, 35 variables with the RI values greater than 2.1589 were selected using ten-fold cross validation as the minimum ten-fold RMSECV was 1.5852 (see Supplementary Fig. S3).

Variable Selection based on SPA

The three parameters of SPA were set to N = 1000 (N, the number of Monte Carlo Simulation), R = 0.8 (R, the ratio of samples to be selected in each Monte Carlo sampling), Q = 10 (Q, the number of variables to be sampled in each Monte Carlo Simulation). 10-flod cross validation and five number of PCA were used in this study to explore its prediction performance. The variable importance assessed by conditional synergetic score (COSS) value was calculated (COSS = − log10 (P)). RMSECV values were recorded, and the corresponding minimum RMSECV value was chosen as the optimized number of variables. The more significant a variable was, the higher the score it got. Particularly, the variables with COSS values greater than 2 were selected. As the minimum RMSECV value was 1.6235, 90 informative variables of BFL were selected for further analysis. For FLP, as the minimum RMSECV value was 1.6428, 30 informative variables were selected (see Supplementary Fig. S4).

Variable Selection based on LPG

LPG36 was adopted in wavelength selection for NIR spectral analysis. The method calculated an LPG (score plot) by performing PCA on the NIR spectral data matrix (7501.74 cm−1–4088.35 cm−1), and then detected the non-collinear variables from the LPG. According to the results of PCA in Supplementary Table S1, the first two principal components were used for LPG. In the end, both BFL and FLP, 129 variables were selected by LPG (see Supplementary Figs S5 and S6).

Evaluation of the Selected Variables

For further analysis the reliability of CARS, MU-UVE, SPA and LPG methods, PLS-DA models of BFL and FLP were established by SIMCA-P 11.0 software. The performance of models was assessed by determination coefficient (R 2), RMSECV and root mean square error of prediction (RMSEP). Generally, a good model should have high value of R2 and low value of RMSECV43. According to Galtier discriminant criterion, the ability of classification was assessed by prediction sets, and values of prediction and deviation (Ypre and Ydev) were examined. When Ypre > 0.5 and Ydev < 0.5, the prediction samples belonged to a certain kind of training set; Ypre < 0.5 and Ydev < 0.5, the prediction samples did not belong to a certain kind of training set; Ydev > 0.5 and 0.45 < Ydev < 0.5, the prediction samples were suspicious, because they were very close to the threshold 0.5. The 0.45 and 0.55 limits have been chosen because they express 10% of error in the results44,45.

Tables 1 and 2 summarized the prediction results of the PLS-DA models performed on the extraction of NIR spectra by the different variables selection methods. Compared with the preliminary variables (7501.74 cm−1–4088.35 cm−1, 886 variables), through different variable selection methods (CARS, MC-UVE, SPA and LPG), the number of the selected variables were decreased. Simultaneously, the parameters for assessing the PLS-DA models were improved. The values of accuracy and R 2 increased, RMSECV and RMSEP reduced.

Table 1.

Prediction results of PLS-DA models of BFL built by different variable selection methods.

Primary ID 886 spectral variables 40 spectral variables by CARS 95 spectral variables by MC-UVE 90 spectral variables by SPA 129 spectral variables by LPG
AC CC Ypre Ydev AC CC Ypre Ydev AC CC Ypre Ydev AC CC Ypre Ydev AC CC Ypre Ydev
BFL-01 1 1 0.836 0.116 1 1 1.001 0.001 1 1 0.654 0.245 1 1 1.023 0.016 1 1 0.805 0.138
BFL-05 1 1 1.629 0.445 1 1 0.774 0.160 1 1 0.644 0.252 1 1 1.23 0.163 1 1 0.548 0.320
BFL-20 1 1 1.536 0.379 1 1 0.834 0.117 1 1 0.541 0.324 1 1 0.805 0.138 1 1 0.749 0.178
BFL-34 1 1 1.618 0.437 1 1 0.719 0.199 1 1 0.621 0.268 1 1 1.205 0.145 1 1 0.762 0.168
BFL-40 1 1 0.835 0.117 1 1 1.168 0.119 1 1 0.711 0.204 1 1 1.577 0.408 1 1 0.822 0.126
BFL-48 1 SU 1.782 0.553 1 1 1.119 0.084 1 1 0.685 0.223 1 1 1.400 0.283 1 1 0.725 0.194
BFL-33 2 2 2.004 0.003 2 2 2.084 0.059 2 2 1.756 0.173 2 2 1.849 0.107 2 2 1.758 0.171
BFL-42 2 SU 2.635 0.450 2 2 1.554 0.315 2 2 1.539 0.326 2 2 1.87 0.092 2 2 1.682 0.225
BFL-49 2 2 1.928 0.051 2 2 1.593 0.288 2 2 2.192 0.136 2 2 1.963 0.026 2 2 1.920 0.057
BFL-54 2 2 1.845 0.110 2 2 1.967 0.023 2 2 1.963 0.026 2 2 1.72 0.198 2 2 1.821 0.126
BFL-55 2 2 1.751 0.176 2 2 1.942 0.041 2 2 1.882 0.083 2 2 2.034 0.024 2 2 1.716 0.201
BFL-12 3 3 2.802 0.140 3 3 3.019 0.013 3 3 2.562 0.310 3 3 2.88 0.085 3 3 2.575 0.300
BFL-15 4 4 3.844 0.110 4 4 4.593 0.419 4 4 3.744 0.181 4 4 3.876 0.088 4 4 3.582 0.296
BFL-37 4 4 3.948 0.037 4 4 3.712 0.204 4 4 3.720 0.198 4 4 3.9 0.071 4 4 3.807 0.137
BFL-47 4 4 3.893 0.076 4 4 3.873 0.090 4 4 3.861 0.098 4 4 3.657 0.243 4 4 3.817 0.129
BFL-04 5 SU 5.653 0.462 5 5 4.901 0.070 5 5 4.782 0.154 5 5 4.565 0.308 5 5 4.576 0.300
BFL-13 5 5 4.813 0.132 5 5 5.126 0.089 5 5 5.117 0.083 5 5 4.685 0.223 5 5 4.790 0.149
BFL-16 5 5 4.904 0.068 5 5 5.047 0.033 5 5 4.751 0.176 5 5 4.818 0.129 5 5 5.013 0.009
BFL-25 5 5 4.965 0.025 5 5 4.829 0.121 5 5 4.995 0.004 5 5 4.762 0.168 5 5 4.829 0.121
Accuracy (%) 84.21 100 100 100 100
R 2 0.940 0.977 0.966 0.972 0.970
RMSECV 0.290 0.181 0.219 0.197 0.208
RMSEP 0.382 0.239 0.289 0.260 0.274

Note: AC (Actual class), CC (Calculated class), Ypre (Predicted value), Ydev (Deviation), SU (Suspicious).

Table 2.

Prediction results of PLS-DA models of FLP built by different variable selection methods.

Primary ID 886 spectral variables 20 spectral variables by CARS 35 spectral variables by MC-UVE 30 spectral variables by SPA 129 spectral variables by LPG
AC CC Ypre Ydev AC CC Ypre Ydev AC CC Ypre Ydev AC CC Ypre Ydev AC CC Ypre Ydev
FLP-01 1 1 0.940 0.042 1 1 0.551 0.317 1 1 1.078 0.055 1 1 0.816 0.130 1 1 0.852 0.105
FLP-05 1 1 1.622 0.440 1 1 0.488 0.362 1 1 0.598 0.285 1 1 0.652 0.246 1 1 0.434 0.400
FLP-32 1 1 1.015 0.011 1 1 1.114 0.081 1 1 0.593 0.288 1 1 0.627 0.264 1 1 1.057 0.040
FLP-34 1 1 1.608 0.430 1 UN 0.380 0.438 1 1 1.290 0.205 1 1 1.172 0.122 1 1 0.974 0.018
FLP-40 1 1 0.799 0.142 1 1 1.133 0.094 1 1 0.609 0.277 1 1 1.013 0.009 1 1 0.906 0.066
FLP-50 1 1 0.810 0.134 1 1 1.138 0.098 1 1 0.866 0.095 1 1 1.436 0.308 1 1 0.870 0.092
FLP-59 1 1 0.617 0.271 1 1 0.971 0.021 1 1 0.579 0.298 1 1 1.321 0.227 1 1 0.647 0.249
FLP-30 2 2 1.895 0.074 2 2 1.474 0.372 2 2 2.465 0.329 2 2 1.802 0.140 2 2 1.737 0.186
FLP-46 2 2 2.681 0.482 2 2 1.502 0.352 2 2 1.701 0.212 2 2 2.47 0.332 2 2 1.604 0.280
FLP-49 2 2 2.587 0.415 2 2 2.053 0.037 2 2 1.509 0.347 2 2 1.866 0.095 2 2 1.521 0.339
FLP-54 2 2 1.855 0.102 2 2 1.566 0.307 2 2 1.663 0.238 2 2 1.657 0.243 2 2 1.962 0.027
FLP-08 3 3 2.989 0.008 3 3 3.368 0.260 3 3 3.039 0.028 3 3 3.469 0.332 3 3 2.899 0.072
FLP-26 4 4 3.616 0.271 4 4 3.900 0.071 4 4 4.463 0.327 4 4 3.779 0.156 4 4 3.828 0.121
FLP-45 4 4 4.412 0.291 4 4 4.350 0.247 4 4 3.723 0.196 4 SU 3.275 0.513 4 4 4.487 0.344
FLP-04 5 5 5.407 0.288 5 5 5.474 0.335 5 5 4.683 0.224 5 5 5.321 0.227 5 5 5.482 0.341
FLP-19 5 5 5.557 0.394 5 5 4.774 0.160 5 5 5.025 0.018 5 5 5.477 0.337 5 5 4.666 0.236
FLP-23 5 SU 5.727 0.514 5 5 4.785 0.152 5 5 5.335 0.237 5 5 4.589 0.291 5 5 4.760 0.170
FLP-25 5 5 4.777 0.158 5 5 4.599 0.284 5 5 4.789 0.149 5 5 4.720 0.198 5 5 4.809 0.135
Accuracy (%) 94.44 94.44 100 94.44 100
R 2 0.932 0.950 0.958 0.949 0.964
RMSECV 0.311 0.268 0.245 0.269 0.225
RMSEP 0.410 0.353 0.323 0.354 0.296

Note: AC (Actual Class), CC (Calculated Class), Ypre (Predicted value), Ydev (Deviation), UN (uncredited), SU (suspicious).

For BFL, the prediction accuracy values of the PLS-DA models performed on the extraction of NIR spectra by the four methods all reached 100%. The sequence of R 2 was CARS > SPA > LPG > MC-UVE, while they were in the exact opposite sequences for RMSECV and RMSEP as CARS < SPA < LPG < MC-UVE. All the four methods showed satisfactory prediction performance for BFL.

For FLP, the highest prediction accuracy values reached 100% in the PLS-DA models performed on the extraction of NIR spectra by MC-UVE and LPG methods, while 94.44% for CARS and SPA methods. The sequence of R 2 was LPG > MC-UVE > CARS > SPA. The values of RMSECV and RMSEP were in the opposite sequence LPG < MC-UVE < CARS < SPA. The results of MC-UVE and LPG were better than CARS and SPA for BFL.

The prediction results of the models were significant improved when conducting variable selection, and also the models were simplified by using only a small part of the variables. The results experimentally proved the necessity to perform variable selection before building a calibration model.

Common Variables Analysis

Based on the variable selection results of the four methods, the variables which were selected more than twice were chosen as the common variables for the further analysis. Totally, there were 56 common variables of BFL and 21 common variables of FLP were chosen.

PLS-DA was performed based on the results of PCA of 56 common variables of BFL. From Fig. 3a, we found that the first two principal components cumulatively accounted for 64.9% of the variation. It was visible that BFL were separated into five groups. The loading scatter plot (Fig. 3b) displayed the contribution of each variable to the discrimination. The further the variable distance from the zero of the X-axis and the Y-axis, the more the variable contributes to the classification46. Through a visual analysis, the variables such as 4092.21, 4096.06, 4308.19 4439.33, 4597.46, 5079.58 and 5866.40 cm−1 were identified preliminarily. The biplot provided a better understanding about the relationships between samples and variables in one plot (Fig. 3c). The biplot displayed that the variables 5866.40 cm−1 was positively correlated with the samples in class 1 in the (+, −) quadrant. The variable 4597.46 cm−1 was positively correlated with the samples in class 2, 3 and 4 in the (−, +) quadrant, and negatively correlated with those in class 1 in the (+, −) quadrant. The variables 4092.21, 4096.06, 4439.33 and 5079.58 cm−1 were positively correlated with the samples in class 1, 2 and 3 in the (−, −) quadrant, and negatively correlated with those in class 5 in the (+, +) quadrant. The variable 4308.19 cm−1 was positively correlated with the samples in class 5 in the (+, +) quadrant. Those variables were the most important markers to separate BFL samples into the five classes.

Figure 3.

Figure 3

Chemometric analysis of common variables of BFL. (a) PLS-DA scores scatter plot. (b) PLS-DA loading scatter plot. (c) PLS-DA loadings biplot. (d) Fisher discriminant analysis scatter plot.

Simultaneously, PLS-DA was conducted for 21 common variables of FLP. In Fig. 4a, the first two principal components cumulatively accounted for 68.0% of the variation. The first principal component explained 38.8% of the total variance and the second principal component explained 29.2% of that. FLP samples were distinctly separated into five groups. Visually analyzed the loading scatter plot (Fig. 4b), we found the variables such as 4508.75, 4952.30, 5230.00, 5233.86, 5303.28, 5634.98, 5685.12, 5874.11 and 5928.11 cm−1 made a significant contribution to the discrimination. The biplot (Fig. 4c) showed that the variables 5230.00 and 5233.86 cm−1 were positively correlated with the samples in class 1 in the (+, −) quadrant. The variables 4508.75 and 5303.28 cm−1 were positively correlated with the samples in class 3 and 4 in the (−, +) quadrant, and negatively correlated with those in class 1 in the (+, −) quadrant. The variable 5634.98 and 5685.12 cm−1 were positively correlated with the samples in class 2 in the (−, −) quadrant, and negatively correlated with those in class 1 and 5 in the (+, +) quadrant. The variables 4952.30, 5874.11 and 5928.11 cm−1 were positively correlated with the samples in class 1 and 5 in the (+, +) quadrant. Those variables were the most important markers to separate FLP samples into the five classes.

Figure 4.

Figure 4

Chemometric analysis of common variables of FLP. (a) PLS-DA scores scatter plot. (b) PLS-DA loading scatter plot. (c) PLS-DA loadings biplot. (d) Fisher discriminant analysis scatter plot.

Establish of Discriminant Analysis Function

To identify and analyze the unknown samples, the Fisher discriminant function model was established. Through stepwise regression method, the common variables which made a greater contribution to classification were further screened. As a result, thirteen variables including 4092.21, 4096.06, 4165.49, 4308.19, 4439.33, 4485.61, 4501.04, 4566.61, 4570.47, 4597.46, 4612.89, 5079.58 and 5866.40 cm−1 were selected for BFL. Seven of them were identified in the above discussion of PLS-DA. Ten variables including 4123.06, 4508.75, 4952.30, 5230.00, 5233.86, 5303.28, 5634.98, 5685.12, 5874.11 and 5928.11 cm−1 were selected for FLP. Nine of them were recognized in the discussion of PLS-DA. The results of stepwise regression were in accordance with PLS-DA, which proved that those variables could be seen as the characteristic identification marks of W. cocos.

In the process of Fisher discriminant analysis, the thirteen variables of BFL and ten variables of FLP were used as discriminant variables respectively, and the different BFL and FLP samples were performed as the subjects of the study to establish Fisher discriminant functions. The function of BFL was shown as follow and the coefficients were in Table 3:

Y=A0+A1X1A2X2A3X3+A4X4+A5X5+A6X6A7X7+A8X8A9X9A10X10+A11X11+A12X12+A13X13

where Xi was the corresponding variables, Yi was the corresponding class.

Table 3.

The coefficients of Fisher functions of BFL.

Y1 Y2 Y3 Y4 Y5
A0 3698.54 4029.41 3761.75 3749.1 3788.01
A1 8.31E+07 8.18E+07 8.80E+07 7.77E+07 7.97E+07
A2 7.73E+07 7.58E+07 8.20E+07 7.10E+07 7.39E+07
A3 5.84E+07 6.18E+07 5.92E+07 6.04E+07 6.03E+07
A4 6.01E+06 3.62E+06 4.34E+06 5.36E+06 5.96E+06
A5 4.03E+06 5.67E+06 5.35E+06 5.74E+06 4.64E+06
A6 1.69E+08 1.77E+08 1.71E+08 1.71E+08 1.73E+08
A7 1.51E+08 1.59E+08 1.55E+08 1.53E+08 1.55E+08
A8 7.42 E+08 7.79E+08 7.84E+08 7.48E+08 7.56E+08
A9 4.35E+08 4.65E+08 4.81E+08 4.52E+08 4.54E+08
A10 5.83E+08 5.79E+08 5.64E+08 5.47E+08 5.73E+08
A11 4.48E+08 4.40E+08 4.33E+08 4.20E+08 4.44E+08
A12 6.64E+06 8.38E+06 8.07E+06 7.87E+06 6.05E+06
A13 2.73E+07 2.27E+07 2.36E+07 2.37E+07 2.77E+07

The function of FLP was shown as follow and the coefficients were in Table 4:

Y=B0B1T1+B2T2+B3T3+B4T4B5T5+B6T6B7T7+B8T8+B9T9B10T10

where Ti was the corresponding variables, Yi was the corresponding class.

Table 4.

The coefficients of Fisher functions of FLP.

Y1 Y2 Y3 Y4 Y5
B0 1075.23 1133.72 1171.49 1174.26 1126.33
B1 7.48E+06 7.47E+06 7.39E+06 7.23E+06 7.43E+06
B2 1.07E+06 6.71E+05 4.47E+05 3.29E+07 7.55E+05
B3 9.58E+06 9.37E+06 9.41E+06 9.30E+06 9.48E+06
B4 7.00E+06 8.03E+06 8.10E+06 8.57E+06 7.48E+06
B5 5.97E+06 7.05E+06 7.28E+06 7.98E+06 6.57E+06
B6 5.41E+06 5.90E+06 5.86E+06 5.86E+06 5.64E+06
B7 9.71E+07 9.82E+07 1.00E+08 9.89E+07 9.93E+07
B8 4.87E+07 4.91E+07 4.91E+07 4.98E+07 4.95E+07
B9 1.65E+07 1.68E+07 1.69E+07 1.63E+07 1.66E+07
B10 1.42E+07 1.53E+07 1.55E+07 1.47E+07 1.44E+07

The Fisher discriminant analysis results were shown in Figs 3d and 4d. The effect of discrimination model was evaluated by cross validation. As seen in the two figures, the ungrouped prediction samples located in different classes. The class of the ungrouped samples could be identified according to the distance from each sample to the centroids of all classes. The validation results were shown in Tables 5 and 6. The original grouped samples 97.50% for BFL and 97.43% for FLP were correctly classified. In the cross validation, the accuracy rates were 94.74% for BFL and 94.44% for FLP. In our previous study, the Fisher discriminant analysis functions built based on the wavelength selected only by the MC-UVE method, the original grouped samples 92.50% for BFL and 92.86% for FLP were correctly classified, and the accuracy rates were 80.95% for BFL and 83.33% for FLP in the cross validation38. The correct classification rates were significantly improved both in the original grouped samples and in the cross validation sets in this study. The validation results indicated that the Fisher discriminant function model established based on the characteristic variables selected simultaneously by the four methods CARS, MC-UVE, SPA and LPG could be seen as a reliable and effective method to discriminate BFL and FLP.

Table 5.

The validation results of the Fisher discriminant analysis of BFL.

Validation Statistics Class 1 2 3 4 5 Total
Originala Count 1 13 0 0 0 0 13
2 0 7 0 0 0 7
3 0 0 4 0 0 4
4 0 0 0 6 0 6
5 1 0 0 0 9 10
Accuracy rate % 100 100 100 100 88.9 40
Cross validationb Count 1 6 0 0 0 0 6
2 0 5 0 0 0 5
3 0 0 1 0 0 1
4 0 0 1 2 0 3
5 0 0 0 0 4 4
Accuracy rate % 100 100 100 66.7 100 19

Note: a97.50% % of original grouped cases correctly classified; bCross validation is done only for those cases in the analysis. In cross validation, each case is classified by the functions derived from all cases other than that case. 94.74% of the cross validation grouped cases correctly classified.

Table 6.

The validation results of the Fisher discriminant analysis of FLP.

Statistics Class 1 2 3 4 5 Total
Originala Count 1 13 0 0 0 0 13
2 0 8 0 0 0 8
3 0 0 3 0 0 3
4 0 0 0 7 0 7
5 1 0 0 0 7 8
Accuracy rate % 100 100 100 100 87.5 39
Cross validationb Count 1 6 0 0 0 0 6
2 0 4 0 0 0 4
3 0 0 2 0 0 2
4 0 0 0 2 0 2
5 1 0 0 0 3 4
Accuracy rate % 100 100 100 100 75 18

Note: a97.43% of original grouped cases correctly classified; bCross validation is done only for those cases in the analysis. In cross validation, each case is classified by the functions derived from all cases other than that case. 94.44% of the cross validation grouped cases correctly classified.

Interpretation of the Characteristic Variables

In order to further understand the significance of these characteristic variables, we interpreted the spectra-structure of them. The wavelengths at 4092.21, 4,096.06, 4123.06, 4165.49, 4566.61 and 4570.47 cm−1 are related to the vibration of C-H aryl in benzene band. The absorption band at 4308.19 cm−1 is the combination of C-H stretch and C-H2 deformation in polysaccharides. The wavelength at 4439.33 cm−1 is the combination of O-H and C-O stretch in glucose. Band at 4485.61 is assigned as second overtones of the symmetric and asymmetric bending vibrations of the CH2 of the uncoupled vinyl group. Absorbance peaks at 4501.04 and 4508.75 cm−1 are the combination of asymmetric stretch of NH and NH2 rocking in urea (NH2-C=O-NH2). Absorbance peak at 4597.46 cm−1 is due to CONH2 as combination of amide B and amide II modes. The wavelength at 4612.89 cm−1 is assigned to CONH2 specifically due to the α-helix peptide structure. The absorption band at 5079.58 cm−1 is the combination of N-H stretching vibration and N-H bending in aromatic amine. Absorbance peak at 5866.40 cm−1 corresponds to C-H first overtone stretch vibration mode in CH3. The absorption band at 4952.30 cm−1 is due to a combination of the OH stretch and CH bending. The wavelengths at 5230.00, 5233.86 and 5303.28 cm−1 are the hydroxyl bands. The peaks at 5634.98 and 5685.12 cm−1 are related to C-H in methylene. The band at 5874.11 cm−1 is assigned to C-H in methyl, while at 5928.11 cm−1 is C-H in methyl with OH associated47. According to the absorption peaks, we could speculate that the chemical compositions of BFL and FLP were different, which provided theoretical basis in the spectrum level for the traditional usage of cutis (FLP) and the inner part (BFL) of the sclerotia of W. cocos separately.

Conclusions

In this work, we first systematically collected the near-infrared spectrum of cutis (FLP) and the inner part (BFL) of the sclerotia of W. cocos from different regions in Yunnan, China. Interestingly, we found that there were obvious differences between FLP and BFL in the pattern space of PCA. Based on this, we established discriminant models for FLP and BFL separately. Through four variable selection methods CARS, MC-UVE, SPA and LPG, the common variables were selected. Furthermore, the characteristic variables were screened to build Fisher discriminant function models, and the validation results showed the models were reliable and effective. The variable selection method used in NIR spectrum provided a new thought for the origin identification of traditional Chinese medicines. The spectrum difference between the cutis (FLP) and the inner part (BFL) of the sclerotia of W. cocos provided theoretical basis in the spectrum level for the traditional usage of FLP and BFL separately.

Methods

Materials

Sixty W. cocos samples from five different areas of Yunnan Province in China were collected during July to August in 2015: the central Yunnan (19), western Yunnan (12), northwestern Yunnan (5), southwestern Yunnan (10) and southeastern Yunnan (14). They were identified and authenticated by Professor H. Jin, Yunnan Academy of Agricultural Sciences. The specimens were preserved in the Institute of Medicinal Plants, Yunnan Academy of Agricultural Sciences. The samples were separated into FLP and BFL. After drying at room temperature, samples were ground to fine powder and stored in the zip lock bags for further analysis. The detailed sample information is listed in Supplementary Table S2.

Instruments

Antaris II Fourier Transform Near Infrared Spectroscopy (Thermo Fisher Scientific INC., USA) was attached with diffuse reflection module. The spectrum collecting software ResultTM 2.1 and the analysis software TQ 9.2 included in the instrument were employed. Traditional Chinese medicine grinder DFT-100 (Zhejiang wenling Linda machinery co., LTD) was applied. Stainless steel sieve tray 80 mesh (Tai’an of Chinese and western, Beijing) was used. The multivariate data analysis softwares were SIMCA-P 11.0 (Umetrics, Umea, Sweden), SPSS 19.0 (SPSS Inc., Chicago, USA) and MATLAB R2010a, and the code was derived from http://www.mathworks.cn/.

Spectra Collection

The powder (20.0 g) was weighed before it was sufficiently mixed, then transferred to the sample cup of NIR and compressed. The parameters of collection were scanning (64 times), resolution (4 cm−1), scanning range (10000 cm−1–4000 cm−1) and parallel collection (3 times). The NIR spectra of W. cocos were preprocessed with Norris, mean centering, standardization, and second derivative successively by software TQ 9.2. Through optimizing, the range 7501.74–4088.35 cm−1 was selected according to the spectrum standard deviation. The higher the spectra standard deviation was, the greater a contribution made to classification.

Electronic supplementary material

Acknowledgements

This work was supported by the National Natural Science Foundation of China (31460538 and 81660638).

Author Contributions

T.J. Yuan and Y.Z. Wang planned the research and wrote the manuscript. Y.L. Zhao and J. Zhang performed all the experiments and analyses.

Competing Interests

The authors declare that they have no competing interests.

Footnotes

Electronic supplementary material

Supplementary information accompanies this paper at 10.1038/s41598-017-18458-9.

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Chen JB, Sun SQ, Ma F, Zhou Q. Vibrational microspectroscopic identification of powdered traditional medicines: Chemical micromorphology of Poria observed by infrared and Raman microspectroscopy. Spectrochim. Acta A. 2014;128:629–637. doi: 10.1016/j.saa.2014.03.010. [DOI] [PubMed] [Google Scholar]
  • 2.Wang WH, et al. Comparative study of lanostane-type triterpene acids in different parts of Poria cocos (Schw.) Wolf by UHPLC-Fourier transform MS and UHPLC-triple quadruple MS. J. Pharm. Biomed. Anal. 2015;102:203–214. doi: 10.1016/j.jpba.2014.09.014. [DOI] [PubMed] [Google Scholar]
  • 3.Wang YZ, et al. Mycology, cultivation, traditional uses, phytochemistry and pharmacology of Wolfiporia cocos (Schwein.) Ryvarden et Gilb: A review. J. Ethnopharmacol. 2013;147:265–276. doi: 10.1016/j.jep.2013.03.027. [DOI] [PubMed] [Google Scholar]
  • 4.Zan JF, et al. Comparative study on the quality of Poria cocos from twenty different origin places. Chin. J. Infor. Tradit. Chin. Med. 2010;17:34–36. [Google Scholar]
  • 5.Song X, Xie ZM, Huang D, Zhong C, Zhou HY. Compariason of polysaccharide content in different medicinal part of Poria cocos from different origin. J. Shandong Univ. Tradit. Chin. Med. 2015;39:186–189. [Google Scholar]
  • 6.Zhang L, et al. Metabolic profiling of Chinese tobacco leaf of different geographical origins by GC-MS. J. Agric. Food Chem. 2013;61:2597–2605. doi: 10.1021/jf400428t. [DOI] [PubMed] [Google Scholar]
  • 7.Cheng JG, Wang XF, Fan LZ, Yang XP, Yang PW. Variations of Yunnan climatic zones in recent 50 years. Prog. Geog. 2009;28:18–24. [Google Scholar]
  • 8.Ma F, et al. Analysis and identification of Poria cocos peels harvested from different producing areas by FTIR and 2D-IR correlation spectroscopy. Spectrosc. Spect. Anal. 2014;34:376–380. [PubMed] [Google Scholar]
  • 9.Li K, Zhang LQ, Nie J. Study on UPLC-UV-MS fingerprints of different medicinal parts of poria cocos. J. Chin. Med. Mater. 2013;36:382–387. [PubMed] [Google Scholar]
  • 10.Kudo M, Watt RA, Moffat AC. Rapid identification of Digitalis purpurea using near-infrared reflectance spectroscopy. J. Pharm. Pharmacol. 2000;52:1271–1277. doi: 10.1211/0022357001777252. [DOI] [PubMed] [Google Scholar]
  • 11.Lu J, et al. Application of two-dimensional near-infrared correlation spectroscopy to the discrimination of Chinese herbal medicine of different geographic regions. Spectrochim. Acta A. 2008;69:580–586. doi: 10.1016/j.saa.2007.05.006. [DOI] [PubMed] [Google Scholar]
  • 12.Duan XJ, Zhang DL, Nie L, Zang HC. Rapid discrimination of geographical origin and evaluation of antioxidant activity of Salvia miltiorrhiza var. alba by Fourier transform near infrared spectroscopy. Spectrochim. Acta Part A. 2014;122:751–757. doi: 10.1016/j.saa.2013.12.003. [DOI] [PubMed] [Google Scholar]
  • 13.Zhao YL, et al. Discrimination of wild Paris based on near infrared spectroscopy and high performance liquid chromatography combined with multivariate analysis. Plos One. 2014;9:e89100. doi: 10.1371/journal.pone.0089100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Wang P, Yu ZG. Species authentication and geographical origin discrimination of herbal medicines by near infrared spectroscopy: A review. J. Pharmaceut. Anal. 2015;5:277–284. doi: 10.1016/j.jpha.2015.04.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Wu XH, Wu B, Sun J, Li M. Rapid discrimination of apple varieties via near-infrared reflectance spectroscopy and fast allied fuzzy C-means clustering. Int. J. Food Eng. 2015;11:23–30. doi: 10.1515/ijfe-2014-0117. [DOI] [Google Scholar]
  • 16.Meng Y, Wang SS, Cai R, Jiang BH, Zhao WJ. Discrimination and content analysis of fritillaria using near-infrared spectroscopy. J. Anal. Methods Chem. 2015;2015:101–124. doi: 10.1155/2015/752162. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Yun YH, et al. A strategy that iteratively retains informative variables for selecting optimal variable subset in multivariate calibration. Anal. Chim. Acta. 2014;807:36–43. doi: 10.1016/j.aca.2013.11.032. [DOI] [PubMed] [Google Scholar]
  • 18.Nørgaard L, et al. Interval partial least-squares regression (iPLS): A comparative chemometric study with an example from near-infrared spectroscopy. Appl. Spectrosc. 2000;54:413–419. doi: 10.1366/0003702001949500. [DOI] [Google Scholar]
  • 19.Rahman A, Kondo N, Ogawa Y, Suzuki T, Kanamori K. Determination of K value for fish flesh with ultraviolet–visible spectroscopy and interval partial least squares (iPLS) regression method. Biosyst. Eng. 2016;141:12–18. doi: 10.1016/j.biosystemseng.2015.10.004. [DOI] [Google Scholar]
  • 20.Leardi R, Nørgaard L. Sequential application of backward interval partial least squares and genetic algorithms for the selection of relevant spectral regions. J. Chemometr. 2004;18:486–497. doi: 10.1002/cem.893. [DOI] [Google Scholar]
  • 21.Jiang JH, Berry RJ, Siesler HW, Ozaki Y. Wavelength interval selection in multicomponent spectral analysis by moving window partial least-squares regression with applications to mid-infrared and near-infrared spectroscopic data. Anal. Chem. 2002;74:3555–3565. doi: 10.1021/ac011177u. [DOI] [PubMed] [Google Scholar]
  • 22.Leardi R. Application of genetic algorithm-PLS for feature selection in spectral data sets. J. Chemometr. 2000;14:643–655. doi: 10.1002/1099-128X(200009/12)14:5/6&#x0003c;643::AID-CEM621&#x0003e;3.0.CO;2-E. [DOI] [Google Scholar]
  • 23.Shinzawa H, Li B, Nakagawa T, Maruo K, Ozaki Y. Multi-objective genetic algorithm-based sample selection for partial least squares model building with applications to near-infrared spectroscopic data. Appl. Spectrosc. 2006;60:631–640. doi: 10.1366/000370206777670576. [DOI] [PubMed] [Google Scholar]
  • 24.Koljonen J, Nordling TEM, Alander JT. A review of genetic algorithms in near infrared spectroscopy and chemometrics: past and future. J. Near Infrared Spectrosc. 2008;16:189–197. doi: 10.1255/jnirs.778. [DOI] [Google Scholar]
  • 25.Brusco M. A comparison of simulated annealing algorithms for variable selection in principal component analysis and discriminant analysis. J. Comput. Stat. Data Anal. 2014;77:38–53. doi: 10.1016/j.csda.2014.03.001. [DOI] [Google Scholar]
  • 26.Li HD, Liang YZ, Xu QS, Cao DS. Key wavelengths screening using competitive adaptive reweighted sampling method for multivariate calibration. Anal. Chim. Acta. 2009;648:77–84. doi: 10.1016/j.aca.2009.06.046. [DOI] [PubMed] [Google Scholar]
  • 27.Zheng KY, et al. Stability competitive adaptive reweighted sampling (SCARS) and its applications to multivariate calibration of NIR spectra. Chemometr. Intell. Lab. Syst. 2012;112:48–54. doi: 10.1016/j.chemolab.2012.01.002. [DOI] [Google Scholar]
  • 28.Fan W, et al. Application of competitive adaptive reweighted sampling method to determine effective wavelengths for prediction of total acid of vinegar. Food Anal. Method. 2012;5:585–590. [Google Scholar]
  • 29.Cai WS, Li YK, Shao XG. A variable selection method based on uninformative variable elimination for multivariate calibration of near-infrared spectra. Chemometr. Intell. Lab. Syst. 2008;90:188–194. doi: 10.1016/j.chemolab.2007.10.001. [DOI] [Google Scholar]
  • 30.Han QJ, Wu HL, Cai CB, Xu L, Yu RQ. An ensemble of Monte Carlo uninformative variable elimination for wavelength selection. Anal. Chim. Acta. 2008;612:121–125. doi: 10.1016/j.aca.2008.02.032. [DOI] [PubMed] [Google Scholar]
  • 31.Zhang BH, et al. Hyperspectral imaging combined with multivariate analysis and band math for detection of common defects on peaches (Prunus persica) Comput. Electron. Agr. 2015;114:14–24. doi: 10.1016/j.compag.2015.03.015. [DOI] [Google Scholar]
  • 32.Li JB, et al. Variable selection in visible and near-infrared spectral analysis for noninvasive determination of soluble solids content of ‘Ya’pear. Food Anal. Methods. 2014;7:1891–1902. doi: 10.1007/s12161-014-9832-8. [DOI] [Google Scholar]
  • 33.Li JB, Zhao CJ, Huang WQ, Zhang C, Peng YK. A combination algorithm for variable selection to determine soluble solid content and firmness of pears. Anal. Methods. 2014;6:2170–2180. doi: 10.1039/C3AY42165A. [DOI] [Google Scholar]
  • 34.Wu T, et al. Application of metabolomics in traditional Chinese medicine differentiation of deficiency and excess syndromes in patients with diabetes mellitus. Evid-Based Compl. Alt. 2012;2012:968083–968093. doi: 10.1155/2012/968083. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Li HD, Liang YZ, Xu QS, Cao DS. Model-population analysis and its applications in chemical and biological modeling. Trends Anal. Chem. 2012;38:154–162. doi: 10.1016/j.trac.2011.11.007. [DOI] [Google Scholar]
  • 36.Shao XG, Du GR, Jing M, Cai WS. Application of latent projective graph in variable selection for near infrared spectral analysis. Chemometr. Intell. Lab. Syst. 2012;114:44–49. doi: 10.1016/j.chemolab.2012.03.003. [DOI] [Google Scholar]
  • 37.Liang YZ, Kvalheim OM. Resolution of two-way data: theoretical background and practical problem-solving Part 1: theoretical background and methodology. Fresen. J. Anal. Chem. 2001;370:694–704. doi: 10.1007/s002160100909. [DOI] [PubMed] [Google Scholar]
  • 38.Zhao YL, Zhang J, Wang YZ. Application of MC-UVE wavelength selection method in the identification of different producing areas of Wolfiporia cocos based on NIR spectroscopy. Mycosystema. 2017;36:112–125. [Google Scholar]
  • 39.Zhao YL, et al. Study on rapid identification of medicinal plants of Paris Ployphylla from different origin areas by NIRspectroscopy. Spectrosc. Spect. Anal. 2014;34:1831–1835. [PubMed] [Google Scholar]
  • 40.Kaiser HF. The application of electronic computers to factor analysis. Educ. Psychol. Meas. 1960;20:141–151. doi: 10.1177/001316446002000116. [DOI] [Google Scholar]
  • 41.Swiderski B, Osowski S, Kruk M, Kurek J. Texture characterization based on the Kolmogorov–Smirnov distance. Expert Syst. Appl. 2015;42:503–509. doi: 10.1016/j.eswa.2014.08.021. [DOI] [Google Scholar]
  • 42.Mora-López L, Mora J. An adaptive algorithm for clustering cumulative probability distribution functions using the Kolmogorov–Smirnov two-sample test. Expert Syst. Appl. 2015;42:4016–4021. doi: 10.1016/j.eswa.2014.12.027. [DOI] [Google Scholar]
  • 43.Zhong JF, Qin XL. Rapid quantitative analysis of corn starch adulteration in konjac glucomannan by chemometrics-assisted FT-NIR spectroscopy. Food Anal. Methods. 2016;9:61–67. doi: 10.1007/s12161-015-0176-9. [DOI] [Google Scholar]
  • 44.Galtier O, et al. Geographic origins and compositions of virgin olive oils determinated by chemometric analysis of NIR spectra. Anal. Chim. Acta. 2007;595:136–144. doi: 10.1016/j.aca.2007.02.033. [DOI] [PubMed] [Google Scholar]
  • 45.Galtier O, et al. Lipid compositions and french registered designations of origins of virgin olive oils predicted by chemometric analysis of mid-infrared spectra. Appl. Spectrosc. 2008;62:583–590. doi: 10.1366/000370208784344479. [DOI] [PubMed] [Google Scholar]
  • 46.Yao S, et al. Discriminatory components retracing strategy for monitoring the preparation procedure of Chinese patent medicines by fingerprint and chemometric analysis. PLoS ONE. 2015;10:e0121366. doi: 10.1371/journal.pone.0121366. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Workman, J. & Weyer, L. Practical Guide to Interpretive Near-Infrared Spectroscopy 240–262 (CRC, 2007).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES