Skip to main content
Journal of Pharmaceutical Analysis logoLink to Journal of Pharmaceutical Analysis
. 2019 Dec 12;10(4):356–364. doi: 10.1016/j.jpha.2019.12.004

An integrated spectroscopic strategy to trace the geographical origins of emblic medicines: Application for the quality assessment of natural medicines

Luming Qi a,b, Furong Zhong a,b, Yang Chen a,b, Shengnan Mao a,b, Zhuyun Yan a,b,, Yuntong Ma a,b,∗∗
PMCID: PMC7474118  PMID: 32923010

Abstract

Emblic medicine is a popular natural source in the world due to its outstanding healthcare and therapeutic functions. Our preliminary results indicated that the quality of emblic medicines might have an apparent regional variation. A rapid and effective geographical traceability system has not been designed yet. To trace the geographical origins so that their quality can be controlled, an integrated spectroscopic strategy including spectral pretreatment, outlier diagnosis, feature selection, data fusion, and machine learning algorithm was proposed. A featured data matrix (245 × 220) was successfully generated, and a carefully adjusted RF machine learning algorithm was utilized to develop the geographical traceability model. The results demonstrate that the proposed strategy is effective and can be generalized. Sensitivity (SEN), specificity (SPE) and accuracy (ACC) of 97.65%, 99.85% and 97.63% for the calibrated set, as well as 100.00% predictive efficiency, were obtained using this spectroscopic analysis strategy. Our study has created an integrated analysis process for multiple spectral data, which can achieve a rapid, nondestructive and green quality detection for emblic medicines originating from seventeen geographical origins.

Keywords: Emblic medicine, Quality assessment, Geographical traceability, Spectroscopic analysis process

Graphical abstract

Image 1

Highlights

  • The quality variation of emblic medicines from seventeen origins were determined.

  • An integrated spectroscopic strategy was provided to trace the geographical origins of emblic medicines.

  • This complete strategy can be generalized for the quality of other natural medicines.

  • Twelve filter, wrapper, and embedded models were applied for feature selection.

1. Introduction

The development of a generalized geographical traceability system for natural medicines remains a significant challenge because the growing environment always has a noticeable influence on their quality [1]. This interference is multidimensional and unpredictable. Primary and secondary metabolite compounds, which are mainly responsible for the healthcare and therapeutic functions of natural medicines, always vary significantly because of their different geographical origins [2,3]. Effective analytical methods and instruments for obtaining more insights into the metabolic characterizations and regional variation of natural medicines are essential because these variations affect both producers and consumers. The well-identified to geographical origin for a natural product is a prerequisite to its optimal application.

At present, many strategies such as molecular, chromatographic and spectroscopic methods have been applied to identify the origins of natural products based on their respective advantages [[4], [5], [6], [7], [8]]. Especially, spectroscopic analytical instruments have attracted more and more attention to characterizing natural products originating from different geographical origins. These techniques are worth to be recommending because they are rapid, simple and environment-friendly. These advantages can further promote the efficiency and safety of the quality control process for natural medicines. However, natural products are always an especially complex mixture with diversified metabolic ingredients. The descriptive information generated from different spectroscopic sensors is always sizable, so there are still a large number of irrelevant and redundant attributes. A large number of data optimization algorithms have been developed to enhance the availability of spectral data.

For example, feature selection is one of these requisite algorithms regarding a geographical traceability task. It can produce a clean and informative sub-dataset, which is necessary to improve the accuracy of analysis and to decrease the computation cost. Generally, feature selection algorithms can be classified into three types of filter, wrapper and embedded models, which have different efficiency for feature selection with respective criteria [9]. Another effective strategy for measuring geographical traceability is data fusion. It is utilized to integrate multi-source descriptive information when two or more instruments are implemented simultaneously [10]. It can provide a complementary approach to constructing a more effective geographical traceability model regarding the regional variation of natural products. These data optimization algorithms further enlarge the application of spectroscopic techniques.

Generally, a complete spectral analysis process for a geographical traceability model should contain several key steps, including spectral pretreatment, outlier diagnosis, feature selection, and machine learning algorithm. Each step needs to be strictly optimized. So far, many spectroscopic geographical traceability studies of natural medicines have been conducted [[11], [12], [13], [14], [15], [16]]. A limited number of studies can use a complete spectroscopic analysis process for geographical traceability, and the universality of developed models is insufficient especially for the quality assessment of natural medicines.

The fruit of emblic (Phyllanthus Emblica L.) belonging to Euphorbiaceae is a popular natural medicine for treating cough and indigestion in China. It has been recorded in “Chinese Pharmacopoeia”. The World Health Organization has designated this species as a plant worth of extensive cultivation in the world because of its outstanding healthcare and medicinal functions. Phytochemical and pharmacological researchers have demonstrated that this product has a broad range of metabolic ingredients such as phenolic, flavonoids and terpenoids. These compounds are capable of producing many biological benefits such as antidiabetic, antioxidant, anticancer, and other additional benefits [[17], [18], [19], [20], [21]]. This fruit, especially rich in vitamin C, has greater than 100 times the vitamin C contained in an apple. To the best of our knowledge, it is extensively distributed in many countries, including China, India and the American Continent. There is a considerable variance in the quality of embolic medicines that comes from these different regions. A rapid and effective spectroscopic quality aseessment strategy concerning different geographical origins is still lacking. Such a quality control strategy is required for the consistent supply of top-quality original materials.

With these ideas in mind, this study aimed to design a rapid and effective spectroscopic geographical traceability model for natural emblic medicines. Our research team collected different emblic materials (cultivated and wild) from seventeen geographical origins in six provinces of China during 2017. The main bioactive compounds (gallic acid, corilagin, chebulagic acid, ellagic acid, quercetin, and vitamin C) were first determined using a high-performance liquid chromatography-ultraviolet detection (HPLC-UV) method. These ingredients largely determine the healthcare and medicinal properties of these materials so that the result can reveal the quality variations of them about their different geographical origins. An integrated spectroscopic analysis process was proposed using two high-throughput spectroscopic techniques of Fourier transform near-infrared (FT-NIR) and Fourier transform mid-infrared (FT-MIR). This workflow included spectral pretreatment, outlier diagnosis, feature selection, data fusion, and machine learning algorithm. Especially, twelve feature selection models including filter, wrapper and embedded were applied to collect informative spectral variables comparatively. Data fusion theory was further used to combine the information learned from two spectroscopic techniques. We hope this study can provide a universal geographical traceability strategy for emblic medicines and also promote the application of spectroscopic techniques for the quality assessment of multi-source natural medication.

2. Materials and methods

2.1. Reagents

Methanol (chromatographic grade) was purchased from Thermo Fisher Scientific (Shanghai, China). Deionized water used for chromatographic analysis was produced using an ultrapure water system (Millipore, USA). Chemical standards of gallic acid, corilagin, chebulagic acid, ellagic acid, quercetin, and vitamin C were provided by Chroma-Biotechnology Co., Ltd. (Chengdu, China). Other analytical grade reagents were supplied by Chron Chemicals Co., Ltd. (Chengdu, China).

2.2. Sample preparation

The detailed information of collected emblic materials from seventeen geographical origins in six provinces of China is shown in Table S1. Their fresh fruits and medicinal materials are shown in Fig. S1. The fruits of these plants were collected from September to December 2017. After removing the dirt from the surface, these samples were put into a drying oven for 24 h at 60 °C. Then they were labelled according to their geographical origins and smashed using a powder machine. Power filtered with an 80 mesh sieve was used for final chromatographic and spectral analysis. Professor Yuntong Ma of Chengdu University of Traditional Chinese Medicine authenticated all the plants of P. emblica.

2.3. Chromatographic and spectral analysis

A Shimadzu system (Shimadzu, Japan) equipped with an LC-20AT quaternary pump, a SIL-20A XR autosampler, a CTO-20AC column oven, and an SPD-20A UV/Vis detector was utilized to determine bioactive compounds of emblic fruits. An Agilent ZORBAX Eclipse XDB-C18 (4.6 mm × 250 mm, 5 μm) column was applied to separate objective compounds.

For the determination of gallic acid, corilagin, chebulagic acid, ellagic acid and quercetin, each sample of 0.100 g was first weighed. The powder was ultrasonically extracted in 10 mL methanol solution for 60 min. Other HPLC-UV conditions are listed below: column temperature: 30 °C; mobile phase: methanol (A) and 0.1% phosphoric acid (B); flow rate: 1 mL/min; elution gradient: 0–15 min, 5%A; 15–35 min, 5%–37%A; 35–39 min, 37%–47%A; 39–60 min, 47%–60%A; injection volume: 5 μL; detection wavelength: 273 nm.

For the determination of vitamin C, each sample of 0.100 g was exactly weighed and then ultrasonically extracted in 10 mL of 0.5% oxalic acid for 30 min. Other HPLC-UV conditions are as follows: column temperature: 30 °C; mobile phase: 0.1% phosphoric acid; flow rate: 1 mL/min; isocratic elution: 15 min; injection volume: 10 μL; detection wavelength: 254 nm. All test solutions were filtered using a 0.45 μm membrane before HPLC-UV analysis.

Two spectroscopic sensors of FT-NIR and FT-MIR spectrometers (PerkinElmer, USA) were used to directly record the spectral signals of sample powder without an extraction pretreatment. Their scan ranges were set as 10000–4000 and 4000–500 cm-1, respectively. The accumulated scans and resolution of two sensors were defined as 64 and 4 cm-1, respectively. Before the sample introduction, a blank control was scanned in order to remove any air interference.

For each sample, approximately 0.5 g powder was weighed using an electronic balance (Sartorius, Germany) and put into a sample cell of FT-NIR and FT-MIR instruments. For FT-MIR, an additional attenuated total reflection accessory was connected to enable sample powder to be directly detected without complicated preparation. Each spectrum was scanned in triplicate, and the average spectrum was used for final analysis.

3. Geographical traceability strategy

3.1. Spectral pretreatments

The spectral quality is susceptible to environmental factors. Many interference factors, including baseline drift and light scattering, decrease the analytic accuracy. Several pretreatments were conducted to optimize spectral data. Baseline correction was applied to produce a stable spectral baseline, and a smoothing algorithm (15 points) was used to remove the tiny signals which were useless for the next analysis. Multiplicative scatter correction eliminated the effect of light scattering caused by the particle size of powder [22].

3.2. Outlier diagnosis

Anomaly samples can negatively impact model accuracy of geographical traceability. Therefore, two outlier detection tools were jointly used to ensure that the analyzed samples were free from abnormal points.

The first such tool can be regarded as a conventional clustering method based on Hotelling’s T2 distribution [23]. Based on the principal component analysis, Hotelling’s T2 displays a confidence ellipse 95% confidence limit. Samples outside of this ellipse were generally regarded as outliers in our study.

The second method used to accomplish this goal was also an unsupervised algorithm called isolation forests (iForest), which is a state-of-the-art technique for handling high-dimensional data [24]. It is an ensemble method utilized to combine many isolation trees. In brief, this method randomly selected ψ points as sub-sampling size. For each tree, these partition points were recursively partitioned by randomly chosen attributes. The process was complete when all the samples were divided into single isolated subspaces. The average path length over selected trees was then recorded as iForest score for each sample. A sample with a low iForest score was classified as an outlier. In this study, two parameters of sub-sampling size ψ and tree number were set as 256 and 100, respectively.

3.3. Feature selection

Different types of feature selection techniques always perform various levels of efficiency for simplifying spectral data of natural medicines. Three feature selection theories (filter, wrapper, and embedded models) were utilized in our study to pick out the informative spectral variables according to their importance comparatively.

Filter models evaluate each variable according to their criteria instead of a specific machine learning classifier. Two unsupervised feature selection techniques of Laplacian Score (LS) [25] and Unsupervised Multi-Cluster Feature Selection (U-MCFS) [26] were first applied. The other two were supervised feature selection techniques which were called Supervised Multi-Cluster Feature Selection (S-MCFS) [26] and Infinite Latent Feature Selection (ILFS) [27].

Wrapper models select the feature variables depending on a mathematical model. A predefined RF algorithm (500 trees) was used to wrap these feature selection techniques. Recursive Feature Elimination (RFE) [28], Boruta [29], Simulated Annealing (SA) [30] and Genetic Algorithm (GA) [31] were applied to handle the spectral data, respectively. The last two algorithms were random search methods for global optimization, which were extensively applied for optimizing sizeable datasets.

Embedded models combine the superiorities of filter and wrapper models. This type of model always performs a high efficiency for feature selection. Least Absolute Shrinkage and Selection Operator (LASSO) [32] and Variable Importance in Projection (VIP) [33] were used as two linear embedded models because they were embedded into linear classifies. Additionally, Permutation importance (PIMP) [34] and Gini coefficient (Gini) [35] based on decision tree theory were also used to propose the best one.

3.4. Evaluation of feature selection model

The evaluation of feature selection models was an essential step in selecting the most useful spectral information to reflect the regional variation of emblic medicines. A repeated 10 fold cross-validation procedure [36] (three times) was used to evaluate the performance of each feature selection model. Because superabundant variables always enlarge the size of search space and lead to an overfitting model, we only used the first 400 variables according to their score ranking. These variables were circularly evaluated with an interval of 10. The best feature selection models were confirmed according to the accuracy of cross-validation regarding FT-NIR and FT-MIR datasets, respectively.

3.5. Data fusion and RF model

Data fusion was conducted on the feature level, because the feature selection models have been performed to select the informative variables from two spectral datasets, respectively. Based on data fusion theory, a combined data matrix related to regional variation of emblic medicines was generated.

RF algorithm is an ensemble learning algorithm combined by a certain number of tree classifiers (ntree), which are mutually independent of each other. It also has excellent performance against overfitting and noise resistance because the training process is random. First, the bootstrap sampling method is used to select a random number of samples for each tree classifier. Besides, a random subspace of variables (mtry) of each sample is applied for each tree classifier. The results of all tree classifiers are exported, and a majority vote is performed for a final decision. ntree and mtry are determined according to the out-of-bag (OOB) estimate ntree in advance [37].

Four parameters of kappa (KAP), accuracy (ACC), sensitivity (SEN) and specificity (SPE) were together used for a balanced evaluation of our geographical traceability model. Overall, the model has several primary advantages over previous studies: (1) interferential and redundant signals were removed as much as possible; (2) multi-source descriptive data were well-utilized; (3) our proposed strategy is complete and can be effectively generalized. A simple data flow diagram for feature selection, data fusion, and the RF model is shown in Fig. 1.

Fig. 1.

Fig. 1

The data flow diagram for the geographical traceability model, including the steps of feature selection, data fusion and machine learning algorithm of the analysis process.

4. Results and discussion

4.1. Quality variation of emblic materials

Primary and secondary metabolites are the basis of natural medicines exerting their healthcare and medicinal functions. Some phenolics in emblic fruits are mainly responsible for their antioxidant activities and natural vitamin C source displays an important function on the prevention of cancers [38,39].

Nowadays, chromatographic analysis is the most fundamental technique for the quality assessment of medicinal plants because it can quantify multiple bioactive components simultaneously [40,41]. We first determined the six main metabolites (gallic acid, corilagin, chebulagic acid, ellagic acid, quercetin, and vitamin C) to investigate the quality variation of emblic materials originating from seventeen geographical origins. Chromatographic plots are exhibited in Fig. S2. Each calibration curve was established by plotting its peak area against the standard concentration (Table S2). Methodological examination, including precision, stability, repeatability and recovery, was conducted (Table S3). These results demonstrated that the HPLC-UV method could be applied to determine the quality variation of emblic medicines.

The levels of determined active compositions are shown in Table 1. The concentrations of gallic acid, corilagin, chebulagic acid, ellagic acid, quercetin and vitamin C of emblic materials originated from different geographical origins are 4.48–61.00 mg/g, 0.77–9.82 mg/g, 4.82–32.31 mg/g, 0.90–13.00 mg/g, 0.34–3.60 mg/g and 0.47–14.56 mg/g, respectively. Gallic acid in this product from CX origin is almost 14 times that from ST origin, and the vitamin C concentration from MY origin is 30 times more than that from HZ origin. This result shows an obvious quality variation of these fruits from different growing environments.

Table 1.

The levels of active compositions of emblic materials from different geographical origins.

Geographical origins Gallic acid (mg/g) Corilagin (mg/g) Chebulagic acid (mg/g) Ellagic acid (mg/g) Quercetin (mg/g) Vitamin C (mg/g)
Zhangzhou, Fujian (ZZ) 18.06 ± 1.71 4.82 ± 0.38 9.89 ± 0.33 6.02 ± 0.30 0.42 ± 0.05 2.03 ± 0.57
Quanzhou, Fujian (QZ) 16.99 ± 1.99 4.1 ± 0.94 7.68 ± 1.42 5.19 ± 0.73 0.39 ± 0.02 1.21 ± 0.27
Huzhou, Guangdong (HZ) 8.85 ± 1.78 4.49 ± 0.97 13.22 ± 3.17 6.72 ± 1.66 0.34 ± 0.12 0.47 ± 0.11
Shantou, Guangdong (ST) 4.86 ± 0.40 9.82 ± 1.78 12.23 ± 2.32 8.84 ± 1.57 0.41 ± 0.02 0.56 ± 0.07
Nanning, Guangxi (NN) 16.03 ± 3.30 6.24 ± 1.87 12.39 ± 3.90 9.35 ± 3.09 0.96 ± 0.13 0.99 ± 0.23
Anshun, Guizhou (AS) 45.7 ± 2.87 7.82 ± 0.77 24.93 ± 2.68 13.00 ± 1.17 1.99 ± 0.16 3.05 ± 0.76
Qianxinan, Guizhou (QXN) 41.43 ± 4.38 7.23 ± 1.00 32.31 ± 4.85 12.11 ± 1.82 1.30 ± 0.14 3.10 ± 1.03
Dechang, Sichuan (DC) 38.06 ± 4.18 7.00 ± 0.60 16.91 ± 1.72 11.5 ± 0.89 1.42 ± 0.17 1.18 ± 0.38
Huili, Sichuan (HL) 48.52 ± 4.26 6.53 ± 0.70 16.00 ± 1.71 12.33 ± 1.24 1.85 ± 0.18 1.21 ± 0.33
Miyi, Sichuan (MY) 4.48 ± 1.77 0.77 ± 0.22 4.82 ± 1.38 0.90 ± 0.30 0.42 ± 0.13 14.56 ± 3.80
Puwei, Sichuan (PW) 50.47 ± 3.12 5.80 ± 0.37 15.06 ± 1.22 8.97 ± 0.45 2.47 ± 0.19 2.45 ± 0.62
Datong, Sichuan (DT) 50.31 ± 4.53 3.73 ± 0.19 12.53 ± 0.81 5.40 ± 0.36 2.58 ± 0.24 7.16 ± 2.55
Jingxing, Sichuan (JX) 46.28 ± 2.11 5.54 ± 0.30 16.31 ± 0.76 7.57 ± 0.49 2.51 ± 0.20 1.74 ± 0.30
Panzhihua, Sichuan (PZH) 30.30 ± 1.66 5.09 ± 0.36 13.87 ± 1.12 5.07 ± 0.36 1.48 ± 0.14 3.90 ± 0.67
Yanyuan, Sichuan (YY) 46.53 ± 5.36 5.28 ± 0.41 14.84 ± 0.90 8.43 ± 0.69 1.96 ± 0.17 3.61 ± 1.33
Chuxiong, Yunnan (CX) 61.00 ± 2.48 3.97 ± 0.32 14.55 ± 0.70 5.62 ± 0.22 2.21 ± 0.20 10.31 ± 2.56
Dali, Yunnan (DL) 52.72 ± 5.24 5.84 ± 0.81 15.08 ± 2.28 8.81 ± 1.24 3.60 ± 0.23 2.46 ± 0.63

A PLS-DA model was developed to visualize their quality variation. The regional variation of these products is apparent because they are divided into several groups obviously (Fig. 2). According to the loading plot, these six compounds play an essential role in this classification model (Fig. S3). The conclusion can be determined that geographical origins have a significant influence on the quality of emblic products. Because these species are extensively distributed in the world, an effective geographical traceability strategy is very essential for their quality assessment. Chromatographic techniques are always time-consuming, pollution producing and inaccurate to deal with this problem. Hence, two spectroscopic techniques were applied for a better solution for a geographical traceability model of emblic medicines.

Fig. 2.

Fig. 2

The visualization of regional variation of emblic products constructed by the levels of six metabolites in these medicines.

4.2. Spectral pretreatment and outlier diagnosis

The raw FT-NIR and FT-MIR spectra of emblic products are visualized in Fig. S4. These original spectral signals are sensitive to the operating environment. The optimized spectra based on baseline correction, smoothing and multiplicative scatter correction are displayed in Fig. 3. Comparatively, these approaches are effective in improving the spectral quality not only for the visualization of metabolic characterization but also for the subsequent data analysis. Many typical absorption peaks were raised, indicating that their metabolic characterizations are similar. Hence, the metabolic variation of emblic medicines from different geographical origins mainly reflects on the level of metabolic products, which can be partly explained by chromatographic results.

Fig. 3.

Fig. 3

FT-NIR and FT-MIR spectra after spectral pretreatment optimization.

Two methods were together used for the outlier diagnostic. The result of Hotelling’s T2 distribution is shown in Fig. S5. Six observations of FT-NIR spectra and three observations of FT-MIR spectra are out of the 95% confidence limit. The result of iForest indicates that the scores of six FT-NIR observations are lower than 2.86 and three FT-MIR observations are smaller than 2.65, respectively (Table S4). Using 2.86 and 2.65 as the threshold scores regarding FT-NIR and FT-MIR spectra, four outliers were additionally detected by this algorithm. Summary, a total of ten samples were identified as abnormal individuals and thus they were not used for the subsequent analysis.

4.3. The results of feature selection

After spectral pretreatment and outlier diagnosis, two preliminary data matrixes concerning FT-NIR (245 × 1556) and FT-MIR (245 × 1789) have been produced. They were too sizable to analyze directly. Twelve feature selection models (filter, wrapper and embedded) were comparatively used to simplify these data structures.

Fig. S6 shows the performance of four filter models. U-MCFS model performs the highest accuracy for FT-NIR with the ACC and KAP of 92.68% and 92.19%, respectively, using the first 200 features. Regarding FT-MIR spectra, 95.03% and 94.68% of ACC and KAP are calculated using the S-MCFS model with the first 100 features.

For wrapper models (Fig. S7), the first 120 FT-NIR features perform the best accuracy using the Boruta model, with ACC and KAP of 96.10% and 95.82%, respectively. For FT-MIR spectra, the intelligence optimization algorithm of GA shows the best accuracy based on the first 40 features. The ACC and KAP are 94.82% and 94.46%, respectively.

The results of embedded feature selections are presented in Fig. S8. Compared with two linear models of LASSO and VIP, nonlinear models of PIMP and Gini perform better results. The former performs 92.86% ACC and 92.36% KAP for FT-MIR spectra using the first 120 features. The latter technique performs 95.84% ACC and 95.55% KAP regarding FT-NIR spectra using the first 150 features.

4.4. The comparison of feature selection

We applied twelve different feature models, including filter, wrapper, and embedded models. A 3 times cross-validation procedure was performed to propose the best one for the optimization of sizable spectral datasets. They were well evaluated because a total of 30 random samplings were performed. The comparison of their KAP accuracy is displayed in Fig. 4.

Fig. 4.

Fig. 4

Comparison of twelve feature selection models based on the KAP coefficient displaying the different efficiency of various feature selection models.

Feature selection models exhibit different performance for different datasets. Filter models have a weaker performance than other methods with a significant variation (P<0.05) for the FT-NIR dataset. The variation between the wrapper and embedded models was not significant. Considering feature number, validation accuracy and cost time together, we chose Boruta as the best method to simplify the FT-NIR dataset.

For the FT-MIR dataset, the variation among different types of feature selections is not significant. LS is the worst algorithm with a significant variation to others (P<0.05). Comparatively, the S-MCFS model was selected as the optimized method to simplify this dataset.

The first 100 important variables of FT-NIR and FT-MIR spectra are visualized, respectively, in order to further compare the performance of feather selection models. As seen in Fig. 5, LS and ILFS mostly focus on the local region of spectral data. This can explain why they performed a bad result. A local search strategy may be powerless regarding sizable spectral data. Conversely, SA and GA are the randomly global optimization algorithms. Their performances were acceptable, but too much time is needed to achieve these algorithms. For several excellent feature selections such as Boruta and S-MCFS, the features selected by them are mainly distributed on the informative spectral region of 7000-4000 cm-1 for the FT-NIR dataset and 2000-500 cm-1 for FT-MIR dataset.

Fig. 5.

Fig. 5

The first 100 important variables of FT-NIR and FT-MIR spectra, respectively, based on different models.

Different feature selection models had different efficiencies when they were used to simplify spectral data. Multiple models need to be together applied to propose the best one for the spectral data optimization of natural medicines. Eventually, 120 FT-NIR features and 100 FT-MIR features were selected using Boruta and S-MCFS models, respectively. This is the first time that filter, wrapper, and embedded feature selections were together used for spectral datasets of natural medicines.

4.5. Development of the geographical traceability model

An optimized data matrix was successfully generated via spectral pretreatment, outlier diagnosis, feature selection, and data fusion in turn. It contained 245 rows and 220 columns, which was simple, representative and informative. Such a data matrix could contribute to constructing an accurate and robust geographical traceability model of emblic medicines.

The OOB estimate is based on a bootstrap sampling procedure, which is an unbiased measurement. This parameter was closely related to a model fitting degree, and it can effectively enhance the generalization ability of the model. So it was used to adjust the parameters of an RF model. 94 trees had the best performance, with the lowest averaged error of 0.029 (Fig. S9). Then, a rough set from 1 to 100 was designed to select the best mtry. As seen in Fig. 6, mtry = 65 has the best performance with an error of 0.023. Via the parameter adjustment process, the calibrated geographical traceability model was successfully developed with the OOB error reduced from 0.036 to 0.023.

Fig. 6.

Fig. 6

mtry optimization process for an RF model according to the lowest OOB estimate.

A well-chosen external validation set based on Kennard-Stone sampling was imported into the calibrated model to evaluate its generalization performance [42]. The confusion matrixes were produced in Table S5 and Table S6. Four samples from QXN, DL and CX groups are misclassified in the calibrated model. The SEN, SPE, and ACC are 97.65%, 99.85%, and 97.63%, respectively. All samples in the external validation set are correctly classified, with 100% of SEN, SPE, and ACC, respectively (Table 2).

Table 2.

The performance of the geographical traceability model of emblic fruits.

Dataset SEN% SPE% ACC%
Calibration set 97.65 99.85 97.63
Validation set 100.00 100.00 100.00

Note: SEN: sensitivity; SPE: specificity; ACC: accuracy.

5. Conclusion

Natural products are always the complex mixtures that consist of diversified chemical constitutes. Their metabolize characterizations are difficult to illustrate completely. Spectroscopic techniques have many advantages because they can contribute to a rapid and green quality detection for natural medicines. The time for collecting the FT-NIR and FT-MIR spectra of an emblic sample is less than 1 min without sample loss.

However, spectral data need to be carefully optimized before their application. When the feature subset is well prepared, spectroscopic techniques show a huge potential for the quality assessment of natural medicines on both qualitative and quantitative levels. These techniques should play a more important role in the field of quality assessment for Chinese medicine.

In this study, we presented an integrated analysis process of two spectral datasets to develop an effective geographical traceability model for emblic medicines. This model performed a 100.00% predicted accuracy for these medicines originating from seventeen geographical origins. These optimization steps included spectral pretreatment, outlier diagnosis, feature selection, data fusion, and machine learning algorithm. This analysis strategy also can be used in quantitative respect and is worth to generalize for the quality assessment of other natural multi-source medicines.

Conflicts of interest

The authors declare that there are no conflicts of interest.

Acknowledgments

This work is financially supported by the National Wild Plant Germplasm Resources Infrastructure which is the follow-up work of a project called Standardization and Community for the Collection and Preservation of Important Wild Plant Germplasm Resources (2005DKA21006).

Footnotes

Peer review under responsibility of Xi'an Jiaotong University.

Appendix A

Supplementary data to this article can be found online at https://doi.org/10.1016/j.jpha.2019.12.004.

Contributor Information

Zhuyun Yan, Email: cdtcmyan@126.com.

Yuntong Ma, Email: mayuntong@cdutcm.edu.cn.

Appendix A. Supplementary data

The following is the Supplementary data to this article:

Multimedia component 1
mmc1.docx (8.4MB, docx)

References

  • 1.Klein K., Stolk P. Challenges and opportunities for the traceability of (Biological) medicinal products. Drug Saf. 2018;41:911–918. doi: 10.1007/s40264-018-0678-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Liu C., Guo D.A., Liu L. Quality transitivity and traceability system of herbal medicine products based on quality markers. Phytomedicine. 2018;44:247–257. doi: 10.1016/j.phymed.2018.03.006. [DOI] [PubMed] [Google Scholar]
  • 3.Gad H.A., El-Ahmady S.H., Abou-Shoer M.I. Application of chemometrics in authentication of herbal medicines: a review. Phytochem. Anal. 2013;24:1–24. doi: 10.1002/pca.2378. [DOI] [PubMed] [Google Scholar]
  • 4.El Sheikha A.F. Molecular Techniques in Food Biology: Safety, Biotechnology, Authenticity and Traceability. John Wiley & Sons Ltd.; New Jersey: 2018. How to determine the geographical origin of food by molecular techniques; pp. 3–26. [Google Scholar]
  • 5.Zhao L., Yu X., Shen J. Identification of three kinds of Plumeria flowers by DNA barcoding and HPLC specific chromatogram. J. Pharm. Anal. 2018;8:176–180. doi: 10.1016/j.jpha.2018.02.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Kamal M., Karoui R. Analytical methods coupled with chemometric tools for determining the authenticity and detecting the adulteration of dairy products: a review. Trends Food Sci. Technol. 2015;46:27–48. [Google Scholar]
  • 7.El Sheikha A.F., Condur A., Metayer I. Determination of fruit origin by using 26S rDNA fingerprinting of yeast communities by PCR-DGGE: preliminary application to Physalis fruits from Egypt. Yeast. 2009;26:567–573. doi: 10.1002/yea.1707. [DOI] [PubMed] [Google Scholar]
  • 8.Kharbach M., Marmouzi I., El Jemli M. Recent advances in untargeted and targeted approaches applied in herbal-extracts and essential-oils fingerprinting-A review. J. Pharm. Biomed. Anal. 2020;177:112849. doi: 10.1016/j.jpba.2019.112849. [DOI] [PubMed] [Google Scholar]
  • 9.Saeys Y., Inza I., Larranaga P. A review of feature selection techniques in bioinformatics. Bioinformatics. 2007;2:2507–2517. doi: 10.1093/bioinformatics/btm344. [DOI] [PubMed] [Google Scholar]
  • 10.Borras E., Ferre J., Boque R. Data fusion methodologies for food and beverage authentication and quality assessment-a review. Anal. Chim. Acta. 2015;891:1–14. doi: 10.1016/j.aca.2015.04.042. [DOI] [PubMed] [Google Scholar]
  • 11.Li Y., Wang Y. Differentiation and comparison of Wolfiporia cocos raw materials based on multi-spectral information fusion and chemometric methods. Sci. Rep. 2018;8:13043. doi: 10.1038/s41598-018-31264-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Mandrile L., Barbosa-Pereira L., Sorensen K.M. Authentication of cocoa bean shells by near- and mid-infrared spectroscopy and inductively coupled plasma-optical emission spectroscopy. Food Chem. 2019;292:47–57. doi: 10.1016/j.foodchem.2019.04.008. [DOI] [PubMed] [Google Scholar]
  • 13.Li J., Zhang J., Zhao Y.L. Comprehensive quality assessment based specific chemical profiles for geographic and tissue variation in Gentiana rigescens using HPLC and FTIR method combined with principal component analysis. Front Chem. 2017;5:125. doi: 10.3389/fchem.2017.00125. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Wang Y., Zuo Z.T., Huang H.Y. Original plant traceability of Dendrobium species using multi-spectroscopy fusion and mathematical models. R. Soc. Open Sci. 2019;6:190399. doi: 10.1098/rsos.190399. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Li Y., Zhang J.Y., Wang Y.Z. FT-MIR and NIR spectral data fusion: a synergetic strategy for the geographical traceability of Panax notoginseng. Anal. Bioanal. Chem. 2018;410:91–103. doi: 10.1007/s00216-017-0692-0. [DOI] [PubMed] [Google Scholar]
  • 16.Yao S., Li T., Liu H. Traceability of Boletaceae mushrooms using data fusion of UV-visible and FTIR combined with chemometrics methods. J. Sci. Food Agric. 2018;98:2215–2222. doi: 10.1002/jsfa.8707. [DOI] [PubMed] [Google Scholar]
  • 17.Chaphalkar R., Apte K.G., Talekar Y. Antioxidants of Phyllanthus emblica L. bark extract provide hepatoprotection against ethanol-induced hepatic damage: a comparison with silymarin. Oxid. Med. Cell. Longev. 2017:3876040. doi: 10.1155/2017/3876040. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Huang C.Z., Tung Y.T., Hsia S.M. The hepatoprotective effect of Phyllanthus emblica L. fruit on high fat diet-induced non-alcoholic fatty liver disease (NAFLD) in SD rats. Food Funct. 2017;8:842–850. doi: 10.1039/c6fo01585a. [DOI] [PubMed] [Google Scholar]
  • 19.Kumar A., Kumar S., Bains S. De novo transcriptome analysis revealed genes involved in flavonoid and vitamin C biosynthesis in Phyllanthus emblica (L.) Front. Plant Sci. 2016;7:1610. doi: 10.3389/fpls.2016.01610. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Zhang J., Miao D., Zhu W.F. Biological activities of phenolics from the fruits of Phyllanthus emblica L. (Euphorbiaceae) Chem. Biodivers. 2017;14 doi: 10.1002/cbdv.201700404. e1700404. [DOI] [PubMed] [Google Scholar]
  • 21.Zheng X.H., Yang J., Lv J.J. Four new cleistanthane diterpenoids from Phyllanthus acidus (L.) Skeels. Fitoterapia. 2018;125:89–93. doi: 10.1016/j.fitote.2017.12.005. [DOI] [PubMed] [Google Scholar]
  • 22.Dhanoa M.S., Lister S.J., Sanderson R. The link between multiplicative scatter correction (MSC) and standard normal variate (SNV) transformations of NIR spectra. J. Near Infrared Spectrosc. 1994;2:43–47. [Google Scholar]
  • 23.Qi L., Li J., Liu H. An additional data fusion strategy for the discrimination of porcini mushrooms from different species and origins in combination with four mathematical algorithms. Food Funct. 2018;9:5903–5911. doi: 10.1039/c8fo01376d. [DOI] [PubMed] [Google Scholar]
  • 24.Liu F.T., Ting K.M., Zhou Z. 2008 Eighth IEEE International Conference on Data Mining. IEEE; New Jersey: 2008. Isolation forest; pp. 413–422. [Google Scholar]
  • 25.He X., Cai D., Niyogi P. Proceeding NIPS’05 Proceedings of the 18th International Conference on Neural Information Processing Systems. MIT Press; Cambridge: 2006. Laplacian score for feature selection; pp. 507–514. [Google Scholar]
  • 26.Cai D., Zhang C., He X. KDD ’10 Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM; New York: 2010. Unsupervised feature selection for multi-cluster data; pp. 333–342. [Google Scholar]
  • 27.Roffo G., Melzi S., Castellani U. Proceedings of the IEEE International Conference on Computer Vision. IEEE; New Jersey: 2017. Infinite latent feature delection: a probabilistic latent graph-based ranking approach; pp. 1398–1406. [Google Scholar]
  • 28.Granitto P.M., Furlanello C., Biasioli F. Recursive feature elimination with random forest for PTR-MS analysis of agroindustrial products. Chemometr. Intell. Lab. 2006;83:83–90. [Google Scholar]
  • 29.Kursa M.B., Rudnicki W.R. Feature selection with the Boruta package. J. Stat. Softw. 2010;36:1–13. [Google Scholar]
  • 30.Kirkpatrick S., Gelatt C.D., Vecchi M.P. Optimization by simulated annealing. Science. 1983;220:671–680. doi: 10.1126/science.220.4598.671. [DOI] [PubMed] [Google Scholar]
  • 31.Zou W., Tolstikov V.V. Probing genetic algorithms for feature selection in comprehensive metabolic profiling approach. Rapid Commun. Mass Spectrom. 2008;22:1312–1324. doi: 10.1002/rcm.3507. [DOI] [PubMed] [Google Scholar]
  • 32.Yan Z.B., Yao Y. Variable selection method for fault isolation using least absolute shrinkage and selection operator (LASSO) Chemometr. Intell. Lab. 2015;146:136–146. [Google Scholar]
  • 33.Mehmood T., Liland K.H., Snipen L. A review of variable selection methods in partial least squares regression. Chemometr. Intell. Lab. 2012;118:62–69. [Google Scholar]
  • 34.Altmann A., Tolosi L., Sander O. Permutation importance: a corrected feature importance measure. Bioinformatics. 2010;26:1340–1347. doi: 10.1093/bioinformatics/btq134. [DOI] [PubMed] [Google Scholar]
  • 35.Singh S.R., Murthy H.A., Gonsalves T.A. Feature selection for text classification based on Gini coefficient of inequality. Fsdm. 2010;10:76–85. [Google Scholar]
  • 36.Wong T.T. Performance evaluation of classification algorithms by k-fold and leave-one-out cross validation. Pattern Recognit. 2015;48:2839–2846. [Google Scholar]
  • 37.Liaw A., Wiener M. Classification and regression by randomforest. R. News. 2002;2:18–22. [Google Scholar]
  • 38.Gillberg L., Orskov A.D., Liu M. Vitamin C-A new player in regulation of the cancer epigenome. Semin. Cancer Biol. 2018;51:59–67. doi: 10.1016/j.semcancer.2017.11.001. [DOI] [PubMed] [Google Scholar]
  • 39.Liu X., Cui C., Zhao M. Identification of phenolics in the fruit of emblica (Phyllanthus emblica L.) and their antioxidant activities. Food Chem. 2008;109:909–915. doi: 10.1016/j.foodchem.2008.01.071. [DOI] [PubMed] [Google Scholar]
  • 40.Feng J.F., Ren H.Z., Gou Q.F. Comparative analysis of the major constituents in three related polygonaceous medicinal plants using pressurized liquid extraction and HPLC-ESI/MS. Anal. Methods. 2016;8:1557–1564. [Google Scholar]
  • 41.Yi T., Fan L.L., Chen H.L. Comparative analysis of diosgenin in Dioscorea species and related medicinal plants by UPLC-DAD-MS. BMC Biochem. 2014;15:19. doi: 10.1186/1471-2091-15-19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Kennard R.W., Stone L.A. Computer aided design of experiments. Technometrics. 1969;11:137–148. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Multimedia component 1
mmc1.docx (8.4MB, docx)

Articles from Journal of Pharmaceutical Analysis are provided here courtesy of Xi'an Jiaotong University

RESOURCES