Interval-based sparse ensemble multi-class classification algorithm for terahertz data

Chengyong Zheng; Xiaowen Zha; Shengjie Cai; Jing Cui; Qian Li; Zhijing Ye

doi:10.1016/j.heliyon.2024.e27743

. 2024 Mar 12;10(6):e27743. doi: 10.1016/j.heliyon.2024.e27743

Interval-based sparse ensemble multi-class classification algorithm for terahertz data

Chengyong Zheng ^a, Xiaowen Zha ^a, Shengjie Cai ^b, Jing Cui ^c,^⁎, Qian Li ^d, Zhijing Ye ^e,^⁎

PMCID: PMC10950663 PMID: 38509892

Abstract

Terahertz time-domain spectroscopy (THz-TDS) has been widely used for food and drug identification. The classification information of a THz spectrum usually does not exist in the whole spectral band but exists only in one or several small intervals. Therefore, feature selection is indispensable in THz-based substance identification. However, most THz-based identification methods empirically intercept the low-frequency band of the THz absorption coefficients for analysis. In order to adaptively find out important intervals of the THz spectra, an interval-based sparse ensemble multi-class classifier (ISEMCC) for THz spectral data classification is proposed. In ISEMCC, the THz spectra are first divided into several small intervals through window sliding. Then the data of training samples in each interval are extracted to train some base classifiers. Finally, a final robust classifier is obtained through a nonnegative sparse combination of these trained base classifiers. With $l_{1}$ -norm, two objective functions that based on Mean Square Error (MSE) and Cross Entropy (CE) are established. For these two objective functions, two iterative algorithms based on the Alternating Direction Method of Multipliers (ADMM) and Gradient Descent (GD) are built respectively. ISEMCC transforms the problem of interval feature selection and decision-level fusion into a nonnegative sparse optimization problem. The sparse constraint ensures only a few important spectral segments are selected. In order to verify the performance of the proposed algorithm, comparative experiments on identifying the origin of Bupleurum and the harvesting year of Tangerine peel are carried out. The base classifiers used by ISEMCC are Support Vector Machine (SVM) and Decision Tree (DT). The experimental results demonstrate that the proposed algorithm outperforms six typical classifiers, including Random Forest (RF), AdaBoost, RUSBoost, ExtraTree, and the two base classifiers, in terms of classification accuracy.

Keywords: Terahertz spectrum, Classification, Sparse ensemble, Interval, Cross entropy

1. Introduction

Electromagnetic wave with a frequency range of 0.1-10 terahertz (THz) is referred to as terahertz wave [1]. Due to its characteristics of penetration, fingerprint, and non-ionization, THz spectroscopy has demonstrated great potential in non-destructive detection, especially in the fields of food, medicine, environmental protection, and some industrial fields [2], [3], [4], [5]. With THz spectra, Friska et al. [6] utilized fast independent component analysis (ICA) and random forest (RF) algorithms to identify adulterated rice. Pan et al. [7] applied the support vector machine (SVM) with an enhanced cuckoo search algorithm to classify ginseng of different growth ages. Zhu et al. [8] claimed that employing a noise reduction technique and a reconstruction approach can successfully tackle the problem of low spectral signal-to-noise ratio produced by different components of the biological mixture, including water. Liu et al. [9] proposed to use principal component analysis (PCA), local preservation protection (LPP), and Isomap methods to reduce the dimensionality of the THz spectral data, and then use the probabilistic neural network (PNN) and SVM to identify liver tumors. Huang et al. [10] used nonnegative matrix factorization (NMF) to decompose THz spectral data into the product of the weight matrix and characteristic matrix, and processed the weight matrix by K-means clustering to classify biological macromolecules. Liu and Zhang et al. [11] proposed a novel method combining THz time-domain spectroscopy (THz-TDS) with chemometrics for the classification of sand samples. They employed Savitzky-Golay (SG) smoothing and orthogonal signal correction (OSC) to pretreat spectra, and applied PCA, partial least squares (PLS) discriminant analysis and SVM to establish classification models for distinguishing sand samples from various deserts and grain sizes. Zhu and Wang et al. [12] used logistic regression (LR), SVM, and RF to classify the oxidation degree of coal, and provided a coal spontaneous combustion monitoring technology combining THz permittivity spectrum and machine learning algorithm. Zhang and Li et al. [13] used PCA to reduce the dimensionality of original THz spectral information, and then employed SVM, decision tree (DT), and RF to discriminate herbal medicines. Sarja et al. [14] proposed a classification method for plastic inorganic pigments based on THz spectroscopy and convolutional neural networks (CNN). Huang and Cao et al. [15] used THz spectroscopy to inspect mouse liver injury. They utilized the maximal information coefficient to select crucial features from the absorption coefficients and the refractive index spectra, and applied RF and AdaBoost to recognize different levels of liver injury.

In conclusion, the feasibility of THz technology has been demonstrated in several areas. However, most studies have focused on dimension reduction or feature extraction, which are widely used in machine learning. Some studies have even explored cutting low-frequency segments directly. Due to the highly sensitive and noisy nature of THz data, classification information is often concealed within one or several characteristic peaks that account for only a small portion of the entire spectral data. The remaining data typically contains limited useful information. Using the whole spectrum data for dimension reduction (e.g., PCA) and feature extraction (e.g., NMF) will inevitably affect the classification performance.

In our previous work [16], a collaborative classification algorithm with multiple THz spectra (MVTHzCC) is put forward. Its feature selection is based on an optimal interval search followed by complementary feature seeking. A decision-level weighted fusion was utilized to make full use of the information provided by various THz spectra. MVTHzCC is good for ensuring that valid classification features in the THz spectra are found. However, its feature search process is time-consuming.

In order to efficiently search out the valid classification features in the THz spectral data, and consider the curve characteristics of THz spectra, an interval-based sparse ensemble multi-class classifier (ISEMCC) for THz spectral data classification is proposed. ISEMCC converts the problem of characteristic peak search and selection in the THz spectral data classification into the optimal sparse combination problem of the trained classifiers by THz interval data. ISEMCC has the following advantages:

•
Through sparse optimization, the selection of optimal interval features and optimal decision level fusion is realized at the same time.
•
The selection of the base classifier is flexible and has wide applicability.
•
Compared with the base classifier, the proposed algorithm can significantly improve the classification accuracy.

The rest of this paper is organized as follows. Section 2 presents the proposed method. Section 3 reports experimental results. Section 4 concludes this paper.

2. Interval-based sparse ensemble multi-class classifier

In THz-based substance identification, the identification information of substances is usually hidden in one or more small intervals of THz spectral data. How to find these small intervals is the key to THz-based classification and recognition. Hence, we provide an interval-based sparse ensemble multi-class classifier (ISEMCC). First, ISEMCC divides the THz spectral data into many small intervals. Then it uses the data of training samples in each interval to train some base classifiers. Finally, the final strong classifier is achieved by a nonnegative sparse combination of the trained base classifiers. Thus, the optimal interval searching and the optimal decision level fusion of interval classifiers are acquired at the same time.

2.1. Symbols

Let $X_{t r n} = {(x_{1}, y_{1}), \dots, (x_{n_{t r n}}, y_{n_{t r n}})}$ be the training THz spectral data, where, $n_{t r n}$ is the number of training data, and $y_{i} \in R^{C}$ is the one-hot coding vector of the label of the i-th training data, i.e., all the entries are zero except the c-th if $x_{i}$ belongs to the c-th class, C is the total number of classes. Let f be a base classifier. In order to gather small intervals from THz spectral data, a sliding window, and its sliding step should be given. Assuming a total of $n_{b}$ intervals have been obtained by sliding a small window with width w and sliding step h, a trained base classifier $f_{j}$ ( $j = 1, 2, \dots, n_{b}$ ) can be obtained by use of the training data that spectra belong to the j-th interval. Let $f_{j} (x_{i}) = p_{j}^{i} \in R^{C}$ be the output of $f_{j}$ on $x_{i}$ , $Y_{i} = [p_{1}^{i}, p_{2}^{i}, . . ., p_{n_{b}}^{i}]$ be the output of f on all $n_{b}$ intervals of $x_{i}$ . It should be noted here that $p_{j}^{i} \in R^{C}$ can be either a one-hot encoded vector or a probability prediction vector, depending on whether the base classifier makes a probability prediction. We'll demonstrate in subsection 3.2 that probability prediction is not necessary, a general class prediction is sufficient.

2.2. Two objective functions

How to find a few intervals from all the $n_{b}$ intervals so that the decision-level fusion of the classifiers trained on these interval segments is optimal? This problem is equivalent to finding sparse non-negative vectors $α \in R^{n_{b}}$ such that $Y_{i} α$ and $y_{i}$ ( $i = 1, \dots, n_{T r n}$ ) are as consistent as possible.

In order to measure the consistency of $Y_{i} α$ and $y_{i}$ , the Mean squared error (MSE) method and Cross entropy (CE) method are discussed in this paper.

For MES method, $‖ Y_{i} α - y_{i} ‖$ is adopted to measure the consistency of $Y_{i} α$ and $y_{i}$ , and the optimization objective function is defined as follows

L_{M S E} (α) = \frac{1}{2 n_{t r n}} \sum_{i = 1}^{n_{t r n}} {‖ Y_{i} α - y_{i} ‖}^{2} + λ_{m s e} {‖ α ‖}_{1}, α \geq 0,

(1)

where $λ_{m s e} > 0$ is a regular parameter.

For the CE method, the CE loss function widely used in neural network training is adopted [17]. In fact, $Y_{i} α$ can be regarded as the output of the last layer of a neural network, and the SoftMax function can be used to obtain the probability output of $Y_{i} α$ . Given a vector $x \in R^{n}$ , the SoftMax function is defined as follows

softmax {(x)}_{i} = \frac{e^{x_{i}}}{\sum_{j = 1}^{n} e^{x_{j}}} .

Let $z_{i} = softmax (Y_{i} α)$ , the optimization objective function of the CE method is defined as follows

L_{C E} (α) = - \sum_{i = 1}^{n_{t r n}} \sum_{k = 1}^{C} y_{i} (k) \log [z_{i} (k)] + λ_{c e} {‖ α ‖}_{1}, α \geq 0,

(2)

where the $λ_{c e} > 0$ is a regular parameter, and $z_{i} (k)$ are the k-th entry of vector $z_{i}$ .

2.3. Solutions to two objective functions

This subsection provides solutions to the two optimization problems $\min_{α} L_{M S E} (α)$ and $\min_{α} L_{C E} (α)$ .

2.3.1. MSE method

The close-form solution of $\min_{α} L_{M S E} (α)$ cannot be obtained directly. In order to solve $\min_{α} L_{M S E} (α)$ , the alternating direction method of multipliers (ADMM) [18], [19] is applied.

First, a free variable v is introduced, and the optimization problem of (1) is transformed to

m i n_{α} \frac{1}{2 n_{t r n}} \sum_{i = 1}^{n_{t r n}} {‖ Y_{i} α - y_{i} ‖}^{2} + λ_{m s e} | | v | |_{1}, s .t . v = α, v \geq 0 .

The Lagrangian function is defined as follows:

L (α, u, v) = \frac{1}{2 n_{t r n}} \sum_{i = 1}^{n_{t r n}} {‖ Y_{i} α - y_{i} ‖}^{2} + λ_{m s e} | | v | |_{1} + \frac{μ}{2} | | v - α + u | |^{2} + l^{+} (v),

where u is the Lagrange multiplier, μ is the non-negative penalty parameter, and $l^{+} (v)$ is the indicative function: when $v \geq 0$ , $l (v) = 0$ , otherwise $l (v) = + \infty$ .

By setting

\frac{\partial L}{\partial α} = \frac{1}{n_{t r n}} \sum_{i = 1}^{n_{t r n}} {Y_{i}}^{T} (Y_{i} α - y_{i}) + μ (α - v - u) =0,

we get

(\sum_{i = 1}^{n_{t r n}} {Y_{i}}^{T} Y_{i} + n_{t r n} μ I) α = \sum_{i = 1}^{n_{t r n}} {Y_{i}}^{T} y_{i} + n_{t r n} μ (v + u),

where I is the identity matrix. The above formula gives

α = {(\sum_{i = 1}^{n_{t r n}} {Y_{i}}^{T} Y_{i} + n_{t r n} μ I)}^{- 1} [\sum_{i = 1}^{n_{t r n}} {Y_{i}}^{T} y_{i} + n_{t r n} μ (v + u)] .

Similarly, by setting the first-order partial derivative of $L (α, u, v)$ with respect to v to be 0, we get

v = S_{\frac{λ_{m s e}}{μ}}^{+} (α - u) = {\begin{matrix} α - u - \frac{λ_{m s e}}{μ}, & α - u > \frac{λ_{m s e}}{μ}, \\ 0, & e l s e . \end{matrix}

According to the ADMM algorithm [18], [19], Algorithm 1 can be obtained.

2.3.2. CE method

The close-form solution of $\min_{α} L_{C E} (α)$ also cannot be directly obtained. So gradient descent algorithm is adopted.

It can be inferred that the gradient of $L_{C E} (α)$ is [20]

\nabla L_{C E} = \sum_{i = 1}^{n_{t r n}} {Y_{i}}^{T} (softmax (Y_{i} α) - y_{i}) + λ_{c e} sign (α),

where $sign(\cdot)$ is the sign function.

Initializing the weight α and the learning rate τ, according to the gradient descent algorithm, an iterative algorithm for solving $\min_{α} L_{C E} (α)$ can be described as Algorithm 2.

2.4. Strong classifier construction

Using the α calculated by Algorithm 1 or Algorithm 2, a strong classifier F can be constructed as follows

F = \sum_{j = 1}^{n_{b}} α_{j} f_{j} .

Given a test sample x, we get $F (x) = \sum_{j = 1}^{n_{b}} α_{j} f_{j} (x)$ , and the final output of F is $\arg \max F (x)$ .

3. Experiments and analysis

In order to verify the effectiveness of the proposed algorithm, in this section, some experiments are carried out to identify the origin of Bupleurum and the year of Tangerine peel based on THz spectral data. SVM (linear kernel) and DT are used as the base classifieres for ISEMCC. With the MSE method, the proposed algorithms using SVM and DT as the base classifiers are denoted by ISEMCC(MSE+SVM) and ISEMCC(MSE+DT), respectively. Similarly, ISEMCC(CE+SVM) and ISEMCC(CE+DT) denote the proposed algorithms using SVM and DT as base classifiers but applying the CE method, respectively.

3.1. Data preparation and experimental setup

The acquisition of experimental data is mainly divided into Three steps. First, some Bupleurum samples with different origins and some Tangerine peel samples with different harvesting years are gathered. Second, for Bupleurum and Tangerine peel, 10 and 6 parallel circular thin flakes per batch of samples are prepared, respectively. In order to produce a circular thin flake, an appropriate amount of samples are taken firstly for crushing, grinding, and 100 mesh sieving to obtain some sieved powder, and about 0.1 g sieved powder is then weighed and put into a tableting mold to be pressed with a pressure of 24 MPa for 30 s to produce a circular thin flake with a diameter of about 13 mm and a thickness of about 1 mm. Lastly, a transmission mode THz-TDS commercial device (LZ9000 Terahertz Technology Application (Guangdong) Co., Ltd, Guangzhou, China) is used to gather the THz spectral data from these circular thin flakes. In order to collect THz spectral data, a circular thin flake is first placed in a mold, and then the mold is placed in the sample chamber of THz-TDS. Nitrogen is continuously blown in the sample chamber to ensure the dryness of the acquisition environment.

Fig. 1 shows the spectral curves of Bupleurum and Tangerine peel. The subgraphs in the left and right figures show the average THz spectral curves of Bupleurum and Tangerine Peel, respectively.

THz Spectral Curves of Bupleurum (left) and Tangerine Peel (right), each contain a subfigure of the average THz spectral curves.

Table 1 shows the number of batches and samples of each category of the collected THz spectral data of Bupleurum and Tangerine peel. The absorption coefficients in the frequency range of 0.1-1.8 THz are used in the experiments.

Table 1.

Classes, sample numbers, and batch numbers of two THz spectral datasets.

Data Name	Bupleurum			Tangerine peel
Label	North	Tibetan	Cone-leaf	2014	2017	2020
Batch Number	3	5	5	12	18	18
Sample Number	30	50	50	72	108	108

Open in a new tab

To evaluate the performance of a model, the cross-validation method named leave-one-batch-out (LOBO) [16] is used. LOBO uses the following approach:

1.
Splits a dataset into a training set and a testing set, using all but one batch of observations as part of the training set.
2.
Build a model using only data from the training set.
3.
Use the model to predict the response values of the one batch of observations left out of the model, and calculate the accuracy.
4.
Repeat this process until each batch of observations have been tested once.
5.
Calculate the overall accuracy (OA) using all the predicted response values of all observations.

The reason why the conventional random sample division method is not used is that the conventional random sample division method has the phenomenon of sample leakage, which will lead to falsely high experimental accuracies [16]. This is because the THz spectral data collected from multiple slices of the same batch (belonging to a class) are very similar. Conventional random sample division will divide part of these very similar samples into a training sample set and the rest into a test sample set, causing the problem of sample leakage. In view of the fact that in the actual classification process, it is impossible to have a batch of samples, some of which belong to the training sample set and some of which belong to the test sample set. Therefore, the batch leave-one-out method is more tally with the actual.

All experimental code is written in MATLAB. For the base classifier SVM (linear kernel), the box constraint is set to 100, and one-versus-one coding design is adopted to deal multiclass problem. All parameters of DT are set by default values.

The hyperparameters λ, μ, τ, $n_{i t e r}$ and ε of ISEMCC are set to 0.01, 10, 0.1, 200 and $10^{- 4}$ , respectively. The window width and sliding step are selected from {31, 41, 51, 61, 71, 81, 91, 101, 111, 121, 131, 141, 151, 161, 171, 181, 191, 201, 211} and {20, 50, 80}, respectively.

3.2. Experimental results

In this subsection, we will first provide the classification comparison results of ISEMCC when the base classifier outputs class prediction and probabilistic prediction, and then demonstrate the relationship between ISEMCC's training accuracy and the $a l p h a$ value on each interval, followed by some test results for window width and sliding step size. Finally, we will show some comparative experimental results.

3.2.1. Comparison between class prediction and probabilistic prediction

In order to implement ISEMCC, the first question we need to decide is whether to choose class prediction (“one-hot” scheme) or probability prediction (“probability” scheme) as the output of base classifier. Therefore, taking ISEMCC (CE+SVM) as an example, we compared the “one-hot” scheme and “probability” scheme on Bupleurum dataset. Fig. 2 shows the results, from which we can see that neither scheme shows significant advantages, and the highest accuracy of the “one-hot” scheme is higher than that of the “probability” scheme. Based on this observation, “one-hot” scheme is adopted in the subsequent experiments unless otherwise stated.

Results of comparative experiment between “one-hot” scheme (left) and “probability” scheme (right) on Bupleurum dataset.

3.2.2. Relationship between ISEMCC's training accuracy and the α value on each interval

Vectors α in formula (1) and (2) determines the weight of each base classifier trained by data belong to each interval. Fig. 3 presents the training accuracies and values of α obtained by ISEMCC(CE+SVM) and ISEMCC(CE+DT) on dataset Bupleurum and Tangerine peel. The window width and sliding step are set by 71 and 20, respectively.

Training accuracies and values of α in each interval obtained by ISEMCC(CE+SVM) and ISEMCC(CE+DT) on dataset Bupleurum and Tangerine peel.

Fig. 3 suggests that, THz data in different intervals have different separability, and in each interval, the value of α is roughly positively correlated with the training accuracy of the base classifier and exhibits a certain sparsity (some values of α are zero). In addition, in each interval, base classifier SVM and DT demonstrate different classification performance. For example, for Tangerine peel, the classification performance of SVM is good at 0.6 and 0.7 THz but decreases significantly after 1.2 THz, while DT demonstrates good performance at 0.2 and 1.5 THz and deteriorates at 0.5 and 0.7 THz. These make the proposed ISEMCC a classification method that can be interpreted.

3.2.3. Window width and sliding step testing

The window width and sliding step are two important hyperparameters of the proposed algorithm. They are applied to partition the data into small interval segments, and control the input data dimension of each base classifier and the amount of sample information obtained by each base classifier. Furthermore, they also directly determine the number of base classifiers.

In order to test the sensitivity of the proposed algorithm to window width and sliding step, extensive testing on the data sets of Bupleurum and Tangerine peel has been conducted. Fig. 4 displays the changes in overall accuracy (OA) for ISEMCC(CE+SVM), ISEMCC(MSE+SVM), ISEMCC(CE+DT), and ISEMCC(MSE+DT) with varying window widths and three representative sliding steps. The baseline in each subfigure shows the OA of the corresponding base classifer.

Classification accuracies of ISEMCC with different window widths under three step sizes on two THz spectral data sets:the baseline in each subfigure shows the OA of the corresponding base classifer.

Fig. 4 suggests that the OA fluctuation of ISEMCC under three sliding steps is relatively consistent, indicating that it is not sensitive to sliding steps. However, it is a little sensitive to the change of window width: In general, as the window width increases, the OA begins to increase and then tends to be stable or decrease. The reason may be that when the window width is small, each base classifier can obtain too little sample information, making it difficult to correctly classify samples. As the window width increases, the classification ability of the base classifier shows some improvement. As the window width increases further, the model exhibits saturation or supersaturation. Fig. 4 also shows that except ISEMCC(CE+DT) and ISEMCC(MSE+DT) for Tangerine Peel, the OAs of the proposed ISEMCC with different window width and sliding step are significantly higher than that of the corresponding base classifer. However, the results of ISEMCC(CE+DT) and ISEMCC(MSE+DT) for Tangerine Peel are a bit abnormal, the reasons of which still need to be further analyzed.

According to the experimental results shown in Fig. 4, the minimum OA, maximum OA, optimal window widths and sliding step sizes of the proposed algorithm are shown in Table 2.

Table 2.

Minimum OA, maximum OA, Optimal window width and sliding step settings of ISEMCC.

Data set	Algorithm	Optimal (width, step)	Minimum OA	Maximum OA
Bupleurum	CE+SVM	(141, 80), (151, 80), (161, 20)	90.70	96.90
	MSE+SVM	(141, 80), (151, 80)	82.95	96.90
	CE+DT	(91, 20)	81.40	91.47
	MSE+DT	(161, 20), (181, 20)	77.52	90.70

Tangerine peel	CE+SVM	(71, 20)	79.17	88.19
	MSE+SVM	(41, 20)	76.74	85.76
	CE+DT	(61, 80)	45.83	57.29
	MSE+DT	(171, 20), (211,20)	47.22	56.60

Open in a new tab

3.2.4. Comparative experiments

In order to verify the classification performance of the proposed algorithm, six algorithms including SVM (linear kernel and RBF kernel), DT, RF, AdaBoost, RUSBoost [21] and ExtraTree1 are used for comparison. The THz spectral data are still Bupleurum and Tangerine peel. The parameters of all the six comparison algorithms have been optimized using Bayesian optimization with 100 objective function evaluations. Table 3 shows the experimental results.

Table 3.

Experimental comparison between ISEMCC and six typical classification algorithms.

Algorithm	Bupleurum	Tangerine peel
SVM(linear)	83.72%	82.64%
SVM(RBF)	84.50%	83.68%
DT	76.74%	56.94%
RF	91.47%	58.68%
AdaBoost	92.25%	65.63%
RUSBoost	79.07%	65.28%
ExtraTree	81.40%	63.19%
ISEMCC(MSE+DT)	90.70%	56.60%
ISEMCC(CE+DT)	91.47%	57.29%
ISEMCC(MSE+SVM)	96.90%	85.76%
ISEMCC(CE+SVM)	96.90%	88.19%

Open in a new tab

As can be seen from Table 3:

•
Apart from ISEMCC with DT on the Tangerine peel data set, the proposed algorithm outperforms its base classifiers in terms of classification accuracy.
•
SVM-based ISEMCC is better than DT-based ISEMCC.
•
CE-based ISEMCC is slightly better than MSE-based ISEMCC.
•
Overall, the classification accuracy of the proposed algorithm is significantly higher than that of all comparison algorithms.

4. Conclusion

In the classification of THz spectral data, it is difficult to obtain ideal results by using conventional classification models directly, and there are still few classification methods that fit the characteristics of THz spectral data. According to the characteristics that the identification information of THz spectral data usually exists in one or more small characteristic peaks, the proposed algorithm transforms the search and selection problem of THz characteristic peaks into a sparse optimization problem by dividing the THz spectral data into intervals, which greatly improves the classification accuracy and makes the proposed algorithm explainable. Admittedly, the way in which the intervals are divided has an obvious impact on the performance of the proposed algorithm, and there seems to be no other efficient method of searching for interval-dividing parameters at present, except for cross-validation methods.

Whether the proposed method is also applicable to the classification of hyperspectral, near-infrared spectroscopy and other data remains to be further studied. The proposed algorithm can also be used for regression problems after slight modification, which will be further investigated in detail.

CRediT authorship contribution statement

Chengyong Zheng: Writing – review & editing, Writing – original draft, Visualization, Validation, Software, Project administration, Methodology, Formal analysis, Conceptualization. Xiaowen Zha: Writing – original draft, Validation, Software. Shengjie Cai: Visualization, Validation, Software, Data curation. Jing Cui: Writing – review & editing, Validation, Investigation. Qian Li: Supervision, Resources, Funding acquisition, Data curation. Zhijing Ye: Writing – original draft, Validation, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work was supported by the Wuyi University Hong Kong-Macau joint Research and Development Fund under Grant 2022WGALH16, supported in part by the National Natural Science Foundation of China under Grant 62001337, the Faculty Research Grants of the Macau University of Science and Technology under Grant FRG-22-101-FIE, and the Special Foundation in Key Fields for Universities of Guangdong Province under Grant 2023ZDZX4060.

Footnotes

https://github.com/rtaormina/MATLAB_ExtraTrees.

Contributor Information

Jing Cui, Email: 12810626@qq.com.

Zhijing Ye, Email: xkinghust@163.com.

Data availability

The data that support the findings of this study are available from Chengyong Zheng (zcy@wyu.edu.cn), upon reasonable request.

References

1.Fedotov V. Phase control of terahertz waves moves on chip. Nat. Photonics. 2021;15(10):715–716. doi: 10.1038/s41566-021-00887-8. [DOI] [Google Scholar]
2.Peng Y., Shi C., Zhu Y., Gu M., Zhuang S. Terahertz spectroscopy in biomedical field: a review on signal-to-noise ratio improvement. PhotoniX. 2020;1(1):12. doi: 10.1186/s43074-020-00011-z. [DOI] [Google Scholar]
3.Bauer M., Friederich F. Terahertz and millimeter wave sensing and applications. Sensors (Basel, Switzerland) 2022;22(24):9693. doi: 10.3390/s22249693. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Fu X., Liu Y., Chen Q., Fu Y., Cui T.J. Applications of terahertz spectroscopy in the detection and recognition of substances. Front. Phys. 2022;10 doi: 10.3389/fphy.2022.869537. [DOI] [Google Scholar]
5.Huang S., Deng H., Wei X., Zhang J. Progress in application of terahertz time-domain spectroscopy for pharmaceutical analyses. Front. Bioeng. Biotechnol. 2023;11 doi: 10.3389/fbioe.2023.1219042. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Friska J., Navaneetha Velammal M., Rajeshwari A., Hannah P. Blessy, Random Forest (RF) based identification of rice powder mixture using terahertz spectroscopy. J. Phys. Conf. Ser. 2021;1979(1) doi: 10.1088/1742-6596/1979/1/012056. [DOI] [Google Scholar]
7.Pan S., Zhang H., Li Z., Chen T. Classification of Ginseng with different growth ages based on terahertz spectroscopy and machine learning algorithm. Optik. 2021;236 doi: 10.1016/j.ijleo.2021.166322. [DOI] [Google Scholar]
8.Zhu Y., Shi C., Wu X., Peng Y. Terahertz spectroscopy algorithms for biomedical detection. Acta Opt. Sin. 2020;41(1) doi: 10.3788/AOS202141.0130001. [DOI] [Google Scholar]
9.Liu H., Zhang Z., Zhang X., Yang Y., Zhang Z., Liu X., Wang F., Han Y., Zhang C. Dimensionality reduction for identification of hepatic tumor samples based on terahertz time-domain spectroscopy. IEEE Trans. Terahertz Sci. Technol. 2018;8(3):271–277. doi: 10.1109/TTHZ.2018.2813085. [DOI] [Google Scholar]
10.Huang J., Liu J., Wang K., Yang Z., Liu X. Classification and identification of molecules through factor analysis method based on terahertz spectroscopy. Spectrochim. Acta, Part A, Mol. Biomol. Spectrosc. 2018;198:198–203. doi: 10.1016/j.saa.2018.03.017. [DOI] [PubMed] [Google Scholar]
11.Liu P., Zhang X., Pan B., Wei M., Zhang Z., Harrington P.B. Classification of sand grains by terahertz time-domain spectroscopy and chemometrics. Int. J. Environ. Res. 2019;13(1):143–160. doi: 10.1007/s41742-018-0159-y. [DOI] [Google Scholar]
12.Zhu H., Wang H., Liu J., Wang W., Gao R., Zhang Y. Application of terahertz dielectric constant spectroscopy for discrimination of oxidized coal and unoxidized coal by machine learning algorithms. Fuel. 2021;293 doi: 10.1016/j.fuel.2021.120470. [DOI] [Google Scholar]
13.Zhang H., Li Z., Chen T., Liu J. Discrimination of traditional herbal medicines based on terahertz spectroscopy. Optik. 2017;138:95–102. doi: 10.1016/j.ijleo.2017.03.037. [DOI] [Google Scholar]
14.Sarjaš A., Pongrac B., Gleich D. Automated inorganic pigment classification in plastic material using terahertz spectroscopy. Sensors. 2021;21(14):4709. doi: 10.3390/s21144709. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Huang P., Cao Y., Chen J., Ge W., Hou D., Zhang G. Analysis and inspection techniques for mouse liver injury based on terahertz spectroscopy. Opt. Express. 2019;27(18) doi: 10.1364/OE.27.026014. [DOI] [PubMed] [Google Scholar]
16.Zheng C., Cai S., Li Q., Li C., Li X. A collaborative classification algorithm with multi-view terahertz spectra. Results Phys. 2022;42 doi: 10.1016/j.rinp.2022.106023. [DOI] [Google Scholar]
17.Zhou Y., Wang X., Zhang M., Zhu J., Zheng R., Wu Q. Mpce: a maximum probability based cross entropy loss function for neural network classification. IEEE Access. 2019;7:146331–146341. doi: 10.1109/ACCESS.2019.2946264. [DOI] [Google Scholar]
18.Boyd S., Parikh N., Chu E., Peleato B., Eckstein J. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 2011;3(1):1–122. doi: 10.1561/2200000016. [DOI] [Google Scholar]
19.Zheng C.Y., Li H., Wang Q., Philip Chen C. Reweighted sparse regression for hyperspectral unmixing. IEEE Trans. Geosci. Remote Sens. 2016;54(1):479–488. doi: 10.1109/TGRS.2015.2459763. [DOI] [Google Scholar]
20.Li L., Doroslovački M., Loew M.H. Approximating the gradient of cross-entropy loss function. IEEE Access. 2020;8:111626–111635. doi: 10.1109/ACCESS.2020.3001531. [DOI] [Google Scholar]
21.Seiffert C., Khoshgoftaar T.M., Van Hulse J., Napolitano A. Rusboost: a hybrid approach to alleviating class imbalance. IEEE Trans. Syst. Man Cybern., Part A, Syst. Hum. 2010;40(1):185–197. doi: 10.1109/TSMCA.2009.2029559. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data that support the findings of this study are available from Chengyong Zheng (zcy@wyu.edu.cn), upon reasonable request.

[br0010] 1.Fedotov V. Phase control of terahertz waves moves on chip. Nat. Photonics. 2021;15(10):715–716. doi: 10.1038/s41566-021-00887-8. [DOI] [Google Scholar]

[br0020] 2.Peng Y., Shi C., Zhu Y., Gu M., Zhuang S. Terahertz spectroscopy in biomedical field: a review on signal-to-noise ratio improvement. PhotoniX. 2020;1(1):12. doi: 10.1186/s43074-020-00011-z. [DOI] [Google Scholar]

[br0030] 3.Bauer M., Friederich F. Terahertz and millimeter wave sensing and applications. Sensors (Basel, Switzerland) 2022;22(24):9693. doi: 10.3390/s22249693. [DOI] [PMC free article] [PubMed] [Google Scholar]

[br0040] 4.Fu X., Liu Y., Chen Q., Fu Y., Cui T.J. Applications of terahertz spectroscopy in the detection and recognition of substances. Front. Phys. 2022;10 doi: 10.3389/fphy.2022.869537. [DOI] [Google Scholar]

[br0050] 5.Huang S., Deng H., Wei X., Zhang J. Progress in application of terahertz time-domain spectroscopy for pharmaceutical analyses. Front. Bioeng. Biotechnol. 2023;11 doi: 10.3389/fbioe.2023.1219042. [DOI] [PMC free article] [PubMed] [Google Scholar]

[br0060] 6.Friska J., Navaneetha Velammal M., Rajeshwari A., Hannah P. Blessy, Random Forest (RF) based identification of rice powder mixture using terahertz spectroscopy. J. Phys. Conf. Ser. 2021;1979(1) doi: 10.1088/1742-6596/1979/1/012056. [DOI] [Google Scholar]

[br0070] 7.Pan S., Zhang H., Li Z., Chen T. Classification of Ginseng with different growth ages based on terahertz spectroscopy and machine learning algorithm. Optik. 2021;236 doi: 10.1016/j.ijleo.2021.166322. [DOI] [Google Scholar]

[br0080] 8.Zhu Y., Shi C., Wu X., Peng Y. Terahertz spectroscopy algorithms for biomedical detection. Acta Opt. Sin. 2020;41(1) doi: 10.3788/AOS202141.0130001. [DOI] [Google Scholar]

[br0090] 9.Liu H., Zhang Z., Zhang X., Yang Y., Zhang Z., Liu X., Wang F., Han Y., Zhang C. Dimensionality reduction for identification of hepatic tumor samples based on terahertz time-domain spectroscopy. IEEE Trans. Terahertz Sci. Technol. 2018;8(3):271–277. doi: 10.1109/TTHZ.2018.2813085. [DOI] [Google Scholar]

[br0100] 10.Huang J., Liu J., Wang K., Yang Z., Liu X. Classification and identification of molecules through factor analysis method based on terahertz spectroscopy. Spectrochim. Acta, Part A, Mol. Biomol. Spectrosc. 2018;198:198–203. doi: 10.1016/j.saa.2018.03.017. [DOI] [PubMed] [Google Scholar]

[br0110] 11.Liu P., Zhang X., Pan B., Wei M., Zhang Z., Harrington P.B. Classification of sand grains by terahertz time-domain spectroscopy and chemometrics. Int. J. Environ. Res. 2019;13(1):143–160. doi: 10.1007/s41742-018-0159-y. [DOI] [Google Scholar]

[br0120] 12.Zhu H., Wang H., Liu J., Wang W., Gao R., Zhang Y. Application of terahertz dielectric constant spectroscopy for discrimination of oxidized coal and unoxidized coal by machine learning algorithms. Fuel. 2021;293 doi: 10.1016/j.fuel.2021.120470. [DOI] [Google Scholar]

[br0130] 13.Zhang H., Li Z., Chen T., Liu J. Discrimination of traditional herbal medicines based on terahertz spectroscopy. Optik. 2017;138:95–102. doi: 10.1016/j.ijleo.2017.03.037. [DOI] [Google Scholar]

[br0140] 14.Sarjaš A., Pongrac B., Gleich D. Automated inorganic pigment classification in plastic material using terahertz spectroscopy. Sensors. 2021;21(14):4709. doi: 10.3390/s21144709. [DOI] [PMC free article] [PubMed] [Google Scholar]

[br0150] 15.Huang P., Cao Y., Chen J., Ge W., Hou D., Zhang G. Analysis and inspection techniques for mouse liver injury based on terahertz spectroscopy. Opt. Express. 2019;27(18) doi: 10.1364/OE.27.026014. [DOI] [PubMed] [Google Scholar]

[br0160] 16.Zheng C., Cai S., Li Q., Li C., Li X. A collaborative classification algorithm with multi-view terahertz spectra. Results Phys. 2022;42 doi: 10.1016/j.rinp.2022.106023. [DOI] [Google Scholar]

[br0170] 17.Zhou Y., Wang X., Zhang M., Zhu J., Zheng R., Wu Q. Mpce: a maximum probability based cross entropy loss function for neural network classification. IEEE Access. 2019;7:146331–146341. doi: 10.1109/ACCESS.2019.2946264. [DOI] [Google Scholar]

[br0180] 18.Boyd S., Parikh N., Chu E., Peleato B., Eckstein J. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 2011;3(1):1–122. doi: 10.1561/2200000016. [DOI] [Google Scholar]

[br0190] 19.Zheng C.Y., Li H., Wang Q., Philip Chen C. Reweighted sparse regression for hyperspectral unmixing. IEEE Trans. Geosci. Remote Sens. 2016;54(1):479–488. doi: 10.1109/TGRS.2015.2459763. [DOI] [Google Scholar]

[br0200] 20.Li L., Doroslovački M., Loew M.H. Approximating the gradient of cross-entropy loss function. IEEE Access. 2020;8:111626–111635. doi: 10.1109/ACCESS.2020.3001531. [DOI] [Google Scholar]

[br0210] 21.Seiffert C., Khoshgoftaar T.M., Van Hulse J., Napolitano A. Rusboost: a hybrid approach to alleviating class imbalance. IEEE Trans. Syst. Man Cybern., Part A, Syst. Hum. 2010;40(1):185–197. doi: 10.1109/TSMCA.2009.2029559. [DOI] [Google Scholar]

PERMALINK

Interval-based sparse ensemble multi-class classification algorithm for terahertz data

Chengyong Zheng

Xiaowen Zha

Shengjie Cai

Jing Cui

Qian Li

Zhijing Ye

Abstract

1. Introduction

2. Interval-based sparse ensemble multi-class classifier

2.1. Symbols

2.2. Two objective functions

2.3. Solutions to two objective functions

2.3.1. MSE method

Algorithm 1.

2.3.2. CE method

Algorithm 2.

2.4. Strong classifier construction

3. Experiments and analysis

3.1. Data preparation and experimental setup

Figure 1.

Table 1.

3.2. Experimental results

3.2.1. Comparison between class prediction and probabilistic prediction

Figure 2.

3.2.2. Relationship between ISEMCC's training accuracy and the α value on each interval

Figure 3.

3.2.3. Window width and sliding step testing

Figure 4.

Table 2.

3.2.4. Comparative experiments

Table 3.

4. Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgement

Footnotes

Contributor Information

Data availability

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases