SSC: The novel self-stack ensemble model for thyroid disease prediction

Shengjun Ji

doi:10.1371/journal.pone.0295501

. 2024 Jan 3;19(1):e0295501. doi: 10.1371/journal.pone.0295501

SSC: The novel self-stack ensemble model for thyroid disease prediction

Shengjun Ji ^1,^*

Editor: Maciej Huk²

PMCID: PMC10763970 PMID: 38170718

Abstract

Thyroid disease presents a significant health risk, lowering the quality of life and increasing treatment costs. The diagnosis of thyroid disease can be challenging, especially for inexperienced practitioners. Machine learning has been established as one of the methods for disease diagnosis based on previous studies. This research introduces a novel and more effective technique for predicting thyroid disease by utilizing machine learning methodologies, surpassing the performance of previous studies in this field. This study utilizes the UCI thyroid disease dataset, which consists of 9172 samples and 30 features, and exhibits a highly imbalanced target class distribution. However, machine learning algorithms trained on imbalanced thyroid disease data face challenges in reliably detecting minority data and disease. To address this issue, re-sampling is employed, which modifies the ratio between target classes to balance the data. In this study, the down-sampling approach is utilized to achieve a balanced distribution of target classes. A novel RF-based self-stacking classifier is presented in this research for efficient thyroid disease detection. The proposed approach demonstrates the ability to diagnose primary hypothyroidism, increased binding protein, compensated hypothyroidism, and concurrent non-thyroidal illness with an accuracy of 99.5%. The recommended model exhibits state-of-the-art performance, achieving 100% macro precision, 100% macro recall, and 100% macro F1-score. A thorough comparative assessment is conducted to demonstrate the viability of the proposed approach, including several machine learning classifiers, deep neural networks, and ensemble voting classifiers. The results of K-fold cross-validation provide further support for the efficacy of the proposed self-stacking classifier.

1 Introduction

Thyroid disease diagnosis is a challenging and time-consuming procedure that requires substantial knowledge and experience [1]. Two common methods for diagnosis are doctors’ examinations and a multitude of blood tests. Early detection of thyroid disease is highly desirable due to its wide-ranging symptoms, including difficulties in losing weight, obesity, constipation, muscle pain, hypersensitivity to colds, fatigue, and exhaustion. Moreover, thyroid disease affects a significant portion of the global population, with 12% of the US population experiencing the disease at some point in their lives according to the American Thyroid Association (ATA) [2]. Therefore, an accurate and prompt disease identification system is essential to avoid major risks to the patient.

Machine learning has completely transformed thyroid disease detection and identification by making optimal use of time and decreasing misdiagnoses brought on by human mistakes. With the use of machine learning algorithms, efficient techniques for knowledge discovery and classification in thyroid disease datasets can be established [3]. For instance, Raisinghani et al. [4] proposed a machine learning-based prediction model for thyroid disease detection using Decision Trees (DT). Similarly, Tyagi et al. [5] estimated the probable risk of thyroid disease using conventional machine learning classifiers. Researchers in [6] evaluated the potential of machine learning classifiers such as K-Nearest Neighbours (KNN), Support Vector Machine (SVM), and Random Forest (RF) in the prediction of thyroid disease. Several other researchers collaborated to develop an effective machine learning-based approach for detecting thyroid disease [7–10]. However, the proposed systems incorporated machine learning algorithms that operated the detection of disease in an individual manner. An ensemble model, on the other end, has been established to be a promising strategy to increase the stability and accuracy of a classifier [11]. Numerous studies have been conducted utilizing various techniques to create efficient ensembles. Stacking is one of the most effective approaches for integrating classifiers and enhancing prediction accuracy [12, 13].

This research suggests a systematic approach for the reliable and efficient diagnosis of thyroid disease so that medical professionals can benefit from the advancements in computer science research. We focus on developing an efficient stacked ensemble algorithm that can provide highly accurate and reliable predictions for thyroid disease detection. Traditional stacking algorithms employ a collection of heterogeneous base learners and a single meta-learner. Base learners are machine learning algorithms that produce results after being trained on dataset attributes. The meta-learner is a machine learning classifier that determines the best approach to integrate those output predictions to generate the final outcome. Since the performance of different machine learning algorithms varies, integrating the prediction of different algorithms optimizes classification performance. In contrast to heterogeneous base learners, this study uses a self-stacking ensemble in which the output predictions of four distinct RF variants as base learners are combined and delivered to another RF variant functioning as a meta-learner. RF is useful for high-dimensional problems with highly correlated attributes and has a high prediction accuracy, especially in the context of medical diagnosis that frequently arises in bio-informatics [14–16].

Another problem associated with disease diagnosis is the imbalanced distribution of disease samples in the dataset, which could produce misleading and unreliable results. This is because machine learning models learn the decision boundary for the majority class more efficiently than for the minority class. Therefore, one of the most well-known approaches to classifying an imbalanced dataset is to modify the dataset composition [17]. To achieve this, we down-sample the dataset to generate an equal number of samples for each target variable. This will produce ample data for each target variable needed for algorithm training.

Our study’s uniqueness can be summed up as follows:

The resampling technique was applied to address data discrepancies. Down-sampling was used in the majority of classes, resulting in a balanced distribution of samples across each target variable.
The behavior of machine learning models in the field of thyroid disease data was investigated, which is a significant new area to consider in terms of imbalanced data.
A randomized tree-based self-stacking ensemble model was developed for highly accurate detection of thyroid disease.
The effectiveness of the proposed approach was demonstrated by comparing its performance with that of other machine learning, ensemble, and deep learning models.
While previous studies mainly focused on binary classification for thyroid disease detection [18–21], our objective was to devise a multi-class classification system for detecting thyroid disease.

The remainder of the article is structured as follows: After a brief introduction to the underlying research, Section 2 discusses the background of thyroid disease and its previously presented detection methods. The description of the methodology used for this study’s experimental approach is provided in Section 3. In Section 4, results from extensive experiments are reported and discussed. A comparative analysis of the results is presented in Section 5. Finally, Section 6 draws the research to a conclusion.

2 Background

The primary cause of thyroid disease is an imbalance in the hormones generated by the thyroid gland. Euthyroidism, hyperthyroidism, and hypothyroidism, which pertain to normal, excessive, or deficient thresholds of thyroid hormones, are the rationale for diagnosing thyroid disease. Euthyroidism describes the thyroid gland’s normal thyroid hormone production and cellular thyroid hormone levels [22]. The clinical manifestation of hyperthyroidism is increased circulatory and intracellular thyroid hormones [23]. Hypothyroidism is caused mostly by a deficiency of thyroid hormone production and inadequate alternative therapy [24]. Albeit, thyroid disease detection appears to be a basic recurring activity for medical experts, there is an unmet need to facilitate the reader in diagnosing thyroid disease with greater accuracy and consistency [25]. Moreover, the treatment of disease is a continuing problem for medical professionals, and accurate and timely diagnosis is crucial.

Recently, machine learning and deep learning demonstrated notable advances in the diagnosis of thyroid disease. Sidiq et al. [26], for example, used different machine learning models, including K-Nearest Neighbors (KNN), Decision Tree (DT), Naive Bayes (NB), and Support Vector Machine (SVM), to diagnose thyroid disease from a clinical dataset collected from a medical facility in Kashmir. The dataset included 553 data samples from healthy individuals, 218 samples from hypothyroid patients, and 36 samples from hyperthyroid patients. The authors conducted classification and reported the highest accuracy score of 98.89% yielded by DT. Another study [27], evaluated the potential of three machine learning models including KNN, DT, and Logistic Regression (LR) for the diagnosis of thyroid disease. The authors carried out their experiments using the “new thyroid” dataset from the UCI thyroid repository. Two attributes including total serum triiodothyronine (T3) and total serum thyroxin (T4) from a total of 5 attributes were used for model training. The experimental results showed that KNN outperformed other models with the highest accuracy of 96.875%. In [28], Jha et al. applied data augmentation and reduction techniques to improve the efficacy of diagnosing thyroid disease. For dimensionality reduction, the authors used Principal Component Analysis (PCA), Singular Value Decomposition (SVD), and DT; for data augmentation, they used Gaussian Distribution (GD). The dataset included 3152 samples, of which 286 samples corresponded to people with thyroid disease and 2864 samples to healthy people. The experiments demonstrated that Deep Neural Network (DNN) trained using augmented data procured a maximum accuracy of 99.95%. Yadav et al. [7] worked on the UCI repository dataset for predicting thyroid disease. Using the bagging technique, the authors proposed an ensemble of DT, Random Forest (RF), and Extra Tree (ET). The study demonstrated that the ensemble technique produced reliable results with 100% accuracy. In [8], machine learning models—SVM, Multiple Linear Regression (MLR), and DT—were used to perform comparative thyroid disease detection, with DT achieving the highest accuracy of 97.97%.

Abbad Ur Rehman et al. [9] implemented L₁ and L₂ feature selection techniques to get effective results. The authors collected 309 patient samples from District Headquarter (DHQ) Teaching Hospital, Dera Ghazi Khan, Pakistan. The dataset comprised 10 attributes and one target column with the imbalanced distribution of three classes. Five machine learning models including KNN, SVM, DT, LR, and Naive Bayes (NB) were implemented for thyroid disease prediction. Experimental results showed that the L₁-based feature selection technique enhanced the performance of models. Similarly, Salman and Sonuc [29] implemented different machine learning models to determine their applicability in the detection of thyroid disease. The study included 1250 data samples from Iraqis that were collected from various hospitals and labs. The experimental results indicated that when only 13 out of the total 16 features were used for training, RF reported a maximum accuracy of 98.92%. Likewise, Shivastuti et al. [10] compared the effectiveness of SVM and RF for diagnosing thyroid disease. The research utilized 7200 data samples from a dataset available in the UCI repository. SVM surpassed RF in terms of accuracy by 1%, according to the experimental results. The experimental findings in [30] revealed that bagging ensemble integrated with three feature selection techniques including Recursive Feature Elimination (RFE), Select K-Best (SKB), and Select From Model (SFM) showed robust results in thyroid diagnosis. Similarly, Akhtar et al. [31] selected attributes from the “thyroid 0387” dataset using RFE, SKB, and SFM. The authors developed an effective unified ensemble of ensembles for improved thyroid disease diagnosis.

The aforementioned literature review is summarized in Table 1. It is worth highlighting that the majority of systems for diagnosing thyroid disease relied on attribute selection whereas, model training was carried out using an imbalanced dataset. Numerous research demonstrated that skewed results are produced by imbalanced data [32, 33]. Nevertheless, since they lack sufficient prior knowledge, they may even provide overfitted or under-fitted predictions [34]. In [35], Chaganti et al. randomly selected 400 samples of normal class to balance the dataset comprised of 233, 346, 359, and 436 samples of primary hypothyroid, increased binding protein, compensated hypothyroid, and concurrent non-thyroidal illness classes respectively. The authors also performed attribute selection using backward, forward, and Bi-Directional elimination techniques (BFE, FFE, and BiDFE, respectively). Machine learning-based feature selection (MLFS) using ET was also performed. Several machine learning and deep learning models including RF, Gradient Boosting Machine (GBM), LR, SVM, AdaBoost, Long Short Term Memory (LSTM), Convolutional Neural Network (CNN), and CNN-LSTM were used to perform thyroid disease diagnosis. The results reported the highest accuracy of 99% by RF when integrated with attributes selected by the MLFS feature selection technique. This study serves as the basis of comparison for all of our subsequent experiments. Conversely, to balance the dataset and provide more reliable results, we adopt a downsampling approach.

Table 1. A summary of recent publications for diagnosing thyroid disease using machine learning and deep learning approaches.

Ref	Year	Dataset	Sample Size	Target Distribution	Models	Accuracy Score
[26]	2019	Medical Facility, Kashmir	807	Imbalanced	KNN, DT, SVM, and NB	KNN: 91.82%, SVM: 96.52%, NB: 91.57&, and DT: 98.9%
[8]	2019	UCI	50	N/A	DT, SVM, and MLR	DT: 97.97%
[7]	2020	UCI	3710	Imbalanced	DT, RF, ET, and Ensemble	DT: 98%, RF: 99%, ET: 93%, and Ensemble: 100%
[27]	2021	UCI	215	Imbalanced	KNN, DT, and LR	KNN: 96.875%, DT: 87.5%, and LR: 81.25%
[9]	2021	DHQ, DG Khan, Pakistan	309	Imbalanced	SVM, KNN, DT, LR, and NB	KNN-L₁: 97.84%, KNN-L₂: 96.77%, DT-L₁: 75.34%, DT-L₂: 76.92%, NB-L₁: 100%, NB-L₂: 100%, SVM-L₁: 86.02%, SVM-L₂: 86.02%, LR-L₁: 100%, and LR-L₂: 98.82%
[29]	2021	Labs and Hospitals in Iraq	1250	N/A	DT, SVM, RF, NB, LR, KNN, LDA Linear Discriminant Analysis, and MLP Multi-layer Perceptron	DT: 98.4%, SVM: 92.27%, RF: 98.93%, NB: 81.33%, LR: 91.47%, LDA:83.2%, KNN:90.93%, and MLP: 97.6%
[10]	2021	UCI	7200	N/A	SVM and RF	SVM: 93% and RF: 92%
[30]	2021	DHQ, DG Khan, Pakistan	309	Imbalanced	RF, Base Meta Estimator (BME), AdaBoost, and XGBoost	Accuracy = 100%
[28]	2022	UCI	3152	Imbalanced	KNN, DNN	KNN-PCA: 94.92%, KNN-SVD: 95.72%, KNN-DT: 97.94%, DNN-PCA: 96.04%, DNN-SVD: 96.67%, DNN-DT: 98.70%, and DNN-GD: 99.95%
[31]	2022	UCI Thyroid 0387	7200	N/A	RF, BME, AdaBoost, and XGBoost	LR-RFE: 99.27%
[35]	2022	UCI	1774	Balanced	RF, GBM, LR, AdaBoost, SVM, LSTM, CNN, CNN-LSTM	RF-MLFS: 99%

Open in a new tab

3 Methodology

This section discusses the dataset and research methodology employed in this study for thyroid disease prediction. Initially, we acquired a dataset for thyroid disease from the UCI machine learning repository [36], which contains the data of a number of thyroid disease-related cases. Multiple target classes and samples related to thyroid disease are included in the dataset. The used dataset was highly imbalanced as the target distribution was unequal in the dataset. Our approach begins by balancing the target class distribution to produce reliable outcomes for thyroid disease detection. We used only five class samples for the experiments in which we deployed a down-sampling approach on specific data. This down-sampling approach makes the dataset balanced as discussed in Section 3.1. After data balancing, we split it into train and test sets with an 80:20 ratio. 80% of the total data was used for training of learning models, and the remaining 20% for testing purposes of the trained models. The training set was then passed to the proposed self-stacking classifier (SSC) (discussed in Section 3.2) and all other used models, which were trained for thyroid disease prediction. After training on learning models, we passed test data to perform the evaluation. For the evaluation of all models, four evaluation scores, namely accuracy, precision, recall, and F1 score were measured after receiving the test set. We also deployed 10-fold cross-validation with both the original dataset and down-sampled dataset approach and reported mean accuracy and standard deviation scores. In the end, we performed statistical analysis on the results of all approaches. For all experiments, this study used several machine learning classifiers, ensemble learners, and deep learning classifiers, followed by a comparative analysis. Fig 1 showcases the experimental workflow of this study.

All experiments were performed using a Corei7, a 12th-generation Dell machine with a Windows operating system. We used Python language to implement the proposed approach in Jupyter Notebook. Several libraries such as sci-kit learn, TensorFlow, and Keras were used for experimental purposes.

3.1 Data description & balancing

The data used in this study comes from the thyroid disease dataset obtained from the UCI machine learning repository [36]. The original dataset contains 31 features and 9172 sample records [35]. Table 2 presents the feature description of the underlying dataset. The target variable includes classes of thyroid disease-related health issues and diagnoses presented in Table 3. The distribution of classes is substantially imbalanced. Uneven distribution of a target variable in a dataset causes machine learning models to perform erroneously [37]. Moreover, models that have been trained on imbalanced data are ineffective at detecting minority data, and thus disease samples [38]. To address this challenge, we down-sampled the dataset and utilized the first 230 samples of each of the five diagnostic classes including primary hypothyroid (F), increased binding protein (I), compensated hypothyroid (G), concurrent non-thyroidal illness (K), and no condition (-). These target variables were selected based on the minimum sample size of 200. Table 4 displays the sample distribution of the selected target variables for the present research.

Table 2. Feature description of thyroid disease dataset.

Sr.	Feature	Description	Sr.	Feature	Description
1.	age	patients’ age	17.	TSH_measured	either patient’s blood has TSH
2.	sec	gender of patient	18.	TSH	TSH level in patients’ blood
3.	on_thyroxine	either the person is taking thyroxine	19.	T3_measured	either patient’s blood has T3
4.	query on thyroxine	either the person is taking thyroxine	20.	T3	T3 level in patients’ blood
5.	on antithyroid meds	either the patient is taking antithyroid meds	21.	TT4_measured	either patient’s blood has TT4
6.	sick	either patient is sick or not	22.	TT4	TT4 level in patients’ blood
7.	pregnant	pregnancy status of patient	23.	T4U_measured	either patient’s blood has T4U
8.	thyroid_surgery	either the patient has had thyroid surgery	24.	T4U	T4U level in patients’ blood
9.	I131_treatment	whether patient is undergoing I131 treatment	25.	FTI_measured	either patient’s blood has FTI
10.	query_hypo-thyroid	if the patient believes they have hypothyroidism	26.	FTI	FTI level in patients’ blood
11.	query_hyperthyroid	if the patient believes they have hyperthyroidism	27.	TGB_measured	either patient’s blood has TGB
12.	lithium	either the patient has lithium	28.	TGB	TGB level in patients’ blood
13.	goitre	either the patient has goitre	29.	referral_source
14.	tumor	either the patient has tumor	30.	target	hyperthyroidism medical diagnosis
15.	hypopituitary	either the patient has hyperpituitary gland	31.	patient_id	identification number of patient
16.	psych	whether patient * psych

Open in a new tab

Table 3. Count for target variables in thyroid disease dataset.

Health Issues	Diagnosis	Count
hyperthyroid	hyperthyroid (A)	147
	T3 toxic (B)	21
	toxic goiter (C)	6
	secondary toxic (D)	8
hypothyroid	hypothyroid (E)	1
	primary hypothyroid (F)	233
	compensated hypothyroid (G)	359
	secondary hypothyroid (H)	8
binding protein	increased binding protein (I)	346
binding protein	decreased binding protein (J)	30
general health	concurrent non-thyroidal illness (K)	436
replacement therapy	under replaced (M)	111
	consistent with replacement therapy (L)	115
	over replaced (N)	110
antithyroid treatment	antithyroid drugs (O)	14
	I131 treatment (P)	5
	surgery (Q)	14
miscellaneous	discordant assay results (R)	196
	elevated TBG (S)	85
	elevated thyroid hormones (T)	0
no condition	sample record of healthy person (-)	6771

Open in a new tab

Table 4. Down-sampled data for efficient thyroid disease detection.

Target Variable	Original Sample Count	Down-sampled Count
primary hypothyroid (F)	233	230
increased binding protein (I)	346	230
compensated hypothyroid (G)	359	230
concurrent non-thyroidal illness (K)	436	230
no condition (-)	6771	230

Open in a new tab

Figs 2 & 3, the feature importance scores are displayed for both the original dataset and the down-sampled dataset. These scores were obtained using the Extra Trees Classifier (ETC), which takes all features and the target into account for both scenarios. The ETC is a tree-based classifier that determines the importance of each feature by calculating its entropy criterion. The feature importance scores for both scenarios are highlighted in Figs 2 and 3. It is observed that there is only a minor difference between the feature importance scores for both scenarios and the range of feature importance scores has increased after balancing the data which helps to improve the performance of learning models in this study.

3.2 Proposed Self Stacking Classifier (SSC)

This study proposes a novel stacked ensemble, called a self-stacking classifier (SSC). The traditional stacking model combines heterogeneous estimators (base learners) to minimize their errors and then provide them as inputs to the final estimator (meta-learner). The stacked classifier leverages the power of multiple base learners on a classification problem to create predictions. These output predictions act as features for the training of meta-classifiers. Finally, the meta-stacking technique is used to find the optimum way to integrate predictions from several underlying base learners to give the final output. Fig 4 shows the architecture of SSC.

Contrary to heterogeneous base learners and meta-learners, this study adopts a self-stacking ensemble where the same machine learning model i.e. RF, serves as a base learner. We selected RF—a combination of various de-correlated decision trees (DTs)—because of its high efficacy in high-dimensional classification problems [39]. Consider a training set T = {X_i, Y_i} where i = 1, 2, 3, …, n is the number of vectors, X ∈ S where S is a set of sample data and Y ∈ C where C is a set of target variables. When applied to the underlying problem, the model maps the input sample data to the target variable S → C. Each DT in RF classifies a new input vector yielding a certain output prediction. RF builds DTs by incorporating training bootstrapped samples from T and selecting a subset of S at each node. This method of training is used for each $R F_{b a s e_l e a r n e r_{m}}$ where m = 1, …, 4, functioning as a base learner. Moreover, the stacking model integrates k-fold cross-validation for the training of each $R F_{b a s e_l e a r n e r_{m}}$ in the proposed SSC model. The final predictions FP_k made by each fold of the k-fold act as features for the training of meta-classifier (RF_{meta_learner}). Finally, RF_{meta_learner} determines the optimal combination of FP_m to finally classify the underlying data sample and detect the underlying thyroid disease. Algorithm 1 presents the pseudocode of the proposed SSC model.

Algorithm 1 Pseudocode of proposed Self-Stacking Classifier

Input: Training Set, T = {S_i, C_i} where, S_i=input features and C_i=target variable, i = 1 to 4 is the number of base learners.

Output: Self-Stacking Classifier X_SSC

1: Step 1: Learn about $R F_{b a s e_l e a r n e r s_{i}}$ where, i = 1 to 4

2: for t ← 1 to 4 do

3: Learn $R F_{b a s e_l e a r n e r s_{t}}$ based on T

4: end for

5: Step 2: Construct new training dataset T_s from T

6: for i ← 1 to m do

7: $T_{s} = {S_{i}^{'}, C_{i}}$ where, $S_{i}^{'} = {R F_{b a s e_l e a r n e r s_{1}} (S_{i}), . . ., R F_{b a s e_l e a r n e r s_{n}} (S_{i})}$

8: end for

9: Step 3: Learn the meta classifier RF_{meta_learner}

10: Train RF_{meta_learner} on newly constructed dataset T_s

11: return $X_{S S C} (S) = R F_{m e t a_l e a r n e r} (R F_{b a s e_l e a r n e r_{1}} (S), . . . R F_{b a s e_l e a r n e r_{4}} (S))$

3.3 Algorithms tested

In this study, we tested several machine learning algorithms which have significant applications in different domains, such as the health care [40], Internet of Things (IoT) [41], machine vision [42], edge computing [43], education [44, 45], and many others. In order to conduct a fair comparative evaluation of our proposed SSC model for the detection of thyroid disease, we chose the following machine learning classifiers: RF due to its effectiveness, interpretability, non-parametric nature, and high accuracy rate across a range of data types; GBM, which has various benefits including adaptability, robust tolerance to anomalous inputs, and high accuracy; AdaBoost, since it is less susceptible to overfitting; LR, because its training and implementation processes are simple; and Support Vector Classifier (SVC), that has advantages including efficiently handling high dimensional data [46]. For these algorithms to perform at their maximum, we optimized their hyperparameters. We select the best hyperparameter setting by tuning the models on specific value ranges. Table 5 presents the hyperparameter settings for each algorithm.

Table 5. Hyperparameter settings for machine learning algorithms.

Algorithm	Hyperparameter	Tuning Range
RF	n_estimators = 200, max_depth = 50	n_estimators={20 to 200}, max_depth={5 to 100}
GBM	max_depth = 200, learning_rate = 0.2, n_estimators = 50, random_state = 52	n_estimators={20 to 200}, max_depth={5 to 100}, random_state={2 to 100 }, learning_rate= {0.1 to 0.9 }
AdaBoost	n_estimators = 300, random_state = 5, learning_rate = 0.8	n_estimators={20 to 200}, random_state={2 to 100 }, learning_rate= {0.1 to 0.9 }
LR	solver=‘saga’, multi_class=‘multinomial’, C = 3.0	solver= [‘saga’,‘sag’,‘liblinear’], multi_class=‘multinomial’, C={1.0 to 5.0}
SVC	kernel=‘linear’, C = 1.0	kernel=[‘linear’, ‘poly’, ‘sigmoid’], C={1.0 to 5.0}

Open in a new tab

Likewise, we utilized self voting ensemble classifier (SVEC) to further emphasize the significance and usefulness of the proposed classifier. The traditional voting classifier is a meta-classifier for classification based on “hard” or “soft” voting criteria that integrate similar or distinct machine learning models. Contrary to this, we adopted a self-voting mechanism in which the output of the same model is combined to produce the final prediction. We employed RF as the base learner in SVEC and used both hard and soft voting criteria to classify the thyroid disease dataset. Hard voting adheres to the basic majority voting criteria. Soft voting classifies target values by aggregating the projected probabilities of the base classifiers and selecting the target variable with the highest probability. Algorithms 2 and 3 demonstrate the pseudocodes for hard voting (SVEC-H) and soft voting (SVEC-S) respectively. The mode in SVC-H Algorithm 2, represents that the target class with the highest vote will be the final prediction, and argmax in SVC-S Algorithms 3, represent that the class with the highest average probability will be the final prediction.

Algorithm 2 Pseudocode of Hard Self-Voting Ensemble Classifier

Input

T = Training set representing A target variables.

Classifier_base = Base Classifiers i.e. RF_base1, RF_base2, and RF_base3

L = Labels of training data.

D = Number of base classifiers.

Output: Diagnostic Class

1: procedure:

2: do n = 1 to D

3: Call Classifier_base with T_n

4: Receive the classifier $C l a s s i f i e r_{b a s e_{n}}$

5: Compare L_n with A_n generated by $C l a s s i f i e r_{b a s e_{n}}$

6: Update vote V

7: Final Prediction← mode{RF_base1(V), RF_base2(V), RF_base3(V)}

Algorithm 3 Pseudocode of Soft Self-Voting Ensemble Classifier

Input:

T = Training set

D = Testing Set

Classifier_base = Base Classifiers i.e. RF_base1, RF_base2, and RF_base3

Output: Diagnostic Class

1: Step 1: Training base classifiers

2: R₁ = RF_base1(T)

3: R₂ = RF_base2(T)

4: R₃ = RF_base3(T)

5: Step 2: Testing base classifiers

6: P(R₁) ← D

P(R₁) = {a₁, a₂, …, a_n}

7: P(R₂) ← D

P(R₂) = {b₁, b₂, …, b_n}

8: P(R₃) ← D

P(R₃) = {c₁, c₂, …, c_n}

9: Final Prediction ← $a r g m a x (\sum_{i = 0}^{n} {P (R_{1}) + P (R_{2}) + P (R_{3})})$

To further substantiate the performance of the SSC classifier, deep neural networks were also taken into consideration. A deep neural network consists of artificial neurons that have been connected and exchanged their output with neighboring neurons. They incorporate an optimization or loss function to maximize the output in conjunction with input, hidden, and output layers. For the intended result, the weights are increased in every subsequent epoch. We employed CNN and LSTM, two well-known deep neural networks. The hyperparameter setting for deep learning models is according to the literature. We study the researchers who worked on the same kinds of datasets and we used the same kind of state-of-the-art architectures to achieve significant accuracy [35, 44, 47]. Additionally, we integrated these neural networks and used a CNN-LSTM hybrid to detect thyroid disease. Table 6 provides the layered architecture of the deep neural networks used in this study.

Table 6. Layered architecture of deep neural networks.

LSTM	CNN	CNN-LSTM
Embedding(5000,200)	Embedding(5000,200)	Embedding(5000,200)
Dropout(0.5)	Conv1D(128,5,activation=‘relu’)	Conv1D(128,5,activation=‘relu’)
LSTM(128)	MaxPooling1D(pool_size = 5)	MaxPooling1D(pool_size = 2)
Dropout(0.5)	Activation(‘relu’)	LSTM(64,return_sequences = True)
LSTM(64)	Dropout(rate = 0.5)	Conv1D(128,5,activation=‘relu’)
Dropout(0.5)	Flatten()	MaxPooling1D(pool_size = 2)
Dense(32)	Dense(32)	Flatten()
Dense(5)	Dense(5)	Dense(5)
loss =‘categorical_crossentropy’, optimizer =‘adam’, epochs = 100, batch_size = 128, activation=‘softmax’

Open in a new tab

3.4 Performance estimation evaluation estimators

The performance of classifiers used in this study is assessed using accuracy and macro estimators (macro precision, macro recall, and macro F1-score). The rationale for using macro estimators is that the micro estimators aggregate the average parameter and produce results that are biased toward classes with many test instances. Likewise, weighted estimators consider the sample size for each target variable when computing the average, resulting in skewed results in the case of imbalanced data.

The very unbalanced nature of the dataset utilized in this work necessitated the adoption of macro estimators, which allowed us to fairly compare classifier performance on unbalanced data with that on balanced data. The corresponding formulas to compute accuracy, precision, recall, and F1-score are listed below. Albeit the weighted, micro, and macro estimators are all calculated in a different manner, however, these estimators compute precision, recall, and F1-score using the same formulas.

\begin{matrix} A c c u r a c y = \frac{T P + T N}{T P + F P + T N + F N} \end{matrix}

(1)

\begin{matrix} P r e c i s i o n = \frac{T P}{T P + F P} \end{matrix}

(2)

\begin{matrix} R e c a l l = \frac{T P}{T P + F N} \end{matrix}

(3)

\begin{matrix} F 1 - s c o r e = 2 \times \frac{p r e c i s i o n \times r e c a l l}{p r e c i s i o n + r e c a l l} \end{matrix}

(4)

where TP (True Positive) and TN (True Negative) are the numbers of correctly predicted positive and negative instances respectively. FP (False Positive) is the count of negative instances that were predicted as positive instances. Similarly, FN (False Negative) is the number of positive instances incorrectly predicted as negative instances. All evaluation metrics are standardized with values ranging from 0 to 1, where 0 represents the lowest score and 1 represents the highest score.

4 Results

We performed two distinct sets of experiments. We were initially, using the original distribution of data samples across each target variable, and later on adopted down-sampled data for model training. The training and test sample distribution for each case is shown in Fig 5.

Fig 5 — Distribution of original and down-sampled train and test samples across each target class (a) Training set of original data, (b) Test set of original data, (c) Training set of down-sampled data, and (d) Test set of down-sampled data.

4.1 Experimental results of machine learning classifiers

First, experiments were carried out on original and down-sampled datasets using individual machine-learning classifiers. In Table 7, we can observe that the accuracy (the degree to which the classifier delivered correct predictions) of RF is the highest among all classifiers in both settings for thyroid detection using machine learning classifiers. The model yielded 98% and 97% accuracy when experimented with original and down-sampled datasets respectively. However, it is important to draw attention to the significant difference in the classifiers’ accuracy, precision, recall, and F1 scores when trained and tested using the original distribution. Nevertheless, with a balanced distribution of data, RF produced more stable results despite achieving higher accuracy on the original dataset. The remaining classifiers, including GBM, AdaBoost, and SVC, obtained accuracy above 92% for balanced data. Hence, it is viable to discern that classifiers perform better overall when the dataset has been balanced. The linear model LR, on the contrary, presented poor performance.

Table 7. Experimental results of machine learning classifiers.

Classifier	Original Data				Down-Sampled Data
Classifier	Accuracy	Precision	Recall	F1	Accuracy	Precision	Recall	F1
RF	0.98	0.97	0.94	0.95	0.97	0.98	0.97	0.97
GBM	0.97	0.92	0.94	0.93	0.95	0.95	0.95	0.95
AdaBoost	0.88	0.72	0.69	0.66	0.94	0.94	0.94	0.94
LR	0.86	0.52	0.40	0.43	0.89	0.88	0.88	0.88
SVC	0.83	0.17	0.20	0.18	0.93	0.93	0.92	0.92

Open in a new tab

Table 8 shows the target class-wise results of machine learning classifiers. It is evident that the classifiers performed poorly for minority classes, such as classes with few training instances, when the target classes’ original distribution, or imbalanced data, was used. In comparison, more consistent results can be seen when the dataset is distributed evenly, or when the data are down-sampled. Because classifiers were able to learn and interpret the features of each class in a balanced manner, the results were more reliable.

Table 8. Target class-wise results of machine learning classifiers.

Classifier	Target Class	Original Data			Down-Sampled Data
Classifier	Target Class	Precision	Recall	F1	Precision	Recall	F1
RF	No Condition	0.98	0.99	0.99	0.91	1.00	0.95
	Hypothyroid	0.98	0.95	0.97	0.98	1.00	0.99
	Increased binding protein	0.96	0.99	0.97	1.00	1.00	1.00
	Compensated hypothyroid	0.96	0.76	0.85	1.00	0.86	0.92
	Concurrent non-thyroidal illness	0.98	0.99	0.98	1.00	0.98	0.99
GBM	No Condition	0.99	0.98	0.98	0.89	0.96	0.92
	Hypothyroid	0.98	0.91	0.94	1.00	1.00	1.00
	Increased binding protein	0.92	0.99	0.95	0.98	1.00	0.99
	Compensated hypothyroid	0.78	0.82	0.80	0.91	0.83	0.87
	Concurrent non-thyroidal illness	0.95	0.99	0.97	0.98	0.94	0.96
AdaBoost	No Condition	0.92	0.94	0.93	0.90	0.92	0.91
	Hypothyroid	1.00	0.61	0.76	1.00	1.00	1.00
	Increased binding protein	0.93	0.99	0.96	0.98	1.00	0.99
	Compensated hypothyroid	0.43	0.85	0.57	0.79	0.89	0.84
	Concurrent non-thyroidal illness	0.33	0.05	0.08	1.00	0.88	0.93
LR	No Condition	0.86	0.99	0.92	0.87	0.92	0.90
	Hypothyroid	0.87	0.77	0.82	0.90	0.93	0.92
	Increased binding protein	0.00	0.00	0.00	0.94	0.85	0.89
	Compensated hypothyroid	0.86	0.25	0.39	0.82	0.80	0.81
	Concurrent non-thyroidal illness	0.00	0.00	0.00	0.88	0.92	0.90
SVC	No Condition	0.83	0.99	0.91	0.85	1.00	0.92
	Hypothyroid	0.00	0.00	0.00	0.91	0.95	0.93
	Increased binding protein	0.00	0.00	0.00	0.96	0.87	0.91
	Compensated hypothyroid	0.00	0.00	0.00	0.94	0.86	0.90
	Concurrent non-thyroidal illness	0.00	0.00	0.00	1.00	0.94	0.97

Open in a new tab

4.2 Experimental results of self-voting classifiers

In this study, we also implement and assess the performance of ensemble classification under self-voting criteria. A voting classifier, as described in Section 3.3, is a collection of various machine learning classifiers that perform prediction using either soft or hard procedures. We ran experiments using the developed self-voting SVEC-H and SVEC-S ensembles of three RF classifiers under hard and soft voting criteria, respectively, to investigate the applicability of self-voting classifiers on the underlying dataset. The experimental results obtained by self-voting classifiers are shown in Table 9. Both self-voting classifiers achieved accuracy values greater than 97%, as shown. The uneven distribution of target classes, however, leads to inconsistent results. Additionally, similar percentages of performance estimators could be seen for both hard and soft voting criteria. Albeit the RF classifier individually achieved 97% accuracy with down-sampled data, it is clear that the self-voting classifier achieves a comparable 97% accuracy score on down-sampled data. Moreover, it is apparent from Table 10 that the classification results of the majority voting (hard) and weighted probability voting (soft) criterion remained similar for each target class.

Table 9. Experimental results of ensemble voting classifiers.

Classifier	Original Data				Down-Sampled Data
Classifier	Accuracy	Precision	Recall	F1	Accuracy	Precision	Recall	F1
SVEC-H	0.98	0.96	0.94	0.95	0.97	0.98	0.96	0.96
SVEC-S	0.98	0.96	0.94	0.95	0.97	0.98	0.96	0.96

Open in a new tab

Table 10. Target class-wise classification results of self-voting ensemble classifiers.

Classifier	Target Class	Original Data			Down-Sampled Data
Classifier	Target Class	Precision	Recall	F1	Precision	Recall	F1
SVEC-H	No Condition	0.99	0.99	0.99	0.89	1.00	0.94
	Hypothyroid	0.98	1.00	0.99	1.00	1.00	1.00
	Increased binding protein	0.96	1.00	0.98	0.98	1.00	0.99
	Compensated hypothyroid	0.96	0.71	0.82	1.00	0.80	0.89
	Concurrent non-thyroidal illness	0.92	0.98	0.95	1.00	1.00	1.00
SVEC-S	No Condition	0.99	0.99	0.99	0.89	1.00	0.94
	Hypothyroid	0.98	1.00	0.99	1.00	1.00	1.00
	Increased binding protein	0.96	1.00	0.98	0.98	1.00	0.99
	Compensated hypothyroid	0.96	0.71	0.82	1.00	0.80	0.89
	Concurrent non-thyroidal illness	0.92	0.98	0.95	1.00	1.00	1.00

Open in a new tab

4.3 Experimental results of deep neural networks

We also evaluated the performance of deep neural networks for thyroid disease detection using the underlying original and balanced distribution of the dataset. Table 11 presents the classification results of deep neural networks. CNN yielded the highest accuracy score of 95% when trained and tested using the original distribution of the dataset. A significant drop in the evaluation estimators can be observed with a balanced dataset distribution. Another pertinent point is the inconsistency between the evaluation estimators such as accuracy precision, recall, and F1-score when the deep neural networks dealt with imbalanced data.

Table 11. Deep learning models results for thyroid disease classification.

Classifier	Original Data				Down-Sampled Data
Classifier	Accuracy	Precision	Recall	F1	Accuracy	Precision	Recall	F1
LSTM	0.92	0.76	0.73	0.73	0.84	0.83	0.84	0.83
CNN	0.95	0.89	0.81	0.84	0.90	0.91	0.90	0.90
CNN-LSTM	0.93	0.84	0.76	0.79	0.79	0.79	0.79	0.79

Open in a new tab

On the contrary, the models yielded lower but consistent and reliable results with balanced data. Fig 6 showcases the performance estimation of deep neural networks with original and downs-sampled data across each epoch. A significant decline in the performance of CNN-LSTM can be observed. Furthermore, CNN demonstrated rather more stability and comparatively improved performance in the classification of target values with each successive epoch. The lower performance results of evaluation estimators are because the deep learning models have the tendency to work better with larger datasets. As the number of data instances reduced with the down-sampling technique, the performance of deep neural networks also degraded. The terms ‘Accuracy’ and ‘Val_Accuracy’ in Figs 6 and 7 correspond to the accuracy metrics computed on the training and validation datasets, respectively. This same concept extends to other evaluation metrics, where ‘val_precision’ represents validation precision, ‘val_recall’ stands for validation recall, and ‘val_F1_Score’ stands for the validation F1 score.

Fig 7 displays the loss of information across each epoch. When balanced data—that is, down-sampled data—was given to deep neural networks, the validation loss decreased overall, as can be shown. The CNN-LSTM loss graph, however, continued to be the highest, demonstrating the model’s poor performance in detecting thyroid disease.

4.4 Experimental results of proposed self-stacking classifier

After experimental verification, it was shown that RF was the superlative classifier for the thyroid disease classification dataset. However, it was believed that more enhanced and reliable results were required. Besides that, as described in Section 4.2, the self-voting technique was found to be comparatively inefficient for identifying thyroid disease. As a result, we explored further into the performance of stacking, another ensemble technique. To do so, we suggested and put into practice a self-stacking classifier (SSC) based on RF. Even though it performed relatively poorly on the original dataset, the proposed SSC model displayed state-of-the-art performance when supplied with balanced data, as demonstrated in Table 12. In terms of diagnosing thyroid disease, the model scored a state-of-the-art 99.5% accuracy score. Other evaluation estimators including precision, recall, and F1-score all showed a perfect score of 100% demonstrating the reliability and effectiveness of the proposed model. Additionally, the SSC model successfully classified each class with the highest degrees of precision, recall, and F1 scores. This also contributes to the robustness of the proposed model.

Table 12. Classification results of proposed SSC model.

Data Distribution	Accuracy	Target	Precision	Recall	F1
Original Data	0.98	No Condition	0.99	0.99	0.99
		Hypothyroid	0.98	0.93	0.95
		Increased binding protein	0.96	0.99	0.97
		Compensated hypothyroid	0.90	0.85	0.87
		Concurrent non-thyroidal illness	0.96	0.99	0.97
		Macro Average	0.95	0.95	0.95
Down-Sampled Data	0.995	No Condition	1.00	0.98	0.99
		Hypothyroid	1.00	1.00	1.00
		Increased binding protein	1.00	1.00	1.00
		Compensated hypothyroid	1.00	1.00	1.00
		Concurrent non-thyroidal illness	0.98	1.00	0.99
		Macro Average	1.00	1.00	1.00

Open in a new tab

5 Comparative analysis of experimental results

In this section, we compare and discuss the extensive experimentation results to further demonstrate the robustness of the proposed SSC model. As evaluation criteria for the comparative study, we used confusion matrices. The main diagonal of the confusion matrix displays the number of correctly predicted instances in correspondence to each associated class. In the confusion matrix the labels “0,1,2,3, and 4” indicate “no condition”, hypothyroid, increased binding protein, compensated hypothyroid, and concurrent non-thyroidal illness classes respectively. Figs 8 and 9 show the confusion matrix for both machine learning and deep learning models on the same size of test data. We have done separate data splitting (Training and Testing sets) for machine learning and deep learning models. When splitting the dataset, we used the shuffle parameter which changes the number of instances for each target class for both machine and deep learning models but in both cases, the total number of test samples is the same which is 230. We did splitting separately for deep learning models because the last layer of deep learning has 5 neurons. For that, we have to feed target data after converting it into five variables and we split data again after the conversion. We use shuffle parameters in the train test split to change the count for each category in the test dataset.

Fig 8 — (a) RF, (b) GBM, (c) AdaBoost, (d) LR, (e) SVC, (f) SVEC-H, (g) SVEC-S, (h) CNN, (i) LSTM, (j) CNN-LSTM, and (k) SSC.

Fig 9 — (a) RF, (b) GBM, (c) AdaBoost, (d) LR, (e) SVC, (f) SVEC-H, (g) SVEC-S, (h) CNN, (i) LSTM, (j) CNN-LSTM, and (k) SSC.

The “no condition” target class makes up over 80% of the original dataset; in contrast, thyroid disease classes make up only 20% of the remaining dataset, making them the minority class. The models have the tendency to over-fit on the majority class consequently, performing poorly on minority classes. The visual representation of the classifier’s performance on imbalanced data is shown in Fig 8 through their respective confusion matrix. In contrast to the minority classes, it is evident that the class with the majority of the training samples was classified correctly. The model’s RF and SSC produced the lowest percentage of test instances that were erroneously predicted, with RF maintaining the highest percentage in the case of unbalanced data with only 29 inaccurate predictions. However, SSC delivered 33 incorrect predictions. Despite this, each classifier experienced a decline in the number of incorrect predictions with the balanced dataset, with the proposed SSC coming out on top with only one incorrect prediction. Moreover, classifiers recognized the minority samples more correctly. This shows the efficacy of the proposed model. Fig 10 presents a bar chart comparing the classifier’s performance in identifying thyroid disease from an imbalanced and balanced dataset. This also displays the superior performance of the proposed SSC when provided with balanced data.

We employed various methods such as 10-fold cross-validation, standard deviation (SD), and time measurement to further compare, validate, and generalize the performance of the proposed classifier. The results of all machine learning classifiers with original and down-sampled data are presented in Table 13. The proposed SSC model outperformed other models with an accuracy of 0.99 and an SD of 0.01. Although SSC took longer to execute than some classifiers such as RF, GBM, AdaBoost, LR, and SVEC-S, the proposed model’s accuracy has significantly improved. Therefore, despite the time difference, we can compromise it because the model generates highly accurate results, which is essential for medical diagnosis.

Table 13. 10-fold cross-validation results of classifiers.

Classifier	Original Data			Down-Sampled Data
Classifier	Accuracy	SD	Time (sec)	Accuracy	SD	Time (sec)
RF	0.97	+/-0.01	0.945	0.96	+/-0.06	0.369
GBM	0.97	+/-0.01	1.445	0.93	+/-0.11	0.439
AdaBoost	0.58	+/-0.28	1.873	0.92	+/-0.12	0.612
LR	0.85	+/-0.01	0.598	0.88	+/-0.07	0.084
SVC	0.86	+/-0.01	11.423	0.91	+/-0.11	10.922
SVEC-H	0.98	+/-0.01	2.732	0.96	+/-0.08	1.567
SVEC-S	0.98	+/-0.01	3.181	0.96	+/-0.08	0.972
Proposed SSC	0.98	+/-0.01	3.012	0.99	+/-0.01	3.705

Open in a new tab

Figs 11 & 12, illustrates SD values for different classifiers and provides insights into the variability of accuracy scores when applied to both original and down-sampled data. It also shows the error rate lower and upper bounds. These values indicate the stability of classifier performance across multiple runs or datasets, with lower values suggesting more consistent results.

Finally, by comparing our results with those of other studies, we can see that, for the cited [35], the optimum results were attained by implementing machine learning-based feature selection and RF-based classification. However, in our research, balancing the dataset and employing SSC for classification contributed to the maximum performance. We also deployed the designed approaches cited in [19, 28, 48, 49] on our dataset to signify the performance of the proposed approach. Table 14 provides a detailed comparison of the results that illustrates the significance of the proposed SSC model. Given that we employed all features for the model training, SSC took longer to execute. Despite this, there is a significant difference in the cross-validation results of SSC and MLFS + RF proposed by Chaganti et al. [35]. This demonstrates better generalizability of the proposed re-sampling technique with SSC.

Table 14. Performance comparison with previous study.

Study	Approach	Train-Test Method				10-Fold
Study	Approach	Accuracy	Precision	Recall	F1	Accuracy	SD
[35]	MLFS + RF	0.99	0.99	0.99	0.99	0.94	± 1.68
[19]	RF	0.98	N/A	N/A	0.98	N/A	N/A
[49]	CNN + SGLV	0.96	N/A	N/A	0.96	N/A	N/A
[28]	DT	0.98	N/A	N/A	0.97	N/A	N/A
[48]	DT	0.93	N/A	N/A	0.93	N/A	N/A
Our	Down-sampling + SSC	0.995	1.00	1.00	1.00	0.99	±0.01

Open in a new tab

5.1 Statistical analysis

In this research, statistical analysis was performed using the T-test to compare the results of the proposed approach with other techniques. The T-test is a method to determine the statistical difference between two populations [50]. The aim was to investigate the statistical significance of the proposed approach in comparison to others. The null hypothesis is rejected by the T-test if the proposed approach is statistically significant. We set a p-value of 0.05 for all scenarios. The degree of freedom in each case is 7 and the results showed that the null hypothesis was rejected in all cases, as the P-value was less than the T scores. Therefore, the proposed approach was found to be statistically significant compared to others, as presented in Table 15.

Table 15. Statistical T-test results.

Approach	T-Statistics
RF Vs SSC	1.896
GBM Vs SSC	5.230
ADA Vs SSC	3.860
LR Vs SSC	3.666
SVC Vs SSC	2.759
LSTM Vs SSC	9.431
CNN Vs SSC	7.768
CNN-LSTM Vs SSC	8.038

Open in a new tab

To determine the statistical power of our test, we utilized an online power calculator [51]. Our sample size is 16, Cohen’s effect size is 0.8, and the significance level is set at 0.05. The sample population consists of a normal distribution. The test results provide a statistical power of 0.169, with critical values of -1.96 and 1.96 based on the standard normal distribution, where the standard distribution shift is 2.263.

6 Conclusions

Medical expert systems for thyroid disease diagnosis have seen tremendous growth, and the systems currently available are sophisticated enough to be employed in practice, enhancing and improving patient care. The thyroid is particularly difficult to diagnose as its symptoms are often mistaken for those of other conditions. Early diagnosis of thyroid disease can prevent mishaps.

Machine learning models produce skewed results when provided with imbalanced datasets. Thyroid disease diagnosis requires an effective, accurate, and reliable system, and current diagnostic tools focus on reducing dataset dimensionality, ignoring the distribution of target classes. This study proposed a system that down-samples the dataset to obtain a balanced distribution of samples, classified using an RF-based self-stacking classifier (SSC). Experimental results show the proposed model’s superior performance, achieving 99.5% accuracy, 100% macro precision, recall, and F1-score. The proposed classifier’s performance was compared to several other machine learning, deep learning, and self-voting classifiers, demonstrating the reliability of the proposed approach through 10-fold cross-validation. The performance of SSC was compared to the existing medical diagnosis system, which corroborated the significance and superiority of this study. The following conclusions were drawn from the research:

Balancing the dataset enables the classifiers to perform well with minority classes.
Down-sampling substantially increases recall, resulting in a decline in the FP to TP ratio, indicating successful identification of minority data.
For highly imbalanced datasets, RF can be used for best performance.
RF-based self-stacking ensemble outperformed RF-based self-voting ensemble classifiers regarding the balanced dataset.

A limitation of this study is that each target class was down-sampled to 230 data instances, which we aim to address in ongoing research by experimenting with a more extensive distribution of data samples across each target variable.

Data Availability

https://archive.ics.uci.edu/ml/datasets/thyroid+disease.

Funding Statement

First-class 431 Course Construction Based on Online and Offline Xi´an University of Finance and Economics (Object-Oriented Technology and Programming) [2020].

References

1. Ioniţă I, Ioniţă L. Prediction of thyroid disease using data mining techniques. BRAIN Broad Research in Artificial Intelligence and Neuroscience. 2016;7(3):115–124. [Google Scholar]
2.Cornell W. Understanding thyroid problems … disease; 2022. https://weillcornell.org/news/understanding-thyroid-problems-disease.
3. Monsalve-Torra A, Ruiz-Fernandez D, Marin-Alonso O, Soriano-Payá A, Camacho-Mackenzie J, Carreño-Jaimes M. Using machine learning methods for predicting inhospital mortality in patients undergoing open repair of abdominal aortic aneurysm. Journal of biomedical informatics. 2016;62:195–201. doi: 10.1016/j.jbi.2016.07.007 [DOI] [PubMed] [Google Scholar]
4.Raisinghani S, Shamdasani R, Motwani M, Bahreja A, Raghavan Nair Lalitha P. Thyroid prediction using machine learning techniques. In: International Conference on Advances in Computing and Data Sciences. Springer; 2019. p. 140–150.
5.Tyagi A, Mehra R, Saxena A. Interactive thyroid disease prediction system using machine learning technique. In: 2018 Fifth international conference on parallel, distributed and grid computing (PDGC). IEEE; 2018. p. 689–693.
6.Shahid AH, Singh MP, Raj RK, Suman R, Jawaid D, Alam M. A Study on Label TSH, T3, T4U, TT4, FTI in Hyperthyroidism and Hypothyroidism using Machine Learning Techniques. In: 2019 International Conference on Communication and Electronics Systems (ICCES). IEEE; 2019. p. 930–933.
7. Yadav DC, Pal S. Prediction of thyroid disease using decision tree ensemble method. Human-Intelligent Systems Integration. 2020;2(1):89–95. doi: 10.1007/s42454-020-00006-y [DOI] [Google Scholar]
8. Raghuraman M, Sailatha E, Gunasekaran S. Efficient thyroid disease prediction and comparative study using machine learning algorithms. International Journal of Information and Computing Science. 2019;6(6):617–624. [Google Scholar]
9. Abbad Ur Rehman H, Lin CY, Mushtaq Z, Su SF. Performance analysis of machine learning algorithms for thyroid disease. Arabian Journal for Science and Engineering. 2021;46(10):9437–9449. doi: 10.1007/s13369-020-05206-x [DOI] [Google Scholar]
10. Shivastuti HK, Manhas J, Sharma V. Performance Evaluation of SVM and Random Forest for the Diagnosis of Thyroid Disorder. Int J Res Appl Sci Eng Technol. 2021;9:945–947. doi: 10.22214/ijraset.2021.34381 [DOI] [Google Scholar]
11. Iglesias JA, Ledezma A, Sanchis A. An ensemble method based on evolving classifiers: eStacking. In: 2014 IEEE Symposium on Evolving and Autonomous Learning Systems (EALS). IEEE; 2014. p. 124–131. [Google Scholar]
12. Jiang W, Chen Z, Xiang Y, Shao D, Ma L, Zhang J. SSEM: A novel self-adaptive stacking ensemble model for classification. IEEE Access. 2019;7:120337–120349. doi: 10.1109/ACCESS.2019.2933262 [DOI] [Google Scholar]
13. Chen C, Zhang Q, Yu B, Yu Z, Lawrence PJ, Ma Q, et al. Improving protein-protein interactions prediction accuracy using XGBoost feature selection and stacked ensemble classifier. Computers in Biology and Medicine. 2020;123:103899. doi: 10.1016/j.compbiomed.2020.103899 [DOI] [PubMed] [Google Scholar]
14. Alam MZ, Rahman MS, Rahman MS. A Random Forest based predictor for medical data classification using feature ranking. Informatics in Medicine Unlocked. 2019;15:100180. doi: 10.1016/j.imu.2019.100180 [DOI] [Google Scholar]
15. Yang F, Wang Hz, Mi H, Lin Cd, Cai Ww. Using random forest for reliable classification and cost-sensitive learning for medical diagnosis. BMC bioinformatics. 2009;10(1):1–14. doi: 10.1186/1471-2105-10-S1-S22 [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Shaik AB, Srinivasan S. A brief survey on random forest ensembles in classification model. In: International Conference on Innovative Computing and Communications. Springer; 2019. p. 253–260.
17. Saad E, Sadiq S, Jamil R, Rustam F, Mehmood A, Choi GS, et al. Novel extreme regression-voting classifier to predict death risk in vaccinated people using VAERS data. Plos one. 2022;17(6):e0270327. doi: 10.1371/journal.pone.0270327 [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Mishra S, Tadesse Y, Dash A, Jena L, Ranjan P. Thyroid disorder analysis using random forest classifier. In: Intelligent and cloud computing. Springer; 2021. p. 385–390. [Google Scholar]
19. Alyas T, Hamid M, Alissa K, Faiz T, Tabassum N, Ahmad A. Empirical Method for Thyroid Disease Classification Using a Machine Learning Approach. BioMed Research International. 2022;2022. doi: 10.1155/2022/9809932 [DOI] [PMC free article] [PubMed] [Google Scholar] [Retracted]
20. Razia S, Prathyusha PS, Krishna NV, Sumana NS. A Comparative study of machine learning algorithms on thyroid disease prediction. Int J Eng Technol. 2018;7:315–319. doi: 10.14419/ijet.v7i2.8.10432 [DOI] [Google Scholar]
21. Garcia de Lomana M, Weber AG, Birk B, Landsiedel R, Achenbach J, Schleifer KJ, et al. In silico models to predict the perturbation of molecular initiating events related to thyroid hormone homeostasis. Chemical research in toxicology. 2020;34(2):396–411. doi: 10.1021/acs.chemrestox.0c00304 [DOI] [PMC free article] [PubMed] [Google Scholar]
22. Li Y, Song J. The difference of the quantitative parameters using kinetic compartment model in Salivary gland scintigraphy in two kinds of people: hyperthyroidism and euthyroidism; 2018. [Google Scholar]
23. Rayman MP. Multiple nutritional factors and thyroid disease, with particular reference to autoimmune thyroid disease. Proceedings of the Nutrition Society. 2019;78(1):34–44. doi: 10.1017/S0029665118001192 [DOI] [PubMed] [Google Scholar]
24. Biondi B, Cooper DS. Thyroid hormone therapy for hypothyroidism. Endocrine. 2019;66(1):18–26. doi: 10.1007/s12020-019-02023-7 [DOI] [PubMed] [Google Scholar]
25. Yang P, Pi Y, He T, Sun J, Wei J, Xiang Y, et al. Automatic differentiation of thyroid scintigram by deep convolutional neural network: a dual center study. BMC Medical Imaging. 2021;21(1):1–9. doi: 10.1186/s12880-021-00710-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
26. Umar Sidiq D, Aaqib SM, Khan RA. Diagnosis of various thyroid ailments using data mining classification techniques. Int J Sci Res Coput Sci Inf Technol. 2019;5:131–6. [Google Scholar]
27. Chaubey G, Bisen D, Arjaria S, Yadav V. Thyroid disease prediction using machine learning approaches. National Academy Science Letters. 2021;44(3):233–238. doi: 10.1007/s40009-020-00979-z [DOI] [Google Scholar]
28. Jha R, Bhattacharjee V, Mustafi A. Increasing the Prediction Accuracy for Thyroid Disease: A Step Towards Better Health for Society. Wireless Personal Communications. 2022;122(2):1921–1938. doi: 10.1007/s11277-021-08974-3 [DOI] [Google Scholar]
29.Sonuç E, et al. Thyroid Disease Classification Using Machine Learning Algorithms. In: Journal of Physics: Conference Series. 1. IOP Publishing; 2021. p. 012140.
30. Akhtar T, Gilani SO, Mushtaq Z, Arif S, Jamil M, Ayaz Y, et al. Effective Voting Ensemble of Homogenous Ensembling with Multiple Attribute-Selection Approaches for Improved Identification of Thyroid Disorder. Electronics. 2021;10(23):3026. doi: 10.3390/electronics10233026 [DOI] [Google Scholar]
31.Akhtar T, Arif S, Mushtaq Z, Gilani SO, Jamil M, Ayaz Y, et al. Ensemble-based Effective Diagnosis of Thyroid Disorder with Various Feature Selection Techniques. In: 2022 2nd International Conference of Smart Systems and Emerging Technologies (SMARTTECH). IEEE; 2022. p. 14–19.
32. Li Q, Zhao C, He X, Chen K, Wang R. The Impact of Partial Balance of Imbalanced Dataset on Classification Performance. Electronics. 2022;11(9):1322. doi: 10.3390/electronics11091322 [DOI] [Google Scholar]
33. Hossain T, Mauni HZ, Rab R. Reducing the Effect of Imbalance in Text Classification Using SVD and GloVe with Ensemble and Deep Learning. Computing and Informatics. 2022;41(1):98–115. doi: 10.31577/cai_2022_1_98 [DOI] [Google Scholar]
34.Li X, Wen L, Deng Y, Feng F, Hu X, Wang L, et al. Graph neural network with curriculum learning for imbalanced node classification. arXiv preprint arXiv:220202529. 2022;.
35. Chaganti R, Rustam F, De La Torre Díez I, Mazón J, Rodríguez C, Ashraf I. Thyroid Disease Prediction Using Selective Features and Machine Learning Techniques. Cancers 2022, 14, 3914; 2022. doi: 10.3390/cancers14163914 [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Quinlan R. UCI Machine Learning Repository: Thyroid Disease Data Set; 2022. https://archive.ics.uci.edu/ml/datasets/thyroid+disease.
37. Saad E, Sadiq S, Jamil R, Rustam F, Mehmood A, Choi GS, et al. Predicting death risk analysis in fully vaccinated people using novel extreme regression-voting classifier. Digital Health. 2022;8:20552076221109530. doi: 10.1177/20552076221109530 [DOI] [PMC free article] [PubMed] [Google Scholar]
38. Bagui S, Li K. Resampling imbalanced data for network intrusion detection datasets. Journal of Big Data. 2021;8(1):1–41. doi: 10.1186/s40537-020-00390-x33425651 [DOI] [Google Scholar]
39. Azar AT, Elshazly HI, Hassanien AE, Elkorany AM. A random forest classifier for lymph diseases. Computer methods and programs in biomedicine. 2014;113(2):465–473. doi: 10.1016/j.cmpb.2013.11.004 [DOI] [PubMed] [Google Scholar]
40. Rupapara V, Rustam F, Aljedaani W, Shahzad HF, Lee E, Ashraf I. Blood cancer prediction using leukemia microarray gene data and hybrid logistic vector trees model. Scientific Reports. 2022;12(1):1–15. doi: 10.1038/s41598-022-04835-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
41. Siddiqui HUR, Saleem AA, Brown R, Bademci B, Lee E, Rustam F, et al. Non-invasive driver drowsiness detection system. Sensors. 2021;21(14):4833. doi: 10.3390/s21144833 [DOI] [PMC free article] [PubMed] [Google Scholar]
42. George A, Ravindran A, Mendieta M, Tabkhi H. Mez: An adaptive messaging system for latency-sensitive multi-camera machine vision at the IoT edge. IEEE Access. 2021;9:21457–21473. doi: 10.1109/ACCESS.2021.3055775 [DOI] [Google Scholar]
43. George A. Distributed messaging system for the IoT edge. The University of North Carolina at Charlotte; 2020. [Google Scholar]
44. Daghriri T, Rustam F, Aljedaani W, Bashiri AH, Ashraf I. Electroencephalogram Signals for Detecting Confused Students in Online Education Platforms with Probability-Based Features. Electronics. 2022;11(18):2855. doi: 10.3390/electronics11182855 [DOI] [Google Scholar]
45. Rafique A, Khan MS, Jamal MH, Tasadduq M, Rustam F, Lee E, et al. Integrating learning analytics and collaborative learning for improving student’s academic performance. IEEE Access. 2021;9:167812–167826. doi: 10.1109/ACCESS.2021.3135309 [DOI] [Google Scholar]
46. Shafique R, Rustam F, Choi GS, Díez IdlT, Mahmood A, Lipari V, et al. Breast Cancer Prediction Using Fine Needle Aspiration Features and Upsampling with Supervised Machine Learning. Cancers. 2023;15(3):681. doi: 10.3390/cancers15030681 [DOI] [PMC free article] [PubMed] [Google Scholar]
47. Siddiqui HUR, Saleem AA, Bashir I, Zafar K, Rustam F, Diez IdlT, et al. Respiration-Based COPD Detection Using UWB Radar Incorporation with Machine Learning. Electronics. 2022;11(18):2875. doi: 10.3390/electronics11182875 [DOI] [Google Scholar]
48. Sankar S, Potti A, Chandrika GN, Ramasubbareddy S. Thyroid Disease Prediction Using XGBoost Algorithms. Journal of Mobile Multimedia. 2022;18(3):1–18. [Google Scholar]
49. Rustam F, Ishaq A, Munir K, Almutairi M, Aslam N, Ashraf I. Incorporating CNN Features for Optimizing Performance of Ensemble Classifier for Cardiovascular Disease Prediction. Diagnostics. 2022;12(6):1474. doi: 10.3390/diagnostics12061474 [DOI] [PMC free article] [PubMed] [Google Scholar]
50. Rustam F, Ashraf I, Mehmood A, Ullah S, Choi GS. Tweets classification on the base of sentiments for US airline companies. Entropy. 2019;21(11):1078. doi: 10.3390/e21111078 [DOI] [Google Scholar]
51.Kingdom S. Statistical Power Calculator; 2023. https://www.statskingdom.com/32test_power_t_z.html.

PLoS One. doi: 10.1371/journal.pone.0295501.r001

Decision Letter 0

Maciej Huk

8 Feb 2023

PONE-D-22-32955SSC: The Novel Self-Stack Ensemble Model for Thyroid Disease PredictionPLOS ONE

Dear Dr. Ji,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Mar 25 2023 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Maciej Huk, Ph.D.

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Thank you for stating the following financial disclosure:

"No"

At this time, please address the following queries:

a) Please clarify the sources of funding (financial or material support) for your study. List the grants or organizations that supported your study, including funding received from your institution.

b) State what role the funders took in the study. If the funders had no role in your study, please state: “The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.”

c) If any authors received a salary from any of your funders, please state which authors and which funders.

d) If you did not receive any funding for this study, please state: “The authors received no specific funding for this work.”

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

3. Thank you for stating the following in your Competing Interests section:

"NO authors have competing interests"

Please complete your Competing Interests on the online submission form to state any Competing Interests. If you have no competing interests, please state "The authors have declared that no competing interests exist.", as detailed online in our guide for authors at http://journals.plos.org/plosone/s/submit-now

This information should be included in your cover letter; we will change the online submission form on your behalf.

4. In your Data Availability statement, you have not specified where the minimal data set underlying the results described in your manuscript can be found. PLOS defines a study's minimal data set as the underlying data used to reach the conclusions drawn in the manuscript and any additional data required to replicate the reported study findings in their entirety. All PLOS journals require that the minimal data set be made fully available. For more information about our data policy, please see http://journals.plos.org/plosone/s/data-availability.

"Upon re-submitting your revised manuscript, please upload your study’s minimal underlying data set as either Supporting Information files or to a stable, public repository and include the relevant URLs, DOIs, or accession numbers within your revised cover letter. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories. Any potentially identifying patient information must be fully anonymized.

Important: If there are ethical or legal restrictions to sharing your data publicly, please explain these restrictions in detail. Please see our guidelines for more information on what we consider unacceptable restrictions to publicly sharing data: http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions. Note that it is not acceptable for the authors to be the sole named individuals responsible for ensuring data access.

We will update your Data Availability statement to reflect the information you provide in your cover letter.

5. PLOS requires an ORCID iD for the corresponding author in Editorial Manager on papers submitted after December 6th, 2016. Please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field. This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager. Please see the following video for instructions on linking an ORCID iD to your Editorial Manager account: https://www.youtube.com/watch?v=_xcclfuvtxQ

6. Please ensure that you refer to Figure 3 and 9 in your text as, if accepted, production will need this reference to link the reader to the figure.

7. We note you have included a table to which you do not refer in the text of your manuscript. Please ensure that you refer to Table 11 in your text; if accepted, production will need this reference to link the reader to the Table.

8. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

Additional Editor Comments:In particular three minor corrections are needed:

It should be cleared why in figure 9, the sum of the numbers of each confusion matrix is different. For example, the sums of the first lines in the subgraphs (a)(b)(c) is 51, while the first lines of (e)(f) and (j) add up to only 35. This should not occur if the same test set was used in each case.
It should be discussed if RF, SVC and LR models were trained with optimal values of parameters.
Algorithms should be presented with proper indentation.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The author uses the SSC algorithm to train and test on the open data set, and compares it with a variety of machine learning models. Please check the following questions:

1. In figure 9, why the sum of the numbers of each confusion matrix is different, for example, the sum of the first line of RF in the subgraph (a) is 51, while the first line of (j) CNN-LSTM adds up to only 35. This problem should not occur in the same test suite.

2. When training models such as 2.RF, SVC and LR, are the optimal parameters selected? if not, what are the considerations? Similarly, what hyper-parameters CNN and LSTM choose and what models they choose need to be supplemented and clarified.

Reviewer #2: Machine learning algorithms cannot properly identify and recognize minority data, and thus disease, because they were trained on unbalanced thyroid disease data. One approach to solving this issue is resampling, which adjusts the ratio between the target classes to fairly balance the data.

This paper proposes a novel RF-based self-stacking classifier for effective thyroid illness diagnosis is also presented. They use a downsampling strategy to attain a balanced distribution of target classes.

Their study uses the UCI thyroid illness dataset, which has 9172 samples, 30 characteristics, and a very unbalanced target class distribution.

They achieve 99.5% accuracy, and their suggested method may identify primary hypothyroidism, elevated binding protein, compensated hypothyroidism, and concurrent non-thyroidal disease. The proposed model produced 100% macro precision and 100% macro recall, demonstrating cutting-edge performance. The suggested model demonstrated state-of-the-art performance by producing 100% macro precision, 100% macro recall, and 100% macro f1-score. A full comparative analysis is carried out to show the practicality of the suggested approach. Numerous deep neural networks, ensemble voting classifiers, and machine learning classifiers were used to achieve this. K-fold cross-validation results demonstrate the effectiveness of the suggested self-stacking classifier.

Overall this is an interesting paper: well-written and well-motivated to an important problem of applied ML. The author compares their methods with several other standard ML methods. The authors should indent the presentation of the Algorithms properly and add the descriptions of S and C in the caption.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2024 Jan 3;19(1):e0295501. doi: 10.1371/journal.pone.0295501.r002

Author response to Decision Letter 0

16 Feb 2023

Editor Comments:

• Concern 1: It should be cleared why in figure 9, the sum of the numbers of each confusion matrix is different. For example, the sums of the first lines in the subgraphs (a)(b)(c) is 51, while the first lines of (e)(f) and (j) add up to only 35. This should not occur if the same test set was used in each case.

Response: We would like to thank the editor for the valuable comments to improve the quality of the manuscript. You rise a good point and help us to mention it in the manuscript for clarity. We have done separate data splitting (Training and Testing sets) for machine learning deep learning models. When splitting the dataset, we used the shuffle parameter which changes the number of instances for each target class but in both cases, the total numbers of test samples are the same which is 230. We did splitting separately for deep learning models because you can see the last layer of deep learning have 5 neurons so we have to feed target data after converting it into five variables so we did splitting again after the conversion of data for deep learning models and shuffle parameters change the count of for each category in the test dataset.

• Concern 2: It should be discussed if RF, SVC and LR models were trained with optimal values of parameters.

Response: We would like to thank the reviewer for the valuable comments to improve the quality of the manuscript. We select the best hyperparameter setting by tuning the models on specific value ranges. We added the tuning range of the model's hyperparameters value to discuss the selection of optimal values of parameters for models.

• Concern 3: Algorithms should be presented with proper indentation.

Response: We make changes in the manuscript according to the reviewer's suggestion. We make corrections in algorithms indentation.

Reviewer #1:

Concern 1: In figure 9, why the sum of the numbers of each confusion matrix is different, for example, the sum of the first line of RF in the subgraph (a) is 51, while the first line of (j) CNN-LSTM adds up to only 35. This problem should not occur in the same test suite.

Response: We would like to thank the reviewer for the valuable comments to improve the quality of the manuscript. You rise a good point and help us to mention it in the manuscript for clarity. We have done separate splitting for machine learning deep learning models. When splitting the dataset we used the shuffle parameter which changes the number of instances for target classes but in both cases, the total numbers of test samples are the same which is 230. We did splitting separately for deep learning models because you can see the last layer of deep learning have 5 neurons so we have to feed target data after converting it into five variables so we did splitting again after the conversion of data for deep learning models and shuffle parameters change the count of for each category in the test dataset.

Concern 2: When training models such as 2.RF, SVC and LR, are the optimal parameters selected? if not, what are the considerations? Similarly, what hyper-parameters CNN and LSTM choose and what models they choose need to be supplemented and clarified.

While for LSTM and CNN models architecture and hyperparameters setting we added in Table 6. This hyperparameter setting is according to the literature. We study the researchers who worked on the same kinds of datasets and we used the same kind of state-of-the-art architectures to achieve significant accuracy.

Reviewer #2:

Concern 1: Overall this is an interesting paper: well-written and well-motivated to an important problem of applied ML. The author compares their methods with several other standard ML methods. The authors should indent the presentation of the Algorithms properly and add the descriptions of S and C in the caption.

Response: We would like to thank the reviewer for the valuable comments to improve the quality of the manuscript. We add the proper indent in the presentation of the Algorithms and also describe the variables in the algorithms.

Attachment

Submitted filename: SSC paper response.docx

Click here for additional data file.^{(145.5KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0295501.r003

Decision Letter 1

Maciej Huk

11 Apr 2023

PONE-D-22-32955R1SSC: The Novel Self-Stack Ensemble Model for Thyroid Disease PredictionPLOS ONE

Dear Dr. Ji,

Thank you for submitting your manuscript to PLOS ONE. It was analyzed by three Reviewers including me as an Academic Editor (Reviewer #4). After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. In particular:

statistical analysis of results should be presented,
language and presentation problems should be removed,
experimental setup should be better justified or modified.

Please submit your revised manuscript by May 26 2023 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

We look forward to receiving your revised manuscript.

Kind regards,

Maciej Huk, Ph.D.

Academic Editor

PLOS ONE

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #3: (No Response)

Reviewer #4: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

Reviewer #3: Yes

Reviewer #4: Partly

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #3: Yes

Reviewer #4: No

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: Yes

Reviewer #3: Yes

Reviewer #4: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

Reviewer #3: Yes

Reviewer #4: Yes

**********

6. Review Comments to the Author

Reviewer #1: The performance of these models looks good, but an important point, how can it be explained? That is, the order of importance of each feature, such as shap. I strongly suggest adding, thank you.

Reviewer #3: The proposed Self Stacking classifier must be clarified with a figure. Also, there are others authors whose have used the same approach https://link.springer.com/article/10.1007/s12065-023-00824-4 using breast cancer. I think the model is not novel but the thyroid disease prediction model application is. Thus, the authors should redefine the manuscript accordingly with these ideas.

Reviewer #4: >>> 1. Language problems:

1.1 "i.e RF" => "i.e. RF" (x2)

1.2 "f1-score" => "F1-score"

1.3 "Where TP" => "where TP " (after eq. (4))

1.4 "Competing Intereset" => "Competing Interests"

1.5 "at the iot edge" => "at the IoT edge" (ref. 42)

>>> 2. Presentation problems:

2.1 Fig 1. Within stacked ensemble model - Model1 includes Model1?

2.2 Fig 1. "Trained Model" block is connected with "Model" block in "Test set" part.

The meaning of this connection is unclear. Is stacked ensemble model influencing the "Model"?

Maybe the "Trained Model" block should be connected to "Model Evaluation" block?

2.3 Table 3. Title is too general. Please at least specify dataset.

2.4 While Table 4 is presented then are Authors sure that Fig. 2 is needed or useful?

Both present the same very simple relation and fact.

2.5 Table 11. Title is too general. Please specify what neural networks are characterized within the table. Maybe Authors want to write "Classification results of example considered deep neural networks"

2.6 Fig. 2.5 and Fig. 2.6: Title includes a lot of repeated fragments what makes it very long - please reorganize to remove repeated fragments.

2.7 Table 13: units are not presented in the header of the table (time in seconds?)

2.8 Error bars on charts are not shown on charts. Table 13: time is given without confidence intervals (or standard dev.).

>>> 3. Other problems:

3.1 Why the plain model used for comparison is built with only 20% of data? This is suggested by Fig. 1 and seems to be unfair - ensemble and non-ensemble models before comparison should be built in the same experimental environment.

3.3 Sums of the same rows (the same classes) are not equal between matrices in Fig 7. The same problem with Fig 6.

3.4 It is hard to understand the meaning of "10-fold" column in Table 14. In fact, all considered measures should be calculated with use of the k-fold cross validation.

3.5 Statistical analysis of results is not presented.

3.6 It would be good to make software prepared for experimentation fully available for the Readers.

>>> Summary:

Manuscript includes a lot of presentation and other problems. Especially serious are those with experimental setup design (remark 3.1) and statistical analysis of presented results (remarks 3.4 and 3.5). Those errors should be fixed before publication.

Recommendation: major rework

===EOT===

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #3: Yes: Marlon Santiago Viñán Ludeña

Reviewer #4: No

**********

PLoS One. 2024 Jan 3;19(1):e0295501. doi: 10.1371/journal.pone.0295501.r004

Author response to Decision Letter 1

18 Apr 2023

Editor Comments:

In particular:

• statistical analysis of results should be presented,

• language and presentation problems should be removed,

• experimental setup should be better justified or modified.

Response: We would like to thank the reviewer for giving us a chance to work on the weakness in the manuscript. We work on all comments such as we added the statistical analysis, and we have done work on language and presentation of data in the manuscript. We also explain the experimental setup in more detail in the updated manuscript.

Statistical Analysis: In this research, statistical analysis was performed using the T-test to compare the results of the proposed approach with other techniques. The T-test is a method to determine the statistical difference between two populations [51]. The aim was to investigate the statistical significance of the proposed approach in comparison to others. The null hypothesis is rejected by the T-test if the proposed approach is statistically significant. The rejection is based on the P-value and alpha value, which in this study was set at 0.5 for all scenarios. The results showed that the null hypothesis was rejected in all cases, as the P-value was less than the alpha value. Therefore, the proposed approach was found to be statistically significant compared to others, as presented in Table 15.

The remaining changes related to language and presentation issues can be found in the updated manuscript.

Reviewer #1:

Concern 1: The performance of these models looks good, but an important point, how can it be explained? That is, the order of importance of each feature, such as shap. I strongly suggest adding, thank you.

Response: We would like to thank the reviewer for the valuable comment to improve the quality of the manuscript. We added the feature importance score in the paper which show why the machine learning model's performance is good after downsampling.

In Figure 2, the feature importance scores are displayed for both the original dataset and the down-sampled dataset. These scores were obtained using the Extra Trees Classifier (ETC), which takes all features and the target into account for both scenarios. The ETC is a tree-based classifier that determines the importance of each feature by calculating its entropy criterion. The feature importance scores for both scenarios are highlighted in Figures 2a and 2b. It is observed that there is only a minor difference between the feature importance scores for both scenarios and the range of feature importance scores has increased after balancing the data which helps to improve the performance of learning models in this study.

Reviewer #3:

Concern 1: The proposed Self Stacking classifier must be clarified with a figure. Also, there are others authors whose have used the same approach https://link.springer.com/article/10.1007/s12065-023-00824-4 using breast cancer. I think the model is not novel but the thyroid disease prediction model application is. Thus, the authors should redefine the manuscript accordingly with these ideas.

Response: We would like to thank the reviewer for raising this concern and giving us a chance to elaborate on the difference. The mentioned study worked on the staked classifier but there are all based learners are different models while we give it the name self-learner because all base models are RF and even the meta-model is also the same model RF which means RF is learning from itself through base models which make it self learner. While the mentioned study called used the SELF keyword because they used staked ensemble learning framework (SELF). So there is clear a difference between our approach and their approach.

Reviewer #4:

Concern 1: >>> 1. Language problems:

1.1 "i.e RF" => "i.e. RF" (x2)

1.2 "f1-score" => "F1-score"

1.3 "Where TP" => "where TP " (after eq. (4))

1.4 "Competing Intereset" => "Competing Interests"

1.5 "at the iot edge" => "at the IoT edge" (ref. 42)

Response: We would like to thank the reviewer for highlighting the issues in the manuscript, we update the manuscript by removing all typos and mistakes.

Concern 2: >>> 2. Presentation problems:

2.1 Fig 1. Within stacked ensemble model - Model1 includes Model1?

2.2 Fig 1. "Trained Model" block is connected with "Model" block in "Test set" part.

The meaning of this connection is unclear. Is stacked ensemble model influencing the "Model"?

Maybe the "Trained Model" block should be connected to "Model Evaluation" block?

Response: We make changes according to reviewer concerns. Model1 replace with word meta models and the evaluation block is attached to the trained model.

2.3 Table 3. Title is too general. Please at least specify dataset.

Response: We update the caption for Table 3 according to the reviewer suggestion.

2.4 While Table 4 is presented then are Authors sure that Fig. 2 is needed or useful?

Both present the same very simple relation and fact.

Response: Reviewer's suggestion is valuable and we remove Figure 2 from the paper to avoid data redundancies in the manuscript.

Response: We update the caption for Table 11 according to the reviewer suggestion.

2.6 Fig. 2.5 and Fig. 2.6: Title includes a lot of repeated fragments what makes it very long - please reorganize to remove repeated fragments.

Response: We update the caption for Figures according to the reviewer suggestion.

2.7 Table 13: units are not presented in the header of the table (time in seconds?)

Response: We measure the computation cost of machine learning models in terms of time (second). We update the manuscript according to the reviewer suggestion.

2.8 Error bars on charts are not shown on charts. Table 13: time is given without confidence intervals (or standard dev.).

Response: We update the manuscript according to the reviewer suggestion.

Concern 3: >>> 3. Other problems:

Response: We would like to thank the reviewer for the valuable comments. We deploy all models in the same environment and give the same size of data to all models. We update the Fig 1 to avoid confusion. All models are trained on 80% training data and then evaluation is done using 20% test data.

Response: We would like to thank the reviewer for the valueable comments. The order is the same for all confusion matrices. The difference which you can see is because of separate data splitting for machine learning and deep learning models. When splitting the dataset we used the shuffle parameter which changes the number of instances for target classes but in both cases, the total numbers of test samples are the same which is 230. We did splitting separately for deep learning models because you can see the last layer of deep learning have 5 neurons so we have to feed target data after converting it into five variables we did splitting again after the conversion of data for deep learning models and shuffled parameters change the count of for each category in the test dataset.

3.3 Sums of the same rows (the same classes) are not equal between matrices in Fig 7. The same problem with Fig 6.

Response: We would like to thank the reviewer for the valuable comments to improve the quality of the manuscript. You rise a good point and help us to mention it in the manuscript for clarity. We have done separate data splitting (Training and Testing sets) for machine learning deep learning models. When splitting the dataset, we used the shuffle parameter which changes the number of instances for each target class but in both cases, the total numbers of test samples are the same which is 230. We did splitting separately for deep learning models because you can see the last layer of deep learning have 5 neurons so we have to feed target data after converting it into five variables we did splitting again after the conversion of data for deep learning models and shuffled parameters change the count of for each category in the test dataset.

3.4 It is hard to understand the meaning of "10-fold" column in Table 14. In fact, all considered measures should be calculated with use of the k-fold cross validation.

Response: Table 14 contain accuracy, precision, recall, and F1 score which we find using test data, while 10-fold accuracy and SD using the 10-fold cross-validation approach. We update the table structure for clear understanding.

3.5 Statistical analysis of results is not presented.

Response: We would like to thank the reviewer for the valuable comment to improve the quality of the manuscript. We added the statistical analysis in the updated manuscript.

In this research, statistical analysis was performed using the T-test to compare the results of the proposed approach with other techniques. The T-test is a method to determine the statistical difference between two populations [51]. The aim was to investigate the statistical significance of the proposed approach in comparison to others. The null hypothesis is rejected by the T-test if the proposed approach is statistically significant. The rejection is based on the P-value and alpha value, which in this study was set at 0.5 for all scenarios. The results showed that the null hypothesis was rejected in all cases, as the P-value was less than the alpha value. Therefore, the proposed approach was found to be statistically significant compared to others, as presented in Table 15.

3.6 It would be good to make software prepared for experimentation fully available for the Readers.

Response: We would like to thank the reviewer for the suggestion. We will make the experiment's code public and will add a link to the code repository in the paper after the paper's acceptance.

Concern 4: Manuscript includes a lot of presentation and other problems. Especially serious are those with experimental setup design (remark 3.1) and statistical analysis of presented results (remarks 3.4 and 3.5). Those errors should be fixed before publication.

Response: Reviewer suggestions to improve the manuscript are highly appreciated. We resolve are comments which mention above such as adding experimental details more clearly, adding statistical Test results, and checking Grammarly thoroughly.

PLoS One. doi: 10.1371/journal.pone.0295501.r005

Decision Letter 2

Maciej Huk

29 May 2023

PONE-D-22-32955R2SSC: The Novel Self-Stack Ensemble Model for Thyroid Disease PredictionPLOS ONE

Dear Dr. Ji,

Thank you for submitting your manuscript to PLOS ONE. It war reviewed by three Reviewers including me as an Academic Editor (Reviewer #4). After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. In particular:

statistical analysis and reporting of results should be improved,
presentation of results should be fixed (including results and legends in figures),
measurement setup should be described clearly (including measurements of time),
pseudocode of algorithm 2 should be made clear.

Please submit your revised manuscript by Jul 13 2023 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

We look forward to receiving your revised manuscript.

Kind regards,

Maciej Huk, Ph.D.

Academic Editor

PLOS ONE

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

Reviewer #1: All comments have been addressed

Reviewer #3: All comments have been addressed

Reviewer #4: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

Reviewer #3: Yes

Reviewer #4: Partly

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #3: Yes

Reviewer #4: No

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: Yes

Reviewer #3: Yes

Reviewer #4: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

Reviewer #3: Yes

Reviewer #4: Yes

**********

6. Review Comments to the Author

Reviewer #1: The author replied to my question and gave a better explanation. Based on the revised manuscript, I think it meets the relevant requirements for publication.

Reviewer #3: Specify what acronyms such as RF stand for. Lectors must knowing what it means when acronym appear at the first time

Reviewer #4: >>> 1. Language problems:

1.1 "Classifier_base = Base Classifiers i.e RF_base1, RF_base2, and RF_base3" (in Algorithm 2)

"i.e" => "i.e."

>>> 2. Presentation problems:

2.1 Table 11, Table 13, Table 14: units are not presented in the header of the table (accuracy in [%] or [1]). SD for accuracy and other measured values is not given.

2.2 Fig. 2a and Fig. 2b: most of the names of features near horizontal axis are presented improperly. Those names are cut in a way that makes some of those features indistinguishable (e.g. 6x"sured", 2x"oxine", 2x"yroid". This should not happen.

2.3 Fig. 9, Fig. 5, Fig 6, Fig 2: Results should be presented with error bars.

2.4 Table 15: There is no point presenting columns with the same value in all rows of the table. Such information can be given in the title of the table or below the table.

2.5 Algorithm 2:

"D = Number of Classifier_base used." - it is not clear

"D = Number of base classifiers."

2.6 Fig. 5: what is the meaning of "Val_Accuracy", "Val_Precision", "Val_Recall" and 'Val_F1" within legends of included charts? What is the difference e.g. between "Accuracy' and ""Val_Accuracy"? This should be explained within the title of the figure.

2.7 Fig. 6 - analogous problem as the one indicated in point 2.6. What is e.g. Loss and how it is different from Val_Loss?

>>> 3. Other problems:

3.1 Why the time in Table 13 is given with varying precision for different models (please compare results for SVC v.s. GBM v.s. LR)? Were different methods for time measurement used in those cases? It should be specified within the text how the time was measured and what was the precision of this time measurement setup. SD or confdence intervals should be given with measured value.

3.2 Algorithm 2: It is unclear what it the meaning of "mode{}". Maybe Authors meant "model" ?

In both cases it should be clarified how the prediction is generated in the last step of Algorithm 2.

3.3 Fig. 8 (h), (i), (j) - rows of confusion matrices are ordered in different way than in the rest of models. This should be clarified and organized in unified way (the same classes should have the same row number designations in case of all models/matrices. E.g. Fig. 8 (h), (i), (j) row 0 include 35 hits while other confusion matrices in Fig 8 include 35 hits in row number 3. Such changes in class identifiers make this figure hard to understand.

This problem should be removed.

3.4 Sums of the same rows (the same classes) are not equal between matrices in Fig 7.

If this is not an error then it should be explained in more clear way.

3.5 Authors write: "All experiments were performed using a Corei7, a 12th generation Dell machine with a Windows operating system."

Are the names of the hardware producer, processor and operating system important for analysis of the results? If yes, this should be clearly discussed within teh text. If not, then not needed information should be removed.

3.6 Authors write: "The rejection is based on the P-value and alpha value, which in this study was set at 0.5 for all scenarios."

Alpha value equal 0.5 represents very weak test. Maybe Authors meant 0.05?

It should be given what was the number of degrees of freedom during calculation of t-statistics? 229?

Why the p-value in case of all comparisons is the same? Is this an effect of imprecision of IEEE754 format used during calculations?

>>> Recommendation: major rework

===EOT===

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #3: No

Reviewer #4: No

**********

PLoS One. 2024 Jan 3;19(1):e0295501. doi: 10.1371/journal.pone.0295501.r006

Author response to Decision Letter 2

26 Jun 2023

RESPONSE TO REVIEWERS

Manuscript ID: PONE-D-22-32955

Article Title: “The Novel Self-Stack Ensemble Model for Thyroid Disease Prediction”

Author: Ji Sheng-jun

Dear Editor,

First, we are extremely grateful to the editor and all the reviewers who conducted a rigorous and constructive review of our article number PONE-D-22-32955. Certainly, all the review points were very well directed, and we managed to improve our manuscript a great deal and had to work a lot to address all the changes suggested by the reviewers. We have incorporated suggestions and guidance provided by the reviewers and made a careful revision of the manuscript. We have tried our level best to improve this manuscript by organizing and improving our figures to present the material more constructively and effectively. This has certainly, raised the quality of this research work and will make it a valuable contribution to the community. We hope that we have managed to fulfil the required changes suggested by the valuable reviewers.

Please see our responses (in blue color) in line with the reviewers’ comments in black color.

Best regards,

Ji Sheng-jun

Reviewer #1:

Concern: The author replied to my question and gave a better explanation. Based on the revised manuscript, I think it meets the relevant requirements for publication.

Response: We would like to thank the reviewer for value comments to improve the quality of manuscript and giving positive feedback.

Reviewer #3:

Concern: Specify what acronyms such as RF stand for. Lectors must knowing what it means when acronym appear at the first time.

Response: We would like to thank the reviewer for highlighting an important point and we update manuscript according to reviewer concern. We added the acronyms for all abbreviation.

Reviewer #4:

Concern 1: >>> 1. Language problems:

1.1 "Classifier_base = Base Classifiers i.e RF_base1, RF_base2, and RF_base3" (in Algorithm 2)

"i.e" => "i.e."

We would like to thank the reviewer for highlighting an important point and we update manuscript according to reviewer concern.

Concern 2: >>> 2. Presentation problems:

2.1 Table 11, Table 13, Table 14: units are not presented in the header of the table (accuracy in [%] or [1]). SD for accuracy and other measured values is not given.

Response: We would like to thank the reviewer for valuable comments. In Table 11, 13, and 14 and all others Tables we preset results in standard format such as accuracy and other evaluation are between 0 and 1. Our results in Tables are not in %. Second the standard deviation (SD) is output of K-fold cross validation and is present in corresponding Table 13.

Concern 3: 2.2 Fig. 2a and Fig. 2b: most of the names of features near horizontal axis are presented improperly. Those names are cut in a way that makes some of those features indistinguishable (e.g. 6x"sured", 2x"oxine", 2x"yroid". This should not happen.

Response: We would like to thank the reviewer for highlighting an important point and we update manuscript according to reviewer concern. We update the manuscript by adding the accurate images.

2.3 Fig. 9, Fig. 5, Fig 6, Fig 2: Results should be presented with error bars.

Response: We would like to thank the reviewer for valuable comments.

• Fig 2: its feature importance figure so their cant be any error bar.

• Fig 5 and Fig 6: These two figures consist of accuracy and error graphs, in Fig 5 we illustrate the accuracies graph while in Fig 6 there are error graphs.

• Fig 7 & 8: Both figures consist of confusion matrix which values represent the error rates and accuracies both.

• Fig 9: we added error rate in fig 9 and update the manuscript.

2.4 Table 15: There is no point presenting columns with the same value in all rows of the table. Such information can be given in the title of the table or below the table.

Response: We would like to thank the reviewer for valuable comments to improve the quality of manuscript. The P-value column value can be different in each row but in our case it constant. We have T-statistics with P-values to evaluate the performance. So according to situate its better to have value in each row but still if reviewer will suggest we will remove this.

2.5 Algorithm 2:

"D = Number of Classifier_base used." - it is not clear

"D = Number of base classifiers."

Response: We would like to thank the reviewer for valuable comments to improve the quality of manuscript. We update manuscript according to reviewer suggestion.

Response: We would like to thank the reviewer for the valuable comments to improve the quality of the manuscript. The ‘Accuracy’ represent the accuracy on the training set and ‘Val_Accuracy’ represent the accuracy on the validation set, it is also similar to other evaluation matrixes. We added the description of it in the updated manuscript.

The meanings of 'Accuracy' and 'Val\\_Accuracy' in Figures 5 and 6 refer to the accuracy values obtained on the training and validation sets, respectively. The same principle applies to other evaluation measures as well.

2.7 Fig. 6 - analogous problem as the one indicated in point 2.6. What is e.g. Loss and how it is different from Val_Loss?

Response: Its very similar to as we mention in response of 2.6, The ‘Loss represent the accuracy on the training set and ‘Val_Loss’ represent the accuracy on the validation set, it is also similar to other evaluation matrixes.

Concern: 3 >>> 3. Other problems:

Response: We would like to thank the reviewer for the valuable comments to improve the quality of the manuscript. These all used models are different in their approaches so their computation time can vary on different run. We added the specification of system used for experiments.

SD for each model is present in Table 13 which show the confidence interval of model.

3.2 Algorithm 2: It is unclear what it the meaning of "mode{}". Maybe Authors meant "model" ?

In both cases it should be clarified how the prediction is generated in the last step of Algorithm 2.

Response: We would like to thank the reviewer for the valuable comments to improve the quality of the manuscript. Mode in SVC-H algorithm, represent that the target class with highest vote will be the final prediction and argmax in SVC-S, represent that the class with highest average probability will be final prediction.

This problem should be removed.

Response: We would like to thank the reviewer for the valuable comments. The order is the same for all confusion matrices. The difference which you can see is because of separate data splitting for machine learning and deep learning models. When splitting the dataset we used the shuffle parameter which changes the number of instances for target classes but in both cases, the total numbers of test samples are the same which is 230. We did splitting separately for deep learning models because you can see the last layer of deep learning have 5 neurons so we have to feed target data after converting it into five variables we did splitting again after the conversion of data for deep learning models and shuffled parameters change the count of for each category in the test dataset.

3.4 Sums of the same rows (the same classes) are not equal between matrices in Fig 7.

If this is not an error then it should be explained in more clear way.

Response: We would like to thank the reviewer for the valuable comments. We did the separate splitting for both machine learning and deep learning models, because of this there is a difference which we clearly mention in the manuscript and make proposed justification in the 3.2 response.

3.5 Authors write: "All experiments were performed using a Corei7, a 12th generation Dell machine with a Windows operating system."

Response: We express our gratitude to the reviewer for their valuable feedback. We have included the details of our experimental setup. These details are crucial for the reproducibility of our results as we not only measured the model's performance in terms of accuracy but also in terms of time.

3.6 Authors write: "The rejection is based on the P-value and alpha value, which in this study was set at 0.5 for all scenarios."

Alpha value equal 0.5 represents very weak test. Maybe Authors meant 0.05?

It should be given what was the number of degrees of freedom during calculation of t-statistics? 229?

Why the p-value in case of all comparisons is the same? Is this an effect of imprecision of IEEE754 format used during calculations?

Response: We would like to thank the reviewer for the valuable comments. According to reviewer suggestion we did experiment with p value 0.05 and also add the degree of freedom for each comparison. Still with this 0.05 p value our proposed technique reject null hypothesis which show that our proposed approach is statistically significant in comparison with other approaches for disease diagnostics.

Attachment

Submitted filename: SSC R3.docx

Click here for additional data file.^{(97.2KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0295501.r007

Decision Letter 3

Maciej Huk

4 Aug 2023

PONE-D-22-32955R3SSC: The Novel Self-Stack Ensemble Model for Thyroid Disease PredictionPLOS ONE

Dear Dr. Ji,

Thank you for submitting your manuscript to PLOS ONE. It was reviewed by the Academic Editor (Reviewer #4). After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. In particular:

reporting of results within tables should be improved,
presentation of results within figures should be fixed (including results and legends),
precision of performed time measurements should be presented.

Please submit your revised manuscript by Sep 18 2023 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

We look forward to receiving your revised manuscript.

Kind regards,

Maciej Huk, Ph.D.

Academic Editor

PLOS ONE

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

Reviewer #4: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #4: Partly

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #4: I Don't Know

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #4: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #4: Yes

**********

6. Review Comments to the Author

Reviewer #4: >>> 1. Language problems: not detected

>>> 2. Presentation problems:

2.1 Table 11, Table 13, Table 14: units are not presented in the header of the table (accuracy in [%] or [1]).

This was indicated previously but not fixed. ACC=0.97 can mean e.g. 0.97% or 97%

2.2 Fig. 9, Fig. 5, Fig 6, Fig 2: Results should be presented with error bars.

Fig. 2, Fig. 5, Fig. 6: Not fixed.

Fig. 9 - modified but not fixed:

Classification Error (1 - Classification Accuracy) was added/presented by the Authors within Fig. 9 but this is not what was meant by the Reviewer. Measurement errors need to be presented instead, e.g. for p-value=0.95.

If Authors want to use bar chart then bar chart with error bars should be presented.

Please see the exmple discussions of the measurement error/measurement uncertainty and related confidence intervals:

https://blogs.sas.com/content/iml/2019/10/09/statistic-error-bars-mean.html

https://www.r-bloggers.com/2021/06/error-bar-plot-in-r-adding-error-bars-quick-guide/

2.3 Table 15: There is no point presenting columns with the same value in all rows of the table. Such information can be given in the title of the table or below the table.

Not fixed. After modifications Table 15 still includes three columns (df, P-Value, Null Hypothesis) which include the same values in each row.

2.4 Algorithm 2:

"D = Number of base classifier." => "D = Number of base classifiers."

Are Authors sure that plural version would not be more accurate?

2.5 Fig. 5, Fig. 6: what is the meaning of "Val_Accuracy", "Val_Precision", "Val_Recall", "Val_F1", "Val_Loss", "Val_MSE" within legends of included charts?

What is the difference e.g. between "Accuracy' and ""Val_Accuracy"? This should be explained within the title of the figure.

Authors write in their response:"‘Val_Accuracy’ represent the accuracy on the validation set, it is also similar to other

evaluation matrixes. We added the description of it in the updated manuscript."

The information was added within the text. It would be better to add relevant information within titles of Fig 5 and Fig 6.

The Reader can have a problem to find this information while analysing those figures.

>>> 3. Other problems:

3.1 Why the time in Table 13 is given with varying precision for different models?

What was the precision of time measurement setup? SD or confdence intervals should be given with measured value.

3.2 Fig. 8 (h), (i), (j) - rows of confusion matrices are ordered in different way than in the rest of models. This should be clarified and organized in unified way (the same classes should have the same row number designations in case of all models/matrices. E.g. Fig. 8 (h), (i), (j) row 0 include 35 hits while other confusion matrices in Fig 8 include 35 hits in row number 3. Such changes in class identifiers make this figure hard to understand.

This problem should be removed.

Authors gave the explanation within their response to the Reviewer.

The explanation should be added within the text of the manuscript it must be available to the Reader.

3.3 Sums of the same rows (the same classes) are not equal between matrices in Fig 7.

If this is not an error then it should be explained in more clear way.

Authors gave the explanation within their response to the Reviewer.

The explanation should be added within the text of the manuscript it must be available to the Reader.

>>> Recommendation: major rework

===EOT===

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #4: No

**********

PLoS One. 2024 Jan 3;19(1):e0295501. doi: 10.1371/journal.pone.0295501.r008

Author response to Decision Letter 3

9 Sep 2023

Reviewer #4:

Concern 1: >>> 2. Presentation problems: 2.1 Table 11, Table 13, Table 14: units are not presented in the header of the table (accuracy in [%] or [1]).

This was indicated previously but not fixed. ACC=0.97 can mean e.g. 0.97% or 97%

Response: We would like to thank the reviewer for raising concerns. Here, we added clear details about score units, as these accuracy, precision, recall, and F1 scores are standard parameters, so their values are always between 0 and 1. Here, 0 is the lowest score and 1 is the highest score. There will be no unit with them.

We added this explanation in the manuscript for readers in the evaluation matrix section.

Concern 2: 2.2 Fig. 9, Fig. 5, Fig 6, Fig 2: Results should be presented with error bars.

Response: We would like to thank the reviewer for the valuable comments. We already responded to these comments in our previous response. Again we clarify this comment point by point. We also added one more figure (Figure 10) to illustrate error bars.

• Fig 2: This figure is already a bar chart but we can’t convert it into an error bar. Because of its feature importance score always one value corresponds to one feature.

• Fig 5 and Fig 6: We convert these figures into bar charts as shown in Figure 9. These two figures consist of accuracy and error graphs so we convert them into bar charts and illustrate them in one graph.

• Fig 9: We added the error rate in Fig 9 and error bars in Fig 10 and updated the manuscript.

Concern 3: Fig. 2, Fig. 5, Fig. 6: Not fixed.

Response: We fixed these figures and converted them according to the reviewer concern.

Concern 4: Fig. 9 - modified but not fixed: Classification Error (1 - Classification Accuracy) was added/presented by the Authors within Fig. 9 but this is not what was meant by the Reviewer. Measurement errors need to be presented instead, e.g. for p-value=0.95.

If Authors want to use bar chart then bar chart with error bars should be presented.

Please see the exmple discussions of the measurement error/measurement uncertainty and related confidence intervals:

https://blogs.sas.com/content/iml/2019/10/09/statistic-error-bars-mean.html

https://www.r-bloggers.com/2021/06/error-bar-plot-in-r-adding-error-bars-quick-guide/

Response: Figure 10, illustrates SD values for different classifiers and provides insights into the variability of accuracy scores when applied to both original and down-sampled data. It also shows the error rate lower and upper bounds. These values indicate the stability of classifier performance across multiple runs or datasets, with lower values suggesting more consistent results.

Concern 5: 2.3 Table 15: There is no point presenting columns with the same value in all rows of the table. Such information can be given in the title of the table or below the table.

Not fixed. After modifications Table 15 still includes three columns (df, P-Value, Null Hypothesis) which include the same values in each row.

Response: We remove the values and mention the values in the text.

Concern 6:2.4 Algorithm 2:

"D = Number of base classifier." => "D = Number of base classifiers."

Are Authors sure that plural version would not be more accurate?

Response: Yes, the number of base classifiers is correct and we update it accordingly.

Concern 7: 2.5 Fig. 5, Fig. 6: what is the meaning of "Val_Accuracy", "Val_Precision", "Val_Recall", "Val_F1", "Val_Loss", "Val_MSE" within legends of included charts?

What is the difference e.g. between "Accuracy' and ""Val_Accuracy"? This should be explained within the title of the figure.

Authors write in their response:"‘Val_Accuracy’ represent the accuracy on the validation set, it is also similar to other

evaluation matrixes. We added the description of it in the updated manuscript."

The information was added within the text. It would be better to add relevant information within titles of Fig 5 and Fig 6.

The Reader can have a problem to find this information while analysing those figures.

Response: We updated the manuscript according to the reviewer's suggestion and added details in the text and captions.

The terms 'Accuracy' and 'Val_Accuracy' in Figures 5 and 6 correspond to the accuracy metrics computed on the training and validation datasets, respectively. This same concept extends to other evaluation metrics, where 'val_precision' represents validation precision, 'val_recall' stands for validation recall, and 'val_F1_Score' stands for the validation F1 score.

Concern 8: 3.1 Why the time in Table 13 is given with varying precision for different models?

What was the precision of time measurement setup? SD or confdence intervals should be given with measured value.

Response: Reviewers' concern is valid, and we would like to address it. The variation in processing time can be attributed to differences in model architectures. For instance, linear models like Support Vector Classifier (SVC) demand more computational resources and time to find an optimal hyperplane, whereas models like Random Forest (RF), Logistic Regression (LR), and Gradient Boosting Machine (GBM) have less complex architectures compared to SVC. Therefore, the time precision is entirely dependent on the specific model's architecture.

Concern 9: 3.2 Fig. 8 (h), (i), (j) - rows of confusion matrices are ordered in different way than in the rest of models. This should be clarified and organized in unified way (the same classes should have the same row number designations in case of all models/matrices. E.g. Fig. 8 (h), (i), (j) row 0 include 35 hits while other confusion matrices in Fig 8 include 35 hits in row number 3. Such changes in class identifiers make this figure hard to understand.

This problem should be removed.

Authors gave the explanation within their response to the Reviewer.

The explanation should be added within the text of the manuscript it must be available to the Reader.

Response: We added these lines in the manuscript according to the reviewer suggestion.

Figures 7 and 8 show the confusion matrix for both machine learning and deep learning models on the same size of test data. We have done separate data splitting (Training and Testing sets) for machine learning and deep learning models. When splitting the dataset, we used the shuffle parameter which changes the number of instances for each target class for both machine and deep learning models but in both cases, the total number of test samples is the same which is 230. We did splitting separately for deep learning models because you can see the last layer of deep learning has 5 neurons so we have to feed target data after converting it into five variables we did splitting again after the conversion of data for deep learning models and shuffle parameters change the count of for each category in the test dataset.

Concern 10: 3.3 Sums of the same rows (the same classes) are not equal between matrices in Fig 7. If this is not an error then it should be explained in more clear way.

Authors gave the explanation within their response to the Reviewer.

The explanation should be added within the text of the manuscript it must be available to the Reader.

Response: This is the same concern as above and we explain it in the text. This difference in the number of instances in each row is because of reshuffle parameters in the train test split. This shuffle makes validation more strong when you have an imbalance or a small dataset.

We added an explanation for this in the manuscript as above for readers.

Attachment

Submitted filename: SSC R4.docx

Click here for additional data file.^{(185.6KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0295501.r009

Decision Letter 4

Maciej Huk

27 Sep 2023

PONE-D-22-32955R4SSC: The Novel Self-Stack Ensemble Model for Thyroid Disease PredictionPLOS ONE

Dear Dr. Ji,

Thank you for submitting your manuscript to PLOS ONE. It was analyzed by me as an Academic Editor (Reviewer #4). After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. In particular:

reporting of statistical results should be improved,
selection of statistical tests should be backed up by the presented data,
indicated presentation and language problems should be removed.

Please submit your revised manuscript by Nov 11 2023 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

We look forward to receiving your revised manuscript.

Kind regards,

Maciej Huk, Ph.D.

Academic Editor

PLOS ONE

Journal Requirements:

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

Reviewer #4: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #4: Partly

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #4: I Don't Know

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #4: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #4: Yes

**********

6. Review Comments to the Author

Reviewer #4: >>> 1. Language problems:

1.1 Authors write: "We did splitting separately for deep learning models because you can see the last

layer of deep learning has 5 neurons so we have to feed target data after converting it

into five variables we did splitting again after the conversion of data for deep learning

models and shuffle parameters change the count of for each category in the test dataset."

Authors should not mix plain language with scientific language ("you can see").

Additionally - this sentence should be split into shorter and more clear ones. The current form is hard to understand.

>>> 2. Presentation problems:

Fig 10. (a) vertical axis title is improper (names of classifiers are under horizontal axis)

Fig 10. (b) vertical axis title is not as title of the figure suggests.

>>> 3. Other problems:

3.1 Reporting of statistical analysis is not complete:

- what is the sample size? (How many times each measurement was repeated? How many times 10-fold CV was repeated?)

- Authors report statistical importance, but what is the size of the observed effect (e.g. Cohens delta?) ?

- What is the estimated power of the performed test? (knowing the Cohens delta and number of samples you can use e.g. "Statistical Nomograph for Sample Size Estimation by Richard N. MacLennan"). Power of the test should be not lower than 0.7. This will allow to check if the number of measurments (sample size) is big enough.

3.2 Selection of the statistical test is not backed up by the presented data and Authors do not present information if the analysed data meet assumptions needed to use this test (normal distribution, homogeneity of variance). What statistical tests of normality and variance homogeneity were used to validate this?

3.3 Construction of the statistical test seems to be not valid. Authors are using solution which is typically used as a post-hoc test. Maybe Authors could use Friedmans ANOVA as a main test?

Please see e.g. [A] and [B] for details.

[A] Janez Demsar, Statistical Comparisons of Classifiers over Multiple Data Sets, Journal of Machine Learning Research 7, 2006

[B] Dietterich T.G., Approximate statistical tests for comparing supervised classification learning algorithms, Neural Comput 10(7), 1998

>>> Recommendation: minor rework.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #4: No

**********

PLoS One. 2024 Jan 3;19(1):e0295501. doi: 10.1371/journal.pone.0295501.r010

Author response to Decision Letter 4

11 Oct 2023

Editor Comments:

In particular:

• reporting of statistical results should be improved,

We have updated the manuscript by incorporating the reviewer's suggested details on the power of the test.

• selection of statistical tests should be backed up by the presented data,

The data used for the test is already in the manuscript, as we utilized the model evaluation scores in the statistical analysis. So every model evaluation is present in the results section.

• indicated presentation and language problems should be removed.

We thoroughly reviewed the manuscript and corrected issues according to reviewer suggestions.

Reviewer #4:

>>> 1. Language problems:

1.1 Authors write: "We did splitting separately for deep learning models because you can see the last

layer of deep learning has 5 neurons so we have to feed target data after converting it

into five variables we did splitting again after the conversion of data for deep learning

models and shuffle parameters change the count of for each category in the test dataset."

Authors should not mix plain language with scientific language ("you can see").

Additionally - this sentence should be split into shorter and more clear ones. The current form is hard to understand.

Response: We would like to thank the reviewer for their valuable comments and suggestions. The manuscript has been updated in accordance with these suggestions, and the changes can be found in the updated version

>>> 2. Presentation problems:

Fig 10. (a) vertical axis title is improper (names of classifiers are under horizontal axis)

Fig 10. (b) vertical axis title is not as title of the figure suggests.

Response: The reviewer raised valuable points in the manuscript, and we have updated the manuscript to address their concerns. Updated figures can be found in the revised manuscript.

>>> 3. Other problems:

3.1 Reporting of statistical analysis is not complete:

- what is the sample size? (How many times each measurement was repeated? How many times 10-fold CV was repeated?)

- Authors report statistical importance, but what is the size of the observed effect (e.g. Cohens delta?) ?

Response: We would like to express our gratitude to the reviewer for their valuable comments. To determine the statistical power of our test, we utilized the suggested methods and an online power calculator [1]. Our sample size is 16, Cohen's effect size is 0.8, and the significance level is set at 0.05. The sample population consists of a normal distribution. The test results provide a statistical power of 0.169, with critical values of -1.96 and 1.96 based on the standard normal distribution, where the standard distribution shift is 2.263.

1- https://www.statskingdom.com/32test_power_t_z.html

Response: We used a normal distribution for the test, and all other parameters in the test were included in the updated manuscript. However, the reviewer raised concerns regarding the quality of the data, as we utilized model results for performing a statistical T-test. This means that the results from the models, including accuracy, per-class (precision, recall, F1 score), and their average, were subjected to a T-test to demonstrate their significance. We compared the results of the SSC model with those of all other models, as displayed in Table 15.

3.3 Construction of the statistical test seems to be not valid. Authors are using solution which is typically used as a post-hoc test. Maybe Authors could use Friedmans ANOVA as a main test?

Please see e.g. [A] and [B] for details.

[A] Janez Demsar, Statistical Comparisons of Classifiers over Multiple Data Sets, Journal of Machine Learning Research 7, 2006

[B] Dietterich T.G., Approximate statistical tests for comparing supervised classification learning algorithms, Neural Comput 10(7), 1998

Response: We would like to express our gratitude to the reviewer for their valuable comments. Indeed, the reviewer's point is accurate, as we deployed a T-test on the model results to assess the statistical differences in model performance. The study [B] recommended applying these tests to model predictions, which can help obtain a larger sample for statistical analysis. We conducted statistical tests on the evaluation scores to emphasize the significance of the models over others based on their performance scores. In this study, our main contribution is classification and proposing a model. Our aim is to highlight the significance of the proposed models and we adopt a very simple approach to apply statistical Tests rather than adopting complex approach.

Attachment

Submitted filename: SSC5.docx

Click here for additional data file.^{(19.6KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0295501.r011

Decision Letter 5

Maciej Huk

23 Nov 2023

SSC: The Novel Self-Stack Ensemble Model for Thyroid Disease Prediction

PONE-D-22-32955R5

Dear Dr. Ji,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Maciej Huk, Ph.D.

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

Reviewer #4: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #4: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #4: I Don't Know

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #4: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #4: Yes

**********

6. Review Comments to the Author

Reviewer #4: >>> 1. Language problems: not detected

>>> 2. Presentation problems:

2.1 Fig. 9, Fig. 10 a/b - please add units to description of vertical axis. (e.g. accuracy equal 0.9 can mean 90% or 0.9%). Do Authors mean "Accuracy [1]" ?

2.2 Fig. 6(f) - description of vertical axis is not visible

>>> 3. Other problems:

3.1 ref. [2] (https://weillcornell.org/news/understanding-thyroid-problems-disease) Such source is not verified and not stable - not a proper reference within scientific manuscript. Authors should use reference to related, high quality book or scientific article.

3.2 ref. [52] - please see comment 3.1 above.

>>> Recommendation: Accept after fixing abovementioned minor problems

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #4: No

**********

PLoS One. doi: 10.1371/journal.pone.0295501.r012

Acceptance letter

Maciej Huk

20 Dec 2023

PONE-D-22-32955R5

PLOS ONE

Dear Dr. Ji,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

If revisions are needed, the production department will contact you directly to resolve them. If no revisions are needed, you will receive an email when the publication date has been set. At this time, we do not offer pre-publication proofs to authors during production of the accepted work. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few weeks to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Maciej Huk

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Attachment

Submitted filename: SSC paper response.docx

Click here for additional data file.^{(145.5KB, docx)}

Attachment

Submitted filename: SSC R3.docx

Click here for additional data file.^{(97.2KB, docx)}

Attachment

Submitted filename: SSC R4.docx

Click here for additional data file.^{(185.6KB, docx)}

Attachment

Submitted filename: SSC5.docx

Click here for additional data file.^{(19.6KB, docx)}

Data Availability Statement

https://archive.ics.uci.edu/ml/datasets/thyroid+disease.

[pone.0295501.ref001] 1. Ioniţă I, Ioniţă L. Prediction of thyroid disease using data mining techniques. BRAIN Broad Research in Artificial Intelligence and Neuroscience. 2016;7(3):115–124. [Google Scholar]

[pone.0295501.ref002] 2.Cornell W. Understanding thyroid problems … disease; 2022. https://weillcornell.org/news/understanding-thyroid-problems-disease.

[pone.0295501.ref003] 3. Monsalve-Torra A, Ruiz-Fernandez D, Marin-Alonso O, Soriano-Payá A, Camacho-Mackenzie J, Carreño-Jaimes M. Using machine learning methods for predicting inhospital mortality in patients undergoing open repair of abdominal aortic aneurysm. Journal of biomedical informatics. 2016;62:195–201. doi: 10.1016/j.jbi.2016.07.007 [DOI] [PubMed] [Google Scholar]

[pone.0295501.ref004] 4.Raisinghani S, Shamdasani R, Motwani M, Bahreja A, Raghavan Nair Lalitha P. Thyroid prediction using machine learning techniques. In: International Conference on Advances in Computing and Data Sciences. Springer; 2019. p. 140–150.

[pone.0295501.ref005] 5.Tyagi A, Mehra R, Saxena A. Interactive thyroid disease prediction system using machine learning technique. In: 2018 Fifth international conference on parallel, distributed and grid computing (PDGC). IEEE; 2018. p. 689–693.

[pone.0295501.ref006] 6.Shahid AH, Singh MP, Raj RK, Suman R, Jawaid D, Alam M. A Study on Label TSH, T3, T4U, TT4, FTI in Hyperthyroidism and Hypothyroidism using Machine Learning Techniques. In: 2019 International Conference on Communication and Electronics Systems (ICCES). IEEE; 2019. p. 930–933.

[pone.0295501.ref007] 7. Yadav DC, Pal S. Prediction of thyroid disease using decision tree ensemble method. Human-Intelligent Systems Integration. 2020;2(1):89–95. doi: 10.1007/s42454-020-00006-y [DOI] [Google Scholar]

[pone.0295501.ref008] 8. Raghuraman M, Sailatha E, Gunasekaran S. Efficient thyroid disease prediction and comparative study using machine learning algorithms. International Journal of Information and Computing Science. 2019;6(6):617–624. [Google Scholar]

[pone.0295501.ref009] 9. Abbad Ur Rehman H, Lin CY, Mushtaq Z, Su SF. Performance analysis of machine learning algorithms for thyroid disease. Arabian Journal for Science and Engineering. 2021;46(10):9437–9449. doi: 10.1007/s13369-020-05206-x [DOI] [Google Scholar]

[pone.0295501.ref010] 10. Shivastuti HK, Manhas J, Sharma V. Performance Evaluation of SVM and Random Forest for the Diagnosis of Thyroid Disorder. Int J Res Appl Sci Eng Technol. 2021;9:945–947. doi: 10.22214/ijraset.2021.34381 [DOI] [Google Scholar]

[pone.0295501.ref011] 11. Iglesias JA, Ledezma A, Sanchis A. An ensemble method based on evolving classifiers: eStacking. In: 2014 IEEE Symposium on Evolving and Autonomous Learning Systems (EALS). IEEE; 2014. p. 124–131. [Google Scholar]

[pone.0295501.ref012] 12. Jiang W, Chen Z, Xiang Y, Shao D, Ma L, Zhang J. SSEM: A novel self-adaptive stacking ensemble model for classification. IEEE Access. 2019;7:120337–120349. doi: 10.1109/ACCESS.2019.2933262 [DOI] [Google Scholar]

[pone.0295501.ref013] 13. Chen C, Zhang Q, Yu B, Yu Z, Lawrence PJ, Ma Q, et al. Improving protein-protein interactions prediction accuracy using XGBoost feature selection and stacked ensemble classifier. Computers in Biology and Medicine. 2020;123:103899. doi: 10.1016/j.compbiomed.2020.103899 [DOI] [PubMed] [Google Scholar]

[pone.0295501.ref014] 14. Alam MZ, Rahman MS, Rahman MS. A Random Forest based predictor for medical data classification using feature ranking. Informatics in Medicine Unlocked. 2019;15:100180. doi: 10.1016/j.imu.2019.100180 [DOI] [Google Scholar]

[pone.0295501.ref015] 15. Yang F, Wang Hz, Mi H, Lin Cd, Cai Ww. Using random forest for reliable classification and cost-sensitive learning for medical diagnosis. BMC bioinformatics. 2009;10(1):1–14. doi: 10.1186/1471-2105-10-S1-S22 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0295501.ref016] 16.Shaik AB, Srinivasan S. A brief survey on random forest ensembles in classification model. In: International Conference on Innovative Computing and Communications. Springer; 2019. p. 253–260.

[pone.0295501.ref017] 17. Saad E, Sadiq S, Jamil R, Rustam F, Mehmood A, Choi GS, et al. Novel extreme regression-voting classifier to predict death risk in vaccinated people using VAERS data. Plos one. 2022;17(6):e0270327. doi: 10.1371/journal.pone.0270327 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0295501.ref018] 18. Mishra S, Tadesse Y, Dash A, Jena L, Ranjan P. Thyroid disorder analysis using random forest classifier. In: Intelligent and cloud computing. Springer; 2021. p. 385–390. [Google Scholar]

[pone.0295501.ref019] 19. Alyas T, Hamid M, Alissa K, Faiz T, Tabassum N, Ahmad A. Empirical Method for Thyroid Disease Classification Using a Machine Learning Approach. BioMed Research International. 2022;2022. doi: 10.1155/2022/9809932 [DOI] [PMC free article] [PubMed] [Google Scholar] [Retracted]

[pone.0295501.ref020] 20. Razia S, Prathyusha PS, Krishna NV, Sumana NS. A Comparative study of machine learning algorithms on thyroid disease prediction. Int J Eng Technol. 2018;7:315–319. doi: 10.14419/ijet.v7i2.8.10432 [DOI] [Google Scholar]

[pone.0295501.ref021] 21. Garcia de Lomana M, Weber AG, Birk B, Landsiedel R, Achenbach J, Schleifer KJ, et al. In silico models to predict the perturbation of molecular initiating events related to thyroid hormone homeostasis. Chemical research in toxicology. 2020;34(2):396–411. doi: 10.1021/acs.chemrestox.0c00304 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0295501.ref022] 22. Li Y, Song J. The difference of the quantitative parameters using kinetic compartment model in Salivary gland scintigraphy in two kinds of people: hyperthyroidism and euthyroidism; 2018. [Google Scholar]

[pone.0295501.ref023] 23. Rayman MP. Multiple nutritional factors and thyroid disease, with particular reference to autoimmune thyroid disease. Proceedings of the Nutrition Society. 2019;78(1):34–44. doi: 10.1017/S0029665118001192 [DOI] [PubMed] [Google Scholar]

[pone.0295501.ref024] 24. Biondi B, Cooper DS. Thyroid hormone therapy for hypothyroidism. Endocrine. 2019;66(1):18–26. doi: 10.1007/s12020-019-02023-7 [DOI] [PubMed] [Google Scholar]

[pone.0295501.ref025] 25. Yang P, Pi Y, He T, Sun J, Wei J, Xiang Y, et al. Automatic differentiation of thyroid scintigram by deep convolutional neural network: a dual center study. BMC Medical Imaging. 2021;21(1):1–9. doi: 10.1186/s12880-021-00710-4 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0295501.ref026] 26. Umar Sidiq D, Aaqib SM, Khan RA. Diagnosis of various thyroid ailments using data mining classification techniques. Int J Sci Res Coput Sci Inf Technol. 2019;5:131–6. [Google Scholar]

[pone.0295501.ref027] 27. Chaubey G, Bisen D, Arjaria S, Yadav V. Thyroid disease prediction using machine learning approaches. National Academy Science Letters. 2021;44(3):233–238. doi: 10.1007/s40009-020-00979-z [DOI] [Google Scholar]

[pone.0295501.ref028] 28. Jha R, Bhattacharjee V, Mustafi A. Increasing the Prediction Accuracy for Thyroid Disease: A Step Towards Better Health for Society. Wireless Personal Communications. 2022;122(2):1921–1938. doi: 10.1007/s11277-021-08974-3 [DOI] [Google Scholar]

[pone.0295501.ref029] 29.Sonuç E, et al. Thyroid Disease Classification Using Machine Learning Algorithms. In: Journal of Physics: Conference Series. 1. IOP Publishing; 2021. p. 012140.

[pone.0295501.ref030] 30. Akhtar T, Gilani SO, Mushtaq Z, Arif S, Jamil M, Ayaz Y, et al. Effective Voting Ensemble of Homogenous Ensembling with Multiple Attribute-Selection Approaches for Improved Identification of Thyroid Disorder. Electronics. 2021;10(23):3026. doi: 10.3390/electronics10233026 [DOI] [Google Scholar]

[pone.0295501.ref031] 31.Akhtar T, Arif S, Mushtaq Z, Gilani SO, Jamil M, Ayaz Y, et al. Ensemble-based Effective Diagnosis of Thyroid Disorder with Various Feature Selection Techniques. In: 2022 2nd International Conference of Smart Systems and Emerging Technologies (SMARTTECH). IEEE; 2022. p. 14–19.

[pone.0295501.ref032] 32. Li Q, Zhao C, He X, Chen K, Wang R. The Impact of Partial Balance of Imbalanced Dataset on Classification Performance. Electronics. 2022;11(9):1322. doi: 10.3390/electronics11091322 [DOI] [Google Scholar]

[pone.0295501.ref033] 33. Hossain T, Mauni HZ, Rab R. Reducing the Effect of Imbalance in Text Classification Using SVD and GloVe with Ensemble and Deep Learning. Computing and Informatics. 2022;41(1):98–115. doi: 10.31577/cai_2022_1_98 [DOI] [Google Scholar]

[pone.0295501.ref034] 34.Li X, Wen L, Deng Y, Feng F, Hu X, Wang L, et al. Graph neural network with curriculum learning for imbalanced node classification. arXiv preprint arXiv:220202529. 2022;.

[pone.0295501.ref035] 35. Chaganti R, Rustam F, De La Torre Díez I, Mazón J, Rodríguez C, Ashraf I. Thyroid Disease Prediction Using Selective Features and Machine Learning Techniques. Cancers 2022, 14, 3914; 2022. doi: 10.3390/cancers14163914 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0295501.ref036] 36.Quinlan R. UCI Machine Learning Repository: Thyroid Disease Data Set; 2022. https://archive.ics.uci.edu/ml/datasets/thyroid+disease.

[pone.0295501.ref037] 37. Saad E, Sadiq S, Jamil R, Rustam F, Mehmood A, Choi GS, et al. Predicting death risk analysis in fully vaccinated people using novel extreme regression-voting classifier. Digital Health. 2022;8:20552076221109530. doi: 10.1177/20552076221109530 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0295501.ref038] 38. Bagui S, Li K. Resampling imbalanced data for network intrusion detection datasets. Journal of Big Data. 2021;8(1):1–41. doi: 10.1186/s40537-020-00390-x33425651 [DOI] [Google Scholar]

[pone.0295501.ref039] 39. Azar AT, Elshazly HI, Hassanien AE, Elkorany AM. A random forest classifier for lymph diseases. Computer methods and programs in biomedicine. 2014;113(2):465–473. doi: 10.1016/j.cmpb.2013.11.004 [DOI] [PubMed] [Google Scholar]

[pone.0295501.ref040] 40. Rupapara V, Rustam F, Aljedaani W, Shahzad HF, Lee E, Ashraf I. Blood cancer prediction using leukemia microarray gene data and hybrid logistic vector trees model. Scientific Reports. 2022;12(1):1–15. doi: 10.1038/s41598-022-04835-6 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0295501.ref041] 41. Siddiqui HUR, Saleem AA, Brown R, Bademci B, Lee E, Rustam F, et al. Non-invasive driver drowsiness detection system. Sensors. 2021;21(14):4833. doi: 10.3390/s21144833 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0295501.ref042] 42. George A, Ravindran A, Mendieta M, Tabkhi H. Mez: An adaptive messaging system for latency-sensitive multi-camera machine vision at the IoT edge. IEEE Access. 2021;9:21457–21473. doi: 10.1109/ACCESS.2021.3055775 [DOI] [Google Scholar]

[pone.0295501.ref043] 43. George A. Distributed messaging system for the IoT edge. The University of North Carolina at Charlotte; 2020. [Google Scholar]

[pone.0295501.ref044] 44. Daghriri T, Rustam F, Aljedaani W, Bashiri AH, Ashraf I. Electroencephalogram Signals for Detecting Confused Students in Online Education Platforms with Probability-Based Features. Electronics. 2022;11(18):2855. doi: 10.3390/electronics11182855 [DOI] [Google Scholar]

[pone.0295501.ref045] 45. Rafique A, Khan MS, Jamal MH, Tasadduq M, Rustam F, Lee E, et al. Integrating learning analytics and collaborative learning for improving student’s academic performance. IEEE Access. 2021;9:167812–167826. doi: 10.1109/ACCESS.2021.3135309 [DOI] [Google Scholar]

[pone.0295501.ref046] 46. Shafique R, Rustam F, Choi GS, Díez IdlT, Mahmood A, Lipari V, et al. Breast Cancer Prediction Using Fine Needle Aspiration Features and Upsampling with Supervised Machine Learning. Cancers. 2023;15(3):681. doi: 10.3390/cancers15030681 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0295501.ref047] 47. Siddiqui HUR, Saleem AA, Bashir I, Zafar K, Rustam F, Diez IdlT, et al. Respiration-Based COPD Detection Using UWB Radar Incorporation with Machine Learning. Electronics. 2022;11(18):2875. doi: 10.3390/electronics11182875 [DOI] [Google Scholar]

[pone.0295501.ref048] 48. Sankar S, Potti A, Chandrika GN, Ramasubbareddy S. Thyroid Disease Prediction Using XGBoost Algorithms. Journal of Mobile Multimedia. 2022;18(3):1–18. [Google Scholar]

[pone.0295501.ref049] 49. Rustam F, Ishaq A, Munir K, Almutairi M, Aslam N, Ashraf I. Incorporating CNN Features for Optimizing Performance of Ensemble Classifier for Cardiovascular Disease Prediction. Diagnostics. 2022;12(6):1474. doi: 10.3390/diagnostics12061474 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0295501.ref050] 50. Rustam F, Ashraf I, Mehmood A, Ullah S, Choi GS. Tweets classification on the base of sentiments for US airline companies. Entropy. 2019;21(11):1078. doi: 10.3390/e21111078 [DOI] [Google Scholar]

[pone.0295501.ref051] 51.Kingdom S. Statistical Power Calculator; 2023. https://www.statskingdom.com/32test_power_t_z.html.

PERMALINK

SSC: The novel self-stack ensemble model for thyroid disease prediction

Shengjun Ji

Roles

Abstract

1 Introduction

2 Background

Table 1. A summary of recent publications for diagnosing thyroid disease using machine learning and deep learning approaches.

3 Methodology

Fig 1. Graphical abstract of this study.

3.1 Data description & balancing

Table 2. Feature description of thyroid disease dataset.

Table 3. Count for target variables in thyroid disease dataset.

Table 4. Down-sampled data for efficient thyroid disease detection.

Fig 2. Features importance scores using original dataset.

Fig 3. Features importance scores using down-sampled dataset.

3.2 Proposed Self Stacking Classifier (SSC)

Fig 4. Architecture of proposed Self Stacking Classifier comprised of RF as a base and meta learner.

3.3 Algorithms tested

Table 5. Hyperparameter settings for machine learning algorithms.

Table 6. Layered architecture of deep neural networks.

3.4 Performance estimation evaluation estimators

4 Results

Fig 5.

4.1 Experimental results of machine learning classifiers

Table 7. Experimental results of machine learning classifiers.

Table 8. Target class-wise results of machine learning classifiers.

4.2 Experimental results of self-voting classifiers

Table 9. Experimental results of ensemble voting classifiers.

Table 10. Target class-wise classification results of self-voting ensemble classifiers.

4.3 Experimental results of deep neural networks

Table 11. Deep learning models results for thyroid disease classification.

Fig 6. Per epochs evaluation score for deep learning models.

Fig 7. Per epochs loss estimation for deep learning models.

4.4 Experimental results of proposed self-stacking classifier

Table 12. Classification results of proposed SSC model.

5 Comparative analysis of experimental results

Fig 8. Confusion matrix of classifiers when supplied with original data.

Fig 9. Confusion matrix of classifiers when supplied with down-sampled data.

Fig 10. Performance analysis of classifiers in terms of accuracy and error rate.

Table 13. 10-fold cross-validation results of classifiers.

Fig 11. Performance of models in terms of mean accuracy and error deviation.

Fig 12. Performance of models in terms of error rate.

Table 14. Performance comparison with previous study.

5.1 Statistical analysis

Table 15. Statistical T-test results.

6 Conclusions

Data Availability

Funding Statement

References

Decision Letter 0

Maciej Huk

Roles

Author response to Decision Letter 0

Decision Letter 1

Maciej Huk

Roles

Author response to Decision Letter 1

Decision Letter 2

Maciej Huk

Roles

Author response to Decision Letter 2

Decision Letter 3

Maciej Huk

Roles

Author response to Decision Letter 3

Decision Letter 4

Maciej Huk

Roles

Author response to Decision Letter 4

Decision Letter 5

Maciej Huk

Roles

Acceptance letter

Maciej Huk

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS