Simple Summary
This study investigates lung cancer detection by combining metabolomics and advanced machine learning to identify small cell lung cancer (SCLC) with high accuracy. We analyzed 461 serum samples from publicly available data to create a stacking-based ensemble model that can distinguish between SCLC, non-small cell lung cancer (NSCLC), and healthy controls. The model has 85.03% accuracy in multi-class classification and 88.19% accuracy in binary classification (SCLC vs. NSCLC). This innovation relies on sophisticated feature selection techniques to identify significant metabolites, particularly positive ions. SHAP analysis identifies key predictors such as benzoic acid, DL-lactate, and L-arginine, shedding new light on cancer metabolism. This non-invasive approach presents a promising alternative to traditional diagnostic methods, with the potential to transform early lung cancer detection. By combining metabolomics and machine learning, the study paves the way for faster, more accurate, and patient-friendly cancer diagnostics, potentially improving treatment outcomes and survival rates.
Keywords: SCLC, NSCLC, serum metabolomics, machine learning, stacking ensemble model
Abstract
Background: Small cell lung cancer (SCLC) is an extremely aggressive form of lung cancer, characterized by rapid progression and poor survival rates. Despite the importance of early diagnosis, the current diagnostic techniques are invasive and restricted. Methods: This study presents a novel stacking-based ensemble machine learning approach for classifying small cell lung cancer (SCLC) and non-small cell lung cancer (NSCLC) using metabolomics data. The analysis included 191 SCLC cases, 173 NSCLC cases, and 97 healthy controls. Feature selection techniques identified significant metabolites, with positive ions proving more relevant. Results: For multi-class classification (control, SCLC, NSCLC), the stacking ensemble achieved 85.03% accuracy and 92.47 AUC using Support Vector Machine (SVM). Binary classification (SCLC vs. NSCLC) further improved performance, with ExtraTreesClassifier reaching 88.19% accuracy and 92.65 AUC. SHapley Additive exPlanations (SHAP) analysis revealed key metabolites like benzoic acid, DL-lactate, and L-arginine as significant predictors. Conclusions: The stacking ensemble approach effectively leverages multiple classifiers to enhance overall predictive performance. The proposed model effectively captures the complementary strengths of different classifiers, enhancing the detection of SCLC and NSCLC. This work accentuates the potential of combining metabolomics with advanced machine learning for non-invasive early lung cancer subtype detection, offering an alternative to conventional biopsy methods.
1. Introduction
Lung cancer remains a leading cause of cancer-related mortality worldwide, with approximately 1.8 million deaths and 2.5 million new cases reported in 2022 [1,2]. Lung cancer is the most common cancer in men and the second most common cancer in women. Non-small cell lung cancer (NSCLC) and small cell lung cancer (SCLC) are the two major lung carcinoma subtypes [3]. The growth rate of SCLC is more rapid than NSCLC and is considered more aggressive [4]. SCLC accounts for about 10–15% of all lung cancers and is known for its strong tendency to metastasize, frequently resulting in widespread disease at the time of diagnosis [5,6]. Biomarker-guided treatment is currently the best method for diagnosing NSCLC and SCLC; however, it requires a biopsy [7]. According to recent literature, the low detection rates for lung cancer at earlier stages are the major contributing factor to its alarming severity [8]. Diagnosis depends on distinct clinical symptoms and pathogenesis, but during this stage, the symptoms are mostly absent, and the pathogenesis is highly complex.
Previous research demonstrated that the metabolism of cancer cells is different from that of normal cells [9]. Metabolites are the byproducts of biological activity, and the change in metabolism, inevitably, causes metabolite concentrations to shift. This changes the unique metabolite profile of the individual [10]. Researchers are exploring metabolites to transform the personalized diagnosis of lung cancer from invasive to non-invasive approaches. Over 150 metabolites have been found to have a connection with lung cancer that reflects its distinct metabolic adaptations [11,12]. Amino acids like serine, glycine, and glutamate are frequently increased, supporting the heightened demands of cancer cell metabolism and proliferation. Alterations in lipid profiles, particularly phosphatidylcholines, sphingomyelins, and fatty acids, have been implicated in lung cancer progression, with lysophosphatidylcholines (LPC) showing notably higher levels in cancerous tissues [13]. Organic acids such as lactic acid and pyruvic acid are elevated due to the Warburg effect, wherein cancer cells favor glycolysis over oxidative phosphorylation, even in oxygen-rich conditions [14]. Increased levels of nucleotides, including uridine and pseudouridine, suggest changes in RNA metabolism associated with lung cancer [15]. Shifts in carnitine levels, especially of long-chain acylcarnitines, point to disruptions in fatty acid oxidation [16]. Additionally, elevated polyamines like spermine and spermidine correlate with increased cell proliferation [17]. Changes in energy metabolites, such as glucose, citrate, and other TCA cycle intermediates, further highlight the altered energy metabolism characteristic of cancer cells [18].
Reaching meaningful statistical interpretations about the measured variables is dependent on the fact that data points are recorded from a broader population. Conventional statistical approaches used in metabolomics concentrate on the establishment of correlations among dependent and independent variables [19]. Machine learning (ML) techniques rely on the use of ad hoc computing algorithms that are either optimized or trained without the need for explicit statistical assumptions [20].
Metabolomics holds the potential to identify the biomarkers of NSCLC and SCLC development. Researchers have developed a prognostic model to predict lung cancer occurrence and patient survival using metabolomics based on nuclear magnetic resonance imaging [21]. In another study, a collection of six metabolites achieved an area under the curve (AUC) value of 0.99 for identifying NSCLC. The six metabolites included are acyl-carnitine C10:1, inosine, L-tryptophan, hypoxanthine, indoleacrylic acid, and LPC [18:2] [22]. Diacetylspermine is known as a unique pre-diagnostic serum biomarker for NSCLC [23]. However, this research only investigated NSCLC detection, and the biomarker of pemetrexed efficacy remains unexplored.
ProGRP and Neuron-specific enolase (NSE) are well-recognized biomarkers of SCLC diagnosis; however, they are found to have low sensitivity and specificity [24,25,26,27,28,29,30,31,32]. For the diagnosis of SCLC, ProGRP had a pooled sensitivity of 72% (95% CI: 47–86%) and a pooled specificity of 93% (95% CI: 95–100%) [33]. Additionally, it exhibited a higher prediction accuracy with a specificity of 72 percent to 99 percent when contrasted with NSE [33]. Some other biomarkers mentioned in previous studies for SCLC diagnosis are lactate dehydrogenase (LDH) and caspase-cleaved cytokeratin 19 (CYFRA21-1), with AUC values of 0.616 and 0.732, respectively [34]. While these are indeed used, they are not specific to SCLC and are more general markers of tumor burden. NSE remains the most widely used serum biomarker for SCLC. In addition, extracellular vesicles (EVs), circulating tumor cells (CTCs), circulating tumor DNA (ctDNA), and exosomes are blood-based tumor components that could serve as alternative ways to evaluate biomarkers for the identification of SCLC [28,35,36,37].
Each tumor hallmark has its limitations and is generally insufficient as a standalone criterion for histological diagnosis or screening [34]. For example, the study of cell-free DNA (cfDNA) has emerged as a promising non-invasive method for tumor detection, as its methylation profiles can be identified at the early stages [4]. However, the results may be biased due to several factors: the methylation levels of CpG sites influence cfDNA, leading to poor sensitivity; cfDNA possesses unstable characteristics that contribute to degradation; and low tumor shedding rates result in very low levels of cfDNA [38].
Machine learning has significant potential in identifying biomarkers for SCLC using metabolomics data. However, due to the limited availability of metabolomics datasets in this field, a considerable research gap still exists. Recently, Shang et al. introduced a large-scale metabolomics dataset specifically for SCLC [4], addressing this gap. Our study utilizes this dataset to conduct a comprehensive machine learning experiment, exploring its potential in biomarker discovery.
This study advances lung cancer detection by combining metabolomics with machine learning, presenting a stacking-based ensemble model that effectively distinguishes between SCLC and NSCLC while revealing key metabolic signatures through SHAP analysis.
2. Methods and Materials
2.1. Study Cohort, Sample Collection, and Metabolomics Dataset
We utilized a publicly available LC-MS/MS-based metabolomics dataset as described by Shang et al. [4]. In brief, a total of 501 serum samples were collected from the Shandong Cancer Hospital and Institute between March and November 2020. After excluding samples with incomplete prescription data and suboptimal serum quality, the final dataset included 461 individuals. This cohort consisted of 191 cases of SCLC, 173 cases of NSCLC, and 97 healthy controls. The majority of patients with SCLC were below 65 years of age, with 70.7% of the population being male. Among the NSCLC cases, 91 were diagnosed with lung adenocarcinoma and 82 with squamous cell carcinoma [4]. The comprehensive clinical characteristics are presented in Supplementary Table S1. Histopathological analysis was performed to confirm the cancer status of all participants, and blood samples were collected before the initiation of any antitumor therapies. Shang and co-authors obtained serum samples by centrifuging blood at 3000× g for 10 min, within 6 h of collection [4]. In our study, we utilized both positive and negative ion mode datasets for metabolomics analysis. Positive mode ionization involves the removal of electrons, generating positively charged ions. This mode is susceptible to detecting metabolites that typically carry a positive charge, such as amino acids, lipids, and small organic molecules like sugars and nucleotides. It is commonly favored for identifying metabolites such as amino acids, neurotransmitters, and lipids. On the other hand, negative mode ionization involves gaining electrons, resulting in negatively charged ions. This mode is more effective for detecting molecules that generally carry a negative charge, such as organic acids, sulfates, and phosphates. Negative ion mode is frequently used for detecting metabolites such as organic acids, fatty acids, and nucleotides. By using both ionization modes, we achieved a more complete representation of the metabolite profiles associated with a wider range of biochemical pathways. The positive mode covers pathways associated with amino acid metabolism, neurotransmitter synthesis, and lipid metabolism, while the negative mode provides insights into pathways involving organic acids, oxidative stress, and phosphate-containing compounds. This dual-mode approach is not only essential to ensure comprehensive metabolic profiling but also allows us to compare our classified metabolites as features in our ensemble model.
2.2. Dataset Preprocessing
Dataset preprocessing is the process of converting unprocessed data into a format that ML models can use efficiently and effectively [39]. Preprocessing is essential for addressing issues such as noise, missing values, and unbalanced data, assuring data consistency, and improving model performance. This study focused exclusively on the annotated features from the negative and positive ion metabolomics datasets. The metabolomics analysis was conducted according to the methodology outlined by Shang et al. [4], employing LC-MS/MS in both positive and negative ionization modes to thoroughly identify metabolites. The negative ion dataset contained 158 annotated known features, whereas the positive ion dataset included 152 annotated known features, demonstrating extensive metabolome coverage. Only these annotated features were used in subsequent analyses and model training. For the ML model to achieve optimal training results on our dataset, it was imperative to normalize the input data. To accomplish this, we employed the StandardScaler method to normalize our dataset, which facilitated the robust training of the ML model and facilitated generalized performance [40].
2.3. Evaluation Metrics
This study considered a variety of performance evaluation metrics, as relying solely on accuracy is inadequate [41,42]. The performance of the classifiers was assessed using Area Under the Receiver Operating Characteristic Curve (AUC-ROC), in addition to metrics such as Precision, Sensitivity, Specificity, Accuracy, and F1 Score. We employed weighted metrics for each class along with overall precision to account for the differences in the number of instances across classes. Additionally, the AUC score was used as a performance metric. The equations for the evaluation metrics are provided in Supplementary Section S1.
2.4. Development of Machine Learning and Stacking Ensemble Models
We initially trained more than 10 machine learning models, including tree-based models (CatBoost, RandomForest, ExtraTrees, GradientBoosting, XGBClassifier), ensemble methods (LGBM), linear models (LogisticRegression, ElasticNet), and neural networks (MLPClassifier). This diverse selection was curated to examine a variety of algorithmic approaches, each offering unique strengths for managing complex, high-dimensional data, such as our metabolomics dataset. We focused on tree-based models due to their ability to handle non-linear relationships and interactions between features, which is particularly relevant in metabolomics data where complex biochemical pathways are involved. Ensemble methods were included for their capacity to reduce overfitting and improve generalization. Linear models were selected for their interpretability and efficiency with large feature sets, while neural networks were included for their ability to uncover intricate patterns within the data. After conducting initial evaluations, we proposed a stacking-based ensemble model design. This approach enables us to capture diverse aspects of the metabolomic data that single models may overlook. The final selection of models for the stacking ensemble was guided by their performance metrics and their capacity to provide unique insights to the ensemble. This thoughtful process of model selection and ensemble construction aims to maximize our predictive power while ensuring interpretability, which is essential in a clinical context. The stacking-based approach involves two levels of learners: base learners and meta-learners. The top three performing machine learning classifiers were selected as base learners in the stacking model, while a classifier was employed as the meta-learner in the second stage. This structure ultimately produced the final prediction, aiming to enhance the overall performance of the model.
The proposed stacking ensemble architecture is illustrated in Figure 1.
Figure 1.
Proposed Stacking Ensemble Model.
The essence of this approach involves training a meta-model that adeptly learns to combine the predictions from the base models, thereby enhancing the overall predictive accuracy. Given an input x and the predictions from a set of base-level classifiers denoted as M, a probability distribution is generated:
(1) |
where (, , …, ) represents the possible class values, and (∣x) signifies the probability that the instance x belongs to the class , as predicted by classifier M [43,44]. In our study, we conducted a multi-class classification analysis for distinguishing between Control (healthy), SCLC, and NSCLC. The CatBoost, RandomForestClassifier, and ExtraTrees models emerged as the top three performers. We utilized their predicted probabilities to train stacking-based meta-models, employing five-fold cross-validation in both the initial and stacking phases. For the binary-class classification (SCLC vs. NSCLC), the MLPClassifier, SVM, and ElasticNet models were identified as the best performers. Their predicted probabilities were leveraged for the initial training, with five-fold cross-validation applied in both phases. An overview of the algorithm is presented in Supplementary Section S2.
An overview of the methodology employed in this study is illustrated in Figure 2. This analysis utilized a publicly accessible LC-MS/MS metabolomics dataset that encompassed both positive and negative ion features. These ion features are amalgamated to generate a comprehensive feature set. To extract exploratory insights, univariate analysis techniques, such as clustering and t-SNE, were used; along with these, we also performed PCA. Afterward, the dataset is preprocessed and subjected to feature evaluation to guarantee the quality and relevance of the data. The robustness of the model is improved by employing a five-fold cross-validation strategy. In the subsequent phase, the top three base models are selected after the parallel training of multiple base models. Utilizing the prediction probabilities of these best models, a stacking meta-classifier is trained to consolidate their respective strengths using five-fold cross-validation. For both multi-class and binary classification tasks, this ensemble model that is based on layering generates final predictions. Ultimately, the SHAP analysis is conducted to evaluate the model’s overall impact, analyze the significance of each metabolomics feature, and interpret the model’s predictions, thereby providing information that is pertinent to SCLC.
Figure 2.
Overview of the methodology employed in this study.
The study was conducted on a local machine using an Intel(R) Core(TM) i9-10980HK CPU running at 2.40 GHz (up to 3.10 GHz), with 32 GB of RAM and an 8 GB GPU. The Scikit-Learn package, alongside Python 3.10, was used for model training. All models were trained following the specified parameters on this hardware setup to ensure efficient performance. Scikit-learn employs the softmax activation function for multiclass classification problems, while the sigmoid activation function is used for binary classification tasks.
A more detailed overview of this study is illustrated in Supplementary Figure S1. The t-SNE visualizations for negative and positive ion data are provided in Supplementary Figures S2 and S3, and the PCA component plots for negative and positive ion data are shown in Supplementary Figures S4 and S5, respectively [45]. Additionally, a hierarchical clustering [46,47] heatmap, presented in Supplementary Figure S6, utilizes metabolomics data to identify key metabolites distinguishing SCLC from NSCLC, underscoring significant metabolic pathways. The parameters of the MLPClassifier for binary classification are shown in Supplementary Table S2, and for multiclass classification, the parameters of the MLPClassifier are shown in Supplementary Table S3.
3. Results
3.1. Feature Ranking
Feature ranking is essential in machine learning, as it enhances model performance by recognizing and selecting the most pertinent features, hence eliminating irrelevant or redundant data and improving forecast accuracy. Prevalent methodologies encompass embedded approaches such as XGBoost [48], Random Forest [49], and Extra Trees [50], wherein feature relevance is ascertained during training by evaluating the decrease in loss or impurity. These strategies facilitate the identification of the most meaningful features, enhancing model accuracy and interpretability. This investigation utilizes three advanced machine learning feature selection models: XGBoost, Random Forest, and Extra Trees. In the preliminary analysis, XGBoost demonstrated superior performance for multi-class classification, while Extra Trees outperformed for two-class classification.
For the multi-class classification, optimal results were attained utilizing the top 47 features identified by XGBoost based on feature importance. As illustrated in Figure 3, the most meaningful features comprise 12-Benzene dicarboxylic acid, Phe-Ser, and Phe-Phe, among others. These features significantly enhanced the model’s prediction performance. The importance of each characteristic is presented, with 12-Benzene dicarboxylic acid exhibiting the highest importance, followed by other essential metabolites. From Figure 3, we find that positive ion metabolites are more meaningful than negative ion metabolites. Among the top 10 ranked features, seven are derived from the positive ion mode, indicating their higher contribution to the model’s performance in distinguishing between the classes. This highlights the greater importance of positive ion metabolites in the classification task.
Figure 3.
Features ranked using the XGBoost feature selection algorithm for multi-class classification.
Figure 4 displays the top 48 features ranked by their importance in distinguishing between binary classes: SCLC and NSCLC. The Extra Trees model was utilized to determine feature importance, with each bar representing the relative contribution of a feature to the model’s predictive performance. Key metabolites such as Phthalic acid Mono-2-ethylhexyl Ester, Dioctyl phthalate, and 3-Phosphoserine are among the most influential, with Phthalic acid Mono-2-ethylhexyl Ester showing the highest relative importance. Other meaningful features include Leu-Phe, p-Cresol, and 12-Benzenedicarboxylic acid, which also contribute substantially to the classification task.
Figure 4.
Features ranked using the Extra Trees feature selection algorithm for binary classification.
3.2. Multi-Class Classification
Table 1 represents the detailed results for individual models and stacking models using five-fold cross-validation. The model was trained on a reduced set of the top 48 features, which were selected using the XGBoost feature selection method. The feature selection process was crucial for identifying the most informative features contributing to the ability of the model to distinguish between the three classes. Among the top-performing models, CatBoost exhibited the highest performance across multiple metrics, achieving an accuracy of 84.81%, precision of 84.86%, recall of 84.81%, specificity of 90.08%, and an AUC of 94.82. Other strong performers included SVM (accuracy: 85.03%, precision: 85.05%, recall: 85.03%, specificity: 90.25%, AUC: 92.47) and MLPClassifier (accuracy: 84.82%, precision: 84.87%, recall: 84.81%, specificity: 90.14%, AUC: 94.17). These results reflect the models’ robust classification capabilities, with CatBoost and SVM demonstrating consistently high performance across all key evaluation metrics. Conversely, models like ExtraTrees (accuracy: 80.69%, AUC: 93.45), GradientBoosting (accuracy: 80.47%, AUC: 92.01), and XGB (accuracy: 79.82%, AUC: 92.65) showed somewhat lower accuracy and AUC scores but still offered valuable insight into feature importance and model behavior.
Table 1.
Top-Performing ML Models and Stacked Ensemble ML Models for Multi-Class Using Five-Fold Cross-Validation.
Initial Results | Stacking Ensemble Results | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Models | A | P | R | S | F1 | AUC | Models | A | P | R | S | F1 | AUC |
CatBoost | 84.81 | 84.86 | 84.81 | 90.08 | 84.82 | 94.82 | SVM | 85.03 | 85.05 | 85.03 | 90.25 | 85.04 | 92.47 |
RandomForest | 84.59 | 84.55 | 84.59 | 90.04 | 84.57 | 94.64 | MLPClassifier | 84.82 | 84.87 | 84.81 | 90.14 | 84.82 | 94.17 |
ExtraTrees | 80.69 | 80.71 | 80.69 | 87.44 | 80.64 | 93.45 | LDA | 84.82 | 84.84 | 84.81 | 90.11 | 84.82 | 94.06 |
GradientBoosting | 80.47 | 80.76 | 80.47 | 87.23 | 80.54 | 92.01 | LogisticRegression | 84.6 | 84.6 | 84.6 | 89.94 | 84.6 | 94.29 |
XGB | 79.82 | 79.57 | 79.82 | 87.67 | 79.66 | 92.65 | ExtraTrees | 84.6 | 84.6 | 84.6 | 89.94 | 84.6 | 94.25 |
LGBM | 79.17 | 79.34 | 79.17 | 86.4 | 79.24 | 92.37 | ElasticNet | 84.6 | 84.6 | 84.6 | 89.94 | 84.6 | 94.29 |
LogisticRegression | 78.09 | 78.22 | 78.09 | 85.68 | 78.1 | 91.58 | XGBClassifier | 84.16 | 84.17 | 84.17 | 89.65 | 84.17 | 93.43 |
SVM | 77 | 77.35 | 77 | 84.97 | 77.08 | 89.26 | LGBM | 83.51 | 83.51 | 83.51 | 89.17 | 83.5 | 93.8 |
MLPClassifier | 78.741 | 78.832 | 78.741 | 86.202 | 78.759 | 91.37 | RandomForest | 83.08 | 83.11 | 83.08 | 88.98 | 83.08 | 93.4 |
ElasticNet | 76.13 | 76.06 | 76.13 | 84.67 | 76.06 | 91.18 | CatBoost | 81.34 | 81.37 | 81.34 | 87.84 | 81.35 | 93.67 |
Bold values indicate the best performance across all models.
We employed a stacked ensemble approach to further enhance the classification performance, utilizing the prediction probabilities from the top three models. This approach was implemented using five-fold cross-validation to ensure robust evaluation and reliable results. This approach demonstrates a notable performance improvement. The stacking ensemble method significantly improved performance compared to the individual models. For instance, SVM in the stacking ensemble yielded an accuracy of 85.03%, precision of 85.05%, recall of 85.03%, specificity of 90.25%, F1 score of 85.04%, and an AUC of 92.47, showcasing a marginal but notable enhancement over its performance in the initial training. Similarly, the MLPClassifier in the ensemble resulted in accuracy, precision, and recall all surpassing 84.8%, and an AUC of 94.17. These improvements underscore the effectiveness of the ensemble strategy in capturing complementary strengths from each model.
Figure 5 shows that accuracy stabilizes around 80% as the number of top features increases from 5 to 50, with the highest accuracy achieved using the top 47 features, indicating that this subset may offer the optimal balance for model performance. The confusion matrix is included in Supplementary Figure S7. The confusion matrix shows the performance of a stacking-based ensemble model, with SVM as one of the components, in predicting three classes: Control, SCLC, and NSCLC. The model correctly classified 97 Control cases (top-left), 155 SCLC cases (middle-center), and 140 NSCLC cases (bottom-right). However, it misclassified 36 SCLC cases as NSCLC and 33 NSCLC cases as SCLC, while no Control cases were misclassified as either SCLC or NSCLC. Overall, the matrix demonstrates strong classification performance, particularly in distinguishing Control and SCLC, with some confusion between SCLC and NSCLC.
Figure 5.
Accuracy of top features for multiclass classification.
3.3. Binary Classification
In our study of multi-class classification, we observed that the stacking-based SVM model performed effectively for control groups but struggled with SCLC and NCLC. Consequently, we transitioned to a binary classification approach, focusing specifically on the SCLC and NCLC groups. Within this framework, we utilized the Extra Trees feature selection method, which demonstrated superior performance in identifying the most relevant features. Our analysis revealed that the model achieved optimal results using the top 48 features, as illustrated in Figure 4. Table 2 presents the performance metric results for binary classification.
Table 2.
Top-Performing ML Models and Stacked Ensemble ML Models for Binary Class Using Five-Fold Cross-Validation.
Initial Results | Stacking Ensemble Results | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Models | A | P | R | S | F1 | AUC | Models | A | P | R | S | F1 | AUC |
MLPClassifier | 85.43 | 85.43 | 85.43 | 85.43 | 85.43 | 92.04 | ExtraTreesClassifier | 88.19 | 88.27 | 88.19 | 88.32 | 88.19 | 92.65 |
SVM | 84.61 | 84.67 | 84.61 | 84.61 | 84.62 | 93.06 | XGBClassifier | 87.36 | 87.4 | 87.36 | 87.41 | 87.37 | 92.22 |
ElasticNet | 84.61 | 84.62 | 84.61 | 84.61 | 84.61 | 93.19 | LogisticRegression | 86.81 | 86.82 | 86.81 | 86.81 | 86.82 | 92.65 |
RandomForest | 83.51 | 83.61 | 83.51 | 83.51 | 83.52 | 89.71 | CatBoost | 86.81 | 86.88 | 86.81 | 86.91 | 86.82 | 92.76 |
LogisticRegression | 83.51 | 83.52 | 83.51 | 83.51 | 83.52 | 92.49 | ElasticNet | 86.81 | 86.82 | 86.81 | 86.81 | 86.82 | 92.72 |
LinearDiscriminantAnalysis | 83.24 | 83.36 | 83.24 | 83.24 | 83.25 | 91.47 | LinearDiscriminantAnalysis | 86.54 | 86.58 | 86.53 | 86.61 | 86.54 | 92.3 |
CatBoost | 82.69 | 82.86 | 82.69 | 82.69 | 82.7 | 90.05 | SVM | 86.54 | 86.56 | 86.54 | 86.56 | 86.54 | 92.27 |
ExtraTrees | 81.86 | 81.86 | 81.86 | 81.86 | 81.85 | 90.14 | LGBM | 86.54 | 86.63 | 86.54 | 86.17 | 86.5 | 92 |
AdaBoostClassifier | 79.94 | 80.02 | 79.94 | 79.94 | 79.95 | 85.92 | MLPClassifier | 85.99 | 86.16 | 85.99 | 86.22 | 85.99 | 92.25 |
LGBM | 76.64 | 76.86 | 76.64 | 76.64 | 76.65 | 82.72 | RandomForest | 85.99 | 86.03 | 85.99 | 86.06 | 85.99 | 91.62 |
Bold values indicate the best performance across all models.
Table 2 highlights the performance metrics for individual ML models in the context of binary classification between SCLC and NSCLC, as well as the performance of stacking models, all evaluated using five-fold cross-validation. These results demonstrate that binary classification yielded improved outcomes compared to the multi-class evaluation. Among the individual models, MLPClassifier, SVM, and ElasticNet performed notably well, with their respective accuracy, precision, recall, specificity, F1 scores, and AUC values listed. Leveraging these three models, we implemented a stacking ensemble approach, training a meta-model based on their combined probabilities. Five-fold cross-validation was also used to ensure robust training and evaluation. This stacking model resulted in further performance improvements, with ExtraTreesClassifier emerging as the top performer, surpassing all other models in terms of classification accuracy and overall evaluation metrics. The binary classification between SCLC and NCLC showed substantial improvement, as evidenced by higher scores across various metrics, particularly in AUC. This indicates that focusing specifically on these two lung cancer subtypes allowed for more refined and accurate predictions.
Figure 6 illustrates the accuracies achieved for the top features, selected by the ExtraTrees feature selection model. The graph highlights the performance improvements observed with increasing feature numbers. The analysis suggests that feature selection plays a critical role in improving classification results, and the ExtraTrees model has been particularly effective in identifying these key features for optimal accuracy.
Figure 6.
Top Feature Accuracy for Binary Classifications.
The confusion matrix for the stacking-based ExtraTrees classifier is presented in Supplementary Figure S8. The confusion matrices illustrate a clear improvement in classification performance from the multiclass to the binary scenario. In the earlier multiclass model, which classified samples into Control, SCLC, and NSCLC groups, the model performed well for the Control group. However, it exhibited significant confusion between the disease groups, misclassifying 36 SCLC samples as NSCLC and 33 NSCLC samples as SCLC. However, in the binary scenario (SCLC vs NSCLC), the accuracy of the model improved, with only 25 SCLC samples misclassified as NSCLC and 21 NSCLC samples misclassified as SCLC. By removing the Control group, the model was able to focus on distinguishing between the two lung cancer types, resulting in fewer misclassifications and better overall performance.
3.4. AUC-ROC Analysis
The AUC-ROC is an essential statistic in machine learning for assessing the efficacy of binary classification algorithms [51]. It measures a model’s capacity to differentiate between positive and negative classes at different threshold levels. An AUC of 1.0 denotes an impeccable classifier, whereas an AUC of 0.5 reflects performance akin to random chance.
Figure 7 illustrates the AUC-ROC curve for the stacking-based SVM classifier in the multi-class classification challenge. The SVM classifier, recognized as the most effective model in Table 1, underwent evaluation using five-fold cross-validation. The three classes included are Control, SCLC, and NSCLC. The ROC curve effectively illustrates the classifier’s capacity to differentiate among the three classes. The weighted average AUC value of 0.92 signifies robust model performance, demonstrating that the stacking-based SVM classifier offers superior discrimination among the three classes. Similarly, the AUC-ROC curve for the stacking-based ExtraTrees classifier is depicted in Figure 8, demonstrating the best binary classification performance, as shown in Table 2. SCLC and NSCLC were distinguished in the binary classification task. The AUC-ROC curve illustrates the trade-off between the True Positive Rate (TPR) and the False Positive Rate (FPR) at differing thresholds for each of the five-fold cross-validation divisions. Weighted average AUC of 0.92 is the consequence of the fold-specific AUC values of 0.84, 0.98, 0.93, 0.94, and 0.92. This consistently high AUC value confirms the ExtraTrees classifier’s robustness and reliability for binary classification between SCLC and NSCLC, indicative of outstanding model discrimination.
Figure 7.
AUC-ROC curve for the stacking-based SVM classifier in multi-class classification.
Figure 8.
AUC-ROC curve for the stacking-based ExtraTrees classifier in binary classification.
3.5. SHAP Analysis
SHAP is a versatile tool used to interpret machine learning models by assigning SHAP values to each feature, which helps measure its contribution to the prediction of the model [52]. It provides insights at both the global and local levels, revealing the overall importance of features and their specific influence on individual predictions based on Shapley values from game theory. Positive SHAP values indicate that a feature pushes the prediction higher, while negative values suggest a lowering effect. The ability of the SHAP analysis to work with any machine learning model and its equitable allocation of feature contributions make it a powerful tool for understanding model behavior. We utilized SHAP analysis for both multi-class and binary classification tasks. Figure 9 presents the SHAP summary plot for the multi-class classification model. We employed SHAP analysis to interpret the stacking-based SVM model for a multi-class classification task, including Control, SCLC, and NSCLC. The SHAP summary plot visualizes how different features influence the predictions of the model across these classes. Each feature, such as “3-Phosphoserine” and “Cholesteryl sulfate”, impacts the model output, with SHAP values indicating whether a feature pushes the prediction towards a specific class. The color gradient from blue to red represents feature values, with blue indicating low values and red indicating high values. This color scale further illustrates how varying feature values influence classification decisions for each instance, providing insights into feature importance for distinguishing between Control, SCLC, and NSCLC classes. Figure 10 represents the SHAP summary plot for the binary classification between SCLC and NSCLC. It demonstrates the impact of various metabolic features on the prediction of the model. Each feature listed on the y-axis (e.g., Benzoic acid, L-Arginine, DL-lactate) is associated with SHAP values on the x-axis, indicating how much that feature contributes to pushing the output of the model towards one class or the other. This allows for a quick interpretation of whether a high or low value of a particular feature has a positive or negative effect on the prediction. Features such as Benzoic acid, 1-Oleoyl-sn-glycero-3-phosphocholine, and DL-lactate exhibit considerable spread in their SHAP values, indicating their important role in differentiating between SCLC and NSCLC. The distribution of points for each feature reflects both the range of SHAP values and the variability in feature importance across different samples. From the SHAP summary plot, we can identify the top three features with the most substantial impact on the classification of SCLC: Benzoic acid, DL-lactate, and L-Arginine. These features demonstrate a broad distribution of SHAP values, highlighting their strong influence on the predictions of the model. In particular, high values of Benzoic acid (depicted in red) tend to drive the output of the model positively toward the SCLC class, while lower values of this feature (depicted in blue) have a negative impact. Conversely, features such as Taurine and 1-Oleoyl-sn-glycero-3-phosphocholine have a more pronounced effect on the classification of NSCLC. In the SHAP summary plot, these features show higher SHAP values on the positive side of the axis, suggesting that lower concentrations of L-Lysine and 1-Myristoyl-sn-glycero-3-phosphocholine are more influential in guiding the model toward an NSCLC prediction.
Figure 9.
SHAP summary plot for the multi-class classification model.
Figure 10.
SHAP summary plot for the binary classification model.
SHAP provides multiple techniques for creating local explanations, which are specific to individual samples, including visualizations such as waterfall plots and force plots. Figure 11A presents a force plot, whereas Figure 11B depicts a waterfall plot. In the waterfall plot (Figure 11B), the x-axis represents the probability of a sample being classified as SCLC, while the y-axis displays the metabolomic features and their corresponding values for that sample. The plot begins with the expected value of the model on the x-axis, noted as E[f(X)] = 0.8. This ‘base’ value of 0.8 indicates the average prediction probability across the test set. Subsequently, the plot illustrates how metabolomic features impact the output of the model. Positive contributions (depicted in red) and negative contributions (depicted in blue) modify the expected value to reach the final model output of f(x) = 0.017. Positive SHAP values enhance the probability of the sample being classified as SCLC, while negative SHAP values reduce this likelihood. Additionally, the force plot employs an additive force layout to display the SHAP values of each feature for a specific sample.
Figure 11.
Local explanations of a representative sample are shown in two forms: (A) a force plot illustrating an SCLC prediction and (B) a waterfall plot displaying the same prediction.
4. Discussion
SCLC is a fast-growing cancer linked to smoking that often spreads before clinical diagnosis. This cancer has high relapse rates despite chemotherapy response, lowering long-term survival [53]. However, NSCLC progresses slowly and responds better to early intervention [54]. SCLC is aggressive because of its neuroendocrine markers like synaptophysin, chromogranin A, and CD56, which promote cell growth and spread [55]. Our study illustrates the potential of combining metabolomics data with sophisticated machine learning techniques to improve the classification of SCLC and NSCLC. The research employed a publicly accessible LC-MS/MS-based metabolomics dataset that included 191 SCLC cases, 173 NSCLC cases, and 97 healthy controls.
A stacking-based ensemble machine learning approach was employed, which effectively leveraged the strengths of individual models, resulting in improved accuracy, precision, and overall classification performance. For multi-class classification, the SVM within the ensemble achieved an accuracy of 85.03% and an AUC of 92.47. To further enhance performance and reduce ambiguity between SCLC and NSCLC, a binary classification was subsequently conducted, focusing specifically on distinguishing between these two classes. This refinement simplified the classification task, strengthened the model’s robustness, and enabled a more targeted differentiation between SCLC and NSCLC in later stages. The ExtraTreesClassifier within the ensemble attained an accuracy of 88.19% and an AUC of 92.65%. The study identified key metabolites such as benzoic acid and DL-lactate as crucial in distinguishing between SCLC and NSCLC. This approach shows promise for developing non-invasive diagnostic tools that could lead to earlier detection and improved outcomes for lung cancer patients, addressing the limitations of conventional biopsy methods.
The analysis revealed that positive ion metabolites contributed significantly more to the model than negative ion metabolites, particularly in multi-class classification. This observation could be attributed to a variety of biological factors. Modifications to amino acids and their derivatives are critical targets for altered cancer metabolism [13]. Furthermore, the synthesis of lipids such as phosphatidylcholines and sphingomyelins is frequently upregulated to support rapid cell division and membrane production [56]. This mode may also preferentially ionize metabolites involved in energy production, such as carnitine and acylcarnitine, which are frequently altered in cancer cells due to shifts in energy metabolism [16]. These findings may guide future metabolomic studies in lung cancer to prioritize positive ion mode analyses, thereby improving the sensitivity and specificity of metabolomic biomarkers for SCLC.
Furthermore, we employed hierarchical clustering to identify co-regulated metabolites that may share biological pathways, aiding in identifying potential biomarkers and understanding disease mechanisms. SHAP analysis provided deep insights into the prediction of the model, identifying key metabolites like benzoic acid, DL-lactate, and L-arginine as significant discriminators between SCLC and NSCLC, which could guide the development of targeted therapies or diagnostic tools. These findings align with and expand upon established hallmarks of cancer metabolism. The identification of DL-lactate as a significant metabolite supports the Warburg effect, a well-known feature of cancer metabolism that is characterized by increased glycolysis and lactate production, even in the presence of oxygen [13,14]. Additionally, the identification of L-arginine as a significant metabolite in our study aligns with its known role in cancer progression. L-arginine contributes to tumor growth through two main pathways: polyamine production and nitric oxide synthesis [17]. The identification of 12-Benzenedicarboxylic acid as a significant feature in our study points to potential alterations in lipid metabolism. Recent research has shown that cancer cells often exhibit reprogrammed lipid metabolism to support rapid proliferation [56,57]. Our findings suggest that this reprogramming may differ between SCLC and NSCLC. The presence of benzoic acid in our significant metabolites list may indicate alterations in cellular redox balance [58]. Cancer cells often have heightened oxidative stress and altered antioxidant mechanisms [58,59]. The differential levels of benzoic acid between SCLC and NSCLC could reflect varying strategies for managing oxidative stress in these cancer types. Consequently, the discoveries concerning these metabolites have the potential to function as biomarkers and provide a deeper understanding of the metabolic reprogramming that is responsible for various lung cancer subtypes.
Although the data used for model development were derived from a publicly available case–control study with approximately balanced classes, it is important to consider the potential impact of class imbalance when applying the model to populations with naturally occurring class distributions. In real-world scenarios, where the prevalence of certain classes may vary significantly, techniques such as the Synthetic Minority Over-sampling Technique (SMOTE) [60] can be employed to address this challenge and improve model performance. The dataset in our study comprises tabular 1D metabolomics data, in which each row corresponds to a sample and the columns represent specific metabolite features. The classical machine learning models [61] were chosen due to the structured and tabular nature of the data, as they are well-suited for feature-driven datasets. Particularly when employing limited sample sizes, these models provide interpretability, effectiveness, and computational efficiency. Although deep learning models are typically more effective with high-dimensional data and larger sample sizes, recent developments, such as Transformer-based deep learning models [62,63], have shown potential applications for tabular data. Our objective in future research is to investigate these sophisticated deep learning methodologies in order to improve the predictive and analytical capabilities of metabolomics data.
Despite the promising results, the study has several limitations. First, the small dataset may limit the generalizability of the model to larger and more diverse populations. Our study develops a metabolomics-based diagnostic model using 461 people (191 with SCLC, 173 with NSCLC, and 97 controls). To contextualize this sample size, comparing it with other lung cancer metabolomics studies is important. For instance, Shang et al. employed the same publicly available LC-MS/MS-based metabolomics dataset, splitting their data into a training cohort of 323 and a testing cohort of 138 to validate a deep learning model [4]. In contrast, our study employed a training cohort of 369 samples and a testing cohort of 92 samples to evaluate our ensemble approach using five-fold cross-validation. Similarly, Wikoff et al. [23] examined a different dataset, which included 108 NSCLC cases and 216 controls. Although our sample sizes are comparable to those of these studies, particularly in the case of SCLC, they are lacking in comparison to larger-scale NSCLC studies, such as those conducted by Mathe et al. (2014), which included 1005 patients [64]. The small control group and the focus on serum samples, which might not fully capture the tumor microenvironment’s metabolic complexity, further limit generalizability. Additionally, our metabolomics dataset is significantly smaller than those in lung cancer genomic studies, which employ thousands of samples. Campbell et al. (2016) analyzed 1144 lung cancer whole-genome sequencing data points [65]. This comparison demonstrates that to enhance the statistical power of genomic studies and the discovery of biomarkers in lung cancer research, it is necessary to have larger metabolomics datasets. Second, serum samples were used for metabolomics, which may not capture the full complexity of tumor microenvironment metabolic changes. Serum analysis illuminates systemic metabolism, but it often misses tissue-specific metabolic processes crucial to tumor progression. The computational complexity of the stacking ensemble model enhanced performance; however, it may restrict clinical decision-making in fast-paced environments. In summary, the SHAP analysis improved the interpretability of the ML model; however, further clinical validation is necessary to confirm the biomarker relevance of the identified metabolites.
Moreover, to validate and potentially refine the model in future research, it is recommended that larger multi-institutional datasets be employed. Incorporating additional omics data, such as genomics, lipidomics, transcriptomics, and proteomics, could achieve a more holistic view of the molecular landscape in lung cancer [66,67]. Combining our metabolomics data with this multi-omics approach could elucidate synergistic effects and regulatory mechanisms that are not apparent from metabolomics alone, thereby facilitating a more comprehensive understanding of cancer pathogenesis. A recent proteomic profiling systematic review identified several proteins that could potentially play a role in SCLC pathogenesis [6]. Therefore, further investigation of the biological mechanisms and interplay amongst different omics data could guide the development of targeted therapies or diagnostic tools for SCLC.
Our findings emphasize the potential for developing metabolomics-based diagnostic tools for early SCLC detection and personalized treatment strategies. Integrating machine learning algorithms, such as the stacking ensemble model used in this study, could improve diagnostic accuracy and guide treatment decisions. It will be critical to work with clinical experts to establish appropriate thresholds and incorporate predictive models into clinical workflows. Furthermore, our proposed model could be a valuable component of these diagnostic tools, which should also include human expert feedback, additional biomarkers, and complementary predictive models. Future research should investigate the potential of a higher-level ensemble model that combines our metabolomics-based machine learning approach with existing lung cancer screening protocols, such as low-dose CT scans. This integrated system could further automate and enhance the medical decision-making process by providing an additional layer of metabolic information, potentially reducing false positives and unnecessary invasive procedures.
5. Conclusions
In conclusion, this study highlights the significant potential of integrating metabolomics data with advanced machine-learning techniques to improve the classification of SCLC and NSCLC. By employing a stacking-based ensemble approach, we achieved enhanced accuracy and precision in distinguishing between cancer types. SHAP analysis revealed key metabolites such as benzoic acid and DL-lactate as critical for differentiation. The clinical implications of the model are noteworthy; it could aid in risk stratification, allowing for personalized management strategies based on metabolic profiles, and potentially informing treatment decisions by identifying potential drug targets. Additionally, its integration into existing screening processes could reduce healthcare costs by minimizing unnecessary invasive procedures. Future research should focus on longitudinal studies to track metabolic profile changes over time and further validate the effectiveness of the model in clinical settings. Overall, our findings accentuate the promise of combining metabolomics with machine learning to develop non-invasive diagnostic tools that could lead to earlier detection and improved outcomes for lung cancer patients.
Supplementary Materials
The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/cancers16244225/s1. Figure S1. Methodology Overview of the Study. Figure S2. t-SNE visualization for the negative ion data. Figure S3. t-SNE visualization for the positive ion data. Figure S4. PCA component plots for the negative ion data. Figure S5. PCA component plots for the Positive ion data. Figure S6. Cluster Heatmap Depicting the Distinction Between SCLC and NSCLC Classes. Figure S7. Confusion matrix for the stacking-based SVM classifier in multi-class classification. Figure S8. Confusion matrix for the stacking-based ExtraTrees classifier in binary classification. Table S1. Comprehensive clinical characteristics. Table S2. Parameters of the MLPClassifier. Table S3. Parameters of the MLPClassifier for Multi-Class Classification. Section S1: Evaluation Metrics. Section S2: Model Development.
Author Contributions
Conceptualization, S.P. and M.E.H.C.; Formal analysis, M.S.I.S.; Funding acquisition, S.P.; Investigation, M.S.I.S. and M.M.; Methodology, M.S.I.S. and M.M.; Project administration, S.P.; Resources, S.P.; Software, M.S.I.S.; Supervision, S.P. and M.E.H.C.; Validation, M.S.I.S. and M.E.H.C.; Visualization, M.S.I.S. and N.A.; Writing—original draft, M.S.I.S., M.M. and S.P.; Writing—review and editing, M.S.I.S., N.A., M.M., H.K., S.V., M.N.A., S.P. and M.E.H.C.; Funding, S.P. All authors have read and agreed to the published version of the manuscript.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Shang et al. [4] have provided the dataset that this investigation employs. Therefore, the authors of this article were not involved in the data collection procedure.
Conflicts of Interest
The authors declare that they have no competing interests.
Funding Statement
This open-access paper is funded by QUCG-CMED-24/25-367.
Footnotes
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
References
- 1.Bray F., Laversanne M., Sung H., Ferlay J., Siegel R.L., Soerjomataram I., Jemal A. Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA A Cancer J. Clin. 2024;74:229–263. doi: 10.3322/caac.21834. [DOI] [PubMed] [Google Scholar]
- 2.Li C., Lei S., Ding L., Xu Y., Wu X., Wang H., Zhang Z., Gao T., Zhang Y., Li L. Global burden and trends of lung cancer incidence and mortality. Chin. Med. J. 2023;136:1583–1590. doi: 10.1097/CM9.0000000000002529. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Barta J.A., Powell C.A., Wisnivesky J.P. Global epidemiology of lung cancer. Ann. Glob. Health. 2019;85:8. doi: 10.5334/aogh.2419. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Shang X., Zhang C., Kong R., Zhao C., Wang H. Construction of a Diagnostic Model for Small Cell Lung Cancer Combining Metabolomics and Integrated Machine Learning. Oncologist. 2024;29:e392–e401. doi: 10.1093/oncolo/oyad261. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Ayoub M., AbuHaweeleh M.N., Mahmood N., Clelland C., Ayoub M.M., Saman H. Small cell lung cancer associated small bowel obstruction, a diagnostic conundrum: A case report. Clin. Case Rep. 2024;12:e9262. doi: 10.1002/ccr3.9262. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Elshoeibi A.M., Elsayed B., Kaleem M.Z., Elhadary M.R., Abu-Haweeleh M.N., Haithm Y., Krzyslak H., Vranic S., Pedersen S. Proteomic Profiling of Small-Cell Lung Cancer: A Systematic Review—PubMed. Cancers. 2023;15:5005. doi: 10.3390/cancers15205005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Lee G., Lee H.Y., Park H., Schiebler M.L., van Beek E.J., Ohno Y., Seo J.B., Leung A. Radiomics and its emerging role in lung cancer research, imaging biomarkers and clinical management: State of the art. Eur. J. Radiol. 2017;86:297–307. doi: 10.1016/j.ejrad.2016.09.005. [DOI] [PubMed] [Google Scholar]
- 8.Shestakova K.M., Moskaleva N.E., Boldin A.A., Rezvanov P.M., Shestopalov A.V., Rumyantsev S.A., Zlatnik E.Y., Novikova I.A., Sagakyants A.B., Timofeeva S.V. Targeted metabolomic profiling as a tool for diagnostics of patients with non-small-cell lung cancer. Sci. Rep. 2023;13:11072. doi: 10.1038/s41598-023-38140-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Amoêdo N.D., Valencia J.P., Rodrigues M.F., Galina A., Rumjanek F.D. How does the metabolism of tumour cells differ from that of normal cells. Biosci. Rep. 2013;33:e00080. doi: 10.1042/BSR20130066. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Holmes E., Wilson I.D., Nicholson J.K. Metabolic phenotyping in health and disease. Cell. 2008;134:714–717. doi: 10.1016/j.cell.2008.08.026. [DOI] [PubMed] [Google Scholar]
- 11.Mariën H., Derveaux E., Vanhove K., Adriaensens P., Thomeer M., Mesotten L. Changes in Metabolism as a Diagnostic Tool for Lung Cancer: Systematic Review. Metabolites. 2022;12:545. doi: 10.3390/metabo12060545. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Noreldeen H.A., Liu X., Xu G. Metabolomics of lung cancer: Analytical platforms and their applications. J. Sep. Sci. 2020;43:120–133. doi: 10.1002/jssc.201900736. [DOI] [PubMed] [Google Scholar]
- 13.Wei Z., Liu X., Cheng C., Yu W., Yi P. Metabolism of Amino Acids in Cancer. Front. Cell Dev. Biol. 2021;8:603837. doi: 10.3389/fcell.2020.603837. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Liberti M.V., Locasale J.W. The Warburg Effect: How Does it Benefit Cancer Cells? Trends Biochem. Sci. 2016;41:211–218. doi: 10.1016/j.tibs.2015.12.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Valles I., Pajares M.J., Segura V., Guruceaga E., Gomez-Roman J., Blanco D., Tamura A., Montuenga L.M., Pio R. Identification of Novel Deregulated RNA Metabolism-Related Genes in Non-Small Cell Lung Cancer. PLoS ONE. 2012;7:e42086. doi: 10.1371/journal.pone.0042086. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Tufail M., Jiang C., Li N. Altered metabolism in cancer: Insights into energy pathways and therapeutic targets. Mol. Cancer. 2024;23:203. doi: 10.1186/s12943-024-02119-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Albaugh V.L., Pinzon-Guzman C., Barbul A. Arginine metabolism and cancer. J. Surg. Oncol. 2017;115:273–280. doi: 10.1002/jso.24490. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Zhang Y., Yang J.-M. Altered energy metabolism in cancer: A unique opportunity for therapeutic intervention. Cancer Biol. Ther. 2013;14:81–89. doi: 10.4161/cbt.22958. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Mendez K.M., Reinke S.N., Broadhurst D.I. A comparative evaluation of the generalised predictive ability of eight machine learning algorithms across ten clinical metabolomics data sets for binary classification. Metabolomics. 2019;15:150. doi: 10.1007/s11306-019-1612-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Bishop C.M. Neural Networks for Pattern Recognition. Oxford University Press; Oxford, UK: 1995. [Google Scholar]
- 21.Schult T.A., Lauer M.J., Berker Y., Cardoso M.R., Vandergrift L.A., Habbel P., Nowak J., Taupitz M., Aryee M., Mino-Kenudson M.A. Screening human lung cancer with predictive models of serum magnetic resonance spectroscopy metabolomics. Proc. Natl. Acad. Sci. USA. 2021;118:e2110633118. doi: 10.1073/pnas.2110633118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Chen R., Li Z., Yuan Y., Zhu Z., Zhang J., Tian X., Zhang X. A comprehensive analysis of metabolomics and transcriptomics in non-small cell lung cancer. PLoS ONE. 2020;15:e0232272. doi: 10.1371/journal.pone.0232272. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Wikoff W.R., Hanash S., DeFelice B., Miyamoto S., Barnett M., Zhao Y., Goodman G., Feng Z., Gandara D., Fiehn O. Diacetylspermine is a novel prediagnostic serum biomarker for non–small-cell lung cancer and has additive performance with pro-surfactant protein B. J. Clin. Oncol. 2015;33:3880–3886. doi: 10.1200/JCO.2015.61.7779. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Du J., Li Y., Wang L., Zhou Y., Shen Y., Xu F., Chen Y. Selective application of neuroendocrine markers in the diagnosis and treatment of small cell lung cancer. Clin. Chim. Acta. 2020;509:295–303. doi: 10.1016/j.cca.2020.06.037. [DOI] [PubMed] [Google Scholar]
- 25.Song B., Shi P., Xiao J., Song Y., Zeng M., Cao Y., Zhu X. Utility of red cell distribution width as a diagnostic and prognostic marker in non-small cell lung cancer. Sci. Rep. 2020;10:15717. doi: 10.1038/s41598-020-72585-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Schneider J., Philipp M., Velcovsky H.-G., Morr H., Katz N. Pro-gastrin-releasing peptide (ProGRP), neuron specific enolase (NSE), carcinoembryonic antigen (CEA) and cytokeratin 19-fragments (CYFRA 21-1) in patients with lung cancer in comparison to other lung diseases. Anticancer Res. 2003;23:885–893. [PubMed] [Google Scholar]
- 27.Yu Z., Lu H., Si H., Liu S., Li X., Gao C., Cui L., Li C., Yang X., Yao X. A highly efficient gene expression programming (GEP) model for auxiliary diagnosis of small cell lung cancer. PLoS ONE. 2015;10:e0125517. doi: 10.1371/journal.pone.0125517. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Barchiesi V., Simeon V., Sandomenico C., Cantile M., Cerasuolo D., Chiodini P., Morabito A., Cavalcanti E. Circulating progastrin-releasing peptide in the diagnosis of Small Cell Lung Cancer (SCLC) and in therapeutic monitoring. J. Circ. Biomark. 2021;10:9. doi: 10.33393/jcb.2021.2212. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Wang J., Jia G., Jie H. Diagnostic value of ProGRP and NSE for small cell lung cancer: A meta-analysis. Zhongguo Fei Ai Za Zhi. 2010;13:1094–1100. doi: 10.3779/j.issn.1009-3419.2010.12.03. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Shibayama T., Ueoka H., Nishii K., Kiura K., Tabata M., Miyatake K., Kitajima T., Harada M. Complementary roles of pro-gastrin-releasing peptide (ProGRP) and neuron specific enolase (NSE) in diagnosis and prognosis of small-cell lung cancer (SCLC) Lung Cancer. 2001;32:61–69. doi: 10.1016/S0169-5002(00)00205-1. [DOI] [PubMed] [Google Scholar]
- 31.Wen Z., Huang Y., Ling Z., Chen J., Wei X., Su R., Tang Z., Wen Z., Deng Y., Hu Z. Lack of efficacy of combined carbohydrate antigen markers for lung cancer diagnosis. Dis. Markers. 2020;2020:4716793. doi: 10.1155/2020/4716793. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Oremek G., Sauer-Eppel H., Bruzdziak T. Value of tumour and inflammatory markers in lung cancer. Anticancer Res. 2007;27:1911–1915. [PubMed] [Google Scholar]
- 33.Yang H.-j., Gu Y., Chen C., Xu C., Bao Y.-x. Diagnostic value of pro-gastrin-releasing peptide for small cell lung cancer: A meta-analysis. Clin. Chem. Lab. Med. 2011;49:1039–1046. doi: 10.1515/CCLM.2011.161. [DOI] [PubMed] [Google Scholar]
- 34.Harmsma M., Schutte B., Ramaekers F.C. Serum markers in small cell lung cancer: Opportunities for improvement. Biochim. Biophys. Acta (BBA)-Rev. Cancer. 2013;1836:255–272. doi: 10.1016/j.bbcan.2013.06.002. [DOI] [PubMed] [Google Scholar]
- 35.Sidaway P. cfDNA monitoring is feasible in SCLC. Nat. Rev. Clin. Oncol. 2020;17:7. doi: 10.1038/s41571-019-0300-7. [DOI] [PubMed] [Google Scholar]
- 36.Mondelo-Macía P., García-González J., León-Mateos L., Castillo-García A., López-López R., Muinelo-Romay L., Díaz-Peña R. Current status and future perspectives of liquid biopsy in small cell lung cancer. Biomedicines. 2021;9:48. doi: 10.3390/biomedicines9010048. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Sandfeld-Paulsen B., Jakobsen K.R., Bæk R., Folkersen B.H., Rasmussen T.R., Meldgaard P., Varming K., Jørgensen M.M., Sorensen B.S. Exosomal proteins as diagnostic biomarkers in lung cancer. J. Thorac. Oncol. 2016;11:1701–1710. doi: 10.1016/j.jtho.2016.05.034. [DOI] [PubMed] [Google Scholar]
- 38.Ma M., Zhu H., Zhang C., Sun X., Gao X., Chen G. “Liquid biopsy”—ctDNA detection with great potential and challenges. Ann. Transl. Med. 2015;3:12–235. doi: 10.3978/j.issn.2305-5839.2015.09.29. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Mallikharjuna Rao K., Saikrishna G., Supriya K. Data preprocessing techniques: Emergence and selection towards machine learning models-a practical review using HPA dataset. Multimed. Tools Appl. 2023;82:37177–37196. doi: 10.1007/s11042-023-15087-5. [DOI] [Google Scholar]
- 40.Singh D., Singh B. Investigating the impact of data normalization on classification performance. Appl. Soft Comput. 2020;97:105524. doi: 10.1016/j.asoc.2019.105524. [DOI] [Google Scholar]
- 41.Yusuf M., Atal I., Li J., Smith P., Ravaud P., Fergie M., Callaghan M., Selfe J. Reporting quality of studies using machine learning models for medical diagnosis: A systematic review. BMJ Open. 2020;10:e034568. doi: 10.1136/bmjopen-2019-034568. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Liu Y., Chen P.-H.C., Krause J., Peng L. How to Read Articles That Use Machine Learning: Users’ Guides to the Medical Literature. JAMA. 2019;322:1806–1816. doi: 10.1001/jama.2019.16489. [DOI] [PubMed] [Google Scholar]
- 43.Sumon M.S.I., Hossain M.S.A., Al-Sulaiti H., Yassine H.M., Chowdhury M.E. Enhancing Influenza Detection through Integrative Machine Learning and Nasopharyngeal Metabolomic Profiling: A Comprehensive Study. Diagnostics. 2024;14:2214. doi: 10.3390/diagnostics14192214. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Meyer D., Nagler T., Hogan R.J. Copula-based synthetic data augmentation for machine-learning emulators. Geosci. Model Dev. 2021;14:5205–5215. doi: 10.5194/gmd-14-5205-2021. [DOI] [Google Scholar]
- 45.Van der Maaten L., Hinton G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008;9:2579–2605. [Google Scholar]
- 46.Caesar L.K., Kvalheim O.M., Cech N.B. Hierarchical cluster analysis of technical replicates to identify interferents in untargeted mass spectrometry metabolomics. Anal. Chim. Acta. 2018;1021:69–77. doi: 10.1016/j.aca.2018.03.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Nielsen F., Nielsen F. Introduction to HPC with MPI for Data Science. Springer; Berlin/Heidelberg, Germany: 2016. Hierarchical clustering; pp. 195–211. [Google Scholar]
- 48.Chen T., Guestrin C. Xgboost: A scalable tree boosting system; Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining; San Francisco, CA, USA. 13–17 August 2016; pp. 785–794. [Google Scholar]
- 49.Breiman L. Random forests. Mach. Learn. 2001;45:5–32. doi: 10.1023/A:1010933404324. [DOI] [Google Scholar]
- 50.Geurts P., Ernst D., Wehenkel L. Extremely randomized trees. Mach. Learn. 2006;63:3–42. doi: 10.1007/s10994-006-6226-1. [DOI] [Google Scholar]
- 51.Vanderlooy S., Hüllermeier E. A critical analysis of variants of the AUC. Mach. Learn. 2008;72:247–262. doi: 10.1007/s10994-008-5070-x. [DOI] [Google Scholar]
- 52.Mangalathu S., Hwang S.-H., Jeon J.-S. Failure mode and effects analysis of RC members based on machine-learning-based SHapley Additive exPlanations (SHAP) approach. Eng. Struct. 2020;219:110927. doi: 10.1016/j.engstruct.2020.110927. [DOI] [Google Scholar]
- 53.Horn L., Mansfield A.S., Szczęsna A., Havel L., Krzakowski M., Hochmair M.J., Huemer F., Losonczy G., Johnson M.L., Nishio M., et al. First-Line Atezolizumab plus Chemotherapy in Extensive-Stage Small-Cell Lung Cancer. N. Engl. J. Med. 2018;379:2220–2229. doi: 10.1056/NEJMoa1809064. [DOI] [PubMed] [Google Scholar]
- 54.Rotow J., Bivona T.G. Understanding and targeting resistance mechanisms in NSCLC—PubMed. Nat. Rev. Cancer. 2017;17:637–658. doi: 10.1038/nrc.2017.84. [DOI] [PubMed] [Google Scholar]
- 55.George J., Lim J.S., Jang S.J., Cun Y., Ozretić L., Kong G., Leenders F., Lu X., Fernández-Cuesta L., Bosco G., et al. Comprehensive genomic profiles of small cell lung cancer. Nature. 2015;524:7563. doi: 10.1038/nature14664. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Butler L.M., Perone Y., Dehairs J., Lupien L.E., de Laat V., Talebi A., Loda M., Kinlaw W.B., Swinnen J.V. Lipids and cancer: Emerging roles in pathogenesis, diagnosis and therapeutic intervention. Adv. Drug Deliv. Rev. 2020;159:245–293. doi: 10.1016/j.addr.2020.07.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Munir R., Lisec J., Swinnen J.V., Zaidi N., Munir R., Lisec J., Swinnen J.V., Zaidi N. Lipid metabolism in cancer cells under metabolic stress. Br. J. Cancer. 2019;120:12. doi: 10.1038/s41416-019-0451-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Hayes J.D., Dinkova-Kostova A.T., Tew K.D. Oxidative Stress in Cancer—PubMed. Cancer Cell. 2020;38:167–197. doi: 10.1016/j.ccell.2020.06.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Kuo C.-L., Ponneri Babuharisankar A., Lin Y.-C., Lien H.-W., Lo Y.K., Chou H.-Y., Tangeda V., Cheng L.-C., Cheng A.N., Lee A.Y.-L., et al. Mitochondrial oxidative stress in the tumor microenvironment and cancer immunoescape: Foe or friend? J. Biomed. Sci. 2022;29:74. doi: 10.1186/s12929-022-00859-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Chawla N.V., Bowyer K.W., Hall L.O., Kegelmeyer W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002;16:321–357. doi: 10.1613/jair.953. [DOI] [Google Scholar]
- 61.Mukhamediev R.I., Symagulov A., Kuchin Y., Yakunin K., Yelis M. From classical machine learning to deep neural networks: A simplified scientometric review. Appl. Sci. 2021;11:5541. doi: 10.3390/app11125541. [DOI] [Google Scholar]
- 62.Huang X., Khetan A., Cvitkovic M., Karnin Z. Tabtransformer: Tabular data modeling using contextual embeddings. arXiv. 20202012.06678 [Google Scholar]
- 63.Shwartz-Ziv R., Armon A. Tabular data: Deep learning is not all you need. Inf. Fusion. 2022;81:84–90. doi: 10.1016/j.inffus.2021.11.011. [DOI] [Google Scholar]
- 64.Mathé E.A., Patterson A.D., Haznadar M., Manna S.K., Krausz K.W., Bowman E.D., Shields P.G., Idle J.R., Smith P.B., Anami K., et al. Noninvasive Urinary Metabolomic Profiling Identifies Diagnostic and Prognostic Markers in Lung Cancer. Cancer Res. 2014;74:3259–3270. doi: 10.1158/0008-5472.CAN-14-0109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Campbell J.D., Alexandrov A., Kim J., Wala J., Berger A.H., Pedamallu C.S., Shukla S.A., Guo G., Brooks A.N., Murray B.A., et al. Distinct patterns of somatic genome alterations in lung adenocarcinomas and squamous cell carcinomas—PubMed. Nat. Genet. 2016;48:607–616. doi: 10.1038/ng.3564. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Gao M., Zhao L., Zhang Z., Wang J., Wang C. Using a stacked ensemble learning framework to predict modulators of protein-protein interactions. Comput. Biol. Med. 2023;161:107032. doi: 10.1016/j.compbiomed.2023.107032. [DOI] [PubMed] [Google Scholar]
- 67.Liang M., Chang T., An B., Duan X., Du L., Wang X., Miao J., Xu L., Gao X., Zhang L., et al. A Stacking Ensemble Learning Framework for Genomic Prediction. Front. Genet. 2021;12:600040. doi: 10.3389/fgene.2021.600040. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Shang et al. [4] have provided the dataset that this investigation employs. Therefore, the authors of this article were not involved in the data collection procedure.