Skip to main content
Wiley Open Access Collection logoLink to Wiley Open Access Collection
. 2024 Nov 25;18(1):e202400427. doi: 10.1002/jbio.202400427

Subtype‐Specific Detection in Stage Ia Breast Cancer: Integrating Raman Spectroscopy, Machine Learning, and Liquid Biopsy for Personalised Diagnostics

Kevin Saruni Tipatet 1,2,, Katie Hanna 3, Liam Davison‐Gates 1, Mario Kerst 4, Andrew Downes 1
PMCID: PMC11700701  PMID: 39587849

ABSTRACT

This study explores the integration of Raman spectroscopy (RS) with machine learning for the early detection and subtyping of breast cancer using blood plasma samples. We performed detailed spectral analyses, identifying significant spectral patterns associated with cancer biomarkers. Our findings demonstrate the potential for classifying the four major subtypes of breast cancer at stage Ia with an average sensitivity and specificity of 90% and 95%, respectively, and a cross‐validated macro‐averaged area under the curve (AUC) of 0.98. This research highlights efforts to integrate vibrational spectroscopy with machine learning, enhancing cancer diagnostics through a non‐invasive, personalised approach for early detection and monitoring disease progression. This study is the first of its kind to utilise RS and machine learning to classify the four major breast cancer subtypes at stage Ia.

Keywords: breast cancer subtyping, early detection, liquid biopsy, machine learning, personalised cancer care, Raman spectroscopy


This study demonstrates the power of combining Raman spectroscopy and machine learning to detect and classify breast cancer subtypes using blood plasma. With an impressive 90% sensitivity and 95% specificity, our method offers a non‐invasive approach for early cancer detection, accurately identifying the four main subtypes at stage Ia. This innovative integration of vibrational spectroscopy with AI‐based analysis holds promise for personalised cancer diagnostics, potentially transforming how we detect and monitor cancer progression.

graphic file with name JBIO-18-e202400427-g002.jpg

1. Introduction

Breast cancer (BC) in women is the most frequently diagnosed malignancy worldwide and ranks among the leading causes of cancer‐related fatalities [1]. Despite significant advancements in the prevention, diagnosis, and treatment of BC over recent decades [2], it remains a formidable global health challenge. The 5‐year survival rate for stage I breast cancer is approximately 99%, compared with 86% for stage II, 57% for stage III, and just 27% for stage IV, where the cancer has spread to distant organs [3, 4]. Often, tumours are detected only once symptoms manifest, typically at an advanced stage of the disease [5]. Distant (metastatic) recurrence is a critical clinical concern, accounting for the majority of BC‐related deaths [6]. Therefore, early detection and diagnosis while the tumour is still localised, and when treatments are most effective, are crucial for improving treatment outcomes [7].

Liquid biopsy enables the non‐invasive investigation and real‐time monitoring of a tumour's dynamic progression and response to therapy [8, 9, 10]. Liquid biopsy tests possess the potential to detect a broad spectrum of biomolecules, including circulating tumour cells (CTCs), cell‐free DNA (cfDNA), RNA, extracellular vesicles, proteins, and methylation markers, thereby providing invaluable insights into disease status. However, many current liquid biopsies aimed at early cancer detection lack the necessary sensitivity to effectively identify cancers at their earliest stages [11]. For instance, tumour‐derived genetic markers may not always be released into the bloodstream at the early stages of cancer, and when they are, their presence is often at very low levels [12]. Furthermore, cancer‐associated biomarkers routinely examined in bodily fluids may not exhibit abnormal levels even in patients with advanced cancer stages [13]. These markers also often lack specificity, as elevated levels can occur in individuals without cancer [14, 15]. Enhancing the efficacy of early cancer detection necessitates the integration of both non‐tumour derived data and direct tumour signals into the diagnostic technology [16].

Spectroscopic techniques are emerging as powerful tools in biomedical research due to their non‐invasive nature and high real‐time spatial resolution [17]. Among these techniques, Raman spectroscopy (RS) is particularly attractive because it has shown the potential to perform rapid, non‐destructive, reagent‐free, real‐time molecular analysis [18], and objective clinically relevant diagnostic information in various biomedical applications using different biofluids [19, 20, 21]. RS, a type of laser spectroscopy, excites vibrational modes within molecules at characteristic frequencies [18]. During this process, a photon loses energy when it excites a vibration, and by detecting the scattered photons with a spectrometer, we can deduce the energy loss and identify the type of vibration within the molecule. This technique reveals subtle differences in the chemical composition of cells and biological tissues, highlighting changes caused by diseases [22]. In cancer research, RS has been used in vitro to distinguish primary tumour cells from secondary tumour cells [23] and to differentiate radioresistant cells from controls [24]. It has also been applied to biopsies [25] with impressive accuracy and more recently extended to biofluids such as urine [26], tears [27], saliva [28], and blood plasma, achieving 95% accuracy for detecting stage II breast cancer [29].

The development of high‐throughput Raman instrumentation for diagnosing prostate cancer in plasma samples has been reported, demonstrating remarkable sensitivity and specificity (96.5% and 95%, respectively) in a limited patient cohort [30]. Wang et al. [31] applied spontaneous RS to analyse stage I to IV of non‐small cell lung cancer (NSCLC) and healthy samples using dried blood serum, achieving an accuracy of 65% in detecting stage I cancer from the other groups. Another study conducted measurements on whole blood samples from a group of healthy volunteers and breast cancer patients achieving classification accuracies above 94% in discriminating between control and cancer samples. However, this study focused on cancer grades 1–3, not addressing cancer stages, which provide a more comprehensive understanding of tumour progression and its impact on treatment outcomes [32]. Nargis et al. [29] employed PCA‐Factorial Discriminant Analysis (PCA‐FDA) and achieved remarkably high accuracy, with classification rates exceeding 99% for distinguishing between breast cancer patients and healthy controls. The study identified specific Raman spectral features associated with DNA and proteins that were exclusively observed in the blood plasma samples of breast cancer patients. However, it is important to note that this study was limited to patients with stages II to IV breast cancer, thereby not including earlier cancer cases from its analysis. In a subsequent study, Nargis et al. [33] conducted a comparative analysis of Surface‐Enhanced Raman Scattering (SERS) and spontaneous RS for the examination of blood serum from breast cancer patients and healthy controls using Partial Least Squares Discriminant Analysis (PLS‐DA). The results demonstrated sensitivity, specificity, and AUROC values of 90%, 98%, and 94% for SERS, and 88%, 98%, and 83% for spontaneous RS, respectively. This study was also confined to stages II to IV of breast cancer. Pereira de Souza et al. [34] applied a rapid and low‐cost ATR‐FTIR spectroscopy technique to analyse the molecular subtypes of breast cancer using blood plasma samples from patients across stages I to IV, as well as from healthy individuals. Utilising integrated Principal Component Analysis (iPCA) and Orthogonal Projections to Latent Structures Discriminant Analysis (OPLS‐DA), they achieved 100% accuracy in distinguishing the four main breast cancer subtypes. Subtype‐specific diagnosis is a major milestone towards personalised medicine [35, 36, 37, 38, 39]. However, a notable limitation of this study is that all cancer samples were pooled together, rather than being analysed according to their individual cancer stage. This pooling could mask stage‐specific spectral variations that are critical for understanding the progression and treatment of breast cancer.

We conducted a small pilot study on blood plasma from stage Ia breast cancer patients and healthy control samples, showing high sensitivity and specificity (macro‐average AUC = 0.98) in detecting different breast cancer subtypes. To the best of our knowledge, this study is the first of its kind to utilise RS and machine learning to classify the four major breast cancer subtypes: Luminal A, Luminal B, HER2‐enriched, and Triple Negative Breast Cancer (TNBC) at stage Ia. Using a custom‐built pattern recognition program, we were able to identify spectral features associated with cancer biomarkers.

2. Materials and Methods

2.1. Blood Samples

The blood plasma samples, 12 samples from breast cancer patients and 12 from healthy volunteers, utilised in this study were generously provided by the Northern Ireland biobank (Ref.: c‐13290114) and Breast Cancer Now Tissue bank (Ref.: c‐12283903). Whole blood samples were collected at diagnosis and before commencement of treatment. Upon clinical histopathological assessment, all cancer samples were identified as stage Ia and classified according to their respective tumour subtype based on the hormone‐receptor (HR) and human epidermal growth factor receptor 2 (HER2) status of tissue biopsy as shown in Table S1.

2.2. Sample Preparation

Blood plasma samples were double‐sealed to prevent contamination and placed in a water bath set at 37°C for 5 min for thawing. The anonymised samples were then randomly placed into a sample rack; random selection reduces bias that could occur when samples of the same group are measured consecutively. Moreover, in order to minimise metabolic activity and enzymatic degradation that could occur at room temperature, samples were placed in a cooling box with ice during the waiting times before measurements. Fifty microliters liquid droplets were individually pipetted onto a gold‐coated mirror and subsequently dried by drop‐casting.

Raman spectra were collected by the Renishaw InVia Raman spectrometer equipped with a 20× (0.4 NA) objective lens (Leica, Germany) and a 600 lines/mm diffraction grating. Using laser power (approximately 60 mW at the sample) with an excitation wavelength of 785 nm, dried blood plasma samples were exposed to illumination for 20 s for each data collection on five different regions per sample over a wavenumber range of 500–1600 cm−1. An excitation wavelength from the NIR region was selected for this analysis to generate high‐quality and reproducible spectra [29, 30, 33, 40].

2.3. Spectral Preprocessing

The preprocessing of the spectral data involved several systematic steps to ensure suitability for analysis, carried out using custom‐built Python programs.

Initially, a dataframe containing the raw spectral data were created to form an array of the raw spectra. A thresholding algorithm based on Mahalanobis Distance and Principal component (PC) analysis was applied to the raw data [41]. After mapping the original data onto the compressed domain, the score values relative to PC1 and PC2 (which represent new axes in this domain) are assessed. If the score value of a spectrum for PC1 or PC2 deviates from the respective mean by more than 2.58 times (corresponding to a 99% confidence level) the standard deviation of PC1 or PC2, that spectrum is flagged as an outlier.

Baseline correction was then performed on the raw array using the Rolling Ball technique [42], with parameter 7 and a ball height of 21 and ball width of 63. Both the corrected array and the baseline were retained for further processing.

Next, the data were mean‐centred by subtracting the mean spectrum from each individual spectrum. This was followed by aligning the spectra to the x‐axis to maintain consistent dimensions. Subsequently, the mean‐centred data were standardised by dividing each spectrum by the standard deviation of the mean‐centred data.

To reduce noise, despiking was performed on the normalised spectra using the ‘removeCRSFast’ function with threshold parameters set to 4. Finally, the spectra were smoothed using the Whittaker method [43], with a lambda value of 10 000.

2.4. Analysis Using Machine Learning

Following spectral preprocessing, the data were divided into training and validation sets with a 70:30 ratio. The training dataset underwent PCA fitting and subsequent LDA transformation. Leave‐One‐Out Cross‐Validation (LOOCV) was applied to the LDA‐transformed training data. In this process, a model is trained on the entire training set except for one piece of data, which is used as the test data. The performance of the model, specifically its accuracy and F1 score, was evaluated in each iteration, and the model was preserved. For the validation set, PCA fitting was executed using the original eigenvectors derived from the training dataset rather than retraining with the validation dataset, aiming to reduce overfitting risks. Each leave‐one‐out iteration within the validation set was assessed against every model saved during the training phase. Performance metrics for each model were evaluated at every iteration, and the variability in these metrics across the models was quantified using the standard deviation.

2.5. Analysis of Spectral Features

An algorithm was developed to analyse differences in band intensities between diseased and control Raman spectra systematically. Initially, the mean spectrum for the control group was subtracted from each individual disease spectrum. This subtraction process highlighted the spectral features that differed between the two spectra, isolating potential disease‐associated changes.

To confirm the statistical significance of the annotated spectral differences, the Mann–Whitney–Wilcoxon test was applied. This non‐parametric test is suitable for comparing differences between two independent groups without assuming a normal distribution of data. The p‐values were calculated and displayed for each significant spectral difference as shown in Figures S3–S6.

The developed algorithm applied this method to each spectral feature (wavenumber) to test for statistical significance, allowing for the identification of specific spectral features with significant differences between the disease and control groups. This process provided insights into the biochemical alterations associated with the disease.

2.6. Hierarchical Clustering Analysis

Hierarchical clustering was performed on the LDA‐transformed data using the complete linkage method [44]. This method measures the distance between clusters as the maximum distance between points in the two clusters. The complete linkage method was chosen due to its tendency to create more compact clusters. The hierarchical clustering results were visualised using a dendrogram. The dendrogram displays the arrangement of the clusters formed at various levels of similarity.

3. Results

Figure S1 presents the difference spectra for control versus the four breast cancer subtypes Luminal A (Hr+/Her2−), Luminal B (Hr+/Her2+), HER2 enriched (Hr−/Her2+), and TNBC (Hr−/Her2−), highlighting key spectral changes associated with each disease subtype. The observed spectral differences at multiple wavenumbers suggest differences in lipids, protein and amino acid composition, essential for identifying disease‐specific biomarkers.

PCA was performed on the preprocessed Raman spectral data. To ensure a robust yet efficient analysis, we selected five principal components (PCs) that together capture approximately 50% of the total variance, as illustrated in Figure S2. This threshold was strategically chosen: it is sufficiently high to encompass the significant spectral features primarily represented in the lower PCs, while also being low enough to mitigate the risk of overfitting by excluding the less relevant features often found in higher PCs.

The statistical significance of the annotated spectral differences was validated using the Mann–Whitney–Wilcoxon test, as depicted in Figures S3–S6. Statistically significant spectral features exhibited notable differences at specific wavenumbers.

In HER2+ subtypes, as illustrated in Tables S3 and S5, common decreases were observed at 1158 cm−1 (C—C/C—N stretching in proteins), 1448 cm−1 (CH2 bending in lipids and proteins), and 1518 cm−1 (aromatic ring stretching in proteins). Conversely, in HER2− subtypes, common significant increases were identified at 1518 cm−1 (aromatic ring stretching in proteins), with decreases at 941 cm−1 (C—C stretching in proteins) and 1448 cm−1 (CH2 bending in lipids and proteins).

For HR+ subtypes, there were no significant common increases, but common decreases were noted at 941 cm−1 (C—C stretching in proteins) and 1448 cm−1 (CH2 bending in lipids and proteins). Detailed descriptions of which peaks increase or decrease, indicating a respective increase or reduction in the concentration of specific chemical species, can be found in Tables S2–S5.

The efficacy and validation of a multiclass classification model applying Receiver Operating Characteristic (ROC) analysis is illustrated. Figure 1A extends the ROC curve to a one‐vs‐rest multiclass analysis, showing the micro‐average and macro‐average ROC curves, each yielding an Area Under the Curve (AUC) of 0.97 and 0.98, respectively. Micro‐averaged ROC tends to perform well for balanced datasets, as it aggregates the contributions of all classes into a single ROC curve, often dominated by the majority class. In contrast, macro‐averaged ROC gives equal weight to each class, making it more suitable for imbalanced datasets where some classes may be underrepresented. This method ensures that the performance across minority classes is adequately captured and evaluated. In our study, the control class outweighs the four disease classes. Thus, the use of macro‐averaging is critical for properly assessing performance across all classes, especially the smaller ones.

FIGURE 1.

FIGURE 1

(A) Extension of the ROC analysis for control and the four breast cancer subgroups. (B) Performance of each cross‐validation model on the validation data.

Figure 1A also presents the ROC curves for control and the breast cancer subtypes (HR+HER2−, HR−HER2+, HR−HER2−, and HR+HER2+), all achieving an AUC of 1.00, 0.95, 1.00, 0.95, and 1.00, respectively. This indicates a high separation accuracy. Moreover, using a threshold with a relatively high true positive and low false positive rate, this model achieved 100%, 90%, 100%, 90%, and 100% sensitivity and 100%, 85%, 100%, 85%, and 100% specificity for control, HR+HER2−, HR−HER2+, HR−HER2−, and HR+HER2+, respectively.

Figure 1B illustrates the validation performance of each cross‐validation model on the validation data. The micro‐average and macro‐average ROC curves for the validation data also demonstrate high AUC values, indicating that the model maintains superior performance even on unseen validation data, thereby underscoring its robustness and generalisability.

To elucidate the spectral features among the different breast cancer subtypes and understand their interrelationships, a hierarchical cluster analysis was conducted, as depicted in Figure 2. The dendrogram illustrates distinct clusters corresponding to the different conditions, demonstrating the model's capacity to differentiate based on spectral features.

FIGURE 2.

FIGURE 2

Hierarchical clustering dendrogram illustrating the relationships among blood plasma of different breast cancer subtypes and control samples of trial 1 based on their spectral features.

The clustering clearly shows that samples from the same condition tend to group together, forming distinct clusters. HER2− subtypes appear to form a distinct cluster separate from the control and HER2+ subtypes, while the HER2+ subtypes also cluster separately but closer to the control samples.

To validate these findings, a second trial was conducted on a different day, including additional samples from different patients. The dendrogram from the hierarchical clustering of the second trial, shown in Figure 3, indicates a similar clustering pattern as observed in the first trial.

FIGURE 3.

FIGURE 3

Hierarchical clustering dendrogram illustrating the relationships among blood plasma of different breast cancer subtypes and control samples of trial 2 based on their spectral features.

This consistency reinforces the reliability of the initial results and demonstrates the reproducibility of the hierarchical clustering analysis.

4. Discussion

Our study is the first to utilise RS and machine learning to classify healthy volunteer samples alongside the four major breast cancer subtypes—Luminal A (HR+HER2−), Luminal B (HR+HER2+), HER2‐enriched (HR−HER2+), and TNBC (HR−HER2−)—at Stage Ia. The results demonstrate percentage sensitivities/specificities of 100/100 for the control group, 90/85 for Luminal A, 100/100 for Luminal B, 90/85 for TNBC, and 98/85 for HER2‐enriched. We identified key spectral features linked to cancer biomarkers, which were corroborated by multiple peer‐reviewed studies, affirming the potential of RS and machine learning in early‐stage breast cancer subtyping.

The observed intensity changes of Raman bands in this study offer significant insights into the biochemical alterations associated with cancer, supporting findings from various published studies. For instance, alterations to the 878 cm−1 band, associated with proline and hydroxyproline in collagen, highlights alterations in extracellular matrix components, corroborating findings by Movasaghi, Rehman, and Rehman [22].

Two comprehensive studies by Talari et al. [45] and Movasaghi, Rehman, and Rehman [22] prepared a database of molecular fingerprints of biological tissues from various studies, identifying significant spectral features linked to amide I (collagen), alanine, glycine, proline, tyrosine, amide II, amide III, and phenylalanine. These align with our observations, where significant spectral changes were identified at 643 cm−1 (tyrosine and phenylalanine), 861 cm−1 (proline and tyrosine in collagen), and 1449 cm−1 (amide II). These changes correspond to specific biomolecules, crucial for the structural integrity and biochemical processes within cancerous tissues.

The spectral changes observed at 861 cm−1, related to proline and tyrosine in collagen, indicate extracellular matrix modifications, a common feature in tumour progression, supported by studies on collagen changes in breast cancer [22]. Further support comes from two studies by Nargis et al. [29, 33], which utilised a 785 nm laser to analyse blood serum from breast cancer patients and healthy volunteers. They observed significant spectral differences at wavenumbers similar to those in our study, associated with the C—S stretching and C—C twisting of proteins/tyrosine (640 cm−1), collagen type I (857 cm−1), and lipids (1449 cm−1).

Notable changes in the intensity of Raman peaks when comparing breast cancer samples to normal tissue include the peak at approximately 1158 cm−1, linked to C—C stretching mode in β‐carotene, the peak around 1448 cm−1 associated with CH2 bending modes in lipids and proteins, the peak at approximately 1518 cm−1 corresponding to C=C stretching mode in β‐carotene, and peaks associated with the Amide III band in proteins. This is corroborated by Pichardo‐Molina et al. [46], who conducted a study using RS and multivariate analysis on serum samples from 11 breast cancer patients (stages II to IV) and 12 healthy controls, noting significant spectral features at 1158 cm−1 (beta carotene, C—C skeletal stretch), 1448 cm−1 (β sheet and phospholipids), and around 1518 cm−1 (β carotene). Medipally et al. [30] using a 785 nm laser, observed significant spectral differences at wavenumbers similar to ours, which were also attributed to β‐carotene. These bands were notably more prominent in the blood plasma of healthy volunteers compared with that of prostate cancer patients, indicating a higher expression of β‐carotene in healthy individuals. This underscores the importance of spectral features associated with β‐carotene in cancer diagnosis using RS and machine learning. Conversely, Li et al. [47] examined 1022 serum blood samples from various cancer patients (stomach, lung, liver, rectum, and oesophagus), concluding that β‐carotene concentration was lower in the serum of cancer patients compared with controls. This was similarly observed in plasma in the current study and significant differences in β‐carotene‐related peaks between control and breast cancer groups at 1158 and 1518 cm−1 were observed.

Cameron et al. identified the Amide II band as the most crucial wavenumber region for distinguishing between cancerous and non‐cancerous conditions in blood serum. This prominent peak captures overlapping bands associated with protein secondary structures, including α‐helices and β‐sheets. Variations in this region, as well as the Amide I region, are indicative of disease states. The Amide II band specifically reflects N—H bending and C—N stretching vibrations within protein molecules. This underscores the significance of our findings at 1448 cm−1, indicating that alterations in protein secondary structures are crucial for understanding the biochemical environment of cancer [48, 49]. These changes in the Amide II band are pivotal in cancer diagnostics, confirming their importance in identifying cancer‐related biochemical alterations [50].

The peak at 643 cm−1 is associated with tyrosine and phenylalanine, key components in protein structures, indicative of cancerous changes [22, 51]. The band at 757 cm−1, related to tryptophan, can provide information related to the involvement of various metabolic processes within cancer cells. At 861 cm−1, changes in proline and tyrosine within collagen highlight modifications in the extracellular matrix, typical of cancer [52].

Minor variations in observed wavenumbers between studies are common in RS and can arise from slight differences in instrument calibration, laser excitation wavelength, and data processing methods [53, 54]. Additionally, variations in sample conditions can slightly alter molecular vibrations, resulting in minor shifts in peak positions [54]. These variations underscore the importance of standardising protocols in RS to minimise discrepancies and improve the reproducibility and comparability of results across studies [54, 55].

Overall, the spectral differences between breast cancer and healthy control samples observed in our study are consistent with those reported in the literature, reinforcing the potential of RS in identifying and monitoring cancer biomarkers. The alignment of our findings with established studies validates the use of Raman spectral analysis combined with advanced machine learning techniques for understanding biochemical changes in cancer, enhancing diagnostic capabilities for early detection and monitoring of the disease.

The observed distinctive clustering and clear separation of the subtypes, highlight the importance of subtype‐specific disease classification. This finding aligns with the highly impactful work of Perou et al. [56], who developed the revolutionary PAM50 gene expression assay categorising breast cancer into intrinsic molecular subtypes using an assay that measures the expression levels of 50 genes (the PAM50) to classify breast cancers into one of five subtypes: Luminal A, Luminal B, HER2‐enriched, Basal‐like, and Normal‐like. This classification provides valuable prognostic information and helps guide treatment decisions by identifying the specific molecular characteristics of each tumour, offering a more precise, personalised classification system compared with traditional histopathological methods. Moreover this has been widely adopted in clinical practice and research, significantly contributing to our understanding of breast cancer heterogeneity and guiding therapeutic decisions [57]. The ability to identify specific molecular subtypes has led to better‐targeted therapies and improved patient outcomes [58, 59]. Despite this groundbreaking work by Perou et al., applying the PAM50 assay to blood samples presents several challenges. First, the assay was originally developed for tissue samples, where tumour‐specific gene expression is abundant and distinct. In blood samples, the concentration of CTCs or cell‐free tumour DNA is much lower, making it difficult to detect the specific gene expression signatures required for PAM50 classification. Additionally, the presence of a high background of normal blood cells can obscure the detection of cancer‐specific signals, necessitating highly sensitive and specific techniques to isolate and amplify the relevant genetic material. These technical difficulties complicate the direct application of PAM50 in liquid biopsy settings such as blood samples. It is, therefore, essential to continue developing fast and affordable techniques that can reliably measure a wider range of disease‐associated biochemical changes in biofluids.

The high AUC values emphasise the versatility and robustness of our model, suggesting its potential applicability across different sample types and conditions. Moreover, the consistency in cross‐validation and validation results highlights the reliability of our approach, reinforcing its suitability for deployment in diagnostic workflows.

Significant efforts were undertaken to mitigate overfitting through several layers of customised data processing methods. These included optimising data processing parameters using a LOOCV approach, conducting principal component analysis (PCA) with the original eigenvectors from the training dataset rather than retraining with the validation dataset, and dividing the data into independent training, testing, and validation datasets. Machine learning models were evaluated using a comprehensive range of performance metrics at each iteration (fold). Recognising data bias as a major limiting factor in most published vibrational spectroscopy and RS studies incorporating machine learning. This study aimed to address this issue through tailored overfitting prevention and detection techniques. However, future research will require a larger sample size for statistical validation of the results. Furthermore, acquiring samples from different cancer types and organs beyond those investigated in this study is planned to further test and validate our approach.

Overall, the application of RS with advanced machine learning techniques, as demonstrated in this study, holds significant promise for enhancing cancer diagnosis. The ability to accurately classify subtypes of a disease with high precision and reliability represents a substantial advancement in the field, paving the way for a more personalised, effective, and timely diagnosis of cancer.

5. Conclusion

The integration of RS, a form of vibrational spectroscopy, with AI‐based analysis in liquid biopsy presents significant potential for improving cancer detection and subtyping. This approach offers a valuable complement to traditional diagnostic methods, potentially leading to more efficient, cost‐effective, and personalised cancer treatment.

Our study adds to the expanding body of research demonstrating that RS, when combined with machine learning can accurately identify subtype‐specific molecular signatures of early (stage Ia) breast cancer. This is achieved through a carefully automated analysis of Raman spectral differences between healthy control and cancer samples, offering a promising approach for future cancer diagnostics and enhancing our understanding of the biochemical changes associated with the early development of cancer.

Conflicts of Interest

The authors declare no conflicts of interest.

Supporting information

Data S1.

Acknowledgments

We extend our sincere gratitude to Northern Ireland Biobank and Breast Cancer Now Tissue Bank for providing the invaluable samples that were essential for our study. Their contributions have significantly facilitated our research, enabling us to achieve meaningful insights and advancements in the field. Without their support, this work would not have been possible.

Data Availability Statement

The data that support the findings of this study are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions.

References

  • 1. Sung H., Ferlay J., Siegel R. L., et al., “Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries,” CA: A Cancer Journal for Clinicians 71, no. 3 (2021): 209–249, 10.3322/caac.21660. [DOI] [PubMed] [Google Scholar]
  • 2. Harbeck N., Penault‐Llorca F., Cortes J., et al., “Breast Cancer,” Nature Reviews Disease Primers 5, no. 1 (2019): 66, 10.1038/s41572-019-0111-2. [DOI] [PubMed] [Google Scholar]
  • 3. Siegel R. L., Miller K. D., Fuchs H. E., and Jemal A., “Cancer Statistics, 2021,” CA: A Cancer Journal for Clinicians 71, no. 1 (2021): 7–33, 10.3322/caac.21654. [DOI] [PubMed] [Google Scholar]
  • 4. Giaquinto A. N., Sung H., Miller K. D., et al., “Breast Cancer Statistics, 2022,” CA: A Cancer Journal for Clinicians 72, no. 6 (2022): 524–541. [DOI] [PubMed] [Google Scholar]
  • 5. CRUK , “Cancer Statistics for the UK,” 2018, https://www.cancerresearchuk.org/health‐professional/cancer‐statistics‐for‐the‐uk.
  • 6. Dillekås H., Rogers M. S., and Straume O., “Are 90% of Deaths From Cancer Caused by Metastases?,” Cancer Medicine 8, no. 12 (2019): 5574–5576, 10.1002/cam4.2474. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Milosevic M., Jankovic D., Milenkovic A., and Stojanov D., “Early Diagnosis and Detection of Breast Cancer,” Technology and Health Care 26, no. 4 (2018): 729–759, 10.3233/thc-181277. [DOI] [PubMed] [Google Scholar]
  • 8. Connal S., Cameron J. M., Sala A., et al., “Liquid Biopsies: The Future of Cancer Early Detection,” Journal of Translational Medicine 21, no. 1 (2023): 118, 10.1186/s12967-023-03960-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Pantel K. and Alix‐Panabières C., “Liquid Biopsy and Minimal Residual Disease—Latest Advances and Implications for Cure,” Nature Reviews Clinical Oncology 16, no. 7 (2019): 409–424. [DOI] [PubMed] [Google Scholar]
  • 10. Wu T.‐M., Liu J. B., Liu Y., et al., “Power and Promise of Next‐Generation Sequencing in Liquid Biopsies and Cancer Control,” Cancer Control 27, no. 3 (2020): 1073274820934805, 10.1177/1073274820934805. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Klein E. A., Richards D., Cohn A., et al., “Clinical Validation of a Targeted Methylation‐Based Multi‐Cancer Early Detection Test Using an Independent Validation Set,” Annals of Oncology 32, no. 9 (2021): 1167–1177, 10.1016/j.annonc.2021.05.806. [DOI] [PubMed] [Google Scholar]
  • 12. Campos‐Carrillo A., Weitzel J. N., Sahoo P., et al., “Circulating Tumor DNA as an Early Cancer Detection Tool,” Pharmacology & Therapeutics 207 (2020): 107458, 10.1016/j.pharmthera.2019.107458. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Bettegowda C., Sausen M., Leary R. J., et al., “Detection of Circulating Tumor DNA in Early‐ and Late‐Stage Human Malignancies,” Science Translational Medicine 6, no. 224 (2014): 224ra24, 10.1126/scitranslmed.3007094. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Sikaris K. A., “CA125—A Test With a Change of Heart,” Heart, Lung & Circulation 20, no. 10 (2011): 634–640, 10.1016/j.hlc.2010.08.001. [DOI] [PubMed] [Google Scholar]
  • 15. Adhyam M. and Gupta A. K., “A Review on the Clinical Utility of PSA in Cancer Prostate,” Indian Journal of Surgical Oncology 3, no. 2 (2012): 120–129, 10.1007/s13193-012-0142-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Crosby D., Bhatia S., Brindle K. M., et al., “Early Detection of Cancer,” Science 375, no. 6586 (2022): eaay9040, 10.1126/science.aay9040. [DOI] [PubMed] [Google Scholar]
  • 17. Pahlow S., Weber K., Popp J., et al., “Application of Vibrational Spectroscopy and Imaging to Point‐of‐Care Medicine: A Review,” Applied Spectroscopy 72, no. S1 (2018): 52–84. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Ferraro J. R., Nakamoto K., and Brown C. W., “Chapter 1—Basic Theory,” in Introductory Raman Spectroscopy, 2nd ed., eds. Ferraro J. R., Nakamoto K., and Brown C. W. (San Diego: Academic Press, 2003), 1–94. [Google Scholar]
  • 19. Atkins C. G., Buckley K., Blades M. W., and Turner R. F. B., “Raman Spectroscopy of Blood and Blood Components,” Applied Spectroscopy 71, no. 5 (2017): 767–793, 10.1177/0003702816686593. [DOI] [PubMed] [Google Scholar]
  • 20. Auner G. W., Koya S. K., Huang C., et al., “Applications of Raman Spectroscopy in Cancer Diagnosis,” Cancer and Metastasis Reviews 37, no. 4 (2018): 691–717, 10.1007/s10555-018-9770-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Austin L. A., Osseiran S., and Evans C. L., “Raman Technologies in Cancer Diagnostics,” Analyst 141, no. 2 (2016): 476–503, 10.1039/C5AN01786F. [DOI] [PubMed] [Google Scholar]
  • 22. Movasaghi Z., Rehman S., and Rehman I. U., “Raman Spectroscopy of Biological Tissues,” Applied Spectroscopy Reviews 42, no. 5 (2007): 493–541. [Google Scholar]
  • 23. Tsikritsis D., Richmond S., Stewart P., Elfick A., and Downes A., “Label‐Free Identification and Characterization of Living Human Primary and Secondary Tumour Cells,” Analyst 140, no. 15 (2015): 5162–5168, 10.1039/C5AN00851D. [DOI] [PubMed] [Google Scholar]
  • 24. Tipatet K. S., Davison‐Gates L., Tewes T. J., et al., “Detection of Acquired Radioresistance in Breast Cancer Cell Lines Using Raman Spectroscopy and Machine Learning,” Analyst 146 (2021): 3709–3716, 10.1039/D1AN00387A. [DOI] [PubMed] [Google Scholar]
  • 25. Kallaway C., Almond L. M., Barr H., et al., “Advances in the Clinical Application of Raman Spectroscopy for Cancer Diagnostics,” Photodiagnosis and Photodynamic Therapy 10, no. 3 (2013): 207–219. [DOI] [PubMed] [Google Scholar]
  • 26. Chen S., Zhang H., Yang X., et al., “Raman Spectroscopy Reveals Abnormal Changes in the Urine Composition of Prostate Cancer: An Application of an Intelligent Diagnostic Model With a Deep Learning Algorithm,” Advanced Intelligent Systems 3, no. 4 (2021): 2000090. [Google Scholar]
  • 27. Kim S., Kim T. G., Lee S. H., et al., “Label‐Free Surface‐Enhanced Raman Spectroscopy Biosensor for On‐Site Breast Cancer Detection Using Human Tears,” ACS Applied Materials & Interfaces 12, no. 7 (2020): 7897–7904, 10.1021/acsami.9b19421. [DOI] [PubMed] [Google Scholar]
  • 28. Calado G., Behl I., Daniel A., Byrne H. J., and Lyng F. M., “Raman Spectroscopic Analysis of Saliva for the Diagnosis of Oral Cancer: A Systematic Review,” Translational Biophotonics 1, no. 1–2 (2019): e201900001. [Google Scholar]
  • 29. Nargis H. F., Nawaz H., Ditta A., et al., “Raman Spectroscopy of Blood Plasma Samples From Breast Cancer Patients at Different Stages,” Spectrochimica Acta Part A, Molecular and Biomolecular Spectroscopy 222 (2019): 117210, 10.1016/j.saa.2019.117210. [DOI] [PubMed] [Google Scholar]
  • 30. Medipally D. K., Maguire A., Bryant J., et al., “Development of a High Throughput (HT) Raman Spectroscopy Method for Rapid Screening of Liquid Blood Plasma From Prostate Cancer Patients,” Analyst 142, no. 8 (2017): 1216–1226, 10.1039/c6an02100j. [DOI] [PubMed] [Google Scholar]
  • 31. Wang H., Zhang S., Wan L., Sun H., Tan J., and Su Q., “Screening and Staging for Non‐Small Cell Lung Cancer by Serum Laser Raman Spectroscopy,” Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy 201 (2018): 34–38, 10.1016/j.saa.2018.04.002. [DOI] [PubMed] [Google Scholar]
  • 32. Githaiga J. I., Angeyo H. K., Kaduki K. A., Bulimo W. D., and Ojuka D. K., “Quantitative Raman Spectroscopy of Breast Cancer Malignancy Utilizing Higher‐Order Principal Components: A Preliminary Study,” Scientific African 14 (2021): e01035, 10.1016/j.sciaf.2021.e01035. [DOI] [Google Scholar]
  • 33. Nargis H. F., Nawaz H., Bhatti H. N., Jilani K., and Saleem M., “Comparison of Surface Enhanced Raman Spectroscopy and Raman Spectroscopy for the Detection of Breast Cancer Based on Serum Samples,” Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy 246 (2021): 119034, 10.1016/j.saa.2020.119034. [DOI] [PubMed] [Google Scholar]
  • 34. Pereira de Souza N. M., Machado B. H., Padoin L. V., et al., “Rapid and Low‐Cost Liquid Biopsy With ATR‐FTIR Spectroscopy to Discriminate the Molecular Subtypes of Breast Cancer,” Talanta 254 (2023): 123858, 10.1016/j.talanta.2022.123858. [DOI] [PubMed] [Google Scholar]
  • 35. Goldhirsch A., Winer E. P., Coates A. S., et al., “Personalizing the Treatment of Women With Early Breast Cancer: Highlights of the St Gallen International Expert Consensus on the Primary Therapy of Early Breast Cancer 2013,” Annals of Oncology 24, no. 9 (2013): 2206–2223, 10.1093/annonc/mdt303. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Millar E. K., Graham P. H., O'Toole S. A., et al., “Prediction of Local Recurrence, Distant Metastases, and Death After Breast‐Conserving Therapy in Early‐Stage Invasive Breast Cancer Using a Five‐Biomarker Panel,” Journal of Clinical Oncology 27, no. 28 (2009): 4701–4708, 10.1200/jco.2008.21.7075. [DOI] [PubMed] [Google Scholar]
  • 37. Voduc K. D., Cheang M. C., Tyldesley S., Gelmon K., Nielsen T. O., and Kennecke H., “Breast Cancer Subtypes and the Risk of Local and Regional Relapse,” Journal of Clinical Oncology 28, no. 10 (2010): 1684–1691, 10.1200/jco.2009.24.9284. [DOI] [PubMed] [Google Scholar]
  • 38. Meksiarun P., Aoki P. H. B., van Nest S. J., et al., “Breast Cancer Subtype Specific Biochemical Responses to Radiation,” Analyst 143, no. 16 (2018): 3850–3858, 10.1039/c8an00345a. [DOI] [PubMed] [Google Scholar]
  • 39. Cyr A. E. and Margenthaler J. A., “Molecular Profiling of Breast Cancer,” Surgical Oncology Clinics of North America 23, no. 3 (2014): 451–462, 10.1016/j.soc.2014.03.004. [DOI] [PubMed] [Google Scholar]
  • 40. Știufiuc G. F., Toma V., Buse M., et al., “Solid Plasmonic Substrates for Breast Cancer Detection by Means of SERS Analysis of Blood Plasma,” Nanomaterials 10, no. 6 (2020): 1212, 10.3390/nano10061212. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. De Maesschalck R., Jouan‐Rimbaud D., and Massart D. L., “The Mahalanobis Distance,” Chemometrics and Intelligent Laboratory Systems 50, no. 1 (2000): 1–18. [Google Scholar]
  • 42. Kneen M. A. and Annegarn H. J., “Algorithm for Fitting XRF, SEM and PIXE X‐Ray Spectra Backgrounds,” Nuclear Instruments and Methods in Physics Research Section B: Beam Interactions with Materials and Atoms 109 (1996): 209–213. [Google Scholar]
  • 43. Eilers P. H. C., “A Perfect Smoother,” Analytical Chemistry 75, no. 14 (2003): 3631–3636. [DOI] [PubMed] [Google Scholar]
  • 44. Nielsen F., “Hierarchical Clustering,” in Introduction to HPC With MPI for Data Science (Cham: Springer, 2016), 195–211. [Google Scholar]
  • 45. Talari A. C. S., Movasaghi Z., Rehman S., and Rehman I. u., “Raman Spectroscopy of Biological Tissues,” Applied Spectroscopy Reviews 50, no. 1 (2015): 46–111, 10.1080/05704928.2014.923902. [DOI] [Google Scholar]
  • 46. Pichardo‐Molina J. L., Frausto‐Reyes C., Barbosa‐García O., et al., “Raman Spectroscopy and Multivariate Analysis of Serum Samples From Breast Cancer Patients,” Lasers in Medical Science 22, no. 4 (2007): 229–236, 10.1007/s10103-006-0432-8. [DOI] [PubMed] [Google Scholar]
  • 47. Li X. Z., Bai J., Lin J., Liu H., and Ding J., Study of Serum Fluorescence and Raman Spectra for Diagnosis of Cancer, vol. 4432 (Washington, DC: Optica Publishing Group, 2001), 124–130, 10.1364/ECBO.2001.4432_124. [DOI] [Google Scholar]
  • 48. Frank C. J., Redd D. C. B., Gansler T. S., and McCreery R. L., “Characterization of Human Breast Biopsy Specimens With Near‐IR Raman Spectroscopy,” Analytical Chemistry 66, no. 3 (1994): 319–326. [DOI] [PubMed] [Google Scholar]
  • 49. Manoharan R., Shafer K., Perelman L., et al., “Raman Spectroscopy and Fluorescence Photon Migration for Breast Cancer Diagnosis and Imaging,” Photochemistry and Photobiology 67, no. 1 (1998): 15–22. [PubMed] [Google Scholar]
  • 50. Cameron J. M., Sala A., Antoniou G., et al., “A Spectroscopic Liquid Biopsy for the Earlier Detection of Multiple Cancer Types,” British Journal of Cancer 129, no. 10 (2023): 1658–1666, 10.1038/s41416-023-02423-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Talari A. C. S., Rehman S., and Rehman I. U., “Advancing Cancer Diagnostics With Artificial Intelligence and Spectroscopy: Identifying Chemical Changes Associated With Breast Cancer,” Expert Review of Molecular Diagnostics 19, no. 10 (2019): 929–940. [DOI] [PubMed] [Google Scholar]
  • 52. Rehman S., Movasaghi Z., Tucker A. T., et al., “Raman Spectroscopic Analysis of Breast Cancer Tissues: Identifying Differences Between Normal, Invasive Ductal Carcinoma and Ductal Carcinoma In Situ of the Breast Tissue,” Journal of Raman Spectroscopy 38, no. 10 (2007): 1345–1351. [Google Scholar]
  • 53. Smith E. and Dent G., “Introduction, Basic Theory and Principles,” in Modern Raman Spectroscopy—A Practical Approach (Wiley, 2005), 1–21, 10.1002/0470011831. [DOI] [Google Scholar]
  • 54. Butler H. J., Ashton L., Bird B., et al., “Using Raman Spectroscopy to Characterize Biological Materials,” Nature Protocols 11, no. 4 (2016): 664–687, 10.1038/nprot.2016.036. [DOI] [PubMed] [Google Scholar]
  • 55. Byrne H. J., Baranska M., Puppels G. J., et al., “Spectropathology for the Next Generation: Quo Vadis?,” Analyst 140, no. 7 (2015): 2066–2073, 10.1039/c4an02036g. [DOI] [PubMed] [Google Scholar]
  • 56. Perou C. M., Sørlie T., Eisen M. B., et al., “Molecular Portraits of Human Breast Tumours,” Nature 406, no. 6797 (2000): 747–752, 10.1038/35021093. [DOI] [PubMed] [Google Scholar]
  • 57. Parker J. S., Mullins M., Cheang M. C. U., et al., “Supervised Risk Predictor of Breast Cancer Based on Intrinsic Subtypes,” Journal of Clinical Oncology 27, no. 8 (2009): 1160–1167. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58. Nielsen T. O., Parker J. S., Leung S., et al., “A Comparison of PAM50 Intrinsic Subtyping With Immunohistochemistry and Clinical Prognostic Factors in Tamoxifen‐Treated Estrogen Receptor–Positive Breast Cancer,” Clinical Cancer Research 16, no. 21 (2010): 5222–5232. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59. Prat A., Parker J. S., Karginova O., et al., “Phenotypic and Molecular Characterization of the Claudin‐Low Intrinsic Subtype of Breast Cancer,” Breast Cancer Research 12, no. 5 (2010): R68, 10.1186/bcr2635. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data S1.

Data Availability Statement

The data that support the findings of this study are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions.


Articles from Journal of Biophotonics are provided here courtesy of Wiley

RESOURCES