Skip to main content
PNAS Nexus logoLink to PNAS Nexus
. 2024 Jul 3;3(8):pgae268. doi: 10.1093/pnasnexus/pgae268

Raman spectroscopic deep learning with signal aggregated representations for enhanced cell phenotype and signature identification

Songlin Lu 1,2, Yuanfang Huang 3, Wan Xiang Shen 4, Yu Lin Cao 5, Mengna Cai 6, Yan Chen 7,8, Ying Tan 9,10, Yu Yang Jiang 11,b,, Yu Zong Chen 12,13,
Editor: Yannis Yortsos
PMCID: PMC11348106  PMID: 39192845

Abstract

Feature representation is critical for data learning, particularly in learning spectroscopic data. Machine learning (ML) and deep learning (DL) models learn Raman spectra for rapid, nondestructive, and label-free cell phenotype identification, which facilitate diagnostic, therapeutic, forensic, and microbiological applications. But these are challenged by high-dimensional, unordered, and low-sample spectroscopic data. Here, we introduced novel 2D image-like dual signal and component aggregated representations by restructuring Raman spectra and principal components, which enables spectroscopic DL for enhanced cell phenotype and signature identification. New ConvNet models DSCARNets significantly outperformed the state-of-the-art (SOTA) ML and DL models on six benchmark datasets, mostly with >2% improvement over the SOTA performance of 85–97% accuracies. DSCARNets also performed well on four additional datasets against SOTA models of extremely high performances (>98%) and two datasets without a published supervised phenotype classification model. Explainable DSCARNets identified Raman signatures consistent with experimental indications.

Keywords: Raman spectra, cell phenotyping, feature representation, Raman signature identification, deep learning


Significance Statement.

Raman spectroscopic deep learning (DL) is useful for cell phenotype identification, with broad applications in diagnostic, therapeutic, forensic, and microbiological tasks. There are additional applications in serum-based diagnostics, food assessments, and substance detections. But these tasks are challenged by the high-dimensional and unordered spectroscopic data problems. New representations are needed for overcoming these problems. A novel spectroscopic 2D representation was introduced. New DL models with these representations significantly outperformed the state-of-the-art models on benchmark tasks and additional datasets. These models also identified Raman signatures consistent with experimental indications. These indicate the potential of our new method in spectroscopic DL.

Introduction

Successful deep learning (DL) critically depends on data representations. This is particularly important for spectroscopic learning in broad applications. These applications include Raman spectroscopic cell phenotype (RSCP) identification (1–6), disease diagnosis (7), and food quality assessment (8, 9). Another application area is in laser-induced breakdown spectroscopy (LIBS) spectroscopic identification of geological materials (10). In particular, RSCP identification facilitates rapid, nondestructive, and label-free disease diagnosis (1, 2), bio-product quality control (3), therapeutics evaluation (4–6), forensic assessment (11), and microbe sorting (12, 13). Machine learning (ML) (3, 14–19) and DL (20, 21) models have been developed by learning spectra features or their principal components (PCs) for RSCP identification. But these tasks are challenged by the high-dimensional, unordered, and low-sample data (HULD) in spectroscopic learning. The dimensionality of the RSCP data is of ∼333–2,644 signals, Raman signals (or Raman shifts) are highly variable across diverse wavenumber range of 320–3,400 cm−1, and most literature-reported RSCP data are in the low-sample range of 150–680 samples (Table 1). Improved spectroscopic representations are needed for enhanced learning of spectroscopic data.

Table 1.

Benchmark and additional datasets of Raman spectroscopic cell phenotyping tasks.

Task type Dataset No. of samples (spectra) Range of Raman shift in cm−1 (no. of features) SOTA machine learning or deep learning method SOTA performance
Benchmark
Cancer type differentiation MG
Melanoma SK-MEL-28, SK-MEL-2, MW-266-4 cells of genotype BRAF V600E, BRAF V600D, and NRAS Q61R (14)
150 600–1,800 (1,362) ANN of PCA (first 9 PCs) (14) 96.7% (14)
ACS
Five cancer cells (HL60, HT29, HCT116, SW620, and SW480 cells) (15)
680 730–1,750;
2,800–3,100
(2,644)
LDA of PCA (first 25 PCs) (15) 92.4% (15)
Stem cell stage differentiation SCS1, SCS2
Human induced pluripotent stem cell (iPSC), neural stem cell (NSC), neuron cell (3)
8,774 SCS1: 320–1,800
(443)
SCS2: 320–3,400
(1,019)
SVM, RBF-SVM (3) SCS1: 92.8% (3)
SCS2: not applicable (NA)
Inflammatory/immune response cell state MCS
Macrophage cells stimulated by lipopolysaccharide vs. naive state (19)
3,140 528–1,800;
2,700–3,076
(644)
Logistic Regression (19) 85.6% (19)
Bacteria species differentiation PB2
methicillin-resistant and susceptible isolates of S. aureus (binary class) (20)
11,000 382–1,792
(1,000)
U-Net with SENets (22) 93.0% (22)
PB30
30 bacterial and yeast isolates (30 tasks) (20)
66,000 382–1,792
(1,000)
U-Net with SENets (22) 85.1% (22)
Additional dataset
Cancer type differentiation ACM
Adenocarcinoma non-metastatic vs metastatic type (SW480 vs SW620 cells) (15)
330 730–1,750;
2,800–3,100
(2,644)
LDA of PCA (first 25 PCs) (15) 98.7% (15)
Stem cell type differentiation SCCA3
Three types of stem cells and embryonic cells hMSCs, hiPSCs, and MEFs in adherent culture (17)
200 600–2,980
(1,590)
LDA of PCA (first 5 PCs) (17) 100% (17)
SCCS4
Four stem cells, embryonic and leukemia cells hMSC, hiPSC, MEF, and Jurkat in suspension culture (17)
300 600–2,980
(1,590)
LDA of PCA (first 5 PCs) (17) 100% (17)
Inflammatory/immune response cell state T-ALLS
T-cell naive vs antigen-stimulated state (Jurkat cells) (17)
600 600–2,980
(1,590)
MLP of PCA (first 5 PCs) (17) AUC 0.98 (17)
Cancer cell state differentiation MMC
Metastatic melanoma cells of de-differentiation states melanocytic, transitory, neural-crest-like, mesenchymal (M262, M229, M397, M409, M381 cells) (18)
250 690–3,141
(1,570)
NA (18) NA
Bacterial growth stage differentiation ECOP
Ecoli lag, log, and early stationary phases (each three replicates) (16)
536 600–1,800
(333)
Unsupervised hierarchical clustering and PhenoGraph (16) NA

Benchmark datasets have been tested by the SOTA ML or DL models with good phenotypic performances (ACC 85–97%). Additional datasets are either tested by the SOTA models with extremely high (ACC 98–100%) performances or without a published phenotype classification model. Data split method and performance metric of each SOTA model are given in SI Appendix, Table S1.

SOTA, state of the art; ACC, overall accuracy; ANN, artificial neural network; LDA, linear discriminant analysis; SVM, support vector machines; RBF-SVM, radial basis function kernel SVM; PCA, principal component analysis; PC, principal component; MLP, multilayer perceptron.

Enhanced protein structure and pharmaceutical property predictions have been achieved by DL of appropriate molecular representations. For instance, AlphaFold learns protein sequence evolutionary and inter-residue spatial representations for high-performance protein structure prediction (23). Pharmaceutical DL learns molecular graph representations (24–26) or physicochemical and substructural representations for predicting diverse and complex pharmacological properties (27). The omics and pharmaceutical data are also HULD. In some studies, these data have been restructured into image-like 2D representations based on neighborhood affiliation (28), functional relationship (29, 30), and manifold-guided feature aggregation (27, 31). These 2D representations enable efficient DL of the omics and pharmaceutical HULD with deep ConvNet architectures, leading to state-of-the-art (SOTA) performances on omics and pharmaceutical benchmark datasets (27, 28, 31). These raise questions on whether and how DL of Raman spectroscopic cell phenotype (RSCP) data may be enhanced by such data restructuring strategies.

Restructuring spectroscopic data may be more challenging than omics and pharmaceutical data. For instance, transcriptomic signatures are composed of functionally related genes that exhibit similar cross-sample profiles (32), which can be aggregated together into neighborhoods in the 2D representations by the established data restructuring methods (27–31). RSCP signatures typically cover variable signals of complex biological components [e.g. amino acids, proteins, lipids, and nucleic acids (14, 15)]. The numerous varieties of correlated spectral signals are more difficult to aggregate into limited neighborhood space in the 2D representations. New methods are needed to supplement established methods for optimal aggregation of RSCP features.

So far, ML is the primary tool in the reported RSCP studies, particularly in the cases of low-sample (150–680 samples) and intermediate-sample (3,140–8,774 samples) tasks, while DL has only been explored in higher-sample tasks (11,000–60,000 samples) (Table 1). Most (80%) low-sample SOTA RSCP ML models exploit multiple PCs of Raman spectra as input features (14, 15, 17) (Table 1). Each PC of RSCP combines a high number of signals of amino acids, proteins, lipids, and nucleic acids (14, 15). Therefore, spectroscopic PCs are useful features for RSCP tasks. We further argue that dual restructuring of spectra and PCs enable the aggregation of more varieties of RSCP features into neighborhood spaces of 2D representations, which facilitate more enhanced DL of RSCP data. In this work, we introduced a new method for restructuring spectroscopic signals and their PCs into novel 2D signal aggregated representations (2D-SARs) and 2D component aggregated representations (2D-CARs), respectively. These representations can be jointly used as dual signal–component aggregated representations (2D-DSCARs) for learning RSCP data.

We constructed new dual-path ConvNet models DSCARNets for the RSCP tasks with 2D-DSCARs as input features. Each individual path of a DSCARNet can also be independently used as a single-path ConvNet with 2D-SARs or 2D-CARs as input features. While the SOTA ML models have scored well in various RSCP tasks, their performance is less well in certain RSCP tasks, scoring ∼86–93% phenotyping performances in two of five cell state and stage tasks (Table 1). Established 1D deep ConvNets have been employed for the two higher-sample RSCP tasks, achieving SOTA performances of ∼85–93% accuracies (22). DSCARNets may complement these established ML and DL methods for low-to-high-sample RSCP tasks. In this work, DSCARNets were tested on six benchmark datasets where the SOTA ML and DL models have produced good performances of 85–97% accuracies (Table 1). DSCARNets were also evaluated on four datasets where the SOTA models have achieved extremely high performances (>98%) and on two datasets without a published supervised phenotyping model. The performances of DSCARNets were compared to the SOTA ML and DL models using the same dataset, data split method, and evaluation metric.

Distinguished Raman peaks (signatures) serve as biomarkers and mechanistic clues (1, 3). Raman signatures have been identified by searching phenotype differentiating spectra (1, 3, 14–19), which tend to select top differential spectra and neglect the groups of signals that collectively indicate phenotype states. Explainable DL algorithms identify signatures with broad coverage of features beyond top differential features (33), but are challenged by the HULD problem (33). Notably, object-discriminative pixels (signatures) of natural images are locally clustered (34), while the omics biomarkers identified by an explainable DL algorithm are also locally clustered in the 2D representations (31). This similarity between localized patterns of natural image pixels and DL-identified signatures suggests the usefulness of explainable DL for signature identification, particularly the DSCARNet explainable algorithm for the identification of biologically relevant Raman signatures from RSCP data.

Spectroscopic data priors and manifold-guided data restructuring into 2D spectra and component aggregated representations

Biological samples present subtle Raman spectroscopic variations and are masked by such artifacts as instrumental drifts and measurement errors (35), which complicate data restructuring tasks. Nevertheless, Raman signals reveal intrinsic molecular vibrational fingerprints of biological samples, and multiple signals display correlated cross-phenotype and cross-sample variational patterns (3). These patterns or fingerprints are valuable priors for appropriate representations in spectroscopic learning. Successful spectroscopic DL hinges on the ability to leverage these priors. Data priors reflect the intrinsic statistical properties of the data. Examples are the stationarity, compositionality and correlation properties in natural images and speeches (36, 37). The local statistics of data priors can be efficiently captured by ConvNet architectures, where the hierarchical layers of convolutional filters mimic the effects of receptive fields and exploit the spatial correlations within the data (37).

Raman signals between 700 and 1,800 cm−1 contain rich molecular fingerprints (38), reflecting the underlying molecular vibrations (35), internal states (19), and variational patterns (3) of biological samples. For instance, several Raman signals of saccharides at 400 cm−1, glycogen at 480 cm−1, and cytochrome C at 746 and 1,126 cm−1 exhibit correlated variational patterns across different stages of human induced pluripotent stem cells (iPSCs) and iPSC-derived neural cells (3). Moreover, cellular Raman signals are with strong CH2 and CH3 bond signals around 2,800 to 3,000 cm−1 high-wavenumber bands (38). These spectroscopic priors in the spectral space differ from the priors of natural images in the pixel space. Some spectroscopic signals may be correlational across samples in the sample space, but not necessarily close to one another in the spectral space (3). Due to their different vibrational modes, these correlated signals typically spread across wide wavenumber range in the spectral space (3). Without data restructuring, these spread data priors may not be fully captured in the spectral space by ConvNet filters designed for the local statistics of natural images.

For exploiting the efficient learning capability of ConvNets, Raman spectra need to be restructured into more suitable representations, i.e. from the low-to-high-wavenumber layout in the spectral space into spatially correlated layout in a 2D representation space, where each pixel represents a Raman signal and the correlational signals are clustered together in the local 2D neighborhood space. Each signal across multiple samples forms a vector in the sample space, where its components are the signal intensities of individual samples. The relationships of signals are manifested by the variational profiles of the vectors in the sample space. These vectors in the high-dimensional sample space can be projected into 2D-SARs by a manifold learning algorithm such as uniform manifold approximation and projection (UMAP), where the pixel layout in 2D-SARs has a similar topological distribution as those in the sample space (39).

UMAP exploits local manifold approximations and local fuzzy simplicial set representations to construct topological representations of the high-dimensional datapoints, where the layout of lower-dimensional datapoints (e.g. 2D-SARs) is determined by minimizing the cross-entropy between the two topological representations in high and lower dimensions (e.g. 2D) (39). UMAP involves two steps. The first step is the construction of a graph in high-dimensions, where the high-dimensional topology is represented by combinations of basic building blocks called simplices. These simplices form a Čech complex, where the connectedness of the datapoints is fuzzy, i.e. the closer datapoints are more likely to be connected and the further datapoints are less likely to be connected, while all datapoints are under the constraint that they must be connected to at least their closest neighbor. The second step is the finding of the most similar graph in lower dimensions.

In some cases, multiple signals are correlational across samples (3). A high number of correlational signals may not be fully clustered together in 2D-SARs due to limited local space and limitations of a manifold learning algorithm. For instance, there are four cytochrome C, two saccharide, and two glycogen peaks in the Raman spectra of iPSC-derived neural stem cell (NSC) (3) (SI Appendix, Fig. S1A, S1B). In the 2D-SAR of these spectra, the two saccharide peaks and two glycogen peaks were, respectively, clustered together, but only two of the four cytochrome C peaks were clustered together, while the other two peaks were placed further apart from the local cluster. Hence, not all relevant signals can be clustered together in the local space of 2D-SAR.

Additional 2D representations are needed for optimal signal clustering. Notably, most SOTA spectroscopic ML models use PCs as input features (Table 1). Each PC combines multiple signals that collectively separate samples and reveal variational patterns (40). PC selection has the effect of maximizing the correlation between signals and representations (41). Moreover, individual PC pair reveals various sample separation patterns (42). Hence, we argue that PC pairs may serve as new 2D representations for the RSCP tasks. We also noticed that up to 30–100 PCs are needed for covering rare variants in statistical genetics (43). Therefore, we tentatively used 25–75 PCs for constructing PC pairs, while the actual number of PCs was adjusted with respect to the RSCP performances. We employed UMAP for restructuring the PC pairs into 2D-CARs.

Results and discussion

Manifold-guided restructuring of Raman spectra into 2D representations and the clustering of spectral signatures

A new UMAP-based algorithm was developed for manifold-guided restructuring of Raman spectra into 2D-SARs and 2D-CARs, respectively. Figure 1A–C and SI Appendix, Fig. S2 present the flowchart of 2D-SARs, 2D-CARs, DSCARNet architecture, and learning procedure, respectively. Figure 2 presents the 2D-SARs of four Raman spectra of iPSC cell, iPSC-derived NSC, and neuron cell (3). These 2D-SARs contain phenotype-discriminative zones that vary across cell types and between spectra of the same cell type. Zone-A is composed of pink, purple, green, and yellow pixels, where the pink pixel is lighter (lower abundances) in the two iPSC spectra, the yellow pixels are darker (higher abundances) in the NSC and lighter in the neuron cell spectra, and the green pixels are of roughly similar brightness across all cell types.

Fig. 1.

Fig. 1.

Flowchart of spectroscopic data restructuring and DL architecture of Raman spectra. A) Manifold-guided Raman spectra restructuring into 2D spectra aggregated representation (2D-SARs); B) restructuring of the PCs of Raman spectra into component aggregated representations (2D-CARs); C) CNN-based DL architecture in this work.

Fig. 2.

Fig. 2.

2D-SARs of human induced iPSC (two spectra: iPSC_1 and iPSC_2), NSC, and neuron cell. Examples of phenotype-discriminative zones are highlighted in the dashed box of zone-A, zone-B, and zone-C. Examples of the literature-reported markers are indicated by pixels at the head of the arrows. The original Raman spectra of these cells are provided below 2D-SARs. A)–D) 2D-SARs of four cells; E) original Raman spectra of the same four cells.

In zone-B, the upper pixels of NSC spectra are slightly darker and most pixels of neuron cell are slightly lighter than the two spectra of iPSC cell. In zone-C, the lower pixels of one iPSC cell are slightly darker than those of the other cells and the middle pixels of the NSC and neuron cells are slightly darker. These distinguished spectra patterns facilitate phenotypic discrimination. These zones contain some of the literature-reported Raman markers for discriminating iPSC and iPSC-derived neural cells (3). Due to experimental and processing artifacts, the processed Raman signals of the same molecular modes may deviate from one another by margins of ∼0.1–1.67 cm−1 (44). Hence, Raman signals within ±2 cm−1 of the reported biomarkers were tentatively considered the same. The reported markers of the saturated lipid band and protein amide-I band around 1,295 cm−1 and 1,660 cm−1 (3) were found in zone-A (the pixel at the tip of right and left arrows, respectively). Two cytochrome C bands around 1,580 cm−1 (3) were found in zone-B. The saccharides and glycogen bands around 400 cm−1, 417 cm−1, and 480 cm−1 (3) were found in zone-C (saccharides: lower right and left and glycogen: upper right and left). The clustering of the reported markers in these phenotype-discriminative zones suggests that 2D-SARs are useful for phenotype and signature identification in RSCP tasks. Moreover, in 2D-SARs, local bands of Raman signals along the local wavenumber zones of the spectral space tend to be localized in small batches of pixels, while some bands of signals remote in the spectral space are brought into neighboring blocks (SI Appendix, Fig. S3). Therefore, 2D-SARs are useful for revealing both the local patterns of Raman bands and the intrinsic correlations among remote signals.

Spectral aggregated representations and component aggregated representations complement each other in presenting phenotype-distinguishing patterns

Figure 3 shows the comparison of the 2D-SARs and 2D-CARs of the Raman spectra of four microbes Escherichia coli, Staphylococcus aureus, Salmonella enteric, and Streptococcus pneumonia (20). The original average Raman spectra of these microbes are provided in SI Appendix, Fig. S4. As expected, both the 2D-SARs and 2D-CARs present phenotype-discriminative patterns, such as zone-A, zone-B, and zone-C of the 2D-SARs and zone-1, zone-2, and zone-3 of the 2D-CARs. These zones present two types of phenotype-discriminative patterns. Type-1 patterns are blotches of distinguished shapes. Specifically, these were found in zone-B of 2D-SARs of E. coli, zone-3 of 2D-CARs of S. aureus, zone-B of 2D-SARs of S. enteric, and zone-4 of 2D-CARs of S. pneumonia. Type-1 blotches provide distinctive edges of different orientations for discriminative features and object recognition. Notably, the 2D-SARs and 2D-CARs of each microbe complement each other in jointly presenting type-1 blotches. Therefore, the exploration of both representations is more advantageous in learning distinguished features for RSCP tasks. Type-2 patterns are blotches of different color brightness. Specifically, these were found in all the remaining zones of the 2D-SARs and 2D-CARs not containing type-1 blotches. Similar patterns of blotches can be found in the 2D-SARs and 2D-CARs of 26 other microbes in the PD30 dataset (SI Appendix, Fig. S5). These blotches collectively provide more varieties and combinations of discriminative features for RSCP tasks.

Fig. 3.

Fig. 3.

2D-SARs and 2D-CARs of four microbes (average spectra). Examples of microbe-discriminative zones of 2D-SARs are highlighted in the dashed box of zone-A, zone-B, and zone-C, and those of 2D-CARs are in the dashed box of zone-1, zone-2, zone-3, and zone-4. A) 2D-SARs of four microbes; B) 2D-CARs of four microbes.

DSCARNet DL significantly outperformed the SOTA ML and DL models in benchmark tests

We collected six RSCP benchmark datasets from literatures with available spectra data and literature-reported ML or DL models, all of which are with 85% < ACC < 97% phenotype classification accuracies (ACC is overall accuracy) (Table 1). DSCARNet models were developed with respect to these SOTA ML and DL models using the same dataset, data split method, evaluation metric, and classification setup (SI Appendix, Table S1). Specifically, the Raman spectra and PCs of each dataset were restructured into 2D-SARs and 2D-CARs followed by the training of a DSCARNet with 2D-DSCARs (dual 2D-SARs and 2D-CARs) as inputs (Fig. 1A, B). The performance of each DSCARNet model was compared with the corresponding SOTA ML or DL model (Fig. 4A). DSCARNet models significantly outperformed the SOTA models by 2.0–4.7% margins on four datasets and 0.5–1.2% margins on two datasets, which consistently demonstrated that DL of 2D-DSCARs enables significantly enhanced cell phenotypic capabilities on various RSCP tasks.

Fig. 4.

Fig. 4.

Performance of A) DSCARNet models with respect to the SOTA ML and DL models on the benchmark datasets and B) DSCARNet models with different representations (single-path 2D-SARs, single-path 2D-CARs, and dual-path 2D-DSCARs) on 12 datasets. All models were trained using the same datasets and data split methods as the corresponding SOTA models reported in the literatures.

There are four datasets, ACS, SCS1, MG, and PB2; SOTA ML models scored fairly good ACCs of 92.4–96.7% (3, 14, 15, 22), while DSCARNet further improved the performances by significant margins of 3.1, 2.2, 2.0, and 1.2%, respectively (Fig. 4A, SI Appendix, Tables S2–S4). ACS includes 680 Raman spectra of five cancer cells (HL60, HT29, HCT116, SW620, and SW480) (15). SCS1 contains 8,774 Raman spectra of human iPSCs, NSCs, and neuron cells (3). MG is composed of 150 Raman spectra of melanoma SK-MEL-28, SK-MEL-2, and MW-266-4 cells of genotype BRAF V600E, BRAF V600D, and NRAS Q61R, respectively (14). PB2 contains 11,000 Raman spectra of methicillin-resistant and methicillin-susceptible isolates of S. aureus (22). These datasets represent different cell types, ranging from stem cells, cancer cells, and bacterial cells. The consistent performance improvement indicates the usefulness of DSCARNet in RSCP tasks of different cell types.

Here, the DSCARNet models for the four datasets were tested by various data split methods and evaluation metrics in accordance with those of the SOTA models. In order for more thorough test of our method, we further developed and evaluated DSCARNet models by the 5-fold cross-validation (5-FCV) method. In 5-FCV, a dataset was randomly divided into five sets, each one set was used for testing and the other four sets used for training, and the average accuracy of the five test sets was used as the performance measure. As shown in SI Appendix, Tables S5–S8. The 5-FCV ACCs of the DSCARNet models for the four datasets are similar to those described above (i.e. based on the various evaluation methods of the SOTA models), and the ACCs are consistent among the five folds of each 5-FCV tests.

There are two datasets, MCS and PB30, and SOTA models scored moderate performances of 85.6 and 85.1% accuracies (19, 22), which represent difficult benchmark tasks. DSCARNet significantly boosted the performances of these tasks to higher accuracies of 90.3 and 85.7%, respectively (Fig. 4A, 5-FCV results in SI Appendix, Tables S9 and S10). MCS contains 3,140 Raman spectra of lipopolysaccharide-stimulated and naive state macrophage cells (19). These cell states are difficult to classify partly because the spectroscopic differences are small among the states of macrophage cell (SI Appendix, Fig. S6). The MCS data are from studies of five different days with significant day-to-day cell state variations (19). The SOTA MCS ML model has been trained by one of the four possible training sets, each includes the data of three unspecified days, while the trained model has been evaluated by the data of the remaining day (19). DSCARNet models were trained by all four possible training sets and evaluated by the data of the remaining date. The ACCs of the four DSCARNet models are 90.3, 88.4, 87.4, and 87.0%, respectively. Because the training set of the MCS SOTA model has not been disclosed by the authors (19), the best DSCARNet model (90.3%) was tentatively compared with the SOTA ML model (85.6%), which scored 4.7% improvement over the SOTA model. Notably, the lowest performing DSCARNet model (87.0%) still scored 1.4% improvement over the SOTA model.

PB30 consists of 66,000 Raman spectra of 30 bacterial and yeast isolate classes without data augmentation (20), which has been used for training deep ConvNet models of the ResNet (26 layers) and U-Net (21 layers) architectures, while the best U-Net model produced SOTA ACC of 85.1% (20, 22). The SOTA model has been developed as follows: Firstly, PB30 has been split into training, fine-tuning, and test sets of 60,000, 3,000, and 3,000 samples of 30 classes, respectively. Secondly, ConvNet models have been trained with the training set. Thirdly, the trained models have been fine-tuned with the fine-tuning set. Lastly, the fine-tuned models have been evaluated with the test set. DSCARNets were trained and tested using the same data split and evaluation metrics of the SOTA model (22), which outperformed the SOTA model with 85.7% accuracy.

Notably, 12 and 5 of the 30 classes of the PB30 dataset can be grouped into 6 and 1 SS-sets (same strain sets) of 2 subclasses and 5 subclasses. Each SS-set is with 2 or 5 subclasses of the same microbial strain (e.g. MSSA1 and MSSA2). There are little spectroscopic differences across the subclasses within a SS-set (SI Appendix, Fig. S7). This likely increases the chance of overfitting in training the SOTA model (22) and DSCARNet model. Indeed, the training ACCs of the DSCARNet model for the subclasses within the SS-sets (97.8–100%) are significantly higher than the test ACCs (53.0–100%) (SI Appendix, Figs. S8–S9, Table S11). We developed a new DSCARNet model with all subclasses of each SS-set combined into one class. The new DSCARNet model scored 89.3 (SI Appendix, Fig. S10) vs 85.7% (Fig. 4A) performance by the original DSCARNet model, which further confirmed the difficulty in distinguishing subclasses within individual SS-set.

DL significantly outperformed the SOTA ML and DL models in benchmark tests

There are four RSCP datasets T-ALLS, ACM, SCCA3, and SCCS4, with which SOTA models scored extremely high ACCs (∼98–100%) (15, 17) (Table 1). T-ALLS contains 600 Raman spectra of T cells (naive vs Jurkat in antigen-stimulated state) (17). ACM includes 330 Raman spectra of adenocarcinoma non-metastatic SW480 vs metastatic type SW620 cells (15). SCCA3 is composed of 200 Raman spectra of three types of stem cells and embryonic cells human mesenchymal stem cells (hMSCs), hiPSCs, and mouse embryonic fibroblasts (MEFs) in adherent culture (17). SCCS4 comprises 300 Raman spectra of four stem cells, embryonic and leukemia cells hMSC, hiPSC, MEF, and Jurkat in suspension culture (17). DSCARNet further improved the performances of the T-ALLS and ACM tasks by 1 and 0.1% margins to 0.99 [area under curve (AUC)] and 98.8%, respectively (SI Appendix, Fig. S11, Tables S12–S13), but scored the 100% ACCs as the SOTA models on SCCA3 and SCCS4 tasks (5-FCV results in SI Appendix Tables S14–S17).

We found from literatures two additional datasets MMC and ECOP, which have been used for unsupervised analysis of cell phenotypic differences and phenotypic separation (clustering), respectively (16, 18) (Table 1). To the best of our knowledge, there have been no supervised phenotypic models for these two datasets. MMC contains 250 spectra of metastatic melanoma cells of de-differentiation states, i.e. melanocytic, transitory, neural-crest-like, and mesenchymal (M262, M229, M397, M409, M381 cells) (18). ECOP includes 536 spectra of Ecoli lag, log, and early stationary phases (each three replicates) (16). Although no supervised ML or DL model has been reported for classifying the phenotypic states of these datasets, these datasets nonetheless are useful for further evaluation of DSCARNet. We therefore developed and tested DSCARNet models on these datasets by 5-FCV tests, which scored very good accuracies of 90.0 and 98.3%, respectively (SI Appendix, Tables S18–S19).

Comparative performances of DSCARNet models trained with different spectra aggregated representations

DSCARNet models of RSCP tasks can be trained with three representations, dual-path DSCARs, single-path 2D-SARs, and single-path 2D-CARs. To assess the performances of DSCARNet models trained with different representations, we further developed DSCARNet models on the six benchmark and six additional datasets using 2D-SARs and 2D-CARs, respectively. These models were developed with the same dataset, data split method, evaluation metric, and classification setup as the corresponding SOTA models (Table 1, SI Appendix, Table S1). The performances of these models were compared with the SOTA models and the dual-path DSCARNet models (Fig. 4, SI Appendix, Fig. S11). Out of the 12 datasets, both the single-path 2D-SARs DSCARNet models and the single-path 2D-CARs DSCARNet models mostly underperformed the SOTA models. In contrast, the dual-path 2D-DSCARs models significantly outperformed the SOTA models, showing the superior capability of dual-path 2D-DSCARs over the single-path 2D-SARs and 2D-CARs models in the DL of RSCP data.

Between the single-path 2D-SARs and 2D-CARs models, the 2D-CARs models showed slightly better overall performances on 12 datasets (Fig. 4B). The 2D-CARs models performed better on four datasets (ACS, SCS1, PB30, and MMC) with significant margins (1.1–22.8%). The 2D-SARs models performed better on three datasets (ACM, MCS, and PB2), with two datasets having minimal margins (0.1–0.3%). The 2D-CARs and 2D-SARs models produced equal performances on the remaining five datasets (MG, SCCA3, SCCS4, T-ALLS, and ECOP). Notably, there are eight multiclass datasets, and the 2D-CARs models outperformed the 2D-SARs models on four datasets (ACS, SCS1, PB30, and MMC) by large margins (3.2, 2, 1.1, and 22.8%) and produced equal performances on four datasets (MG, SCCA3, SCCS4, and ECOP) (Fig. 4B). Hence, with broader coverage of correlational signals, 2D-CARs representations appear more capable than the 2D-SARs for multiclass DL of RSCP data.

DSCARNet explainable DL of 2D spectra aggregated representations identified biologically relevant Raman signatures

By means of an established perturbation-based explainable DL algorithm (31), we introduced an DSCARNet explainable DL algorithm (Materials and methods section) and tested it on the identification of phenotypic Raman signatures of the SCS1 dataset (Table 1), i.e. important signals for distinguishing iPSC cells from the NSC and neuron cells (iPSC vs other cells) (3). This perturbation-based procedure was applied to the single-path 2D-SARs DSCARNet model for the following reasons. Firstly, 2D-SAR pixels are Raman signals straightforward for interpretation, while the 2D-CAR pixels are signal combinations difficult for direct interpretation. Secondly, perturbation of 2D-SAR signals usually leads to small variations of the PC explained ratio (0–1.4%). Therefore, we focused on the perturbation evaluation of a single-path 2D-SARs DSCARNet model. The top-20 signatures, their importance score S values, and the corresponding literature-reported Raman biomarkers are provided in SI Appendix, Table S20, and the saliency map of these signatures is in SI Appendix, Fig. S12. The top-20 signatures are at 1,121; 1,124; 1,127; 857; 867; 1,273; 864; 1,117; 881; 877; 826; 833; 871; 836; 853; 860; 1,200; 895; 725; and 1,269 cm−1, respectively.

Raman spectroscopy provides rich information of cells, particularly the intrinsic molecular vibrations of nucleic acids, proteins, carbohydrates, and lipids that reflect cellular genotypic, phenotypic, and physiological states (45). Therefore, DSCARNet-identified signatures are expected to indicate the intrinsic molecular modes of cells. Indeed, all of the top-20 signatures are the known Raman signals of bimolecular modes reported in the literatures (SI Appendix, Table S20). These signatures include nine amino acid signals of four molecular mode groups. Group-1 includes signals at 1,273, 1,200, and 1,269 cm−1. As discussed earlier, Raman signals within ±2 cm−1 of the reported biomarkers were tentatively considered the same. Hence, these group-1 signals were considered to correspond to the literature-reported amide-III bands at 1,272 (46), 1,202 (47, 48), and 1,270 cm−1 (46). Group-2 contains aromatic amino acid signals at 877 and 853 cm−1, which correspond to the literature-reported hydrogen-bond mode of tryptophan at 876 cm−1 (46) and the ring breathing mode of tyrosine at 854 cm−1 (49), respectively. Group-3 has two signals at 881 and 871 cm−1, which correspond to the C–C stretching and CH2 rocking in methionine at 882 cm−1 (48) and single bond stretching vibrations of proline at 870 cm−1 (50). Group-4 contains two signals at 1,124 and 1,127 (or 1,128) cm−1, which are both of cytochrome C mode (3, 15).

There are eight nucleic acid signatures. Two signals at 826 and 836 cm−1 are associated with O–P–O phosphodiester stretching in DNA at 825 and 836 cm−1 (48, 51). A signal at 867 cm−1 is a ribose mode (50), and a signal at 833 cm−1 is a mode of B-conformation DNA (52). The fifth signal at 895 (or 896) cm−1 is a C–O–C mode of DNA phosphodiester and deoxyribose (53). The sixth signal at 725 cm−1 is a mode of adenine of DNA/RNA bases (54). Moreover, a signal at 860 (or 861) cm−1 is a mode of nucleic acid phosphate group (55) and a signal at 1,121 cm−1 is a DNA/RNA mode (47). There are two carbohydrate signatures. A signal at 857 (or 856) cm−1 is the C–C skeletal mode of glucose (56). Another signal at 1,117 cm−1 has been considered as a Raman fingerprint of glucose as well (57). Cell phenotypic homeostasis is strongly influenced by amino acid metabolism (58) and the contents of lipids (58), nucleic acids, and proteins (59). Amide bonds influence bimolecular conformation and function (59). These findings indicate that DSCARNet-identified signatures are relevant to RSCP tasks.

The performance improvement of DSCARNet was enabled by the learning of the aggregated Raman spectra but was hindered by the difficulty in learning multiclass data

It is of interest to evaluate what contributes to the improved performance of DSCARNet models over the SOTA models. A unique feature of DSCARNet is the learning of spectral aggregated representations and component aggregated representations. Hence, we examined the contribution of this unique feature in the performance improvement. In contrast to the learning of aggregated representations, SOTA models learn Raman spectra representation in a spread-lineup of low-to-high wavenumber or the PCs in a spread-lineup of PC1, PC2, etc. Hence, we constructed unaggregated 2D representations of the spectra and components with the same spread-lineup as those of SOTA models, which are named as 2D-SURs and 2D-CURs, respectively. Trained and tested on the MCS dataset (19), we developed DSCARNet models based on the aggregated representations of dual-path DSCARs, single-path 2D-SARs, and single-path 2D-CARs, DSCURNet models based on unaggregated representations of dual-path DSCURs, single-path 2D-SURs, and single-path 2D-CURs, respectively, and DSCAURNet models based on the dual path of mixed representations of 2D-SURs and 2D-CARs. As shown in SI Appendix, Table S21, DSCARNet models significantly outperformed DSCURNet models (dual-path 90.3 vs 82.9%, single-path 86.6–86.9 vs 77.8–84.9%). DSCARNet also significantly outperformed DSCAURNet model, while the latter is slightly better than DSCURNet (dual-path 90.3 vs 85.7 vs 82.9%). These results suggested that the performance improvement of DSCARNet models is due in part to the enhanced learning of spectra and component aggregated representations.

There are two pairs of datasets with very similar SOTA ACCs but different DSCARNet improvements. The SOTA ACCs are 85.6 and 85.1% for the first pair MCS and PB30 (19, 22) and 92.4 and 92.8% for the second pair ACS and SCS1 (3, 15), respectively, while DSCARNet produced substantially varied improvement of 4.7 and 0.6% for the first pair and relatively comparable improvement of 3.1 and 2.2% for the second pair, respectively. The varied improvement for the MCS and PB30 dataset is likely due to the well-known difficulty in multiclass DL tasks. Published investigations have consistently shown that image recognition performances of the established ConvNet (e.g. ResNet-152, EfficientNet) and image transformer (e.g. iGPT, DeiT, CaiT) DL models decrease with respect to the number of image classes (60, 61), where the SOTA ACCs of these models are in the decreasing ranges of 94–99.4%, 78–93.1%, 77.7–81.8%, and 73.2–78% for 10, 100, 1,010, and 8,142 classes, respectively (SI Appendix, Table S22). MCS is a 2-class dataset, while PB30 is a 30-class dataset (30 subclasses of bacteria). As expected, PB30 has lower SOTA ACC (85%) and low DSCARNet improvement (0.6%).

To further study the multiclass difficulty in the PB30 task, we evaluated DSCARNet subclass ACCs for three subclass groups (SG1–SG3) of PB30. The evaluation results are presented in SI Appendix, Table S23, and SG1, SG2, and SG3 are composed of the data of two, two, and four bacterial strains, respectively. The 30-class PB30 DSCARNet model produced lower subclass ACCs of 58 and 82%, 54 and 77%, and 77, 66, 33, and 88%, respectively. We then developed 2-class, 2-class, and 4-class DSCARNet model for SG1, SG2, and SG3, respectively. As expected, these few-class DSCARNet models produced significantly improved subclass ACCs of 61 and 91%, 67 and 97%, and 80, 75, 37, and 94% for SG1, SG2, and SG3, respectively, Therefore, the lower performance improvement for PB30 primarily arises from the well-known multiclass difficulty in DL.

Influence of Raman spectral range on DSCARNet phenotypic performance

The available Raman datasets are of variable spectral ranges from selected spectra of 333 signals to broader spectra of 2,644 signals across wide wavenumber bands of 320 cm−1 to 3,400 cm−1 (Table 1). The selected spectra are typically from comparative spectroscopic analysis across phenotypes and highly useful for the RSCP tasks (3). Nonetheless, the broader spectra contain more variety of intrinsic molecular modes. One may question whether the richer features of broader spectra may be captured by DL for enhanced RSCP tasks. To probe this question, we developed a DSCARNet DL model using the broader spectra of SCS2 in addition to the selected spectra of SCS1 by the original ML model (3). SCS1 contains the RSCP data of iPSC, NSC, and neuron cells (3). These selected spectra include 443 processed signals (3) and were used for developing and testing DSCARNet model (Table 1). SCS2 comprises 1,019 unprocessed signals, which have been released as the original Raman data of the published SCS1 ML model (3). SCS2 was used here for evaluating the influence of spectra range on DSCARNet performances.

SCS2 was preprocessed by the same preprocessing method as that of the published SCS1 ML model (3). A SCS2 DSCARNet model was developed using these data under the same sample-split and validation method as that of the SCS1 ML model (3). The performance of SCS2 DSCARNet is 95.7%, which is compared to the 95.0% of the SCS1 DSCARNet model (Fig. 4A). We further developed additional SCS1 and SCS2 DSCARNet models using the SCS1 and SCS2 datasets under the 5-FCV scheme. These DSCARNet models produced 94.5 and 96.2% accuracy, respectively (SI Appendix, Table S6). These results consistently indicate that the selected spectra are good for the development of DL phenotypic models, and the exploitation of broader spectra may further improve the phenotypic performance of DL models.

DSCARNet phenotypic performance against dataset of explicit label of biological variability

The RSCP datasets of this and existing works are with duplicates of cells or microbial strains without explicit label of biological variability. This duplication leads to some degree of information leakage from the training to the testing dataset and possibly superficially higher classification performance of the models. We further evaluated DSCARNet performance against a dataset with explicit label of biological variability. This dataset is composed of Raman spectroscopic data of saliva samples of 30 COVID-19-positive, 37 COVID-19-negative, and 34 age- and sex-correlated healthy individuals (62). Each sample has 25 measured Raman spectra duplicates. COVID-19 DL and ML classification models have been developed by this dataset and evaluated by the leave-one(sample)-out cross-validation (LOOCV) method (62). In each LOOCV round, the 25 spectra of each sample were left out as test data, a DL or ML model was trained by all remaining data, and the individual-sample accuracy of the model was determined by the majority outcome of the 25 duplicates of the tested sample. The ACC is the sum of the 101 individual-sample accuracies in the 101 LOOCV rounds. A developed 1D ConvNet model has achieved SOTA performance of 89% ACC without excluding samples of severe comorbidities (62). Based on the same dataset and the same LOOCV method, DSCARNet model produced 91.1% ACC (SI Appendix, Table S24), outperforming the SOTA model. This suggests that DSCARNet as well as the SOTA DL method is able to produce stable performance on data of biological variability.

Conclusion

Our study demonstrated that the capability of spectroscopic DL on the RSCP tasks can be significantly improved by learning the signal and component aggregated 2D representations of the spectra, which provides a useful DL strategy for broader spectroscopic learning tasks beyond RSCP (1–6, 11–13). These include additional Raman spectroscopic tasks such as diseases diagnosis from patient samples (e.g. urine and saliva) (7, 63), food quality and source identification (8, 9, 64), and chemical substance detection (65). There are also tasks of other spectroscopic types, which include the identification of geological materials from LIBS spectra (10), radionuclides from gamma-ray spectra (66), and microstructure of materials from IR spectra (67). These spectroscopic learning tasks are challenged by the highly variational spectroscopic signals of complex biological components (14, 15), heterogeneous materials (10, 67), or signal scattering effects (66). These challenges can be met by such strategies as the learning of appropriate data representations such as data aggregated representations, more expanded training data or data augmentation (22), and algorithms that can efficiently learn large datasets (68).

Materials and methods

Data collection and processing

The 12 RSCP datasets (Table 1) were collected by the following procedure: PubMed (69) literature search was conducted by combinations of keywords “Raman,” “Raman-based,” “cell,” “cells,” “single-cell,” “cell-derived,” “microbe,” “microbial,” “bacteria,” and “bacterial.” The collected literatures were manually evaluated for Raman spectroscopic data of over 50 samples (spectra) and with cell/microbe phenotype information. Phenotypes include cell/microbe type (e.g. stem cell, melanoma genotype), stage (e.g. cells of different disease stages), and state (e.g. stimulated cells). For unprocessed Raman data, an established baseline and normalization algorithm (3) was used for data preprocessing.

Manifold-guided restructuring of Raman spectra into 2D spectra aggregated representations

For a dataset of N samples each with M spectral signals, the data can be analyzed in both the M-dimensional spectral space and the N-dimensional sample space. In the spectral space, there are M signals for every sample. In the sample space, there are N sample values for every signal (i.e. a signal is an N-dimensional vector in the sample space). Data restructuring (Fig. 1A) involves the arrangement of the M signals of each sample onto a 2D space, where a signal of a specific wavenumber was placed in a particular location for every sample, and the location was determined such that the correlational signal vectors in the sample space are drawn together in the 2D space. Following the successful application of manifold learning of omics data (70), the correlational relationship of the signal vectors was measured by Euclidean distance, and the signal vectors were projected onto a 2D space by the UMAP method (71). The UMAP-based projection procedure is of four steps:

(1) The spectral value at each wavenumber (e.g. the ith value of the kth sample Vik) of the Raman spectra across n samples was regarded as an n-dimensional signal vector Si=(Vi1,,Vin), and the Euclidean distance dij between the ith and jth vector Si, Sj was computed as:

dij=k=1n(VikVik)2 (1)

(2) A weighted topological k-neighbor graph D was built by the exponential probability distribution based on dij. Specifically, μi|j is the adjacency matrix of graph D:

μi|j=exp((dijρi)/σi) (2)

where ρi is the distance from the ith wavenumber to its first nearest neighbor and σi is the normalization factor. The adjacency matrix satisfies the symmetry condition:

μij=μi|j+μj|iμi|jμj|i (3)

(3) A corresponding graph F in low dimension (e.g. 2D space) was built using the spectral layout of UMAP (71), where vij is the weight matrix of the graph F,

vij=(1+a(yiyj)2b)1 (4)

Here, yi and yi are the initial embedding coordinates and a and b are hyperparameters.

(4) The cross-entropy was used as the cost function to minimize the error between the two topological representations D and F. Specifically, CE(X,Y) is the total cross-entropy loss over all the edge existence probabilities between weighted graphs D and F.

CE(X,Y)=D(X)log(D(X)F(Y))+(1D(X))log(1D(X)1F(Y)) (5)

After these four steps, the n-dimensional signal vectors of Raman spectra were projected into 2D space. The UMAP-projected signals in the 2D space were further aggregated into regular 2D grids to form a 2D-SAR by the grid linear assignment Jonker–Volgenant algorithm (70). Raman spectra data are highly variational and confounded by artifacts (35). A useful strategy for learning such challenging data is the employment of multichannel/multicapsule DL, which has produced SOTA performances on challenging tasks (72). To facilitate multichannel DL, the pixels of 2D-SARs were further clustered into multiple groups by an agglomerative hierarchical clustering package Fastcluster (73), leading to multichannel 2D-SARs.

PCs of Raman spectra and restructuring into 2D component aggregated representations

For a dataset of N samples each with M spectral signals, the 2D-CARs were constructed (Fig. 1B) by the following procedures: PCs were computed on the M signal vectors in the sample space. The top-K PCs were selected to form PC pairs, Pij = Pi + Pj, where Pi and Pj are the ith and jth PC, respectively (i, j = 1, 2, …, K). Standard scaling was performed for each Pij. Based on the literature-reported finding that up to 30–100 PCs are needed for covering rare variants in statistical genetics (43), K was set to be close to the square root of M, i.e. 0.5×K × (K−1) + K M, and then adjusted for optimal RSCP performances. In this work, K = 25–75 for the 12 datasets, which is roughly in line with the 30–100 PCs in a literature (43). There are 325–2,850 of independent Pij pairs, which were restructured into 2D-CARs by UMAP (71) and Jonker–Volgenant algorithm (70). The pixels of 2D-CARs were further clustered into multiple groups by Fastcluster (73), leading to a multichannel 2D-CAR. 2D-SAR and 2D-CAR were jointly used as a novel dual 2D representation 2D-DSCAR for dual-channel CNN learning of the RSCP data.

DSCARNet DL architecture, hyperparameters, training, and evaluation metrics

An efficient dual-path CNN architecture DSCARNet (Fig. 1C) was developed for learning 2D-DSCARs, 2D-SARs, or 2D-CARs of spectroscopic data. Dual-path DL enables separate feature learning of 2D-SARs and 2D-CARs before concatenation into fully connected layers (FCs). In the first Conv layer, filters of larger kernel sizes were used for more expressive power and global perception (74). 2D-SARs and 2D-CARs mostly exhibit lower-resolution patterns than natural images (Figs. 2 and 3). Hence, larger default kernel size was introduced for optimally capturing useful and nonredundant information (SI Appendix, Table S25). Two Conv layers are followed for more local information and size reduction. Thereafter, a GooLeNet inception layer (75) was adopted for enhanced local perception. Different stride values were used according to the size of 2D-SARs or 2D-CARs. A global max pooling layer was used before the FCs for reduced parameters. Since multichannel DL generally outperforms single-channel DL in image recognition (72), the number of channels was scanned from 1 to 11 for optimal RSCP performances. Moreover, the cross-entropy loss was used, and the number of the patience, learning rate, and batch size was adjusted for enhanced classification performances. Early stopping, regularization (in inception layer), and batch normalization (optional) were used for avoiding overfitting during the nested validations. The optimized hyperparameters for every DSCARNet model of each dataset are given in SI Appendix, Table S25, The DSCARNet models were developed using the same data split and evaluation metric as those of the SOTA ML or DL models for these RSCP tasks (SI Appendix, Table S1). Moreover, additional sets of DSCARNet models were trained with 5-FCV for convenient comparison across datasets (SI Appendix, Tables S5–S10 and S14–S19).

DSCARNet explainable DL and signal importance saliency map

Explanation of the phenotypic decision of a DL model facilitates informed decisions, biomarker discoveries, mechanism understanding, and model assessments (76, 77). The perturbation-based post hoc feature attribution method is useful for explaining black box models (78, 79), which explains learning decisions by perturbing individual input signals to assess their influences on the decision. The perturbation method is of three steps: (i) Given a 2D-SAR matrix X, train a model f with a label vector y, error measure L(y,M), and log loss (cross-entropy), (ii) estimate the original model error: Eorig=L(y,f(X)), and (iii) for Raman spectra of n feature values (S1, …, Sn), conduct the following operations for each feature value Si (i=1,,n):

  1. Generate 2D-SARs matrix Xpert by replacing feature Vik with the background value, which breaks the association between Vik and the true outcome y.

  2. Estimate the error after the prediction of the perturbated matrix Xpert: Epert=L(y,f(Xpert)).

  3. Calculate the perturbation feature importance score: SI=EpertEorig.

SI is a n×n matrix whose element (i, j) indicates the importance of the corresponding element in the 2D-SAR matrix X. In this work, this perturbation method was employed for the global explanation tasks (i.e. signal importance with respect to all samples) (80). The derived important signals were presented by a saliency map M (81). Specifically, the importance score SI can be approximated by SI(X)=wTX+b, where w is the derivative of SI with respect to X. The saliency map M can be obtained by replacing the elements of the 2D-SAR matrix X with the elements of w. We used the Seaborn package to generate the global explanation saliency map, and the top-20 score values were selected and highlighted in the saliency map.

This perturbation-based procedure was applied to the 2D-SARs path, while the 2D-CARs path was kept unperturbed. The whole procedure involves four steps: (i) development of an original single-path DSCARNet model based on unperturbed 2D-SARs of original dataset, (ii) each pixel of 2D-SARs (i.e. each signal) was replaced by the background value (e.g. the lowest pixel value in the training set) without retraining the model, (iii) the perturbed 2D-SARs were input to the original DSCARNet model to make predictions, and (iv) the error and the feature importance score S value were computed for both the model of perturbed and original 2D-SARs. The error was calculated by the log loss (after standard scaling) of the predicted values versus true labels with respect to all samples. The S value for each perturbed pixel of 2D-SARs is the error of the perturbed minus the error of original 2D-SARs. To obtain more robust global explanation saliency maps, the original models were trained by 5-FCV, and steps (ii)–(iv) above were conducted under the 5-FCV. These were developed and performed in Python 3.7 based on TensorFlow 2.6.

Supplementary Material

pgae268_Supplementary_Data

Acknowledgments

The authors appreciate the financial supports from the National Key R&D Program of China, Synthetic Biology Research (2019YFA0905900), and the Startup Fund from Shenzhen Bay Laboratory (Grant No. 21310091).

Contributor Information

Songlin Lu, The State Key Laboratory of Chemical Oncogenomics, Key Laboratory of Chemical Biology, Tsinghua Shenzhen International Graduate School, Tsinghua University, 2279 Lishui Road, Nanshan District, Shenzhen 518055, Guangdong, P. R. China; Institute of Biomedical Health Technology and Engineering, Shenzhen Bay Laboratory, 9 Kexue Avenue, Guangming District, Shenzhen 518132, Guangdong, P. R. China.

Yuanfang Huang, The State Key Laboratory of Chemical Oncogenomics, Key Laboratory of Chemical Biology, Tsinghua Shenzhen International Graduate School, Tsinghua University, 2279 Lishui Road, Nanshan District, Shenzhen 518055, Guangdong, P. R. China.

Wan Xiang Shen, Bioinformatics and Drug Design Group, Department of Pharmacy, National University of Singapore, 18 Science Drive 4, Singapore 117543, Singapore.

Yu Lin Cao, Tangyi and Tsinghua Shenzhen International Graduate School Collaborative Program, Tsinghua University, 2279 Lishui Road, Nanshan District, Shenzhen 518055, Guangdong, P. R. China.

Mengna Cai, Tangyi and Tsinghua Shenzhen International Graduate School Collaborative Program, Tsinghua University, 2279 Lishui Road, Nanshan District, Shenzhen 518055, Guangdong, P. R. China.

Yan Chen, The State Key Laboratory of Chemical Oncogenomics, Key Laboratory of Chemical Biology, Tsinghua Shenzhen International Graduate School, Tsinghua University, 2279 Lishui Road, Nanshan District, Shenzhen 518055, Guangdong, P. R. China; Shenzhen Kivita Innovative Drug Discovery Institute, Shenzhen 518057, Guangdong, P. R. China.

Ying Tan, The State Key Laboratory of Chemical Oncogenomics, Key Laboratory of Chemical Biology, Tsinghua Shenzhen International Graduate School, Tsinghua University, 2279 Lishui Road, Nanshan District, Shenzhen 518055, Guangdong, P. R. China; Institute of Drug Discovery Technology, Ningbo University, 818 Fenghua Road, Ningbo 315211, Zhejiang, P. R. China.

Yu Yang Jiang, School of Pharmaceutical Sciences, Tsinghua University, 30 Shuangqing Road, Haidian District, Beijing 100084, P. R. China.

Yu Zong Chen, The State Key Laboratory of Chemical Oncogenomics, Key Laboratory of Chemical Biology, Tsinghua Shenzhen International Graduate School, Tsinghua University, 2279 Lishui Road, Nanshan District, Shenzhen 518055, Guangdong, P. R. China; Institute of Biomedical Health Technology and Engineering, Shenzhen Bay Laboratory, 9 Kexue Avenue, Guangming District, Shenzhen 518132, Guangdong, P. R. China.

Supplementary Material

Supplementary material is available at PNAS Nexus online.

Funding

This work was supported by the National Key R&D Program of China, Synthetic Biology Research (2019YFA0905900); The Startup Fund from Shenzhen Bay Laboratory (Grant No. 21310091); and the Ningbo Top Talent Project (No. 215-432094250).

Author Contributions

S.L. and Y.Z.C. conceptualized this work, developed computational methods, and wrote the manuscript. S.L. and Y.H. conducted data collection, model training, and result analysis. W.X.S. contributed to algorithm development. Y.T. contributed to data collection. Y.L.C., M.C., and Y.C. contributed to biomarker analysis. Y.Y.J. and Y.Z.C. provided suggestions and facilities. All authors read and approved the manuscript.

Data Availability

All datasets are available in the GitHub DSCAR-data repository.

https://github.com/songlinlu/DSCAR. The source codes of DSCAR, DSCARNet, and DSCARNet explainable module are freely available at GitHub DSCAR repository (https://github.com/songlinlu/DSCAR).

Attribution

The four icons in SI Appendix, Fig. S1 are designed by Freepik, which agrees with Freepik license.

References

  • 1. Jermyn M, et al. 2015. Intraoperative brain cancer detection with Raman spectroscopy in humans. Sci Transl Med. 7:274ra219. [DOI] [PubMed] [Google Scholar]
  • 2. Traynor D, et al. 2021. Raman spectral cytopathology for cancer diagnostic applications. Nat Protoc. 16:3716–3735. [DOI] [PubMed] [Google Scholar]
  • 3. Hsu C-C, et al. 2020. A single-cell Raman-based platform to identify developmental stages of human pluripotent stem cell-derived neurons. Proc Natl Acad Sci U S A. 117:18412–18423. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Lin H-H, et al. 2012. Single nuclei Raman spectroscopy for drug evaluation. Anal Chem. 84:113–120. [DOI] [PubMed] [Google Scholar]
  • 5. Fu D, et al. 2014. Imaging the intracellular distribution of tyrosine kinase inhibitors in living cells with quantitative hyperspectral stimulated Raman scattering. Nat Chem. 6:614–622. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Ali A, et al. 2019. Single-cell screening of tamoxifen abundance and effect using mass spectrometry and Raman-spectroscopy. Anal Chem. 91:2710–2718. [DOI] [PubMed] [Google Scholar]
  • 7. Weng S, et al. 2020. Deep learning networks for the recognition and quantitation of surface-enhanced Raman spectroscopy. Analyst. 145:4827–4835. [DOI] [PubMed] [Google Scholar]
  • 8. Berghian-Grosan C, Magdas DA. 2020. Application of Raman spectroscopy and machine learning algorithms for fruit distillates discrimination. Sci Rep. 10:21152. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Leong YX, et al. 2021. Surface-enhanced Raman scattering (SERS) taster: a machine-learning-driven multireceptor platform for multiplex profiling of wine flavors. Nano Lett. 21:2642–2649. [DOI] [PubMed] [Google Scholar]
  • 10. Kepes E, Vrabel J, Stritezska S, Porizka P, Kaiser J. 2020. Benchmark classification dataset for laser-induced breakdown spectroscopy. Sci Data. 7:53. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Muro CK, Lednev IK. 2017. Identification of individual red blood cells by Raman microspectroscopy for forensic purposes: in search of a limit of detection. Anal Bioanal Chem. 409:287–293. [DOI] [PubMed] [Google Scholar]
  • 12. Lee KS, et al. 2019. An automated Raman-based platform for the sorting of live cells by functional properties. Nat Microbiol. 4:1035–1048. [DOI] [PubMed] [Google Scholar]
  • 13. Lee KS, et al. 2021. Raman microspectroscopy for microbiology. Nat Rev Methods Primers. 1:80. 10.1038/s43586-021-00082-7 [DOI] [Google Scholar]
  • 14. Baria E, et al. 2021. Supervised learning methods for the recognition of melanoma cell lines through the analysis of their Raman spectra. J Biophotonics. 14:e202000365. [DOI] [PubMed] [Google Scholar]
  • 15. Gala de Pablo J, et al. 2018. Biochemical fingerprint of colorectal cancer cell lines using label-free live single-cell Raman spectroscopy. J Raman Spectrosc. 49:1323–1332. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Garcia-Timermans C, et al. 2020. Discriminating bacterial phenotypes at the population and single-cell level: a comparison of flow cytometry and Raman spectroscopy fingerprinting. Cytometry A. 97:713–726. [DOI] [PubMed] [Google Scholar]
  • 17. Akagi Y, Mori N, Kawamura T, Takayama Y, Kida YS. 2021. Non-invasive cell classification using the Paint Raman express spectroscopy system (PRESS). Sci Rep. 11:8818. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Du J, et al. 2020. Raman-guided subcellular pharmaco-metabolomics for metastatic melanoma cells. Nat Commun. 11:4830. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Pavillon N, Hobro AJ, Akira S, Smith NI. 2018. Noninvasive detection of macrophage activation with single-cell resolution through machine learning. Proc Natl Acad Sci U S A. 115:E2676–E2685. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Ho C-S, et al. 2019. Rapid identification of pathogenic bacteria using Raman spectroscopy and deep learning. Nat Commun. 10:4927. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Shin H, et al. 2020. Early-stage lung cancer diagnosis by deep learning-based spectroscopic analysis of circulating exosomes. ACS Nano. 14:5435–5444. [DOI] [PubMed] [Google Scholar]
  • 22. Al-Shaebi Z, Uysal Ciloglu F, Nasser M, Aydin O. 2022. Highly accurate identification of bacteria's antibiotic resistance based on Raman spectroscopy and U-net deep learning algorithms. ACS Omega. 7:29443–29451. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Jumper J, et al. 2021. Highly accurate protein structure prediction with AlphaFold. Nature. 596:583–589. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Wu Z, et al. 2018. MoleculeNet: a benchmark for molecular machine learning. Chem Sci. 9:513–530. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Xiong Z, et al. 2020. Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism. J Med Chem. 63:8749–8760. [DOI] [PubMed] [Google Scholar]
  • 26. Yang K, et al. 2019. Analyzing learned molecular representations for property prediction. J Chem Inf Model. 59:3370–3388. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Shen WX, et al. 2021. Out-of-the-box deep learning prediction of pharmaceutical properties by broadly learned knowledge-based molecular representations. Nat Mach Intell. 3:334–343. [Google Scholar]
  • 28. Bazgir O, et al. 2020. Representation of features as images with neighborhood dependencies for compatibility with convolutional neural networks. Nat Commun. 11:4391. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Lyu B, Haque A. 2018. Deep learning based tumor type classification using gene expression data. In: Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics; Washington, DC. New York (NY): ACM. p. 89–96.
  • 30. Chen X, et al. 2021. Artificial image objects for classification of schizophrenia with GWAS-selected SNVs and convolutional neural network. Patterns (NY). 2:100303. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Shen WX, et al. 2022. AggMapNet: enhanced and explainable low-sample omics deep learning with feature-aggregated multi-channel networks. Nucleic Acids Res. 50:e45. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Hawrylycz MJ, et al. 2012. An anatomically comprehensive atlas of the adult human brain transcriptome. Nature. 489:391–399. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Lu YY, Fan Y, Lv J, Noble WS. 2018. DeepPINK: reproducible feature selection in deep neural networks. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montréal, Canada: Curran Associates Inc. p. 8690–8700.
  • 34. Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A. 2016. Learning deep features for discriminative localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Las Vegas, NV. IEEE. p. 2921–2929.
  • 35. Guo S, Popp J, Bocklitz T. 2021. Chemometric analysis in Raman spectroscopy from experimental design to machine learning-based modeling. Nat Protoc. 16:5426–5459. [DOI] [PubMed] [Google Scholar]
  • 36. Simoncelli EP, Olshausen BA. 2001. Natural image statistics and neural representation. Annu Rev Neurosci. 24:1193–1216. [DOI] [PubMed] [Google Scholar]
  • 37. Bronstein MM, Bruna J, LeCun Y, Szlam A, Vandergheynst P. 2017. Geometric deep learning. IEEE Signal Process Mag. 34:18–42. [Google Scholar]
  • 38. Butler HJ, et al. 2016. Using Raman spectroscopy to characterize biological materials. Nat Protoc. 11:664–687. [DOI] [PubMed] [Google Scholar]
  • 39. McInnes L, Healy J, Melville J. 2018. UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv:1802.03426. 10.48550/arXiv.1802.03426, preprint: not peer reviewed. [DOI]
  • 40. Ringner M. 2008. What is principal component analysis? Nat Biotechnol. 26:303–304. [DOI] [PubMed] [Google Scholar]
  • 41. Lever J, Krzywinski M, Atman N. 2017. Points of significance principal component analysis. Nat Methods. 14:641–642. [Google Scholar]
  • 42. Smallman L, Underwood W, Artemiou A. 2019. Simple Poisson PCA: an algorithm for (sparse) feature extraction with simultaneous dimension determination. Comput Stat. 35:559–577. [Google Scholar]
  • 43. Abegaz F, et al. 2019. Principals about principal components in statistical genetics. Brief Bioinform. 20:2200–2216. [DOI] [PubMed] [Google Scholar]
  • 44. Bian H, Gao J. 2018. Error analysis of the spectral shift for partial least squares models in Raman spectroscopy. Opt Express. 26:8016–8027. [DOI] [PubMed] [Google Scholar]
  • 45. Li M, Xu J, Romero-Gonzalez M, Banwart SA, Huang WE. 2012. Single cell Raman spectroscopy for cell sorting and imaging. Curr Opin Biotechnol. 23:56–63. [DOI] [PubMed] [Google Scholar]
  • 46. Kuhar N, Sil S, Umapathy S. 2021. Potential of Raman spectroscopic techniques to study proteins. Spectrochim Acta A Mol Biomol Spectrosc. 258:119712–119720. [DOI] [PubMed] [Google Scholar]
  • 47. Medipally DKR, et al. 2020. Vibrational spectroscopy of liquid biopsies for prostate cancer diagnosis. Ther Adv Med Oncol. 12:1758835920918499. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Pezzotti G. 2021. Raman spectroscopy in cell biology and microbiology. J Raman Spectrosc. 52:2348–2443. [Google Scholar]
  • 49. Notingher I. 2007. Raman spectroscopy cell-based biosensors. Sensors. 7:1343–1358. [Google Scholar]
  • 50. Movasaghi Z, Rehman S, Rehman IU. 2007. Raman spectroscopy of biological tissues. Appl Spectrosc Rev. 42:493–541. [Google Scholar]
  • 51. Li SS, et al. 2017. Revealing chemical processes and kinetics of drug action within single living cells via plasmonic Raman probes. Sci Rep. 7:2296. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Wang H, et al. 2022. Investigating the cellular responses of osteosarcoma to cisplatin by confocal Raman microspectroscopy. J Photochem Photobiol B. 226:112366. [DOI] [PubMed] [Google Scholar]
  • 53. Du S, et al. 2022. Micro-Raman analysis of sperm cells on glass slide: potential label-free assessment of sperm DNA toward clinical applications. Biosensors (Basel). 12:1051–1065. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Ribeiro ARB, et al. 2022. Application of Raman spectroscopy for characterization of the functional polarization of macrophages into M1 and M2 cells. Spectrochim Acta A Mol Biomol Spectrosc. 265:120328. [DOI] [PubMed] [Google Scholar]
  • 55. Silva-López MS, et al. 2022. Raman spectroscopy of individual cervical exfoliated cells in premalignant and malignant lesions. Appl Sci. 12:2419–2430. [Google Scholar]
  • 56. Flores-Morales A, Jimenez-Estrada M, Mora-Escobedo R. 2012. Determination of the structural changes by FT-IR, Raman, and CP/MAS (13)C NMR spectroscopy on retrograded starch of maize tortillas. Carbohydr Polym. 87:61–68. [DOI] [PubMed] [Google Scholar]
  • 57. Krafft C, Neudert L, Simat T, Salzer R. 2005. Near infrared Raman spectra of human brain lipids. Spectrochim Acta A Mol Biomol Spectrosc. 61:1529–1535. [DOI] [PubMed] [Google Scholar]
  • 58. Li C, et al. 2022. Amino acid catabolism regulates hematopoietic stem cell proteostasis via a GCN2-eIF2alpha axis. Cell Stem Cell. 29:1119–1134.e1117. [DOI] [PubMed] [Google Scholar]
  • 59. Mahesh S, Tang KC, Raj M. 2018. Amide bond activation of biological molecules. Molecules. 23:2615. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60. Chen M, et al. 2020. Generative pretraining from pixels. In: Proceedings of the 37th International Conference on Machine Learning (ICML); Virtual Event. PMLR. p. 1691–1703.
  • 61. Touvron H, Cord M, Sablayrolles A, Synnaeve G, Jégou H. 2021. Going deeper with image transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision; Montreal, QC, Canada. IEEE. p. 32–42.
  • 62. Bertazioli D, et al. 2024. An integrated computational pipeline for machine learning-driven diagnosis based on Raman spectra of saliva samples. Comput Biol Med. 171:108028. [DOI] [PubMed] [Google Scholar]
  • 63. Carlomagno C, et al. 2021. COVID-19 salivary Raman fingerprint: innovative approach for the detection of current and past SARS-CoV-2 infections. Sci Rep. 11:4943. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64. Pan Y, et al. 2014. Determination of tert-butylhydroquinone in vegetable oils using surface-enhanced Raman spectroscopy. J Food Sci. 79:T1225–T1230. [DOI] [PubMed] [Google Scholar]
  • 65. Liu J, et al. 2017. Deep convolutional neural networks for Raman spectrum recognition: a unified solution. Analyst. 142:4067–4074. [DOI] [PubMed] [Google Scholar]
  • 66. Daniel G, Ceraudo F, Limousin O, Maier D, Meuris A. 2020. Automatic and real-time identification of radionuclides in gamma-ray spectra: a new method based on convolutional neural network trained with synthetic data set. IEEE Trans Nucl Sci. 67:644–653. [Google Scholar]
  • 67. Lansford JL, Vlachos DG. 2020. Infrared spectroscopy data- and physics-driven machine learning for characterizing surface microstructure of complex materials. Nat Commun. 11:1513. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68. Zhu X, Vondrick C, Fowlkes CC, Ramanan D. 2016. Do we need more training data? Int J Comput Vis. 119:76–92. [Google Scholar]
  • 69. Sayers EW, et al. 2023. Database resources of the National Center for Biotechnology Information in 2023. Nucleic Acids Res. 51:D29–D38. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70. Jonker R, Volgenant T. 1988. A shortest augmenting path algorithm for dense and sparse linear assignment problems. Berlin (Heidelberg): Springer. p. 622–622. [Google Scholar]
  • 71. Becht E, et al. 2018. Dimensionality reduction for visualizing single-cell data using UMAP. Nat Biotechnol. 37:38–44. [DOI] [PubMed] [Google Scholar]
  • 72. Cheng D, Gong Y, Zhou S, Wang J, Zheng N. 2016. Person re-identification by multichannel parts-based CNN with improved triplet loss function. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Las Vegas, NV. IEEE. p. 1335–1344.
  • 73. Müllner D. 2013. Fastcluster: fast hierarchical, agglomerative clustering routines for R and Python. J Stat Softw. 53:1–18. [Google Scholar]
  • 74. Peng C, Zhang X, Yu G, Luo G, Sun J. 2017. Large kernel matters—improve semantic segmentation by global convolutional network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Honolulu, HI. IEEE, Hawaii Convention Center. p. 1743–1751.
  • 75. Szegedy C, et al. 2015. Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Boston, MA. IEEE. p. 1–9.
  • 76. Lee H, et al. 2019. An explainable deep-learning algorithm for the detection of acute intracranial haemorrhage from small datasets. Nat Biomed Eng. 3:173–182. [DOI] [PubMed] [Google Scholar]
  • 77. Samek W, Wiegand T, Müller K-R. 2017. Explainable artificial intelligence: understanding, visualizing and interpreting deep learning models. ITU J: ICT Discoveries - Spec Issue. 1:39–48. [Google Scholar]
  • 78. Jiménez-Luna J, Grisoni F, Schneider G. 2020. Drug discovery with explainable artificial intelligence. Nat Mach Intell. 2:573–584. [Google Scholar]
  • 79. Ribeiro MT, Singh S, Guestrin C. 2016. “Why should I trust you?” Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; August 13–17, 2016, San Francisco, CA. New York (NY): ACM. p. 1135–1144.
  • 80. Lundberg SM, et al. 2020. From local explanations to global understanding with explainable AI for trees. Nat Mach Intell. 2:56–67. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81. Simonyan K, Vedaldi A, Zisserman A. 2014. Deep inside convolutional networks: visualising image classification models and saliency maps. In: Proceedings of the International Conference on Learning Representations (ICLR); April 14-16, 2014, Banff, AB, Canada. p. 1–8.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

pgae268_Supplementary_Data

Data Availability Statement

All datasets are available in the GitHub DSCAR-data repository.

https://github.com/songlinlu/DSCAR. The source codes of DSCAR, DSCARNet, and DSCARNet explainable module are freely available at GitHub DSCAR repository (https://github.com/songlinlu/DSCAR).


Articles from PNAS Nexus are provided here courtesy of Oxford University Press

RESOURCES