Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2023 May 30;19(5):e1011197. doi: 10.1371/journal.pcbi.1011197

Identifying prognostic subgroups of luminal-A breast cancer using deep autoencoders and gene expressions

Seunghyun Wang 1, Doheon Lee 1,*
Editor: Mark Alber2
PMCID: PMC10256220  PMID: 37253056

Abstract

Luminal-A breast cancer is the most frequently occurring subtype which is characterized by high expression levels of hormone receptors. However, some luminal-A breast cancer patients suffer from intrinsic and/or acquired resistance to endocrine therapies which are considered as first-line treatments for luminal-A breast cancer. This heterogeneity within luminal-A breast cancer has required a more precise stratification method. Hence, our study aims to identify prognostic subgroups of luminal-A breast cancer. In this study, we discovered two prognostic subgroups of luminal-A breast cancer (BPS-LumA and WPS-LumA) using deep autoencoders and gene expressions. The deep autoencoders were trained using gene expression profiles of 679 luminal-A breast cancer samples in the METABRIC dataset. Then, latent features of each samples generated from the deep autoencoders were used for K-Means clustering to divide the samples into two subgroups, and Kaplan-Meier survival analysis was performed to compare prognosis (recurrence-free survival) between them. As a result, the prognosis between the two subgroups were significantly different (p-value = 5.82E-05; log-rank test). This prognostic difference between two subgroups was validated using gene expression profiles of 415 luminal-A breast cancer samples in the TCGA BRCA dataset (p-value = 0.004; log-rank test). Notably, the latent features were superior to the gene expression profiles and traditional dimensionality reduction method in terms of discovering the prognostic subgroups. Lastly, we discovered that ribosome-related biological functions could be potentially associated with the prognostic difference between them using differentially expressed genes and co-expression network analysis. Our stratification method can be contributed to understanding a complexity of luminal-A breast cancer and providing a personalized medicine.

Author summary

Luminal-A breast cancer is the most frequently occurring breast cancer subtype. However, it shows high variability in prognosis, and more precise stratification is needed. In this paper, we identified two prognostic subgroups of luminal-A breast cancer, BPS-LumA and WPS-LumA. To this end, we used deep autoencoders which automatically generate informative latent features that represent essential properties of gene expressions. We found that the two subgroups clustered using the latent features are significantly different in prognosis. This prognostic difference was validated with the external luminal-A breast cancer cohort. We showed that only latent features are able to discover the prognostic subgroups compared to gene expression profiles. In addition, we compare our results with the two previous luminal-A breast cancer stratification method which are complementary to each other. Finally, we suggested biological functions associated with the differentially expressed genes between the two subgroups as potential molecular mechanisms which results in the differences in the prognosis. We expect that our method could be used for the personalized medicine of luminal-A breast cancer.

Introduction

Personalized medicine is the ultimate goal of modern medicine [1]. The personalized medicine arises from that millions of people are taking medications that will not help them. It is reported that the top ten highest-grossing drugs in the United states help only between 1 in 25 and 1 in 4 of the people who take them [2]. This is resulted from a limitation of large-scale clinical trials which cannot consider the individual characteristics of each patient [3]. Fortunately, due to the recent development of high-throughput technology, there are attempts to infer the individual characteristics of each patient from various omics data and to reflect them in the treatment and management of the diseases. In this context, breast cancer is one of the remarkable diseases that personalized medicine has been realized in the clinic.

Breast cancer which is one of the leading causes of death among females [4], and it has been discovered that breast cancer is a heterogeneous disease of which subtypes have different molecular mechanisms and require different therapeutic strategies [5]. Traditionally, immunohistochemical markers (e.g., estrogen receptor (ER), progesterone receptor (PR), and HER2) are used to stratify the breast cancer patients [6]. Recently, PAM50 is the most popular subtyping method which classifies breast cancer into several intrinsic subtypes (e.g., luminal-A, luminal-B, HER2-enriched, and basal-like) based on expression levels of 50 genes [7]. Moreover, it is well-known that there is significant concordance between the immunohistochemistry-based stratification and PAM50 [7].

Especially, luminal-A breast cancer is the most frequently occurring subtype which accounts for about 60~70% of whole breast cancer, and it is characterized by hormone receptor-positive (ER, PR) and HER2 receptor-negative [8]. Hence, endocrine therapies have been considered as first-line treatments for luminal-A breast cancer. For example, aromatase inhibitors (e.g., anastrozole, letrozole, and exemestane) interrupt estrogen production by inhibiting the aromatization of androgens to estrogens [9]. On the other hand, SERM (selective estrogen receptor modulator), such as tamoxifen, block the binding of estrogen and estrogen receptor and SERD (selective estrogen receptor degraders), such as elacestrant and fulvestrant, inhibit translocating estrogen receptor to the nucleus and degrade them [9]. However, even in the luminal-A breast cancer subtype, some patients show intrinsic and/or acquired resistance to these endocrine therapy [10], and the prognosis of luminal-A breast cancer is more variable in comparison with the other breast cancer subtypes [11].

Consequently, this heterogeneity within luminal-A breast cancer has required a more precise stratification method which makes it possible to predict the prognosis and provide the personalized diagnosis and treatment. Recently, several previous studies suggested prognostic subgroups of luminal-A breast cancer through machine learning and gene expressions. For example, Netanely et al. clustered luminal-A breast cancer samples into two prognostic subgroups (LumA-R1 and LumA-R2) using the most variable genes in their expressions [12]. Later, Poudel et al. classified the luminal-A breast cancer into five subgroups based on expression levels of several marker genes which represent five different cell types (Enterocytes, Inflammatory, Stem-like, Goblet-like, and TA) [13], and they revealed that four out of five subgroups were significantly different in prognosis [14].

Even though the previous studies successfully identified the prognostic subgroups of luminal-A breast cancer, they have a limitation that they required human engineering to select the features (genes) for stratifications. For example, Netanely et al. took the top 2,000 genes which show the highest variability in the expressions, and Poudel et al. selected the marker genes according to significances of microarray analysis [15].

In the perspective of feature engineering, a deep learning which takes advantages of data-driven automatic feature learning can be a promising solution [16]. Especially, an autoencoder is an artificial neural network which aims for data dimension reduction and feature extraction [17]. The training purpose of the autoencoder is to generate latent features in the hidden layers, which can reconstruct the input features in the output layer. In consequence, the autoencoder automatically extracts and compresses essential properties of the input features, and generates the informative latent features in the hidden layers. Recently, Tan et al. trained the autoencoder using gene expressions of breast cancer, and they showed that the latent features were able to discriminate intrinsic subtypes of breast cancer [18]. Dwivedi et al. showed that disease modules could be discovered through the autoencoder trained with large gene expression datasets [19].

In this study, we identified two prognostic subgroups of luminal-A breast cancer. To this end, we trained deep autoencoders using gene expression profiles of luminal-A breast cancer to generate informative latent features of each sample and we discovered the subgroups through the latent features and unsupervised learning. Additionally, we showed that our method has important biological contributions in realizing precision medicine of luminal-A breast cancer. First, we demonstrated that our method is feasible in an independent test set, which is the most important part to translate the deep learning approaches to clinical practices. Furthermore, we proved that the latent features are more useful than gene expression profiles and features generated by traditional dimensionality reduction method (i.e., PCA) in terms of discovering the prognostic subgroups. In addition, we suggested potential molecular mechanisms which determine the prognostic difference using differentially expressed genes between the subgroups and weighted gene co-expression network analysis. Lastly, we compared our stratification with two previous luminal-A breast cancer stratification methods.

Results

The latent features generated from the deep autoencoders successfully identify the prognostic subgroups of luminal-A breast cancer

First of all, as we aimed to obtain informative latent features to identify prognostic subgroups of luminal-A breast cancer, we trained deep autoencoders which are able to automatically extract and compress important properties of gene expression profiles without additional human engineering. To this end, we obtained gene expression profiles of 679 and 415 luminal-A breast cancer samples from METABRIC and TCGA dataset to use them as training set and validation set, respectively. Before training the deep autoencoders, we chose the top 5,000 genes which show the highest variability across the samples based on the median absolute deviation (MAD) in the METABRIC dataset (S1 Table). Then, we renormalized the gene expression profiles in both datasets using min-max scaling and used them as input features of the deep autoencoders (Fig 1A and Methods). We trained the eight deep autoencoders with the different hidden layer size (16, 32, 64, 128, 256, 512, 1024, and 2048). We set the size of all three hidden layer size same to see whether the size of hidden layer affects the performance of deep autoencoders [19]. The performances of deep autoencoder were evaluated by mean squared error (MSE) between the renormalized gene expression profiles of the input layer and the reconstructed gene expression profiles of the output layer.

Fig 1. Overall pipelines.

Fig 1

(A) X1 are the renormalized gene expression profiles and X5 is the reconstructed gene expression profiles from the latent features. N is the number of genes, and M is the number of samples. The deep autoencoders were trained using the renormalized gene expression profiles of luminal-A breast cancer in the METABRIC and the samples of the TCGA BRCA were used as the validation set. (B) The latent features of each 679 METABRIC sample were generated in the second hidden layer of deep autoencoders, and (C) the samples were divided into the subgroups using the latent features as input features of unsupervised learning. (D) The Kaplan-Meier analysis was performed to compare the prognostic differences (recurrence-free survival rate) between the subgroups, and (E) the prognostic differences were validated using the recurrence-free survival data of 415 TCGA samples.

As we expected, the MSEs decreased continuously as the size of hidden layers increased in the training set (S2 Table), but the differences between the deep autoencoders were not that significant (MSE = 0.012±0.008). In the validation set, we observed that the differences between the deep autoencoders were much smaller than the training set (MSE = 0.075±0.003). More importantly, the MSEs decreased until the size of hidden layers increased to 128. However, they started to increase again from when the size of hidden layer is larger than 128. It indicates that the models might be overfitted to the training set. Therefore, we decided to use only the latent features obtained from the models of which the size of hidden layers are 16, 32, 64, and 128 in the following survival analysis.

The ultimate goal of our study is to identify distinct prognostic subgroups of luminal-A breast cancer. Hence, we clustered 679 METABRIC samples using the latent features generated from the deep autoencoders and unsupervised learning. We used K-means clustering for unsupervised learning. Then, we performed the Kaplan-Meier survival analysis to compare prognosis between the subgroups and the significance was evaluated by log-rank test.

As a result, the prognostic differences between the subgroups were stably significant (p-value < 0.01) when the samples were divided into two subgroups regardless of the dimensional size of latent features (S1 Fig), but the difference was the most significant when the dimensional size is 64 (p-value = 5.82E-05, Fig 2). However, when the samples were clustered into more than two subgroups using the 64-dimensional latent, and we found that some subgroup pairs did not show significant prognostic differences in the pairwise log-rank test. For example, when the samples were clustered into the three subgroups we observed that the samples belonged to the third subgroup (cluster3) did not show the significant prognostic differences with the samples belonged to the other subgroups (cluster1 and cluster2) (S3 Table and S2 Fig). We observed similar tendencies when we divided the samples into four and five subgroups (S3 Table and S2 Fig).

Fig 2. The t-SNE (t-distributed Stochastic Neighbor Embedding) plot of 64-dimensional latent features and the Kaplan-Meier survival curve of 679 METABRIC luminal-A breast cancer samples.

Fig 2

(A) The t-SNE plot (dimension size = 2) of latent features generated from the deep autoencoders of 679 METABRIC luminal-A breast cancer samples. The samples assigned to the BPS-LumA (the better prognostic subgroup) and WPS-LumA (the worse prognostic subgroup) colored as green and orange, respectively. (B) The green and orange curve indicates the BPS-LumA and the WPS-LumA, respectively. The x-axis refers recurrence-free survival months and the y-axis refers survival probability.

Given these results, we concluded that the prognostic differences were the most distinct when 679 METABRIC samples were divided into the two subgroups using the 64-dimensional latent features (Fig 2A). In the Kaplan-Meier survival curve (Fig 2B), the first subgroup (n = 336) and the second subgroup (n = 343) show worse and better prognosis (S4 Table). In the following sections of this study, we named the better prognostic subgroup s ‘BPS-LumA’, and the worse prognostic subgroup as ‘WPS-LumA’.

The prognostic difference between BPS-LumA and WPS-LumA was validated in the independent dataset

To validate the prognostic differences between the BPS-LumA and the WPS-LumA, we applied our stratification method to an independent dataset. We used the 415 TCGA luminal-A breast cancer samples for validation. We generated the 64-dimensional latent features of each TCGA sample using the deep autoencoder trained with METABRIC dataset in the previous section, and each sample was assigned to the closer subgroup based on the distance between the sample and the centroid of the BPS-LumA and the WPS-LumA in the latent space (Fig 3A). Among 415 samples, 191 and 224 samples belonged to the BPS-LumA and the WPS-LumA, respectively (S5 Table). Interestingly, the samples belonging to the BPS-LumA showed significantly better prognosis than the other samples belonging to the WPS-LumA (p-value = 0.004; Fig 3B).

Fig 3. The t-SNE plot of 64-dimensional latent features of all samples in the METABRIC and TCGA BRCA dataset and the Kaplan-Meier survival curve of and 415 TCGA BRCA luminal-A breast cancer samples.

Fig 3

(A) The t-SNE plot of all samples in the METABRIC and TCGA BRCA datasets. The samples of METABRIC and TCGA BRCA are denoted as circle and square, respectively. The samples assigned to the BPS-LumA and WPS-LumA colored as green and orange, respectively. (B) The Kaplan-Meier survival curve of the 415 TCGA samples which were assigned to the closer prognostic subgroups (BPS-LumA and WPS-LumA) in the latent space. (C) The mean log2-transforemd expression levels and (D) the log-transformed median absolute deviation of individual genes (N = 17,202) in the METABRIC (microarray, x-axis) and the TCGA (RNA-seq, y-axis) dataset were plotted using scatter plots. Each dot indicates individual genes in the (C) and (D).

Interestingly, even though the two datasets used the different expression profiling platforms (The METABRIC and TCGA dataset use microarray and RNA-seq as expression profiling platforms, respectively), we proved that our method is applicable in the both datasets. To further explore these results, we measured the Spearman’s rank correlation coefficient (SRCC) between the mean expression levels of individual genes in the both datasets and we observed that they are highly correlated (SRCC = 0.771, Fig 3C). Similarly, we measured the SRCC of the absolute median deviations of individual gene in the both datasets and we confirmed that they are also significantly correlated (SRCC = 0.595, Fig 3D). In addition to these correlations between two expression profiling platforms, the additional preprocessing to reduce the batch effects (e.g. min-max scaling) makes our methods successfully predicts the prognosis of the samples regardless of the expression profiling platforms.

Only the latent features of deep autoencoders successfully identified the prognostic subgroups

Next, we wanted to compare our method with gene expression profiles and traditional dimensionality reduction method to show the usefulness of the deep autoencoders in terms of generating the informative latent features for the prognostic subgroup identification. Hence, similarly to when using the latent features, we divided the 679 METABRIC samples into the two subgroups using the expression profiles of whole 17,202 genes and the top 5,000 genes with the highest variability as the input features of K-Means clustering, respectively. Interestingly, we found that the prognostic difference between the subgroups identified using the whole 17,202 genes (p-value = 0.566; Fig 4A) and the top 5,000 most variable genes (p-value = 0.183; Fig 4B) were not significant.

Fig 4. The Kaplan-Meier survival curves of when the samples are clustered using the gene expression profiles and the low-dimensional features generated using PCA.

Fig 4

The Kaplan-Meier survival curve when the 679 METABRIC samples were divided into the two clusters using (A) the whole 17,202 genes, (B) the top 5,000 most variable genes, the 64-dimesional features (PCA) of (C) the whole 17,202 genes and (D) the top 5,000 most variable genes, and the 2-dimesional features (PCA) of (E) the whole 17,202 genes and (F) the top 5,000 most variable genes.

In addition, we compared our method with traditional dimensionality reduction method. To this end, we projected the whole 17,202 genes and the top 5,000 most variable genes into 64- (the same dimensional size with the latent features that we used) and 2- (the most informative principal components) dimensional space using PCA (Principal Component Analysis), respectively. As a result, we observed that none of them successfully discovered the distinct prognostic subgroups (Fig 4D–4F).

From these results, we confirmed that only the latent features were able to identify the prognostic subgroups p-value = 5.82E-05; Fig 2A). It indicates that the deep autoencoders more effectively extract the important properties from the complex gene expression profiles, which determine the prognosis of luminal-A breast cancer, than the traditional dimensionality reduction method and it helps to discover the distinct prognostic subgroups for precision medicine.

Ribosome-related biological functions could be potentially associated with the prognostic difference between BPS-LumA and WPS-LumA

Next, we tried to figure out which biological functions potentially makes the prognostic differences between the BPS-LumA and WPS-LumA. To this end, among top 5,000 genes with the highest variability, we found the 548 differentially expressed genes (DEGs) between BSP-LumA and WPS-LumA in the both METABRIC and TCGA datasets through limma [20] (adjusted p-value<0.01, S6 and S7 Tables). Then, we constructed weighted co-expression network of breast tissue through WGCNA [21, 22], which is consisted of 23 co-expressed modules (Fig 5A and S8 and S9 Tables), using the 459 gene expression profiles of non-diseased breast tissue obtained from GTEx. We used co-expression network to analyze the DEGs because the malfunction of individual genes does not result in the dysfunction of biological systems due to robustness of the biological systems and the impact of the DEGs have to be analyzed at the system-level, not at the single-gene level [2325]. Lastly, we measured proportion (%) of the genes overlapping with the DEGs in each co-expressed module.

Fig 5. The module size and the proportion of genes overlapping with the DEGs in each co-expressed module.

Fig 5

(A) The bar graph of the module size (the number of genes in each module). (B) The bar graph representing the % of genes overlapping with DEG. The lightcyan module is highlighted as yellow. (C) The network plot of the lightcyan module. The DEGs are colored as yellow and the node size indicates the connectivity of each node.

As a result, we observed that “lightcyan” co-expressed module includes the significantly large number of DEGs (11.6%, Fig 5B and S10 Table). We explored the biological functions related to the lightcyan module through the gene-set enrichment analysis [26] and we observed that the 95 genes included in the lightcyan module was associated with ribosome-related terms such as “rRNA metabolic process (GO:0016072)”, “Ribosome (MAP 03010)”, and “Cytoplasmic Ribosomal Proteins (WP477)” (S11 Table). Among the 95 genes in the lightcyan module, 11 genes were DEGs between BPS-LumA and WPS-LumA (Fig 5C).

We found that there are many literature evidences that report the associations between the ribosomal proteins (RPs) and cancer [27]. For example, the RP-MDM2-p53 signaling pathway is the most well-studied pathway which defines a role of ribosomal proteins in tumor suppressor gene p53 activation [28]. Recently, it was revealed that deregulation of some ribosomal proteins can promote breast cancer metastasis [29] and ribosome biogenesis could be potential therapeutic target to combat tamoxifen resistance [30]. Based on these results, the terms significantly associated with the DEGs and their activities in each subgroup could be considered as potential biological mechanisms that promote the prognostic differences between the BPS-LumA and WPS-lumA.

BPS-LumA and WPS-LumA are complementary to the previous stratification methods of luminal-A breast cancer

Additionally, we wanted to see how much our method coincides with previous luminal-A breast cancer stratification methods: Netanely’s method [12] and Poudel’s method [14]. The Netanely’s method suggested the two prognostic subgroup: LumA-R1 (poor prognosis) and LumA-R2 (good prognosis). The Poudel’s method proposed the five heterocellular subgroups: stem-like (good prognosis), inflammatory (good prognosis), goblet-like (good/intermediate prognosis), TA (poor prognosis), and enterocyte. As a result, we confirmed that our method, which used the automatically generated latent features from the deep autoencoders, showed a tendency to be consistent with the two previous studies.

For example, our method was significantly concordant with the Netanely’s method (p-value = 6.70E-05; Fig 6A and S12 Table). The LumA-R1 was composed of the more WPS-LumA samples (64.3%). On the other hand, the proportion of BPS-LumA was larger than the WPS-LumA in the LumA-R2 (55.9%). Similarly, our method was also concordant with the Poudel’s method (p-value = 0.004; Fig 6B and S12 Table). We observed that the proportion of BPS-LumA samples decreased according to the prognosis of four subgroups in the Poudel’s method: stem-like, inflammatory, goblet-like and TA subgroups. The 52.22% of stem-like subgroup belonged to the BPS-LumA. However, the proportion decreased to 48%, 41.18% and 18.18% in the inflammatory, goblet-like and TA subgroups, respectively. In addition, the Poudel’s method did not provide the prognosis information of enterocyte subgroup, but interestingly it was composed of large proportion of BPS-LumA samples (70.68%).

Fig 6. The comparison with previous luminal-A breast cancer stratification methods.

Fig 6

(A) Netanely’s method, (B) Poudel’s method. The x-axis refers the subgroups identified in the previous studies and the y-axis indicates the percentage of the BPS-LumA and the WPS-LumA belonging to them. The BPS-LumA and WPS-LumA were colored as green and orange, respectively.

Discussion

In this study, we identified the two prognostic subgroups, BPS-LumA and WPS-LumA, of luminal-A breast cancer using the latent features generated from the deep autoencoders trained with the gene expressions. We validated the prognostic differences between the two subgroups using the independent dataset. We also showed that the latent features generated from the deep autoencoders are more useful to identify the prognostic subgroups than the gene expression profiles and the features generated from the traditional dimensionality reduction method (i.e. PCA).

Consequently, the remaining challenge is to develop more effective therapeutic strategies for each subgroup, especially for the patients who belongs to the worse prognostic subgroup. The stratification method can further maximize its power when there is an appropriate treatment for each subgroup. For example, patients with hormone receptor-positive breast cancer (e.g. luminal-A and luminal-B) receive an endocrine therapy as a first-line treatment. On the other hand, dual HER2 blockade (e.g. Trastuzumab and Pertuzumab) is recommended to hormone receptor-negative and human epidermal growth factor 2-positive breast cancer (e.g. HER2-enriched) [8]. Like this, to develop the appropriate therapeutic strategies for each subgroup identified in our study is our top priority to solve in the near future, and the enriched pathways which we found using the differentially expressed genes between the two subgroups could be a starting point. We observed that ribosome-related terms were significantly related with the differentially expressed genes between BPS-LumA and WPS-LumA and literature evidences indicate that proteins involved in these biological functions could be considered as potential drug targets [2832] for the patients who belong to the worse prognostic subgroup and are resistant to the endocrine therapy.

In addition to the DEG analysis, there are recently lots of efforts to develop explainable and interpretable deep leaning models to overcome “black-box” problem [33, 34], especially for biomedical and healthcare domain [35, 36]. These efforts could be helpful to interpret the biological meanings of the latent features generated in our method. Alternatively, even though network-based or pathway-based methods have some limitations that our knowledge about human interactome is still incomplete [37, 38] and the interaction information derived from experimental assays (e.g. yeast two-hybrid) and computational inference usually do not fully consider the context-specificity [39, 40], they could suggest more solid evidences to interpret deep learning approaches when the prior knowledges are combined with the deep learning models (e.g., visible neural network [41]). Likewise, the approaches to increase the interpretability of deep learning models could accelerate the development of novel therapeutic options for the patients of poor prognosis.

Lastly, we showed that our and the previous methods are significantly coincident and they could complement each other for a more precise stratification through further study. For example, in the Poudel’s method, the samples belonged to goblet-like subgroups are regarded as intermediate prognosis, which are ambiguous than the other subgroups. Interestingly, we discovered that 41.18% and 58.82% of them are belonged to the BPS-LumA and WPS-LumA in our method. This result indicates that even though the samples show similar goblet cell-like signatures, they could show different characteristics in other biological pathways, such as ribosome-related pathways that we discovered with the DEG analysis. In this perspective, comprehensively considering the results of our and the previous methods can be helpful to more precisely define the disease states of the patients.

To sum up, we successfully developed the precise stratification method of luminal-A breast cancer which is able to predict the prognosis. Given that luminal-A breast cancer is the most frequently occurring breast cancer and the prognosis varies from patient to patient due to the endocrine resistance, our method could be helpful to stratify the patients and prepare alternative treatment options according to the predicted prognosis.

Materials and methods

Collecting gene expression profiles and recurrence-free survival data of METABRIC and TCGA BRCA

Two breast cancer datasets, METABRIC (the Molecular Taxonomy of Breast Cancer International Consortium) [42, 43] and TCGA (The Cancer Genome Atlas) BRCA [44], were collected. We downloaded all gene expressions, recurrence-free survival data and PAM50 subtype data of the both datasets from cBio cancer genomics portal except the PAM50 subtype data of the TCGA dataset [45]. We obtained the PAM50 subtype data of the TCGA dataset from the supplemental information of original publication [44]. In the case of the gene expression profiles, we downloaded normalized gene expression profiles (median Z-scores) of the both datasets instead of raw gene expression profiles. The recurrence-free survival data includes recurrence-free survival status ("Recurred" or "Not recurred") and recurrence-free survival months.

Renormalizing gene expression profiles

Among 2,509 samples in the METABRIC dataset and 817 samples in the TCGA dataset, we picked out 679 and 415 luminal-A breast cancer samples from each dataset based on the PAM50 subtype data. We also selected 17,202 genes of which expression values are available in the all 679 METABRIC samples and 415 TCGA samples. Then, we chose the top 5,000 genes with the highest variability based on the median absolute deviations of each gene. Next, we renormalized the normalized gene expression profiles (median Z-scores) of the 679 METABRIC samples such that each ith gene expression value of the jth sample, ej,META(i), to be in the range between zero and one, and it could be calculated as:

X1i,META(j)=ei,META(j)min(ei,META)max(ei,META)min(ei,META) (1)

where min (ej,META(i)) and max (ej,META(i)) is the minimum and maximum expression value of ith gene across the 679 METABRIC samples.

Similarly, we renormalized the normalized gene expression profiles of the 415 TCGA samples using the minimum and maximum expression value of each gene in the METABRIC dataset, and it could be calculated as:

X1i,TCGA(j)=ei,TCGA(j)min(ei,META)max(ei,META)min(ei,META) (2)

We used python machine learning library Scikit-learn (version 0.23.2) to renormalize the gene expression profiles.

Training deep autoencoders

We constructed deep autoencoders which are composed of five layers: input layer, three hidden layers, and output layer. As shown in Fig 1A, the renormalized gene expression profiles of each sample X1j were used as the input features, and so the size of the input layer and the output layer was set as the number of genes (N = 5,000). Notably, we set the number of nodes same in all three hidden layers and trained the eight deep autoencoders with the different size of hidden layers (16, 32, 64, 128, 256, 512, 1024, and 2048) to see whether the size of hidden layers affect the performance of the deep autoencoders [19]. We used the ReLU function and the sigmoid function as an activation function in the hidden layers and the output layer, respectively. The latent features of jth sample generated in the kth layer of the deep autoencoders of which the hidden layer size is dim could be calculated as:

Xk,dimj=fk(Wk,dimXk1,dimj+bk,dim),k{3,4,5} (3)

In Eq (3), fk is the activation function in the kth layer. Wk,dim and bk,dim are weight matrix and bias in the kth layer of the deep autoencoders of which the hidden layer size is dim. For example, X3,64META(j) is the latent features of jth sample in the METABRIC dataset, which is generated in the second hidden layer of deep autoencoder of which the hidden layer size is 64.

We used the 679 METABRIC samples as a training set and the 415 TCGA samples as a validation set. We evaluated the performance of the deep autoencoders using mean squared error (MSE) which measures the differences between the renormalized gene expression profiles of the input layer (X1) and the reconstructed gene expression profiles of the output layer (X5), and the deep autoencoders were trained to minimize the MSE:

MSE=1NMj=1Mi=1N(X5i,jX1i,j)2 (4)

where N is the number of genes, and M is the number of samples. We used ADAM as an optimizer [46], and set the batch size and the epoch number as 16 and 100, respectively. All procedures related to the construction and training of deep autoencoders were performed by python machine learning library Tensorflow (version 2.3.0).

Dividing the samples into subgroups

As shown in Fig 1B, we generated the latent features of each 679 METABRIC sample in the second hidden layer of each four deep autoencoder (dim∈ 16, 32, 64, and 128). They were calculated as:

X3,dimMETA(j)=f3(W3,dimX2,dimMETA(j)+b3,dim)whereX2,dimMETA(j)=f2(W2,dimX1,dimMETA(j)+b2,dim) (5)

In consequence, each sample had the four latent features of the different dimensional size according to the size of hidden layers. Then, we divided the samples into several subgroups using each latent features of the different dimensional size as input features of K-Means clustering. We set the number of clusters as from two to five. Python machine learning library Scikit-learn (version 0.23.2) was used for the unsupervised learning.

Comparing recurrence-free survival rate between subgroups

We performed Kaplan-Meier survival analysis [47] to compare prognosis between the subgroups using recurrence-free survival status and months data. The significance of prognostic difference was statistically estimated by log-rank test [48]. Specifically, we conducted both multivariate log-rank test and pairwise log-rank test to find the number of subgroups which shows the most distinct prognostic differences. In this step, we chose the size of hidden layers that show the most distinct prognostic differences between the subgroups for validation in the following step. The Kaplan-Meier survival analysis was implemented through python survival analysis library Lifelines (version 0.24.1).

Validating the prognostic difference between BPS-LumA and WPS-LumA using an independent dataset

Using the renormalized gene expression profiles of 415 TCGA samples X1TCGA(j), we generated the latent features of 415 TCGA samples in the second hidden layer of deep autoencoder of which the hidden layer size is 64. They could be calculated as:

X3,64TCGA(j)=f3(W3,64X2,64TCGA(j)+b3,64)whereX2,64TCGA(j)=f2(W2,64X1,64TCGA(j)+b2,64) (6)

Then, we assigned the each TCGA sample to the closest subgroup based on distance from the centroid of each subgroup in the latent space, which was identified using the samples in the METABRIC dataset. The Kaplan-Meier survival analysis was performed to compare prognosis between the samples belonging to each subgroup.

Comparing stratifications with the previous studies

We compared our stratification with the two previous studies: Netanely’s method [12] and Poudel’s method [14]. We used the TCGA dataset for comparison because two previous studies provided their subgroup data of the TCGA samples. We obtained the subgroup data of TCGA samples from the supplemental information of original publications [12, 14]. We examined how significantly each previous method is consistent with our method by chi-squared test using python scientific computing library Scipy (version 1.4.1).

Finding differentially expressed genes between BPS-LumA and WPS-LumA

We used the normalized gene expression profiles of METABRIC dataset and R package limma to find differentially expressed genes (DEGs) between the subgroups (p-value < 0.01) [20]. We performed the same procedures using the normalized gene expression profiles of the TCGA dataset. Then, we selected genes that are differentially expressed in both datasets among the top 5,000 genes with the highest variability that are used in training the deep autoencoders.

Constructing weighted co-expression network of breast tissue

We constructed weighted co-expression network, which is able to represent normal breast tissue, and detected co-expressed modules using gene expression profiles obtained from GTEx (the Genotype-Tissue Expression) [49] and R package WGCNA (Weighted gene co-expression network analysis) [21, 22]. We downloaded gene read counts and TPMs (Transcripts Per Kilobase Million) profiles of 459 breast tissue obtained from non-diseased tissue sites. According to the median read count number per gene, we only considered the genes of which median read count number is larger than ten. We divided the 459 samples into training set (80%) and test set (20%). We used the training set to determine parameters of WGCNA to construct the weighted co-expression network and used the test set to test the significance of co-expressed module preservation. As a result, we constructed weighted co-expression network of breast tissue consisted of the 23 modules and 8,383 genes which are strongly preserved in the test set. In addition, because WGCNA provides the weights of every gene pair (fully connected), we only used top 10% edges with the highest weights and their nodes in each module.

Performing gene-set enrichment analysis

We performed Enrichr [26] to find the biological functions associated with each co-expressed modules using gene-set enrichment analysis [50]. It was performed using the genes in each co-expressed module and gene-sets from Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways [51], Wiki-Pathways [52], and Gene Onology [53] and we only considered the terms of which adjusted p-values are lower than 0.0001. Python gene set enrichment analysis library Gseapy (version 0.10.4) were used.

Supporting information

S1 Fig. The Kaplan-Meier survival analysis according to the dimensional size of latent features in the METABRIC dataset (the number of subgroups = 2).

(DOCX)

S2 Fig. The Kaplan-Meier survival analysis according to the number of clusters in the METABRIC dataset (64-dimensional latent features).

(DOCX)

S1 Table. The list of top 5,000 genes with the highest variability in the METABRIC dataset.

(XLSX)

S2 Table. The mean squared error (MSE) according to the size of hidden layers.

(XLSX)

S3 Table. The p-value of pairwise log-rank test according to the number of clusters (64-dimensional latent features).

(XLSX)

S4 Table. The subgroup data (BPS-LumA and WPS-LumA) of the 679 METABRIC luminal-A breast cancer samples.

(XLSX)

S5 Table. The subgroup data (BPS-LumA and WPS-LumA) of the 415 TCGA luminal-A breast cancer samples.

(XLSX)

S6 Table. The list of differentially expressed genes between the BPS-LumA and WPS-LumA in the METBRIC dataset.

(XLSX)

S7 Table. The list of differentially expressed genes between the BPS-LumA and WPS-LumA in the TCGA dataset.

(XLSX)

S8 Table. The results of co-expressed module preservation tests.

(XLSX)

S9 Table. The list of genes belonged to each co-expressed module.

(XLSX)

S10 Table. The proportion of genes overlapping with DEGs in each co-expressed module.

(XLSX)

S11 Table. The results of gene-set enrichment analysis of the co-expressed modules.

(XLSX)

S12 Table. The comparison with the previous luminal-A breast cancer stratification methods.

(XLSX)

Data Availability

There are no primary data in the paper; all data are available on the original publications of databases (https://pubmed.ncbi.nlm.nih.gov/22522925/, https://pubmed.ncbi.nlm.nih.gov/23000897/), cBioPortal (https://www.cbioportal.org/), and GTEx (https://gtexportal.org/home). We have archived our code on GitHub at https://github.com/BISLshwang/ISLA.

Funding Statement

SW and DL are supported by the Ministry of Science and ICT through the National Research Foundation (NRF-2022M3A9B6017511). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Topol EJ. Individualized medicine from prewomb to tomb. Cell. 2014;157(1):241–53. doi: 10.1016/j.cell.2014.02.012 ; PubMed Central PMCID: PMC3995127. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Schork NJ. Personalized medicine: Time for one-person trials. Nature. 2015;520(7549):609–11. doi: 10.1038/520609a . [DOI] [PubMed] [Google Scholar]
  • 3.Sheridan DJ, Julian DG. Achievements and Limitations of Evidence-Based Medicine. J Am Coll Cardiol. 2016;68(2):204–13. doi: 10.1016/j.jacc.2016.03.600 . [DOI] [PubMed] [Google Scholar]
  • 4.Sung H, Ferlay J, Siegel RL, Laversanne M, Soerjomataram I, Jemal A, et al. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA Cancer J Clin. 2021;71(3):209–49. Epub 20210204. doi: 10.3322/caac.21660 . [DOI] [PubMed] [Google Scholar]
  • 5.Perou CM, Sorlie T, Eisen MB, van de Rijn M, Jeffrey SS, Rees CA, et al. Molecular portraits of human breast tumours. Nature. 2000;406(6797):747–52. doi: 10.1038/35021093 . [DOI] [PubMed] [Google Scholar]
  • 6.Zaha DC. Significance of immunohistochemistry in breast cancer. World J Clin Oncol. 2014;5(3):382–92. doi: 10.5306/wjco.v5.i3.382 ; PubMed Central PMCID: PMC4127609. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Parker JS, Mullins M, Cheang MC, Leung S, Voduc D, Vickery T, et al. Supervised risk predictor of breast cancer based on intrinsic subtypes. J Clin Oncol. 2009;27(8):1160–7. Epub 20090209. doi: 10.1200/JCO.2008.18.1370 ; PubMed Central PMCID: PMC2667820. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Harbeck N, Penault-Llorca F, Cortes J, Gnant M, Houssami N, Poortmans P, et al. Breast cancer. Nat Rev Dis Primers. 2019;5(1):66. Epub 20190923. doi: 10.1038/s41572-019-0111-2 . [DOI] [PubMed] [Google Scholar]
  • 9.Hernando C, Ortega-Morillo B, Tapia M, Moragon S, Martinez MT, Eroles P, et al. Oral Selective Estrogen Receptor Degraders (SERDs) as a Novel Breast Cancer Therapy: Present and Future from a Clinical Perspective. Int J Mol Sci. 2021;22(15). Epub 20210722. doi: 10.3390/ijms22157812 ; PubMed Central PMCID: PMC8345926. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Higgins MJ, Stearns V. Understanding resistance to tamoxifen in hormone receptor-positive breast cancer. Clin Chem. 2009;55(8):1453–5. Epub 20090618. doi: 10.1373/clinchem.2009.125377 . [DOI] [PubMed] [Google Scholar]
  • 11.Ciriello G, Sinha R, Hoadley KA, Jacobsen AS, Reva B, Perou CM, et al. The molecular diversity of Luminal A breast tumors. Breast Cancer Res Treat. 2013;141(3):409–20. Epub 20131006. doi: 10.1007/s10549-013-2699-3 ; PubMed Central PMCID: PMC3824397. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Netanely D, Avraham A, Ben-Baruch A, Evron E, Shamir R. Expression and methylation patterns partition luminal-A breast tumors into distinct prognostic subgroups. Breast Cancer Res. 2016;18(1):74. Epub 20160707. doi: 10.1186/s13058-016-0724-2 ; PubMed Central PMCID: PMC4936004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Sadanandam A, Lyssiotis CA, Homicsko K, Collisson EA, Gibb WJ, Wullschleger S, et al. A colorectal cancer classification system that associates cellular phenotype and responses to therapy. Nat Med. 2013;19(5):619–25. Epub 20130414. doi: 10.1038/nm.3175 ; PubMed Central PMCID: PMC3774607. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Poudel P, Nyamundanda G, Patil Y, Cheang MCU, Sadanandam A. Heterocellular gene signatures reveal luminal-A breast cancer heterogeneity and differential therapeutic responses. NPJ Breast Cancer. 2019;5:21. Epub 20190802. doi: 10.1038/s41523-019-0116-8 ; PubMed Central PMCID: PMC6677833. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A. 2001;98(9):5116–21. Epub 20010417. doi: 10.1073/pnas.091062498 ; PubMed Central PMCID: PMC33173. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Esteva A, Robicquet A, Ramsundar B, Kuleshov V, DePristo M, Chou K, et al. A guide to deep learning in healthcare. Nat Med. 2019;25(1):24–9. Epub 20190107. doi: 10.1038/s41591-018-0316-z . [DOI] [PubMed] [Google Scholar]
  • 17.Vincent P, Larochelle H, Bengio Y, Manzagol P-A, editors. Extracting and composing robust features with denoising autoencoders. Proceedings of the 25th international conference on Machine learning; 2008. [Google Scholar]
  • 18.Tan J, Ung M, Cheng C, Greene CS. Unsupervised feature construction and knowledge extraction from genome-wide assays of breast cancer with denoising autoencoders. Pac Symp Biocomput. 2015:132–43. ; PubMed Central PMCID: PMC4299935. [PMC free article] [PubMed] [Google Scholar]
  • 19.Dwivedi SK, Tjarnberg A, Tegner J, Gustafsson M. Deriving disease modules from the compressed transcriptional space embedded in a deep autoencoder. Nat Commun. 2020;11(1):856. Epub 20200212. doi: 10.1038/s41467-020-14666-6 ; PubMed Central PMCID: PMC7016183. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic acids research. 2015;43(7):e47–e. doi: 10.1093/nar/gkv007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Zhang B, Horvath S. A general framework for weighted gene co-expression network analysis. Statistical applications in genetics and molecular biology. 2005;4(1). doi: 10.2202/1544-6115.1128 [DOI] [PubMed] [Google Scholar]
  • 22.Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC bioinformatics. 2008;9(1):1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Jeong H, Mason SP, Barabási A-L, Oltvai ZN. Lethality and centrality in protein networks. Nature. 2001;411(6833):41–2. doi: 10.1038/35075138 [DOI] [PubMed] [Google Scholar]
  • 24.Smart AG, Amaral LA, Ottino JM. Cascading failure and robustness in metabolic networks. Proceedings of the National Academy of Sciences. 2008;105(36):13223–8. doi: 10.1073/pnas.0803571105 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Marbach D, Lamparter D, Quon G, Kellis M, Kutalik Z, Bergmann S. Tissue-specific regulatory circuits reveal variable modular perturbations across complex diseases. Nature methods. 2016;13(4):366–70. doi: 10.1038/nmeth.3799 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Kuleshov MV, Jones MR, Rouillard AD, Fernandez NF, Duan Q, Wang Z, et al. Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic acids research. 2016;44(W1):W90–W7. doi: 10.1093/nar/gkw377 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Goudarzi KM, LINDSTRöM MS. Role of ribosomal protein mutations in tumor development. International journal of oncology. 2016;48(4):1313–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Macias E, Jin A, Deisenroth C, Bhat K, Mao H, Lindström MS, et al. An ARF-independent c-MYC-activated tumor suppression pathway mediated by ribosomal protein-Mdm2 Interaction. Cancer cell. 2010;18(3):231–43. doi: 10.1016/j.ccr.2010.08.007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Ebright RY, Lee S, Wittner BS, Niederhoffer KL, Nicholson BT, Bardia A, et al. Deregulation of ribosomal protein expression and translation promotes breast cancer metastasis. Science. 2020;367(6485):1468–73. doi: 10.1126/science.aay0939 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Tsoi H, You C-P, Leung M-H, Man EP, Khoo U-S. Targeting Ribosome Biogenesis to Combat Tamoxifen Resistance in ER+ ve Breast Cancer. Cancers. 2022;14(5):1251. doi: 10.3390/cancers14051251 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Sotgia F, Fiorillo M, Lisanti MP. Mitochondrial markers predict recurrence, metastasis and tamoxifen-resistance in breast cancer patients: Early detection of treatment failure with companion diagnostics. Oncotarget. 2017;8(40):68730. doi: 10.18632/oncotarget.19612 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Fiorillo M, Sotgia F, Sisci D, Cappello AR, Lisanti MP. Mitochondrial “power” drives tamoxifen resistance: NQO1 and GCLC are new therapeutic targets in breast cancer. Oncotarget. 2017;8(12):20309. doi: 10.18632/oncotarget.15852 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Adadi A, Berrada M. Peeking inside the black-box: a survey on explainable artificial intelligence (XAI). IEEE access. 2018;6:52138–60. [Google Scholar]
  • 34.Linardatos P, Papastefanopoulos V, Kotsiantis S. Explainable ai: A review of machine learning interpretability methods. Entropy. 2020;23(1):18. doi: 10.3390/e23010018 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Saraswat D, Bhattacharya P, Verma A, Prasad VK, Tanwar S, Sharma G, et al. Explainable AI for healthcare 5.0: opportunities and challenges. IEEE Access. 2022. [Google Scholar]
  • 36.Loh HW, Ooi CP, Seoni S, Barua PD, Molinari F, Acharya UR. Application of explainable artificial intelligence for healthcare: A systematic review of the last decade (2011–2022). Computer Methods and Programs in Biomedicine. 2022:107161. doi: 10.1016/j.cmpb.2022.107161 [DOI] [PubMed] [Google Scholar]
  • 37.Menche J, Sharma A, Kitsak M, Ghiassian SD, Vidal M, Loscalzo J, et al. Uncovering disease-disease relationships through the incomplete interactome. Science. 2015;347(6224):1257601. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Ghiassian SD, Menche J, Barabási A-L. A DIseAse MOdule Detection (DIAMOnD) algorithm derived from a systematic analysis of connectivity patterns of disease proteins in the human interactome. PLoS computational biology. 2015;11(4):e1004120. doi: 10.1371/journal.pcbi.1004120 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Prahallad A, Sun C, Huang S, Di Nicolantonio F, Salazar R, Zecchin D, et al. Unresponsiveness of colon cancer to BRAF (V600E) inhibition through feedback activation of EGFR. Nature. 2012;483(7387):100–3. doi: 10.1038/nature10868 [DOI] [PubMed] [Google Scholar]
  • 40.Broyde J, Simpson DR, Murray D, Paull EO, Chu BW, Tagore S, et al. Oncoprotein-specific molecular interaction maps (SigMaps) for cancer network analyses. Nature biotechnology. 2021;39(2):215–24. doi: 10.1038/s41587-020-0652-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Ma J, Yu MK, Fong S, Ono K, Sage E, Demchak B, et al. Using deep learning to model the hierarchical structure and function of a cell. Nature methods. 2018;15(4):290–8. doi: 10.1038/nmeth.4627 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Curtis C, Shah SP, Chin S-F, Turashvili G, Rueda OM, Dunning MJ, et al. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature. 2012;486(7403):346–52. doi: 10.1038/nature10983 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Pereira B, Chin S-F, Rueda OM, Vollan H-KM, Provenzano E, Bardwell HA, et al. The somatic mutation profiles of 2,433 breast cancers refine their genomic and transcriptomic landscapes. Nature communications. 2016;7(1):1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Brigham, Hospital Ws, 13 HMSCLPPJKR, 25 GdaBCoMCCJDLA, Ilya IfSBRSKRBBBBRETLJTVZWS. Comprehensive molecular portraits of human breast tumours. Nature. 2012;490(7418):61–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Cerami E, Gao J, Dogrusoz U, Gross BE, Sumer SO, Aksoy BA, et al. The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Cancer discovery. 2012;2(5):401–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Kingma DP, Ba J. Adam: A method for stochastic optimization. arXiv preprint arXiv:14126980. 2014. [Google Scholar]
  • 47.Kaplan EL, Meier P. Nonparametric estimation from incomplete observations. Journal of the American statistical association. 1958;53(282):457–81. [Google Scholar]
  • 48.Bland JM, Altman DG. The logrank test. Bmj. 2004;328(7447):1073. doi: 10.1136/bmj.328.7447.1073 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Consortium G. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science. 2020;369(6509):1318–30. doi: 10.1126/science.aaz1776 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Chen EY, Tan CM, Kou Y, Duan Q, Wang Z, Meirelles GV, et al. Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool. BMC bioinformatics. 2013;14(1):1–14. doi: 10.1186/1471-2105-14-128 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic acids research. 2000;28(1):27–30. doi: 10.1093/nar/28.1.27 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Kelder T, Van Iersel MP, Hanspers K, Kutmon M, Conklin BR, Evelo CT, et al. WikiPathways: building research communities on biological pathways. Nucleic acids research. 2012;40(D1):D1301–D7. doi: 10.1093/nar/gkr1074 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene ontology: tool for the unification of biology. Nature genetics. 2000;25(1):25–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1011197.r001

Decision Letter 0

Mark Alber

28 Feb 2023

Dear Dr. Lee,

Thank you very much for submitting your manuscript "Identifying prognostic subgroups of luminal-A breast cancer using deep autoencoders and gene expressions" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

Both reviewers have serious concerns about the methods and results described in the paper.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Mark Alber, Ph.D.

Section Editor

PLOS Computational Biology

Mark Alber

Section Editor

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The authors suggest an autoencoder model that stratifies subgroups of breast cancers that displays different prognostic outcomes. They built an autoencoder model that takes gene expression data of breast cancer samples for feature extraction, then performed k-means clustering to identify subgroups, which indeed show different survival profiles in independent TCGA data as well as the METABRIC training data. They also compared the stratification performance against conventional models that directly use gene expression profiles as input with a varying number of genes. Further, they sought to explain the prognostic outcomes by overlaying differentially expressed genes between groups onto the co-expression network constructed from independent gene expression data from healthy samples. Meanwhile, it is a novel approach to delineating complicated biological samples, this work requires technical clarification and analysis.

Autoencoder is indeed widely used for feature extraction and is proven to be effective in many studies. However, the actual application here is with some concern. First, the authors used all detected genes as input. Authors argue that it minimizes manual filtering and intervention. But obviously, there should be many genes that display minimal variance, hence are uninformative. These genes will be more of noise rather than useful features. This issue is related to the way of re-normalization by min/max scaling. Even when a gene displays very minimal noisy variation, those variances will be amplified via min/max scaling and ultimately will have an equal contribution to the model as much as highly variable and informative genes, which will obscure successful learning of the model. Also related to this in Fig4, the p-value vividly decreases as the number of input genes decreases, which suggests a preliminary gene filtering can be beneficial. Therefore, I’d like to recommend trying a combined approach of preliminary gene filtering and autoencoder, i.e. low variance gene filtering followed by autoencoder training. Here, I’d recommend selecting highly variable genes after variance stabilization to avoid the preference of highly expressed genes. If successful, it will help us to achieve a more compact model.

On a second note, it is somewhat surprising to see that an autoencoder model trained using array data successfully classifies RNA-seq data. Is it because of the min-max normalization and highly correlated array vs RNA-seq data? Please discuss.

Also, the authors build a WGCNA model using seemingly all detected genes. Given the nature of WGCNA relying on correlation coefficient, it conveys a similar risk when involving low variance genes. How would the author justify this? What if similar preliminary gene filtering is applied?

In Fig 6

- Panel B and C are not quite informative. Rather I’d recommend putting 1) a bar plot module size, and 2) % of genes in each module overlapping with DEG

- Panels seem to be mislabeled in the legend for B and C. A appears twice, B once, and no C.

In line 413, the authors described that the hidden layer size is 2048. However, they tried all different numbers of hidden layers in the supplementary figures. Please correct if necessary.

Reviewer #2: In this paper, the authors proposed a data-driven method for identifying prognostic markers that can differentiate between different groups of breast cancer patients with different prognosis.

The main idea is to first train auto-encoders that can extract low-dimensional latent features that encode the gene expression data from breast cancer patients and then can be used to faithfully reconstruct the gene expression data.

The extracted features are then used to cluster breast cancer patients using k-means algorithm, and the authors demonstrate that the identified subgroups show very different prognosis.

While the proposed method and the presented results are reasonable, there are several major concerns in the manuscript that need to be addressed.

These concerns are outlined below.

1. While the use of autoencoder to extract potential latent features for stratifying breast cancer patients is reasonable, the proposed method lacks novelty.

The authors mainly train a standard autoencoder with different latent dimensions using breast cancer gene expression data, which is technically not novel.

Except for varying the dimension size, no other architecture/hyperparameter optimization is performed to customize the model for the task and enhance the stratification results.

2. The authors show that using the latent dimension of 2048 led to the best (most significant results).

But considering that this latent dimension size was the largest among the test dimension sizes, and as this dimension size is comparable to the dimension of the original gene space, it is highly likely that the model is overfitted to the data (especially, since the sample size of the gene expression data was relatively small compared to the latent dimension).

3. The latent features do not seem to have any biological significance. For example, while there has been recent work on constraining the latent space based on pathway info - in which case the latent features may potentially reflect underlying pathway activities, in this current work it is difficult to see whether the 2,048 latent features meaningfully reflect the underlying biological processes/pathways that may be associated with different prognosis of subgroups.

Although the authors perform some analysis by analyzing the differentially expressed genes between the subgroups that are identified by the proposed method, such analysis beats the purpose of using an auto-encoder for latent feature extraction in the first place.

For example, we could have simply started with DEG analysis without using autoencoders to detect gene markers that can be used to stratify subgroups and then analysis the identified DEGs to understand what might be leading to this difference.

Why should one first use autoencoders to extract latent features (that cannot be easily interpreted), and then resort to DEG analysis for interpretation?

4. Figure 2 and Figure 3 show PCA plots for the latent feature vectors, but how does it compare to using PCA directly to gene expression data?

Why not use PCA to extract important principal components that reflect the main gene expression patterns, use these principal components for clustering patients and stratifying them?

Does the use of autoencoder significantly improve the results?

Such comparison would be critical in demonstrating the effectiveness of the proposed scheme.

5. Furthermore, there have been pathway based methods and network analysis based methods for extracting gene modules that may be used as diagnostic and/or prognostic markers.

How does the use of autoencoder-based data driven scheme for extracting latent features compare with more traditional methods that define/predict modular markers based on pathways or network info?

Is there a clear evidence that even without using such additional information (i.e., pathway info or network info), the autoencoder can extract better prognostic signatures thare are predictive of patients' survival?

6. The authors claim that "our method and previous methods can be complementary to each other for a more precise stratification for luminal-A breast cancer". However, this conclusion is drawn simply based on the fact the difference/discrepancies between the different methods.

To justify such statement, the authors should actually show "how" they could be combined to improve the stratification results.

7. (minor comment) The literary presentation of the paper should be improved, as the current manuscript includes a large number of grammatical and typographical errors.

8. (minor comment) in line 198, the author mention that "the two datasets used the different sequencing platform", but note that microarray is not a "sequencing platform".

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1011197.r003

Decision Letter 1

Mark Alber

30 Apr 2023

Dear Dr. Lee,

Thank you very much for submitting your manuscript "Identifying prognostic subgroups of luminal-A breast cancer using deep autoencoders and gene expressions" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations.

Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Mark Alber, Ph.D.

Section Editor

PLOS Computational Biology

Mark Alber

Section Editor

PLOS Computational Biology

***********************

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Authors addressed all the concerns.

One minor comment:

Authors changed the presentation of the dimension reduction in the main figures. What was the basis of this change

Reviewer #2: I would like to thank the authors for their careful revision.

Most of my concerns regarding the initial version of this manuscript have been sufficiently addressed in the revised version, and I believe the main advantages of the proposed approach are conveyed in a clearer and a more convincing manner.

For example, the updated analysis based on a lower-dimensional latent representation addresses the concern regarding overfitting, and the updated analysis results and discussions are therefore more convincing.

Direct comparison between the proposed autoencoder-based scheme and the traditional PCA based dimensionality reduction is also very helpful, and it clearly demonstrates how using deep network models for dimensionality reduction can identify useful latent features that can more accurately cluster and stratify different patient groups with different prognosis, which may be difficult based on traditional schemes like the PCA.

Regarding my previous comment concerning the comparison against existing methods that used prior knowledge (e.g., pathways or PPI networks), while I understand the authors position I nevertheless think that it would be meaningful to include at least some discussion comparing the pros and cons of the respective approaches.

Finally, the authors have revised a number of places in the manuscript to convey the novel contributions made in this work more clearly, these are spread over the paper and it may be a good idea to summarize the main novel contributions more clearly at the outset (e.g., in the last paragraph of the introduction section).

For example, the authors may want to clearly mention that while they are not proposing a novel methodology, the application of an autoencoder to luminal-A breast cancer to extract biologically meaningful low-dimensional latent features is novel and it may detect features not possible using other traditional techniques.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

References:

Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript.

If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1011197.r005

Decision Letter 2

Mark Alber

18 May 2023

Dear Dr. Lee,

We are pleased to inform you that your manuscript 'Identifying prognostic subgroups of luminal-A breast cancer using deep autoencoders and gene expressions' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Mark Alber, Ph.D.

Section Editor

PLOS Computational Biology

Mark Alber

Section Editor

PLOS Computational Biology

***********************************************************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Acceptable for publication.

Reviewer #2: Thank you for the revision.

The previous revision was fairly comprehensive and it already addressed all major concerns I had regarding the original manuscript.

The remaining concerns were mostly minor, and the current revision has addressed them in a satisfactory manner.

I do not have any further suggestions.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: None

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1011197.r006

Acceptance letter

Mark Alber

25 May 2023

PCOMPBIOL-D-23-00002R2

Identifying prognostic subgroups of luminal-A breast cancer using deep autoencoders and gene expressions

Dear Dr Lee,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Anita Estes

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. The Kaplan-Meier survival analysis according to the dimensional size of latent features in the METABRIC dataset (the number of subgroups = 2).

    (DOCX)

    S2 Fig. The Kaplan-Meier survival analysis according to the number of clusters in the METABRIC dataset (64-dimensional latent features).

    (DOCX)

    S1 Table. The list of top 5,000 genes with the highest variability in the METABRIC dataset.

    (XLSX)

    S2 Table. The mean squared error (MSE) according to the size of hidden layers.

    (XLSX)

    S3 Table. The p-value of pairwise log-rank test according to the number of clusters (64-dimensional latent features).

    (XLSX)

    S4 Table. The subgroup data (BPS-LumA and WPS-LumA) of the 679 METABRIC luminal-A breast cancer samples.

    (XLSX)

    S5 Table. The subgroup data (BPS-LumA and WPS-LumA) of the 415 TCGA luminal-A breast cancer samples.

    (XLSX)

    S6 Table. The list of differentially expressed genes between the BPS-LumA and WPS-LumA in the METBRIC dataset.

    (XLSX)

    S7 Table. The list of differentially expressed genes between the BPS-LumA and WPS-LumA in the TCGA dataset.

    (XLSX)

    S8 Table. The results of co-expressed module preservation tests.

    (XLSX)

    S9 Table. The list of genes belonged to each co-expressed module.

    (XLSX)

    S10 Table. The proportion of genes overlapping with DEGs in each co-expressed module.

    (XLSX)

    S11 Table. The results of gene-set enrichment analysis of the co-expressed modules.

    (XLSX)

    S12 Table. The comparison with the previous luminal-A breast cancer stratification methods.

    (XLSX)

    Attachment

    Submitted filename: response_letter.pdf

    Attachment

    Submitted filename: response_letter.docx

    Data Availability Statement

    There are no primary data in the paper; all data are available on the original publications of databases (https://pubmed.ncbi.nlm.nih.gov/22522925/, https://pubmed.ncbi.nlm.nih.gov/23000897/), cBioPortal (https://www.cbioportal.org/), and GTEx (https://gtexportal.org/home). We have archived our code on GitHub at https://github.com/BISLshwang/ISLA.


    Articles from PLOS Computational Biology are provided here courtesy of PLOS

    RESOURCES