Summary
Here, we present a protocol for analyzing the global metabolic landscape in breast tumors for the purpose of metabolism-based patient stratification. We describe steps for analyzing 1,454 metabolic genes representing 90 metabolic pathways and subjecting them to an algorithm that calculates the deregulation score of 90 pathways in each tumor sample, thus converting gene-level information into pathway-level information. We then detail procedures for performing clustering analysis to identify metabolic subtypes and using machine learning to develop a signature representing each subtype.
For complete details on the use and execution of this protocol, please refer to Iqbal et al.1
Subject areas: cancer, metabolism, systems biology
Graphical abstract

Highlights
-
•
Steps for converting gene-level information to pathway-level information
-
•
Instructions for performing a 2-fold clustering analysis for subtype identification
-
•
Steps for identifying subtype gene signatures using machine learning
-
•
Steps for testing the prediction power of signatures in independent datasets
Publisher’s note: Undertaking any experimental protocol requires adherence to local institutional guidelines for laboratory safety and ethics.
Here, we present a protocol for analyzing the global metabolic landscape in breast tumors for the purpose of metabolism-based patient stratification. We describe steps for analyzing 1,454 metabolic genes representing 90 metabolic pathways and subjecting them to an algorithm that calculates the deregulation score of 90 pathways in each tumor sample, thus converting gene-level information into pathway-level information. We then detail procedures for performing clustering analysis to identify metabolic subtypes and using machine learning to develop a signature representing each subtype.
Before you begin
Tumor heterogeneity is a big challenge in cancer therapeutics.2,3 It is therefore crucial to systematically stratify patients based on biological or clinical parameters (e.g., metabolism in our case) to identify the heterogeneity that exists (in the context of the chosen parameter). The overall goal is to identify clinically and therapeutically relevant subtypes for improved precision medicine.
We used metabolism as the biological criteria to classify human breast tumors to identify subtypes. Approximately 1454 genes were selected to represent metabolism.4 These genes in turn represented 90 metabolic pathways.5 Using the Pathifier algorithm,6 we converted gene-level information into pathway-level information and used 90 pathways for classifying breast tumors (Figure 1). Pathifier assigns deregulation score to each pathway based on the extent of deviation of tumor sample from its normal counterpart. Finally, deregulation scores of 90 pathways were calculated for each sample and the matrix generated (for all samples) is used for tumor classification with the help of NbClust7 and consensus clustering to determine the number of stable clusters and samples belonging to each cluster (Figure 2).
Figure 1.
Conversion of gene level information to pathway level information
Gene expression and pathway gene sets are fed into Pathifier for calculation of pathway deregulation scores (PDS).
Figure 2.
Pipeline for identification of metabolic subtypes
PDS scores were used to perform two-fold clustering analysis using consensus clustering and NbClust to identify a robust number of metabolic clusters.
Machine learning (ML) is used to identify gene signatures associated with each cluster and their robustness is tested in independent cohorts. These ML-generated gene signatures were then used to predict metabolic subtypes of patient samples in an independent cohort and to identify representative cell lines of metabolic subtypes.
The pipeline described here can be used to stratify tumors based on any biological process. The details of required software and input files to successfully execute the pipeline are provided below.
Input files
Timing: 30 min
-
1.Prepare following input files:
-
a.Split pre-processed normalized gene expression data of breast tumor samples from METABRIC into discovery and validation datasets.
-
b.Ninety genesets represent 90 metabolic pathways. Genesets are available for download from this link.
-
c.File containing expression values of all genes in the 90 genesets.
-
a.
CRITICAL: Normal tumor samples expression data is essential for Pathifier
Pathifier
Timing: 20 min
Clustering analysis
Timing: 10 min
-
5.
The Pathifier output obtained will be subjected to NbClust to predict the number of clusters and its R package can be downloaded here.
-
6.
Further, Pathifier output will also be subjected to consensus clustering using GenePattern online tool for determination of number of clusters.
Machine learning
Timing: 10 min
> cd ./project_root_folder
> git clonehttps://github.com/kirksmi/BreastCancerClustering.git.
-
7.
Machine learning is implemented using Python in the Google Colab environment. Using terminal or command line, clone the project’s GitHub repository.
-
8.
We recommend moving the downloaded files to Google Drive, so that Google Colab can be used to run the Python notebook which contains the machine-learning code. Otherwise, the code can be run locally using Jupyter Notebook.
Key resources table
| REAGENT or RESOURCE | SOURCE | IDENTIFIER |
|---|---|---|
| Deposited data | ||
| METABRIC microarray data | European Genome-phenome Archive |
https://ega-archive.org accession number: EGAD00010000210, EGAD00010000211, and EGAD00010000212 |
| TCGA mRNA-seq and clinical data | UCSC Xena | http://xena.ucsc.edu |
| Code availability | This protocol | https://github.com/kirksmi/BreastCancerClustering |
| Breast cancer cell line expression data | ArrayExpress E-MTAB-181 | https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-181 |
| Software and algorithms | ||
| R software | R project | https://www.r-project.org/ |
| Consensus clustering | Gene pattern version 2.0, GSEA | http://www.broadinstitute.org/gsea/ |
| Morpheus | Broad Institute | https://software.broadinstitute.org/morpheus/ |
| Pathifier | Bioconductor package | https://www.bioconductor.org/packages/release/bioc/html/pathifier.html |
| NbClust | CRAN | https://cran.r-project.org/web/packages/NbClust/index.html |
| Python version | Python Software Foundation | https://www.python.org |
| SHAP | Python | https://shap.readthedocs.io/en/latest/ |
| PAMR | CRAN | https://cran.r-project.org/web//packages/pamr/pamr.pdf |
Step-by-step method details
Running pathifier
Timing: 3 h
This step is required to convert gene-level information (expression values) to pathway level information (scores). Essentially, this step is about executing Pathifier to calculate pathway deregulation scores (PDS) of 90 metabolic pathways, represented by 90 genesets (see Figure 1).
-
1.Loading following input files:
-
a.Preprocessed normalized expression data file (metabolic genes in current example).
-
b.Genesets file indicating which genes belong to which pathways (90 genesets for 90 pathways).
-
c.Upload these files on Pathifier using R studio and run the code provided below.
-
a.
Note: Remove genes with missing expression values.
> if (!requireNamespace("BiocManager", quietly = TRUE))
> install.packages("BiocManager")
> BiocManager::install("pathifier")
> library(pathifier)
> exp.matrix <- read.delim(file =file.choose(), as.is = T, row.names = 1)
-
2.
Load pathway genesets.
> gene_sets <- as.matrix(read.delim(file = file.choose(), header = F, sep = "∖t", as.is = T))
-
3.
Generate a list of genes contained in genesets.
> gs <- list()
> for (i in 1:nrow(gene_sets)){
a <- as.vector(gene_sets[i,3:ncol(gene_sets)])
a <- na.omit(a)
a <- a[a != ""]
a <- matrix(a, ncol = 1)
gs[[length(gs)+1]] <- a
rm(a,i)
}
-
4.
Generate a list that contains the names of the genesets used.
> pathwaynames <- as.list(gene_sets[,1])
-
5.
Generate a list that contains gene sets and their names.
> PATHWAYS <- list();
> PATHWAYS$gs <- gs;
> PATHWAYS$pathwaynames <- pathwaynames
-
6.
Extract information from binary phenotypes and assign 1 = Normal, 0 = Tumor.
> normals <- as.vector(as.logical(exp.matrix[1,]))
> exp.matrix <- as.matrix(exp.matrix[-1, ])
-
7.
Calculate minimum standard deviation.
> N.exp.matrix <- exp.matrix[,as.logical(normals)]
> rsd <- apply(N.exp.matrix, 1, sd)
> min_std <- quantile(rsd, 0.25)
-
8.
Calculate minimum expression.
> min_exp <- quantile(as.vector(exp.matrix), 0.1)
-
9.
Filter low value genes. At least 10% of samples with values over min_exp.
> over <- apply(exp.matrix, 1, function(x) x > min_exp)
> G.over <- apply(over, 2, mean)
> G.over <- names(G.over)[G.over > 0.1]
> exp.matrix <- exp.matrix[G.over,]
> exp.matrix[exp.matrix < min_exp] <- min_exp
-
10.
Set N as the number of genes in your corresponding expression file.
> V <- names(sort(apply(exp.matrix, 1, var), decreasing = T))[1:N]
> V <- V[!is.na(V)]
> exp.matrix <- exp.matrix[V,]
> genes <- rownames(exp.matrix) # Checking genes
> allgenes <- as.vector(rownames(exp.matrix))
-
11.
Generate a list that contains: gene expression, normal status, and name of genes.
> DATASET <- list();
> DATASET$allgenes <- allgenes;
> DATASET$normals <- normals
> DATASET$data <- exp.matrix
-
12.
Run Pathifier and save the PDS scores.
> PDS <- quantify_pathways_deregulation(DATASET$data,
DATASET$allgenes, PATHWAYS$gs,
PATHWAYS$pathwaynames,
DATASET$normals, min_std = min_std,
min_exp = min_exp)
> write.table(PDS$scores, file = "PDS_Result.txt", quote = FALSE, sep = "∖t",
row.names = FALSE, col.names = TRUE)
Clustering analysis: NbClust, consensus and k-means clustering
Timing: 2 h
This step employs the NbClust algorithm in R to calculate the number of clusters. NbClust uses 30 different indices to assess the number of clusters. Usually, the highest number of indices supporting a particular cluster number is considered (see Figure 2).
-
13.Perform t-distributed stochastic neighbor embedding (tSNE) to reduce the data into 2 dimensions and it calculates two vectors - T1 and T2, for each sample, using Broad Morpheus online software.
-
a.Upload the calculated PDS scores onto Broad Morpheus online software and execute the t-SNE function (available under Tools section).
-
b.Use calculated T1 and T2 values as input for the NbClust program. The NbClust output indicates the best number of clusters.
-
a.
-
14.
Load input file with T1 and T2 values and run the following code:
> library(NbClust)
> D <- read.delim(file = "ALL_06 FEB.txt")
> D1 <- data.matrix(D)
> res <- NbClust(D1, diss = NULL, distance = "euclidean",
min.nc = 2, max.nc = 10, method = "kmeans",
index = "all")
-
15.Perform consensus clustering using GenePattern online portal. The purpose of this analysis is to ensure the robustness of the number of clusters identified using NbClust.
-
a.Input files should be prepared according to the instructions provided on GenePattern. Input file of PDS scores is used in this case.
-
b.Clustering parameters may be selected as per the user’s choice. We selected kmax as 10, resampling iterations as 1000, clustering algorithm as k-means, distance measured as Euclidean, create heat map as yes. All the other parameters were set to be default values.
-
c.Based on heatmap output a stable cluster number is selected.
-
d.Details of samples belonging to each cluster are provided in the .clu file in the output folder generated.
-
a.
Machine-learning in python
Timing: 1 h
This section details the training and testing of various supervised machine-learning models. In addition to demonstrating that we can accurately classify breast cancer samples into the appropriate metabolic cluster using gene expression data, we also investigate the most important features of these models.
-
16.Load the data and import the Python libraries required within the Python notebook. machine_learning_notebook.ipynb file in Google Colab, run the first few code blocks.
-
a.After uploading the project files from GitHub to Google Drive, open the machine_learning_notebook.ipynb file in Google Colab.
-
b.Run the first code block which sets up Google Drive to run correctly and imports the necessary Python libraries.
-
a.
# mount Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
# install shap (not included in Google Colab)
!pip install shap==0.37.0
# change working directory to project folder
import sys
import os
root_dir = "/content/drive/MyDrive/"
project_folder = "my_project_folder"
os.chdir(root_dir + project_folder)
# import libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn import metrics
import openpyxl
from sklearn import preprocessing
import os
import shap
import seaborn as sns
from collections import Counter
from sklearn.preprocessing import StandardScaler
import importlib
import pickle
import breastCancerFunctions
np.random.seed(123)
Note: the folder names will have to be adjusted to match the Google Drive location where your files were uploaded.
-
17.
Load and format the breast cancer gene expression datasets.
df_discovery =pd.read_csv("./data/discovery_geneExpr_metabolicOnly.csv")
df_validation=pd.read_csv("./data/validation_geneExpr_metabolicOnly.csv")
my_genes=df_discovery.select_dtypes(include='number').columns.tolist()
class_names = ["M1","M2","M3"]
num_class=len(class_names)
# numerically encode cluster labels
le = preprocessing.LabelEncoder()
y_disc=pd.Series(le.fit_transform(df_discovery['Metabolic cluster']))
## create folder to save figures
figure_path = "./my_figures/"
try:
os.mkdir(figure_path)
except OSError:
print ("Creation of the directory %s failed" % figure_path)
else:
print ("Successfully created the directory %s " % figure_path)
Note: Update the figure path variable to the folder where you would like output figures from the notebook to be saved.
-
18.Test various machine-learning algorithms on the METABRIC Discovery dataset to determine which performs best in terms of classifying breast cancer samples by metabolic cluster. The testAlgorgithms function runs cross-validation on the dataset, which is helpful for determining a model’s robustness.test_scores = breastCancerFunctions.testAlgorgithms(X = df_discovery[my_genes],y = y_disc,condition = "tutorial",path = figure_path)
-
a.The cross-validation scores are contained in the function. In addition, a figure comparing the F1 score for each algorithm is saved in the figure path directory (Figure 3). F1 score is a useful scoring parameter for classification problems that measures a balance between recall and precision.Note: For the data used in the main paper, ridge regression was the best performing algorithm, so it was used for the ensuing analyses. Ridge regression is a form of logistic regression which introduces a regularization term to the cost function, thereby pushing non-contributing coefficients towards zero and preventing overfitting.
-
a.
-
19.
Train a final model on the Discovery dataset using the tuneModels function. Unlike in the previous step, we will now tune the model’s hyperparameters, which for ridge regression includes a penalization term, alpha.
Note: The output of this step is a train machine-learning model, which uses gene expression data to predict the metabolic cluster (M1, M2, M3) for each sample. Additionally, a confusion matrix and bar plot figures will be generated for summarizing the model’s cross-validation performance (Figure 4). For a detailed explanation of the function’s inputs, run help(breastCancerFunctions.tuneModels).
ridge_mdl = breastCancerFunctions.tuneModels(
X = df_discovery[my_genes],
y = df_discovery['Metabolic cluster'],
model_name = "Ridge",
path=tune_mdls_path,
class_names = class_names,
scale = "none",
tune=True,
weight=True)
-
20.
Validate the tuned model on external datasets using the modelValidation function. Model outputs include a confusion matrix showing the model’s performance on the validation dataset(s), as well as an ROC plot (Figure 5).
Note: The function allows the user to input a list of dataframes for X_vals and y_vals. This allows for testing the model on multiple datasets at once. The datasets should be named using the val_IDs input.
% specify the data used for the model’s training
X_train = df_discovery[my_genes]
y_train=pd.Series(le.fit_transform(df_discovery['Metabolic cluster']))
% run modelValidation
ridge_validate = breastCancerFunctions.modelValidation(ridge_mdl,
X_vals=df_validation[my_genes],
y_vals=df_validation['Metabolic cluster'],
val_IDs="Validation",
condition="Ridge",
class_names=class_names,
path=path+"/my_validation",
scale="none",
X_train = X_train, y_train = y_train)
-
21.
Using the SHAP package, explain the model’s predictions on the validation dataset by generating Shapley values.8
X_train = df_discovery[my_genes]
X_test = df_validation[my_genes]
clf = deepcopy(ridge_mdl)
explainer = shap.LinearExplainer(clf, X_train)
shap_explainer = explainer(X_test)
shap_values = explainer.shap_values(X_test)
-
22.
Now plot the Shapley values using the shap_plot function. The required inputs include the Shapley values, the test dataset, the names of the classes (e.g., M1, M2, M3), and the path to save the Shapley plots (Figure 6).
Note: Because our classification problem is multi-class (more than two possible classes), we have to generate a separate plot for each class. For example, the “M1” plot will show how features pushed the model towards a prediction of the M1 class (positive SHAP value) or towards a prediction of another class (negative SHAP value).
breastCancerFunctions.shap_plot(shap_values=shap_values,
X_test=X_test,
class_names=class_names,
shap_path="my_shap_figures")
-
23.
Lastly, make predictions on an external dataset, such as breast cancer cell lines. The first step will be preprocessing the data.
# load the cell line gene expression dataset and transpose so columns are genes
df_cl = pd.read_csv("../data/cellLine_geneExpression.csv")
df_CL = df_cl.set_index('Cell line').T
# filter out genes not in our model
df_CL = df_CL.loc[:,my_genes]
-
24.
Use the ML model to predict cluster labels for each cell line.
y_pred = model.predict(X_test)
cluster_labels = np.where(y_pred==0,"M1",
np.where(y_pred==1, "M2", "M3"))
Figure 3.
The breastCancerFunctions.testAlgorithms function tests a variety of machine-learning classification algorithms
The user can use these results to guide their decision on which algorithm to move forward with. Cross-validation is used to train and test the models, while providing an estimate of the model’s classification capabilities.
Figure 4.
The breastCancerFunctions.tuneModels function introduces hyperparameter tuning and its intended purpose is to train the final classification model
The outputs include a confusion matrix (A) and bar chart summarizing the cross-validation results from the model’s training (B).
Figure 5.
The breastCancerFunctions.modelValidation function makes predictions on test data, in which the true class labels are known
The purpose of this function is to test the model’s performance on unseen data. Outputs include a confusion matrix (A) summarizing the model’s performance, as well as an ROC curve plot (B).
Figure 6.
The breastCancerFunctions.shap_plot function creates a Shapley summary plot for each class
(A) The breastCancerFunctions.shap_plot function creates a Shapley summary plot for each class
(B) Each dot on the Shapley plot is a feature value for a given sample. The color of the dot represents the relative value (blue = low, red = high), and the Shapley value indicates whether the feature pushed the model’s prediction towards that class (positive Shapley value) or towards another class (negative Shapley value).
Machine-learning by PAMR algorithm
Timing: 1 h
-
25.Two input files needed to be prepared for PAMR:
-
a.Training expression data.
-
b.Label file as a vector.
-
a.
-
26.
Load both files, train the model for obtaining threshold value at minimum error value.
library(pamr)
data1 <- as.matrix(read.table("TCGA (SCALED & CENTER = TRUE).txt",
sep="∖t"))
vecto <- scan("TCGA_LABEL.txt", character(), quote = "")
mydata <- list(x=data1, y=vecto)
mytrain <- pamr.train(mydata)
mycv <- pamr.cv(mytrain,mydata)
pamr.plotcv(mycv)
pamr.plotcen(mytrain, mydata, threshold=0.69)
-
27.
Predict labels using PAMR thereafter. Test expression data is required to execute this step. Cell line expression used as test data here.
newx1 <- as.matrix(read.table("Cell Line (SCALED & CENTER = TRUE).txt", sep="∖t"))
prediction_result <- pamr.predict(mytrain, newx1, threshold = 1.6)
write.table(prediction_result, file = "RESULT.txt", quote = FALSE,
sep = "∖t", row.names = FALSE, col.names = TRUE)
Expected outcomes
This analysis pipeline enables the user to characterize the metabolic landscape of human breast tumors. Furthermore, the clustering analysis provides the stratification of breast tumors based on their metabolism, thus leading to identification of metabolic subtypes. The machine learning analysis demonstrated how to identify signatures associated with each metabolic subtype and test its robustness in independent datasets. The identification of metabolic signature is important for assigning a random breast tumor sample to one of the metabolic subtype categories. By doing so, one can predict the characteristics of an unknown sample based on its assignment to a subtype. Besides, ML analysis helps in identification of representative cell lines, and this helps in choosing correct cell lines to study a particular subtype.
Importantly, this pipeline can be applied to study the behavior of any biological process in tumors beyond metabolism, like apoptosis, cell cycle, signaling, or immunity. This is important from the systems perspective as this pipeline enables users to get an overall picture of any biological process in tumors and identify if any heterogeneity exists in tumors.
Overall, the described protocol is important from the standpoint of precision medicine as it enables identification and characterization of tumor subtypes, predicting subtype-representative cell lines, and designing validation experiments to study a particular subtype in appropriate cell lines.
Limitations
One of the limitations of this pipeline emanates from the fact that gene-expression data exhibits platform variation, for example, microarray vs. RNA-seq.9 This could affect reproducibility when extending analysis across datasets. Another limitation, due to the presence of batch-effect variation in some datasets like TCGA, may also hamper reproducibility. Further, a comparable (like METABRIC) metabolomics dataset should be investigated for validation of subtypes.
Regarding the ML portion of the protocol, the potential limitations to be aware of are sample size and class balance. The ML model may not perform well in terms of predicting the cluster labels if the training data is not adequate in size. Similarly, if a metabolic cluster is identified by PAMR but for a very small number of tumor samples, it’s possible the model will have difficulty predicting instances of that cluster on an external dataset.
Troubleshooting
Problem 1
Should the Pathifier output be normalized in Step 12?
Potential solution
The normalization is already done within Pathifier. Normalization is based on the standard deviation across the normal samples only. If you have a very few normal samples (say, 3 or less) you might need to increase their number or define more samples as 'normal'. However, note that the Pathifier does not perform gene filtering; this should be done by the user (e.g., by gene variance).
Problem 2
When using the smooth.spline function from the princurve package, it is common to receive the following error.
>quantify_pathways_deregulationfunction:Error smooth.spline(lambda, xj, df = df, keep.data = FALSE).
Potential solution
If you have below 50 samples or below 3 normal, this may cause smoothing issues throwing off such errors. It may also help to omit pathways that are too small (e.g., < 5 genes) or too big (e.g., # genes >> # samples).
Problem 3
When using the ML model to make predictions on a new dataset in Python, you receive an error message like the one below.
> X has # features per sample; expecting #
This error means that not all the genes used in training your machine-learning model are present in the test dataset.
Potential solution
The preferred option is to retrain your ML model using only genes that you know to be present in any datasets that you wish to test on.
>overlap_genes=df_test.iloc[:,df_test.columns.isin(df_train.columns)].columns.tolist()
> df_train_new = df_train.loc[overlap_genes]
An alternative method that is not covered in this protocol would be to impute the values for the missing genes. Imputation of gene expression values may introduce further uncertainty in the predicted cluster labels for the test set.
Resource availability
Lead contact
Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact [Mohammad Askandar Iqbal] (dr.askandar@gmu.ac.ae).
Technical contact
Information about the technical specifics of the protocol should be directed to and will be fulfilled by the technical contacts, Mohammad Askandar Iqbal (dr.askandar@gmu.ac.ae), Kirk Smith (kirksmi@umich.edu), Prithvi Singh (prithvi.mastermind@gmail.com).
Materials availability
No materials were generated in this study.
Data and code availability
-
•
All data have been deposited on GitHub and are publicly available as of the date of publication at https://github.com/kirksmi/BreastCancerClustering. DOIs are listed in the key resources table.
-
•
Archive of the GitHub repo on Zenodo available at https://doi.org/10.5281/zenodo.11647064.
-
•
All original code has been deposited at GitHub and is publicly available as of the date of publication. DOIs are listed in the key resources table.
-
•
Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.
Acknowledgments
This work was funded by internal research grant from Gulf Medical University (M.A.I.) and startup funds from the University of Michigan (S.C.).
Author contributions
M.A.I. conceived and designed the study. M.A.I., S.S., K.S., P.S., and S.C. were involved in analysis pipeline development. M.A.I. and K.S. wrote the protocol.
Declaration of interests
The authors declare no competing interests.
Contributor Information
Mohammad Askandar Iqbal, Email: dr.askandar@gmu.ac.ae.
Sriram Chandrasekaran, Email: csriram@umich.edu.
References
- 1.Iqbal M.A., Siddiqui S., Smith K., Singh P., Kumar B., Chouaib S., Chandrasekaran S. Metabolic stratification of human breast tumors reveal subtypes of clinical and therapeutic relevance. iScience. 2023;26 doi: 10.1016/j.isci.2023.108059. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Fisher R., Pusztai L., Swanton C. Cancer heterogeneity: implications for targeted therapeutics. Br. J. Cancer. 2013;108:479–485. doi: 10.1038/bjc.2012.581. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Pasha N., Turner N.C. Understanding and overcoming tumor heterogeneity in metastatic breast cancer treatment. Nat. Cancer. 2021;2:680–692. doi: 10.1038/s43018-021-00229-1. [DOI] [PubMed] [Google Scholar]
- 4.Duarte N.C., Becker S.A., Jamshidi N., Thiele I., Mo M.L., Vo T.D., Srivas R., Palsson B.Ø. Global reconstruction of the human metabolic network based on genomic and bibliomic data. Proc. Natl. Acad. Sci. USA. 2007;104:1777–1782. doi: 10.1073/pnas.0610772104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Gaude E., Frezza C. Tissue-specific and convergent metabolic transformation of cancer correlates with metastatic potential and patient survival. Nat. Commun. 2016;7 doi: 10.1038/ncomms13041. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Drier Y., Sheffer M., Domany E. Pathway-based personalized analysis of cancer. Proc. Natl. Acad. Sci. USA. 2013;110:6388–6393. doi: 10.1073/pnas.1219651110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Charrad M., Ghazzali N., Boiteau V., Niknafs A. NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set. J. Stat. Softw. 2014;61:1–36. doi: 10.18637/jss.v061.i06. [DOI] [Google Scholar]
- 8.Lundberg S.M., Nair B., Vavilala M.S., Horibe M., Eisses M.J., Adams T., Liston D.E., Low D.K.W., Newman S.F., Kim J., Lee S.I. Explainable machine-learning predictions for the prevention of hypoxaemia during surgery. Nat. Biomed. Eng. 2018;2:749–760. doi: 10.1038/s41551-018-0304-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.van der Kloet F.M., Buurmans J., Jonker M.J., Smilde A.K., Westerhuis J.A. Increased comparability between RNA-Seq and microarray data by utilization of gene sets. PLoS Comput. Biol. 2020;16 doi: 10.1371/journal.pcbi.1008295. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
-
•
All data have been deposited on GitHub and are publicly available as of the date of publication at https://github.com/kirksmi/BreastCancerClustering. DOIs are listed in the key resources table.
-
•
Archive of the GitHub repo on Zenodo available at https://doi.org/10.5281/zenodo.11647064.
-
•
All original code has been deposited at GitHub and is publicly available as of the date of publication. DOIs are listed in the key resources table.
-
•
Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.



Timing: 30 min
CRITICAL: Normal tumor samples expression data is essential for Pathifier


