Abstract
Background
Unsupervised learning methods, such as Hierarchical Cluster Analysis, are commonly used for the analysis of genomic platform data. Unfortunately, such approaches ignore the well-documented heterogeneous composition of prostate cancer samples. Our aim is to use more sophisticated analytical approaches to deconvolute the structure of prostate cancer transcriptome data, providing novel clinically actionable information for this disease.
Methods
We apply an unsupervised model called Latent Process Decomposition (LPD), which can handle heterogeneity within individual cancer samples, to genome-wide expression data from eight prostate cancer clinical series, including 1,785 malignant samples with the clinical endpoints of PSA failure and metastasis.
Results
We show that PSA failure is correlated with the level of an expression signature called DESNT (HR = 1.52, 95% CI = [1.36, 1.7], P = 9.0 × 10−14, Cox model), and that patients with a majority DESNT signature have an increased metastatic risk (X2 test, P = 0.0017, and P = 0.0019). In addition, we develop a stratification framework that incorporates DESNT and identifies three novel molecular subtypes of prostate cancer.
Conclusions
These results highlight the importance of using more complex approaches for the analysis of genomic data, may assist drug targeting, and have allowed the construction of a nomogram combining DESNT with other clinical factors for use in clinical management.
Subject terms: Prostate cancer, Molecular medicine, Computer science
Background
Driven by technological advances and decreased costs, a plethora of genomic datasets now exists. This is illustrated by the availability of expression data from over 1.3 million samples from the Gene Expression Omnibus,1 and DNA sequence data on 25,000 cases from the International Cancer Genome Consortium.2 Such datasets have been used as the raw material for the discovery of disease subclasses, using a variety of mathematical approaches. Hierarchical clustering, k-means clustering, and self-organising maps have been applied to expression datasets, leading, for example, to the discovery of five molecular breast cancer types (Basal, Luminal A, Luminal B, ERBB2-overexpressing and Normal-like).3 The inherent shortcoming of this type of approach is the implicit assumption of sample assignment to a particular cluster or group. Such analyses are in complete contrast to the well-documented heterogeneous composition of most individual cancer samples.4,5
Unsupervised analysis of prostate cancer transcriptome profiles using the above approaches has failed to identify robust disease categories that have distinct clinical outcomes.6,7 Noting that prostate cancer samples used in genome-wide studies frequently harbour multiple cancer lineages, and can have intra-tumour variations in genetic compositions,8–10 we applied an unsupervised learning method called latent process decomposition (LPD)11 that can take into account the issue of heterogeneity of composition within individual cancer samples. By heterogeneity, we mean that an individual cancer sample can be made up of several different components that each has distinct properties. We had previously used Latent Process Decomposition: (i) to confirm the presence of the basal and ERBB2-overexpressing subtypes in breast cancer transcriptome datasets;12 (ii) to demonstrate that data from the MammaPrint breast cancer recurrence assay would be optimally analysed using four separate prognostic categories;12 and (iii) to show that patients with advanced prostate cancer can be stratified into two clinically distinct categories based on expression profiles in blood.13 LPD (closely related to Latent Dirichlet Allocation) is a mixed membership model in which the expression profile for a single cancer is represented as a combination of the underlying latent (hidden) signatures. Each latent signature has a representative gene expression pattern. A given sample can be represented over a number of these underlying functional states, or just one such state. The appropriate number of signatures to use is determined using the LPD algorithm by maximising the probability of the model, given the data.
The application of LPD to prostate cancer transcriptome datasets led to the discovery of an expression pattern, called DESNT, that was observed in all datasets examined.14 Cancer samples were assigned as DESNT when this pattern was more common than any other signature, and this designation was associated with poor outcomes independently of other clinical parameters, including Gleason, clinical stage and PSA. In this paper, we test whether the presence of even a small proportion of the DESNT cancer signature confers poor outcome, and uses LPD to develop a new prostate cancer stratification framework.
Methods
Transcriptome datasets
Eight publicly available transcriptome microarray datasets derived from prostatectomy samples from men with prostate cancer were used, and are referred to as Memorial Sloan Kettering Cancer Centre (MSKCC),7 CancerMap,14 CamCap,6 Stephenson,15 TCGA,16 Klein,17 Erho18 and Karnes.19 There were 1785 samples from primary malignant tissue, and 173 from normal tissue (Table 1). MSKCC also had data from 19 metastatic cancer samples. The CamCap dataset was produced by combining two Illumina HumanHT-12 V4.0 expression beadchip datasets (GEO: GSE70768 and GSE70769) obtained from two prostatectomy series (Cambridge and Stockholm).6 The original CamCap6 and CancerMap14 datasets have 40 patients in common, and thus 20 of the common samples were excluded at random from each dataset. Each Affymetrix Exon microarray dataset was normalised using the RMA algorithm,20 implemented in the Affymetrix Expression Console software. For CamCap and Stephenson, previous normalised values were used. For the TCGA dataset, the counts per gene, previously calculated, were used16 and transformed, using the variance-stabilising transformation implemented in the DESeq2 package.21 For the CamCap and CancerMap datasets, the ERG gene alterations had been scored by fluorescence in situ hybridisation.6,14 Only probes corresponding to genes measured by all platforms were retained. The ComBat algorithm from the sva R package and quantile transformation, was used to mitigate series-specific effects. Flow diagrams presenting each of the analyses performed in this study, with the datasets used, are shown in the Supplementary Materials. The ethical approvals obtained for each dataset are listed in the original publications.
Table 1.
Dataset | Primary | Normal | Type | Platform | Citation |
---|---|---|---|---|---|
MSKCC7 | 131 | 29 | FF | Affymetrix Exon 1.0 ST v2 | Taylor et al.7 |
CancerMap14 | 137 | 17 | FF | Affymetrix Exon 1.0 ST v2 | Luca et al. 2017 |
Stephenson15 | 78 | 11 | FF | Affymetrix U133A | Stephenson et al.15 |
Klein17 | 182 | 0 | FFPE | Affymetrix Exon 1.0 ST v2 | Klein et al.17 |
CamCap6 | 147 | 73 | FF | Illumina HT12 v4.0 BeadChip | Ross-Adams et al.6 |
TCGA16 | 333 | 43 | FF | Illumina HiSeq 2000 RNA-Seq v2 | TCGA network 2015 |
Erho18 | 545 | 0 | FFPE | Affymetrix Exon 1.0 ST v2 | Erho et al.18 |
Karnes19 | 232 | 0 | FFPE | Affymetrix Exon 1.0 ST v2 | Karnes et al.19 |
The MSKCC study additionally reported expression profiles from 19 metastatic cancers. The ethical approvals obtained for each dataset are listed in the original publications.
Latent process decomposition
LPD11,12 is an unsupervised Bayesian approach that breaks down (decomposes) each sample into component sub-elements (signatures). Each signature is a representative gene expression pattern. LPD is able to classify complex data based on the relative representation of these signatures in each sample. LPD can objectively assess the most likely number of signatures. We assessed the hold-out validation log-likelihood of the data computed at various numbers of signatures, and used a combination of both the uniform (equivalent to a maximum likelihood approach) and non-uniform (missed approach point) priors to choose the number of signatures. For input, each dataset was reduced to probes that detect the 500 genes with the greatest variance across the MSKCC dataset. For robustness, LPD is run 100 times with different seeds, for each dataset. Out of the 100 runs, we selected the run with the survival log-rank p-value closest to the mode as a representative run that was used for subsequent analysis.
OAS-LPD
The OAS-LPD (one added sample-LPD) algorithm is a modified version of the LPD algorithm in which new sample(s) are decomposed into LPD signatures, without retraining the model (i.e. without re-estimating the model parameters µgk, σ2gk and α in Rogers et al.11). Only the variational parameters Qkga and γak, corresponding to the new sample(s), are iteratively updated until convergence, according to Eq. (6) and (7) from Rogers et al.11 LPD as presented by Rogers et al.11 was first applied to the MSKCC dataset of 131 cancer and 29 normal samples, as described above. The model parameters µgk, σ2gk and α, corresponding to the representative LPD run, were then used to classify additional expression profiles from all datasets, one sample at a time. A detailed description is provided in the Supplementary Methods.
Statistical tests
All statistical tests were performed in R version 3.3.1. For characterisation of signatures, each sample was assigned to the signature that had the largest gamma (γ) value for that sample.
Correlations
Pearson correlations between the expression profiles between the MSKCC and CancerMap were calculated for each of the eight signatures: (i) for each gene, we select one corresponding probe at random; (ii) for each probe, we transformed its distribution across all samples to a standard normal distribution; (iii) the mean expression for each gene across the samples assigned to signature j (gene subgroup mean) in each dataset was determined; (iv) the Pearson’s correlation between the gene subgroup mean expression profile in MSKCC vs the gene subgroup mean expression profile in CancerMap is calculated for each signature.
Differentially expressed and methylated features
Differentially expressed probe sets were identified for each signature by using a moderated t test implemented in the limma R package (Benjamin–Hochberg false discovery rate <0.05, differentially expressed in at least 50/100 runs; samples assigned to the signature vs the rest).
Differential methylation was assigned at the probe level. Hypo- and hypermethylated genes that are predictive of transcription were identified using the methylMix R package (functionally differentially methylated in at least 50/100 runs), using genes that are found to be differentially expressed in that signature as input. Datasets where there were <10 samples assigned to a signature were removed from the identification of intersection genes for that signature.
Survival analyses and nomogram
Survival analyses were performed using Cox proportional hazard models, the log-rank test and Kaplan–Meier estimator, with biochemical recurrence after prostatectomy as the end point. For nomogram construction, the Cox proportional hazard model was fitted on the meta-dataset obtained by combining MSKCC, CancerMap and Stephenson datasets, and validated on CamCap, using the rms R package. The Gleason grade was divided into <7, 3 + 4, 4 + 3 and >7, the pathological stage in T1–T2 vs T3–T4, while DESNT percentage and PSA were considered continuous covariates. The missing values for the predictors were imputed using the flexible additive models with predictive mean matching, implemented in the Hmisc R package. The linearity of the continuous covariates was assessed using the Martingale residuals.22 The lack of collinearity between covariates was determined by calculating the variance inflation factors (VIF) (VIF values between 1.04 and 3.01).23 All covariates met the Cox proportional hazard assumption, as determined by the Schoenfeld residuals. The internal validation and calibration of the Cox model were performed by bootstrapping the training dataset 1000 times. The calibration of the model was estimated by comparing the predicted and observed survival probabilities at 5 years. For comparing the discrimination accuracy of two non-nested Cox models, the U-statistic calculated by the Hmisc rcorrp.cens function was used.
Detecting over-representation of genomic features
Mutated cancer genes identified by the Cancer Genome Atlas Research Network (2015)16 were examined at the sample level. The under-/over-representation of these features in samples assigned to a particular LPD signature was determined using the χ2 independence test.
Pathway over-representation analysis and signature correlation analysis
The GO biological process annotations were tested for over-representation (or under-representation) in the lists of differentially expressed genes in each signature, using clusterProfiler version 3.4.4. For a given pathway and a given sample, the pathway activation score was calculated as indicated in Levine et al.24 Using the complete combined dataset of all eight datasets, Z scores were calculated for each sample for each of the 17,697 MSigDB v6.0 gene sets. These were correlated with DESNT γ values, and the top 20 sets with the highest absolute Pearson’s correlation were selected. The resulting p values from pathway over-representation analysis were adjusted for multiple testing using the false discovery rate.
Results
Presence of DESNT signature as a continuous variable is associated with poor clinical outcome
In our previous studies, LPD was detected between three and eight underlying signatures (also called processes) in expression microarray datasets collected from prostate cancer samples after prostatectomy.14 Decomposition of the MSKCC dataset7 gave eight signatures.14 Figure 1a illustrates the proportion of the DESNT expression signature identified in each MSKCC sample, with individual cancer samples being assigned as a “DESNT cancer” when the DESNT signature was the most abundant as shown in Fig. 1a and Fig. 1c. Based on PSA failure, patients with DESNT cancer always exhibited poorer outcome relative to other cancer samples in the same dataset.14 The implication is that it is the presence of regions of cancer containing the DESNT signature that conferred poor outcome. If this idea is correct, we would predict that cancer samples containing a smaller contribution of DESNT signature, such as those shown in Fig. 1b for the MSKCC dataset, should also exhibit poorer outcome.
To increase the power to test this prediction, we combined transcriptome data from the MSKCC,7 CancerMap,14 Stephenson15 and CamCap6 studies (n = 503). There was a significant association with PSA recurrence when the proportion of expression assigned to the DESNT signature was treated as a continuous variable (HR = 1.52, 95% CI = [1.36, 1.7], P = 9.0 × 10−14, Cox proportional hazard regression model). The outcome became worse as the proportion of DESNT signature increased. For illustrative purposes, cancer samples were divided into four groups based on the proportion of DESNT, with 47.4% of cancer samples containing at least some DESNT cancer (proportion > 0.001, Fig. 2a). PSA failure-free survival at 60 months is 82.5%, 67.4%, 59.5% and 44.9% for the proportion of DESNT signature being: <0.001; 0.001 to 0.3; 0.3–0.6; and >0.6, respectively (Fig. 2b).
Nomogram for DESNT predicting PSA failure
The proportion of DESNT cancer was combined with other clinical variables (Gleason grade, PSA levels, pathological stage and the surgical margin status) in a Cox proportional hazard model, and fitted to a combined dataset of 318 cancer samples (MSKCC, CancerMap and Stephenson); CamCap cancer samples (n = 185) were used for external validation. The proportion of DESNT was an independent predictor of worse clinical outcome (HR = 1.33, 95% CI = [1.14, 1.56], P = 3.0 × 10−4), along with Gleason grade=4 + 3 (HR = 2.43, 95% CI = [1.10, 5.37], P = 2.7 × 10−2), Gleason grade>7 (HR = 5.05, 95% CI = [2.35, 10.89], P < 1 × 10−4) and positive surgical margins (HR = 1.65, 95% CI = [1.07, 2.56], P = 2.2 × 10−2) (Fig. S1: Supplementary Fig. 1). PSA level and pathological stage were below the threshold of statistical significance (P = 0.09, HR = 1.14, 95% CI = [0.97, 1.34]) and (P = 0.055, HR = 1.51, 95% CI = [0.99, 2.31]), respectively. At internal validation, the Cox model obtained a 1000 bootstrap-corrected C index of 0.747, and at external validation a C index of 0.795. Using this model, a nomogram was constructed for use of DESNT cancer information in conjunction with clinical variables to predict the risk of biochemical recurrence at 1, 3, 5 and 7 years following prostatectomy (Fig. 2c, Fig. S1).
LPD algorithm for detecting the presence of DESNT cancer in individual samples
The ability of LPD to detect structure is likely to be dependent on sample size, cohort composition, disease severity range and data quality. We observed optimal decompositions varying between three and eight underlying signatures in different datasets.14 When we examined the two datasets that had an optimal eight underlying signatures (MSKCC and CancerMap), we noted a striking relationship: based on correlations of expression profiles, all eight of the LPD signatures appeared to be common (Fig. S2; R2 > 0.5). To provide a more consistent classification framework where the number of classes did not vary between datasets, we therefore used the MSKCC dataset and its decomposition into eight distinct signatures as a reference for identifying categories of prostate cancer type.
LPD is a computer-intensive procedure, and analyses can take days to run on a high-performance computing cluster. This would restrict ease of DESNT detection for clinical implementation. We therefore developed a variant of LPD called OAS-LPD, where data from a single additional cancer sample could be decomposed into signatures, following normalisation, without repeating the entire LPD procedure. LPD model parameters11 were first derived by decomposition of the MSKCC dataset into eight signatures. These signature parameters were then used as a framework for decomposition of additional data from single samples, selected in this case from a dataset, or in the future from a patient undergoing assessment in the clinic. To test this procedure, we applied OAS-LPD individually to cancer samples from MSKCC, CancerMap, Stephenson and CamCap (Fig. S3), and repeated Cox regression analysis and nomogram construction. The proportion of DESNT (P = 0.0011, HR = 1.53, 95% CI = [1.19, 1.98]), Gleason = 4 + 3 (P = 0.0061, HR = 2.83, 95% CI = [1.35, 5.96]), Gleason>7 (P < 1 × 10−4, HR = 5.39, 95% CI = [2.54, 11.44]) and surgical margin status (P = 0.0015, HR = 2.00, 95% CI = [1.30, 3.07]) remained independent predictors of clinical outcome (Fig. S4). Notably, the performance of the Cox model (internal validation C index = 0.742; external validation C index = 0.786) was not significantly different to that of the original separate dataset Cox model (train dataset Z = −0.65, two-tailed P = 0.52; validation dataset Z = 0.89, two-tailed P = 0.38; U-statistic), and the nomogram (Fig. S5) had almost an identical presentation of parameters to that shown in Fig. 2c. This observation is consistent with the high degree of correlation between LPD and OAS-LPD DESNT gamma values across the MSKCC, CancerMap, Stephenson and CamCap datasets (P = 2.39 × 10−110).
New categories of prostate cancer
We wished to determine whether LPD signatures were characterised by particular clinical or molecular features, indicating that they represented distinct categories of prostate cancer. OAS-LPD using the MSKCC-derived model of gene signatures was applied to all datasets (n = 1958, Table 1), and each sample was assigned to the signature that was the most abundant. Samples from non-cancerous (benign) prostate tissue were more frequently assigned to LPD2, LPD4 and LPD8 than to the other groups (P < 0.05, χ2 test, Fig. S3, Table S1). When datasets with linked clinical data were combined (MSKCC, CancerMap, Stephenson and CamCap, Fig. 3a–c), primary cancers assigned to DESNT had worse outcome (P = 3.4 × 10−14, log-rank test, DESNT-assigned samples vs the rest), while those assigned to LPD4 had improved outcome (P = 0.0081, log-rank test, LPD4-assigned samples vs the rest) as judged by PSA failure. Cancer samples with ERG alterations assigned to signature LPD3 also exhibited better outcome (P < 0.05, log-rank test, comparison to all other ETS-positive cancer samples) in all three datasets where ERG status was available (Fig. 4b–d).
To gain information about the new LPD categories, we examined the distribution of genetic alterations in the decomposition of the TGCA dataset16 (Fig. 4a). LPD3 cancer samples had over-representation of ETS and PTEN gene alterations, and under-representation of CDH1 and SPOP gene alterations (P < 0.05, χ2 test, Table 2). LPD5 cancer samples exhibited exactly the reverse pattern of genetic alteration: there was under-repression of ETS and PTEN gene alterations, and over-representation of SPOP and CHD1 alterations (Table 2). The statistically different distribution of ETS-gene alterations in samples assigned to LPD3 and LPD5, observed in the TGCA dataset, was confirmed in the CamCap and CancerMap dataset (Table 2). In summary, we have identified three additional prostate cancer categories that have altered genetic and/or clinical associations: LPD3, LPD4 and LPD5 (Fig. 5), and that may be relevant for drug targeting.
Table 2.
TCGA | CancerMap | CamCap | |||||||
---|---|---|---|---|---|---|---|---|---|
ETS– | ETS+ | χ2 P-val | ERG– | ERG+ | χ2 P-val | ERG– | ERG+ | χ2 P-val | |
LPD1 | 8 | 3 | 0.0588 | 13 | 4 | 0.0851 | 0 | 3 | 0.235 |
LPD2 | 4 | 8 | 0.827 | 3 | 3 | 1 | 0 | 2 | 0.467 |
LPD3 | 9 | 67 | 1.45 × 10−08 | 5 | 15 | 0.00977 | 4 | 17 | 0.00299 |
LPD4 | 14 | 21 | 1 | 14 | 15 | 0.619 | 1 | 2 | 0.987 |
LPD5 | 65 | 5 | 2.20 × 10−16 | 19 | 1 | 0.000180 | 34 | 0 | 1.15 × 10−11 |
LPD6 | 13 | 22 | 0.802 | 5 | 5 | 1 | 2 | 4 | 0.657 |
DESNT | 13 | 66 | 1.17 × 10−06 | 6 | 15 | 0.0207 | 9 | 24 | 0.00274 |
LPD8 | 9 | 6 | 0.193 | 8 | 4 | 0.540 | 4 | 1 | 0.371 |
PTEN | SPOP | CHD1 | |||||||
---|---|---|---|---|---|---|---|---|---|
Non-homdel | Homdel | χ2 P-val | Non-mut | Mut | χ2 P-val | Non-homdel | Homdel | χ2 P-val | |
LPD1 | 10 | 1 | 0.896 | 8 | 3 | 0.213 | 9 | 2 | 0.309 |
LPD2 | 12 | 0 | 0.284 | 12 | 0 | 0.436 | 12 | 0 | 0.756 |
LPD3 | 55 | 21 | 0.000894 | 73 | 3 | 0.0400 | 76 | 0 | 0.0211 |
LPD4 | 35 | 0 | 0.0174 | 31 | 4 | 1 | 34 | 1 | 0.603 |
LPD5 | 67 | 3 | 0.00830 | 51 | 19 | 4.46 × 10−06 | 57 | 13 | 7.69 × 10−06 |
LPD6 | 29 | 6 | 0.903 | 32 | 3 | 0.825 | 34 | 1 | 0.603 |
DESNT | 60 | 19 | 0.0167 | 75 | 4 | 0.0795 | 76 | 3 | 0.432 |
LPD8 | 15 | 0 | 0.195 | 14 | 1 | 0.889 | 14 | 1 | 1 |
Statistically significant differences are italicised.
Altered patterns of gene expression and DNA methylation
We examined samples assigned to each OAS-LPD signature for genes with significantly altered expression levels in all eight datasets (P < 0.05 after FDR correction, samples in the LPD group vs all other LPD categories from the same dataset, Supplementary data 1). LPD3 cancer samples exhibited seven commonly overexpressed genes, including ERG, GHR and HDAC1. Pathway analysis suggested the involvement of Stat3 gene signalling (Fig. S6a, Supplementary data 2). LPD5 exhibited 47 significantly overexpressed genes and 13 under-expressed genes. Many of the genes had established roles in fatty acid metabolism and the control of secretion (Fig. S6b). LPD6- and LPD8 cancers had failed to exhibit statistically significant changes in genetic alteration or clinical outcome in this study, but did have characteristic altered patterns of gene expression (Fig. S6c, e). The five genes commonly overexpressed in LPD6 cancers suggested involvement in metal ion homoeostasis. In total, 30 genes were overexpressed, and 36 genes under-expressed in in LPD8 cancers, including several genes involved in extracellular matrix organisation. Cross- referencing differential methylation data available for the TCGA dataset with genes associated with each LPD group indicated that many expression changes may be explained, at least in part, by changes in DNA methylation (Fig. 5, Fig. S7, Supplementary data 3).
DESNT as a signature of metastasis
The MSKCC study includes data from 19 metastatic cancer samples. For each metastatic sample, DESNT was the most abundant signature when OAS-LPD was applied (Fig. 3d). Two of the studied datasets (MSKCC and Erho) had publicly available annotations, indicating that the patients, from which primary cancer expression profiles were examined, had progressed to develop metastasis after prostatectomy (Fig. S3). From nine cancer patients developing metastasis in the MSKCC dataset, five occurred from samples in which the DESNT signature is most common (X2 test, P = 0.0017), and of 212 cancer patients developing metastases in the Erho dataset, 50 were from DESNT cancers (X2 test, P = 0.0019) (Fig. S8). From these studies, we concluded that DESNT cancers have an increased risk of developing metastasis, consistent with the higher risk of PSA failure. For the Erho dataset, membership of LPD1 was associated with lower risk of metastasis (X2 test, P = 0.026, Fig. S8).
To further investigate the underlying nature of DESNT cancer, we used the transcriptome profile for each primary prostate cancer sample to investigate associations with the 17,697 signatures and pathways annotated in the MSigDB database. The top 20 signatures, where expression was associated with the proportion of DESNT, are shown in Table S2. The third most significant correlation was to genes downregulated in metastatic prostate cancer. This resulting data give additional clues to the underlying biology of DESNT cancer, including associations with genes altered in ductal breast cancer, in stem cells and during FGFR1 signalling.
Discussion
We have confirmed a key prediction of the DESNT cancer model by demonstrating that the presence of a small proportion of the DESNT cancer signature confers poorer outcome. The proportion of DESNT signature can be considered a continuous variable, such that as DESNT cancer content increases, the outcome became worse. This observation led to the development of nomograms for estimating PSA failure at 3, 5 and 7 years following prostatectomy. The result provides an extension of previous studies in which nomograms incorporating Gleason score, stage and PSA value have been used to predict outcome following surgery.25
The match between the eight underlying signatures detected for the MSKCC and CancerMap datasets was used as the basis for developing a novel classification framework for prostate cancer. A new algorithm called OAS-LPD was developed to allow rapid assessment of the presence of the signatures in individual cancer samples. In total, four clinically or genetically distinct subgroups were identified (DESNT, LPD3, LPD4 and LPD5, Fig. 5). The functional significance of the new disease groupings, for example, in determining drug sensitivity, remains to be established. However, with the use of OAS-LPD, it will be possible to undertake assessments of the response of patients in each of the groups DESNT, LPD3, LPD and LPD5 to drug treatments. There is limited overlap between the new classification and previously proposed subgroups based on genetic alterations.16,26–29
Multiplatform data (expression, mutation and methylation data from each cancer sample) are available for many cancer types, for example from The Cancer Genome Atlas. This has prompted the development of additional methods for sub-class discovery that can combine information from different platforms, including the copula-mixed model,30 Bayesian consensus clustering,31 and the iCluster model.16 Such approaches can suffer from the problem of sample assignment to a particular cluster or group, and the failure to take into consideration the heterogeneous composition of individual cancer samples. These observations highlight the need to develop methods similar to LPD that can be applied to multiplatform data.
An important issue for patients diagnosed with prostate cancer is that the clinical outcome is highly variable, and precise prediction of the course of disease progression at the time of diagnosis is not possible.32 In some studies, the use of population PSA screening can reduce mortality from prostate cancer by up to 21%.33 However many, if not most, prostate cancers that are currently detected by PSA screening are clinically insignificant.34 Overdiagnosis of clinically insignificant prostate cancer is a major issue, and is set to increase still further.35 There is therefore an urgent need for the identification of cancer categories that are associated with clinically aggressive or indolent disease to allow the targeting of radical therapies to men that need them. For breast cancer, unsupervised hierarchical clustering of transcriptome data resulted in a classification system that is routinely used to guide the management and treatment of this disease. Here we established a novel classification framework for the analysis of prostate cancer that has its origins in unsupervised analyses of transcriptome data. In future studies, we plan to analyse the utility of DESNT and other LPD processes (particularly LPD3, LPD4 and LPD5) in managing prostate cancer patients, including predicting the response to drug treatment. This will be performed through the assessment of LPD status in the contexts of established clinical trials. For evaluation, we would plan to use each LPD assignment (e.g. DESNT, LPD3, LPD4 and LPD5) as a continuous variable, as illustrated here by the development of a nomogram for the use of DESNT in predicting PSA failure. In conclusion, our results highlight the importance of devising and using more sophisticated approaches for the analysis of genomic datasets from all biological systems.
Supplementary information
Acknowledgements
The research presented in this paper was carried out on the High Performance Computing Cluster supported by the Research and Specialist Computing Support service at the University of East Anglia. We thank Shea Connell for useful comments and suggestions on the paper.
Author contributions
C.S.C., D.S.B. and V.M. were involved in funding acquisition and supervised the project. C.S.C., D.S.B., B-A.L. and V.M. were involved in conceptualisation and planned the data analysis. B-A.L., C.E. and D.S.B. performed the majority of the analyses and investigations, with additional analysis and insight provided by V.M., D.R.E., C.C., R.A.C., J.C. and C.S.C. B-A.L., C.E., C.C., D.S.B. and C.S.C. were involved in developing the methodology for the project. C.S.C., D.S.B. and B-A.L. wrote the original draft of the paper. All authors reviewed and edited the paper. All authors read and approved the final paper.
Ethics approval and consent to participate
All data were from other publications. The ethical approvals obtained for each dataset are listed in the original publications.
Data availability
The datasets analysed during this study are available (Table 1). The majority are available from the Gene Expression Omnibus repository:
MSKCC:7 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE21034
CancerMap:14 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE94767
Klein:17 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE62667
CamCap:6 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE70768 and https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE70769
Erho:18 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE46691
Karnes:19 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE62116
Stephenson:15 data available from the corresponding author of this paper.
TCGA:16 data available from the TCGA Data Portal https://portal.gdc.cancer.gov/projects/TCGA-PRAD.
Competing interests
C.S.C., D.S.B., B-A L. and V.M. are co-inventors on a patent application from the University of East Anglia on the detection of DESNT prostate cancer.
Funding information
This work was funded by the Bob Champion Cancer Trust, The Masonic Charitable Foundation successor to The Grand Charity, The King Family, The Hargrave Foundation and The University of East Anglia. We acknowledge support from Movember, from Prostate Cancer UK, The Big C Cancer Charity, Callum Barton and from The Andy Ripley Memorial Fund.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
These authors Contributed equally: Bogdan-Alexandru Luca, Daniel S. Brewer
These authors jointly supervised this work: Vincent Moulton, Daniel S. Brewer, Colin S. Cooper
Supplementary information
Supplementary information is available for this paper at 10.1038/s41416-020-0799-5.
References
- 1.Edgar R, Domrachev M, Lash AE. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 2002;30:207–210. doi: 10.1093/nar/30.1.207. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Consortium ICG, Anderson W, Artez A, Bell C, Bernabé RR, Bhan MK, et al. International network of cancer genome projects. Nature. 2010;464:993–998. doi: 10.1038/nature08987. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Sorlie T, Tibshirani R, Parker J, Hastie T, Marron JS, Nobel A, et al. Repeated observation of breast tumor subtypes in independent gene expression data sets. Proc. Natl Acad. Sci. USA. 2003;100:8418–8423. doi: 10.1073/pnas.0932692100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Blanco-Calvo M, Concha Á, Figueroa A, Garrido F, Valladares-Ayerbes M. Colorectal cancer classification and cell heterogeneITY: A SYSTEMs oncology approach. Int J. Mol. Sci. 2015;16:13610–13632. doi: 10.3390/ijms160613610. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Polyak K. Heterogeneity in breast cancer review series introduction heterogeneity in breast cancer. J. Clin. Invest. 2011;121:3786. doi: 10.1172/JCI60534. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Ross-Adams H, Lamb ADD, Dunning MJJ, Halim S, Lindberg J, Massie CMM, et al. Integration of copy number and transcriptomics provides risk stratification in prostate cancer: a discovery and validation cohort study. EBioMedicine. 2015;2:1133–1144. doi: 10.1016/j.ebiom.2015.07.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Taylor BS, Schultz N, Hieronymus H, Gopalan A, Xiao Y, Carver BS, et al. Integrative genomic profiling of human prostate cancer. Cancer Cell. 2010;18:11–22. doi: 10.1016/j.ccr.2010.05.026. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Cooper CS, Eeles R, Wedge DC, Van Loo P, Gundem G, Alexandrov LB, et al. Analysis of the genetic phylogeny of multifocal prostate cancer identifies multiple independent clonal expansions in neoplastic and morphologically normal prostate tissue. Nat. Genet. 2015;47:367–372. doi: 10.1038/ng.3221. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Boutros PC, Fraser M, Harding NJ, de Borja R, Trudel D, Lalonde E, et al. Spatial genomic heterogeneity within localized, multifocal prostate cancer. Nat. Genet. 2015;47:736–745. doi: 10.1038/ng.3315. [DOI] [PubMed] [Google Scholar]
- 10.Tsourlakis M-C, Stender A, Quaas A, Kluth M, Wittmer C, Haese A, et al. Heterogeneity of ERG expression in prostate cancer: a large section mapping study of entire prostatectomy specimens from 125 patients. BMC Cancer. 2016;16:641. doi: 10.1186/s12885-016-2674-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Rogers S, Girolami M, Campbell C, Breitling R. The latent process decomposition of cDNA microarray data sets. IEEE/ACM Trans. Comput. Biol. Bioinforma. 2005;2:143–156. doi: 10.1109/TCBB.2005.29. [DOI] [PubMed] [Google Scholar]
- 12.Carrivick L, Rogers S, Clark J, Campbell C, Girolami M, Cooper C. Identification of prognostic signatures in breast cancer microarray data using Bayesian techniques. J. R. Soc. Interface. 2006;3:367–381. doi: 10.1098/rsif.2005.0093. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Olmos D, Brewer D, Clark J, Danila DC, Parker C, Attard G, et al. Prognostic value of blood mRNA expression signatures in castration-resistant prostate cancer: a prospective, two-stage study. Lancet Oncol. 2012;2045:1–11. doi: 10.1016/S1470-2045(12)70372-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Luca B, Brewer DS, Edwards DR, Edwards S, Whitaker HC, Merson S, et al. DESNT: a poor prognosis category of human prostate cancer. Eur. Urol. Focus. 2018;4:842–850. doi: 10.1016/j.euf.2017.01.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Stephenson AJ, Smith A, Kattan MW, Satagopan J, Reuter VE, Scardino PT, et al. Integration of gene expression profiling and clinical variables to predict prostate carcinoma recurrence after radical prostatectomy. Cancer. 2005;104:290–298. doi: 10.1002/cncr.21157. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Network CGAR, Cancer Genome Atlas Research Network. The molecular taxonomy of primary prostate cancer. Cell. 2015;163:1011–1025. doi: 10.1016/j.cell.2015.10.025. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Klein EA, Yousefi K, Haddad Z, Choeurng V, Buerki C, Stephenson AJ, et al. A genomic classifier improves prediction of metastatic disease within 5 years after surgery in node-negative high-risk prostate cancer patients managed by radical prostatectomy without adjuvant therapy. Eur. Urol. 2015;67:778–786. doi: 10.1016/j.eururo.2014.10.036. [DOI] [PubMed] [Google Scholar]
- 18.Erho N, Crisan A, Vergara IA, Mitra AP, Ghadessi M, Buerki C, et al. Discovery and validation of a prostate cancer genomic classifier that predicts early metastasis following radical prostatectomy. PLoS ONE. 2013;8:e66855. doi: 10.1371/journal.pone.0066855. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Karnes RJ, Bergstralh EJ, Davicioni E, Ghadessi M, Buerki C, Mitra AP, et al. Validation of a genomic classifier that predicts metastasis following radical prostatectomy in an at risk patient population. J. Urol. 2013;190:2047–2053. doi: 10.1016/j.juro.2013.06.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Irizarry RA, Hobbs B, Collin F, Beazer‐Barclay YD, Antonellis KJ, Scherf U, et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003;4:249–264. doi: 10.1093/biostatistics/4.2.249. [DOI] [PubMed] [Google Scholar]
- 21.Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15:550. doi: 10.1186/s13059-014-0550-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Therneau TM, GRAMBSCH PM, Fleming TR. Martingale-based residuals for survival models. Biometrika. 1990;77:147–160. doi: 10.1093/biomet/77.1.147. [DOI] [Google Scholar]
- 23.Hair J. F., Black W. C., Babin B. J., Anderson R. E. & Tatham R. L. Multivariate data analysis. (Pearson Education Limited, Essex, UK, 1998).
- 24.Levine DM, Haynor DR, Castle JC, Stepaniants SB, Pellegrini M, Mao M, et al. Pathway and gene-set activation measurement from mRNA expression data: the tissue distribution of human pathways. Genome Biol. 2006;7:R93. doi: 10.1186/gb-2006-7-10-r93. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Shariat SF, Kattan MW, Vickers AJ, Karakiewicz PI, Scardino PT. Critical review of prostate cancer predictive tools. Future Oncol. 2009;5:1555–1584. doi: 10.2217/fon.09.121. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Attard G, Clark J, Ambroisine L, Fisher G, Kovacs G, Flohr P, et al. Duplication of the fusion of TMPRSS2 to ERG sequences identifies fatal human prostate cancer. Oncogene. 2008;27:253–263. doi: 10.1038/sj.onc.1210640. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Reid AHM, Attard G, Ambroisine L, Fisher G, Kovacs G, Brewer D, et al. Molecular characterisation of ERG, ETV1 and PTEN gene loci identifies patients at low and high risk of death from prostate cancer. Br. J. Cancer. 2010;102:678–684. doi: 10.1038/sj.bjc.6605554. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Mosquera JM, Beltran H, Park K, MacDonald TY, Robinson BD, Tagawa ST, et al. Concurrent AURKA and MYCN gene amplifications are harbingers of lethal treatmentrelated neuroendocrine prostate cancer. Neoplasia. 2013;15:1–IN4. doi: 10.1593/neo.121550. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Rodrigues LU, Rider L, Nieto C, Romero L, Karimpour-Fard A, Loda M, et al. Coordinate loss of MAP3K7 and CHD1 promotes aggressive prostate cancer. Cancer Res. 2015;75(Mar):1021–1034. doi: 10.1158/0008-5472.CAN-14-1596. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Rey M., Roth V. Copula Mixture Model for Dependency-seeking Clustering. In: Proceedings of the 29th International Conference on Machine Learning (Edinburgh, Scotland, UK, 2012).
- 31.Lock EF, Dunson DB. Bayesian consensus clustering. Bioinformatics. 2013;29:2610–2616. doi: 10.1093/bioinformatics/btt425. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Buyyounouski MK, Pickles T, Kestin LL, Allison R, Williams SG. Validating the interval to biochemical failure for the identification of potentially lethal prostate cancer. J. Clin. Oncol. 2016;30:1857–1863. doi: 10.1200/JCO.2011.35.1924. [DOI] [PubMed] [Google Scholar]
- 33.Schröder FH, Hugosson J, Roobol MJ, Tammela TLJ, Zappa M, Nelen V, et al. Screening and prostate cancer mortality: results of the European Randomised Study of Screening for Prostate Cancer (ERSPC) at 13 years of follow-up. Lancet. 2014;384:2027–2035. doi: 10.1016/S0140-6736(14)60525-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Etzioni R, Gulati R, Mallinger L, Mandelblatt J. Influence of study features and methods on overdiagnosis estimates in breast and prostate cancer screening. Ann. Intern Med. 2013;158(Jun):831–838. doi: 10.7326/0003-4819-158-11-201306040-00008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Parker C, Emberton M. Screening for prostate cancer appears to work, but at what cost? BJU Int. 2009;104:290–292. doi: 10.1111/j.1464-410X.2009.08689.x. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The datasets analysed during this study are available (Table 1). The majority are available from the Gene Expression Omnibus repository:
MSKCC:7 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE21034
CancerMap:14 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE94767
Klein:17 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE62667
CamCap:6 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE70768 and https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE70769
Erho:18 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE46691
Karnes:19 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE62116
Stephenson:15 data available from the corresponding author of this paper.
TCGA:16 data available from the TCGA Data Portal https://portal.gdc.cancer.gov/projects/TCGA-PRAD.