Skip to main content
iScience logoLink to iScience
. 2023 Feb 18;26(3):106238. doi: 10.1016/j.isci.2023.106238

A multimodal analysis of genomic and RNA splicing features in myeloid malignancies

Arda Durmaz 1,2,10, Carmelo Gurnari 1,3,10, Courtney E Hershberger 4, Simona Pagliuca 1,5, Noah Daniels 6, Hassan Awada 7, Hussein Awada 1, Vera Adema 8, Minako Mori 1, Ben Ponvilawan 1, Yasuo Kubota 1, Tariq Kewan 1, Waled S Bahaj 1, John Barnard 4, Jacob Scott 1,2, Richard A Padgett 6, Torsten Haferlach 9, Jaroslaw P Maciejewski 1,10, Valeria Visconte 1,10,11,
PMCID: PMC10011742  PMID: 36926651

Summary

RNA splicing dysfunctions are more widespread than what is believed by only estimating the effects resulting by splicing factor mutations (SFMT) in myeloid neoplasia (MN). The genetic complexity of MN is amenable to machine learning (ML) strategies. We applied an integrative ML approach to identify co-varying features by combining genomic lesions (mutations, deletions, and copy number), exon-inclusion ratio as measure of RNA splicing (percent spliced in, PSI), and gene expression (GE) of 1,258 MN and 63 normal controls. We identified 15 clusters based on mutations, GE, and PSI. Different PSI levels were present at various extents regardless of SFMT suggesting that changes in RNA splicing were not strictly related to SFMT. Combination of PSI and GE further distinguished the features and identified PSI similarities and differences, common pathways, and expression signatures across clusters. Thus, multimodal features can resolve the complex architecture of MN and help identifying convergent molecular and transcriptomic pathways amenable to therapies.

Subject areas: Bioinformatics, Cancer, Omics

Graphical abstract

graphic file with name fx1.jpg

Highlights

  • Clusters of myeloid neoplasms result from mutations, gene expression, and RNA splicing

  • Multimodal features unveiled combinatorial splicing patterns in myeloid neoplasms

  • Geno-transcriptomic interactions can define therapeutic vulnerabilities


Bioinformatics; Cancer; Omics

Introduction

The frequent occurrence of splicing factor mutations (SFMT) has suggested that RNA splicing dysfunction plays an important pathogenic role in myeloid neoplasia (MN).1,2 A common theory posits that SFMT result in mis-splicing of specific driver genes in MN3,4,5,6 and that diverse mutations underpin the heterogeneity of disease phenotypes through the establishment of particular splicing patterns. In vitro and in vivo studies of MN with SFMT have led to the identification of a variety of putative downstream targets, but a specific and unique mis-splicing event responsible for the malignant phenotype has not been identified for any of the SFMT.1,7,8,9 Nevertheless, transcriptome-wide splicing analysis suggested that MN is associated with a broad spectrum of splicing alterations also occurring in cases without SFMT.9 The notion of global changes in splicing patterns in MN suggests that the presence of these mutations represents only the most extreme pole in the highly diverse mis-splicing spectrum which includes quantitative degrees of pathogenic changes and resultant phenotypes.10,11,12

Perturbations in RNA splicing include alterations of isoform proportions as in the case of skewed alternative splicing (AS) (quantitative changes) or occurrence of new splicing forms (qualitative changes) not present under normal circumstances (so called neo-AS). The overall landscape is complex whereby isoforms and mis-splicing variants may be expressed at various ratios and expression levels. Whether small changes in highly expressed genes or large changes in less abundant transcripts are more consequential may be a feature germane to MN subtypes. If these changes are indeed ubiquitous, the presence of these alterations across the entire MN entity, the term of a “spliceopathy disorder” may be warranted. To date, the impact of the spliceopathic changes on leukemogenesis remains to be fully elucidated but it is clear from genotype-phenotype association studies that some of the mis-splicing events can produce profound morphologic changes likely to be functional.3,13 Possibly, similar patterns of mis-splicing could represent subsets of patients with MN. Such subsets, which may connect seemingly distinct genotypes and phenotypes into functionally related groups sharing common clinical, drug sensitivities/resistance profiles and prognostic features, are not resolvable by “just” DNA sequencing or gene expression (GE) analysis.

Due to the tremendous complexity of splicing changes, traditional approaches for identification of splicing patterns have proven not be sufficient to dissect such pathophysiologic characteristics. Therefore, we propose that machine learning (ML) strategies may help to identify combinational splicing patterns and select most informative variables to reduce molecular complexity.

Rather than generating prediction algorithms to identify similar clinical outcomes which may result from distinct disease mechanisms, herein using ML tools, we explore the global RNA splicing patterns of patients within MN in relation to GE levels, mutations, and copy number variants (CNV) signatures. To that end, we investigate a convergence of molecular and transcriptomic pathways and define spliceosomal changes irrespective of specific SFMT.

Results

Identification of distinct splicing patterns by ML-based clustering

We theorized that the presence of SFMT underestimates the changes in RNA splicing and that abnormal splicing patterns may be recognized also in patients without SFMT. A cohort of patients with MN (n = 1,258) with available RNA sequencing and mutations was analyzed using an ML method based on variational autoencoders (VAE) (see method details). Patterns of similarities and/or differences across patients were also compared to a cohort of age-matched healthy controls (HC, n = 63) (Figure 1). We first used RNA splicing variations (measured as PSI), or GE levels, or mutations as a single dimension to identify patterns classifiable as discrete patients clusters (Figures S1A–S1F). We then exploited an integrative VAE method by combining all the genomic features (GE, mutations, CNV, and PSI) and then clustered the patients in an unsupervised fashion (Figures 2A and 2B). Indeed, by applying embedding procedures (transforming the high-dimensional multimodal data onto a unified (low-dimensional representation) applied to the entire cohort (patients with MN and HC), we were able to find latent (e.g., hidden) similarities across patients and assign them to 15 clusters (C-0 through C-14) based on presence/absence of any mutation (and/or deletions), GE, and PSI.

Figure 1.

Figure 1

Study design and machine learning rationale

Figure showing a combination of gene expression (GE) and RNA splicing changes expressed by percent spliced in (PSI) derived from RNA-seq of 1,258 patients with myeloid neoplasia and 63 healthy controls. Mutational data and copy number variants were obtained from whole-genome sequencing performed on the same cohort. CIBERSORT analysis was performed to quantify the cellular composition of the samples. In total, 54% of the patients carried at least one mutation in a common splicing factor (SF) gene. A total of 17,565 alternative splicing events (ASE) were called based on PSI with a variation threshold p ≥ 0.01. Machine learning algorithms were employed to create a combined signature of the above-mentioned molecular and transcriptomic features to identify distinct clusters (C-0 through C-14). An internal model validation was also done.

Figure 2.

Figure 2

Identification of clusters of patients with myeloid neoplasia by combining gene mutations, spliceome, and gene expression levels

(A) A low dimensional embedding map was generated using variational autoencoders based on gene expression, percent spliced in, and mutations of the samples. Embedding indicates the transformed coordinates for the shared layer.

(B–E) (B) Fifteen clusters (C-0 through C-14) were identified using Gaussian mixture model (GMM). 2D plots of UMAP visualization of (C) all patients (disease: orange color) and healthy controls (black color), (D) of patients with splicing factor mutations (SF, turquoise color), no splicing factor mutations (No SF, pink color) and healthy controls (black color), (E) of all cohort according to disease types in the myeloid neoplasia’s spectrum: healthy controls, acute myeloid leukemia (AML), myelodysplastic syndrome (MDS), chronic myelomonocytic leukemia (CMML), MDS/myeloproliferative neoplasms-unclassified (MDS/MPN-U), MDS/MPN-ringed sideroblasts and thrombocytosis (MDS/MPN-RS-T), and secondary AML.

To estimate the contribution of latent features to each omic, we quantified the variation explained for PSI, expression and pseudo r-squared for mutation data by the latent layers. We also quantified the “importance” of individual features by permutation. Shared layers resulted in 30% and 50% of explained variation for expression and PSI, respectively. The full model resulted in 60% and 55% of explained variation, suggesting that the majority of the variation in PSI can be associated with GE differences. Similarly, pseudo r-squared values for mutation data showed 18% and 38% increase when using only the shared layer and full model compared to a null model (Figures S2A and S2B). The investigation of the “importance” of the features showed majority of the latent features resulting in similar increase in error profile when permuted (Figure S3). The identified groups resulted in distinct differences in the distribution of MN subtypes. HC cases were mainly distributed in three clusters: C-13 (41% of cases), C-2 (40%), and C-7 (16%). C-4 counted only 3% of HC. Myelodysplastic syndrome (MDS) and acute myeloid leukemia (AML) cases populated distinct clusters with AML mostly enriching six clusters: C-10 (100%), C-9 (96%), C-5 (91%), C-0 (88%), C-6 (80%), and C-4 (78%) and patients with MDS grouped in three: C-12 (88%), C-1 (81%), and C-3 (76%) (Figures 2A and 2B). Of note is that C-9 and C-10 were “pure” AML clusters as they consisted of >95% patients with AML. CIBERSORT analysis confirmed a similar content of myeloid cells (median of 75%) in all different tested clusters (Figure S4A) suggesting that the identified clusters were not due to different cellular composition. To understand the dominant leukemia hierarchies represented in our patient cohort, we performed CIBERSORT analysis applying the malignant signature panel described by Zeng et al.14 Out of the seven cell type configurations, all clusters were mainly overrepresented by a GMP-like composition (Figure S4B).

Only a small fraction of cases showed patterns similar to HC (Figure 2C); among these, we identified carriers of SRSF2 and SF3B1 mutations (common SFMT) with low variant allele frequency (VAF) (Figure 2D) suggesting that other mechanisms beyond the presence of SFMT may drive splicing changes. Following our initial theory, SFMT cases did not entirely group to pure mutant clusters but expressed variable patterns in line with patients without mutations. Indeed, although SFMT represent the most extreme pole of the mis-splicing spectrum, we also found that patients lacking obvious SFMT were characterized by distinct patterns as well but at a lesser degree of changes in PSI and GE levels. Moreover, SFMT were more uniformly enriched in MDS clusters than expected, probably as a result of the high frequency of some of these mutations in MDS as opposed to AML (Figure 2E).

Splicing clusters and mutational distribution

When we compared the distribution of somatic mutations, other than SFMT, across the identified clusters, multimodal molecular signatures were identified in each cluster (Figure 3). About 90% of patients carried at least one mutation in a panel of myeloid-associated genes (Table S1). C-1 and C-3, and mostly C-7 and C-12 were predominantly composed of SF3B1 mutants. In particular, these patients also had two distinctive patterns of PSI: patients in C-1 and C-12 were characterized by significantly decreased splicing efficiency of distinct genes (e.g., DDX20 and KDM3B) which were clearly different from C-7. Indeed, C7 showed increased PSI levels of the top most altered AS events in KDM3B, PI4KB, and AHNAK genes. A similar distinction was obtained in clusters with SRSF2 mutations showing differential splicing patterns in KDM3B and CYBB (Figures S5–S7). In general, SF3B1 mutations were enriched in C-1, C-3, C-7, and C-12, whereas SRSF2 mutations dominated C-14 and C-8 together with TET2 . However, non-SFMT were equally distributed in different clusters. For example, carriers of DNMT3A, TET2, and FLT3 hits clustered in C-0, C-5, and C-6, while DNMT3A and FLT3 mutants were equally distributed in C-10. C-2, C-7, and C-13 showed similar PSI features as HC. C-2 and C-13 contained mainly TET2 and ASXL1 mutations, whereas C-7 had SF3B1 and TET2 lesions. For some transcript-defining clusters, similar PSI values were observed in SFMT and non-mutant cases suggesting that some splicing changes occur without clonal SFMT.

Figure 3.

Figure 3

Mutational signature of clusters

Bar graphs depicting the distribution of lesions across clusters. Lesions include gene mutations or deletions (for genes on chromosomes 5, 7, 17, and X). Pie charts summarize the distribution (percentage) of disease types per cluster. Color code of the clusters is the same of Figure 2E.

PSI similarities and differences across clusters

It has been reported that SRSF2 mutations promote mis-splicing and degradation of EZH2 via the inclusion of a conserved “poison” cassette exon, which leads to a premature termination codon resulting in nonsense-mediated mRNA decay.8 As shown in Figure S8, we found that this mechanism can be associated also with other SFMT. In fact, patients with SRSF2, LUC7L2, and PRPF8 mutations were characterized by a high correlation of PSI values for EZH2, thereby manifesting similar splicing changes for this gene. Deregulation of genes involved in iron metabolism and trafficking, such as the iron importer SLC25A37, has been implicated in the pathogenesis of SF3B1 mutant MDS, as shown by our group in an independent cohort of patients4 and by others.3,15 Here, while confirming that patients with SF3B1 mutations shared similar PSI values for SLC25A37 gene, we also found that these changes are common to other SFMT (e.g., LUC7L2, PRPF8, and ZRSR2) (Figure S8). Specifically, for PRPF8, this finding might be related to the occurrence of mutations in this gene in patients with MDS with ringed sideroblasts (RS), a morphologic feature associated primarily with SF3B1 mutations.16

Next, we explored the differences between clusters based on splicing changes represented as ΔPSI, which was defined as the difference in PSI values across clusters. We noticed that C-10 (pure AML cluster) showed much less PSI variation compared to other AML clusters, segregated more closely to HC (C-2, C-7, and C-13), and was distant from other AML clusters (C-0, C-5, and C-9) (Figures S9A and S9B). We then investigated the latent feature correlations of mutations and GE (Figure S10). One sub-cluster was characterized by the presence of SF3B1 mutations and by PSI features analogous to non-SF mutations (e.g., BCORL1, DNMT3A, GNAS, and TET2) possibly suggesting a process of phenocopying. Similar PSI association values were also seen for specific genes in DDX41, LUC7L2, PRPF8, and SF3B1 mutations, implying common downstream targets (Figure S8). A sub-cluster of patients with TP53 mutations shared PSI values similar to DDX41, LUC7L2, and PRPF8. The association of TP53 and PRPF8 is of particular interest considering their genomic proximity (both mapping to 17p) and the appearance of RS in cases without SF3B1 mutations and high-risk features.16 When we performed a sub-analysis of RNA splicing and GE of patients with single PRPF8 (n = 34), SF3B1 (n = 281), and TP53 (n = 52) mutations, we found high covariance features for the alpha-globin (HBA2) gene, which was previously shown to correlate with impaired erythrocyte maturation in SF3B1 mutant MDS.17 Furthermore, by applying gene set overrepresentation analysis to PRPF8 and TP53 mutants, we clustered the top five categories (FDR< 0.001) in erythrocytes differentiation (KLF1 and EPB42), homeostasis, and development (ALAS2, GATA1, RHAG, and SLC4A1) suggesting that these genes might inversely correlate with HBA2 expression and be responsible for the formation of RS in SF3B1 wild-type cases (data not shown).

Common features across clusters

To define the most important AS changes informative for the clusters, we first filtered for genes with high mean GE and high correlation with the latent space across the dataset. To further characterize our clusters (C-0 through C-14, termed observation-clusters hereinafter) and AS events, we extracted the top 15 AS events with high correlation with the latent dimensions and grouped them into 11 clusters (F-1 to F-11 termed feature-clusters hereinafter) using Ward’s criterion-based hierarchical clustering (Figure S11). As expected, we observed two major groups of PSI patterns separating observation-clusters in group 1 (C-2, C-4, C-10, C-7, C-8, C-13, and C-6) and group 2 (C-14, C-3, C-0, C-11, C-12, C-1, C-9, and C-5). Of note is that AS events in feature-clusters F-8, F-10, and F-11 exhibited characteristics differentiating observation-clusters in the first major group. For instance, observation-clusters C-6 and C-13 separated from the rest by increased PSI levels for AS events in feature-cluster F-8, which included SUZ12, RBM5, and USP24 as well as decreased PSI levels for AS events in feature-cluster F-11, which included instead MAPK14, NDUFV2, and STK4. Furthermore, AS changes in feature-cluster F-10 differentiated observation-clusters C-6 and C-13, with observation-cluster C-6 showing increased PSI levels for AS events mapping to SF3B3, KDM6A (UTX), and ATP5F (refer to Table S2 Feature-clusters of top alternative splicing events in observation-clusters for a list of AS changes). Investigating the second major observation group, feature-cluster F-9 showed similar patterns to feature-cluster F-11, including AS events for USP12, CUL3, and IKZF1. In addition, observation-clusters C-1 and C-5 differentiated from the rest by increased PSI levels in feature-clusters F-4 and F-8, whereas feature-cluster F-5 separated observation-clusters C-1 and C-5 and included RSRC2 and ATP2B4 (Figure S11 and Table S2 Feature-clusters of top alternative splicing events in observation-clusters).

Expression signatures across clusters

We then compared transcriptomic data using Limma for differential GE analysis (supplemental information) of two major groups of clusters. Of note is that the top 100 differentially expressed genes in our analysis included those previously associated with MN (Figure 4A, upper and lower panels and Table S3 List of differential expressed genes). Gene set enrichment analysis showed several pathways significantly dysregulated, ranging from ribosome biogenesis/mRNA translation, energy metabolism, and to immunity (Figure S12 and Table S4 List of genes obtained through gene set enrichment analysis). Solely focusing on the expression levels of the seven SF genes, we observed an extent of separation across the clusters populated by SF3B1 and SRSF2 mutations in addition to HC (Figure S13). We also investigated whether similar genes are both differentially expressed and differentially spliced. We extracted AS events differentially regulated between the two major clusters and highly correlated with differentially expressed genes based on latent features (Pearson’s r ≥ 0.75). We observed 35 common genes both spliced and differentially expressed (Figure S14). Furthermore, based on differential expression, a significant enrichment of ribosomal genes was detected with some of them being downregulated in specific clusters. Of note is that RPS14 was downregulated in seven clusters (C-2, C-4, C-6, C-7, C-8, C-10, and C-13) with some of them including patients with deletion of the long arm of chromosome 5q (Figure 4B, upper and lower panels). RPS14 haploinsufficiency is known to be a hallmark of MDS with deletion 5q and is also involved in the observed exquisite sensitivity to lenalidomide of this MDS subgroup.18,19 Metabolic reprogramming has been associated with MDS carrying SF3B1 mutations.20 We found altered energy metabolism with decreased expression levels in complexes of the mitochondrial respiratory chain (ubiquinol-cytochrome C reductases and oxidases, NADH dehydrogenase) in SF3B1 signature clusters (C-2 and C-7) (Figure 4C, upper and lower panels). Of particular interest are the findings of altered adaptive immunity and antigen processing genes in our cohort of patients. Leukemic cells can escape T cell surveillance as a consequence of downregulation of human leukocyte antigens (HLA).21,22 Indeed, we found differential expression of various HLA genes (HLA-B, HLA-E, HLA-DRB1, and HLA-DRB5). Of note is that C-10 (pure AML cluster) was characterized by a general downregulation of all the above-mentioned HLA genes accompanied by downregulation of proteasome complex genes crucial in antigen presentation (PSMA5, PSMB4, PSMC3, and PSMD13) (Figure 4D, upper and lower panels).

Figure 4.

Figure 4

Differential gene expression analysis of clusters

(A, upper panel) Limma analysis of two groups with differential gene expression levels. (A, lower panel) Volcano plot showing statistical significance (log10 FDR-adjusted p value) versus fold change of down- or upregulated expression (log2) comparing the two groups.

(B–D) Screenshots of heatmaps visualizing gene expression and enrichment score curves for three specific pathways.

These results confirm that our multimodal analysis recognized functional clusters of patients beyond the fraction possibly distinguishable when applying single-data input (mutations, GE, or PSI only), providing a tool to identify convergent geno-transcriptomic interactions explaining therapeutic vulnerabilities for clinical interventions.

To further capture the multimodal nature of the analysis, we created a global map by the combination of all the dimensions for top features (Figure S15).

Exploration of the impact of the clusters on survival outcomes

Next, we investigated the prognostic significance of the identified patient clusters. Uniform manifold approximation and projection visualization showed a map of similarities across the observation clusters populated by similar disease types (MDS: C-1, C-12, and C-3) but also indicated that some clusters enriched in specific mutations were rather similar (SF3B1, C-12 and SRSF2, C-14) (Figure S16). Therefore, we analyzed whether similarities and/or differences might provide any extent of clinical outcome. The relevance of the clusters was reflected in different overall survival (OS) in the majority of pairwise comparisons (Figures 5A–5C). For instance, C-10 showed significant survival differences when compared to C-0, C-5, and C-9 with similar AML fraction (Figure S17A). On the other hand, C-12 (characterized by SF3B1 mutations) showed better survival among other MDS clusters (Figure S17B).

Figure 5.

Figure 5

Survival differences of clusters

(A–C) (A) Kaplan-Meier estimator-based comparison of survival using log -rank test. Log-transformed p-values for each pairwise comparison are shown. Comparison of individual clusters assigned to IPSS-R high-risk survival comparison of clusters by log rank test survival function assessed using Kaplan-Meier estimator comparing C-12 and C-11 (B) and C-12 and C-7 (C). N = number of observations.

Thus, OS might be a result of a diverse combination of features including our newly identified splicing patterns. To highlight this, we then combined the latent features resulting from our multidimensional model with survival outcomes in a continuous fashion by means of Survival Forests and permutation method, and identified components with highest associations with OS. For instance, latent feature/dimension 21 showed higher cumulative OS with increasing feature levels as opposed to latent feature/dimension 24 (Figures S17C and S17D). Examining the observation-clusters in relation to latent features 21 and 24, we noticed a relationship across clusters, further emphasizing the complex genomic and transcriptomic makeup of MN (Figure S11). In addition, to capture the impact of modifiers such as disease type, age, gender, and treatment on OS, we performed CoxPH regression model by integrating relevant clinical variables. Stratifying by time to account for proportionality assumption in relevant variables, we identified significant associations with latent features and OS suggesting their clinical utility (Figures S18A and S18B).

Finally, in order to evaluate the extent of stochasticity in the identified clusters, we applied the proposed framework over 100 runs and quantified the overlap by adjusted-rand index (ARI). ARI showed relatively high stochasticity of the clusters where mean ARI was 0.55 (Figure S19A). To investigate whether a model without variational inference could reduce the stochasticity, we used an autoencoder with the same architecture and included Gaussian noise in the latent layer. Indeed, the ARI values increased to 0.65 mean (Figure S19B) suggesting that the variational inference was restrictive in identifying features and resulted in possible underfitting.

Discussion

Recently, ML-based tools have been increasingly deployed in medicine in an attempt to comprehensively characterize different disease populations. Particularly in hematology, ML has been adapted to analyze relevant morphologic or genetic patterns shedding light onto specific patient groups and pathologic sub-classifications, particularly in MN.23,24,25 Nonetheless, despite the huge efforts, all these studies have taken into account genomic information in a unimodal fashion, chiefly focusing on mutational status as a binary variable. The recent discoveries in MN showed that mutations account only for a fraction of both initiating and disease modifier events. For instance, we demonstrated that TET2 dysfunction26 is only marginally explained by the mutational events of this gene, emphasizing how focusing only on mutations may not be sufficient when examining multifaceted diseases such as myeloid disorders, well known to be characterized also by transcriptional modifications and AS in a non-negligible fraction of patients.

In the recent era, the availability of large omics data along with ML-based methods has opened the possibility to help the diagnosis or inform on the prognosis and therapeutic vulnerabilities of patients with cancer, opening a new era of personalized medicine.27 However, the utilization of multiple omics datasets is generally performed in a post hoc fashion, where more formal approaches such as canonical correlation analysis can benefit from increasingly large datasets and be more suitable for diseases showing complex interactions among different layers of geno-transcriptomic information. This is the case of deep learning algorithms, which have been applied to high-dimensional data to dissect clinical associations from morphologic features in MDS28,29 or update prognostic assessment and sub-classification using geno-clinical information in both AML and MDS.23,24,25,30 The importance of such models is also demonstrated by the efforts of the GENOMED4ALL project (NCT04889729), which is currently recruiting patient cohorts from multiple institutions with the aim of developing artificial intelligence-based tools to improve MDS classification and prognostication.

In an attempt to overcome the limitations of studies taking into consideration only one aspect of the complex geno-transcriptomic landscape of MN, we took advantage of ML algorithms to combine the characteristic features of MN, which, if considered alone, only account for part of its pathobiological dysregulation. Using autoencoders with variational inference, we combined traits of genomic lesions (mutations and deletions), PSI values, and GE particularly considering that transcriptomic changes are often the mirror of PSI values (assuming that these values identify AS events). These methods were followed by unsupervised clustering to recognize discrete patient groups. To our knowledge, this study represents the first comprehensive attempt to incorporate all of the above-mentioned MN features in one multilayered model.

Facing the limitations in assigning distinct clusters by using a single layer of data, we combined all features in one unsupervised clustering defining 15 groups of patients (C-0 through C-14). Notably, all clusters showed certain extents of RNA splicing variations (assessed as different PSI values), a finding beyond the mere presence of mis-splicing dictated by SFMT. We were able to discern two main groups between AML (C-9 and C-10) and MDS with typical SFMT (e.g., SF3B1 and SRSF2) fitting in distinct clusters (SF3B1: C-1, C-7, and C-12; SRSF2: C-8 and C-14). To give some exemplificative instances of potential therapeutic implications drawn by our multimodal analysis, RPS14 makes the best case. Its downregulation was a feature common to seven of our clusters, but only a fraction of the cases carried deletion 5q.31,32 In such patients, lenalidomide may represent a potential treatment strategy, given the similarities shared at an even higher level of complexity than the mere GE, as now resolved by our multimodal analysis. Another example is represented by PRPF8 mutations, which we found enriched in patients with deletions of the short arm of chromosome 17 (17p).16 Indeed, we glimpsed a sub-cluster of patients with TP53 mutations sharing similar PSI values with PRPF8, possibly due to the shared 17p genomic location. The similarities between the two groups are also evident when considering the overlapping morphologic features such as the acquired pseudo-Pelger-Huet anomaly, previously described in PRPF8-mutant patients with deletion encompassing the TP53 locus.16 These likely phenocopying mechanisms may enable therapeutic opportunities (luspatercept) in PRPF8 or TP53 mutant MDS cases lacking SF3B1 mutations presenting with RS and transfusion-dependent anemia.33 Lastly, the identification of immunoedited clusters with general downregulation of HLA genes and of antigen presentation machinery may identify patients with increased likelihood of response to immune-checkpoint inhibitors or immunomodulatories strategies, an unmet clinical need in MN.34 Finally, the creation of a unified space allowed us to test whether interdependencies among all features could hold survival differences across the patients’ clusters assuming that the levels of splicing changes would be combined with other crucial factors.

We quantified the stochasticity of the proposed workflow in order to assess the stability of the clusters as well. We generated multiple training runs of the proposed model and iterations of samples from each of the training runs. Using adjusted-rand index, we quantified the overlap of the identified clusters. The choice of applying our stochastic model was driven by the rationale of performing robust tests to study genetic variations in a disease which is the reflection of complex genetic and biology. Our goal was to identify ubiquitous splicing patterns relevant for disease biology. The heterogeneity of disease phenotypes and their clinical outcomes will be better captured if the model worked in a restricted fashion. To assess alternative models, we tested a vanilla autoencoder with Gaussian noise injected at the latent layer and found an increased sensitivity (as showed by a mean ARI of 0.65). This suggests the feasibility to apply ML tools to study the genetic and biology of MN and that there is a lot of room for improvement for future studies.

In sum, this is the first integrative analysis to examine the complex nature of MN further reinforcing the urgent need for analytic strategies unifying different levels of information where ML approaches can excel. However, although ML approaches can prove to be powerful, it is not trivial to amass the results for diagnostic, prognostic, or therapeutic applications, due to both the inherent black-box nature of such methods and the current not-updated legal and ethical implications.35 Nevertheless, our study initiates a new avenue of research on personalized multilayer-based patients’ evaluation, representing the first step toward the application in MN of the “digital twins”, a virtual patient’s phenocopy constructed based on integration of multimodal geno-transcriptomic information amenable to being safely, quickly, and economically challenged in silico to explore therapeutic vulnerabilities.36

Limitations of the study

We acknowledge several limitations of the study. MN are very heterogeneous diseases in which a number of factors affect clinical outcomes. Consequently, OS is agnostic to functional similarity and common pathogenetic mechanisms being not a factor correlating with mechanisms. For instance, patients having similar survival outcomes may have distinct pathogenesis. Therefore, our study besides exploring the impact of cluster-features on OS does not delve into assessing whether the multi-modality measurements perform better than standard clinical prognostic algorithms. Further studies would better elaborate the association of global variations of omics signatures with time-to-event data.

Furthermore, our study focused on the “shared/co-varying” features across multiple-omics to delineate possible clusters, which constituted an intrinsic limitation.

In regard to incorporation of results of CNV, CNV analysis was restricted to the genomic location of the four splicing genes DDX41, LUC7L2, PRPF8, and ZRSR2 mapping on chromosomes 5, 7, 17, and X, respectively.

Further studies performed on purified population are warranted, however still hampered by splicing analysis at single-cell level.

In terms of validation of the prospective use of the clusters, an external validation of similar magnitude and disease representation is lacking. In regards to internal validation of the model, while offering a proof-of-concept application of multimodal analysis in MN, it must be acknowledged that future studies may still have room for improvement such as using more flexible distributions and/or using higher number of latent layers. Indeed, besides proving our hypothesis of the presence of seemingly ubiquitous splicing patterns relevant for disease sub-classifications, our model showed a mean ARI of 0.55. This might be due to relatively restrictive priors of isotropic Gaussian and the network being a shallow architecture where more flexible distributions to approximate the posterior or a deeper architecture can help further characterizing the heterogeneity. Indeed, using an autoencoder without variational inference improved the mean ARI across multiple runs. Furthermore, since the clustering is restricted to the shared layer of the autoencoder, patterns relevant to single datatypes are disregarded by default and rather co-varying features are extracted.

STAR★Methods

Key resources table

Resource availability

Lead contact

Further information and requests should be directed to Valeria Visconte, Ph.D. visconv@ccf.org.

Materials availability

This study did not generate any new reagents.

Method details

Patient specimens

Bone marrow (BM) specimens of 1,258 patients with MN classified according to WHO 2016 guidelines and 63 healthy subjects (Table S5) were collected after receiving written informed consent in accordance with the Declaration of Helsinki and ethics approvals approved by The Institutional Review Boards. Study design is shown in Figure 1.

DNA and RNA sequencing studies

Sample collection for sequencing studies was done at diagnosis to reduce the effects of disease length on latent clusters. For whole genome sequencing (WGS), 1 μg of DNA was used. Libraries were prepared with the TruSeq PCR free library prep kit (Illumina, San Diego, CA) and 150bp paired-end sequences were generated on a NovaSeq 6000 or HiSeqX instrument with a median of 100x coverage (Illumina). Reads were aligned to the human genome assembly (GRCh37 Ensembl annotation, hg19) using the Isaac3 aligner (version 03.16.02.19).37 To identify somatic mutations the following filters for variant calling were applied: a) only variants with a minimum depth of 10 reads and 4 reads supporting the alternate allele were retained, b) synonymous, polymorphisms with a population frequency >1% and potential germline and benign variants were not included, c) nonsynonymous variants (missense, stop codon, frameshift, indels) were included and filtered according to sequenced control databases [gnomAD (https://gnomad.broadinstitute.org/) (≤0.1%), dbSNP13850, 1000Genomes], and databases of mutations including COSMIC, ClinVar, and HePPy (Hematological Predictor of Pathogenicity) for pathogenicity confirmation. Due to the lack of paired normal tissues, a mixture of genomic DNA for multiple unidentified healthy subjects was used as control. SNV and indels were called using Strelka2 software (version 2.4.7; Illumina, San Diego, CA), followed by Ensembl Variant Effect Predictor annotation. CNV imputation was performed using GATK4 (Broad Institute, Cambridge, MA). CNV and karyotype of chromosomes 5, 7, 17 and X were used to assess deletions encompassing DDX41, LUC7L2, PRPF8, and ZRSR2. To capture robustness of our features in relation to high mutational burden, we filtered out mutations with a variant allele frequency (VAF) < 15% in downstream analysis. Germline mutations were not the purpose of this study and were not retained. Results from a panel of myeloid related genes (Table S1) were used for our analyses.

RNA-Seq was performed as previously described by our group.9 Total RNA (250 ng) from BM specimens was prepared using the Illumina TruSeq Total Stranded RNA library preparation kit. Paired-end reads of 100bp were sequenced with a median depth of 50 million reads/sample. Sequences were aligned to human reference genome hg19 using STAR 2.5.0 where gene counts were estimated using Cufflinks (v2.2.1).38 Total gene level expression was expressed in transcript per million (TPM). To quantify splice junctions, for each patient, we calculated PSI for alternatively spliced/skipped exons, alternatively spliced/retained introns, and alternative 5′ splice sites or alternative 3′ splice sites usages.

Analysis of alternative splicing

Analysis of alternative splicing (AS) was conducted as previously described.9 rMATS (python rMATS.4.0.1) was used to annotate junction counts for AS events. We removed AS events in which the total skipped and included read counts were <10. Constitutive splice events were defined as events for which there were no samples with that had a percent spliced in (PSI) between 10 and 90%. Such events were excluded from analysis. Skipped exon (SE) junctions labeled as alternative 5' splice site (A5SS) and 3′ splice site (A3SS) as well as unannotated/novel SEs, A3SS, and A5SS containing a UCSC-annotated intron were also removed. The length normalized percent spliced in values were filtered for total count of reads mapping to splicing event greater than 10 in 95% of cases. PSI values with variance <0.01 were removed to keep more informative events resulting in 17k AS events as input. Somatic mutations for a panel of genes (Table S1) were filtered for VAF smaller than 15% to keep only the most informative/possible driver mutations for analysis. Similarly, preprocessing transcript per million (TPM) level gene expressions for low variation resulted in 13k features as input to the model.

Machine learning approaches

Variational Autoencoders Network. Autoencoders are neural networks structured to learn low dimensional representation of input data where relatively recently variational inference is integrated to build robust stochastic encoder and decoder models39,40 and have been widely applied including in single-cell RNASeq41,42,43 and image generation.44,45 Simply, the variational autoencoder (VAE) model constitutes an encoder/inference model which approximates the intractable posterior distribution of latent variables q(z|x) with a tractable alternative, usually an isotropic Gaussian distribution and a decoder/generative model which learns the joint distribution of latent variables and observations p(x,z). Here, we used a vanilla version of VAEs trained with cyclical KL annealing to generate latent embeddings capturing co-varying features of GE, PSI and variants (mutations plus deletions).46 The network is structured to include a single shared layer and three unshared layers similar to canonical correlation analysis (CCA). A representation of the network for the shared layer is given in Figure S2. The autoencoder network is constructed to include a single shared layer and 3 non-shared layers for each of the data types allowing us to capture the co-varying features in the shared layer. Selection of the number of latent features/dimensions is performed via cross-validation. Both the number of training iterations and number of latent features are chosen where the error for each data type starts to increase. We chose 32 latent features and 150 training epochs. Furthermore, variational inference is performed by cyclical Kullback-Leibler (KL) annealing where the KL divergence weight is increased from 0.0 to 1.0 by 0.1 every 5 epoch to prevent posterior collapse. The training is performed using Adam optimizer with learning rate set to 1e-3 and batch size set as 64 via Tensorflow framework.47

To determine the number of latent dimensions, we used cross-validation and selected the minimum point where none of the different data types showed an increase in error. We used mean squared error for 1) PSI and GE and 2) log-loss for mutations. Inputs of the model were somatic mutations (presence/absence), CNV in cases with SFMT mapping on chromosome 5, 7, 17 and X, GE, and PSI. PSI values were expressed in percentages. GE was considered as a continuous variable. VAF was not accounted in GE analysis. After pre-processing, the total number of genes and PSI features were 13k and 17k, respectively.

Gaussian Mixture Model. To cluster samples embedded on the shared latent layer, we used a GMM and Bayesian Information Criterion (BIC) to select the number of components. GMMs are widely used probabilistic models that can capture latent clusters. GMMs allow us to formalize the underlying generative model, improving its interpretability. We used the scikit-learn python package to fit GMM using Expectation Maximization.48,49 Bayesian Information Criterion (BIC) was then used to select the optimal number of clusters where multiple schemes with full covariance were compared to select the number of components/clusters based on BIC.

Model validation: IPSS-R score and stochasticity of clusters

The clinical utility of the proposed clusters was assessed using the Revised International Prognostic Scoring System (IPSS-R), a standard to categorize patients in very low to very high risk based on the risk to transform to AML and mortality. The score incorporates multiple statistically weighted clinical factors (Greenberg et al. 2012). By filtering out clusters with low number of samples and focusing only on IPSS-R low risk group (scores ≤3.5) (Pfeilstocker et al. 2016) traditionally represented as the one with higher degree of heterogeneity of outcomes, we investigated survival differences using log-rank test.

We then evaluated the overlap of identified clusters across multiple runs which resulted in relatively low overlap. To test the alternative models, we recalculated the ARI distribution over 100 runs without the KL term in training which resulted in improvement in ARI.

Feature similarity and bioinformatic analysis

Pairwise feature similarities for gene expression, mutations and PSI were quantified based on cosine similarity of the decoder network weights. Since we utilized a single layer of linear mapping from latent features to observed features, cosine similarity of the network weights was used as a direct proxy for feature correlations. Differential gene expression was analyzed using limma package and cyclicLoess normalization on log2 scaled gene expression values.50 Specifically, we normalized log2 scaled expression values via cyclic loess normalization and estimated FDR adjusted p values across 2 major groups via linear-models using limma package and empirical Bayes shrinkage. Gene Set Enrichment Analysis was done using the fgsea R package with Reactome pathways gene-set obtained from MsigDB database51,52 and using default parameters. Visualization using UMAP is done via umap-learn python package with parameters set to default values.53

Survival analysis

Survival estimation was assessed using Kaplan-Meier estimator for each cluster using R package survival and pairwise comparisons were performed using the log-rank test.54 No corrections were applied for multi-testing in pairwise comparisons. Cox Proportional-Hazards model was applied by integrating the latent dimensions to clinical factors (age, gender) and treatment information. Treatment with disease modifying agents such as hypomethylating agents, supportive therapy and induction chemotherapy or chemotherapy were included. OS was defined from diagnosis of myeloid neoplasia to death or last follow-up. Proportionality assumption for latent features with significant deviation is corrected by adding a time-dependent interaction term with time cutoff selected at median follow-up of 35 months. In order to evaluate the latent feature effects on survival in a non-parametric fashion, we have used random survival forests by means of the R package random ForestSRC55 with default parameters to build a predictive model using the latent features. Furthermore, to quantify the importance of each latent feature on the survival predictions, we generated a permutation-based importance measure based on the concordance index.

Acknowledgments

This work was supported by the Henry and Marilyn Taub Foundation, grants R01HL118281 (to J.P.M.), R01HL123904 (to J.P.M. and R.A.P.), R01HL132071 (to J.P.M. and R.A.P.), R35HL135795 (to J.P.M.), The Leukemia & Lymphoma Society TRP Award 6645-22 (to J.P.M.), AA&MDSIF (to V.V., S.P., and J.P.M.), VeloSano 9 Pilot Award, Vera and Joseph Dresner Foundation–MDS (to V.V.), Vera and Joseph Dresner Foundation–MDS (to R.A.P.), Fondation ARC pour la Recherche sur le Cancer and Italian society of Hematology (to S.P.), Tito Bastianello MDS Young Investigator Awards (to C.G., S.P., and V.A.) and F31HL131140 (to C.E.H.). This work was supported by a grant from the Edward P. Evans Foundation (to C.G.). This work made use of the High Performance Computing Resource in the Core Facility for Advanced Research Computing at Case Western Reserve University. We thank The Torsten Haferlach Leukämiediagnostik Stiftung to support this work.

Author contributions

A.D. conceptualized methodologies, data curation and analysis, trained the machine learning models, software, and wrote the manuscript. C.G. conceptualized data analysis and wrote the manuscript. C.E.H. analyzed raw data of splicing. S.P. edited the manuscript. N.D. helped in data collection. Hassan A. edited the manuscript. Hussein A. edited the manuscript. V.A. helped in data collection and interpretation of sequencing data. M.M. helped in data collection and edited the manuscript. B.P. edited the manuscript. Y.K. edited the manuscript. T.K. helped in data collection and editing. W.S.B. edited the manuscript. J.B. helped in data interpretation and editing. J.S. helped in data analysis and interpretation and editing. R.A.P. helped in data gathering, interpretation, and writing. T.H. helped in data collection, resources, and funding acquisition. J.P.M. contributed with resources, conceptualization, funding acquisition, and editing. V.V. conceptualized and supervised the study, made figures, helped in data interpretation, and wrote the manuscript.

Declaration of interests

The authors declare no competing interests.

Published: February 18, 2023

Footnotes

Supplemental information can be found online at https://doi.org/10.1016/j.isci.2023.106238.

Supplemental information

Document S1. Figures S1–S19 and Tables S1 and S5
mmc1.pdf (2.8MB, pdf)
Table S2. Feature-clusters of top alternative splicing events in observation-clusters, related to STAR Methods

Top 15 alternative splicing events per latent dimension with high correlation were grouped in 11 feature-clusters using hierarchical clustering based on Ward’s criterion. These feature-clusters were investigated per each observation-cluster to characterize informative changes across observation-clusters.

mmc2.xlsx (39.1KB, xlsx)
Table S3. List of differential expressed genes, related to Figure 4

Differentially expressed genes between two major myeloid neoplasia groups are provided. Data are expression as logfold change and average expression.

mmc3.xlsx (1.3MB, xlsx)
Table S4. List of genes obtained through Gene Set Enrichment Analysis, related to Figure 4

A complete list of pathways clustered using R package fgsea is provided.

mmc4.xlsx (34KB, xlsx)

Data and code availability

References

  • 1.Pellagatti A., Armstrong R.N., Steeples V., Sharma E., Repapi E., Singh S., Sanchi A., Radujkovic A., Horn P., Dolatshad H., et al. Impact of spliceosome mutations on RNA splicing in myelodysplasia: dysregulated genes/pathways and clinical associations. Blood. 2018;132:1225–1240. doi: 10.1182/blood-2018-04-843771. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Shiozawa Y., Malcovati L., Gallì A., Sato-Otsubo A., Kataoka K., Sato Y., Watatani Y., Suzuki H., Yoshizato T., Yoshida K., et al. Aberrant splicing and defective mRNA production induced by somatic spliceosome mutations in myelodysplasia. Nat. Commun. 2018;9:3649. doi: 10.1038/s41467-018-06063-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Dolatshad H., Pellagatti A., Fernandez-Mercado M., Yip B.H., Malcovati L., Attwood M., Przychodzen B., Sahgal N., Kanapin A.A., Lockstone H., et al. Disruption of SF3B1 results in deregulated expression and splicing of key genes and pathways in myelodysplastic syndrome hematopoietic stem and progenitor cells. Leukemia. 2015;29:1798. doi: 10.1038/leu.2015.178. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Visconte V., Avishai N., Mahfouz R., Tabarroki A., Cowen J., Sharghi-Moshtaghin R., Hitomi M., Rogers H.J., Hasrouni E., Phillips J., et al. Distinct iron architecture in SF3B1-mutant myelodysplastic syndrome patients is linked to an SLC25A37 splice variant with a retained intron. Leukemia. 2015;29:188–195. doi: 10.1038/leu.2014.170. [DOI] [PubMed] [Google Scholar]
  • 5.Dolatshad H., Pellagatti A., Liberante F.G., Llorian M., Repapi E., Steeples V., Roy S., Scifo L., Armstrong R.N., Shaw J., et al. Cryptic splicing events in the iron transporter ABCB7 and other key target genes in SF3B1-mutant myelodysplastic syndromes. Leukemia. 2016;30:2322–2331. doi: 10.1038/leu.2016.149. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Park S.M., Ou J., Chamberlain L., Simone T.M., Yang H., Virbasius C.M., Ali A.M., Zhu L.J., Mukherjee S., Raza A., Green M.R. U2AF35(S34F) promotes transformation by directing aberrant ATG7 pre-mRNA 3' end formation. Mol. Cell. 2016;62:479–490. doi: 10.1016/j.molcel.2016.04.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Shirai C.L., Ley J.N., White B.S., Kim S., Tibbitts J., Shao J., Ndonwi M., Wadugu B., Duncavage E.J., Okeyo-Owuor T., et al. Mutant U2AF1 expression alters hematopoiesis and pre-mRNA splicing in vivo. Cancer Cell. 2015;27:631–643. doi: 10.1016/j.ccell.2015.04.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Kim E., Ilagan J.O., Liang Y., Daubner G.M., Lee S.C.W., Ramakrishnan A., Li Y., Chung Y.R., Micol J.B., Murphy M.E., et al. SRSF2 mutations contribute to myelodysplasia by mutant-specific effects on exon recognition. Cancer Cell. 2015;27:617–630. doi: 10.1016/j.ccell.2015.04.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Hershberger C.E., Moyer D.C., Adema V., Kerr C.M., Walter W., Hutter S., Meggendorfer M., Baer C., Kern W., Nadarajah N., et al. Complex landscape of alternative splicing in myeloid neoplasms. Leukemia. 2021;35:1108–1120. doi: 10.1038/s41375-020-1002-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Alsafadi S., Houy A., Battistella A., Popova T., Wassef M., Henry E., Tirode F., Constantinou A., Piperno-Neumann S., Roman-Roman S., et al. Cancer-associated SF3B1 mutations affect alternative splicing by promoting alternative branchpoint usage. Nat. Commun. 2016;7:10615. doi: 10.1038/ncomms10615. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Liu Z., Yoshimi A., Wang J., Cho H., Chun-Wei Lee S., Ki M., Bitner L., Chu T., Shah H., Liu B., et al. Mutations in the RNA splicing factor SF3B1 promote tumorigenesis through MYC stabilization. Cancer Discov. 2020;10:806–821. doi: 10.1158/2159-8290.CD-19-1330. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Canbezdi C., Tarin M., Houy A., Bellanger D., Popova T., Stern M.H., Roman-Roman S., Alsafadi S. Functional and conformational impact of cancer-associated SF3B1 mutations depends on the position and the charge of amino acid substitution. Comput. Struct. Biotechnol. J. 2021;19:1361–1370. doi: 10.1016/j.csbj.2021.02.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Yip B.H., Steeples V., Repapi E., Armstrong R.N., Llorian M., Roy S., Shaw J., Dolatshad H., Taylor S., Verma A., et al. The U2AF1S34F mutation induces lineage-specific splicing alterations in myelodysplastic syndromes. J. Clin. Invest. 2017;127:2206–2221. doi: 10.1172/JCI91363. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Zeng A.G.X., Bansal S., Jin L., Mitchell A., Chen W.C., Abbas H.A., Chan-Seng-Yue M., Voisin V., van Galen P., Tierens A., et al. A cellular hierarchy framework for understanding heterogeneity and predicting drug response in acute myeloid leukemia. Nat. Med. 2022;28:1212–1223. doi: 10.1038/s41591-022-01819-x. [DOI] [PubMed] [Google Scholar]
  • 15.del Rey M., Benito R., Fontanillo C., Campos-Laborie F.J., Janusz K., Velasco-Hernández T., Abáigar M., Hernández M., Cuello R., Borrego D., et al. Deregulation of genes related to iron and mitochondrial metabolism in refractory anemia with ring sideroblasts. PLoS One. 2015;10:e0126555. doi: 10.1371/journal.pone.0126555. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Kurtovic-Kozaric A., Przychodzen B., Singh J., Konarska M.M., Clemente M.J., Otrock Z.K., Nakashima M., Hsi E.D., Yoshida K., Shiraishi Y., et al. PRPF8 defects cause missplicing in myeloid malignancies. Leukemia. 2015;29:126–136. doi: 10.1038/leu.2014.144. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Conte S., Katayama S., Vesterlund L., Karimi M., Dimitriou M., Jansson M., Mortera-Blanco T., Unneberg P., Papaemmanuil E., Sander B., et al. Aberrant splicing of genes involved in haemoglobin synthesis and impaired terminal erythroid maturation in SF3B1 mutated refractory anaemia with ring sideroblasts. Br. J. Haematol. 2015;171:478–490. doi: 10.1111/bjh.13610. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Ebert B.L., Pretz J., Bosco J., Chang C.Y., Tamayo P., Galili N., Raza A., Root D.E., Attar E., Ellis S.R., Golub T.R. Identification of RPS14 as a 5q- syndrome gene by RNA interference screen. Nature. 2008;451:335–339. doi: 10.1038/nature06494. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.List A., Dewald G., Bennett J., Giagounidis A., Raza A., Feldman E., Powell B., Greenberg P., Thomas D., Stone R., et al. Lenalidomide in the myelodysplastic syndrome with chromosome 5q deletion. N. Engl. J. Med. 2006;355:1456–1465. doi: 10.1056/NEJMoa061292. [DOI] [PubMed] [Google Scholar]
  • 20.Dalton W.B., Helmenstine E., Walsh N., Gondek L.P., Kelkar D.S., Read A., Natrajan R., Christenson E.S., Roman B., Das S., et al. Hotspot SF3B1 mutations induce metabolic reprogramming and vulnerability to serine deprivation. J. Clin. Invest. 2019;129:4708–4723. doi: 10.1172/JCI125022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Christopher M.J., Petti A.A., Rettig M.P., Miller C.A., Chendamarai E., Duncavage E.J., Klco J.M., Helton N.M., O'Laughlin M., Fronick C.C., et al. Immune escape of relapsed AML cells after allogeneic transplantation. N. Engl. J. Med. 2018;379:2330–2341. doi: 10.1056/NEJMoa1808777. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Pagliuca S., Gurnari C., Rubio M.T., Visconte V., Lenz T.L. Lenz. Individual HLA heterogeneity and its implications for cellular immune evasion in cancer and beyond. Front. Immunol. 2022;13:944872. doi: 10.3389/fimmu.2022.944872. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Awada H., Durmaz A., Gurnari C., Kishtagari A., Meggendorfer M., Kerr C.M., Kuzmanovic T., Durrani J., Shreve J., Nagata Y., et al. Machine learning integrates genomic signatures for subclassification beyond primary and secondary acute myeloid leukemia. Blood. 2021;138:1885–1895. doi: 10.1182/blood.2020010603. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Bersanelli M., Travaglino E., Meggendorfer M., Matteuzzi T., Sala C., Mosca E., Chiereghin C., Di Nanni N., Gnocchi M., Zampini M., et al. Classification and personalized prognostic assessment on the basis of clinical and genomic features in myelodysplastic syndromes. J. Clin. Oncol. 2021;39:1223–1233. doi: 10.1200/JCO.20.01659. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Radakovich N., Meggendorfer M., Malcovati L., Hilton C.B., Sekeres M.A., Shreve J., Rouphail Y., Walter W., Hutter S., Galli A., et al. A geno-clinical decision model for the diagnosis of myelodysplastic syndromes. Blood Adv. 2021;5:4361–4369. doi: 10.1182/bloodadvances.2021004755. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Gurnari C., Pagliuca S., Guan Y., Adema V., Hershberger C.E., Ni Y., Awada H., Kongkiatkamon S., Zawit M., Coutinho D.F., et al. TET2 mutations as a part of DNA dioxygenase deficiency in myelodysplastic syndromes. Blood Adv. 2022;6:100–107. doi: 10.1182/bloodadvances.2021005418. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Voso M.T., Gurnari C. Have we reached a molecular era in myelodysplastic syndromes? Hematology. Am. Soc. Hematol. Educ. Program. 2021;2021:418–427. doi: 10.1182/hematology.2021000276. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Brück O.E., Lallukka-Brück S.E., Hohtari H.R., Ianevski A., Ebeling F.T., Kovanen P.E., Kytölä S.I., Aittokallio T.A., Ramos P.M., Porkka K.V., Mustjoki S.M. Machine learning of bone marrow histopathology identifies genetic and clinical determinants in patients with MDS. Blood Cancer Discov. 2021;2:238–249. doi: 10.1158/2643-3230.BCD-20-0162. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Passamonti F., Corrao G., Castellani G., Mora B., Maggioni G., Gale R.P., Della Porta M.G. The future of research in hematology: integration of conventional studies with real-world data and artificial intelligence. Blood Rev. 2022;54:100914. doi: 10.1016/j.blre.2021.100914. [DOI] [PubMed] [Google Scholar]
  • 30.Nagata Y., Zhao R., Awada H., Kerr C.M., Mirzaev I., Kongkiatkamon S., Nazha A., Makishima H., Radivoyevitch T., Scott J.G., et al. Machine learning demonstrates that somatic mutations imprint invariant morphologic features in myelodysplastic syndromes. Blood. 2020;136:2249–2262. doi: 10.1182/blood.2020005488. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Czibere A., Bruns I., Junge B., Singh R., Kobbe G., Haas R., Germing U. Low RPS14 expression is common in myelodysplastic syndromes without 5q- aberration and defines a subgroup of patients with prolonged survival. Haematologica. 2009;94:1453–1455. doi: 10.3324/haematol.2009.008508. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Oliva E.N., Cuzzola M., Nobile F., Ronco F., D'Errigo M.G., Laganà C., Morabito F., Galimberti S., Cortelezzi A., Aloe Spiriti M.A., et al. Changes in RPS14 expression levels during lenalidomide treatment in Low- and Intermediate-1-risk myelodysplastic syndromes with chromosome 5q deletion. Eur. J. Haematol. 2010;85:231–235. doi: 10.1111/j.1600-0609.2010.01473.x. [DOI] [PubMed] [Google Scholar]
  • 33.Fenaux P., Platzbecker U., Mufti G.J., Garcia-Manero G., Buckstein R., Santini V., Díez-Campelo M., Finelli C., Cazzola M., Ilhan O., et al. Luspatercept in patients with lower-risk myelodysplastic syndromes. N. Engl. J. Med. 2020;382:140–151. doi: 10.1056/NEJMoa1908892. [DOI] [PubMed] [Google Scholar]
  • 34.Winter S., Shoaie S., Kordasti S., Platzbecker U. Integrating the "immunome" in the stratification of myelodysplastic syndromes and future clinical trial design. J. Clin. Oncol. 2020;38:1723–1735. doi: 10.1200/JCO.19.01823. [DOI] [PubMed] [Google Scholar]
  • 35.Walter W., Haferlach C., Nadarajah N., Schmidts I., Kühn C., Kern W., Haferlach T. How artificial intelligence might disrupt diagnostics in hematology in the near future. Oncogene. 2021;40:4271–4280. doi: 10.1038/s41388-021-01861-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Björnsson B., Borrebaeck C., Elander N., Gasslander T., Gawel D.R., Gustafsson M., Jörnsten R., Lee E.J., Li X., Lilja S., et al. Digital twins to personalize medicine. Genome Med. 2019;12:4. doi: 10.1186/s13073-019-0701-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Adema V., Palomo L., Walter W., Mallo M., Hutter S., La Framboise T., Arenillas L., Meggendorfer M., Radivoyevitch T., Xicoy B., et al. Pathophysiologic and clinical implications of molecular profiles resultant from deletion 5q. EBioMedicine. 2022;80:104059. doi: 10.1016/j.ebiom.2022.104059. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Trapnell C., Williams B.A., Pertea G., Mortazavi A., Kwan G., van Baren M.J., Salzberg S.L., Wold B.J., Pachter L. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 2010;28:511–515. doi: 10.1038/nbt.1621. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Kingma D, Welling M. Auto-encoding variational Bayes.Preprint at arXiv:1312.6114v10. DOI.org/10.48550/arXiv.1312.6114
  • 40.Kingma DP, Salimans T, Jozefowicz R, Chen X, Sutskever I, Welling M. Improving variational inference with inverse autoregressive flow.Preprint at arXiv:1606.04934. DOI.org/10.48550/arXiv.1606.04934
  • 41.Tran D., Nguyen H., Tran B., La Vecchia C., Luu H.N., Nguyen T. Fast and precise single-cell data analysis using a hierarchical autoencoder. Nat. Commun. 2021;12:1029. doi: 10.1038/s41467-021-21312-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Svensson V., Gayoso A., Yosef N., Pachter L. Interpretable factor models of single-cell RNA-seq via variational autoencoders. Bioinformatics. 2020;36:3418–3421. doi: 10.1093/bioinformatics/btaa169. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Lin E., Mukherjee S., Kannan S. A deep adversarial variational autoencoder model for dimensionality reduction in single-cell RNA sequencing analysis. BMC Bioinf. 2020;21:64. doi: 10.1186/s12859-020-3401-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Vahdat A KJ. NVAE: a deep hierarchical variational autoencoder.Preprint at arXiv:2007.03898v3. DOI.org/10.48550/arXiv.2007.03898
  • 45.Chira D, Haralampiev I, Winther O, Dittadi A, Liévin V. Image super-resolution with deep variational autoencoders. Preprint at arXiv:2203.09445v2. DOI.org/10.48550/arXiv.2203.09445
  • 46.Fu H, Li C, Liu X, Gao J, Celikyilmaz A, Carin L. Cyclical annealing schedule: a simple approach to mitigating kl vanishing. Preprint at arXiv:1903.10145. DOI.org/10.48550/arXiv.1903.10145
  • 47.Abadi M, Agarwal A, Barham P, E. Brevdo, Z. Chen, C. Citro, G. Corrado, A. Davis, J. Dean, M. Devin, et al. TensorFlow: large-scale machine learning on heterogeneous distributed systems. Preprint at arXiv:1603.04467. DOI.org/10.48550/arXiv.1603.04467
  • 48.Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 2011;12:2825–2830. [Google Scholar]
  • 49.Bouveyron C., Brunet-Saumard C. Model-based clustering of high-dimensional data: a review. Comput. Stat. Data Anal. 2014;71:52–78. doi: 10.1016/j.csda.2012.12.008. [DOI] [Google Scholar]
  • 50.Ritchie M.E., Phipson B., Wu D., Hu Y., Law C.W., Shi W., Smyth G.K. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43:e47. doi: 10.1093/nar/gkv007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Korotkevich G.S.V, Budin N, Shpak B, Artyomov MN, Sergushichev A. Fast gene set enrichment analysis.Preprint at bioRxiv 2021; 060012. DOIorg/10.1101/060012
  • 52.Liberzon A., Subramanian A., Pinchback R., Thorvaldsdóttir H., Tamayo P., Mesirov J.P. Molecular signatures database (MSigDB) 3.0. Bioinformatics. 2011;27:1739–1740. doi: 10.1093/bioinformatics/btr260. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.McInnes L HJ, Melville J. Uniform manifold approximation and projection for dimension reduction. Preprint at arXiv:180203426. DOIorg/10.48550/arXiv.1802.03426
  • 54.Therneau T. R package version, 3-1; 2020. A Package for Survival Analysis in R. R Package Version 3.1-12. Package for Survival Analysis in R. [Google Scholar]
  • 55.Ishwaran H. R package version 21; 2019. Fast Unified Random Forests for Survival, Regression, and Classification (RF-SRC) [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Figures S1–S19 and Tables S1 and S5
mmc1.pdf (2.8MB, pdf)
Table S2. Feature-clusters of top alternative splicing events in observation-clusters, related to STAR Methods

Top 15 alternative splicing events per latent dimension with high correlation were grouped in 11 feature-clusters using hierarchical clustering based on Ward’s criterion. These feature-clusters were investigated per each observation-cluster to characterize informative changes across observation-clusters.

mmc2.xlsx (39.1KB, xlsx)
Table S3. List of differential expressed genes, related to Figure 4

Differentially expressed genes between two major myeloid neoplasia groups are provided. Data are expression as logfold change and average expression.

mmc3.xlsx (1.3MB, xlsx)
Table S4. List of genes obtained through Gene Set Enrichment Analysis, related to Figure 4

A complete list of pathways clustered using R package fgsea is provided.

mmc4.xlsx (34KB, xlsx)

Data Availability Statement


Articles from iScience are provided here courtesy of Elsevier

RESOURCES