Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2026 Jan 1.
Published in final edited form as: Nat Cancer. 2025 Jun 27;6(7):1242–1262. doi: 10.1038/s43018-025-00987-2

MULTIPOTENT LINEAGE POTENTIAL IN B-CELL ACUTE LYMPHOBLASTIC LEUKEMIA IS ASSOCIATED WITH DISTINCT CELLULAR ORIGINS AND CLINICAL FEATURES

Ilaria Iacobucci 1,*, Andy GX Zeng 2,3,*, Qingsong Gao 1,*, Laura Garcia-Prat 2,*, Pradyumna Baviskar 1, Sayyam Shah 2, Alex Murison 2, Veronique Voisin 2, Michelle Chan-Seng-Yue 2, Cheng Cheng 4, Chunxu Qu 1, Colin Bailey 1, Matthew Lear 1, Matthew T Witkowski 5, Xin Zhou 6, Airen Zaldivar Peraza 6, Karishma Gangwani 6, Anjali S Advani 7, Selina M Luger 8, Mark R Litzow 9, Jacob M Rowe 10,11, Elisabeth M Paietta 12, Wendy Stock 13, John E Dick 2,3,**, Charles G Mullighan 1,14,**
PMCID: PMC12377259  NIHMSID: NIHMS2100933  PMID: 40579588

Abstract

Developmental origins and their associations with lineage plasticity and treatment response in B-progenitor acute lymphoblastic leukemia (B-ALL) are mostly unexplored. Here, we integrated single cell transcriptome sequencing (scRNA-seq) of 89 B-ALL samples with a single cell atlas of normal human B-cell development incorporating functional and molecular assays. We observed subtype- and sample-dependent correlation with normal developmental stage, with intra-subtype and intra-patient heterogeneity. We show that subtypes prone to shift from the B-lineage (e.g. BCR::ABL1, KMT2A-r, and DUX4-r B-ALL) are enriched for multipotent progenitors and show this developmental stage exhibits CEBPA activation and retains myeloid potential, providing a mechanistic explanation for this clinical observation. We developed a “Multipotency Score” most enriched in subtypes exhibiting lineage plasticity that was independently associated with inferior survival. Thus, multipotent B-ALL states reflect the early progenitor origins of a subset of B-ALL patients and may be relevant for understanding lineage shifting following conventional chemotherapy or immunotherapies.

INTRODUCTION

B-cell acute lymphoblastic leukemia (B-ALL) is characterized by abnormal proliferation of immature lymphoid cells and includes up to 26 subtypes based on genomic and transcriptomic characteristics1,2. Classification and risk stratification primarily rely on integrating genomic and clinical factors3,4. While advancements in risk stratification and the use of risk-adapted therapies have improved cure rates of pediatric B-ALL, children classified as high-risk still face poor outcomes5,6. Moreover, across all age groups, disease relapse still represents a major cause of death.

Although relapse is often caused by the dynamic and complex genetic diversity of leukemic cells, leading to the proliferation of resistant clonesn7,8, there is evidence that variation in leukemia developmental stage may influence treatment vulnerability9,10. Indeed, certain subtypes of B-ALL (e.g. BCR::ABL1, DUX4-, KMT2A- and ZNF384-rearranged) exhibit immunophenotypic shifts, including lymphoid to myeloid lineage switching following either CD19-directed immunotherapies or conventional chemotherapy1113, suggesting plasticity or heterogeneity in B-ALL cell states within these patients. These findings highlight the importance of understanding leukemia cell state heterogeneity in B-ALL to better understand disease evolution and predict therapeutic responses.

B lymphoid development involves dynamic transcriptional programs that transition as developing B-cells mature through distinct stages. Alterations disrupting B cell development can promote the development of lymphoid neoplasms, including B-ALL. Although widely assumed that B-ALL most commonly originates from a pro-B stage lymphoid progenitor, a window of vulnerability during B cell receptor rearrangement, there is also evidence that origins may be heterogeneous1418, subtype- and driver-dependent, and may sometimes arise from more primitive hematopoietic stem and progenitor cells (HSPC)19, or at more mature stages during immunoglobulin light chain rearrangement20. This heterogeneity challenges conventional diagnosis and impacts outcome. Single-cell RNA-sequencing (scRNA-seq) provides the resolution to define precise cell states within B-ALL, and the opportunity to identify the distinct windows of B-cell development susceptible to transformation throughout the human lifespan. However, highly resolved maps of normal B cell development are lacking and differences between mouse and human hematopoietic ontogeny preclude reliance on refined murine transcriptional maps. Furthermore, a comprehensive view of B-ALL cell state heterogeneity may also help to improve classification and risk-stratification systems for B-ALL.

Thus, we performed scRNA-seq on 89 samples from 84 B-ALL patients. We compared these B-ALL transcriptomes to normal hematopoietic states by first developing a comprehensive single-cell reference atlas of normal human B-cell development spanning 130,085 cells from various tissue sources. Projection of B-ALL samples onto this map revealed distinct developmental states associated with genomic alterations and clinical outcomes that collectively improve our current understanding of B-ALL biology and refine risk stratification.

RESULTS

Single-cell RNA sequencing characterization of B-ALL

We generated scRNA-seq data on 89 samples from 84 B-ALL patients with different molecular subtypes (Supplementary Tables 13 and Fig. 1a). Cells were annotated into six broad hematopoietic cell types and classified as leukemic cells or non-malignant cells (Supplementary Table 4) using the Inferred Copy Number Variation (inferCNV) algorithm21, revealing CNVs in blasts consistent with karyotype and whole genome sequencing (WGS) data (Supplementary Table 1 and ED Fig.1a). Aggregated pseudobulk scRNA-seq data showed clustering of individual samples with corresponding leukemia subtypes from bulk RNA-seq (N=2,046) B-ALL datasets5,22,23 (Supplementary Table 5)(Fig. 1b). Normal cells from different samples and subtypes clustered together, while leukemic blasts formed distinct gene expression clusters, suggesting that inter-patient heterogeneity exists even within subtypes, consistent with data from solid tumors24,25 (Fig. 1c,d and ED Fig.1b).

Figure 1. Single-cell RNA sequencing landscape of B-ALL.

Figure 1.

a) Schematic workflow for scRNA-seq of B-ALL samples. The UMAP is from sample SJMLL009_D1. This UMAP is only shown as representative example. The figure was created with Biorender.com. b) t-distributed stochastic neighbor embedding (tSNE) plot showing gene expression profiling of 1,794 B-ALL bulk RNA-seq samples5,22,23 and 89 scRNA-seq pseudobulk from this study. Each dot represents a sample, and samples are colored by the leukemia subtype. In parentheses are the number of scRNA-seq pseudobulk samples. These samples clustered with bulk samples according to their specific subtype. From the 2,046 B-ALL samples with bulk RNA-seq (Supplementary Table 5) for this analysis we only included subtypes that were represented in the scRNA-seq sample cohort with the addition of ETV6::RUNX1 samples to show co-clustering with ETV6::RUNX1-like (total in tSNE= 1,794). c,d) UMAP representation of cells (N=489,019) from the 89 single-cell RNA-seq samples. Clusters of cells are colored by B-ALL genomic subtype (c) and cell type (leukemic blasts vs non malignant cells) (d). Abbreviations: BM, bone marrow; inferCNV, inferred copy number variation; NK, natural killer; DC, dendritic cells; HSPCs, hematopoietic and progenitor cells. UMAPs can be interactively visualized at https://proteinpaint.stjude.org/BALLscrna.

Intra sample genetic and transcriptomic heterogeneity of leukemic cells

Using inferCNV to examine intra-tumoral genetic heterogeneity, we observed multiclonality in many cases, with 43 (48.3%) samples exhibiting multiple CNV clones, including 16 samples with minor clones (< 3% of all cells) undetectable by bulk sequencing (Fig. 2a and Supplementary Table 6). Notably, in near haploid and hyperdiploid B-ALL, chromosomal losses or gains, respectively, had the same pattern in all cells in most samples, consistent with early acquisition of aneuploidy in leukemogenesis5 (ED Fig. 1c,d). A minority of samples showed sequential evolution of aneuploidy with progressive losses/gains of chromosomes (Fig. 2b,c and ED Fig. 1e,f).

Figure 2. Single-cell genomic and transcriptomic heterogeneity.

Figure 2.

a) Bar plot showing copy number clonal heterogeneity. Number of clones ranged from 1 (N=46) to 8 (N=1, SJHYPO120). The inferCNV heatmap from each B-ALL sample can be visualized at https://proteinpaint.stjude.org/BALLscrna. b) InferCNV heatmap demonstrating total or partial chromosome losses in the hypodiploid case SJHYPO117. c) UMAP representation of cells from SJHYPO117. Clusters of cells are colored by cell type (upper panel) and inferCNV group (lower panel). Cells from cluster 3 without losses of chromosomes 8, 14p and 18 formed a distinct cluster. d) InferCNV heatmap from SJE2A063 (TCF3::PBX1) showing progressive gain of chromosome 1q. e) UMAP representation of cells from SJE2A063. Clusters of cells are colored by cell type (left panel), inferCNV group (middle panel) and detection of TCF3::PBX1 fusion (right panel). f) Scatterplot showing copy numbers of chromosome (chr) 1p, chr 1q proximal (from centromere to PBX1; chr1: 145002063–164688691), chr 1q distal (from PBX1 to telomere, chr1: 164688692–248956422), chr 9p (chr9: 1–39100000), chr 9q (chr9: 68400001–138394717) and chr 19p (chr19: 1–1617901). Data are from scWGS in SJE2A063 performed on flow sorted blast (hCD45dim, CD19+) and normal (hCD45bright and CD3+) cells. Single cells were assigned to different clusters based on their copy number patterns. Dotted line indicates diploid DNA content. g) Cartoon created with Biorender.com depicting the model of progressive gain of chromosome 1q and aberrations of chromosome 9 in SJE2A063 which occur in mutual exclusive clones. h) Heatmap showing the pairwise correlations of the sample level meta-programs derived from 10 DUX4-r patients using NMF: A - Cell Cycle (S), B - Cell Cycle (G2/M), C - Metabolism, D - Differentiation, and E - Inflammation. i) Bubble plot showing gene set enrichment analysis of the five transcriptional programs in different subtypes. Genes are listed based on their contribution to each program in each subtype. j) Pairwise correlation of expression scores of transcriptional metaprograms from NMF defined separately in each subtype and applied across cells from all samples showing that cell cycle and metabolism expression programs are common across subtypes while those for cell differentiation and inflammation are more subtype-specific.

Interestingly, intra-tumoral genetic variegation was also associated with subtype-defining gene fusions, such as TCF3::PBX1, encoded by the t(1;19)(q23;p13) translocation. This rearrangement is commonly unbalanced, with duplication of 1q distal to PBX1, yet the mechanisms and patterns of associated copy number changes remain poorly understood. Integrated scRNA-seq and single cell whole genome sequencing (scWGS) revealed a complex mechanism with gradual gain of chromosome 1 from PBX1 to the telomere and amplification of the entire chromosome 1q, encompassing PBX1 (Fig. 2dg, ED Fig. 1gk and Supplementary Tables 7,8), likely occurring on the chromosome 1 not involved in the translocation. Supporting this mechanism, we noted amplification of the same region in a sample (SJE2A067) with a balanced t(1;19) translocation. Single cell sequencing further unveiled clonal compositions not evident on bulk sequencing, revealing minor clones harboring additional chromosomal aberrations (ED Fig. 1k,l), supporting a model where TCF3::PBX1 arises early, followed by der(19) and amplification of the telomeric region distal to PBX1, with further copy number changes in subclones. Thus, integrated single cell gene expression and inferCNV analysis reveals previously unrecognized clonal heterogeneity in B-ALL with a dynamic accumulation of genetic alterations, including those further influencing fusion oncoprotein expression.

To explore intra-sample transcriptional heterogeneity of leukemic cells, we first performed non-negative matrix factorization (NMF)26 of single cell gene expression data. Hierarchical clustering on NMF programs from samples within the same subtype revealed five consensus transcriptional metaprograms in all subtypes, independently of the genetic driver: cell cycle (S), cell cycle (G2/M), metabolism, cell differentiation, and inflammation (Fig. 2h,i, ED Fig. 1m and Supplementary Table 9).

To explore transcriptional heterogeneity across different leukemia subtypes, we next analyzed how the expression scores of genes contributing to each metaprogram varied across different subtypes by pairwise correlation (Fig. 2j). Transcriptional programs related to cell cycle (S or G2/M phase) and metabolism were very similar across all subtypes despite the differences in their genomic drivers. In contrast, cell differentiation and inflammation programs showed intermediate correlation across subtypes, suggesting that genes contributing to these two programs may be subtype-specific (ED Fig. 1n). Overall, this analysis revealed that variation in programs related to cellular differentiation and inflammation help to explain inter-patient transcriptomic heterogeneity in B-ALL.

Characterization of human B cell development

Given our identification of cellular differentiation programs as a driver of transcriptional heterogeneity in B-ALL, we sought to examine the ontogeny of B-ALL data through the lens of normal human B cell development. This has been extensively studied in mice27 but an in-depth analysis in humans is lacking, thus we created a detailed transcriptomic map of normal human B cell development. We performed stringent sorting of immunophenotypically-defined cell populations of human B cell developmental stages from umbilical cord blood (CB; Supplementary Table 10) followed by bulk RNA-seq and ATAC-seq. Clustering of sorted fractions from early B cell development confirmed their order along B cell differentiation (Fig. 3a).

Figure 3. Characterization of normal human B cell development.

Figure 3.

a) Clustering of bulk RNA-seq profiles from sorted cell populations spanning B cell development in umbilical cord blood. Samples were clustered based on the expression of known human transcription factors (TFs). Abbreviations: LT-HSC, long term hematopoietic stem cells; ST-HSC, short term hematopoietic stem cell. b) A single cell atlas of human B cell development, comprised of 130,085 single cell transcriptomes integrated from eight studies spanning fetal bone marrow, fetal liver, umbilical cord blood, pediatric bone marrow, and adult bone marrow tissues. Cell clustering and annotations were guided by bulk RNA-seq reference profiles from (a). c) cNMF signature analysis in normal B-lymphoid development. Signature strength of cNMF programs corresponding to HSC/MPP, Early Lymphoid, Pro-B, and Pre-B stages of B cell development are shown, together with an expression heatmap of 10 genes with the highest weights within each signature along B cell development. d) CEBPA regulon activity inferred through pySCENIC38 persists into the CLP stage of B cell development. e) Chromatin accessibility of an enhancer element +42kb from the CEBPA locus from sorted populations. f,g) In vitro differentiation assays of human lymphoid-primed multipotent progenitor (LMPP), multilineage progenitor (MLP), CLP, Pre-Pro-B, and Pro-B cells in Lympho-Myeloid (Ly-My) or Lymphoid promoting (LyP) media (Supplementary Table 14). Two hundred cells from the indicated cell populations were sorted and cultured on MS-5 stroma cells for 3 (ED Fig. 3a), 7 or 14 days in Ly-My or LyP media and resulting colonies were analyzed by flow cytometry. The proportion of single cell-derived colonies matching different phenotypes after the differentiation period is represented in the pie charts (g). h,i) In vivo differentiation assays of human LMPP, MLP, CLP, Pre-Pro-B, and Pro-B cells. NSG mice were subjected to intra-femoral injection of sorted populations and engraftment levels were analyzed by flow cytometry two weeks later. Three independent experiments with a total number of mice for each group as follows: LMPP (N=24), MLP (N=20), CLP (N=13 mice), Pre-ProB (N=3) and Pro-B (N=10). Representative FACS plots and engraftment results are shown in panel i. The cartoons in (f) and (h) were created with Biorender.com.

We next utilized this resource to develop and annotate a single cell reference gene expression atlas of human B cell development onto which B-ALL blasts could be projected. We compiled scRNA-seq data from 8 normal cell datasets spanning fetal liver and bone marrow, cord blood, and pediatric and adult bone marrow, focusing on cell states along B cell development and branch points to proximal lineages. Collectively, 130,085 cells were integrated across 90 donors, 5 tissues, and 3 technologies (Supplementary Table 11 and Fig. 3b)2834. Focused clustering and annotation guided by reference transcriptomes from our sorted cell fractions allowed us to identify 17 cell states spanning human B cell development while projection of surface protein levels from AbSeq data35 enabled immunophenotypic validation of our cell populations (ED Fig. 2ac).

Prior studies have noted the existence of a fetal-specific population along B-cell development, namely a CD10-CD19- Early Lymphoid Progenitor (ELP)1820. Notably, we found that fetal and post-natal tissues shared common transcriptional states spanning B cell development, however we did observe some variation in surface marker expression by ontogeny. In post-natal tissue, CD10 upregulation occurs at the MLP stage while in fetal tissue CD10 upregulation is delayed until the Pro-B stage (ED Fig. 2d,e). Differential expression analysis confirmed transcriptional upregulation of the CD10-encoding transcript MME within post-natal CLP. Thus, despite differences in surface marker expression, pre-natal “ELPs” represent an equivalent cellular state to post-natal CLPs and both will hereafter be referred to as CLPs.

To identify gene expression programs that underlie human B cell development, we applied NMF, identifying two cell cycle programs (S-phase and G2/M-phase) and seven stage-specific programs along B cell development (Fig. 3c, ED Fig. 2f,g and Supplementary Tables 12,13). This included an “Early Lymphoid” program enriched in multi-lymphoid progenitors (MLPs) and common lymphoid progenitors (CLPs) which was characterized by high expression of IL7-R and CD7 that encode known surface markers of common lymphoid progenitors in mice and humans (ED Fig. 2h)36,37. Flow cytometry analysis validated the stage-specific expression of surface IL7-R and CD7 protein levels within MLPs and CLPs (ED Fig. 2i,j). Interestingly, expression of RAG1 and RAG2 occurred at the MLP stage of fetal hematopoiesis and the CLP stage in adult hematopoiesis, suggesting a physiological window of RAG expression that precedes the Pro-B stage where heavy chain re-arrangement occurs (ED Fig. 2k).

We next evaluated transcription factor (TF) activity, captured by enrichment of their target regulons, across lymphoid lineage commitment through SCENIC regulon analysis38. Unexpectedly, we found that the activity of the myeloid-associated TF CEBPA persisted into the CLP stage (Fig. 3d), despite previous characterizations of CLP as functionally restricted to B/NK lineages39. This was supported by bulk ATAC-seq on the sorted populations which revealed persisting accessibility of a CEBPA-regulating +42kb enhancer into the CLP stage40 (Fig. 3e). The retention of functional myeloid potential within CLPs but not Pro-B cells was confirmed through differentiation assays on MS-5 stromal cells with either Lympho-Myeloid (Ly-My) or Lymphoid promoting (LyP) media (Supplementary Table 14, Fig. 3f,g and ED Fig. 3ac) as well as xenotransplantation into NSG mice (Fig. 3h,i and ED Fig. 3df). In further support of these findings, myeloid surface marker CD33 levels were higher in the CLP fraction compared to downstream B progenitor populations (ED Fig. 3g), and in vitro differentiation assays of CD33 positive and negative subsets of Early Lymphoid progenitors confirmed this association between CD33 and functional myeloid potential within the CLP fraction (ED Fig. 3h,i and Supplementary Table 15). Together, this shows that myeloid lineage potential is retained at the CLP stage. Altogether, we have developed a comprehensive atlas of cell states spanning human B cell differentiation from pre-natal to post-natal tissues. B-development states in our map were validated using scRNA-seq data from prenatal human lungs41 (ED Fig. 4a,b and Supplementary Table 16). Furthermore, integration of molecular profiles with functional data from purified populations has revealed retention of myeloid lineage potential further into B-lymphoid development than previously expected, which may explain lymphoid-to-myeloid lineage switching in some B-ALL samples at disease relapse.

Developmental states of B-ALL at single cell resolution

This atlas of human B cell development enabled precise mapping of B-ALL cells along the continuum of human B cell development. Projection of B-ALL cells onto this normal reference revealed precise stages of human B cell development implicated in each patient’s disease (Fig. 4a,b, ED Fig. 4ac). The “Pro-B VDJ” state, wherein heavy chain rearrangement occurs, was most frequently observed, and comprised at least 20% of a patient’s leukemic cells in 76 of 89 samples (Supplementary Table 17). By contrast, tumor samples varied in representation of other populations along B cell development. To simplify the many precise cellular states, we condensed co-occurring cellular states into broader categories, hereafter referred to as “developmental states” (ED Fig. 4d,e and Supplementary Table 18).

Figure 4. Heterogeneity in B cell developmental states within B-ALL.

Figure 4.

a) Schematic outlining composition analysis of B-ALL samples along B cell development. B-ALL cells were projected onto the single cell atlas of human B-cell development (from Fig. 3b) and cell states were assigned based on the most transcriptionally similar normal population. Samples were clustered based on the relative abundance of each cell state along normal B cell development. b) Schematic analysis workflow and UMAP of 89 B-ALL samples grouped into six patient clusters based on cell state composition. CLR, centered log ratio. c) Projection results of B-ALL cells from representative patients within cell state composition-based groups. d) Cell state composition-based clustering of 89 B-ALL samples. Each column represents a single sample, grouped into six subgroups based on cell composition. Heatmap depicts normalized abundance of each mapped cell population for each sample, and annotations pertaining to white blood cell count (WBC), genomic subtype, clinical risk group, and 5-year survival are also provided for each sample. Statistical significance was determined by two-sided Fisher’s exact tests between each annotation and the composition-based clusters. e-h) UMAP clustering based on cell state composition from 89 B-ALL samples depicting abundance of broad developmental states spanning HSC/MPP/LMPP (e), Early Lymphoid (f), Pro-B (g), and Pre-B (h). Key genomic subtypes enriched for each developmental state are also depicted with their position along the UMAP embedding.

Clustering samples by cellular composition revealed seven patient subgroups (from A to G) differing in enrichment of various B cell developmental states. Each subgroup was associated with distinct underlying genomic characteristics and clinical outcomes (Fig. 4c,d and ED Fig. 4f,g). A subgroup defined by high HSC/MPP abundance (subgroup A) was enriched for samples with rearrangements in ZNF384-R (Fig. 4e) that encode fusion oncoproteins enriched in lineage-ambiguous leukemia42,43, and a subset of DUX4-R ALL. Subgroup B was highly enriched with Early Lymphoid populations (MLP, CLP, Pre-Pro-B) and comprised a subset of both BCR::ABL1 and KMT2A-R cases (Fig. 4f). A majority of samples including hyperdiploid and ETV6::RUNX1-like subtypes were highly enriched with Pro-B populations (subgroups C, D, E; Fig. 4g). Notably, analysis of an independent cohort of ETV6::RUNX1 B-ALL samples44 showed similar enrichment of pro-B cell signatures as for ETV6::RUNX1-like from this study (ED Fig. 4h,i). Last, a subset of samples (subgroups F, G) with high abundance of Pre-B cell state was enriched for TCF3::PBX1 and MEF2D-R subtypes (Fig. 4h). We confirmed that B-ALL cells belonging to each developmental state retained stage-specific expression of cell identity programs from normal B cell development (Fig. 5 ac).

Figure 5. B-ALL developmental state abundance and Multipotency Score.

Figure 5.

a) Broad developmental states implicated in B-ALL. b) Gene set variation analysis (GSVA) enrichment scores for gene expression programs defined from normal B-cell developmental states. Cells belonging to each developmental state from each B-ALL sample were pooled into pseudobulk profiles, and developmental states from the same patient sample are connected by a line. c) Heatmap depicting expression of top 100 marker genes for each B-ALL developmental state. Each row corresponds to a unique marker gene while each column corresponds to a pseudobulk profile of each developmental state from each B-ALL sample. d) Feature selection approach prior to model training. Differentially expressed genes at FDR<0.01 across both normal as well as B-ALL developmental states were retained, yielding 2,655 genes. e) LASSO regression was performed to predict developmental state abundance of 89 scRNA-seq samples. For each developmental state n=50, 5-fold cross validation, repeated 10 times. Box plots indicate the range of the central 50% of the data with the central line marking the median. Whiskers extend from each both to 1.5x the interquartile range. f) Validation of final developmental state models using matched bulk RNA-seq profiles available for 85 scRNA-seq samples. For each developmental state, the predicted developmental state abundance from matched bulk RNA-seq data are shown in the x-axis, and the actual developmental state abundance from scRNA-seq composition data is shown on the y-axis. For each association, the linear regression line shaded with the 95% CI, as well as r and P values from Pearson correlation, are shown. g) Principal component analysis of B-ALL developmental states (N = 2,046 samples). Feature loadings for principal component 1 (Multipotency) are depicted. h,i) Ridge plots comparing the Multipotency Score (h) and the inferred abundance (i) of Early Lymphoid, Pro-B, and Pre-B states across B-ALL subtypes. j) Association between B-ALL subtypes and inferred abundance of B-ALL developmental states (N=2,046 samples). Associations were determined by comparing each genomic subtype against the “B-Other” subtype. The magnitude of each association, quantified as the −log10 (p value), is depicted through the size and color intensity of each dot. Only associations with an FDR corrected P<0.05 are depicted.

Thus, projection of B-ALL cells across human B cell development reveals a new dimension of heterogeneity in B-ALL that complements but also refines genomic stratification in B-ALL.

Genomic determinants associated with multipotency in B-ALL

Given that our single cell analysis identified B-ALL subgroups enriched for multipotent developmental states (HSC/MPP and Early Lymphoid) associated with specific genomic alterations, we sought to augment our analysis using bulk transcriptomes from large B-ALL patient cohorts5,22,23. We developed an approach to quantify B-ALL developmental state abundance in bulk RNA-seq using the expression of 145 developmental state-related genes (Fig. 5df and Supplementary Tables 5,19), and found that this approach outperformed bulk deconvolution tools including CIBERSORTx45 and BayesPrism46 by benchmarking on matched single-cell and bulk transcriptomes from 85 B-ALL patients (ED Fig.5ad).

Quantification of developmental state abundance on 2,046 B-ALL transcriptomes revealed variation in developmental state abundance in concordance with findings from our scRNA-seq cohort (Supplementary Table 20). Importantly, through principal component analysis we identified an axis of variation which was positively associated with abundance of multipotent states (HSC/MPP and Early Lymphoid) and negatively associated with abundance of committed states (Pro-B and Pre-B) (Fig. 5g). Starting from our 145 developmental state genes, we developed a 99-gene score to capture this multipotent versus committed axis, which will hereafter be referred to as the “Multipotency Score” (Fig. 5g, ED Fig. 6a and Supplementary Table 21). In normal cells, the Multipotency Score was positively associated with stem and progenitor abundance and negatively associated with committed B-lymphoid precursor abundance (ED Fig. 6b).

Association analysis between developmental state abundance or Multipotency Score and genomic subtype identified B-ALL subtypes associated with multipotency including ZNF384-R, BCR::ABL1, KMT2A-R, and DUX4-R (Fig. 5h). Moreover, in line with our scRNA-seq analyses, we identified Pre-B enrichment in MEF2D-R and TCF3::PBX1; and Pro-B enrichment with hyperdiploid and ETV6::RUNX1/-like ALL (Fig. 5i,j and ED Fig. 6c). Developmental state abundance and Multipotency Score were also associated with specific driver mutations and gene fusions (ED Fig. 6d,e). Notably, after controlling for genomic subtype, alterations in EBF1, FLT3, and NRAS were positively associated with multipotency while alterations involving PAX5 and CDKN2A/B were associated with B-lymphoid commitment. The association of the Multipotency Score with distinct genomic alterations in B-ALL may suggest disparate cellular origins among a subset of B-ALL patients.

Multipotency Score explains transcriptional heterogeneity

We next asked whether developmental state information, including the Multipotency Score, could explain heterogeneity within genomic subtypes. Recent bulk gene expression analyses by our group and others have identified transcriptional subgroups within DUX4-R5, KMT2A-R5 and BCR::ABL1 B-ALL5,14. Strikingly, the Multipotency Score enabled near-perfect separation of transcriptional subgroups within DUX4-R (n = 112, AUC = 0.980, Fig. 6a,b and ED Fig. 6f) and KMT2A-R patients (n = 142, AUC = 0.949, Fig. 6c,d and ED Fig. 6g). Thus, we refer to these transcriptional subgroups within DUX4-R and KMT2A-R as “Early/Multipotent” and “Committed”. Within BCR::ABL1 samples, three transcriptional subgroups were identified (Early/Multipotent, Inter-Pro, and Committed)14. We observed a step-wide decrease in these subgroups with near-perfect separation between “Early/Multipotent” and “Committed” samples by the Multipotency Score (AUC = 0.978) (Fig. 6e,f and ED Fig. 6h). This was validated through analysis of an independent BCR::ABL1 cohort (n = 54, AUC = 0.955) (ED Fig. 6i)14. Thus, features of multipotency can explain transcriptional heterogeneity within B-ALL subtypes. Within each subtype, transcriptional subgroups and their differences in multipotency and developmental state abundance were associated with specific genetic alterations (ED Fig. 7ac).

Figure 6. B-ALL developmental states refine existing genomic subtypes.

Figure 6.

a-f) Differences in developmental state abundance between transcriptional subtypes of DUX4-R, KMT2A-R and BCR::ABL1 B-ALL. For each developmental state, P values from a two-tailed Wilcoxon rank-sum test as well as discriminatory power between transcriptional subtypes, represented by the area under the receiver operating characteristic (ROC) curve (AUC), are reported. Box plots indicate the range of the central 50% of the data with the central line marking the median. Whiskers extend from each both to 1.5x the interquartile range. a) Developmental state abundance between transcriptional subtypes of DUX4-r B-ALL (total n=112; DUX4-r Early/Multipotent = 70; DUX4-r Committed n =42). b) Projection results from representative samples. c) Developmental state abundance between transcriptional subtypes of KMT2A-r B-ALL (total n=144; KMT2A-r Early/Multipotent = 125; KMT2A-r Committed = 19). d) Projection results from representative samples. e) Developmental state abundance between transcriptional subtypes of BCR::ABL1 B-ALL (total n=127; BCR::ABL1 Early/Multipotent = 32; BCR::ABL1 Inter = 26; BCR::ABL1 Committed = 69). f) Projection results from representative samples. g) GSEA from bulk RNA-seq comparison of Early/Multipotent vs Committed subgroups of DUX4-r, KMT2A-r and BCR::ABL1 (Early/Multipotent vs Committed). h) rlog normalized CEBPA expression across transcriptional and developmental subgroups of BCR::ABL1 (n=127), KMT2A-r (n=144), and DUX4-r BALL (n=112). The number for each subgroup is the same as in panels a, c and e. For each comparison, P values from a two-tailed Wilcoxon rank-sum test are reported. Box plots indicate the range of the central 50% of the data with the central line marking the median. Whiskers extend from each both to 1.5x the interquartile range. i) B Development Multipotency Score comparison between uniformly processed MPAL (n = 66) and B-ALL (n = 1,153) patient samples. Specifically, Early/Multipotent and Committed subgroups of BCR::ABL1 (n=50 and n=57, respectively), KMT2A-r (n=69 and n=10, respectively) and ZNF384-r (n=53) with B-ALL and in “Other B-ALL” (n=793, non-BCR::ABL1, non-KMT2A-r and non-ZNF384-r) were compared against MPAL patients with the same genetic drivers: BCR::ABL1 (n=4), KMT2A-r (n=13) and ZNF384-r (n=13) or “Other” (n=36). For each comparison, P values from a two-tailed Wilcoxon rank-sum test are reported. Box plots indicate the range of the central 50% of the data with the central line marking the median. Whiskers extend from each both to 1.5x the interquartile range. j) Projection results from SJBALL182 with ZNF384-r at diagnosis (B-ALL) and at relapse (MPAL).

DUX4-R subgroups differed in their co-lesions, wherein IKZF1 and NRAS alterations were associated with multipotency while ERG and TBL1XR1 alterations were associated with B-lymphoid commitment (ED Fig. 7a). Differences in developmental state abundance between Early/Multipotent and Committed subgroups were also complemented by our prior NMF analysis. Notably, cell differentiation and inflammation transcriptional programs also discriminated between the two subgroups (ED Fig. 7d,e). In line with this, Early/Multipotent DUX4-R samples expressed stem and progenitor genes (eg. FLT3, CD44) alongside genes within the inflammation pathway (eg. TGFB), while Committed DUX4-R samples expressed B-cell commitment genes including EBF1, CD79B, MME (encoding CD10) and RAG1 (ED Fig. 7f). Flow cytometric analysis confirmed high expression of CD44 in Early/Multipotent DUX4-R and CD10 in Committed DUX4-R (ED Fig. 7g).

In KMT2A-R B-ALL, we did not identify co-lesions that were significantly associated with multipotency, though PAX5 alterations were weakly associated with lower Early Lymphoid abundance (P = 0.005, FDR=0.068) (ED Fig. 7b). Instead, KMT2A rearrangements are known to involve a variety of partner genes47, and we identified associations between fusion partner involvement and Multipotency Score or development state abundance. Notably, KMT2A::AFF1 was associated with higher Multipotency Scores and higher Early Lymphoid abundance while KMT2A::MLLT3 and KMT2A::MLLT10 were associated with lower Multipotency Scores and higher Pre-B abundance (ED Fig. 7h,i).

BCR::ABL1 subgroups also differed in their co-lesions with deletions in CDKN2A/B, PAX5 and SLX4IP being enriched among more committed Inter-Pro and Committed subgroups. Interestingly, there was no difference in Multipotency or developmental state between the two common BCR::ABL1 transcripts encoding for p190 or p210 (ED Fig. 7j).

Pathway analysis comparing Early/Multipotent and Committed subgroups revealed minimal concordance across these three genomic subtypes, with no mutually upregulated or downregulated hallmark pathways. Rather, we found inflammatory and growth signaling to be upregulated within Early/Multipotent subgroups in DUX4-R and KMT2A-R B-ALL yet downregulated within the Early-Pro subgroup in BCR::ABL1 B-ALL. Despite the discordance in pathway enrichment, Early/Multipotent subgroups in all genomic subtypes converged on cell state, exhibiting upregulation of myeloid progenitor programs and downregulation of B lineage restriction programs (Fig. 6g). In line with this, CEBPA was among the top genes enriched in all Early/Multipotent subgroups compared to their Committed counterparts (Fig. 6h).

These data suggest that, in line with their counterparts from normal hematopoiesis, a subset of leukemic cells from B-ALL samples belonging to these Early/Multipotent subgroups may retain latent myeloid potential. Furthermore, differences in co-lesions or gene fusion partners raise the possibility of distinct cellular origins for Early/Multipotent B-ALLs compared to Committed B-ALLs, possibly involving Early Lymphoid progenitors including CLPs that continue to retain functional myeloid lineage potential.

Common cellular origin of Early/Multipotent B-ALL and MPAL

To further investigate the cellular origins of Early/Multipotent subgroups in B-ALL, we performed a comparative analysis of B-ALL transcriptomes with transcriptomes from B/Myeloid mixed phenotype acute leukemia (MPAL), which is characterized by expression of myeloid and lymphoid immunophenotypic features42,48,49. Founding genetic alterations do not explain immunophenotypic variegation, rather they are most likely acquired in primitive hematopoietic progenitors that retain multilineage potential42. First, we confirmed that Multipotency Score levels in MPAL were significantly higher than in B-ALL (Fig. 6i, ED Fig. 8). Next, we evaluated the Multipotency Score in the context of BCR::ABL1, KMT2A-R, and ZNF384-R genomic subtypes, which span both MPAL and B-ALL. Indeed, within these genomic subtypes classification of B-ALL versus MPAL is based solely on immunophenotypic criteria49. Interestingly, Multipotency Scores within Early/Multipotent subgroups of BCR::ABL1 and KMT2A-R B-ALL were comparable to those of MPALs with the same genomic alterations. This was also the case with ZNF384-R B-ALL, the B-ALL subtype with the highest overall Multipotency Score enrichment (Fig. 6i). This suggests that ZNF384-R B-ALL and Early/Multipotent subgroups of BCR::ABL1 and KMT2A-R B-ALL may share similar cellular origins with MPAL samples containing the same genomic alterations, and further supports the notion of latent myeloid potential within these B-ALL subsets. Indeed, one case of ZNF384-R B-ALL (SJBALL182) with matched diagnosis and relapse samples exhibited high HSC/MPP involvement at diagnosis, and both lymphoid and myeloid blasts at relapse. Specifically, in the relapse sample, we observed the persistence of a primitive population alongside the expansion of downstream myeloid progenitors (Fig. 6j).

B-ALL multipotency is associated with age

We sought to evaluate the association between age of disease diagnosis with developmental state abundance and Multipotency Score (Fig. 7a and ED Fig. 9ad). Notably, infant B-ALL exhibited a high Multipotency Score and high Early Lymphoid abundance, in line with reports of its distinct cellular origins18. In contrast, B-ALL presenting in childhood had the highest enrichment for Pro-B state. Furthermore, within childhood, high-risk disease was associated with high Early Lymphoid abundance and low Pro-B and Pre-B abundance compared to standard-risk disease. From adolescence to adulthood, relative abundance of multipotent states including Early Lymphoid and HSC/MPP increased with age, suggesting that multipotent states may be more likely to be involved in adult B-ALL oncogenesis (Fig. 7b and ED Fig. 9e). Thus, higher risk disease and older age is associated with Early/multipotent stages of B cell development, likely reflecting age-dependent differences in susceptibility to leukemic transformation along human B cell development.

Figure 7. Clinical associations with B-ALL developmental states.

Figure 7.

a) Multipotency Score (upper panel) and inferred abundance of Early Lymphoid, Pro-B and Pre-B states (lower panel) with age at B-ALL diagnosis. For each association, the LOESS regression line shaded with the 95% CI is shown for 2,019 B-ALL with available age information. b) Association between Multipotency Score (left panel) and inferred abundance of Early Lymphoid and Pro-B states (right panel) with B-ALL clinical risk groups for 2,022 B-ALL patients annotated for clinical risk (Childhood HR, n = 680; Childhood SR, n = 527; AYA, n = 430; Adult, n = 385) c) Association between Multipotency Score and inferred abundance Early Lymphoid, Pro-B, and Pre-B states at B-ALL diagnosis with measurable residual disease (MRD) levels at day 29 of induction chemotherapy administration for 794 pediatric B-ALL patients with available information on MRD (< 0.01%, n = 448; 0.01 – 0.1%, n = 148; 0.1 – 1%, n = 104; 1 – 10%, n = 54). For each comparison from (b) and (c), P values from a two-tailed Wilcoxon rank-sum test are reported. Box plots indicate the range of the central 50% of the data with the central line marking the median. Whiskers extend from each both to 1.5x the interquartile range. d) Association between Multipotency Score as well as HSC/MPP, Early Lymphoid, Pro-B, and Pre-B state abundance with ex vivo drug sensitivity from 595 B-ALL samples profiled by bulk RNA-seq and ex vivo drug sensitivity screening for 18 therapeutic agents from Lee et al 50. Only agents with a significant P value < 0.05 are shown and those with the “*” symbol have FDR < 0.05. e) For each developmental state and Multipotency Score, association with outcome is performed within 1,010 pediatric B-ALL patients with available outcome data. For each variable of interest, multivariable Cox models were regressed on overall survival and event-free survival, with age, sex, WBC, clinical risk group, and genomic subtype, as independent covariates. The adjusted hazard ratios (HR) from multivariable analysis are reported for each standard deviation increase in inferred abundance. Error bars depict the 95% confidence interval for each variable of interest, presented as the adjusted hazard ratio +/− 1.96 standard errors. f) Schematic model created with Biorender.com depicting B-cell development and association of specific stages of development with genetic drivers of B-ALL.

B-ALL multipotency confers chemoresistance and poor outcome

We next sought to understand whether Multipotency was associated with chemo-resistance in pediatric B-ALL. Among a subset of pediatric patients (N=794) with measurable residual disease (MRD) information at day 29 of induction therapy, high Multipotency Score, Early Lymphoid state abundance, and low Pro-B and Pre-B state abundance were observed in MRD positive patients compared to MRD negative patients (Fig. 7c and ED Fig. 9f). These associations also extended to the level of residual disease among patients who were MRD positive (ED Fig. 9g,h), suggesting that the multipotent versus committed axis may be associated with response to chemotherapy in B-ALL. Association of ex vivo drug sensitivity of 18 therapeutic agents with developmental state abundance from 595 B-ALL samples profiled by bulk RNA-seq50 added further support for this notion, showing that the Multipotency Score was associated with resistance to common chemotherapeutic agents (Fig. 7d).

Given this association with chemoresistance, we next evaluated whether the Multipotency Score was associated with survival outcomes. Multivariable analysis controlling for age, sex, white blood cell (WBC) count at diagnosis, clinical risk group and genomic subtype within 1,010 pediatric B-ALL patients revealed that a higher Multipotency Score was independently associated with adverse overall survival (OS) and event-free survival (EFS) outcomes (Fig. 7e and ED Fig. 9ik and Supplementary Tables 22,23). Notably, the Multipotency Score was prognostic within favorable risk and intermediate risk genomic subtypes as well as within high-risk childhood disease (ED Fig. 10ad). This remained significant in a multivariable analysis within a subset of 649 patients who were MRD negative at day 29 of induction therapy (ED Fig. 9j and ED Fig. 10e,f). These survival associations were more nuanced within adult B-ALL. Among adult B-ALL patients, a higher Multipotency Score was associated with adverse OS and relapse-free survival (RFS) by univariable analysis (n=324), but these associations were not significant in multivariable analysis (n=312) (ED Fig. 9k). However, within a cohort of adult BCR::ABL1 B-ALL patients (n=41), a higher Multipotency Score was associated with adverse OS and EFS outcomes in contrast to pediatric cases (ED Fig. 10g,h). Given its clinical implications, the association between developmental state and leukemia subtype is summarized in Fig. 7f.

Composition of non-malignant cells in B-ALL

As the presence of normal hematopoietic cells offers the opportunity to explore cell microenvironment composition and its correlation with molecular subtypes, we analyzed the type and proportion of non-blast cells (Fig. 8a and Supplementary Table 4) and observed enrichment of specific normal cell types according to the B-ALL molecular subtype. For example, ETV6::RUNX1-like and TCF3::PBX1 samples had a higher proportion of non-classical monocytes (CD14dim, CD16+) (Fig. 8b), confirming previous scRNAseq data from ETV6::RUNX1 samples51. These cells are characterized by overexpression of genes involved in WNT (LBH), PI3K/AKT (AKT3) or interleukin signaling (CSF1R) and downregulation of ribosomal proteins (Fig. 8c). Interestingly, dendritic cells (overexpression of IL3RA and CST3) were instead enriched in BCR::ABL1 ALL and specifically they were only found in samples with an early pre-B differentiation stage (BCR::ABL1 Early)(Fig. 8d,e). Overall, these findings suggest a dynamic crosstalk between leukemia cells and their microenvironment.

Figure 8. Single-cell analysis of non-malignant cells in B-ALL samples.

Figure 8.

a) Schematic workflow of analysis of normal cells which include monocytes (Mono) and dendritic cells (DC), T cells and natural kill (NK) cells and normal B-cells in the 89 B-ALL samples analyzed by scRNA-seq. b) Percentage of CD16+ cells (non-classical monocytes) in all monocytes across different subtypes. ETV6::RUNX1-like samples had the highest amount (adjusted P values are from one-way ANOVA; only significant P values are shown. Box plots indicate the range of the central 50% of the data with the central line marking the median. Whiskers extend from the smallest to the largest value. Dot plots represent individual data points from KMT2A-R (N=8); hyperdiploid (N=2); BCR::ABL1 (N=11); BCR::ABL1-like (N=8); DUX4-R (N=7); ZNF384-R (N=4); near haploid (N=2); low-hypodiploid (n=1); MEF2D-R (N=1); iAMP21 (N=2); PAX5alt (N=2); TCF3::PBX1 (N=4); ETV6-RUNX1-like (N=5). Only samples with at least 10 monocytes detected are shown. c) GSEA showing enrichment of ribosomal proteins in DC from ETV6::RUNX1-like samples. d) Histograms showing the percentage of DC in non-malignant cells in BCR::ABL1 positive ALL patients. e) Violin plot showing the expression of DC-related genes (IL3RA and CST3) in BCR::ABL1 ALL samples.

DISCUSSION

Here we provide multiple insights into the functional and cellular heterogeneity that exists in B-ALL by examining intratumor transcriptional heterogeneity and B-cell development state. First, our approach allowed us to refine the temporal and quantitative patterns of CNV evolution in B-ALL, such as the progressive chromosome 1q gain in TCF3::PBX1 and the evolutionary nature of aneuploidies in B-ALL, improving our understanding of the genetic intra-tumor heterogeneity of B-ALL and demonstrating the power of inferCNV from scRNA-seq to detect and chart the evolution of copy number changes.

Second, this study provides a framework for understanding normal and malignant B-cell development through integration of multiomic and functional characterization of stringently purified cell populations spanning human B-cell development. Our finding that functional myeloid lineage potential persists into early lymphoid populations including human CLPs challenges prior models pertaining to lymphoid lineage restriction timing and has important implications for understanding lineage promiscuity in B-ALL. Notably, we identified distinct developmental states within B-ALL, ranging from stem cells to more committed stages, which are associated with distinct genomic alterations and potentially reflect discrete windows of susceptibility to B-ALL transformation along differentiation (Fig. 7f). Further, the enrichment of multipotent developmental states in infant and adult B-ALL suggests that these windows of susceptibility across human B-cell development shift throughout the human lifespan.

Third, we developed a transcriptional Multipotency Score which was positively associated with multipotent cell populations including HSC/MPP and Early Lymphoid and negatively associated with committed B-cell states. This score alone was able to discriminate between transcriptional subgroups within KMT2A-R, DUX4-R, and BCR::ABL1 B-ALL with high accuracy, highlighting that transcriptional heterogeneity within these genomic subtypes can be explained by involvement of either Early/Multipotent or Committed developmental states. In the context of BCR::ABL1 ALL wherein two or three subgroups have been identified based on targeted BCR::ABL1 detection within lymphoid and myeloid cells1 or whole transcriptome sequencing5,14,52, our analysis provides strong evidence for three transcriptional and cellular subgroups of BCR::ABL1: Early/Multipotent, Inter-pro and Committed. Notably, the Inter-pro subgroup shares high abundance of HSC as the Early/Multipotent and high abundance of Pre-B cells as the Committed subgroup with homozygous deletions of IKZF1 as a hallmark.

Importantly, our finding that B-ALL subsets with the highest Multipotency scores (i.e. ZNF384-R B-ALL and Early/Multipotent subgroups of BCR::ABL1 and KMT2A-R B-ALL) had comparable multipotency scores to MPALs with the same genetic driver alterations suggests shared stem cell origins of these B-ALL subsets with their MPAL counterparts. These findings have important therapeutic implications. Although these stem cells appear restricted to lymphoid differentiation at diagnosis, our data suggest that they retain a latent myeloid potential, which may manifest at relapse following lymphoid-targeted therapies. This may explain reports of lineage switch from ALL to AML, especially under pressure of B-cell–specific immunotherapy, as well as further reports of monocytic lineage switch in ZNF384-R, DUX4-R, and PAX5 P80R B-ALL12. This change in lineage phenotype can result in discordant MRD levels determined by flow cytometry of lymphoid markers and molecular PCR-based assays of immunoglobulin gene rearrangements and affects the availability of CD19 as a therapeutic target.

Finally, the Multipotency Score was associated with clinical outcomes following conventional B-ALL chemotherapy. Its association with ex vivo chemo-resistance and higher levels of residual disease suggests inherent differences in chemo-sensitivity between B-ALL developmental states. Further, its association with inferior overall survival as well as event-free survival in pediatric B-ALL, independent of genomic subtype and other key clinical features, positions the Multipotency Score as a powerful biomarker for pediatric B-ALL.

Collectively, elucidation of distinct developmental states refines our understanding of inter-patient and intra-patient heterogeneity in B-ALL and reveals patient subsets with distinct cellular origins and clinical characteristics. Mapping the intersection between genetic driver alterations and specific cellular states in B-ALL may ultimately improve risk stratification and therapy development.

METHODS

Ethical regulation

The study research complies with all relevant ethical regulations and it was approved by the Institutional Review Board of St. Jude Children’s Research Hospital (Institutional Review Board #00000029; FWA00004775), protocol “ALLGENOME” (protocol number Pro00007034). All mouse experiments were approved by the Animal Care Committee of the University Health Network (UHN) and were performed in accordance with all the relevant regulatory and ethical standards (animal protocol AUP #1117.54). Written informed consent/assent was obtained from all study participants and/or their legally authorized representative in accordance with the Declaration of Helsinki. Samples were coded and assigned a unique study identifier. . Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Cohort description

We analyzed children, adolescents and young adults (AYA) with newly diagnosed B-ALL (N=84) and matched relapsed B-ALL (N=5). All patients were enrolled on St. Jude Children’s Research Hospital sponsored trials, including Total Therapy XV (ClinicalTrials.gov Identifier NCT00137111)53, Total Therapy 16 (NCT00549848)54 and Total Therapy 17 (NCT03117751)55. Detailed clinical information for each case is provided in Supplementary Table 1.

Reagents

The description with catalogue numbers of reagents used is provided in Supplementary Table 24.

Single-cell RNA sequencing

Frozen mononuclear cells (MNC) from diagnosis and relapse bone marrow samples were thawed, counted and up to 2 million cells were subjected to dead cell removal (Miltenyi Biotec) following manufacturer’s instructions. Isolated live cells were washed three times with 1X Phosphate-Buffered Saline (PBS, calcium and magnesium free) containing 0.04% weight/volume BSA (Thermo Fisher Scientific) and automatedly counted by the Countess 3 Automated Cell Counter (Thermo Fisher Scientific). From each sample, we calculated the volume of cells needed to have a recovery of 8,000–10,000 cells. Cells were processed following standard manufacturer’s protocols for the Chromium Next GEM Single Cell 5’ v2 (Dual Index) and as previously described56. Briefly, barcoded cDNA from polyadenylated mRNA was amplified, purified, and quantified after breaking the GEMs and recovering pooled fractions. The cDNA was further processed through enzymatic fragmentation, size selection, and library construction. Illumina-ready dual index libraries were sequenced at the recommended depth at the Hartwell Center at SJCRH on the Illumina NovaSeq according to manufacturer’s recommendations.

Single-cell RNA sequencing analysis

Single-cell RNA-seq data were aligned and quantified using the Cell Ranger (v5.0.1) pipeline against the human genome GRCh38 (refdata-gex-GRCh38–2020-A). Cells with mitochondria content >8%, or number of detected genes <500, or total number of detected molecules <2500 were removed using Seurat (4.0.6, https://satijalab.org/seurat)57 using Uniform Manifold Approximation and Projection (UMAP) reduction and characterized based on gene expression of major haemopoietic cell types. Doublets and multiplets were filtered out using DoubletFinder with the core statistical parameters (nExp and pK) determined automatically using recommended settings58. Ambient cell free mRNA contamination was estimated using SoupX59. Copy number variations were inferred using inferCNV with parameters HMM=TRUE, analysis_mode = “subclusters”, k_obs_groups = 8, denoise=TRUE (v.1.11.1, https://github.com/broadinstitute/inferCNV). The groups with similar copy number pattern were then merged. This inferCNV analysis was initially performed on all the cells within each sample as one of the four methodologies employed to distinguish between leukemic and non-malignant cell, and then run on leukemic cells only to identify copy number subclones. Differential expression analysis between different clusters of cells were performed using FindMarkers function from Seurat package with default parameters (logfc.threshold = 0.25, test.use = “wilcox”, min.pct=0.1).

Single-cell whole genome sequencing (scWGS)

ScWGS was performed as previously described56. Briefly, frozen bone marrow MNC from SJE2A063, SJE2A066 and SJE2A067 were thawed and stained with Alexa Fluor® 700 Mouse Anti-Human CD45 (BD Biosciences, #560566), BV605 Mouse Anti-Human CD19 (BD Biosciences #562653) and PE Mouse Anti-Human CD3 (BD Biosciences, #561809) and single cell fluorescence-activated cell sorting (FACS) in 96-well plate for CD19+CD45dim (leukemic blasts) and CD3+CD45bright (normal T cells). Sorted single-cell were processed by PicoPLEX Gold Single Cell DNA-Seq Kit (Takara Bio USA), according to manufacturer’s instructions.

scWGS analysis

scWGS data were mapped to human reference genome GRCh38 by BWA (version 0.7.12). Samtools (version 1.5) was used to estimate the sequencing depth for each segment inferred from bulk WGS data. To generate the coverage normalization index, segment coverages of individual normal cell were firstly divided by the median segment coverage across the genome, and then the median value of each genomic segment among all the normal cells was used as the normalization index. To estimate the copy number for individual tumor cells, segment coverages were first divided by the coverage normalization index, then by the median segment coverage across the genome, and multiplied by two due to diploidy.

Non-negative matrix factorization (NMF)

To determine the transcriptional programs within leukemic cells, we performed NMF using cNMF (v1.1)26 for the largest inferred copy number clone of each sample (clone C1). For this analysis we used the top 2,000 overdispersed genes found in the upper 50% of malignant cells exhibiting high total RNA counts. To identify the most stable and accurate number of factors (k) for each sample, 100 iterations of NMF with different random seeds were run for values of k from 5 to 10. For each value of k, the Silhouette score, measuring the stability of the components, and the Frobenius error were computed, and the k maximizing the Silhouette score and minimizing the Frobenius error was selected for each sample, as implemented in cNMF. For each of the resulting factors, we considered the 30 genes with the highest NMF scores to be characteristic of that factor. For each subtype, hierarchical clustering of the scores for each program was used to identify main correlated sets of programs. The 30 genes with the highest average NMF score within each correlated program set (excluding ribosomal protein genes) were then used to define subgroup-specific metaprograms. Each program was annotated using pathway enrichment analysis.

Human cord blood (CB) samples

Human CB samples were obtained with informed consent from Trillium Health, Credit Valley and William Osler Hospitals (Ontario, Canada) according to procedures approved by the University Health Network (UHN) Research Ethics Board. CB were pooled and processed together to have enough CD34+ cells for experiments. The number of CB units in each CB pool was variable depending on the number of newborns, ranging from 2 to 10. MNC from pools of male and female CB units (~4–15 units) were obtained by centrifugation on Lymphoprep medium (Stem Cell Technologies), and after ammonium chloride lysis MNC were enriched for CD34+ cells by positive selection with the CD34 Microbead kit (Miltenyi Biotec) and LS column purification with MACS magnet technology (Miltenyi Biotec). Resulting CD34+ CB cells were viably stored in 50% PBS, 40% fetal bovine serum (FBS) and 10% DMSO at −80°C or −150°C.

Flow Cytometric Analysis and Sorting

For cell sorting, CD34+ and CD34− human CB cells were thawed via slow dropwise addition of X-VIVO 10 medium with 50% fetal bovine serum (FBS) and DNaseI (200 μg/ml). Cells were spun at 350g for 10 minutes (min) at 4°C and then resuspended in PBS with 2.5% FBS. For all in vitro and in vivo experiments, the full stem and progenitor hierarchy sort was performed as described in Notta et al.60 Briefly, this sort scheme is a 11-parameter design which includes CD49f to distinguish HSCs from MPPs and CD19 to differentiate B-lymphoid–committed progenitors from myeloid progenitors like GMPs. Cells were resuspended in 100 μl per 1×106 cells and stained in two subsequent rounds for 15 min at room temperature each. See Supplementary Table 25 for the full list and details of the antibodies used.

Cell Culture

For in vitro experiments sorted cells were cultured in 96 well-plate round bottom with the indicated cell media: Lympho-Myeloid media (Ly-My) or Lymphoid promoting media (LyP) (Supplementary Table 14). Single cell in vitro assays were set up as described previously61 with low passage mouse MS-5 stromal cells (kindly provided by Dr. K. Itoh, Japan) seeded at a density of 1,500 cells per 96-well and grown for 2–4 days in H5100 media (Stem Cell Technologies). Briefly, one-day prior to coculture initiation, the H5100 media was removed and replaced with 100 μl erythro-myeloid differentiation media (Supplementary Table 14) Sorted single-cells were deposited into the MS-5 seeded 96-well plates (80 wells/96-well plate) using the FACSAria II (BD). Colonies were scored after 15–17 days under the microscope and every individual well containing a visible colony was stained with antibodies listed in Supplementary Table 25 and analyzed by flow cytometry using a BD FACSCelesta instrument equipped with a high throughput sampler (HTS).

Mouse experiments

Mouse experiments were done in accordance with institutional guidelines approved by University Health Network (Toronto, Canada) Institutional Animal Care and Use Committee. All in vivo experiments were done with 8- to 12-week-old female/male NOD.Cg-PrkdcscidIl2rgtm1Wjl/SzJ (NSG) mice (JAX) mice that were irradiated the day before intrafemoral cell injection. All mice were housed at the animal facility at Princess Margaret Cancer Centre in a room designated only for immunocompromised mice with individually ventilated racks equipped with complete sterile micro-isolator caging (IVC), on corn-cob bedding and supplied with environmental enrichment in the form of a red house/tube and a cotton nestlet. Cages are changed every <7 days under a biological safety cabinet. Health status is monitored using a combination of soiled bedding sentinels and environmental monitoring. No mice were excluded from the analysis. No malignant cells were injected into immune-deficient mice. Only normal human populations were transplanted. In all cases these were short term transplants resulting in very low levels of human engraftment and no animal reached humane endpoints.

Xenotransplantation

The progeny of 10,000 CD34+CD38− transduced cells one day after transduction were intra-femoral injected in age and gender matched 8–12-week-old male and female recipient NSG. The mice were housed on ventilated racks supplied with autoclaved micro isolator cages. Reverse osmosis water was supplied via an automatic watering system. The light cycle was maintained with lights off at 6:00 pm and on at 6:00 am. The room temperature was kept between 22–23 degrees and humidity ranging from 40% to 60%. After 1–2 weeks post transplant39,60,62, mice were euthanized, injected femur and other long bones (non-injected femur, tibiae) were flushed separately with Iscove’s modified Dulbecco’s medium (IMDM) (Thermo Fisher) + 5%FBS and 5% of cells were analyzed for human chimerism along with antibodies listed in Supplementary Table 25. Sick and mis-injected mice were excluded from analysis.

RNA-seq processing and analysis

Freshly sorted populations from 3–5 independent CB pools were frozen (−80°C) in PicoPure RNA Extraction Buffer, and RNA was isolated using the PicoPure RNA Isolation Kit (ThermoFisher). Samples with RIN>8 and sufficient concentration (Bioanalyzer pico chip) were processed at the Center for Applied Genomics, Sick Kids Hospital. cDNA was generated with SMART-Seq v4, and libraries were prepared using Nextera XT (Illumina). Libraries were pooled (4 per lane) and sequenced on an Illumina HiSeq 2500, generating 125-bp paired-end reads (55–75M reads/sample). Reads were aligned to hg38 with STAR v2.5.2b, annotated with Ensembl v90, and quantified using HTSeq v0.7.2. Differential expression analysis was performed with edgeR_3.24.3.

SCENIC regulon analysis

For transcription factor inference, the 130, 085 single cells were aggregated into metacell using the Metacell2 package. The highly variable genes used to construct the reference map were used as input into the algorithm guiding feature gene selection, thus optimizing clustering for metacell partitioning. This generated 6,394 metacells with a target size of 75,000 UMI. Gene regulatory network inference was performed using pySCENIC38 on the metacell raw count to reduce runtime. Candidate regulons were pruned using the annotations of transcription factor motifs, ‘motifs-v9-nr.hgnc-m0.001-o0.0.tbl.’ Subsequently, Cistarget was employed using the ‘mc9nr’ databases, which include known human TF motifs annotated at: a) 500 bp upstream and 100 bp downstream of the transcriptional start site (TSS) and b) 10 kb centered around the TSS. Log-transformed counts of the metacells were used as input for CisTarget. Transcription factor regulons were then scored using AUCell in the single-cell dataset.

ATAC-seq processing and analysis

Library preparation for ATAC-Seq was performed on 1000–5000 cells with Nextera DNA Sample Preparation kit (Illumina), according to previously reported protocol63. Four ATAC-seq libraries were sequenced per lane in HiSeq 2500 System (Illumina) to generate paired-end 50-bp reads. Reads were mapped to hg38 using BWA (0.7.15) using default parameters. Duplicate reads, reads mapped to mitochondria, an ENCODE blacklisted region or an unspecified contig were removed (Encode Project Consortium, 2012). MACS (2.2.5) was used to call peaks in mapped reads. A catalogue of all peaks was obtained by concatenating all peaks and merging any overlapping peaks.

Cross-ontogeny reference map of B cell development

In order to increase the resolution of our normal B-lymphoid reference and expand the source of the reference to include fetal tissue which are hypothesized to be a potential origin of B-ALL, we expanded our reference map by incorporating scRNA-seq data from additional studies incorporating fetal liver31, fetal bone marrow32, umbilical cord blood (Human Cell Atlas CB)34, and a fourth study incorporating progenitors across ontology36.

Briefly, scRNA-seq data from each study was mapped to the bone marrow reference map and any cells positioned within the B-lymphoid lineage (HSC, MPP, LMPP, MLP, CLP (Early Pro-B), Pre-Pro-B, Pro-B, Pre-B, Immature B, Mature B) were retained as well as cells mapping to proximal states branching off of B-lymphoid development (early GMP, Pre-pDC, Pre-pDC Cycling, pDC).

Among single cell transcriptomes positioned within B cell development and proximal lineages, scRNA-seq data spanning fetal liver (FL), fetal BM (FBM), CB, pediatric BM (PBM), and adult BM (ABM) tissues profiled across eight studies representing B-cell development and proximal branch points (GMP, pDC) were integrated to develop a comprehensive map of B-cell development including a total of 130,085 single cell transcriptomes. This included single-cell transcriptomes from the following studies: Human Cell Atlas, n = 15,689; Oetjen et al 2018 (ref28), n = 6,593; Ainciburu et al 2023 (ref29), n = 23,278; Setty et al 2019 (ref30), n = 11,071; Popescu et al 2019 (ref31), n = 5,471; Jardine et al 2021 (ref32), n = 42,877; Roy et al 2021 (ref33), n = 25,106. Transcriptomes from Granja et al 2019 (ref64) and Mende et al 2022 (ref65) were excluded due to high rates of gene dropout particularly involving B cell receptor genes. This atlas spans 90 donors across five tissues (FL, n = 20,944; FBM, n = 39,680; CB, n = 5,680; PBM, n = 5,870; ABM, n = 57,911), and three technologies (10× 3’ V2, n = 74,990; 10× 3’ V3, n = 41,576; 10× 5’, n = 13,519). The full list of studies, tissues and technologies represented in the B-cell developmental map is provided in the Supplementary Table 11.

Highly variable gene selection

Proper highly variable gene (HVG) selection was critical in minimizing batch effects from technology and donor. Our final HVG selection approach divided the cohort by sequencing technology and used the sc.pp.highly_variable_genes function within scanpy (v1.9.1) adjusting for donor ID among cells sequenced with each technology66. HVGs surpassing a normalized dispersion threshold >1 within each technology (10× 3’ V2, n=425 genes; 10× 3’ V3, n=606 genes; 10× 5’, n=581 genes) were retained and the union of these genes across all three technologies was used. This resulted in 950 highly variable genes across B cell development which were used for dimensionality reduction.

Dimensionality reduction and clustering

Dimensionality reduction and cell type annotation was performed in line with BoneMarrowMap (https://github.com/andygxzeng/BoneMarrowMap) wherein an iterative grid search was performed across multiple batch correction parameters and multiple dimensionality reduction parameters to identify a UMAP embedding that exhibited a continuous hematopoietic differentiation manifold while retaining adequate distance between non-cycling CD34+ Pro-B and CD34− Pre-B cells. Parameters for the final embedding included 950 HVGs as described above, batch correction with harmony (v0.1.1) across 90 individual donors with parameters theta = 1, max.iter.cluster = 150, and max.iter.harmony = 20. A neighbourhood similarity matrix was constructed for the top 100 neighbours of each cell by cosine distance across the top 30 harmony components and UMAP reduction was performed with min.dist = 0.29 and spread = 0.9. To identify cell states within the data, leiden clustering was performed using scanpy (v1.9.1) at resolutions ranging from 0.5 to 20. Known cell type signatures derived from ground-truth bulk RNA-seq profiles from sorted CB fractions spanning human B cell development, as well as bulk RNA-seq from sorted FL fractions67, were scored in the single cell atlas and used to identify optimal transition points between clusters. Briefly, broad classifications from low resolution leiden clustering would be refined into specific cell states with higher resolution leiden clustering to align cell state assignments with marker expression and known cell type signatures. Any remaining unlabeled cell types were labeled through KNN classification based on harmony components. These assignments were subsequently validated by marker gene expression, enrichment of known bulk RNA-seq cell type signatures, BoneMarrowMap assignments, and projection of Abseq data35.

B-cell consensus non-negative matrix factorization

Consensus NMF26 was run on the single cell atlas to identify gene expression programs in an unsupervised manner. Raw counts, as well as SCTransform V2 (ref68) normalized counts, were used for running cNMF with default parameters, and the final number of gene expression programs (cNMF components) was selected by evaluating the silhouette score (stability) and the Frobenius reconstruction error.

Composition analysis of B-ALL data

BALL scRNA-seq Mapping

For mapping of B-ALL scRNA-seq samples, filtered counts from 10x were loaded into Seurat (v4.3.0) and subject to further QC with the following filters: nFeature_RNA >500, nCount_RNA >2500, and pct.mito <8. Filtered single-cell transcriptomes were first projected onto BoneMarrowMap v0.1.0, https://github.com/andygxzeng/BoneMarrowMap) based on 2,386 highly variable genes spanning human bone marrow cells. Cells that were assigned to the following cell state by BoneMarrowMap (HSC, MPP-MyLy, LMPP, Early GMP, MLP, MLP-II, Pre-pDC, Pre-pDC Cycling, pDC, CLP, Pre-ProB, Pro-B VDJ, Pro-B Cycling, Large Pre-B, Small Pre-B, Immature B, Mature B) were retained. These cells were subsequently mapped onto our B cell development map using symphony (v0.1.1), based on 950 highly variable genes spanning human B cell development. The scripts used for B-cell mapping are publicly available at https://github.com/andygxzeng/b_development_map.

B-ALL scRNA-seq Composition Analysis

For composition analysis of B-ALL, the number of mapped single-cell transcriptomes that passed a mapping QC filter was counted for each sample. Cell states with less than 100 cells present across all samples were filtered out. These frequency statistics for each cell state within each sample were used.

To cluster patient samples by stages of B cell development, 21 cell states spanning B cell development were retained. Cell counts across these 21 cell states for each patient were subject to CLR normalization with multiplicative replacement using the package scikit-bio (v0.5.6). Dimensionality reduction and clustering was performed using scanpy (v1.9.1). Briefly, a neighbourhood graph was constructed based on cosine similarity across the n = 6 top PCs, incorporating n = 10 nearest neighbours. UMAP reduction was performed at min.dist = 0.2 and leiden clustering at a resolution = 1.2. This resulted in seven patient subgroups on the basis of cell composition across B cell development.

NMF Developmental State Analysis

To simplify downstream analyses, we sought to bundle specific cell states into broader categories based on correlated abundance across patient samples. To do so, cell counts across 55 cell states for each patient were subject to Centered Log ratio (CLR) normalization with multiplicative replacement using the package skbio, and NMF analysis was performed on this normalized composition data. Ten NMF components, each corresponding to groups of cell states that vary across samples. Out of these ten components, six correspond to discrete stages of B cell development, hereafter Developmental States. These comprise of Pro-B (NMF1), Early Lymphoid (NMF2), Pre-B (NMF4), Myeloid Progenitor (NMF6), Pre-pDC (NMF7), and HSC/MPP/LMPP (NMF8).

Quantification in B-ALL RNA-seq

The relative abundance of each developmental state is represented by the score for each corresponding NMF component within each patient. To estimate the abundance of each developmental state in bulk RNA-seq data, we utilized sample-level pseudo-bulk profiles from our 89 scRNA-seq patient samples and performed LASSO regression to predict developmental state abundance based on bulk gene expression features. Crucially, we utilized a three-pronged feature selection approach prior to LASSO regression by focusing on biologically relevant genes meeting the following criteria: (1) differentially expressed across developmental states in B-ALL patients, (2) differentially expressed across normal B cell development, and (3) expression in pseudo-bulk transcriptomes is correlated with developmental state abundance across the 89 patient samples. This quantification approach was validated on matched bulk RNA-seq profiles for 85 out of 89 patient samples, and benchmarked against other deconvolution approaches as well as other feature selection approaches. Full details pertaining to quantification and benchmarking are outlined in protocols.io69. Results from this benchmarking analysis are shown in ED Figure 5.

Derivation of the B-ALL Multipotency Score

Starting with the relative abundance values of four key B-ALL developmental states spanning B-cell development (HSC/MPP, Early Lymphoid, Pro-B, Pre-B) within the 2,046 sample cohort, we performed principal component analysis and identified principal component 1 (PC1) as capturing the multipotent (HSC/MPP and Early Lymphoid) vs committed (Pro-B and Pre-B) axis. To quantify this with a gene expression score, we started with the 115 genes used to quantify these four developmental states and performed LASSO regression on PC1 across the 2,046 bulk transcriptomes, resulting in a 99-gene linear equation which we termed the “Multipotency Score”. We validated the accuracy of this score through 10 iterations of 10-fold cross validation, shuffling samples between each iteration, and validated the biological relevance through score calculation on pseudobulk profiles (pooled within each cell type from each donor from each study) from our single-cell atlas of normal B cell development.

Statistical Analysis in B-ALL Cohorts

Inferring Developmental State Abundance in bulk RNA-seq

Briefly, the abundance of each developmental state is represented by a gene expression score, comprising a linear equation with coefficients/weights assigned by LASSO for each constituent gene. These equations are then applied to normalized bulk RNA-seq data from B-ALL patients wherein the normalized expression of each gene is multiplied by the appropriate co-efficient, and the sum of these products is used and standardized across patients. Each standardized score represents the relative abundance of a developmental state within an individual patient. Gene expression scores were calculated on vst-normalized70 data from bulk RNA-seq of 2,046 B-ALL.

Association with genomic characteristics of B-ALL

Gene expression scores were calculated on vst-normalized70 data from bulk RNA-seq of 2,046 B-ALL. Associations between developmental state and genomic characteristics were evaluated through a generalized linear model using glmnet (v4.1.3) wherein inferred abundance of each developmental state was used as the dependent variable and genomic category was used as the independent variable, stratifying on originating institute. For associations between mutations and developmental state, genomic subtype was used as a covariate. The resulting t-statistic and FDR corrected significance was visualized through package corrplot (v0.84). Unless otherwise stated, statistical significance between groups was evaluated through a non-parametric two-tailed Wilcoxon rank sum test.

Association with patient survival outcomes

Survival associations were determined through cox proportional hazards regression using package survival (v2.44.1.1) with either overall or event-free survival outcomes as the dependent variable and inferred developmental state abundance or Multipotency Score as the independent variable. In pediatric B-ALL, survival analysis was performed across subsets of patients with available RNA-seq data from St Jude’s and Children’s Oncology Group (COG) cohort, in adult B-ALL survival analysis was performed within a subset of patients with available RNA-seq data from the European Co-operative Oncology Group (ECOG) cohort. Multivariable analysis in pediatric and adult patients was performed with Clinical Risk Group (Childhood SR, Childhood HR, AYA, Adult), Genomic Subtype Risk Group (Favorable, Intermediate, and Unfavorable, Unclassified), age, sex and WBC as covariates. Hazard ratios are reported for each standard deviation increase in Multipotency Score or inferred abundance of a developmental state. Significance was calculated through nested likelihood ratio test (LRT) wherein the performance of a model with Clinical Risk Group + Genomic Subtype Risk Group + Age + Sex + WBC + Multipotency Score / Developmental State Abundance is compared to the performance of a baseline model with only Clinical Risk Group + Genomic Subtype Risk Group + Age + Sex + WBC information. This analysis was repeated independently for the Multipotency Score and for each developmental state.

To determine the independent prognostic value of the Multipotency Score in pediatric B-ALL, the Multipotency Score was binarized into high and low categories by median split. This was used in multivariable analysis for direct comparison against existing clinical stratifications including Genomic Subtype Risk Group, Clinical Risk Group, and Residual Disease. Hazard ratios from these multivariable analyses were visualized using forest plots. Kaplan-Meier plots were generated for prognostic evaluation of the multipotency score within each category of these clinical stratifications.

Genomic Subtype Risk Group assignments were determined based on the genomic subtype and differed for adult and pediatric B-ALL (Supplementary Table 26).

Association with ex vivo drug response

Gene expression profiles of 595 B-ALL samples from Lee et al50 were used to link developmental state abundance with ex vivo sensitivity to 18 therapeutic agents. Gene expression scores were calculated using logTPM normalized gene expression data obtained from the publication to infer abundance of each developmental state. Pearson correlation and FDR-corrected significance of the association between abundance of each developmental state and the area under the dose-response curve (AUC), wherein lower AUC values denote higher sensitivity. Pearson values are multiplied by −1 such that positive values indicate that higher abundance of a developmental state is associated with higher ex vivo sensitivity to a drug.

Comparison between B-ALL and MPAL

Gene expression profiles of 1,153 B-ALL samples and 66 MPAL samples for which sequencing data was uniformly pre-processed was obtained from Montefiori et al48. Gene expression scores were calculated using rlog-normalized gene expression data obtained from the publication to infer abundance of each developmental state. For comparisons of developmental state abundance, T/Myeloid lineage MPALs were removed from the analysis to focus on B/Myeloid lineage MPALs and Acute Undifferentiated Leukemias (AULs).

Re-analysis of BCR::ABL1 transcriptional subgroups

Gene expression profiles of 57 BCR::ABL1 lymphoblastic leukemia samples were vst-normalized and gene expression scores were calculated to infer abundance of each developmental state. Transcriptional subgroup labels of Early-Pro, Inter-Pro, and Late-Pro were obtained from the original study14. Survival associations were performed on 41 adult samples with outcomes data (overall survival and relapse-free survival) using cox proportional hazards regression with inferred developmental state abundance as the independent variable. Hazard ratios are reported for each standard deviation increase in inferred abundance of a developmental state. This analysis was repeated independently for each developmental state.

Statistics and Reproducibility

No statistical methods were used to predetermine sample sizes. The cohort size was selected based on the availability of existing genomic sequencing data for B-ALL, or the presence of B-ALL tissue on which single cell sequencing could be performed. Unless otherwise stated, P values derived from comparisons between specific groups were derived from a non-parametric Wilcoxon rank-sum test. For functional studies, the minimum sample sizes for each experimental group were three samples. For computational studies analysis of single-cell and bulk patient samples, no statistical method was used to predetermine sample size. No data were excluded from the analyses. The experiments were not randomized and the investigators were not blinded to allocation during experiments and outcome assessment. Data distribution was assumed to be normal but this was not formally tested.

Extended Data

Extended Data Fig. 1. Genetic heterogeneity elucidated by inferCNV.

Extended Data Fig. 1

a) Copy number alterations from whole genome sequencing or SNP array in B-ALL samples analyzed by scRNA-seq. Copy number changes are depicted in red for copy number gains and in blue for copy number losses. Only samples with genomic data available are included. b) PCA plots of pairwise view of the top three PCs for cells (N=489,019) from the 89 single-cell RNA-seq data colored by cell types. c,d) Bar plots showing inferCNVs in near haploid (c) and hyperdiploid (d) B-ALL. InferCNV heatmap (e) of chromosome gains and UMAP (f) of SJBALL030285_D1. g,h) InferCNV heatmap from SJALL040066 (g) and SJALL040070 (h) with TCF3::PBX1 derived from the unbalanced translocation der(19)t(1;19)(q23;p13.3). SJALL040066 (g) has three CNV clusters sharing gain from PBX1 to 1q telomere, and chr7p loss/7q gain with additional alterations undetectable by bulk sequencing: 9p- and 22+ in clone 2 (C2); 5+ 8+ in clone 3 (C3). SJALL040070 (f) has one CNV clone with gain of 1q region from PBX1 to telomere. i) InferCNV heatmap from SJE2A067 with TCF3::PBX1 fusion derived from balanced translocation t(1;19)(q23;p13.3), showing 3 clusters having distinct gene expression profiles: clone 2 (C2) is the founder clone; clone 1 (C1) is characterized by a gain of a region near the PBX1 gene (chr1: 145002063–206421902); clone 3 (C3) shows loss of chr 13. A subset of cells in each sample in e, g, h and i show coordinate increased expression of genes on 6p reflecting cell-cycle associated expression of histone genes rather than copy number variation. j) UMAP representation of cells from SJE2A067 colored by cell type (upper left panel), inferCNV group (upper right panel) and detection of TCF3::PBX1 fusion (lower left panel). k) Scatterplot from scWGS in SJE2A067 showing copy numbers of chr 1p, 1q regions upstream and downstream of PBX1, 11p, 13q and 19p. Dotted line indicates diploid DNA content. This analysis revealed multiple distinct clones demonstrating the complex patterns of genetic evolution in samples with TCF3::PBX1. l) Screenshot from UCSC Genome Brower showing genomic coordinates (hg38) of alterations on chr1 in SJE2A063 with amplification of regions both upstream and downstream of PBX1 as derived from two independent events and in SJE2A067 with a single alteration encompassing the PBX1 gene. m) Heatmap showing the pairwise correlations of the sample level meta-programs using consensus non-negative matrix factorization (cNMF). Each panel includes samples from the same subtype. Clustering identified five coherent malignant gene expression signatures, including A - Cell Cycle (S), B - Cell Cycle (G2/M), C - Metabolism, D - Differentiation, and E - Inflammation. n) Distribution of the number of subtypes expressing the top 30 signature genes in the five metaprograms of different subtypes indicating that most of the signature genes are shared for A - Cell Cycle (S), B - Cell Cycle (G2/M), C – Metabolism, but they are subtype-specific for D – Cell Differentiation, and E - Inflammation.

Extended Data Fig. 2. Transcriptional and functional characterization of human B cell development.

Extended Data Fig. 2

a-c) Centered log ratio (CLR)-normalized protein levels of key cell surface markers across B cell development from combined scRNA-seq and protein expression (AbSeq) data. a) 5,102 adult bone marrow cells profiled through BD Rhapsody from Triana et al35. These cell states were classified through reference map projection onto our B cell development atlas. b) 4,189 fetal liver cells and c) 7,782 fetal bone marrow cells profiled through CITE-seq from Jardine et al32. d,e) Differential expression (DE) results between pre-natal and post-natal samples across human B-cell development. Single cells belonging to each cell state from each donor from our B cell development atlas were pooled into pseudobulk profiles prior to DE analysis using DESeq2 with P values determined by a Wald test and subject to Benjamini-Hochberg correction to obtain FDR values. d) The number of DE genes at FDR < 0.01 specific to either pre-natal or post-natal samples is shown. e) DE results in CLPs from pre-natal compared to post-natal cells. Genes that are significantly DE at log2 fold change > 1 and FDR < 0.01 are depicted in red. MME, which encodes CD10, is significantly upregulated within post-natal CLPs and shown within a box for emphasis. f,g) cNMF signature analysis in normal B-lymphoid development including the S and G2/M-phases of cell cycle (f) and the Early Myeloid, Plasmacytoid DC, and Mature B programs (g). h) CD7 and IL7R expression by cell state across the B cell development atlas. i) Analysis of CD7 protein expression by flow cytometry in the indicated cell populations (for each population n=2). The bar plots represent mean values. j) Analysis of IL7R protein expression by flow cytometry in the indicated cell populations. k) RAG1/RAG2 gene expression in B cell developmental cell types from bone marrow, fetal bone marrow and fetal liver. Single-cell transcriptomes (n = 90 donors) were pooled together based on tissue source, study, and sequencing into pseudo-bulk profiles for visualization of vst-normalized RAG1 and RAG2 expression patterns. Box plots indicate the range of the central 50% of the data with the central line marking the median. Whiskers extend from each both to 1.5x the interquartile range.

Extended Data Fig. 3. Functional modeling of normal flow-sorted populations.

Extended Data Fig. 3

a) Two hundred cells from the indicated populations were sorted and cultured on MS-5 stroma cells for 3, 7 or 14 days (see Fig. 3g for 7 and 14 days) in Ly-My or LyP promoting media and resulting colonies analyzed by flow cytometry. The proportion of each population is represented in the pie charts. b) Representative FACS plots from individual wells summarized in (a) at day 14 of culture. c) Total number of cells from (a) at day 14 of culture. N=2 cord blood pools for all groups (a, b, c). The dots in the plots represent technical replicates (Ly-My: LMPP, n=4; MLP, n=4; CLP, n=4; Pre-Pro-B, n=3; Pro-B, n=8. LyP: LMPP, n=4; MLP, n=4; CLP, n=4; Pre-Pro-B, n=3; Pro-B, n=3). d) Engraftment analysis of indicated cell populations 2 weeks after intrafemoral injection into NSG mice. Injected cells were (number/mouse): LMPP 3,000; MLP: 4,000; CLP: 6,000; Pre-Pro-B: 6,000; Pro-B: 10,000. The percentage of human CD45 positive (CD45+) cells and relative engraftment normalized by the number of cells injected in the right femur (injected bone) are shown. The dots in the plots represent different injected mice from 3 human cord blood donors (n=23 for LMPP and MLP; n=14 for CLP; n=3 for Pre-Pro-B; n= 9 for Pro-B). Two-tailed P values are from unpaired t test. e) Engraftment levels in the left femur. The dots in the plot represent different injected mice from 3 human cord blood donors (n=13 for LMPP and MLP; n=9 for CLP; n=3 for Pre-Pro-B; n= 9 for Pro-B). Two-tailed P values are from unpaired t test. f) Percentage of CD33+, CD19+, CD56+ and CD1a+ cells in the right femur from panel (d). The dots in the plots represent different injected mice from 3 human cord blood donors (n=23 for LMPP and MLP; n=14 for CLP; n=3 for Pre-Pro-B; n= 9 for Pro-B). g) Human CB (n=5), adult bone marrow (BM)(n=8) and mobilized peripheral blood (MPB)(n=7) were stained and analyzed by flow cytometry with the indicated cell surface markers. The percentage of CD33+ and CD33- cells in each population is represented. Each histogram bar represents an individual donor. h) Single cells from indicated populations were sorted into MS-5 stroma cells and cultured for 16–17 days with Ly-My or LyP media (Supplementary Table 14). Colonies were scored under the microscope and analyzed by flow cytometry for differentiation markers. The percentage of each colony type resulting from each starting population as labels on the x axis is shown. i) From (h) the clonogenic potential (wells with colonies/seeded wells multiplied by 100; bar plots represent the mean with SD) and the number of B, Myeloid (M) or NK cells/colony are shown for both media (LMPP CD33+, n=4 cord blood pools; MLP CD33+, n=4 cord blood pools; MLP CD33-, n=3 cord blood pools; CLP CD33+, n=5 cord blood pools; CLP CD33-, n=5 cord blood pools; Pre-ProB CD33+, n=2 cord blood pools; Pre-ProB CD33-, n=3 cord blood pools; ProB, n=3 cord blood pools).

Extended Data Fig. 4. B-ALL developmental states refine existing genomic subtypes.

Extended Data Fig. 4

a,b) Validation through projection of human fetal lung data. a) 9,491 single cell transcriptomes mapped to human B cell development from fetal lung tissue profiled in Barnes et al41. Projected cells are colored by cell type labels assigned in the original publication. Reference single-cell transcriptomes from our B cell development atlas are depicted in grey. b) Density plots depicting projected single-cell transcriptomes split by each author assigned cell type within the query fetal lung scRNA-seq data. Darker shades of red depict higher density of query cells within a given region along the reference atlas. c) UMAP embedding based on cell state composition from 89 B-ALL samples depicting centered log ratio (CLR) normalized abundance of each mapped cell population for each sample. d) Consolidation of similar B-ALL cell states into broader developmental states. To identify B-ALL states with correlated abundance across patient samples, NMF was performed on the normalized abundance scores of all mapped cell states from B-ALL bone marrow samples and NMF modules were interpreted based on the weights assigned to the cell states that comprise each module. e) For NMF modules that correspond to a distinct stage along B-lymphoid development, termed “Developmental State”, feature weights are visualized in this heatmap by constituent cell type. f) Abundance of broad developmental states defined by NMF spanning Early Myeloid progenitors and plasmacytoid DC cells. g) Remaining genomic subtypes (DUX4-R, BCR::ABL1-like (CRLF2-R), BCR::ABL1-like (non-CRLF2-R), iAMP21, PAX5alt and near haploid) are overlaid based on the cell composition of their constituent patient samples. h,i) Developmental State in ETV6::RUNX1 B-ALL. h) Cell state composition-based clustering of 8 ETV6::RUNX1 B-ALL samples from Mehtonen et al44 and 7 ETV6::RUNX1-like B-ALL samples from this study. Each column represents a scRNA sample. i) Projection results on the B cell development map of ETV6::RUNX1 B-ALL samples from Mehtonen et al44.

Extended Data Fig. 5. Benchmarking gene expression deconvolution approaches in B-ALL.

Extended Data Fig. 5

a) Overview of benchmarking inference of developmental state abundance in B-ALL across various gene expression deconvolution approaches. This spans the stages of 1) feature selection and signature matrix generation, 2) application of gene expression deconvolution algorithm, and 3) benchmarking deconvolution accuracy using 85 patient samples with matched scRNA-seq and bulk RNA-seq. b,c) Association between predicted abundance (bulk RNA-seq deconvolution) and observed abundance (scRNA-seq) of B-ALL developmental states across 85 matched patient samples, shown for each deconvolution method and split by signature matrix. Associations shown include the Pearson correlation (b) as well as the coefficient of determination denoted as R-squared (c). Lines between points in the boxplots link the performance for each B-ALL developmental state (HSC/MPP, Myeloid Progenitor, Pre-pDC, Early Lymphoid, Pro-B, Pre-B, Mature B) across quantification methods. P-values denote results from two-tailed paired t-tests comparing our LASSO regression approach with CIBERSORTx deconvolution. Box plots indicate the range of the central 50% of the data with the central line marking the median. Whiskers extend from each both to 1.5x the interquartile range. d) Scatterplots of predicted abundance (bulk RNA-seq) and observed abundance (scRNA-seq) for each quantification method, with each dot representing one of the 85 patients with matched scRNA-seq and bulk RNA-seq. For each method, the B-ALL biologically relevant genes signature matrix was used. For each association, the linear regression line shaded with the 95% CI, as well as r and P values from Pearson correlation, are shown.

Extended Data Fig. 6. Quantification of B-ALL developmental states and Multipotency Score.

Extended Data Fig. 6

a) Development of a 99-gene regression model to quantify PC1, hereafter named the “Multipotency Score”. Correlation results for 10-fold cross validation with 10 repeats are shown alongside correlation for the final model trained on all 2,046 samples. r and P values from Pearson correlation are shown. b) B-ALL Multipotency Score validation on donor-level pseudobulk profiles from normal B cell development atlas. Single-cell transcriptomes spanning n = 90 donors were pooled together based on tissue source, study, and sequencing technology into pseudo-bulk profiles for visualization of Multipotency Score enrichment. For the specific number of pseudobulk profiles as well as the number of cells for each cell state along B cell development, refer to Supplementary Table 11. Box plots indicate the range of the central 50% of the data with the central line marking the median. Whiskers extend from each both to 1.5x the interquartile range. c) Ridge plot comparing the inferred abundance of HSC/MPP and Pre-pDC states by genomic subtype across 2,046 B-ALL patients. d,e) Association plots between Multipotency Score or developmental state abundance and driver alterations (d) or gene fusions (e). The magnitude of each association, quantified as the −log10 (P value), is depicted through the size and color intensity of each dot. The direction of the association is depicted through the color, wherein higher abundance is green and lower abundance is purple. Only associations with an FDR corrected P value < 0.05 are depicted. For driver alterations (d), abundances of samples with each alteration were compared to all other samples, and genomic subtype was adjusted for as a covariate. For gene fusions (e), abundances of samples with each fusion were compared to abundances from samples with no gene fusions. f-h) Differences in developmental state abundance (Myeloid Progenitors and Pre-pDCs) between transcriptional subtypes of DUX4-R (total n=112; DUX4-r Early/Multipotent = 70; DUX4-r Committed n =42). (f), KMT2A-R (total n=144; KMT2A-r Early/Multipotent = 125; KMT2A-r Committed = 19). (g) and BCR::ABL1 (total n=127; BCR::ABL1 Early/Multipotent = 32; BCR::ABL1 Inter = 26; BCR::ABL1 Committed = 69). (h). i) Multipotency Score and developmental state abundance explains differences between Early-Pro (Early/Multipotent, n=24), Inter-Pro (n=8), and Late-Pro (Committed, n=22) transcriptional subgroups of BCR::ABL1 (N=54) from Kim et al14. For each comparison, P values from a two-tailed Wilcoxon rank-sum test are reported. Box plots indicate the range of the central 50% of the data with the central line marking the median. Whiskers extend from each both to 1.5x the interquartile range.

Extended Data Fig. 7. B-ALL developmental states refine existing genomic subtypes.

Extended Data Fig. 7

a-c) Association between genetic alterations and Multipotency Score or inferred abundance of B-ALL developmental states from patients with DUX4-R (a), KMT2A-R (b) and BCR::ABL1 (c) B-ALL. The magnitude of each association, quantified as the −log10 (P value), is depicted through the size and color intensity of each dot. The direction of the association is depicted through the color, wherein higher abundance is green and lower abundance is purple. Only associations with an unadjusted P <0.05 are depicted, with a star denoting at FDR < 0.05. Within each subtype, samples with each alteration were compared against all other samples. BCR::ABL1 transcript type (p190, p210) is not shown in (c) due to a lack of significant associations at P < 0.05. d) Violin plot of the AUCell enrichment scores for the four indicated DUX4-R subtype-level cNMF metaprograms. e) GSEA on the genes from (d) ranked by their contribution to Differentiation and Inflammation programs in DUX4-R transcriptional subgroups. f) Violin plots of gene expression pattern for twelve selected genes in 10 DUX4-R patients. g) Scatter dot plots of CD44 (left) and CD10 (right) protein expression by flow cytometric analysis in DUX4-R Early/Multipotent (n=6) and DUX4-R Committed (n=4) B-ALL samples. Two-tailed P value is from unpaired t test. Each point in the scatter plot represents individual samples. The barplot represents the mean with error bars indicating the standard deviation. h) KMT2A gene fusion partners in Early/Multipotent and Committed KMT2A-R subgroups. i) Association between KMT2A fusion partner and Multipotency Score or developmental state abundance. KMT2A fusion partners (AFF1, n=93; EPS15, n=4; MLLT1, n=19; MLLT3, n=11; MLLT10, n=10) harbored by patients within each KMT2A-R subgroup are shown alongside the inferred abundance of Early Lymphoid, Myeloid Progenitor and Pre-B states. For each comparison, P values from a two-tailed Wilcoxon rank-sum test are reported. Box plots indicate the range of the central 50% of the data with the central line marking the median. Whiskers extend from each both to 1.5x the interquartile range.

Extended Data Fig. 8. Abundance of cell types in B-ALL and MPAL.

Extended Data Fig. 8

Comparison of B cell Developmental State abundance between uniformly processed MPAL (n = 66) and B-ALL (n = 1,153) patient samples. Abundance of HSC/MPP/LMPP, Early Lymphoid, Myeloid Progenitor, Pre-pDC, Pro-B and Pre-B in Early/Multipotent and Committed subgroups of BCR::ABL1 (n=50 and n=57, respectively), KMT2A-r (n=69 and n=10, respectively) and ZNF384-r (n=53) with B-ALL and in “Other B-ALL” (n=793, non-BCR::ABL1, non-KMT2A-r and non-ZNF384-r), and in MPAL patients with same genetic drivers: BCR::ABL1 (n=4), KMT2A-r (n=13) and ZNF384-r (n=13) or “Other” (n=36). For each comparison, P values from a two-tailed Wilcoxon rank-sum test are reported. Box plots indicate the range of the central 50% of the data with the central line marking the median. Whiskers extend from each both to 1.5x the interquartile range.

Extended Data Fig. 9. B-ALL developmental states are associated with clinical outcomes.

Extended Data Fig. 9

a-d) Association between age at diagnosis and inferred abundance of HSC/MPP (a), Myeloid Progenitor (b), Pre-pDC (c) and Pre-B developmental states (d). For each association, the LOESS regression line shaded with the 95% CI is shown for 2,019 B-ALL with available age information. e-h) Association between developmental state abundance and clinical characteristics in B-ALL. For each comparison, P values from a two-tailed Wilcoxon rank-sum test are reported. Box plots indicate the range of the central 50% of the data with the central line marking the median. Whiskers extend from each both to 1.5x the interquartile range. e) Association between B-ALL clinical risk groups and inferred abundance of HSC/MPP, Myeloid Progenitor, and Pre-pDC developmental states among 2,022 B-ALL patients annotated for clinical risk (Childhood HR, n = 680; Childhood SR, n = 527; AYA, n = 430; Adult, n = 385). f,g) Association between measurable residual disease (MRD) status at day 29 of induction and inferred abundance of indicated developmental states (f) as well as of Early Lymphoid, Pro-B and Pre-B developmental states (g) at diagnosis (N=1,197 pediatric patients with MRD status (Negative < 0.01%, n = 787; Positive > 0.01%, n = 410) h) Association between MRD levels at day 29 and inferred developmental state abundance at diagnosis (N=794 pediatric B-ALL patients with MRD levels available: < 0.01%, n = 448; 0.01 – 0.1%, n = 148; 0.1 – 1%, n = 104; 1 – 10%, n = 54). i-k) Association of B-ALL developmental states with survival outcomes across different subsets of B-ALL patients. Hazard ratios from univariate cox regression, or multivariable cox regression when indicated, related to OS and EFS are reported for each standard deviation increase in inferred abundance. Error bars depict the 95% confidence interval for each variable of interest, presented as the hazard ratio +/− 1.96 standard errors. i,j) For each developmental state and Multipotency Score, association with survival outcomes is performed within 1,039 pediatric B-ALL patients (i) and within 649 pediatric B-ALL patients with negative MRD (j). k) Association of B-ALL developmental states with survival outcomes within genetically diverse adult B-ALL patients. Hazard ratios for overall survival and event-free survival from univariate cox regression (n=324 for each developmental state/Multipotency Score, left panel) or adjusted hazard ratios (n=312 for each developmental state/Multipotency Score, right panel) from multivariable Cox regression accounting for age, sex, WBC, genomic and clinical risk group as independent covariates are reported for each standard deviation increase in inferred abundance.

Extended Data Fig. 10. Multipotency Score and outcome in B-ALL.

Extended Data Fig. 10

a-f) OS and EFS Kaplan Meier plots evaluating the Multipotency Score within genomic risk group categories (a,b), clinical risk groups (c,d) and MRD categories (e,f). Patients were assigned to high (red) and low (blue) Multipotency Score categories based on a median split. Hazard ratios and 95% confidence intervals for high vs low Multipotency Score are shown within each category. P values were derived from a Wald test. Sample sizes for genomic risk: Favorable (Low=283, High=258), Intermediate (Low=156, High=117), Unfavorable (Low=66, High=108), Unclassified (Low=15, High=36). Sample sizes for clinical risk: Childhood Standard Risk (Low=230, High=122), Childhood High Risk (Low=250, High=273), Adolescent/Young Adult (Low=31, High=109). Sample sizes for MRD: Negative (Low=373, High=277), Positive (Low=90, High=175). g,h) Association of B-ALL developmental states and of Multipotency Score with OS and EFS outcomes within 47 pediatric BCR::ABL1 patients (g) and 41 adult BCR::ABL1 patients from Kim et al14 (h). For each developmental state (n=47 in panel g, n=41 in panel h). Hazard ratios from multivariable Cox regression, adjusting for age, sex, WBC, clinical risk group and genomic subtype, with overall survival and event-free survival are reported for each standard deviation increase in inferred abundance. Error bars depict the 95% confidence interval of the hazard ratio for each variable of interest.

Supplementary Material

Supplementary Tables
Supplementary Figures

ACKNOWLEDGMENTS

This study was supported by the American Lebanese Syrian Associated Charities of St. Jude Children’s Research Hospital; the Alex’s Lemonade Stand Foundation for Childhood Cancer (C.G.M.); National Cancer Institute grants P30 CA021765 (C.G.M.), R35 CA197695 (C.G.M.); U10 CA180888 (Southwest Oncology Group National Clinical Trials Network grant), U10 CA180820 and UG1 CA189859; St. Baldrick’s Foundation Robert J. Arceci Innovation Award (C.G.M.); the Henry Schueler 41&9 Foundation (C.G.M.); University of Toronto MD/PhD studentship award (A.G.X.Z.); Princess Margaret Cancer Foundation (J.E.D.); Ontario Institute for Cancer Research through funding provided by the Government of Ontario (J.E.D.); Canadian Institutes for Health Research RN380110–409786 (J.E.D.); International Development Research Centre Ottawa Canada (J.E.D.); Canadian Cancer Society 703212 (J.E.D.); Terry Fox New Frontiers Program project grant 1106 (J.E.D.); University of Toronto’s Medicine by Design initiative with funding from the Canada First Research Excellence Fund (J.E.D.); The Ontario Ministry of Health (J.E.D.); Canada Research Chair (J.E.D.). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. We thank Dr K Itoh (Department of Biology, Faculty of Science, Niigata University, Japan) for providing MS-5 stromal cells.

Footnotes

COMPETING INTERESTS

Ilaria Iacobucci reported consultation honoraria from Arima and travel expenses reimbursed by Mission Bio and Takara for invited talks. Charles G. Mullighan received research funding from AbbVie and Pfizer, honoraria from Amgen and Illumina, and royalty payments from Cyrus. He is on an advisory board for Illumina. John E. Dick received research funding from BMS/Celgene and IP licenses from Pfizer/Trillium Therapeutics. Anjali S. Advani reported Advisory Board member for Nkarta, Pfizer, Novartis and Jazz Pharmaceuticals; honoraria from KSA, PER, MD Education, ALF Medscape/ Global Activity, Web med, Onc live AML Geronimo; Steering committee for Glycomimetics; royalty payments from Springer. The other authors indicated no financial relationships.

DATA AVAILABILITY

Single-cell RNA and DNA sequences have been deposited in the European Genome-phenome Archive (EGA) under accession number EGAS00001007512 and at Alex’s Lemonade Stand Foundation for Childhood Cancer single cell portal (https://scpca.alexslemonade.org/projects/SCPCP000008). Access to the EGA dataset is currently available and can be requested by submitting an application to the Data Access Committee (EGAC00001002449) which is chaired by C.G.M. All requests from investigators seeking to use the data to examine scientific questions are approved and data released according to the terms outlined in the Data Access Agreements. Feature-barcode matrix, feature and barcode sequences from cellranger output of single-cell RNA-seq data have also been deposited to GEO under accession number GSE241405. RNA-seq and ATAC-seq data from immunophenotypically sorted populations have been deposited to GSE125345 and GSE285437, respectively. Single-cell RNA-seq UMAPs can be interactively visualized at https://proteinpaint.stjude.org/BALLscrna.

CODE AVAILABILITY

The scripts for the Bone Marrow map are available at https://github.com/andygxzeng/BoneMarrowMap while those for the B-cell developmental map can be found at https://github.com/andygxzeng/b_development_map. The scripts used to generate the Figures are available at https://github.com/gaoqs313/BALL-Single-Cell-Landscape.

REFERENCES

  • 1.Arber DA, et al. International Consensus Classification of Myeloid Neoplasms and Acute Leukemias: integrating morphologic, clinical, and genomic data. Blood 140, 1200–1228 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Alaggio R, et al. The 5th edition of the World Health Organization Classification of Haematolymphoid Tumours: Lymphoid Neoplasms. Leukemia 36, 1720–1748 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Duncavage EJ, et al. Genomic profiling for clinical decision making in myeloid neoplasms and acute leukemia. Blood 140, 2228–2247 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Iacobucci I, Kimura S & Mullighan CG Biologic and Therapeutic Implications of Genomic Alterations in Acute Lymphoblastic Leukemia. J Clin Med 10(2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Brady SW, et al. The genomic landscape of pediatric acute lymphoblastic leukemia. Nat Genet 54, 1376–1389 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Jeha S, et al. Clinical significance of novel subtypes of acute lymphoblastic leukemia in the context of minimal residual disease-directed therapy. Blood Cancer Discov 2, 326–337 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Waanders E, et al. Mutational landscape and patterns of clonal evolution in relapsed pediatric acute lymphoblastic leukemia. Blood Cancer Discov 1, 96–111 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Dobson SM, et al. Relapse-Fated Latent Diagnosis Subclones in Acute B Lineage Leukemia Are Drug Tolerant and Possess Distinct Metabolic Programs. Cancer Discov 10, 568–587 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Zeng AGX, et al. A cellular hierarchy framework for understanding heterogeneity and predicting drug response in acute myeloid leukemia. Nat Med 28, 1212–1223 (2022). [DOI] [PubMed] [Google Scholar]
  • 10.van Galen P, et al. Single-Cell RNA-Seq Reveals AML Hierarchies Relevant to Disease Progression and Immunity. Cell 176, 1265–1281 e1224 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Tirtakusuma R, et al. Epigenetic regulator genes direct lineage switching in MLL/AF4 leukemia. Blood 140, 1875–1890 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Novakova M, et al. DUX4r, ZNF384r and PAX5-P80R mutated B-cell precursor acute lymphoblastic leukemia frequently undergo monocytic switch. Haematologica 106, 2066–2075 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Iacobucci I & Mullighan CG KMT2A-rearranged leukemia: the shapeshifter. Blood 140, 1833–1835 (2022). [DOI] [PubMed] [Google Scholar]
  • 14.Kim JC, et al. Transcriptomic classes of BCR-ABL1 lymphoblastic leukemia. Nat Genet 55, 1186–1197 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Caron M, et al. Single-cell analysis of childhood leukemia reveals a link between developmental states and ribosomal protein expression as a source of intra-individual heterogeneity. Sci Rep 10, 8079 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Turati VA, et al. Chemotherapy induces canalization of cell state in childhood B-cell precursor acute lymphoblastic leukemia. Nat Cancer 2, 835–852 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Wu L, et al. Single-Cell Transcriptome Analysis Identifies Ligand-Receptor Pairs Associated With BCP-ALL Prognosis. Front Oncol 11, 639013 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Khabirova E, et al. Single-cell transcriptomics reveals a distinct developmental state of KMT2A-rearranged infant B-cell acute lymphoblastic leukemia. Nat Med 28, 743–751 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.O’Byrne S, et al. Discovery of a CD10-negative B-progenitor in human fetal life identifies unique ontogeny-related developmental programs. Blood 134, 1059–1071 (2019). [DOI] [PubMed] [Google Scholar]
  • 20.Jackson TR, Ling RE & Roy A The Origin of B-cells: Human Fetal B Cell Development and Implications for the Pathogenesis of Childhood Acute Lymphoblastic Leukemia. Front Immunol 12, 637975 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Patel AP, et al. Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. Science 344, 1396–1401 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Gu Z, et al. PAX5-driven subtypes of B-progenitor acute lymphoblastic leukemia. Nat Genet 51, 296–307 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Kimura S, et al. Enhancer retargeting of CDX2 and UBTF::ATXN7L3 define a subtype of high-risk B-progenitor acute lymphoblastic leukemia. Blood 139, 3519–3531 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Puram SV, et al. Single-Cell Transcriptomic Analysis of Primary and Metastatic Tumor Ecosystems in Head and Neck Cancer. Cell 171, 1611–1624 e1624 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Hovestadt V, et al. Resolving medulloblastoma cellular architecture by single-cell genomics. Nature 572, 74–79 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Kotliar D, et al. Identifying gene expression programs of cell-type identity and cellular activity with single-cell RNA-Seq. Elife 8(2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Lee RD, et al. Single-cell analysis identifies dynamic gene expression networks that govern B cell development and transformation. Nat Commun 12, 6843 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Oetjen KA, et al. Human bone marrow assessment by single-cell RNA sequencing, mass cytometry, and flow cytometry. JCI Insight 3(2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Ainciburu M, et al. Uncovering perturbations in human hematopoiesis associated with healthy aging and myeloid malignancies at single-cell resolution. Elife 12(2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Setty M, et al. Characterization of cell fate probabilities in single-cell data with Palantir. Nat Biotechnol 37, 451–460 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Popescu DM, et al. Decoding human fetal liver haematopoiesis. Nature 574, 365–371 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Jardine L, et al. Blood and immune development in human fetal bone marrow and Down syndrome. Nature 598, 327–331 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Roy A, et al. Transitions in lineage specification and gene regulatory networks in hematopoietic stem/progenitor cells over human development. Cell Rep 36, 109698 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Li B.e.a. HCA Data Portal: census of immune cells (Human Cell Atlas, 2019). [Google Scholar]
  • 35.Triana S, et al. Single-cell proteo-genomic reference maps of the hematopoietic system enable the purification and massive profiling of precisely defined cell states. Nat Immunol 22, 1577–1589 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Kondo M, Weissman IL & Akashi K Identification of clonogenic common lymphoid progenitors in mouse bone marrow. Cell 91, 661–672 (1997). [DOI] [PubMed] [Google Scholar]
  • 37.Hao QL, et al. Identification of a novel, human multilymphoid progenitor in cord blood. Blood 97, 3683–3690 (2001). [DOI] [PubMed] [Google Scholar]
  • 38.Van de Sande B, et al. A scalable SCENIC workflow for single-cell gene regulatory network analysis. Nat Protoc 15, 2247–2276 (2020). [DOI] [PubMed] [Google Scholar]
  • 39.Doulatov S, et al. Revised map of the human progenitor hierarchy shows the origin of macrophages and dendritic cells in early lymphoid development. Nat Immunol 11, 585–593 (2010). [DOI] [PubMed] [Google Scholar]
  • 40.Avellino R, et al. An autonomous CEBPA enhancer specific for myeloid-lineage priming and neutrophilic differentiation. Blood 127, 2991–3003 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Barnes JL, et al. Early human lung immune cell development and its role in epithelial cell fate. Sci Immunol 8, eadf9988 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Alexander TB, et al. The genetic basis and cell of origin of mixed phenotype acute leukaemia. Nature 562, 373–379 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Dickerson KM, et al. ZNF384 Fusion Oncoproteins Drive Lineage Aberrancy in Acute Leukemia. Blood Cancer Discov 3, 240–263 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Mehtonen J, et al. Single cell characterization of B-lymphoid differentiation and leukemic cell states during chemotherapy in ETV6-RUNX1-positive pediatric leukemia identifies drug-targetable transcription factor activities. Genome Med 12, 99 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Newman AM, et al. Determining cell type abundance and expression from bulk tissues with digital cytometry. Nat Biotechnol 37, 773–782 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Chu T, Wang Z, Pe’er D & Danko CG Cell type and gene expression deconvolution with BayesPrism enables Bayesian integrative analysis across bulk and single-cell RNA sequencing in oncology. Nat Cancer 3, 505–517 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Meyer C, et al. The KMT2A recombinome of acute leukemias in 2023. Leukemia 37, 988–1005 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Montefiori LE, et al. Enhancer Hijacking Drives Oncogenic BCL11B Expression in Lineage-Ambiguous Stem Cell Leukemia. Cancer Discov 11, 2846–2867 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Weinberg OK, et al. The International Consensus Classification of acute leukemias of ambiguous lineage. Blood 141, 2275–2277 (2023). [DOI] [PubMed] [Google Scholar]
  • 50.Lee SHR, et al. Pharmacotypes across the genomic landscape of pediatric acute lymphoblastic leukemia and impact on treatment response. Nat Med 29, 170–179 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Witkowski MT, et al. Extensive Remodeling of the Immune Microenvironment in B Cell Acute Lymphoblastic Leukemia. Cancer Cell 37, 867–882 e812 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Bastian L, et al. Developmental trajectories and cooperating genomic events define molecular subtypes of BCR::ABL1-positive ALL. Blood 143, 1391–1398 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Pui CH, et al. Treating childhood acute lymphoblastic leukemia without cranial irradiation. N Engl J Med 360, 2730–2741 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Jeha S, et al. Improved CNS Control of Childhood Acute Lymphoblastic Leukemia Without Cranial Irradiation: St Jude Total Therapy Study 16. J Clin Oncol 37, 3377–3391 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Inaba H, Azzato EM & Mullighan CG Integration of Next-Generation Sequencing to Treat Acute Lymphoblastic Leukemia with Targetable Lesions: The St. Jude Children’s Research Hospital Approach. Front Pediatr 5, 258 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Gao Q, et al. The genomic landscape of acute lymphoblastic leukemia with intrachromosomal amplification of chromosome 21. Blood 142, 711–723 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Hao Y, et al. Integrated analysis of multimodal single-cell data. Cell 184, 3573–3587 e3529 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.McGinnis CS, Murrow LM & Gartner ZJ DoubletFinder: Doublet Detection in Single-Cell RNA Sequencing Data Using Artificial Nearest Neighbors. Cell Syst 8, 329–337 e324 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Young MD & Behjati S SoupX removes ambient RNA contamination from droplet-based single-cell RNA sequencing data. Gigascience 9(2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Notta F, et al. Distinct routes of lineage development reshape the human blood hierarchy across ontogeny. Science 351, aab2116 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Wagenblast E, et al. Functional profiling of single CRISPR/Cas9-edited human long-term hematopoietic stem cells. Nat Commun 10, 4730 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Karamitros D, et al. Single-cell analysis reveals the continuum of human lympho-myeloid progenitor cells. Nat Immunol 19, 85–97 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Buenrostro JD, Giresi PG, Zaba LC, Chang HY & Greenleaf WJ Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat Methods 10, 1213–1218 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Granja JM, et al. Single-cell multiomic analysis identifies regulatory programs in mixed-phenotype acute leukemia. Nat Biotechnol 37, 1458–1465 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Mende N, et al. Unique molecular and functional features of extramedullary hematopoietic stem and progenitor cell reservoirs in humans. Blood 139, 3387–3401 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Wolf FA, Angerer P & Theis FJ SCANPY: large-scale single-cell gene expression data analysis. Genome Biol 19, 15 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Bueno C, et al. CD34+CD19-CD22+ B-cell progenitors may underlie phenotypic escape in patients treated with CD19-directed therapies. Blood 140, 38–44 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Choudhary S & Satija R Comparison and evaluation of statistical error models for scRNA-seq. Genome Biol 23, 27 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Zeng AGX 10.17504/protocols.io.dm6gpde9jgzp/v1. (2025). [DOI] [Google Scholar]
  • 70.Love MI, Huber W & Anders S Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15, 550 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Tables
Supplementary Figures

Data Availability Statement

Single-cell RNA and DNA sequences have been deposited in the European Genome-phenome Archive (EGA) under accession number EGAS00001007512 and at Alex’s Lemonade Stand Foundation for Childhood Cancer single cell portal (https://scpca.alexslemonade.org/projects/SCPCP000008). Access to the EGA dataset is currently available and can be requested by submitting an application to the Data Access Committee (EGAC00001002449) which is chaired by C.G.M. All requests from investigators seeking to use the data to examine scientific questions are approved and data released according to the terms outlined in the Data Access Agreements. Feature-barcode matrix, feature and barcode sequences from cellranger output of single-cell RNA-seq data have also been deposited to GEO under accession number GSE241405. RNA-seq and ATAC-seq data from immunophenotypically sorted populations have been deposited to GSE125345 and GSE285437, respectively. Single-cell RNA-seq UMAPs can be interactively visualized at https://proteinpaint.stjude.org/BALLscrna.

RESOURCES