Summary
Single-cell DNA sequencing (scDNA-seq) methods are powerful tools for profiling mutations in cancer cells; however, most genomic regions sequenced in single cells are non-informative. To overcome this issue, we developed a multi-patient-targeted (MPT) scDNA-seq method. MPT involves first performing bulk exome sequencing across a cohort of cancer patients to identify somatic mutations, which are then pooled together to develop a single custom targeted panel for high-throughput scDNA-seq using a microfluidics platform. We applied MPT to profile 330 mutations across 23,500 cells from 5 patients with triple negative-breast cancer (TNBC), which showed that 3 tumors were monoclonal and 2 tumors were polyclonal. From these data, we reconstructed mutational lineages and identified early mutational and copy-number events, including early TP53 mutations that occurred in all five patients. Collectively, our data suggest that MPT can overcome a major technical obstacle for studying tumor evolution using scDNA-seq by profiling information-rich mutation sites.
Keywords: single-cell genomics, triple-negative breast cancer, intratumor heterogeneity, mutational evolution, breast cancer
Graphical abstract

Highlights
-
•
Developed a custom targeted panel to profile information-rich mutations via scDNA-seq
-
•
Pooled mutations profiled by bulk exome-seq from 5 triple-negative patients
-
•
Applied MPT to 330 mutations across 23,500 cells using microfluidics-based scDNA-seq
-
•
Reconstructed mutational lineages and CNA events underlying breast cancer evolution
Leighton et al. developed a multi-patient-targeted (MPT) panel to profile information-rich mutation sites in single cancer cells by synthesizing a custom targeted panel using pooled mutations from patients with cancer who were first profiled by bulk exome sequencing. Along with high-throughput scDNA-seq on a microfluidics platform, MPT was applied to profile 330 mutations across 23,500 cells from 5 patients with triple-negative breast cancer. This reconstructed mutational lineages and copy-number events, including early TP53 mutations observed in all five patients, underlying breast cancer evolution.
Introduction
Triple-negative breast cancer (TNBCs) is an aggressive subtype that is characterized by a lack of estrogen receptor (ER), progesterone receptor (PR), and Her2 receptor levels. Patients with TNBC frequently develop resistance to chemotherapy (∼50%) and progress to metastatic disease, often leading to poor survival.1,2,3 In contrast to ER-positive breast cancer, patients with TNBC display extensive copy-number alterations (CNAs) and driver mutations in TP53 in 83% of patients.4 Furthermore, patients with TNBC have been shown by single-cell DNA sequencing (scDNA-seq), multi-region sequencing, and whole-genome sequencing methods to display extensive intratumor heterogeneity (ITH).5,6,7 This ITH may explain why patients with TNBC frequently develop resistance to chemotherapy and often progress to metastatic disease.8,9 Previous studies have shown that TNBCs evolve through punctuated copy-number evolution (PCNE), in which large numbers of CNA events are acquired in short bursts of evolution at the earliest stages of progression.3,5,10,11 However, the dynamics and timing of mutations during TNBC progression represent a fundamental gap in knowledge. Additionally, improved knowledge of ITH is important for the clinical diagnosis and treatment of patients with TNBC, particularly when selecting targeted therapies. One approach for resolving intratumor heterogeneity that is gaining broader use in both research and clinical studies is scDNA-seq.12
The field of single-cell genomics has shown remarkable progress in the development of both transcriptomic and genomic profiling methods over the last 10 years.13 Initial scDNA-seq for profiling the genomes and exomes of single cells were low throughput and had extensive technical errors.14,15,16 These technologies have been improved through developments in whole-genome amplification (WGA) chemistries15,17,18,19 and the implementation of nanowell and microfluidic platforms.20,21,22,23,24 Specifically, single-cell genomic copy-number profiling technologies have undergone substantial improvements with the use of tagmentation chemistries and microfluidic or nanowell platforms (10X Genomics CNV, ACT, DLP+).11,25 Similarly, scDNA-seq platforms for mutational profiling have been developed using microfluidic platforms (Mission Bio, Tapestri), which have enabled the profiling of tens of thousands of cells in parallel.26,27,28,29,30,31 However, in contrast to genomic copy-number profiling, which involves unbiased whole-genome sequencing (WGS) at sparse depth, mutational profiling requires high-coverage data. Consequently, it is necessary to target genomic regions for sequencing (e.g., exome, cancer gene panels) in order to make these assays economically feasible. This presents a major challenge since the informative mutation sites are not usually known a priori, and thus most of the genomic regions profiled in single cells contain only reference DNA sequences or germline SNPs.
To address this challenge, we developed a multi-patient-targeted (MPT) scDNA-seq sequencing approach that combines bulk DNA sequencing, single-cell droplet-based microfluidics, and a custom targeted panel. We applied MPT to 5 invasive TNBC tumor samples, which resolved the clonal substructure of the tumors and enabled the reconstruction of clonal lineages during tumor progression. Our data show that TP53 mutations and other early mutational events in TNBC progression are accompanied by CNA events, leading to the expansion of the primary tumor mass. Additionally, our data show that most mutations are early events in the progression of TNBC tumors and that subclones undergo stable clonal expansions with only limited intermediate mutations that are acquired during tumor growth.
Results
MPT panel sequencing
A major challenge with targeted scDNA-seq of mutations in tumors is that generally only a few somatic mutations (e.g., 1–10) can be profiled from each patient using an unbiased panel that covers a small region (e.g., 5 Mb) of the human genome, while most genomic regions sequenced contain only the reference genome bases and SNPs. This problem results in high costs for scDNA-seq of regions with limited useful mutational information and prohibits large numbers of cell to be profiled in parallel. To address this problem, we developed a method called MPT sequencing (Figure 1). In this approach, we first profile a set of patients (e.g., 5–20) with bulk deep-exome sequencing methods (approximately ∼200× tumor, ∼60× normal) to identify somatic mutations and then pool all mutations together from the patient cohort (e.g., ∼400 mutations) to develop a custom targeted panel for scDNA-seq that targets all of the mutation sites across all of the patients (Figure 1A). In future developments of the microdroplet platform (Mission Bio), larger amplicon panels (e.g., thousands of targets) will allow more patients to be profiled or a larger number of mutation sites to be profiled per patient.
Figure 1.
Overview of multi-patient-targeted single-cell sequencing workflow
(A) Bulk exome sequencing is performed on a cohort of human tumors.
(B) Mutations from all patients are pooled together and used to synthesize a multi-patient-targeted (MPT) custom panel.
(C) The MPT panel is used to perform single-cell DNA sequencing (scDNA-seq) on each patient individually to profile thousands of cells in parallel using a microfluidics platform.
The custom panel is synthesized commercially (V1 custom targeted panel, Mission Bio) for the purpose of downstream scDNA-seq using a microfluidics platform (Tapestri, Mission Bio) (Figure 1B). We then perform scDNA-seq (Mission Bio) using the same MPT panel on each of the tumors that were previously analyzed by bulk exome sequencing to profile ∼5,000 cells for each tumor (Figure 1C). In the scDNA-seq experiments that are performed for each tumor, the mutations are profiled for each patient (e.g., ∼50 mutations), as well as the reference sites for the mutations in the other patients (which are not utilized in the downstream analysis). The major advantage of the MPT approach is that it greatly mitigates the sequencing space for non-informative genome information that would normally be profiled using an unbiased panel. Notably, when the same mutations occur in different patients, these sites will save additional space on the custom MPT panel. By focusing the scDNA-seq on the mutation sites of interest, the sequencing costs are greatly decreased, enabling very high-throughput analysis of thousands of single cells in parallel at low costs for each patient.
Bulk exome sequencing of TNBCs
We selected frozen invasive tumor samples (ductal carcinoma in situ [DCIS]/invasive ductal carcinoma [IDC]) from 5 untreated patients with TNBC that were negative for the ER, PR, and Her2 receptor levels (evaluated by immunohistochemistry) and Her2 amplification (evaluated by cytogenetic analysis) for bulk exome sequencing followed by MPT profiling using a microfluidics system (Tapestri, Mission Bio) (Table S1). To enrich the bulk tumor cell populations, we first generated single nucleus suspensions and stained the nuclei with DAPI to perform fluorescence-activated cell sorting (FACS) (Figure 2A). Using FACS, we gated distributions of cells from 2.65–3.7(N) ploidy to enrich aneuploid tumor cell fractions. Using the aneuploid cells, we generated sequencing libraries and performed both WGS at sparse depth to estimate genomic copy number and exome capture (Roche, Nimblegen v.2) to identify point mutations (Table S2). The genomic copy-number profiles showed extensive aneuploidy and copy-number aberrations (CNAs) in all of the tumors, including amplifications of oncogenes such as MYC and deletions of tumor suppressors such as TP53 (Figure 2B). Exome sequencing of the aneuploid-sorted nuclei and matched normal tissues identified a range of somatic mutations (mean = 66) in the tumors, in which TN4 and TN5 showed a highly elevated mutation burden compared with TN1–TN3, suggesting that they are hypermutators. Within the different classes of exonic mutations, missense mutations were the most prevalent (>70%) in all 5 samples, with smaller numbers of non-sense and splicing mutations identified (Figure 2C).
Figure 2.
Bulk exome sequencing and mutational analysis of TNBC tumors
(A) Frozen tumor tissues were dissociated into nuclear suspensions and stained with DAPI for FACS to isolate aneuploid single cells by differences in DNA ploidy, where P indicates the mean tumor ploidy and D and A indicate diploid or aneuploid distributions, respectively.
(B) Pseudo-bulk copy-number ratio plots generated from single-cell WGS data from each tumor sample with integer copy-number states calculated.
(C) Classes of exonic mutations for each tumor identified by Annovar.
(D) Mutational signatures calculated from the exonic mutations.
(E) Allelic frequencies for all of the somatic mutations identified in each TNBC tumor that were used to design the MPT panel.
Analysis of the mutational signatures showed that several common breast cancer signatures were prevalent, including signatures 1a/b, 3, 13, and 15 (Figure 2D). Signature 1a/1b is aging (deamination of methylcytosines), signature 3 is homologous recombination repair deficiency, signature 13 is APOBEC cytidine deaminases, and signature 15 represents DNA mismatch repair. Interestingly, the highest mutation frequency in the cohort observed in TN4 correlates with the AID/APOBEC signature, suggesting a possible mechanism for the acquisition of the mutations. Finally, we investigated the distribution of somatic mutation frequencies, which ranged from ∼0.1% to ∼99% for each patient, and showed that TP53 was one of the highest-frequency mutations identified in all 5 patients (Figure 2E). We combined all of the somatic mutations identified across the 5 patients with TNBC to design a custom MPT panel (Mission Bio) for scDNA-seq analysis.
Mutational substructure of TNBC tumors
To resolve the clonal substructure of each TNBC tumor, we applied the MPT panel to perform scDNA-seq of a total of 23,526 cells (range: 4,002–5,941) of the 5 tumors using the microdroplet (Mission Bio) scDNA-seq platform (Table S3). In contrast to bulk sequencing, the scDNA Mission Bio experiments were all performed on unsorted cells (not subjected to FACS). The resulting data showed that 330 targeted sites (STAR Methods; Table S3) had 164× coverage depth across the 5 tumors. The somatic mutations were detected in each single cell by de novo variant calling using GATK, independently of the exome bulk data that also used GATK for variant calling (STAR Methods).
In the polyclonal tumor TN4, clustering of 4,141 single cells across 69 genes identified 4 major clusters in high-dimensional space using uniform manifold approximation and projection (UMAP) that corresponded to three tumor subclones (c1–c3) and one population of diploid cells (c4) (Figure 3A). A clustered heatmap of the mutations further revealed the somatic mutations in each subclone, including shared mutations that were present in all three tumor clusters (Figure 3B). This analysis identified 15 homozygous mutations that were shared among all three subclones, including TP53, which is the most commonly mutated gene in TNBC.4 The copy number of each gene on the targeted panel was estimated from the read depth, which identified chromosomal gains and losses (STAR Methods). These data showed that GRIN3A, DMKN, and IGSF10 were amplified, while NOTCH3 and 15 other genes with homozygous mutations showed copy-number losses. Computational prediction of the functional impact scores using CADD identified KALRN, NOTCH3, and PTPN4 as having the highest functional impact scores, among other genes (Figure 3C).32
Figure 3.
Clonal substructure and phylogenetic reconstruction of two polyclonal tumors
(A and E) High-dimensional clustering of somatic mutations using UMAP of 4,141 and 5,172 (TN4, TN2) single cells to identify major clusters of subclones.
(B and F) Heatmaps of hierarchical clustered mutations in TN4 and TN2 showing the clonal substructure and heterogeneity within the tumor subpopulations, with cluster-level copy-number estimations shown in the header tracks above.
(C and G) CADD scores are used to rank functional impact of mutations’ deleteriousness in TN4 and TN2 (CADD >15, top 5%).
(D and H) Phylogenetic reconstruction of the mutation lineages in TN4 and TN2, chronology, and timing of copy-number aberrations relative to the MRCA using a neighbor-joining tree. The common ancestors (A1, A2) are annotated on the tree, as well as the whole-genome doubling (WGD) events.
To infer the chronology and order of single-nucleotide variants (SNVs) and CNAs during tumor evolution, a neighbor-joining (NJ) tree was constructed from the mutation matrix, after which the mutations and CNA events were annotated (STAR Methods). The NJ tree showed three major branching points: the most recent common ancestor (MRCA), the first subclonal ancestor (A1), and the second subclonal ancestor (A2), which resulted in the divergence of the three tumor subclones (c1–c3). Truncal mutations that occurred in all tumor subclones included chromosome losses in TP53 and NOTCH3 and gains in DMKN and GRIN3A, while early mutations included TP53, GRIN3A, PTPN4, and AKAP6. The c3 subclone was closest to the MRCA, suggesting that it was one of the earliest subclones that diverged, consistent with the clustering results, which showed that it harbored the lowest number of mutations (n = 33). The c2 clone diverged after the A1 ancestor by acquiring an additional set of mutations (n = 8). Finally, the c3 clone diverged from the A2 ancestor via the acquisition of a large set of somatic mutations (n = 28) that included a mutation in NOTCH3 (Figure 3D). Interestingly, a lack of gradual intermediate mutations was identified between the 3 major subclones in this analysis, and most mutations were very clonal within the three tumor subpopulations, which shared a common evolutionary lineage.
In the polyclonal tumor TN2, a total of 5,172 single cells were sequenced, and 19 somatic mutations were identified. Clustering of these data in high-dimensional space identified 3 major clusters, including two major tumor subclones (c1–c2) and one diploid cell cluster (c3) (Figure 3E). Hierarchical clustering of the somatic mutations identified 13 mutations that were shared between the two tumor clusters including 3 homozygous mutations, of which one was a TP53 mutation (Figure 3F). The clustered heatmap also showed 6 mutations that were exclusive to the c2 subclone, suggesting that this subpopulation continued to evolve additional mutations after a shared common ancestor. Applying targeted copy-number inference from the read depth data, we identified amplifications in ZFHX4[G/A], USP21, and TMTC1, while TP53, ZMAT1, and COL5A3 showed copy-number losses and had homozygous mutations (Figure 3F). CADD predictions of the functional impact scores showed that ZFHX4[G/A], TP53, SULT1E1, and DICER1 had the highest scores in TN2 (Figure 3G). An NJ tree was constructed and showed two major branching points in evolution: the MRCA and A1. Early truncal events that occurred prior to the MRCA included chromosome losses in TP53, ZMAT1, and COL5A3, while gains included USP21, TMTC1, and ZFHX4[G/A] and mutations included TP53, ZMAT1, and SULT1E1 (Figure 3H). The c1 clone diverged from c2 by acquiring 6 additional mutations including DCC, DICER1, and RUVBL2 with high functional impact scores. Consistent with TN4, the mutational substructure of TN2 showed a lack of gradual mutations detected between the two subclones and the diploid cells.
Clustering of TN1, TN3, and TN5 revealed mono-clonal tumors that were comprised of a single population of tumor cells and a single population of diploid cells (Figure 4). In TN1, 16 mutations were detected across 4,270 single cells. In TN3, the data identified 11 mutations across 5,941 cells, while in TN5, 47 mutations were detected across 4,002 cells (Figures 4A and 4B). In TN1, the mutations with the highest impact scores included homozygous mutations in TP53, CDS1, RASAL2, and ARFGAP1, while in TN3, these mutations included TP53, PLEKHA6, STXBP2, and MED24. In TN5, significant mutations were identified in TP53, RGS7, and NDST4, among other genes (Figure 4C). The CADD analysis showed that homozygous mutations in TP53 had the most significant functional impact scores in all three tumors (Figure 4C). While most mutations in the three monoclonal tumors were detected in all tumor cells, there were a few mutations identified that occurred at lower frequencies, including MED24 (4.8%) in TN3 and OTUD5 (3.2%) and PRDM5 (2%) in TN5, as well as SENP6, MARCH6, PODXL2, and ADCY8 (all at 1.8%) of cells in TN5, suggesting that these were later lineage events that emerged during tumor progression.
Figure 4.
Clonal diversity and mutational lineages of three monoclonal tumors
(A) High-dimensional clustering using UMAP of 4,270 (TN1), 5,941 (TN3), and 4,002 (TN5) single cells, respectively, identified one major tumor cluster and one cluster of normal cells in each sample.
(B) Hierarchical clustering heatmaps of the mutations in TN1, TN3, and TN5 showing a monoclonal population of tumor cells, with cluster-level copy-number estimations for each mutation shown in the header bar.
(C) CADD score is used to rank mutations’ predicted deleteriousness impact (CADD >15, top 5%) in TN1, TN3, and TN5.
(D) Phylogenetic reconstruction of the TN1, TN3, and TN5 mutation lineages, chronology, and timing of copy-number aberrations relative to MRCA using a neighbor-joining tree with the MRCA annotated.
Targeted copy-number inference showed that homozygous mutations TP53, CDS1, AP1M1, ZBTB38, PABPC5, and OSM were deleted in TN1, while PLEKHA6 was amplified in TN3 and TBX4, DLG2, and SPTA1 were amplified in TN5 (Figure 4B). NJ trees were constructed to infer the chronology and order of SNVs and CNAs, which showed that most events were truncal and occurred prior to the MRCA, after which the tumor cells expanded to form the tumor mass in each patient (Figure 4D).
Comparison of clonal substructure from single-cell and bulk data
To understand the advantages of the MPT method for resolving clonal substructure, we directly compared data as “pseudo-bulk” with the single-cell MPT data in 5 patients. To assess the accuracy of the copy-number data, we first calculated a consensus of the single-cell MPT data and compared it with the pseudo-bulk single-cell WGS data, which we considered the “gold standard” reference for each patient (Figure 5A). Calculation of the Pearson correlation coefficients showed high correlation values (mean = 0.871) across the patients, suggesting that the single-cell MPT read count data, when merged together across cells, accurately reflected the pseudo-bulk WGS copy-number profiles. Notably, all high-level amplifications were detected in the single-cell data from TN1 (RASAL2, NAT9), TN2 (USP21), TN3 (RIMS2), TN4 (GRIN3A), and TN5 (IQGAP3, SPTA1, TBX4, PIN4). Next, we compared the variant allele frequencies (VAFs) of the bulk exome data versus the combined single-cell mutation frequencies (Figure 5B). Calculation of the Pearson correlation coefficients of the VAFs between the two datasets showed a high correlation (mean = 0.876) across the 5 patients, suggesting that they were highly concordant.
Figure 5.
Comparison of MPT single-cell versus bulk exome sequencing data
(A) Copy-number states for each gene were estimated across all tumor cells from each patient using the Mission Bio data and compared with a pseudo-bulk single-cell consensus copy-number reference from WGS data (Pearson correlation).
(B) Average allele frequencies for each mutation calculated across all single cells from each patient in the MPT data are compared with the bulk exome reference (Pearson correlation).
(C) Pyclone2 subclone clustering of mutation frequencies from bulk exome data compared with subclone clusters detected from MPT scDNA-seq profiling in each patient.
Finally, we investigated whether the clonal composition of the bulk exome data could be estimated by clustering VAF distributions using PyClone2 (Figures 5C and S4).33 These data showed that PyClone2 often overestimated the number of subclones present in each tumor, in which the two monoclonal TNBC tumors (TN1, TN2) were estimated to have 3 subclones by PyClone2 and one monoclonal tumor (TN4) was shown to have 1 subclone. For the two polyclonal tumors, Pyclone2 estimated a higher number of subclones in the exome data of TN3 (3 by single cell, 5 by exome) and in TN5 (2 by single cell and 3 by exome). These data highlight the challenge in estimating the number of subclones from bulk exome data based on VAFs, which single-cell mutation data can accurately resolve.
MPT benchmarking metrics
To benchmark the sensitivity and accuracy of the MPT approach, we first investigated the variance of coverage by calculating the Gini index (GI) and coefficient of variation (CV) from the coverage read depths across all amplicons and cells (Figures S1A, S1B, and S2B). The average GI for amplicons was 0.32, while the average GI for cells was slightly higher at 0.51. Similarly, the CV for the amplicons was 65.6, and the average CV was 125.3. Additionally, we calculated the GI for each respective amplicon to understand the target-specific variation in coverage depth (Figure S2A). These data show that the coverage depth was higher than 100× coverage for most targeted sites and showed substantial variation that corresponded to the targeted genomic region, reflecting differences in amplification efficiencies across the genome (Figure S2B). Next, we calculated the average allelic dropout percentage (∼9%), the average doublet frequency (∼8%) across all samples, and the percentage of successful amplicons (∼85%), which showed good performance metrics for the single-cell datasets (Figures S1C–S1E and S5).
Next, we tested the robustness of the MPT approach by downsampling the FASTQ files of each tumor sample from 100× to 10× coverage (Figures S1F and S1G). This analysis showed that as the coverage depth decreased, most mutations in amplicons were still detected until ∼60× coverage depth, where a large number of mutations were not detected (<60×) in TN4 and TN5. Similarly, the downsampling of the coverage led to fewer cells detected with less than 100× depth, which was problematic below 60× coverage depth (Figure S1G). Conversely, the average GI and allelic dropout of all samples showed little changes as the coverage decreased (Figures S1H and S1I). Finally, we investigated the ability to detect subclonal populations as the number of mutations and cells were downsampled (Figures S1J and S1K). These data showed that for the polyclonal samples, 38 mutations were required to detect all 4 subclusters in TN4 (15 mutations for 3 clusters), while 14 mutations were needed to detect all 3 subclusters in TN2. In contrast in the monoclonal tumors, all samples were able to detect 2 subclusters (normal and tumor) for lower numbers of mutations (>3 mutations). We then downsampled the number of cells, which showed that 1,350 cells were needed to detect 4 clusters for TN4 (280 cells for 3 clusters) and that 300 cells were needed to detect 3 clusters for TN2 (>50 cells to determine 2 clusters in all samples). Overall, these data suggest that a minimum of 38 mutations and 1,350 cells were required to resolve the clonal substructure in the polyclonal TNBC tumors.
Discussion
Here, we report a new approach (MPT) that involves first performing deep-exome sequencing across a cohort of patients (e.g., 5–10) and identifying mutations that can be pooled together to develop a custom capture platform that targets only these genomic regions for sequencing in scDNA-seq experiments. Our study utilized a microdroplet-based scDNA-seq platform (Mission Bio) to perform high-throughput analysis of thousands of cells in parallel from 5 patients with TNBC. By targeting regions with somatic mutations (identified a priori by bulk exome sequencing), we were able to profile 330 mutations in thousands of single cells using MPT. In contrast, previous studies that have used the same microdroplet platform (Mission Bio) to profile patients with acute myeloid leukemia (AML) with pre-defined tumor panels typically only measure a very small set of mutations (∼1–8) in each patient across many cells.27 From these data, we obtained sufficient genomic markers in each patient to resolve clonal substructure, reconstruct clonal lineages, and identify putative driver mutations that have a potential functional impact during breast tumor progression.
Our study showed that 3 tumors were monoclonal (TN1, TN3, TN5), while 2 tumors (TN2, TN4) were multi-clonal, out of 5 patients analyzed. The samples showed a high degree of interpatient heterogeneity, in which single cells were well separated by patient clusters in high-dimensional space (Figure S3). We detected TP53 mutations in all 5 patients with TNBC, which we inferred to be one of the earliest truncal mutations that was acquired during tumor evolution. Additionally, our FACS data show that genome doubling occurred in all 5 patients with TNBC as evidenced by increased DNA ploidy (>2.65N), which is consistent with other studies in TNBC.11 However, in addition to TP53, several additional early mutations and CNAs were also detected in the two polyclonal tumors. In TN4, early truncal homozygous mutations (GRIN3A, PTPN4, AKAP6) along with copy-number losses (TP53, NOTCH3) and gains (DMKN, GRIN3A) co-occurred with TP53 in the MRCA at the earliest stages of tumor progression. In TN2, early truncal homozygous mutations (ZMAT1, SULTIE1) and copy-number losses (TP53, ZMAT1, COL5A3) and gains (USP21, TMTC1, ZFHX4[G/A]) co-occurred with TP53 mutations in the MRCA at the earliest stage of progression. These data are consistent with a PCNE model in which short bursts of genome instability give rise to multiple clones that stably expand to form the tumor mass, as previously reported in TNBC.5,11 However, our data further suggest the possibility that mutations may also co-occur in short bursts of evolution, followed by stable clonal expansions that form the tumor mass in patients with TNBC.
Our findings on very limited numbers of gradual intermediate mutations occurring between the lineages of subclones in the polyclonal tumors and that there are few mutations occurring outside of the dominant clones in the monoclonal tumors is unexpected. While TNBC tumors were shown to evolve through PCNE, resulting in a limited number of subclones with highly clonal CNAs, the diversity of somatic point mutations has not been investigated widely through scDNA-seq methods.5,11,16 Previous bulk DNA-seq studies of TNBC tumors have reported high mutation burden, but they have not reported mutation diversity with high granularity beyond clustering of the bulk VAFs.4,7 Our study shows that the TNBC tumors profiled with MPT showed limited clonal diversity, consistent with the scDNA-seq copy-number studies.5,11,16 These data suggest that either (1) the mutations were acquired simultaneously in short bursts of evolution and the mutation rate is very low outside of these events or (2) there are large selective sweeps leading to dominant clones that outcompete the other clones that harbored the gradual intermediate mutations. From our current data, which are derived from a single time point, we cannot accurately distinguish between these two scenarios. Additionally, MPT may be potentially underestimating subclonal diversity in tumor samples that harbor few exonic mutations (N < 14) based on our downsampling calculations, making it more challenging to delineate intratumor heterogeneity and clonality in these tumors (Figures S1J and S1K). Another possible limitation is that the MPT approach focused on profiling only targeted mutation sites in exons that were identified by bulk exome sequencing and did not perform unbiased profiling of mutation sites across the whole genome (e.g., single-cell exome sequencing or WGS), potentially missing gradual intermediate mutations that may have occurred in single cells in non-coding and intergenic regions. While we primarily focused on exonic mutations in this study, due to the limited size of the custom amplicon panel (∼380 amplicons), future studies may include different classes of mutations (e.g., intronic, intergenic, insertions or deletions [indels]). This will become more feasible with future technical advances on expanding the Mission Bio targeted panels.29,31
To better understand the concordance of MPT to bulk exome sequencing or WGS, we directly compared these independent datasets from the same tumors in 5 patients. Comparison of VAFs between the bulk and MPT-merged single-cell data showed a very high concordance (average [avg.] Pearson correlation: 0.876), confirming the accuracy of the MPT. Copy-number estimation was performed on consensus clusters of the single-cell MPT data for each mutation and compared with a pseudo-bulk-derived reference. Even though the MPT estimation was limited due to coverage of only one amplicon per SNV, the copy-number estimations showed a high concordance as well (avg. Pearson correlation: 0.871) when performed on groups of cells from each subclone cluster. These data allowed us to obtain both mutation information and copy-number states for each subclone in the TNBC tumors for reconstructing their order in the clonal lineages during tumor evolution. Our data also suggest that PyClone2 generally overclustered the bulk exome data and estimated a higher number of subclones in each tumor compared with the scDNA-seq experiments, where Pyclone2 has been shown to have difficulty in assessing single-sample clonal structure with small numbers of clones (2–3 clusters).34 In our data, the bulk-exome approach (DNA-seq) alone was unable to provide an accurate estimate of clonal substructure of the tumors due to the inability of correctly determining the co-existence of mutations in each subclone. This analysis highlights the importance of using single-cell sequencing approaches to accurately resolve clonal substructure in solid tumors.
Important future directions will include applying MPT to study early-stage breast cancers such as atypical ductal hyperplasia (ADH) or DCIS to better understand mutation bursts and early events that initiate breast cancers in larger cohorts of patients. Additionally, it will be important to link clinical data on drug responses, survival, and progression to the amount of intratumor heterogeneity in invasive breast cancers and other cancer types. Overall, the MPT approach provides a powerful new tool to target only information-rich genomic regions for scDNA-seq analysis to study clonal substructure and lineages across diverse solid cancer types.
Limitations of this study
This study has several notable limitations. First, since this project was primarily for the development of the MPT approach, only 5 patients with TNBC were analyzed. Thus, our biological conclusions regarding the evolutionary history of the tumors may not be generalizable across all TNBC cancers. Additionally, the custom amplicon panel included a limited number of targeted mutation sites (N = 330); however, future developments of the microdroplet technology (Mission Bio) will likely expand the number of amplicons to thousands of sites in the future. Moreover, while starting with mutations detected by bulk sequencing effectively limits the sequencing space in a cost-effective manner, WGS could potentially detect a greater number of mutations, due to increased sensitivity, providing a higher resolution of clonal diversity. Another limitation was that the copy-number measurements were not based on whole-genome measurements but were instead restricted to the targeted genomic regions covered by the mutations, thus limiting the overall genomic resolution of detection. While the MPT Mission Bio copy-number estimations generally showed high correlation to the pseudo-bulk reference, they are not as accurate as the copy numbers profiled using WGS scDNA-seq methods, which amplify DNA evenly across the whole genome. A further limitation is that was while indels and intronic and intergenic regions were not profiled in this study, these events could potentially be used to further resolve ITH and clonal diversity, as other studies have shown.29,31 Finally, it is important to note that the bulk and single-cell data were acquired from different spatial regions in the tumors, respectively, which could potentially impact the conclusions regarding clonality.
STAR★Methods
Key resources table
| REAGENT or RESOURCE | SOURCE | IDENTIFIER |
|---|---|---|
| Biological samples | ||
| DCIS/IDC Breast Tumor Samples | University of Texas MD Anderson Cancer Center Breast Cancer Tissue Bank | https://www.mdanderson.org/research/research-resources/core-facilities/institutional-tissue-bank.html |
| Chemicals, peptides, and recombinant proteins | ||
| Tagmentation reagents (Tn5 transposase, ATM) | Illumina | Cat# FC-131-1096 |
| Ampure XP Beads | Beckman Coulter | Cat# A63880 |
| Critical commercial assays | ||
| Zymo DNA Clean & Concentrator Column Kit | Zymo | Cat# D4004 |
| 2X HiFi HotStart Ready Mix | Roche | Cat# KK2602 |
| NEBNext end repair | NEB | Cat# E6050L |
| Quick Ligation | NEB | Cat# E6056L |
| dA-tailing | NEB | Cat# E6053L |
| NEBNext HiFi2x PCRmix | NEB | Cat# M0541L |
| Nimblegen SeqCap EZ Exome V2 kit | Roche | Cat# 05860482001 |
| Mission Bio Custom Panel | Mission Bio | Cat# 145936 |
| Mission Bio Cartridge | Mission Bio | Cat# 046459 |
| Mission Bio Reagent Kit | Mission Bio | Cat# 165919 |
| Deposited data | ||
| Human reference genome NCBI build 37, GRCh37 | Genome Reference Consortium | http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/ |
| FASTQ Files Bulk Exome Sequencing | This paper | SRA: PRJNA763862 |
| FASTQ Files Mission Bio Single Cell Sequencing | This paper | SRA: PRJNA763862 |
| FASTQ Files Single Cell Copy Number Sequencing | Minussi | SRA: PRJNA629885 |
| Oligonucleotides | ||
| 5′-CAAGCAGAAGACGGCATACGAGA TXXXXXXXXGTCTCGTGGGCTCGG-3′ |
Illumina | Cat# FC-131-1096 |
| 5′-AATGATACGGCGACCACCG’AGAT CTACACXXXXXXXXTCGTCGGCAGCGTC-3′ |
Illumina | Cat# FC-131-1096 |
| Software and algorithms | ||
| Bowtie2(v2.2.6) | Langmead et al., 201235 | https://github.com/BenLangmead/bowtie2 |
| SAMtools(v1.2) | Li et al., 200936 | https://github.com/samtools/; RRID: SCR_002105 |
| DNACopy(v1.70) | Seshan et al., 202237 | https://bioconductor.org/packages/release/bioc/html/DNAcopy.html; RRID:SCR_012560 |
| Dbscan(v1.1-5) | Hahsler et al., 201938 | https://github.com/mhahsler/dbscan |
| Picard(2.27.4) | Broad Institute, 201939 | https://broadinstitute.github.io/picard; RRID: SCR_006525 |
| BEDTools(v2.28) | Quinlan et al., 201040 | https://github.com/arq5x/bedtools2; RRID: SCR_006646 |
| GATK(4.1.9) | McKenna et al., 201041 | https://github.com/broadinstitute/gatk/releases; RRID: SCR_001876 |
| ANNOVAR(v2019Oct24) | Wang et al., 201042 | https://github.com/WGLab/doc-ANNOVAR |
| MutSignatures(v2.1.1) | Fantini et al., 202043 | https://github.com/dami82/mutSignatures |
| Cutadapt(v3.1) | Martin et al., 201144 | https://github.com/marcelm/cutadapt; RRID: SCR_011841 |
| BWA-MEM(v0.7.17) | Langmead et al., 200945 | https://github.com/lh3/bwa; RRID: SCR_022192 |
| Mission Bio Tapestri Insights(v2.0) | Mission Bio, 202046 | https://portal.missionbio.com/ |
| UMAP(v0.2.4) | McInnes et al., 201847 | https://github.com/tkonopka/umap; RRID: SCR_018217 |
| stats(v3.6.2) | R Core Team, 201348 | https://www.r-project.org |
| ComplexHeatMap(v2.2.0) | Gu et al., 201649 | https://github.com/jokergoo/ComplexHeatmap |
| PolyPhen(v2.2.3) | Adzhubei et al., 201050 | http://genetics.bwh.harvard.edu/pph2/ |
| SIFT(6.2.1) | Vaser et al., 201651 | https://sift.bii.a-star.edu.sg; RRID:SCR_012813 |
| CADD(v1.4) | Kircher et al., 201432 | https://cadd.gs.washington.edu/; RRID:SCR_018393 |
| Ape(v5.3) | Paradis et al., 200452 | https://github.com/craN/Ape; RRID:SCR_017343 |
| Tapestri.cnv(v1.0) | MissionBio, 202053 | https://portal.missionbio.com/ |
| Pyclone2(v2.91) | Roth et al., 201433 | https://github.com/Roth-Lab/pyclone; RRID:SCR_016873 |
| Gini(v0.1.0) | Brown et al., 199454 | https://github.com/PABalland/EconGeo/blob/master/R/Gini.r |
| Seqtk(v1.3) | Li, 201355 | https://github.com/lh3/seqtk; RRID:SCR_018927 |
| Other | ||
| FACS Aria II | BD Biosciences | https://www.bdbiosciences.com/en-us/products/instruments/flow-cytometers/research-cell-sorters/bd-facsaria-iii |
| BD FACSMelody | BD Biosciences | https://www.bdbiosciences.com/en-us/products/instruments/flow-cytometers/research-cell-sorters/bd-facsmelody |
| Echo 525 | Labcyte | https://www.beckman.com/liquid-handlers/echo-525 |
| HiSeq 4000 | Illumina | https://www.illumina.com/systems/sequencing-platforms/hiseq-3000-4000.html |
| Mission Bio Tapestri | Mission Bio | https://missionbio.com/products/platform/ |
| TapeStation | Agilent | https://www.agilent.com/en/product/automated-electrophoresis/tapestation-systems/tapestation-instruments/4200-tapestation-system-228263 |
Resource availability
Lead contact
Further information and requests should be directed to the lead contact, Nicholas E. Navin (nnavin@mdanderson.org).
Materials availability
This study did not generate new unique reagents.
Experimental model and subject details
Triple negative breast cancer tumor samples
We selected five snap-frozen tissue samples from women with untreated intraductal carcinoma breast cancer, including paired normal adjacent breast tissues, that were collected from the University of Texas MD Anderson Cancer Center Breast Tissue Bank. The patients were chosen based on their tumor grade (stage 3), age (37–71), histopathology, lack of treatment, tumor ploidy, and triple-negative (ER, PR, Her2) receptor status (Table S1). IHC confirmed the ER and PR status of each tumor to be <1%, while Her2 negative status was determined by IHC or FISH cytogenetic analysis. All of the tissues were collected under informed consent, and the IRB for this study was approved by the MD Anderson Cancer Center Review Board.
Method details
Generation of nuclear suspensions from frozen tissues
For each bulk (tumor and normal) and single cell experiment, nuclear suspensions were generated from frozen tissue using an NST/DAPI buffer (800 mL of NST [146 mM NaCl, 10 mM Tris base at pH 7.8, 1 mM CaCl2, 0.05% BSA, 0.2% Nonidet P-40, and 21 mM MgCl2]), 200 mL of 106 mM MgCl2 and 10 mg DAPI.16 To prepare the nuclear suspensions, sections of the tumors were minced with a surgical blade in petri dishes containing NST/DAPI, and then nuclei were filtered through a 40 um mesh into 1.5 mL Eppendorf tubes. Cell counts were obtained from DAPI utilizing the Countess (Invitrogen).
FACS enrichment of aneuploid nuclei
In order to enrich tumor nuclei from the tissue sections, millions of DAPI-stained nuclei were flow-sorted on the FACS Aria II (BD Biosciences). Nuclei were selected from the aneuploid distributions based on their DAPI intensity (>2N) relative to the diploid peaks of nuclei (2N) and deposited in 1.5 ML Eppendorf tubes. Notably, the FACS enriched bulk aneuploid tumor nuclei suspensions were used for downstream bulk exome and pseudo-bulk copy number profiling, while the scDNA Mission Bio experiments were all conducted with unsorted cells to provide for copy number estimation (that were not enriched by FACS).
Pseudo-bulk copy number sequencing
Low pass single cell whole genome sequencing was performed on the FACS enriched aneuploid nuclear suspensions of the tumor samples following the Acoustic Tagmentation protocol (ACT). The FACS sorted aneuploid nuclei were distributed into 384-well plates, where the Echo525 platform (Labcyte) was utilized to dispense lysis buffer [Protease (1.36AU/mL) diluted 1:9 in 5% Tween 20, 0.5% Triton X-100 and 30mM Tris pH 8.0], tagmentation reagents ((TD:ATM 2:1, 384PP-Plus_GPSA), and barcoding reagents [(5′-CAAGCAGAAGACGGCATACGAGATXXXXXXXXGTCTCGTGGGCTCGG-3′) and S5XX (5′-AATGATACGGCGACCACCGAGATCTACACXXXXXXXXTCGTCGGCAGCGTC-3′) primers (384PP_AQBP)] in 2X HiFi HotStart Ready Mix (6RES_GPSA) in subsequent controlled steps at the nano-liter scale. The final libraries were purified by 1.8X ampure XP beads and sequenced at 76-single read cycles on the Illumina HiSeq 4000 sequencer. A comprehensive outline of each specific step of the ACT method can be found in a previous publication.11
Pseudo-bulk copy number detection pipeline
The resulting sequencing reads were demultiplexed into single cell FASTQ files and aligned to h19 using bowtie2 and converted form SAM to BAM files using SAMtools.35,36 Aligned reads were counted in variable bins (220kb), normalized for GC content, bin-wide ratios were calculated, and segmentation was applied with circular binary segmentation (CBS). Segments were merged to attach adjacent segments by R Bioconductor DNACopy package and outliers were removed by dbscan.37,38 Integer copy number was calculated from the DAPI signal in the FACS data for each tumor sample, where the first “D” peak is considered to be reference (2N) and the ratio of the “A” peak to the “D” peak gives the average ploidy. The inferred segment ratios in each sample are then multiplied by the FACS derived ploidy to estimate the final integer copy number. Finally, the consensus (pseudo-bulk) copy number of each sample was calculated by taking median of the ith segment of all single cell inferred integer copy number profiles for each respective sample. The full copy number pipeline and analysis (including scripts) can be found in more detail in a previous publication.11
Bulk exome sequencing
After using FACS to enrich aneuploid nuclear suspensions or matched normal tissues, the nuclei were prepared for sequencing by first fragmenting the DNA to 250 bp using a Covaris Sonicater and subsequently purifying with the Zymo DNA Clean & Concentrator Column kit (Zymo). Next, the fragmented DNA was barcoded using the NEBNext end repair model (NEB), dA-tailing (NEB), and quick ligation (NEB); and then library amplification was performed by PCR using NEBNext HiFi2x PCRmix. An exome capture was then applied using the Nimblegen SeqCap EZ Exome V2 kit (Roche), and both normal/tumor matched samples were sequenced paired-end at 100 bp on the Illumina HiSeq4000.
Detection of mutations in bulk DNA samples
The resulting FASTQ files from the Hiseq4000 sequencing run of normal/tumor matched samples were then demultiplexed using our custom software code (deplexer.pl). Bowtie 2 was then applied to align each FASTQ to the human genome reference (hg19) and then converted to individual BAM files by SAMtools.35,36 Picard was used to remove PCR duplicates from the resulting BAM files.39 Indel regions were then realigned using the Genome Analysis Toolkit (GATK), and sequencing reads were filtered at a standard mapping quality score of 40.41 BEDTools was utilized to calculate coverage depth and breadth.40 The full in-house pipeline and scripts outlining each step can be downloaded from Nature Protocols, and the full bulk sequencing metrics can be found in (Table S2).56 GATK was then applied with default parameters to detect variants and recalibrate quality scores resulting in a final bulk VCF file summarizing the results. Mutations were filtered out based on two distinct criteria: consensus filtering (mutations must be detected in at least 3 reads) and clustered regions (multiple mutations detected in a 10-bp sliding window). The resulting variants were then annotated by applying ANNOVAR to the VCF files using default parameters.16,42 Sites having low coverage (<10X) were annotated as missing values (NA) and nonvariant sites were labeled as germline reference. Classes of exonic mutations were annotated by Annovar and mutational signature analysis was performed using the MutSignatures R package.43
Mission Bio general custom panel
Mission Bio’s custom design targeted panels were used to choose single nucleotide variants to be profiled over specific amplicon regions for high-throughput single cell DNA sequencing. Amplicon-based targeted sequencing was then applied to profile SNVs and estimate copy number variation (CNV) variation in up to 10,000 cells per sample. Mission Bio custom panels generally consist of short amplicons ranging in size from 175 to 275 bp with primers of 18–35 bp in length. The original custom panels that we utilized contained up to 380 amplicons, while future custom panels are expected that have 1,000 or more amplicons. Although most genes can successfully be profiled by the custom panels, several technical limitations do exist that can exclude some gene target areas from the final panel. Most targets are missed because of the GC%, where only genes in regions with 27–70% GC content and primers with 27–62% GC can be included in the panel. Amplicons also must be non-overlapping to be included; however, amplicons can be tiled end-to-end to address this issue, where tiling can also be employed to handle genes longer that 250 bp (>250bp can have center gap regions with low coverage). Mission Bio custom panels also cannot contain genes located in masked regions of the genome (eg. centromere regions, telomere regions). Highly repetitive regions (eg. CAG repeat) or regions with high homology (eg. LINEs, SINEs, homologous genes) are very difficult to profile, and require advanced approaches (250 PE sequencing, custom reference genome) on a case by case basis.
Mission Bio custom MPT panel
In order to capture the most variance and heterogeneity in our tumor samples, we selected variants with allele frequencies ranging from 0.1% to 100% to identify the heterogeneous subclonal events to build the MPT panel. Out of the 379 genes we submitted for the custom MPT panel, 330(87%) genes (TN1 - 92%, TN2 - 89%, TN3 - 82%, TN4 - 85%, TN5 - 87%) were selected to create the panel, where the designed amplicon range was 125–190bp (43 amplicons ≤ 140bp, 275 at 140–175bp, 12 amplicons >175bp). The majority of the amplicons were excluded from our panel due to high GC% (>70%), overlapping amplicons, and amplicon length (>275 bp). Notably, 9 low frequency mutations (MED24, BICD1, TBX4, OTUD5, PRDM5, SENP2, MARCH6, PODXL2, ADCY8) in TN3 and TN5 (9/330) are possible FPs since they were found in a relatively low frequency in a small number of cells, however, they were retained in the custom panel due to a high concordance of allele frequencies with the bulk data. The final custom panel of 330 amplicons was then used on the Mission Bio Tapestri platform to profile SNVs utilizing targeted high-throughput DNA sequencing.
Mission Bio single cell DNA sequencing
The MPT sequencing method utilizes the Mission Bio Tapestri microdroplet platform to perform targeted high-throughput single cell DNA sequencing, where we followed the basic Mission Bio Tapestri default guidelines (Mission Bio User Guide). After isolating single cells in NST/DAPI buffer from the frozen tissue, we input 2–4,000 cells/uL into the Mission Bio Tapestri cartridge, where single cells are individually partitioned into nano-droplets with lysis buffer (Mission Bio) and incubated at 50C for 60 min on the PCR block (Bio Rad) to release the DNA. The DNA is then re-loaded into the Mission Bio cartridge, where barcoding beads and PCR reagents are combined in a second merged encapsulation. UV light (Analytik Jena XX-15L) is applied for 8 min to release the barcoded DNA from the beads, and the DNA is amplified via multiplexed PCR (Bio Rad) within the droplets (Mission Bio). The droplet emulsions are then broken and the DNA is extracted and purified using Ampure XP Beads (0.72X). Qubit Fluorescence Quantification (Invitrogen) was used the quantify the concentration of DNA at an expected range of (0.2–4 ng/uL). For library construction, the i5 and i7 indexes are added and library amplification is performed (Mission Bio) on the PCR block (Bio Rad). The library is then purified using Ampure XP beads (0.69X) for an expected on-target size range of (350–550 bp) at a range of concentration (2–20 ng/uL). Qubit is used as a preliminary raw quantification of DNA concentration followed by the TapeStation (Agilent) for a more detailed view of the fragment size distribution in addition to the targeted concentration. The library was then diluted to 5nM (0.9–1.3 ng/uL) and sequenced on the Illumina HiSeq4000 at 150 paired-end.
Quantification and statistical analysis
Mission Bio mutation detection pipeline
The resulting single cell demultiplexed FASTQ files were input into the Tapestri Sequencing Pipeline (Mission Bio). Adapter sequences are first trimmed from the raw sequencing reads using Cutadapt.44,57 Short reads less than 30 nt are discarded and barcodes are extracted from the reads. Error correction is used (Hamming distance or Levenshtein distance) to correct barcodes with a partial match to increase yields. A cell calling algorithm is then applied to only select cells that have at least 80% amplicon read completeness and pass a total reads cutoff. The reads are then mapped to the reference genome (hg19) using the BWA-MEM algorithm with default parameters discarding any unmapped reads.45,58 The cells are genotyped using GATK with a joint calling approach that follows GATK Best Practices.59,60 Finally, joint genotyping is performed for all cells using GATK’s GenotypeGVCFs tool, VCF and HDF5 files are generated, and the genotypes and cell matrix are converted into an open-source loom format for further analysis.61 Full MPT single cell sequencing metrics can be found in (Table S3). Loom files were input into Mission Bio Tapestri Insights, where default parameters were used to filter the data by cell quality (<30), read depth (<10), alternate allele frequency (<20%), variants genotyped/cell (<50%), percent genotypes present (<50%) and percent cells mutated (<1%).46 Heterozygous germline mutations present in more than 95% of cells were also removed, and amplicons and/or cells that had NA values in over 50% of cells were excluded. Multiple mutations occurring in the same gene in one patient are indicated by a base pair change annotation next to the mutation gene name. For example, XIRP1[C/T] refers to a mutation with a base pair change of C- > T that has occurred in this gene.
Clustering
To investigate the clonality of the tumor samples, a multi-step clustering process was applied to the remaining cells, where PCA (prcomp()) was initially employed to reduce the high dimensional space.48,62 UMAP (umap(), n_neighbors = 15, a = 1, b = 1) was further applied to the PCA selected features to partition the cells in distinct clusters where additional outliers (eg. low quality, depth, NAs) were further excluded.47 Technical noise (over-sensitivity) is observed as excess dispersion in the UMAP clustering due to the highly clonal nature of the samples (Figures 3, 4 and Table S3). Unsupervised hierarchical clustering via ComplexHeatMap (Heatmap()) was next applied to the distinct clusters to provide annotated single cell granularity per cluster across all amplicons and visualize the final subclones based on the determined features (linkage = complete, distance = Euclidean), which were then overlaid back on the UMAP projection for annotation (Figures 3, 4 and Table S3).49 Inter-patient heterogeneity was characterized by combining the tumor cells and mutations from all 5 patients, and performing clustering with UMAP (n = 13,200, umap(), n_neighbors = 15, a = 1.2, b = 1.2) to define clusters (excluding outliers as above) for each tumor sample (Figure S3).47
Estimation of mutation impact
To predict the functional impact of the detected variants, Combined Annotation-Dependent Depletion (CADD) was coupled with PolyPhen2 and SIFT to prioritize causal variants based on over 60 combined genomic features.32,50,51 Most annotations tend to exploit a single information type and/or are restricted in scope, so a broadly applicable metric that objectively weights and integrates diverse information was needed. Thus, CADD integrates multiple annotations into one metric by contrasting variants that survived natural selection with simulated mutations. For each sample, the mutations were first filtered by a PolyPhen or (1-SIFT) score over 0.8, then the top 30 deleterious genes (highest potential functional impact) were selected from this filtered pool based on the CADD functional impact scores for each patient scaled by CADD >10 = top 10%, CADD >20 = top 1%, and CADD >30 = top 0.1% (Figures 3 and 4).
Integrated phylogenic tree
Neighbor Joining (NJ) trees were constructed from the clustered SNV data using Ape (dist(), nj(), Table S3), and the cluster annotations defined through UMAP and hierarchical clustering were overlaid onto the resulting NJ trees.52 The respective mutations for each sample were annotated on the NJ tree based on the clusters in which they were identified. Estimated copy number aberrations for deleterious genes showing significant CNV per cluster were annotated on the trees based on the clusters in which they were identified (Figures 3 and 4).
Mission Bio copy number estimation
Tapestri.cnv is an R package provided by Mission Bio for estimating copy number from single cell sequencing read depth.53 A higher degree of variance is commonly observed in read depth coverage with amplicon-based sequencing, which can be observed in the calculated GI and CV across all cells and amplicons (Figures S1A and S1B). This can potentially confound copy number estimation, where some amplicons or cells can have a significantly higher or lower coverage depth in comparison making normalization imperative. By performing a normalization of the read count matrix for each tumor sample, we can mitigate the effects of this non-uniformity in regard to the coverage depth, thus allowing for a more accurate copy number inference. To normalize for cells, the read depth for each mutation in a respective cell is divided by the total number of reads for the entire cell across all amplicons. To both normalize for amplicon read depth and estimate the copy number of each cell, we leverage the relationship between each mutation’s tumor profile and its associated normal (reference) population. Specifically, the median coverage read depth was first calculated for all normal (reference) single cells in each respective amplicon. Next, each tumor cell in a specific amplicon is then divided by the median of its respective normal (reference) population (eg. the normalized read count for TP53 mutation in each tumor cell in sample TN4 is divided by the median of the normalized read count for all TN4 TP53 normal (reference) cells). This both normalizes the amplicon read depth to attenuate for the high variance between each amplicon and effectively imparts a normalized copy number ratio value for each cell per each specific mutation. This effective copy ratio value then serves as the estimated copy number, which is simply the normalized read count of each tumor cell mutation divided by its respective median normal reference. While this provides single cell copy number estimations per cell, the data has significant technical noise at single cell resolution. To correct for this, we calculated the median estimated copy number for each mutation per cluster, which was used to annotate hierarchal clustering heat maps and integrate NJ trees (Figures 3 and 4). To compare the accuracy of the single cell copy number estimations from the MPT panel, whole-genome sparse depth sequencing was performed on each sample, which served as a pseudo-bulk gold standard comparison (Figure 2B).
Single cell vs bulk exome comparisons
The median copy number ratio values estimated from the Mission Bio data per gene were compared against a ‘gold standard reference’ that was computed by combining consensus single cell copy number data (TN1 - 1,378, TN2 - 1,224, TN3 - 1,307, TN4 - 1,101, TN5 - 1,238) from each tumor to generate a pseudo-bulk profile (Figure 2B). Correlation was calculated (cor.test()) for all filtered mutations using Pearson (t = t-statistic, df = degrees of freedom(df = (n-1)(k-1); k = 2), CI = 95% confidence interval) [TN1 - t = 19.383, df = 15, CI = (0.7508439, 0.8171596), TN2 - t = 10.537, df = 18, CI = (0.8363862, 0.9707934), TN3 - t = 6.4446, df = 10, ci = (0.7260698, 0.9522813), TN4 - t = 6.0815, df = 68, CI = (0.6427014, 0.9731556), TN5 - t = 11.286, df = 46, CI = (0.8363862, 0.9207934)] (Figure 5A).48 Next, the allele frequencies of all variants in the Mission Bio and bulk data were compared, and the relationship was defined by Pearson’s correlation (cor.test()) [TN1 - t = 7.1999, df = 24, CI = (0.6758093, 0.9236741), TN2 - t = 11.971, df = 29, CI = (0.7887095, 0.9219351), TN3 - t = 18.8136, df = 17, CI = (0.8569058, 0.9207677), TN4 - t = 28.602, df = 74, CI = (0.8579727, 0.8957873), TN5 - t = 26.02, df = 54, CI = (0.8971459, 0.9383571)] (Figure 5B).48 Pyclone2 was applied to the bulk exome data using default parameters to estimate the total number and size of the clusters.33 The resulting Pyclone2 cluster assignments were then compared to the clusters defined by the single cell analysis (Figures 5C and S4).
Doublet identification
To identify doublets in the dataset, we leveraged the normalized coverage read count matrix created using tapestri.cnv, where we utilize the normalized copy number ratio values (normalized read depth) to determine outliers. Notably, we sought to determine outlier clusters with significantly higher average normalized ratio values (effectively higher normalized coverage depth) to mark and remove cells as doublets. Unsupervised hierarchical clustering via ComplexHeatMap (Heatmap()) was applied to the matrix of normalized ratio values for each tumor cell across all mutations to define cluster differences per sample.49 For each defined cluster, the log2 of the average ratio value of all cells in each respective cluster was calculated across all mutations per sample. Thus, each cluster was assigned a respective metric (log2(Ratio Values)) quantifying its average normalized ratio value for comparison. In each tumor sample, a single outlier cluster, defined as [Doublet] cluster, was observed that had a significantly higher average normalized ratio value (higher read depth) when compared to the values of the other defined tumor clusters. To quantify this difference, we applied a paired t-test (t.test(paired = TRUE)) to compare the distributions of the ratio values of the [Doublet] cluster to the [Tumor] cluster (distribution of remaining cluster average ratio values) in each sample (eg. TN1 outlier cluster [Doublet] vs combined average ratio values of remaining TN1 tumor clusters [Tumor]), which simply shows that clusters containing cells with significantly higher normalized ratio values (akin to higher read depth) are marked as doublets/outliers and subsequently removed from the downstream analysis (mod = mean of the difference) [TN1 - t = −5.3145, df = 24, mod = −0.2915397, CI = (−0.4047607, −0.1783187), TN2 - t = −3.3362, df = 29, mod = −0.1337342, CI = (−0.26956914, −0.04789932), TN3 - t = −2.9942, df = 17, mod = −0.167538, CI = (−0.28559127, −0.04948482), TN4 - t = −3.2603, df = 74, mod = −0.1641321, CI = (−0.26458814, −0.06367613), TN5 - t = −5.766, df = 54, mod = −0.325912, CI = (0.439234, −0.212590)] (Figures S1D and S5).48
MPT benchmarking metrics
In order to access the efficiency, sensitivity, and limitations of the MPT approach, we calculated a variety of benchmarking metrics and several downsampled conditions. Amplicon sequencing can have a relatively non-linear distribution of read depth, so we examined the coverage read depth distributions of each tumor sample to access the extent of this variance (Figure S2B). Summary Gini index (GI ) (gini()) and coefficient of variation (CV) (SD/mean) were calculated for both the cells and amplicons by taking the average of both metrics for each sample (average of all cells/amplicons per sample) (Figures S1A and S1B, Table S3).54 To obtain a higher resolution, we next calculated the GI of each respective amplicon per sample to better understand the individual contribution of each amplicon to the variance (Figure S2A). For additional summary metrics, we calculated the average allelic dropout per sample (based on heterozygous SNPs - Mission Bio Pipeline), the percentage of doublets per tumor sample (STAR Methods), and the percentage of successful amplicons per sample (amplicons with mean reads above 0.2 ∗ the mean reads per cell per amplicon - Mission Bio Pipeline) (Figures S1C–S1E).
Seqtk was employed to downsample the coverage depth of each sample through random sampling of the FASTQ file reads for each depth (100X->10X).55 After re-running the Mission Bio pipeline with each of the downsampled FASTQs, the data was re-processed with the same analysis pipeline to yield both the number of amplicons and cells, respectively, that could be successfully profiled by MPT at each coverage depth (Figures S1F and S1G). Summary GI and allelic dropout were also calculated for each downsampled coverage condition per sample (Figures S1H and S1I). Finally, to investigate the sensitivity of MPT to accurately determine subclusters, we downsampled the number of mutations and cells, respectively, at the original coverage depths of each sample. -Specifically, we successively random sampled lower numbers of mutations and cells for each tumor sample mutation matrix. For each downsampled condition, we then re-analyzed the data to determine the number of clusters that could be still be successfully delineated at each condition (Figures S1J and S1K).
Acknowledgments
This study was supported by grants to N.E.N. from the American Cancer Society (129098-RSG-16-092-01-TBG), the NIH National Cancer Institute (RO1CA240526 and RO1CA236864), the Emerson Collective Cancer Research Fund, and the CPRIT Single Cell Genomics Center (RP180684). N.E.N. is an AAAS Wachtel Scholar, Andrew Sabin Family Fellow, Jack & Beverly Randall Innovator, and AAAS Fellow. This study was supported by the MD Anderson Sequencing Core Facility Grant (CA016672) and MD Anderson T32 Translational Genomics and Precision Medicine Fellowship (CA217789). We thank Hongli Tang, Louis Ramagli, and Erika Thompson for their help with next-generation sequencing. We are grateful to Alexander Davis for providing guidance on the mutation trees and Darlan Minussi, Naveen Ramesh, Tapsi Kumar, and Aislyn Schlack for data analysis support. Finally, we thank Robert Durruthy, Kelly Kaihara, and Anjali Pradhan from Mission Bio for their computational and technical support for this project.
Author contributions
J.L. performed experiments and data analysis and prepared the manuscript. E.S. performed FACS and exome experiments. M.H. performed data analysis. F.M.-B. obtained clinical tissues and clinical data. N.E.N. managed the project and wrote the manuscript.
Declarations of interests
N.E.N. is a member of the Scientific Advisory Board (SAB) for Mission Bio
Published: November 9, 2022
Footnotes
Supplemental information can be found online at https://doi.org/10.1016/j.xgen.2022.100215.
Supplemental information
Data and code availability
The data has been deposited to the NCBI Sequencing Read Archive (SRA) under the accession SRA: PRJNA763862, PRJNA629885. Code is available upon request.
References
- 1.Foulkes W.D., Smith I.E., Reis-Filho J.S. Triple-negative breast cancer. N. Engl. J. Med. 2010;363:1938–1948. doi: 10.1056/NEJMra1001389. [DOI] [PubMed] [Google Scholar]
- 2.Liedtke C., Mazouni C., Hess K.R., André F., Tordai A., Mejia J.A., Symmans W.F., Gonzalez-Angulo A.M., Hennessy B., Green M., et al. Response to neoadjuvant therapy and long-term survival in patients with triple-negative breast cancer. J. Clin. Oncol. 2008;26:1275–1281. doi: 10.1200/JCO.2007.14.4147. [DOI] [PubMed] [Google Scholar]
- 3.Kim C., Gao R., Sei E., Brandt R., Hartman J., Hatschek T., Crosetto N., Foukakis T., Navin N.E. Chemoresistance evolution in triple-negative breast cancer delineated by single-cell sequencing. Cell. 2018;173:879–893.e13. doi: 10.1016/j.cell.2018.03.041. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Cancer Genome Atlas Network Comprehensive molecular portraits of human breast tumours. Nature. 2012;490:61–70. doi: 10.3390/cells8090957. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Gao R., Davis A., McDonald T.O., Sei E., Shi X., Wang Y., Tsai P.C., Casasent A., Waters J., Zhang H., et al. Punctuated copy number evolution and clonal stasis in triple-negative breast cancer. Nat. Genet. 2016;48:1119–1130. doi: 10.1038/ng.3641. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Yates L.R., Gerstung M., Knappskog S., Desmedt C., Gundem G., Van Loo P., Aas T., Alexandrov L.B., Larsimont D., Davies H., et al. Subclonal diversification of primary breast cancer revealed by multiregion sequencing. Nat. Med. 2015;21:751–759. doi: 10.1038/nm.3886. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Shah S.P., Roth A., Goya R., Oloumi A., Ha G., Zhao Y., Turashvili G., Ding J., Tse K., Haffari G., et al. The clonal and mutational evolution spectrum of primary triple-negative breast cancers. Nature. 2012;486:395–399. doi: 10.1038/nature10933. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Nedeljković M., Damjanović A. Mechanisms of chemotherapy resistance in triple-negative breast cancer-how we can rise to the challenge. Cells. 2019;8:957. doi: 10.3390/cells8090957. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Marra A., Trapani D., Viale G., Criscitiello C., Curigliano G. Practical classification of triple-negative breast cancer: intratumoral heterogeneity, mechanisms of drug resistance, and novel therapies. NPJ Breast Cancer. 2020;6:54. doi: 10.1038/s41523-020-00197-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Casasent A.K., Schalck A., Gao R., Sei E., Long A., Pangburn W., Casasent T., Meric-Bernstam F., Edgerton M.E., Navin N.E. Multiclonal invasion in breast tumors identified by topographic single cell sequencing. Cell. 2018;172:205–217.e12. doi: 10.1016/j.cell.2017.12.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Minussi D.C., Nicholson M.D., Ye H., Davis A., Wang K., Baker T., Tarabichi M., Sei E., Du H., Rabbani M., et al. Breast tumours maintain a reservoir of subclonal diversity during expansion. Nature. 2021;592:302–308. doi: 10.1038/s41586-021-03357-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Lim B., Lin Y., Navin N. Advancing cancer research and medicine with single-cell genomics. Cancer Cell. 2020;37:456–470. doi: 10.1016/j.ccell.2020.03.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Navin N.E. Cancer genomics: one cell at a time. Genome Biol. 2014;15:452. doi: 10.1186/s13059-014-0452-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Navin N., Kendall J., Troge J., Andrews P., Rodgers L., McIndoo J., Cook K., Stepansky A., Levy D., Esposito D., et al. Tumour evolution inferred by single-cell sequencing. Nature. 2011;472:90–94. doi: 10.1038/nature09807. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Zong C., Lu S., Chapman A.R., Xie X.S. Genome-wide detection of single-nucleotide and copy-number variations of a single human cell. Science. 2012;338:1622–1626. doi: 10.1126/science.1229164. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Wang Y., Waters J., Leung M.L., Unruh A., Roh W., Shi X., Chen K., Scheet P., Vattathil S., Liang H., et al. Clonal evolution in breast cancer revealed by single nucleus genome sequencing. Nature. 2014;512:155–160. doi: 10.1038/nature13600. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Telenius H., Carter N.P., Bebb C.E., Nordenskjöld M., Ponder B.A., Tunnacliffe A. Degenerate oligonucleotide-primed PCR: general amplification of target DNA by a single degenerate primer. Genomics. 1992;13:718–725. doi: 10.1016/0888-7543(92)90147-k. [DOI] [PubMed] [Google Scholar]
- 18.Dean F.B., Hosono S., Fang L., Wu X., Faruqi A.F., Bray-Ward P., Sun Z., Zong Q., Du Y., Du J., et al. Comprehensive human genome amplification using multiple displacement amplification. Proc. Natl. Acad. Sci. USA. 2002;99:5261–5266. doi: 10.1073/pnas.082089499. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Kamberov E., Sun T., Bruening E., Pinter J.H., Sleptsova I. National Center for Biotechnology Information; 2012. Amplification and Analysis of Whole Genome and Whole Transcriptome Libraries Generated by a DNA Polymerization Process. US-8206913-B1. [Google Scholar]
- 20.Gierahn T.M., Wadsworth M.H., Hughes T.K., Bryson B.D., Butler A., Satija R., Fortune S., Love J.C., Shalek A.K. Seq-Well: portable, low-cost RNA sequencing of single cells at high throughput. Nat. Methods. 2017;14:395–398. doi: 10.1038/nmeth.4179. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Prakadan S.M., Shalek A.K., Weitz D.A. Scaling by shrinking: empowering single-cell 'omics' with microfluidic devices. Nat. Rev. Genet. 2017;18:345–361. doi: 10.1038/nrg.2017.15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Macosko E.Z., Basu A., Satija R., Nemesh J., Shekhar K., Goldman M., Tirosh I., Bialas A.R., Kamitaki N., Martersteck E.M., et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell. 2015;161:1202–1214. doi: 10.1016/j.cell.2015.05.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Nagasawa S., Kuze Y., Maeda I., Kojima Y., Motoyoshi A., Onishi T., Iwatani T., Yokoe T., Koike J., Chosokabe M., et al. Genomic profiling reveals heterogeneous populations of ductal carcinoma in situ of the breast. Commun. Biol. 2021;4:438. doi: 10.1038/s42003-021-01959-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Gawad C., Koh W., Quake S.R. Dissecting the clonal origins of childhood acute lymphoblastic leukemia by single-cell genomics. Proc. Natl. Acad. Sci. USA. 2014;111:17947–17952. doi: 10.1073/pnas.1420822111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Laks E., McPherson A., Zahn H., Lai D., Steif A., Brimhall J., Biele J., Wang B., Masud T., Ting J., et al. Clonal decomposition and DNA replication states defined by scaled single-cell genome sequencing. Cell. 2019;179:1207–1221.e22. doi: 10.1016/j.cell.2019.10.026. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Lan F., Demaree B., Ahmed N., Abate A.R. Single-cell genome sequencing at ultra-high-throughput with microfluidic droplet barcoding. Nat. Biotechnol. 2017;35:640–646. doi: 10.1038/nbt.3880. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Pellegrino M., Sciambi A., Treusch S., Durruthy-Durruthy R., Gokhale K., Jacob J., Chen T.X., Geis J.A., Oldham W., Matthews J., et al. High-throughput single-cell DNA sequencing of acute myeloid leukemia tumors with droplet microfluidics. Genome Res. 2018;28:1345–1352. doi: 10.1101/gr.232272.117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Morita K., Wang F., Jahn K., Hu T., Tanaka T., Sasaki Y., Kuipers J., Loghavi S., Wang S.A., Yan Y., et al. Clonal evolution of acute myeloid leukemia revealed by high-throughput single-cell genomics. Nat. Commun. 2020;11:5327. doi: 10.1038/s41467-020-19119-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Miles L.A., Bowman R.L., Merlinsky T.R., Csete I.S., Ooi A.T., Durruthy-Durruthy R., Bowman M., Famulare C., Patel M.A., Mendez P., et al. Single-cell mutation analysis of clonal evolution in myeloid malignancies. Nature. 2020;587:477–482. doi: 10.1038/s41586-020-2864-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Wang F., Morita K., DiNardo C.D., Furudate K., Tanaka T., Yan Y., Patel K.P., MacBeth K.J., Wu B., Liu G., et al. Leukemia stemness and co-occurring mutations drive resistance to IDH inhibitors in acute myeloid leukemia. Nat. Commun. 2021;12:2607. doi: 10.1038/s41467-021-22874-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Zhang H., Karnoub E.R., Umeda S., Chaligné R., Masilionis I., McIntyre C.A., Hayashi A., Sashittal P., Zucker A., Mullen K., et al. Application of high-throughput, high-depth, targeted single-nucleus DNA sequencing in pancreatic cancer. Preprint at bioRxiv. 2022 doi: 10.1101/2022.03.06.483206. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Kircher M., Witten D.M., Jain P., O'Roak B.J., Cooper G.M., Shendure J. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 2014;46:310–315. doi: 10.1038/ng.2892. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Roth A., Khattra J., Yap D., Wan A., Laks E., Biele J., Ha G., Aparicio S., Bouchard-Côté A., Shah S.P. PyClone: statistical inference of clonal population structure in cancer. Nat. Methods. 2014;11:396–398. doi: 10.1038/nmeth.2883. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Ahmadinejad N., Troftgruben S., Wang J., Chandrashekar P.B., Dinu V., Maley C., Liu L. Accurate identification of subclones in tumor genomes. Mol. Biol. Evol. 2022;39:msac136. doi: 10.1093/molbev/msac136. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Langmead B., Salzberg S.L. Fast gapped-read alignment with Bowtie 2. Nat. Methods. 2012;9:357–359. doi: 10.1038/nmeth.1923. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R., 1000 Genome Project Data Processing Subgroup The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Seshan V.E., Olshen A. 2022. DNAcopy: DNA copy number data analysis.https://github.com/veseshan/DNAcopy/ Bioconductor R package version 1.70.0. [Google Scholar]
- 38.Hahsler M., Piekenbrock M., Doran D. Dbscan: fast density-based clustering with R. J. Stat. Softw. 2019;91:1–30. doi: 10.18637/jss.v091.i01. [DOI] [Google Scholar]
- 39.Broad Institute . 2019. Picard Toolkit.https://broadinstitute.github.io/picard/ [Google Scholar]
- 40.Quinlan A.R., Hall I.M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26:841–842. doi: 10.1093/bioinformatics/btq033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.McKenna A., Hanna M., Banks E., Sivachenko A., Cibulskis K., Kernytsky A., Garimella K., Altshuler D., Gabriel S., Daly M., DePristo M.A. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20:1297–1303. doi: 10.1101/gr.107524.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Wang K., Li M., Hakonarson H. ANNOVAR: functional annotation of genetic variants from next-generation sequencing data. Nucleic Acids Res. 2010;38:164. doi: 10.1093/nar/gkq603. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Fantini D., Vidimar V., Yu Y., Condello S., Meeks J.J. MutSignatures: an R package for extraction and analysis of cancer mutational signatures. Sci. Rep. 2020;10 doi: 10.1038/s41598-020-75062-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet. J. 2011;17:10–12. doi: 10.14806/ej.17.1.200. [DOI] [Google Scholar]
- 45.Langmead B., Trapnell C., Pop M., Salzberg S.L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10:R25. doi: 10.1186/gb-2009-10-3-r25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Mission Bio Tapestri Insights . Mission Bio; 2020. Computational Tool to Filter and Select Variants for Downstream Processing.https://portal.missionbio.com [Google Scholar]
- 47.McInnes L., Healy J., Saul N., Großberger L. UMAP: uniform manifold approximation and projection. J. Open Source Softw. 2018;3:861. doi: 10.21105/joss.00861. [DOI] [Google Scholar]
- 48.R Core Team . R Foundation for Statistical Computing; 2012. Stats Package - R: A Language and Environment for Statistical Computing.http://www.R-project.org/ [Google Scholar]
- 49.Gu Z., Eils R., Schlesner M. Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics. 2016;32:2847–2849. doi: 10.1093/bioinformatics/btw313. [DOI] [PubMed] [Google Scholar]
- 50.Adzhubei I.A., Schmidt S., Peshkin L., Ramensky V.E., Gerasimova A., Bork P., Kondrashov A.S., Sunyaev S.R. A method and server for predicting damaging missense mutations. Nat. Methods. 2010;7:248–249. doi: 10.1038/nmeth0410-248. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Vaser R., Adusumalli S., Leng S.N., Sikic M., Ng P.C. SIFT missense predictions for genomes. Nat. Protoc. 2016;11:1–9. doi: 10.1038/nprot.2015.123. [DOI] [PubMed] [Google Scholar]
- 52.Paradis E., Claude J., Strimmer K. APE: analyses of phylogenetics and evolution in R language. Bioinformatics. 2004;20:289–290. doi: 10.1093/bioinformatics/btg412. [DOI] [PubMed] [Google Scholar]
- 53.Bio M. 2020. Tapestri.cnv - R package to infer copy number variation from Mission Bio single cell mutation data.https://portal.missionbio.com [Google Scholar]
- 54.Brown M.C. Using gini-style indices to evaluate the spatial patterns of health practitioners: theoretical considerations and an application based on Alberta data. Soc. Sci. Med. 1994;38:1243–1256. doi: 10.1016/0277-9536(94)90189-9. [DOI] [PubMed] [Google Scholar]
- 55.Li H. 2013. Seqtk - Toolkit for processing sequences in FASTA/Q formats.https://github.com/lh3/seqtk [Google Scholar]
- 56.Leung M.L., Davis A., Gao R., Casasent A., Wang Y., Sei E., Vilar E., Maru D., Kopetz S., Navin N.E. Single-cell DNA sequencing reveals a late-dissemination model in metastatic colorectal cancer. Genome Res. 2017;27:1287–1299. doi: 10.1101/gr.209973.116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Bolger A.M., Lohse M., Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30:2114–2120. doi: 10.1093/bioinformatics/btu170. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Kim D., Pertea G., Trapnell C., Pimentel H., Kelley R., Salzberg S.L. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 2013;14:R36. doi: 10.1186/gb-2013-14-4-r36. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.DePristo M.A., Banks E., Poplin R., Garimella K.V., Maguire J.R., Hartl C., Philippakis A.A., del Angel G., Rivas M.A., Hanna M., et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 2011;43:491–498. doi: 10.1038/ng.806. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Van der Auwera G.A., Carneiro M.O., Hartl C., Poplin R., Del Angel G., Levy-Moonshine A., Jordan T., Shakir K., Roazen D., Thibault J., et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr. Protoc. Bioinformatics. 2013;43 doi: 10.1002/0471250953.bi1110s43. 11.10.1-11.10.33. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Zeisel A., Hochgerner H., Lönnerberg P., Johnsson A., Memic F., van der Zwan J., Häring M., Braun E., Borm L.E., La Manno G., et al. Molecular architecture of the mouse nervous system. Cell. 2018;174:999–1014.e22. doi: 10.1016/j.cell.2018.06.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Pearson K. On lines and planes of closest fit to systems of points in space. London Edinburgh Philos. Mag. J. Sci. 1901;2:559–572. doi: 10.1093/bioinformatics/btg412. [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The data has been deposited to the NCBI Sequencing Read Archive (SRA) under the accession SRA: PRJNA763862, PRJNA629885. Code is available upon request.





