Summary
Normal tissues adjacent to the tumor (NATs) may harbor early breast carcinogenesis events driven by field cancerization. Although previous studies have characterized copy-number (CN) and transcriptomic alterations, the evolutionary history of NATs in breast cancer (BC) remains poorly characterized. Utilizing whole-genome sequencing (WGS), methylation profiling, and RNA sequencing (RNA-seq), we analyzed paired germline, NATs, and tumor samples from 43 individuals with BC in Hong Kong (HK). We found that single-nucleotide variants (SNVs) were common in NATs, with one-third of NAT samples exhibiting SNVs in driver genes, many of which were present in paired tumor samples. The most frequently mutated genes in both tumor and NAT samples were PIK3CA, TP53, GATA3, and AKT1. In contrast, large-scale aberrations such as somatic CN alterations (SCNAs) and structural variants (SVs) were rarely detected in NAT samples. We generated phylogenetic trees to investigate the evolutionary history of paired NAT and tumor samples. They could be categorized into tumor only, shared, and multiple-tree groups, the last of which is concordant with non-genetic field cancerization. These groups exhibited distinct genomic and epigenomic characteristics in both NAT and tumor samples. Specifically, NAT samples in the shared-tree group showed higher number of mutations, while NAT samples belonging to the multiple-tree group showed a less inflammatory tumor microenvironment (TME), characterized by a higher proportion of regulatory T cells (Tregs) and lower presence of CD14 cell populations. In summary, our findings highlight the diverse evolutionary history in BC NAT/tumor pairs and the impact of field cancerization and TME in shaping the genomic evolutionary history of tumors.
Keywords: Chinese, breast cancer, cancer genomics, whole-genome sequencing, omics analyses, clonal evolution, normal tissues adjacent to the tumor
Graphical abstract

Analysis of normal tissues adjacent to breast tumors (NATs) revealed widespread somatic point mutations, including in cancer driver genes, but few large-scale chromosomal aberrations. Further analysis uncovered different cancer evolution patterns influenced by the tumor microenvironment (TME). This study highlights the roles of field cancerization and the TME in breast cancer (BC) evolution.
Introduction
Breast cancer (BC) is a complex and heterogeneous disease, with intrinsic subtypes that have distinct clinical and genomic features1,2 and varying levels of recurrence risk.3 While the evaluation of tumor histopathology and molecular profiles is critical for assessing recurrence risk and guiding clinical management, recent research suggests that normal tissues adjacent to the tumor (NATs) may also provide valuable insights into cancer relapse and survival.4,5,6 Studies have shown that NATs harbor genetic, epigenetic, and transcriptomic alterations6,7,8 that likely represent early events in breast carcinogenesis and are consistent with the field cancerization hypothesis. Field cancerization is characterized by morphologically “normal” areas that contain pre-cancerous cells that have undergone molecular changes.9 A comprehensive analysis of NATs across multiple sites suggested that NAT samples occupy an intermediate state between healthy and tumor tissues.6
It has been suggested that molecular alterations observed in NAT samples are more likely due to field cancerization than presence of tumor cells or paracrine influence from the tumor.10 Field cancerization has important clinical implications in local cancer recurrence, as the field of molecularly altered cells, in combination with tumor microenvironment (TME) factors, may create a tumor-prone environment.11 The concept of field cancerization and its implications for local recurrence and second primary tumors was first proposed in 1953 for individuals with squamous cell carcinoma of the head and neck by Slaughter et al.12 Subsequent studies by Tabor et al. observed that a subset of individuals with head and neck cancer with genetically altered fields in surgical margins developed local recurrences.13 Additionally, field cancerization has been associated with a higher risk of recurrences in other cancers, such as bladder cancer.14 Therefore, individuals with field cancerization may require more intensive surveillance and follow-up to detect and treat recurrent tumors at an early stage. Studies have demonstrated that molecular field cancerization may have prognostic and predictive value,15,16,17,18 suggesting that using molecular signatures in high-risk peri-tumoral tissues alongside morphological examinations could enhance the definition of surgical margins in breast conservation therapy. This approach can also help identify individuals who require increased surveillance or more aggressive treatment at the time of surgery.
Variations in cancer initiation, driven by mutations, and selection processes influenced by interactions with the TME suggest that the field cancerization process may differ between individual cancer subjects. Therefore, in addition to its potential utility as a biomarker for recurrence prediction, detailed characterization of field cancerization may offer insights into early events and tumor evolution, as illustrated in a small-scale sequencing analysis of paired tumor and normal tissue from individuals with prostate cancer.19
Most previous studies characterizing molecular alterations in morphologically normal BC tissues have been based on copy-number (CN) and transcriptomic analyses,16,20,21 while the clonal evolutionary trajectories of NATs are unclear. Thus, a comprehensive genomic and epigenomic analysis of paired germline, NAT (with the annotation of distance from the tumor), and tumor samples is needed. To address this gap, we conducted whole-genome sequencing (WGS), methylation profiling, and RNA sequencing (RNA-seq) analysis of paired germline, NAT, and tumor samples collected from individuals with BC in Hong Kong (HK). Our goal was to provide a detailed analysis of clonal expansion and evolutionary history in paired tumor and NAT tissue samples.
Subjects, material, and methods
HKBC study participants and biospecimens
The information on the study design and biospecimen collection included in this work has been previously described.22 Briefly, data and biospecimens were collected from individuals with BC diagnosed and treated in 2 HK hospitals between 2013 and 2016. Participants were included based on the following criteria: (1) female, (2) between 20 and 84 years old, (3) diagnosed with BC no more than 3 months prior to the recruitment interview and histologically confirmed (International Classification of Disease, Tenth Revision, code 50), and (4) of Chinese ethnicity and a resident of HK for at least 5 years. Individuals with pre-surgery treatment were excluded from the study. The study protocol was approved by the ethics committees of the Joint Chinese University of Hong Kong-New Territories East Cluster, the Kowloon West Cluster, and the National Cancer Institute (NCI). All subjects provided written informed consent prior to the surgery.
Paired fresh-frozen tumors, matched NATs, and blood/saliva samples were collected from these individuals. For each subject, we collected 1 piece of the tumor and 2 pieces of NAT samples, one peri-tumor and another distant (>2 cm) to the tumor mass. To reduce the impact of tumor contamination, we used the distant NAT samples for molecular analyses. Frozen tumor and NAT samples were processed for pathology review at the Biospecimen Core Resource (BCR), Nationwide Children’s Hospital, using modified The Cancer Genome Atlas (TCGA) criteria.23 Specifically, only tumors with >50% tumor cells and NAT tissue with no detected tumor cells were included for dual DNA/RNA extraction. Germline DNA was extracted from matched blood or saliva samples as a reference for somatic mutation, calling in both WGS and targeted panel sequencing analyses.
WGS and single-nucleotide variant calling
WGS was performed on 45 paired tumor, NAT, and germline DNA samples at the Broad Institute (https://www.broadinstitute.org) using the Illumina HiSeq X system following Illumina-provided protocols for 2 × 150-bp paired-end sequencing. The average sequencing depth was 103.5× for tumors and 36.6× for the paired germline and NAT samples. Somatic mutations were called separately for tumor vs. germline and NAT vs. germline using 4 different algorithms (MuTect,24 MuTect2 [GATK tool], Strelka,25 and TNScope by Sentieon) as previously described26 by Zhang et al. To increase the sensitivity of somatic mutation calling in NAT samples, we used an approach described in Stachler et al.27 to salvage somatic variants that were detected in tumor samples and were missed in the matched NAT samples due to lower sequencing depth. Two NAT samples did not pass the quality control (QC) and were therefore excluded from further analyses.
Targeted panel sequencing and single-nucleotide variant calling
A total of 514 samples from 216 HKBC individuals, including 70 trios (germline from blood/saliva, fresh-frozen NAT, and paired tumor) and 146 paired tumor and NATs without germline, were sequenced using a multiplex PCR primer panel targeting PIK3CA and TP53 designed using the custom Ion AmpliSeq Designer (ThermoFisher Scientific, USA). The primer panel covered a region totaling 5 kb, covered by 49 primer pairs. Sample DNA (30 ng) was amplified using this custom AmpliSeq panel, and libraries were prepared following the manufacturer’s Ion AmpliSeq Library Preparation protocol (ThermoFisher Scientific). Individual samples were barcoded during library preparation and subsequently pooled. Automated template preparation and chip loading were performed with the Ion Chef System and sequenced on the S5 Sequencer per manufacturer’s instructions. Raw-sequencing reads generated by the Ion Torrent sequencer were quality checked and adaptor trimmed by Ion Torrent Suite and then aligned to the hg19 reference sequence by Torrent Mapping Alignment Program (TMAP, https://github.com/iontorrent/TS/tree/master/Analysis/TMAP) using default parameters. The resulting binary alignment map (BAM) files were left-aligned using the Genome Analysis Toolkit (GATK) LeftAlignIndels module.
Germline variants were called by Variant Caller 5.0.2 and GATK (http://www.nature.com/ng/journal/v43/n5/full/ng.806.html, UnifiedGenotyper v.3.1). Tumor/NAT vs. germline paired calls were made using the Torrent Variant Caller (TVC) algorithm implemented in Ion Reporter (v.5.0.9). Variants were filtered out using the following criteria: p value > 5.0 × 10−6, flagged by the Confident Somatic Variants filter, allele frequency >0.001 in the gnomAD database, variant allele fraction <5% in tumor/NAT samples, and <3 reads for either reference or variant allele in any sample. Tumor/NAT only (without germline as reference) somatic variants were called using Torrent Variant Caller 5.0.2 and filtered by minimum allele fraction ≥0.02, minimum coverage ≥100, and a minimum variant score ≥6. Variants were annotated using snpEff28 and Annovar.29
Methylation profiling
Methylation profiling was performed on paired tumor and NAT samples from 196 individuals (including 42 individuals with WGS data included in this analysis) using the Infinium MethylationEPIC BeadChip (Illumina, San Diego, CA) at the Cancer Genomics Research Laboratory (CGR), NCI. QC was performed using the basic intensity R package minfi.30 Raw methylated and unmethylated intensities were background corrected and dye-bias-equalized to correct for technical variation in signal between arrays.
RNA-seq and PAM50 subtypes
RNA-seq data were generated in 239 tumors and 92 NAT samples using Illumina HiSeq4000. Gene expression was quantified as transcript per million (TPM), and log2TPM was used for statistical analyses. PAM50 subtype (luminal A, luminal B, HER2 enriched, basal like, and normal like) was defined by the absolute intrinsic molecular subtyping (AIMS) method31 using RNA-seq data. We compared the classification between PAM50 and clinical markers (ER, PR, HER2) and found 57.1%, 100.0%, 100.0%, and 50.0% concordance with luminal A, luminal B, HER2-enriched, and basal subtypes, respectively.
Detection of somatic structural variants
Somatic structural variants (SVs) were identified in paired samples (tumor vs. germline and NAT vs. germline) obtained from the same subject using a snakemake workflow called “MineSV” (https://github.com/NCI-CGR/MineSV). MineSV utilized 3 SV callers, namely GRIDSS32 (v.2.9.4), Manta33 (v.1.4.0), and SvABA34 (v.1.0.0), for SV prediction. Additionally, Meerkat35 (v.0.189) was employed to call somatic SVs, and its results were integrated with the outputs of the other SV callers within MineSV. We selected SV calls that were supported by at least 3 SV callers. In MineSV, SVs were categorized into 2 types based on whether both breakpoints of the SV resided on the same chromosome: intra-chromosomal SVs (intraSVs) and inter-chromosomal SVs (interSVs). After applying SV callers to call SVs in each sample pair, MineSV compared the SV calling from the 4 callers and generated 2 comparison output files for each sample pair: 1 for intraSVs and another for interSVs. IntraSVs exhibiting a reciprocal overlap of over 70% were considered “matched.” For interSVs, which have 2 breakpoints located on different chromosomes, MineSV added a 50-bp padding on both sides of each breakpoint. A padded breakpoint with at least 1-bp overlap was considered matched. InterSVs with both breakpoints matched were classified as matched.
To identify matched SVs of tumors and NATs, we employed the same approach as used in MineSV. For the ensemble intraSVs, we utilized the findOverlaps function from the R package GenomicRanges to identify matched SVs with at least 70% reciprocal overlaps. As for the ensemble interSVs, we matched the padded breakpoint pairs using the findOverlaps function from the R package InteractionSet, requiring a minimum of 1 bp overlap.
Battenberg CN aberration profiles and subclonal clustering QC
Evaluation of Battenberg CN aberration (CNA) profiles and DPclust clonal/subclonal clustering were assessed using the copyNumberAndSubclonalClusteringQC tool (https://github.com/Wedge-lab/copyNumberAndSubclonalClusteringQC). The evaluation of Battenberg CNAs and DPclust calls was based on several factors, including purity, ploidy, whole-genome duplication, loss of heterozygosity (LOH), clonal/subclonal cluster cancer cell fraction (CCF) value, and the number of mutations within each cluster. For samples that did not meet the QC criteria, up to 3 attempts were made to refit them using the suggested purity and ploidy values provided by the copyNumberAndSubclonalClusteringQC tool.
Signature extraction
Mutational signature analysis was conducted using the SigProfiler suite of tools. De novo extraction of single-base substitutions (SBSs), DBSs, and insertion-deletion (ID) mutational signatures was performed using SigProfilerExtractor (v.0.0.32).36 All signature extraction runs were performed using parameters with parameters including random nmf_init, 100 nmf_replicates, 10,000 min_nmf_iterations, and 1,000,000 max_nmf_iterations. In all signature types, we considered the presence of 1–20 signatures. The SBS, DBS, and ID de novo-extracted mutational signatures were then decomposed into COSMIC reference signatures (v.3.4).37 The default SigprofilerExtractor parameters were used for assigning de novo or COSMIC signatures to a sample and are as follows: 0.02 de_novo_fit_penalty, which sets a weak threshold cutoff for assigning de novo signatures to a sample; 0.05 nnls_add_penalty, which establishes a strong threshold for assigning COSMIC signatures; 0.01 nnls_remove_penalty, defining a weak threshold cutoff for assigning COSMIC signatures; and 0.05 initial_remove_penalty, which sets an initial weak threshold cutoff for assigning COSMIC signatures.
Due to limited or absent CNs and SVs, data size did not permit de novo signature extraction for CNs and SVs. Consequently, CN and SV calls for the tumor samples were directly assigned to the COSMIC reference signatures (v.3.4). A comprehensive list of all the extracted signatures and the proposed signature etiologies can be found in Table S1.
Varlap (https://github.com/bjpop/varlap) was employed to conduct QC on all indel mutational variance reads utilized in extracting the ID signature. The mean Varlap QC values of mutations within each ID signature were calculated to ascertain whether the novel ID-83D was a product of a genuine biological process or an outcome of sequencing artifacts. The Varlap values for each signature exhibit close correlation, and there is no observed divergence in any of the signatures.
Phylogenetic tree construction
Clonal and subclonal cluster evolution trees were generated following the methodology described before.38 The Battenberg CN calls, single-nucleotide variant (SNV) calls from both NAT and T samples were clustered into groups using the multiDPClust algorithm. Clusters that contained less than 1% of the mutations identified in the NAT and T samples pair or had a CCF value smaller than 0.0055 were excluded from the analysis. The remaining clusters were organized into a tree structure based on their CCF values. Clonal clusters, characterized by a CCF value of approximately 1, were positioned at the top of the tree, followed by subclonal clusters in descending order of CCF values. If the sum of the CCF values of subclonal clusters was equal to or lower than the CCF value of the immediate ancestral attached node, the subclones were added to the tree as branching nodes; otherwise, the particular subclones were placed as linear branches on the tree.
Telomere length estimate
Telomere length (TL) was estimated by TelSeq.39 For a read to be considered telomeric, a default threshold of 7 telomere TTAGGG/CCCTAA repeats was required. TL was determined by the total number of telomere repeats in each tumor or NAT sample.
Immune-cell deconvolution based on DNA methylation data
For cell deconvolution, we used EpiDISH40 and MethylCIBERSORT41 (downloaded from https://zenodo.org/record/1284582#.ZFkCB3bMKUm and https://www.bioconductor.org/packages/release/bioc/html/EpiDISH.html, respectively). While MethylCIBERSORT primarily focuses on stromal and immune cell subpopulations, EpiDish provides estimates for a broader range of cell types within a bulk tissue sample, encompassing both immune and non-immune cells. The breast tissue-specific methylation signature matrices available in these packages were used. For MethylCIBERSORT, we used an additional step of the CIBERSORTx (https://cibersortx.stanford.edu/) in relative mode on our DNA methylation data with 1,000 permutations and without quantile normalization.
Analysis of differentially methylated regions
DNA methylation beta values for each probe ordered by chromosomal position are used to detect differentially methylated regions (DMRs). We detect DMRs that were differently methylated in samples from 1 tree group compared to samples from the other 2 groups using the bumphunter function implemented in the Bioconductor Minfi package30 (v.1.44.0). Bumphunter generates regions by combining all probes within a pairwise distance threshold. Probes close to each other are assigned to the same region. The default cutoff of 500 bp is used in the study. Bumphunter uses a linear model for each probe to predict methylation beta values from tree categories (1 versus the other 2), adjusting for age and batch effect, and it smooths the effect-size estimates of all probes within a region. Candidate DMRs are defined as the collection of smoothed effect sizes for any region with an effect size above a user-defined threshold. We use 0.1 as the threshold in the study by which we call a DMR if the adjusted mean beta value difference between tree groups (1 versus the other 2) in a region is greater than 0.1. We perform 100 bootstrap replications to evaluate the statistical significance of each candidate DMR and adjust for multiple testing by comparing the effect size for the observed DMR to those from all the bootstrap DMRs.
Statistical analyses
Wilcoxon signed-rank test was performed to evaluate differences of median of each genomic feature across different phylogenetic tree groups. Fisher’s exact test was used to investigate differences by other genomic and clinical characteristics of categorical nature between the 3 tree groups. Multivariable regression analysis was used to assess the differences in these genomic features between the 3 phylogenetic tree groups, with the adjustment of age at diagnosis. All statistical tests in the present study were two-sided and performed using SAS v.9.4 (SAS Institute, Cary, NC, USA) or R v.3.6.3 (R Foundation for Statistical Computing, Vienna, Austria).
Results
Subjects and samples
This study included 43 treatment-naive BC individuals from HK, with WGS data available in matched trios of samples including germline (DNA derived from saliva or blood), fresh-frozen macroscopically NATs, and tumor samples for each patient (Figure 1A). Notably, most NAT samples were taken >2 cm away from the tumor and were examined by 2 independent pathologists and confirmed to be histologically normal. This analysis focused on somatic mutations that were present in NAT or tumor samples but absent in blood/saliva; consequently, germline variants are not described here. RNA-seq and methylation profiling data are available for these individuals (Figure 1B). The mean age at diagnosis was 59.2 years (SD = 10.8, range = [31.0, 78.0]) and the mean body mass index (BMI) was 24.2 kg/m2 (SD = 4.3, range = [14.5, 39.0]). The distribution of molecular subtypes was 43.9% luminal A, 26.8% luminal B, 14.6% HER2 enriched, 12.2% basal like, and 2.5% normal like based on PAM50. The distribution of other patient characteristics is shown in Table S2.
Figure 1.
Genomic landscape of breast tumor and paired normal adjacent to tumor samples from 43 individuals with breast cancer in the Hong Kong Breast Cancer Study
(A) Study design overview. This figure illustrates the study’s workflow, where paired germline, NAT, and tumor samples were collected for comprehensive multi-omics profiling. This included whole-genome sequencing (WGS), RNA sequencing (RNA-seq), and methylation profiling. Additionally, tissue scans underwent review by pathologists to validate the classification of tumor and normal tissues.
(B) Number of samples included for different omics platforms.
(C) Stacked bar plot displays the frequencies of mutated breast cancer driver genes in tumor and NAT samples among 43 individuals, with colors representing the functional impact of the mutations.
(D) The oncoplot visualizes the prevalence of point mutations within breast cancer driver genes across tumor and NAT samples. The top details mutation frequencies across the samples. The bottom bars categorize the samples according to PAM50 tumor subtypes and phylogenetic tree groups, respectively. Colored blocks overlaid on the plot denote cases where a sample exhibits two different mutations within the same gene.
Comparison of genomic and epigenomic characteristics between paired NAT and tumor samples
We compared the genomic characteristics between paired tumors and NAT samples. First, we investigated the driver gene profiles in tumor and NAT samples separately (Figure 1C) using the driver gene list described by Martínez-Jiménez et al.42 We found that 93% and 73% of the nonsynonymous mutations in these driver genes of NAT and tumor samples, respectively, were predicted to be oncogenic by Cancer Genome Interpreter.43 As expected, tumor samples had more mutations in a much higher number of driver genes, indicating a higher level of genetic alteration (Figures 1C and 1D). The top 4 recurrently mutated genes in tumor samples were PIK3CA (MIM: 171834), TP53 (MIM: 191170), GATA3 (MIM: 131320), and AKT1 (MIM: 164730), which were also observed in NAT samples. Interestingly, we found SNVs in driver genes in about one-third of NAT samples (14/43 = 32.6%); 12 of these SNVs were also observed in paired tumor samples, with all but one predicted to be oncogenic.43 Two samples showed a mutation of c.1633G>A (GenBank: NM_006218.4 [p.Glu545Lys]) and c.3140A>G (GenBank: NM_006218.4 [p.His1047Arg]), respectively, of PIK3CA exclusively in the NAT tissue (Figure 1D), both of which were predicted to be oncogenic.43 Indeed, the most recurrently mutated gene in NAT samples was PIK3CA, with mutations observed in 6 samples, including the 2 hotspot mutations (c.3140A>G and c.1633G>A, each observed in 3 NAT samples; Figure S1). To follow up on these findings, we conducted deep targeted sequencing of TP53 and PIK3CA on paired NAT and tumor frozen tissues in a larger set of HKBC individuals (n = 214, including 23 individuals with WGS data). We identified nonsynonymous mutations of TP53 and PIK3CA in NAT samples of 6.5% and 12.1% of individuals, respectively (Figure S2). While 80% of TP53 mutations found in NAT samples were also observed in the paired tumor samples, 2 of 6 mutations in PIK3CA detected in NAT samples were not present in paired tumor samples with targeted sequencing. For example, the c.3140A>G mutation in PIK3CA was found in 14 (6.5%) NAT and 35 (16.4%) tumor samples, respectively. For the 14 NAT samples with the c.3140A>G mutation, the mutation was not found in 8 of the 14 corresponding tumor samples. At the same time, the occurrence of additional driver mutations in the majority of PIK3CA mutant tumor samples implies that PIK3CA mutations may not be sufficient to drive tumorigenesis on their own.
As expected, we observed that the median tumor mutational burden (TMB) was significantly higher in tumor (TMB = 1.40 mutations/Mb) than in NAT samples (TMB = 0.16 mutations/Mb, Wilcoxon test: p = 4.5 × 10−13; Figure 2A). We estimated TL and found that tumor samples on average had shorter TLs compared to NAT samples (tumor TL = 3.20 kb, NAT TL = 3.94 kb, Wilcoxon test: p = 8.1 × 10−6, Figure 2B), with a few exceptions. Three individuals (HKB1005, HKB1305, HKB2067) showed markedly longer TLs in tumors than in paired NAT samples (Figure 2B). Two of these tumor samples with elongated TLs had higher expression of TERT (MIM: 187270), which encodes the catalytic subunit of telomerase and plays a key role in maintaining TLs, than the great majority of examined tumor samples (Figure S3). This may explain the longer TLs in these tumor samples. Notably, one of them also had the lowest methylation level in the TERT promoter region among all tumor samples (Figure S3).
Figure 2.
Genomic and epigenomic comparison between paired breast tumor and normal adjacent to tumor samples in 43 Hong Kong individuals with breast cancer
(A) Tumor mutational burden (TMB), illustrated as the number of mutations per megabase (MB) on a logarithmic scale. TMB is calculated as the ratio of the number of synonymous and nonsynonymous single nucleotide variants detected by whole-genome sequencing (WGS) to the size of the whole genome.
(B) Telomere length (TL) estimated from the WGS data and measured in kilobases (Kbs).
(C) Proportions of COSMIC single-base substitution (SBS) signatures in tumor and NAT samples.
(D) Proportions of COSMIC double-base substitution (DBS) signatures in tumor samples.
(E) Proportions of COSMIC insertion-deletion (ID) signatures in tumor samples.
(F) Proportions of relative scores of immune cells estimated by MethylCIBERSORT. Significance levels are denoted as follows: ∗∗∗p < 0.001, ∗∗p < 0.01, ∗p < 0.05.
We analyzed SBS mutational signatures in both NAT and tumor samples (Figure 2C). Signature SBS5 manifested in 95% of the samples, emerging as the predominant signature in 69% and 28% of the NAT and tumor samples, respectively. SBS40a stood out as the predominant signature in all 7 NAT samples in which it was identified and was the predominant signature in 17 out of the 30 tumor samples it was identified in. SBS3 emerged as the predominant signature in all 4 NAT samples containing this signature and in 6 out of the 9 tumor samples containing this signature. Apolipoprotein B mRNA-editing enzyme catalytic polypeptide (APOBEC) signatures SBS2 and SBS13 were identified in a number of tumor samples but were rarely seen in the NAT samples (SBS2: n = 1 in NAT, n = 16 in tumor, and dominant in 25% of these samples; SBS13: n = 2 in NAT, n = 16 in tumor, and dominant in 25% of these samples; Figure S4A).
Most NAT samples showed no doublet-base substitution (DBS) mutations, with only 15 displaying any DBS activity (Figure S4B). In contrast, tumor samples exhibited a higher range of DBS activity, showcasing a variety of DBS signatures (Figure 2D). DBS4 and DBS2 were identified in 25 and 24 samples, respectively, followed by signature DBS13, which was detected in 14 samples. However, DBS13 emerged as the most dominant signature in most of the samples in which it was detected (93%), while DBS4 was dominant in 56%, and DBS2 was dominant in 25% of samples in which they were detected.
The majority of NAT samples displayed very low mutation activity, with 8 samples lacking any ID mutations (Figure S4C). In contrast, tumor samples exhibited a range of ID mutational signatures including ID1, ID2, ID6, ID8, ID9, ID23, and a novel signature ID83D (Figure 2E). ID23 showed the highest dominance rate at 63% in 16 samples, with ID1 dominating in 35% of the 37 samples in which it was detected and ID2 being the dominant signature in 31% of the 33 samples where it was identified. ID83D signature is characterized by 1-bp insertion (Figure S4D). We employed the Varlap tool to evaluate the reads that exhibited indel mutations associated with the ID83D signature and compared them to mutations linked to other known signatures (Table S3). Our findings indicate that mutations associated with ID83D in the samples exhibiting the signature were of similar quality with mutations associated with the rest of the signatures. This suggests that the novel signature likely represents a genuine biological signal rather than an artifact of mutation calling. The distinct mutational profiles between NAT and tumor samples, in all mutational signature groups, highlight the unique characteristics and diversity within each sample group.
Deconvolution of methylation data can offer insights into the cellular composition of tumors and NAT tissues, providing additional information about how the TME shapes tumor evolution. We used MethylCIBERSORT analysis41 to estimate cellular composition in tumor and NAT samples separately. When comparing stromal cell populations that were differentially represented between tumor and NAT samples, we found that fibroblasts (Wilcoxon test, p = 1.4 × 10−12) and endothelial cells (p = 0.0002) were more abundant in NAT than in tumors, while CD8+ lymphocytes (p = 7.4 × 10−7), Tregs (p = 4.5 × 10−13), and eosinophils (p = 0.0004) were more abundant in tumor samples (Figure 2F). NAT samples further away from tumors were more likely to show increased fibroblast proportions than those closer to tumors (p = 0.007), while scores for other cell populations in NAT samples did not vary significantly according to their distances from the paired tumor samples (Figure S5). These results underscore the hallmark-immune evasion characteristic of cancer and highlight the role of stroma-enriched NATs in fostering a tumor-supportive and promoting microenvironment.
We then used Battenberg to call somatic CNAs (SCNAs) in tumor and NAT samples. In contrast to tumor tissue samples, most NAT samples had very few SCNAs. Specifically, tumor and NAT samples were clearly clustered separately based on unsupervised clustering of SCNA patterns (Figure 3A). Only 4 NAT samples clustered together with tumor samples, with 2 (HKB2065 and HKB2068) displaying numerous large or small SCNAs across the genome and another 2 having fewer SCNAs, many of which are known to be recurrent (e.g., 1q gain) in BC. The extreme contrast in SCNA prevalence in NAT and tumor samples suggests that SCNAs do not benefit normal cells (and may even be selected against) but are a necessary prerequisite for tumorigenesis and represent a hallmark of progression from normal to tumor biology.
Figure 3.
Analysis of somatic copy-number alterations and structural variants in paired breast tumor and normal adjacent to tumor samples in 43 Hong Kong individuals with breast cancer
(A) The somatic copy-number alteration (SCNA) profiles for both tumor and NAT samples along chromosomes. The relative copy-number profile is determined by subtracting the ploidy (nWGD = 2, WGD = 4) from the total copy number. Annotations along the rows include tumor purity, WGD status, PAM50 classification, and sample type, detailed on the right side. The top highlights SCNA frequencies, categorizing them into amplifications, deletions, and copy-neutral loss of heterozygosity (LOH), represented by a black line. WGD indicates samples with whole-genome doubling, while nWGD denotes those without.
(B) The frequency of SCNA signatures in tumor samples, with corresponding frequencies in NAT samples detailed in Figure S6.
(C) Number and type of SV events observed in NAT and tumor samples, respectively.
(D) Circos plot of SV events in NAT samples.
(E) Frequency and type of SV signatures in tumor samples.
We further examined the CN signatures using SigProfiler36 in both NAT and tumor samples. As anticipated, the majority of NAT samples displayed the SCNA signature CN1, which is indicative of a diploid genome (Figure S6A). However, 3 NAT samples (HKB2065, HKB1154, and HKB1305) displayed CN2, which is indicative of a tetraploid genome. HKB2065 also had CN7, a signature associated with amplified chromothripsis (Figure S6B). Another 3 NAT samples (HKB2068, HKB2042, and HKB2067) showed CN13 (Figure S6A), which is indicative of chromosomal-scale LOH. In contrast, tumor samples exhibited a larger variety of signatures (Figure 3B), many of which were not present in NAT samples, such as CN6, indicative of chromothripsis/amplicon events followed by genome doubling, and CN17, which is associated with homologous recombination deficiency (HRD).
Similarly, SVs were predominantly observed in tumors and were much less frequent in NATs (Figure 3C). Only 7 of the 43 NAT samples harbored SV events, with up to 50 events observed per sample (Figures 3C and 3D). Consistently, the 6 NAT samples showing CN2, CN7, and CN13 signatures also displayed SVs (Figure 3C). We then used SigProfiler to deconvolve SV signatures, as described previously.36 No reliable SV signatures were extracted in NATs due to very few SV events, while signatures of clustered rearrangements (SV6) and clustered translocations (SV4) were prevalent in tumor samples (Figure 3E). These findings suggest that SVs are a hallmark of tumor progression and not a feature of NATs.
Phylogenetic trees
Both cancers and non-tumor lesions result from the processes of selection and drift, acting on cells that have acquired somatic mutations through endogenous and exogenous mutational processes. The evolutionary journey from normal to cancer cells may be represented using phylogenetic trees, in which ancestral cell populations occur on the trunk of a tree and descendant cell populations occur on branches or leaves. Notably, if tumor and NAT samples share a common genetic ancestor, they will be represented by a single tree, whereas if tumor and NAT samples have independent genetic origins, it is necessary to represent them as 2 separate trees.
We used multi-dpclust44 to identify clonal and subclonal clusters of SNVs in paired tumor and NAT samples of each individual. The clone/subclone clusters’ CCF values obtained from each individual were used to generate phylogenetic trees using the sum and crossing rules method.45 We identified a total of 8 phylogenetic tree patterns based on tree structures in paired tumor and NAT samples (Figure 4), which fell into 3 broad groups. The first phylogenetic tree structure (tumor only, N = 7 individuals) was only identified in tumor samples, with no clonal or subclonal clusters identified in NAT samples. The second group (shared tree, N = 16 individuals) included 3 individual tree patterns that were characterized by tumor and NAT samples sharing at least part of a single-tree structure. The last group (multiple trees, N = 20 individuals) encompassed 4 individual tree patterns characterized by distinct tree structures in tumor and NAT samples, with varying tree profile structures such as (1) individual tree structures in tumor and NAT samples; (2) tumor and NAT samples sharing 1 tree structure and the second tree structure found in either tumor or NAT samples; and (3) 2 independent trees, both found in NAT and tumor samples. NAT samples in the shared-tree group were more likely to have been sampled further away from tumor (>4cm) than NAT samples from the other tree groups, indicating that the relationship between tumor and NAT samples cannot be explained by a simple distance measure.
Figure 4.
Representative phylogenetic structures identified in NAT-tumor pair samples
Different colored lines are used to signify the presence of clones or subclones in paired NAT and tumor samples. The color scheme includes green for shared clones, orange for NAT-exclusive clones, and blue for tumor-exclusive clone or subclone membership. Circular plots below each tree visually represent the clonal and subclonal structures, along with the cancer cell fraction (CCF) values within each cluster. The first row within each oval plot provides an overview of the clone and subclone clusters’ overall structure. Subclonal clusters superimposed on one another indicate clonality membership, with the larger clusters representing subclones enveloping their nested counterparts. Subclones that are adjacent but not superimposed signify divergent subclones. The size of the clone or subclone ovals reflects the CCF value for each corresponding cluster.
Membership of different tree groups has interesting implications for tumorigenesis and early tumor growth. In the shared tree tumors, either a clonal expansion has occurred in morphologically normal tissue followed by transformation of a cell, or cluster of cells, within the clonal expansion to cancer, or tumor cells have migrated from the tumor site to surrounding normal tissue. In the multiple-tree group, we observed multiple independent clonal expansions, consistent with the presence of non-genetic field cancerization. In the majority of these cases, separate clonal expansions were observed in tumor and normal tissue regions, indicating that, in these individuals, not all clonal expansions lead to cancer.
Genomic, epigenomic, and histologic factors associated with phylogenetic tree groups
We then investigated the associations between phylogenetic tree groups and genomic/epigenomic features in NAT and tumor tissue samples. Because our goal was to understand mutations in NAT samples in relation to tumor evolution, we focused on the comparisons between the 2 broad tree groups that had mutations in both tumors and NAT samples, i.e., shared and multiple-tree groups.
In NAT samples, we found that the shared-tree group showed, on average, a higher number of mutations compared to the multiple-tree group (Figure 5A), which is somewhat as expected given that the NAT samples in the shared-tree group shared mutations with tumor samples. On the other hand, NAT samples in the multiple-tree group were more likely to have shorter TLs (p = 0.037, Figure 5B), a typical characteristic of cancer, than the shared-tree group after adjustment for age. Because NAT samples in the multiple-tree group shared fewer mutations with the matched tumor samples than those in the shared-tree group, this observation is unlikely to be explained by the presence of tumor cells within the NAT samples. Furthermore, the multiple-tree group had a lower frequency of NAT samples that are close to the tumor (<2 cm) (Figure S7). This is anticipated, as the migration of tumor cells to distant normal tissues is inherently challenging. Using the cellular composition determined by EpiDish,40 which was designed to estimate the proportions of various cell types based on DNA methylation data within bulk tissue samples, we found that NAT samples in the multiple-tree group had significantly higher epithelial content and lower fat content compared to the other 2 tree groups (Figure S8). In contrast, cell composition did not significantly differ by tree group in the tumor samples.
Figure 5.
Association of genomic and epigenomic characteristics with broad phylogenetic tree groups in NAT samples
(A) Tumor mutation burden.
(B) Telomere length (TL), measured in kilobases (Kbs).
(C) Frequency of CD14+ cell fractions estimated by MethylCIBERSORT.
(D) Promoter methylation of probes in REC8.
(E) REC8 expression levels in NAT samples. Significance levels are denoted as follows: ∗∗∗p < 0.001, ∗∗p < 0.01, ∗p < 0.05.
We assembled H&E images of NAT samples that were used for DNA/RNA extraction to examine whether there were morphological differences across different tree groups (Figure S9). Dense fibrous stroma was observed in a subset of NAT samples across all tree groups. Fibrocystic changes seemed to be more common in the shared-tree group. Interestingly, ductal hyperplasia was exclusively found in NAT samples from the multiple-tree group, which aligns with the observed higher epithelial content and lower fat content in this group.
We then used MethylCIBERSORT41 to further investigate stromal and immune cell subpopulations in relation to tree groups. MethylCIBERSORT results showed a lower proportion of the CD14 cell population (p = 0.0003) and a higher proportion of Tregs (p = 0.033) in NAT samples in the multiple- than shared-tree group (Figure 5C). CD14 is a surface marker of monocytes and macrophages that induces pro-inflammatory responses to invading pathogens via the Toll-like receptor 4 signaling pathway.46 Consistent with the literature,47 we found that CD14 expression showed high correlations with inflammatory mediators (such as IL-8, IL-6, CCL3) and macrophage markers (CD68 and CD163) in both tumor and NAT samples (Figure S10). Further, we found that NAT samples in the multiple-tree group were more likely to have lower expression levels of CD163, IL-8, IL-6, and CCL3 as compared to the shared-tree group in a subset of NAT samples (n = 16) with RNA-seq and WGS data available, although the differences were not statistically significant, possibly due to small sample size (Figure S11). These findings suggest that NAT samples in the multiple-tree group may be associated with a less inflammatory TME compared to those in the shared-tree group. Other cell populations in NAT samples did not vary by tree group (Figure S12). Similarly, none of the cell population scores varied by tree group in tumor samples (Figure S13).
We then conducted an agnostic analysis to identify DMRs across the 3 tree groups in NAT samples. After adjusting for age and batch effect, we found 3 significant DMRs (false discovery rate [FDR] < 0.05) covering the genes REC8 (MIM: 608193), DNM3 (MIM: 611445), and HPN (MIM: 142440) (Table S4). In particular, the top DMR contained probes in REC8, in which all 10 probes in the promoter region showed significantly increased methylation in the multiple-tree group compared to the other 2 tree groups (Figure 5D). A similar pattern was also observed in tumor samples, but the differences across tree groups were not statistically significant after adjusting for multiple comparisons (Figure S14A). In a subset of NAT samples with RNA-seq data available, we observed a lower expression of REC8 in the multiple-tree than in the shared-tree group, albeit without reaching statistical significance (p = 0.28, Figure 5E; Table S5). REC8 belongs to the cohesin protein complex, which is essential for correct chromosome disjunction and homologous recombination in cell-cycle regulation and has been shown to be epigenetically regulated in cancer.48,49 Consistent with its role as a potential tumor suppressor gene, we observed lower methylation levels for all 10 probes in the promoter region of REC8 in NAT than in tumor samples (Figure S14B). Similarly, DNM3 is also a tumor suppressor gene, and probes in the DMR region of this gene showed increased methylation in the multiple-tree group as compared to the shared-tree group in NAT samples (Figure S15A). On the other hand, HPN, which encodes Hepsin, is a serine protease that shows oncogenic activity and is widely overexpressed in a variety of cancers including BC. Methylation of DMR probes in HPN were strongly correlated with increased expression levels of the gene in the NAT samples (Figure S15B), suggesting a higher activity of this gene in the multiple-tree group than in the shared-tree group. Together with the observation that the multiple-tree group had shorter TLs, these findings suggest that NAT samples in the multiple-tree group may harbor genomic and epigenomic changes that are typical characteristics of tumors.
Our analysis of genomic features in tumor samples revealed several key differences between the multiple- and shared-tree groups (Figure 6A). Specifically, the multiple-tree group had higher TMB (p = 0.025), percent genome with SCNAs (%PGA, p = 0.016), and SV counts (p = 0.11) compared with the shared-tree group. In addition, although not significant, the multiple-tree group also appeared to have a higher percentage of tumors classified as luminal B (30.0%) and triple-negative BC (TNBC, 15.0%), 2 aggressive subtypes associated with a poorer prognosis, as compared to the shared-tree group (23.1% luminal B and 7.7% TNBC). Notably, while TMB, %PGA, SV counts, and percentage of molecular subtypes differed between the multiple- and shared-tree groups, these factors were similar between the tumor-only and multiple-tree groups.
Figure 6.
Association of genomic and epigenomic characteristics with broad phylogenetic tree groups in tumor samples
(A) Boxplots of tumor mutation burden (TMB), percent genome affected by somatic copy number alterations (%PGA), structural variant counts, and the stacked bar plot of PAM50 tumor subtypes. Significance levels are denoted as follows: ∗∗∗p < 0.001, ∗∗p < 0.01, ∗p < 0.05.
(B) Heatmap of driver genes visually representing mutated genes identified within samples in each tree group. The numbers within each cell denote the frequency of a specific gene mutation in each tree group, while the cell colors signify the proportion of that gene’s mutation relative to all the mutated genes identified within that tree group.
(C) Bubble heatmaps highlight the presence of signatures include single-base substitution (SBSs), doublet-base substitution (DBSs), small insertions and deletions (IDs), structural variation (SVs), and copy-number (CN) variations in each group, respectively. The size of each bubble in the plots represents the proportion of samples exhibiting a particular signature, while the color signifies the log10 value of the mean activity +1 of that signature across all samples within a tree group.
Interestingly, mutations in several known BC driver genes appeared to show different prevalence in specific tree groups (Figure 6B). For example, mutations in AKT1 (5 in shared tree vs. 1 in multiple tree, p = 0.064; 4 mutations are hotspot mutation E17K) and KMT2D (MIM: 602113; 3 vs. 0, p = 0.069) appeared to be enriched in the shared-tree group, although not quite reaching significance. In addition, among recurrent mutations that were present in multiple samples, those in CBFB (MIM: 121360), CTCF (MIM: 604167), KMT2D, and NF1 (MIM: 613113) were found only in the shared-tree group, whereas mutations in FBXW7 (MIM: 606278), NCOR1 (MIM: 600849), and ATM (MIM: 607585) were found only in the multiple-tree group. On the other hand, PIK3CA and TP53 mutations were identified across tree groups, although they appeared to be more prevalent in the multiple-tree group. As these possible associations lack statistical significance, they are interesting subjects for future study with a larger cohort.
Similarly, we observed that some mutational and SCNA/SV signatures demonstrated distinct patterns in specific tree structure groups (Figures 6C and S16). For example, we observed much lower activities of ID1 (Wilcoxon p = 0.030) and ID2 (p = 0.0014) but a higher activity of ID83D (p = 0.13) in tumors of the shared-tree group as compared to those in the multiple-tree group (Figure 6C). SBS1 mutations seemed to be more abundant among tumors in the multiple-tree group than in the shared-tree group (Wilcoxon p = 0.026). CN2 (p = 0.059), CN17 (p = 0.043), CN20 (p = 0.013), and SV3 (p = 0.30) were more abundant and/or showed higher activity in tumor samples in the multiple-tree group, while the diploid signature CN1 (p = 0.010) had higher prevalence/activity in tumor samples of the shared-tree group (Figure 6C). CN2 is a signature indicative of tetraploid genomes, whereas CN17 is a signature of HRD that is often associated with TP53 mutations. This connection may explain the higher TMB, PGA, and SV count, as well as increased presence of samples with luminal B and basal tumor subtype, in the multiple-tree group. Higher TMB in the multiple-tree tumors may also indicate that these tumors have been present for longer than tumors in the shared-tree group, concordant with multiple-tree groups being slower growing. However, future analysis of longitudinal samples or single-cell sequencing is needed to confirm this hypothesis.
Discussion
In this study, we characterized the genomic and epigenomic landscape of NATs in individuals with BC, unraveled the heterogeneous evolutionary histories of NAT and tumor tissues, and identified factors associated with this heterogeneity. WGS revealed the widespread occurrence of SNVs in NATs, including those in cancer driver genes. This was confirmed by targeted sequencing in a larger cohort. The presence of mutations in known cancer driver genes in normal healthy tissue has been documented in various tissues, such as the skin, esophagus, blood, and breast.50,51,52,53 Our findings that NATs contain driver somatic mutations are consistent with these observations, aligning with the concept of field cancerization in pre-cancerous tissues. Interestingly, in comparison to paired tumor tissue, most NAT tissues exhibited fewer SNVs and lacked SCNAs or SVs. This is consistent with findings from our earlier investigation of the mutational landscape in morphologically normal prostate tissue,54 suggesting that SNVs in NATs alone may be insufficient for tumor transformation. This implies a requirement for additional driver SNVs and/or large-scale somatic alterations, such as SCNAs or SVs, in the process of tumorigenesis. In other words, histologically normal adjacent tissue has initiated but not completed the tumorigenesis process.
The construction of phylogenies from WGS of 43 BCs, each with matched normal adjacent tissue, has allowed the separation of cancers that share a common somatic ancestor with cells in normal adjacent tissue (shared tree) from those in which cells in normal adjacent tissue have an independent genetic origin (multiple trees). The observation of multiple independent trees is concordant with non-genetic field cancerization (Figure 7), wherein factors in the TME enable the clonal expansion of cells within normal breast tissue. Clonal expansions in different geographical locations may lead to multiple affected regions, increasing the likelihood that one of the regions will transform to cancer, as illustrated in Figure 7. Previous work from our group and others indicates that field cancerization is widespread in individuals with prostate cancer, whereas the results of this study suggest that field cancerization may only occur in around half of women with BC. Like previous studies, our power to detect independent cell populations in NATs is limited by sequencing just 1 NAT from each individual. It may be that this phenomenon will appear more common upon sequencing multiple normal regions from each donor.
Figure 7.
Field cancerization in breast cancer
This figure illustrates the concept of field cancerization in breast cancer, showcasing the emergence of molecular alterations across various cell groups within breast tissue, which lead to multiple or shared-tree groups. The normal epithelium (top) consists of normal cells (depicted in light pink color) without detectable molecular changes. Due to mutational processes or changes in the cell microenvironment, morphologically normal cells with molecular alterations emerge, forming heterogeneous subclones (illustrated by dark brown and light brown cells in the middle left) or a homogeneous subclone (illustrated by light brown cells in the middle right). Some subclones (illustrated by light brown cells) continue to develop molecular alterations, leading to malignant cells (illustrated by purple cells). In the multiple-tree group (bottom left), subclones in NAT (e.g., dark brown cells) and tumors (e.g., light brown and purple cells) do not share genetic alterations. Conversely, in shared-tree groups (bottom right), subclones in NAT (e.g., light brown cells) and tumors (e.g., light brown and purple cells) share genetic alterations. The characteristics of NAT and tumor samples are listed for multiple and shared-tree groups, respectively.
NAT samples in the multiple-tree group have significantly less inflammatory and more suppressive microenvironment than those in the shared-tree group, suggesting that a more permissive TME may play a role in generating the field effect. This is supported by the observation of higher levels of genomic instability in tumors in the multiple-tree group, indicated by higher TMB, PGA, and signatures of large-scale genomic rearrangements and HRD in this set of cancers. Interestingly, in contrast to tumor samples, NAT samples in the shared-tree group had higher mutation rates than in the multiple-tree group. Furthermore, the 2 NAT samples with the highest levels of genomic rearrangements also belonged to the shared-tree group. It is possible that high levels of genomic alterations may lead to immune activation, which in turn suppressed the expansion of other clones in NAT samples of the shared-tree group while promoting growth of tumor cells through instigating an inflamed TME, as indicated by higher monocyte and macrophage populations. On the other hand, although lacking mutations, NAT samples in the multiple-tree group displayed distinct genomic and epigenetic features such as shorter telomeres, downregulation of tumor suppressor genes such as REC8 and DNM3 through epigenetic regulation, and distinct immune composition. Together, these alterations may provide a nurturing environment for the expansion of clones that eventually progress to tumors with more aggressive behavior. Our findings suggest that both genomic mutations/rearrangements and TME alterations occur at the early stage of BC progression, representing the cancer field effect, and their interplay may drive patient-specific tumor evolution processes (Figure 7). This may explain why tumors with identical driver mutations may behave quite differently. These data also highlight the challenge of developing accurate genetic markers to predict future recurrence, particularly among those that are in the multiple-tree group, where the field and tumor are not genetically connected. Therefore, comprehensive characterizations of field cancerization are needed to better understand the process and its contribution to recurrence and the development of multiple independent tumors. This understanding is crucial for preventing the progression of precancerous cells.
Interestingly, we observed that ductal hyperplasia appeared to be enriched in the NAT samples of the multiple-tree group, potentially explaining the higher epithelial content in this group compared to the other tree groups. This observation may challenge the conventional hypothesis that cancers progress from hyperplastic lesions. However, further investigations using single-cell sequencing approaches will be necessary to confirm these findings.
The strengths of our study include the utilization of multi-omics sequencing platforms, comprehensive sample collection with well-annotated clinical and epidemiologic data, and a distinctive study population. In contrast to previous research limited to CN and transcriptomic analyses, our investigation employs WGS, methylation profiling, and RNA-seq to comprehensively characterize the genomic and epigenomic features of NAT and paired tumor samples, to infer phylogenetic trees, and to associate genomic and epigenomic factors with phylogenetic tree groups. Moreover, our study uniquely includes paired germline, NAT, and tumor samples, facilitating the simultaneous analysis of evolutionary trajectories for NAT and tumor samples. Finally, our research focuses on a population from HK, representing an East Asian demographic that has been insufficiently studied for BC.
The major limitation of this study is the sequencing of only 1 NAT sample from each individual, which may have restricted the detection of heterogeneous cell populations within NATs. Sequencing multiple normal regions from each donor would provide a more comprehensive understanding of the heterogeneity in normal tissues and further elucidate the complexities of field cancerization in BC. While the study of an East Asian cohort rarely included in molecular studies of BC is a strength of this study, further research in diverse populations is warranted to validate the generalizability of the findings.
In summary, this study investigates the genomic landscape and evolutionary trajectories of NATs in BC. The analysis reveals at least 2 distinct clonal evolution groups for NAT samples, characterized by shared or independent evolutionary trajectories as the result of interactions of tumors with the surrounding TME. These findings underscore early genomic alterations in BC and provide further insights into field cancerization in BC.
Data and code availability
The accession number for the sequencing and array data reported in this paper is the dbGaP database: phs001870.v1.p1, which can be accessed via this web link: https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs001870.v1.p1.
Acknowledgments
This research was supported by the Intramural Research Program of the National Institutes of Health, National Cancer Institute, Division of Cancer Epidemiology and Genetics, by contract no. 75N91019D00024, and by Research Grants Council (grant no. 474811 to Dr. Tse), Hong Kong SAR. D.C.W. is part funded by Cancer Research UK RadNet Manchester (C1994/A28701) and is supported by the NIHR Manchester Biomedical Research Centre (NIHR203308). D.L. is supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT, no. RS-2023-00213625). This study utilized the high-performance computational capabilities of the Biowulf Linux cluster at the National Institutes of Health, Bethesda, MD: https://biowulf.nih.gov.
Author contributions
B.Z., M.G.-C., S.C., L.A.T., D.C.W., and X.R.Y. contributed to the study concept and design and obtained the funding. B.Z., A.T., H.K., T.Z., W. Zhu, X.W., A.K., D.L., M.H., C.A.H., W. Zhou, D.C.W., and W.L. conducted bioinformatic and biostatistical analyses. P.M.Y.L., G.M.T., K.-h.T., C.W., and L.A.T. contributed to specimen and data acquisition; K.J., A.H., P.L., and B.H. contributed to the assay and tissue processing support. B.Z., A.T., H.K., T.Z., D.C.W., and X.R.Y. performed data interpretation. B.Z., A.T., H.K., D.C.W., and X.R.Y. drafted the initial manuscript. All authors contributed to the critical revision of the manuscript for important intellectual content and approved the final manuscript.
Declaration of interests
Authors have no competing interests to disclose.
Published: November 3, 2024
Footnotes
Supplemental information can be found online at https://doi.org/10.1016/j.ajhg.2024.10.005.
Contributor Information
David C. Wedge, Email: david.wedge@manchester.ac.uk.
Xiaohong R. Yang, Email: royang@mail.nih.gov.
Supplemental information
References
- 1.Perou C.M., Sørlie T., Eisen M.B., van de Rijn M., Jeffrey S.S., Rees C.A., Pollack J.R., Ross D.T., Johnsen H., Akslen L.A., et al. Molecular portraits of human breast tumours. Nature. 2000;406:747–752. doi: 10.1038/35021093. [DOI] [PubMed] [Google Scholar]
- 2.Curtis C., Shah S.P., Chin S.F., Turashvili G., Rueda O.M., Dunning M.J., Speed D., Lynch A.G., Samarajiwa S., Yuan Y., et al. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature. 2012;486:346–352. doi: 10.1038/nature10983. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Ribelles N., Perez-Villa L., Jerez J.M., Pajares B., Vicioso L., Jimenez B., de Luque V., Franco L., Gallego E., Marquez A., et al. Pattern of recurrence of early breast cancer is different according to intrinsic subtype and proliferation index. Breast Cancer Res. 2013;15 doi: 10.1186/bcr3559. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Huang X., Stern D.F., Zhao H. Transcriptional Profiles from Paired Normal Samples Offer Complementary Information on Cancer Patient Survival--Evidence from TCGA Pan-Cancer Data. Sci. Rep. 2016;6 doi: 10.1038/srep20567. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Gadaleta E., Fourgoux P., Pirró S., Thorn G.J., Nelan R., Ironside A., Rajeeve V., Cutillas P.R., Lobley A.E., Wang J., et al. Characterization of four subtypes in morphologically normal tissue excised proximal and distal to breast cancer. NPJ Breast Cancer. 2020;6:38. doi: 10.1038/s41523-020-00182-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Aran D., Camarda R., Odegaard J., Paik H., Oskotsky B., Krings G., Goga A., Sirota M., Butte A.J. Comprehensive analysis of normal adjacent to tumor transcriptomes. Nat. Commun. 2017;8:1077. doi: 10.1038/s41467-017-01027-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Hattori N., Ushijima T. Epigenetic impact of infection on carcinogenesis: mechanisms and applications. Genome Med. 2016;8:10. doi: 10.1186/s13073-016-0267-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Takeshima H., Ushijima T. Accumulation of genetic and epigenetic alterations in normal cells and cancer risk. npj Precis. Oncol. 2019;3:7. doi: 10.1038/s41698-019-0079-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Slaughter D.P., Southwick H.W., Smejkal W. Field cancerization in oral stratified squamous epithelium; clinical implications of multicentric origin. Cancer. 1953;6:963–968. doi: 10.1002/1097-0142(195309)6:5<963::aid-cncr2820060515>3.0.co;2-q. [DOI] [PubMed] [Google Scholar]
- 10.Abdalla M., Tran-Thanh D., Moreno J., Iakovlev V., Nair R., Kanwar N., Abdalla M., Lee J.P.Y., Kwan J.Y.Y., Cawthorn T.R., et al. Mapping genomic and transcriptomic alterations spatially in epithelial cells adjacent to human breast carcinoma. Nat. Commun. 2017;8:1245. doi: 10.1038/s41467-017-01357-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Gadaleta E., Thorn G.J., Ross-Adams H., Jones L.J., Chelala C. Field cancerization in breast cancer. J. Pathol. 2022;257:561–574. doi: 10.1002/path.5902. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Braakhuis B.J.M., Tabor M.P., Kummer J.A., Leemans C.R., Brakenhoff R.H. A genetic explanation of Slaughter's concept of field cancerization: evidence and clinical implications. Cancer Res. 2003;63:1727–1730. [PubMed] [Google Scholar]
- 13.Tabor M.P., Brakenhoff R.H., van Houten V.M., Kummer J.A., Snel M.H., Snijders P.J., Snow G.B., Leemans C.R., Braakhuis B.J. Persistence of genetically altered fields in head and neck cancer patients: biological and clinical implications. Clin. Cancer Res. 2001;7:1523–1532. [PubMed] [Google Scholar]
- 14.Strandgaard T., Nordentoft I., Birkenkamp-Demtröder K., Salminen L., Prip F., Rasmussen J., Andreasen T.G., Lindskrog S.V., Christensen E., Lamy P., et al. Field Cancerization Is Associated with Tumor Development, T-cell Exhaustion, and Clinical Outcomes in Bladder Cancer. Eur. Urol. 2024;85:82–92. doi: 10.1016/j.eururo.2023.07.014. [DOI] [PubMed] [Google Scholar]
- 15.Bartlett T.E., Evans I., Jones A., Barrett J.E., Haran S., Reisel D., Papaikonomou K., Jones L., Herzog C., Pashayan N., et al. Antiprogestins reduce epigenetic field cancerization in breast tissue of young healthy women. Genome Med. 2022;14:64. doi: 10.1186/s13073-022-01063-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Forsberg L.A., Rasi C., Pekar G., Davies H., Piotrowski A., Absher D., Razzaghian H.R., Ambicka A., Halaszka K., Przewoźnik M., et al. Signatures of post-zygotic structural genetic aberrations in the cells of histologically normal breast tissue that can predispose to sporadic breast cancer. Genome Res. 2015;25:1521–1535. doi: 10.1101/gr.187823.114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Yates J., Schaufelberger H., Steinacher R., Schär P., Truninger K., Boeva V. DNA-methylation variability in normal mucosa: a field cancerization marker in patients with adenomatous polyps. J Natl Cancer Inst. 2024;116:974–982. doi: 10.1093/jnci/djae016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Silvestri G.A., Vachani A., Whitney D., Elashoff M., Porta Smith K., Ferguson J.S., Parsons E., Mitra N., Brody J., Lenburg M.E., et al. A Bronchial Genomic Classifier for the Diagnostic Evaluation of Lung Cancer. N. Engl. J. Med. 2015;373:243–251. doi: 10.1056/NEJMoa1504601. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Cooper C.S., Eeles R., Wedge D.C., Van Loo P., Gundem G., Alexandrov L.B., Kremeyer B., Butler A., Lynch A.G., Camacho N., et al. Analysis of the genetic phylogeny of multifocal prostate cancer identifies multiple independent clonal expansions in neoplastic and morphologically normal prostate tissue. Nat. Genet. 2015;47:367–372. doi: 10.1038/ng.3221. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Ronowicz A., Janaszak-Jasiecka A., Skokowski J., Madanecki P., Bartoszewski R., Bałut M., Seroczyńska B., Kochan K., Bogdan A., Butkus M., et al. Concurrent DNA Copy-Number Alterations and Mutations in Genes Related to Maintenance of Genome Stability in Uninvolved Mammary Glandular Tissue from Breast Cancer Patients. Hum. Mutat. 2015;36:1088–1099. doi: 10.1002/humu.22845. [DOI] [PubMed] [Google Scholar]
- 21.Jakubek Y.A., Chang K., Sivakumar S., Yu Y., Giordano M.R., Fowler J., Huff C.D., Kadara H., Vilar E., Scheet P. Large-scale analysis of acquired chromosomal alterations in non-tumor samples from patients with cancer. Nat. Biotechnol. 2020;38:90–96. doi: 10.1038/s41587-019-0297-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Zhu B., Joo L., Zhang T., Koka H., Lee D., Shi J., Lee P., Wang D., Wang F., Chan W.C., et al. Comparison of somatic mutation landscapes in Chinese versus European breast cancer patients. HGG Adv. 2022;3 doi: 10.1016/j.xhgg.2021.100076. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.The Cancer Genome Atlas Network Comprehensive molecular portraits of human breast tumours. Nature. 2012;490:61–70. doi: 10.1038/nature11412. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Cibulskis K., Lawrence M.S., Carter S.L., Sivachenko A., Jaffe D., Sougnez C., Gabriel S., Meyerson M., Lander E.S., Getz G. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat. Biotechnol. 2013;31:213–219. doi: 10.1038/nbt.2514. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Saunders C.T., Wong W.S.W., Swamy S., Becq J., Murray L.J., Cheetham R.K. Strelka: accurate somatic small-variant calling from sequenced tumor-normal sample pairs. Bioinformatics. 2012;28:1811–1817. doi: 10.1093/bioinformatics/bts271. [DOI] [PubMed] [Google Scholar]
- 26.Zhang T., Joubert P., Ansari-Pour N., Zhao W., Hoang P.H., Lokanga R., Moye A.L., Rosenbaum J., Gonzalez-Perez A., Martínez-Jiménez F., et al. Genomic and evolutionary classification of lung cancer in never smokers. Nat. Genet. 2021;53:1348–1359. doi: 10.1038/s41588-021-00920-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Stachler M.D., Taylor-Weiner A., Peng S., McKenna A., Agoston A.T., Odze R.D., Davison J.M., Nason K.S., Loda M., Leshchiner I., et al. Paired exome analysis of Barrett's esophagus and adenocarcinoma. Nat. Genet. 2015;47:1047–1055. doi: 10.1038/ng.3343. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Cingolani P., Platts A., Wang L.L., Coon M., Nguyen T., Wang L., Land S.J., Lu X., Ruden D.M. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin) 2012;6:80–92. doi: 10.4161/fly.19695. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Wang K., Li M., Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 2010;38 doi: 10.1093/nar/gkq603. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Aryee M.J., Jaffe A.E., Corrada-Bravo H., Ladd-Acosta C., Feinberg A.P., Hansen K.D., Irizarry R.A. Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays. Bioinformatics. 2014;30:1363–1369. doi: 10.1093/bioinformatics/btu049. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Paquet E.R., Hallett M.T. Absolute assignment of breast cancer intrinsic molecular subtype. J. Natl. Cancer Inst. 2015;107:357. doi: 10.1093/jnci/dju357. [DOI] [PubMed] [Google Scholar]
- 32.Cameron D.L., Schröder J., Penington J.S., Do H., Molania R., Dobrovic A., Speed T.P., Papenfuss A.T. GRIDSS: sensitive and specific genomic rearrangement detection using positional de Bruijn graph assembly. Genome Res. 2017;27:2050–2060. doi: 10.1101/gr.222109.117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Chen X., Schulz-Trieglaff O., Shaw R., Barnes B., Schlesinger F., Källberg M., Cox A.J., Kruglyak S., Saunders C.T. Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics. 2016;32:1220–1222. doi: 10.1093/bioinformatics/btv710. [DOI] [PubMed] [Google Scholar]
- 34.Wala J.A., Bandopadhayay P., Greenwald N.F., O'Rourke R., Sharpe T., Stewart C., Schumacher S., Li Y., Weischenfeldt J., Yao X., et al. SvABA: genome-wide detection of structural variants and indels by local assembly. Genome Res. 2018;28:581–591. doi: 10.1101/gr.221028.117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Yang L., Luquette L.J., Gehlenborg N., Xi R., Haseley P.S., Hsieh C.H., Zhang C., Ren X., Protopopov A., Chin L., et al. Diverse mechanisms of somatic structural variations in human cancer genomes. Cell. 2013;153:919–929. doi: 10.1016/j.cell.2013.04.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Islam S.M.A., Diaz-Gay M., Wu Y., Barnes M., Vangara R., Bergstrom E.N., He Y., Vella M., Wang J., Teague J.W., et al. Uncovering novel mutational signatures by de novo extraction with SigProfilerExtractor. Cell Genom. 2022;2:100179. doi: 10.1016/j.xgen.2022.100179. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Everall A., Tapinos A., Hawari A., Cornish A., Sud A., Chubb D., Kinnersley B., Frangou A., Barquin M., Jung J., et al. Comprehensive repertoire of the chromosomal alteration and mutational signatures across 16 cancer types from 10,983 cancer patients. medRxiv. 2023 doi: 10.1101/2023.06.07.23290970. Preprint at. [DOI] [Google Scholar]
- 38.Dentro S.C., Wedge D.C., Van Loo P. Principles of Reconstructing the Subclonal Architecture of Cancers. Cold Spring Harb. Perspect. Med. 2017;7 doi: 10.1101/cshperspect.a026625. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Ding Z., Mangino M., Aviv A., Spector T., Durbin R., UK10K Consortium Estimating telomere length from whole genome sequence data. Nucleic Acids Res. 2014;42 doi: 10.1093/nar/gku181. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Teschendorff A.E., Breeze C.E., Zheng S.C., Beck S. A comparison of reference-based algorithms for correcting cell-type heterogeneity in Epigenome-Wide Association Studies. BMC Bioinf. 2017;18:105. doi: 10.1186/s12859-017-1511-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Chakravarthy A., Furness A., Joshi K., Ghorani E., Ford K., Ward M.J., King E.V., Lechner M., Marafioti T., Quezada S.A., et al. Pan-cancer deconvolution of tumour composition using DNA methylation. Nat. Commun. 2018;9:3220. doi: 10.1038/s41467-018-05570-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Martinez-Jimenez F., Muinos F., Sentis I., Deu-Pons J., Reyes-Salazar I., Arnedo-Pac C., Mularoni L., Pich O., Bonet J., Kranas H., et al. A compendium of mutational cancer driver genes. Nat. Rev. Cancer. 2020;20:555–572. doi: 10.1038/s41568-020-0290-x. [DOI] [PubMed] [Google Scholar]
- 43.Tamborero D., Rubio-Perez C., Deu-Pons J., Schroeder M.P., Vivancos A., Rovira A., Tusquets I., Albanell J., Rodon J., Tabernero J., et al. Cancer Genome Interpreter annotates the biological and clinical relevance of tumor alterations. Genome Med. 2018;10:25. doi: 10.1186/s13073-018-0531-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Bolli N., Avet-Loiseau H., Wedge D.C., Van Loo P., Alexandrov L.B., Martincorena I., Dawson K.J., Iorio F., Nik-Zainal S., Bignell G.R., et al. Heterogeneity of genomic evolution and mutational profiles in multiple myeloma. Nat. Commun. 2014;5:2997. doi: 10.1038/ncomms3997. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Jiao W., Vembu S., Deshwar A.G., Stein L., Morris Q. Inferring clonal evolution of tumors from single nucleotide somatic mutations. BMC Bioinf. 2014;15:35. doi: 10.1186/1471-2105-15-35. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Sharygin D., Koniaris L.G., Wells C., Zimmers T.A., Hamidi T. Role of CD14 in human disease. Immunology. 2023;169:260–270. doi: 10.1111/imm.13634. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Cheah M.T., Chen J.Y., Sahoo D., Contreras-Trujillo H., Volkmer A.K., Scheeren F.A., Volkmer J.P., Weissman I.L. CD14-expressing cancer cells establish the inflammatory and proliferative tumor microenvironment in bladder cancer. Proc. Natl. Acad. Sci. USA. 2015;112:4725–4730. doi: 10.1073/pnas.1424795112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Yu J., Liang Q., Wang J., Wang K., Gao J., Zhang J., Zeng Y., Chiu P.W.Y., Ng E.K.W., Sung J.J.Y. REC8 functions as a tumor suppressor and is epigenetically downregulated in gastric cancer, especially in EBV-positive subtype. Oncogene. 2017;36:182–193. doi: 10.1038/onc.2016.187. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Liu D., Shen X., Zhu G., Xing M. REC8 is a novel tumor suppressor gene epigenetically robustly targeted by the PI3K pathway in thyroid cancer. Oncotarget. 2015;6:39211–39224. doi: 10.18632/oncotarget.5391. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Cereser B., Yiu A., Tabassum N., Del Bel Belluz L., Zagorac S., Ancheta K.R.Z., Zhong R., Miere C., Jeffries-Jones A.R., Moderau N., et al. The mutational landscape of the adult healthy parous and nulliparous human breast. Nat. Commun. 2023;14:5136. doi: 10.1038/s41467-023-40608-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Kakiuchi N., Ogawa S. Clonal expansion in non-cancer tissues. Nat. Rev. Cancer. 2021;21:239–256. doi: 10.1038/s41568-021-00335-3. [DOI] [PubMed] [Google Scholar]
- 52.Martincorena I. Somatic mutation and clonal expansions in human tissues. Genome Med. 2019;11:35. doi: 10.1186/s13073-019-0648-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Martincorena I., Campbell P.J. Somatic mutation in cancer and normal cells. Science. 2015;349:1483–1489. doi: 10.1126/science.aab4082. [DOI] [PubMed] [Google Scholar]
- 54.Buhigas C., Warren A.Y., Leung W.K., Whitaker H.C., Luxton H.J., Hawkins S., Kay J., Butler A., Xu Y., Woodcock D.J., et al. The architecture of clonal expansions in morphologically normal tissue from cancerous and non-cancerous prostates. Mol. Cancer. 2022;21:183. doi: 10.1186/s12943-022-01644-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The accession number for the sequencing and array data reported in this paper is the dbGaP database: phs001870.v1.p1, which can be accessed via this web link: https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs001870.v1.p1.







