Skip to main content
NAR Cancer logoLink to NAR Cancer
. 2026 May 4;8(2):zcag011. doi: 10.1093/narcan/zcag011

Sensitive detection of somatic mutations in GC-rich cancer gene promoters

Meifang Qi 1,2,3,✉,#, Preshita Sanjay Dave 4,5,6,#, Nicole Francis 7,8, Matthew DeFelice 9,10, Sam Pollock 11,12, Holly Corbitt 13, Irene Rin Mitsiades 14,15,16, Gengchao Wang 17,18,19, Paul Frere 20, Alan Fox 21, Gregory Khitrov 22, Caroline Petersen 23,24, Katie Larkin 25,26, Susanna Hamilton 27,28, Leif W Ellisen 29,30,31, Doga C Gulhan 32,33, Alfredo Hidalgo-Miranda 34, Niall Lennon 35,36, Carrie Cibulskis 37,38,, Esther Rheinbay 39,40,41,42,e,
PMCID: PMC13136893  PMID: 42088607

Abstract

Somatic mutations in protein-coding genes and noncoding regulatory regions are the major drivers of cancer. Only a relatively small number of somatic noncoding mutations that are likely drivers have been described to date, including those in the promoters of the TERT, FOXA1, and TP53 genes. The impact of these alterations can be profound by initiating, increasing, or abolishing gene expression. Promoter mutations in particular have been difficult to identify even from whole tumor genomes due to their high content of G and C nucleotides, which leads to loss of sequencing coverage in these regions. Therefore, the landscape of somatic drivers in gene promoters remains incomplete. Here, we present a hybrid capture assay optimized for >3000 promoters of cancer genes. We show that this assay allows for deep sequencing of challenging GC-rich promoter regions, enabling discovery of reliable point mutations, short insertions and deletions, copy number variants, and mutational signatures in cell line models as well as formalin-fixed, paraffin-embedded archival tissue samples. Our assay nominated candidate noncoding driver mutations in CDK4, SMAD3, and GATA3 in breast cancer for future functional follow-up.

Graphical Abstract

Graphical Abstract.

For image description, please refer to the figure legend and surrounding text.

Introduction

Cancer-driving somatic mutations have predominantly been described in protein-coding genes, where they can alter function, localization, and abundance of an oncogene or tumor suppressor. In addition, alterations in the regulatory, noncoding part of the genome have been found in the promoters of genes (e.g. TERT [1, 2], TP53 [3]), untranslated regions (e.g. NFKBIZ [4], FOXA1 [3, 5], SFTPB [6]), and in distal regulatory elements (enhancers; e.g. TAL1; GFI1) [7, 8]. While variants in protein-coding genes can be readily and cost-effectively assessed with whole-exome sequencing (WES), this strategy does not assess regulatory regions. Whole-genome sequencing (WGS) has enabled the discovery of several mutations in regulatory regions [3]. However, specific sequence properties, including nucleotide composition and repetitiveness of the genome, have left “blind spots” in the landscape of somatic mutations in certain classes of regulatory elements. This includes around 70% of human gene promoters that contain “CpG islands” [9, 10], sequences strongly enriched in stretches of cytosine (C) and guanosine (G) nucleotides compared to the rest of the genome. It is well known that GC-rich DNA is less efficiently amplified in polymerase-chain reaction (PCR) experiments, leading to dropout specifically of GC-rich sequences. As a consequence, sequencing read coverage in next-generation sequencing profiles based on library preparation methods with DNA amplification steps can be highly biased against GC-rich DNA. Uneven and sometimes complete lack of sequencing read coverage in specific genomic regions such as GC-rich DNA can lead to “missed” mutation calls of true somatic events, with implications for identifying critical driver events in a given tumor and case cohort, estimating mutation density, tumor classification, and clinical decision making.

We have previously shown that gene promoters, due to their high GC content, are particularly challenging to sequence and are systematically undersampled in published whole cancer genomes [3, 11, 12]. With known promoter driver mutations in TERT [1, 2] FOXA1 [ 11], TP53 [3] and BAP1 [13] occurring in narrowly defined regions, severe lack of coverage at individual positions close to the transcription start site (TSS) results in many mutations being missed in given tumor genome cohort [11]. This problem is further exacerbated in tumors infiltrated with normal stromal or immune cells and those with subclonal mutations, making detection of potential driver mutations particularly challenging.

Here, we present a novel hybrid capture method specifically for cancer gene promoters that enables sensitive detection of potential driver mutations in these regions.

Materials and methods

Coverage analyses of whole genomes

The MuTect somatic mutation caller calculates a binary flag indicating whether a given genomic base is sufficiently covered with sequencing reads to robustly call a somatic mutation [14]. We used binary coverage tracks from 1111 whole cancer genomes from the PCAWG project [12] to calculate the fraction of covered samples for a given genomic position. The list of cancer genes was obtained from PCAWG [12]. Promoters defined as regions of −100 to +100 bp around the annotated TSS obtained from RefGene. For exon power estimates, exons not overlapping regions baited in exome capture panels and exons on chromosomes X and Y were not included in this analysis due to expected lower coverage in male samples.

Selection of cancer gene targets for promoter assay design

Cancer genes for panel design were collected from the Catalogue Of Somatic Mutations In Cancer (COSMIC v80 and v90) [15], The Network of Cancer Genes (NCG 6.0) [16], and an internal list of cancer genes (“BroadPanCan2019”).

Definition of functional promoter regions with chromatin signatures

We defined the likely functional promoter regions of cancer genes as follows (Supplementary Fig. S1A):

  1. Defining TSSs: TSSs of cancer genes were obtained from the “BestRefSeq” transcripts from RefGene (hg38; https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.ncbiRefSeq.gtf.gz). For genes without a “BestRefSeq” transcript, gene-level annotation information was used to define the TSS instead of relying on individual transcripts.

  2. Identifying high-confidence active regulatory regions: We analyzed open chromatin signatures from 132 DNase-seq and 64 Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq) experiments across 112 cell lines and primary tissues (Supplementary Table S2). Signal peak files for these samples were downloaded via the ENCODE portal (https://www.encodeproject.org/) (hg38 human genome assembly). For samples missing signal peak files, we retrieved alignment files (BAM) from ENCODE and performed peak calling using MACS2 [17] with default settings. If multiple replicates were available for a given sample, we merged peaks found in at least 50% of the replicates. Finally, we extracted and combined nucleotide positions covered by three or more distinct chromatin peaks across the full dataset, classifying these as highly reliable regulatory regions.

  3. Definition of promoter regions: Each transcript was associated with a single candidate promoter region. We chose a regulatory region present in at least 20 samples located within 1 kb either upstream or downstream of the annotated TSS as potential promoter region. If no region was called in 20 samples (such as for highly tissue-specific genes), we chose the regulatory region—whether upstream or downstream—that had the highest support number as the candidate promoter region. To control the total territory of the hybrid capture assay design and concentrate on core functional elements, we implemented several refinement steps for each transcript:

    1. Remove low coverage subpeaks and boundaries: Using the “findpeaks” function from the R package “pracma,” TSS-associated peaks were segmented into multiple subpeaks based on signal intensities from matching BigWig files. To minimize computational time, we selected 10 DNase-seq and 10 ATAC-seq samples whose peak sets covered at least 50% of the candidate peak regions. Subpeaks exhibiting minimum signals greater than half of the original peak’s maximum value were merged again. Subpeaks with maximum values at or below one-third of the original peak’s maximum were discarded. Additionally, subpeak boundaries displaying signal levels at or below the background signal were trimmed. The background signal was defined as the minimum of two values: twice the average of the lowest decile of peak signals and 20% of the peak’s maximum value for a given transcript.

    2. Exclude overlapping high-coverage exon regions, assuming that these would be adequately covered in exome sequencing: We obtained the PMC42-ET and PMC42-LA datasets (SRP216795) from the NCBI Sequence Read Archive (SRA) and aligned the reads to the hg38 human reference genome using bwa, followed by duplicate removal. Coverage for each exon was calculated using “samtools depth” and averaged; exons exhibiting coverage of ≥30X in both samples were identified as highly covered. Since these regions are generally sufficiently captured for mutation calling in exome sequencing, we excluded them from the candidate promoter regions.

    3. Add adjacent regions with extremely low coverage in WGS: To increase mutation calling sensitivity in promoter-proximal regions not covered by our chromatin-seq based approach (which is also subject to sequencing bias), we identified regions generally not well covered in WGS. We first identified all bases of low coverage, defined as ≥65% of 1111 samples without coverage from the PCAWG WGS dataset (pancan.bigwig value ≤722.15). Individual bases were merged into larger “low-coverage” regions such that the minimum fraction of covered samples was ≤30% and mean fraction of covered samples of ≤55%, with a minimum merged region size of 20 bp (smaller regions were not considered). Low-coverage regions located within 100 bp of candidate promoter regions were consolidated into candidate promoter regions.

    4. Merge promoter regions within 50 bp: Promoter regions located within 50 bp of each other were consolidated into single regions.

Due to the requirements of the probe design pipeline, the designed promoter regions were then converted from hg38 to hg19 using UCSC liftOver [18].

Sample preparation

DNA for the lymphoblastic cell line GM12878 was obtained from the Broad Institute Genomics Platform. DNA for eight cancer cell lines was obtained from commercial providers. FFPE breast cancer lymph node metastases and matched normal lymph nodes were obtained from the Instituto de Enfermedades de la Mama, FUCAM A.C., Mexico City, after informed consent as previously described [11].

This project was approved by the Broad Institute Office for Research Subjects Protection (DFCI protocol 15-370, Broad IRB-8627, Broad NHSR-6955).

Library construction

An aliquot of genomic DNA (50–200 ng in 50 µl) from the GM12878 cell line was used as the input into DNA fragmentation. Shearing was performed acoustically using a Covaris focused-ultrasonicator, targeting 150 bp fragments. Library preparation was performed using a commercially available kit provided by KAPA Biosystems, NEB Q5 High Fidelity, or Invitrogen SuperFi. For all library kits evaluated, IDT’s duplex UMI Stubby-Y adapters were used. Unique 8-base dual index sequences embedded within the p5 and p7 primers (IDT) are added during PCR. Bead-based purification clean-ups were performed using Beckman Coulter AMPure XP beads with elution volumes reduced to 30 µl to maximize library concentration. Library quantification was performed using the Invitrogen Quant-It broad-range dsDNA quantification assay kit. Following quantification, each library was normalized to a concentration of 35 ng/µl.

Hybrid selection

Hybridization and capture were performed using the relevant components of IDT’s XGen hybridization and wash kit following the manufacturer’s suggested protocol, with minor exceptions. A set of 12-plex prehybridization pools were created by equivolume pooling of the normalized libraries, Human Cot-1, and IDT XGen blocking oligos. The prehybridization pools underwent lyophilization using the Biotage SPE-DRY. Post lyophilization, custom promoter panel bait (Twist Bioscience) along with hybridization master mix was added to the lyophilized pool prior to resuspension. Samples are hybridized with bait probes overnight at a range of temperatures as indicated for optimization testing. PCR was performed post capture to amplify the enriched library. Library pools were quantified using qPCR and normalized to 2 nM.

Sequencing

Cluster amplification of library pools was performed according to the manufacturer’s protocol (Illumina) using Exclusion Amplification cluster chemistry and HiSeq X flowcells. Flowcells were sequenced on v2 Sequencing-by-Synthesis chemistry for HiSeq X flowcells. The flowcells were then analyzed using RTA (v.2.7.3 or later). Each pool of libraries was run on paired 151 bp runs, reading the dual-indexed sequences to identify molecular indices and sequenced across the number of lanes needed to meet coverage for all libraries in the pool. Paired-end reads were aligned to the human reference genome (hg19) with bwa (v0.7.15). Sequencing statistics were obtained with Picard Alignment and CollectHSMetrics.

Calculation of uncovered bases

To calculate uncovered bases, samtools depth only including bases with quality scores of 30 or higher was run on alignments from the optimization stage. A base was called uncovered if it was covered by ≤30 reads. KDM5D, the only target on the Y chromosome, was excluded from the calculation due to use of a female cell line (GM12878) for assay optimization.

Quality control

Sequencing metrics were obtained from Picard in the standard Broad Institute processing pipeline (https://broadinstitute.github.io/picard/). GC content and mean coverage for the panel was obtained using Picard, with the CollectHsMetrics tool employed to calculate hybrid selection metrics. ContEst [19] was used to calculate contamination with foreign DNA, and tumor-in-normal content was quantified with deTiN [20]. Samples with contamination fraction >0.02 and tumor-in-normal fraction >0.3 were removed from the analysis set.

Point mutation calling

Single-nucleotide variants (SNVs) were called with Mutect and Strelka2 with default parameters [14, 21]. Strelka somatic calls were retained if supported by at least two variant reads. Insertions and deletions (indels) were called with Strelka2 with default parameters. Somatic indels labeled “PASS” were used for downstream analysis.

The union of SNVs from both were fed into the following filtering steps: Filtering for oxoG [22] and FFPE artifacts (https://app.terra.bio/#workflows/getzlab/OrientationBiasFilter/1), read alignment to other genomic sites (“BLAT” filter), and variants present in normal panels (using a custom panel of normals). SNVs were further selected for depth of alternative alleles >5, minimum variant allele fraction (VAF) >0.02, and VAF (tumor)/VAF (normal) >5. SNVs within germline indels (called by Strelka2 with default parameters) were removed as likely alignment artifacts. For the remaining SNVs located within 50 bp of any germline or somatic indel, manual review was applied to confirm real mutations. Finally, only SNVs and indels in targeted promoter regions were considered for analysis.

Copy number calling

Copy number variants (CNVs) in cell lines were identified using CNVkit with default parameters [23]. For cell lines without matched normal samples, CNVs were called by comparing tumor samples to a reference panel constructed from blood-derived normal cell lines HCC1143BL and HCC1187BL. For FFPE tumor samples and cell lines with matched normals, CNVs were called using FACETs [24] with default parameters.

Motif analysis

To assess the impact of noncoding mutations on transcription factor (TF) binding, we compared predicted TF binding motifs in reference and mutant genomic contexts. First, we extracted ±20 bp flanking sequences surrounding each mutation from the hg19 reference genome using BEDTools [25]. Mutant sequences were generated by substituting the reference allele with the alternate allele. We then predicted TF binding motifs in both reference and mutant sequences using FIMO (MEME Suite v5.5.7) [26] with the JASPAR 2024 core vertebrate nonredundant motif database [27]. Binding sites were identified using a P-value threshold of 10−4. Gains and losses of TF motifs were defined as motifs uniquely present in either the mutant or reference sequence, respectively, based on set comparisons of FIMO outputs. Motifs with altered binding affinity (present in both but with significantly shifted P-values) were recorded but not used for categorical classification in this study.

Microsatellite instability analysis

Microsatellite sites in targeted promoter regions were called by Msisensor-pro [28]. For MSI status in cell lines, the “tumor-only” mode was run with a reference baseline constructed from four normal cell line profiles (two different library preparations of GM12878, HCC1143BL replicate 1, HCC1187BL) MSI detection in patient samples and three paired cell lines (including two replicates of HCC1143) was performed in paired tumor-normal mode.

Mutational signatures

We used SigMA (v2.0; code available at https://github.com/parklab/SigMA) to detect COSMIC SBS3 (homologous recombination deficiency), APOBEC, and clock-like signatures from panel data. To optimize SBS3 detection, we followed the strategy described in [29]. Specifically, we simulated targeted panels by downsampling mutations from 550 whole-genome sequenced breast cancers from the International Cancer Genome Consortium (ICGC) to match the promoter target regions. These reference matrices were added to the SigMA package as a new parameter set. We then followed the workflow in the example tuning script test_tune_example.R, which involves:

  1. Trinucleotide frequency adjustment: Correcting for differences between panel target regions and the whole genome, where COSMIC mutational signatures are defined. This step is critical for promoter panels, as their GC-rich regions strongly influence mutation distributions.

  2. Mutation count optimization: Calibrating panel simulations to account for differences in mutation burden. Additional mutations from WES exonic regions of the same ICGC samples were incorporated to reflect the increased sensitivity of targeted panels (owing to higher sequencing depth) compared to WGS.

  3. Classifier training: Training a gradient boosting classifier on the simulated samples. WGS SBS3 labels were used as truth, while features included likelihoods of WGS cluster membership, cosine similarity, and exposures estimated via non-negative least squares. From the simulations, we estimated sensitivity and false positive rates (FPR), and defined high-confidence and low-confidence thresholds as SBS3 score cutoffs corresponding to 5% and 10% FPR, respectively. Applying these thresholds to promoter panel data, we identified seven high-confidence and one low-confidence SBS3-positive samples.

Additionally, we assessed APOBEC activity (SBS2 and SBS13) by selecting samples with >0.15 probability of belonging to one of the two APOBEC-high WGS clusters (out of 12 breast cancer clusters). For the remaining SBS3- and APOBEC-negative samples, we evaluated other breast cancer signatures (SBS8, SBS17, SBS18), none of which were detected. These samples were most compatible with clock-like signatures, based on their highest-likelihood WGS cluster matches.

Results

Systematic sequencing coverage depletion at gene promoters in WGS

We have previously shown that the PCAWG data set suffers from loss of mutation calling power in gene promoters [3] due to low sequencing depth (coverage). To assess promoters of genes with known roles in cancer, we measured the fraction of samples in this dataset with sufficient coverage for mutation calling (the “Materials and methods” section). We found that among 602 cancer genes [3], 37% of promoters were on average not sufficiently covered (at 80% power) to sensitively detect somatic mutations, compared to only 3% of protein-coding exons (Fig. 1A). Importantly, sequence coverage patterns vary strongly around the TSS, where the DNA is abundant with recognition motifs for transcriptional regulators. Indeed, hundreds of cancer gene promoters exhibit uneven and sometimes complete lack of local coverage around the TSS (Fig. 1B). For instance, the promoters of KMT2C (MLL3) and AKT1 exhibit strong coverage loss directly upstream or downstream of the TSS (Fig. 1C). And for seven promoters (SSX2, SHTN1, NUTM2A, NSD2, SSX4, BRD3, and NUTM2B), the region proximal to the TSS contains nearly no information (covered in <20% of samples; Supplementary Table S1). In order to address these systematic challenges, we designed and evaluated a hybrid capture assay to specifically capture and deeply sequence promoter regions of cancer genes.

Figure 1.

For image description, please refer to the figure legend and surrounding text.

Low coverage in promoter regions in whole-genome cancer sequences from PCAWG. (A) Fraction of 1111 PCAWG tumor samples sufficiently covered for mutation calling as a function of the percentage of genomic elements in each category. Horizontal dashed line indicates 80% mutation calling power. Vertical dashed lines define the percentage of genomic elements with <80% power as “not covered” [3% of coding exons (also highlighted with red arrow), 37% of promoters]. (B) Number of PCAWG samples sufficiently covered for powered mutation calling in promoter regions of cancer genes, centered around the TSS. White indicates few samples with coverage, dark blue indicates full coverage. (C) Number of PCAWG covered samples at the promoter regions of KMT2C and AKT1. Gray shading indicates regions of particularly low coverage in each gene promoter.

Design of a hybrid capture assay for human cancer gene promoters

A challenge for targeted sequencing of promoters is that they are not well defined by sequence alone, with critical regulatory elements positioned upstream as well as downstream of the TSS. Prior definitions of promoters in studies searching for somatic mutations have used fixed distances around the TSS, potentially missing important regulatory sites while decreasing power by including additional, nonregulatory sequence in the statistical evaluation [3, 11]. When a promoter is active, its chromatin is opened to allow TFs and the initiation complex to bind. Open chromatin can be readily measured genome-wide with several strategies, including DNase-seq and ATAC-seq, and profiles for many cell lines and primary tissues are available today [30, 31]. Our hybrid capture design is based on these open chromatin features, optimizing capture of the most likely functional regions for each cancer gene (Fig. 2A and Supplementary Fig. S1A).

Figure 2.

For image description, please refer to the figure legend and surrounding text.

Design of a targeted sequencing panel for promoter regions. (A) Schematic of the design process (also see Supplementary Fig. S1A). (B) Representative cancer genes showing enrichment of DNase-seq and ATAC-seq data at promoter assay regions in multiple cell and tissue contexts. Black bars indicate the target regions on the hybrid capture assay. (C) Length distribution of all promoter regions included in the promoter panel. (D) Distribution of the number of promoter regions included per gene. Multiple promoters represent alternative start site isoforms. (E) Distribution of GC content in promoters included in the hybrid capture design and protein-coding exons from the same cancer genes.

To define likely promoter regions, we first collated genes with previously described roles in cancer from different sources and annotated their exact TSS (the “Materials and methods” section). We then inferred the regulatory sequence for each TSS using published open chromatin profiles from 112 normal and malignant cell lines and primary tumors (Supplementary Fig. S1A, Supplementary Table S2, and the “Materials and methods” section). A regulatory sequence was defined as a region that was marked with open chromatin in at least three samples (excluding technical or biological replicates) and located within 1000 bp of the annotated TSS of a cancer gene. To keep the total territory on the assay limited, we trimmed enrichment peak regions that extended into protein-coding exons that were covered (>30×) in exome sequencing datasets (the “Materials and methods” section), reasoning that these regions can be adequately assessed with current exome sequencing technologies. Finally, we extended open chromatin peak regions into promoter-adjacent UTR and protein-coding regions with extremely low coverage (≤30% of samples covered in the PCAWG study [12]). Examples for KMT2A, MTOR, MYC, and TP53 illustrate the location and extent of promoter regions inferred through this process (Fig. 2B). In total, we derived 3167 promoter regions for 2343 cancer genes (Supplementary Table S3) and a total combined genomic territory of ~2 Mb. The size range for these promoter regions was 121 to 6297 bp with a mean of 637 bp and median of 516 bp (Fig. 2C). Most genes (1783, 76%) had one assigned promoter, while 560 (24%) had two or more, representing isoforms with differing transcription initiation sites (Fig. 2D). About 8.5% of assay sequence overlapped with annotated GC-rich protein-coding sequence not otherwise well covered in exomes [32]. By design, the GC content of the panel (median value = 0.6371, mean value = 0.6188) was significantly higher than that of exons (median value = 0.4935, mean value = 0.4928) from promoter assay target genes (Fig. 2E; t-test P < 2.2e-16).

Optimization of assay conditions

Given the challenge of the elevated proportion of GC-rich regions contained in the promoter panel regions of interest, two rounds of probe design were required before considering any assay laboratory parameter tuning. Once adequate baits were obtained, we systematically optimized and evaluated the coverage performance of the promoter hybrid capture design on DNA from the EBV-transformed human B-lymphocyte cell line GM12878. To establish baseline performance, the default conditions and laboratory parameters typical for other targeted capture panels were deployed. As expected, some regions showed limited coverage, warranting assay optimization.

Our improvement efforts focused on two key assay components: library amplification and hybridization temperature. Because GC-rich regions are often underrepresented due to PCR amplification bias [33, 34], we attempted promoter target enriched sequencing using three high-fidelity polymerase enzymes—Kapa HiFi, Thermo Fisher Platinum SuperFi, and New England Biolabs (NEB) Q5 High-Fidelity. Only two of the three PCR enzymes (Kapa HiFi, NEB Q5) produced adequate library quantities to proceed into hybridization and sequencing. To compare the enzymes for suitability in the promoter assay, we measured the estimated library size (a measure of library complexity) and estimated GC dropout using Picard CollectHsMetrics. In both measures, the Kapa HiFi HotStart PCR kit outperformed the other enzymes tested (Fig. 3A and B). Specifically, with the same amount of input DNA and sequencing capacity attempted per sample, the Kapa-HiFi showed higher estimated library size and lower GC dropout rate than the NEB Q5 High-Fidelity Kit and notably also produced higher yields of library for capture input (Fig. 3A and B, and Supplementary Table S4).

Figure 3.

For image description, please refer to the figure legend and surrounding text.

Performance evaluation of the promoter assay in cell line GM12878. (A) Estimated library size (calculated by Picard) for promoter assay targets for two different enzymes (Kapa HiFi, NEB Q5). (B) GC dropout (calculated by Picard) for promoter assay targets. (C) Hybridization temperature for Kapa HiFi library preparations and resulting number of uncovered bases on the capture panel. (D) Estimated library size as a function of hybridization temperature (Kapa HiFi). (E) Fraction of selected bases as a function of hybridization temperature (Kapa HiFi). (F) Correlation between GC content and mean coverage per promoter region for promoter targets.

To further optimize hybridization conditions, we also systematically tested a range of overnight hybridization temperatures. We aimed to improve the coverage of challenging targets while maintaining library complexity and on-target enrichment. We compared the performance of NEB Q5 and Kapa HiFi library prep kits across several hybridization temperatures ranging from 57.9°C to 66.9°C. We observed that temperatures below 61°C yielded improved coverage with fewer uncovered bases (Fig. 3C; Supplementary Table S4), while still supporting high library complexity (Fig. 3D) and stable on-target rates (Fig. 3E). Based on these results, the cancer gene promoter assay was launched using the Kapa HiFi library preparation strategy and a final overnight hybridization temperature of 58.6°C.

These final promoter assay conditions were demonstrated to achieve 99.1% of target bases with at least 100-fold read depth (Supplementary Table S4). Moreover, the promoter assay performance did not show the characteristic coverage decrease with increased GC content (Fig. 3F); indeed, targets with higher GC content tended to have higher mean coverage, reflecting the benefits of the optimized protocol.

Power to detect variants from profiling promoter regions

We next sequenced the promoter regions of eight cancer cell lines from different tumor types to assess different cancer-cell-specific metrics and ability to detect known mutations. The median coverage across all targets was >1400× with a GC-dropout rate below 5% for most samples (Fig. 4A, Supplementary Fig. S2A, and Supplementary Table S5). An average of 99% of targets were covered at least 100-fold (Supplementary Fig. S2B and Supplementary Table S5), ensuring sufficient power for identifying somatic mutations in virtually all promoter regions. Notably, the promoters of genes with known recurrent mutations, including FOXA1, TERT, TP53, and BCL2 showed robust coverage patterns [1, 2, 3, 11, 35, 36] (Fig. 4B). Two of the profiled cell lines derived from gliomas, A172 and HS 683, harbor known heterozygous mutations in the promoter of the telomerase gene (TERT) at two canonical hotspot sites. These two hotspot mutations, c.-124 C > T (C228T [hg19]) in A172 and c.-146 C > T (C250T [hg19]) in HS 683 were detected with extremely high coverage (Fig. 4C). To evaluate the unbiased detection of somatic variants in promoter regions, we sequenced two triple-negative breast cancer (TNBC) cell lines, HCC1143 (in duplicate) and HCC1187 for which matched normal blood cell lines are available (HCC1143BL and HCC1187BL, respectively). This strategy allows for the detection of tumor-specific mutations and copy number alterations (the “Materials and methods” section). Coverage, copy number of promoters, as well as point mutations were highly concordant between the two HCC1143 replicates (Fig. 4DF and Supplementary Fig. S2C). One discordant mutation had a very low VAF of 2%, likely evading detection in the second, lower coverage replicate where individual variant reads were also detected upon manual inspection. Similarly, comparison of the HCC1187 cell line with WGS from the Cancer Cell Line Encyclopedia [37] revealed concordant copy number profiles (Supplementary Fig. S2D) and complete overlap of called variants [10 SNVs, 2 insertions and 2 deletions (indels); Fig. 4G], with a single low VAF (3%) mutation in the ANK2 promoter exclusively in the promoter assay. These findings suggest that promoter mutation patterns captured by deep targeted sequencing are highly reproducible and that despite the small genomic territory (2 Mb) covered by promoters, copy number patterns mirror those obtained by whole exome and WGS remarkably well, presumably because selection of copy number changes is driven by cancer genes together with their regulatory promoters.

Figure 4.

For image description, please refer to the figure legend and surrounding text.

Promoter mutations in cancer cell lines. (A) Mean target coverage for promoter targets in eight cancer and two matched normal cell lines. (B) Whole-genome and promoter assay sequencing read coverage at several cancer gene promoters. (C) Sequencing read coverage and detection of two known hotspot mutations, c.-124 C > T (C228T; hg19) and c.-146 C > T (C250T; hg19) in the TERT promoter in the glioblastoma cell lines A172 and HS 683. (D) Per-target coverage depth comparison between two separate replicates of the HCC1143 breast cancer cell line. Red dots indicate individual targets. R-value and P-value were calculated using the Pearson correlation test. (E) Correlation of log2-normalized copy number ratios for cancer gene promoters between two independent sequencing runs for HCC1143. Red dots indicate clustered segments. R-value and P-value were calculated using the Pearson correlation test. (F) Comparison of SNVs and indels identified between two hybridization and sequencing experiments for the HCC1143 cell line, called against the matched normal (HCC1143BL). (G) Comparison of SNV and indel mutations between promoter assay and promoter regions obtained from WGS (obtained from DepMap) for the HCC1187 cell line, called against the matched normal (HCC1187BL). (H) Total number of SNV and indel variants identified in promoter regions in HCC1143 (biological replicate 1) and HCC1187 cell lines. Somatic mutations were called against their matched normal blood cell lines (HCC1143BL and HCC1187, respectively). (I) Subsampling of sequencing reads at the CDK4 promoter locus show a somatic mutation ~200 bp downstream of the TSS (arrow) in the HCC1143 cell line. Coverage depth is indicated for each sample. (J) Subsampling of sequencing reads at the SMAD3 promoter locus show a somatic mutation ~200 bp upstream of the TSS (arrow) in the HCC1143 cell line. Coverage depth is indicated for each sample. Gene expression for SMAD3 obtained from the DepMap portal demonstrates extreme expression of this gene in HCC1143 compared to other breast cancer cell lines.

We then investigated specific promoter events in the two TNBC models. In total, we detected 24 small nucleotide variants, 2 insertions, and 3 deletions in 28 gene promoters (Fig. 4H and Supplementary Table S6). Promoter mutations included variants in BRD4 (VAF = 0.038) and CDK4 (VAF = 0.163) in the HCC1143 cell line; both of these genes are amplified in this cell line, thus the relatively low VAFs suggest that the promoter mutation is not located on the amplified allele, potentially creating a “two-hit” situation. The G > C mutation in the CDK4 promoter destroyed an RFX7 binding motif (Fig. 4I). CDK4 amplification and protein-coding mutations have been associated with resistance to CDK4 inhibitors, and the HCC1143 cell line is known to be resistant to both abemaciclib and palbociclib [38, 39]. To our knowledge, this is the first report of a CDK4 promoter mutation in a CDK4 inhibitor-resistant cell line.

We further detected a near-heterozygous C > T mutation in the SMAD3 promoter (rep1 VAF = 0.395; rep 2 VAF = 0.44; Fig. 4J). This mutation creates a motif for ZNF343. Although the role of ZNF343 in cancer remains unclear, KRAB-domain-containing TFs, including ZNF343, are often implicated as either oncogenes or tumor suppressors in different cancer types [40, 41]. We observed extremely high expression of SMAD3 in HCC1143 compared to other breast cancer cell lines (Fig. 4J). Importantly, we did not detect focal or broad amplification of SMAD3 that could otherwise explain increased expression (Supplementary Fig. S2C, blue label). SMAD3 has a known role as a key driver of cancer metastasis [42, 43]. It forms an oligomeric complex with SMAD4 to mediate TGF-β signaling, a pathway that regulates cell proliferation, tumor microenvironment, and tumor metastasis [44]. In TNBC cell lines, knockdown of SMAD3 has been shown to suppress epithelial–mesenchymal transition signaling and significantly reduce cell proliferation and invasive capacity [45]. Further functional studies will be needed to elucidate the connections between the promoter mutation, ZNF343 binding (or a related zinc finger), and SMAD3 expression.

In the HCC1187 cell line, our assay detected a mutation in a TP53-associated promoter region. As this mutation also causes a protein change (p.G108del; Supplementary Table S6), additional studies are needed to determine whether the effect of this mutation is on the protein or DNA regulatory level, or both. These findings demonstrate that deep promoter sequencing can nominate novel driver candidates associated with known biology in breast cancer cell lines.

Promoter profiling of breast cancer lymph node metastases

Finally, we sequenced the promoters of 49 formalin-fixed and paraffin-embedded (FFPE) breast cancer lymph node metastases with matched pathologically normal lymph nodes. Key sequencing metrics were excellent, with mean target coverage of 338× (range 149–477×) in normal and 345× (range 231–496) in tumor samples (Fig. 5A and Supplementary Table S7). On average, 97% of promoter targets were covered at 100× in both tumor and normal samples (range 68%–99%) (Fig. 5 B and Supplementary Fig. S3A). No notable differences between normal and tumor samples were detected across several additional sequencing metrics, including GC dropout (Fig. 5C and Supplementary Fig. S3B). To identify copy number alterations and point mutations (SNVs and indels) affecting cancer gene promoters, we deployed our custom variant calling and filtering pipeline (the “Materials and methods” section).

Figure 5.

For image description, please refer to the figure legend and surrounding text.

Promoter mutations in FFPE samples of metastatic breast cancer lymph nodes. (A) Comparison of mean target coverage distribution between metastatic (red) and normal lymph nodes (blue) reveals no difference. (B) Fraction of covered targets (y-axis) at indicated level (x-axis) for tumor samples. Smaller dots indicate individual samples. Large dots indicate mean values for each category. (C) Comparison of GC dropout distributions between metastatic (red) and normal (blue) samples reveals no difference. (D) Relative copy number profiles for tumors compared to normals for all samples. Red indicates gain/amplification, blue shades indicate losses and deletions. Common breast cancer alterations (gain of 1q, 8q) are evident. (E) Fraction of amplified (red, left) and deleted (right, blue) known breast cancer genes in the metastatic lymph node dataset. (F) Mutational burden, including SNVs and indels, for all cases. (G) SNV mutational spectrum across the FFPE lymph node cohort. (H) Subsampling of sequencing reads at the GATA3 promoter locus show a somatic mutation −521 bp upstream of the TSS (arrow) in case FUC_268. Coverage depth is indicated for each sample. Comparison of the wildtype (WT) and mutated DNA sequence (arrow) shows disruption of a canonical PGR motif (top) and creation of SOX family motif (bottom). (I) Comparison of allele frequencies between promoter assay sequencing of lymph node metastases and prior “ExomePlus” sequencing of primary breast cancers[11] across seven samples included in both datasets. Blue line indicates fit, with shading demarcating the confidence interval. (J) VAF comparison between mutations discovered in both promoter and ExomePlus assays (left; “in both sets”) and those detected only in the deeply sequenced promoter assay (right). P-value calculated with the nonparametric Mann–Whitney U-test.

Copy number patterns derived from the panel recapitulated known broad patterns of breast tumors, including near ubiquitous gains of 1q and 8q and losses of chromosome 13 and 16 (Fig. 5D). Genes known to be amplified in breast cancer were also detected in this dataset, with high-level gains of the MYC promoter present in 12 cases, ERBB2 in 5 cases, FOXA1 in 2 cases, and CCND1 in 1 case. Copy losses in known breast cancer tumor suppressors were seen in TP53 and MAP2K4 (both on 17p, 22 cases), CDH1 (20 cases), RB1 (18 cases), and CDKN2A (4 cases; Fig. 5E). These findings demonstrate that copy number can be robustly detected in fixed tissue samples from a limited subset of cancer gene promoters.

Among the 49 metastatic lymph nodes, we detected 430 high-confidence SNVs and 45 indels (Fig. 5F). The mean point mutation count was 9 events per sample (range 0–57), corresponding to a mean mutation density of ~4.5 point mutations/Mb. This value is higher than what has been reported for primary breast cancer [46], but concordant with our previous estimate of promoter mutation densities [11]. As expected, most mutations were C > T transitions, with fewer C > A and C > G mutations (Fig. 5G). We have previously described recurrent hotspot mutations in the promoters of primary breast cancers [11]. Of these mutations, we detected TBC1D12 mutations (G > A; both at position −1 bp to TSS) at the canonical hotspot sites in two tumors and a CTNNB1 mutation in one of the previously reported genomic positions. The TBC1D12 mutation occurs at the exact position of the TSS of this gene, and we have previously suggested that it leads to an alternative start site upstream of the annotated gene [11]. Successive work has shown that this TBC1D12 mutation site is a substrate for APOBEC enzymes, which are frequently active in breast cancer [47]. Indeed, we found evidence of general APOBEC activity in this tumor (described below).

In addition, we discovered several novel singleton variants, including an SNV in the FOXA1 promoter/5′UTR (VAF = 0.36), an SNV upstream of the KDM5C TSS (VAF = 0.17), and a deletion in the GATA3 promoter. This deletion of two T nucleotides located 521 bases upstream of the breast cancer oncogene GATA3 was detected in sample FUC_268 with 71% of sequencing reads carrying the deletion variant (Fig. 5H). In this ER+/PR− tumor, the variant disrupts a canonical recognition site for the progesterone receptor (PGR) while creating a new binding motif for SOX family TFs (Fig. 5H). SOX family members play key roles in development and regulation of stemness phenotypes, and several SOX genes have been associated with altered expression patterns in breast cancer [48]. Activation of PGRs has been shown to downregulate GATA3 in hormone-receptor positive breast cancer cell lines [49]. The change in TF binding site introduced by the two-base deletion suggests a shift of GATA3 regulation away from PGR-mediated signaling toward SOX dependence, potentially increasing the expression of the GATA3. Thus, promoter mutation is a potential additional activation mechanism of the GATA3 onocgene in breast cancer.

We further systematically compared promoter mutations for seven lymph node metastases (profiled with targeted promoter sequencing) with their primary tumors that were collected at the same time of surgery and sequenced with ExomePlus technology [11]. Reflecting the inherent biological and technical differences between the data sets, 12 mutations (12.5%) were shared between primary and metastatic cases with high concordance of their VAFs (Fig. 5I). In addition, deep promoter sequencing alone identified 96 mutations not seen in the primary cohort sequenced with ExomePlus technology. To assess whether the locations of these mutations had sufficient read coverage and were at least theoretically amenable to mutation calling, we measured the fraction of samples with coverage in the ExomePlus assay. Of the private mutations, 52% had no coverage on the ExomePlus capture assay. For the remaining mutations, the median VAF was significantly lower than for mutations detected on both assays (0.06 versus 0.27; = 0.0008; Fig. 5J), suggesting that either only the deeper coverage of the promoter assay enabled the detection of these events, or that these mutations emerged or enriched in the tumor cells that left the primary location and migrated to a lymph node.

Mutational processes identified from promoter sequencing

In addition to detecting point mutations and CNVs of cancer gene promoters, we investigated the possibility of inferring genome-wide processes from sequencing reads. Microsatellite instability (MSI) and homologous recombination repair deficiency (HRD) are examples of such processes caused by defects in DNA repair. MSI is characterized by small indels in specific genomic regions caused by deficient DNA mismatch repair and this process has been shown to predict response to inhibition of the immune checkpoint [50, 51]. We applied an MSI identification strategy previously developed for panel sequencing [28] to promoter sequencing of cell lines. In the analysis of 27 183 microsatellite sites within promoter assay regions, we confirmed MSI status in the RKO colon cancer cell line, a known MSI model, with all other tested lines being classified as microsatellite stable (MSS) (Fig. 6A).

Figure 6.

For image description, please refer to the figure legend and surrounding text.

Mutational signatures obtained from promoter sequencing. (A) MSI scores for the eight cancer cell lines. Known patterns of MSI in RKO and MSS in other cell lines are detectable from promoter sequencing alone. (B) Signature inference from panel-derived SNVs for tumors and three cell lines with matched normals. Heatmap shows the likelihood for clock-like, HRD/SBS3, and APOBEC signatures in each sample. “Dominant signature” refers to the signature classification of each sample. H.c., high confidence; l.c., low confidence.

HRD is common in TNBCs and renders tumors susceptible to therapy with inhibitors of poly (ADP-ribose) polymerase (PARP) [52]. HRD leaves a characteristic mutation footprint in the cancer genome, and inference methods can identify HRD from somatic mutation patterns alone (known as single-base substitution [SBS] signature 3) [53, 54]. However, the extremely low SNV mutation count (mean of 9) from our small panel makes it difficult to identify mutational processes. We therefore used SigMA, a method optimized for targeted panel sequencing to infer HRD and other mutational signatures [29]. Among 28 tumors and the three paired cell lines with sufficient mutations to call signatures (≥5), the SBS3 signature was detected in 7 tumors and the HCC1187 cell lines as the dominant signature, with additional cases exhibiting moderate levels (Fig. 6B). In seven tumors and the two replicates for the HCC1143 cell line, the APOBEC signature was dominant, and one tumor (FUC_100) exhibited a combined SBS3 and APOBEC signature (Fig. 6B). The APOBEC-driven tumors included the case described above with APOBEC-associated TBC1D12 translation start site mutation. The remaining tumors for which signatures could be inferred were dominated by the clock-like “aging” signature. HRD status is clinically relevant for identifying optimal treatments, and APOBEC enzyme activity has recently been linked to the emergence of therapy-resistance mutations. The ability to identify activity of these mutational signatures highlights the potential clinical utility of limited sequencing of cancer gene promoter sequences.

Discussion

Genomic alterations in cancer encompass protein-coding mutations, noncoding mutations, copy number variations, and structural variants. However, most known variant interpretations to date have focused primarily on protein-coding mutations, due to their direct and often deleterious effects on gene function. Nonetheless, mutations in noncoding regulatory elements—particularly promoter regions—can profoundly influence the expression and regulation of cancer-associated genes. The example of the TERT hotspot promoter mutation has demonstrated the importance of nonprotein coding mutations, yet the number of such bona fide driver mutations remains low. This is in part due to lack of sequencing data for promoter regions, including in whole genome sequences afflicted with GC-bias and subsequent lack of read coverage in these regions. Our approach to deeply profile >3000 cancer gene promoter regions with an optimized, promoter-specific targeted assay overcomes these limitations.

We show that the assay achieves deep coverage even of challenging GC-rich regions, enabling somatic mutation calling even of subclonal mutations and those in heterogeneous tumor samples. This is particularly important for profiling metastatic and heavily infiltrated tumor specimens. Proof-of-principle sequencing of cancer cell lines shows that promoter profiling yields both known and novel promoter point mutations, can recover copy number profiles and provides information on mutational processes. Although we did not observe recurrence of promoter variants in our relatively small cohort, we identified several novel and potentially relevant driver mutation candidates in the promoters of breast cancer genes. These include an SNV mutation in the CDK4 promoter in the HCC1143 cell line, which is known to be resistant to CDK inhibition; a promoter mutation in SMAD3 in the same cell line, which has extreme outlier expression of this gene; and a deletion in the GATA3 promoter in an HR-positive breast cancer that disrupts a canonical PGR binding site while creating a SOX family motif. We hypothesize that the absence of recurrent promoter mutations in certain datasets may be due to the indirect and multifactorial nature of gene regulation.

In the past years, it has become clear that noncoding mutations are far less frequent than protein-coding mutations in cancer, even when accounting for our current technical limitations to completely chart them [3, 55]. Of the few recurrent non-coding mutations that have been discovered so far, many can be explained by passenger mutational processes, suggesting they are not tumor drivers. Likely, biological reasons and the selective advantages of high-impact protein-coding mutations and copy number variations underlie the observed paucity of non-coding compared to coding driver mutations. Copy number variations, especially in the form of highly abundant extrachromosomal DNA, provides a much stronger gene expression boost, and may be more readily selected during tumor evolution than single-nucleotide promoter mutations. This selective advantage could reduce the observed frequency of promoter mutations. These challenges highlight the importance of jointly analyzing promoter mutations and CNVs to better elucidate their complementary roles in gene regulation and cancer development.

Despite this, promoters are the most likely functional targets for regulatory mutations because they directly affect the expression of the regulated gene and might provide a higher degree of control than whole gene amplification or deletion. Owing to the redundancy and robustness of regulatory sequence and often concerted interaction with distal elements to fine-tune expression, truly disruptive, tumor suppressive mutations that abolish gene expression are likely to be larger deletions of promoter sequence (such as TP53 [3] and BAP1 [13]). Point mutations, as those in the TERT hotspots, would be unlikely to disrupt sequence to sufficiently decrease gene expression, which is suggested by the fact that the characterized driver point mutations are oncogenic and activating (TERT, FOXA1).

It is important to note that the hybrid capture assay presented here is designed to identify mutations only in the promoter regions, and it does not have the capability to detect mutations in enhancer, insulator or other regulatory elements. As with all next-generation sequencing assays, careful deliberation around the trade-off between the number of samples that are sequenced versus the desired coverage depth is necessary. Tumor content and heterogeneity will also influence this decision. Another major challenge lies in interpreting the functional impact of these mutations—specifically, whether they influence target gene expression. Unlike protein-coding mutations, promoter mutations typically occur at low frequencies and do not directly disrupt protein structure, rendering traditional recurrence-based or protein-structure interpretation methods inadequate. To overcome this limitation, integrating additional layers of gene regulation—such as copy number alterations—can enhance the detection of regulatory effects. Furthermore, machine learning frameworks that incorporate TF binding motifs, chromatin accessibility, and transcriptomic data may offer a more effective strategy for predicting the functional consequences of individual promoter mutations.

Ultimately, these predictions must be validated through comprehensive experimental approaches that introduce promoter alterations via reporter assays or genome editing into adequate in vitro and in vivo models. Careful measurements of gene expression changes of the target gene, binding of general and specific TFs and chromatin regulators and downstream phenotypic consequences—including transformative potential—will be essential for fully understanding which non-coding regulatory mutations are major drivers, collaborating actors or passenger alterations in cancer.

Supplementary Material

zcag011_Supplemental_Files

Acknowledgements

Author contributions: Meifang Qi (Formal analysis [lead], Investigation [lead], Methodology [lead], Software [lead], Writing – original draft [lead]), Preshita Dave (Data curation [equal], Formal analysis [equal], Visualization [equal]), Nicole Francis (Investigation [equal], Methodology [equal]), Matthew DeFelice (Investigation [equal], Methodology [equal]), Sam Pollock (Investigation [equal], Methodology [equal]), Holly Corbitt (Methodology [equal]), Irene Rin Mitsiades (Formal analysis [equal]), Gengchao Wang (Formal analysis [equal]), Paul Frere (Methodology [equal]), Alan Fox (Methodology [equal]), Gregory Khitrov (Methodology [equal]), Caroline Petersen (Methodology [equal]), Katie Larkin (Methodology [equal]), Susanna Hamilton (Methodology [equal]), Leif Ellisen (Funding acquisition [equal]), Doga C. Gulhan (Formal analysis [equal], Visualization [equal]), Alfredo Hidalgo-Miranda (Data curation [equal], Resources [equal]), Niall Lennon (Methodology [equal], Project administration [equal]), and Carrie Cibulskis (Conceptualization [equal], Funding acquisition [equal], Investigation [equal], Supervision [equal], Writing – review & editing [equal])

Notes

Present address: Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, 200031 Shanghai, China

Contributor Information

Meifang Qi, Krantz Family Center for Cancer Research, Massachusetts General Hospital, Charlestown, MA 02129 Boston, MA, United States; Harvard Medical School Department of Medicine, Boston, MA 02115, United States; Broad Institute of MIT and Harvard, Cambridge, MA 02142, United States.

Preshita Sanjay Dave, Krantz Family Center for Cancer Research, Massachusetts General Hospital, Charlestown, MA 02129 Boston, MA, United States; Harvard Medical School Department of Medicine, Boston, MA 02115, United States; Broad Institute of MIT and Harvard, Cambridge, MA 02142, United States.

Nicole Francis, Broad Institute of MIT and Harvard, Cambridge, MA 02142, United States; Broad Clinical Laboratories, LLC, Burlington, MA 01803, United States.

Matthew DeFelice, Broad Institute of MIT and Harvard, Cambridge, MA 02142, United States; Broad Clinical Laboratories, LLC, Burlington, MA 01803, United States.

Sam Pollock, Broad Institute of MIT and Harvard, Cambridge, MA 02142, United States; Broad Clinical Laboratories, LLC, Burlington, MA 01803, United States.

Holly Corbitt, Twist Bioscience South San Francisco, California, CA 94080, United States.

Irene Rin Mitsiades, Krantz Family Center for Cancer Research, Massachusetts General Hospital, Charlestown, MA 02129 Boston, MA, United States; Harvard Medical School Department of Medicine, Boston, MA 02115, United States; Broad Institute of MIT and Harvard, Cambridge, MA 02142, United States.

Gengchao Wang, Krantz Family Center for Cancer Research, Massachusetts General Hospital, Charlestown, MA 02129 Boston, MA, United States; Harvard Medical School Department of Medicine, Boston, MA 02115, United States; Broad Institute of MIT and Harvard, Cambridge, MA 02142, United States.

Paul Frere, Twist Bioscience South San Francisco, California, CA 94080, United States.

Alan Fox, Twist Bioscience South San Francisco, California, CA 94080, United States.

Gregory Khitrov, Twist Bioscience South San Francisco, California, CA 94080, United States.

Caroline Petersen, Broad Institute of MIT and Harvard, Cambridge, MA 02142, United States; Broad Clinical Laboratories, LLC, Burlington, MA 01803, United States.

Katie Larkin, Broad Institute of MIT and Harvard, Cambridge, MA 02142, United States; Broad Clinical Laboratories, LLC, Burlington, MA 01803, United States.

Susanna Hamilton, Broad Institute of MIT and Harvard, Cambridge, MA 02142, United States; Broad Clinical Laboratories, LLC, Burlington, MA 01803, United States.

Leif W Ellisen, Krantz Family Center for Cancer Research, Massachusetts General Hospital, Charlestown, MA 02129 Boston, MA, United States; Harvard Medical School Department of Medicine, Boston, MA 02115, United States; Ludwig Center at Harvard, Boston, MA 02115, United States.

Doga C Gulhan, Krantz Family Center for Cancer Research, Massachusetts General Hospital, Charlestown, MA 02129 Boston, MA, United States; Harvard Medical School Department of Medicine, Boston, MA 02115, United States.

Alfredo Hidalgo-Miranda, Laboratorio de Genomica del Cancer, Instituto Nacional de Medicina Genomica, Periferico Sur 4809, Arenal Tepepan, Mexico City 14610, Mexico.

Niall Lennon, Broad Institute of MIT and Harvard, Cambridge, MA 02142, United States; Broad Clinical Laboratories, LLC, Burlington, MA 01803, United States.

Carrie Cibulskis, Broad Institute of MIT and Harvard, Cambridge, MA 02142, United States; Broad Clinical Laboratories, LLC, Burlington, MA 01803, United States.

Esther Rheinbay, Krantz Family Center for Cancer Research, Massachusetts General Hospital, Charlestown, MA 02129 Boston, MA, United States; Harvard Medical School Department of Medicine, Boston, MA 02115, United States; Broad Institute of MIT and Harvard, Cambridge, MA 02142, United States; Massachusetts General Hospital Department of Pathology, Boston, MA 02114, United States.

Supplementary data

Supplementary data is available at NAR Cancer online.

Conflict of interest

E.R. receives research funding from Inocras, Inc not related to this project. L.W.E. is a consultant for Astra Zeneca, Gilead, Atavistik, Kisoji and receives research funding from Sanofi and Esai not related to this project. All other authors declare no competing interests.

Funding

This project was funded in part by a Broad Institute SPARC grant to E.R. and C.C., R21CA284602 to E.R. and L.W.E., and a Quadrangle Fund for Advancing and Seeding Translational Research (Q-FASTR) at Harvard Medical School to D.C.G. Patient sample collection was conducted as part of the Slim Initiative in Genomic Medicine for the Americas (SIGMA), a project funded by the Carlos Slim Foundation in Mexico. The graphical abstract was created in BioRender: Qi, M. & Rheinbay, E. (2026) https://BioRender.com/oqfkneo.

Data availability

Sequencing data and somatic mutations are available from dbGaP (phs001250.v2.p1) and SRA under accession number PRJNA1348700 (https://www.ncbi.nlm.nih.gov/sra/PRJNA1348700).

References

  • 1. Huang  FW, Hodis  E, Xu  MJ  et al.  Highly recurrent TERT promoter mutations in human melanoma. Science. 2013;339:957–9. 10.1126/science.1229259 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Horn  S, Figl  A, Rachakonda  PS  et al.  TERT promoter mutations in familial and sporadic melanoma. Science. 2013;339:959–61. 10.1126/science.1230062 [DOI] [PubMed] [Google Scholar]
  • 3. Rheinbay  E, Nielsen  MM, Abascal  F  et al.  Analyses of non-coding somatic drivers in 2,658 cancer whole genomes. Nature. 2020;578:102–11. 10.1038/s41586-020-1965-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Arthur  SE, Jiang  A, Grande  BM  et al.  Genome-wide discovery of somatic regulatory variants in diffuse large B-cell lymphoma. Nat Commun. 2018;9:4001. 10.1038/s41467-018-06354-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Annala  M, Taavitsainen  S, Vandekerkhove  G  et al.  Frequent mutation of the FOXA1 untranslated region in prostate cancer. Commun Biol. 2018;1:122. 10.1038/s42003-018-0128-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Imielinski  M, Guo  G, Meyerson  M. Insertions and deletions target lineage-defining genes in human cancers. Cell. 2017;168:460–72. 10.1016/j.cell.2016.12.025 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Wilson  NK, Timms  RT, Kinston  SJ  et al.  Gfi1 expression is controlled by five distinct regulatory regions spread over 100 kilobases, with Scl/Tal1, Gata2, PU.1, Erg, Meis1, and Runx1 acting as upstream regulators in early hematopoietic cells. Mol Cell Biol. 2010;30:3853–63. 10.1128/MCB.00032-10 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Mansour  MR, Abraham  BJ, Anders  L  et al.  Oncogene regulation. An oncogenic super-enhancer formed through somatic mutation of a noncoding intergenic element. Science. 2014;346:1373–7. 10.1126/science.1259037 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Deaton  AM, Bird  A. CpG islands and the regulation of transcription. Genes Dev.  2011;25:1010–22. 10.1101/gad.2037511 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Saxonov  S, Berg  P, Brutlag  DL. A genome-wide analysis of CpG dinucleotides in the human genome distinguishes two distinct classes of promoters. Proc Natl Acad Sci USA. 2006;103:1412–7. 10.1073/pnas.0510310103 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Rheinbay  E, Parasuraman  P, Grimsby  J  et al.  Recurrent and functional regulatory mutations in breast cancer. Nature. 2017;547:55–60. 10.1038/nature22992 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. PCAWG  Consortium. Pan-cancer analysis of whole genomes. Nature. 2020;578:82–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Gentien  D, Saberi-Ansari  E, Servant  N  et al.  Multi-omics comparison of malignant and normal uveal melanocytes reveals molecular features of uveal melanoma. Cell Rep. 2023;42:113132. 10.1016/j.celrep.2023.113132 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Cibulskis  K, Lawrence  MS, Carter  SL  et al.  Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat Biotechnol. 2013;31:213–9. 10.1038/nbt.2514 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Sondka  Z, Bamford  S, Cole  CG  et al.  The COSMIC Cancer Gene Census: describing genetic dysfunction across all human cancers. Nat Rev Cancer. 2018;18:696–705. 10.1038/s41568-018-0060-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Repana  D, Nulsen  J, Dressler  L  et al.  The Network of Cancer Genes (NCG): a comprehensive catalogue of known and candidate cancer genes from cancer sequencing screens. Genome Biol. 2019;20:1. 10.1186/s13059-018-1612-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Zhang  Y, Liu  T, Meyer  CA  et al.  Model-based analysis of ChIP-Seq (MACS). Genome Biol. 2008;9:R137. 10.1186/gb-2008-9-9-r137 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Haeussler  M, Zweig  AS, Tyner  C  et al.  The UCSC Genome Browser database: 2019 update. Nucleic Acids Res. 2019;47:D853–8. 10.1093/nar/gky1095 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Cibulskis  K, McKenna  A, Fennell  T  et al.  ContEst: estimating cross-contamination of human samples in next-generation sequencing data. Bioinformatics. 2011;27:2601–2. 10.1093/bioinformatics/btr446 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Taylor-Weiner  A, Stewart  C, Giordano  T  et al.  DeTiN: overcoming tumor-in-normal contamination. Nat Methods. 2018;15:531–4. 10.1038/s41592-018-0036-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Saunders  CT, Wong  WS, Swamy  S  et al.  Strelka: accurate somatic small-variant calling from sequenced tumor-normal sample pairs. Bioinformatics. 2012;28:1811–7. 10.1093/bioinformatics/bts271 [DOI] [PubMed] [Google Scholar]
  • 22. Costello  M, Pugh  TJ, Fennell  TJ  et al.  Discovery and characterization of artifactual mutations in deep coverage targeted capture sequencing data due to oxidative DNA damage during sample preparation. Nucleic Acids Res. 2013;41:e67. 10.1093/nar/gks1443 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Talevich  E, Shain  AH, Botton  T  et al.  CNVkit: genome-wide copy number detection and visualization from targeted DNA sequencing. PLoS Comput Biol. 2016;12:e1004873. 10.1371/journal.pcbi.1004873 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Shen  R, Seshan  VE. FACETS: allele-specific copy number and clonal heterogeneity analysis tool for high-throughput DNA sequencing. Nucleic Acids Res. 2016;44:e131. 10.1093/nar/gkw520 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Quinlan  AR, Hall  IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26:841–2. 10.1093/bioinformatics/btq033 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Bailey  TL, Johnson  J, Grant  CE  et al.  The MEME suite. Nucleic Acids Res. 2015;43:W39–49. 10.1093/nar/gkv416 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Rauluseviciute  I, Riudavets-Puig  R, Blanc-Mathieu  R  et al.  JASPAR 2024: 20th anniversary of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 2024;52:D174–82. 10.1093/nar/gkad1059 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Jia  P, Yang  X, Guo  L  et al.  MSIsensor-pro: fast, accurate, and matched-normal-sample-free detection of microsatellite instability. Genomics Proteomics Bioinformatics. 2020;18:65–71. 10.1016/j.gpb.2020.02.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Gulhan  DC, Lee  JJ, Melloni  GEM  et al.  Detecting the mutational signature of homologous recombination deficiency in clinical samples. Nat Genet. 2019;51:912–9. 10.1038/s41588-019-0390-2 [DOI] [PubMed] [Google Scholar]
  • 30. Buenrostro  JD, Wu  B, Chang  HY  et al.  ATAC-seq: a method for assaying chromatin accessibility genome-wide. CP Molecular Biology. 2015;109:21–9. 10.1002/0471142727.mb2129s109 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Song  L, Crawford  GE. DNase-seq: a high-resolution technique for mapping active gene regulatory elements across the genome from mammalian cells. Cold Spring Harb Protoc. 2010;2010:pdb.prot5384. 10.1101/pdb.prot5384 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Sanghvi  RV, Buhay  CJ, Powell  BC  et al.  Characterizing reduced coverage regions through comparison of exome and genome sequencing data across 10 centers. Genet Med. 2018;20:855–66. 10.1038/gim.2017.192 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Ross  MG, Russ  C, Costello  M  et al.  Characterizing and measuring bias in sequence data. Genome Biol. 2013;14:R51. 10.1186/gb-2013-14-5-r51 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Aird  D, Ross  MG, Chen  WS  et al.  Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol. 2011;12:R18. 10.1186/gb-2011-12-2-r18 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Seto  M, Jaeger  U, Hockett  RD  et al.  Alternative promoters and exons, somatic mutation and deregulation of the Bcl-2-Ig fusion gene in lymphoma. EMBO J. 1988;7:123–31. 10.1002/j.1460-2075.1988.tb02791.x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Zhou  S, Hawley  JR, Soares  F  et al.  Noncoding mutations target cis-regulatory elements of the FOXA1 plexus in prostate cancer. Nat Commun. 2020;11:441. 10.1038/s41467-020-14318-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Ghandi  M, Huang  FW, Jané-Valbuena  J  et al.  Next-generation characterization of the Cancer Cell Line Encyclopedia. Nature. 2019;569:503–8. 10.1038/s41586-019-1186-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Corsello  SM, Nagari  RT, Spangler  RD  et al.  Discovering the anti-cancer potential of non-oncology drugs by systematic viability profiling. Nat Cancer. 2020;1:235–48. 10.1038/s43018-019-0018-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Asghar  US, Barr  AR, Cutts  R  et al.  Single-cell dynamics determines response to CDK4/6 inhibition in triple-negative breast cancer. Clin Cancer Res. 2017; 23:5561–72. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Sun  M, Ju  J, Ding  Y  et al.  The signaling pathways regulated by KRAB zinc-finger proteins in cancer. Biochim Biophys Acta (BBA). 2022;1877:188731. 10.1016/j.bbcan.2022.188731 [DOI] [PubMed] [Google Scholar]
  • 41. Sobocinska  J, Molenda  S, Machnik  M  et al.  KRAB-ZFP transcriptional regulators acting as oncogenes and tumor suppressors: an overview. Int J Mol Sci. 2021;22:2212. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Yu  B, Luo  F, Sun  B  et al.  KAT6A acetylation of SMAD3 regulates myeloid-derived suppressor cell recruitment, metastasis, and immunotherapy in triple-negative breast cancer. Adv Sci. 2022;9:e2105793. 10.1002/advs.202105793 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Zhou  T, Zhang  DD, Jin  J  et al.  Multiomic characterization, immunological and prognostic potential of SMAD3 in pan-cancer and validation in LIHC. Sci Rep. 2025;15:657. 10.1038/s41598-024-84553-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Feng  XH, Derynck  R. Specificity and versatility in tgf-beta signaling through Smads. Annu. Rev. Cell Dev. Biol.  2005;21:659–93. 10.1146/annurev.cellbio.21.022404.142018 [DOI] [PubMed] [Google Scholar]
  • 45. Singha  PK, Pandeswara  S, Geng  H  et al.  Increased Smad3 and reduced Smad2 levels mediate the functional switch of TGF-beta from growth suppressor to growth and metastasis promoter through TMEPAI/PMEPA1 in triple negative breast cancer. Genes Cancer. 2019;10:134–49. 10.18632/genesandcancer.194 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Lawrence  MS, Stojanov  P, Polak  P  et al.  Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature. 2013;499:214–8. 10.1038/nature12213 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Buisson  R, Langenbucher  A, Bowen  D  et al.  Passenger hotspot mutations in cancer driven by APOBEC3A and mesoscale genomic features. Science. 2019;364:eaaw2872. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Mehta  GA, Khanna  P, Gatza  ML. Emerging role of SOX proteins in breast cancer development and maintenance. J Mammary Gland Biol Neoplasia. 2019;24:213–30. 10.1007/s10911-019-09430-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Izzo  F, Mercogliano  F, Venturutti  L  et al.  Progesterone receptor activation downregulates GATA3 by transcriptional repression and increased protein turnover promoting breast tumor growth. Breast Cancer Res. 2014;16:491. 10.1186/s13058-014-0491-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Eso  Y, Shimizu  T, Takeda  H  et al.  Microsatellite instability and immune checkpoint inhibitors: toward precision medicine against gastrointestinal and hepatobiliary cancers. J Gastroenterol. 2020;55:15–26. 10.1007/s00535-019-01620-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Petrelli  F, Ghidini  M, Ghidini  A  et al.  Outcomes following immune checkpoint inhibitor treatment of patients with microsatellite instability-high cancers: a systematic review and meta-analysis. JAMA Oncol. 2020;6:1068–71. 10.1001/jamaoncol.2020.1046 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Comen  EA, Robson  M. Poly(ADP-ribose) polymerase inhibitors in triple-negative breast cancer. Cancer J. 2010;16:48–52. 10.1097/PPO.0b013e3181cf01eb [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Alexandrov  LB, Kim  J, Haradhvala  NJ  et al.  The repertoire of mutational signatures in human cancer. Nature. 2020;578:94–101. 10.1038/s41586-020-1943-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Alexandrov  LB, Nik-Zainal  S, Wedge  DC  et al.  Signatures of mutational processes in human cancer. Nature. 2013;500:415–21. 10.1038/nature12477 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. Elliott  K, Larsson  E. Non-coding driver mutations in human cancer. Nat Rev Cancer. 2021;21:500–9. 10.1038/s41568-021-00371-z [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

zcag011_Supplemental_Files

Data Availability Statement

Sequencing data and somatic mutations are available from dbGaP (phs001250.v2.p1) and SRA under accession number PRJNA1348700 (https://www.ncbi.nlm.nih.gov/sra/PRJNA1348700).


Articles from NAR Cancer are provided here courtesy of Oxford University Press

RESOURCES