Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 May 22.
Published in final edited form as: Nature. 2022 Nov 30;612(7940):495–502. doi: 10.1038/s41586-022-05253-4

Genomic signature of Fanconi anemia DNA repair pathway deficiency in cancer

Andrew LH Webster 1,*, Mathijs A Sanders 2,3,*, Krupa Patel 1, Ralf Dietrich 4, Raymond J Noonan 1, Francis P Lach 1, Ryan R White 1, Audrey Goldfarb 1, Kevin Hadi 5, Matthew M Edwards 6, Frank X Donovan 7, Remco M Hoogenboezem 3, Moonjung Jung 1, Sunandini Sridhar 1, Tom F Wiley 1, Olivier Fedrigo 8, Huasong Tian 5, Joel Rosiene 5, Thomas Heineman 1, Jennifer A Kennedy 1,9, Lorenzo Bean 1, Rasim O Rosti 1, Rebecca Tryon 10, Ashlyn-Maree Gonzalez 1, Allana Rosenberg 1, Ji-Dung Luo 11, Thomas Carrol 11, Sanjana Shroff 12, Michael Beaumont 12, Eunike Velleuer 13, Jeff C Rastatter 14, Susanne I Wells 15, Jordi Surrallés 16, Grover Bagby 17, Margaret L MacMillan 10, John E Wagner 10, Maria Cancio 18, Farid Boulad 18, Theresa Scognamiglio 19, Roger Vaughan 20, Kristin G Beaumont 12, Amnon Koren 6, Marcin Imielinski 5, Settara C Chandrasekharappa 7, Arleen D Auerbach 21, Bhuvanesh Singh 22, David I Kutler 23, Peter J Campbell 2, Agata Smogorzewska 1
PMCID: PMC10202100  NIHMSID: NIHMS1884089  PMID: 36450981

Abstract

Fanconi anemia (FA), a model syndrome of genome instability, is caused by a deficiency in DNA interstrand crosslink (ICL) repair resulting in chromosome breakage13. The FA repair pathway protects against carcinogenic endogenous and exogenous aldehydes47. Individuals with FA are hundreds to thousands-fold more likely to develop head and neck (HNSCC), esophageal and anogenital squamous cell carcinomas (SCCs)8. Molecular studies of SCCs from individuals with FA (FA SCCs) are limited, and it is unclear how FA SCCs relate to sporadic HNSCCs primarily driven by tobacco and alcohol exposure or human papillomavirus (HPV) infection9. Here, by sequencing FA SCCs, we demonstrate that the primary genomic signature of FA-deficiency is the presence of high numbers of structural variants (SVs). SVs are enriched for small deletions, unbalanced translocations, and fold-back inversions, and are often connected, thereby forming complex rearrangements. They arise in the context of TP53 loss, but not HPV infection, and lead to somatic copy number alterations of HNSCC driver genes. We further show that FA pathway deficiency may lead to epithelial-to-mesenchymal transition and enhanced keratinocyte-intrinsic inflammatory signaling, which would contribute to the aggressive nature of FA SCCs. We propose that genomic instability in sporadic HPV-negative HNSCC may arise consequent to the FA repair pathway being overwhelmed by ICL damage caused by alcohol and tobacco-derived aldehydes, making FA SCC a powerful model to study tumorigenesis resulting from DNA crosslinking damage.

Keywords: Fanconi anemia, head and neck squamous cell carcinoma, anogenital squamous cell carcinoma, HPV, structural variants, fold-back inversions, genomic rearrangements, genome instability, hereditary cancers, FANCA, BRCA2, BRCA1, p53, epithelial-to-mesenchymal transition, EMT


DNA interstrand crosslinks (ICLs) are lesions that covalently link the Watson and Crick DNA strands, impeding proper replication and transcription. They are excised by the FA repair pathway which comprises at least 22 FANC proteins including BRCA1 and BRCA21012. Lack of the FA pathway may lead to developmental abnormalities, bone marrow failure and cancer predisposition, primarily acute myeloid leukemia and squamous cell carcinoma of the aerodigestive tract8.

To characterize the mechanism of tumor formation in the setting of DNA ICL repair deficiency, derive biomarkers for early detection, and identify potential therapeutic opportunities for individuals with FA, we sequenced 55 independent FA SCCs and three adenocarcinomas from 50 individuals. Clinical data, which was available for 41 individuals from this cohort, revealed that they developed early-onset SCCs at a median age of 31 years and have a median cancer-specific survival of only 17 months, much shorter than patients with sporadic HPV positive or negative HNSCCs13 (Fig. 1a, Extended Data Fig. 1a). Characteristics of the FA individuals and sequenced tumors are described in Extended Data Fig. 1bd and Supplementary Table 1.

Fig. 1. Comparison of the mutational landscapes of Fanconi anemia (FA) squamous cell carcinomas (SCCs) and sporadic head and neck SCCs (HNSCCs).

Fig. 1

a Cancer-specific survival curves for n=41 individuals with FA SCC and complete clinical history, n=69 HPV-positive sporadic HNSCC cases, and n=394 HPV-negative sporadic HNSCC cases from TCGA with disease-specific survival data. The p-value was determined by a Mantel-Cox log-rank test. b Number of paired-end reads aligning without clipping to non-repetitive regions of any HPV genome in FA SCCs (n=20 WGS and n=40 WES) and sporadic HNSCCs from TCGA (n=42 WGS and n=513 WES). c Comparison of gene alteration frequencies between independent FA SCCs (n=55) and HPV-negative sporadic HNSCCs (n=415), with focal somatic copy number alteration (sCNA) peaks defined by GISTIC2 [read-depth change of log2(sCNA)≥0.9 (amplification) or log2(sCNA)≤−0.9 (deletion) relative to binned region coverage in pool of normals] and with normalization for tumor purity in both cohorts. SNV is single nucleotide variation. TP53 and PDL1 sCNA frequencies were determined manually for FA SCC. * and # indicate genes contained within the same focal sCNA peak. GISTIC2 FDR q-values for the FA SCC cohort are listed for each applicable gene affected by a copy-number alteration. ** indicates a selection of genes not captured by GISTIC2, with sCNA-frequencies being extracted from cBioPortal – Pan-Cancer Atlas (HPV-Negative HNSCC).40 In all cases, n refers to independent biological samples or individuals.

HPV and TP53 status of FA SCCs

HPV infection is a well-characterized driver of sporadic HNSCC, particularly for younger patients9,13. Although there is evidence that the FA pathway protects against HPV replication, the importance of HPV in FA SCC etiology has been debated1418. To identify HPV sequences integrated within the FA SCC genomes, whole genome sequences from 20 FA and 42 sporadic HNSCCs (The Cancer Genome Atlas [TCGA]) were aligned against all genotyped HPV strains. No HPV sequences were identified in the FA HNSCCs, while we detected such sequences in 18 HPV-positive sporadic HNSCCs (Fig. 1b). Using whole exome sequencing of 40 FA SCCs across multiple tissue sites, we found only two HPV-positive cases, both of gynecologic origin (Fig. 1b, Supplementary Table 2). Consistent with the low HPV infection incidence, TP53 mutations were identified in 89% of the FA SCCs (Fig. 1c). Correcting for tumor purity, the median TP53mut variant allele frequency was 85%, suggesting it is an early transforming event (Extended Data Fig. 1e).

With the notable exception of TP53, we found a reduced rate of single nucleotide variants (SNV) and small insertion/deletion events (indels) perturbing known SCC driver genes in FA SCCs compared to HPV-negative sporadic HNSCC (Fig. 1c, Extended Data Fig. 1f). For example, only a single FA SCC harbored a PIK3CA-missense mutation (1.8% of FA SCCs), as compared to 17.5% of the HPV-negative sporadic HNSCCs (TCGA). A similar lack of point mutations was seen for CDKN2A, FAT1, NOTCH1, NSD1, and several other tumor suppressors. A global analysis of somatic SNV and indel events uncovered a reduced exonic mutation rate relative to sporadic HNSCC (Extended Data Fig. 2a, Supplementary Table 3), most likely due to an earlier age of cancer onset in FA patients. Deconstruction of FA SCC SNV profiles into Catalogue of Somatic Mutation In Cancer (COSMIC) mutational signatures, revealed the presence of signature profiles SBS1 & SBS5 (5-methylcytosine deamination and unknown etiology), SBS2 & SBS13 (APOBEC), and SBS18 (reactive oxygen species) (Extended Data Fig. 2b,c). We found no evidence of the general homology-directed repair deficiency associated signature SBS3, or the SBS4 signature associated with cigarette smoke exposure. Analysis of indel mutation profiles from FA SCCs revealed the presence of signatures ID1 and ID2 (replication slippage), ID4 (small deletions at tandem repeats), ID6 (non-homologous end joining (NHEJ)/microhomology-mediated end joining (MMEJ)), ID8 (NHEJ), ID10 (large insertions at tandem repeats), as well as ID9 and ID14 (unknown etiology) (Extended Data Fig. 2d). ID10, ID4, ID6, and ID9 were not enriched in sporadic HNSCC.

In place of a high SNV/indel load, we find that chromosomal instability, a hallmark of FA cells, is the major mutational force driving FA SCCs13. Analysis of focal somatic copy-number alteration (sCNA) in 55 independent FA SCCs revealed an elevated sCNA frequency relative to HPV-negative sporadic HNSCCs (Fig. 1c, Extended Data Fig. 1f, Extended Data Fig. 3). Among the amplified loci are those harboring PIK3CA (amplified in 78% of samples), MYC (71%), TAZ (65%), CCND1 (62%), PLEC (53%), YAP1/BIRC2/BIRC3 (38%), CCND2 (36%), KDM2A (33%), and EGFR (31%). Among deleted loci are PTPRD (deleted in 62% of samples), CSMD1 (60%), CDKN2A (55%), MXD4 (49%), KMT2C (42%), NSD1/FAT2 (36%), FAT1 (35%), and NOTCH1 (31%). Overall, sCNAs perturbing oncogene and tumor suppressor loci were more frequent in FA SCCs than in sporadic HNSCCs. Strikingly, each FA SCC carried a multitude of amplifications and deletions across numerous loci (Extended Data Fig. 1f, Supplementary Data File 8, Supplementary Table 4). The most frequent co-occurrence was between PIK3CA and MYC amplifications, which was present in 56% of FA SCCs.

High structural variation burden in FA SCCs

Using whole genome sequencing (WGS) datasets, we next compared the somatic structural variant (SV) landscape of FA tumors (20 SCC and 2 adenocarcinomas) to that of sporadic HPV-negative (n=23) and HPV-positive (n=18) HNSCCs, as well as breast, ovarian, and other cancers driven by BRCA2 (BRCA2mut, n=41) or BRCA1 (BRCA1mut, n=24) mutations (Supplementary Table 5). FA tumors displayed a median 2-fold increase in the total SV burden compared to HPV-negative HNSCCs (Fig. 2a). This difference remained unchanged after exclusion of adenocarcinoma samples. FA tumors had, on average, 45% more SVs than BRCA2mut carcinomas, but 30% fewer SVs than BRCA1mut carcinomas, which are highly enriched for tandem duplications (TDs) (Fig. 2a)1921. FA SCCs exhibited a significantly increased SV burden across all SV classes with a 2-fold increase in deletions, 3-fold increase in translocations, 1.7-fold increase in inversions, and 1.5-fold increase in TDs when compared to HPV-negative sporadic HNSCC, with deletions representing the most abundant SV class in FA SCCs (Fig. 2b). As a proportion of all SVs occurring within each tumor, FA SCCs were mildly enriched for deletions and moderately depleted for tandem duplications (Fig. 2c). While SV breakpoints in FA SCCs clustered at oncogenic and tumor suppressor loci, breakpoints were also distributed at low frequencies throughout the genome (Extended Data Fig. 4a), suggestive of a model whereby continuous DSB generation due to abnormal DNA repair provides the substrate of genetic variation within the pool of epithelial cells for selection to act upon. Clonal analysis revealed the coexistence of 2–4 subclones in addition to a dominant parental clone (Extended Data. Fig 4b,c) consistent with this idea.

Fig. 2. The structural variant landscape of FA SCC.

Fig. 2

a Comparison of somatic structural variant (SV) numbers in the whole genomes of FA-associated cancers (20 SCCs and 2 adenocarcinomas), HPV-positive sporadic HNSCC, HPV-negative sporadic HNSCC, BRCA2-deficient (BRCA2mut), or BRCA1-deficient (BRCA1mut) tumors. b SV counts in FA SCCs (n=20) and HPV-negative sporadic HNSCCs (n=23) cohorts categorized by SV class: deletion (DEL), translocation (TRA), inversion (INV), and tandem duplication (TDs). INV include reciprocal inversions, fold-back inversions and complex intrachromosomal rearrangements with inverted orientation. c The proportion of all SVs attributed to each SV class in samples shown in panel b. d SV class size distribution in the indicated tumor samples. Size (in base pairs; bp) is defined by intrachromosomal distance between the left and right SV breakpoints. x indicates the median size. e Replication timing and common fragile site localization of SV breakpoints, stratified by both SV class and tumor cohort. Color scale indicates correlation strength. Detailed data is shown in Extended Data Fig. 5. f Mechanism of breakpoint resolution in FA SCC (n=20) and HPV-negative sporadic HNSCC (n=23) cohorts, categorized by the double-strand DNA break repair pathways: non-homologous end joining (NHEJ), microhomology-mediated end joining (MMEJ), and single strand annealing (SSA). Indicated is the proportion (%) of BRASS re-assembled breakpoints predicted to have been repaired by each pathway, based on previously established homology parameters. Two-tailed Mann-Whitney U test p-values are indicated, with median and IQR shown (a-d). In all cases, n refers to independent biological samples or individuals.

SV size (length of DNA between two intrachromosomal breakpoints) predominantly ranged between 1–100kb in FA SCCs (Fig. 2d). HPV-negative sporadic HNSCCs displayed a clustering of mid-size deletions with a median of 61kb (interquartile range [IQR] of 7–304kb), while FA SCCs exhibited a clustering of small deletions with a median size of 9kb (IQR 4–47kb). Sizes of deletions were similar between FA SCC and BRCA2mut samples. FA SCC inversions were also small and clustered with a median size of 7kb (IQR 2–89kb). This differed from the more evenly spread inversions of HPV-negative HNSCCs with a median of 16kb (IQR 2kb-1.3Mb), the bimodal clustering of small and large inversions in BRCA2mut, and the large inversion cluster of BRCA1mut samples. Like deletions and inversions, TDs in FA SCCs were also small, at 24kb (IQR 7–140kb), in comparison to HPV-negative HNSCC and BRCA2mut samples, with median sizes of 61kb (IQR 15–724kb), and 354kb (IQR 27kb-10Mb) respectively. FA SCCs did not share the characteristic TD enrichment observed in BRCA1mut tumors; however, the TD size distribution was similar between these two cohorts.

Genome replication timing has been demonstrated to have a strong association with SV formation. Deletions are enriched in late-replicating regions, while TDs and unbalanced translocations occur preferentially in early-replicating regions across the Pan-Cancer Analysis of Whole Genomes22. We assessed replication timing and SV localization in FA SCCs in comparison to sporadic HNSCC, BRCA2mut, and BRCA1mut cancers. Compared to reference timing profiles23, we found that TD breakpoints in FA SCC, BRCA1mut, and BRCA2mut tumors, but not sporadic HNSCCs, are highly associated with regions of early genomic replication (Fig. 2e, Extended Data Fig. 5a). Deletions, on the other hand, were found in late-replicating regions only in sporadic HNSCCs. Consistent with the FA pathway being important at common fragile sites24, we found that deletion, inversion, and TD breakpoints were greatly enriched at these loci in FA SCCs. Only inversions, and to lesser extent TDs, were enriched at common fragile sites in sporadic HNSCCs (Extended Data Fig. 5b). TD breakpoints in FA SCCs and BRCA1mut, and translocations in BRCA1mut, correlated with early replication fragile sites (Extended Data Fig. 5c).

FA pathway-deficient cells have been demonstrated to harbor rearrangements driven by MMEJ4. We assessed the frequency of predicted NHEJ, MMEJ, and single-strand annealing (SSA), defined as 0–1bp, 2–9bp, and more than 10bp breakpoint homology respectively, and we found that FA pathway status did not alter the mechanism of DNA double strand break resolution (Fig. 2f).

Complex SVs cause oncogene amplification

FA pathway-deficient cells are known to contain complex radial chromosomes that can be visualized on metaphase spreads1,2. We hypothesized that cycles of their formation and breakage during tumorigenesis leads to the development of rearrangement chains. To assess their presence, we performed 10x linked-read WGS on four FA SCC samples, which allowed for phasing of structural events across long genomic segments. We observed that SVs frequently occurred in long chains forming complex chromosomal rearrangements (Fig. 3a,b). Samples had between 23 and 32 unique chains with a mean of 4.6 SVs per chain (Extended Data Fig. 6a,b). The largest observed chain contained 43 SVs. These chains were enriched for duplications, translocations, and deletions, and frequently localized to oncogene-containing regions of chromosomes 3, 7, 8, 9, and 11 (Extended Data Fig. 6c,d).

Fig. 3. Complex FA SCC SVs identified by 10x linked-read, PacBio long read, and Illumina WGS.

Fig. 3

a Circos plot of somatic structural variants (SVs) larger than 30kb detected using 10x linked-read WGS in FA SCC (sample F17P1). 8 selected multi-SV chains are highlighted using distinct colors, with the outer ring segmented by chromosome number. A chain is defined as a minimum of 4 barcode-linked breakpoints (≥ 2 SVs). b Illustration of SV chain #1 from panel a. Color legend is shared with panel c. Arrows indicate orientation of each segment relative to the hg19 reference genome. c Illustration of a somatic SV chain containing unbalanced translocations, fold-back inversions, and templated insertion chains present in FA SCC sample F45P1. d Deduced amplification mechanism at select oncogenes in FA SCC as assessed by PacBio sequencing. e Proportion of translocations events that are unbalanced (non-reciprocal and copy-number altering) among FA SCCs (n=20), HPV-negative sporadic HNSCCs (n=19), BRCA2mut carcinomas (n=40), and BRCA1mut carcinomas (n=24). 5 sporadic HNSCC samples and 1 BRCA2mut sample with ≤3 translocation events were excluded. f Number of fold-back inversion events in the same cohorts. FA SCCs (n=20), HPV-negative sporadic HNSCCs (n=23), BRCA2mut carcinomas (n=40), and BRCA1mut carcinomas (n=24). g % of the samples in each cohort with 0, 1, 2, 3, or more than 3 unique FBI-TIC chains. h Comparison of expected (hg19 reference) vs. observed percentage of somatic SV breakpoints localizing to indicated repeat class. Breakpoints from n=9 FA SCC PacBio samples are shown. Two-tailed Mann-Whitney U test p-values are indicated, with median and IQR shown (e-f). Unpaired two-tailed Student’s t-test p-values are indicated, with median and IQR shown (SINE: t=5.627, df=8, Tandem: t=4.786, df=8) (h). In all cases, n refers to independent biological samples or individuals.

To deconvolve complex SV patterns, we assessed 9 FA SCCs and four paired normal samples using PacBio long-read sequencing. We implemented a noise tolerant germline filtering algorithm (described in the methods) to robustly call somatic SVs. Data obtained from long and short read sequencing showed strong overlap of somatic SVs over 1kb with 84% of PacBio calls observed in Illumina and 78% of Illumina calls observed in PacBio (Extended Data Fig. 6eg). Long read data captured most short deletions between 150–999bp found by Illumina indel calling, while discovering additional deletions that localized to short-read inaccessible regions (Extended Data Fig 6g,h, Supplementary Table 6). The abundance of short deletions observable in both platforms, particularly in the 150–600bp range, reveals the role of the FA pathway in preventing deletions shorter than 1kb.

PacBio sequencing uncovered a high frequency of unbalanced translocations (UBT) and fold-back inversions (FBI) that were frequently connected forming complex rearrangements involving multiple oncogenes (Fig. 3c). UBT and FBI events drove sharp copy number amplification at key oncogenes including CCND1, PIK3CA, MYC, KDM2A, YAP1-BIRC2/3, and EGFR (Fig. 3d, Extended Data Fig. 6i). UBT and FBI segments were often bridged by one or more templated insertions of less than 1 kb, copied from distant loci (Fig. 3c). When strung together, these short insertions formed templated insertion chains (TIC)22, which connected multiple intrachromosomal and interchromosomal loci. The analysis of PacBio data prompted us to algorithmically quantify the frequency of unbalanced translocations (Fig. 3e, Extended Data Fig. 6j), FBIs (Fig. 3f), and FBI-TICs (Fig. 3g) present in FA SCC, HPV-negative HNSCC, BRCA2mut, and BRCA1mut tumor sequences from Illumina WGS (Supplementary Table 7). We found that all three SV classes were enriched in FA SCCs relative to the other samples, demonstrating that the FA repair pathway protects against complex rearrangement events that drive oncogenic amplification.

Repeat-rich loci are susceptible to higher levels of replication stress25. Using PacBio data, which can more accurately align to these loci, we found that somatic SV breakpoints in FA SCCs preferentially localized to repeat regions relative to the expected hg19 reference genome background (78% observed vs 51% expected) (Extended Data Fig. 6k). Breakpoints were 3.6-fold enriched in simple/tandem repeat regions (regions of 1–5bp repeat patterns), and 2-fold enriched at SINE elements, but were not enriched in LINE elements or LTRs (Fig. 3h). Regions of PacBio SV breakpoints (+/−100 bp around the breakpoint) were mildly enriched in GC-content with a median +2.6% shift (IQR −4.9% to +13.6%) relative to the hg19 GC-background (Extended Data Fig. 6l), suggesting a bias towards GC-rich coding regions. While detection capacity is more limited with Illumina short-read data in repetitive regions, particularly at simple/tandem elements, we found global enrichment of somatic SV breakpoints at repeat elements when comparing FA SCC to sporadic HNSCC (68% vs. 64% median localization, Extended Data Fig. 6m). We also found that mobile retro-transposon insertion events were not statistically increased in FA SCCs relative to HPV-negative HNSCCs (Extended Data Fig. 6n).

To correlate the increased genomic instability of FA SCCs with transcriptional output, we compared the transcriptomic landscape of FA SCCs against sporadic HNSCCs (Supplementary Table 8). We performed RNAseq on six FA SCC tumors and compared expression data with that of sporadic HNSCC. Although data were limited due to the number of available FA SCC samples, the transcriptional expression mirrored DNA sequencing findings, with many of the amplified genes being expressed at higher levels and deleted genes being expressed at lower levels in FA SCCs relative to sporadic HNSCCs (Extended Data Fig. 6o,p). We additionally found that the global DNA damage response was upregulated in FA SCCs and aldehyde detoxification enzymes were downregulated compared to sporadic HNSCCs (Extended Data Fig. 6q). Gene expression of ALDH/ADH genes associated with acetaldehyde, retinaldehyde, and lipid aldehyde processing was lower in FA SCCs (Extended Data Fig. 6r). We also observed transcriptional downregulation of the Major Histocompatibility Complex (MHC) Class I antigen processing and presentation pathway (Extended Data Fig. 6s). Comparison of methylation patterns between FA SCCs and sporadic HNSCC using EPIC 850K methylation arrays revealed no significant differences between the two cohorts (Supplementary Table 9).

Sporadic HNSCCs and FA pathway function

Somatic mutation and hypermethylation of FA pathway genes has been proposed to occur in a subset of sporadic HNSCCs2628, although the functional consequences of the identified genomic changes have not been assessed. Re-analysis of the TCGA HNSCC data revealed that somatic deletions of MAD2L2 (FANCV), ALDH2, RAD51 (FANCR), and XRCC2 (FANCU) correlated with a highly copy-unstable subset of tumors, representing close to 13% of HPV-negative sporadic HNSCCs (Extended Data Fig. 7a). This subset displayed significantly increased sCNA frequencies at HNSCC driver loci relative to the complete HPV-negative HNSCC cohort (Extended Data Fig. 7b). We note; however, that the deletion of these genes may not have driven the observed instability, but may simply represent passenger mutations occurring in tumors with already high levels of genomic instability.

Recent work indicates that formaldehyde can drive hematopoietic stem cell failure and carcinogenesis even with a functional FA pathway1820. This led us to hypothesize that the increased acquisition of copy number alterations in a subset of sporadic HNSCCs may alternatively be explained by the functional overload of a genetically unaltered FA repair pathway by endogenous and exogenous aldehydes. Most notable of these are acetaldehyde, a byproduct of ethanol metabolism, as well as formaldehyde and acrolein present in tobacco smoke. To determine whether exposure to exogenous DNA crosslinking agents correlated with sCNA levels, we stratified sporadic HNSCC tumors by pack-year smoking history (Extended Data Fig. 7c). We found that the median number of focal sCNAs was elevated in smokers relative to non-smokers, and that this difference widened as the number of pack-years increased.

Subsequently, we ranked sporadic HPV-negative HNSCC tumors by level of focal copy-number instability and found that the most unstable quartile demonstrated a strong enrichment for acetaldehyde (DBS2), smoking carcinogen (SBS4 and ID3), and NHEJ (ID8) mutational signatures relative to the most copy-stable tumor quartile. We further found that tumors from smokers in the most unstable quartile had the highest exposure to tobacco when compared with those in the most stable quartile (Extended Data Fig. 7dg). We propose that high aldehyde exposure in this subset of tumors during the early stages of tumorigenesis generates extensive ICL damage, which may overwhelm a genetically unaltered FA DNA repair pathway – particularly in the setting of mutated TP53. In turn, this may lead to a high sCNA rate that further fuels carcinogenesis.

Keratinocyte characteristics in FA SCC

To test if the combination of FA pathway and TP53 tumor suppressor deficiency may recapitulate the increased SV load and hasten SCC development, as seen in human FA tumors, we created a mouse serial allograft model of FA SCC. Fanca−/− Trp53−/− and Fanca+/+ Trp53−/− neonatal keratinocytes were immortalized and transformed by overexpression of Ccnd1 and HRASG12V respectively. These keratinocytes were intradermally engrafted for 11 cycles, each with four independent replicates. We quantified tumor growth and performed RNAseq, Illumina WGS, histopathology, and protein expression analysis (Fig. 4a) and found that Fanca−/− keratinocytes exhibited dramatically accelerated tumor growth compared to their Fanca+/+ counterpart. The difference became apparent at the first engraftment cycle (Fig 4b,c), despite pre-engraftment Fanca−/− and Fanca+/+ keratinocytes having identical growth in vitro (Extended Data Fig. 8a). Growth acceleration was observed through multiple subsequent engraftments, but the difference between Fanca−/− and Fanca+/+ dissipated by the time of the 11th engraftment (Extended Data Fig. 8b).

Fig. 4. Characterization of a murine FA SCC model, single-cell and spatial transcriptomics of human FA SCCs.

Fig. 4

a Schematic of the serial engraftment of murine keratinocytes. b Representative micrographs of H&E-stained tumors derived from Fanca+/+ and Fanca−/− keratinocytes at the first engraftment cycle. c Tumor volumes during first engraftment of Fanca+/+ and Fanca−/− keratinocytes. Each point represents the mean volume of four tumors from one mouse with the standard error shown. Four replicates, each comprising four tumors intradermally engrafted within a single mouse, are indicated by separate curves. d Somatic SV counts in n=3 Fanca−/− and n=4 Fanca+/+ tumors at the 6th engraftment cycle. Bars indicate median-IQ range. Two-tailed, unpaired t-test p-value (t=2.574, df=5) is shown. e and f Protein levels of epithelial and mesenchymal (e) and inflammatory (f) markers measured by western blotting of different engraftments. g UMAP embedding of single-cell transcriptomics data of FA SCC sample F44P1 (k=634 cells). h Spatial transcriptomic clusters identified in FA SCC sample F38P1(left) and UMAP embedding of spot clusters with annotated identity (right), i KRT14 expression, spatially-mapped scTSK and p-EMT sensor score in F38P1 Visium sample. j The FA pathway prevents SV formation by repairing DNA interstrand crosslinks created by endogenous and exogenous aldehydes. k The constitutive FA repair deficiency leads to copy number alterations of oncogenes and tumor suppressors driving SCC development. Innate inflammatory keratinocyte response and the EMT in FA SCC may contribute to their aggressive nature. We propose that the functional overload of a genetically unaltered FA pathway by exogenous aldehydes in tobacco and alcohol leads to sporadic HNSCCs. It remains to be determined whether and how DNA damage contributes to EMT and more aggressive behavior in sporadic HNSCCs. In all cases, n refers to independent biological samples. Extended discussion is in the Supplementary data.

WGS of the mouse tumors revealed a median two-fold increase of somatic SVs in the tumors derived from Fanca−/− relative to Fanca+/+ keratinocytes at the 6th cycle (Fig. 4d), with the largest increase resulting from inversions (Extended Data Fig. 8c,d). It is unclear whether this is due to species-specific differences in DNA repair, the genetic background of the keratinocytes, or the influence of HRASG12V. While pre-engraftment Fanca−/− and Fanca+/+ keratinocytes exhibited similar gene expression profiles, the first engraftment induced rapid transcriptomic changes exclusively in Fanca−/− tumors. Fanca−/− keratinocytes invoked expression programs characteristic of epithelial-to-mesenchymal transition (EMT), intracellular inflammatory signaling, TGFβ pathway activation, cancer stem cell transition, and cellular metastasis, among others (Extended Data Fig 8e,f, Supplementary Table 10). EMT marker induction at the first engraftment cycle was confirmed at the protein level, with a strong rise in SNAIL and ZEB1 levels accompanying the appearance of Vimentin. Subsequent disappearance of epithelial markers including E-cadherin, Claudin, and β-catenin occurred at the second engraftment (Fig. 4e). Fanca+/+ keratinocytes eventually underwent EMT by the 11th engraftment cycle, and this correlated with a narrowing of the growth rate difference between Fanca+/+ and Fanca−/− cells (Extended Data Fig. 8b). Protein expression studies also revealed strong activation of intracellular inflammatory pathways, including canonical (RELA, TBK1, IRF3, IKBA) and non-canonical (RELB, NFKB2) NF- κB pathways, already present in the pre-engraftment Fanca−/−, but not Fanca+/+ keratinocytes (Fig. 4f). Over the course of multiple engraftment cycles, we observed significant upregulation of the dsDNA-sensor STING in Fanca−/− tumors (Fig 4f., Extended Data Fig. 8f). STING is activated by cGAS stimulation at micronuclei formed in the setting of genome instability, including in primary cells from FA patients2932.

To determine if similar EMT phenotypes were present in human FA SCCs, we performed 10× 3’ single cell or nuclei transcriptomics on three primary FA patient tumors, and Visium spatial transcriptomics on a primary FA patient tumor. Unsupervised clustering of the single-cell sample (F44P1) (Fig. 4g, Extended Data Fig. 9a,b) revealed that 34% of KRT5/14+ tumor keratinocytes exhibited markers characteristic of partial EMT (p-EMT) as indicated by a positive scTSK and p-EMT scores, which have previously identified p-EMT in sporadic SCCs33,34 (Extended Data Fig. 9cf). Differential expression between p-EMT and non-EMT tumor keratinocytes demonstrated EMT-related GO gene set enrichment (Extended Data Fig. 9g, Supplementary Table 11). Integration of the single-cell tumor sample with the two single nuclei tumor samples and embedding into a single UMAP projection (Extended Data Fig. 9h, Supplementary Fig. 1a) demonstrated the presence of KRT5/14+ tumor keratinocyte population with a high p-EMT sensor score in all three tumors (Extended Data Fig. 9i, Supplementary Fig. 1b).

Spatial transcriptomics performed on sample F38P1, revealed highly proliferative, CCND1 and EGFR-positive KRT5/14+ keratinocytes characterized by a strong p-EMT signal (Fig. 4h,i, Extended Data Fig. 10ad). We further observed EMT-promoting signaling between p-EMT KRT14/5+ tumor keratinocytes and fibroblasts (Extended Data Fig. 10d,e). While these results suggest that p-EMT is a common feature of Fanconi patient SCC future studies will be necessary to further define how this phenotype develops.

Discussion

Sporadic HNSCCs and esophageal SCCs are rich in sCNAs affecting multiple tumorigenesis-associated genes13,35,36. Data presented here demonstrate that loss of the FA repair pathway increases sCNA frequency in tumors through induction of simple and complex structural rearrangements (Fig. 4j). This phenotype is consistent with a lack of proper DNA repair during replication, leading to the formation of multiple DNA double strand breaks that are repaired by NHEJ and MMEJ pathways. The SV breakpoints in FA SCCs were enriched at common fragile sites and other difficult-to-replicate regions including repeats, consistent with known functions of the FA pathway at those sites. High numbers of fold-back inversions, unbalanced translocations, templated insertions, and deletions present in FA SCCs point to mechanisms of inappropriate processing of stalled replication forks in the setting of FA repair pathway deficiency, leading to a multitude of genomic copy number alterations. Enrichment of small deletions among the structural variants in FA SCCs is of particular interest. We speculate that these occur when DNA breaks at two replication forks cannot be rescued by an intervening origin of replication. Despite previous hints that proteins involved in ICL repair may suppress chromothripsis37, we have not observed its presence. Analysis of sporadic HNSCCs performed in this study suggests that the most copy-number unstable tumors were enriched for acetaldehyde and smoking mutagen exposure signatures. We further found that cigarette smoke exposure was correlated with increased copy-number instability. Thus, we speculate that high genome instability characterized by frequent amplifications and deletions seen in these sporadic HNSCCs may be due to a functionally overwhelmed FA pathway by the elevated aldehyde load derived from tobacco smoke inhalation, alcohol consumption, and environmental pollutants (Fig. 4k). The resultant genomic instability shaped by SVs consequent to an overburdened FA pathway is superimposed over known SNV and indel-inducing mutagenesis driven by tobacco use and alcohol consumption.

Our mouse model and human single-cell transcriptomics point to other SCC hallmarks potentially driven by DNA damage when the FA pathway is absent or overwhelmed. Most notably, we observed p-EMT and increased keratinocyte innate inflammatory signaling in a substantial fraction of tumor keratinocytes. The exact mechanism of these transitions remains to be determined but may involve triggering of STING-mediated NF-κB pathway activation by DNA damage. In particular, the non-canonical NF-κB pathway has been demonstrated to drive EMT in chromosomally unstable cancers38. We propose that the DNA damage may contribute to p-EMT observed in sporadic HNSCCs39 (Fig. 4k), but further study of the pathway and cell interactions, including those between tumor and the immune system, is required to understand disease pathogenesis and improve the prevention and treatment of sporadic and FA SCCs. Taken together, our data indicate that studying the genomic and phenotypic characteristics of FA SCCs enriches observations made from model systems and SCCs in the general human population.

Online Content

Any methods, additional references, Nature Research reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https…

Methods:

Human Samples

The International Fanconi Anemia Registry (IFAR) was initiated in 1982 as a prospectively collected database of clinical and genetic information for FA patients41. Subjects entered the registry at any point in the disease process with the majority entering at onset of bone marrow failure or cancer occurrence. Informed consent to participate and publish results of the study were given by participants, authorized legal guardian, or next-of-kin (in cases of deceased participant). Available medical, surgical, pathology, radiation and chemotherapy records were collected from medical centers. Tumor samples were obtained from pathology departments, clinical collaborators and NDRI with proper consents. Most of the normal samples were already available in the IFAR due to prospective collection. The institutional Review Board of the Rockefeller University, New York, NY approved this study.

Illumina whole genome sequencing (WGS) was performed on 22 tumor samples – 19 fresh-frozen (FF) tumors and three tumor-derived primary cell lines. Whole exome sequencing (WES) was completed on 41 Formalin-Fixed Paraffin Embedded (FFPE) tumors. Of the 63 total tumor samples sequenced, 60 were SCCs and 3 were adenocarcinomas. 6 SCC samples were WGS/WES pairs (3 tumors sequenced by both WGS and WES) and 4 SCC samples were primary/metastasis pairs (2 primary SCC and 2 associated metastases). Excluding metastasis pairs and samples sequenced using multiple techniques, this yielded 55 independent SCC tumors. Of these independent FA SCCs, 44 were HNSCC, six anogenital SCCs, four esophageal SCCs, and one lung SCC. The adenocarcinomas were from bladder, cervix, and the ampulla. Normal tissue was available for 55 of the 63 total samples, predominantly from peripheral blood obtained before HSCT or primary fibroblasts. Non-tumor tissue collected during surgical resection of the tumor was used as control if other tissue sources were not available. See Supplementary Table 1 for details about all sequenced samples.

Formalin-Fixed Paraffin Embedded Human Tumor Whole-Exome Sequencing

FFPE tumor blocks were sectioned, slide-mounted, and H&E-stained to mark tumor boundaries. Adjacent normal tissue was removed by scalpel. Isolated tumor sections were placed in Qiagen deparaffinization buffer at 56°C for 5 minutes. Samples were incubated in Proteinase K and crosslink reversal buffer (Qiagen Buffer FTB) for 1 hour at 56°C for protein digestion, followed by 90°C for 1 hour to reverse-crosslink DNA. 35 ul of uracil-N-glycosylase (UNG) included with the kit was added to each sample and incubated at 50°C for 1 hour to correct fixation-induced cytosine deamination. Samples were treated with RNAse A, followed by complete lysis (Qiagen Buffer AL & ethanol). DNA was captured, washed, and eluted on Qiagen gDNA capture columns. DNA was subsequently treated with NEBNext FFPE Repair Mix and purified by Ampure XP beads. Agilent Tapestation was used to determine DNA size and DNA was submitted for whole-exome sequencing. All FFPE tumor samples were paired with patient-matched normal DNA. 25 normal samples were pre-HSCT peripheral blood, 10 were primary fibroblasts, and 6 were normal tissue FFPE samples. Novogene performed exome capture (Agilent SureSelect XT V6 & IDT xGen Exome) and library preparation. FFPE tumor samples were 2×150bp sequenced to 12Gb of raw output, and normal samples were sequenced to 6Gb of raw output.

Fresh-Frozen Human Tumor Illumina Whole-Genome Sequencing

DNA was extracted from 19 fresh-frozen tumor biopsies and 3 primary tumor lines using the Qiagen Blood and Cell Culture extraction kit. Tissue was lysed in Qiagen Buffer GL containing Proteinase K, Qiagen Protease, and RNAse A. DNA was captured on the Qiagen Genomic Tip column, washed, and eluted by gravity-flow. Agilent Fragment Analyzer was used to check suitability for short-read Illumina WGS and long-read PacBio sequencing. 9 samples with DNA >40kb were concurrently sequenced using PacBio and Illumina and the remaining 13 tumor samples were sequenced by Illumina WGS only. Patient-matched normal DNA for 15 tumors was extracted from primary fibroblasts or pre-HSCT peripheral blood. Illumina WGS sequencing was performed at New York Genome Centre (NYGC) and National Institutes of Health (NIH) Intramural Sequencing Center using PCR-free TruSeq library prep. Tumor samples were sequenced to 60x genome coverage and normal samples were sequenced to 30x genome coverage.

Human Tumor PacBio Long-Read Whole-Genome Sequencing

6 fresh-frozen tumors and 3 primary tumor cell lines were PacBio sequenced at the Rockefeller University Reference Genome Resource Center. PacBio libraries were prepared using the SMRTbell Express Template v2 library prep kit, using 26G needle-sheared high-molecular weight (HMW) DNA as input. Agarose-plug DNA extraction was performed for the 3 primary tumor lines. 2 samples were sequenced on the PacBio Sequel 1, each with 3 × 1M SMRTcells for 10x average genome coverage. 7 samples were sequenced on the PacBio Sequel 2, each with an 8M SMRTcell for 30x average genome coverage. 1 sample in this latter cohort yielded lower than-expected output (~10x genome coverage).

Human Tumor 10x Linked-Read Whole-Genome Sequencing

4 tumor samples (3 primary tumor lines and 1 fresh-frozen tumor) had sufficiently sized ultra-HMW DNA (median 100kb+) for 10x linked-read WGS. Tumor DNA was extracted using the Circulomics Big DNA bead kit, using the ultra-HMW elution option. Cell line DNA was extracted using agarose-plug lysis. DNA was run on Chromium WGS partitioning chips for GEM encapsulation and DNA fragment barcoding. Each 10x WGS library was sequenced to estimated 60x coverage at the New York Genome Center.

Human Tumor 10x Single-Cell/Nuclei 3’ RNA Sequencing

10× 3’ RNA single-cell sequencing was performed on three human FA SCCs obtained from patients who have previously had HSCT. One tumor (F44P1) was single-cell dissociated prior to viable freezing, while the remaining two tumors (F38P1 and F46P1) were flash-frozen at the time of surgery. For the viably frozen (single-cell) sample, cells were thawed and assessed as having more than 80% viability by Trypan-Blue. The single-cell suspension was loaded onto a Chromium 3’ GEX V3 chip kit with an estimated input of 10,000 cells. Barcoded cDNA was extracted from produced GEMs by RT-cleanup and subsequently amplified for 12 cycles. Amplified cDNA was fragmented and subjected to end-repair, A-tailing, adapter ligation, and 10x-specific sample indexing following the manufacturer’s protocol. Libraries were quantified using Agilent Bioanalyzer and Thermofisher Qubit platforms. Libraries were sequenced using a 2×100 paired-end configuration on an Illumina NovaSeq 6000, targeting a depth of 50,000–100,000 reads per cell. For flash-frozen tumor samples, single nuclei were extracted using chilled 0.1% Nonident P40 lysis buffer and mechanical homogenization, followed by several wash steps. For each single-nuclei sample, a target GEX input of 10,000 nuclei was performed. Post-GEX library processing steps were identical to the single-cell tumor sample.

Human Tumor 10x Visium Spatial transcriptomics

10x Visium spatial transcriptomics was performed on one FA SCC tumor (F38P1) for which matched 3’ single-cell/nuclei RNAseq was also performed. Flash-frozen tumor biopsy was OCT-embedded and cryostat-sectioned to a 10μm slice. After initial permeabilization-time optimization with the Visium test slide kit, tumor slice was mounted on a Visium assay slide (5,000 × 55μm capture spots), H&E stained and imaged, and subsequently permeabilized for 24 minutes for in-situ RNA capture. Captured mRNA was reverse-transcribed and cleaved from the slide, followed by cDNA amplification, amplification and fragmenting, end-repair, A-tailing, adapter ligation, and 10x-specific sample indexing. Library was quantified using Agilent Bioanalyzer and Thermofisher Qubit platforms. Libraries was sequenced in paired-end mode on an Illumina NovaSeq 6000 targeting 150,000 reads per capture spot.

Human Tumor DNA Methylation Analysis

6 FA SCCs and matched normal tissue with sufficient DNA yield were selected for DNA methylation analysis on the Illumina EPIC Methylation Array platform. Library preparation and methylation calls were performed at the NIH Genomics Core.

Human Tumor Bulk RNAseq

RNAseq was performed on 6 FA SCCs with sufficient tissue material. Total RNA was extracted using Trizol methodology. Purified RNA was processed at New York Genome Center with polyA-mRNA capture prior to cDNA synthesis and 2×150bp PE sequencing to >30 million reads/sample.

Mouse Keratinocyte Line Generation

All mouse studies were approved by the Rockefeller University Institutional Animal Care and Use committee (IACUC). Fanca+/− 129S mice (gift from Markus Grompe) and Trp53+/− B6/129S mice (B6.129S2-Trp53tm1Tyj/JJackson Labs) were crossbred to produce Fanca+/+ × Trp53−/− and Fanca−/− × Trp53−/− offspring. At post-natal day 4, a neonate from both genotypes was humanely euthanized and keratinocytes were harvested from back-skin tissue via dispase-seperation.42 Fanca−/− keratinocytes were male, Fanca+/+ keratinocytes were female. Primary keratinocytes were plated on 3T3-J2 feeder cells (Kerafast, EF300) until colony formation. On day 5 post-plating, keratinocytes of both genotypes were immortalized with pZIP-mCMV-Ccnd1-IRES-mCherry lentivirus, produced in 293T cells (ATCC CRL-3216). Cells were subsequently FACS-sorted for integrin a6hi × integrin b1hi × mCherry+ markers, producing basal43 (a6hi × b1hi) Ccnd1OE immortalized keratinocyte lines (Supplementary Figure 4). To allow for tumor generation upon engraftment, HRASG12V-puro retrovirus was then transduced into the keratinocyte lines and infected cells were puromycin selected. pBabe-puro Ras V12 was a gift from Bob Weinberg (Addgene plasmid # 1768; http://n2t.net/addgene:1768; RRID:Addgene_1768). All cell lines specifically derived for this study were authenticated by PCR and by sequencing. 3T3-J2 and 293T cells were used soon after purchasing from the suppliers and were not authenticated. All cell lines were tested for mycoplasma using LookOut® Mycoplasma PCR Detection Kit (Sigma-Aldrich) and were negative.

Mouse Tumor Serial Allograft Study

Fanca+/+;Trp53−/−;Ccnd1OE;HRASG12V and Fanca−/−;Trp53−/−;Ccnd1OE;HRASG12V mouse keratinocytes were serially intradermally allografted into backs of 7 weeks old female Nude/J mice (Jackson Labs) over 11 engraftment cycles (approx. 17 weeks). Four replicates were created for both Fanca+/+ and Fanca−/− genotypes, with each replicate carried forward independently through the engraftment cycles. Each replicate consisted of 4 intradermal engraftment sites on the same mouse, which were combined after harvesting at the end of each cycle. The starting sample size was set at the outset with n=4 independent biological replicates for the intended purpose of comparative SV analysis and RNAseq, but tumor growth analysis was also simultaneously performed as a secondary analysis. At the first engraftment cycle, each engraftment site received 150,000 cells suspended in a 1:1 ratio of DPBS & Matrigel (Corning). Resulting tumors were grown for a maximum of two weeks. As per IACUC protocol, euthaniasa was performed when tumor size exceeded 2 cm. Host mice were humanely euthanized, and tumors were dissected, minced, dissociated, and strained to a single-cell mixture using collagenase and Trypsin. Resulting tumor cells were plated at 5% CO2 / 3% O2 at 37°C, and allowed to recover for 48 hours in low-calcium keratinocyte media (E-Low Media)42. This ensured consistent cell viability between replicates and permitted in vitro analysis of post-transplant cells. The number of engrafted cells was lowered to 100K for both genotypes for engraftments two to five, further lowered to 70K cells for cycles six to nine and reduced to 35K for the 10–11th cycles. One Fanca−/− replicate was lost at the fifth cycle due to host death, which resulted in removal of this replicate from the sixth engraftment cycle onwards. At the first, second, sixth, and 11th cycles, mice were anaesthetized, and tumors were measured in three dimensions with ISO-calibrated digital calipers (VWR) every two days. One repeat first-engraftment cycle experiment for both genotypes was set up to collect tissue for H&E histology of resulting tumors. No randomization or blinding was employed in this study.

Mouse Tumor Illumina WGS

At the sixth engraftment cycle, Fanca−/− and Fanca+/+ tumor cells from each genotype replicate were harvested. To remove residual host cells, cells were FACS-sorted for mCherry+ directly into cytoplasmic lysis buffer (Qiagen Buffer C1), after which the standard Qiagen Blood and Cell Culture DNA kit workflow was carried out for DNA extraction. Normal DNA was simultaneously extracted from pre-engraftment Fanca+/+;Trp53−/−;Ccnd1OE;HRASG12V and Fanca−/−;Trp53−/−;Ccnd1OE;HRASG12V keratinocyte lines. Illumina WGS libraries were prepared at NIH using the PCR-free Truseq protocol and sequenced to 60x genome depth for tumors and 30x genome depth for pre-engraftment normals.

Mouse Tumor RNAseq

At the first, second, sixth, and 11th engraftment cycles, Fanca+/+;Trp53−/−;Ccnd1OE;HRASG12V and Fanca−/−;Trp53−/−;Ccnd1OE;HRASG12V tumor cells from each replicate were harvested. Cells were FACS-sorted for mCherry+ cells directly into RNA-stabilizing cell lysis buffer (Qiagen Buffer RLT). Homogenization was performed on QiaShredder columns, followed by RNA isolation using the Qiagen RNAeasy Plus kit. RNA was simultaneously extracted from pre-engraftment keratinocytes. Total RNA was processed by Novogene for polyA+ capture, cDNA synthesis, and 2×150bp PE Illumina sequencing to >30 million reads/sample.

Histology

Mouse tissue processing, H&E staining and slide scanning was performed at Histowiz.

Western Blotting of Mouse Tumor Cells

Fanca+/+;Trp53−/−;Ccnd1OE;HRASG12V and Fanca−/−;Trp53−/−;Ccnd1OE;HRASG12V mouse tumor cells from the first, second, sixth, and 11th engraftment cycles were scraped from tissue culture plates into DPBS. Cells were lysed in Laemmli buffer containing benzonase and phosphatase inhibitor. Supernatant was quantified by Bradford and DC protein assay (BioRad). 25μg of sample was loaded per well on 8–12% Bis-Tris gels (BioRad) and run for 2.5h/100V. Protein was transferred overnight onto PVDF at 35V/4°C and membranes were blocked in 5% milk. They were incubated with primary antibody overnight at 4°C. After washing, membranes were incubated with secondary HRP-conjugated antibody at RT for 2 hours. They were developed with Western Lightning ECL Plus (Perkin Elmer). Blots were imaged on an Azure c300 chemiluminescent imager. Full blots are shown in Supplementary Figures 2 and 3.

Antibodies

Those obtained from Cell Signaling Technology included CCND1 (E3P5S XP), HRAS (D2H12), VIM (D21H3 XP), CLDN1 (D5H1D XP), β-CATENIN (D10A8 XP), SNAIL (C15D3), CDH1 (24E10), ZEB1 (E2G6Y XP), RELA (p65) (D14E12 XP), STING (D1V5L), TBK1 (D1B4), IRF3 (D83B9), RELB (C1E4), IKBA (44D4), NF-KB2 (p100/p52) (4882), H3 (D1H2 XP). Other antibodies included HEY1 (ProteinTech 19929–1-AP), ITGA6 (BD Horizon GoH3, BV650), ITGB1 (BioLegend HMB1–1 APC/Cy7).

Data Analysis

Somatic SNV, Indel and Structural Variant Calling

Human WES & WGS sequencing data was processed via the Wellcome Sanger Institute (WSI) Cancer Genome Project pipeline44. Sequencing data were aligned to NCBI human reference genome GRCh37 using BWA-mem45. Duplicate reads were marked by Picard MarkDuplicates46. Somatic single-nucleotide variants (SNVs) were called with CaVEMan47 and somatic indels were called by Pindel48. Post variant calling filters were used to remove artefacts, alignment errors and low-quality variants as described in detail previously49. The following filters were applied: (1) common single nucleotide polymorphisms and artefacts were filtered by the presence in a panel of 75 unmatched normal samples50 (2) low-quality variants are filtered by setting the median alignment score threshold (ASMD ≥ 140) and excluding variants for which the majority of reads supporting the variant are clipped (CLPM = 0), (3) counting the number of paired-end reads supporting the variants. Software implementing these filtering steps can be found at https://github.com/MathijsSanders/SangerLCMFiltering. Since many tumor samples suffered from HSCT-donor blood contamination, a strict SNV/indel-filtering algorithm was employed next. Variants were deemed somatic if they were either: 1) not present in the GNOMAD51 germline variant database, or 2) positively reported in the COSMIC52 database.

Structural variants (SVs) were called with BRASS53. SVs were further annotated by AnnotateBRASS (https://github.com/MathijsSanders/AnnotateBRASS) as described in detail previously49. In brief, AnnotateBRASS determines per SV: the number of supporting read pairs, the variance in alignment position of read pairs, whether read-pairs are clipped or carry an excess of variants not reported in SNP databases, are in the correct orientation or whether SV-supporting read-pairs are in regions marked by high proportions of other read-pairs aligning to different parts of the genome (high homology). Detailed post-annotation filtering strategy was described in detail previously49. For 14 tumor samples with paired normal WGS, germline SVs were removed followed by a second pass filtering against a normal in-house population database. For 8 tumor samples without paired normal WGS, SV filtration was performed using the population database removal method. Breakpoint microhomology and unbalanced/balanced translocation status was determined by BRASS. PCAWG-HNSCC, PCAWG-BRCA1mut, and PCAWG-BRCA2mut WGS cohorts36 were analyzed using identical methodology to the Fanconi tumor WGS cohort.

SNV and indel signature analysis was performed using two independent methodologies: Bayesian decomposition through Markov-Chain-Monte-Carlo sampling by sigfit (v2.2.0)54, and statistical bootstrapping by Sigflow (v1.5)55. For the SNV analysis, we used 13 HSCT-negative whole-exome samples that each had ≥100 SNV mutations, and 4 HSCT-negative whole-genome samples. All samples had matched-normal controls. We performed two separate SNV analysis: one restricted to the exome (Extended Data Fig. 2b), comprising 13 WES samples and 4 WGS samples with SNV calls restricted to exonic regions; and the other surveying the whole genome and comprising 4 WGS samples (Extended Data Fig. 2c). For indel analysis, we limited our sampling to the 4 HSCT-negative whole-genome samples, as FA SCC whole-exome samples did not provide sufficient indel counts for reliable signature analysis.

Mouse tumor and pre-engraftment WGS samples were processed through the same bioinformatic workflow as human samples. Samples were aligned against the mm10 reference genome and somatic SV calls were made by filtration against pre-engraftment controls.

Human Tumor HPV Detection

Reads from Fanconi SCC WGS and WES samples, PCAWG-HNSCC WGS, and TCGA HNSCC-WES samples not mapping to the human genome were aligned against 218 known HPV strain genomes (NIH-PAVE database)56 with BWA-mem45. A sample was considered positive for HPV when a total of ≥ 1 unique paired-end reads aligned without clipping to a non-repetitive region of any HPV genome.

Human Tumor Copy Number Alteration and Focal Peak Calling

Following quality-control filtering, including removal of PCR-duplicated reads and flagging of secondary (ambiguous) and supplementary (chimeric) read alignments, Fanconi WES and WGS tumor samples were separately processed by the CNVkit (v0.9.7)57 pipeline to generate segmented copy-number alteration (sCNA) profiles. Initial tumor sCNA-ratio calls were made against a pooled-normal reference panel, comprised of either normal-WES samples (n=43) or normal-WGS samples (n=14) from our sequencing cohort. Circular binary segmentation was subsequently used to generate segmented sCNA calls. Raw segmented sCNA profiles from WGS and WES tumor cohorts were then combined for batch input into GISTIC2 (v2.0.23)58. GISTIC2 called recurrent focal amplification and deletion peaks in the FA SCC cohort, including defining those genes contained within the boundaries of focal peaks. Variable peak calling parameters (conf: 0.99, armpeel: true, brlength-cutoff: 0.7, js: 4, qvt: 0.25) were set identical to the TCGA-HNSCC (Genome-Wide SNP6 Copy Number Array) GISTIC2 run (GDAC Firehose HNSCC run v2016_01_28).59 To account for variations in tumor purity affecting called sCNA-depth, TCGA-HNSCC tumor purities (Absolute v1.0.6)60,61 and Fanconi SCC tumor purities (Theta2 v0.7)62 were used as inputs to normalize sCNA signal amplitudes. CNVkit ‘call’ tumor purity compensation function was employed for this transformation in both cohorts. Using the normalized sCNA calls, amplification (log2[sCNA]≥0.9) and deletion (log2[sCNA]≤−0.9) thresholds were set for reporting significantly CN-altered genes at the defined focal peaks.

For Extended Data Fig. 3d, comparing global focal copy-number alteration rates, we performed a comparative GISTIC2 analysis on FA SCC, HPV+ HNSCC, HPV HNSCC, BRCA2mut and BRCA1mut cohorts. To eliminate any platform-specific bias, we ran this analysis on both WGS and genome-wide CNV array data (CGH) from TCGA for sporadic HNSCC, as well as whole-exome and whole-genome sequencing data from FA SCC. All cohorts were processed using identical GISTIC2 peak calling parameters (conf: 0.99, armpeel: true, brlength-cutoff: 0.7, js: 4, qvt: 0.25), were corrected for tumor purity, and report-gated to log2(CN)>0.9 or log2(CN)<−0.9. To reduce any potential noise related to segmentation between the cohorts, we merged adjacent focal CNAs occurring in the same cytoband for these global counts, and removed all known false-positive peaks.

Clonal composition

For 4 HSCT-negative FA tumor WGS samples, we determined the allele-specific subclonal copy number structure63. Somatic mutations for these cases were called as described above. Input files were generated with dpclust3p (v2.28)64. The generated output files were subsequently used as input for dpclust (v2.2.7)65, which uses a Dirichlet Process Bayesian framework to cluster somatic mutations into clones based on the cancer-cell fraction (CCF) after tumor purity, ploidy and subclonal CNV correction, with default settings. Only clones comprising at least 100 somatic mutations were considered. All samples were manually reviewed to determine whether the information provided by Battenberg was correct, which would otherwise result in clones with high CCF point estimates (>> CCF point estimate 1) or non-dominant clones (no clone with CCF point estimate close to 1).

Replication timing

DNA replication timing data was obtained from Koren et al., 201223. The replication timing at each breakpoint of each sCNA were calculated by linear interpolation of the replication timing values on the respective chromosomes. The distribution of replication timing of all breakpoints were then compared to the whole genome. p-values were calculated using the two-sample Kolmogorov-Smirnov test.

Overlap test

A permutation-based overlap test was performed to assess whether FA SCC and PCAWG breakpoint sites were enriched for early replicating fragile sites66, common and rare fragile sites67,68, and the aphidicolin breakome69. Breakpoints were stratified by type (deletion, inversion, translocation, or tandem duplication). Analysis was performed on each set of stratified regions compared to each set of fragile sites. Fragile site lists that were not originally in the hg19 genome were converted to hg19 using the UCSC LiftOver tool. In the case of early replicating fragile sites, gene locations were used for analysis. The fragile site lists were each permuted 1000 times by generating random genomic windows in the hg19 genome, with sizes equal to the original regions. These permuted windows were not permitted to overlap gaps in the reference genome, overlap each other, or overlap regions in the original gains/losses list. The occurrence of FA and PCAWG SV breakpoints in both permuted regions, and true fragile sites was determined. Then, a z-score was calculated for the number of breakpoints in true fragile sites compared to permutations, and a two-tailed p-value for observing a z-score as or more extreme was calculated.

Human Tumor PacBio Sequencing Analysis

PacBio movies were converted to FASTQ (PacBio bam2fastq) and aligned to hg19 using NGMLR (v0.2.7)70. SVs were called with Sniffles (v1.0.12)70. Sniffles detects deletions, tandem duplications, inverted duplications (BFBCs), inversions, translocations, and large insertions. The number of zero-mode waveguides (ZMWs) required to call an SV was set to three for 8 of 9 SCC samples and all four paired normal samples, ensuring that at least three native DNA molecules sharing the same breakpoint were present. One SCC sample (F45P1) had below-expected coverage, which required a ZMW threshold of 2 to prevent initial over-rejection for this sample. Raw calls were initially filtered using AnnotSV (v1.0) 71 to remove germline SVs present in population databases, including GNOMAD and the 1000 Genomes Database72, together representing approximately 15,000 human WGS samples. We subsequently designed and implemented a breakpoint alignment-noise tolerant intra-cohort filtering algorithm to stringently remove germline SV calls also present in any sequenced normal sample or other SCC sample, termed AnnotateSniffles (v1.0) (https://github.com/MathijsSanders/annotateSniffles). The algorithm comprises two separate filtering strategies: (I) detecting the queried SV of interest in the AnnotSV-annotated Sniffles output of matched normal, unmatched normals, and all other tumors samples, and (II) detecting reads in the NGMLR-aligned PacBio BAM file of matched normal, unmatched normals, and all other tumor samples supporting the presence of the queried SV of interest.

In the first step, AnnotateSniffles takes the PacBio BAM file and AnnotSV-annotated Sniffles output from the sample of interest as input. AnnotSV-annotated Sniffles output for matched normal, unmatched normals, and other tumor samples (independent non-normal samples unlikely to carry the same somatic SV), are then inputted for filtering (referred to as ‘control panel’). As the alignment of SV breakpoints detected from long-read sequencing are on occasion imprecise (the predicted exact base-pair of the breakpoint exhibits a small degree of variability), AnnotateSniffles utilizes a filtering strategy that depends on SV size. For non-translocation SVs, a window is placed centered on the estimated breakpoint loci of the SV of interest, and for which the window size at default is 5% of the SV size, or a minimum of 20 nucleotides (whichever is larger). For SVs ≥ 400kb, a maximum window of 20,000 nucleotides is set. If a tumor SV exists in the AnnotSV-annotated Sniffles output of the control panel, for which the SV breakpoints are located in the respective filtering windows, it is assumed that the SV of interest is detected. For translocations, a window size of 2.5kb is used at default. The algorithm adds two separate columns to the AnnotSV-annotated Sniffles input file, indicating how many normal or tumor samples carry the same SV.

On occasion, SVs were absent from the AnnotSV-filtered Sniffles output, despite reads supporting the SV of interest being present at low levels in the control panel. The most prominent reason for such a calling absence was low-coverage at the locus of interest, with too few reads supporting the SV to pass Sniffles’ detection threshold. In the second step of AnnotateSniffles, the NGMLR-aligned PacBio BAM files of all samples are taken as input, to detect any read that supports the queried tumor SV of interest. At default, the algorithm extracts reads in a window of size 2.5kb, centered on the SV breakpoints from each control sample. Windows are merged if the SV breakpoint distance is < 2.5kb. Long reads are soft-clipped on or near a SV breakpoint, if the SV is of sufficient size. The soft-clipped portion of the read is aligned to the genome as a supplementary alignment. AnnotateSniffles extracts all supplementary alignment information from the read and determines whether the start or end of the supplementary alignment is sufficiently close to the other SV breakpoint (i.e. the breakpoint different from where the primary alignment is positioned). The SV is assumed present in the control if a read exists that upholds this criteria at any of the 2 SV breakpoints. For small SVs, primarily deletions ≤ 1.5kb, there are no supplementary alignments, and instead the mutation event is encoded in the alignment. Reads are extracted using the merged window method described before and their CIGARs, which encode the exact alignment of the read including small and large deletion and insertion events, is extracted. Using the CIGAR, we determine whether there is an encoded deletion or insertion event for which the boundaries are sufficiently close (based on the window size formula described before) to the SV breakpoints detected in the sample of interest. The SV is assumed present in the control panel if any such read exists. The algorithm again adds 2 columns describing separately how many normals or other tumor samples carry the same SVs. In this manuscript, we filtered out all SVs detectable either in the AnnotSV-annotated Sniffles output or NGMLR-aligned PacBio BAM file of any normal or other tumor sample.

In a final filtering step for small somatic deletions <1kb called in PacBio data and reported in this paper, we manually cross-checked all AnnotateSniffles-passing deletion calls in this size range for 1) presence in any other PacBio or Illumina WGS sample BAM, and 2) deletion breakpoint consistency. This caught a small number of false positive calls, most often where the average breakpoint standard deviation / called deletion size ratio was > 0.15.

SV breakpoint co-localization to known hg19 repeat regions was performed with AnnotSV, which subclassified and annotated repeats by RepeatMasker73 family grouping.

Assessing the SV overlap (>1kb) detected in matched Illumina and PacBio WGS data

SVs were detected in Illumina WGS data and filtered using the BRASS and AnnotateBRASS combination described above. SVs were detected in NGMLR-aligned PacBio data and filtered using the Sniffles and AnnotateSniffles combination described above.

The overlap of SVs detected from Illumina WGS data in matched PacBio WGS data was done by ValidateStructuralVariants (v1.0) (https://github.com/MathijsSanders/validateStructuralVariants). In short, reads were extracted in windows from the PacBio data, centered on the BRASS Illumina SV breakpoints with a window size of 5kb. Windows are merged if the SV breakpoint distance is ≤ 5kb. For large SVs or translocations, long reads are soft-clipped and the clipped sequence aligned separately against the genome, termed ‘supplementary alignment’. Using the supplementary alignment information, we determined whether the supplementary alignment start or end position was sufficiently close to the other SV breakpoint (i.e. the SV breakpoint different from the proximity of the primary alignment), using at default a window size of 250 bases centered on the SV breakpoints. The SV was deemed detected if the alignment start or end position of the primary alignment and the alignment start or end position of the supplementary alignment were sufficiently proximal to each SV breakpoint. Small deletion or insertion events are often encoded in the CIGAR (i.e. exact definition of the read alignment) of long read alignments. CIGARs were evaluated for small events if not deemed present by the first approach. Using the alignment start position and the CIGAR, the algorithm determined whether the CIGAR encodes for a deletion or insertion event whose boundaries fall within each window centered on SV breakpoints with a window size of 50 bases. If true, the SV is deemed detected. The algorithm adds 2 columns to the BRASS BED file detailing how many reads and Zero-Mode Waveguide-associated reads support the SV. An SV was deemed detected if any read in the PacBio WGS data supported its presence.

The overlap of SVs detected from filtered PacBio Sniffles output in matched Illumina WGS data was done by ValidateSVIllumina (v1.0) (https://github.com/MathijsSanders/validateSVIllumina). This algorithm focuses on SVs of size ≥ 1kb, as this is the lower detection limit of BRASS. In short, the algorithm takes the AnnotateSniffles-filtered Sniffles file and Illumina WGS BAM file as input. Reads are extracted from the Illumina WGS BAM file using windows centered on the Sniffles SV breakpoints with, at default, a window size of 500 bases. The same algorithmic procedure as AnnotateBRASS was used to determine whether a read-pair supports the SV of interest as detailed before. The algorithm adds 2 columns to the Sniffles input file, detailing whether the SV was detected in the matched Illumina WGS data and how many read-pairs support it.

Detecting small deletions (<1kb) from PacBio WGS data in matched Illumina WGS data

Small deletions < 1kb, were detected in PacBio WGS data by Sniffles, annotated with AnnotSV combined with AnnotateSniffles filtering, and finally manually checked as described before. Deletions of this size are often not called by SV callers for Illumina WGS data (e.g. BRASS), due to the convergence of SV size and mate-paired insert size. We used cgpPindel to detect deletions from Illumina WGS data as previously described in detail.64 Small deletions detected by Sniffles and passing all filtering criteria were searched for in the Pindel VCF file of the matched Illumina WGS sample by OverlapSnifflesPindel (v1.0) (https://github.com/MathijsSanders/overlapSnifflesPindel). A fraction of the deletions detected by Sniffles have imprecise boundaries, i.e. the exact start and end genomic position is uncertain. We allowed for some degree of flexibility in matching the boundaries of the Sniffles and Pindel deletion calls to accommodate for this uncertainty. Windows with a size set at 5% of the deletion size, similar to AnnotateSniffles, with a minimum size of 20 bases are positioned at both deletion boundaries as reported by Sniffles. The deletion call is considered present in the Illumina WGS data if each Pindel deletion boundary falls within the two different windows. For a minor fraction of PacBio deletion events, the Pindel call was absent yet reads supporting the deletion present in the Illumina WGS data. OverlapSnifflesPindel extracts the aligned reads at the reported deletion locus to detect supplementary alignments that support the deletion. Supplementary alignments from clipped read portions at the PacBio-reported deletion boundaries were extracted, and from it determined whether the supplementary alignment is positioned at the other deletion boundary. If true, allowing for a minor degree of variation by using the same window technique aforementioned, the deletion event is considered to be detected in the Illumina WGS data. Both approaches together are used to determine the fraction of PacBio-detected small deletions present in the Illumina WGS data.

Human Tumor 10x Linked-Read WGS Sequencing Analysis

Four FA tumors were processed through the 10x Long Ranger (v2.2.2) pipeline74. Barcoded clouds of linked-reads were aligned using Lariat (v2.2.2)75, followed by SNV/indel calling and haplotype-phasing with freebayes76, packaged with Long Ranger. SVs were called using barcode-linkage, read-linkage, split-read evidence, and paired-end read support. Called somatic SVs have a minimum size threshold of 30kb. SVs were filtered by polymorphic genomic region blacklisting in addition to calling confidence score. Long Ranger was run with the somatic flag enabled to increase the sensitivity of sub-haplotype event detection. Clusters of breakpoint-adjacent SVs sharing barcode overlap were outputted by the pipeline. These clusters were subsequently catalogued, validated, and manually reconstructed to define consensus linear SV chains.

Algorithmic Detection and Chaining of Fold-Back Inversions, Templated Insertion Chains, and Retrotransposon Element Insertions in Illumina WGS

Each structural variant (SV) is characterized by two independent breakpoints and strand specificities by BRASS. Independent SVs are tested for being part of a composite genetic lesion whenever their breakpoints are proximal (distance ≤ 2kb). Canonical inversions are recognized by two SVs with near-matching breakpoint positions and opposing strand specificities. Retrotransposon element (RTE) insertions are recognized by two independent SVs, most often translocations, where the majority of soft-clipped reads spanning the breakpoint in either region is characterized by long poly-A/poly-T tracts in the soft-clipped sequence. Only soft-clipped sequences of a minimal length of 20 bases were considered. Fold-back inversions (FBIs) were recognized as a single short-distance SV (≤ 2kb) with copy number alterations (sCNA) emanating from the two breakpoint loci in a stepwise manner. FBIs with small templated insertions in a chain (TIC) were recognized as two independent proximal translocations, with local CNAs starting at the breakpoint loci, which upon tracing reveals a chain of small templated insertions (<= 1kb). Tracing was achieved by considering the supplementary alignment positions of the longest soft-clipped reads as the next step in the chain. There are two classes of FBI-TICs: (I) Cyclical FBI-TICs: tracing the chain from one breakpoint of the first proximal SV leads to the breakpoint of the other proximal SV after a few jumps in the chain (complete cycle) and (II) Broken FBI-TICs: tracing the chain from one breakpoint of the first proximal SV, or the second proximal SV, leads after a few jumps of the chain to the decoy contig, a region of extreme genomic homology (non-unique alignment) or the inability to recover the next step in the chain – the number of possible steps in the broken chain starting from the first and second proximal SVs are reported.

Mouse and Human Tumor RNAseq Analysis

Full genome sequence and transcript coordinates for both the hg19 and mm10 UCSC genomes and gene models were retrieved from the Bioconductor packages BSgenome.Mmusculus.UCSC.mm10 (v1.4.0) and TxDb.Mmusculus.UCSC.mm10.knownGene (v3.4.0) for mm10 and BSgenome.Hsapiens.UCSC.hg19 (v1.4.0) and TxDb.Hsapiens.UCSC.hg19.knownGene (v3.2.2) for hg19. Transcript sequences for building a transcriptome index were extracted using GenomicFeatures (v1.34.4) and indexed with Salmon (v0.8.2)77. Transcripts were quantified from raw FASTQ files using Salmon in quant mode. Gene expression levels as TPMs and counts were retrieved using tximport (v1.8.0)78. Normalization and transformation of counts was performed using DESeq2 (v1.20.0)79. For comparison against TCGA-HNSCC RNAseq, raw FASTQ data was imported and run through the same pipeline. An enrichment was performed using the GSVA (v1.34.0)80 and Limma (v3.42.2)81. Heatmaps were generated with pheatmap (v1.0.10)82.

Human Tumor 10x Single-Cell/Nuclei RNAseq Analysis

10x single-cell barcoded FASTQ files were processed using the 10x Cell Ranger (v6.0.1) pipeline to generate single-cell expression matrices. Samples were subsequently imported into Seurat (v4.1.1)83 for downstream analysis. Initial filtering steps were applied to remove high-feature doublet cells (>8000 gene features), low-feature cells (<200 gene features), and apoptotic cells (>20% mitochondrial gene content). Per-cell counts were normalized with SCTransform84 using the Gamma-Poisson generalized linear model to accelerate convergence. Principal component (PC) analysis was performed on the 3000 most variable genes determined by SCTransform. The cells were projected into 2 dimensions by UMAP85 using the first 30 PCs that capture the most variance. Clustering was performed using FindClusters with the resolution set at 1.0. Differential marker expression for each cluster was then calculated using the FindAllMarkers function with a minimum reportable log2 fold-change threshold of 0.2. Using the differential expression output, clusters were manually annotated into major cell types. These included, but were not limited to: macrophages (CD163, CD86, CD14, CD68, MRC1), CD4+ T-cells and Tregs (CD4, CD3D, CD28, BATF, CTLA4, TNFRSF18, FOXP3), cytotoxic CD8+ T-cells and NK cells (CD8A, CD8B, CD96, GZMK, GZMA, KLRC2, KLRC1), KRT14/5+ tumor keratinocytes (KRT14,KRT5), neutrophils (HCAR2, HCAR3), fibroblasts (COL11A1, COL8A1, COL7A1, TWIST1, TWIST2, FN1, LUM), mast cells (CPA3, TPSAB1, MS4A2, HDC, TPSB2), Langerhans dendritic cells (CD207, CD1E, CD1A, CD1C, CD86), p-EMT tumor keratinocytes (KRT14, KRT5, LAMA3, LAMC2, LAMB3, COL17A1, ITGA6, SERPINE2, TGFBI), vascular smooth muscle cells (ACTA2, CALD1, CNN1, TRPC6, TRPC4, CD248), differentiating keratinocytes (SPRR2E, SPRR2D, KRTDAP), endothelial cells (VWF, CDH5, PECAM1, MAGI1), KRT16/17+ keratinocytes (KRT16, KRT17), KRT10+ keratinocytes (KRT10, SPINK5, DMKN), KRT7/19+ keratinocytes (KRT7, KRT19).

Single-cell CNAs were determined by inferCNV (v1.3.3)86 in F44P1, separating cells by their annotated cell identity and using CD8+ T-cells as reference control. KRT14/5+ p-EMT tumor keratinocytes, non-EMT tumor keratinocytes, and differentiating tumor keratinocytes demonstrated near-identical predicted major CNAs, suggesting that they were likely derived from the same somatic clonal lineage. Ligand-receptor signaling prediction between cancer-associated fibroblasts and p-EMT tumor keratinocytes vs non-EMT tumor keratinocytes was performed using CellPhoneDB (v3)87, using a curated selection of EMT-related ligand-receptor interactions. GO analysis was performed using the output from FindMarkers (DESeq2) differential expression analysis of p-EMT vs non-EMT KRT14/5+ tumor keratinocyte clusters in F44P1. For differential expression, a cutoff of log2 fold-change threshold ≥ 0.2 and FDR-adjusted p-value ≤0.05 was used as input for GO enrichment analysis.

EMT sensor scoring was performed using the AddModuleScore Seurat function using a curated SCC EMT gene set (referred to as ‘scTSK’) described in Ji et al34 and HNSCC partial-EMT gene set (referred to as ‘p-EMT’) described in Puram et al33. The latter study reported eight separate p-EMT gene signatures attained by using non-negative matrix factorization on the cellular gene expression profiles on a per-donor basis. We considered genes found in at least three out of the eight p-EMT gene signatures (i.e. found in at least three of the eight donors) to be indicative of an association with the p-EMT process (genes: n=26, Supplementary Table 11).

For integration of the single-cell tumor sample (F44P1) with the two single-nuclei samples (F38P1 and F46P1), per-cell counts were normalized with SCTransform84 using the Gamma-Poisson generalized linear model to accelerate convergence. Specific to SCTransform-normalized data we picked 3,000 features for integration using the SelectIntegrationFeatures function as implemented in Seurat. Integration anchors were determined using the FindIntegrationAnchors function using the features previously selected. The data of the three FA SCCs were merged by integration88. Principal component (PC) analysis was performed on the 3,000 most variable genes determined by SCTransform. The cells were projected into 2 dimensions by UMAP85 using the first 30 PCs that capture the most variance. A shared nearest-neighbor (SNN) graph was constructed based on the first 30 PCs capturing the most variance. Modularity optimization-based clustering from the constructed SNN graph was performed using a resolution set at 1 and all other parameters at default values89. Differential marker expression for each cluster was then calculated using the FindAllMarkers function with a minimum reportable log2 fold-change threshold of 0.2. Using the differential expression output, clusters were manually annotated into major cell types. These included, but were not limited to: macrophages (CD163, HLA-DRA), T-cells (CD3G, CD3D, CD6), KRT14/5/17+ tumor keratinocytes (KRT14, KRT5, KRT17), KRT16/17+ tumor keratinocytes (KRT16, KRT17), fibroblasts (COL1A2, COL3A1, COL1A1, FN1, LUM), endothelial cells (PECAM1, MAGI1, VWF), KRT10+ keratinocytes (KRT10), KRT7/19+ keratinocytes (KRT7, KRT19), myofibroblasts (ACTA2, VIM, COL1A1, COL1A2), and myocytes (MEF2C, TTN, TNNI2,PDLIM3).

Human Tumor 10x Visium Spatial Transcriptomics

10x Visium spot-barcoded FASTQs, incorporating information from the H&E-stained tumor slide images, were aligned against the default GRCh38 reference transcriptome provided by Visium, using 10x Space Ranger (v1.3.0). The same pipeline was subsequently used to generate counts. The generated counts and H&E-stained tumor slide images were imported into Seurat (v4.1.1)83 for further analysis. Per-spot counts were normalized with SCTransform90 using default parameters. Principal component analysis was run using default parameters. A SNN graph was constructed based on the first 30 PCs capturing the greatest variance. Modularity optimization-based clustering from the constructed SNN graph was performed using the default values for the resolution and other parameters89. Cells were projected in two dimension using UMAP taking the first 30 PCs capturing the highest variance85. The FindMarkers function, with the minimal proportion detected in any group set at 0.2, was used to determine genes differentially expressed between groups of interest. Only genes with a FDR-corrected p-value≤0.05 were considered differentially expressed. Strong and spatially-widespread KRT14/5+ expression, in combination with tumor InferCNV (v1.3.3) single cell CNAs matching the ASCAT (v2.5.2) predictions from F38P1 WGS, were used to delineate the tumor clusters (0,2,3,4,7,8,11) from the normal stroma and mucosa clusters (1,5,6,9,10). For each Visium spot, we extracted the normalized gene expression levels and used this as input for AddModuleScore with default parameters as implemented in Seurat. The scTSK and p-EMT module scores were visualized with SpatialFeaturePlot, and revealed that p-EMT was strongly co-localized with cluster 6 (labeled p-EMT tumor keratinocyte cluster).

Gene set enrichment analysis (GSEA) was further performed with GSEA software (Broad Institute, v4.2.3), comparing p-EMT cluster expression against the remaining tumor clusters in the Visium sample, using the predefined gene set collections H, C2 and C5 from the Molecular Signatures Database (MSigDB 7.5.1) as input91. FindMarkers, with the logfc.threshold set at 0 and min.pct set at 0.05, was used to determine differentially expressed between groups of interest. The gene metric used for ranking was calculated as follows:

gi=signlog2FC*log10pvalue

Genes were sorted on this metric and used as input for preranked GSEA using the weighted scoring scheme (p=1) and 10,000 permutations.

Neighborhood analysis spatial transcriptomics

10x Genomics Visium spatial transcriptomics data was analyzed with STUtility (v0.1.0)92. In brief, the spatial transcriptomics data and high-resolution H&E tissue image was loaded using the ‘InputFromTable’ function with default parameters. Next, the transcriptomics data was normalized using SCTransform90 with the same settings previously used in Seurat. Principal component analysis, UMAP embedding, and cluster identification was performed using the same Seurat functions and settings as previously utilized. For case F38P1 we identified a tumor cluster as being enriched for the partial epithelial-to-mesenchymal signature (p-EMT) and being positioned at the tumor-normal boundary. Using the RegionNeighbours function we identified the spots lining the outer ring of this p-EMT signature enriched cluster and all abutting tumor or normal spots outside of this cluster. For Extended Data Fig. 10d, we compared the spots lining the outer ring of this cluster to abutting normal spots. These spots belong to cluster 1, labelled as normal mixed fibroblast tissue. The barcodes of the identified normal and tumor spots were used as ident.1 and ident.2 respectively in the FindMarkers function from Seurat with the min.pct set at 0.2. The positive log2-fold changes, i.e., increased expression in the abutting normal spots, of the top 20 most differentially expressed genes based on the FDR-corrected p-value were used for illustration.

Human Methylation Array Analysis

Illumina EPIC 850K Methylation array beta-methylation signal scores were normalized and compared to TCGA-HNSCC 450K EPIC Methylation beta-methylation array signal scores at overlapping probed loci. We identified enriched/depleted genomic regions of methylation that were proximal or overlapping with known cancer-associated genes.

Additional Software Used

Graphpad PRISM (v8) was used to generate scatter, bar, and pie charts, along with survival curves. Maftools (v3.1.2)93 was used to generate the FA SCC integrative oncoplot, FA SCC/TCGA-HNSCC mutation comparison graph, and comparative SNV/indel TMB plot. Circa (v1.2.2)94 was used to produce genomic Circos plots. SplitThreader (v1.0)95 was used for supplementary visualization of PacBio SV events. ASCAT (v2.5.2)96 was used for generation of allele-specific copy number plots for select HSCT-negative samples. IGV (v2.8)97 was used for manual assessment of structural variants.

Statistics and reproducibility

The statistical test used for each experiment is indicated in the figure legends. Graphpad PRISM (v8) was used to calculate statistical significance of differences. For Fig. 4b, micrographs are representative of three tumors assessed by H&E. For the Extended Data Fig. 8, all replicates (three or four depending on the genotype) are shown.

Extended Data

Extended Data Fig. 1. Clinical characteristics of the FA SCC cohort and its mutational landscape.

Extended Data Fig. 1.

a Age at diagnosis for the FA SCC (n=41), HPV+ (n=71) and HPV- (n=415) sporadic HNSCC cohorts, for which full clinical data was available. Clinical data for sporadic HNSCC cohorts were obtained from the TCGA database. b and c Characteristics of the 41 FA patients with complete clinical information available. Some individuals had multiple cancers. For these cases, survival was calculated from the first cancer sequenced in this study.* numbers are based on 41 individuals with complete history. ** based on the first sample sequenced if multiple tumors sequenced. d Type and tissue site of the sequenced tumors. * two were in pyriform sinus, one in oropharynx; ** cell lines are from the tongue, pharynx, and oral cavity. *** one of these samples is a metastasis to a lymph node of another tumor in this set. e TP53 variant allele frequency (%) spread for n=43 biologically independent FA SCCs with a TP53 SNV or indel mutation. Mutant allele frequency was corrected for individual tumor purity as calculated by Theta2. Median and IQR are indicated. f Oncoplot of the FA SCC cohort indicating the variant type by color and the gene effected is listed on the left. Recurrent focal CNAs were defined by GISTIC2. Amplifications were classified as log2(sCNA)≥0.9 and focal deletions as log2(sCNA)≤−0.9 after normalizing for tumor purity. Samples are stratified by SCC tissue subtype. One adenocarcinoma sample (cervical adenocarcinoma) is shown, while the bladder and intestinal adenocarcinomas are not displayed. The y-axis of the top graph indicates the number of total somatic gene alterations from the GISTIC2 and SNV/indel analysis. In all cases, n refers to independent biological samples or individuals.

Extended Data Fig. 2. Assesing the SNV burden and COSMIC SNV mutational signatures of FA SCC.

Extended Data Fig. 2.

a Comparison of tumor mutation burden between TCGA cohorts and the FA SCC cohort (n=55 independent SCCs). Each dot represents the number of exonic SNV and indel mutations detected per sample, with median mutation burden indicated by a black horizonal line. FA SCC samples are colored red and TCGA-HNSCC samples are colored blue. b sigfit (Bayesian procedure) and Sigflow (bootstrapping procedure) extraction of COSMIC single-base substitution (SBS) signatures from n=13 HSCT-negative FA SCC whole-exome samples (each with >100 SNVs) and n=4 HSCT-negative FA SCC whole-genome samples (with SNV calls restricted to the exome). c sigfit and Sigflow extraction of SBS signatures from n=4 HSCT-negative FA SCC whole-genome samples, surveying genome-wide SNVs. d sigfit and Sigflow extraction of COSMIC indel (ID) signatures from n=4 HSCT-negative FA SCC whole-genome samples with matched normal controls. Mutation fraction indicates the fraction of tumor mutations that can be explained by the particular signature. Signature exposure (sig exposure) is the number of mutations that contributed to the particular signature. sigfit, error bars indicate the 95% highest posterior density (HPD) intervals. Grey bars indicate non-significant signature exposures, defined as exposures for which the lowest HPD limit is less than 0.01. Sigflow, boxplots indicate the median signature exposure value and interquartile range (IQR). Whisker ends are positioned at Q1 (first quartile) - 1.5xIQR, or at the minimum value when larger than this lower range value, and Q3 (third quartile) + 1.5xIQE, or at the maximum value when smaller than this upper range value. In all cases, n refers to independent biological samples.

Extended Data Fig. 3. Copy number instability in FA SCC.

Extended Data Fig. 3.

a Plot displaying chromosomal locations of recurrent focal amplification peaks detected by GISTIC2 in all FA SCCs (n=60 samples, including 55 independent SCCs, 2 SCC metastases, and 3 SCC samples sequenced by both WGS/WES) and one cervical adenocarcinoma. GISTIC2 q-value is shown below, with default minimum calling threshold displayed as a green line. b A plot displaying chromosome location of recurrent focal deletion peaks detected by GISTIC2 in all FA SCCs (n=60). GISTIC2 q-value is shown below, with default minimum calling threshold displayed as a green line. c Copy-number alteration heatmap displaying detected sCNAs for all FA SCCs (n=60) and one FA-associated cervical adenocarcinoma, colored by amplitude intensity and normalized for individual tumor purity. Each row is a tumor sample. d Comparison of focal sCNA numbers between FA SCC, HPV+ sporadic HNSCC, HPV sporadic HNSCC, BRCA2mut carcinomas, and BRCA1mut carcinomas. For FA SCC, n=20 whole-genome & n=40 whole-exome samples are displayed and colored by sample type. For HPV+ sporadic HNSCC, n=18 whole-genome samples and n=71 genome-wide CNV array (CGH) are shown separately. For HPV sporadic HNSCC, n=24 whole-genome samples and n=415 CGH samples are displayed separately. For BRCA2mut carcinomas n=41 whole-genome samples are shown. For BRCA1mut carcinomas, n=24 whole-genome samples are displayed. Focal copy number alterations are defined by GISTIC2, and gated at log2(sCNA)≥0.9 or log2(sCNA)≤−0.9 after correcting for tumor purity. Two-tailed Mann-Whitney U test p-values are indicated, with median and IQR shown. e ASCAT plot of a WGS FA SCC (F17P1). Total copy number is represented by the purple line. Minor allele is represented by the blue line. Indicated are notable oncogenes and tumor suppressors localizing to focal sCNA regions. f Genomic Circos plot displaying all somatic SV events detected by Illumina WGS of sample F17P1 depicted in panel e. g ASCAT plots displaying allele-specific CNAs in select WES-sequenced FA SCC tumors with little to no detectable HSCT-donor SNP contamination. Upper left (F32P1), Upper right – (F16P1-Vulv), Bottom left (F4P1), Bottom right (F25P1). F32P1 is HPV+, but harbors somatic deletions of TP53 and CDKN2A. In all cases, n refers to independent biological samples.

Extended Data Fig. 4. SV breakpoint landscape and subclonal structure of FA SCC.

Extended Data Fig. 4.

a Scatter plot displaying localization of 8,896 SV breakpoints (from 4,448 SVs) in FA SCC by chromosome and genomic position. Relative breakpoint density is indicated by height from the baseline. Annotated are curated oncogenes and tumor suppressors localizing to breakpoint hotspots. b Hatchet subclonal absolute copy-number prediction of a low-HSCT+ FA SCC (F44P1). A copy number of 2 is considered copy-neutral. Individual predicted subclones (n=3) are displayed as distinct colored lines. c Battenberg-(DPClust) decomposition of 4 HSCT-negative FA tumor whole-genome samples with matched normal controls. Each annotated peak is a detected clone within the SCC, with peak area indicating fractional composition of tumor cells.

Extended Data Fig. 5. SV breakpoint localization of FA SCC, sporadic HNSCC, BRCA2mut, and BRCA1mut tumors relative to genome replication timing and fragile sites.

Extended Data Fig. 5.

a Replication timing of the SV breakpoint loci. Plotted in black is the expected replication timing distribution. Plotted in blue is the observed SV breakpoint localization to early, mid, or late replicating genomic regions. Vertical axis indicates relative abundance and horizontal axis indicates standard deviation from mean replication timing. Kolmogorov-Smirnov (KS) p-values are indicated. n corresponds to the number of breakpoints included in the sample for each analysis. b Binned SV breakpoint counts from the indicated cohorts and SV class, localizing to common and rare fragile sites. SV class highlighted in red indicates a significant association, as determined by indicated p-value of two-tailed z-score test compared to 1000 permutations of fragile site locations. c Binned SV breakpoint counts localizing to “early-replicating fragile sites”. SV class highlighted in red indicates a significant association, as determined by indicated p-value of two-tailed z-score test compared to 1000 permutations of fragile site locations.

Extended Data Fig. 6. Complex SVs in FA SCC and the transcriptional landscapes of FA SCC and sporadic HNSCC.

Extended Data Fig. 6.

a Number of somatic SV chains detected in 10x-sequenced FA SCCs (n=4), where a chain is defined as ≥ 2 discrete SVs (≥ 4 unique breakpoints). Median and IQR are indicated. b Number of SVs present in 108 SV chains in 10x-sequenced FA SCCs. Mean (4.6 SVs) and IQR are indicated. c Number of SVs of indicated class present in 108 SV chains from 10x-sequenced FA SCCs. Means and IQRs are indicated. d SV breakpoint distribution from 108 SV chains stratified by human chromosome number. e Somatic SV burden of n=9 PacBio-sequenced FA SCCs. 3 samples (indicated) were sequenced to 10x average coverage, and 6 samples were sequenced to 30x average coverage. f Somatic SV class proportions in n=9 PacBio-sequenced FA SCCs. Medians and IQRs are indicated. g Illumina & PacBio % SV call overlap for SVs > 1kb and deletions < 1kb for n=9 FA SCCs sequenced on both platforms. Shown are % of PacBio SV calls > 1kb present in Illumina BRASS output, % of PacBio deletion calls <1kb present in Illumina indel calls, and % of Illumina SV calls >1kb present in PacBio BAMs. Median and IQR are indicated. h Comparison of deletion sizes (<1kb) detected by SV calling in n=9 PacBio FA SCCs and by indel calling in the same 9 FA SCCs sequenced by Illumina WGS. Median and IQR are shown. i Examples of fold-back inversions (FBI) driving sharp copy-number change at key oncogenic loci identified in FA SCCs (PacBio data). j Comparison of the raw number of unbalanced translocation events in FA SCC (n=20), HPV-negative sporadic HNSCC (n=23), BRCA2mut (n=41), and BRCA1mut (n=24) cohorts. Two-tailed Mann-Whitney U test p-values are indicated, with median and IQR shown. k Comparison of hg19 expected vs. observed percentage of somatic SV breakpoints in 9 PacBio-sequenced FA SCCs that localize to repeat regions. Unpaired two-tailed t-test p-value is indicated (t=7.371, df=8), with median and IQR shown. l Breakpoint density graph displaying GC% sequence composition within +/− 100bp from SV breakpoints identified in PacBio sequencing data, calculated relative to hg19 global GC% frequency (40.9%) (notated as “expected”). Median and IQR are displayed. m Comparison of hg19 expected vs. observed percentage of somatic SV breakpoints from FA SCCs (n=20) and HPV-negative sporadic HNSCC cohorts (n=23) that localize to repeat regions and to the indicated repeat class (Illumina WGS). Two-tailed Mann-Whitney U test p-values are indicated, with median and IQR shown. n Comparison of the number of retrotransposon element (RTE) insertions in FA SCC (n=20), HPV-negative sporadic HNSCC (n=23), BRCA2mut (n=41), and BRCA1mut (n=24) cohorts. Two-tailed Mann-Whitney U test p-values are indicated, with median and IQR shown. o Cancer-relevant genes differentially expressed between FA SCC (n=6) and sporadic HNSCC (n=520) as assessed by RNAseq, including genes displayed in Fig. 1c. Differential expression is gated at log2(FC)>1 or log2(FC)<−1 with DESeq2 FDR-adjusted p-value < 0.05. DESeq2 implementation of Wald test with FDR-adjusted p-value is indicated. Genes whose relative expression are impacted by a sCNA are colored orange. Genes whose relative expression is discordant with sCNA frequency are colored blue. Genes not identified in focal sCNA peaks are colored white. GAPDH and PGK1 are indicated in black and added as housekeeping controls. p Quality-control distribution graph showing log2(FC) values of all genome-wide transcripts comparing FA SCC (n=6) vs sporadic HNSCC (n=520). Median and IQR is displayed. q DNA repair genes differentially expressed in FA SCC (n=6) versus sporadic HNSCC (n=520) by RNAseq. Differential expression is gated at log2(FC)>1 or log2(FC)<−1 with DESeq2 FDR-adjusted p-value <0.05. DESeq2 implementation of Wald test with FDR-adjusted p-value is indicated. r Aldehyde dehydrogenase (Aldh) and alcohol dehydrogenase (Adh) genes differentially expressed between FA SCC (n=6) and sporadic HNSCC (n=520). Differential expression is gated at log2(FC)>1 or log2(FC)<−1 with DESeq2 FDR-adjusted p-value <0.05. DESeq2 implementation of Wald test with FDR-adjusted p-value is indicated. s Gene-set enrichment/depletion (GO) analysis of genes differentially expressed between FA SCC and sporadic HNSCC. Genes entered into analysis were gated at log2(FC)>1 or log2(FC)<−1 with DESeq2 FDR-adjusted p value < 10−5. Gene sets were gated at >2-fold enrichment over expected background with GO Fisher’s exact test FDR-adjusted p-value <0.01 to be reported in the figure. In all cases, n refers to independent biological.

Extended Data Fig. 7. Copy-number instability in sporadic HNSCC coupled to FA pathway deficiency.

Extended Data Fig. 7.

a Oncoplot of 415 HPV-negative sporadic HNSCCs, displaying somatic copy-number alteration (sCNA) or SNV/indel-alteration of FA pathway genes or ALDH2. Mutation type is indicated in the legend. Top bar graph indicates the relative copy-number instability of each sample. Blue indicates deletions, magenta indicates amplifications. GISTIC2 q-value (FDR) values: XRCC2 (1×10−22), MAD2L2 (1×10−7), RAD51 (1×10−1), ALDH2 (2×10−1). b Mutational frequency of key HNSCC driver genes in HPV-negative sporadic HNSCC samples with MAD2L2, ALDH2, RAD51, or XRCC2 deletions (n=52) versus entire HPV-negative TCGA-HNSCC cohort (n=415). GISTIC q-value (FDR) values: CDKN2A (4×10264), PTPRD (7×1040), KMT2C (1×1022), PIK3CA (5×1057), NSD1 (1×104), CSMD1 (9×10101), LATS2 (2×1023), MXD4 (5×104), CCND1 (8×10252), FAT1 (9×1036), SDHB (7×107), NOTCH2 (8×1019), MYC (6×1022), NOTCH1 (2×103), DIP2C (3×102), NCOR2(2×101), TGIF(4×106), PTEN (4×1011), EGFR (1×1052). c Number of focal copy-number alterations in sporadic HNSCC tumors (n=321 samples with data on smoking history), stratified by number of cigarette pack-years associated with each sample. Shown are cases with zero pack years (no recorded smoking), cases with more than one (>1) pack-years, and cases with more than two (>2), more than three (>3), more than four (>4) and more than eight (>8) pack years. Two-tailed Mann-Whitney U test p-values are indicated, with median and IQR shown. d HPV-negative sporadic HNSCC samples (n=415) ranked by number of focal somatic copy-number alteration (sCNA) peaks as defined by GISTIC2. Annotated are the top and bottom sCNA quartiles, with the top quartile being most unstable and the bottom quartile being most stable. Median and IQR displayed. e Comparison of the number of cigarette pack-years for smokers in top (n=104) and bottom (n=104) copy-number quartiles. Two-tailed Mann-Whitney U test p-value is indicated, with median and IQR shown. f Bar chart indicating the proportion (%) of samples within top and bottom sCNA quartiles exhibiting each respective COSMIC signature ID3, ID8, SBS4, or DBS2. Annotated are fold-differences in these proportions. g Comparison of the total number of ID3, ID8, SBS4, and DBS2 signature events between top (n=104) and bottom (n=104) sCNA quartiles. Indicated in brackets is the proportion (%) of total SBS, DBS, or ID events represented by the respective signature in each sCNA quartile. In all cases, n refers to independent biological samples.

Extended Data Fig. 8. Characterization of a murine FA SCC model.

Extended Data Fig. 8.

a In vitro cell growth curve of pre-engraftment Fanca+/+ and Fanca−/− keratinocytes, measured by cell count over six days with three independent experimental replicates per genotype. Data points indicate the mean cell count and bars indicate standard deviation. b Mean replicate tumor volumes measured at multiple time points during the 2nd, 6th, and 11th engraftment cycles of Fanca+/+ and Fanca−/− keratinocytes. Each genotype has 4 independent replicates, each of which in turn is comprised of 4 co-engrafted tumor sites on a single mouse (for a total of 16 tumors per genotype). Each data point represents one replicate as the mean volume of its 4 constituent tumors at the specified time point, with standard error bars indicated. 100×103, 70×103, and 35×103 cells were engrafted at 2nd, 6th, and 11th engraftment respectively. 1st engraftment data is shown in Fig. 4c. Fanca−/− was reduced to 3 replicates at the 6th and 11th cycles due to recurrent loss through host death. c Number of tumor SVs categorized by class: inversion (INV), deletion (DEL), tandem duplications (TD), translocation (TRA) in n=4 Fanca+/+ and n=3 Fanca−/− replicates from 6th engraftment cycle. Two-tailed unpaired t-test p-values displayed (inversions: t=2.934, df=5), with medians and IQRs indicated. d Proportion (%) of SVs represented by each class in n=4 Fanca+/+ and n=3 Fanca−/−replicates at 6th engraftment cycle. Two-tailed unpaired t-test p-values displayed (inversions: t=2.666, df=5), with medians and IQRs indicated. e Unsupervised-clustering heatmap displaying differential transcriptomic gene-set enrichment across all replicates at pre-engraftment and 1st, 2nd, 6th, & 11th engraftment cycles for Fanca+/+ and Fanca−/− genotypes (32 samples). Relative gene set enrichment or depletion is indicated by color scale at each time point (ANOVA test). Gene sets displayed have a FDR-adjusted p-value < 10−7. Pre indicates pre-engraftment, E indicates engraftment, R indicates replicate. f RNAseq differential expression heatmap across all replicates displaying time-course expression changes in genes associated with keratinocyte identity, EMT transition, and inflammation/immune cell activation. Heatmap color indicates -scaled log2-normalized expression (32 samples). Pre indicates pre-engraftment, E indicates engraftment, R indicates replicate. In all cases, n refers to independent biological samples.

Extended Data Fig. 9. Single cell transcriptomics of case F44P1 and integration with other single-nuclei FA SCC samples.

Extended Data Fig. 9.

a Feature plots superimposed on a UMAP embedding displaying cell type identity markers corresponding to the annotated clusters in Fig. 4g. Macrophage (CD163), CD4+ T-cells (CD4), CD8+ T cells (CD8A), KRT14/5+ tumor keratinocytes (KRT14), neutrophils (HCAR2), fibroblasts (COL11A1), mast cells (TPSAB1), Langerhans dendritic cells (CD207), p-EMT tumor keratinocytes (LAMA3), myofibroblasts (ACTA2), differentiated tumor keratinocytes (SPRR2E), endothelial cells (VWF). See methods for additional markers used for identification. b ASCAT plot of WGS sample F44P1 (top), inferred single-cell copy-number analysis displaying distinct amplifications in tumor keratinocyte, p-EMT tumor keratinocyte, and differentiated tumor keratinocyte clusters (bottom) c Feature plot displaying the scTSK sensor score for case F44P1. d Feature plots displaying a selection of scTSK markers. e Feature plot displaying p-EMT sensor score for case F44P1. f Feature plots displaying a selection of p-EMT markers. g Fold-enrichment in gene expression between p-EMT vs non-EMT tumor keratinocytes in F44P1 (DESeq2 log2(x) > 0.2, Wald test with FDR-adjusted p-value < 0.05) shown by GO term. GO enrichment Fisher’s exact test FDR-adjusted p-value displayed. h UMAP embedding displays the integrated clustering of F44P1 (single-cell), F46P1 (single-nuclei), and F38P1 (single-nuclei) samples after quality control (k=1,986 cells). Cell type identities are indicated in the legend. i scTSK and p-EMT sensor scores of integrated samples, split by constituent tumor sample. Also see Supplementary Fig. 1 for examples of cellular markers used in h and i.

Extended Data Fig. 10. Spatial transcriptomics of FA SCC and fibroblast-tumor keratinocyte interactions.

Extended Data Fig. 10.

a left to right: H&E-stained scan of sample F38P1 showing a scale bar, spatial feature plots of CCND1, EGFR, SNAI2, LAMC2, TGFBI expression and imputed G1/G2-M/S cell-cycle stage. b GSEA EMT hallmark enrichment plot, assessed using a pre-ranked gene list determined from differential expression analysis between the p-EMT tumor cluster 6 against the remaining tumor clusters. EMT hallmark enrichment and normalized enrichment scores were 0.64722323 and 2.0873358, respectively, with the nominal p-value = 0 and the adjusted FDR value = 0. c ASCAT plot of the F38P1 WGS sample (top). Inferred single-spot copy-number analysis displaying distinct amplifications in tumor versus normal tissue (bottom). d Location of tumor keratinocytes and adjacent non-tumor stroma (top) used for spatial neighborhood analysis. Differential expression between tumor keratinocyte spots and directly adjacent stromal spots (bottom). e Ligand-receptor interaction analysis between tumor-associated fibroblasts and p-EMT tumor keratinocytes vs. non-EMT tumor keratinocytes in F44P1 single-cell sample.

Supplementary Material

Supplementary material
Supplementary Data File 8_Integrated FA SCC & Sporadic SCC Mutation Database
Supp Table 4
Supp table 7
Supp table 1
Supp table 8
Supp table 2
Supp table 11
Supp table 10
Supp table 6
Supp table 5
Source data figure 10d
Source data figure 4
Source data figure 8
Source data figure 11
Supp table 3
Supp table 9

Acknowledgements:

We are grateful to participants and their families who donated their tissues to the International Fanconi Anemia Registry (IFAR). We thank the physicians who provided research samples and clinical information and the staff of Fanconi Anemia Research Fund, especially Suzanne Planck for referrals to IFAR. National Disease Research Interchange (NDRI) is acknowledged for providing samples. We thank Markus Grompe for providing Fanca mutant animals. We appreciate advice from all members of the Laboratory of Genome Maintenance. We thank Natalie Papazian and Tom Cupedo for assistance with sample processing, Eric Bindels for assistance with single cell analysis. Staff of the Genomic, Reference Genome, Bioinformatics, and Flow Cytometry resource centers at the Rockefeller University are acknowledged for their expert advice and contribution. We thank Elaine Fuchs and her laboratory for advice on keratinocyte growth conditions. Genomic data from non-FA SCCs are in part based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga. This study was supported by Pershing Square Sohn Prize for Young Investigators in Cancer Research (AS), Fanconi Anemia Research Fund (RD, EV, AS), V Foundation grant T2019-013 (AS), National Institutes of Health (NIH) National Heart Lung and Blood Institute (R01 HL120922) (AS), National Cancer Institute (R01 CA204127) (AS), National Center for Advancing Translational Sciences (UL1 TR001866) (RV and AS), NIH award 1DP2-GM123495 (AK). SCC acknowledges support from the Intramural Research Program of the NIH National Human Genome Research Institute. MAS is supported by a Rubicon fellowship from NWO (019.153LW.038) and a KWF Kankerbestrijding Young Investigator Grant (12797/2019-2, Bas Mulder Award; Dutch Cancer Foundation). AS is a Howard Hughes Faculty Scholar.

Footnotes

Competing interests: Rocket Pharmaceuticals provided research funding and partial salary support to A.S. for an unrelated project. P.J.C. is a founder, director and consultant for Mu Genomics Ltd. B.S. is a co-inventor of intellectual property related to DCN1 small molecule inhibitors licensed by MSK to Cinsanso. He has rights to receive royalty income as a result of this arrangement. MSK has financial interests related to this intellectual property and Cinsanso as a result of this arrangement. Other authors declare no competing interests.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.

Additional Information

Supplementary information: The online version contains supplementary material available at….

Peer review information Nature thanks the anonymous reviewers for their contribution to the peer review of this work. Peer review reports are available.

Reprints and permissions information is available at http://www.nature.com/reprints.

Data availability

Outputs from the data analysis and source data are included in 12 Supplementary Tables listed in the SI guide. An integrated and quarriable SNV-Indel/CNV Maftools mutation database for FA SCCs & sporadic HNSCCs is included as Supplementary Data File 8 in the online version of this manuscript, along with scripts for figure generation. Human data including 34 Illumina whole-genome samples (22 tumor and 12 matched-normal, 70 Illumina whole-exome samples (37 tumor and 33 matched-normals, 13 PacBio whole-genome samples (9 tumor and 4 matched-normal, four 10x linked-read whole-genome tumor samples, six bulk RNAseq tumor samples, three 10x single-cell/nuclei RNAseq tumor samples, one 10x Visium spatial RNAseq tumor samples and six EPIC 850K methylation array tumor samples generated in this study are available in dbGAP (phs002652.v1.p1). The data are available through controlled access. Samples from four individuals (F10P1, F15P1, F28P1, F34P1) did not have consent for release of raw sequencing data to a repository. These four tumor/normal whole-exome sample pairs are available from the Smogorzewska laboratory after fulfilling requirements to participate in the IRB protocol #AAU-0112 at the Rockefeller University. Mouse sequencing data, including 32 bulk RNAseq samples & 10 Illumina whole-genome samples, are available on SRA (PRJNA753831) and GEO (GSE195811). Sporadic human tumor data is available from TCGA (https://portal.gdc.cancer.gov/projects/TCGA-HNSC), PCAWG (see Supplementary Table 2c for sample accession code list), and cBioPortal (https://www.cbioportal.org/study/summary?id=hnsc_tcga_pan_can_atlas_2018). HPV genome data is available from NIH-PAVE database (https://pave.niaid.nih.gov/). hg19 (https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.13/) and mm10 (https://www.ncbi.nlm.nih.gov/assembly/GCF_000001635.20/) genomes were used for sequencing alignment. Population variant databases GNOMAD (https://gnomad.broadinstitute.org/), db1000 Genomes (https://www.internationalgenome.org/), and COSMIC (https://cancer.sanger.ac.uk/cosmic) were used for additional germline variant filtering.

Data Deposition

Tumor/normal sequencing data from Fanconi anemia individuals (including 34 Illumina whole-genome samples, 70 Illumina whole-exome samples, 13 PacBio whole-genome samples, four 10x linked-read whole-genome samples, six bulk RNAseq samples, three 10x single-cell/nuclei RNAseq samples, one 10x Visium spatial RNAseq sample), and six EPIC 850K methylation array samples generated in this study will be available in dbGAP (phs002652.v1.p1). Mouse sequencing data, including 32 bulk RNAseq samples & 10 Illumina whole-genome samples, will be available in SRA (PRJNA753831) and GEO (GSE195811).

Availability of materials

LCL and fibroblast cell lines from Fanconi anemia patients are available from the International Fanconi Anemia Registry (IFAR) under the Fanconi Anemia Tissue and Cell Repository Usage Agreement. Tumor tissues described in the study are a non-renewable resource collected for this study. All available non-human materials will be shared with proper MTAs in place.

References:

  • 1.Auerbach AD & Wolman SR Susceptibility of Fanconi’s anaemia fibroblasts to chromosome damage by carcinogens. Nature 261, 494–496, doi: 10.1038/261494a0 (1976). [DOI] [PubMed] [Google Scholar]
  • 2.Sasaki MS & Tonomura A A high susceptibility of Fanconi’s anemia to chromosome breakage by DNA cross-linking agents. Cancer Res 33, 1829–1836 (1973). [PubMed] [Google Scholar]
  • 3.Taylor AMR et al. Chromosome instability syndromes. Nat Rev Dis Primers 5, 64, doi: 10.1038/s41572-019-0113-0 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Garaycoechea JI et al. Alcohol and endogenous aldehydes damage chromosomes and mutate stem cells. Nature 553, 171–177, doi: 10.1038/nature25154 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Langevin F, Crossan GP, Rosado IV, Arends MJ & Patel KJ Fancd2 counteracts the toxic effects of naturally produced aldehydes in mice. Nature 475, 53–58, doi:nature10192 [pii] 10.1038/nature10192 (2011). [DOI] [PubMed] [Google Scholar]
  • 6.Pontel LB et al. Endogenous Formaldehyde Is a Hematopoietic Stem Cell Genotoxin and Metabolic Carcinogen. Mol Cell 60, 177–188, doi: 10.1016/j.molcel.2015.08.020 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Rycenga HB & Long DT The evolving role of DNA inter-strand crosslinks in chemotherapy. Curr Opin Pharmacol 41, 20–26, doi: 10.1016/j.coph.2018.04.004 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Alter BP, Giri N, Savage SA & Rosenberg PS Cancer in the National Cancer Institute inherited bone marrow failure syndrome cohort after fifteen years of follow-up. Haematologica 103, 30–39, doi: 10.3324/haematol.2017.178111 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Johnson DE et al. Head and neck squamous cell carcinoma. Nat Rev Dis Primers 6, 92, doi: 10.1038/s41572-020-00224-3 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Ceccaldi R, Sarangi P & D’Andrea AD The Fanconi anaemia pathway: new players and new functions. Nat Rev Mol Cell Biol 17, 337–349, doi: 10.1038/nrm.2016.48 (2016). [DOI] [PubMed] [Google Scholar]
  • 11.Kottemann MC & Smogorzewska A Fanconi anaemia and the repair of Watson and Crick DNA crosslinks. Nature 493, 356–363, doi: 10.1038/nature11863 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Wang AT & Smogorzewska A SnapShot: Fanconi anemia and associated proteins. Cell 160, 354–354 e351, doi: 10.1016/j.cell.2014.12.031 (2015). [DOI] [PubMed] [Google Scholar]
  • 13.Cancer Genome Atlas N Comprehensive genomic characterization of head and neck squamous cell carcinomas. Nature 517, 576–582, doi: 10.1038/nature14129 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Alter BP et al. Squamous cell carcinomas in patients with Fanconi anemia and dyskeratosis congenita: a search for human papillomavirus. Int J Cancer 133, 1513–1515, doi: 10.1002/ijc.28157 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Hoskins EE et al. The fanconi anemia pathway limits human papillomavirus replication. J Virol 86, 8131–8138, doi: 10.1128/JVI.00408-12 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Kutler DI et al. Human papillomavirus DNA and p53 polymorphisms in squamous cell carcinomas from Fanconi anemia patients. J Natl Cancer Inst 95, 1718–1721, doi: 10.1093/jnci/djg091 (2003). [DOI] [PubMed] [Google Scholar]
  • 17.Sauter SL et al. Oral human papillomavirus is common in individuals with Fanconi anemia. Cancer Epidemiol Biomarkers Prev 24, 864–872, doi: 10.1158/1055-9965.EPI-15-0097-T (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.van Zeeburg HJ et al. Clinical and molecular characteristics of squamous cell carcinomas from Fanconi anemia patients. J Natl Cancer Inst 100, 1649–1653, doi: 10.1093/jnci/djn366 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Menghi F et al. The tandem duplicator phenotype as a distinct genomic configuration in cancer. Proc Natl Acad Sci U S A 113, E2373–2382, doi: 10.1073/pnas.1520010113 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Nik-Zainal S et al. Landscape of somatic mutations in 560 breast cancer whole-genome sequences. Nature 534, 47–54, doi: 10.1038/nature17676 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Willis NA et al. Mechanism of tandem duplication formation in BRCA1-mutant cells. Nature 551, 590–595, doi: 10.1038/nature24477 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Li Y et al. Patterns of somatic structural variation in human cancer genomes. Nature 578, 112–121, doi: 10.1038/s41586-019-1913-9 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Koren A et al. Differential relationship of DNA replication timing to different forms of human mutation and variation. Am J Hum Genet 91, 1033–1040, doi: 10.1016/j.ajhg.2012.10.018 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Howlett NG, Taniguchi T, Durkin SG, D’Andrea AD & Glover TW The Fanconi anemia pathway is required for the DNA replication stress response and for the regulation of common fragile site stability. Hum Mol Genet 14, 693–701, doi: 10.1093/hmg/ddi065 (2005). [DOI] [PubMed] [Google Scholar]
  • 25.Zeman MK & Cimprich KA Causes and consequences of replication stress. Nat Cell Biol 16, 2–9, doi: 10.1038/ncb2897 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Campbell JD et al. Genomic, Pathway Network, and Immunologic Features Distinguishing Squamous Carcinomas. Cell Rep 23, 194–212 e196, doi: 10.1016/j.celrep.2018.03.063 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Marsit CJ et al. Inactivation of the Fanconi anemia/BRCA pathway in lung and oral cancers: implications for treatment and survival. Oncogene 23, 1000–1004, doi: 10.1038/sj.onc.1207256 (2004). [DOI] [PubMed] [Google Scholar]
  • 28.Wreesmann VB, Estilo C, Eisele DW, Singh B & Wang SJ Downregulation of Fanconi Anemia Genes in Sporadic Head and Neck Squamous Cell Carcinoma. ORL 69, 218–225, doi: 10.1159/000101542 (2007). [DOI] [PubMed] [Google Scholar]
  • 29.Harding SM et al. Mitotic progression following DNA damage enables pattern recognition within micronuclei. Nature 548, 466–470, doi: 10.1038/nature23470 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Heddle JA, Lue CB, Saunders EF & Benz RD Sensitivity to five mutagens in Fanconi’s anemia as measured by the micronucleus method. Cancer Res 38, 2983–2988 (1978). [PubMed] [Google Scholar]
  • 31.Mackenzie KJ et al. cGAS surveillance of micronuclei links genome instability to innate immunity. Nature 548, 461–465, doi: 10.1038/nature23449 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Velleuer E et al. Diagnostic accuracy of brush biopsy-based cytology for the early detection of oral cancer and precursors in Fanconi anemia. Cancer Cytopathol 128, 403–413, doi: 10.1002/cncy.22249 (2020). [DOI] [PubMed] [Google Scholar]
  • 33.Puram SV et al. Single-Cell Transcriptomic Analysis of Primary and Metastatic Tumor Ecosystems in Head and Neck Cancer. Cell 171, 1611–1624.e1624, doi: 10.1016/j.cell.2017.10.044 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Ji AL et al. Multimodal Analysis of Composition and Spatial Architecture in Human Squamous Cell Carcinoma. Cell 182, 497–514.e422, doi: 10.1016/j.cell.2020.05.039 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Cancer Genome Atlas Research N et al. Integrated genomic characterization of oesophageal carcinoma. Nature 541, 169–175, doi: 10.1038/nature20805 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Consortium I. T. P.-C. A. o. W. G. Pan-cancer analysis of whole genomes. Nature 578, 82–93, doi: 10.1038/s41586-020-1969-6 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Molenaar JJ et al. Sequencing of neuroblastoma identifies chromothripsis and defects in neuritogenesis genes. Nature 483, 589–593, doi: 10.1038/nature10910 (2012). [DOI] [PubMed] [Google Scholar]
  • 38.Bakhoum SF et al. Chromosomal instability drives metastasis through a cytosolic DNA response. Nature 553, 467–472, doi: 10.1038/nature25432 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Puram SV et al. Single-Cell Transcriptomic Analysis of Primary and Metastatic Tumor Ecosystems in Head and Neck Cancer. Cell 171, 1611–1624 e1624, doi: 10.1016/j.cell.2017.10.044 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Cerami E et al. The cBio Cancer Genomics Portal: An Open Platform for Exploring Multidimensional Cancer Genomics Data. Cancer Discovery 2, 401–404, doi: 10.1158/2159-8290.Cd-12-0095 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

Methods References

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary material
Supplementary Data File 8_Integrated FA SCC & Sporadic SCC Mutation Database
Supp Table 4
Supp table 7
Supp table 1
Supp table 8
Supp table 2
Supp table 11
Supp table 10
Supp table 6
Supp table 5
Source data figure 10d
Source data figure 4
Source data figure 8
Source data figure 11
Supp table 3
Supp table 9

Data Availability Statement

Outputs from the data analysis and source data are included in 12 Supplementary Tables listed in the SI guide. An integrated and quarriable SNV-Indel/CNV Maftools mutation database for FA SCCs & sporadic HNSCCs is included as Supplementary Data File 8 in the online version of this manuscript, along with scripts for figure generation. Human data including 34 Illumina whole-genome samples (22 tumor and 12 matched-normal, 70 Illumina whole-exome samples (37 tumor and 33 matched-normals, 13 PacBio whole-genome samples (9 tumor and 4 matched-normal, four 10x linked-read whole-genome tumor samples, six bulk RNAseq tumor samples, three 10x single-cell/nuclei RNAseq tumor samples, one 10x Visium spatial RNAseq tumor samples and six EPIC 850K methylation array tumor samples generated in this study are available in dbGAP (phs002652.v1.p1). The data are available through controlled access. Samples from four individuals (F10P1, F15P1, F28P1, F34P1) did not have consent for release of raw sequencing data to a repository. These four tumor/normal whole-exome sample pairs are available from the Smogorzewska laboratory after fulfilling requirements to participate in the IRB protocol #AAU-0112 at the Rockefeller University. Mouse sequencing data, including 32 bulk RNAseq samples & 10 Illumina whole-genome samples, are available on SRA (PRJNA753831) and GEO (GSE195811). Sporadic human tumor data is available from TCGA (https://portal.gdc.cancer.gov/projects/TCGA-HNSC), PCAWG (see Supplementary Table 2c for sample accession code list), and cBioPortal (https://www.cbioportal.org/study/summary?id=hnsc_tcga_pan_can_atlas_2018). HPV genome data is available from NIH-PAVE database (https://pave.niaid.nih.gov/). hg19 (https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.13/) and mm10 (https://www.ncbi.nlm.nih.gov/assembly/GCF_000001635.20/) genomes were used for sequencing alignment. Population variant databases GNOMAD (https://gnomad.broadinstitute.org/), db1000 Genomes (https://www.internationalgenome.org/), and COSMIC (https://cancer.sanger.ac.uk/cosmic) were used for additional germline variant filtering.

Tumor/normal sequencing data from Fanconi anemia individuals (including 34 Illumina whole-genome samples, 70 Illumina whole-exome samples, 13 PacBio whole-genome samples, four 10x linked-read whole-genome samples, six bulk RNAseq samples, three 10x single-cell/nuclei RNAseq samples, one 10x Visium spatial RNAseq sample), and six EPIC 850K methylation array samples generated in this study will be available in dbGAP (phs002652.v1.p1). Mouse sequencing data, including 32 bulk RNAseq samples & 10 Illumina whole-genome samples, will be available in SRA (PRJNA753831) and GEO (GSE195811).

LCL and fibroblast cell lines from Fanconi anemia patients are available from the International Fanconi Anemia Registry (IFAR) under the Fanconi Anemia Tissue and Cell Repository Usage Agreement. Tumor tissues described in the study are a non-renewable resource collected for this study. All available non-human materials will be shared with proper MTAs in place.

RESOURCES