Abstract
Genetic changes in repetitive sequences are a hallmark of cancer and other diseases, but characterizing these has been challenging using standard sequencing approaches. We developed a de novo kmer finding approach, called ARTEMIS (Analysis of RepeaT EleMents in dISease), to identify repeat elements from whole-genome sequencing. Using this method, we analyzed 1.2 billion kmers in 2837 tissue and plasma samples from 1975 patients, including those with lung, breast, colorectal, ovarian, liver, gastric, head and neck, bladder, cervical, thyroid, or prostate cancer. We identified tumor-specific changes in these patients in 1280 repeat element types from the LINE, SINE, LTR, transposable element, and human satellite families. These included changes to known repeats and 820 elements that were not previously known to be altered in human cancer. Repeat elements were enriched in regions of driver genes, and their representation was altered by structural changes and epigenetic states. Machine learning analyses of genome-wide repeat landscapes and fragmentation profiles in cfDNA detected patients with early-stage lung or liver cancer in cross-validated and externally validated cohorts. In addition, these repeat landscapes could be used to noninvasively identify the tissue of origin of tumors. These analyses reveal widespread changes in repeat landscapes of human cancers and provide an approach for their detection and characterization that could benefit early detection and disease monitoring of patients with cancer.
INTRODUCTION
Genomic repeats comprise more than half the human genome and include a diverse set of elements that vary widely between individuals and exert key influences on genome structure and function (1, 2). Because of technical limitations of short-read alignment and a reliance on incomplete genome assemblies, repeats have historically been neglected (3). Repetitive sequences are largely composed of tandem repeats and retrotransposons. Tandem repeats, such as human satellites, are usually concentrated in centromeres, telomeres, and the short arms of acrocentric chromosomes. Retrotransposons include diverse families of genome-wide repeats including long interspersed nuclear elements (LINEs), short interspersed nuclear elements (SINEs), long terminal repeats (LTRs), and other transposable elements (3). The recent completion of a telomere-to-telomere (T2T) genome has added nearly 200 Mb to the previous reference genome, revealed the genomic and epigenomic states of repeats, and revitalized study of these integral genomic regions (4–6).
Changes to repeat sequences have long been implicated in the development of cancer. Transposable elements are thought to modulate gene expression, and loss of their silencing by global hypomethylation in cancer may drive their movement (7), resulting in oncogene activation and genomic instability (8). Repeat types show differential enrichment in structural breakpoints. For example, tandem repeats are enriched at breakpoints of regions of copy number variation, whereas Alu repeats are enriched in deletion and duplication breakpoints (9). The above changes to repeat elements are broadly characteristic of cancer genomes, but changes to individual element types have been observed in different cancer types (10, 11). Transposable elements serve as active enhancers for tissue-specific transcription factors dysregulated in cancers, and canonical tandem repeat expansions are associated with gene regulation and vary substantially between tumor sites of origin (12, 13). In addition, instability and expansion of repeats in pericentromeric and centromeric regions in patients with cancer may drive chromosomal missegregation and other structural changes (14–18) that have been associated with lower overall survival (19).
With the development of liquid biopsies for detection and genome-wide characterization of human cancer, analyses of repeat sequences have begun to be performed in cell-free DNA (cfDNA). Initial genome-wide analyses of cfDNA (20) did not specifically assess repeat elements. More recently, retrotransposable elements and nontelomeric satellite DNA have been shown to be highly represented in cfDNA (21) and have been used to evaluate overall cfDNA amounts or to assess aneuploidy (22–24). Despite these advances, no systematic analysis of the compendium of repeat sequences has been performed in tissue or cfDNA of any human cancer, largely because of the inability to identify and quantify repeat sequences in a genome-wide fashion.
To address these challenges, we developed ARTEMIS (Analysis of RepeaT EleMents in dISease) as an alignment-free, genome-wide approach for analyzing repeat landscapes in short read sequencing. ARTEMIS assesses more than 1 billion short kmer sequences from 1280 individual repeat types that occur genome-wide and span 57 subfamilies comprising six families (satellites, RNA elements, transposable elements, LINEs, SINEs, and LTRs). In this study, we used ARTEMIS to show that repeat landscapes were broadly altered in human cancers, including in repeat elements not previously implicated in tumorigenesis. Repeat elements were enriched in regions of genes commonly altered in cancer, and tumor-specific changes in repeats reflected a combination of structural, copy number, and focal repeat alterations in the cancer genome. Changes in repeat landscapes were detectable in cfDNA and could be used for detection and monitoring of cancer and to identify the tissue of origin of tumors.
RESULTS
De novo search of kmers in genome-wide repeat elements
To develop ARTEMIS, we conducted a de novo search of short sequences (kmers) because we hypothesized that these would have enough complexity to identify the different types of repeat elements in the genome (Fig. 1, fig. S1, and table S1). For example, a 24–base pair (bp) kmer sequence can theoretically distinguish between 281 trillion (424) sequences. Using the recently obtained T2T reference genome (chm13) (5, 25) assembled from long-read sequencing, we found that 4.73 billion 24-bp kmer sequences were present in the genome and 4.18 billion 24-bp kmers were unique to repeat elements overall. Because related repeat elements have diverged in their sequence composition over time, we identified 1.2 billion 24-bp kmers that uniquely defined each of 1266 recently identified repeat types. To be included in this set, a kmer could neither occur in nonrepeat regions of the genome nor occur in multiple repeat types. Each of the 1266 repeat types analyzed were defined by a median of 43,297 24-bp kmers spanning an average of 2.6 Mb of genome sequence (fig. S2). We further included 58,000 24-bp kmers from enhanced annotations of 14 human satellite subtypes (26). These 1.2 billion kmers representing 1280 repeat types were found on all chromosomes, and 98% of used kmers were only observed once in the T2T reference genome (fig. S3, A to C). These kmers also represented regions of the genome, such as human satellites, that could not be aligned with high quality in typical short-read next-generation sequencing. This allowed ARTEMIS to consider the entirety of the genome, rather than only that from the ~60 to 85% of reads from next-generation sequencing that can be aligned with high quality (27, 28). To verify that these repeat landscape kmers would not be confounded by human-associated microbial genomes (29, 30), we examined 1545 reference genomes representing common microbes and found a median of 100 ARTEMIS kmers per microbial genome (range, 0 to 1350), in all cases comprising <0.0002% of the 1.2 billion possible kmers counted in ARTEMIS. For analyses of an individual sample, we defined the kmer repeat landscape as the count of all kmers in a sequenced sample that matched each of the 1280 repeat types divided by the number of aligned sequence reads. Because changes in repeat sequences may occur during initiation of cancer and other diseases, this comprehensive compendium of repeat features can be used to distinguish genomes from normal and disease states.
Fig. 1. Overview of ARTEMIS method.
De novo identification of kmers revealed ~1.2 billion unique kmers spanning 1280 distinct repeat elements. These elements represent six families: transposable elements, SINEs, satellites, LTRs, LINEs, and RNA elements. In an individual sample, the kmer repeat landscape is defined as the sum of the counts of all kmers comprising each repeat type identified in all sequence reads, normalized by coverage. These landscapes are used in machine learning to generate an ARTEMIS score for disease characterization and prediction.
Genome-wide enrichment of repeat kmers in cancer-related genes and pathways
We first examined the genome-wide distribution of the 1.2 billion kmers defining unique repeat types and found that repeat elements were enriched in genes commonly altered in human cancer (table S2). Of the 736 genes in the COSMIC cancer driver gene census (31), we found that 487 of these had a higher than expected number of repeat kmer sequences within their exonic or intronic sequences (normalized enrichment score = 9.12 and false discovery rate q = 0.00; fig. S3D), including those in genes amplified, deleted, and rearranged in cancer (normalized enrichment scores = 1.86, 4.04, and 6.71 and false discovery rate q = 0.01, 0.00, and 0.00, respectively; table S3). This enrichment remained significant even after correcting for the size of these genes (fig. S3D and table S3) and reflected an average 15-fold increase in repeat kmers in these regions (P < 2.2 × 10−16, Wilcoxon signed-rank test). In contrast, an analysis of the same number of randomly chosen genes in the genome did not show an enrichment of repeat kmer sequences (normalized enrichment score = −1.05 and false discovery rate q = 0.77). Repeat kmer sequences were also significantly increased in pathways commonly dysregulated in cancer including in cell adhesion, growth, and signaling, as well as cancer type–specific gene sets (false discovery rate q < 0.05; fig. S4). Together, these observations of repeat kmer localization suggest that alterations in key genes affecting oncogenic path-ways in human cancer may be selected for during tumorigenesis using repeat-related genomic changes.
Kmer repeat landscapes are altered in cancer genomes
Given the broad number of genomic changes that occur during tumorigenesis, we evaluated whether kmer repeat landscapes were altered in cancers using short-read next-generation sequencing technologies. Given the challenges of distinguishing highly related repeat sequences, we simulated short-read whole-genome sequence data incorporating typical sequence error rates and analyzed these in an alignment-free fashion. We found that despite potential sequencing errors, the high complexity of kmer sequences allowed them to remain specific for their defined repeat family (98% of kmers counted were found in reads originating from their true repeat type) (figs. S5 and S6).
We analyzed matched tumor and normal tissues of 525 patients (table S4) from the Pan-Cancer Analysis of Whole Genomes (PCAWG) (32), including those with breast (n = 91), lung (n = 86), colorectal (n = 60), liver (n = 54), thyroid (n = 48), head and neck squamous cell (n = 44), ovarian (n = 42), gastric (n = 38), bladder (n = 23), cervical (n = 20), or prostate (n = 19) cancer and determined whether genome-wide kmer counts for specific repeat element types were altered in the tumors. An average of 22.4 billion total kmers were identified in each sample sequenced at 30 to 60× coverage, representing 1280 repeat elements. A median of 807 repeat elements (range, 246 to 1280) had increased or decreased kmer counts in tumors compared with their matched normal tissues (Fig. 2A and tables S5 and S6). Nearly two-thirds of altered elements (820 of 1280) had not been previously observed as being altered in human cancer (Fig. 2A and Table 1). Elements from satellites, LINEs, and SINEs were altered at the highest rates, although changes were also frequently observed in elements within LTRs, transposable elements, and RNA elements. Nearly a quarter of the elements studied came from the largest repeat subfamily of LTRs, ERV1s (endogenous retrovirus 1) (Table 1), which are hypothesized to aberrantly activate transcription in cancer cells by onco-exaptation, the process by which reactivated transposable elements can drive oncogene expression (33). On average, more than 40% of the 300 LTR ERV1 elements were altered in all 12 types of tumors studied, although the individual altered elements varied across tissue types. Although changes to 21 ERV1s have been described previously (34), we observed changes in an additional 279 ERV1s across the cancer types analyzed (tables S7 and S8). Similar to other large-scale changes in cancer genomes (35, 36), changes in kmer repeat landscapes were highly complex, with no two patients studied having the same set of alterations.
Fig. 2. Kmer repeat landscapes across human cancers reveal widespread differences from normal tissues.
(A) The heatmap shows the ratio of kmer repeat landscapes for each PCAWG tumor as compared with its matched normal, revealing high numbers of tumor-specific changes that can be correlated to genomic instability metrics (n = 469 tumor/normal pairs representing all PCAWG samples with genomic instability metrics available). Each PCAWG tumor is listed along the y axis, and each individual repeat element type is along the x axis. Ratios greater than one (red) indicate an increase in the element in the tumor, whereas ratios less than one (blue) indicate a decrease in the tumor. Most of these identified changes are in elements (820 of 1280) with no prior evidence for changes in cancer, as shown in yellow in the evidence bar along the x axis. (B) The plot shows elements from all six repeat element families ordered by the Benjamini-Hochberg–corrected P value of the Wilcoxon signed-rank test comparing the overlap of repeat elements with tumor-specific structural breakpoints versus their overlap with randomly selected genomic regions. Filled circles indicate elements newly implicated in cancer through this study, whereas open circles indicate elements with prior evidence for involvement in cancer. Red circles indicate elements depleted of breakpoints, and yellow circles indicate elements enriched for breakpoints. TEs, transposable elements. (C) Box plots show the distribution of tumor:normal kmer count ratios for repeat element types overlapping the ERBB2 region (1 Mb) in PCAWG breast cancers (n = 91 tumor/normal pairs). Ratios for each patient for all elements with >0.5% of kmers found in the region are shown (left), and the Benjamini-Hochberg–corrected P value from the Wilcoxon signed-rank test is plotted for each comparison, with points in red indicating P < 0.05 (right). Element names in bold indicate those newly implicated in cancer through this study. (D) Box plots show the ratios of tumor:normal kmer counts for kmers occurring within LINE-1–mediated deletions in PCAWG lung tumors containing at least one LINE-1–mediated deletion (n = 5; data file S1). (E) Kaplan-Meir plots of overall survival and progression-free survival of PCAWG tumors of AJCC (American Joint Committee on Cancer) stage III or IV (n = 167) stratified into two groups based on predicted ARTEMIS scores. The group shown in blue had ARTEMIS scores below the median value, and the group shown in red had ARTEMIS scores above the median value.
Table 1. Repeat element subfamilies identified by ARTEMIS to be altered in cancer.
TCGA, the Cancer Genome Atlas.
| Average percentage of elements altered in an individual tumor | Average percentage of elements significantly altered in an individual tumor, by type | Previous evidence of involvement in cancer (number of elements previously reported as changed in PCAWG) | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|||||||||||||||||
| Family | Subfamily | Number of elements | Significant alterations | Alterations greater than expected by chromosomal arm gains/losses alone | Bladder | Breast | Cervical | Colorectal | Head and neck squamous | Liver | Lung adeno | Lung squamous | Ovarian | Prostate | Gastric | Thyroid | |
|
| |||||||||||||||||
| Transposable elements | DNA | 40 | 42% | 36% | 51% | 47% | 42% | 48% | 33% | 37% | 52% | 50% | 45% | 34% | 34% | 25% | Subfamily evidence |
|
|
|||||||||||||||||
| DNA_Crypton | 7 | 28% | 26% | 37% | 35% | 30% | 34% | 22% | 19% | 38% | 30% | 36% | 23% | 20% | 12% | Newly implicated | |
|
|
|||||||||||||||||
| DNA_Crypton-A | 2 | 28% | 26% | 48% | 35%% | 25% | 29% | 18% | 10% | 45% | 35% | 36% | 18% | 28% | 10% | Newly implicated | |
|
|
|||||||||||||||||
| DNA_hAT | 10 | 42% | 35% | 49% | 43% | 39% | 45% | 37% | 40% | 51% | 50% | 43% | 38% | 36% | 27% | Newly implicated | |
|
|
|||||||||||||||||
| DNA_hAT-Ac | 5 | 53% | 48% | 50% | 58% | 52% | 58% | 50% | 46% | 61% | 57% | 55% | 48% | 53% | 40% | Newly implicated | |
|
|
|||||||||||||||||
| DNA_hAT-Blackjack | 10 | 48% | 41% | 47% | 54% | 45% | 49% | 44% | 42% | 54% | 54% | 48% | 45% | 43% | 37% | Prior demonstration in TCGA/PCAWG (1 of 5) | |
|
|
|||||||||||||||||
| DNA_hAT-Charlie | 82 | 67% | 58% | 65% | 70% | 71% | 70% | 64% | 63% | 74% | 72% | 61% | 67% | 67% | 58% | Prior demonstration in TCGA/PCAWG (15 of 82) | |
|
|
|||||||||||||||||
| DNA_hAT-hAT19 | 1 | 19% | 18% | 22% | 24% | 45% | 23% | 9% | 17% | 18% | 29% | 24% | 5% | 13% | 2% | Newly implicated | |
|
|
|||||||||||||||||
| DNA_hAT-Tag1 | 2 | 83% | 72% | 70% | 91% | 85% | 79% | 77% | 86% | 86% | 83% | 85% | 84% | 80% | 79% | Newly implicated | |
|
|
|||||||||||||||||
| DNA_hAT-Tip100 | 54 | 59% | 52% | 58% | 62% | 61% | 61% | 57% | 56% | 68% | 66% | 56% | 58% | 60% | 46% | Prior demonstration in TCGA/PCAWG (5 of 54) | |
|
|
|||||||||||||||||
| DNA_hAT-Tip100? | 1 | 50% | 46% | 43% | 54% | 65% | 70% | 34% | 61% | 63% | 42% | 55% | 53% | 29% | 25% | Newly implicated | |
|
|
|||||||||||||||||
| DNA_hAT? | 2 | 45% | 36% | 59% | 47% | 53% | 57% | 40% | 43% | 59% | 45% | 56% | 18% | 43% | 20% | Newly implicated | |
|
|
|||||||||||||||||
| DNA_Kolobok | 2 | 52% | 46% | 59% | 55% | 55% | 63% | 51% | 40% | 61% | 61% | 52% | 53% | 41% | 35% | Newly implicated | |
|
|
|||||||||||||||||
| DNA Merlin | 3 | 35% | 31% | 43% | 45% | 45% | 43% | 22% | 30% | 61% | 27% | 43% | 23% | 25% | 15% | Newly implicated | |
|
|
|||||||||||||||||
| DNA_MULE-MuDR | 7 | 55% | 46% | 60% | 61% | 51% | 55% | 52% | 50% | 62% | 60% | 65% | 49% | 47% | 36% | Newly implicated | |
|
|
|||||||||||||||||
| DNA_PIF-Harbinger | 6 | 42% | 37% | 49% | 47% | 46% | 50% | 37% | 40% | 54% | 50% | 43% | 32% | 29% | 26% | Newly implicated | |
|
|
|||||||||||||||||
| DNA_Pig-gyBac | 6 | 44% | 39% | 49% | 45% | 43% | 48% | 39% | 38% | 56% | 49% | 47% | 33% | 40% | 35% | Newly implicated | |
|
|
|||||||||||||||||
| DNA_Tc-Mar | 5 | 27% | 24% | 43% | 31% | 31% | 26% | 22% | 17% | 34% | 30% | 47% | 14% | 21% | 11% | Newly implicated | |
|
|
|||||||||||||||||
| DNA_TcMar-Mariner | 6 | 61% | 52% | 64% | 64% | 58% | 70% | 55% | 55% | 64% | 73% | 56% | 60% | 55% | 54% | Prior demonstration in TCGA/PCAWG (1 of 6) | |
|
|
|||||||||||||||||
| DNA_TcMar-Pogo | 1 | 29% | 26% | 39% | 30% | 40% | 38% | 27% | 28% | 34% | 33% | 36% | 11% | 26% | 6% | Newly implicated | |
|
|
|||||||||||||||||
| DNA_TcMar-Tcl | 5 | 39% | 36% | 42% | 44% | 46% | 52% | 33% | 32% | 49% | 45% | 40% | 27% | 32% | 20% | Newly implicated | |
|
|
|||||||||||||||||
| DNA_TcMar-Tc2 | 8 | 62% | 53% | 62% | 60% | 58% | 74% | 54% | 60% | 74% | 66% | 57% | 60% | 57% | 54% | Newly implicated | |
|
|
|||||||||||||||||
| DNA_TcMar-Tigger | 81 | 61% | 52% | 62% | 62% | 59% | 67% | 55% | 58% | 69% | 66% | 57% | 58% | 59% | 51% | Prior demonstration in TCGA/PCAWG (9 of 81) | |
|
|
|||||||||||||||||
| DNA? | 5 | 26% | 18% | 30% | 37% | 34% | 24% | 13% | 29% | 34% | 30% | 33% | 8% | 18% | 6% | Subfamily evidence | |
|
|
|||||||||||||||||
| DNA?_hAT-Tip100? | 1 | 37% | 33% | 74% | 36% | 35% | 38% | 43% | 33% | 37% | 40% | 50% | 26% | 37% | 13% | Newly implicated | |
|
|
|||||||||||||||||
| RC_Helitron | 5 | 56% | 51% | 56% | 57% | 51% | 68% | 53% | 47% | 71% | 62% | 50% | 58% | 55% | 45% | Subfamily evidence | |
|
|
|||||||||||||||||
| Retroposon_SVA | 6 | 71% | 64% | 72% | 75% | 78% | 66% | 69% | 62% | 80% | 83% | 73% | 70% | 79% | 53% | Prior demonstration in TCGA/PCAWG (4 of 6) | |
|
| |||||||||||||||||
| LINE | LINE_CR1 | 25 | 57% | 51% | 63% | 60% | 57% | 65% | 51% | 49% | 69% | 64% | 56% | 55% | 53% | 45% | Prior demonstration in TCGA/PCAWG (4 of 25) |
|
|
|||||||||||||||||
| LINE_Dong-R4 | 1 | 74% | 62% | 70% | 86% | 65% | 77% | 77% | 61% | 82% | 79% | 64% | 74% | 76% | 67% | Newly implicated | |
|
|
|||||||||||||||||
| LINEJ-Jockey | 1 | 37% | 34% | 39% | 37% | 40% | 22% | 41% | 39% | 50% | 44% | 48% | 26% | 37% | 21% | Newly implicated | |
|
|
|||||||||||||||||
| LINE_L1 | 132 | 86% | 80% | 87% | 85% | 86% | 90% | 83% | 89% | 90% | 88% | 84% | 86% | 87% | 82% | Prior demonstration in TCGA/PCAWG (132 of 132) | |
|
|
|||||||||||||||||
| LINE_L1-Tx1 | 1 | 56% | 51% | 74% | 63% | 40% | 53% | 50% | 39% | 55% | 77% | 67% | 37% | 68% | 38% | Newly implicated | |
|
|
|||||||||||||||||
| LINE_L2 | 14 | 69% | 65% | 74% | 74% | 70% | 73% | 67% | 63% | 75% | 70% | 65% | 72% | 67% | 60% | Prior demonstration in TCGA/PCAWG (4 of 14) | |
|
|
|||||||||||||||||
| LINE_Penelope | 1 | 68% | 59% | 70% | 63% | 65% | 78% | 68% | 61% | 84% | 75% | 48% | 84% | 66% | 63% | Newly implicated | |
|
|
|||||||||||||||||
| LINE_RTE-BovB | 2 | 69% | 61% | 78% | 68% | 78% | 64% | 68% | 65% | 66% | 83% | 73% | 61% | 72% | 56% | Newly implicated | |
|
|
|||||||||||||||||
| LINE_RTE-X | 4 | 85% | 78% | 85% | 84% | 85% | 88% | 80% | 82% | 86% | 90% | 85% | 92% | 91% | 83% | Newly implicated | |
|
| |||||||||||||||||
| LTR | LTR | 17 | 59% | 51% | 66% | 63% | 60% | 63% | 56% | 50% | 66% | 68% | 58% | 57% | 57% | 46% | Subfamily evidence |
|
|
|||||||||||||||||
| LTR_ERV1 | 300 | 61% | 53% | 65% | 66% | 65% | 62% | 61% | 58% | 70% | 69% | 65% | 56% | 59% | 41% | Prior demonstration in TCGA/PCAWG (21 of 300) | |
|
|
|||||||||||||||||
| LTR_ERVK | 42 | 59% | 50% | 68% | 68% | 68% | 58% | 56% | 55% | 67% | 67% | 67% | 49% | 58% | 30% | Prior demonstration in TCGA/PCAWG (2 of 42) | |
|
|
|||||||||||||||||
| LTR_ERVL | 124 | 62% | 55% | 65% | 66% | 63% | 64% | 60% | 58% | 72% | 68% | 63% | 60% | 61% | 47% | Prior demonstration in TCGA/PCAWG (14 of 124) | |
|
|
|||||||||||||||||
| LTR_ERVL-MaLR | 49 | 76% | 69% | 76% | 78% | 76% | 77% | 76% | 71% | 83% | 80% | 69% | 77% | 77% | 67% | Prior demonstration in TCGA/PCAWG (21 of 49) | |
|
|
|||||||||||||||||
| LTR_Gypsy | 25 | 65% | 58% | 67% | 67% | 61% | 67% | 64% | 58% | 74% | 72% | 58% | 71% | 68% | 56% | Prior demonstration in TCGA/PCAWG (2 of 25) | |
|
| |||||||||||||||||
| RNA elements | rRNA | 3 | 69% | 65% | 74% | 73% | 75% | 62% | 72% | 64% | 80% | 71% | 73% | 65% | 72% | 54% | Subfamily evidence |
|
|
|||||||||||||||||
| scRNA | 5 | 33% | 31% | 43% | 40% | 42% | 32% | 30% | 27% | 38% | 38% | 30% | 29% | 31% | 17% | Subfamily evidence | |
|
|
|||||||||||||||||
| snRNA | 13 | 28% | 24% | 31% | 35% | 35% | 28% | 27% | 22% | 33% | 34% | 24% | 23% | 27% | 14% | Subfamily evidence | |
|
|
|||||||||||||||||
| srpRNA | 1 | 75% | 70% | 70% | 81% | 70% | 68% | 75% | 74% | 79% | 79% | 81% | 89% | 82% | 56% | Subfamily evidence | |
|
|
|||||||||||||||||
| tRNA | 50 | 17% | 16% | 24% | 22% | 22% | 16% | 13% | 16% | 23% | 19% | 23% | 10% | 13% | 6% | Subfamily evidence | |
|
| |||||||||||||||||
| Satellite | Satellite | 39 | 85% | 82% | 86% | 87% | 86% | 87% | 86% | 86% | 89% | 87% | 88% | 77% | 85% | 73% | Subfamily evidence |
|
|
|||||||||||||||||
| Satellite_acro | 1 | 62% | 57% | 70% | 69% | 70% | 70% | 64% | 50% | 79% | 69% | 81% | 42% | 42% | 25% | Newly implicated | |
|
|
|||||||||||||||||
| Satell ite_centr | 6 | 91% | 87% | 91% | 95% | 89% | 94% | 89% | 90% | 93% | 92% | 93% | 82% | 90% | 78% | Subfamily evidence | |
|
|
|||||||||||||||||
| Satell ite_subtelo | 1 | 73% | 70% | 83% | 73% | 80% | 87% | 70% | 85% | 84% | 60% | 76% | 63% | 74% | 46% | Subfamily evidence | |
|
| |||||||||||||||||
| SINE | SINE_5S-Deu-L2 | 1 | 78% | 67% | 74% | 78 | 75 | 80 | 91 | 67 | 87 | 83 | 62 | 89 | 87 | 71 | Prior demonstration in TCGA/PCAWG (1 of 1) |
|
|
|||||||||||||||||
| SINE_Alu | 51 | 69% | 64% | 73% | 73 | 75 | 65 | 65 | 60 | 77 | 74 | 65 | 65 | 72 | 62 | Prior demonstration in TCGA/PCAWG (34 of 51) | |
|
|
|||||||||||||||||
| SINE_MIR | 5 | 94% | 91% | 86% | 97 | 98 | 94 | 96 | 93 | 97 | 95 | 89 | 98 | 95 | 94 | Prior demonstration in TCGA/PCAWG (4 of 5) | |
|
|
|||||||||||||||||
| SINE_tRNA | 1 | 74% | 69% | 78% | 77 | 55 | 77 | 70 | 59 | 95 | 81 | 62 | 89 | 87 | 65 | Newly implicated | |
|
|
|||||||||||||||||
| SINE_tRNA-Deu | 1 | 40% | 37% | 43% | 41 | 40 | 30 | 48 | 28 | 63 | 46 | 26 | 42 | 53 | 35 | Newly implicated | |
|
|
|||||||||||||||||
| SINE_tRNA-RTE | 1 | 77% | 66% | 74% | 80% | 85% | 72% | 77% | 74% | 76% | 81% | 74% | 89% | 82% | 73% | Prior demonstration in TCGA/PCAWG (1 of 1) | |
We hypothesized that changes in kmer repeat landscapes would in part be related to structural changes that arise during tumorigenesis such as chromosomal copy number changes, rearrangements, or focal amplifications or deletions. Accordingly, we found that kmer counts reflected chromosomal arm gains or losses genome-wide in a representative set of analyzed tumors (r = 0.81, P < 2.2 × 10−16, Spearman’s correlation) (fig. S7). In addition, tumors with more changes in kmer repeat landscapes had higher chromosomal instability as reflected through overall genomic entropy, loss of heterozygosity, nonmodal ploidy fraction of the genome, and other measures of genome-wide structural changes (r = 0.34, P = 2.1 × 10−14; r = 0.29, P = 2.7 × 10−10; r = 0.33, P = 1.2 × 10−14, respectively, Spearman’s correlation) (Fig. 2A). In contrast, tumor mutation burden, a measure of single-base sequence changes in an individual cancer, was only weakly correlated with genome-wide kmer repeat landscape changes (r = 0.15, P = 8.3 × 10−4, Spearman’s correlation).
Rearrangements, resulting from copy neutral translocations, as well as inversions, duplications, or deletions, may be facilitated by crossing over of homologous sequence (37). An analysis of the locations of repeat elements and tumor-specific sequence breakpoints in the 525 samples analyzed identified an enrichment of 215 elements at breakpoint locations, comprising LINEs, SINEs, LTRs, transposable elements, and RNA elements and including 128 elements that had never been previously implicated as being altered in cancer, suggesting that these elements may play a role in facilitating these structural changes (Fig. 2B). Analysis of focal amplifications of five or more copies in a subset of analyzed tumors revealed that repeat element content across all subfamilies correlated with an increase in amplicon copy number (r = 0.91, P < 2.2 × 10−16, Spearman’s correlation). As an example, analysis of the 1-Mb region surrounding ERBB2 in breast tumor with known gains at this region revealed significant increases in 14 repeat elements, including in eight elements with no previously documented changes in cancer (P < 0.05; Fig. 2C and fig. S8A). Similarly, gains of the ~30-Mb region on chromosome 3q containing driver genes PIK3CA and SOX2 in squamous cell lung cancer (38, 39) revealed increases in kmers for repeat elements overlapping these regions, including in nine elements not previously known to be altered in cancer (fig. S8, B and C).
Changes in the content of repeat landscapes were not fully explained by chromosomal or focal copy number changes or genomic rearrangements. After comparing changes in repeat elements to the segmented copy number alterations observed across the cancer genomes analyzed, we determined that 89% of the repeat changes (median, 693 changes per tumor; range, 232 to 1280) were larger in magnitude than would be expected because of copy number gains and losses alone (fig. S9 and table S9). A set of 236 elements exhibited changes not explained by copy number changes in at least 75% of tumors studied. These types of changes included reduction of kmer elements through LINE-1–mediated deletions in squamous cell lung cancers and lower-than-expected repeat content in regions of copy number gain, consistent with the concept that these repeat sequences may undergo deletion because they facilitate gains in nearby genomic content (Fig. 2D and figs. S7 and S10, A and B) (37, 40). Overall, these analyses highlight the ability of kmer repeat landscapes to detect and characterize a broad variety of structural changes in human cancer, including large chromosomal changes, commonly amplified or deleted driver gene regions, and alterations that directly target repeat sequences.
We next used a machine learning model to generate for each sample an ARTEMIS score, a single number that provides a quantitative summary of genome-wide repeat element changes predictive of disease state. Despite germline variability of repeat elements among different individuals (fig. S11), cross-validated ARTEMIS scores distinguished the 525 PCAWG tumors from normal tissues with high performance across all cancer types analyzed, regardless of the race of patients [overall area under the curve (AUC) = 0.96] (fig. S12).
To evaluate the potential clinical implications of changes in repeat elements of cancer genomes, we examined whether ARTEMIS scores for each tumor were associated with changes in overall survival or progression-free survival for patients with advanced cancers (stage III or IV, n = 167) in the PCAWG dataset. We found that increased ARTEMIS scores were associated with shorter overall (P < 0.001) and progression-free (P < 0.001) survival (Fig. 2E) and remained significant for progression-free survival even after adjusting for tumor type (P < 0.001; fig. S13). This change in patient outcomes was not observed for other nonrepeat genome-wide metrics, including genomic entropy, loss of heterozygosity, or nonmodal ploidy fraction (fig. S14, A and B). Given that the ARTEMIS score captures genome-wide changes to repeat landscapes, our observations are consistent with previous analyses indicating that reactivation and increase in repeat elements in cancer genomes may lead to increased immune responses (41–43) or genomic instability (44), both mechanisms that could reduce tumor cell fitness and lead to improved patient outcomes.
Kmer repeat landscapes in cfDNA
We sought to determine whether our approach to characterizing the repeat landscape could be used for evaluating circulating cfDNA. Detection of repeats using low-coverage whole-genome sequencing would theoretically be achievable because ARTEMIS aggregates a large number of kmer-defined repeat element instances throughout the genome while maintaining sufficient granularity to identify disease-specific genomic features. As a first step in this analysis, we determined that repeat landscapes in PCAWG were highly consistent even if these were subsampled to different sequencing depths ranging from >60× to 1× coverage (figs. S15 and S16). We further found that kmer repeat landscapes in cfDNA were consistent across different sequencing platforms and experimental batches (fig. S17).
To determine whether repeat landscapes could be quantified in the plasma using low-coverage sequencing of cfDNA, we first examined satellite families with known distributions on the Y chromosome (chrY) in a collection of male and female individuals (n = 158). In the plasma of males (n = 87), kmer counts for human satellite types known to be found exclusively or predominantly on chrY were substantially higher than in females (n = 71) (P < 2.2 × 10−16 for all types) (Fig. 3A), whereas satellites not found on chrY showed no significant difference between males and females (P > 0.1 for all types).
Fig. 3. Kmer repeat landscapes capture tumor-specific changes in the plasma.
(A) Top: Each bar plot shows for a given human satellite 2 or 3 element type the percentage of its kmer occurrences found on chrY (dark blue) and on all other chromosomes (light blue) in the chm13 reference. Bottom: In individuals without cancer (n = 158), the distribution of coverage-normalized kmer counts in cfDNA for these satellite types in males (n = 87) and females (n = 71). P values for the Wilcoxon signed-rank test are shown at the top of each plot. (B) Kmer counts for PCAWG tissue (top; n = 54 liver, n = 48 lung squamous, and n = 38 lung adenocarcinoma tumor/normal pairs) and plasma cfDNA (bottom; n = 75 patients with liver cancer and n = 133 patients without cancer; n = 29 patients with lung squamous cell cancer and n = 158 patients without cancer; n = 62 patients with lung adenocarcinoma and n = 158 patients without cancer). The top five features with significant differences in both tissue and plasma, and at least 1000 expected kmers per million aligned reads are shown for each cancer type as separate plots. P values are shown at the top of each plot and were calculated by the Wilcoxon signed-rank test.
We then identified repeat elements with the largest changes across the PCAWG tumors and evaluated the occurrences of these elements in cfDNA of individuals in prospectively collected diagnostic cohorts for patients at risk for lung or liver cancer (n = 287 for lung cancer cohort, table S10; n = 208 for liver cancer cohort, table S11) which had been previously sequenced (27, 45). Across cohorts, many of the repeat kmer increases or decreases observed in tumors were evident in the plasma of patients with lung cancer of squamous or adenocarcinoma subtypes or liver cancer as compared to plasma from individuals without cancer (Fig. 3B). These included changes not only in elements with previously documented roles in cancer such as LINE L1 elements but also newly identified elements that have now been revealed to have alterations in cancer, including from subfamilies such as DNA-hAT-Charlie and LTR ERV1, ERVL-MaLR, and ERVL.
We hypothesized that repeat landscapes in cfDNA could be different from the expected repeat content in genomic DNA due to genome-wide chromatin and epigenetic changes that may alter the representation of cfDNA fragments in the circulation (27, 28, 45– 49). We have previously shown that cfDNA fragmentation profiles reflect open and closed chromatin states genome-wide (28, 45). Here, we analyzed cfDNA from 158 individuals without cancer and showed that regions with different histone marks had differential density of repeat element types (Fig. 4, A and B) and that individual cfDNA fragments derived from regions with actively transcribed chromatin or activated histone marks had shorter lengths and exhibited lower coverage in the plasma (Fig. 4, C and D). Overall, repeat landscape kmer counts in cfDNA for regions with high density of activating chromatin histone marks were lower than for regions with low density of these marks, whereas the reverse was observed for repressive histone marks (Fig. 4E). Genome-wide simulations suggest that repeat landscapes in cfDNA may be influenced by both tumor-specific epigenomic and genomic changes (fig. S18).
Fig. 4. Impact of epigenetic state on repeat element representation in cfDNA.
(A) A summary of peaks per megabase of each chromatin state for each histone type is indicated at the top of each plot. The peak density is scaled within each chromatin immunoprecipitation sequencing experiment to account for different numbers of peaks in each experiment. PC, polycomb. (B) Box plots show the proportion of histone peaks of each type (columns) in each of 1280 repeat elements organized into six families. (C) In plasma from patients without cancer (n = 158) in the LUCAS cohort, the distributions of aligned fragment sizes for fragments overlapping each histone mark and all fragments are plotted. The line is the median, and the shading indicates ±1 SD, plotted as a difference in distributions. (D) In plasma from patients without cancer (n = 158) in the LUCAS cohort, plots of coverage genome-wide versus within regions of each histone mark are shown. The x axis represents the log average coverage, and the y axis represents the log difference in count. (E) In plasma from patients without cancer (n = 158) in the LUCAS cohort, box plots show the ratios of average observed to expected kmer counts for the features in the top and bottom deciles of histone mark density. P values for the Wilcoxon signed-rank test are shown above each plot.
ARTEMIS kmer repeat landscape analyses for cancer detection and monitoring
Given the ability to identify repeat landscapes changes in cfDNA, we evaluated the potential of the ARTEMIS method for noninvasive detection of cancer (fig. S19). We previously described use of a sensitive and accessible whole-genome cfDNA fragmentation test (DELFI, DNA evaluation of fragments for early interception) for lung and liver cancer screening in high-risk populations (27, 45). Here, we used the kmer repeat landscapes and epigenetic profiles in cfDNA regions with high density of histone marks that differentially affect repeat representation (fig. S20 and table S12) as features in machine learning models to detect lung cancer in the Danish Lung Cancer Screening Study (LUCAS) prospectively collected diagnostic cohort (n = 287) and liver cancer in a high-risk population (n = 208) (27, 45). ARTEMIS classified patients with lung cancer with an AUC of 0.82 [95% confidence interval (CI), 0.78 to 0.87], and when ensembled with the DELFI genome-wide fragmentation features (28), a joint ARTEMIS-DELFI model classified patients with lung cancer with an AUC of 0.91 (95% CI, 0.88 to 0.94) (Fig. 5, A and B, and fig. S21). Similar performance was observed in the cohort of individuals at risk for liver cancer, where ARTEMIS detected individuals with liver cancer among patients with cirrhosis or viral hepatitis with an AUC of 0.87 (95% CI, 0.82 to 0.93), and when combined with DELFI, the AUC improved to 0.90 (95% CI, 0.86 to 0.94) (fig. S22). We validated the locked ARTEMIS and ARTEMIS-DELFI models in an external cohort (table S13) composed of noncancer individuals at high and average risk of lung cancer (n = 400) and patients with all stages of lung cancer (n = 88) and observed similar performance to that in the cross-validated training cohort (Fig. 5C and fig. S23). Analysis of a separate held-out set of patients from the LUCAS cohort with a prior history of cancer (n = 25; table S14) using the locked ARTEMIS and ARTEMIS-DELFI models revealed higher scores in patients who experienced cancer recurrence compared with those who did not (fig. S24). We further applied these models to an independent cohort of patients with late-stage lung cancer (n = 19; table S15) receiving tyrosine kinase inhibitor therapy (50) and demonstrated that the ARTEMIS and joint ARTEMIS-DELFI scores were correlated to circulating tumor DNA mutant allele fractions observed during therapy (r = 0.70, P = 2.67 × 10−12 for ARTEMIS and r = 0.80, P < 2.2 × 10−16 for ARTEMIS-DELFI, Spearman’s correlation). Analysis of ARTEMIS-DELFI scores in patients at the first time point after initiation of treatment (median, 6 days) identified that those with scores above or below the pretreatment median had shorter or longer progression-free survival, respectively (median, 1.4 months for patients in the high-score group versus 8.9 months for low-score group; P < 0.001, log-rank test, two-sided) (fig. S25).
Fig. 5. ARTEMIS and ARTEMIS-DELFI for detection of lung cancer using cfDNA.
(A) Distributions of ARTEMIS and joint ARTEMIS-DELFI scores for patients with (n = 129) and without (n = 158) cancer in the cross-validated LUCAS cohort separated by biopsy status (individuals without cancer), cancer stage, and histology. SCLC, small cell lung cancer. (B) ROC analyses of ARTEMIS and ARTEMIS-DELFI scores classifying individuals with and without lung cancer in the full LUCAS cohort and in subgroups by cancer stage. (C) The sensitivity and specificity achieved by ARTEMIS and ARTEMIS-DELFI in the external validation cohort at locked score thresholds that achieved 50 to 80% specificity in the cross-validated cohort.
Lastly, given the observation of tumor-specific changes in repeat landscapes, we evaluated whether ARTEMIS could aid tissue of origin determination in tumor or cfDNA samples of patients with cancer. We first examined whether kmer repeat landscapes could capture a tissue-specific signal. We trained a machine learning model on the PCAWG cohort using kmer repeat landscapes to differentiate between tissue types and found that it classified the tumors by tissue of origin with an average of 78% accuracy among 12 tumor types studied despite relying only on genomic features, which are typically thought to show fewer tissue-specific differences than transcriptomic and epigenetic features (table S16). This is consistent with the observations that although changes in repeat landscapes are a pan-cancer feature of cancer genomes, the specific repeat elements altered vary between tumor types (Fig. 2A and tables S7 to S9). We then extended this approach to cfDNA, cross-validating ARTEMIS-DELFI within a multicancer cohort including 226 individuals with breast, ovarian, lung, colorectal, bile duct, gastric, and pancreatic tumors (table S17) (28). Despite the small number of samples available for training, we found that ARTEMIS-DELFI correctly categorized detected patients among the different cancer types with an average of 68 or 83% accuracy, for the highest or top two predictions, respectively (Table 2 and table S18).
Table 2. Tissue of origin classification for cfDNA samples by ARTEMIS-DELFI.
Patients detected are based on ARTEMIS-DELFI detection at 90% specificity. Lung patients include additional patients with lung cancer with prior therapy. Values with an asterisk in the confusion matrix indicate individuals classified correctly, and values outside the confusion matrix highlight key performance summary statistics.
| Predicted cancer type |
ARTEMIS-DELFI top prediction accuracy (95% CI) | ARTEMIS-DELFI top two prediction accuracy (95% CI) | Random assignment accuracy | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| True cancer type | Patients detected | Breast | Breast Bile duct | Colorectal | Gastric | Lung | Ovarian | Pancreatic | |||
|
| |||||||||||
| Breast | 52 | 41* | 1 | 2 | 1 | 3 | 2 | 2 | 79% (65–89%) | 94% (84–99%) | 27% |
|
| |||||||||||
| Bile duct | 24 | 2 | 18* | 0 | 1 | 2 | 0 | 1 | 75% (53–90%) | 79% (58–93%) | 4% |
|
| |||||||||||
| Colorectal | 26 | 2 | 1 | 17* | 0 | 3 | 2 | 1 | 65% (44–83%) | 81% (61–93%) | 4% |
|
| |||||||||||
| Gastric | 24 | 0 | 2 | 1 | 18* | 2 | 0 | 1 | 75% (53–90%) | 79% (58–93%) | 8% |
|
| |||||||||||
| Lung | 31 | 7 | 4 | 1 | 0 | 13* | 2 | 4 | 42% (25–61%) | 61% (42–78%) | 13% |
|
| |||||||||||
| Ovarian | 27 | 5 | 0 | 2 | 0 | 1 | 19* | 0 | 70% (50–86%) | 89% (71–98%) | 22% |
|
| |||||||||||
| Pancreatic | 27 | 4 | 3 | 0 | 1 | 1 | 0 | 18* | 67% (46–84%) | 89% (71–98%) | 22% |
|
| |||||||||||
| Total | 211 | 68% | 83% | 16% | |||||||
DISCUSSION
In this study, we show that ARTEMIS can reconstruct genome-wide repeat landscapes that reflect underlying changes in human cancer. The alterations reflect structural changes in the cancer genome and direct changes in repeat elements. Through these analyses, we found that repeat elements were enriched in the genome in genes commonly altered in human cancer, including at specific tumor-derived rearrangement breakpoints. Cancer-specific changes of the repeat landscape were observed genome-wide, including in elements not previously known to be altered in human cancer. These elements may provide an underlying basis for structural alterations and the genomic instability of genes, pathways, and chromosomes widely altered in human cancers. In addition, the expansion or contraction of repeat elements that can now be comprehensively identified provides a new way to detect and examine mechanisms affecting cancer and other diseases.
We found that changes in repeat landscapes were detectable in the circulation and that the signal in plasma was further altered by epigenetic changes to repeat elements that influence their susceptibility to fragmentation. We and others have previously shown that changes in chromatin accessibility, transcription factor binding, and methylation can alter the representation of cfDNA in the blood (28, 45–48). In this study, we show that epigenetic states affected by histone acetylation and methylation, leading to altered gene expression, have a profound impact on the size and coverage of cfDNA at distinct regions genome-wide, including in repeat regions. These analyses suggest that kmer repeat landscapes in plasma can reveal both structural and epigenetic changes in the genome.
Our study has some limitations. Alhough we externally validated ARTEMIS for cancer detection in four cohorts consisting of 532 patients with and without lung cancer, it will be critical to validate this approach in larger screening populations and for other applications. Another limitation of ARTEMIS is that it relies on evaluation of changes in repeat landscapes that are inherently variable among the germlines of individuals (3, 14, 51–53). In addition, certain repeat regions of the genome, including low-complexity repeats and highly polymorphic regions, would not be fully analyzed through this approach. Although ARTEMIS reveals genome-wide changes in repeat landscapes, the specific location of changes in repeat elements may not be directly identified through this approach. In the future, it will be valuable to characterize kmer repeat landscapes across diverse individuals because the current chm13 reference genome is from a single individual, and comparisons to a representative panel of healthy genotypes of different germline backgrounds could improve performance. Moreover, the functional impacts of changes in repeat families remain poorly understood and could be improved through further analyses in cancer and other disease states.
Repeat landscape analyses for cfDNA-based detection of lung, liver, and other cancers suggest that ARTEMIS alone or in combination with other genome-wide features may provide an avenue for noninvasive detection, monitoring, and tissue of origin determination of cancer. ARTEMIS may improve early-stage diagnosis by identifying genome-wide changes that would perhaps not be evident in other liquid biopsy approaches when tumor features such as mutations or chromosomal arm changes are not detected. Only 44% of the genome-wide occurrences of kmers used in the ARTEMIS method are within known genes, and many of the repeat types in our landscapes have not been studied in human cancers. Given the size, diversity, and potential clinical relevance of these regions of the genome, our study offers unique insights into the cancer genome and provides a proof of concept for the utility of genome-wide kmer repeat landscapes as tissue- and blood-based biomarkers for cancer detection, characterization, and monitoring.
MATERIALS AND METHODS
Study design
Our objective was to characterize repeat landscapes in whole-genome sequencing of tissue and cfDNA from individuals with and without cancer. We used these to characterize cancer-related changes in repeat landscapes and to develop liquid biopsy approaches for cancer detection, monitoring, and tissue of origin classification. We obtained matched tumor and normal BAM files for the 539 patients with lung, liver, ovarian, colorectal, breast, thyroid, head and neck squamous, prostate, cervical, bladder, and gastric cancers in PCAWG that were available in the Protected Data Cloud (32) and excluded 14 patients for which either the tumor or normal was on the PCAWG blacklist (table S4).
We analyzed whole-genome sequencing (1 to 2× coverage) data from cfDNA from 819 individuals with and without lung cancer from four cohorts; 208 individuals with and without liver cancer; and 423 individuals from a multicancer cohort of patients without cancer and with breast, ovarian, lung, bile duct, colorectal, gastric, duodenal, and pancreatic tumors described in our previous publications (27, 28, 45, 50) (fig. S19; tables S10, S11, S13, S14, S15, and S17; and Supplementary Materials and Methods).
Collection of patient samples used in this study conformed to all relevant ethical regulations. All patients provided written informed consent, and the studies were performed according to the Declaration of Helsinki. Because all samples analyzed were from previously published cohorts, no study size calculations, randomization, or blinding was performed in the present study.
De novo kmer finding
We first extracted all repeat sequences and coordinates for known repeat element types from the RepeatMasker track in chm13 (T2T-CHM13v2.0). We excluded repeats from the families low-complexity, unknown, and simple repeats, leaving 1287 types of repeats across 57 subfamilies comprising 13 families. For simplicity, we aggregated all elements in the gene and pseudogene families′ transfer RNA (tRNA), signal recognition particle RNA (srpRNA), small nuclear RNA (snRNA), small cytoplasmic RNA (scRNA), and ribosomal RNA (rRNA) as RNA elements and the families′ DNA, DNA?, retroposon, and RC (Rolling Circle) as transposable elements, leaving six overall families (LINE, SINE, LTR, satellites, transposable elements, and RNA elements).
We then performed a de novo kmer finding procedure inspired by Altemose et al. (26) using Jellyfish (54) to count all unique 24-mers occurring in each of the 1287 types of repeats, as well as those occurring in the portions of the genome excluding all repeat regions. We then selected all kmers that occurred only in a single repeat type and that were not present in the nonrepeat regions of the genome. The kmers for a sequence and its reverse complement were counted together as the reference genome represents one strand, but we expect that half of the paired-end reads were derived from the reverse complement strand. We identified at least one unique kmer in 1266 of the 1287 repeat types. We additionally included 58,426 kmers from 14 HSATII (Human Satellite II) and HSATIII (Human Satellite III) subfamilies (26) to supplement the RepeatMasker Satellite annotations. These kmers overlapped with broader satellite types in the RepeatMasker track, but we allowed these kmers to be counted in multiple repeat types for consistency with the previous publication (fig. S6). In total, we identified 1,206,871,310 distinct kmers defining 1280 repeat element types (see Supplementary Materials and Methods). To verify that these kmers had low cooccurrence in common human-associated microbial genomes, we counted kmers in 1545 microbial genomes from the Human Microbiome Project available for download on National Center for Biotechnology Information (NCBI) Entrez (29, 30). We analyzed the colocalization of these kmers within cancer driver genes using a gene set enrichment analysis and verified that these kmers could be identified in simulations of short-read sequencing incorporating a realistic error rate (see Supplementary Materials and Methods).
Generation of kmer repeat landscapes
We obtained all sequencing reads for each sample, counted each unique kmer and its reverse complement, and aggregated the kmer counts for each repeat type. We normalized the aggregated counts to the number of reads that were aligned with Mapping Quality (MAPQ) ≥ 30 to the hg19 genome (samtools view -c -q 30 -F 3844). Our approach considered all reads, including those from portions of the genome not provided in hg19 and/or repeat types that were not aligned.
We analyzed these kmer repeat landscapes in PCAWG tissue samples, identified alterations in the repeat landscape indicative of structural changes, and determined that many of these changes were in elements not previously implicated in oncogenesis (see Supplementary Materials and Methods). We then analyzed kmer repeat landscapes in cfDNA samples from patients with and without cancer and analyzed structural and epigenetic influences on repeat element representation in the circulation (see Supplementary Materials and Methods).
ARTEMIS machine learning models in tissue
We centered and scaled the coverage normalized counts of the kmer repeat landscape for each tumor and normal tissue sample and trained a penalized logistic regression (PLR) model to generate a cross-validated ARTEMIS score (for each sample, the ARTEMIS score was calculated as the mean across 10 repeats of fivefold cross-validation) for distinguishing tumor from normal tissue samples. We further used kmer repeat landscapes to train a multiclass gradient-boosted model (GBM) to generate cross-validated (fivefold cross-validation) predictions of tumor tissues of origin (for each sample, the model generated a vector of multinomial probabilities, where each element corresponded to a possible tumor tissue of origin and the predicted class was chosen on the basis of the element with maximum value).
ARTEMIS for early detection, tissue of origin, and monitoring of cancer in cfDNA
We obtained a kmer repeat landscape for each sample using the 786 features with more than 1000 kmers per million aligned reads expected. This filtering was used because at low-coverage features with low abundance have greater technical variation (fig. S17). To accommodate ensembling of diverse feature classes, we used nested cross-validation to generate the ARTEMIS score. The inner cross-validation loop trained six PLR models (Lasso regression, ∝ = 1, penalty chosen in the range of 0.00001 to 0.1 by resampling within each cross-validation fold) with repeat landscapes as features (a PLR for each of five repeat families and a PLR for the epigenetic profile; see Supplementary Materials and Methods). The outer cross-validation loop was trained with a leave-one-individual-out architecture; we ensembled the six scores available for each of the N − 1 individuals using a PLR model. The score obtained by applying this PLR model to the six scores for the held-out patient’s features is the ARTEMIS score for cancer detection in cfDNA.
To incorporate DELFI fragmentation profiles, we ensembled the ARTEMIS score with three additional models: a PLR model using principal components analysis of the ratio of short to long fragments in 5-Mb bins genome-wide (27), a PLR model on 39 chromosomal arm z-scores for aneuploidy (27), and a GBM on coverage in 5-Mb bins genome-wide (28). This combined ensemble produced a joint ARTEMIS-DELFI score. We retrained the lung cancer model on the full LUCAS cohort and then applied the locked models to four external validation cohorts: the Johns Hopkins University validation set (n = 431) from Mathios et al. (27); a subset of patients with prior cancers, with and without cancer recurrence in the LUCAS cohort (n = 25); the validation set from the AHN/DECAMP cohort in Bruhm et al. (55); and the lung cancer monitoring cohort from Phallen et al. (n = 19) (50).
Last, we trained multiclass GBMs using the features described above to generate an ARTEMIS and ARTEMIS-DELFI score for tissue of origin classification. The final ARTEMIS model for tissue of origin used the ensemble components described above as features, and the ARTEMIS-DELFI model used these components and additional fragmentation features (the 39 chromosomal arm z-scores for aneuploidy and the ratio of short to long fragments in 5-Mb bins genome-wide). Each ensemble component produced a vector of multinomial probabilities, with one for each possible tumor site. We determined classification on the basis of the element of the vector with the maximum value. When ensembling multiple GBM classifiers, all elements of the vector were used as feature inputs to the ensemble model. The models were trained using the nested cross-validation procedure described above on all cancer samples from Cristiano et al. (n = 423), and performance was reported for all cancers detected at the 90% specificity threshold by the ARTEMIS and ARTEMIS-DELFI detection models when trained on the full cohort including patients without cancer. Consistent with that previous publication, for the tissue of origin analyses, we included the baseline time points from the lung cancer monitoring cohort above to increase the number of lung cancers available for classification analyses.
Statistical analysis
Computer code, software versions, processed data in tabular format used for making figures in the manuscript, and the computing environment for running the ARTEMIS pipeline and generating figures in this study are available at https://github.com/cancer-genomics/artemis2024. This GitHub repository has also been archived on Zenodo at 10.5281/zenodo.10627372. P values for two group comparisons were performed using the Wilcoxon rank sum test. Correlation of continuous variables was performed using Spearman’s rank correlation coefficient. Receiver operating characteristic (ROC) curves were compared using DeLong’s test. The 95% CIs for area under the ROC curve were based on DeLong’s method.
Supplementary Material
Acknowledgments:
We thank members of our laboratories for critical review of the manuscript. The results shown here are in part based on data generated by the TCGA Research Network (http://cancer.gov/tcga), the ENCODE Consortium (http://encodeproject.org), and the T2T Consortium (https://sites.google.com/ucsc.edu/t2tworkinggroup/).
Funding:
This work was supported in part by the Dr. Miriam and Sheldon G. Adelson Medical Research Foundation (to V.E.V., J.P., and R.B.S.); SU2C in-Time Lung Cancer Interception Dream Team Grant (to V.E.V. and J.P.); Stand Up to Cancer-Dutch Cancer Society International Translational Cancer Research Dream Team Grant (SU2C-AACR-DT1415) (to V.E.V.); the Gray Foundation (to V.E.V. and J.P.); the Honorable Tina Brozman Foundation (to V.E.V. and J.P.); the Commonwealth Foundation (to V.E.V., V.A., and R.B.S.); the Mark Foundation for Cancer Research (to D.M.); the Cole Foundation (to V.E.V.); a research grant from Delfi Diagnostics (to V.E.V. and R.B.S.); and US National Institutes of Health grants CA121113 (to V.E.V.), CA006973 (to V.E.V.), CA233259 (to V.E.V.), CA062924 (to V.E.V. and R.B.S.), CA271896 (to V.E.V.) and 1T32GM136577 (to A.V.A.). Stand Up To Cancer is a program of the Entertainment Industry Foundation administered by the American Association for Cancer Research. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Footnotes
Competing interests: A.V.A., R.B.S., and V.E.V. are inventors on patent applications submitted by Johns Hopkins University related to genome-wide repeat landscapes in cancer and cfDNA (US Patent application number 63/532,642). A.V.A., D.C.B., V.A., D.M., Z.H.F., J.P., and R.B.S. are inventors on patent applications submitted by Johns Hopkins University related to cell-free DNA for cancer detection that have been licensed to Delfi Diagnostics. J.R.W. is the founder and owner of Resphera Biosciences LLC and serves as a consultant to Personal Genome Diagnostics Inc. and Delfi Diagnostics Inc. C.C. is the founder and owner of CMCC Consulting. J.P., V.A., and R.B.S. are founders of Delfi Diagnostics, and V.A. and R.B.S are consultants for this organization. V.E.V. is a founder of Delfi Diagnostics, serves on the board of directors and as an officer for this organization, and owns Delfi Diagnostics stock, which is subject to certain restrictions under university policy. In addition, Johns Hopkins University owns equity in Delfi Diagnostics. V.E.V. divested his equity in Personal Genome Diagnostics (PGDx) to LabCorp in February 2022. V.E.V. is an inventor on patent applications submitted by Johns Hopkins University related to cancer genomic analyses and cell-free DNA for cancer detection that have been licensed to one or more entities, including Delfi Diagnostics, LabCorp, QIAGEN, Sysmex, Agios, Genzyme, Esoterix, Ventana, and ManaT Bio. Under the terms of these license agreements, the university and inventors are entitled to fees and royalty distributions. V.E.V. is an advisor to Viron Therapeutics and Epitope. These arrangements have been reviewed and approved by the Johns Hopkins University in accordance with its conflict-of-interest policies. The remaining authors declare that they have no competing interests.
Data and materials availability:
All data associated with this study are in the paper and Supplementary Materials or deposited in publicly available repositories. The code to run the ARTEMIS pipeline and reproduce manuscript figures is publicly available at https://github.com/cancer-genomics/artemis2024. Code needed to generate DELFI features for fragmentation-based analysis may be found at https://github.com/cancer-genomics/reproduce_lucas_wflow. These GitHub repositories have also been archived on Zenodo at 10.5281/zenodo.10627372. Sequence data generated for cfDNA samples (27, 28, 45, and 56) have been deposited at the database of the European Genome-Phenome Archive (EGA) under accession codes EGAS00001005340, EGAS00001007248, EGAS00001007249, and EGAS00001003611 and may be obtained at https://ega-archive.org/. PCAWG BAM files were downloaded from the Bionimbus Protected Data Cloud (https://bionimbus.opensciencedatacloud.org/). Chromatin immunoprecipitation sequencing data were downloaded from the ENCODE portal (accession codes ENCFF001SUG, ENCFF001SUI, ENCFF001SUJ, ENCFF001SUE, ENCFF001SUL, ENCFF001SUF, ENCFF001SUN, ENCFF001SUO, ENCFF001SUP, and ENCFF001SUQ).
REFERENCES AND NOTES
- 1.Vollger MR, Guitart X, Dishuck PC, Mercuri L, Harvey WT, Gershman A, Diekhans M, Sulovari A, Munson KM, Lewis AP, Hoekzema K, Porubsky D, Li R, Nurk S, Koren S, Miga KH, Phillippy AM, Timp W, Ventura M, Eichler EE, Segmental duplications and their variation in a complete human genome. Science 376, eabj6965 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Aganezov S, Yan SM, Soto DC, Kirsche M, Zarate S, Avdeyev P, Taylor DJ, Shafin K, Shumate A, Xiao C, Wagner J, McDaniel J, Olson ND, Sauria MEG, Vollger MR, Rhie A, Meredith M, Martin S, Lee J, Koren S, Rosenfeld JA, Paten B, Layer R, Chin C-S, Sedlazeck FJ, Hansen NF, Miller DE, Phillippy AM, Miga KH, McCoy RC, Dennis MY, Zook JM, Schatz MC, A complete reference genome improves analysis of human genetic variation. Science 376, eabl3533 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Hoyt SJ, Storer JM, Hartley GA, Grady PGS, Gershman A, de Lima LG, Limouse C, Halabian R, Wojenski L, Rodriguez M, Altemose N, Rhie A, Core LJ, Gerton JL, Makalowski W, Olson D, Rosen J, Smit AFA, Straight AF, Vollger MR, Wheeler TJ, Schatz MC, Eichler EE, Phillippy AM, Timp W, Miga KH, O’Neill RJ, From telomere to telomere: The transcriptional and epigenetic state of human repeat elements. Science 376, eabk3112 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Gershman A, Sauria MEG, Guitart X, Vollger MR, Hook PW, Hoyt SJ, Jain M, Shumate A, Razaghi R, Koren S, Altemose N, Caldas GV, Logsdon GA, Rhie A, Eichler EE, Schatz MC, O’Neill RJ, Phillippy AM, Miga KH, Timp W, Epigenetic patterns in a complete human genome. Science 376, eabj5089 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Nurk S, Koren S, Rhie A, Rautiainen M, Bzikadze AV, Mikheenko A, Vollger MR, Altemose N, Uralsky L, Gershman A, Aganezov S, Hoyt SJ, Diekhans M, Logsdon GA, Alonge M, Antonarakis SE, Borchers M, Bouffard GG, Brooks SY, Caldas GV, Chen N-C, Cheng H, Chin C-S, Chow W, de Lima LG, Dishuck PC, Durbin R, Dvorkina T, Fiddes IT, Formenti G, Fulton RS, Fungtammasan A, Garrison E, Grady PGS, Graves-Lindsay TA, Hall IM, Hansen NF, Hartley GA, Haukness M, Howe K, Hunkapiller MW, Jain C, Jain M, Jarvis ED, Kerpedjiev P, Kirsche M, Kolmogorov M, Korlach J, Kremitzki M, Li H, Maduro VV, Marschall T, McCartney AM, McDaniel J, Miller DE, Mullikin JC, Myers EW, Olson ND, Paten B, Peluso P, Pevzner PA, Porubsky D, Potapova T, Rogaev EI, Rosenfeld JA, Salzberg SL, Schneider VA, Sedlazeck FJ, Shafin K, Shew CJ, Shumate A, Sims Y, Smit AFA, Soto DC, Sović I, Storer JM, Streets A, Sullivan BA, Thibaud-Nissen F, Torrance J, Wagner J, Walenz BP, Wenger A, Wood JMD, Xiao C, Yan SM, Young AC, Zarate S, Surti U, McCoy RC, Dennis MY, Alexandrov IA, Gerton JL, O’Neill RJ, Timp W, Zook JM, Schatz MC, Eichler EE, Miga KH, Phillippy AM, The complete sequence of a human genome. Science 376, 44–53 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Altemose N, Logsdon GA, Bzikadze AV, Sidhwani P, Langley SA, Caldas GV, Hoyt SJ, Uralsky L, Ryabov FD, Shew CJ, Sauria MEG, Borchers M, Gershman A, Mikheenko A, Shepelev VA, Dvorkina T, Kunyavskaya O, Vollger MR, Rhie A, McCartney AM, Asri M, Lorig-Roach R, Shafin K, Lucas JK, Aganezov S, Olson D, de Lima LG, Potapova T, Hartley GA, Haukness M, Kerpedjiev P, Gusev F, Tigyi K, Brooks S, Young A, Nurk S, Koren S, Salama SR, Paten B, Rogaev EI, Streets A, Karpen GH, Dernburg AF, Sullivan BA, Straight AF, Wheeler TJ, Gerton JL, Eichler EE, Phillippy AM, Timp W, Dennis MY, O’Neill RJ, Zook JM, Schatz MC, Pevzner PA, Diekhans M, Langley CH, Alexandrov IA, Miga KH, Complete genomic and epigenetic maps of human centromeres. Science 376, eabl4178 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Grégoire L, Haudry A, Lerat E, The transposable element environment of human genes is associated with histone and expression changes in cancer. BMC Genomics 17, 588 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Anwar SL, Wulaningsih W, Lehmann U, Transposable elements in human cancer: Causes and consequences of deregulation. Int. J. Mol. Sci. 18, 974 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Bose P, Hermetz KE, Conneely KN, Rudd MK, Tandem repeats and G-rich sequences are enriched at human CNV breakpoints. PLOS ONE 9, e101607 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Sato S, Gillette M, de Santiago PR, Kuhn E, Burgess M, Doucette K, Feng Y, Mendez-Dorantes C, Ippoliti PJ, Hobday S, Mitchell MA, Doberstein K, Gysler SM, Hirsch MS, Schwartz L, Birrer MJ, Skates SJ, Burns KH, Carr SA, Drapkin R, LINE-1 ORF1p as a candidate biomarker in high grade serous ovarian carcinoma. Sci. Rep 13, 1537 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Martínez JG, Pérez-Escuredo J, Castro-Santos P, Marcos CÁ, Pendás JLL, Fraga MF, Hermsen MA, Hypomethylation of LINE-1, and not centromeric SAT-α, is associated with centromeric instability in head and neck squamous cell carcinoma. Cell. Oncol 35, 259–267 (2012). [DOI] [PubMed] [Google Scholar]
- 12.Karttunen K, Patel D, Xia J, Fei L, Palin K, Aaltonen L, Sahu B, Transposable elements as tissue-specific enhancers in cancers of endodermal lineage. Nat. Commun 14, 5313 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Erwin GS, Gürsoy G, Al-Abri R, Suriyaprakash A, Dolzhenko E, Zhu K, Hoerner CR, White SM, Ramirez L, Vadlakonda A, Vadlakonda A, von Kraut K, Park J, Brannon CM, Sumano DA, Kirtikar RA, Erwin AA, Metzner TJ, Yuen RKC, Fan AC, Leppert JT, Eberle MA, Gerstein M, Snyder MP, Recurrent repeat expansions in human cancer genomes. Nature 613, 96–102 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.de Lima LG, Howe E, Singh VP, Potapova T, Li H, Xu B, Castle J, Crozier S, Harrison CJ, Clifford SC, Miga KH, Ryan SL, Gerton JL, PCR amplicons identify widespread copy number variation in human centromeric arrays and instability in cancer. Cell Genom. 1, 100064 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Saha AK, Mourad M, Kaplan MH, Chefetz I, Malek SN, Buckanovich R, Markovitz DM, Contreras-Galindo R, The genomic landscape of centromeres in cancers. Sci. Rep 9, 11259 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Decombe S, Loll F, Caccianini L, Affannoukoué K, Izeddin I, Mozziconacci J, Escudé C, Lopes J, Epigenetic rewriting at centromeric DNA repeats leads to increased chromatin accessibility and chromosomal instability. Epigenetics Chromatin 14, 35 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Ly P, Brunner SF, Shoshani O, Kim DH, Lan W, Pyntikova T, Flanagan AM, Behjati S, Page DC, Campbell PJ, Cleveland DW, Chromosome segregation errors generate a diverse spectrum of simple and complex genomic rearrangements. Nat. Genet 51, 705–715 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Ichida K, Suzuki K, Fukui T, Takayama Y, Kakizawa N, Watanabe F, Ishikawa H, Muto Y, Kato T, Saito M, Futsuhara K, Miyakura Y, Noda H, Ohmori T, Konishi F, Rikiyama T, Overexpression of satellite alpha transcripts leads to chromosomal instability via segregation errors at specific chromosomes. Int. J. Oncol 52, 1685–1693 (2018). [DOI] [PubMed] [Google Scholar]
- 19.Bersani F, Lee E, Kharchenko PV, Xu AW, Liu M, Xega K, MacKenzie OC, Brannigan BW, Wittner BS, Jung H, Ramaswamy S, Park PJ, Maheswaran S, Ting DT, Haber DA, Pericentromeric satellite repeat expansions through RNA-derived DNA intermediates in cancer. Proc. Natl. Acad. Sci. U.S.A. 112, 15148–15153 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Leary RJ, Sausen M, Kinde I, Papadopoulos N, Carpten JD, Craig D, O’Shaughnessy J, Kinzler KW, Parmigiani G, Vogelstein B, Jr LAD, Velculescu VE, Detection of chromosomal alterations in the circulation of cancer patients with whole-genome sequencing. Sci. Transl. Med 4, 162ra154 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Grabuschnig S, Soh J, Heidinger P, Bachler T, Hirschböck E, Rodriguez IR, Schwendenwein D, Sensen CW, Circulating cell-free DNA is predominantly composed of retrotransposable elements and non-telomeric satellite DNA. J. Biotechnol 313, 48–56 (2020). [DOI] [PubMed] [Google Scholar]
- 22.Gezer U, Bronkhorst AJ, Holdenrieder S, The utility of repetitive cell-free DNA in cancer liquid biopsies. Diagnostics 12, 1363 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Douville C, Springer S, Kinde I, Cohen JD, Hruban RH, Lennon AM, Papadopoulos N, Kinzler KW, Vogelstein B, Karchin R, Detection of aneuploidy in patients with cancer through amplification of long interspersed nucleotide elements (LINEs). Proc. Natl. Acad. Sci. U.S.A. 115, 1871–1876 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Rago C, Huso DL, Diehl F, Karim B, Liu G, Papadopoulos N, Samuels Y, Velculescu VE, Vogelstein B, Kinzler KW, Diaz LA, Serial assessment of human tumor burdens in mice by the analysis of circulating DNA. Cancer Res. 67, 9364–9370 (2007). [DOI] [PubMed] [Google Scholar]
- 25.Rhie A, Nurk S, Cechova M, Hoyt SJ, Taylor DJ, Altemose N, Hook PW, Koren S, Rautiainen M, Alexandrov IA, Allen J, Asri M, Bzikadze AV, Chen N-C, Chin C-S, Diekhans M, Flicek P, Formenti G, Fungtammasan A, Giron CG, Garrison E, Gershman A, Gerton JL, Grady PGS, Guarracino A, Haggerty L, Halabian R, Hansen NF, Harris R, Hartley GA, Harvey WT, Haukness M, Heinz J, Hourlier T, Hubley RM, Hunt SE, Hwang S, Jain M, Kesharwani RK, Lewis AP, Li H, Logsdon GA, Lucas JK, Makalowski W, Markovic C, Martin FJ, Cartney AMM, McCoy RC, McDaniel J, McNulty BM, Medvedev P, Mikheenko A, Munson KM, Murphy TD, Olsen HE, Olson ND, Paulin LF, Porubsky D, Potapova T, Ryabov F, Salzberg SL, Sauria MEG, Sedlazeck FJ, Shafin K, Shepelev VA, Shumate A, Storer JM, Surapaneni L, Oill AMT, Thibaud-Nissen F, Timp W, Tomaszkiewicz M, Vollger MR, Walenz BP, Watwood AC, Weissensteiner MH, Wenger AM, Wilson MA, Zarate S, Zhu Y, Zook JM, Eichler EE, O’Neill RJ, Schatz MC, Miga KH, Makova KD, Phillippy AM, The complete sequence of a human Y chromosome. Nature 621, 344–354 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Altemose N, Miga KH, Maggioni M, Willard HF, Genomic characterization of large heterochromatic gaps in the human genome assembly. PLoS Comput. Biol 10, e1003628 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Mathios D, Johansen JS, Cristiano S, Medina JE, Phallen J, Larsen KR, Bruhm DC, Niknafs N, Ferreira L, Adleff V, Chiao JY, Leal A, Noe M, White JR, Arun AS, Hruban C, Annapragada AV, Jensen SØ, Ørntoft M-BW, Madsen AH, Carvalho B, de Wit M, Carey J, Dracopoli NC, Maddala T, Fang KC, Hartman A-R, Forde PM, Anagnostou V, Brahmer JR, Fijneman RJA, Nielsen HJ, Meijer GA, Andersen CL, Mellemgaard A, Bojesen SE, Scharpf RB, Velculescu VE, Detection and characterization of lung cancer using cell-free DNA fragmentomes. Nat. Commun 12, 5060 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Cristiano S, Leal A, Phallen J, Fiksel J, Adleff V, Bruhm DC, Jensen SØ, Medina JE, Hruban C, White JR, Palsgrove DN, Niknafs N, Anagnostou V, Forde P, Naidoo J, Marrone K, Brahmer J, Woodward BD, Husain H, van Rooijen KL, Ørntoft M-BW, Madsen AH, van de Velde CJH, Verheij M, Cats A, Punt CJA, Vink GR, van Grieken NCT, Koopman M, Fijneman RJA, Johansen JS, Nielsen HJ, Meijer GA, Andersen CL, Scharpf RB, Velculescu VE, Genome-wide cell-free DNA fragmentation in patients with cancer. Nature 570, 385–389 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.The Human Microbiome Project Consortium, A framework for human microbiome research. Nature 486, 215–221 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.The Human Microbiome Project Consortium, Structure, function and diversity of the healthy human microbiome. Nature 486, 207–214 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Tate JG, Bamford S, Jubb HC, Sondka Z, Beare DM, Bindal N, Boutselakis H, Cole CG, Creatore C, Dawson E, Fish P, Harsha B, Hathaway C, Jupe SC, Kok CY, Noble K, Ponting L, Ramshaw CC, Rye CE, Speedy HE, Stefancsik R, Thompson SL, Wang S, Ward S, Campbell PJ, Forbes SA, COSMIC: The catalogue of somatic mutations in cancer. Nucleic Acids Res. 47, D941–D947 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.The ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium, Pan-cancer analysis of whole genomes. Nature 578, 82–93 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Babaian A, Mager DL, Endogenous retroviral promoter exaptation in human cancer. Mob. DNA 7, 24 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Jang HS, Shah NM, Du AY, Dailey ZZ, Pehrsson EC, Godoy PM, Zhang D, Li D, Xing X, Kim S, O’Donnell D, Gordon JI, Wang T, Transposable elements drive widespread expression of oncogenes in human cancers. Nat. Genet 51, 611–617 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Sjöblom T, Jones S, Wood LD, Parsons DW, Lin J, Barber TD, Mandelker D, Leary RJ, Ptak J, Silliman N, Szabo S, Buckhaults P, Farrell C, Meeh P, Markowitz SD, Willis J, Dawson D, Willson JKV, Gazdar AF, Hartigan J, Wu L, Liu C, Parmigiani G, Park BH, Bachman KE, Papadopoulos N, Vogelstein B, Kinzler KW, Velculescu VE, The consensus coding sequences of human breast and colorectal cancers. Science 314, 268–274 (2006). [DOI] [PubMed] [Google Scholar]
- 36.Wood LD, Parsons DW, Jones S, Lin J, Sjöblom T, Leary RJ, Shen D, Boca SM, Barber T, Ptak J, Silliman N, Szabo S, Dezso Z, Ustyanksky V, Nikolskaya T, Nikolsky Y, Karchin R, Wilson PA, Kaminker JS, Zhang Z, Croshaw R, Willis J, Dawson D, Shipitsin M, Willson JKV, Sukumar S, Polyak K, Park BH, Pethiyagoda CL, Pant PVK, Ballinger DG, Sparks AB, Hartigan J, Smith DR, Suh E, Papadopoulos N, Buckhaults P, Markowitz SD, Parmigiani G, Kinzler KW, Velculescu VE, Vogelstein B, The genomic landscapes of human breast and colorectal cancers. Science 318, 1108–1113 (2007). [DOI] [PubMed] [Google Scholar]
- 37.Burssed B, Zamariolli M, Bellucco FT, Melaragno MI, Mechanisms of structural chromosomal rearrangement formation. Mol. Cytogenet 15, 23 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Qian J, Massion PP, Role of chromosome 3q amplification in lung cancer. J. Thorac. Oncol 3, 212–215 (2008). [DOI] [PubMed] [Google Scholar]
- 39.Hussenet T, Dali S, Exinger J, Monga B, Jost B, Dembelé D, Martinet N, Thibault C, Huelsken J, Brambilla E, du Manoir S, SOX2 is an oncogene activated by recurrent 3q26.3 amplifications in human lung squamous cell carcinomas. PLOS ONE 5, e8960 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Rheinbay E, Nielsen MM, Abascal F, Wala JA, Shapira O, Tiao G, Hornshøj H, Hess JM, Juul RI, Lin Z, Feuerbach L, Sabarinathan R, Madsen T, Kim J, Mularoni L, Shuai S, Lanzós A, Herrmann C, Maruvka YE, Shen C, Amin SB, Bandopadhayay P, Bertl J, Boroevich KA, Busanovich J, Carlevaro-Fita J, Chakravarty D, Chan CWY, Craft D, Dhingra P, Diamanti K, Fonseca NA, Gonzalez-Perez A, Guo Q, Hamilton MP, Haradhvala NJ, Hong C, Isaev K, Johnson TA, Juul M, Kahles A, Kahraman A, Kim Y, Komorowski J, Kumar K, Kumar S, Lee D, Lehmann K-V, Li Y, Liu EM, Lochovsky L, Park K, Pich O, Roberts ND, Saksena G, Schumacher SE, Sidiropoulos N, Sieverling L, Sinnott-Armstrong N, Stewart C, Tamborero D, Tubio JMC, Umer HM, Uusküla-Reimand L, Wadelius C, Wadi L, Yao X, Zhang C-Z, Zhang J, Haber JE, Hobolth A, Imielinski M, Kellis M, Lawrence MS, von Mering C, Nakagawa H, Raphael BJ, Rubin MA, Sander C, Stein LD, Stuart JM, Tsunoda T, Wheeler DA, Johnson R, Reimand J, Gerstein M, Khurana E, Campbell PJ, López-Bigas N; PCAWG Drivers and Functional Interpretation Working Group; PCAWG Structural Variation Working Group, Weischenfeldt J, Beroukhim R, Martincorena I, Pedersen JS Getz G; PCAWG Consortium, Analyses of non-coding somatic drivers in 2,658 cancer whole genomes. Nature 578, 102–111 (2020).32025015 [Google Scholar]
- 41.Chiappinelli KB, Strissel PL, Desrichard A, Li H, Henke C, Akman B, Hein A, Rote NS, Cope LM, Snyder A, Makarov V, Budhu S, Buhu S, Slamon DJ, Wolchok JD, Pardoll DM, Beckmann MW, Zahnow CA, Merghoub T, Mergoub T, Chan TA, Baylin SB, Strick R, Inhibiting DNA methylation causes an interferon response in cancer via dsRNA including endogenous retroviruses. Cell 162, 974–986 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Russo M, Morelli S, Capranico G, Expression of down-regulated ERV LTR elements associates with immune activation in human small-cell lung cancers. Mob. DNA 14, 2 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Onishi-Seebacher M, Erikson G, Sawitzki Z, Ryan D, Greve G, Lübbert M, Jenuwein T, Repeat to gene expression ratios in leukemic blast cells can stratify risk prediction in acute myeloid leukemia. BMC Med. Genomics 14, 166 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Kemp JR, Longworth MS, Crossing the LINE toward genomic instability: LINE-1 retrotransposition in cancer. Front. Chem 3, 68 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Foda ZH, Annapragada AV, Boyapati K, Bruhm DC, Vulpescu NA, Medina JE, Mathios D, Cristiano S, Niknafs N, Luu HT, Goggins MG, Anders RA, Sun J, Meta SH, Thomas DL, Kirk GD, Adleff V, Phallen J, Scharpf RB, Kim AK, Velculescu VE, Detecting liver cancer using cell-free DNA fragmentomes. Cancer Discov. 13, 616–631 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Snyder MW, Kircher M, Hill AJ, Daza RM, Shendure J, Cell-free DNA comprises an in vivo nucleosome footprint that informs its tissues-of-origin. Cell 164, 57–68 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Ulz P, Thallinger GG, Auer M, Graf R, Kashofer K, Jahn SW, Abete L, Pristauz G, Petru E, Geigl JB, Heitzer E, Speicher MR, Inferring expressed genes by whole-genome sequencing of plasma DNA. Nat. Genet 48, 1273–1278 (2016). [DOI] [PubMed] [Google Scholar]
- 48.Zhou Q, Kang G, Jiang P, Qiao R, Lam WKJ, Yu SCY, Ma M-JL, Ji L, Cheng SH, Gai W, Peng W, Shang H, Chan RWY, Chan SL, Wong GLH, Hiraki LT, Volpi S, Wong VWS, Wong J, Chiu RWK, Chan KCA, Lo YMD, Epigenetic analysis of cell-free DNA by fragmentomic profiling. Proc. Natl. Acad. Sci. U.S.A. 119, e2209852119 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Shen SY, Singhania R, Fehringer G, Chakravarthy A, Roehrl MHA, Chadwick D, Zuzarte PC, Borgida A, Wang TT, Li T, Kis O, Zhao Z, Spreafico A, da Silva Medina T, Wang Y, Roulois D, Ettayebi I, Chen Z, Chow S, Murphy T, Arruda A, O’Kane GM, Liu J, Mansour M, McPherson JD, O’Brien C, Leighl N, Bedard PL, Fleshner N, Liu G, Minden MD, Gallinger S, Goldenberg A, Pugh TJ, Hoffman MM, Bratman SV, Hung RJ, De Carvalho DD, Sensitive tumour detection and classification using plasma cell-free DNA methylomes. Nature 563, 579–583 (2018). [DOI] [PubMed] [Google Scholar]
- 50.Phallen J, Leal A, Woodward BD, Forde PM, Naidoo J, Marrone KA, Brahmer JR, Fiksel J, Medina JE, Cristiano S, Palsgrove DN, Gocke CD, Bruhm DC, Keshavarzian P, Adleff V, Weihe E, Anagnostou V, Scharpf RB, Velculescu VE, Husain H, Early noninvasive detection of response to targeted therapy in non–small cell lung cancer. Cancer Res. 79, 1204–1213 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Altemose N, A classical revival: Human satellite DNAs enter the genomics era. Semin. Cell Dev. Biol 128, 2–14 (2022). [DOI] [PubMed] [Google Scholar]
- 52.Iben JR, Maraia RJ, tRNA gene copy number variation in humans. Gene 536, 376–384 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Liao W-W, Asri M, Ebler J, Doerr D, Haukness M, Hickey G, Lu S, Lucas JK, Monlong J, Abel HJ, Buonaiuto S, Chang XH, Cheng H, Chu J, Colonna V, Eizenga JM, Feng X, Fischer C, Fulton RS, Garg S, Groza C, Guarracino A, Harvey WT, Heumos S, Howe K, Jain M, Lu T-Y, Markello C, Martin FJ, Mitchell MW, Munson KM, Mwaniki MN, Novak AM, Olsen HE, Pesout T, Porubsky D, Prins P, Sibbesen JA, Sirén J, Tomlinson C, Villani F, Vollger MR, Antonacci-Fulton LL, Baid G, Baker CA, Belyaeva A, Billis K, Carroll A, Chang P-C, Cody S, Cook DE, Cook-Deegan RM, Cornejo OE, Diekhans M, Ebert P, Fairley S, Fedrigo O, Felsenfeld AL, Formenti G, Frankish A, Gao Y, Garrison NA, Giron CG, Green RE, Haggerty L, Hoekzema K, Hourlier T, Ji HP, Kenny EE, Koenig BA, Kolesnikov A, Korbel JO, Kordosky J, Koren S, Lee H, Lewis AP, Magalhães H, Marco-Sola S, Marijon P, McCartney A, McDaniel J, Mountcastle J, Nattestad M, Nurk S, Olson ND, Popejoy AB, Puiu D, Rautiainen M, Regier AA, Rhie A, Sacco S, Sanders AD, Schneider VA, Schultz BI, Shafin K, Smith MW, Sofia HJ, Tayoun ANA, Thibaud-Nissen F, Tricomi FF, Wagner J, Walenz B, Wood JMD, Zimin AV, Bourque G, Chaisson MJP, Flicek P, Phillippy AM, Zook JM, Eichler EE, Haussler D, Wang T, Jarvis ED, Miga KH, Garrison E, Marschall T, Hall IM, Li H, Paten B, A draft human pangenome reference. Nature 617, 312–324 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Marçais G, Kingsford C, A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27, 764–770 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Bruhm DC, Mathios D, Foda ZH, Annapragada AV, Medina JE, Adleff V, Chiao EJ, Ferreira L, Cristiano S, White JR, Mazzilli SA, Billatos E, Spira A, Zaidi AH, Mueller J, Kim AK, Anagnostou V, Phallen J, Scharpf RB, Velculescu VE, Single-molecule genome-wide mutation profiles of cell-free DNA for non-invasive detection of cancer. Nat. Genet 55, 1301–1310 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Stoop EM, de Haan MC, de Wijkerslooth TR, Bossuyt PM, van Ballegooijen M, Nio CY, van de Vijver MJ, Biermann K, Thomeer M, van Leerdam ME, Fockens P, Stoker J, Kuipers EJ, Dekker E, Participation and yield of colonoscopy versus non-cathartic CT colonography in population-based screening for colorectal cancer: A randomised controlled trial. Lancet Oncol. 13, 55–64 (2012). [DOI] [PubMed] [Google Scholar]
- 57.Billatos E, Duan F, Moses E, Marques H, Mahon I, Dymond L, Apgar C, Aberle D, Washko G, Spira A, DECAMP Investigators, Detection of early lung cancer among military personnel (DECAMP) consortium: Study protocols. BMC Pulm. Med 19, 59 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP, Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. U.S.A. 102, 15545–15550 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Mootha VK, Lindgren CM, Eriksson K-F, Subramanian A, Sihag S, Lehar J, Puigserver P, Carlsson E, Ridderstråle M, Laurila E, Houstis N, Daly MJ, Patterson N, Mesirov JP, Golub TR, Tamayo P, Spiegelman B, Lander ES, Hirschhorn JN, Altshuler D, Groop LC, PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat. Genet 34, 267–273 (2003). [DOI] [PubMed] [Google Scholar]
- 60.Anagnostou V, Niknafs N, Marrone K, Bruhm DC, White JR, Naidoo J, Hummelink K, Monkhorst K, Lalezari F, Lanis M, Rosner S, Reuss JE, Smith KN, Adleff V, Rodgers K, Belcaid Z, Rhymee L, Levy B, Feliciano J, Hann CL, Ettinger DS, Georgiades C, Verde F, Illei P, Li QK, Baras AS, Gabrielson E, Brock MV, Karchin R, Pardoll DM, Baylin SB, Brahmer JR, Scharpf RB, Forde PM, Velculescu VE, Multimodal genomic features predict outcome of immune checkpoint blockade in non-small-cell lung cancer. Nat. Cancer 1, 99–111 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Rodriguez-Martin B, Alvarez EG, Baez-Ortega A, Zamora J, Supek F, Demeulemeester J, Santamarina M, Ju YS, Temes J, Garcia-Souto D, Detering H, Li Y, Rodriguez-Castro J, Dueso-Barroso A, Bruzos AL, Dentro SC, Blanco MG, Contino G, Ardeljan D, Tojo M, Roberts ND, Zumalave S, Edwards PA, Weischenfeldt J, Puiggròs M, Chong Z, Chen K, Lee EA, Wala JA, Raine KM, Butler A, Waszak SM, Navarro FCP, Schumacher SE, Monlong J, Maura F, Bolli N, Bourque G, Gerstein M, Park PJ, Wedge DC, Beroukhim R, Torrents D, Korbel JO, Martincorena I, Fitzgerald RC, Van Loo P, Kazazian HH, Burns KH, PCAWG Structural Variation Working Group, P. J. Campbell, J. M. C. Tubio, PCAWG Consortium, Pan-cancer analysis of whole genomes identifies driver rearrangements promoted by LINE-1 retrotransposition. Nat. Genet 52, 306–319 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Yandım C, Karakülah G, Dysregulated expression of repetitive DNA in ER+/HER2− breast cancer. Cancer Genet. 239, 36–45 (2019). [DOI] [PubMed] [Google Scholar]
- 63.Zillner K, Komatsu J, Filarsky K, Kalepu R, Bensimon A, Nmeth A, Active human nucleolar organizer regions are interspersed with inactive rDNA repeats in normal and tumor cells. Epigenomics 7, 363–378 (2015). [DOI] [PubMed] [Google Scholar]
- 64.Uemura M, Zheng Q, Koh CM, Nelson WG, Yegnasubramanian S, De Marzo AM, Overexpression of ribosomal RNA in prostate cancer is common but not linked to rDNA promoter hypomethylation. Oncogene 31, 1254–1263 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Valori V, Tus K, Laukaitis C, Harris DT, LeBeau L, Maggert KA, Human rDNA copy number is unstable in metastatic breast cancers. Epigenetics 15, 85–106 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Shuai S, Suzuki H, Diaz-Navarro A, Nadeu F, Kumar SA, Gutierrez-Fernandez A, Delgado J, Pinyol M, López-Otín C, Puente XS, Taylor MD, Campo E, Stein LD, The U1 spliceosomal RNA is recurrently mutated in multiple cancers. Nature 574, 712–716 (2019). [DOI] [PubMed] [Google Scholar]
- 67.Suzuki H, Kumar SA, Shuai S, Diaz-Navarro A, Gutierrez-Fernandez A, Antonellis PD, Cavalli FMG, Juraschka K, Farooq H, Shibahara I, Vladoiu MC, Zhang J, Abeysundara N, Przelicki D, Skowron P, Gauer N, Luu B, Daniels C, Wu X, Forget A, Momin A, Wang J, Dong W, Kim S-K, Grajkowska WA, Jouvet A, Fèvre-Montange M, Garrè ML, Rao AAN, Giannini C, Kros JM, French PJ, Jabado N, Ng H-K, Poon WS, Eberhart CG, Pollack IF, Olson JM, Weiss WA, Kumabe T, López-Aguilar E, Lach B, Massimino M, Meir EGV, Rubin JB, Vibhakar R, Chambless LB, Kijima N, Klekner A, Bognár L, Chan JA, Faria CC, Ragoussis J, Pfister SM, Goldenberg A, Wechsler-Reya RJ, Bailey SD, Garzia L, Morrissy AS, Marra MA, Huang X, Malkin D, Ayrault O, Ramaswamy V, Puente XS, Calarco JA, Stein L, Taylor MD, Recurrent noncoding U1 snRNA mutations drive cryptic splicing in SHH medulloblastoma. Nature 574, 707–711 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Tan C, Cao J, Chen L, Xi X, Wang S, Zhu Y, Yang L, Ma L, Wang D, Yin J, Zhang T, Lu ZJ, Noncoding RNAs serve as diagnosis and prognosis biomarkers for hepatocellular carcinoma. Clin. Chem 65, 905–915 (2019). [DOI] [PubMed] [Google Scholar]
- 69.Ganesan S, Breaking satellite silence: Human satellite II RNA expression in ovarian cancer. J. Clin. Investig 132, e161981 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Ting DT, Lipson D, Paul S, Brannigan BW, Akhavanfard S, Coffman EJ, Contino G, Deshpande V, Iafrate AJ, Letovsky S, Rivera MN, Bardeesy N, Maheswaran S, Haber DA, Aberrant overexpression of satellite repeats in pancreatic and other epithelial cancers. Science 331, 593–596 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Arancio W, Coronnello C, Repetitive sequence transcription in breast cancer. Cell 11, 2522 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Tsumagari K, Qi L, Jackson K, Shao C, Lacey M, Sowden J, Tawil R, Vedanarayanan V, Ehrlich M, Epigenetics of a tandem DNA repeat: Chromatin DNaseI sensitivity and opposite methylation changes in cancers. Nucleic Acids Res. 36, 2196–2207 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Herrington CS, Worsham M, Southern SA, Mackowiak P, Wolman SR, Loss of sequences on the short arm of chromosome 17 is a late event in squamous carcinoma of the cervix. Mol. Pathol 54, 160–164 (2001). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.The ENCODE Project Consortium, An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Ho JWK, Jung YL, Liu T, Alver BH, Lee S, Ikegami K, Sohn K-A, Minoda A, Tolstorukov MY, Appert A, Parker SCJ, Gu T, Kundaje A, Riddle NC, Bishop E, Egelhofer TA, Hu SS, Alekseyenko AA, Rechtsteiner A, Asker D, Belsky JA, Bowman SK, Chen QB, Chen RA-J, Day DS, Dong Y, Dose AC, Duan X, Epstein CB, Ercan S, Feingold EA, Ferrari F, Garrigues JM, Gehlenborg N, Good PJ, Haseley P, He D, Herrmann M, Hoffman MM, Jeffers TE, Kharchenko PV, Kolasinska-Zwierz P, Kotwaliwale CV, Kumar N, Langley SA, Larschan EN, Latorre I, Libbrecht MW, Lin X, Park R, Pazin MJ, Pham HN, Plachetka A, Qin B, Schwartz YB, Shoresh N, Stempor P, Vielle A, Wang C, Whittle CM, Xue H, Kingston RE, Kim JH, Bernstein BE, Dernburg AF, Pirrotta V, Kuroda MI, Noble WS, Tullius TD, Kellis M, MacAlpine DM, Strome S, Elgin SCR, Liu XS, Lieb JD, Ahringer J, Karpen GH, Park PJ, Comparative analysis of metazoan chromatin organization. Nature 512, 449–452 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Feinberg AP, Gehrke CW, Kuo KC, Ehrlich M, Reduced genomic 5-methylcytosine content in human colonic neoplasia. Cancer Res. 48, 1159–1161 (1988). [PubMed] [Google Scholar]
- 77.Ehrlich M, DNA hypomethylation in cancer cells. Epigenomics 1, 239–259 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All data associated with this study are in the paper and Supplementary Materials or deposited in publicly available repositories. The code to run the ARTEMIS pipeline and reproduce manuscript figures is publicly available at https://github.com/cancer-genomics/artemis2024. Code needed to generate DELFI features for fragmentation-based analysis may be found at https://github.com/cancer-genomics/reproduce_lucas_wflow. These GitHub repositories have also been archived on Zenodo at 10.5281/zenodo.10627372. Sequence data generated for cfDNA samples (27, 28, 45, and 56) have been deposited at the database of the European Genome-Phenome Archive (EGA) under accession codes EGAS00001005340, EGAS00001007248, EGAS00001007249, and EGAS00001003611 and may be obtained at https://ega-archive.org/. PCAWG BAM files were downloaded from the Bionimbus Protected Data Cloud (https://bionimbus.opensciencedatacloud.org/). Chromatin immunoprecipitation sequencing data were downloaded from the ENCODE portal (accession codes ENCFF001SUG, ENCFF001SUI, ENCFF001SUJ, ENCFF001SUE, ENCFF001SUL, ENCFF001SUF, ENCFF001SUN, ENCFF001SUO, ENCFF001SUP, and ENCFF001SUQ).





