Systematic biases in reference-based plasma cell-free DNA fragmentomic profiling

Xiaoyi Liu; Mengqi Yang; Dingxue Hu; Yunyun An; Wanqiu Wang; Huizhen Lin; Yuqi Pan; Jia Ju; Kun Sun

doi:10.1016/j.crmeth.2024.100793

. 2024 Jun 11;4(6):100793. doi: 10.1016/j.crmeth.2024.100793

Systematic biases in reference-based plasma cell-free DNA fragmentomic profiling

Xiaoyi Liu ^1,⁵, Mengqi Yang ^1,^2,⁵, Dingxue Hu ^1,³, Yunyun An ¹, Wanqiu Wang ^1,³, Huizhen Lin ¹, Yuqi Pan ^1,⁴, Jia Ju ^1,³, Kun Sun ^1,^6,^∗

PMCID: PMC11228372 PMID: 38866008

Summary

Plasma cell-free DNA (cfDNA) fragmentation patterns are emerging directions in cancer liquid biopsy with high translational significance. Conventionally, the cfDNA sequencing reads are aligned to a reference genome to extract their fragmentomic features. In this study, through cfDNA fragmentomics profiling using different reference genomes on the same datasets in parallel, we report systematic biases in such conventional reference-based approaches. The biases in cfDNA fragmentomic features vary among races in a sample-dependent manner and therefore might adversely affect the performances of cancer diagnosis assays across multiple clinical centers. In addition, to circumvent the analytical biases, we develop Freefly, a reference-free approach for cfDNA fragmentomics profiling. Freefly runs ∼60-fold faster than the conventional reference-based approach while generating highly consistent results. Moreover, cfDNA fragmentomic features reported by Freefly can be directly used for cancer diagnosis. Hence, Freefly possesses translational merit toward the rapid and unbiased measurement of cfDNA fragmentomics.

Keywords: liquid biopsy, cancer diagnosis, end motif, size profile

Graphical abstract

Highlights

•
We characterize systematic biases in reference-based cell-free (cfDNA) fragmentomic analysis
•
We develop Freefly, a reference-free algorithm for cfDNA fragmentomic profiling
•
Freefly runs faster than conventional reference-based approach and produces reliable results
•
Freefly enables cfDNA fragmentomic-based cancer diagnosis

Motivation

Plasma cell-free DNA (cfDNA) is one of the most widely used analytes in cancer liquid biopsy. Recent studies have revealed that cfDNA generation is not random, and its fragmentation patterns are highly informative in diagnosis and tumor-origin tracing. However, current implementations need to map the cfDNA data to a reference human genome, which is time consuming and could introduce race- and gender-mediated biases. We sought to evaluate such systematic biases and develop a rapid and reference-free approach to profile cfDNA fragmentomic features for cancer diagnosis.

Liu et al. characterize the systematic biases in conventional reference-based profiling of plasma cell-free DNA (cfDNA) fragmentomic features and develop a reference-free algorithm, Freefly, with accelerated speed and consistent results for unbiased cfDNA fragmentomic profiling and cancer diagnosis.

Introduction

Circulating cell-free DNA (cfDNA) in the plasma of peripheral blood has been proved to be a valuable analyte in cancer liquid biopsy, including non-invasive diagnosis and precision medicine, as well as relapse surveillance.¹ In-depth investigations on cfDNA from patients with cancer reveal that tumor-derived cfDNA molecules are remarkably different from the background ones (which mainly originate from the hematopoietic system²), such as relative shortness, tissue-specific nucleosome footprints, and aberrant preferences on ending sites.³^,⁴^,⁵^,⁶^,⁷^,⁸^,⁹^,¹⁰ Numerous studies have demonstrated that the fragmentation characteristics (also known as “fragmentomics”) of cfDNA molecules in patients with cancer bear high translational significance in both early diagnosis and tumor origin tracing and subtyping, as well as tissue damage surveillance.¹^,⁹^,¹¹^,¹² Among the various types of cfDNA fragmentomic features, size profile and end motif patterns are two of the most widely studied and comprehensively validated diagnostic biomarkers. For instances, Cristiano and colleagues utilized the size characteristics to develop the “DNA evaluation of fragments for early interception” algorithm for accurate pan-cancer diagnosis¹³; we and others developed various diagnostic models based on cfDNA end motif frequencies and diversities,¹⁴ whose performance has been validated in large cohorts of patients with hepatocellular carcinoma (HCC)¹⁵ and lung cancer.¹⁶ To obtain the size and end motif profiles of cfDNA molecules, DNA sequencing technologies have been widely utilized; during downstream data analysis, one essential step in the current implementations is to align the cfDNA reads to a reference genome,³^,⁶^,⁸^,¹⁰^,¹³^,¹⁴^,¹⁵^,¹⁶ which is time consuming and aggravates the turnover time of the clinical diagnoses. Moreover, whether the reference genome fitted the samples (in terms of genetic background) was usually not comprehensively evaluated,¹⁷ and the impact and potential biases caused by any genetic discrepancies have not been rigorously characterized. Hence, in this work, we investigated the analytical biases related to reference genomes and explored novel computational algorithms toward unbiased cfDNA fragmentomics profiling.

Results

Systematic biases in conventional reference-based approach

The schematic workflow of cfDNA fragmentomics profiling is illustrated in Figure 1. To explore the influences of reference genomes on the conventional approach, three datasets were collected from the literature and analyzed: dataset 1 was retrieved from our previous study that sequenced 24 healthy subjects,¹⁰ dataset 2 contains control subjects and patients with HCC and lung cancer,¹⁸ and dataset 3 includes 24 patients with breast cancer.¹⁹ Notably, all the samples in datasets 1 and 2 were collected from Chinese patients. Hence, besides the commonly used NCBI GRCh38 (also known as hg38) reference genome, we reanalyzed the data against Han1, a telomere-to-telomere and reference-quality genome from a Southern Han Chinese individual²⁰ with a more homogeneous genetic background than the samples in these two datasets. As a result, in these two datasets, we obtained significantly elevated mapping rates along with reduced duplication rates using Han1 compared to GRCh38 as the reference genome, leading to a median of ∼1.7% increasement in usable data (all p < 10⁻⁵, paired t tests; Figures 2A and S1), suggesting that the Han1 genome could be more appropriate for these datasets. Interestingly, although samples in dataset 3 were not Chinese, an increase in usable data was also observed, which was even higher (∼2.2%) than the other two datasets. The cfDNA fragmentomic profiles using Han1 were overall similar with that from GRCh38 (Figure S2); however, systematic discrepancies were observed (Figures 2B–2F and S1): in datasets 1 and 2, results using Han1 showed significantly higher proportions of short fragments (derived from the size profiles), while no significant difference was found in dataset 3; in dataset 1, CCCA end motif usage was significantly higher using Han1, while it was markedly lower in datasets 2 and 3; and for the motif diversity score, significant elevations using Han1 were observed in all three datasets. Moreover, in dataset 1, using Han1 as the reference genome tended to generate even more shorter fragments in samples with high proportions of shorter fragments (Figure 2D), and this bias was also observed in the HCC samples of dataset 2 (Figure S1C), while an opposite trend was found in dataset 3 (Figure 2F). In addition, the variances in the proportion of short fragments were significantly different among patient types in dataset 2 (p = 0.0026, Kruskal-Wallis test; Figure 2E). These results suggested that the reference genomes did introduce systematic biases to the conventional methods, and such biases could be different among races in a sample-dependent manner.

Schematic workflow of Freefly versus the conventional approach in cfDNA fragmentomic profiling

For sequenced plasma cell-free DNA (cfDNA) reads, conventional approaches would align them to a reference genome and then extract the outermost ends to calculate size and end motifs; as a contrast, Freefly would stitch the paired-end reads into a single-end fragment and then extract the size distribution of the stitched reads and use the sequences in the 5′ end to calculate motif patterns.

Systematic biases associated with the conventional reference-based approach

The results were generated using GRCh38 versus Han1 as the reference genomes.

(A) Mapping rate, duplication rate, and number of usable reads in dataset 1.

(B and C) Differences in proportion of short fragments, CCCA end motif usage, and motif diversity score in datasets (B) 1 and (C) 3.

(D) Relationship between differences in proportion of short fragments and itself in dataset 1.

(E) Differences in proportion of short fragments among different patient groups in dataset 2.

(F) Relationship between differences in proportion of short fragments and itself in dataset 3.

In (A)–(C), p values were calculated using paired t tests; in (E), the p value was calculated using the Kruskal-Wallis test.

Performance evaluation of Freefly

To circumvent the analytical biases brought by the reference genomes, as a proof of concept, we developed the Freefly algorithm, which does not depend on references to profile cfDNA fragmentomics. We first benchmarked Freefly against the conventional approach (using GRCh38 as the reference genome) to evaluate its performance and accuracy. On dataset 1, the conventional approach completed the analysis in ∼2 h; in contrast, Freefly finished in just ∼2 min, demonstrating a median of an ∼61.8-fold speed boost (range: 61.0–63.1; Figure 3A). In the meantime, Freefly showed almost identical size profiles and highly consistent end motif patterns (including the most widely studied CCCA end motif usage and motif diversity score; Figures 3B–3E) compared to the conventional approach. The results obtained from dataset 2 were highly comparable (Figure S3). In dataset 3, the results were also very similar to those of datasets 1 and 2: Freefly displayed a median of an ∼68.2-fold speed boost (range: 56.2–72.9; Figure 3F) and high consistency with the conventional approach (Figures 3G–3J). Together, the results demonstrated the speed advantage as well as the high accuracy of Freefly.

Performance evaluations of Freefly

(A–E) and (F–J) are of datasets 1 and 3, respectively: (A and F) read number (bars, values shown in millions in right axis), running time of the conventional approach and Freely (blue and red dots, respectively, values shown in left axis), and the speed boost of Freefly for each sample; (B and G) typical size profiles reported by the conventional approach and Freefly from the same sample; and correlation of (C and H) proportion of short fragments (i.e., ≤150 bp), (D and I) CCCA end motif usage, and (E and J) motif diversity scores between the conventional approach and Freefly. For (B), as the data were generated in paired-end 100 bp mode, Freefly only reported the size distribution of cfDNA molecules below 190 bp.

On the other hand, besides the high consistency between Freefly and the conventional approach, the differences in fragmentomic parameters reported by the conventional approach and Freefly were significantly correlated with the parameters themselves in all three investigated datasets (Figure S4). For example, Freefly and the conventional approach reported comparable results on CCCA end motif usage for samples with low CCCA end motif usage (e.g., patients with cancer), while differences between these two approaches were relatively large for samples with high CCCA end motif usage (e.g., control subjects). Together with the observations in Figures 2D–2F, the results suggested that the analytical biases might affect the performance of cancer diagnosis assays in a sample-dependent manner.

Freefly for cancer diagnosis

To further demonstrate the usability of Freefly, we first employed its results to perform cancer diagnoses in dataset 2. As shown in Figures 4A–4C, significant differences in cfDNA fragmentomics between controls and patients with cancer were observed. Moreover, the proportion of short cfDNA molecules could readily differentiate patients with HCC from controls with a comparable performance to the conventional approach (Figure 4D); the CCCA end motif usage and motif diversity score could efficiently differentiate patients with lung cancer from controls (Figure S5). In addition, we further collected and investigated dataset 4, which contains ∼600 colorectal cancer samples and non-cancerous control subjects.²¹ As expected, the size distribution and motif patterns calculated by Freefly showed significant differences between cancer samples and controls in this large cohort (Figures 4E and S6), and the performance of utilizing these fragmentomic features for cancer diagnosis was also comparable to the conventional alignment-based approach, with slightly higher area under the curve values (Figures 4F and S6). Notably, at 90% sensitivity, cancer diagnosis using the CCCA end motif usage calculated by Freely resulted in significantly less false positives than the conventional approach (p = 0.016, chi-squared test).

Diagnostic performance of the features calculated by Freefly

(A–D) In dataset 2, (A) proportions of short fragments (i.e., ≤150 bp), (B) CCCA end motif usage, and (C) motif diversity scores among different groups and (D) ROC (receiver operating characteristic) curves differentiating patients with HCC from controls using the proportion of short fragments.

(E and F) In dataset 4, (E) proportions of short fragments (i.e., ≤150 bp) between controls and patients with colorectal cancer and (F) ROC curves differentiating patients with colorectal cancer from controls using the CCCA end motif usage.

For (A)–(C) and (E), p values were calculated using Mann-Whitney rank-sum tests. In (D) and (F), AUC stands for area under the curve, and the corresponding p values were calculated using Z-tests; p values comparing ROC curves (shown in the top middle of the images) were calculated using DeLong tests.

Discussion

In the past decade, many research groups (including us) and industries have developed various cfDNA fragmentomics-based approaches for cancer diagnosis⁹; however, the performances of these assays varied across datasets, where analytical biases could be one latent factor. In fact, as much as two-thirds of the human genome is comprised from repetitive sequences²² that are highly error prone for short-sized sequencing read alignment.²³ Genetic variants (e.g., single-nucleotide polymorphisms, gender, etc.) among human subjects also add difficulties to this task¹⁷; in particular, reads mapped to the Y chromosome are commonly observed in female subjects (e.g., accounted for ∼0.085% in dataset 3), which are definitely erroneous results. In this study, we provided evidence of and characterized the biases in cfDNA fragmentomics related to reference genomes, which would serve as a guidance in optimizing current approaches to minimize the adverse effect of analytical biases.

As a proof of concept, we further developed Freefly for unbiased cfDNA fragmentomics profiling. With a reference-free design, Freefly was inherently exempted from the analytical biases in reference-based read alignment. We showed that Freefly reported consistent results compared to the conventional approach and enabled cancer diagnosis (Figures 3 and 4). In addition to human subjects, Freefly may also be helpful in some other complex scenarios. For instance, a handful of mouse strains are commonly used as biological models (e.g., xenografts), while their genomes are not identical; any genetic insertion/deletions in the model mouse compared to the reference mouse genome could affect the accurate measurement of cfDNA fragmentomics, which makes the analysis very complex and error prone.²⁴ Besides its high accuracy, Freefly was also very fast, as demonstrated by the benchmark evaluations (Figure 3): Freefly could reduce the analysis time from hours to minutes, suggesting that cfDNA fragmentomics-based cancer diagnostic results could be obtained almost instantly after sequencing, which would be favorable for large-scale clinical applications.

In conclusion, we reported systematic biases linked to the conventional reference-based approaches and developed Freefly toward the rapid and unbiased assessment of cfDNA fragmentomics.

Limitations of the study

There are certain limitations in reference-free approaches. Previous studies have demonstrated that the spatial information of cfDNA fragmentomics, e.g., variations in size distributions across the genome, could add value in cancer diagnosis¹³^,²⁵; it is very difficult to utilize the spatial information in the genome without an alignment procedure. In addition, reference-free approaches could not filter out non-human sequences, e.g., contaminations from microbiota,²⁶ and they also could not perfectly handle cfDNA fragments that are longer than twice the sequencing cycles, while such molecules might be valuable in cancer diagnosis.²⁷ Considering such drawbacks, it might be more appropriate to integrate the results from reference-free and reference-based approaches toward generalizable and high-performance cancer diagnostic assays. Hence, evaluations using more diversified datasets are needed in future studies to validate the performance and limitations of Freefly to take advantage of reference-free cfDNA fragmentomics profiles.

STAR★Methods

Key resources table

REAGENT or RESOURCE	SOURCE	IDENTIFIER
Deposited data

CfDNA dataset 1	An et al.¹⁰	GSA-Human: HRA002250
CfDNA dataset 2	Liang et al.¹⁸	CNGBdb-CNSA: CNP0000680
CfDNA dataset 3	Zivanovic et al.¹⁹	NCBI SRA: PRJNA578569
CfDNA dataset 4	Walker et al.²¹	NCBI SRA: PRJNA755688

Software and algorithms

Ktrim	Sun et al.²⁸	https://github.com/hellosunking/Ktrim/
Bowtie2	Langmead et al.²⁹	https://github.com/BenLangmead/bowtie2
FLASH	Magoc et al.³⁰	https://ccb.jhu.edu/software/FLASH/
SAMtools	Li et al.³¹	https://github.com/samtools/samtools
Freefly	This study	https://github.com/hellosunking/Freefly https://zenodo.org/doi/10.5281/zenodo.11206499

Open in a new tab

Resource availability

Lead contact

Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Kun Sun (sunkun@szbl.ac.cn).

Materials availability

This study did not generate new unique reagents.

Data and code availability

•
This paper analyzes existing, publicly available data. The accession numbers for the datasets are listed in the key resources table.
•
Implementation of Freefly method (as well as the conventional approach) is publicly available at https://github.com/hellosunking/Freefly, free for academic and personal usage.
•
Any additional information required to reanalyze the data reported in this work paper is available from the corresponding author upon request.

Method details

Ethics approval

This study had been approved by the Ethics Committee of Shenzhen Bay Laboratory.

The conventional reference-based approach

The conventional approach was implemented as previous studies (Figure 1).⁶^,¹⁰^,¹⁴ Briefly, raw sequencing reads were first preprocessed to remove adapters and low-quality cycles using Ktrim software,²⁸ and then mapped to a reference human genome (either NCBI GRCh38 or Han1) using Bowtie2 software²⁹; PCR duplicates (i.e., reads with identical ends) were identified and removed using in-house programs,¹⁰ and the remaining reads with mapping scores of 30 or higher (corresponding to 0.1% error rate) were picked up for fragmentomics profiling: the sizes of cfDNA molecules were determined as the distance between the genomic coordinates of the outmost 2 ends, and the sequences of k nucleotides from the leftmost ends in the reference genome were extracted to calculate k-mer motif frequencies and diversity score.¹⁴ In Dataset 3, samples sequenced in paired-end 75 bp mode (N = 3) were omitted from the analyses. In Dataset 4, only the samples from Caucasians with mappabilities >90% were kept for downstream analyses (N = 274 and 343 for non-cancers controls and colorectal cancer samples, respectively).

The Freefly algorithm

CfDNA molecules are relatively short (mostly below 200 bp), and the majority of them are shorter than twice of the sequencing cycles of current mainstream sequencers (e.g., 150 bp). As a result, most cfDNA fragments would get readthrough during paired-end sequencing, which allowed us to stitch them into full-length fragments and precisely measure their size profiles. In addition, to get rid of the blunt ends and single-stranded overhang issues of cfDNA termini,³² we only extracted the sequences from read 1 to profile the end motif patterns. We named this algorithm Freefly (Figure 1). In terms of implementation, the preprocessing procedure was the same to the conventional approach and the preprocessed reads were subjected to PCR removal using in-house program³³; the reads were then stitched using FLASH program.³⁰ During stitching, a minimum of 10 bp overlap between read 1 and 2 were required, which meant the fragments shorter than or equal to 2 × read cycles - 10 bp (e.g., 190 bp for paired-end 100 bp reads, 290 bp for paired-end 150 bp reads) could be stitched. The sizes of stitch-able reads were determined as the length of the stitched fragment, and those could not be stitched would be assigned a label of “long”; sequences from the first k-cycles in read 1 were extracted to calculate frequencies of k-mer motifs and diversity score. As a result, Freefly reported both cfDNA size profiles and end motif patterns without reference genomes and alignment.

Benchmark evaluations

Benchmark evaluations were conducted on a computing machine running standard 64-bit Linux operating system (CentOS v7.5.1804, kernel v3.10.0) equipped with an Intel Xeon Gold 6242 CPU and 192 GB memory. Other dependent software (and versions) was: Ktrim v1.5.0, FLASH v1.2.11, Bowtie v2.3.5.1, and SAMtools³¹ v1.12. During benchmark evaluations, 8 threads were used for both Freefly and the conventional approach; each sample was analyzed using both Freefly and the conventional approach for 5 times, and the average running time was reported. For both the conventional approach and Freefly, cfDNA molecules shorter than or equal to 150 bp (i.e., ≤150 bp) were considered as “short fragments”; for motif analysis, we investigated the 4-mer motif with a focus on CCCA as it was the most widely studied motif with proven value in cancer diagnosis¹⁴^,¹⁵^,¹⁶; motif diversity score was measured using the same definition in our previous study.¹⁴

Quantification and statistical analysis

Paired t-tests were used for comparing the results generated using GRCh38 versus Han1 as reference genomes. Kruskal-Wallis test was used for comparing differences in proportion of short fragments among different patient groups. Mann-Whitney U tests were used for comparing fragmentomic features reported by Freefly among different patient groups. Z-tests were used in analyzing ROC curves for differentiating cancer patients from controls. DeLong tests were used for comparing ROC curves.

Acknowledgments

This work was supported by the National Key R&D Program of China (2022YFA0912700), the Guangdong Basic and Applied Basic Research Foundation (2023B1515120073), the National Natural Science Foundation of China (82101763), and the Major Program of Shenzhen Bay Laboratory. We would like to thank Ms. Qi Wang for technical assistance and the Shenzhen Bay Laboratory Supercomputing Center for computational support.

Author contributions

K.S. conceived the study and implemented the software; all authors analyzed data; and X.L., M.Y., and K.S. wrote the manuscript.

Declaration of interests

The authors declare no competing interests.

Published: June 11, 2024

Footnotes

Supplemental information can be found online at https://doi.org/10.1016/j.crmeth.2024.100793.

Supplemental information

Document S1. Figures S1–S6

mmc1.pdf^{(2.2MB, pdf)}

Document S2. Article plus supplemental information

mmc2.pdf^{(6.1MB, pdf)}

References

1.van der Pol Y., Mouliere F. Toward the Early Detection of Cancer by Decoding the Epigenetic and Environmental Fingerprints of Cell-Free DNA. Cancer Cell. 2019;36:350–368. doi: 10.1016/j.ccell.2019.09.003. [DOI] [PubMed] [Google Scholar]
2.Sun K., Jiang P., Chan K.C.A., Wong J., Cheng Y.K.Y., Liang R.H.S., Chan W.K., Ma E.S.K., Chan S.L., Cheng S.H., et al. Plasma DNA tissue mapping by genome-wide methylation sequencing for noninvasive prenatal, cancer, and transplantation assessments. Proc. Natl. Acad. Sci. USA. 2015;112:E5503–E5512. doi: 10.1073/pnas.1508736112. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Snyder M.W., Kircher M., Hill A.J., Daza R.M., Shendure J. Cell-free DNA comprises an in vivo nucleosome footprint that informs its tissues-of-origin. Cell. 2016;164:57–68. doi: 10.1016/j.cell.2015.11.050. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Underhill H.R., Kitzman J.O., Hellwig S., Welker N.C., Daza R., Baker D.N., Gligorich K.M., Rostomily R.C., Bronner M.P., Shendure J. Fragment length of circulating tumor DNA. PLoS Genet. 2016;12 doi: 10.1371/journal.pgen.1006162. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Ulz P., Thallinger G.G., Auer M., Graf R., Kashofer K., Jahn S.W., Abete L., Pristauz G., Petru E., Geigl J.B., et al. Inferring expressed genes by whole-genome sequencing of plasma DNA. Nat. Genet. 2016;48:1273–1278. doi: 10.1038/ng.3648. [DOI] [PubMed] [Google Scholar]
6.Jiang P., Sun K., Tong Y.K., Cheng S.H., Cheng T.H.T., Heung M.M.S., Wong J., Wong V.W.S., Chan H.L.Y., Chan K.C.A., et al. Preferred end coordinates and somatic variants as signatures of circulating tumor DNA associated with hepatocellular carcinoma. Proc. Natl. Acad. Sci. USA. 2018;115:E10925–E10933. doi: 10.1073/pnas.1814616115. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Sun K., Jiang P., Wong A.I.C., Cheng Y.K.Y., Cheng S.H., Zhang H., Chan K.C.A., Leung T.Y., Chiu R.W.K., Lo Y.M.D. Size-tagged preferred ends in maternal plasma DNA shed light on the production mechanism and show utility in noninvasive prenatal testing. Proc. Natl. Acad. Sci. USA. 2018;115:E5106–E5114. doi: 10.1073/pnas.1804134115. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Sun K., Jiang P., Cheng S.H., Cheng T.H.T., Wong J., Wong V.W.S., Ng S.S.M., Ma B.B.Y., Leung T.Y., Chan S.L., et al. Orientation-aware plasma cell-free DNA fragmentation analysis in open chromatin regions informs tissue of origin. Genome Res. 2019;29:418–427. doi: 10.1101/gr.242719.118. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Lo Y.M.D., Han D.S.C., Jiang P., Chiu R.W.K. Epigenetics, fragmentomics, and topology of cell-free DNA in liquid biopsies. Science. 2021;372 doi: 10.1126/science.aaw3616. [DOI] [PubMed] [Google Scholar]
10.An Y., Zhao X., Zhang Z., Xia Z., Yang M., Ma L., Zhao Y., Xu G., Du S., Wu X., et al. DNA methylation analysis explores the molecular basis of plasma cell-free DNA fragmentation. Nat. Commun. 2023;14:287. doi: 10.1038/s41467-023-35959-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Gai W., Sun K. Epigenetic Biomarkers in Cell-Free DNA and Applications in Liquid Biopsy. Genes. 2019;10:32. doi: 10.3390/genes10010032. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Jin X., Wang Y., Xu J., Li Y., Cheng F., Luo Y., Zhou H., Lin S., Xiao F., Zhang L., et al. Plasma cell-free DNA promise monitoring and tissue injury assessment of COVID-19. Mol. Genet. Genomics. 2023;298:823–836. doi: 10.1007/s00438-023-02014-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Cristiano S., Leal A., Phallen J., Fiksel J., Adleff V., Bruhm D.C., Jensen S.Ø., Medina J.E., Hruban C., White J.R., et al. Genome-wide cell-free DNA fragmentation in patients with cancer. Nature. 2019;570:385–389. doi: 10.1038/s41586-019-1272-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Jiang P., Sun K., Peng W., Cheng S.H., Ni M., Yeung P.C., Heung M.M.S., Xie T., Shang H., Zhou Z., et al. Plasma DNA End-Motif Profiling as a Fragmentomic Marker in Cancer, Pregnancy, and Transplantation. Cancer Discov. 2020;10:664–673. doi: 10.1158/2159-8290.CD-19-0622. [DOI] [PubMed] [Google Scholar]
15.Chen L., Abou-Alfa G.K., Zheng B., Liu J.F., Bai J., Du L.T., Qian Y.S., Fan R., Liu X.L., Wu L., et al. Genome-scale profiling of circulating cell-free DNA signatures for early detection of hepatocellular carcinoma in cirrhotic patients. Cell Res. 2021;31:589–592. doi: 10.1038/s41422-020-00457-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Guo W., Chen X., Liu R., Liang N., Ma Q., Bao H., Xu X., Wu X., Yang S., Shao Y., et al. Sensitive detection of stage I lung adenocarcinoma using plasma cell-free DNA breakpoint motif profiling. EBioMedicine. 2022;81 doi: 10.1016/j.ebiom.2022.104131. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Martschenko D.O., Wand H., Young J.L., Wojcik G.L. Including multiracial individuals is crucial for race, ethnicity and ancestry frameworks in genetics and genomics. Nat. Genet. 2023;55:895–900. doi: 10.1038/s41588-023-01394-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Liang H., Li F., Qiao S., Zhou X., Xie G., Zhao X., Zhang Y., Wu K. Whole-genome sequencing of cell-free DNA yields genome-wide read distribution patterns to track tissue of origin in cancer patients. Clin. Transl. Med. 2020;10 doi: 10.1002/ctm2.177. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Zivanovic Bujak A., Weng C.F., Silva M.J., Yeung M., Lo L., Ftouni S., Litchfield C., Ko Y.A., Kuykhoven K., Van Geelen C., et al. Circulating tumour DNA in metastatic breast cancer to guide clinical trial enrolment and precision oncology: A cohort study. PLoS Med. 2020;17 doi: 10.1371/journal.pmed.1003363. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Chao K.H., Zimin A.V., Pertea M., Salzberg S.L. The first gapless, reference-quality, fully annotated genome from a Southern Han Chinese individual. G3 (Bethesda) 2023;13 doi: 10.1093/g3journal/jkac321. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Walker N.J., Rashid M., Yu S., Bignell H., Lumby C.K., Livi C.M., Howell K., Morley D.J., Morganella S., Barrell D., et al. Hydroxymethylation profile of cell-free DNA is a biomarker for early colorectal cancer. Sci. Rep. 2022;12 doi: 10.1038/s41598-022-20975-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.de Koning A.P.J., Gu W., Castoe T.A., Batzer M.A., Pollock D.D. Repetitive elements may comprise over two-thirds of the human genome. PLoS Genet. 2011;7 doi: 10.1371/journal.pgen.1002384. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Treangen T.J., Salzberg S.L. Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat. Rev. Genet. 2011;13:36–46. doi: 10.1038/nrg3117. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Serpas L., Chan R.W.Y., Jiang P., Ni M., Sun K., Rashidfarrokhi A., Soni C., Sisirak V., Lee W.S., Cheng S.H., et al. Dnase1l3 deletion causes aberrations in length and end-motif frequencies in plasma DNA. Proc. Natl. Acad. Sci. USA. 2019;116:641–649. doi: 10.1073/pnas.1815031116. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Foda Z.H., Annapragada A.V., Boyapati K., Bruhm D.C., Vulpescu N.A., Medina J.E., Mathios D., Cristiano S., Niknafs N., Luu H.T., et al. Detecting Liver Cancer Using Cell-Free DNA Fragmentomes. Cancer Discov. 2023;13:616–631. doi: 10.1158/2159-8290.CD-22-0659. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Burnham P., Gomez-Lopez N., Heyang M., Cheng A.P., Lenz J.S., Dadhania D.M., Lee J.R., Suthanthiran M., Romero R., De Vlaminck I. Separating the signal from the noise in metagenomic cell-free DNA sequencing. Microbiome. 2020;8:18. doi: 10.1186/s40168-020-0793-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Choy L.Y.L., Peng W., Jiang P., Cheng S.H., Yu S.C.Y., Shang H., Olivia Tse O.Y., Wong J., Wong V.W.S., Wong G.L.H., et al. Single-Molecule Sequencing Enables Long Cell-Free DNA Detection and Direct Methylation Analysis for Cancer Patients. Clin. Chem. 2022;68:1151–1163. doi: 10.1093/clinchem/hvac086. [DOI] [PubMed] [Google Scholar]
28.Sun K. Ktrim: an extra-fast and accurate adapter- and quality-trimmer for sequencing data. Bioinformatics. 2020;36:3561–3562. doi: 10.1093/bioinformatics/btaa171. [DOI] [PubMed] [Google Scholar]
29.Langmead B., Salzberg S.L. Fast gapped-read alignment with Bowtie 2. Nat Meth. 2012;9:357–359. doi: 10.1038/nmeth.1923. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Magoc T., Salzberg S.L. FLASH: fast length adjustment of short reads to improve genome assemblies. Bioinformatics. 2011;27:2957–2963. doi: 10.1093/bioinformatics/btr507. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R., 1000 Genome Project Data Processing Subgroup The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Harkins K.M., Schaefer N.K., Troll C.J., Rao V., Kapp J., Naughton C., Shapiro B., Green R.E. A novel NGS library preparation method to characterize native termini of fragmented DNA. Nucleic Acids Res. 2020;48:e47. doi: 10.1093/nar/gkaa128. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Peng Q., Huang Z., Sun K., Liu Y., Yoon C.W., Harrison R.E.S., Schmitt D.L., Zhu L., Wu Y., Tasan I., et al. Engineering inducible biomolecular assemblies for genome imaging and manipulation in living cells. Nat. Commun. 2022;13:7933. doi: 10.1038/s41467-022-35504-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Figures S1–S6

mmc1.pdf^{(2.2MB, pdf)}

Document S2. Article plus supplemental information

mmc2.pdf^{(6.1MB, pdf)}

Data Availability Statement

•
This paper analyzes existing, publicly available data. The accession numbers for the datasets are listed in the key resources table.
•
Implementation of Freefly method (as well as the conventional approach) is publicly available at https://github.com/hellosunking/Freefly, free for academic and personal usage.
•
Any additional information required to reanalyze the data reported in this work paper is available from the corresponding author upon request.

[bib1] 1.van der Pol Y., Mouliere F. Toward the Early Detection of Cancer by Decoding the Epigenetic and Environmental Fingerprints of Cell-Free DNA. Cancer Cell. 2019;36:350–368. doi: 10.1016/j.ccell.2019.09.003. [DOI] [PubMed] [Google Scholar]

[bib2] 2.Sun K., Jiang P., Chan K.C.A., Wong J., Cheng Y.K.Y., Liang R.H.S., Chan W.K., Ma E.S.K., Chan S.L., Cheng S.H., et al. Plasma DNA tissue mapping by genome-wide methylation sequencing for noninvasive prenatal, cancer, and transplantation assessments. Proc. Natl. Acad. Sci. USA. 2015;112:E5503–E5512. doi: 10.1073/pnas.1508736112. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib3] 3.Snyder M.W., Kircher M., Hill A.J., Daza R.M., Shendure J. Cell-free DNA comprises an in vivo nucleosome footprint that informs its tissues-of-origin. Cell. 2016;164:57–68. doi: 10.1016/j.cell.2015.11.050. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib4] 4.Underhill H.R., Kitzman J.O., Hellwig S., Welker N.C., Daza R., Baker D.N., Gligorich K.M., Rostomily R.C., Bronner M.P., Shendure J. Fragment length of circulating tumor DNA. PLoS Genet. 2016;12 doi: 10.1371/journal.pgen.1006162. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib5] 5.Ulz P., Thallinger G.G., Auer M., Graf R., Kashofer K., Jahn S.W., Abete L., Pristauz G., Petru E., Geigl J.B., et al. Inferring expressed genes by whole-genome sequencing of plasma DNA. Nat. Genet. 2016;48:1273–1278. doi: 10.1038/ng.3648. [DOI] [PubMed] [Google Scholar]

[bib6] 6.Jiang P., Sun K., Tong Y.K., Cheng S.H., Cheng T.H.T., Heung M.M.S., Wong J., Wong V.W.S., Chan H.L.Y., Chan K.C.A., et al. Preferred end coordinates and somatic variants as signatures of circulating tumor DNA associated with hepatocellular carcinoma. Proc. Natl. Acad. Sci. USA. 2018;115:E10925–E10933. doi: 10.1073/pnas.1814616115. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib7] 7.Sun K., Jiang P., Wong A.I.C., Cheng Y.K.Y., Cheng S.H., Zhang H., Chan K.C.A., Leung T.Y., Chiu R.W.K., Lo Y.M.D. Size-tagged preferred ends in maternal plasma DNA shed light on the production mechanism and show utility in noninvasive prenatal testing. Proc. Natl. Acad. Sci. USA. 2018;115:E5106–E5114. doi: 10.1073/pnas.1804134115. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib8] 8.Sun K., Jiang P., Cheng S.H., Cheng T.H.T., Wong J., Wong V.W.S., Ng S.S.M., Ma B.B.Y., Leung T.Y., Chan S.L., et al. Orientation-aware plasma cell-free DNA fragmentation analysis in open chromatin regions informs tissue of origin. Genome Res. 2019;29:418–427. doi: 10.1101/gr.242719.118. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib9] 9.Lo Y.M.D., Han D.S.C., Jiang P., Chiu R.W.K. Epigenetics, fragmentomics, and topology of cell-free DNA in liquid biopsies. Science. 2021;372 doi: 10.1126/science.aaw3616. [DOI] [PubMed] [Google Scholar]

[bib10] 10.An Y., Zhao X., Zhang Z., Xia Z., Yang M., Ma L., Zhao Y., Xu G., Du S., Wu X., et al. DNA methylation analysis explores the molecular basis of plasma cell-free DNA fragmentation. Nat. Commun. 2023;14:287. doi: 10.1038/s41467-023-35959-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib11] 11.Gai W., Sun K. Epigenetic Biomarkers in Cell-Free DNA and Applications in Liquid Biopsy. Genes. 2019;10:32. doi: 10.3390/genes10010032. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib12] 12.Jin X., Wang Y., Xu J., Li Y., Cheng F., Luo Y., Zhou H., Lin S., Xiao F., Zhang L., et al. Plasma cell-free DNA promise monitoring and tissue injury assessment of COVID-19. Mol. Genet. Genomics. 2023;298:823–836. doi: 10.1007/s00438-023-02014-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib13] 13.Cristiano S., Leal A., Phallen J., Fiksel J., Adleff V., Bruhm D.C., Jensen S.Ø., Medina J.E., Hruban C., White J.R., et al. Genome-wide cell-free DNA fragmentation in patients with cancer. Nature. 2019;570:385–389. doi: 10.1038/s41586-019-1272-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib14] 14.Jiang P., Sun K., Peng W., Cheng S.H., Ni M., Yeung P.C., Heung M.M.S., Xie T., Shang H., Zhou Z., et al. Plasma DNA End-Motif Profiling as a Fragmentomic Marker in Cancer, Pregnancy, and Transplantation. Cancer Discov. 2020;10:664–673. doi: 10.1158/2159-8290.CD-19-0622. [DOI] [PubMed] [Google Scholar]

[bib15] 15.Chen L., Abou-Alfa G.K., Zheng B., Liu J.F., Bai J., Du L.T., Qian Y.S., Fan R., Liu X.L., Wu L., et al. Genome-scale profiling of circulating cell-free DNA signatures for early detection of hepatocellular carcinoma in cirrhotic patients. Cell Res. 2021;31:589–592. doi: 10.1038/s41422-020-00457-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib16] 16.Guo W., Chen X., Liu R., Liang N., Ma Q., Bao H., Xu X., Wu X., Yang S., Shao Y., et al. Sensitive detection of stage I lung adenocarcinoma using plasma cell-free DNA breakpoint motif profiling. EBioMedicine. 2022;81 doi: 10.1016/j.ebiom.2022.104131. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib17] 17.Martschenko D.O., Wand H., Young J.L., Wojcik G.L. Including multiracial individuals is crucial for race, ethnicity and ancestry frameworks in genetics and genomics. Nat. Genet. 2023;55:895–900. doi: 10.1038/s41588-023-01394-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib18] 18.Liang H., Li F., Qiao S., Zhou X., Xie G., Zhao X., Zhang Y., Wu K. Whole-genome sequencing of cell-free DNA yields genome-wide read distribution patterns to track tissue of origin in cancer patients. Clin. Transl. Med. 2020;10 doi: 10.1002/ctm2.177. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib19] 19.Zivanovic Bujak A., Weng C.F., Silva M.J., Yeung M., Lo L., Ftouni S., Litchfield C., Ko Y.A., Kuykhoven K., Van Geelen C., et al. Circulating tumour DNA in metastatic breast cancer to guide clinical trial enrolment and precision oncology: A cohort study. PLoS Med. 2020;17 doi: 10.1371/journal.pmed.1003363. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib20] 20.Chao K.H., Zimin A.V., Pertea M., Salzberg S.L. The first gapless, reference-quality, fully annotated genome from a Southern Han Chinese individual. G3 (Bethesda) 2023;13 doi: 10.1093/g3journal/jkac321. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib21] 21.Walker N.J., Rashid M., Yu S., Bignell H., Lumby C.K., Livi C.M., Howell K., Morley D.J., Morganella S., Barrell D., et al. Hydroxymethylation profile of cell-free DNA is a biomarker for early colorectal cancer. Sci. Rep. 2022;12 doi: 10.1038/s41598-022-20975-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib22] 22.de Koning A.P.J., Gu W., Castoe T.A., Batzer M.A., Pollock D.D. Repetitive elements may comprise over two-thirds of the human genome. PLoS Genet. 2011;7 doi: 10.1371/journal.pgen.1002384. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib23] 23.Treangen T.J., Salzberg S.L. Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat. Rev. Genet. 2011;13:36–46. doi: 10.1038/nrg3117. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib24] 24.Serpas L., Chan R.W.Y., Jiang P., Ni M., Sun K., Rashidfarrokhi A., Soni C., Sisirak V., Lee W.S., Cheng S.H., et al. Dnase1l3 deletion causes aberrations in length and end-motif frequencies in plasma DNA. Proc. Natl. Acad. Sci. USA. 2019;116:641–649. doi: 10.1073/pnas.1815031116. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib25] 25.Foda Z.H., Annapragada A.V., Boyapati K., Bruhm D.C., Vulpescu N.A., Medina J.E., Mathios D., Cristiano S., Niknafs N., Luu H.T., et al. Detecting Liver Cancer Using Cell-Free DNA Fragmentomes. Cancer Discov. 2023;13:616–631. doi: 10.1158/2159-8290.CD-22-0659. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib26] 26.Burnham P., Gomez-Lopez N., Heyang M., Cheng A.P., Lenz J.S., Dadhania D.M., Lee J.R., Suthanthiran M., Romero R., De Vlaminck I. Separating the signal from the noise in metagenomic cell-free DNA sequencing. Microbiome. 2020;8:18. doi: 10.1186/s40168-020-0793-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib27] 27.Choy L.Y.L., Peng W., Jiang P., Cheng S.H., Yu S.C.Y., Shang H., Olivia Tse O.Y., Wong J., Wong V.W.S., Wong G.L.H., et al. Single-Molecule Sequencing Enables Long Cell-Free DNA Detection and Direct Methylation Analysis for Cancer Patients. Clin. Chem. 2022;68:1151–1163. doi: 10.1093/clinchem/hvac086. [DOI] [PubMed] [Google Scholar]

[bib28] 28.Sun K. Ktrim: an extra-fast and accurate adapter- and quality-trimmer for sequencing data. Bioinformatics. 2020;36:3561–3562. doi: 10.1093/bioinformatics/btaa171. [DOI] [PubMed] [Google Scholar]

[bib29] 29.Langmead B., Salzberg S.L. Fast gapped-read alignment with Bowtie 2. Nat Meth. 2012;9:357–359. doi: 10.1038/nmeth.1923. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib30] 30.Magoc T., Salzberg S.L. FLASH: fast length adjustment of short reads to improve genome assemblies. Bioinformatics. 2011;27:2957–2963. doi: 10.1093/bioinformatics/btr507. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib31] 31.Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R., 1000 Genome Project Data Processing Subgroup The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib32] 32.Harkins K.M., Schaefer N.K., Troll C.J., Rao V., Kapp J., Naughton C., Shapiro B., Green R.E. A novel NGS library preparation method to characterize native termini of fragmented DNA. Nucleic Acids Res. 2020;48:e47. doi: 10.1093/nar/gkaa128. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib33] 33.Peng Q., Huang Z., Sun K., Liu Y., Yoon C.W., Harrison R.E.S., Schmitt D.L., Zhu L., Wu Y., Tasan I., et al. Engineering inducible biomolecular assemblies for genome imaging and manipulation in living cells. Nat. Commun. 2022;13:7933. doi: 10.1038/s41467-022-35504-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Systematic biases in reference-based plasma cell-free DNA fragmentomic profiling

Xiaoyi Liu

Mengqi Yang

Dingxue Hu

Yunyun An

Wanqiu Wang

Huizhen Lin

Yuqi Pan

Jia Ju

Kun Sun

Summary

Graphical abstract

Highlights

Motivation

Introduction

Results

Systematic biases in conventional reference-based approach

Figure 1.

Figure 2.

Performance evaluation of Freefly

Figure 3.

Freefly for cancer diagnosis

Figure 4.

Discussion

Limitations of the study

STAR★Methods

Key resources table

Resource availability

Lead contact

Materials availability

Data and code availability

Method details

Ethics approval

The conventional reference-based approach

The Freefly algorithm

Benchmark evaluations

Quantification and statistical analysis

Acknowledgments

Author contributions

Declaration of interests

Footnotes

Supplemental information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases