Abstract
Analysis of DNA methylation in cell-free DNA reveals clinically relevant biomarkers but requires specialized protocols such as whole-genome bisulfite sequencing. Meanwhile, millions of cell-free DNA samples are being profiled by whole-genome sequencing. Here, we develop FinaleMe, a non-homogeneous Hidden Markov Model, to predict DNA methylation of cell-free DNA and, therefore, tissues-of-origin, directly from plasma whole-genome sequencing. We validate the performance with 80 pairs of deep and shallow-coverage whole-genome sequencing and whole-genome bisulfite sequencing data.
Subject terms: Genome informatics, Next-generation sequencing, Software, Epigenomics, Cancer genomics
DNA methylation from cell-free DNA (cfDNA) can be profiled using whole genome bisulfite sequencing (WGBS). Here, the authors develop a computational method, FinaleMe, that predicts DNA methylation and tissues of-origin in cfDNA and validate its performance using paired deep and shallow-coverage whole-genome sequencing (WGS) and WGBS data.
Introduction
DNA methylation plays an instrumental role in gene regulation during disease progression and embryonic development1,2. Genome-wide DNA methylation level in cell-free DNA (cfDNA) has been extensively studied for disease diagnosis and prognosis3–7. The current gold standard to measure DNA methylation from cfDNA molecules is bisulfite sequencing8. However, sodium bisulfite treatment causes non-uniform sequence-dependent degradation of most DNA fragments9,10. The substantial loss of input DNA during the bisulfite treatment limits the sensitivity of diagnostic tests and analyses11. Recent advances in enzymatic conversion and long-read sequencing approaches have partly mitigated these issues but require specialized protocols12–16.
Unlike genomic DNA (gDNA), cfDNA is not randomly fragmented and its fragmentation pattern is highly associated with the local epigenetic background17,18. Several recent studies have identified significantly different DNA fragmentation patterns between methylated and unmethylated cfDNA molecules7,19,20. These findings suggest the possibility of computationally inferring DNA methylation levels from cfDNA fragmentation patterns. One recent study provided a proof-of-concept solution to predict the binary status of DNA methylation in high-coverage whole-genome bisulfite sequencing (WGBS) through a deep-learning model19. However, the ability to predict methylation status from cfDNA whole-genome sequencing (WGS) remains unexplored. The 2020 American College of Obstetricians and Gynecologists (ACOG) guidelines recommend non-invasive prenatal testing (NIPT) for all pregnancies regardless of risk, which will eventually result in millions of shallow-coverage (~0.1X-1X) cfDNA WGS every year in the US. In addition, hundreds of thousands of cfDNA WGS samples have already been sequenced for cancer early detection and other purposes worldwide by academic communities and commercial entities21.
Here, to leverage cfDNA WGS datasets and advance understanding of gene regulation and human health22, we develop a computational method, named FinaleMe (FragmentatIoN AnaLysis of cEll-free DNA Methylation), to predict the DNA methylation status in each CpG at each cfDNA fragment and obtain the continuous DNA methylation level at CpG sites, mostly accurate in CpG rich regions. We further predict the associated tissues-of-origin status from the inferred methylation patterns. We validate the predictions of both methylation level and tissues-of-origin status using paired WGS and WGBS of plasma cfDNA from the same tube of blood across different physiological conditions at deep (~16-39X) and shallow (~0.1X) WGS.
Results
Since DNA methylation has been tightly correlated with nucleosome occupancy23,24, we hypothesized that if the boundaries of cfDNA fragments are biased by their association with nucleosomes, then the fragmentation pattern observed in each cfDNA molecule should indicate its associated DNA methylation pattern and thus its tissue-of-origin. To evaluate this hypothesis, we first studied the correlation between fragment size and mean methylation level of DNA fragments from publicly available WGBS of cfDNA and gDNA of buffy coat samples from two healthy individuals7 (Fig. 1). Replicate samples of cfDNA showed waved methylation patterns at mono-nucleosomal lengths that were not present in the gDNA samples. This observation supported the hypothesis that the fragmentation pattern of cfDNA can provide information related to the DNA methylation level.
Next, we built a non-homogeneous Hidden Markov Model (HMM), named FinaleMe, to predict the methylation status in cfDNA (details in Methods and Supplementary Methods, Fig. 2). Since CpGs are not evenly distributed in the human genome, we incorporated the distance between CpG sites into the model and utilized the following three features: fragment length, normalized coverage, and the distance of each CpG to the center of the DNA fragment (Fig. 1b). We first evaluated the model using high-coverage WGBS of cfDNA (from non-pregnant healthy individuals), masking the methylation status, and then benchmarked the model performance using the ground truth DNA methylation states at each CpG in each DNA fragment. After sampling an equal number of the methylated and unmethylated CpGs, we observed high performance in predicting the methylation status at each single CpG from each DNA fragment based on the area under the receiver operating characteristic curve (auROC) within CpG-rich regions (auROC=0.91, for CpGs at fragments with ≥5 CpGs, Fig. 1c).
To further benchmark the model performance in cfDNA WGS, we generated our own matched high-coverage WGS (~16–39X) and WGBS (~10–15X) data from plasma cfDNA samples within the same tube of blood in healthy individuals and a prostate cancer patient (Fig. 3a, Supplementary Data 1–3). Without using cfDNA WGBS data, we trained the HMM model and predicted the methylation level from the same cfDNA WGS dataset. By comparing the results with the methylation level at CpG sites in the reference genome from matched WGBS, we achieved a high correlation at single-CpGs and 1 kb windows in CpG-rich regions (CpG island and CpG island shore regions, Fig. 3b, c). At differentially methylated regions (DMRs) detected in the cfDNA WGBS between cancer and healthy individuals at CpG-rich regions, we also observed consistent methylation changes in the predicted methylation levels from matched cfDNA WGS (Fig. 3d). To check the potential overfitting problem of the model, we further trained and decoded the model for gDNA WGS from cancer and normal blood cells, in which the fragments are sonicated and do not have a correlation with the epigenetics status. The predicted results for gDNA WGS did not show any methylation differences between cancer and normal cells in the DMRs detected at the matched gDNA WGBS datasets (Supplementary Fig. 1a). This result suggested that the differential methylation we predicted in cfDNA WGS was not driven by the methylation prior we used but indeed the fragmentation features. However, we noticed that, in the CpG-poor regions, FinaleMe did not work as well as in CpG-rich regions (Supplementary Fig. 1b). We further assessed the methylation level at important regulatory elements, such as CpG island (CGI) promoters (Fig. 3e), 5’exon boundaries, and CTCF motifs (Supplementary Fig. 2). These results showed a high correlation between the ground truth (WGBS) and the prediction (WGS) in cfDNA from both healthy individuals and the cancer patient (Fig. 3e, Supplementary Figs. 2, 3), but not in gDNA dataset (Supplementary Fig. 4).
Since DNA methylation in CGI and CGI shore regions are often cell-type-specific, we further estimated the tissue-of-origin in cfDNA by using DNA methylation levels that were measured or predicted using WGBS and WGS, respectively. We found similar tissue-of-origin profiles between predicted and measured methylation levels for each of the individuals in both cancer and healthy conditions (Fig. 3f), which was also largely consistent with other previous tissues-of-origin studies by cfDNA WGBS3,6.
Deep coverage WGBS and WGS remain costly for routine clinical application. Many publicly available cfDNA WGS datasets are sequenced with shallow coverage (0.1–1X). We sought to determine whether we could predict DNA methylation levels using ultra-low-pass whole-genome sequencing (~0.1X, ULP-WGS). We generated matched ULP-WGS and ultra-low-pass WGBS (~0.1X, ULP-WGBS) of cfDNA from 77 individuals, including healthy donors, breast, and prostate cancer patients (Supplementary Data 1–3). We examined the methylation level globally and at important regulatory elements, such as CGI promoters, and observed similar average methylation profiles in predicted and measured methylation levels from ULP-WGS and WGBS, respectively (Fig. 4a, b). We also observed the differential methylation level in ULP-WGS at differentially methylated regions detected in ULP-WGBS (Supplementary Fig. 5). Next, we assessed whether methylation levels from ultra-low-pass sequencing could be utilized for the estimation of tissues-of-origin. We downsampled the deep coverage sequencing results and found largely consistent tissue-of-origin estimates with ultra-low-pass sequencing (Supplementary Fig. 6). Finally, we estimated the tissue-of-origin in both ULP-WGS and ULP-WGBS. We found consistent results between the two assays. The fractions of prostate or breast-originated cell types are low in healthy individuals and showed a high correlation with tumor fraction as estimated by copy number variations (ichorCNA) across all samples in both assays (Fig. 4c). These results suggested that the application of FinaleMe to ULP-WGS is consistent with ULP-WGBS for both DNA methylation and tissues-of-origin predictions.
Discussion
Our study demonstrates the ability to infer cfDNA methylation level and tissues-of-origin status directly from deep and shallow-coverage cfDNA WGS. This overcomes a major hurdle associated with bisulfite conversion of limited amounts of cfDNA and, more importantly, enables the usage of a large number of existing, publicly available cfDNA genomic datasets for epigenetic analysis. Our predictions are most accurate in CpG-rich regions of the genome but not in CpG-poor regions. Further work is required to improve the predictions in CpG-poor regions for the detection of other disease-related methylation features, such as the partially methylated domains in cancers. Moreover, the Bayesian prior we utilized from genomic DNA methylome may cause overfitting problems and the false positive call of DMRs in cancer WGS. Previous studies have suggested that analysis of tissue-of-origin is possible based on analysis of nucleosome spacing in WGS of cfDNA17. However, only the relative rank of most related cell types is estimated in deep WGS. The tissues-of-origin estimation from inferred DNA methylation here can provide the estimation of absolute fractions in each cell type and utilize the rich reference methylome resources. Although we do not expect to replace bisulfite sequencing for direct measurement of methylation levels, we provide a generalizable method that could enable the methylation analysis of cfDNA samples with limited material or samples that would otherwise only undergo genomic profiling.
Methods
Ethics approval and consent to participate
This research study was approved by the Broad Institute Institutional Review Board in accordance with the Declaration of Helsinki. De-identified plasma sample collection was approved by the Dana-Farber Cancer Institute and Broad Institute Institutional Review Boards. All participants provided written informed consent to participate.
Clinical samples
Cancer patient blood samples were obtained from appropriately consented patients as described in Adalsteinsson et al.25. Healthy donor blood samples were obtained from appropriately consented individuals from Research Blood Components (http://researchbloodcomponents.com/services.html). Samples were collected and fractionated as described in Adalsteinsson et al.25.
Whole-genome bisulfite sequencing of cfDNA
Library construction was performed on 25 ng of cfDNA using the Hyper Prep Kit (Kapa Biosystems) with NEXTFlex Bisulfite-Seq Barcodes (Bioo Scientific) and methylated adapters (IDT) along with HiFi Uracil+ polymerase (Kapa Biosystems) for library amplification. NEXTFlex Bisulfite-Seq Barcodes were used at a final concentration of 7.5 μM and the EZ-96 DNA Methylation-Lightning MagPrep kit (Zymo Research) was used for bisulfite conversion of the adapter-ligated cfDNA prior to library amplification. Libraries were sequenced using paired-end 100 bp in the platform of HiSeq2500 (Illumina) with a 20% spike of PhiX.
Whole-genome sequencing of cfDNA
Library construction was performed on 5–20 ng of cfDNA using the Hyper Prep Kit (Kapa Biosystems) and custom sequencing adapters (IDT) on a Hamilton STAR-line liquid handling system. Libraries were sequenced using paired-end 100 bp in the platform of the HiSeq2500 (Illumina).
Model development and training
Data preprocessing
For WGS data, reads were aligned to the human genome (GRCh37) using BWA-MEM 0.7.1526 with default parameters. Each fragment containing CpGs in the autosomal chromosomes reference genome was used for the analysis. Fragment lengths of more than 500 bp or less than 30 bp were discarded. Regions with coverage more than 250× or ENCODE blacklist regions (merged wgEncodeDukeMapabilityRegionsExcludable and wgEncodeDacMapabilityConsensusExcludable) were also discarded. Only high-quality reads were considered in the following analysis (high quality: uniquely mapped, no PCR duplicates, both of ends are mapped with mapping qualities more than 30 and properly paired). To calculate the methylation status for each CpG in each fragment, only bases with a base quality of more than 5 were used.
For cfDNA WGBS data, a recent study demonstrated that the existence of the jagged-end at the end of cfDNA fragment will affect the estimation accuracy of DNA methylation27. We first generated the M-bias plot by using Bismark28 to map the reads without trimming (see Supplementary Fig. 7). To avoid the artifact potentially brought by the jagged end for Fig. 1a, we trimmed the 40 bp from the 5′ end and 10 bp from 3′ end at the R2 reads. The 3′ end of R1 reads seems to be not affected by the jagged-end problem. However, in CpG islands (often open chromatin regions), cfDNA fragments are usually very small. To avoid the potential bias at these small fragments, we also trimmed 40 bp from 3′ end at the R1 reads, and the results were still largely the same. After trimming, reads were aligned to the human genome (GRCh37) using Bismark (v0.22.3) with bowtie2 (v2.3.5)29. The methylation status of CpGs was counted from the first converted cytosine in each of the fragments as described in Bis-SNP30. Fragment coverage at each CpG site was first normalized by dividing the total number of high-quality reads in the bam file. Further, the three features (fragment length, normalized coverage, and distance to the center of the fragment) were transformed into Z-score by the mean and standard deviation of the features within the same bam file as the input for the HMM model (Fig. 2). All details are implemented in CpgMultiMetricsStats.java (with parameters -stringentPaired for only high-quality fragments and with parameters -wgsMode for WGS data). The methylation level from WGBS was called by Bis-SNP v0.9030.
Non-homogeneous Hidden Markov Model
The initiation matrix was summarized based on the methylation states of the first CpG in each DNA fragment separately (Fig. 2). A nonparametric model was used to calculate the initiation and transition matrix by considering the distance with adjacent CpG sites. A gaussian mixture model was applied to model the emission likelihood of each of the three fragmentation features (fragment length, coverage, and distance to the center of the fragment). A weighted DNA methylation prior, estimated from methylation level at genomic DNA (buffy coat) in healthy individuals, was utilized to calculate the posterior emission probability of hidden status only in the decoding (i.e., prediction) step, which models the base DNA methylation differences in different genomic contexts. For example, the probability of observing methylated event em given that located at the CpG site with methylation prior k is:
1 |
Two states Hidden Markov Model (HMM) is implemented as described in Rabiner31 at Jahmm framework with some adaptations to our problem. Baum-Welch algorithm was used to estimate the parameters with a maximum of 50 iterations. The model was trained by all the cfDNA fragments with at least 7 CpGs within the same fragments. The number of CpGs was not limited at the decoding step. In low-coverage data, we utilized an HMM model trained in high-coverage samples (HD_45, a healthy individual) to estimate the model parameters and applied it directly to each ULP-WGS dataset for the decoding. All details are implemented in FinaleMe.java (with parameters: -miniDataPoints 7 -gmm -covOutlier 3, for the training step and parameters -decodeModeOnly for the decoding step).
Gaussian Mixture Model (GMM) initialization for HMM model
GMM algorithm was utilized to estimate the initiation state of each CpG in each fragment by three fragmentation feature vectors with a maximum of 10,000 iterations. After GMM initialization, in WGBS, the methylated and unmethylated states were identified by the mean methylation level of each state. In WGS data, the state with a higher distance to the center was defined as the methylated state. Then the initiation parameters of HMM model were estimated based on the GMM initialization.
Initiation and transition probability
The initiation probability of each state with the same offset from the start of the fragment was averaged by the states of the first CpGs with the same offset range at all the high-quality fragments. The transition probability matrix between states was also calculated separately for each of the possible distance ranges to the previous CpG.
Emission distributions
Three features were modeled by Multivariate Mixture Gaussian distribution. Two components mixture of Gaussian distribution was used to model each of the features separately.
2 |
In the Viterbi decoding step, methylation prior estimated from genomic DNA in buffy coat samples from healthy individuals7 was only used to calculate the emission probability for each CpG.
KL divergence
Kullback-Leibler distance was used to estimate the divergence of new HMM during Baum-Welch re-estimation. Since methylation prior was used for the decoding step and is different at different CpG site, 10,000 random fragments with a minimum of 5 CpGs is selected to calculate the Kullback-Leibler distance. If the distance between new and old HMM was less than 1e−4 or the changes of distance were less than 1%, the model was considered converged.
Summary of the model
In cfDNA WGS (Fig. 2), our HMM model infers the model parameters directly from WGS data without using cfDNA WGBS data. The principle of the model is: we assume that there are two binary states (u or m) in each CpG at each cfDNA fragment. These two states are not observable in WGS (thus hidden). We assume that the states are affected by three fragmentation features. At each CpG in each fragment in the bam file (CpG point), we can obtain three features: the fragment’s length, the CpG’s distance to the center of that fragment, and the fragment coverage at that particular CpG position in the reference genome. We also assume the status of each CpG in each fragment is a Multivariate Gaussian distribution of these three features.
Step 1, we utilized a Gaussian mixture model to classify all the CpG points in WGS into two groups (u or m) to initiate the HMM model (the initial parameters). Given the hypothesis in Fig. 1B, we always assume “m” group has a larger average distance to the center of fragments.
Step 2, we applied the initiated parameters to the HMM model and built a Markov chain for each single cfDNA fragment. Due to the Markov process, the status of each CpG point is affected by its adjacent CpG in the same fragment. Then, the Baum-Welch algorithm was used to estimate the maximum likelihood parameters in the WGS dataset. Different from the traditional HMM model that assumes equal transition probability between CpGs, we utilized a non-homogenous model to estimate different transition probability matrices given different distances between CpGs. Kullback-Leibler distance was utilized to estimate whether or not the model converged during the iteration.
Step 3, after the estimation of parameters in step 2 (training), we utilize the Viterbi algorithm to estimate the best state (u or m) in each CpG at each fragment. Different from the traditional HMM model, we add methylation prior from WGBS in a healthy buffy coat to calculate the posterior probability.
Step 4, after the prediction in step 3, we aggregated the methylation status across fragments at each CpG site in the reference genome and calculated the continuous methylation level (0-100%).
Performance evaluation
Comparison of the binary methylation status of each CpG in each fragment (WGBS)
The equal number of methylated and unmethylated CpGs was randomly sampled at the evaluation step. Prediction results were compared with ground truth methylation binary states at each CpG in each cfDNA fragment of WGBS. The threshold was varied to identify methylated status at the Viterbi decoding step in order to calculate the ROC curve.
Comparison of the continuous methylation level at each CpG or windows in the reference genome (paired WGBS and WGS)
FinaleMe was trained and decoded at WGS data only. The methylation level was calculated by aggregating the binary methylation status across fragments at each CpG in the reference genome. Finally, the continuous methylation level at each CpG or window was compared with the methylation level obtained from matched WGBS in the same blood draw.
Comparison of methylation profiles at important regulatory elements (paired WGBS and WGS)
FinaleMe was trained and decoded at WGS data. The predicted methylation level was calculated as described in above (section of Non-homogeneous Hidden Markov Model). The average methylation level around CpG island promoters, 5′ end of exons, and CTCF motifs were calculated by Bis-Tools as described in Lay & Liu et al.32. CpG island definition was downloaded from UCSC genome browser33. CpG island shore was defined by the regions within 2 kb regions around the CGI.
Benchmark of the speed
We downsampled the high-coverage cfDNA WGS data and calculated the time cost with different numbers of fragments in the bam files (Supplementary Fig. 8). Benchmark was performed at a single CPU in the computational cluster (Intel(R) Xeon(R) Gold 6338 CPU @ 2.0 GHz).
Tissue-of-origin deconvolution
To infer tissue of origin from measured or inferred DNA methylation data, we modeled patient methylation data as a linear combination of reference methylomes. We constrain the weights to sum up to one so that the weights can be interpreted as tissue contribution to cfDNA. Quadratic programming was utilized to solve the constrained optimization problem. This method and approach closely follow the tissue deconvolution algorithm described in Sun et al. PNAS6. To reduce the noise, we utilized the methylation density at 1 kb non-overlapped windows within the CpG island and CpG island shore regions at autosomes and binarized the methylation level (window with methylation density <0.1 was defined as 0, otherwise 1) in both reference methylomes and cfDNA data. Only windows with at least 10 Cs or Ts across all the reference methylomes were utilized for the analysis. Only windows that were highly variable across reference methylomes (top 1% most variable regions in the reference methylomes) were further utilized for the deconvolution.
We incorporated WGBS from the major immune cell types (Neutrophil, B cell, T cell, Macrophage, Erythroblast cells), blood vessel endothelial cells, and liver hepatocyte cells, as suggested by Moss 2018 Nature Communications3. We also incorporated methylomes from mammary epithelial cells (HMEC) and prostate epithelial cells (PrEC) since they are related to the cancer types we analyzed.
In the low pass data, we further relaxed our criteria about the coverage to keep more windows. The top 25% of most variable regions in the reference methylomes were utilized for deconvolution. Windows with less than 5 Cs or Ts in either reference methylome or cfDNA data were marked as NA. Samples or windows with more than 80% NA were filtered. We further imputed the missing data of the windows by K-nearest neighbor (k = 5 and maxp = “p” in impute.knn function at impute package, R 4.2.1) and finally binarized the methylation level within the window as that in high-coverage data.
ichorCNA analysis
Estimation of tumor fraction was performed using ichorCNA as described previously in Adalsteinsson et al. Nature Communications 201725. Specifically, we utilized readCounter with parameters: --window 1000000 --quality 20 --chromosome “1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y” to generate the wig files. Then we utilized runIchorCNA.R with parameters: --normal “c(0.75)” --scStates “c(1,3)” --ploidy “c(2)” --maxCN 5 together with gc_hg19_1000kb.wig, map_hg19_1000kb.wig, GRCh37.p13_centromere_UCSC-gapTable.txt, and HD_ULP_PoN_1Mb_median_normAutosome_mapScoreFiltered_median.rds panel provided by ichorCNA to calculate tumor fraction for each sample.
Differential methylation analysis
Differential methylation regions (predefined non-overlapped 1 kb windows in autosomes) in high-coverage WGBS were identified by metilene (v 0.2–8)34 with q value < 0.05. Data in ULP-WGBS are very sparse and noisy. Therefore, we utilized two-sided Wilcoxon Rank Sum Tests to identify the windows that were different between cancers and healthy controls with a p value cut-off 0.01.
Statistics and reproducibility
No statistical method was used to predetermine sample size. No data were excluded from the analyses. The experiments were randomized to generate cfDNA sequencing libraries. The Investigators were not blinded to allocation during experiments and outcome assessment.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Supplementary information
Source data
Acknowledgements
This work was supported by the computational resources from the Broad Institute of MIT and Harvard, the Biomedical Informatics (BMI) high-performance computing cluster in CCHMC, and QUEST computational cluster in Northwestern University. This work also used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by the National Science Foundation grant number ACI-1548562. This work used the XSEDE at the Pittsburgh Supercomputing Center (PSC) through allocation MCB190124P and MCB190006P. Y.L. is supported by the Broad Next10 grant from the Broad Institute of MIT and Harvard, trustee award from Cincinnati Children’s Hospital Medical Center, the startup grant to Y.L. from Cincinnati Children’s Hospital Medical Center, Northwestern University, Robert H. Lurie Comprehensive Cancer Center of Northwestern University, and NHGRI (R56HG012360 to Y.L.). The authors acknowledge the generous support of the Gerstner Family Foundation to V.A.A., the Wong Family Foundation and DFCI Medical Oncology grant to A.D.C.
Author contributions
Y.L., V.A.A. and M.K. conceived the study. Y.L. implemented the computational method. S.R. performed the library constructions. Y.L., C.L., D.W.K, R.B, G.H., G.G, J.R, D.R, H.Z., H.F. and S.F. performed the data analysis with input from A.D.C., H.A.P, D.G.S, V.A.A. and M.K. A.D.C., H.A.P. and D.G.S. provided the clinical samples and guidance related to the clinic applications. Y.L. and V.A.A. wrote the manuscript together. All authors read and approved the final manuscript.
Peer review
Peer review information
Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work. A peer review file is available.
Data availability
The publicly available cfDNA WGBS data used in this study are available in the dbGaP database under accession code [https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000846.v1.p1]7. The publicly available ULP-WGS data used in this study are available in the dbGaP database under accession code [https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs001417.v1.p1]25. The raw sequencing data for the deep WGS, WGBS. and ULP-WGBS data generated in this study have been deposited in the Sequence Read Archive with controlled access from dbGaP under accession code phs003287.v1.p1 [https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs003287.v1.p1]. These data are available under restricted access due to individual privacy concerns. Permanent employees of an institution at a level equivalent to a tenure-track professor or senior scientist with laboratory administration and oversight responsibilities may request access through dbGaP. The requests, which are managed by NHGRI’s Data Access Committee, take less than one month for approval, and access is permitted for 12 months. The processed and de-identified data are available in zenodo.org (10.5281/zenodo.7779198)35. The remaining data are available within the Article, Supplementary Information, and Source Data file. Source data are provided with this paper.
Code availability
Code for FinaleMe and associated scripts are publicly available on GitHub under the MIT license for academic researchers: https://github.com/epifluidlab/FinaleMe.git36. The zipped code is also available in zenodo.org (10.5281/zenodo.7779198)35.
Competing interests
Y.L., V.A.A. and M.K. have an approved patent covered FinaleMe (“Methods for genome characterization”, US Patent US11788135B2, date of patent, Oct 17, 2023, filed by MIT and Broad Institute of MIT and Harvard). Y.L. owns stocks from Freenome Inc. V.A.A., G.H. and S.F. are inventors on an approved patent covered ichorCNA (US20190078232A1, “Methods for genome characterization”, date of patent, Mar 14, 2019, filed by Harvard College, Dana Farber Cancer Institute Inc, Broad Institute of MIT and Harvard) on methods for estimating tumor fraction in cfDNA. VAA is a co-inventor on a patent application covering MAESTRO (US 2023/0203568, “Minor allele enrichment sequencing through recognition oligonucleotides”, pending, filed by Broad Institute of MIT and Harvard), which has been licensed to Exact Sciences, receives sponsored research funding from Exact Sciences, and is a cofounder and advisor to Amplifyer. The remaining authors declare no competing interests. H.Z is currently an employee at Regeneron Pharmaceuticals Inc. and contributed to this article as an employee of Cincinnati Children’s Hospital Medical Center, and the views expressed do not necessarily represent the views of Regeneron Pharmaceuticals Inc.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Yaping Liu, Email: lyping1986@gmail.com.
Viktor A. Adalsteinsson, Email: viktor@broadinstitute.org
Manolis Kellis, Email: manoli@mit.edu.
Supplementary information
The online version contains supplementary material available at 10.1038/s41467-024-47196-6.
References
- 1.Xie W, et al. Epigenomic analysis of multilineage differentiation of human embryonic stem cells. Cell. 2013;153:1134–1148. doi: 10.1016/j.cell.2013.04.022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Jones PA. Functions of DNA methylation: islands, start sites, gene bodies and beyond. Nat. Rev. Genet. 2012;13:484–492. doi: 10.1038/nrg3230. [DOI] [PubMed] [Google Scholar]
- 3.Moss J, et al. Comprehensive human cell-type methylation atlas reveals origins of circulating cell-free DNA in health and disease. Nat. Commun. 2018;9:5068. doi: 10.1038/s41467-018-07466-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Liu MC, et al. Sensitive and specific multi-cancer detection and localization using methylation signatures in cell-free DNA. Ann. Oncol. 2020;31:745–759. doi: 10.1016/j.annonc.2020.02.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Shen SY, et al. Sensitive tumour detection and classification using plasma cell-free DNA methylomes. Nature. 2018;563:579–583. doi: 10.1038/s41586-018-0703-0. [DOI] [PubMed] [Google Scholar]
- 6.Sun K, et al. Plasma DNA tissue mapping by genome-wide methylation sequencing for noninvasive prenatal, cancer, and transplantation assessments. Proc. Natl Acad. Sci. USA. 2015;112:E5503–E5512. doi: 10.1073/pnas.1508736112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Jensen TJ, et al. Whole genome bisulfite sequencing of cell-free DNA and its cellular contributors uncovers placenta hypomethylated domains. Genome Biol. 2015;16:78. doi: 10.1186/s13059-015-0645-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Olova N, et al. Comparison of whole-genome bisulfite sequencing library preparation strategies identifies sources of biases affecting DNA methylation data. Genome Biol. 2018;19:33. doi: 10.1186/s13059-018-1408-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Tanaka K, Okamoto A. Degradation of DNA by bisulfite treatment. Bioorg. Med. Chem. Lett. 2007;17:1912–1915. doi: 10.1016/j.bmcl.2007.01.040. [DOI] [PubMed] [Google Scholar]
- 10.Yi S, Long F, Cheng J, Huang D. An optimized rapid bisulfite conversion method with high recovery of cell-free DNA. BMC Mol. Biol. 2017;18:1–8. doi: 10.1186/s12867-017-0101-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Sun K, et al. Orientation-aware plasma cell-free DNA fragmentation analysis in open chromatin regions informs tissue of origin. Genome Res. 2019;29:418–427. doi: 10.1101/gr.242719.118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Erger F, et al. cfNOMe - a single assay for comprehensive epigenetic analyses of cell-free DNA. Genome Med. 2020;12:54. doi: 10.1186/s13073-020-00750-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Vaisvila, R. et al. Enzymatic methyl sequencing detects DNA methylation at single-base resolution from picograms of DNA. Genome Res. 10.1101/gr.266551.120 (2021). [DOI] [PMC free article] [PubMed]
- 14.Liu Y, et al. Bisulfite-free direct detection of 5-methylcytosine and 5-hydroxymethylcytosine at base resolution. Nat. Biotechnol. 2019;37:424–429. doi: 10.1038/s41587-019-0041-2. [DOI] [PubMed] [Google Scholar]
- 15.Yu SCY, et al. Single-molecule sequencing reveals a large population of long cell-free DNA molecules in maternal plasma. Proc. Natl Acad. Sci. USA. 2021;118:e2114937118. doi: 10.1073/pnas.2114937118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Choy LYL, et al. Single-molecule sequencing enables long cell-free DNA detection and direct methylation analysis for cancer patients. Clin. Chem. 2022;68:1151–1163. doi: 10.1093/clinchem/hvac086. [DOI] [PubMed] [Google Scholar]
- 17.Snyder MW, Kircher M, Hill AJ, Daza RM, Shendure J. Cell-free DNA comprises an in vivo nucleosome footprint that informs its tissues-of-origin. Cell. 2016;164:57–68. doi: 10.1016/j.cell.2015.11.050. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Ivanov M, Baranova A, Butler T, Spellman P, Mileyko V. Non-random fragmentation patterns in circulating cell-free DNA reflect epigenetic regulation. BMC Genomics. 2015;16:S1. doi: 10.1186/1471-2164-16-S13-S1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Zhou Q, et al. Epigenetic analysis of cell-free DNA by fragmentomic profiling. Proc. Natl Acad. Sci. USA. 2022;119:e2209852119. doi: 10.1073/pnas.2209852119. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.An Y, et al. DNA methylation analysis explores the molecular basis of plasma cell-free DNA fragmentation. Nat. Commun. 2023;14:287. doi: 10.1038/s41467-023-35959-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Liu S, et al. Genomic analyses from non-invasive prenatal testing reveal genetic associations, patterns of viral infections, and chinese population history. Cell. 2018;175:347–359.e14. doi: 10.1016/j.cell.2018.08.016. [DOI] [PubMed] [Google Scholar]
- 22.Liu Y. At the dawn: cell-free DNA fragmentomics and gene regulation. Br. J. Cancer. 2022;126:379–390. doi: 10.1038/s41416-021-01635-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Kelly TK, et al. Genome-wide mapping of nucleosome positioning and DNA methylation within individual DNA molecules. Genome Res. 2012;22:2497–2506. doi: 10.1101/gr.143008.112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Collings CK, Waddell PJ, Anderson JN. Effects of DNA methylation on nucleosome stability. Nucleic Acids Res. 2013;41:2918–2931. doi: 10.1093/nar/gks893. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Adalsteinsson VA, et al. Scalable whole-exome sequencing of cell-free DNA reveals high concordance with metastatic tumors. Nat. Commun. 2017;8:1324. doi: 10.1038/s41467-017-00965-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25:1754–1760. doi: 10.1093/bioinformatics/btp324. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Jiang P, et al. Detection and characterization of jagged ends of double-stranded DNA in plasma. Genome Res. 2020;30:1144–1153. doi: 10.1101/gr.261396.120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Krueger F, Andrews SR. Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications. Bioinformatics. 2011;27:1571–1572. doi: 10.1093/bioinformatics/btr167. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat. Methods. 2012;9:357–359. doi: 10.1038/nmeth.1923. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Liu Y, Siegmund KD, Laird PW, Berman BP. Bis-SNP: combined DNA methylation and SNP calling for Bisulfite-seq data. Genome Biol. 2012;13:R61. doi: 10.1186/gb-2012-13-7-r61. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Rabiner, L. R. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE77, 257–286 10.1109/5.18626 (1989).
- 32.Lay FD, et al. The role of DNA methylation in directing the functional organization of the cancer epigenome. Genome Res. 2015;25:467–477. doi: 10.1101/gr.183368.114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Gardiner-Garden M, Frommer M. CpG islands in vertebrate genomes. J. Mol. Biol. 1987;196:261–282. doi: 10.1016/0022-2836(87)90689-9. [DOI] [PubMed] [Google Scholar]
- 34.Jühling F, et al. metilene: fast and sensitive calling of differentially methylated regions from bisulfite sequencing data. Genome Res. 2016;26:256–262. doi: 10.1101/gr.196394.115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Liu, Y. Code and de-identified datasets for FinaleMe. Zenodo.org. Available from: 10.5281/zenodo.7779198 (2024).
- 36.Liu, Y. FinaleMe: Predicting DNA methylation by the fragmentation patterns of plasma cell-free DNA. GitHub. https://github.com/epifluidlab/FinaleMe. 10.5281/zenodo.7779198 (2024). [DOI] [PMC free article] [PubMed]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The publicly available cfDNA WGBS data used in this study are available in the dbGaP database under accession code [https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000846.v1.p1]7. The publicly available ULP-WGS data used in this study are available in the dbGaP database under accession code [https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs001417.v1.p1]25. The raw sequencing data for the deep WGS, WGBS. and ULP-WGBS data generated in this study have been deposited in the Sequence Read Archive with controlled access from dbGaP under accession code phs003287.v1.p1 [https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs003287.v1.p1]. These data are available under restricted access due to individual privacy concerns. Permanent employees of an institution at a level equivalent to a tenure-track professor or senior scientist with laboratory administration and oversight responsibilities may request access through dbGaP. The requests, which are managed by NHGRI’s Data Access Committee, take less than one month for approval, and access is permitted for 12 months. The processed and de-identified data are available in zenodo.org (10.5281/zenodo.7779198)35. The remaining data are available within the Article, Supplementary Information, and Source Data file. Source data are provided with this paper.
Code for FinaleMe and associated scripts are publicly available on GitHub under the MIT license for academic researchers: https://github.com/epifluidlab/FinaleMe.git36. The zipped code is also available in zenodo.org (10.5281/zenodo.7779198)35.