Abstract
Deeper and broader sequencing of circulating tumor DNA (ctDNA) has identified a wealth of cancer markers in the circulation, resulting in a paradigm shift towards data science-driven liquid biopsies in oncology. Although panel sequencing for actionable mutations in plasma is moving towards the clinic, the next generation of liquid biopsies is increasingly shifting from analyzing digital mutation signals towards analog signals, requiring a greater role for machine learning. Concomitantly, there is an increasing acceptance that these cancer signals do not have to arise from the tumor itself. In this Opinion, we discuss the opportunities and challenges arising from increasingly complex cancer liquid biopsy data.
Liquid Biopsies in Oncology
Liquid biopsies increasingly offer a minimally invasive alternative to tissue biopsies in both research and clinical settings [1]. Within oncology, the concept of liquid biopsy has become synonymous with the analysis of ctDNA, which constitutes a fraction of cell-free DNA (cfDNA) in plasma spanning at least six orders of magnitude, from >80% of circulating DNA fragments [2,3] to individual parts per million [4]. ctDNA assays may have utility for a number of applications in oncology throughout the patient journey, ranging from the early detection of cancer [5], prognostication [6], detection of residual disease [7], treatment selection based on actionable drivers, and monitoring of disease burden [3].
The history of the ctDNA field is outlined in Box 1. Early ctDNA assays involved PCR of plasma cfDNA amplifying driver mutations, which therefore leveraged signals that were highly specific to the tumor. However, at the lowest fractions of ctDNA, such as in the settings of cancer screening or detection of relapse, the sensitivity of such approaches becomes limited by the lack of sufficient markers to detect the presence of the tumor and sampling error for rare mutant fragments [1,8,9]. Therefore, other features of ctDNA are increasingly being interrogated for their utility, such as DNA fragment sizes [10-12], epigenetic signatures [13-15], RNA signals [16,17], and microbial signatures [18]. These recent developments suggest that the sensitivity of liquid biopsies might be boosted by deeper and/or broader sequencing of plasma samples, paving the way for the next generation of liquid biopsies, which transition away from well-established driver mutations and instead achieve high sensitivity and specificity by data mining (see Glossary) alternative features of ctDNA in sequencing data.
Box 1. Brief History of ctDNA.
cfDNA in plasma was first described by Mandel and Metais in 1948 [55]. In cancer patients, a fraction of cfDNA appears to be released by the tumor itself – ctDNA – and can be distinguished from normal, nontumor cfDNA by virtue of the presence of driver gene mutations: Sorenson et al. were the first to detect such mutations in plasma using PCR, namely KRAS mutations in the circulation of patients with pancreatic cancer that matched KRAS mutations in their tumor [20]. This study could be considered the earliest plasma-based first-generation liquid biopsy, and arguably the first demonstration of patient-specific ctDNA detection. Subsequently, the advent of digital PCR, flow cytometry, and next-generation sequencing enabled the accurate quantification of mutant fragments, and exploitation of the predictive and prognostic potential of ctDNA in settings such as minimal residual disease, disease recurrence, prognostication, treatment selection, and potentially, the early detection of cancer [1,8,56].
ctDNA molecules are released by tumors directly into the bloodstream through apoptosis or other modes of cell death and makes up a proportion of the overall cfDNA in human plasma [9]. The half-life of cfDNA is between 16 min and 2.5 h; after which it may be cleared by multiple mechanisms including nuclease action, macrophage-mediated degradation and urinary excretion [1]. ctDNA can now be distinguished from nontumor-derived circulating DNA through multiple approaches: via genetic and epigenetic signatures retained from the tumor cell of origin or through distinct fragmentation patterns (Box 2).
As a specific and quantitative tumor marker, ctDNA has become synonymous with cancer liquid biopsies over the past decade. The field has evolved following advances in PCR and sequencing technology, now enabling the study of tumor-derived but nonmutant cfDNA fragments and also even nontumor-derived cfDNA (see Figure 3 in main text).
In this Opinion, we classify the recent progress in the field based on the digital or analog nature of the signal examined (Figure 1). Greater use of machine learning is enabling the interrogation of analog signals in plasma, which may arise from either cancer or noncancer cells. We discuss the emergence of data science-driven liquid biopsies and the opportunities and challenges that the next generation of cancer diagnostics may face.
From Digital to Analog Cancer Signal in Plasma
First-Generation Liquid Biopsy: Targeting Individual Digital Signals
Given the causative role of mutations in tumorigenesis [19], initial ctDNA studies aimed to identify mutant DNA carrying well-established driver gene mutations using PCR on plasma cfDNA from patients with cancer [8,20]. This approach was subsequently extended to performing personalized PCR assays that targeted individual, patient-specific point mutations, which were selected after sequencing matched tumor samples [3,7]. PCR-based assays generate a single binary, or digital, signal based on the presence or absence of a sequence alteration, thus conferring high specificity for the cfDNA molecule being cancer derived. Multiple PCRs can be run in parallel to boost sensitivity further, such as in digital PCR [21], which utilizes 103–104 partitions.
Second-Generation Liquid Biopsy: Targeting Multiple Digital Signals
With the increasing utilization of next-generation sequencing (NGS), it has become possible to interrogate plasma cfDNA for multiple digital sequence alterations in parallel. Hence, the second generation of liquid biopsy involves targeted sequencing of regions that are recurrently mutated in the cancer type of interest [2,3,22-24]. This is particularly useful for cancers with recurrently mutated genes [22], or when tracking well-established resistance genes for alterations arising during and after treatment [25]. Although targeted sequencing approaches provide a broader and more sensitive approach for any given patient sample, the second generation of liquid biopsy is – similar to PCR assays – ultimately constrained by insufficient mutant molecules being sampled.
In the setting of a low-volume tumor, few mutant molecules (or none at all) may be found in a sample using digital PCR [4]. Increasing the number of assays used mitigates sampling error and boosts sensitivity (Figure 2A). However, at low ctDNA fractions, the lack of point-mutated fragments limits the sensitivity of both first- and second-generation liquid biopsies, which rely on sampling precisely such mutations. It remains to be seen whether personalized approaches will always be more sensitive than their nonpersonalized counterparts, which depend on the abundance of nondigital signals in cfDNA sequencing data.
Second-Generation Liquid Biopsies as a Gateway to Analog Signal Detection
As recent second-generation liquid biopsies have targeted a larger number of mutations, their individual digital signals have, to an extent, become analog. When considering signals in aggregate, algorithms must discriminate between background mutation signals and true cancer signals [4,26]. In Figure 1, we call such signals pseudoanalog. As both biological confounding factors – such as clonal hematopoiesis [27] – and technical noise increase with targeted sequencing, the previously high specificity of individual cancer mutations no longer holds strictly true outside of actionable mutations. Some studies have acknowledged the limited sensitivity from analyzing mutations alone, and thus combine targeted sequencing assays with protein biomarker assays for cancer detection [28], which achieves a highly sensitive and scalable approach [5,28]. Other studies seek to maximize the useful information from the same data, exploiting fragmentation patterns [4] or combining with copy number alternations [6] in parallel with point mutations to boost signals (Figure 2B). The interrogation of analog signals requires both an understanding of the biology underpinning those signals, and machine learning to extract them effectively. These realizations, coupled with routine application of deeper sequencing, have enabled a shift toward a third generation of liquid biopsy that is made possible by data science.
Third-Generation Liquid Biopsy: Embracing Analog Signals
Although whole-genome sequencing (WGS) may still allow the extraction of point mutations if performed sufficiently deeply, recent studies have shown that cancer detection and monitoring can be performed using analog signals [10,13,17,18]. Analog signals include the fragmentation pattern of circulating DNA [10,11], epigenetic changes [13-15], and genomic and transcriptomic changes in nontumor cells [16-18], which are discussed in the following text. By leveraging signals from all cfDNA fragments sequenced rather than focusing on individual mutation loci, sequencing data may be more efficiently used (Figure 3A). The resulting data are, naturally, more challenging to interpret than high-specificity digital signals. Unlike digital signals, which have high signal-to-noise ratios, the signal-to-noise ratio of analog signals can be low, as they are based on small variations from the physiological state (Figure 3B). Yet, when these data are deconvoluted using machine learning models, accurate discrimination between cancer and control may be possible – even at the lowest levels of DNA, at which first- and second-generation approaches would falter due to sampling error limitations [11,15].
Tumor-Derived, Third-Generation Liquid Biopsies
Fragmentomics is emerging as a field in its own right and focuses on understanding and leveraging differences in fragmentation patterns between tumor-derived and normal cfDNA (Box 2). cfDNA is believed to be fragmented by enzymatic cleavage during cell death, giving a characteristic modal size of 166 bp, corresponding to the length of cfDNA wrapped around a nucleosome plus a 20-bp length of linker DNA [29]. Studies conducted early in the last decade highlighted the relative abundance of short ctDNA fragment in patients with cancer [9,30], and thus, size selection appeared to be a logical next step to enrich for tumor signals [10,31]. Although fragment size patterns may be summarized as simple ratios of short: long fragments [32], there is greater complexity present in fragment size data than can be summarized as a ratio. Therefore, recent studies have sought to minimize information loss by applying machine-learning models to an increasing number of features of cfDNA [10,11]. Other than fragment lengths themselves, fragment end motifs [32] and start–stop positions [33] may also have utility for enhancing cancer detection. Deep convolutional neural networks have previously been fed point mutation and sequence context data resulting in high sensitivity [34], highlighting a trend towards utilizing ever more raw sequencing data in machine learning models. Fragmentomics will likely follow the same trend by integration of an increasingly broad set of fragmentation features.
Box 2. cfDNA Fragmentomics: from Biology to Application.
Early evidence for the biological relevance of cfDNA fragment size was generated two decades ago, linking tumor cell death via apoptosis and necrosis with characteristic fragmentation patterns observed in the plasma of cancer patients [57,58].
Studies conducted over the past decade have highlighted the occurrence of short cfDNA molecules in cancer patients [9,30]. Thus, size-selection for fragments was a logical next step in the field – initially for short DNA fragments below 150 bp, and later of more specific size ranges, in order to discard most of the nontumor cfDNA fragments, which have a modal peak around 166 bp [10,31]. In healthy individuals, cfDNA fragmentation predominantly reflects DNA processed and released by lymphoid and myeloid cells [11,14,27]. Differences in size profile of tumor-derived and nontumor-derived cfDNA fragments have enabled fragment size selection to enrich for ctDNA in both conventional volumes [10] and from small volumes of blood, enabling ctDNA detection from dried blood spots [12]. Single-stranded DNA library preparation has revealed a different fragmentation pattern for ctDNA and a predominance of shorter fragments specifically below 100 bp [59], though whether this ultrashort population might be enriched for tumor-derived fragments is not yet clear.
Although fragment size patterns can be summarized as simple ratios of short: long fragments [32], there is greater complexity present than can be summarized in a ratio. For instance, Mouliere et al. utilized a random forest algorithm for classification of patient samples based on multiple fragmentation ratios of cfDNA size and were able to distinguish between patients with cancer and healthy individuals with high accuracy [11]. Similarly, Cristiano et al. trained a gradient-boosting, machine-learning model on the short: long ratio across the whole genome, thus recovering positional information that was discarded by an aggregated size ratio [11]. cfDNA fragmentation was shown to correspond to chromatin structure [11], highlighting the inter-relatedness of fragmentomics and epigenetics. This approach classified patients with cancer versus healthy with an area under the curve of 0.94 and, to a lesser degree, the tissue of origin. The ability to classify the tissue of origin indicates tissue-specific differences in fragmentation, potentially due to differences in nucleases and epigenetic profile.
Other than fragment size, fragment-end motifs [32] and start–stop positions [33] may also have utility for enhancing cancer detection either by themselves, or in addition to other markers – potentially in the form of ensemble learning. Such approaches may efficiently utilize a greater proportion of the signal present in data and may also be informative with regards to the relative importance of each feature.
Given differences in methylation profile between tumor and normal tissues [35], plasma methylomics is based on the representation of differential methylation patterns in the circulation. Classification of samples using methylation profiles in plasma is not only highly sensitive in both early and advanced stage disease [13,15,36], but also resolves the contributions of different tissues to the circulating DNA pool [13,36], and the expression of cancer genes may be interrogated [36]. At present, targeted deep sequencing is used for accurately assessing thousands of differentially methylated regions in plasma [13], in which signals are aggregated across multiple loci either using machine learning [15] or probabilistic models [13], analogous to second-generation assays that target large numbers of point mutations [4,22]. The sensitivity of such approaches can be improved by increasing the number of reporters further through a shift towards WGS, as costs decline.
Nucleosome positioning and occupancy may be determined based on sequencing depths across the genome leading to the emerging field of nucleosomics. Patterns of nucleosome occupancy differ between nonmalignant and cancer cells and have been associated with distinct gene expression patterns [14,37]. Nucleosome position maps have also been used to deconvolve the tissue sources of cfDNA in plasma [14], similar to methylation maps [36]. Moderate-depth WGS is required for accurate nucleosome mapping to minimize biases in coverage introduced by current targeted sequencing methods. In the absence of inexpensive sequencing, wet-lab and dry-lab workarounds for the above biases might provide insights into this field. The sensitivity of nucleosomics is currently limited due to the limited signal-to-noise ratio of any given nucleosome locus. However, we speculate that with greater sequencing data, nucleosome maps can be made with more granularity, enabling small deviations from the normal profile to accurately classify samples. With greater data exploration, individual informative nucleosome loci or signatures of nucleosome positions might provide useful signal. Given that these data are obtainable from WGS data, nucleosome position information may be used as an ensemble approach with other cancer features in plasma.
Nontumor-Derived, Third-Generation Liquid Biopsies
The term ctDNA implies that the circulating DNA fragment is tumor-derived: the tumor releases its mutated [20], or short [30], or differentially methylated fragments [36] into the bloodstream. However, untargeted sequencing approaches such as WGS also contain signals from the microenvironment and the host, which we consider nontumor-derived cancer signal. Thus, nontumor-derived liquid biopsies may be possible: leveraging signals derived from physiological or pathological processes that are associated with the cancer, rather than being derived directly from the cancer itself. On the one hand, nontumor-derived signals may be less specific and present a significant overlap between cancer and normal individuals. On the other hand, they may provide the advantage of being more abundant than directly tumor-derived signals in absolute terms when tumor volumes are minimal (Figure 3), such as in screening or the residual disease setting. The association between these signals may be causal, due to interactions between the tumor and host or microenvironment in either direction: tumors may cause local microenvironmental or systemic changes that may be present in the bloodstream; alternatively, the microenvironment or the host is substantially different in cancer patients to begin with. Regardless of whether causality or simple association, these relationships can be leveraged by aligning unmapped reads to microbial genomes [18] and by studying platelet RNA profiles [16,36].
Similar to how the nascent fragmentomics field started with observations of short vs. long fragment ratios, initial studies of plasma microbiomics began by monitoring known causes of infection in immunocompromised patients [36,38], where the clinical correlates were clear [36,38]. Cancer detection has rapidly progressed towards leveraging machine learning in the microbiomics field (Box 3): as most cancers do not have a sole microbial etiology, finding an individual specific microbial sequence in the bloodstream is unlikely. Thus, a recent study utilized machine learning to classify individuals into healthy or cancer, or to identify patients’ cancer type microbial read counts [18]. This approach was subsequently shown to be underpinned by specific microbial profiles of tumors through a comprehensive analysis of 1526 tumor microbiomes [39].
Box 3. Nontumor-derived Liquid Biopsies: Targeting the Circulating Microbiome.
Characterization of the tumor microbiome may unravel insights into the effects of microbial species on different cancer hallmarks [39] and provide novel opportunities in cancer diagnostics. Recently, a comprehensive microbiomic analysis of 1526 tumors and their adjacent normal tissues across seven tumor types found that each tumor type exhibited a distinct microbiomic composition, mostly consisting of intracellular bacteria present within cancer cells or immune cells [39]. Metabolic functions encoded by the intratumor bacterial species were associated with clinical features of tumor types and subtypes, suggesting that high levels of tissue-specific metabolites in tumors may create a preferred niche for bacteria [39]. Although many cancers do not have a sole microbial etiology, evidence supporting an active role of tumor-associated bacteria in tumorigenesis is accumulating, especially for the link between microbiome and colorectal cancer [60]. We suggest that microbial signals in plasma could increasingly provide useful information for third-generation liquid biopsy assays in oncology, and possibly even in other areas of clinical medicine, as studies have shown associations between the microbiome and neurological [61] and psychiatric disorders [62].
Aside from circulating microbial sequences, cfDNA from intracellular organelles may provide useful signals. Mitochondrial cfDNA is more abundant in the plasma than nuclear ctDNA due to there being 102–105 copies of mitochondrial genome per human cancer cell, and may be used for cancer detection [63]. Although abundant, mitochondrial sequences are limited by their poor mapping quality, which is due to their similarity to regions in the human genome; however, future advances in bioinformatics or mitochondrial DNA sequencing may yet unlock this potential resource.
Transcriptomics may be an important new frontier for the field of liquid biopsy due to its high complexity, although progress is hampered by the current preanalytical limitations of preserving and preparing circulating free RNA (cfRNA). In contrast, intracellular RNA is relatively protected, allowing the study of platelet RNA profiles. Despite the role of platelets in tumorigenesis being unclear, RNA splicing events observed in platelets indicate that transcriptomic analysis of tumor-educated platelets may serve as a cancer diagnostic marker [16,40,41]. Machine learning applied to platelet mRNA data showed accurate classification of patients with cancer versus healthy controls [16,17], suggesting direct tumor education of platelet RNA profiles or by other indirect means.
Evidence from the field of prenatal diagnostics, where the fetal fraction is relatively high compared to ctDNA fractions, showed that the expected date of delivery may be predicted accurately from cfRNA [42], suggesting that there is complex and useful information present in cfRNA which could be translated to cancer research [43]. Whether such cfRNA signals are sufficiently abundant in cancer patients for cancer monitoring remains to be shown, although historically, innovations in the noninvasive prenatal testing field have often fed into the cancer liquid biopsy space. Overall, the emergent trends in nontumor-derived signals offer exciting opportunities for the next generation of liquid biopsy and suggest that there are likely many more unknown circulating markers to be discovered and applied in oncology.
Concluding Remarks and Future Directions
The field of liquid biopsy in oncology has broadened rapidly, from the detection of highly specific digital point mutation signals in plasma towards considering a multitude of analog signals, which, despite presenting different levels of specificity by themselves, show high performance when integrated. There has been concomitant progress in leveraging sensitive and specific nontumor-derived cancer signal, which may provide greater absolute signals, potentially overcoming hard limits on what the tumor itself can release into the bloodstream. These trends lead us to suggest that the next generation of liquid biopsies is beginning to emerge. These assays are not confined to sequence alterations, but exploit the tumor, its microenvironment and systems dynamics using broad and deep sequencing combined with machine learning – shifting the field of liquid biopsy towards a data science-driven discipline.
An increasing focus on data science in the liquid biopsy field has practical implications for the scientists and clinicians involved. For any machine-learning study using large amounts of raw data, such as those used by third-generation liquid biopsies interrogating omics, batch effects must be addressed. Sensitive algorithms built to detect true biological signals will also be sensitive for technical artifacts, unless these are pre-empted and controlled for. As a result, overfitting is an ever-present risk, hampering generalization of algorithms and thus severely limiting their utility. Aside from preprocessing samples in the same batch [18], k-fold cross-validation may be used to mitigate such biases [44]. New methods in machine learning will, in time, feed into ctDNA studies, as novel algorithms and methods for controlling batch effects emerge.
As machine-learning algorithms become more complex, such methods become increasingly opaque, or ‘black box’. Therefore, there are many potential sources of confounding error if such opaque machine-learning approaches are used without first characterizing the data. However, in the end, if the model performs well – does the underlying basis for it matter? Inexplicability is a ubiquitous problem with machine learning, and the black box nature of circulating DNA studies may already be occurring as we move away from the easily comprehensible analyses of first- and second-generation liquid biopsies to more complex data. Interpretability measures for deep learning are increasingly being utilized to mitigate the black box nature of some machine-learning approaches [45]. Determining feature importance from machine-learning models remains important from a biological interpretability perspective, or feature engineering may be performed in advance to restrict the feature set [10]. Future studies using many features across the genome should endeavor to analyze feature importance, although we appreciate that these signatures or principal components may be complex.
Lastly, as we shift towards analyzing more abundant circulating analog signals for cancer, novel algorithms need to retain high specificity. It is known that cfDNA reflects biological processes, such as clonal hematopoiesis [27] or exercise [46]. How, then, could future deeper and broader sequencing of cfDNA distinguish cancer mutations from those in healthy epithelial cells [47], or inflammatory [48], or precancerous states [49]? In addition, as cancer treatments are almost always associated with significant tissue injury, which causes cfDNA release [50,51], analog cancer signals may be diluted in the plasma by other sources of cfDNA. Future longitudinal studies assessing the extent of chemo-, radio- and immunotherapy on analog tumor and nontumor signals are needed to characterize such background signals. If the physiological responses to treatment can be summarized as a discrete signature, then it may be subtracted from subsequent data. Novel cancer diagnostics must introduce measures to control for confounders. Firstly, patient characteristics other than genomic data could be integrated together with cfDNA data [52] to give a more accurate classification of cancer. Secondly, the field might benefit from studies using algorithms that incorporate internal controls, including other tissue or noninvasive samples from the same patient’s body. Finally, future studies should ideally be large-scale, stratified, randomized, and controlling for comorbidities.
In summary, the field of ctDNA has advanced on all fronts: first- and second-generation assays targeting digital mutation signals are moving towards implementation and, in parallel, multiple novel omics subfields have emerged, offering numerous approaches for cancer detection. Current clinical trials are based on the advances in targeted sequencing over recent years, such as PlasmaMATCH [53]. Future prospective studies are required to test the real-world performance of next-generation liquid biopsy assays that utilize deep sequencing and machine learning, ensuring robustness by applying the new assays to a large population with well-controlled preanalytical procedures. While performance of such algorithms may be high, investigators have a duty to explain why such relationships should exist, learning from principles outlined in epidemiology such as the Bradford Hill criteria for causation [54]. Key questions about the biology of circulating DNA (summarized in Outstanding Questions) remain a priority in the field, since such insights serve to enable novel diagnostic approaches. Clinicians and scientists must be mindful of the rapid shift towards data science-driven cancer diagnostics, so that such approaches may be translated into the clinic and further benefit patient care.
Highlights.
In oncology, liquid biopsies offer a minimally invasive alternative to traditional cancer biopsies.
The sensitivity of point-mutation-based liquid biopsy approaches may reach a plateau, as there is a simple limit on how many mutant ctDNA copies a tumor can release into the bloodstream.
Next-generation liquid biopsy approaches are instead applying deeper and broader sequencing to leverage both tumor-derived and nontumor-derived cancer signals, which are increasingly analog signals.
Complex omics data require machine learning tools for their interpretation, which poses novel challenges to the field and is driving a paradigm shift in liquid biopsies research – with implications for future preclinical and clinical translational studies.
Outstanding Questions.
What are the tumor and nontumor contributors to the circulating DNA pool?
How is ctDNA released into the bloodstream, and how does tumor biology relate to ctDNA levels?
What are the mechanisms underpinning novel and highly accurate nontumor-derived signal-based classification?
Given the recent advances of cell-free RNA within the field of prenatal diagnostics, what can the field of cancer liquid biopsies learn?
Will ensemble machine learning models, which combine several tumor-derived and nontumor-derived features, provide even greater sensitivity?
How should we translate this growing body of preclinical data into the clinic in an impactful, safe and cost-effective way?
During the implementation phase of next-generation liquid biopsies, how will clinicians respond to the increasing complexity of black-box machine learning algorithms?
Glossary
- Analog
signal that is represented by a continuous variable
- Data mining
the process of discovering biological signals in large datasets, using statistical tools and machine learning
- Digital
signal that can take on only one of a fixed number of values, and is discrete in nature
- Ensemble learning
a model that combines several machine learning techniques into one optimal model, in order to improve prediction and decrease variance and bias
- Fragmentomics
the study of fragmentation patterns in the genome based on biological differences in cfDNA fragment size across different disease states
- Gradient boosting model
a decision-tree-based, machine-learning model that builds the model in a stage-wise fashion
- K-fold cross-validation
a method that estimates how well a new machine-learning technique trained on a certain dataset would perform on an unseen, different dataset. It works by shuffling all observations and dividing the observations into ‘k’ number of groups (or folds), and tests the test folds against all remaining observations as a whole (training folds) in turn
- Machine learning
the use of algorithms that can learn to detect patterns in datasets without being explicitly programmed; often described as a subset of artificial intelligence
- Methylomics
the study of methylation patterns in the genome
- Microbiomics
the study of microbial genomes
- Nucleosomics
the study of nucleosome positioning across the genome
- Omics
the study of a group of molecules (DNA, RNA, proteins, metabolites, etc.) in a global or comprehensive way
- Overfitting (in machine learning)
occurs when an algorithm learns irrelevant features of the dataset, picking up both the ‘real’ signal (e.g., signal specific to a cancer type) and/or noise (e.g. any features specific to the patient populations studied, but irrelevant to the cancer)
- Random forest algorithm
an ensemble learning method which uses many decision trees together, classifying data using the mean value across the trees (or forest)
- Transcriptomics
the study of the transcriptome, the complete set of RNA transcripts that are produced by the genome, under specific circumstances or in a specific cell
References
- 1.Wan JCM et al. (2017) Liquid biopsies come of age: towards implementation of circulating tumour DNA. Nat. Rev. Cancer 17, 223–238 [DOI] [PubMed] [Google Scholar]
- 2.Forshew T et al. (2012) Noninvasive identification and monitoring of cancer mutations by targeted deep sequencing of plasma DNA. Sci. Transl. Med 4, 136ra68. [DOI] [PubMed] [Google Scholar]
- 3.Dawson SJ et al. (2013) Analysis of circulating tumor DNA to monitor metastatic breast cancer. N. Engl. J. Med 368, 1199–1209 [DOI] [PubMed] [Google Scholar]
- 4.Wan JCM et al. (2020) ctDNA monitoring using patient-specific sequencing and integration of variant reads. Sci. Transl. Med 12, eaaz8084. [DOI] [PubMed] [Google Scholar]
- 5.Lennon AM et al. (2020) Feasibility of blood testing combined with PET-CT to screen for cancer and guide intervention. Science 369, eabb9601. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Chabon JJ et al. (2020) Integrating genomic features for noninvasive early lung cancer detection. Nature 580, 245–251 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Garcia-Murillas I et al. (2015) Mutation tracking in circulating tumor DNA predicts relapse in early breast cancer. Sci. Transl. Med 7, 302ra133. [DOI] [PubMed] [Google Scholar]
- 8.Diehl F et al. (2005) Detection and quantification of mutations in the plasma of patients with colorectal tumors. Proc. Natl. Acad. Sci. U. S. A 102, 16368–16373 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Thierry AR et al. (2010) Origin and quantification of circulating DNA in mice with human colorectal cancer xenografts. Nucleic Acids Res. 38, 6159–6175 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Mouliere F et al. (2018) Enhanced detection of circulating tumor DNA by fragment size analysis. Sci. Transl. Med 4921, 1–14 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Cristiano S et al. (2019) Genome-wide cell-free DNA fragmentation in patients with cancer. Nature 570, 385–389 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Heider K et al. (2020) Detection of ctDNA from dried blood spots after DNA size selection. Clin. Chem 9, 1–9 [DOI] [PubMed] [Google Scholar]
- 13.Liu MC et al. (2020) Sensitive and specific multi-cancer detection and localization using methylation signatures in cell-free DNA. Ann. Oncol 31, 745–759 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Snyder MW et al. (2016) Cell-free DNA comprises an in vivo nucleosome footprint that informs its tissues-of-origin. Cell 164, 57–68 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Shen SY et al. (2018) Sensitive tumour detection and classification using plasma cell-free DNA methylomes. Nature 563, 579–583 [DOI] [PubMed] [Google Scholar]
- 16.Best MG et al. (2015) RNA-Seq of tumor-educated platelets enables blood-based pan-cancer, multiclass, and molecular pathway cancer diagnostics. Cancer Cell 28, 666–676 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Best MG et al. (2017) Swarm intelligence-enhanced detection of non-small-cell lung cancer using tumor-educated platelets. Cancer Cell 32, 238–252.e9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Poore GD et al. (2020) Microbiome analyses of blood and tissues suggest cancer diagnostic approach. Nature 579, 567–574 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Vogelstein B et al. (2013) Cancer genome landscapes. Science 339, 1546–1558 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Sorenson GD et al. (1994) Soluble normal and mutated DNA sequences from single-copy genes in human blood. Cancer Epidemiol. Biomark. Prev 3, 67–71 [PubMed] [Google Scholar]
- 21.Tie J et al. (2016) Circulating tumor DNA analysis detects minimal residual disease and predicts recurrence in patients with stage II colon cancer. Sci. Transl. Med 8, 346ra92. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Newman AM et al. (2014) An ultrasensitive method for quantitating circulating tumor DNA with broad patient coverage. Nat. Med 20, 548–554 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Abbosh C et al. (2017) Phylogenetic ctDNA analysis depicts early stage lung cancer evolution. Nature 545, 446–451 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Phallen J et al. (2017) Direct detection of early-stage cancers using circulating tumor DNA. Sci. Transl. Med 9, eaan2415. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Murtaza M et al. (2013) Non-invasive analysis of acquired resistance to cancer therapy by sequencing of plasma DNA. Nature 497, 108–112 [DOI] [PubMed] [Google Scholar]
- 26.Newman AM et al. (2016) Integrated digital error suppression for improved detection of circulating tumor DNA. Nat. Biotechnol 34, 547–555 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Razavi P et al. (2019) High-intensity sequencing reveals the sources of plasma circulating cell-free DNA variants. Nat. Med 25, 1928–1937 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Cohen JD et al. (2018) Detection and localization of surgically resectable cancers with a multi-analyte blood test. Science 359, 926–930 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Jiang P and Lo YMD (2016) The long and short of circulating cell-free DNA and the ins and outs of molecular diagnostics. Trends Genet. 32, 360–371 [DOI] [PubMed] [Google Scholar]
- 30.Mouliere F et al. (2011) High fragmentation characterizes tumour-derived circulating DNA. PLoS One 6, e23418. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Underhill HR et al. (2016) Fragment length of circulating tumor DNA. PLoS Genet. 12, 426–437 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Jiang P et al. (2020) Plasma DNA end motif profiling as a fragmentomic marker in cancer, pregnancy and transplantation. Cancer Discov. 10, 664–673 [DOI] [PubMed] [Google Scholar]
- 33.Sun K et al. (2018) Size-tagged preferred ends in maternal plasma DNA shed light on the production mechanism and show utility in noninvasive prenatal testing. Proc. Natl. Acad. Sci 115, E5106–E5114 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Zviran A et al. (2020) Genome-wide cell-free DNA mutational integration enables ultra-sensitive cancer monitoring. Nat. Med 26 1114–1124 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Feinberg AP and Vogelstein B (1983) Hypomethylation distinguishes genes of some human cancers from their normal counterparts. Nature 301, 89–92 [DOI] [PubMed] [Google Scholar]
- 36.Sun K et al. (2015) Plasma DNA tissue mapping by genomewide methylation sequencing for noninvasive prenatal, cancer, and transplantation assessments. Proc. Natl. Acad. Sci 112, E5503–E5512 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Liu Y et al. (2019) Abstract 5177: Spatial co-fragmentation pattern of cell-free DNA recapitulates in vivo chromatin organization and identifies tissue-of-origin. 10.1158/1538-7445.sabcs18-5177 [DOI] [Google Scholar]
- 38.Burnham P et al. (2018) Urinary cell-free DNA is a versatile analyte for monitoring infections of the urinary tract. Nat. Commun 9, 1–10 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Nejman D et al. (2020) The human tumor microbiome is composed of tumor type-specific intra-cellular bacteria. Science 980, 973–980 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Calverley DC et al. (2010) Significant downregulation of platelet gene expression in metastatic lung cancer. Clin. Transl. Sci 3, 227–232 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Nilsson RJA et al. (2011) Blood platelets contain tumor-derived RNA biomarkers. Blood 118, 3680–3683 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Ngo TTM et al. (2018) Noninvasive blood tests for fetal development predict gestational age and preterm delivery. Science 4, 1133–1136 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Larson MH et al. (2019) Detection of tissue-specific RNA in the plasma of cancer patients: a Circulating Cell-Free Genome Atlas (CCGA) substudy. In Cold Spring Harbour Laboratory Meeting: The Biology of Genomes [Google Scholar]
- 44.Wan N et al. (2019) Machine learning enables detection of early-stage colorectal cancer by whole-genome sequencing of plasma cell-free DNA. BMC Cancer 19, 1–10 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Shrikumar A et al. (2017) Learning important features through propagating activation differences. arXiv, 1704.02685v2 [Google Scholar]
- 46.Haller N et al. (2018) Circulating, cell-free DNA as a marker for exercise load in intermittent sports. PLoS One 13, e0191915. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Lee-Six H et al. (2019) The landscape of somatic mutation in normal colorectal epithelial cells. Nature 574, 532–537 [DOI] [PubMed] [Google Scholar]
- 48.Jaiswal S and Libby P (2020) Clonal haematopoiesis: connecting ageing and inflammation in cardiovascular disease. Nat. Rev. Cardiol 17, 137–144 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Contino G et al. (2017) The evolving genomic landscape of Barrett’s esophagus and esophageal adenocarcinoma. Gastroenterology 153, 657–673.e1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Swystun LL et al. (2011) Breast cancer chemotherapy induces the release of cell-free DNA, a novel procoagulant stimulus. J. Thromb. Haemost 9, 2313–2321 [DOI] [PubMed] [Google Scholar]
- 51.Henriksen TV et al. (2020) The effect of surgical trauma on circulating free DNA levels in cancer patients – implications for studies of circulating tumor DNA. Mol. Oncol 14, 1670–1679 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Wan JCM et al. (2019) “Hey CIRI, what’s my prognosis?”. Cell 178, 518–520 [DOI] [PubMed] [Google Scholar]
- 53.Turner NC et al. (2020) Circulating tumour DNA analysis to direct therapy in advanced breast cancer (plasmaMATCH): a multicentre, multicohort, phase 2a, platform trial. Lancet Onco. 21, 1296–1308 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Hill AB (1965) The environment and disease: association or causation? Proc. R. Soc. Med 58, 295–300 [PMC free article] [PubMed] [Google Scholar]
- 55.Mandel P and Métais P (1948) Les acides nucléiques du plasma sanguine chez l’homme. C. R. Seances Soc. Biol. Fil 142, 241–243 [PubMed] [Google Scholar]
- 56.Diehl F et al. (2008) Circulating mutant DNA to assess tumor dynamics. Nat. Med 14, 985–990 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Jahr S et al. (2001) DNA fragments in the blood plasma of cancer patients: quantitations and evidence for their origin from apoptotic and necrotic cells. Cancer Res. 61, 1659–1665 [PubMed] [Google Scholar]
- 58.Wang BG et al. (2003) Increased plasma DNA integrity in cancer patients. Cancer Res. 63, 3966–3968 [PubMed] [Google Scholar]
- 59.Burnham P et al. (2016) Single-stranded DNA library preparation uncovers the origin and diversity of ultrashort cell-free DNA in plasma. Sci. Rep 6, 27859. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Pleguezuelos-Manzano C et al. (2020) Mutational signature in colorectal cancer caused by genotoxic pks+ E. coli. Nature 580, 269–273 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.McMurran CE et al. (2019) The microbiota regulates murine inflammatory responses to toxin-induced CNS demyelination but has minimal impact on remyelination. Proc. Natl. Acad. Sci 116, 25311–25321 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.MacQueen G et al. (2017) The gut microbiota and psychiatric illness. J. Psychiatry Neurosci 42, 75–77 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Mair R et al. (2019) Measurement of plasma cell-free mitochondrial tumor DNA improves detection of glioblastoma in patient-derived orthotopic xenograft models. Cancer Res. 79, 220–230 [DOI] [PMC free article] [PubMed] [Google Scholar]