Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

bioRxiv logoLink to bioRxiv
[Preprint]. 2026 Feb 12:2026.02.10.705188. [Version 1] doi: 10.64898/2026.02.10.705188

Deep learning-based non-invasive profiling of tumor transcriptomes from cell-free DNA for precision oncology

Robert D Patton 1,2,*, Alexander Netzley 1,2, Thomas W Persse 1,2, Akira Nair 1,2, Patricia C Galipeau 1,2, Ilsa M Coleman 2, Pushpa Itagi 1,2, Pooja Chandra 1,2, Mohamed Adil 1,2, Manasvita Vashisth 1,2, Erolcan Sayar 2, Joseph B Hiatt 2, Ruth Dumpit 2, Lori Kollath 5, Ridvan Arda Demirci 6, Alireza Ghodsi 6, Hung-Ming Lam 5, Colm Morrissey 5, Amir Iravani 3,6, Delphine L Chen 3,6, Andrew C Hsieh 2,9, David MacPherson 1,2, Michael C Haffner 2,3,4, Peter S Nelson 2,3,4,5,7,8,9,*, Gavin Ha 1,2,7,9,10,*
PMCID: PMC12918969  PMID: 41726945

Abstract

Circulating tumor DNA (ctDNA) profiling from liquid biopsies is increasingly adopted as a minimally invasive solution for clinical cancer diagnostic applications. Current methods for inferring gene expression from ctDNA require specialized assays or ultra-deep, targeted sequencing, which preclude transcriptome-wide profiling at single-gene resolution. Herein we jointly introduce Triton, a tool for comprehensive fragmentomic and nucleosome profiling of cell-free DNA (cfDNA), and Proteus, a multi-modal deep learning framework for predicting single gene expression, using standard depth (~30–120x) whole genome sequencing of cfDNA. By synthesizing fragmentation and inferred nucleosome positioning patterns in the promoter and gene body from Triton, Proteus reproduced expression profiles using pure ctDNA from patient-derived xenografts (PDX) with an accuracy similar to RNA-Seq technical replicates. Applying Proteus to cfDNA from four patient cohorts with matched tumor RNA-Seq, we show that the model accurately predicted the expression of specific prognostic and phenotype markers and therapeutic targets. As an analog to RNA-Seq, we further confirmed the immediate applicability of Proteus to existing tools through accurate prediction of gene pathway enrichment scores. Our results demonstrate the potential clinical utility of Triton and Proteus as non-invasive tools for precision oncology applications such as cancer monitoring and therapeutic guidance.

Subjects: Circulating tumor DNA, liquid biopsies, patient-derived xenografts, whole genome sequencing, deep learning, convolutional neural network, gene expression


Tissue biopsies represent the current clinical standard for genomic and epigenomic characterization of tumors used both in determining cancer treatment and in monitoring response to therapies. However, traditional biopsies are invasive procedures associated with high costs and co-morbidity, and late-stage cancers often include metastases which are inaccessible to biopsy or challenging to acquire safely. To address these shortcomings, methods utilizing circulating tumor DNA (ctDNA) released from tumor cells into the blood as cell-free DNA (cfDNA) are emerging as minimally invasive alternatives capable of interrogating the global molecular profiles of patient tumors. These “liquid biopsy” assays have been developed to tackle diverse clinical needs including early cancer detection16, assessment of minimum residual disease7,8, treatment response determinants, and mid-treatment genotype and phenotype monitoring914. However, liquid biopsy-based inference of transcriptional regulation is critical for determining prognosis, detecting lineage transitions, and monitoring therapeutic targets but remains challenging.

RNA and protein-based assessments of tumor tissue biopsies are standardized methods which can reveal molecular phenotypes, aid in achieving accurate prognoses, and identify potential therapeutic targets. Analyses of cell-free RNAs (cfRNAs) obtained through a blood sample represents an alternative approach to tissue biopsy, but cfRNA is typically less stable with high biological variability, requiring specialized methodologies which are neither standardized nor broadly adopted1518. In response to these limitations, emerging methods take advantage of unique aspects of cfDNA biology, going beyond traditional genomics to infer tumor-specific epigenetic regulation from coverage and fragmentation patterns which reflect underlying chromatin states9,1928. Other assays analyze DNA methylation and histone modifications in cfDNA, allowing for the direct measurement of cancer-associated biomarkers2931. Despite these considerable advances, a method with the resolution to predict individual gene expression levels from cfDNA in a manner that is directly analogous to tissue RNA-Seq remains elusive.

Current approaches for cfDNA-based epigenetic inference show robust correlation primarily in aggregate for similarly expressed gene sets, and require either complex, multi-modal assays29 or specialized ultra-deep targeted sequencing panels19, which tend to preclude evaluation of the entire transcriptome. Whole genome sequencing (WGS) of cfDNA can overcome these limitations but requires specialized computational methods to extract meaningful information. Therefore, we first developed the tool Triton to comprehensively characterize cfDNA in regions of interest, outputting both well-established fragmentation and novel nucleosome positioning profiles at base pair (bp)-resolution. We hypothesized that even at standard sequencing depths (~30–100x) these features, if integrated together and treated as a pattern recognition task, would enable robust expression inference. To this end we developed Proteus, the first deep learning framework to infer expression from cfDNA. Proteus utilized castration-resistant prostate cancer (CRPC) and small-cell lung cancer (SCLC) patient derived xenograft (PDX) models along with healthy donor (HD) whole blood as a ground-truth resource for initial training on “pure” ctDNA and cfDNA with matched tumor- or whole blood-derived RNA-Seq. By using in silico mixtures of these model systems along with unlabeled patient cfDNA in a semi-supervised training scheme, we developed a model capable of directly partitioning tumor and non-tumor background expression contributions. We then performed fine-tuning of Proteus using matched cfDNA and RNA-Seq of one or more tumor biopsies from patients with CRPC and SCLC to aid the model in generalizing to real-world examples from diverse sources. The model was run on unseen patient cfDNA with matched tumor biopsies from multiple cancer types, achieving high-accuracy predictions of phenotype-specific markers, therapeutic targets, and gene set enrichment scores. Finally, we applied the model to two additional, unseen CRPC patient cohorts lacking matched RNA-Seq but with known histology or clinical progression, from which we identified differentially expressed phenotype-defining genes and prognostic pathway enrichment scores de novo from cfDNA.

RESULTS

Triton: comprehensive characterization of cfDNA features at base-pair resolution

To investigate cfDNA fragmentation and nucleosome positioning for inferring transcriptional activity,5,9,1927 we first developed Triton, a signal processing tool to comprehensively extract features from cfDNA sequencing data. Triton generates bp-resolution coverage and fragmentation profiles as well as region-level summary statistics which characterize underlying nucleosome positioning and transcription factor (TF) binding events in both the promoter and body of genes (Fig. 1a, Methods). For actively transcribed genes, the transcription start site (TSS) exhibits a nucleosome depleted region (NDR) and a stably located +1 nucleosome, while both the promoter and full gene body display overall disordered nucleosome positioning and increased cfDNA fragment length heterogeneity which reflects active remodeling of chromatin5,9,1927,32 (Fig. 1b). By contrast, genes with inactive or repressed transcription will lack a NDR, have more regular nucleosome positioning and spacing, and show less fragment length diversity in both the promoter and gene body reflecting inaccessible chromatin (Fig. 1c).

Figure 1. Triton and Proteus workflow for cfDNA fragmentomic and nucleosome profiling and single gene expression prediction.

Figure 1.

a) Diagram of gene-level regions interrogated by Triton for fragmentomic and nucleosome profiling for use in the Proteus framework. Triton extracts “region-level” features in both a 2000 bp window centered on the TSS (left) and in the full gene body (right), as two distinct regions. Triton further produces continuous, bp-resolution “spatial” signals in the TSS window, as illustrated on the left side of b) and c).

b) Canonical nucleosome coverage and cfDNA fragmentation signals in the TSS region (left) and body (right) of a gene undergoing active translation. In the promoter, binding of TFs at the TSS leads to a nucleosome depleted region (NDR), or dip in coverage, while the regulatory role of the +1 nucleosome leads to more stable positioning, seen as a peak in coverage. TF-protected fragments of diverse length in the promoter lead to peaks in fragmentation diversity that inversely mirror nucleosomal binding, with higher diversity expected at the NDR. In the body of the gene, active translation is associated with less consistent positioning, reflecting transient chromatin remodeling which also leads to greater heterogeneity in fragmentation patterns.

c) In the case of an inactive gene, no NDR or highly phased +1 peak is expected at the TSS, and regular nucleosome positioning extends into the gene body reflecting inactive chromatin. More consistent fragmentation patterns, typical of nucleosome protection, are found in both regions

d) Heatmap showing the empirically derived weighting given to fragments of different lengths, which is used by Triton to localize nucleosome centers (Methods). Shorter fragments at the top illustrate mononucleosomal binding with a central peak intensity width roughly equivalent in size to the canonical linker length. When Triton produces a PN signal, these fragments contribute mass to that signal coinciding with their central overlap. As fragment length increases the most likely position splits, corresponding to asymmetric mononucleosomal binding with a canonical linker length on one side and a trailing strand on the other. After 314 bp consistent separation is regained echoing true dinucleosomal protection; these fragments therefore contribute mass to the PN signal localized near their end points, as opposed to their centers. Finally, the intensity lines spread once again, corresponding to dinucleosomes with asymmetric trailing DNA.

e) From the phased-nucleosome (PN) signal produced by Triton three additional region-level metrics capturing chromatin structure are extracted: spacing, which reveals the mean inferred distance between nucleosomes; amplitude, which measures the strength of well-phased nucleosome signal components; and phasing score, which reports the relative contribution of canonically spaced nucleosomes within the signal.

f) Workflow diagram of the Triton/Proteus methodology. Triton nucleosome and fragmentation spatial signal profiling from cfDNA WGS at individual gene TSS regions, as well as region-level features extracted from both TSS and gene bodies (ae) are combined with matched meta data (transcript length, coding region length, number of exons, TSS region reference sequence, and estimated tumor fraction) as inputs to Proteus. During base-model training Proteus uses matched RNA-Seq of tumors from PDX models and whole blood from healthy donors, while during fine tuning it is trained on patient cfDNA with RNA-Seq of tumor biopsies. As a final output Proteus predicts individual gene expression values, both tumor and background, which can be used to directly monitor targets of interest or as inputs to standard RNA-Seq analysis pipelines.

g) Structural diagram of the Proteus architecture. Seven input Triton signals spanning the TSS +/−1000 bp (i) containing raw coverage, inferred nucleosome positioning, and fragmentation profile signals along with the one-hot encoded reference sequence undergo independent convolutions (ii) before being concatenated back together, at which point parallel convolutional layers extract small and large-scale relational (intra-channel) information while reducing the signals’ dimensionality (iii). These signals are then flattened and undergo further dimensionality reduction (iv) before Triton region-level TSS and GB scalar features are introduced (v). This intermediate representation then enters parallel β-VAEs which disentangle tumor and background-specific latent spaces (vi). Disentangled latent spaces are then combined additively using the estimated tumor fraction (vii) before being decoded back into the reduced signal representation and raw region-level features (viii) as an internal training process encouraging meaningful disentangling. Separately, each latent space is first used to perform soft classification of gene activity, on which the latent spaces are conditioned before continuous tumor and background expression predictions are made (ix).

Triton builds on our previous work9 by producing a signal that localizes phased nucleosome (PN) centers based on the lengths of individual fragments (Fig. 1d, Methods). “Spatial” signals derived by Triton are extracted at bp-resolution in the region around the TSS (±1000 bp) and include the GC-corrected raw coverage, PN profile, and nucleosome peak locations as well as fragmentomic features, such as fragment length short-long ratio, diversity index, and Shannon entropy (Supplementary Fig. 1, Methods)12,19. “Region-level” features are scalar values summarized across both the TSS region and full gene body, and therefore reflect larger trends in nucleosome positioning or cfDNA fragmentation. PN profiles are also used to derive a phasing score (PN score) that describes nucleosome periodicity, the mean phased-nucleosome peak height (PN amplitude), and the mean inter-nucleosomal distance (PN spacing) (Fig. 1e). Other region-level features include summary statistics based on both fragment coverage and length (Supplementary Fig. 1, Methods). Triton thus represents a complete cfDNA coverage profiling and feature extraction tool that provides high resolution spatial information, along with both new and established nucleosome positioning and fragmentation features within regions of interest.

Proteus: a deep learning framework for gene expression inference from cfDNA

We hypothesized that an approach which synthesizes spatial signal profiles and region-level features can be used to predict the expression of individual genes from standard-depth WGS of cfDNA. To this end we developed Proteus, a deep learning framework designed to predict gene expression levels using these input features, along with the promoter DNA reference sequence, gene-level meta data, and estimated tumor fraction (Fig. 1f, Methods). The Proteus deep learning architecture was designed with a biologically informed multi-modal approach (Fig. 1g). First, a series of convolutional neural network (CNN) layers are used to extract interdependent relationships in the promoter between the DNA reference sequence, bp-level coverage, PN signals, and fragmentomic profiles (Input A). Following convolutions, the reduced promoter feature representation is combined with region-level features and gene-specific annotations (exon count, mRNA length, transcript length) (Input B). These inputs are subsequently fed into parallel β-Variational Autoencoders (β-VAEs), a distinctive hallmark of Proteus that enables direct disentanglement and prediction of tumor- and background-derived expression levels33. For each input gene, Proteus outputs these predicted values in units of counts per million (log2(CPM+1)), analogous to standard analyses of tissue RNA-Seq34,35 (Methods).

A pivotal design feature in the model training and prediction strategy is the treatment of each gene as an individual input without labeling gene names and tissue types. As a result, the model avoids overfitting to the effects of “guilt by association”36 since no co-expression information is able to be learned. Proteus may therefore be more robust when predicting the expression of genes involved in irregular co-expression patterns while also ensuring that predictions are directly supported by input data features. Modeling at the per-gene level also effectively increases the number of training examples by a factor of ~18,000, enabling the use of deep learning which would otherwise be infeasible due to the limited size of cfDNA cohorts with matched tissue RNA-Seq available at this scale.

Proteus learns patterns in cfDNA nucleosome and fragmentation profiles that inform transcriptional activity

To form an initial, base-model training set with known tumor and background proportions and gene expression levels, we used 150 simulated and original ctDNA samples from PDX and patient plasma (Fig. 2a, Supplementary Table 1). First, we generated 80 simulated samples from in silico admixtures of ctDNA sequencing data from the plasma of 40 PDX lines (20 CRPC, 20 SCLC) with matched tumor RNA-Seq data and 10 healthy donor (HD-L) cfDNA samples with matched RNA-Seq data from buffy coat (Methods). Along with both the original samples and admixtures, an additional 20 cfDNA samples from patients without RNA-Seq were included for unsupervised learning. These data were used in 5-fold cross-validation (CV) training, representing ~2.7 million independent gene examples (Fig. 2a, Methods, Supplementary Table 2). Model validation was performed using 40 unseen, held-out samples comprising 10 PDX lines (5 CRPC, 5 SCLC) with matched tumor RNA-Seq, 10 unseen HD cfDNA samples without matched RNA-Seq (HD-U), and admixtures generated with these samples. Following initial training, fine-tuning of the model to learn real-world patient variability was performed using 11 (18%) of 62 total patient samples with matched tumor biopsy RNA-Seq labels and 14 in silico admixtures. For the final model evaluation, we used 51 unseen, held-out patient samples with matched RNA-Seq, along with two distinct external cohorts comprising 102 CRPC patient samples with clinical data (Fig. 2a).

Figure 2. Proteus model interpretations illuminate the effects of cfDNA fragmentation and inferred chromatin dynamics on gene expression.

Figure 2.

a) Illustration of the Proteus training scheme and data usage. An initial “base-model” was trained using PDX ctDNA with matched tumor RNA-Seq (n = 40), Healthy donor cfDNA with matched whole blood RNA-Seq (HD-L, n = 10), in-silico mixtures (n = 80), and patient cfDNA without matched RNA-Seq (n = 20). 17% of patient data (n = 11) with matched RNA-Seq and estimated tumor fractions were then used along with in-silico mixtures (n = 14) to fine-tune the deeper layers of the model to the variability seen in the patient cohorts, resulting in a final model for use on the held-out patient samples (n = 51 with matched RNA-Seq, n = 102 without matched RNA-Seq). Inner pie-charts indicate the proportion of training examples coming from each data type, while outer rings indicate presence and type of RNA-Seq labels.

b) SHAP values for Triton region-level features used by Proteus, corresponding to impact on predicted output by individual feature. Red or blue dots indicate a single gene with higher or lower feature value relative to the sample-mean, while position along the x-axis indicates absolute impact on the expression prediction in log2(CPM+1) units. The left and right columns show identical features extracted from the TSS +/−1000 bp region (left) and full gene body (right). FL, fragment length; PN, Phased-Nucleosome. All features are standard scaled within samples except for mean fragment depth, which is log-scaled.

c) Global SHAP polarity-impact scores for TSS input signals to Proteus showing the spatial attention paid to each input signal illustrated in d), along with peak calls and the reference sequence nucleotide frequencies. Red regions indicate a positive correlation between signal level and expression, while blue indicate a negative correlation. In all non-sequence signals the model has learned to focus on the regions where nucleosome depletion (−150 to 0 bp) and or +1 localization (50 to 150 bp) is expected during active translation.

d) Example of six out of seven Triton bp-resolution TSS signals (excluding peak call locations) used as inputs to Proteus for the AR promoter region, illustrating active and inactive signatures. Lines and shading represent the mean and 95% confidence interval of ctDNA signals for ARPC (active AR; red; n=17) vs NEPC (inactive AR; blue; n=6) LuCaP PDX lines. Yellow shading highlights the TSS NDR and +1 nucleosome regions that are given dominant spatial attention by Proteus, corresponding to the SHAP polarity-impact values in c).

e) Information content of sequence 6-mer kernels extracted from the model’s initial CNN layer, representing sequence patterns focused on by the model. An increase in GC-rich 6-mers consistent with the GC box and zinc finger binding can be seen in kernels 8, 9, 10, 11, and 14. Kernels 1–4, 5 and 15 have increased activation scanning over AT-rich sequence segments indicative of TATA-box localization.

Once the model architecture and base-model training were finalized, we performed Shapley Additive exPlanations (SHAP) analysis to elucidate the contributions of input features and their association with expression learned by the model (Methods). We found that an increase in region-level features measuring fragment length diversity in both the promoter and gene body was associated with higher expression, consistent with previous observations19,37 (Fig. 2b). Interestingly, we observed inverse relationships between the promoter and gene body for mean coverage and short-long fragment length ratio. We reasoned that shorter fragments in the promoter were attributable to protection by TF binding and reduced nucleosome occupancy, while longer fragment lengths in the gene body reflected larger nucleosome spacing during active transcription20,23,31. Region-level PN features in the promoter had little impact on gene expression; however, the increase in PN spacing within the gene body drove increased expression, reflecting more sparse positioning of stable nucleosomes in actively transcribed genes3840. Both PN score and amplitude were decreased in the gene body for actively expressed genes, further reflecting ongoing chromatin remodeling and accessibility41. We note that region-level features in the gene body played a significant role in determining expression, which has been overlooked in previous studies that largely focused on the promoter and TSS19,4245.

We then looked at SHAP polarity-impact scores to understand the attention paid to the spatial signals at the promoter (Methods). Across signal profiles, the model learned to give more attention to the NDR upstream of the TSS and the +1 nucleosome as expected, while giving reduced attention to other nucleosomes downstream (Fig. 2c). Fragment end coverage and fragment diversity index signals reflected a lack of nucleosome protection and potential TF binding, showing an inverted pattern compared to depth, fragment short:long ratio, and Shannon entropy. Interestingly, while we initially expected the PN signal to generally follow the trend of raw coverage, we found that the model associated very narrow peaks within the NDR with increased activity, while ignoring signals at the +1 nucleosome position. This observation suggests that the PN signal may allow Proteus to discern the presence of variably positioned “transient” nucleosomes linked to active transcription46. Attention paid to the reference DNA sequence was more dispersed throughout the promoter region. Together these results align with observed trends, which is exemplified in the Triton ctDNA spatial feature profiles for the promoter region of the androgen receptor (AR) in PDX subtypes where AR is active and inactive (Fig. 2d).

Finally, we examined the input-level sequence kernels of the trained model which showed high activation for GC and AT-rich regions (Fig. 2e). Analysis of the cross-signal kernels that capture spatial relationships showed that the model frequently leverages the reference DNA sequence to localize GC- and AT-rich contributions in fragment end-motif coverage, features known to reflect the origins of cfDNA47 (Supplementary Fig. 2). Overall, the attention paid by Proteus suggests that cfDNA spatial profiles around the TSS and region-level features in both the promoter and gene body contribute to expression prediction, in alignment with known chromatin dynamics.

Robust prediction of tumor gene expression levels using ctDNA from PDX plasma

To benchmark the upper bound predictive performance of Proteus, we used unseen data from pure human ctDNA (100% tumor fraction, TFx) isolated from the plasma of mice implanted with human PDX tumor lines9,48 (Methods). In CRPC (n = 5, 2 shown) and SCLC (n = 5, 2 shown) PDX lines representing different subtypes, the predicted expression of the whole transcriptome (18,488 genes) from ctDNA was strongly correlated with matched tumor RNA-Seq (r = 0.91–0.95, p < 10−10) (Fig. 3a). This correlation was similar to that observed when comparing between technical replicates of direct tumor RNA-Seq49. Next, we looked at individual CRPC phenotype-defining genes with known differential expression in the LuCaP PDX validation and hold out samples along with the healthy donor validation cohort10. We observed distinct patterns of predicted expression between AR-driven adenocarcinoma (ARPC) and neuroendocrine prostate cancer (NEPC) subtype-specific gene sets that mirrored the tumor RNA-Seq data (r = 0.90, p < 10−10) (Fig. 3b). As DNA sequence alone is a major determinant of baseline gene expression levels5052, we then sought to assess its role in the ability of Proteus to infer the expression of individual genes. Using a sequence-agnostic version of model, we were still able to differentiate active/inactive genes in the CRPC gene set but with reduced correlation (r = 0.65, p < 10−10) (Supplementary Fig. 3, Methods). This indicates that Proteus uses the reference sequence to ground baseline expression levels, while leveraging cfDNA-derived features to distinguish active and inactive states which cannot be captured by DNA sequence-only models.

Figure 3. Proteus base-model expression prediction results for PDX and healthy donor cohorts.

Figure 3.

a) Comparison of transcriptome level (n = 18,488 genes) expression values between tumor RNA-Seq vs. Proteus predictions for two LuCaP CRPC (left) and two SCLC (right) PDX samples held out from training. A selection of phenotype-defining genes colored by subtype association are highlighted to illustrate gene-specific expression between different subtypes.

b) Heatmap comparing tumor RNA-Seq (left) and Proteus-predicted (right) expression values for both training (5-fold cross-validation [CV]) and validation (hold-out [HO]) LuCaP PDX and healthy donor (HD-L) samples across three phenotype-defining gene sets: NEURO I, NEURO II, and AR. CRPC subtype of each PDX line is indicated by the colored bar on top.

c-d) Tumor RNA-Seq vs. Proteus-predicted expression values for selected genes across 5-fold CV and holdout samples from of healthy donor (HD-L, n = 10) and c) LuCaP PDX lines (nCV = 20, nHO = 5) or d) SCLC PDX lines (nCV = 20, nHO = 5). AR and ASCL1 are established phenotype defining genes in PC, while SYP and DLL3 are both expressed in NEPC. NEUROD1 and ASCL1 are both molecular phenotype-defining transcription factors associated with neuroendocrine differentiation. POU2F3 is expressed in non-neuroendocrine-like SCLCs which present unique treatment needs, and DLL3 is expressed in neuroendocrine-like SCLC. Diagonal line with shaded area denotes the linear fit and 95% confidence interval.

Next, we examined the performance of Proteus in predicting the expression of individual genes that are known markers of tumor subtypes or therapeutic targets. We found significant correlations between gene expression predicted by Proteus and matched direct tumor RNA-Seq in the PDX models of CRPC, including the expression of AR, ASCL1, and SYP (r = 0.74–0.93, Fig. 3c). These genes are important markers used to determine both clinical phenotypes and trans-differentiation from ARPC to NEPC during resistance to AR signaling inhibitor therapies5355. Similarly, we observed significant correlations of gene expression in PDX of SCLC for NEUROD1, ASCL1, and POU2F3 which define SCLC transcriptional subtypes5658 (r = 0.88–0.95, Fig. 3d). The predicted expression of DLL3, a clinically investigated target in neuroendocrine cancers5961, also showed significant correlation with the tumor RNA-Seq in both cancer types (r = 0.87–88). These results demonstrate the capability of Proteus to predict expression transcriptome-wide, including for genes relevant to tumor subtyping and targeted therapies.

Proteus captures expression levels of individual genes in multiple cancer patient cohorts

We evaluated the performance of Proteus for quantitating gene expression across four clinical cohorts of cancer patients with matched plasma cfDNA and tumor tissue RNA-Seq datasets: two patient cohorts with blood samples and tumor biopsies collected post-mortem from bladder cancer (BLCA, BLCA-RA, n = 8, TFx 8–85%, mean coverage 73x) and CRPC (CRPC-RA, n = 25, TFx 8–85%, mean coverage 61x), a SCLC cohort48 (n = 16, TFx 18–77%, mean coverage 67x), and an external CRPC cohort (CRPC-WCDT62, n = 13, TFx 25–66%, mean coverage 120x) (Methods, Supplementary Table 1). Proteus predictions were significantly correlated with matched tissue-derived gene expression measures across samples in each cohort, for the whole transcriptome (cohort median r = 0.88 in CRPC-WCDT, 0.89 in CRPC-RA, 0.83 in BLCA-RA, and 0.83 in SCLC) and for the Druggable Genome tier 1 target genes with potential clinical relevance (DGT163, n = 1,423, cohort median r = 0.85 in CRPC-WCDT, 0.85 in CRPC-RA, 0.80 in BLCA-RA, and 0.82 in SCLC) (Fig. 4a, Supplementary Table 3). We further observed strong concordance for established phenotype-defining genes in the CRPC-RA cohort (n = 22, cohort median r = 0.82, Fig. 4b), similar to what was observed in the CRPC PDX ctDNA. The BLCA-RA cohort was entirely held out from training as an unseen cancer type for performance evaluation. The predicted expression for this cohort was significantly correlated with matched tissue for genes that classify consensus molecular subtypes of muscle-invasive BLCA64 (n = 828, cohort median r = 0.56), even for a sample with 8% TFx (Patient 19–022, r = 0.55, p < 10−10) (Supplementary Table 3).

Figure 4. Proteus accurately predicts expression of individual targets and gene set enrichment scores in patients with variable tumor fraction.

Figure 4.

a) Comparison of transcriptome-level (n = 18,488 genes) expression values between tumor RNA-Seq vs. Proteus predictions for a sample from each of the four patient cohorts with matched RNA-Seq, representing four unique subtypes. Individual genes defining the cancer subtypes are labelled with colors representing their association with the subtypes.

b) Heatmap comparing tumor RNA-Seq (left) and Proteus-predicted (right) expression values for CRPC-RA patient samples across phenotype-defining gene sets: NEURO I, NEURO II, and AR. CRPC tumor subtypes, including amphicrine (AMPC; AR+/NE+), are indicated at the top; multiple histologies are show with split boxes.

c) HeaTMap of Spearman correlation coefficient (ρ) comparing between tumor RNA-Seq and cfDNA Proteus-predicted expression, tumor RNA-Seq and tumor IHC, and cfDNA Proteus-predicted expression and tumor IHC in the CRPC-RA and BLCA-RA cohorts. Tumor RNA-Seq and IHC values are the mean across sampled biopsies.

d) Receiver Operating Characteristic (ROC) curves with inset area under the ROC curve (AUC) and 95% CI within each patient cohort for determining gene activity based on a tumor RNA-Seq threshold of log2(CPM+1) >= 2.0. Top: all genes, bottom: Druggable Genome Tier 1.

e) Comparison of tumor RNA-Seq vs Proteus-predicted expression for four example therapeutic targets across four cohorts. Dot color indicates patient cohort while dot size indicates estimated tumor fraction. Diagonal line with shaded area denotes the linear fit and its 95% confidence interval.

f) Gene Set Variation Analysis (GSVA) scores for the neuroendocrine gene set (10 genes, NE.10) and the computed PAM50 basal score. Scores for tumor RNA-Seq were computed using the mean expression across tumor metastases (x-axes). One score is shown for each patient from Proteus predictions (y-axes). Diagonal line with shaded area denotes the linear fit and its 95% confidence interval.

We next explored whether the gene expression values predicted by Proteus can be used to inform the activity of known markers of tumor subtypes or therapeutic targets in patient samples. In the CRPC-RA and BLCA-RA cohorts, we found that predicted expression was strongly correlated with RNA-Seq levels, but both had variable correlation with immunohistochemistry (IHC) scores quantitating protein expression (Fig. 4c, Methods, Supplementary Table 4). This observation is consistent with the fact that mRNA abundance and protein expression levels may have a non-linear relationship and that IHC and RNA-Seq estimates will have technical variability. These results also highlight the need for careful selection of targets for which chromatin state and mRNA or protein abundance are concordant when quantitation of protein abundance is relevant. We further evaluated Proteus for classifying whether a gene is expressed (“on”; CPM > 1) or not expressed (“off”) as a dichotomous prediction and observed high performance across cohorts for all genes in both the transcriptome and in the DGT1 set (AUC ≥ 0.90) (Fig. 4d, Methods).

In the context of a practical application of Proteus, we evaluated the expression of specific drug targets, such as STEAP1, which is overexpressed in CRPC and is the target of Antibody-Drug Conjugate (ADC) and CAR-T therapies in several clinical trials.65,66 We observed significant correlation between ctDNA-predicted and tumor tissue RNA-Seq expression of STEAP1 in the CRPC cohorts (CRPC-RA and CRPC-WCDT; r = 0.65, p = 2.2×10−7) (Fig. 4e). There was also significant correlation between ctDNA predictions and the tumor RNA-Seq for SSTR2 (r = 0.74, p = 5.1×10−10), which is amplified in NE-like SCLC and is being investigated for both ADC and CAR-T therapies6769. Notably, specifically in the BLCA-RA cohort, ctDNA analyses with Proteus identified a strong correlation between ctDNA and tumor expression of NECTIN4 (r = 0.77, p < 10−10), a validated target for Enfortumab vedotin, an FDA-approved ADC to treat locally advanced and metastatic urothelial cancers70. DLL3 is expressed by neuroendocrine tumors, and is being investigated as a promising target of bispecific T-cell engager (BiTE), CAR-T, and radionuclide therapies in SCLC60,71,72. Proteus predicted the upregulation of DLL3 in patients with SCLC, as well as for patients in CRPC and BLCA cohorts with neuroendocrine (NE) differentiation (r = 0.77, p < 10−10).

To determine if the transcriptome-wide predictions by Proteus can be used in similar analyses typically performed for tumor RNA-Seq data, we assessed gene signatures and pathways using gene set enrichment analysis and tumor subtype classification, for example the PAM50 classifier of breast cancer that has also been applied to partition prostate cancer subtypes73 (Methods). For 50 hallmark gene sets74, the CCP31 gene set75, and CRPC molecular phenotype-defining gene sets76, we used gene set variation analysis (GSVA) to compute pathway enrichment scores and found that 42% of gene sets had at least moderate correlation between Proteus-derived ctDNA and tumor tissue RNA-Seq scores across the four cohorts (ρ > 0.3, Supplementary Fig. 4). Among CRPC molecular subtype-defining gene sets, the highest correlated was the NE-associated gene signature (NE.10, r = 0.77, 2.2×10−9), which is used to classify the NE phenotype in CRPC and useful for indicating NE histology in both SCLC and BLCA (Fig. 4f). We also applied the PAM50 classifier to the predicted expression profiles to identify basal and luminal subtypes and observed that the basal subtype score was significantly correlated with tumor RNA-Seq (r = 0.53, p = 3.2×10−4) (Fig. 4f). Altogether, these results showcase the broad utility of Proteus for clinically relevant applications to profile the expression of single targets and gene signatures that identify aggressive phenotypes, inform prognosis, and aid in treatment selection.

De novo gene expression analysis from proteus identifies differentially expressed genes and biologically relevant pathways in CRPC

To further assess the potential clinical utility of cfDNA-based gene expression inference, we employed Proteus to identify potential targets or pathway signatures associated with subtypes of tumors seen in clinical practice where classification is relevant. Using plasma samples from a cohort of CRPC patients collected at the University of Washington (UW)9, we performed differential gene expression analysis on Proteus predictions between patients with dominant ARPC (n = 14) and NEPC (n = 3) histology determined by tumor IHC (Fig. 5a). We identified genes differentially up-regulated in CRPCs classified as ARPC (e.g. AR, FOLH1) and NEPC (e.g. CHGA, SRRM4) that were consistent with lineage-defining gene signatures. Because matched tumor RNA-Seq data was not available, we compared the identified genes to those differentially expressed in CRPC PDX tumor RNA-Seq data (ARPC: n = 17, NEPC: n = 6) and confirmed the concordance of 86% and 50% of significantly differentially upregulated genes in NEPC and ARPC, respectively (Fig. 5a, Supplementary Table 5, Methods). Gene set enrichment analysis of the hallmark cancer pathways using the predicted expression further revealed concordant, significant enrichment scores with PDX tumor RNA-Seq data, including androgen response in ARPC (Fig. 5b). Several immune-related pathways were also enriched in ctDNA from ARPC patients, as predicted by Proteus, but were not detected in PDX tumors. This is consistent with both decreased immune activity in NE-like tumors7779 and the immunodeficiency of the mice, and suggests predictions are able to capture gene expression of patient-specific immune responses.

Figure 5. De novo identification of differentially expressed genes and prognostic pathway enrichment in CRPC cohorts using Proteus expression prediction.

Figure 5.

a) Volcano plot showing genes identified as differentially up- and down-regulated using limma in patients with ARPC (n = 14) vs. NEPC (n = 3) based on predicated expression from Proteus in the UW CRPC cohort. Labels and highlights (green circles) indicate that the same gene was significantly up- or down-regulated in the same direction in the same analysis performed using RNA-Seq from LuCaP PDX tumors (ARPC: n = 17, NEPC: n = 6).

b) Significantly enriched Hallmark pathways between ARPC and NEPC samples identified using GSVA from either LuCaP PDX tumor RNA-Seq (left) or Proteus-predicted expression in the UW cohort (right). Enrichment scores (NEPC:ARPC) and -log(P-value) from GSVA.

c) Cox forest plot showing the univariate hazard ratios per standard deviation change of all available baseline clinical markers (top), including estimated tumor fraction (TFx), and enriched gene set pathways (bottom, showing the 5 lowest and 5 highest HR pathways) for the Pluvicto cohort based on completed cycles before exit from the clinical trial.

d) Kaplan-Meier curves for the clinical imagining variable MEAN_SUV_PSMA and predicted expression-based pathway score ALLOGRAFT_REJECTION representing the clinical and Proteus-derived biomarkers with the lowest hazard ratios from c), using the median value in each variable as the group threshold.

e) Kaplan-Meier curves for the clinical imagining variable Volume_FDG and predicted expression-based pathway score NE.10 representing the clinical and Proteus-derived biomarkers with the highest hazard ratios from c), using the median value in each variable as the group threshold.

To evaluate whether Proteus can predict molecular signatures associated with prognosis and treatment outcomes, we analyzed ctDNA samples from 55 CRPC patients at baseline prior to receiving the FDA-approved radioligand therapy Pluvicto (177Lu-PSMA-617)80 (Supplementary Table 1). Due to limited outcome data measures, we used the number of treatment cycles as a proxy for treatment durability and inferred progression (Methods). Further, as we confirmed that tumor fraction is a major prognostic biomarker as shown previously8184 (HR 1.79, 95% CI 1.30–2.46, p = 3×10−4, Fig. 5c, Supplementary Table 6), we explored other potential pathways or signatures predicted by Proteus which may provide biological insights. Both neuroendocrine (NE.10) and cell-cycle progression (CCP.31) signatures were associated with increased risk (HR > 1.67) for progression (fewer treatment cycles), while immune activation signatures (allograft rejection, inflammatory response) were associated with reduced risk of progression (more treatment cycles) (Fig. 5c, Supplementary Table 6). These pathway risk scores from Proteus predictions had significance comparable to, or greater than available clinical measures, including serum PSA levels, volume by FDG-PET, and standardized uptake value (SUV) by PSMA-PET (Fig. 5c). The predicted expression of pathways with the highest (NE.10) and lowest (allograft rejection) HR was able to stratify patients into low- and high-risk progression groups (log-ranked p < 2.12 ×10−5) which were comparable to those informed by clinical measures (volume FDG and PSMA SUV, log-ranked p < 0.0137, Fig. 5d). These results suggest that the detection of an NE-like signature, which indicates possible NE differentiation, higher proliferation rates, and general lack of PSMA expression prior to Pluvicto treatment85 may be related to therapeutic resistance, and highlight the potential prognostic value of expression signatures predicted by Proteus.

DISCUSSION

Herein we have introduced two new methods for advancing the field of cfDNA analysis: Triton, a comprehensive cfDNA profiling tool which characterizes the DNA fragmentomic and chromatin landscapes at both the region and bp-level; and Proteus, a biologically informed deep learning framework with the highest accuracy to-date for inferring expression of individual genes using only standard depth WGS20,42,86. These methods work in concert to interrogate fragmentomic and chromatin signatures in both the promoter and full gene body, discovering bp-resolution patterns associated with gene activity. By employing deep learning to uncover fundamental relationships across the spectrum of cfDNA feature modalities, Proteus can analyze readily accessible, standardized cfDNA sequencing data without the need for specialized assays that are required for previously developed approaches12,1519,28. We further determined that Proteus is robust to experimental heterogeneity, demonstrating strong performance in unseen cohorts from other institutions, with varying assay preparations and sequencing depths. Proteus is also the first method to output cfDNA predictions in widely accepted expression units35, which allows it to be used with existing RNA-Seq-based analysis tools and methods.

Proteus outlines a new foundational cfDNA processing framework, capable of synthesizing local sequence context with distinct data types, both spatially and at the regional level. Nucleosome profiling and fragmentation signals can be easily exchanged or integrated with other feature modalities such as histone modifications, transcription factor binding activity, and DNA methylation. Transfer learning can also be used to customize the desired output metric, allowing for transferability to any use-case where the separation of background cfDNA and ctDNA states is needed. While we have demonstrated favorable performance for BLCA samples that were unseen by the model in a tissue-agnostic manner, Proteus may be tailored via fine-tuning to further improve performance for analyzing specific cancer and assay types. Because the model is gene-agnostic and trained to predict expression for any single gene, it is immediately extensible to targeted capture panels with custom gene sets. Furthermore, gene-network-aware post processing at the sample level may allow for improved applications in fields like early cancer detection and assessment of minimum residual disease. Proteus therefore represents a blueprint for advancement on which novel cfDNA applications can be built.

Through the use of Proteus to infer gene expression information conveyed by ctDNA, we have shown diverse applications with potential clinical utility including molecular subtyping, the identification of specific therapeutic targets, and the prediction of enriched cancer pathways. These applications can allow for minimally invasive multi-timepoint monitoring of specific gene targets and signatures while also providing the expression of the entire transcriptome. Overall, these capabilities support precision oncology and translational applications, such as active monitoring of drug targets and pharmacodynamics, determining response and resistance to treatment, and identifying tumor plasticity in real time. As a tool for discovery of new cfDNA biomarkers, Proteus further enables the rapid identification of potential therapeutic targets or expression signatures, only requiring standardized data that is accessible, cost-effective, and readily scalable.

Proteus represents a first step in using deep learning for transcriptional inference from cfDNA. Despite its strong performance, the final model leveraged individual gene expression profiles from only 11 patients and two tissue types for training. Deep learning models are inherently restricted by the variability in observed data; therefore, with larger and more diverse training data, the accuracy and generalizability of Proteus will likely improve. Furthermore, only a small cohort of healthy donors with matched cfDNA and RNA-Seq was used to train the base-model to predict non-tumor background expression. While this learning framework enables the model to robustly disentangle expression profiles of the tumor, additional training data is needed for Proteus to accurately discern the expression of immune components present in the blood of cancer patients. We have showcased the trained model in advanced cancers, including an unseen BLCA sample with as low as 8% TFx; however, additional optimization and training will be required to apply Proteus in the settings of minimal residual disease and early cancer detection. Finally, the model does not account for post-transcriptional or post-translational regulation, which may distance predictions from mRNA or protein levels for certain genes. However, the Proteus framework has been designed with the flexibility to incorporate future extensions to support the prediction of protein activity.

In summary, Proteus directly addresses a current challenge in predicting transcriptome-wide expression from cfDNA. Proteus represents a new framework for cfDNA deep learning applications and provides an opportunity for non-invasively profiling tumors at an unprecedented resolution and scale. Proteus has the potential to advance precision oncology through minimally invasive and time-sensitive expression profiling useful for diagnostic, prognostic, and therapeutic purposes.

ONLINE METHODS

PDX mouse models

LuCaP CRPC PDX Models

25 LuCaP PDX models with line-matched cfDNA sequencing and tumor tissue RNA-Seq were previously described and characterized9,87. Briefly, PDXs were propagated in vivo in male NOD/SCID IL2R-gamma-null (NSG) mice (cat. #005557), with tumor collection for initial establishment approved by the UW Human Subjects Division Institutional Review Board (IRB #2341). Phenotypes were determined through histopathology by at least two expert pathologists and validated using transcriptome-derived marker expression scores76,88,89 (ARPC, NEPC, and ARlow). Blood was collected via cardiac puncture from animals carrying PDX tumors measuring 300–1,400 mm³. Each sample was collected into Sarstedt K3-EDTA microtubes and processed within 4 h, undergoing sequential centrifugation at 2,500 × g for 10 min and 16,000 × g for 10 min (room temperature). Plasma from 4–8 mice was pooled per PDX line, transferred to screw-cap cryovials, and stored at −80 °C for subsequent cfDNA isolation. All animal procedures were conducted in accordance with NIH guidelines and approved by the Fred Hutchinson Cancer Center IACUC (protocol #1618).

Sequencing libraries were built from 50 ng of input cfDNA with the KAPA HyperPrep kit, using nine PCR cycles and a clean-up with in-house standard SPRI beads. KAPA UDI dual-indexed adapters were ligated, after which libraries were quantified, equalized, pooled for multiplexing, and sequenced on two instruments: an Illumina HiSeq 2500 (200-cycle kit) at the Fred Hutchinson Genomics Shared Resources and an Illumina NovaSeq employing S4 flow cells (300-cycle kit) through the Broad Institute Genomics Platform Walkup-Seq Services. To keep read lengths consistent, the NovaSeq output was trimmed to 200 cycles, producing 100 bp paired-end FASTQ files that match the HiSeq data.

Published data of bulk RNA-Seq from the LuCaP PDX tumors were sequenced and aligned as described previously73. In brief, sequencing reads were mapped to the hg38 human genome using STAR v2.7.3a. PDX data were also aligned to the mm10 mouse genome. All subsequent analyses were performed in R. PDX sequencing reads deriving from mouse were removed using XenofilteR. Gene level abundance was quantitated using GenomicAlignments.

SCLC PDX Models

25 SCLC and NSCLC PDX models with line-matched cfDNA sequencing and tumor tissue RNA-Seq were previously described and characterized48. Briefly, PDXs were generated from circulating tumor cells or tumor tissues as described previously9093 and propagated in vivo in male NOD/SCID IL2R-gamma-null (NSG) mice (cat. #005557) according to institutional protocols. Blood was collected via cardiac puncture from animals bearing flank tumors, with fresh blood from up to five mice of the same model pooled when available. Samples were processed within 4 h of collection, undergoing centrifugation at 2,500 × g for 10 min at room temperature to separate plasma, followed by a second centrifugation of plasma at 13,200 × g for 10 min to remove residual debris. Plasma and, when collected, buffy coat fractions were stored at −80 °C for subsequent DNA isolation. All animal procedures were performed in accordance with institutional guidelines and approved protocols.

Genomic DNA was extracted from buffy coat using the QIAamp DNA Blood Mini Kit (QIAGEN, 51104), while cfDNA was isolated from 0.5–4 ml of plasma using either the QIAamp MinElute ccfDNA Mini Kit (QIAGEN, 55204) or QIAamp Circulating Nucleic Acid Kit (QIAGEN, 55114), following manufacturer instructions. DNA concentration was measured using the Qubit dsDNA HS Assay Kit (Thermo Fisher Scientific, Q32851), and size distribution assessed on a Bioanalyzer using the DNA 1000 or HS DNA 1000 Kit (Agilent, 5067–1504 or 5067–4626). Plasma cfDNA libraries were prepared with either the NEBNext Ultra II DNA Library Prep Kit for Illumina (New England Biolabs, E7645 and E7335/E6609) or the xGen cfDNA & FFPE DNA Library Prep MC Kit with unique dual index primers (Integrated DNA Technologies, 10006203 and 10005922). For buffy coat genomic DNA, samples were mechanically sheared to ~250 bp modal size using a Bioruptor (Diagenode; low power, 30 s on / 90 s off, 4 °C, 30 min) and further fragmented as needed, followed by library construction using the xGen cfDNA & FFPE DNA Library Prep MC Kit. Libraries were quantified, assessed for fragment size distribution, pooled, and sequenced on Illumina platforms.

RNA was extracted from PDX tumor tissue or cell pellets using TRIzol reagent, and RNA-seq libraries were prepared from 500 ng total RNA with the NEBNext Ultra RNA Library Prep Kit for Illumina (New England Biolabs, E7530L). Sequencing was performed on an Illumina HiSeq 2500 instrument with 50 bp single-end reads. Reads were aligned using STAR v2.7.1a in two-pass mode to both the human genome (hg38, GENCODE v31) and mouse genome (mm10, GENCODE vM23), followed by removal of mouse-derived reads with XenofilteR (parameters per authors’ recommendations). QC metrics including fragment size, read quality, duplication rates, gene body coverage, and genomic region distribution were generated with FastQC v0.11.9 and RSeQC v4.0.0. Gene-level counts were obtained in an unstranded manner using FeatureCounts (Subread v1.6.5).

cfDNA sequence alignment and mouse subtraction

All cfDNA reads were re-aligned to the Broad GRCh38/hg38 human reference genome (retrieved from https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta) with BWA-MEM v0.7.1794; the complete Snakemake workflow and configuration files are available at https://github.com/GavinHaLab/fastq_to_bam_paired_snakemake. Alignment summaries, including read counts and the percentage of properly paired reads, were computed with Picard CollectAlignmentSummaryMetrics, and genome-wide coverage metrics were obtained with CollectWgsMetrics (GATK/Picard v4.1.8.1). For PDX ctDNA WGS datasets, mouse-derived reads were removed as previously described9,95,96.

Human cohorts

HD (Healthy Donor) cohorts

For the HD-L cohort with matched cfDNA and RNAseq from blood (N=10), blood was obtained from Bloodworks Bio (Seattle, WA; item 4500–01, 2 10ml ETDA whole blood tubes per donor). Blood samples were immediately placed on ice, delivered by courier, and processed within two hours from time of donor blood draw. Blood tubes were spun at 500g x 15 min at 4C. Plasma from both blood tubes was transferred without disturbing buffy coat layer into a clean 15cc tube and spun at 15,000g x 10 min at 4C. Double-spun plasma was aliquoted 1 ml each into 2 ml Sarstedt tubes and frozen at −80C until cfDNA extraction. cfDNA was manually extracted from up to 4 ml plasma using the QIAamp Circulating Nucleic Acit Kit (QIAGEN, Hilden Germany; Cat. No. 55114) following manufactures instructions. From the corresponding buffy coat, PBMCs were counted and divided for RNA extraction and PBMC isolation. PBMC cells for RNA extraction in phosphate-buffered saline (PBS) were spun down, supernatant was removed, and 2 ml PBS was added and transfered to a 1.5ml Eppendorf tube. Cells were centrifuged at 3,000 rpm for 5 min. Supernatant was removed, avoiding disruption of the pellet, and 0.35μl prepared Qiagen RLT + β-Mercaptoethanol (lysis solution) was added. Sample was vortexed to ensure cell lysis. Lysed cells were frozen at −80 C until RNA extraction. RNA was extracted with the QIAGEN RNeasy kit with on-column DNase digestion (Cat. No. 79254) following manufacture instructions.

For the HD-U cohort (N=10), plasma was purchased from Research Blood Components, LLC (Waterton, MA, Cat No 1–009, Fresh Plasma Single Donor 10ml). Plasma was separated by centrifugation of whole blood at 1600 g for 10 minutes at room temperature. Upper plasma layer was removed, leaving buffy coat undisturbed, and transferred to Cell-free DNA BCT Streck tubes. Plasma was shipped on dry ice and stored at −800C until cfDNA extraction. cfDNA was extracted using the QIAamp® Circulating Nucleic Acid kit (Qiagen, Hilden Germany). After thawing 4 mL of plasma on ice, samples were subject to a second spin at 15,000 g for 10 min at room temperature. Double-spun supernatant was removed, avoiding disrupting any remaining pellet, and processed according to manufacture instructions, eluting cfDNA with 55 μl Buffer AVE. cfDNA concentration was quantitated via Qubit 1X dsDNA High Sensitivity Assay Kit (Thermo Fisher Scientific, Cat No Q33230) on a Qubit 4 Fluorometer. cfDNA quality and fragment-size distribution was assessed from 3μl (range 100–4,000 pg/μl) cfDNA run with Agilent Cell-free DNA ScreenTape Assay (Agilent Technologies, Santa Clara, CA USA) on the Agilent 4200 TapeStation system following the manufacturer’s instructions.

CRPC-RA and BLCA-RA cohorts

All samples for both the CRPC-RA and BCLA-RA cohorts were obtained from participants who were provided written informed consent under the Genitourinary Cancer Biorepository at the University of Washington (UW IRB 2341), including specific consent for rapid autopsy. Normal and tumors specimens were collected through this rapid autopsy program, with tissues procured within 15 hours of death (median 5 hours, CRPC-RA cohort), and within 11.5 hours of death (median 3.4 hours, BLCA-RA cohort). Cardiac or peripheral venous blood was collected at the start of the rapid autopsy, using a 16–18-gauge needle and 20–50 mL syringe into Tiger Top (serum) or Purple Top (plasma) tubes. Samples were centrifuged at 1,600 g for 15 min, and the plasma or serum supernatant was aliquoted (560–600 μL per tube) using a Serum Bank pipette and stored at −80 °C until cfDNA extraction. 0.5 – 2 ml plasma or serum were used for cfDNA extraction as described for HD-U cohort samples.

For the CRPC-RA cohort, 35 total patients contributed blood for cfDNA, with 25 of the patients contributing 49 metastatic flash-frozen tumors for RNA-Seq as previously described10. Briefly, total RNA was isolated from tissue samples of CRPC metastases, which had been frozen in OCT (Tissue-Tek) with RNA STAT-60 (Tel-Test). Using an H&E-stained slide for each sample for orientation, 1-mm core punches of tumor were obtained. Alternatively, multiple sections enriched for tumor were cut using a Leica CM3050S cryostat.

In the BLCA-RA cohort, eight total patients contributed specimens comprising primary tumors (n = 7; 4 FFPE, 3 flash-frozen, 19–037 has no RNA-Seq primary), 2–5 metastases per patient (n = 28; all flash-frozen), and blood for cfDNA. Visceral metastases and matched primary tumors were embedded in OCT or formalin-fixed and paraffin-embedded (FFPE). Hematoxylin–eosin staining was performed on 5 μm FFPE sections as previously described97,98, and histology was defined based on FFPE pathology review.

SCLC cohort

The SCLC cohort was previously described96. Briefly, peripheral blood from patients was collected in KEDTA tubes, centrifuged at 2,500 × g for 10 min at room temperature, and plasma was separated by pipetting; for select samples, the buffy coat was also aspirated and stored at −80 °C. Plasma was further centrifuged at 13,200 × g for 10 min to remove residual debris, and the resulting supernatant was collected. Both plasma and buffy coat were stored at −80 °C until DNA isolation, with genomic DNA extracted from buffy coat using the QIAamp DNA Blood Mini Kit (QIAGEN, 51104) and cfDNA extracted from 0.5–4 ml plasma using either the QIAamp MinElute ccfDNA Mini Kit (QIAGEN, 55204) or the QIAamp Circulating Nucleic Acid Kit (QIAGEN, 55114). DNA concentration was measured with the Qubit dsDNA HS Kit (Thermo Fisher Scientific, Q32851), and fragment size distribution was assessed using a Bioanalyzer with DNA 1000 or HS DNA 1000 kits (Agilent, 5067–1504 or 5067–4626).

CRPC-WCDT (West Coast Dream Team) cohort

A subset of WCDT patients with matching cfDNA and tumor tissue RNA-Seq were previously described62,62 and were obtained with approval directly from the authors.

UW cohort

The UW WGS cohort was previously described9. Briefly, 31 men with mCRPC enrolled under UW IRB CC6932 (2014–2021) provided 61 plasma samples; all donors provided written informed consent. After ULP-WGS screening, 47 samples from 27 patients with tumor fraction > 3% (ichorCNA, GRCh37) and three additional AR-amplified samples (FH0243_E_1_A, FH0345_E_1_A, FH0482_E_1_A) were advanced to high-depth WGS. Only the high-depth samples were used in this study, along with their clinical histology.

Pluvicto cohort

Peripheral blood from men with mCRPC was collected at the UW into Streck Cell-Free DNA BCT (Streck, part #230470) or BD Vacutainer K2-EDTA (BD, part#366643) blood collection tubes. All donors provided written informed consent for research participation under UW IRB CC6932. Samples were kept at room temperature and processed within 4 hours (K2-EDTA) or 36 hours (Streck DNA) from collection. Each sample was transferred to a 15 mL conical tube and centrifuged at 2500 x g for 10 minutes. The plasma layer was transferred to a new 15 mL conical and centrifuged at 3000 x g for 15 minutes to pellet any remaining debris. The double spun plasma was portioned into 1 mL aliquots in 1.5 mL screwcap tubes (Celltreat, part#230820) and stored at −80C. The buffy coat ring and 1 mL of packed red blood cells was aliquoted into 1.5mL screwcap tubes (Celltreat, part#230820) and stored at −80C.2 ml of plasma per study subject was submitted for the Broad Clinical Blood Biopsy protocol (Broad Clinical Labs, LLC (Cambridge, MA). cfDNA was purified from plasma using the QIAsymphony DSP Circulating DNA kit (QIAGEN, Cat. No. 937556) on the QIAsymphony SP instrument. cfDNA was eluted TE buffer and quantified via picogreen.

cfDNA library preparation and sequencing

For HD-U, BLCA-RA, CRPC-RA, and Pluvicto cohort cfDNA, library preparation and sequencing were performed by Broad Clinical Labs, LLC (Cambridge, MA). Initial cfDNA input was normalized to be within 25–52ng in 50μl TE buffer (10mM Tris HJCL 12mM EDTA, pH 8.0) according to PicoGreen quantification, with no shearing of cfDNA prior to library construction. Library preparation was performed using a commercially available kit provided by KAPA Biosystems (KAPA HyperPrep Kit with Library Amplification product KK8504) and IDT xGen UDI-UMI duplex adapters (Integrated DNA Technologies, Coralville, IA, USA). Unique 8-base dual index sequences embedded within the p5 and p7 primers (IDT) are added during PCR. Enzymatic clean-up was performed using Beckman Coulter AMPure XP SPRI beads (Beckman Coulter, Brea, CA; Cat No A63880) with elution volumes reduced to 30μl to maximize library concentration. Library quantification was performed using the Invitrogen Quant-It broad range dsDNA quantification assay kit (Thermo Fisher Scientific, Cat No Q33130) with a 1:200 PicoGreen dilution. Following quantification, each library was normalized to a concentration of 35 ng/μL, using Tris-HCl, 10mM, pH 8.0. In preparation for ULP libraries, approximately 4 μL of the normalized library is transferred into a new receptacle and further normalized to a concentration of 2ng/μL using Tris-HCl, 10mM, pH 8.0. Following normalization, up to 95 ultra-low pass WGS samples are pooled together using equivolume pooling. The pool is quantified via qPCR and normalized to the appropriate concentration to proceed to sequencing. Cluster amplification of library pools was performed according to the manufacturer’s protocol (Illumina) using Exclusion Amplification cluster chemistry and NovaSeq SP flowcells. Flowcells were sequenced on a NovaSeq 6000 Sequencer using the XP workflow and a v1.5 300 cycle NovaSeq SP kit. Each pool of ultra-low pass whole genome libraries is run on one lane using paired 151bp runs. Following initial ULP sequencing, selected ULP libraries underwent further sequencing. Libraries were initially normalized to 2 ng/μL, pooled, then quantified via qPCR using a KAPA Biosystems kit that employs probes specific to the ends of the adapters. Based on the qPCR results, the libraries were adjusted to 2.2 nM before proceeding to the next sequencing stage using an automated Agilent Bravo liquid handling platform. Pools were denatured with Sodium Hydroxide, diluted using an Illumina-provided Pre Load Buffer, and transferred to a uniquely-barcoded 8 lane strip tube with a Hamilton Starlet liquid handler. Strip tubes were loaded into a 300 cycle NovaSeq X 25B kit and the run was initiated with a 151 base paired end, dual-indexed read structure. Based on pool size, the number of lanes were calculated to ensure samples reach desired mean coverage.

For HD-L cohort, cfDNA was quantified using Life Technologies’ Invitrogen Qubit 2.0 Fluorometer (Thermo Fisher, Waltham, MA) and sequencing libraries were prepared using 100 ng cfDNA using the IDT xGen cfDNA & FFPE DNA Library Prep v2 MC and xGen Indexing Primers (Integrated DNA Technologies, Inc., Coralville, IA). Library quantification was performed using Life Technologies’ Invitrogen Qubit® 2.0 Fluorometer and size distribution validated using an Agilent 4200 TapeStation (Agilent Technologies, Santa Clara, CA). Individual libraries were pooled 10-plex at equimolar concentrations and sequenced on an Illumina NovaSeq X Plus (Illumina, Inc, San Diego, CA) over four lanes of a 10B-300 flow cell employing a paired-end, 150-base read length sequencing configuration. Average per-library sequencing output was 509.4M read pairs (range 327.2M-642.3M).

ctDNA tumor–normal admixtures

Admixtures for model training and benchmarking (PHT and PHH sets) were generated using all available PDX (LuCaP CRPC n = 25, SCLC = 25) and HD (HD-L (with RNA-Seq) n = 10, HD-U (no RNA-Seq) n = 10) samples. All mixtures comprised a single PDX and single HD sample. For training sets, no down-sampling was performed in order to replicate the variable depth and TFx seen in patients (n = 100 total mixtures). Complete mixture descriptions are available in Supplementary Table 1. BAM files were merged with SAMtools; the full Snakemake workflow for merging and down-sampling is available at https://github.com/GavinHaLab/Admixtures_snakemake.

RNA-Seq processing

Published data of bulk tumor RNA-Seq for LuCaP PDX models, SCLC PDX models, the SCLC patient cohort, and the CRPC-WCDT cohort were previously described62,73,96. Bulk tumor RNA-Seq for the CRPC-RA cohort was sequenced and aligned as described previously73. In brief, sequencing reads were mapped to the hg38 human genome using STAR v2.7.3a. All subsequent analyses were performed in R, with gene level abundance quantitated using GenomicAlignments. For HD-L cohort, RNA was quantified using a Lunatic spectrophotometer (Unchained Labs, Pleasanton, CA) and RNA quality was confirmed using an Agilent 4200 TapeStation (Agilent Technologies, Santa Clara, CA). Sequencing libraries were prepared using the TruSeq Stranded mRNA kit (Illumina, Inc., San Diego, CA) using 250 ng total RNA input. Library fragment size distributions were confirmed using Agilent 4200 TapeStation and library yields were determined using Life Technologies’ Invitrogen Qubit 2.0 Fluorometer (Thermo Fisher, Waltham, MA). Sequencing libraries were pooled in equimolar ratios and sequenced on an Illumina NovaSeq X Plus (Illumina, Inc, San Diego, CA) over one lane of a 10B-300 flow cell employing a paired-end, 50-base read length sequencing configuration. Average per-library sequencing output was 111.2M read pairs (range 101.2M-122.4M).

For the BLCA-RA cohort, RNA was isolated from 42 flash-frozen autopsy specimens embedded in OCT using a combined protocol of RNA STAT-60 (Tel-Test, Friendswood, TX) and the RNeasy Mini Kit with in-solution DNase digestion prior to column purification (Qiagen, Germantown, MD). For eight FFPE surgical specimens, RNA was extracted with the AllPrep DNA/RNA FFPE Kit (Qiagen, Germantown, MD). RNA quality was assessed by RNA integrity number (RIN) on an Agilent Bioanalyzer (Agilent Technologies, Santa Clara, CA). For RNA sequencing, libraries were prepared from 300 ng total RNA using the TruSeq RNA Exome Sample Preparation Kit following the manufacturer’s instructions (Illumina, San Diego, CA). Barcoded libraries were pooled and sequenced on an Illumina HiSeq 2500 to generate 50 bp paired-end reads.

Both internal and external RNA-Seq cohort gene-level counts were aligned to the MANE Select99 v1.3 primary transcripts based on either ENTREZ IDs or gene symbols, as available. Sample-level raw counts for all cohorts underwent Trimmed Mean of M-values (TMM) re-scaling as implemented in edgeR35 to normalize differences in library sizes. Normalized raw-counts were then converted into Counts Per Million (CPM) and subsequently into log2(CPM+1) for use as training labels and in all downstream RNA-Seq analyses. For samples with RNA-Seq from multiple tissue biopsies, log2(CPM+1) values were averaged across biopsies to obtain a single representative transcriptome per sample.

Patient tumor fraction estimation

For externa data sets (SCLC, UW and CRPC-WCDT cohorts) the provided tumor fraction estimates were used. In the BLCA-RA and CRPC-RA cohorts, tumor fraction was estimated for cfDNA WGS samples using the TITAN pipeline (v1.15.0; [https://github.com/GavinHaLab/TitanCNA_SV_WGS], commit ID: [bedbd76]). For four samples in the BLCA-RA cohort and all samples in the CRPC-RA cohort with matched tissue normal WGS, standard tumor–normal paired analysis was performed. For the remaining four BLCA-RA samples without matched WGS normals, a tumor-only pipeline was applied. TITAN-derived results were manually curated to determine the optimal solution for each sample, from which tumor fraction estimates were obtained. In the Pluvicto cohort, tumor content in patient plasma cfDNA was quantified with ichorCNA100 using default settings and 1 Mb bins. ichorCNA’s default tumor-fraction output was manually reviewed and optimal solutions were selected based on expert curation.

Immunohistochemistry (IHC)

Tumor tissues from bladder and prostate rapid autopsy cases (CRPC-RA and BLCA-RA cohorts) were evaluated for target protein expression using validated immunohistochemical assays. Details regarding the sample cohorts, immunohistochemistry protocols, and expert pathologist assessment have been previously described98,101103.

PET imagining metrics

Whole-body tumor segmentation was conducted using a semi-automated workflow to derive quantitative metrics (MIM Software Inc.). For PSMA-PET, an absolute 3 SUV threshold was utilized, while for 18F-FDG-PET, the threshold was based on normal liver background (liver SUVmean × 1.5 + 2 × standard deviation). After segmenting the entire tumor volume, images were visually inspected, and sites of physiological activity were removed. Total tumor volume (TTV), SUVmax, and SUVmean were extracted from the segmented tumor burden.

Triton nucleosome profiling and fragmentomic analysis

Overview

Triton converts mapped cell-free DNA (cfDNA) reads into nucleotide-resolution “spatial” signal profiles and matched region-level biomarkers. The user supplies a BED style coordinate set as either (i) single genomic regions, for example full length gene bodies (“region” mode), (ii) fixed-width windows centered on an index such as transcription-start sites (“window” mode, default ±1 kb), or (iii) collections of windows that are first “piled-up” across many instances of the same regulatory element (“composite-window” mode). For every site/region or composite, Triton extracts properly-paired fragments (default 15–500 bp) from the BAM/CRAM, applies per-fragment corrections (see below), and finally reports (i) bp-resolution tracks (depth, fragment end coverage, phased-nucleosome (PN) signal, fragment-length metrics, and reference-sequence context) and (ii) region-level fragmentation and nucleosome-phasing metrics, summarized into a tab-separated feature table. Complete outputs of Triton include (see below for specific details):

“Spatial” bp-resolutions signals (n = 11)

  • GC-corrected coverage / fragment pileup

  • Fragment end coverage

  • PN signal

  • Fragment length short:long ratio

  • Fragment length diversity index

  • Fragment length Shannon entropy

  • Called peaks

  • A (Adenine) frequency (based on the provided reference genome)

  • C (Cytosine) frequency (based on the provided reference genome)

  • G (Guanine) frequency (based on the provided reference genome)

  • T (Thymine) frequency (based on the provided reference genome)

“Region-level” scalar features (n = 12)

  • Fragment length mean

  • Fragment length standard deviation

  • Fragment length median

  • Fragment length median absolute deviation (MAD)

  • Fragment length short:long ratio

  • Fragment length diversity index

  • Fragment length Shannon entropy

  • PN amplitude

  • PN score

  • PN spacing

  • Mean coverage depth

  • Ratio of variation

Fragment filtering and GC correction

After removing PCR duplicates, reads with mapping quality < 20, and fragments with length outside 15–500 bp, GC correction is performed. Triton adopts the fragment-level bias tables produced by Griffin21 (fragment length × GC%) but will accept any bias-correction file in the same format. Briefly, for every fragment-length/GC-content combination, Griffin counts (i) aligned reads in the BAM file and (ii) genomic sites with that same length and GC composition, then calculates bias as the read-to-site ratio. Bias values across all GC bins for a given fragment length are rescaled to a mean of 1 and subsequently smoothed with k-nearest-neighbor averaging across adjacent length/GC bins to yield final GC-correction factors. During iteration over every fragment within Triton processing of an individual site, the bias factor is looked up (fragment_bias) and the fragment is accepted only if 0.05 (0.10 for region mode) < bias < 10. Its contribution to depth is then rescaled by 1 / fragment_bias. This single-step normalization removes both GC- and length-dependent coverage artifacts before any downstream signal processing. This bias correction is not applied when calculating fragment length distributions or end coverage, wherein the 5’ and 3’ ends of all fragments passing filtering are piled up.

Fragment-length re-weighting for nucleosome-center localization

To produce the phased-nucleosome (PN) signal and related features, fragments ≥146 bp (the canonical length of a fully wrapped nucleosome DNA footprint) are re-weighted to reflect the empirical probability that a given base along each fragment coincides with a histone dyad. This is achieved with a pre-computed weight matrix (NCDict.pkl), where each row encodes the displacement density for one fragment length, meaning longer fragments contribute less sharply than core-sized fragments and reflect underlying uncertainty in wrapping symmetry. The weight matrix was formed by running “nc_dist.py” on healthy donor cfDNA samples. This is a simplified version of Triton which runs in “composite-window” mode and returns arrays with counts of displacements between fragment centers and the central site, for each observed fragment length. A set of 186 sites representing high-confidence nucleosome centers was used; iNPS104 peak BED files across multiple tissues were downloaded from NucMap (https://download.cncb.ac.cn/nucmap/organisms/latest/Homo_sapiens/data_type/iNPS_peaks/, retrieved all BED files as of 2023–07-20) and sites with “MainPeak” identity intersected using BEDTools v2.30.0. Only sites with >50% overlap across all experiments were kept, resulting in 186 sites with nucleosome peak locations consistent across human tissues and conditions. Raw displacement profiles generated by Triton (nc_dist.py) for those sites and samples for each possible fragment length then underwent Gaussian smoothing. Finally, displacement profiles were fit to a central Gaussian and/or symmetric flanking Gaussian profile using scipy.optimize curve_fit (v1.8.0) to create the final, smoothed fragment weight matrix (visualized in Figure 1d). At run time Triton retrieves nc_density, the smoothed profile from the weight matrix, for each given fragment length and uses that value (divided by the fragment’s GC bias) to create the nucleosome-center signal (nc_signal) used downstream in PN signal generation.

Fourier filtering to obtain the phased-nucleosome (PN) signal

To generate the PN signal, the nc_signal derived from GC corrected and fragment-length re-weighted fragments is then processed with a low-pass filter by performing a fast Fourier transform (FFT; scipy.fft v1.8.0) on the signal plus 500 bp flanking regions used as padding. This process removes high-frequency components corresponding to periods shorter than 146 bp before the signal is reconstructed (via inverse FFT). This cutoff ensures that the signal used for downstream analyses reflects the minimum possible internucleosomal spacing, and excludes peak pileups unrelated to dominant nucleosome phasing.

PN-signal spatial profiles and region-level features

In addition to reporting the bp-resolution GC-corrected coverage profile and PN signal, Triton also performs peak calling of maxima and minima on the PN signal to localize peaks and troughs in the nucleosome signal (encoded as +1 and −1) as well as the minus- and plus-one nucleosome positions and NDR location (encoded as −2, +2, and 3 respectively; only reported in “window” mode). From the PN signal and reported peaks the following region-level features are derived: PN amplitude which measures signal to noise ratio in nucleosome positioning (mean PN signal amplitude, based on called peaks, normalized to the mean PN signal), PN score which measures strength of phasing in two linker regimes (the ratio of mean amplitudes in the 180–210 bp periodicity to 150–180 bp periodicity ranges, based on FFT), PN spacing (mean internucleosomal distance, based on called peaks), and ratio of variation (fraction of the PN signal max height spanned by the signal).

Fragmentomic spatial profiles and region-level features

For each site Triton records a histogram of observed fragment lengths (default 5–500 bp) which overlap the region at any point (“region-level”) as well as histograms for every point in the region, which include all fragments overlapping each single bp locus (“spatial”). Calculation of fragment length-derived features is performed identically based on both region-level and spatial histograms. Fragmentomic features measuring heterogeneity in the fragment length distribution are calculated at both the spatial (histograms at every point in the region) and region-level: short:long ratio, the ratio of short (<= 150 bp) to long (>150 bp) fragments; diversity index, the proportion of unique fragment lengths to total fragments; and Shannon entropy, the normalized information entropy. To calculate Shannon entropy E, let c=c1,,cn be the counts of fragments of each length; then pi=cij=1ncj and E=1log2ni=1npilog2pi. Remaining fragmentomic features were found to be less informative at bp-resolution, and are calculated only at the region-level: fragment length mean, standard deviation, median, and median absolute deviation MAD;medianXimedian(X).

Proteus deep learning method for gene expression prediction

Overview

Proteus is a multi-input deep neural network that predicts single gene expression from cfDNA nucleosome positioning and fragmentomic profiles and features produced by Triton. For all samples used in both training and prediction, Triton was first run on MANE Select99 v1.3 full transcripts (gene bodies, using region mode) and TSS +/−1000 bp windows (promoter regions, using window mode) to generate training and testing data. The model was further designed to (i) disentangle tumor-derived signal from background hematopoietic signal and (ii) use tumor fraction as a prior when recombining these components. Concretely, Proteus encodes each gene’s input features into two parallel probabilistic latent spaces, one intended to capture tumor-originating information, and one intended to capture background/whole-blood information, using β-VAEs. A contrastive loss term encourages separation between these latents, while a learned, tumor fraction-conditioned recombination of the latents must reconstruct the original scalar features, providing an internal self-consistency constraint. Expression is then predicted from each latent, with predictions conditioned on a soft “on/off” activity classification score also derived from the latents. Proteus contains 2,006,380 total parameters (of which 2,001,564 are trainable), is implemented in TensorFlow v2.15.0105, and was trained and run using individual NVIDIA RTX2080ti GPUs on the Fred Hutch computing core.

Inputs, outputs, and pre-processing

Triton spatial signals:

For each gene, Triton exports an 11-channel, 2,000 bp, TSS-centered profile (TSS_images, shape 2000×11). Seven channels correspond to fragmentomic/coverage-derived signals, and four channels are the one-hot encoded reference nucleotide sequence (A/C/G/T). The first six coverage and fragmentomic channels are mean-normalized at the gene level (peak-location and nucleotide channels are left as-is).

Triton region-level features:

Two 12-feature vectors summarize promoter and gene body features: TSS_scalars and GB_scalars. All features are standard-scaled at the sample level except for mean-depth which undergoes a natural log transformation.

Reference features:

A 3-feature gene-level “reference” head (ref_scalars, length 3) encodes sample-agnostic transcript meta data which is log-transformed before undergoing standard scaling at the genome-level (NumExons, mRNA_size, and transcript_length retrieved directly from MANE Select reference annotations).

Tumor fraction:

A per-sample TFx_label provides the estimated tumor fraction used during recombination of latents and is concatenated to the scalar feature vectors.

Targets and derived labels:

For each gene Proteus was trained on two regression targets: tumor-originating expression and background/whole-blood expression, both provided as TMM-scaled log2(CPM+1). For auxiliary supervision we also formed soft class activity labels (sigmoids of the targets) to indicate “on/off” expression: yclass=σy2.00.5 with center 2.0 (3 CPM) and scale 0.5.

Output.

Proteus outputs two continuous predicted values per gene, tumor and background components, each in TMM-scaled log2(CPM+1) as well as corresponding soft “on/off” ([0, 1]) probabilities.

Architecture

Signal smoothing block:

The 2000×11 TSS tensor is split into seven Triton signal channels and a 2000×4 nucleotide block. Each non-nucleotide channel is passed through its own 1D convolutional block to perform pre-processing and smoothing. The 4-channel nucleotide block is processed with a Conv2D kernel spanning the four bases (kernel = (4, 6)) to capture short 6-mer context across positions. These eight outputs are then concatenated along the channel axis.

Higher order refinement and spatial reduction:

To capture both fine and long-range structure across channels, two parallel stacks of Conv2D layers with different spatial dilations, interleaved with 1D average pooling, are used to reduce the dimensionality of the 2,000 bp axis. A “small-receptive-field” stream (dilation = 1, k = 6) captures local structure, while a “large-receptive-field” stream (dilation = 11, k = 15) captures periodicity/phasing and broader patterns. Outputs are concatenated and flattened, then aggressively reduced with Dense(1024) → Dropout → Dense(256) → Dropout → Dense(64) to form a 64-dimensional signal embedding.

Scalar addition:

The signal embedding is concatenated with TSS_scalars (12), GB_scalars (12), ref_scalars (3), and TFx_label (1), producing a 92-dimensional fused vector y.

Dual β-VAEs:

Two identical MLP encoders transform y into tumor and background latent Gaussians (means and log-variances). Each encoder is a halving funnel starting at 128 units which reduces to a latent size of 32 (default) with BatchNorm and Dropout between Dense layers. The tumor and background latents are sampled via reparameterization during training, while the means are used at inference.

Latent recombination and reconstruction:

Tumor and background latents are then combined with a small gating MLP that is re-conditioned on the supplied TFx, and which outputs Softmax weights over {tumor, background}. Means and (log) variances are merged analytically (mixture of Gaussians; implemented in the MergeLatents layer) to yield a recombined latent, which is sampled and decoded to reconstruct the fused input vector y. This internal reconstruction forces the two latents to carry complementary, physically meaningful information without trying to reconstruct the entire 2,000 bp signal.

Uncertainty-aware gating and prediction heads:

For each branch Proteus then computes a per-latent confidence from the negative log-variance and generates element-wise gates (Dense → sigmoid). The gated latents for both tumor and background are then passed to two decoders:

  • Classification head: gated latent → MLP (64→16) → Dense(1) → sigmoid to output the soft “on/off” probability.

  • Regression head: the latent is concatenated with its (pre-sigmoid) classification logit and passed through another MLP (64→16) → Dense(1) with a ClippedReLU activation capped at 19.93 (=log2(10^6+1)), guaranteeing outputs remain in valid log2(CPM+1) bounds.

Ablation models that omit the sequence channels or use only sequence share the same backbone with corresponding inputs disabled.

Loss function

Training minimizes the sum of “external” supervised losses and an “internal” disentangling loss:

External losses (per-gene):
  • Tumor / background regression: masked focal MSE on log2(CPM+1) with focusing parameter γ (γ=0 reduces to masked MSE).

  • Tumor / background classification: masked focal loss on the soft “on/off” labels with parameters (α,γ).

All regression/classification losses are masked to ignore invalid/NaN labels that occur during semi-supervised training.

Internal loss (applied with custom layers):
  • β-VAE Kullback-Leibler (KL) divergence on each latent z:
    βVAEKLqztumorN(0,1)+KLqzbackgroundN(0,1)
  • Contrastive separation between latents with margin m:
    βcontrastivemax0,mztumorzbackground22
  • Reconstruction of the fused vector y from the recombined latent:
    βreconstructionyy^22

The internal loss is further multiplied by an overall weight to scale it relative to the external loss (internal_weight, default 2.5).

Optimization and schedule:

AdamW with cosine decay was used for all training. For base training, the initial learning rate (LR) and decay horizon were set via the number of steps per epoch and an epoch-level decay (see below). For transfer learning we lowered the starting LR and decayed faster.

Training sets

We first generated 80 in silico admixed samples using ctDNA from 40 PDX (20 CRPC, 20 SCLC) samples with line-matched tumor RNA-Seq and 10 newly sequenced HD cfDNA samples with matched whole-blood RNA-Seq. These were combined with the pure input samples and 20 cfDNA samples from patients without RNA-Seq (for unsupervised domain learning) which we split without overlap into a 5-fold cross-validation (CV) training set representing ~2.7 million individual gene examples (“BTS” set, Supplementary Table 2). Validation on unseen samples was then performed with 10 additional PDX models (5 PC, 5 SCLC) and mixtures generated with those same PDX and 10 unseen HD cfDNA samples without matched RNA-Seq (“BTS_HO” set, Supplementary Table 2). For fine-tuning of the model, we used 11 patient samples with one or more matched tumor biopsy RNA-Seq labels which were doped with 14 of the base-model in-silico mixtures (“TL-NoBC” set, Supplementary Table 2).

Hyperparameter tuning (base model and fine-tuning)

We tuned iteratively in three rounds (architecture → learning schedule → loss/regularization) using Bayesian optimization across 5-fold CV, conducted using either the BTS (30 samples/~540,000 examples per fold) or patient fine-tuning set (TL-NoBC, 3 samples/~54,000 examples per fold). Representative hyperparameters include:

Architecture:

number of signal filters; nucleotide-interaction kernel size, number of reduction stages, final/reduced number of signal channels, dilation size for the large receptive field stream, final signal dimensions, encoder/decoder depth, latent representation size, and prediction head sizes.

Training:

initial LR (cosine decay), total decay epochs, cosine floor α, and dropout rate.

Loss/regularization:

all loss component weights βclassification,βVAE,βcontrastive,βreconstruction,βbackground, focal (α,γ) for classification; focal-MSE γ for regression, and global weight decay strength.

Transfer-learning–specific:

freeze level (unfreezing progressively less of the backbone), transfer-learning initial LR choices, decay epochs, focal (α,γ), and global weight decay.

Training paradigm

Data pipeline:

Per-sample HDF5s are written using Triton outputs and then converted to TFRecords (one gene per record). We excluded genes with any NaNs across required modalities. TFRecords were merged by preassigned CV folds for hyper parameter tuning. Datasets were built with tf.data (interleave TFRecordDataset across files), parsed to typed tensors, batched (32), cached on disc, and prefetched.

Cross-validation training.

For each of 5 folds, we trained on 4 folds and validated on the held-out fold. Unless hyper-parameter tuning was active, we trained for a fixed number of epochs determined through tuning (default 10; 5 when warm-starting from base weights for transfer-learning), saved fold weights, and computed validation predictions for reporting.

Finalized training with early stopping.

For the definitive PDX-mixture base model and for patient fine-tuning we trained on the combined training folds with a 10% internal validation split (last 10% of the already shuffled stream), using early stopping on val_tumor_exp_masked_mse (patience 5, restore best). Base training used a higher initial LR, whereas transfer learning used a lower initial LR, fewer decay epochs, and froze the model backbone up to the encoders, unfreezing only from the latent layers onward.

Model Interpretability

SHAP Analysis

To better characterize feature importance of the input signals and scalars, we performed SHAP (SHapley Additive exPlanations) analysis (shap v0.46.0). SHAP is an interpretability technique founded in cooperative game theory that attempts to assign feature attributions by computing the weighted average difference in output for all permutations of the input features with and without the given input. We executed the SHAP analysis using the Gradient Explainer method. To generate the SHAP values, we used a random set of 50,000 training samples as the background dataset, and a random set of 5,000 training samples as the test set. To further generate SHAP “polarity-impact” scores we calculated bp-level polarity=PosUp+NegDownPosDownNegUp/Ntotal where Pos/Neg indicate points with SHAP >/< 0 and Up/Down indicate input signals with points above/below the mean. Impact refers to the median absolute SHAP value across all tested samples. Polarity-impact is the product of these two spatial scores, where high values indicate that higher signal consistently drives higher output and low values indicate the reverse relationship is true, with values near 0 indicating low impact or inconsistent polarity.

Kernel Visualization

In addition to the SHAP analysis, we examined model interpretability through kernel visualization, inspecting the learned convolutional kernels after the model was trained. For the first convolutional layer, visualizing the learned weights can help to give a sense of how the model is processing the raw input, as the learned kernels act directly upon the input sequence. The first layer of the signal-specific convolutional network consists of 16 4×6 (4 nucleotides, over a length of 6 bp) convolutional kernels which act directly on the nucleotide sequences. By visualizing the learned 4×6 kernels, we can get a sense of potential nucleotide motifs that the model may be identifying as important to highlight and promote in this initial processing step. To enhance interpretability, we transformed the raw 4×6 kernels to position-weight-matrices (PWMs) which were exported in MEME Motif Format. The sequence logos derived from these PWMs were generated using the python package Logomaker v0.8.6. We also directly visualized the 64 15×11 kernels associated with the large context signal interaction layer (15 width kernel with dilation rate of 10 for an effective window of 150 bp, over 11 TSS region input signals), along with their absolute mean signals (AMS). These kernels give insights into the attention paid by the model to multiple signals at once while reducing the promoter representation, with large, parallel weights on one or more bands indicating the model has learned a meaningful coupling between them.

Gene set enrichment and PAM50 scoring

Predicted and RNA-Seq derived expression values underwent identical pre-filtering: Genes with >50% missing values or with CPM < 1 in 2 or more samples in either group (RNA-Seq and predicted, cohort-level) were dropped, KNN imputation with k = 3 was used to impute any remaining missing values, and the common set of final genes were used in both groups for final analysis. The 31-gene Cell Cycle Progression (CCP3175), Hallmark (v2024.1)74, and CRPC molecular subtype-defining76 signatures were derived by running GSVA106 with minSize = 2, kcdf = ”Gaussian” on genome-wide log(CPM+1) RNA-Seq or predicted expression level data, yielding single-sample enrichment scores.

Samples were assigned to PAM50 categories using the classification method described previously107. We restricted the classification to LumA, LumB and Basal, removing Her2 and Normal samples from the training set and centroid scores before classification.

Differential gene expression and pathway enrichment

Gene Set Enrichment Analysis (GSEA) was performed to investigate transcriptomic differences between Neuroendocrine Prostate Cancer (NEPC) and Androgen Receptor Prostate Cancer (ARPC) subtypes across two independent cohorts: a clinical cohort from the University of Washington (UW) and a patient-derived xenograft cohort (LuCaP-PDX). Samples with tumor fraction below 0.15 were excluded, and only pure ARPC and NEPC samples were retained within the UW cohort to ensure clean representation. Genes with ≥50% missing values were removed from analysis. Differential gene expression analysis was conducted using limma-trend, appropriate for log2(CPM+1) transformed RNA-seq data, with design matrices constructed to contrast NEPC versus ARPC subtypes. The Benjamini-Hochberg procedure was applied to control for multiple testing, with adjusted p-values <0.05 considered significant. GSEA was performed using the fgsea package with the Hallmark gene sets (MSigDB v2024.1), filtering for pathways with 15–500 genes. Transcriptomic signatures were visualized using volcano plots highlighting top differentially expressed genes (ranked by fold change) and pathway enrichment dot plots. Comparative analyses between cohorts included calculation of overlap statistics for differentially expressed genes and visualization of shared significant pathways. All analyses were implemented in R (v4.0) using the limma, fgsea, and GSEABase packages, with visualization through ggplot2.

Statistical analyses

Pearson correlation coefficients (“r”) and Spearman’s rank correlation coefficients (“ρ”) along with their associated p-values were calculated using SciPy108 v1.15.1. Receiver Operating Characteristic (ROC) curves and associated Area Under the Curves (AUC) were derived using scikit-learn109 v1.6.1. Univariate hazard ratios and Kaplan Meier survival curves delineated by median score along with associated p-values were computed using lifelines110 v0.30.0.

Supplementary Material

Supplement 1
media-1.pdf (896.4KB, pdf)

ACKNOWLEDGEMENTS

We thank the many patients and their families for their altruistic contributions to this study. We thank members of the Ha and Nelson research groups for helpful discussions. We thank Dr. Robert. B Montgomery and the Molecular Correlates Biospecimen Blood Collection and Sample Delivery Clinical Research Support Team, and the Biospecimen Processing & Biorepository Shared Resources of Fred Hutch for assistance with blood processing. We also thank the Genomics & Bioinformatics Shared Resource (RRID:SCR_022606) of the Fred Hutch/University of Washington/Seattle Children’s Cancer Consortium (P30CA015704), the Stuart and Molly Sloan Precision Oncology Institute, and the Institute for Prostate Cancer Research clinicians and staff who support the University of Washington rapid autopsy program and the LuCAP PDX program. This work was supported by the National Institutes of Health (DP2 CA280624, R01CA280056, P50CA097186, P30CA015704, P01CA163227, R50CA27433, R01CA281801), CDMRP awards (W81XWH-21–1-0513, PC230582), a Prostate Cancer Foundation Young Investigator Award (R.D.P., generously funded by the Milken family), and the FHCC Scientific Computing Infrastructure award (ORIP Grant S10OD028685).

Footnotes

COMPETING INTERESTS

A patent application has been filed on methodologies developed from this work.

G.H. Receives research support from Pfizer; Consulting with Quest Diagnostics; all activities unrelated to this work.

A.I. received fees for advisory works from Curium, ITM, Novartis, Lantheus, Bayer, Boston Scientific and Ambrx/J&J (through institution). AI receives research funding from Novartis, SNMMI and NCI/NIH.

All other authors declare no competing interests.

Data availability

Software and pipeline configurations and custom code written for this study will be accessible upon publication at https://github.com/GavinHaLab/Proteus (Proteus) and https://github.com/GavinHaLab/Triton (Triton).

Prostate cancer: Raw whole genome sequencing data for PDX cfDNA will be accessible upon publication at NCBI BioProject PRJNA900550. Raw sequencing data for patient plasma cfDNA are not publicly available because patients did not consent to genomic data sharing but are available upon reasonable request from the corresponding authors. Raw PDX RNA-Seq data is available at GEO accession GSE199596. Raw CRPC-RA RNA-Seq data is available at GEO accessions GSE147250 and GSE228283.

Lung cancer: Raw whole genome sequencing data for PDX and patient cfDNA will be accessible upon publication at dbGaP accession phs003570.

Bladder cancer: Raw whole genome sequencing data for patient cfDNA will be accessible upon publication at dbGaP accession phs001797. Raw RNA-Seq data will be accessible at GEO accession GSE302273.

Processed patient-plasma RNA-Seq expression data and cfDNA-based predictions, as well as the custom analysis pipelines and configurations used in this study are available at https://github.com/GavinHaLab/ProteusManuscript. All other processed data necessary to reproduce the results are found in the Supplementary Tables and at https://github.com/GavinHaLab/ProteusManuscript.

Code availability

Triton and Proteus software for cfDNA characterization and gene expression inference is available on GitHub and can be accessed upon publication at https://github.com/GavinHaLab/Triton and https://github.com/GavinHaLab/Proteus. The source code is provided in the submission files for manuscript reviewers.

REFERENCES

  • 1.Bao H. et al. Early detection of multiple cancer types using multidimensional cell-free DNA fragmentomics. Nat. Med. 1–9 (2025) doi: 10.1038/s41591-025-03735-2. [DOI] [PubMed] [Google Scholar]
  • 2.Batool S. M. et al. The Liquid Biopsy Consortium: Challenges and opportunities for early cancer detection and monitoring. Cell Rep. Med. 4, 101198 (2023). [Google Scholar]
  • 3.Bruhm D. C. et al. Genomic and fragmentomic landscapes of cell-free DNA for early cancer detection. Nat. Rev. Cancer 1–18 (2025) doi: 10.1038/s41568-025-00795-x. [DOI] [PubMed] [Google Scholar]
  • 4.Kim S. Y. et al. Cancer signature ensemble integrating cfDNA methylation, copy number, and fragmentation facilitates multi-cancer early detection. Exp. Mol. Med. 1–16 (2023) doi: 10.1038/s12276-023-01119-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Ulz P. et al. Inference of transcription factor binding from cell-free DNA enables tumor subtype prediction and early detection. Nat. Commun. 10, 4666 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Liu M. C., Oxnard G. R., Klein E. A., Swanton C. & Seiden M. V. Sensitive and specific multi-cancer detection and localization using methylation signatures in cell-free DNA. Ann. Oncol. Off. J. Eur. Soc. Med. Oncol. 31, 745–759 (2020). [Google Scholar]
  • 7.Chauhan P. S. et al. Urine cell-free DNA multi-omics to detect MRD and predict survival in bladder cancer patients. Npj Precis. Oncol. 7, 1–6 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Thakral D., Das N., Basnal A. & Gupta R. Cell-free DNA for genomic profiling and minimal residual disease monitoring in Myeloma- are we there yet? Am. J. Blood Res. 10, 26–45 (2020). [PMC free article] [PubMed] [Google Scholar]
  • 9.De Sarkar N. et al. Nucleosome Patterns in Circulating Tumor DNA Reveal Transcriptional Regulation of Advanced Prostate Cancer Phenotypes. Cancer Discov. 13, 632–653 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Labrecque M. P. et al. Molecular profiling stratifies diverse phenotypes of treatment-refractory metastatic castration-resistant prostate cancer. J. Clin. Invest. 129, 4492–4505 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Berchuck J. E. et al. Detecting Neuroendocrine Prostate Cancer Through Tissue-Informed Cell-Free DNA Methylation Analysis. Clin. Cancer Res. 28, 928–938 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Helzer K. T. et al. Fragmentomic analysis of circulating tumor DNA-targeted cancer panels. Ann. Oncol. 34, 813–825 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Woodhouse R. et al. Clinical and analytical validation of FoundationOne Liquid CDx, a novel 324-Gene cfDNA-based comprehensive genomic profiling assay for cancers of solid tumor origin. PLOS ONE 15, e0237802 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Chung D. C. et al. A Cell-free DNA Blood-Based Test for Colorectal Cancer Screening. N. Engl. J. Med. 390, 973–983 (2024). [DOI] [PubMed] [Google Scholar]
  • 15.Wang J. et al. Terminal modifications independent cell-free RNA sequencing enables sensitive early cancer detection and classification. Nat. Commun. 15, 156 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Shetti D. et al. Emerging role of circulating cell-free RNA as a non-invasive biomarker for hepatocellular carcinoma. Crit. Rev. Oncol. Hematol. 200, 104391 (2024). [Google Scholar]
  • 17.Cabús L., Lagarde J., Curado J., Lizano E. & Pérez-Boza J. Current challenges and best practices for cell-free long RNA biomarker discovery. Biomark. Res. 10, 62 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Nesselbush M. C. et al. An ultrasensitive method for detection of cell-free RNA. Nature 641, 759–768 (2025). [DOI] [PubMed] [Google Scholar]
  • 19.Esfahani M. S. et al. Inferring gene expression from cell-free DNA fragmentation profiles. Nat. Biotechnol. 40, 585–597 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Ulz P. et al. Inferring expressed genes by whole-genome sequencing of plasma DNA. Nat. Genet. 48, 1273–1278 (2016). [DOI] [PubMed] [Google Scholar]
  • 21.Doebley A.-L. et al. A framework for clinical cancer subtyping from nucleosome profiling of cell-free DNA. Nat. Commun. 13, 7475 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Zhu G. et al. Tissue-specific cell-free DNA degradation quantifies circulating tumor DNA burden. Nat. Commun. 12, 2229 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Snyder M. W., Kircher M., Hill A. J., Daza R. M. & Shendure J. Cell-free DNA comprises an in vivo nucleosome footprint that informs its tissues-of-origin. Cell 164, 57–68 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Cristiano S. et al. Genome-wide cell-free DNA fragmentation in patients with cancer. Nature 570, 385–389 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Mathios D. et al. Detection and characterization of lung cancer using cell-free DNA fragmentomes. Nat. Commun. 12, 5060 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Peneder P. et al. Multimodal analysis of cell-free DNA whole-genome sequencing for pediatric cancers with low mutational burden. Nat. Commun. 12, 3230 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Rao S. et al. Transcription factor-nucleosome dynamics from plasma cfDNA identifies ER-driven states in breast cancer. Sci. Adv. 8, eabm4358 (2022). [Google Scholar]
  • 28.Zhou Q. et al. Epigenetic analysis of cell-free DNA by fragmentomic profiling. Proc. Natl. Acad. Sci. 119, e2209852119 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Baca S. C. et al. Liquid biopsy epigenomic profiling for cancer subtyping. Nat. Med. 1–5 (2023) doi: 10.1038/s41591-023-02605-z. [DOI] [PubMed] [Google Scholar]
  • 30.Bai J. et al. Histone modifications of circulating nucleosomes are associated with changes in cell-free DNA fragmentation patterns. Proc. Natl. Acad. Sci. 121, e2404058121 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Noë M. et al. DNA methylation and gene expression as determinants of genome-wide cell-free DNA fragmentation. Nat. Commun. 15, 6690 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Flores O., Deniz Ö., Soler-López M. & Orozco M. Fuzziness and noise in nucleosomal architecture. Nucleic Acids Res. 42, 4934–4946 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Mattox A. K. et al. The Origin of Highly Elevated Cell-Free DNA in Healthy Individuals and Patients with Pancreatic, Colorectal, Lung, or Ovarian Cancer. Cancer Discov. 13, 2166– 2179 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Corchete L. A. et al. Systematic comparison and assessment of RNA-seq procedures for gene expression quantitative analysis. Sci. Rep. 10, 19737 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Chen Y., Chen L., Lun A. T. L., Baldoni P. L. & Smyth G. K. edgeR v4: powerful differential analysis of sequencing data with expanded functionality and improved support for small counts and larger datasets. Nucleic Acids Res. 53, gkaf018 (2025). [Google Scholar]
  • 36.Gillis J. & Pavlidis P. ‘Guilt by association’ is the exception rather than the rule in gene networks. PLoS Comput. Biol. 8, e1002444 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Qi T. et al. Cell-Free DNA Fragmentomics: The Novel Promising Biomarker. Int. J. Mol. Sci. 24, 1503 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Klemm S. L., Shipony Z. & Greenleaf W. J. Chromatin accessibility and the regulatory epigenome. Nat. Rev. Genet. 20, 207–220 (2019). [DOI] [PubMed] [Google Scholar]
  • 39.Saxton D. S. & Rine J. Nucleosome Positioning Regulates the Establishment, Stability, and Inheritance of Heterochromatin in Saccharomyces cerevisiae. Proc. Natl. Acad. Sci. U. S. A. 117, 27493–27501 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Valouev A. et al. Determinants of nucleosome organization in primary human cells. Nature 474, 516–520 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Deal R. B., Henikoff J. G. & Henikoff S. Genome-wide kinetics of nucleosome turnover determined by metabolic labeling of histones. Science 328, 1161–1164 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Zhang M. et al. Gene expression profiles based on maternal plasma cfDNA nucleosome footprints indicate fetal development and maternal immunity changes during pregnancy progress. Placenta 159, 84–92 (2025). [DOI] [PubMed] [Google Scholar]
  • 43.Li K. et al. Maternal plasma cell-free DNA nucleosome footprints can reveal changes in gene expression profiles during pregnancy and pre-eclampsia. BMC Pregnancy Childbirth 25, 304 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Ivanov M., Baranova A., Butler T., Spellman P. & Mileyko V. Non-random fragmentation patterns in circulating cell-free DNA reflect epigenetic regulation. BMC Genomics 16, S1 (2015). [Google Scholar]
  • 45.Chen X. et al. Transcriptional Start Site Coverage Analysis in Plasma Cell-Free DNA Reveals Disease Severity and Tissue Specificity of COVID-19 Patients. Front. Genet. 12, 663098 (2021). [Google Scholar]
  • 46.Levitsky V. G., Zykova T. Yu., Moshkin, Y. M. & Zhimulev, I. F. Nucleosome Positioning around Transcription Start Site Correlates with Gene Expression Only for Active Chromatin State in Drosophila Interphase Chromosomes. Int. J. Mol. Sci. 21, 9282 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Zhou Z. et al. Fragmentation landscape of cell-free DNA revealed by deconvolutional analysis of end motifs. Proc. Natl. Acad. Sci. U. S. A. 120, e2220982120 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Hiatt J. B. et al. Molecular phenotyping of small cell lung cancer using targeted cfDNA profiling of transcriptional regulatory regions. Sci. Adv. 10, eadk2082 (2024). [Google Scholar]
  • 49.McIntyre L. M. et al. RNA-seq: technical variability and sampling. BMC Genomics 12, 293 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Kelley D. R. et al. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res. 28, 739–750 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Avsec Ž. et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods 18, 1196–1203 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Linder J., Srivastava D., Yuan H., Agarwal V. & Kelley D. R. Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation. Nat. Genet. 57, 949–961 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Tang F. et al. Chromatin profiles classify castration-resistant prostate cancers suggesting therapeutic targets. Science 376, eabe1505 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Nouruzi S. et al. ASCL1 activates neuronal stem cell-like lineage programming through remodeling of the chromatin landscape in prostate cancer. Nat. Commun. 13, 2282 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Romero R. et al. The neuroendocrine transition in prostate cancer is dynamic and dependent on ASCL1. Nat. Cancer 5, 1641–1659 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Qu S. et al. Molecular Subtypes of Primary SCLC Tumors and Their Associations With Neuroendocrine and Therapeutic Markers. J. Thorac. Oncol. 17, 141–153 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Qi J., Zhang J., Liu N., Zhao L. & Xu B. Prognostic Implications of Molecular Subtypes in Primary Small Cell Lung Cancer and Their Correlation With Cancer Immunity. Front. Oncol. 12, 779276 (2022). [Google Scholar]
  • 58.Baine M. K. et al. SCLC Subtypes Defined by ASCL1, NEUROD1, POU2F3, and YAP1: A Comprehensive Immunohistochemical and Histopathologic Characterization. J. Thorac. Oncol. 15, 1823–1835 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Chou J. et al. Immunotherapeutic Targeting and PET Imaging of DLL3 in Small-Cell Neuroendocrine Prostate Cancer. Cancer Res. 83, 301–315 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Puca L. et al. Delta-like protein 3 expression and therapeutic targeting in neuroendocrine prostate cancer. Sci. Transl. Med. 11, (2019). [Google Scholar]
  • 61.Wermke M. et al. Phase I trial of the DLL3/CD3 bispecific T-cell engager BI 764532 in DLL3-positive small-cell lung cancer and neuroendocrine carcinomas. Future Oncol. 18, 2639–2649 (2022). [DOI] [PubMed] [Google Scholar]
  • 62.Herberts C. et al. Deep whole-genome ctDNA chronology of treatment-resistant prostate cancer. Nature 608, 199–208 (2022). [DOI] [PubMed] [Google Scholar]
  • 63.Finan C. et al. The druggable genome and support for target identification and validation in drug development. Sci. Transl. Med. 9, eaag1166 (2017). [Google Scholar]
  • 64.Kamoun A. et al. A Consensus Molecular Classification of Muscle-invasive Bladder Cancer. Eur. Urol. 77, 420–433 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Bhatia V. et al. Targeting advanced prostate cancer with STEAP1 chimeric antigen receptor T cell and tumor-localized IL-12 immunotherapy. Nat. Commun. 14, 2041 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Danila D. C. et al. STEAP1 as a predictive biomarker for antibody-drug conjugate (ADC) activity in metastatic castration resistant prostate cancer (mCRPC). J. Clin. Oncol. 33, 5029–5029 (2015). [Google Scholar]
  • 67.Mandriani B. et al. Development of anti-somatostatin receptors CAR T cells for treatment of neuroendocrine tumors. J. Immunother. Cancer 10, e004854 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Şen F. et al. In-Vivo Somatostatin-Receptor Expression in Small Cell Lung Cancer as a Prognostic Image Biomarker and Therapeutic Target. Cancers 15, 3595 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Si Y. et al. Anti-SSTR2 antibody-drug conjugate for neuroendocrine tumor therapy. Cancer Gene Ther. 28, 799–812 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Klümper N. et al. NECTIN4 Amplification Is Frequent in Solid Tumors and Predicts Enfortumab Vedotin Response in Metastatic Urothelial Cancer. J. Clin. Oncol. 42, 2446–2455 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71.Su P.-L. et al. DLL3-guided therapies in small-cell lung cancer: from antibody-drug conjugate to precision immunotherapy and radioimmunotherapy. Mol. Cancer 23, 97 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.Serrano A. G. et al. Delta-like ligand 3 (DLL3) landscape in pulmonary and extra-pulmonary neuroendocrine neoplasms. Npj Precis. Oncol. 8, 1–13 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Coleman I. M. et al. Therapeutic Implications for Intrinsic Phenotype Classification of Metastatic Castration-Resistant Prostate Cancer. Clin. Cancer Res. Off. J. Am. Assoc. Cancer Res. 28, 3127–3140 (2022). [Google Scholar]
  • 74.Liberzon A. et al. The Molecular Signatures Database Hallmark Gene Set Collection. Cell Syst. 1, 417–425 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75.Cuzick J. et al. Prognostic value of an RNA expression signature derived from cell cycle proliferation genes in patients with prostate cancer: a retrospective study. Lancet Oncol. 12, 245–255 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76.Bluemn E. G. et al. Androgen Receptor Pathway-Independent Prostate Cancer Is Sustained through FGF Signaling. Cancer Cell 32, 474–489.e6 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77.Watanabe R. et al. Spatial Gene Expression Analysis Reveals Characteristic Gene Expression Patterns of De Novo Neuroendocrine Prostate Cancer Coexisting with Androgen Receptor Pathway Prostate Cancer. Int. J. Mol. Sci. 24, 8955 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.Li T. & Giaccone G. Advances in biology and novel treatments of SCLC. Semin. Cancer Biol. 96, 1–2 (2023). [DOI] [PubMed] [Google Scholar]
  • 79.Bhinder B. et al. Immunogenomic Landscape of Neuroendocrine Prostate Cancer. Clin. Cancer Res. Off. J. Am. Assoc. Cancer Res. 29, 2933–2943 (2023). [Google Scholar]
  • 80.Sartor O. et al. Lutetium-177–PSMA-617 for Metastatic Castration-Resistant Prostate Cancer. N. Engl. J. Med. 385, 1091–1103 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81.Kwan E. M. et al. Lutetium-177–PSMA-617 or cabazitaxel in metastatic prostate cancer: circulating tumor DNA analysis of the randomized phase 2 TheraP trial. Nat. Med. 1–15 (2025) doi: 10.1038/s41591-025-03704-9. [DOI] [PubMed] [Google Scholar]
  • 82.Fonseca N. M. et al. Prediction of plasma ctDNA fraction and prognostic implications of liquid biopsy in advanced prostate cancer. Nat. Commun. 15, 1828 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 83.Reichert Z. R. et al. Prognostic value of plasma circulating tumor DNA fraction across four common cancer types: a real-world outcomes study. Ann. Oncol. Off. J. Eur. Soc. Med. Oncol. 34, 111–120 (2023). [Google Scholar]
  • 84.Beltran H. et al. Circulating tumor DNA profile recognizes transformation to castration-resistant neuroendocrine prostate cancer. J. Clin. Invest. 130, 1653–1668 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 85.Bakht M. K. & Beltran H. Biological determinants of PSMA expression, regulation and heterogeneity in prostate cancer. Nat. Rev. Urol. 22, 26–45 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 86.Markus H. et al. Analysis of recurrently protected genomic regions in cell-free DNA found in urine. Sci. Transl. Med. 13, eaaz3088 (2021). [Google Scholar]
  • 87.Lam H.-M., Nguyen H. M. & Corey E. Generation of Prostate Cancer Patient-Derived Xenografts to Investigate Mechanisms of Novel Treatments and Treatment Resistance. in Prostate Cancer: Methods and Protocols (ed. Culig Z.) 1–27 (Springer, New York, NY, 2018). doi: 10.1007/978-1-4939-7845-8_1. [DOI] [Google Scholar]
  • 88.Beltran H. et al. Divergent clonal evolution of castration resistant neuroendocrine prostate cancer. Nat. Med. 22, 298–305 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 89.Nyquist M. D. et al. Combined TP53 and RB1 Loss Promotes Prostate Cancer Resistance to a Spectrum of Therapeutics and Confers Vulnerability to Replication Stress. Cell Rep. 31, 107669 (2020). [Google Scholar]
  • 90.Daniel V. C. et al. A Primary Xenograft Model of Small-Cell Lung Cancer Reveals Irreversible Changes in Gene Expression Imposed by Culture In vitro. Cancer Res. 69, 3364–3373 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 91.Augert A. et al. Targeting NOTCH activation in small cell lung cancer through LSD1 inhibition. Sci. Signal. 12, eaau2922 (2019). [Google Scholar]
  • 92.Caeser R. et al. Genomic and transcriptomic analysis of a library of small cell lung cancer patient-derived xenografts. Nat. Commun. 13, 2144 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 93.Simpson K. L. et al. A biobank of small cell lung cancer CDX models elucidates inter- and intratumoral phenotypic heterogeneity. Nat. Cancer 1, 437–451 (2020). [DOI] [PubMed] [Google Scholar]
  • 94.Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at 10.48550/arXiv.1303.3997 (2013). [DOI] [Google Scholar]
  • 95.Jo S.-Y., Kim E. & Kim S. Impact of mouse contamination in genomic profiling of patient-derived models and best practice for robust analysis. Genome Biol. 20, 231 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 96.Hiatt J. B. et al. Molecular phenotyping of small cell lung cancer using targeted cfDNA profiling of transcriptional regulatory regions. Sci. Adv. 10, eadk2082 (2024). [Google Scholar]
  • 97.Jana S. et al. Transcriptional-translational conflict is a barrier to cellular transformation and cancer progression. Cancer Cell 41, 853–870.e13 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 98.Ghali F. et al. Metastatic Bladder Cancer Expression and Subcellular Localization of Nectin-4 and Trop-2 in Variant Histology: A Rapid Autopsy Study. Clin. Genitourin. Cancer 21, 669–678 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 99.Morales J. et al. A joint NCBI and EMBL-EBI transcript set for clinical genomics and research. Nature 604, 310–315 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 100.Adalsteinsson V. A. et al. Scalable whole-exome sequencing of cell-free DNA reveals high concordance with metastatic tumors. Nat. Commun. 8, 1324 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 101.Roudier M. P. et al. Patterns of intra- and intertumor phenotypic heterogeneity in lethal prostate cancer. J. Clin. Invest. 135, e186599 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 102.Lee H. J. et al. Patterns of HER2 Expression in Metastatic Prostate and Urothelial Cancers: Implications for HER2-Targeted Therapies. Cancer Res. Commun. (2025) doi: 10.1158/2767-9764.CRC-25-0069. [DOI] [Google Scholar]
  • 103.Ajkunic A. et al. Assessment of TROP2, CEACAM5 and DLL3 in metastatic prostate cancer: Expression landscape and molecular correlates. NPJ Precis. Oncol. 8, 104 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 104.Chen W. et al. Improved nucleosome-positioning algorithm iNPS for accurate nucleosome positioning from sequencing data. Nat. Commun. 5, 4909 (2014). [DOI] [PubMed] [Google Scholar]
  • 105.Developers T. TensorFlow. Zenodo 10.5281/zenodo.16547183 (2025). [DOI] [Google Scholar]
  • 106.Hänzelmann S., Castelo R. & Guinney J. GSVA: gene set variation analysis for microarray and RNA-Seq data. BMC Bioinformatics 14, 7 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 107.Parker J. S. et al. Supervised Risk Predictor of Breast Cancer Based on Intrinsic Subtypes. J. Clin. Oncol. 27, 1160–1167 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 108.Virtanen P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 109.Pedregosa F. et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011). [Google Scholar]
  • 110.Davidson-Pilon C. lifelines: survival analysis in Python. J. Open Source Softw. 4, 1317 (2019). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement 1
media-1.pdf (896.4KB, pdf)

Data Availability Statement

Software and pipeline configurations and custom code written for this study will be accessible upon publication at https://github.com/GavinHaLab/Proteus (Proteus) and https://github.com/GavinHaLab/Triton (Triton).

Prostate cancer: Raw whole genome sequencing data for PDX cfDNA will be accessible upon publication at NCBI BioProject PRJNA900550. Raw sequencing data for patient plasma cfDNA are not publicly available because patients did not consent to genomic data sharing but are available upon reasonable request from the corresponding authors. Raw PDX RNA-Seq data is available at GEO accession GSE199596. Raw CRPC-RA RNA-Seq data is available at GEO accessions GSE147250 and GSE228283.

Lung cancer: Raw whole genome sequencing data for PDX and patient cfDNA will be accessible upon publication at dbGaP accession phs003570.

Bladder cancer: Raw whole genome sequencing data for patient cfDNA will be accessible upon publication at dbGaP accession phs001797. Raw RNA-Seq data will be accessible at GEO accession GSE302273.

Processed patient-plasma RNA-Seq expression data and cfDNA-based predictions, as well as the custom analysis pipelines and configurations used in this study are available at https://github.com/GavinHaLab/ProteusManuscript. All other processed data necessary to reproduce the results are found in the Supplementary Tables and at https://github.com/GavinHaLab/ProteusManuscript.


Articles from bioRxiv are provided here courtesy of Cold Spring Harbor Laboratory Preprints

RESOURCES