Profiling of repetitive RNA sequences in the blood plasma of patients with cancer

Roman E Reggiardo; Sreelakshmi Velandi Maroli; Vikas Peddu; Andrew E Davidson; Alexander Hill; Erin LaMontagne; Yassmin Al Aaraj; Miten Jain; Stephen Y Chan; Daniel H Kim

doi:10.1038/s41551-023-01081-7

. 2023 Aug 31;7(12):1627–1635. doi: 10.1038/s41551-023-01081-7

Profiling of repetitive RNA sequences in the blood plasma of patients with cancer

Roman E Reggiardo ¹, Sreelakshmi Velandi Maroli ², Vikas Peddu ¹, Andrew E Davidson ¹, Alexander Hill ¹, Erin LaMontagne ¹, Yassmin Al Aaraj ³, Miten Jain ^1,^8,⁹, Stephen Y Chan ³, Daniel H Kim ^1,^4,^5,^6,^7,^✉

PMCID: PMC10727983 PMID: 37652985

Abstract

Liquid biopsies provide a means for the profiling of cell-free RNAs secreted by cells throughout the body. Although well-annotated coding and non-coding transcripts in blood are readily detectable and can serve as biomarkers of disease, the overall diagnostic utility of the cell-free transcriptome remains unclear. Here we show that RNAs derived from transposable elements and other repeat elements are enriched in the cell-free transcriptome of patients with cancer, and that they serve as signatures for the accurate classification of the disease. We used repeat-element-aware liquid-biopsy technology and single-molecule nanopore sequencing to profile the cell-free transcriptome in plasma from patients with cancer and to examine millions of genomic features comprising all annotated genes and repeat elements throughout the genome. By aggregating individual repeat elements to the subfamily level, we found that samples with pancreatic cancer are enriched with specific Alu subfamilies, whereas other cancers have their own characteristic cell-free RNA profile. Our findings show that repetitive RNA sequences are abundant in blood and can be used as disease-specific diagnostic biomarkers.

Subject terms: Transcriptomics, Non-coding RNAs, Computational biology and bioinformatics, Biomarkers, Tumour biomarkers

RNAs derived from transposable elements and other repeat elements are enriched in the cell-free transcriptome of patients with cancer and can serve as diagnostic signatures.

Main

Of the 3 billion base pairs in the human genome, approximately 75% are transcribed into RNA¹. The vast majority of these RNAs are not translated into proteins and are thus considered non-coding RNAs. Although non-coding RNAs such as microRNAs^2,3 and long non-coding RNAs^4,5 (lncRNAs) are well annotated, many other non-coding RNAs are generated throughout the genome, including RNAs transcribed from repeat elements such as transposable elements (TEs)⁶. There are over 5 million repeat element insertions in the human genome, with repeat sequences comprising roughly half the genomic sequence content⁷. TE RNAs in particular are aberrantly expressed in diseases such as cancer^8–13, highlighting their potential as abundant and specific biomarkers of disease^14,15.

Cell-free RNAs¹⁶ are released from cells that comprise the various tissues and organ systems throughout the human body^17,18. The diagnostic and prognostic potential of cell-free RNA is evidenced by the prediction of pre-eclampsia in pregnancy^19–21, and cell-free RNAs serve as biomarkers of diseases such as cancer^22–26 and Alzheimer’s disease^27–29. Cell-free RNAs have been profiled predominantly via whole-exome RNA sequencing (RNA-seq), which precludes the detection of repeat-derived and other non-coding RNA, or ribosomal-RNA-depleted total RNA-seq³⁰. Studies using total RNA-seq for cell-free RNA have identified many well-annotated non-coding RNAs in human plasma, and a small fraction of repeat-derived cell-free RNAs (1–2%) in healthy individuals³¹. However, the diagnostic potential of the repeat-derived cell-free RNA transcriptome in the context of disease remains unknown.

Here we report that repeat-aware profiling of the cell-free RNA transcriptome (COMPLETE-seq) enables the in-depth characterization of disease-specific, repeat-derived cell-free RNAs, and the accurate classification of patients with cancer by leveraging the rich feature space of the repeat-derived cell-free RNA transcriptome. In marked contrast to the cell-free RNA transcriptomes of healthy individuals, patients with cancer show strong enrichment of TE- and other repeat-derived cell-free RNAs in their blood, with patients with pancreatic cancer showing high levels of short interspersed nuclear element (SINE)-derived cell-free RNAs from various Alu subfamily elements. We further show the generalizability of COMPLETE-seq by showing that repeat-aware classification of liver, lung, oesophageal, colorectal and stomach cancer cell-free RNA data³² shows improved performance compared with repeat-naive classifiers. Taken together, our results show that repeat-aware COMPLETE-seq profiling of the cell-free RNA transcriptome identifies a robust and dynamic repeat-element-derived RNA signature for the diagnosis of diseases such as cancer.

Results

COMPLETE-seq enables repeat-aware profiling of the cell-free RNA transcriptome

We developed the COMPLETE-seq technology to enable repeat-aware characterization of the cell-free RNA transcriptome. To generate cell-free RNA-seq data from human plasma, we leveraged a highly sensitive RNA-seq protocol that robustly detects both coding and non-coding RNAs⁵. Given that the human genome contains millions of repeat element insertions that have not been examined in the context of cell-free RNA, we created a custom transcriptome annotation for cell-free RNA quantification that incorporates both well-annotated coding and non-coding RNAs (that is, GENCODE) and over 5 million repeat element insertions found in the human genome (that is, RepeatMasker). We then aggregated the RNA signal from individual repeat element insertions to the subfamily element level³³, reducing the number of repeat features from over 5 million to approximately 15,000 repeat features for disease classification and other downstream analyses (Fig. 1a).

Fig. 1 — a, Diagram of COMPLETE-seq RNA liquid-biopsy technology, highlighting the use of repeat-derived cell-free RNAs aggregated into a tractable feature set to enable diagnostic modelling. Created with BioRender.com. b, Comparison of mapping rates between use of a repeat-naive (GENCODE v.39) reference annotation (**P = 0.0039) and repeat-aware reference annotation (Wilcoxon, paired, two-sided). c, Comparison of gene detection distributions for each cohort across coding genes (GENCODE_coding; *P = 0.043), lncRNAs (GENCODE_lncRNA; *P = 0.035) and TE subfamilies (Wilcoxon, two-sided). For the box plots, the centre line represents the median, the box limits are upper and lower quartiles and whiskers represent 1.5× interquartile range. NS, not significant; panc., pancreatic cancer.

Compared with using only well-annotated GENCODE coding and non-coding genes (repeat naive) for cell-free RNA quantification, the application of our COMPLETE-seq technology significantly enhanced the percentage of mapped reads in our cell-free RNA data from patients with pancreatic cancer (Fig. 1b and Extended Data Fig. 1). For healthy control cell-free RNA, however, the mapping rate difference between repeat-naive versus our repeat-aware approach was negligible (Fig. 1b). Notably, there were no significant differences between the total number of repeat subfamilies that were represented in the cell-free RNA of patients with pancreatic cancer and healthy individuals (Fig. 1c).

Extended Data Fig. 1 — a, Comparison of age distributions between cohorts (Wilcoxon, two-sided, ns: p > 0.05) enter line, median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range. b, Number of samples, stratified by gender, in each cohort. c, Heatmap (K-means) of Pearson correlation between panc samples using Repeat-naïve quantification. d, Heatmap (K-means) of Pearson correlation between panc samples using Repeat-aware quantification. e, PCA dimensions 1 & 2 calculated using variance-stabilized, Repeat-naive quantifications for normal and panc samples. f, PCA dimensions 1 & 2 calculated using variance-stabilized, Repeat-aware quantifications for normal and panc samples. g, MA plot of log2FoldChange between panc and normal samples compared to log-scale baseMean derived from DESeq2. Significantly DE genes/subfamilies are full opacity and colour.

Both repeat-naive and repeat-aware quantification of the cell-free RNA transcriptomes of patients with pancreatic cancer and healthy individuals enabled robust, unsupervised disease identification in low-dimensional space via principal component analysis (Extended Data Fig. 1e,f). Compared with using only well-annotated GENCODE coding and non-coding genes for cell-free RNA quantification, however, the application of our repeat-aware technology increased the sample-to-sample correlation (Pearson) of cell-free RNA data from patients with pancreatic cancer (Extended Data Fig. 1c,d), indicating greater overall similarity when examining a more robust annotation of their cell-free RNA transcriptomes.

TE RNAs and other repeat element RNAs are enriched in pancreatic cancer cell-free RNA

To determine the repeat composition of cell-free RNA, we first examined repeat content at the superfamily level. In healthy individuals, less than 10% of the DESeq2-normalized cell-free RNA counts consistently corresponded to repeats. However, we found substantially larger fractions of repeat-derived cell-free RNAs across almost all patients with pancreatic cancer, with most of these cell-free RNAs being derived from SINE elements (Fig. 2a). We also found significant differences in the information content, as quantified by Shannon entropy³⁴, of protein-coding RNAs, lncRNAs, and long terminal repeat (LTR), SINE and simple repeat superfamilies (Fig. 2b). These differences in biotype or repeat superfamily transcriptome diversity suggest dynamic changes in both the abundance and identity of cell-free RNAs in the context of diseases such as cancer.

Fig. 2 — a, Distribution of biotype representation (by DESeq2-normalized count) in cell-free RNA-seq quantifications for samples from each cohort, coloured by GENCODE biotype or repeat subfamily, and facetted by stage (NA for healthy samples). b, Comparison of significantly different (Wilcoxon, two-sided) Shannon entropy distributions for GENCODE biotype (****P = 9.6 × 10⁻⁵, ***P = 0.00019) and repeat subfamilies (*P = 0.014, ****P = 3.1 × 10⁻⁸). c, Volcano plot of significantly (P < 0.01) differentially expressed genes or repeat subfamilies derived from repeat-aware quantification, with horizontal and vertical lines drawn at −log₁₀(0.01) and 0, respectively. d, Heat map (K means) of Z scores calculated from DESeq2-normalized counts of SINEs and simple repeats, with an average of at least five counts per sample across the dataset. For the box plots, the centre line represents the median, the box limits are upper and lower quartiles and whiskers represent 1.5× interquartile range. NA, not applicable.

To further characterize the features of repeat-derived cell-free RNAs, we used full-length complementary DNA to perform single-molecule sequencing using nanopore technology³⁵ on cell-free RNA samples from patients with pancreatic cancer that were also sequenced using Illumina technology. We examined the size distributions of protein-coding RNA, lncRNA, and long interspersed nuclear element (LINE)-, SINE- and LTR-derived cell-free RNAs from patients with pancreatic cancer, which revealed cell-free RNA transcripts up to 1,337 nt in length for protein-coding RNA (median 456 nt), 970 nt for lncRNA (median 303 nt), 1,002 nt for LINE (median 258 nt), 368 nt for SINE (median 185 nt) and 477 nt for LTR (median 167 nt) (Extended Data Fig. 2a). For SINE-derived cell-free RNA, we observed a bimodal length distribution, reflecting both full-length, ~300-nt-long Alu-derived RNA, along with a shorter species of Alu-derived RNA (Extended Data Fig. 2a). We then compared the expected length of SINE-derived RNA based on genomic alignment with the observed length of cell-free SINE RNA and found both full-length transcripts of expected size, and half-length cell-free SINE RNAs (Extended Data Fig. 2b,c). In addition, we compared the cell-free RNA abundances of all COMPLETE-seq-annotated genes and repeat subfamilies in matched nanopore and Illumina libraries, which revealed strong concordance between well-annotated coding and non-coding genes, with a bias towards Illumina for TE RNAs and towards nanopore for some simple repeat RNAs (Extended Data Fig. 3).

Extended Data Fig. 2 — a, Distribution of cell-free RNA lengths in base pairs (bp) for GENCODE biotypes or - repeat superfamily elements in pancreatic (panc 6, panc 7) cancer patients. b, Density plots depicting the relationship between expected (genomic SINE locus length) and observed SINE cell-free RNA length in pancreatic (panc 6, panc 7) cancer patients. c, Cumulative distribution function plot of SINE cell-free RNA length empirically calculated in pancreatic (panc 6, panc 7) cancer patients.

Extended Data Fig. 3 — a, Scatter plot depicting transcripts-per-million abundance for transcripts detected in matched nanopore and Illumina libraries from sample panc 7. Linear fit described. b, Scatter plot depicting transcripts-per-million abundance for transcripts detected in matched nanopore and Illumina libraries from sample panc 6. Linear fit described.

We next performed differential expression analysis and found that Alu subfamily elements were the most enriched TE signal in cell-free RNA from patients with pancreatic cancer, with AluY, AluSc, AluSg7, AluSc8, AluSx3 and AluSg subfamily elements among the most significantly enriched (P < 0.01) in patients with pancreatic cancer compared with healthy individuals (Fig. 2c). Further analysis of the significantly enriched repeat subfamilies showed that the upregulated Alu elements in pancreatic cancer cell-free RNA were highly abundant (Extended Data Fig. 1g), despite the lack of increase in overall SINE complexity in the pancreatic cancer cell-free RNA transcriptome (Fig. 2b). Capturing both the robust enrichment of Alu elements and the significant increase in simple repeat entropy in the pancreatic cancer cell-free RNA transcriptome, hierarchical clustering achieved perfect clustering of patients with pancreatic cancer (Fig. 2d). These repeat-aware analyses further contextualized the differences in the respective repeat superfamilies, where SINE-derived cell-free RNAs were uniform in their enrichment in our cohort of patients with pancreatic cancer, whereas simple repeat-derived cell-free RNAs were far more divergent, with some simple repeat-derived cell-free RNAs being enriched in healthy individuals (Fig. 2d).

COMPLETE-seq reveals cancer-specific repeat element RNA signatures in cell-free RNA

To show the generalizability and applicability of COMPLETE-seq technology in enabling RNA liquid biopsy for cancer diagnosis, we used COMPLETE-seq quantification to analyse lung, liver, oesophageal, colorectal and stomach cancer cell-free RNA-seq data, along with their corresponding healthy controls³². We observed repeat superfamily variability in both healthy and cancer cell-free RNA transcriptomes (Extended Data Fig. 4 and Extended Data Fig. 5) and a significant (P < 0.05) increase in mapping rate by using our repeat-aware COMPLETE-seq annotation for analysing oesophageal, liver and stomach cancer cell-free RNA transcriptomes (Extended Data Fig. 4b).

Extended Data Fig. 4 — a, Distribution of biotype representation (by DESeq2 normalized count) in cell-free RNA-seq quantifications for each cancer type, coloured by GENCODE biotype or repeat subfamily, and facetted by stage. b, Comparison of mapping rates between use of a Repeat-naïve (GENCODE v39) reference annotation and Repeat-aware reference annotation (Wilcoxon, paired, two-sided). center line, median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range. ns: p > 0.05, *: p <= 0.05, **: p <= 0.01, ***: p <= 0.001.

Extended Data Fig. 5 — a, Distribution of repeat representation (by DESeq2 normalized count) in cell-free RNA-seq quantifications for pancreatic cancer, coloured by repeat subfamily, and facetted by stage. b, Distribution of repeat representation (by DESeq2 normalized count) in cell-free RNA-seq quantifications for each cancer type, colored by repeat subfamily.

Performing pairwise comparisons between five different cancers and healthy individuals captured robust and significant (P < 0.01) differential expression of repeat-derived cell-free RNAs that were characteristic to each cancer type (Fig. 3a–e). Additional analyses of the significantly differentially expressed repeat subfamilies showed that these repeat-derived cell-free RNAs were also highly abundant (Extended Data Fig. 6), with significant changes to biotype or repeat superfamily entropy (P < 0.05) (Extended Data Fig. 6f). By comparing the significantly differentially expressed repeat RNA signal across cancer types, we identified cancer-specific TE- and other repeat-derived cell-free RNA enrichment or depletion across all repeat superfamilies (Fig. 3f,g).

Fig. 3 — a–e, Volcano plots of differentially expressed genes and repeat subfamilies derived from repeat-aware quantification of cell-free RNA-seq data for liver (a), lung (b), oesophagus (c), colorectal (d) and stomach (e) cancer. Horizontal and vertical lines drawn at −log₁₀(0.01) and 0, respectively. f,g, UpSet plots showing the number of shared and unique upregulated (f) or downregulated (g) TE subfamilies across the different cancer types.

Extended Data Fig. 6 — **a-e**, MA plots of log2FoldChange between labeled cancer type and healthy donor samples compared to log-scale baseMean derived from DESeq2. Significantly DE genes/repeat subfamilies are full opacity and color. f, Comparison of significantly different (Wilcoxon, two-sided) Shannon Entropy distributions for GENCODE biotypes and repeat subfamilies. center line, median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range. ns: p > 0.05, *: p <= 0.05, **: p <= 0.01, ***: p <= 0.001, ****: p <= 0.0001.

COMPLETE-seq features improve classification performance of diagnostic models

To show proof of concept for diagnostic modelling using repeat-aware COMPLETE-seq analysis of cell-free RNA-seq data³², we trained logistic regression classifiers with tenfold cross-validation on training sets created for each cancer and healthy comparison (Fig. 4). To determine the utility of repeat-aware COMPLETE-seq features for disease classification, we trained models on repeat-naive and repeat-aware feature sets comprising DESeq2-normalized counts or biotype/repeat superfamily entropy for each cancer type. This resulted in eight feature sets comprising repeat-aware and repeat-naive counts, entropy and counts filtered by training set differential expression. Optimized repeat-aware models were compared with their repeat-naive counterparts, revealing repeat-driven increases in both area under the curve (AUC) (Fig. 4a,e,i,m,q) and training sensitivity for liver cancer (86% sensitivity) (Fig. 4b,c), oesophageal cancer (56% sensitivity) (Fig. 4f,g), colorectal cancer (91% sensitivity) (Fig. 4j,k), stomach cancer (86% sensitivity) (Fig. 4n,o) and lung cancer (93% sensitivity) (Fig. 4r,s) at 90% specificity.

Fig. 4 — a,e,i,m,q, Receiver operating characteristic curves for the best repeat-aware model and the equivalent repeat-naive model for liver (a), oesophagus (e), colorectal (i), stomach (m), and lung (q) cancer. Diagonal lines represent a random classifier. AUC estimates are shown with the improved, repeat-aware AUC compared with the repeat-naive equivalent. b,f,j,n,r, Training sensitivity (sens.) at 90% specificity (spec.) for repeat-naive and repeat-aware models (95% confidence interval, binomial), for liver (b), oesophagus (f), colorectal (j), stomach (n), and lung (r) cancer, with values shown on the plot. c,g,k,o,s, Testing sensitivity calculated with the 90% specificity probability threshold identified in training (95% confidence interval, binomial), for liver (c), oesophagus (g), colorectal (k), stomach (o), and lung (s) cancer, with values shown on the plot. d,h,l,p,t, Comparison of model coefficient (β) to DESeq2 log₂fold change for non-zero repeat features used in the repeat-aware model characterized in the respective row for liver (d), oesophagus (h), colorectal (l), stomach (p), and lung (t) cancer, with the total number of features shown. Horizontal and vertical lines drawn at 0.

Liver cancer, which showed a large repeat fraction (Extended Data Fig. 2a), had a corresponding dependence on repeat-aware features for classification, which resulted in a significant (P < 0.05) improvement in training sensitivity (Fig. 4a). Classification performance in the respective testing cohorts largely reflected the improvements seen in training, suggesting that our models have the potential to generalize to unseen data (Fig. 4c,g,k,o,s). Notably, we observed cancer-specific differences in repeat-aware feature dependence for disease classification. Stomach (Fig. 4p) and colorectal (Fig. 4l) cancer models each used one repeat-aware feature, liver (Fig. 4d) and oesophageal (Fig. 4h) cancer models used five and ten repeat-aware features, respectively, and the lung (Fig. 4t) cancer model used many repeat-aware features and the most overall features. For all five cancer models, repeat-aware features enhanced disease classification, highlighting the potential of COMPLETE-seq for highly sensitive and specific disease diagnosis.

Discussion

Our study reveals the value and utility of broadly characterizing the cell-free RNA transcriptome using our repeat-aware COMPLETE-seq technology for RNA liquid biopsies. Although other studies have provided valuable insights into protein-coding cell-free RNA dynamics, we find that the vast non-coding and repeat-derived cell-free RNA transcriptome is a rich source of abundant and disease-specific RNA biomarkers. We show that repeat-derived cell-free RNAs, including simple repeat RNAs and TE RNAs transcribed from LINE, SINE and LTR elements, are cancer-specific RNAs that are normally present at low or undetectable levels in healthy individuals. By creating a custom, repeat-aware transcriptome annotation for cell-free RNA quantification that incorporates over 5 million repeat element insertions throughout the human genome, we show that repeat-derived cell-free RNAs are highly enriched in the plasma of patients with cancer, with each cancer type showing its own characteristic repeat-derived cell-free RNA signature. COMPLETE-seq also greatly reduces the number of features used for downstream analysis and disease classification from over 5 million repeat element insertions to ~15,000 aggregated repeat elements at the subfamily level. Notably, our repeat-aware approach achieves highly accurate disease classification by incorporating both protein-coding RNAs and non-coding RNAs, such as lncRNAs and repeat-derived RNAs.

Although cell-free RNA studies so far have focused on short-read RNA-seq technologies, we show that long-read RNA-seq technologies, such as single-molecule nanopore sequencing, provide additional information regarding the full length of cell-free RNAs. We see differences in cell-free RNA length (that is, bimodal SINE-derived cell-free RNAs) that may serve as additional disease-specific features to further improve disease classification via RNA liquid biopsy (that is, RNA fragmentomics). Moreover, we also show that COMPLETE-seq robustly characterizes repeat-derived cell-free RNAs in both poly(A)-selected and total RNA library preparation protocols. In both cell-free RNA-seq contexts, COMPLETE-seq analysis increases mapping rate significantly and provides a richer feature space that leverages highly abundant and disease-specific repeat-derived cell-free RNAs to improve classification performance.

COMPLETE-seq also provides systemic insights into disease pathogenesis, and opportunities to discover drug targets for diseases such as cancer. In addition, our RNA liquid-biopsy technology enables non-invasive, systemic monitoring of protein-coding and repeat-derived cell-free RNA responses to targeted therapies, such as KRAS inhibitors³⁶, which induce treated cancer cells to preferentially release TE-derived cell-free RNAs in extracellular vesicles⁹. Given the preferential upregulation and secretion of TE-derived cell-free RNAs in response to mutant KRAS^8,9, companion diagnostic tests developed using repeat-aware RNA liquid biopsy would enable the robust detection of repeat-derived cell-free RNA signatures of oncogenic RAS signalling.

To move towards clinical implementation of COMPLETE-seq, future studies will require the generation of larger and more diverse cell-free RNA transcriptomic datasets across additional early-stage cancer types to further improve diagnostic performance and to accurately deconvolve the cell-free RNA transcriptome to determine cancer tissue of origin. Moreover, multi-cancer early detection using COMPLETE-seq will also require larger prospective studies to evaluate repeat-aware classification performance in an asymptomatic population. These studies may enable the application of repeat-aware RNA liquid-biopsy technology for the early detection of cancer and, more generally, for precision health.

Methods

Cell-free RNA isolation from blood plasma

The ExoRNeasy kit (Qiagen) was used to isolate cell-free RNA from blood plasma of de-identified healthy controls (blood collected in K2EDTA tubes; Discovery Life Sciences) and patients with pancreatic cancer (blood collected in K2EDTA tubes; BioIVT). Samples were initially filtered through a 0.8 µm filter to remove any contaminants, such as buffy coat. Filtered plasma was then processed using the ExoRNeasy kit to isolate cell-free RNA according to manufacturer instructions.

Library preparation for cell-free RNA-seq

Full-length cDNA was synthesized from cell-free RNA from pancreatic cancer and healthy control samples (Takara SMART-Seq HT kit). Size distribution of cell-free RNA and resulting cDNA were evaluated using an Agilent bioanalyser. Final libraries were made using the Illumina Nextera XT DNA Prep kit. These libraries were then sequenced (PE150) on an Illumina NextSeq 500.

Nanopore full-length cDNA libraries were prepared as described above, followed by manufacturer instructions for the Oxford Nanopore ligation kit LSK109. Sequencing for each library was performed on an Oxford Nanopore MinION device with R9.4 flow cells.

RNA-seq quality control, alignment and quantification

RNA-seq reads (FASTQ) were trimmed with Trimmomatic³⁷ (v.0.39), quality assessed using FastQC³⁸ (v.0.11.9) and visualized using MultiQC³⁹ (v.1.11). Quantification was performed using Salmon⁴⁰ (v.1.9.0) with two separate transcriptome annotations:

The GENCODE consortium Hg38 reference annotation (v.39): 61,488 genes
A concatenation of the above reference and the Hg38 repeat element track available at the University of California Santa Cruz genome browser mySQL server genome-mysql.cse.ucsc.edu: ~5 million insertions

The following optional arguments were used:

–validateMappings, –gcBias, –seqBias, –recoverOrphans, –rangeFactorizationBins 4

to enable selective alignment, reduce sequence biases, rescue reads with unmapped pairs and improved quantification, respectively.

The repeat-aware annotation aggregation was performed much as the transcript-to-gene aggregation is performed for alignment-free quantification approaches such as Salmon. Briefly, we assembled a reference transcriptome that includes all annotated repeats in Hg38 and associated each individual repeat instance ‘transcript’ (n ≈ 5 million) with its subfamily ‘gene’ (n ≈ 15,000). Repeat instances were summated to the subfamily level.

Nanopore signals were base-called with guppy (v.5.0.7) and alignments were quantified using StringTie2 (ref. ⁴¹) with the ‘-L’ argument for long reads using the repeat-aware reference described above (number 2).

Differential expression analysis

Salmon quantifications were loaded into R using tximeta (v.1.12.4)⁴² and converted to DESeq (v.1.34.0)^43,44 objects where counts were aggregated to the gene level from transcripts. Count normalization was performed with DESeq2 and pairwise comparisons were calculated with the following models:

Internal cohort: ~age + gender + input_volume + condition

External cohort: ~age + gender + condition

Significant differential expression was considered at adjusted P values (P_adj) of <0.01.

Unsupervised analysis

Using the gene-level, variance-stabilized counts computed above, principal component analysis was performed with the prcomp from the R stats (v.4.1.3) package on a count matrix filtered to include only genes with non-zero s.d. across the samples using the following optional arguments: centre = T scale. = T rank. = 50 to calculate the 50 principal components from centred, scaled counts.

Pearson correlation was calculated using the cor function from the R stats (v.4.1.3) package. K-means clustering was calculated and plotted with the ComplexHeatmap package in R using, where appropriate, Pearson correlation values (described above) or expression Z scores calculated with the scale function in R with ‘centre = T’ to centre our scaled values at zero.

Cell-free RNA length analysis

lncRNA, protein-coding and TE reference sequences were retrieved as described above. Sequences were extracted from Hg38 to create the biotype reference genomes used in the length analyses. Nanopore reads were aligned to Hg38 using Minimap2. To determine alignments in genomic regions with overlapping annotations, the length of the aligned fragment was compared with the lengths of the overlapping repeats. The annotation with the closest length to that of the fragment was chosen as the correct alignment. Fragment length was extracted using the PySam (v.0.15.4) template length.

Modelling and statistical analysis

Feature engineering

DESeq2-normalized counts of features with a non-zero s.d. were used directly in all cases except where they were used to calculate Shannon entropy. Entropy (H) was calculated on a per-sample basis for each biotype or subfamily as:

H_{biotype} = - \sum_{i}^{n} p_{i} \log_{2} (p_{i})

where p_i represents the fractional contribution of a given feature i (total of n) belonging to the biotype of interest to the total biotype abundance.

For classification as described below, eight feature sets belonging to three categories were used as input matrices for model training:

Total: repeat-naive, repeat-aware and repeat-alone features
Differential expression: repeat-naive, repeat-aware and repeat-alone differentially expressed features (calculated on training set, excluding test set)
Entropy: TE clade entropy and TE clade entropy plus repeat-naive features

Classification and performance evaluation

Each disease cohort was paired with healthy samples and split into stratified training (80%) and testing (20%) subsets. Training splits were used to optimize logistic regression models by performing tenfold cross-validated classification using elastic net penalty values (ɑ) from zero (lasso) to one (ridge) to optimize feature selection via the regularization parameter (λ), producing a final model trained on the entire training split with selected features. Top-performing models were identified by training sensitivity at 90% specificity, which was determined by calculating the probability threshold that achieved ~90% specificity and AUC.

When feature sets including differentially expressed genes were used, differential expression was calculated using only the training split and excluding testing samples. Model performance was finally evaluated on held-out test data by generating prediction probabilities on the test split samples and classifying based on the 90% specificity probability threshold defined in training. Features with non-zero coefficients (β) in the final models were identified to determine total feature size. Confidence intervals for sensitivity were estimated as binomial confidence intervals based on the successful observations and the total training/testing cohort. All modelling was performed with the cv.glmnet function from the glmnet package and custom code written in R.

Statistical analysis

Unless otherwise stated, comparison of means was performed with a two-sided, unpaired Wilcoxon rank-sum test. Where paired tests are used, lines are drawn to connect the dependent observations. When represented using symbolic ranks (*), statistical significance is defined as follows: non-significant P > 0.05, *P ≤ 0.05, **P ≤ 0.01, ***P ≤ 0.001 and ****P ≤ 0.0001.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Supplementary information

Reporting Summary^{(64.5KB, pdf)}

Supplementary Information^{(10.9KB, xlsx)}

Metadata describing critical clinical features of the 32-member cohort generated for this study, and available at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE138651.

Acknowledgements

We thank members of the Kim Lab for helpful discussions. This work was supported by funds from the American Cancer Society (10.53354/pc.gr.158353; D.H.K.) and the Baskin School of Engineering at UC Santa Cruz (D.H.K.). R.E.R. was supported by the National Institutes of Health (DK131504), S.V.M. was supported by the California Institute for Regenerative Medicine (EDUC4-12759), V.P. was supported by the Tobacco-Related Disease Research Program (T32DT4904), S.Y.C. was supported by the American Heart Association Established Investigator Award (18EIA33900027) and the National Institutes of Health (HL122596 and HL124021), and D.H.K. was supported by a Research Scholar Grant (RSG-22-099-01-CDP) from the American Cancer Society.

Extended data

Author contributions

D.H.K. conceived of and designed the study. R.E.R. performed experiments, devised and performed computational analyses, and generated figures. S.V.M. and E.L. performed Illumina sequencing experiments. A.E.D. and A.H. performed RNA-seq analyses. V.P. and M.J. performed nanopore-sequencing experiments and generated data. Y.A.A. and S.Y.C. provided materials. D.H.K. supervised research. D.H.K. and R.E.R. wrote the article with input from all co-authors.

Peer review

Peer review information

Nature Biomedical Engineering thanks the anonymous reviewer(s) for their contribution to the peer review of this work.

Data availability

The main data supporting the results in this study are available within the article and its Supplementary Information. RNA-seq data are available at the NCBI Gene Expression Omnibus repository, under accession number GSE136651. Publicly available data used in this study are available at the NCBI Gene Expression Omnibus repository, under accession number GSE174302. All data generated in this study, including source data for the figures, are available from the corresponding author on reasonable request.

Code availability

Custom code used in this study is available on GitHub at https://github.com/rreggiar/exRNA_disease_biomarkers.

Competing interests

D.H.K. and R.E.R. are inventors on patent applications covering the methods and compositions to detect cancer using cell-free RNA submitted by the Regents of the University of California. D.H.K. and R.E.R. are founders and shareholders and D.H.K. is a board member of LincRNA Bio. S.Y.C. has served as a consultant to United Therapeutics and Acceleron Pharma. S.Y.C. has held research grants from Actelion, Bayer and Pfizer. S.Y.C. is a director, officer and shareholder of Synhale Therapeutics. S.Y.C. has submitted patent applications regarding metabolism in pulmonary hypertension.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

is available for this paper at 10.1038/s41551-023-01081-7.

Supplementary information

The online version contains supplementary material available at 10.1038/s41551-023-01081-7.

References

1.Djebali S, et al. Landscape of transcription in human cells. Nature. 2012;489:101–108. doi: 10.1038/nature11233. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Bartel DP. MicroRNAs: target recognition and regulatory functions. Cell. 2009;136:215–233. doi: 10.1016/j.cell.2009.01.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Kim DH, Saetrom P, Snove O, Jr, Rossi JJ. MicroRNA-directed transcriptional gene silencing in mammalian cells. Proc. Natl Acad. Sci. USA. 2008;105:16230–16235. doi: 10.1073/pnas.0808830105. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Rinn JL, Chang HY. Genome regulation by long noncoding RNAs. Annu. Rev. Biochem. 2012;81:145–166. doi: 10.1146/annurev-biochem-051410-092902. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Kim DH, et al. Single-cell transcriptome analysis reveals dynamic changes in lncRNA expression during reprogramming. Cell Stem Cell. 2015;16:88–101. doi: 10.1016/j.stem.2014.11.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Elbarbary RA, Lucas BA, Maquat LE. Retrotransposons as regulators of gene expression. Science. 2016;351:aac7247. doi: 10.1126/science.aac7247. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Fernandes JD, et al. The UCSC repeat browser allows discovery and visualization of evolutionary conflict across repeat families. Mob. DNA. 2020;11:13. doi: 10.1186/s13100-020-00208-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Reggiardo, R. E. et al. Epigenomic reprogramming of repetitive noncoding RNAs and IFN-stimulated genes by mutant KRAS. Preprint at bioRxiv10.1101/2020.11.04.367771 (2020).
9.Khojah, R. et al. Extracellular RNA signatures of mutant KRAS(G12C) lung adenocarcinoma cells. Preprint at bioRxiv10.1101/2022.02.23.481574 (2022).
10.Burns KH. Transposable elements in cancer. Nat. Rev. Cancer. 2017;17:415–424. doi: 10.1038/nrc.2017.35. [DOI] [PubMed] [Google Scholar]
11.Kong Y, et al. Transposable element expression in tumors is associated with immune infiltration and increased antigenicity. Nat. Commun. 2019;10:5228. doi: 10.1038/s41467-019-13035-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Reggiardo RE, et al. Mutant KRAS regulates transposable element RNA and innate immunity via KRAB zinc-finger genes. Cell Rep. 2022;40:111104. doi: 10.1016/j.celrep.2022.111104. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Carrillo, D. et al. Transposable element RNA dysregulation in mutant KRAS(G12C) 3D lung cancer spheroids. Preprint at bioRxiv10.1101/2023.02.27.530369 (2023).
14.Reggiardo RE, Maroli SV, Kim DH. lncRNA biomarkers of inflammation and cancer. Adv. Exp. Med. Biol. 2022;1363:121–145. doi: 10.1007/978-3-030-92034-0_7. [DOI] [PubMed] [Google Scholar]
15.Wang, J., Ma, P., Kim, D. H., Liu, B. F. & Demirci, U. Towards microfluidic-based exosome isolation and detection for tumor therapy. Nano Today10.1016/j.nantod.2020.101066 (2021). [DOI] [PMC free article] [PubMed]
16.Lo YM, Chiu RW. The biology and diagnostic applications of plasma RNA. Ann. N. Y. Acad. Sci. 2004;1022:135–139. doi: 10.1196/annals.1318.022. [DOI] [PubMed] [Google Scholar]
17.Vorperian SK, Moufarrej MN, Quake SR. Cell types of origin of the cell-free transcriptome. Nat. Biotechnol. 2022;40:855–861. doi: 10.1038/s41587-021-01188-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Anfossi S, Babayan A, Pantel K, Calin GA. Clinical utility of circulating non-coding RNAs—an update. Nat. Rev. Clin. Oncol. 2018;15:541–563. doi: 10.1038/s41571-018-0035-x. [DOI] [PubMed] [Google Scholar]
19.Moufarrej MN, et al. Early prediction of preeclampsia in pregnancy with cell-free RNA. Nature. 2022;602:689–694. doi: 10.1038/s41586-022-04410-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Rasmussen M, et al. RNA profiles reveal signatures of future health and disease in pregnancy. Nature. 2022;601:422–427. doi: 10.1038/s41586-021-04249-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Ng EK, et al. The concentration of circulating corticotropin-releasing hormone mRNA in maternal plasma is increased in preeclampsia. Clin. Chem. 2003;49:727–731. doi: 10.1373/49.5.727. [DOI] [PubMed] [Google Scholar]
22.Larson MH, et al. A comprehensive characterization of the cell-free transcriptome reveals tissue- and subtype-specific biomarkers for cancer detection. Nat. Commun. 2021;12:2357. doi: 10.1038/s41467-021-22444-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Hulstaert E, et al. Charting extracellular transcriptomes in the Human Biofluid RNA Atlas. Cell Rep. 2020;33:108552. doi: 10.1016/j.celrep.2020.108552. [DOI] [PubMed] [Google Scholar]
24.Roskams-Hieter B, et al. Plasma cell-free RNA profiling distinguishes cancers from pre-malignant conditions in solid and hematologic malignancies. npj Precis. Oncol. 2022;6:28. doi: 10.1038/s41698-022-00270-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Lo K-W, et al. Analysis of cell-free Epstein–Barr virus-associated RNA in the plasma of patients with nasopharyngeal carcinoma. Clin. Chem. 1999;45:1292–1294. doi: 10.1093/clinchem/45.8.1292. [DOI] [PubMed] [Google Scholar]
26.Kopreski MS, Benko FA, Kwak LW, Gocke CD. Detection of tumor messenger RNA in the serum of patients with malignant melanoma. Clin. Cancer Res. 1999;5:1961–1965. [PubMed] [Google Scholar]
27.Koh W, et al. Noninvasive in vivo monitoring of tissue-specific global gene expression in humans. Proc. Natl Acad. Sci. USA. 2014;111:7361–7366. doi: 10.1073/pnas.1405528111. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Toden, S. et al. Noninvasive characterization of Alzheimer’s disease by circulating, cell-free messenger RNA next-generation sequencing. Sci. Adv.10.1126/sciadv.abb1654 (2020). [DOI] [PMC free article] [PubMed]
29.Yan Z, et al. Presymptomatic increase of an extracellular RNA in blood plasma associates with the development of Alzheimer’s disease. Curr. Biol. 2020;30:1771–1782. doi: 10.1016/j.cub.2020.02.084. [DOI] [PubMed] [Google Scholar]
30.Moufarrej MN, Wong RJ, Shaw GM, Stevenson DK, Quake SR. Investigating pregnancy and its complications using circulating cell-free RNA in women’s blood during gestation. Front. Pediatr. 2020;8:605219. doi: 10.3389/fped.2020.605219. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Yao J, Wu DC, Nottingham RM, Lambowitz AM. Identification of protein-protected mRNA fragments and structured excised intron RNAs in human plasma by TGIRT-seq peak calling. eLife. 2020;9:e60743. doi: 10.7554/eLife.60743. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Chen, S. et al. Cancer type classification using plasma cell-free RNAs derived from human and microbes. eLife10.7554/eLife.75181 (2022). [DOI] [PMC free article] [PubMed]
33.Bourque G, et al. Ten things you should know about transposable elements. Genome Biol. 2018;19:199. doi: 10.1186/s13059-018-1577-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Garcia-Nieto PE, Wang B, Fraser HB. Transcriptome diversity is a systematic source of variation in RNA-sequencing data. PLoS Comput. Biol. 2022;18:e1009939. doi: 10.1371/journal.pcbi.1009939. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Deamer D, Akeson M, Branton D. Three decades of nanopore sequencing. Nat. Biotechnol. 2016;34:518–524. doi: 10.1038/nbt.3423. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Moore AR, Rosenberg SC, McCormick F, Malek S. RAS-targeted therapies: is the undruggable drugged? Nat. Rev. Drug Discov. 2020;19:533–552. doi: 10.1038/s41573-020-0068-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30:2114–2120. doi: 10.1093/bioinformatics/btu170. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Andrews, S. FastQC v.0.11.9 (Babraham Bioinformatics, Babraham Institute, 2010).
39.Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016;32:3047–3048. doi: 10.1093/bioinformatics/btw354. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Patro R, Duggal G, Love MI, Irizarry RA, Kingsford C. Salmon provides fast and bias-aware quantification of transcript expression. Nat. Methods. 2017;14:417–419. doi: 10.1038/nmeth.4197. [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Kovaka S, et al. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biol. 2019;20:278. doi: 10.1186/s13059-019-1910-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Love MI, et al. Tximeta: reference sequence checksums for provenance identification in RNA-seq. PLoS Comput. Biol. 2020;16:e1007664. doi: 10.1371/journal.pcbi.1007664. [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Zhu A, Ibrahim JG, Love MI. Heavy-tailed prior distributions for sequence count data: removing the noise and preserving large differences. Bioinformatics. 2019;35:2084–2092. doi: 10.1093/bioinformatics/bty895. [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15:550. doi: 10.1186/s13059-014-0550-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Reporting Summary^{(64.5KB, pdf)}

Supplementary Information^{(10.9KB, xlsx)}

Metadata describing critical clinical features of the 32-member cohort generated for this study, and available at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE138651.

Data Availability Statement

Custom code used in this study is available on GitHub at https://github.com/rreggiar/exRNA_disease_biomarkers.

[CR1] 1.Djebali S, et al. Landscape of transcription in human cells. Nature. 2012;489:101–108. doi: 10.1038/nature11233. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR2] 2.Bartel DP. MicroRNAs: target recognition and regulatory functions. Cell. 2009;136:215–233. doi: 10.1016/j.cell.2009.01.002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR3] 3.Kim DH, Saetrom P, Snove O, Jr, Rossi JJ. MicroRNA-directed transcriptional gene silencing in mammalian cells. Proc. Natl Acad. Sci. USA. 2008;105:16230–16235. doi: 10.1073/pnas.0808830105. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.Rinn JL, Chang HY. Genome regulation by long noncoding RNAs. Annu. Rev. Biochem. 2012;81:145–166. doi: 10.1146/annurev-biochem-051410-092902. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR5] 5.Kim DH, et al. Single-cell transcriptome analysis reveals dynamic changes in lncRNA expression during reprogramming. Cell Stem Cell. 2015;16:88–101. doi: 10.1016/j.stem.2014.11.005. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR6] 6.Elbarbary RA, Lucas BA, Maquat LE. Retrotransposons as regulators of gene expression. Science. 2016;351:aac7247. doi: 10.1126/science.aac7247. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR7] 7.Fernandes JD, et al. The UCSC repeat browser allows discovery and visualization of evolutionary conflict across repeat families. Mob. DNA. 2020;11:13. doi: 10.1186/s13100-020-00208-w. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR8] 8.Reggiardo, R. E. et al. Epigenomic reprogramming of repetitive noncoding RNAs and IFN-stimulated genes by mutant KRAS. Preprint at bioRxiv10.1101/2020.11.04.367771 (2020).

[CR9] 9.Khojah, R. et al. Extracellular RNA signatures of mutant KRAS(G12C) lung adenocarcinoma cells. Preprint at bioRxiv10.1101/2022.02.23.481574 (2022).

[CR10] 10.Burns KH. Transposable elements in cancer. Nat. Rev. Cancer. 2017;17:415–424. doi: 10.1038/nrc.2017.35. [DOI] [PubMed] [Google Scholar]

[CR11] 11.Kong Y, et al. Transposable element expression in tumors is associated with immune infiltration and increased antigenicity. Nat. Commun. 2019;10:5228. doi: 10.1038/s41467-019-13035-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR12] 12.Reggiardo RE, et al. Mutant KRAS regulates transposable element RNA and innate immunity via KRAB zinc-finger genes. Cell Rep. 2022;40:111104. doi: 10.1016/j.celrep.2022.111104. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR13] 13.Carrillo, D. et al. Transposable element RNA dysregulation in mutant KRAS(G12C) 3D lung cancer spheroids. Preprint at bioRxiv10.1101/2023.02.27.530369 (2023).

[CR14] 14.Reggiardo RE, Maroli SV, Kim DH. lncRNA biomarkers of inflammation and cancer. Adv. Exp. Med. Biol. 2022;1363:121–145. doi: 10.1007/978-3-030-92034-0_7. [DOI] [PubMed] [Google Scholar]

[CR15] 15.Wang, J., Ma, P., Kim, D. H., Liu, B. F. & Demirci, U. Towards microfluidic-based exosome isolation and detection for tumor therapy. Nano Today10.1016/j.nantod.2020.101066 (2021). [DOI] [PMC free article] [PubMed]

[CR16] 16.Lo YM, Chiu RW. The biology and diagnostic applications of plasma RNA. Ann. N. Y. Acad. Sci. 2004;1022:135–139. doi: 10.1196/annals.1318.022. [DOI] [PubMed] [Google Scholar]

[CR17] 17.Vorperian SK, Moufarrej MN, Quake SR. Cell types of origin of the cell-free transcriptome. Nat. Biotechnol. 2022;40:855–861. doi: 10.1038/s41587-021-01188-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR18] 18.Anfossi S, Babayan A, Pantel K, Calin GA. Clinical utility of circulating non-coding RNAs—an update. Nat. Rev. Clin. Oncol. 2018;15:541–563. doi: 10.1038/s41571-018-0035-x. [DOI] [PubMed] [Google Scholar]

[CR19] 19.Moufarrej MN, et al. Early prediction of preeclampsia in pregnancy with cell-free RNA. Nature. 2022;602:689–694. doi: 10.1038/s41586-022-04410-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR20] 20.Rasmussen M, et al. RNA profiles reveal signatures of future health and disease in pregnancy. Nature. 2022;601:422–427. doi: 10.1038/s41586-021-04249-w. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR21] 21.Ng EK, et al. The concentration of circulating corticotropin-releasing hormone mRNA in maternal plasma is increased in preeclampsia. Clin. Chem. 2003;49:727–731. doi: 10.1373/49.5.727. [DOI] [PubMed] [Google Scholar]

[CR22] 22.Larson MH, et al. A comprehensive characterization of the cell-free transcriptome reveals tissue- and subtype-specific biomarkers for cancer detection. Nat. Commun. 2021;12:2357. doi: 10.1038/s41467-021-22444-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR23] 23.Hulstaert E, et al. Charting extracellular transcriptomes in the Human Biofluid RNA Atlas. Cell Rep. 2020;33:108552. doi: 10.1016/j.celrep.2020.108552. [DOI] [PubMed] [Google Scholar]

[CR24] 24.Roskams-Hieter B, et al. Plasma cell-free RNA profiling distinguishes cancers from pre-malignant conditions in solid and hematologic malignancies. npj Precis. Oncol. 2022;6:28. doi: 10.1038/s41698-022-00270-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR25] 25.Lo K-W, et al. Analysis of cell-free Epstein–Barr virus-associated RNA in the plasma of patients with nasopharyngeal carcinoma. Clin. Chem. 1999;45:1292–1294. doi: 10.1093/clinchem/45.8.1292. [DOI] [PubMed] [Google Scholar]

[CR26] 26.Kopreski MS, Benko FA, Kwak LW, Gocke CD. Detection of tumor messenger RNA in the serum of patients with malignant melanoma. Clin. Cancer Res. 1999;5:1961–1965. [PubMed] [Google Scholar]

[CR27] 27.Koh W, et al. Noninvasive in vivo monitoring of tissue-specific global gene expression in humans. Proc. Natl Acad. Sci. USA. 2014;111:7361–7366. doi: 10.1073/pnas.1405528111. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR28] 28.Toden, S. et al. Noninvasive characterization of Alzheimer’s disease by circulating, cell-free messenger RNA next-generation sequencing. Sci. Adv.10.1126/sciadv.abb1654 (2020). [DOI] [PMC free article] [PubMed]

[CR29] 29.Yan Z, et al. Presymptomatic increase of an extracellular RNA in blood plasma associates with the development of Alzheimer’s disease. Curr. Biol. 2020;30:1771–1782. doi: 10.1016/j.cub.2020.02.084. [DOI] [PubMed] [Google Scholar]

[CR30] 30.Moufarrej MN, Wong RJ, Shaw GM, Stevenson DK, Quake SR. Investigating pregnancy and its complications using circulating cell-free RNA in women’s blood during gestation. Front. Pediatr. 2020;8:605219. doi: 10.3389/fped.2020.605219. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR31] 31.Yao J, Wu DC, Nottingham RM, Lambowitz AM. Identification of protein-protected mRNA fragments and structured excised intron RNAs in human plasma by TGIRT-seq peak calling. eLife. 2020;9:e60743. doi: 10.7554/eLife.60743. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR32] 32.Chen, S. et al. Cancer type classification using plasma cell-free RNAs derived from human and microbes. eLife10.7554/eLife.75181 (2022). [DOI] [PMC free article] [PubMed]

[CR33] 33.Bourque G, et al. Ten things you should know about transposable elements. Genome Biol. 2018;19:199. doi: 10.1186/s13059-018-1577-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR34] 34.Garcia-Nieto PE, Wang B, Fraser HB. Transcriptome diversity is a systematic source of variation in RNA-sequencing data. PLoS Comput. Biol. 2022;18:e1009939. doi: 10.1371/journal.pcbi.1009939. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR35] 35.Deamer D, Akeson M, Branton D. Three decades of nanopore sequencing. Nat. Biotechnol. 2016;34:518–524. doi: 10.1038/nbt.3423. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR36] 36.Moore AR, Rosenberg SC, McCormick F, Malek S. RAS-targeted therapies: is the undruggable drugged? Nat. Rev. Drug Discov. 2020;19:533–552. doi: 10.1038/s41573-020-0068-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR37] 37.Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30:2114–2120. doi: 10.1093/bioinformatics/btu170. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR38] 38.Andrews, S. FastQC v.0.11.9 (Babraham Bioinformatics, Babraham Institute, 2010).

[CR39] 39.Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016;32:3047–3048. doi: 10.1093/bioinformatics/btw354. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR40] 40.Patro R, Duggal G, Love MI, Irizarry RA, Kingsford C. Salmon provides fast and bias-aware quantification of transcript expression. Nat. Methods. 2017;14:417–419. doi: 10.1038/nmeth.4197. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR41] 41.Kovaka S, et al. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biol. 2019;20:278. doi: 10.1186/s13059-019-1910-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR42] 42.Love MI, et al. Tximeta: reference sequence checksums for provenance identification in RNA-seq. PLoS Comput. Biol. 2020;16:e1007664. doi: 10.1371/journal.pcbi.1007664. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR43] 43.Zhu A, Ibrahim JG, Love MI. Heavy-tailed prior distributions for sequence count data: removing the noise and preserving large differences. Bioinformatics. 2019;35:2084–2092. doi: 10.1093/bioinformatics/bty895. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR44] 44.Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15:550. doi: 10.1186/s13059-014-0550-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Profiling of repetitive RNA sequences in the blood plasma of patients with cancer

Roman E Reggiardo

Sreelakshmi Velandi Maroli

Vikas Peddu

Andrew E Davidson

Alexander Hill

Erin LaMontagne

Yassmin Al Aaraj

Miten Jain

Stephen Y Chan

Daniel H Kim

Abstract

Main

Results

COMPLETE-seq enables repeat-aware profiling of the cell-free RNA transcriptome

Fig. 1. Cell-free RNA transcriptome profiling using repeat-aware COMPLETE-seq.

Extended Data Fig. 1. Performance overview of COMPLETE-seq on the internal cohort.

TE RNAs and other repeat element RNAs are enriched in pancreatic cancer cell-free RNA

Fig. 2. Disease-specific enrichment of repeat-derived cell-free RNA.

Extended Data Fig. 2. Nanopore sequencing of cell-free RNA reveals biotype-specific fragment-size patterns.

Extended Data Fig. 3. Nanopore and Illumina show agreement in the quantification of most GENCODE-annotated genes.

COMPLETE-seq reveals cancer-specific repeat element RNA signatures in cell-free RNA

Extended Data Fig. 4. Repeat-aware analysis of cell-free RNA from 5 different cancers.

Extended Data Fig. 5. Repeat-specific diversity across internal and external cohorts.

Fig. 3. Disease-specific repeat-derived cell-free RNA signatures.

Extended Data Fig. 6. Differential expression and variability of repeat-elements in 5 different cancers.

COMPLETE-seq features improve classification performance of diagnostic models

Fig. 4. COMPLETE-seq features improve performance of diagnostic models.

Discussion

Methods

Cell-free RNA isolation from blood plasma

Library preparation for cell-free RNA-seq

RNA-seq quality control, alignment and quantification

Differential expression analysis

Unsupervised analysis

Cell-free RNA length analysis

Modelling and statistical analysis

Feature engineering

Classification and performance evaluation

Statistical analysis

Reporting summary

Supplementary information

Acknowledgements

Extended data

Author contributions

Peer review

Peer review information

Data availability

Code availability

Competing interests

Footnotes

Extended data

Supplementary information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases