Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

bioRxiv logoLink to bioRxiv
[Preprint]. 2023 Jan 16:2023.01.13.523899. [Version 1] doi: 10.1101/2023.01.13.523899

RNA-to-image multi-cancer synthesis using cascaded diffusion models

Francisco Carrillo-Perez, Marija Pizurica, Yuanning Zheng, Jeanne Shen, Olivier Gevaert
PMCID: PMC9882105  PMID: 36711711

Abstract

Synthetic data generation offers a solution to the data scarcity problem in biomedicine where data are often expensive or difficult to obtain. By increasing the dataset size, more powerful and generalizable machine learning models can be trained, improving their performance in clinical decision support systems. The generation of synthetic data for cancer diagnosis has been explored in the literature, but typically in the single modality setting (e.g. whole-slide image tiles or RNA-Seq data). Given the success of text-to-image synthesis models for natural images, where one modality is used to generate a related one, we propose RNA-to-image synthesis (RNA-CDM) in a multi-cancer context. First, we trained a variational auto-encoder in order to reduce the dimensions of the patient’s gene expression profile, showing that this can accurately differentiate between different cancer types. Then, we trained a cascaded diffusion model to synthesize realistic whole-slide image tiles using the latent representation of the patient’s RNA-Seq data. We show that generated tiles preserved the cell-type distribution found in real-world data, with important cell types detectable by a state-of-the-art cell identification model in the synthetic samples. Next, we successfully used this synthetic data to pretrain a multi-cancer classification model, observing an improvement in performance after training from scratch across 5-fold cross validation. Our results demonstrate the potential utility of synthetic data for developing multi-modal machine learning models in data scarce settings, as well as the possibility of imputing missing data modalities by leveraging the information present in available data modalities.

Full Text Availability

The license terms selected by the author(s) for this preprint version do not permit archiving in PMC. The full text is available from the preprint server.


Articles from bioRxiv are provided here courtesy of Cold Spring Harbor Laboratory Preprints

RESOURCES