RNA-to-image multi-cancer synthesis using cascaded diffusion models

Francisco Carrillo-Perez; Marija Pizurica; Yuanning Zheng; Jeanne Shen; Olivier Gevaert

doi:10.1101/2023.01.13.523899

Abstract

Synthetic data generation offers a solution to the data scarcity problem in biomedicine where data are often expensive or difficult to obtain. By increasing the dataset size, more powerful and generalizable machine learning models can be trained, improving their performance in clinical decision support systems. The generation of synthetic data for cancer diagnosis has been explored in the literature, but typically in the single modality setting (e.g. whole-slide image tiles or RNA-Seq data). Given the success of text-to-image synthesis models for natural images, where one modality is used to generate a related one, we propose RNA-to-image synthesis (RNA-CDM) in a multi-cancer context. First, we trained a variational auto-encoder in order to reduce the dimensions of the patient’s gene expression profile, showing that this can accurately differentiate between different cancer types. Then, we trained a cascaded diffusion model to synthesize realistic whole-slide image tiles using the latent representation of the patient’s RNA-Seq data. We show that generated tiles preserved the cell-type distribution found in real-world data, with important cell types detectable by a state-of-the-art cell identification model in the synthetic samples. Next, we successfully used this synthetic data to pretrain a multi-cancer classification model, observing an improvement in performance after training from scratch across 5-fold cross validation. Our results demonstrate the potential utility of synthetic data for developing multi-modal machine learning models in data scarce settings, as well as the possibility of imputing missing data modalities by leveraging the information present in available data modalities.

PERMALINK

This is a preprint.

RNA-to-image multi-cancer synthesis using cascaded diffusion models

Francisco Carrillo-Perez

Marija Pizurica

Yuanning Zheng

Jeanne Shen

Olivier Gevaert

Abstract

Full Text Availability

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

This is a preprint.

RNA-to-image multi-cancer synthesis using cascaded diffusion models

Francisco Carrillo-Perez

Marija Pizurica

Yuanning Zheng

Jeanne Shen

Olivier Gevaert

Abstract

Full Text Availability

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases