Skip to main content
Journal of Advanced Research logoLink to Journal of Advanced Research
. 2024 Oct 31;75:189–198. doi: 10.1016/j.jare.2024.10.035

scPerb: Predict single-cell perturbation via style transfer-based variational autoencoder

Zijia Tang a, Minghao Zhou b, Kai Zhang c, Qianqian Song b,
PMCID: PMC12536660  PMID: 39486785

Graphical abstract

graphic file with name ga1.jpg

Keywords: Single-cell RNA sequencing, Perturbation, Style transfer, Variational auto-encoder

Highlights

  • We introduced scPerb, a framework to predict cellular responses to perturbations, addressing limitations of costly, labor-intensive methods.

  • scPerb uses a style transfer strategy, isolating perturbation-related variance from unperturbed to perturbed cells.

  • scPerb outperforms current methods in post-perturbation gene expression prediction, achieving R² values of 0.98, 0.98, and 0.96 on benchmarks.

  • Its robust performance across datasets shows scPerb's potential in precision medicine for cost-effective predictions.

Abstract

Introduction

Traditional methods for obtaining cellular responses after perturbation are usually labor-intensive and costly, especially when working with multiple different experimental conditions. Therefore, accurate prediction of cellular responses to perturbations is of great importance in computational biology. Existing methodologies, such as graph-based approaches, vector arithmetic, and neural networks, either mix perturbation-related variances with cell-type-specific patterns or implicitly distinguish them within black-box models.

Objectives

This study aims to introduce and demonstrate a novel framework, scPerb, which explicitly extracts perturbation-related variances and transfers them from unperturbed to perturbed cells to accurately predict the effect of perturbation in single-cell level.

Methods

scPerb utilizes a style transfer strategy by incorporating a style encoder into the architecture of a variational autoencoder. The style encoder captures the differences in latent representations between unperturbed and perturbed cells, enabling accurate prediction of post-perturbation gene expression data.

Results

Comprehensive comparisons with existing methods demonstrate that scPerb delivers improved performance and higher accuracy in predicting cellular responses to perturbations. Notably, scPerb outperforms other methods across multiple datasets, achieving superior R2 values of 0.98, 0.98, and 0.96 on three benchmarking datasets.

Conclusion

scPerb offers a significant advancement in predicting cellular responses by effectively separating and transferring perturbation-related variances. This framework not only enhances prediction accuracy but also provides a robust tool for computational biology, addressing the limitations of current methodologies.

Introduction

Single-cell RNA sequencing (scRNA-seq) is a revolutionary technology to profile gene expression of cells in heterogeneous tissue samples [1], [2], [3]. It can measure transcripts in thousands of single cells from multiple biological samples under different conditions [4], [5], [6], [7], [8]. Such breakthrough technology has inspired the development of tailored computational tools such as cell type annotations [9], [10], [11], [12], identification of pseudo-time trajectories [13], [14], and rare cell type detection [15], [16], facilitating the biological insights into single-cell data [17], [18]. Although scRNA-seq technologies have led to a remarkable growth of single-cell data, it is still challenging to collect the matched pairs of control and perturbed samples for a particular perturbation. As current databases comprise a wide variety of single-cell data collected from samples at normal conditions, there is a critical need to leverage the existing data at normal conditions to generate and predict the single-cell data after a certain perturbation. To achieve this, an accurate and robust method is needed, with generalized capabilities in revealing gene expression patterns across different tissues, different platforms, and limited data size.

Recent efforts to address the challenges of perturbation prediction use generative models such as Generative Adversarial Networks (GAN) [19] and Variational Autoencoders (VAE) [20]. GAN-based models use a generator to simulate perturbed data and an adversarial discriminator to assess how closely the predicted data matches the ground truth. While this adversarial setup is designed to produce robust predictions, GAN models often suffer from training instability. The generator may collapse, particularly when balancing the adversarial process becomes difficult, which is often the case with noisy or sparse single-cell data. This instability can result in poor generalization to new datasets or perturbations, limiting the model's reliability for broader biological applications. To address the challenge of predicting single-cell perturbations, sc-WGAN [21] applies the more stable Wasserstein GAN (WGAN), while stGAN [22] introduces style transfer learning by incorporating multiple styles into the generator. However, both models still suffer from significant limitations. Since GAN are inherently difficult to train due to the challenge of balancing the generator and discriminator, it often results in unstable gradients. This instability makes sc-WGAN and stGAN prone to model collapse, where they fail to decrease loss effectively during training. As a result, these models struggle to generalize to new datasets and have significantly lower accuracy in predicting perturbations, especially in different biological scenarios. On the other hand, VAE-based model, i.e. scGen [23], samples gene expression profiles from a multivariate Gaussian distribution through variational inference. It relies on the assumption of a fixed linear relationship between control (unperturbed) and perturbed cells. This oversimplified assumption is not sufficient for capturing the changes of complex biological data after perturbation.

Therefore, we introduce scPerb, a novel tool designed to predict single-cell gene expression under perturbations such as drug dosage, treatment, or gene modification. Unlike previous models, scPerb decouples perturbation-specific features using learnable parameters, overcoming the limitations of fixed vectors—such as those in scGen—that struggle to capture non-linear perturbation features. Additionally, it avoids the instability commonly observed in GAN-based models, which often results in poor accuracy due to challenges in learning features under varying conditions. This adaptive approach enables scPerb to effectively model complex perturbation patterns, leading to significantly more accurate and robust predictions across diverse datasets. scPerb adopts a novel strategy to decouple gene expression data into perturbation-independent contents and perturbation-specific styles. Here the ‘content’ represents the perturbation-irrelevant information, while ‘style’ refers to the perturbation-specific information. To learn the contents and styles, scPerb takes gene expression data input from both control (unperturbed) and perturbed cells, projecting each cell's gene expression into a latent space, with a tailor-designed loss. By transferring this perturbation style from control to perturbed cells, scPerb predicts the gene expressions of perturbed cells. In comprehensive benchmarks, scPerb outperforms other modeling approaches such as scGen, CVAE, stGAN, and sc-WGAN. These benchmarking results provide a valuable resource for the community, highlighting both the potential and limitations of these models when applied to scRNA-seq data. scPerb is implemented as an integrated workflow in Python and is available at https://github.com/QSong-github/scPerb.

Results

In this section, we demonstrate that scPerb accurately predicts perturbed single-cell gene expression data, outperforming several benchmark models including scGen, CVAE, stGAN, and sc-WGAN across multiple datasets. Additionally, scPerb consistently achieves superior performance when extending to smaller datasets, such as the Hpoly dataset. This underscores the versatility and reliability of scPerb in predicting gene expression changes under different perturbations.

Overview of scPerb framework

In this work, we presented a novel tool, i.e., scPerb, to predict single-cell gene expressions under specific conditions such as a dose [24], a treatment [25], [26], or a modification of genes [27], [28], [29] (Fig. 1). We hypothesized the observations Xctrl and Xperb from the control and perturbed datasets had two independent latent features: a cell type-related latent feature, denoted as “content” c; and a dataset-specific feature, denoted as “style” s. scPerb learned the contents Zcctrl and Zcperb of the cell types from both the control and perturbed datasets, where c represented the content features of the cell types and transferred the style Zsctrl from the control dataset to the perturbed dataset Zsperb, and s represented the dataset styles. scPerb solved the perturbation task by learning the latent features of cell types and the condition-specific style vector. Specifically, scPerb estimated the multi-variance normal distribution of the cell type feature c. scPerb also used a neural network to learn the style transformation matrix from the datasets. Different with previous methods that adopt a constant vector to transfer the latent features from cells of the control condition to that of the perturbed condition, scPerb introduces learnable parameters and allows the neural network to learn both cell type and condition differences between the control and perturbed datasets. With comprehensive evaluation, scPerb performs better with more accurate prediction results when compared to other approaches. Details are shown in Materials and Methods.

Fig. 1.

Fig. 1

scPerb predicts gene expressions of perturbed cells. scPerb was designed to predict gene expressions in perturbed cells and combines the principles of both style transfer and VAE. With the perturbed and control dataset as inputs, the content encoder projected the data into latent space. Differences between the latent representations of the perturbed dataset and the control dataset were captured by a style vector (s), which enabled transferring from the perturbed style to the control style. Such style vector was initiated with a random vector and updated via a style encoder, which learned the style of the perturbed dataset and transferred it to the control dataset by adding it to the latent representation of the control dataset. By minimizing the differences between both latent representations and gene expressions between predicted perturbed data and real perturbed data, scPerb transferred the control style to the perturbed style and predicted the gene expression of perturbed cells.

scPerb outperforms other benchmarking methods

To demonstrate the performance of scPerb, we compared scPerb with currently existing methods, including scGen [23], CVAE [30], stGAN [22], and sc-WGAN [21]. Three datasets were used for benchmarking, including two published human peripheral blood mononuclear cell (PBMC) datasets, i.e., PBMC-Kang [24] and PBMC-Zheng [31] datasets, which were perturbed with interferon (IFN-β), and the intestinal epithelial cell dataset fetched by parasitic helminth H.poly [25], i.e., H.poly dataset.

Based on those three datasets, each method’s performance was evaluated using the R2 between predictions and real perturbed data. Specifically, we randomly selected a cell type to predict its gene expression data after perturbation, meanwhile using the rest of the cell types for model training. We repeated such process across all cell types and presented the average of the R2 in Fig. 2a. In the PBMC-Zheng dataset [31], scPerb achieved the average R2 score of 0.98, which was better than the performance of the competitors, including scGen (average R2 = 0.94), CVAE (average R2 = 0.93), stGAN (average R2 = 0.39) and sc-WGAN (average R2 = 0.10). Surprisingly, the GAN-based methods had much worse performance, as both GAN-based methods could not reach a R2 value exceeding 0.5. Meanwhile, in the PBMC-Kang dataset, scPerb achieved the highest average R2 score of 0.98, while the second-best and third-best approaches were scGen and CVAE which had 0.96 and 0.91. Similarly, the stGAN and sc-wGAN only had an average R2 score of 0.42 and 0.12, respectively, in this dataset. Finally, we applied scPerb to the H.poly dataset and still got a 0.96 average R2 score, followed by the scGen, CVAE, stGAN, and sc-wGAN with the average R2 score of 0.95, 0.93, 0.58, 0.14. When comparing their results in a specific cell type, scPerb consistently outperformed other benchmarking methods (Fig. 2b). For example, in CD4-T cell type, one of the most numerous cell types in the PBMC-Zheng dataset, scPerb achieved a superior R2 score of 0.99, which was much better than scGen, CVAE, stGAN, and sc-WGAN (R2 score: 0.96, 0.95, 0.16, and 0.09) respectively.

Fig. 2.

Fig. 2

Results of scPerb in general. a: Comparison of R2 values across all benchmarking methods; b: Bar plots showed the R2 value of all methods in the PBMC-Zheng dataset [31]; c: Scatter plot showed the correlation between real and predicted gene expression of 7000 genes by scPerb and other three benchmarking methods in CD4-T cells, and the five red dots represented the top five DEGs IFIT1, IFIT3, IFIT6, ISG15, and ISG20. The values on the x and y axes represent log2 of the mean gene count for each gene across all cells; d: The distribution of the control dataset, perturbed dataset, and the prediction of all methods in one of the least DEGs (FTL), and one of the top DEGs (IFIT2). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

In addition, we evaluated the performance of the proposed scPerb and the other benchmarking methods across genes. In Fig. 2c, we illustrated the prediction of our scPerb and the performance of the other three benchmarking methods in CD4-T cells from the PBMC-Zheng dataset. The scatter plot demonstrated that scPerb got the average R2 score of 0.9905 when we used all the genes in this cell type. The performance could go up to 0.9935 when we only consider the top 100 DEGs. In comparison under the same setting, scGen achieved the average R2 score of 0.9605 over all genes and 0.9963 on the top 100 DEGs. Our scPerb could outperform CVAE (average R2 score of all genes = 0.9472, average R2 score of top 100 DEGs = 0.9578) and sc-WGAN (average R2 score = 0.0924, average R2 score = 0.7195) on both the evaluation criteria. Specifically, DEGs including IFIT1, IFIT3, IFI6, ISG20, and ISG15, showed the best performance.

In Fig. 2d, the distribution of IFIT2 in the control dataset largely differed from the distribution of its perturbed dataset. Notably, based on the predictions of perturbed gene expressions, the mean of scPerb’s prediction was close to the mean of the perturbed dataset. However, the distribution of scGen’s and st-WGAN’s prediction was comparable to the ground truth but resulted in a mean much lower than the mean of the ground truth. The predictions of CVAE resulted somewhere in between the control data and the perturbed data, meaning that it cannot clearly learn the style difference between control data and perturbed data. Though the prediction of stGAN seems to resemble the mean of the ground truth, the Wilcoxon test [32] resulted in P value less than 0.05, showing the significant difference between the mean of stGAN’s prediction distribution and the ground truth. For the other gene FTL, as shown in Fig. 2e, its distribution pattern in the control dataset resembled the distribution in the real perturbed dataset. Under such scenario, most of the predictions in scPerb were close to the mean of the perturbed data, whereas the predictions from scGen and CVAE exhibited a much lower mean compared with the ground truth. Both GAN-based methods stGAN and sc-WGAN presented many outliers which were deviate from the perturbed data. To further illustrate that our result was better than that of benchmarks, we applied Wilcoxon test to these results. In this case, only scPerb resulted in an adjusted P value larger than 0.05 for both genes (0.176, and 0.074 respectively for the FTL gene and the IFIT2 gene), which showed that the prediction of scPerb did not have a significant difference from the ground truth. In contrast, all benchmarking methods resulted in P values less than 0.05, showing a significant difference from the ground truth. To be more specific, scGen scored 6.3×10-15 and 0.0033 for the FTL gene and the IFIT2 gene, while CVAE scored 0.0307 and 1.63×10-9, stGAN scored 4.81×10-109 and 3.14×10-103, and sc-WGAN scored 2.01×10-31 and 2.41×10-10. Therefore, scPerb demonstrated superior performance than the other benchmarking methods.

scPerb predicts single-cell perturbation response accurately

In this section, we aimed to show that scPerb could accurately predict the single-cell perturbation responses for other cell types. Fig. 3a summarized the performance of scPerb over different cell types. In CD4-T, CD14 Mono, and FCGR3A Mono cells, scPerb could achieve an average R2 score = 0.99 in both the top 100 DEGs and all gene expressions. In Dendritic cells, the average R2 score was 0.98 and 0.98 respectively. In B cells and NK cells, the performance of the top 100 DEGs was slightly better than the performance of all genes, which was 0.99 vs. 0.98 and 0.98 vs. 0.97 respectively. We also observed that in CD8-T cells, the performance of the top 100 DEGs was 0.94, which was slightly lower than the performance on all genes (average R2 score = 0.96). In Fig. 3b, the dot plot demonstrated the correlation of representative genes among different cell types. In half of the selected genes, the dot plot showed a strong difference between the gene expression and the real perturbed gene expression. On the other half of the selected genes, we presented similar gene patterns in both the control dataset and the perturbed dataset. In the green dashed rectangle box, we highlighted the mean of the expression in the control, predicted, and real perturbed datasets. Fig. 3b implied that the mean gene expression of B cells, CD8-T cells, and Dendritic cells in our scPerb prediction was associated with the mean gene expression in the real perturbed dataset. The UMAP in Fig. 3c showed that the predicted gene expression from scPerb in CD4-T cells was correlated with the real perturbed gene expression in the latent space. Such consistent observation was also observed for a specific gene IFI6 (Fig. 3d).

Fig. 3.

Fig. 3

Result of scPerb in PBMC-Zheng dataset. a: Grouped boxplot showed the result of scPerb in R2 values in all genes and the top 100 DEGs in every cell type in the PBMC-Zheng dataset; b: Dot plot illustrating the mean gene expression in each cell type and condition; c–d: UMAP [33] visualizations depicted the condition distribution of the overall CD4-T cell type in the PBMC-Zheng dataset and the expression pattern of IFI6, one of the top DEGs in the CD4-T cells.

scPerb accurately predicts the perturbation of cells in multiple PBMC datasets

scPerb had robust predictions of perturbed gene expressions in multiple datasets. In the PBMC-Kang dataset [24], scPerb outperformed all other methods(Fig. 4a), achieving a mean R2 of 0.98 across all cell types, followed by scGen (R2 = 0.96), CVAE (R2 = 0.91), stGAN (R2 = 0.42), and sc-WGAN (R2 = 0.12). Specifically, scPerb predicted the perturbed gene expressions in FCGR3A Mono cells with exceptional accuracy, achieving R2 scores of 0.995 for all genes and 0.998 for the top 100 DEGs. In contrast, scGen produced R2 values of 0.962 and 0.954, while sc-WGAN and stGAN yielded significantly lower R2 scores (Fig. 4b).

Fig. 4.

Fig. 4

Result of scPerb in PBMC-Kang dataset. a: This bar plot compared the R2 values of all the methods within the PBMC-Kang dataset, while central values represented the mean R2 values across all 7 cell types in the dataset; b–c: Comparing the distribution of all the methods in the MT2A gene in CD4-T cells in the PBMC-Kang dataset. Center values in Fig. 4c were the adjusted P values comparing the prediction of each method to the ground truth by using the Wilcoxon test; d: A dot plot comparing the mean gene expression of all 7 cell types and all 3 conditions in the PBMC-Kang dataset; e: The correlation of the mean expression of all 6998 genes in FCGR3A Mono cells. It compared predictions from three of the best benchmark methods and scPerb against the ground truth, with shaded lines representing the 95 % confidence interval of the regression estimate.

For the MT2A gene, one of the top DEGs in FCGR3A Mono cells, scPerb provided predictions closely aligned with the ground truth, outperforming all other methods. The Wilcoxon test [34] further validated scPerb’s accuracy, with a P-value of 0.878, indicating no statistically significant difference between scPerb’s predictions and the real perturbed data. In contrast, scGen, CVAE, and both GAN-based methods resulted in P-values far below 0.0001, highlighting significant discrepancies in their predictions (Fig. 4c).

Moreover, scPerb provided robust predictions across various gene expression scenarios, whether the control gene expression was lower (e.g., IFIT1), comparable (e.g., RPL13A), or higher (e.g., FTH1) than the real perturbed gene expression (Fig. 4d). Notably, scPerb’s predictions correlated closely with the real data for the top 5 DEGs, as shown by the red dots in Fig. 4e. Overall, scPerb achieved higher R2 values (0.995 for all genes and 0.996 for the top 100 DEGs) compared to all other benchmark methods, including scGen, CVAE, and sc-WGAN.

scPerb has robust results across different datasets

In the H.poly dataset [25], scPerb demonstrated superior performance with robust predictive accuracy. Across all cell types, scPerb achieved an average R2 of 0.96, outperforming scGen (R2 = 0.95) and CVAE (R2 = 0.93), as well as the GAN-based methods stGAN (R2 = 0.38) and sc-WGAN (R2 = 0.14). The line plot in Fig. 5a highlights scPerb’s notable performance, especially in Tuft cells, where it attained an R2 of 0.94. In contrast, other VAE-based methods performed worse, with scGen at 0.91 and CVAE at 0.84. As shown in Fig. 5a, all VAE-based methods (scPerb, scGen, CVAE) consistently outperformed GAN-based models (stGAN, sc-WGAN) across most cell types.

Fig. 5.

Fig. 5

The result of scPerb in the H.poly dataset a: Line plot using R2 to compare the outcomes of all the methods; b–f: The UMAP visualization of the control, perturbed, and predicted cells. The cell type plotted in the graph is Enterocyte Progenitor cell type in Hpoly dataset, which contains 1131 cells in total, including 586 control cells and 545 perturbed cells.

scPerb also excelled in predicting gene expression in Enterocyte Progenitor cells. As illustrated in Fig. 5b, scPerb’s predictions (green dot) closely matched the real perturbed data (orange dot) compared to the unperturbed dataset (blue dot). In contrast, the predictions from other methods (Fig. 5c–f) were indistinguishable from either the unperturbed or real perturbed data, further emphasizing scPerb’s superior predictive capacity.

To further demonstrate scPerb's ability to predict perturbations, We added two large new datasets (GSE161195 and GSE161801) and a cross study to evaluate the reproducible effectiveness of scPerb. scPerb outperforms scGen on both datasets (Supplement Fig. 1, Supplement Fig. 2) and on cross study (Supplement Fig. 3).

Materials and methods

Here we presented scPerb, a generative model to predict gene expression data after perturbation. We hypothesized the observations Xctrl and Xperb from the control and perturbed datasets had two independent latent features: a cell type-related latent feature, denoted as “content” c; and a dataset-specific feature, denoted as “style” s. scPerb learned the contents Zcctrl and Zcperb of the cell types from both the control and perturbed datasets, where c represented the content features of the cell types and transferred the style Zsctrl from the control dataset to the perturbed dataset Zsperb, and s represented the dataset styles (Fig. 1).

scPerb first translated the input data into a probability distribution in the latent space using an encoder. Specifically, it mapped the input data to a mean (μ) and a variance (σ) for each latent variable. We then projected the style vector s into the latent space and learned the transformation from the control dataset Xctrl to the perturbed dataset Xperb, and the learned difference between Xctrl and Xperb would be denoted as σs. Furthermore, we denoted Eμc. as the content encoder acquiring the cell-type awareness features, Eϕs. as the style encoder projecting the random style vectors to the latent space, Eμc. and Eσc. as the μ and σ estimation for the probability distribution generated by the encoders, and Dϕ. as the decoder generating the perturbed data using the latent variables c and s. In the inference stage, given a specific cell type from the control dataset Xctrl, scPerb would extract the cell type-related features Zcctrl, generate the “fake” perturbed cell type X^perb based on Zcctrl and σs, and minimize the differences between Zsctrl and Zsperb.

Encoders

To extract common cell type content features, we projected both inputs Xctrl,Xperb into the latent space. Followed by the setting of VAE, we assumed the content features were multivariate normal distributions, Nμ,σ, where μ and σ represented the mean and variance of multivariate normal distribution). The latent representation Zctrl of input data Xctrl was obtained from the learned distribution

Nμctrl,σctrl:ZcctrlNμctrl,σctrl

where μctrl=EμcEθcXctrl and σctrl=EσcEθcXctrl.

Since the projection weights were shared between the two input datasets Xctrl and Xperb, the latent representation Zperb of input data Xperb was obtained from ZcperbNμperb,σperb, where μperb=EμcEθcXperb and σperb=EσcEθcXperb. Followed by VAE settings, we used KL loss to estimate μctrl, σctrl, μperb, and σperb:

KLLossctrl=KLNμctrl,σctrl,N0,IKLLossperb=KLNμperb,σperb,N0,I

where KL divergence was calculated by:

KLP,Q=xXPxlogPxQx

In this work, our task was to generate the “fake” perturbed cell types from the same cell types in the control dataset. Therefore, instead of learning the dataset styles explicitly, we applied a light-wise network to learn the transformation σs in the latent space. Our idea was inspired by the style transfer learnings [22], where randomly sampled style vector (s) and projected the latent space as the styles. In scPerb, we applied a style encoder Eϕs., which can project the s into the latent space as the transformation variable to convert Zcctrl to Zcperb:

σs=EϕssZ^cperb=Zcctrl+σs

Therefore, we had the following StyleLoss:

StyleLoss=SmoothL1LossZcperb,Z^cperb

While the SmoothL1Loss was defined below:

SmoothL1lossx,y=x-y22βifx-y<βx-y-0.5βotherwise

Decoder

In the decoder part, scPerb reparametrized the latent variable from the estimated posterior distribution ZcctrlNμctrl,σctrl and ZcperbNμperb,σperb. Unlike the standard VAE, which directly reconstructed the output X^perb from the latent variable Zcctrl and Zcperb, scPerb converted the representation of the control data Zcctrl to the latent representation Z^cperb, and generated the predicted perturbed data from decoder Dϕ:

X^perb=DϕZ^cperb

Note that our task was to predict the perturbation of the cell types using the control dataset, instead of generating the samples from Zcperb and Zcctrl as the original VAE, we only used Z^cperb to generate X^perb. Therefore, our GeneratedLoss was:

GeneratedLoss=SmoothL1lossXperb,X^perb

Loss function

The final objective function consisted of the Generatedloss, StyleLoss, and the KL regulation terms.

Loss=w1StyleLoss+w2KLLossctrl+w3KLLossperb+w4GenLoss

Datasets and preprocess

The PBMC-Zheng dataset was obtained from a study by Zheng et al. [31]. that involved massively parallel digital transcriptional profiling of single cells using single-cell RNA sequencing (scRNA-seq). This dataset includes 18,868 Peripheral Blood Mononuclear Cells (PBMCs), consisting of 9925 perturbed cells infected with IFN-β and 8943 control cells. To ensure data quality, we first removed megakaryocyte cells, which had uncertain or ambiguous label assignments due to their small sample size and difficulty in classification. Then we performed log transformation on gene expression levels to stabilize the variance and make the training process smoother. For our analysis, we focused on the average gene expression profiles of the top 20 gene clusters, which contain 7000 genes. This dataset is split into training and testing data sets. The training data set can be obtained from https://www.dropbox.com/s/wk5zewf2g1oat69/train_pbmc.h5ad?dl=1 and the testing data set can be obtained from https://www.dropbox.com/S/Nqi971n0tk4nbfj/valid_pbmc.h5ad?dl=1.

Kang et al. published a dataset of PBMCs [24], including both control and perturbed cells (also infected by IFN-β). We did the same data preprocessing as the PBMC-Zheng dataset, removing megakaryocyte cells, performing log transformation, and filtering the top 20 gene clusters (6998 genes in total). Among these two prepossessed PBMC datasets, seven cell types exist, respectively: B cells, CD4-T cells, CD8-T cells, CD14 Mono cells, Dendritic cells, FCGR3A Mono cells, and NK cells. This dataset can be obtained by the accession number GSE96583.

Harber et al. presented a dataset using the responses of epithelial cells infected by Salmonella and H.poly [25]. In this dataset, there were 3240 control cells, 2711 H.poly-infected cells, and the rest 1770 Salmonella-infected cells. Like the PBMC datasets, we normalized and log-transformed the data and selected the top 7000 highly variable genes to get a side-by-side comparison. This dataset can be obtained by the accession number GSE92332.

In our model, we performed further data preprocessing to ensure consistency between control and perturbed cells within each cell type. Specifically, we randomly selected an equal number of control cells and perturbed cells for each cell type in order to balance the dataset. This data preprocessing step helped us create a more robust and unbiased dataset, enabling accurate comparisons in each cell type. By doing such data processing, we guaranteed that each pair of Xctrl and Xperb have the same cell type, so the following style transfer process would be valid.

Statistics and reproducibility

In scPerb, we evaluated the performance of our model under a fixed seed of 42 by using the square of the R value (R2), calculated through scipy.stats.linregress function [35]. This metric evaluated the degree to which the predicted perturbed data and the real perturbed data were correlated. We computed the R2 values for all genes’ mean and variance and the top 100 Differential Expressed Genes (DEGs). To understand the model’s results visually, we created scatter plots comparing the predicted perturbed data to the corresponding ground truth data. This graph allowed us to observe how well the model’s predictions aligned with the actual values.

Additionally, we used a violin plot to examine the discrepancies between the predicted perturb data and the real perturb data for the top DEGs. DEGs (Differentially Expressed Genes) are genes that exhibit statistically significant differences in expression levels between two or more conditions. In our case, the top DEGs refer to the genes with the greatest statistical differences between control and perturbed conditions. The top DEGs are those most significant ones calculated using Wilcoxon rank-sum test [34] of scanpy.tl.rank_genes_groups function. Through these analyses, we aimed to assess the accuracy and performance of our scPerb model based on the input gene expression data. The evaluation of R2 values and the visualization of the scatter and violin plots provided valuable insights into the model’s capabilities and highlighted any discrepancies between the predicted and real perturbated data for further investigation.

Discussion

scPerb is a novel generative model that predicts gene expressions after perturbation. The encoder of scPerb projects gene expressions of both control and perturbed data into the high-dimensional latent space. scPerb aggregates it with the dataset-specific styles to generate a high-quality representation for the perturbed dataset. Based on the representation, the decoder from scPerb can reconstruct gene expressions of perturbed data. The experiments demonstrate that scPerb can capture the latent content features and generate dataset-specific styles across different cell types and conditions. Moreover, the quantitative evaluation indicated the performance of scPerb outperforms four existing methods, presenting outperformed results in each cell types of three different datasets.

Compared with previous work [21], [22], [23], [30], scPerb is a data-driven algorithm that fully explores the gene expression in the raw dataset and does not rely on solid domain priors. On the opposite, previous work extract the principal components and build up a graph-based model in the low-dimensional manifold. Such methods rely heavily on the experienced domain knowledge, and lack of generalization capabilities. Compared with other data-driven algorithms, scPerb incorporates the stableness from the VAE settings and exploits the advantage of the GAN to generate high-quality samples.

However, minor problems still exist. In Endocrine cells in the H.poly dataset, one of the cell types containing the fewest cells in the H.poly dataset (163 in 5059), scPerb makes predictions slightly worse than scGen [23]. Using R2 values as a criterion, scGen results in 0.89 while scPerb only results in 0.87. Note that scGen only calculates a fixed liner vector while scPerb uses style transfer, in this case, the problem of “overfitting” exists. However, such cases are very rare and scPerb can still outperform other methods such as scGen in other cases when the data is small. In Tuft cells, also one of the cell types containing the fewest cells in the H.poly dataset (248 in 5059), scPerb achieves a R2 value of 0.94 while scGen only gets 0.91.

The recent advancements in droplet microfluidics and microfluidic impedance cytometry [36], [37] provide data resources for perturbation studies. As more data is produced from these platforms, scPerb and other models can be evaluated for robustness and accuracy across diverse perturbation scenarios. It will not only enhance the reliability of scPerb’s predictions but also expand its applicability to a wider range of biological contexts.

Code availability

scPerb is provided as a Python package available at https://github.com/QSong-github/scPerb, with detailed functions for implementation.

Compliance with Ethics Requirements

This article does not contain any studies with human or animal subjects.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

Q.S. is supported by the National Institute of General Medical Sciences of the National Institutes of Health (R35GM151089).

Footnotes

Appendix A

Supplementary data to this article can be found online at https://doi.org/10.1016/j.jare.2024.10.035.

Appendix A. Supplementary data

The following are the Supplementary data to this article:

Supplementary Data 1
mmc1.docx (2.3MB, docx)

References

  • 1.Baron M., et al. A single-cell transcriptomic map of the human and mouse pancreas reveals inter-and intra-cell population structure. Cell Syst. 2016;3:346–360. doi: 10.1016/j.cels.2016.08.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Puram S.V., et al. Single-cell transcriptomic analysis of primary and metastatic tumor ecosystems in head and neck cancer. Cell. 2017;171:1611–1624. doi: 10.1016/j.cell.2017.10.044. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Athanasiadis E.I., et al. Single-cell RNA-sequencing uncovers transcriptional states and fate decisions in haematopoiesis. Nat Commun. 2017;8:2045. doi: 10.1038/s41467-017-02305-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Azizi E., et al. Single-cell map of diverse immune phenotypes in the breast tumor microenvironment. Cell. 2018;174:1293–1308. doi: 10.1016/j.cell.2018.05.060. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Cusanovich D.A., et al. A single-cell atlas of in vivo mammalian chromatin accessibility. Cell. 2018;174:1309–1324. doi: 10.1016/j.cell.2018.06.052. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Muraro M.J., et al. A single-cell transcriptome atlas of the human pancreas. Cell Syst. 2016;3:385–394. doi: 10.1016/j.cels.2016.09.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Iram T., Consortium T.M. Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris. Nature. 2018;562:367–372. doi: 10.1038/s41586-018-0590-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Buenrostro J.D., et al. Integrated single-cell analysis maps the continuous regulatory landscape of human hematopoietic differentiation. Cell. 2018;173:1535–1548. doi: 10.1016/j.cell.2018.03.074. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Jagadeesh K.A., et al. Identifying disease-critical cell types and cellular processes by integrating single-cell RNA-sequencing and human genetics. Nat Genet. 2022;54:1479–1492. doi: 10.1038/s41588-022-01187-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Shao X., et al. scCATCH: automatic annotation on cell types of clusters from single-cell RNA sequencing data. Iscience. 2020;23 doi: 10.1016/j.isci.2020.100882. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Crow M., Paul A., Ballouz S., Huang Z.J., Gillis J. Characterizing the replicability of cell types defined by single cell RNA-sequencing data using MetaNeighbor. Nat Commun. 2018;9:884. doi: 10.1038/s41467-018-03282-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Wei J.-R., et al. Identification of visual cortex cell types and species differences using single-cell RNA sequencing. Nat Commun. 2022;13:6902. doi: 10.1038/s41467-022-34590-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Tasaki S., et al. Inferring protein expression changes from mRNA in Alzheimer’s dementia using deep neural networks. Nat Commun. 2022;13:655. doi: 10.1038/s41467-022-28280-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Denyer T., et al. Spatiotemporal developmental trajectories in the Arabidopsis root revealed using high-throughput single-cell RNA sequencing. Dev Cell. 2019;48:840–852. doi: 10.1016/j.devcel.2019.02.022. [DOI] [PubMed] [Google Scholar]
  • 15.Torre E., et al. Rare cell detection by single-cell RNA sequencing as guided by single-molecule RNA FISH. Cell Syst. 2018;6:171–179. doi: 10.1016/j.cels.2018.01.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Wu H., Kirita Y., Donnelly E.L., Humphreys B.D. Advantages of single-nucleus over single-cell RNA sequencing of adult kidney: rare cell types and novel cell states revealed in fibrosis. J Am Soc Nephrol. 2019;30:23. doi: 10.1681/ASN.2018090912. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Andrews T.S., Kiselev V.Y., McCarthy D., Hemberg M. Tutorial: guidelines for the computational analysis of single-cell RNA sequencing data. Nat Protoc. 2021;16:1–9. doi: 10.1038/s41596-020-00409-w. [DOI] [PubMed] [Google Scholar]
  • 18.Chen G., Ning B., Shi T. Single-cell RNA-seq technologies and related computational data analysis. Front Genet. 2019;10:317. doi: 10.3389/fgene.2019.00317. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Goodfellow I., et al. Generative adversarial nets. Adv Neural Inf Proces Syst. 2014;27 [Google Scholar]
  • 20.D.P. Kingma, M. Welling, Auto-encoding variational bayes, arXiv preprint arXiv:1312.6114, 2013.
  • 21.Ghahramani A., Watt F.M., Luscombe N.M. Generative adversarial networks uncover epidermal regulators and predict single cell perturbations. bioRxiv. 2018 [Google Scholar]
  • 22.Karras T, Laine S, Aila T. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 2019:4401–10.
  • 23.Lotfollahi M., Wolf F.A., Theis F.J. scGen predicts single-cell perturbation responses. Nat Methods. 2019;16:715–721. doi: 10.1038/s41592-019-0494-8. [DOI] [PubMed] [Google Scholar]
  • 24.Kang H.M., et al. Multiplexed droplet single-cell RNA-sequencing using natural genetic variation. Nat Biotechnol. 2018;36:89–94. doi: 10.1038/nbt.4042. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Haber A.L., et al. A single-cell survey of the small intestinal epithelium. Nature. 2017;551:333–339. doi: 10.1038/nature24489. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Hagai T., et al. Gene expression variability across cells and species shapes innate immunity. Nature. 2018;563:197–202. doi: 10.1038/s41586-018-0657-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Dixit A., et al. Perturb-Seq: dissecting molecular circuits with scalable single-cell RNA profiling of pooled genetic screens. Cell. 2016;167:1853–1866. doi: 10.1016/j.cell.2016.11.038. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Adamson B., et al. A multiplexed single-cell CRISPR screening platform enables systematic dissection of the unfolded protein response. Cell. 2016;167:1867–1882. doi: 10.1016/j.cell.2016.11.048. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Datlinger P., et al. Pooled CRISPR screening with single-cell transcriptome readout. Nat Methods. 2017;14:297–301. doi: 10.1038/nmeth.4177. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Cortes C, Lawarence N, Lee D, Sugiyama M, Garnett R. In: Proceedings of the 29th annual conference on neural information processing systems; 2015.
  • 31.Zheng G.X., et al. Massively parallel digital transcriptional profiling of single cells. Nat Commun. 2017;8:14049. doi: 10.1038/ncomms14049. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Cuzick J. A Wilcoxon-type test for trend. Stat Med. 1985;4:87–90. doi: 10.1002/sim.4780040112. [DOI] [PubMed] [Google Scholar]
  • 33.McInnes L, Healy J, Melville J. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426; 2018.
  • 34.Wolf F.A., Angerer P., Theis F.J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 2018;19:1–5. doi: 10.1186/s13059-017-1382-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Virtanen P., et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods. 2020;17:261–272. doi: 10.1038/s41592-019-0686-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Jiang Z., Shi H., Tang X., Qin J. Recent advances in droplet microfluidics for single-cell analysis. TrAC Trends Anal Chem. 2023;159 [Google Scholar]
  • 37.Zhu J., et al. Microfluidic impedance cytometry enabled one-step sample preparation for efficient single-cell mass spectrometry. Small. 2024;20 doi: 10.1002/smll.202310700. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data 1
mmc1.docx (2.3MB, docx)

Articles from Journal of Advanced Research are provided here courtesy of Elsevier

RESOURCES