Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2023 Jan 11;51(4):e20. doi: 10.1093/nar/gkac1212

Single-cell gene regulatory network prediction by explainable AI

Philipp Keyl 1, Philip Bischoff 2,3,4, Gabriel Dernbach 5,6, Michael Bockmayr 7,8,9, Rebecca Fritz 10, David Horst 11,12, Nils Blüthgen 13,14, Grégoire Montavon 15,16,, Klaus-Robert Müller 17,18,19,20,, Frederick Klauschen 21,22,23,24,25,
PMCID: PMC9976884  PMID: 36629274

Abstract

The molecular heterogeneity of cancer cells contributes to the often partial response to targeted therapies and relapse of disease due to the escape of resistant cell populations. While single-cell sequencing has started to improve our understanding of this heterogeneity, it offers a mostly descriptive view on cellular types and states. To obtain more functional insights, we propose scGeneRAI, an explainable deep learning approach that uses layer-wise relevance propagation (LRP) to infer gene regulatory networks from static single-cell RNA sequencing data for individual cells. We benchmark our method with synthetic data and apply it to single-cell RNA sequencing data of a cohort of human lung cancers. From the predicted single-cell networks our approach reveals characteristic network patterns for tumor cells and normal epithelial cells and identifies subnetworks that are observed only in (subgroups of) tumor cells of certain patients. While current state-of-the-art methods are limited by their ability to only predict average networks for cell populations, our approach facilitates the reconstruction of networks down to the level of single cells which can be utilized to characterize the heterogeneity of gene regulation within and across tumors.

INTRODUCTION

Therapeutic decisions in the battle against cancer increasingly rely on molecular tumor characteristics, and molecular profiling of cancer tissue is becoming an integral part of routine diagnostics (1–3). Still, in many cases therapy outcome can only insufficiently be predicted based on molecular properties, suggesting a discrepancy between current markers and their functional implications for tumorigenesis or therapy resistance. The investigation of gene regulatory networks (GRN) inferred from transcriptomic profiling aims at bringing out these functional aspects of cancer genomics. Many methods have been developed that infer network information from gene expression profiling, but these methods mostly infer average gene regulatory networks for a cohort of tumor samples and are therefore limited in that they cannot be used to identify patient-specific differences (4,5). To infer gene regulatory networks for individual patients, these methods would therefore require multiple samples from the same patient. This has become possible by the development of single-cell RNA sequencing methods that can provide thousands of transcriptomic samples from the same patient. While these approaches contribute to better understanding the major oncogenic mechanisms in a patient’s cancer, they cannot be used to analyze intra-tumoral heterogeneity with respect to gene regulation (4,6–10).

This limitation is a severe shortcoming of these approaches since the progression of only a few tumor cell clones resistant to the (targeted) therapy may lead, due to evolutional pressure, to a limited response and the development of therapy resistances. It is therefore of clinical relevance to keep these ‘therapeutic gaps’ to a minimum. For this reason, methods capable of inferring single-cell GRNs would be needed.

Here, we introduce the method scGeneRAI, which employs the explainable artificial intelligence method layer-wise relevance propagation (LRP) (see e.g. (11–18)) to infer gene regulatory networks of individual cells from single-cell RNA sequencing data.

To predict single-cell gene regulatory networks, scGeneRAI trains a deep neural network to predict the abundance of one gene based on arbitrary sets of other genes. Subsequently, LRP is applied to estimate the relevance of every gene for this prediction. This extends our previous explainable AI approach to predict protein networks for individual patients from bulk proteomic profiling data (19). We further develop the approach here for scRNA-seq data analysis, which poses additional challenges due to the often small transcript counts per cell and the frequent occurrence of dropouts and examine its performance on synthetic data (6).

Our cell-specific predictions are supported by a global ML model, which unlike a local statistical analysis (e.g. (20)), is supported by all data points and can better deal with sampling heterogeneities caused by the presence of cells with unique molecular properties that are rare in the population. Specifically, our approach reflects the global structure, and can capture more robustly complex global correlations, which are inherent to biological networks (21,22).

Using single-cell sequencing data from 10 NSCLC patients (23), we apply scGeneRAI to predict gene regulatory networks for >15 000 single normal and lung cancer cells. We report known as well as novel network structures, of which some are observed across tumors whereas others are specific to certain patients and tumor cell subclones.

MATERIALS AND METHODS

Inference of gene regulatory networks with scGeneRAI

We will first explain the general procedure of predicting GRNs of individual cells from single-cell RNA sequencing data which consists of neural network training and LRP computation. In the next sections we will then closer describe these two steps and provide sufficient information to replicate the experiments. scGeneRAI is applied on a data set with M samples (cells) and N features (genes). Ten percent of the data set are selected uniformly at random and held out as a test set to prevent the neural network from overfitting. Network training and the prediction of single-cell GRNs are conducted on the remaining 90% of the data.

(i) First, a neural network is trained to predict, for every sample, arbitrary genes based on an arbitrary set of other genes. (ii) Given the trained neural network and one sample, i.e. the transcriptome data of one cell, a gene G* is predicted based on K randomly sampled sets of other genes from this patient. After each prediction, layer-wise relevance propagation (LRP) is used to estimate the relevance each predicting gene has for the prediction of G*. Afterwards, these relevances are averaged over all K repetitions, generating raw LRP values (LRPr) between G* and the predicting genes. This procedure is repeated for every possible G*, thus leaving us with a full matrix of LRPr scores with dimension N × N. In order to attain an undirected (but more robust) measure of the interaction score between two genes, LRPau is computed as the absolute undirected average of the two LRPr scores between two genes.

Neural network training

We used a simple neural network architecture with hyperparameter heuristics based on our previous study (19): The NN was constructed with two hidden layers of width 10 · N in order to vary the NN capacity according to the data set. The NN was tasked to perform imputation for a set of N genes, P of which are missing. For the input vector, every gene with abundance a was encoded as tuple (a, 1 − a). This leaves the vector (0, 0) free to encode the P genes that are hidden by choice. Thus, the input vector of the NN was constructed as a vector of size 2 · N and maps to the imputed abundance vector of size N. Additionally, we included information about the cell type (‘ct’) and the patient ID (‘pid’) by concatenating to the vector of genes given as input two one-hot encoded vectors representing the cell type and patient ID respectively. This was done so that the relevance that one gene has for the prediction of another gene would not be influenced by group effects. Training was performed with stochastic gradient descent with learning rate 0.02 and the pytorch learning rate scheduler with a weak (gamma = 0.995) exponential learning rate decay to ensure network convergence, batch size 5 and a momentum of 0.9. The log hyperbolic cosine loss was applied as training loss. We trained for a maximum of 1500 epochs and used early stopping to select the neural network that performed best.

Layer-wise relevance propagation (LRP)

Based on the trained NN model, we build a gene regulatory network by identifying for each predicted gene at the output of the NN, the genes at the input that contribute to the predicted value. Let F denote the function implemented by our NN model, and (y1, …, yN) = F(x1, …, xN, xct, xpid) the computation of the N output genes from the N input genes plus the ‘ct’ and ‘pid’ variables, we would like to identify for each data point a matrix of ‘relevance scores’ Inline graphic of size N × N containing the contribution of each input gene i to each output gene l.

The problem of computing these scores for some function F evaluated at some data point x is known as attribution. Many approaches have been proposed for attribution, e.g. (11,24,25). Here, we use the Layer-wise Relevance Propagation (LRP) (11) approach for its robustness and advantageous computational properties as it lets us extract for each output yl the collection of scores Inline graphic in the order of a single forward/backward pass. The LRP procedure starts at the output of the NN with a particular predicted gene value yl and redistributes this score to the input of the NN in an iterative layer-wise manner. Let j and k be indices for neurons at two adjacent layers, and let aj and ak denote their respective activations. Activations at these two layers are related via the neuron equation:

graphic file with name M0002.gif

where Inline graphicjk, bk are the neuron parameters learned from the data, where ρ is a ReLU or linear activation function, and where ∑0, j sums over all input neurons j plus a bias (represented as a constant activation a0 = 1 and weight Inline graphic0k = bk). Denote by Rk the relevance score that has been attributed on neuron k by propagation of yl from the top-layer back to the layer of neuron k. To propagate relevance scores one layer below (i.e. onto the layer of neuron j), we use the propagation rule:

graphic file with name M0005.gif

where ( · )+ and ( · ) are shortcut notations for max (0, ·) and min (0, ·). This rule is known as ‘generalized LRP-γ’, and used in (19,26). The parameter γ is a hyperparameter that needs to be selected to maximize explanation quality. When reaching the input layer, we get 2 · N explanation scores representing gene contributions (each gene being represented as a pair of two values at the input of the NN), plus a few more scores associated to ‘ct’ and ‘pid’ features. We reach the desired N explanation scores by ignoring the ‘ct’ and ‘pid’ scores, and then reducing (i.e. summing) the remaining 2 · N scores into a N-dimensional vector representing the contribution of each gene.

The LRP procedure is repeated K =100 times for random sets of predicting genes and the ‘raw’ LRP score (which we denote by LRPr) is then defined as the average over these random sets of genes. LRPr scores are then computed for all predicted genes which leaves us with a N × N matrix of LRP scores representing gene-to-gene interactions. Subsequently, self-loops are excluded and the absolute undirected LRPau scores for every pair of genes A and B are calculated as Inline graphic.

Data

Synthetic data

In order to validate our method, we predicted sample-wise networks for samples for which the ground truth network was already known. We used the method BoolODE from the Beeline framework (6) and chose three networks curated from real-world biological systems (GSD: gonadal sex determination, HSC: hematopoietic stem cell differentiation, VSC: ventral spinal cord development), from which 5000 synthetic gene sets were sampled in each case. Dropout was induced by setting the parameters drop_cutoff to 0.5 and drop_prob to 0.5 which are the parameters used in their minimal configuration example.

In order to construct a dataset with ‘heterogeneous interactions’, i.e. samples have different underlying interaction networks, the three datasets GSD, HSC and VSC were combined. Since GSD was the largest network (19 genes), samples that came from HSC and VSC were padded with additional genes (Inline graphic that did not interact with other genes (i.e., network nodes with degree 0). This padding was not used, when the average network behind GSD, HSC and VSC was predicted based on the homogeneous datasets.

scRNA-seq data

The single-cell sequencing data coming from 10 patients with non-small cell lung cancer was acquired from Bischoff et al. (23). Only data from tumor cells and other epithelial cells were selected. Single-cell transcriptome data was filtered for the 800 protein-coding genes with the highest average expression values and consisted of data from 13439 Tumor cells, 875 Club Cells, 1665 Ciliated Cells, 1547 AT2 cells and 511 AT1 cells. Due to their small number (77), neuroendocrine cells were excluded from the experiment. Due to the large amount of data and the fact that the computation of gene interactions scales quadratically with the number of genes, computations were performed using the BIH High Performance Compute (HPC) Cluster.

Statistical analysis

All statistical tests were two-sided and mean differences were regarded as significant if P < 0.05. Wilcoxon ranked test, Pearson’s r and Spearman’s rho were computed in R statistical packages (27) using Hmisc (28). The AUC for the prediction of network interactions was computed using the package pROC (29).

Figures

Parts of Figure 1A were created with BioRender.com. Violin and boxplots plots were made with ggplot2 (30). Network visualization was performed with the R package igraph (31). Heatmaps were created using ComplexHeatmap (32). For visualization of network similarity between individual tumor cells (Figure 5), GRN100 interactions were encoded as a binary vector (interaction present / not present) and filtered for interactions present in more than one cell before calculating a UMAP. UMAPs and t-SNE were computed with umap (33) and Rtsne (34).

Figure 1.

Figure 1.

Reconstruction of synthetic single-cell networks by scGeneRAI using XAI. (A) Workflow for the inference of single-cell GRNs by scGeneRAI. A neural network is trained on scRNA-seq data to predict each gene’s expression based on arbitrary sets of other genes. Following training, a single-cell GRN is predicted in three steps: (1) A target gene is predicted based on a set of other genes. (2) LRP is used to infer the relevance of every gene for this prediction. (3) The LRP scores subsequently serve as measure of interaction strength between the target gene and all predicting genes. This procedure is repeated for 100 masks and for all genes as target gene. (B) Ground truth for three different networks provided by the beeline framework. Self-loops are ignored for the evaluation of network reconstruction performance by scGeneRAI or LIONESS. Abbreviations (GSD, HSC, VSC) refer to the biological origin of these network models (gonadal sex determination, hematopoietic stem cell differentiation, ventral spinal cord development). (C) Comparison of Area under the ROC curve (AUC) values for the network prediction of single cells. Given a dataset of synthetic scRNA-seq data for which every single-cell transcriptome is generated by one of three different networks (visualized in B), scGeneRAI has to reverse-engineer each cell’s underlying network. The reconstruction performance is measured as AUC for every individual cell. Each violin plot therefore visualizes AUC scores for the reconstruction of 4500 (size of the training set) single cells, either using the method scGeneRAI or LIONESS. scGeneRAI is able to predict the network topologies of individual cells and significantly outperforms the method LIONESS.

Figure 5.

Figure 5.

Inter- and intra-tumoral distribution of tumor-specific network activity (dots represent tumor cells). (A) UMAP embedding of tumor cells based on predicted single-cell gene-regulatory networks (patients are colour-coded). (B) Activity of tumor-specific subnetworks T1-T6 across tumor cells of different patients. RIA (relative interaction abundance) is visualized in red. Based on the UMAP analysis from (A), the activity of tumor-specific subnetworks is visualized across patients and cells. While certain patients show different active networks in the same cells (e.g. T1 and T6 in patient p032), some network modules are distinctly active only in a minority of tumor cells (e.g. T2 in patient p024), indicating a functional heterogeneity.

Baseline models

The package lionessR (35) was used as a baseline model in order to infer sample-wise networks. All default settings were used. GENIE3 (5) and GRNboost2 (4) were used with default settings in order to predict ‘average’ networks for multiple samples.

Network analysis

In order to compare gene interactions of different cells, the highest 100 LRPau scores for each cell were examined. In order to find general network communities predicted interactions that were present in <1 % of all cells were discarded. The remaining interactions were used to construct an unweighted undirected graph and the Louvain algorithm from the igraph package (31) was applied to identify the network communities C1–C9. For the discovery of tumor-specific network features, a graph was constructed that consisted only of interactions for which the log ratio between tumor cells and other cells was greater than 4. Additionally, interactions were discarded which were present in <4% of all cells, because we regarded a high ratio of interactions as not meaninful for very rare interactions. The Louvain algorithm identified the tumor specific networks that were labeled as T1–T6. An overrepresentation analysis of network genes was performed using the REACTOME database (36).

Average shared interactions

The ‘interpatient’ average number of shared interactions was computed for every pair of two patients and every cell type using the following procedure: Be Mi a matrix with dimensions G × Ci where G is the number of possible gene interactions and Ci is the number of cells of patient i for the specified cell type. The entries Inline graphic of Mi are then defined as 1 if the gene interaction g is in the GRN100 of cell c in patient i, otherwise 0. The ‘interpatient’ average number of shared interactions for two patients i1 and i2 is then defined as the average over the entries of Inline graphic, excluding the diagonal. The ‘intrapatient’ average of shared interactions is calculated for every patient i by computing the average of Inline graphic. This again is done for every cell type separately.

RESULTS

Reconstruction of synthetic GRNs

We first validated scGeneRAI on artificial RNA data generated by the method beeline (6). To this end, we constructed a synthetic data set in which each single cell’s transcriptome is simulated based on one of three different gene regulatory networks (beeline IDs: ‘GSD’, ‘HSC’, ‘VSC’, Figure 1B). We sampled 5000 synthetic transcriptomes for each network and combined these data into one mixed data set containing 15000 single cell’s transcriptomes.

Within the scGeneRAI framework, a neural network model is trained on 90 % of the data to impute the masked gene expression of single cells and the neural network’s performance is tested on the held out data to avoid overfitting. After training, LRP is applied on the training data to highlight the interactions the network has learned to use. This way, scGeneRAI can recover the GRN of each single cell of the training data without knowledge about its group membership (GSD, HSC or VSC). To evaluate the reconstruction, the predicted GRN of every cell was compared to the ground truth GRN of this particular cell by computing the area under the ROC curve (AUC). AUCs for all predicted networks (n = 13 500) are visualized in Figure 1C.

We compared the performance of scGeneRAI against the recently developed network reconstruction method LIONESS (37) which re-engineers the interaction network of an individual sample by applying an arbitrary network inference algorithm on two data sets, one containing, and one missing the sample in question, and then interpolating between the two reconstructed networks.

scGeneRAI achieved a median AUC of 0.75 (LIONESS: 0.53, P < 10−16, paired Wilcoxon test). For scGeneRAI, reconstruction accuracy depended on the network complexity and networks with lower complexity were reconstructed with higher precision (VSC: Median=0.88, IQR=0.07; HSC: median = 0.78, IQR = 0.09; GSD: median = 0.57, IQR = 0.14) (Figure 1C, comparison with LIONESS in Supplementary Table S1).

AUC scores for GSD exhibit two peaks which we trace down to the specific data distribution with data points assembled around two different centers with different levels of predictability (cf. Supplementary Figure S2).

We find that scGeneRAI yields higher AUC scores than LIONESS on all networks which equates to better agreement with ground truth. For large potentially imbalanced datasets, it has been shown that the AUC score can be less informative, because it focuses on parts of the ROC curve that are less relevant (see e.g.(38,39)). For that reason, we also computed the full ROC curves for the predicted networks averaged over all samples for GSD, HSC and VSC (Supplementary Figure S3) which show that scGeneRAI systematically outperforms LIONESS. We additionally calculated the correlation between the predicted and ground truth networks as another measure of prediction accuracy, which supports these results (Supplementary Figure S4).

Comparison with state-of-the-art average network prediction algorithms

While the main aim of our study is the prediction of single-cell GRNs, we additionally benchmarked our explainable AI approach against popular ‘average’ network prediction algorithms. We applied the methods GENIE3 and GRNboost2 (4,5) to three data sets, each of which contained 5000 single-cell transcriptomes generated by one network (VSC, HSC or GSD). For the comparison with our XAI method, we let scGeneRAI predict an ‘average’ network by computing the mean over all 4500 (again using 500 samples as a test set) reconstructed single-cell networks.

Repeating the experiment five times for each data set and comparing LRP results against the two other methods with the Wilcoxon ranked test, scGeneRAI-based reconstruction showed better or similar performance with a median AUC of 0.73 for GSD (GENIE3: AUC=0.64, GRNboost2: AUC = 0.59), 0.79 for HSC (GENIE3: AUC = 0.76, GRNboost2: AUC = 0.71) and 0.8 for VSC (GENIE3: AUC = 0.79 , GRNboost2: AUC = 0.82) (Table 1).

Table 1.

Comparison of network reconstruction performance between the baseline methods GENIE3 and GRNBoost2 and scGeneRAI

Network Method Median AUC Range P
1 GSD GENIE3 0.64 0.64–0.64 0.008
2 GSD GRNBoost2 0.59 0.57–0.6 0.008
3 GSD LRP 0.73 0.71–0.75
4 HSC GENIE3 0.76 0.76–0.76 0.042
5 HSC GRNBoost2 0.71 0.71–0.72 0.006
6 HSC LRP 0.79 0.76–0.8
7 VSC GENIE3 0.79 0.78–0.79 0.093
8 VSC GRNBoost2 0.82 0.82–0.82 0.094
9 VSC LRP 0.80 0.79–0.82

We report median AUC and the range of AUCs for five repeated reconstructions. Since the baseline methods are capable only of predicting average networks over many cells, we used the average over all single-cell networks predicted by scGeneRAI for an adequate comparison. scGeneRAI has a substantial advantage for the prediction of the more complex network GSD.

Reconstruction of GRNs from scRNA-seq data

Following the validation with synthetic data, we applied scGeneRAI to scRNA-seq data from 10 patients with non-small cell lung cancer. The data comprised the expression values of 800 highly expressed genes for 13439 Tumor cells, 875 Club Cells, 1665 Ciliated Cells, 1547 AT2 cells and 511 AT1 cells. We held out a test set to monitor the performance of the neural network and then used our XAI approach to predict a GRN for each individual cell of the training set. We computed the 319600 pairwise interaction scores for the 800 most highly expressed genes and focus on the 100 strongest predicted interactions for every cell which we call GRN100.

Network similarity between single cells

Based on these GRN100, we analysed inter-cell similarity of gene regulatory networks. This comparison was done for cells of the same tumor for the quantification of intratumoral network similarity, as well as for cells of different patients’ tumors for the investigation of regulatory differences of different tumors (‘intertumoral network similarity’) (Figure 2).

Figure 2.

Figure 2.

Comparison of network similarity of single-cell GRNs across different patients (‘interpatient’) and within each patient (‘intrapatient’). Each analysis is done separately for different tissues. Left: Network similarity between cells of different patients (‘interpatient’). Each data point represents the average number of gene interactions that overlap between the GRN100 of cells from two different patients. Thus, for every combination of two patients the boxplot contains one data point (45 per boxplot). For AT1 cells, AT2 cells, Ciliated cells and Club cells show a similar average number of shared interactions between cells from two patients, but Tumor cells exhibit much more differential gene regulatory patterns in different patients. Right: Network similarity between cells coming from the same patient (‘intrapatient’). Each data point corresponds to the average number of gene interactions that overlap between the GRN100 of two cells coming from the same patient. Thus, for each boxplot the number of data points is equal to the number of patients (i.e. 10). The average number of shared (GRN100) interactions between cells from the same patient is higher than between cells from different patients. However, even in the intrapatient analysis, tumor cell GRNs show higher variability (median network similarity = 9.78) compared to normal epithelial GRNs (median network similarity = 14.15, Wilcoxon test: P = 0.002.

In order to describe the intertumoral network similarity of single-cell gene regulatory networks, we computed, for each pair of patients, the average number of interactions that two cells (one from each patient) have in common (i.e., interaction that are part of both cells’ GRN100). Thus, network similarity of 100 indicates that the GRN100 of the examined cells are identical. AT1 (median network similarity = 10.77, IQR= 3.9), AT2 (median = 9.8, IQR = 3.67), ciliated (median = 10.89, IQR = 0.6) and club cells (median = 11.03, IQR = 2.19) on average shared more interactions between two patients than tumor cells (median = 3.43, IQR = 1.64, Wilcoxon test: P < 2.2 × 10−16). This lower number of shared interactions between cells of different tumors suggests that tumors from different patients vastly differ in their regulatory mechanisms (high intertumoral heterogeneity).

We subsequently investigated the intratumoral network similarity of single-cell GRNs by computing the average number of interactions that are shared by two cells’ GRN100. As should be expected, a higher number of interactions is shared between single-cell-GRN100 of cells within a patient than across patients (AT1: Median network similarity = 14.74, IQR = 3.03; AT2: median = 15.08, IQR = 4.13; ciliated: median = 12.52, IQR = 3.47; club: median = 13.87, IQR = 1.37). However, even within patients the GRN100 of tumor cells shared fewer interactions than the GRN100 of normal epithelial cells (median = 9.78, IQR = 2.68, P = 0.002), suggesting high intratumoral heterogeneity.

Characterisation of gene regulatory network communities

We next investigated the most frequent network patterns, compared them between different tissue types and analysed their functional relevance in the cell by an overrepresentation analysis in REACTOME (36).

To provide a quantitative measure of the number of cells in which a certain interaction is found, we introduce the measure ‘Relative Interaction Abundance’ (RIA), which defines the proportion of cells, for which a given interaction is part of the GRN100). Consequently, the RIA can have values between 0 and 1. We extend this score to quantify how often a subnetwork (i.e. a set of multiple interactions) is part of the GRN100. In this case, we count the number of the network’s edges that can be found in each cell on average and normalize by the network’s ground truth edge count. For networks, the RIA is therefore ambiguous in that it cannot distinguish between a smaller part of the network existing in many cells, or a larger part of the network existing in a smaller proportion of cells.

Figure 3 illustrates the most frequent gene interactions for 5 selected patients stratified by cell type. All other networks can be found in the Supplementary Figure S1.

Figure 3.

Figure 3.

(A) Illustration of most frequent interactions in the GRN100 stratified by patient and cell type. For better visibility, only networks for five selected patients are shown. The networks of the remaining patients can be examined in the Supplementary Figure S1. The edge width between two genes corresponds to the frequency (RIA) of this interaction within the respective patient and cell type. Networks that are based on less than 5 cells (e.g. AT1 cells, Ciliated cells and Club cells of patient p023) are omitted. Several gene network communities are common to all cell types such as the communities C6 (e.g. HSP90B1, HSPA5, CALR), C4 (e.g. RPL10, RPL5, RPS27, RPL22, EEF1A1), as well as C3 (e.g. JUN, JUNB, JUND, FOS, FOSB, EGR1, SOCS3 and genes encoding heat shock proteins). Interactions that appear specifically in tumor cells are colored white and have a 50% higher edge width. (B) Heatmap visualizing the RIA of 9 frequent network communities (C1-C9). The subplots ‘Mutation’ and ‘Histology’ indicate the presence of a tumor mutation with a black tile and the histology of the tumor sample in color, respectively. RIA scores are scaled between 0 and 1 for each cluster to visualize the contrast between patients. Overall, hierachical clustering of the (scaled) RIA value of these communities groups tumors according to their histological subtype. (C) Heatmap of 6 tumor specific network communities (T1–T6). Heatmap shows scaled RIA, mutation status is color coded (black = mutated, white = wildtype), for histology color code see legend.

It can be seen that the interaction networks build distinct network communities, some of which are conserved across patients and cell types. By applying the Louvain algorithm (40) we identified nine recurring network communities.

  1. Community 1 (C1) showed interactions between 53 different genes and was present mostly in AT1 cells (AT1: RIA = 0.19, AT2: RIA = 0.04, Ciliated: RIA = 0.01, Club: RIA = 0.04, Tumor: RIA = 0.04). AGER, a cell marker of AT1 (41), had a central position in this network and the most frequent interactions in AT1 cells were AGER-ACTB (RIA = 0.47), AGER-CAV1 (RIA=0.42) and AGER-S100A10 (RIA = 0.39).

  2. C2 consists of 17 genes among which are BRD2, FUS, HEXIM1, HNRNPA2B1, HNRNPA3 and the genes ID1, ID2, ID3 and ID4. An overexpression analysis in REACTOME showed that these genes play a role in ‘NGF-stimulated transcription’ (P = 4.9 × 10−12), ‘Nuclear kinase and transcription factor activation’ (P = 4.1 × 10−12), ‘Signaling by NTRKs’ (P = 6.3 × 10−9) and ‘Signaling by Receptor Tyrosine Kinases’ (P = 1.0 × 10−5). The network was most frequently expressed in Tumor cells (RIA = 0.10) followed by AT2 cells (RIA = 0.09), Ciliated cells (RIA = 0.05), Club cells (RIA = 0.05) and AT1 cells (RIA = 0.02). The most frequent interactions in this network were, averaged over all cell types, INTs6-HEXIM1 (RIA = 0.14), ID1-ID3 (RIA = 0.12), ID2-INTs6 (RIA = 0.11) and ID2-GADD45G (RIA = 0.11).

  3. C3 contained 31 genes among which were JUN, JUNB, JUND, FOS, FOSB, EGR1, HSPA8, HSPB1, HSPD1, HSPH1, HSP90AA1 and HSP90AB1. According to REACTOME, genes of C3 are important within ‘Interleukin-4/13 signaling’ (P = 3.1 × 10−13), ‘attenuation phase’ (P = 5.6 × 10−8) and ‘Cellular response to heat stress’ (P = 1.3 × 10−7) It can be seen (Figures 3 and 4A) that C3 itself consists of two parts, the smaller part consisting of the heat shock proteins HSPB1, HSPA8, HSPAA1, HSPAB1, HSPH1 and HSPD1. This HSP subnetwork is connected to the other genes by DNAJB1(/HSP40). A close relationship between DNAJB1 with Fos, Jun and EGR1 has already been well established, e.g. coexpression of these genes in the context of heat stress (42). C3 was most frequent in Ciliated cells (RIA = 0.16) and interactions from this network were least frequent in AT1 cells (RIA = 0.09). Within C3, the most frequent interactions were HSPH1-DNAJB1 (RIA = 0.46), HSPB1-DNAJB1 (RIA = 0.36), HSP90AA1-DNAJB1 (RIA = 0.35), FOSB-FOS (RIA = 0.34) and JUN-FOS (RIA = 0.33).

  4. C4 contained 23 genes (e.g. RPL and RPS genes, as well as EEF1A1, PTMA and EIF3E) that are associated to translation,e.g. ‘Formation of 40S subunits’ (P = 2.1 × 10−15), ‘GTP hydrolysis and joining of the 60S ribosomal subunit’ (P = 2.1 × 10−15) and ‘Peptide chain elongation’ (P = 2.1 × 10−15). Most frequent interaction were RPS27-RPL26 (RIA = 0.31) and RPL5-RPL26 (RIA = 0.29).

  5. C5 was more dominant in Club cells (RIA = 0.11) and AT2 cells (RIA = 0.11) and less frequent in Tumor (RIA = 0.03), Ciliated (RIA = 0.02) and AT1 cells (RIA = 0.02). In AT2 and Club cells, the most frequent interaction within this network were NFKBIA-CXCL2 (AT2: RIA = 0.72, Club: RIA = 0.56). In Club cells CXCL2-CXCL1 (RIA = 0.43) was frequent as well. Genes from this network play a role in ‘IL-10-signaling’ (P = 5.6 × 10−5) and ‘Cytokine signaling in Immune system’ (P = 5.6 × 10−5).

  6. The most important interactions in C6 were HSP90B1-CALR (RIA = 0.50), HSPA5-HSP90B1 (RIA = 0.42) and HSPA5-CALR (RIA = 0.39). All three of this genes are regulated by ATF6 (see REACTOME, also (43)). In Ciliated cells, several genes also showed frequent interactions with RSPH1 such as RSPH1-RUVBL1 (RIA = 0.22) and RSPH1- IGFBP7 (RIA = 0.18). Notably, a defect of RSPH1 is known to lead to primary ciliary dyskinesia (44).

  7. C7 was built by the genes CTSB, SDCBP, TPM3, GRN, PSAP and CLIC1, which are associated to ‘Surfactant metabolism’ (P = 4.6 × 10−6). Its expression was more pronounced in Tumor cells (RIA = 0.07) and least frequent in Club cells (RIA = 0.04).

  8. SYNE1, SYNE2, PCM1, SETD2, MACF1 and AKAP9 interacted with each other in C8 and are associated to ‘Meiotic synapsis’ (P = 8.0 × 10−3) and ‘Cell cycle’ (P = 1.1 × 10−3). This community can be found in ciliated cells (RIA = 0.14), but it is almost never present in Tumor cells (RIA = 0.002), possibly suggesting a dysregulation of the cell cycle in these cells.

  9. C9 also was most frequent in ciliated cells (ciliated: RIA = 0.38, tumor: RIA = 0.05, AT2: RIA = 0.04, club: RIA = 0.02, AT1: RIA = 0.002). The participating genes PRDX1, NQO1, TALDO1, TXN, GDF15, MDM2, COX6CC AKR1C3 and CDKN1A are associated to ‘Transcriptional Regulation by TP53’ (P = 4.2 × 10−76). Most important interactions were GDF15-CDKN1A (RIA = 0.15) and COX6C-TXN (RIA = 0.06).

Figure 4.

Figure 4.

Visualization of frequent network patterns. (A) Visualization of network communities C1–C9. Edge width indicates the RIA score for the respective interaction over all patients, but stratified by tissue type (AT1: green, AT2: steelblue, Club: pink, Ciliated: yellow, Tumor: red). It can be seen that most network structures are common for many cell types (see also Figure 3). On the other hand, several network structures are specific to one cell type, e.g. C1 is most frequent in AT1 cells and C8 and C9 are particularly frequent in Ciliated cells. (B) Visualization of tumors-specific network communities T1–T6. The networks T2 and T3 are associated to Mitosis and cell death, respectively, and can be found in tumor cells of different patients. All other networks are more specific for one patient.They are associated to transcriptional, translational or signaling processes.

When we compared how the proportion of interactions in the GRN100 of tumors correlated with the respective proportions in other cell types, the highest correlation could be shown for AT2 cells (Spearman’s ρ = 0.37), followed by club cells (ρ = 0.35) and ciliated cells (ρ = 0.34). AT1 cells had less GRN overlap with tumor cells (ρ = 0.26). This is consistent with the hypothesis that NSCLC cells origin from AT2 cells or secreting (e.g. club) cells (45–47).

When we averaged LRP values over all cells, 48 out of the highest 100 predicted interactions could be found in the database REACTOME (χ2: P < 2.2 × 10−16), which compares to 33/100 when Pearson’s r was used as metric for interaction strength (χ2: P < 2.2 × 10−16). When comparing Pearson’s r of all gene pairs to the respective averaged LRP values, a rank comparison showed that both metrics of interaction strength had relevant differences (Spearman’s ρ: 0.38).

Tumor specific gene regulatory networks

Single-cell analyses in precision oncology research aim to uncover tumor specific mechanisms and their heterogeneous distribution among tumor cells. We therefore selected all interactions that were more frequent in tumor cells compared to normal cells (log ratio > 4) and used the Louvain algorithm to identify six network communities in this cancer-specific network (Figures 3 and 4b).

  1. T1 almost exclusively existed in patient p032 (RIA = 0.25, Figure 3B) and consisted of the genes ST13, PACSIN2, XRCC6, EP300, RBX1 and CDC42EP1. The most frequent interactions in T1 were PACSIN2—ST13 which existed in 42.2 % and PACSIN2-XRCC6 which existed in 41.3 % of all Tumor cells (the only lepidic tumor in the dataset) in patient p032. This network is associated to NOTCH1 regulation of transcription regulation (P = 1.2 × 10−2).

  2. T2 consisted of the genes CKS1B CKS2, RAD21 and STMN1 which are associated to ‘Nuclear Signaling by ERBB4’ (P = 3.6 × 10−3) and ‘Cell Cycle, Mitotic’ (P = 3.2 × 10−2). There are further connections to the ‘translation’ network of C4, especially to PTMA (encircled in Figure 3A). We found interactions of these networks in a high proportion of tumor cells in patients p027 (RIA = 0.65) and p023 (RIA = 0.60), in a lower number of cells in patient p034 (RIA = 0.15), but T2 had an RIA below 0.07 for all other patients. All four of these genes are part of diagnostic tools for Multiple Myeloma (48). To our knowledge, these genes have not previously been associated to lung cancer cells, which may well be due to the fact that this network only seems to appear in few patients, but is completely missing in others.

  3. T3 consisted of interactions between the genes VEFFA, DDIT4, NDRG1 BNIP3 and BNIP3L and was most prominent in patients p031 (RIA = 0.21) and p034 (RIA = 0.17). Genes from this network showed strong overexpression for ‘TP53 Regulates Transcripton of Cell Death Genes’ (P = 8.5 × 10−9).

  4. T4 consisted only of two interactions, NNMT-MSC (RIA = 0.05) and NNMT-AKR1C3 (RIA = 0.05). It differed widely between patients, such that for patient p031 NNMT-MSC occured in 37 % of cells and NNMT-AKR1C3 in 33 % of cells, but neither interaction could be found in other patients in more than 3 % of cells. The network was associated to ‘Gene and protein expression by JAK-STAT signaling after Interleukin-122 stimulation’ (P = 2.6 × 10−3).

  5. T5 consisted of 8 genes with PTGS2 in a central position (Figure 3B). It existed mostly in cells of patient p018 (RIA = 0.18), e.g. PTGS2-EPS8 (RIA in p018 = 0.31) and PTGS2-CITED2 (RIA in p018 = 0.23) and is associated to ‘FOXO-mediated transcription’ (P = 7.5 × 10−5).

  6. T6 consisted of WIF1, CLU and PTK7 which are associated to ‘Negative regulation of TCF-dependent signaling by WNT ligand antagonists’ (P = 2.7 × 10−2). This network pattern could only be found in patient p032 (RIA = 0.34) .

While T2 and T3 are common to several tumors, it becomes clear that most network communities are specific to tumor cells of one patient. T1, T4 and T6 are each present in over 25% of tumor cells of one patient, but they are almost completely missing in cells of almost all other patients and cell types (Figures 3 and 5). When cells are visualized using a UMAP embedding based on the GRN100 (Figure 5A), cells from the same patient cluster together, underlining the high inter-patient heterogeneity of tumor gene-regulatory networks. Figure 5B illustrates the heterogeneity of tumor-specific network modules T1–T6 even between tumor cells from the same patient. The distribution of the network modules C1–C9 in cancer cells can be found in Supplementary Figure S5.

DISCUSSION

Intratumoral heterogeneity is a common cause for the development of resistance against targeted therapies. Single-cell sequencing methods aim to characterize this heterogeneity, but their clinical impact is still limited which is partially due to a lack of computational methods capable of analysing functional implications of the complex data.

In this study, we demonstrated that scGeneRAI can predict gene regulatory networks for single cells providing functional insight that could ultimately help selecting patient-specific therapeutic targets. Existing approaches are limited in that they perform a local statistical analysis ((20)) or that they try to reconstruct the gene regulatory network of a single cell as the difference between a reference network with and without the sample in question (37,49). However, these reconstruction methods strongly depend on the data distribution. Consequently, scGeneRAI clearly outperformed LIONESS in predicting single-cell GRNs. Applying scGeneRAI to a single-cell RNA sequencing data set of 10 non-small cell lung cancer patients, we found that gene regulatory networks of cancer cells exhibit a consistently higher inter-cell variability than GRNs of normal epithelial cells. The differences between single-cell GRNs were even more pronounced when we compared cells derived from different tumors. While the inter- and intra-patient heterogeneity of cancer cells is, in principle, well known for the genomic, transcriptomic and proteomic profiles (50–53), the prediction of gene regulatory networks may now be used to interpret this heterogeneity with respect to functional differences.

Our analysis showed both highly conserved network properties across patients as well as a pronounced inter- and intra-tumoral heterogeneity. The tumor-specific network modules T2 and T3 reconstructed with our approach, for instance, were shared by several tumors. Genes of T2 (CKS1B, CKS2, RAD21 and STMN1) have not been previously reported to be associated with lung cancer. However, all genes were shown to be of prognostic value in multiple myeloma (48,54). Furthermore, pathway enrichment showed an association to ERBB4 signaling which is highly relevant in lung cancer. Also, activating driver mutations of the ERBB4 gene occur in lung cancer (55,56), but are not captured by the targeted sequencing panel of the national Network Genomic Medicine (nNGM), which was used here.

The highest proportion of tumor cells with T2 network activity was found in the two patients who had a TP53 mutation (p023 and p027). Mutation of TP53 might therefore be associated with a high activity of this network that is associated to cell cycle activity. The tumor cells of p023 and p027 also featured a solid histology, but this might not be an independent factor since solid tumors are poorly differentiated which may be a consequence of TP53 mutations. However, due to the low number of patients with TP53 mutation, these hypotheses must be further investigated in studies including more patients with different mutation status.

Genes of T3 are well known to be associated with TP53. While the majority of tumors for which these networks showed frequent activity, were not TP53-mutated in our cohort, it may be hypothesized that this tumor network activity works towards functional changes in TP53 interactions necessary for tumorigenesis not induced by mutations.

While subnetworks T1, T4, T5 and T6 were observed exclusively in tumor cells, they were also highly variable among patients. In one patient, these networks appeared in a significant proportion of cells, which could hint at the potential relevance for personalized therapy approaches. However, similar to passenger mutations, it is certainly possible that these dysregulated networks are not drivers of tumorigenesis, but that they are part of a general change in gene regulatory network activity. To further evaluate the relevance of these networks, it is certainly necessary to conduct further studies with a larger number of patients and available clinical outcome data.

We also found 9 gene communities that were present in many cells over all cell types. The three communities C3, C4 and C6 were particularly consistent among patients and tissue types (Figures 3 and 4). The underlying networks appear to be responsible for basic cell function, such as mRNA translation or cytokine signaling.

The investigation of XAI has been associated with a growing number of applications in biological research, such as the prediction of proteomic networks for individual patients (19) and the identification of molecular network modules associated to specific disease phenotypes (17,18,57). The use of XAI for the prediction of single-cell gene regulatory networks offers promising opportunities, but several difficulties still need to be considered. First, the performance of network prediction for synthetic data was associated to the network complexity and more complex networks (many interactions, high node degree) showed decreased reconstruction accuracy. However, the dependence of performance on network complexity is a general feature that can be observed not only for our single-cell reconstruction approach, but also for average network reconstruction approaches, as shown for GENIE3 and GRNBoost2 in this study.

Second, the frequent occurrence of technical dropouts poses a problem when working with scRNA-seq data in general and, in particular, for machine learning approaches. Here we optimized neural network training by using low batch sizes and the log hyperbolic cosine loss to reduce the effects of dropouts. We could show on synthetic data with technical dropout that scGeneRAI is able to infer networks accurately despite these artifacts.

Post-processing techniques that have been used to reduce dropout in scRNA-seq data are likely to be of restricted benefit for the prediction of single-cell GRNs. For example, pooling the expression data of several cells has been used to redue noise in scRNA-seq data (58). However, since scGeneRAI uses information about how genes covary in individual cells to predict interaction strength, the reduction of this variability is not desirable in this task. Therefore, better data with less technical dropout would certainly further improve the prediction accuracy.

Third, the computation of single-cell gene regulatory networks generates a large amount of data. Since the number of interactions scales quadratically with the number of genes, a high number of genes is not only computationally costly, but statistical tests for millions of different gene pairs are computationally expensive and have low statistical power when multiple-test correction is applied. We therefore, in this study, examined the GRN100, the top 100 strongest interactions of each cell. We consider this information as highly relevant and informative for many aspects of cancerogenesis and as a good proof of concept for the prediction of single-cell GRNs. However, this approach may result in overlooking relevant parts of the GRN that might hold information about important cancer characteristics. For diagnostic purposes, it may in the future be necessary to design panels of relevant network patterns, thus saving computational cost and retaining statistical power.

The information of single-cell gene regulatory networks may in the future not only suggest targets for personalized medicine, but also predict the effectiveness of treatment and emerging resistance mechanisms that are related to the heterogeneity of regulatory mechanisms across tumor cells in individual patients. This concept extends precision oncology from patient-focused medicine to cell-based medicine (59). Our method ‘scGeneRAI’ presented here can support such an approach and help unfold the full potential of single-cell approaches for precision oncology.

DATA AVAILABILITY

All data used in this article are available at https://osf.io/nfdtk/ (DOI: 10.17605/OSF.IO/NFDTK).

CODE AVAILABILITY

All computer code used in this article is available at https://github.com/PhGK/scGRN. Source code can be found at https://figshare.com/articles/software/scGeneRAI/21550905. We provide a user-friendly version of scGeneRAI at https://github.com/PhGK/scGeneRAI.

Supplementary Material

gkac1212_Supplemental_File

ACKNOWLEDGEMENTS

Computation has been performed on the HPC for Research cluster of the Berlin Institute of Health.

Author contributions: Conceptualization: P.K., P.B., G.M., K.R.M., F.K. Methodology: P.K., G.M., K.R.M., F.K. Formal Analysis: P.K. Investigation: All authors. Resources: M.B., K.R.M., F.K. Data Curation: P.K., P.B. Writing – Original Draft: P.K. Writing – Review & Editing: All authors. Visualization: P.K. Supervision: G.M., K.R.M., F.K. Funding: K.R.M., F.K.

Contributor Information

Philipp Keyl, Institute of Pathology, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität Berlin, Charitéplatz 1, 10117 Berlin, Germany.

Philip Bischoff, Institute of Pathology, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität Berlin, Charitéplatz 1, 10117 Berlin, Germany; Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Anna-Louisa-Karsch-Straße 2, 10178 Berlin, Germany; German Cancer Consortium (DKTK), German Cancer Research Center (DKFZ), Berlin partner site, Germany.

Gabriel Dernbach, Institute of Pathology, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität Berlin, Charitéplatz 1, 10117 Berlin, Germany; BIFOLD – Berlin Institute for the Foundations of Learning and Data, Berlin, Germany.

Michael Bockmayr, Institute of Pathology, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität Berlin, Charitéplatz 1, 10117 Berlin, Germany; Department of Pediatric Hematology and Oncolog, University Medical Center Hamburg-Eppendorf, Martinistr. 52, 20246 Hamburg, Germany; Mildred Scheel Cancer Career Center HaTriCS4, University Medical Center Hamburg-Eppendorf Martinistr. 52, 20246 Hamburg, Germany.

Rebecca Fritz, Institute of Pathology, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität Berlin, Charitéplatz 1, 10117 Berlin, Germany.

David Horst, Institute of Pathology, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität Berlin, Charitéplatz 1, 10117 Berlin, Germany; German Cancer Consortium (DKTK), German Cancer Research Center (DKFZ), Berlin partner site, Germany.

Nils Blüthgen, Institute of Pathology, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität Berlin, Charitéplatz 1, 10117 Berlin, Germany; Institut für Biologie, Humboldt University, Free University of Berlin, Unter den Linden 6, 10099 Berlin, Germany.

Grégoire Montavon, BIFOLD – Berlin Institute for the Foundations of Learning and Data, Berlin, Germany; Machine Learning Group, Technical University of Berlin, Marchstr. 23, 10587 Berlin, Germany.

Klaus-Robert Müller, BIFOLD – Berlin Institute for the Foundations of Learning and Data, Berlin, Germany; Machine Learning Group, Technical University of Berlin, Marchstr. 23, 10587 Berlin, Germany; Department of Artificial Intelligence, Korea University, Seoul 136-713, South Korea; Max-Planck-Institute for Informatics, Stuhlsatzenhausweg 4, 66123 Saarbrücken, Germany.

Frederick Klauschen, Institute of Pathology, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität Berlin, Charitéplatz 1, 10117 Berlin, Germany; German Cancer Consortium (DKTK), German Cancer Research Center (DKFZ), Berlin partner site, Germany; BIFOLD – Berlin Institute for the Foundations of Learning and Data, Berlin, Germany; Institute of Pathology, Ludwig-Maximilians-University Munich, Thalkirchner Str. 36, 80337 München, Germany; German Cancer Consortium (DKTK), German Cancer Research Center (DKFZ), Munich partner site, Germany.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

P.B. is participant in the BIH Charité Clinician Scientist Program funded by the Charité – Universitätsmedizin Berlin; Berlin Institute of Health at Charité (BIH); K.R.M. was partly supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grants funded by the Korea government (MSIT) [2019-0-00079, Artificial Intelligence Graduate School Program, Korea University and No. 2022-0-00984, Development of Artificial Intelligence Technology for Personalized Plug-and-Play Explanation and Verification of Explanation]; German Ministry for Education and Research (BMBF) [01IS14013A-E, 01GQ1115, 01GQ0850, 01IS18025A, 01IS18037A, MSTARS/MSCORESYS]. Funding for open access charge: Institute of Pathology, Munich.

Conflict of interest statement. None declared.

REFERENCES

  • 1. Bockmayr M., Klauschen F., Györffy B., Denkert C., Budczies J.. New network topology approaches reveal differential correlation patterns in breast cancer. BMC Syst. Biol. 2013; 7:78. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Slamon D.J., Leyland-Jones B., Shak S., Fuchs H., Paton V., Bajamonde A., Fleming T., Eiermann W., Wolter J., Pegram M.et al.. Use of chemotherapy plus a monoclonal antibody against HER2 for metastatic breast cancer that overexpresses HER2. N. Engl. J. Med. 2001; 344:783–792. [DOI] [PubMed] [Google Scholar]
  • 3. Fenaux P., Mufti G.J., Hellstrom-Lindberg E., Santini V., Finelli C., Giagounidis A., Schoch R., Gattermann N., Sanz G., List A.et al.. Efficacy of azacitidine compared with that of conventional care regimens in the treatment of higher-risk myelodysplastic syndromes: a randomised, open-label, phase III study. Lancet Oncol. 2009; 10:223–232. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Moerman T., Aibar Santos S., Bravo González-Blas C., Simm J., Moreau Y., Aerts J., Aerts S.. GRNBoost2 and Arboreto: efficient and scalable inference of gene regulatory networks. Bioinformatics. 2019; 35:2159–2161. [DOI] [PubMed] [Google Scholar]
  • 5. Huynh-Thu V.A., Irrthum A., Wehenkel L., Geurts P.. Inferring regulatory networks from expression data using tree-based methods. PLoS One. 2010; 5:e12776. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Pratapa A., Jalihal A.P., Law J.N., Bharadwaj A., Murali T.M.. Benchmarking algorithms for gene regulatory network inference from single-cell transcriptomic data. Nat. Methods. 2020; 17:147–154. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Luo Q., Yu Y., Lan X.. SIGNET: single-cell RNA-seq-based gene regulatory network prediction using multiple-layer perceptron bagging. Brief. Bioinform. 2022; 23:bbab547. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Pectasides E., Stachler M.D., Derks S., Liu Y., Maron S., Islam M., Alpert L., Kwak H., Kindler H., Polite B.et al.. Genomic heterogeneity as a barrier to precision medicine in gastroesophageal adenocarcinoma. Cancer Discov. 2018; 8:37–48. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Nakamura K., Aimono E., Tanishima S., Imai M., Nagatsuma A.K., Hayashi H., Yoshimura Y., Nakayama K., Kyo S., Nishihara H.. Intratumoral genomic heterogeneity may hinder precision medicine strategies in patients with serous ovarian carcinoma. Diagnostics (Basel). 2020; 10:200. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Heinrich S., Craig A.J., Ma L., Heinrich B., Greten T.F., Wang X.W.. Understanding tumour cell heterogeneity and its implication for immunotherapy in liver cancer using single-cell analysis. J. Hepatol. 2021; 74:700–715. [DOI] [PubMed] [Google Scholar]
  • 11. Bach S., Binder A., Montavon G., Klauschen F., Müller K.-R., Samek W.. On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation. PLoS One. 2015; 10:e0130140. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Montavon G., Samek W., Müller K.-R.. Methods for interpreting and understanding deep neural networks. Digital Signal Process. 2018; 73:1–15. [Google Scholar]
  • 13. Lapuschkin S., Wäldchen S., Binder A., Montavon G., Samek W., Müller K.-R.. Unmasking clever hans predictors and assessing what machines really learn. Nat. Commun. 2019; 10:1096. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Samek W., Montavon G., Lapuschkin S., Anders C.J., Müller K.-R.. Explaining deep neural networks and beyond: a review of methods and applications. Proc. IEEE. 2021; 109:247–278. [Google Scholar]
  • 15. Binder A., Bockmayr M., Hägele M., Wienert S., Heim D., Hellweg K., Ishii M., Stenzinger A., Hocke A., Denkert C.et al.. Morphological and molecular breast cancer profiling through explainable machine learning. Nat. Mach. Int. 2021; 3:355–366. [Google Scholar]
  • 16. Schulte-Sasse R., Budach S., Hnisz D., Marsico A.. Graph Convolutional networks improve the prediction of cancer driver genes. International Conference on Artificial Neural Networks. 2019; Springer; 658–668. [Google Scholar]
  • 17. Chereda H., Bleckmann A., Menck K., Perera-Bel J., Stegmaier P., Auer F., Kramer F., Leha A., Beißbarth T.. Explaining decisions of graph convolutional neural networks: patient-specific molecular subnetworks responsible for metastasis prediction in breast cancer. Genome Med. 2021; 13:42. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Schnake T., Eberle O., Lederer J., Nakajima S., Schütt K.T., Müller K.-R., Montavon G.. Higher-order explanations of graph neural networks via relevant walks. IEEE Trans. Pattern Anal. Mach. Intell. 2022; 44:7581–7596. [DOI] [PubMed] [Google Scholar]
  • 19. Keyl P., Bockmayr M., Heim D., Dernbach G., Montavon G., Müller K.R., Klauschen F.. Patient-level proteomic network prediction by explainable artificial intelligence. NPJ Precis. Oncol. 2022; 6:35. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Dai H., Li L., Zeng T., Chen L.. Cell-specific network constructed by single-cell RNA sequencing data. Nucleic Acids Res. 2019; 47:e62. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Barabási A.L., Oltvai Z.N.. Network biology: understanding the cell’s functional organization. Nat. Rev. Genet. 2004; 5:101–113. [DOI] [PubMed] [Google Scholar]
  • 22. Hernández-Lemus E., Reyes-Gopar H., Espinal-Enríquez J., Ochoa S.. The many faces of gene regulation in cancer: a computational oncogenomics outlook. Genes (Basel). 2019; 10:865. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Bischoff P., Trinks A., Obermayer B., Pett J.P., Wiederspahn J., Uhlitz F., Liang X., Lehmann A., Jurmeister P., Elsner A.et al.. Single-cell RNA sequencing reveals distinct tumor microenvironmental patterns in lung adenocarcinoma. Oncogene. 2021; 40:6748–6758. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Strumbelj E., Kononenko I.. An efficient explanation of individual classifications using game theory. J. Mach. Learn. Res. 2010; 11:1–18. [Google Scholar]
  • 25. Sundararajan M., Taly A., Yan Q.. Axiomatic attribution for deep networks. ICML PMLR Vol.70 of Proceedings of Machine Learning Research. 2017; 3319–3328. [Google Scholar]
  • 26. Andéol L., Kawakami Y., Wada Y., Kanamori T., Müller K.R., Montavon G.. Learning domain invariant representations by joint Wasserstein distance minimization. 2021; arXiv doi:09 June 2021, preprint: not peer reviewedhttps://arxiv.org/abs/2106.04923. [DOI] [PubMed]
  • 27. R Core Team R: A language and environment for statistical computing. R Foundation for Statistical Computing. 2022; Vienna, Austria: https://www.R-project.org/. [Google Scholar]
  • 28. Harrell F.E. Jr Hmisc: Harrel Miscellaneous. 2022; R package version 4.7-2https://CRAN.R-project.org/package=Hmisc.
  • 29. Robin X., Turck N., Hainard A., Tiberti N., Lisacek F., Sanchez J.-C., Müller M.. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinform. 2011; 12:77. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Wickham H. ggplot2: Elegant graphics for data analysis. 2016; NY: Springer-Verlag. [Google Scholar]
  • 31. Csardi G., Nepusz T.. The igraph software package for complex network research. InterJournal. 2006; Complex Systems:1695. [Google Scholar]
  • 32. Gu Z., Eils R., Schlesner M.. Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics. 2016; 32:2847–2849. [DOI] [PubMed] [Google Scholar]
  • 33. Konopka T., Konopka M.T.. 2018; R-package: umap. Unif. Manifold Approx. Project..
  • 34. van der Maaten L., Hinton G.. Visualizing high-dimensional data using t-SNE. J. Mach. Learn. Res. 2008; 9:2579–2605. [Google Scholar]
  • 35. Kuijjer M.L., Hsieh P.H., Quackenbush J., Glass K.. lionessR: single sample network inference in R. BMC Cancer. 2019; 19:1003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Jassal B., Matthews L., Viteri G., Gong C., Lorente P., Fabregat A., Sidiropoulos K., Cook J., Gillespie M., Haw R.et al.. The reactome pathway knowledgebase. Nucleic Acids Res. 2020; 48:D498–D503. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Kuijjer M.L., Hsieh P.H., Quackenbush J., Glass K.. lionessR: single sample network inference in R. BMC Cancer. 2019; 19:1003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Carrington A.M., Fieguth P.W., Qazi H., Holzinger A., Chen H.H., Mayr F., Manuel D.G.. A new concordant partial AUC and partial c statistic for imbalanced data in the evaluation of machine learning algorithms. BMC Med. Inform. Decis. Mak. 2020; 20:4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Jiang Y., Metz C.E., Nishikawa R.M.. A receiver operating characteristic partial area index for highly sensitive diagnostic tests. Radiology. 1996; 201:745–750. [DOI] [PubMed] [Google Scholar]
  • 40. Blondel V.D., Guillaume J.-L., Lambiotte R., Lefebvre E.. Fast unfolding of communities in large networks. J. Stat. Mech. Theor. Exp. 2008; 2008:P10008. [Google Scholar]
  • 41. Chung M.I., Hogan B. L.M.. A new genetic tool for studying lung alveolar development, homeostasis, and repair. Am. J. Respir. Cell. Mol. Biol. 2018; 59:706–712. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Kato N., Kobayashi T., Honda H.. Screening of stress enhancer based on analysis of gene expression profiles: enhancement of hyperthermia-induced tumor necrosis by an MMP-3 inhibitor. Cancer Sci. 2003; 94:644–649. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Lu L., Zhao H., Liu J., Zhang Y., Wang X.. miRNA-mRNA regulatory network reveals miRNAs in HCT116 in response to folic acid deficiency via regulating vital genes of endoplasmic reticulum stress pathway. Biomed. Res. Int. 2021; 2021:6650181. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Knowles M.R., Ostrowski L.E., Leigh M.W., Sears P.R., Davis S.D., Wolf W.E., Hazucha M.J., Carson J.L., Olivier K.N., Sagel S.D.et al.. Mutations in RSPH1 Cause primary ciliary dyskinesia with a unique clinical and ciliary phenotype. Am. J. Respir. Crit. Care Med. 2014; 189:707–717. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Rosigkeit S., Kruchem M., Thies D., Kreft A., Eichler E., Boegel S., Jansky S., Siegl D., Kaps L., Pickert G.et al.. Definitive evidence for Club cells as progenitors for mutant Kras/Trp53-deficient lung cancer. Int. J. Cancer. 2021; 149:1670–1682. [DOI] [PubMed] [Google Scholar]
  • 46. Sainz de Aja J., Dost A. F.M., Kim C.F.. Alveolar progenitor cells and the origin of lung cancer. J. Intern. Med. 2021; 289:629–635. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Wang Z., Li Z., Zhou K., Wang C., Jiang L., Zhang L., Yang Y., Luo W., Qiao W., Wang G.et al.. Deciphering cell lineage specification of human lung adenocarcinoma with single-cell RNA sequencing. Nat. Commun. 2021; 12:6500. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Puła A., Robak P., Mikulski D., Robak T.. The significance of mRNA in the biology of multiple myeloma and its clinical implications. Int. J. Mol. Sci. 2021; 22:12070. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Azim R., Wang S.. Cell-specific gene association network construction from single-cell RNA sequence. Cell Cycle. 2021; 20:2248–2263. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Treue D., Bockmayr M., Stenzinger A., Heim D., Hester S., Klauschen F.. Proteogenomic systems analysis identifies targeted therapy resistance mechanisms in EGFR-mutated lung cancer. Int. J. Cancer. 2019; 144:545–557. [DOI] [PubMed] [Google Scholar]
  • 51. Heim D., Budczies J., Stenzinger A., Treue D., Hufnagl P., Denkert C., Dietel M., Klauschen F.. Cancer beyond organ and tissue specificity: next-generation-sequencing gene mutation data reveal complex genetic similarities across major cancers. Int. J. Cancer. 2014; 135:2362–2369. [DOI] [PubMed] [Google Scholar]
  • 52. Cai L., Friedman N., Xie X.S.. Stochastic protein expression in individual cells at the single molecule level. Nature. 2006; 440:358–362. [DOI] [PubMed] [Google Scholar]
  • 53. Klein C.A., Blankenstein T.J., Schmidt-Kittler O., Petronio M., Polzer B., Stoecklein N.H., Riethmüller G.. Genetic heterogeneity of single disseminated tumour cells in minimal residual cancer. Lancet. 2002; 360:683–689. [DOI] [PubMed] [Google Scholar]
  • 54. Fonseca R., Van Wier S.A., Chng W.J., Ketterling R., Lacy M.Q., Dispenzieri A., Bergsagel P.L., Rajkumar S.V., Greipp P.R., Litzow M.R.et al.. Prognostic value of chromosome 1q21 gain by fluorescent in situ hybridization and increase CKS1B expression in myeloma. Leukemia. 2006; 20:2034–2040. [DOI] [PubMed] [Google Scholar]
  • 55. Kurppa K.J., Denessiouk K., Johnson M.S., Elenius K.. Activating ERBB4 mutations in non-small cell lung cancer. Oncogene. 2016; 35:1283–1291. [DOI] [PubMed] [Google Scholar]
  • 56. Starr A., Greif J., Vexler A., Ashkenazy-Voghera M., Gladesh V., Rubin C., Kerber G., Marmor S., Lev-Ari S., Inbar M.et al.. ErbB4 increases the proliferation potential of human lung cancer cells and its blockage can be used as a target for anti-cancer therapy. Int. J. Cancer. 2006; 119:269–274. [DOI] [PubMed] [Google Scholar]
  • 57. Pfeifer B., Saranti A., Holzinger A.. GNN-SubNet: disease subnetwork detection with explainable graph neural networks. Bioinformatics. 2022; 38:ii120–ii126. [DOI] [PubMed] [Google Scholar]
  • 58. Lun A.T., Bach K., Marioni J.C.. Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 2016; 17:75. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59. Rajewsky N., Almouzni G., Gorski S.A., Aerts S., Amit I., Bertero M.G., Bock C., Bredenoord A.L., Cavalli G., Chiocca S.et al.. Publisher Correction: LifeTime and improving European healthcare through cell-based interceptive medicine. Nature. 2021; 592:E8. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

gkac1212_Supplemental_File

Data Availability Statement

All data used in this article are available at https://osf.io/nfdtk/ (DOI: 10.17605/OSF.IO/NFDTK).


Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES