MTM: a multi-task learning framework to predict individualized tissue gene expression profiles

Guangyi He; Maiyue Chen; Yingnan Bian; Ence Yang

doi:10.1093/bioinformatics/btad363

. 2023 Jun 5;39(6):btad363. doi: 10.1093/bioinformatics/btad363

MTM: a multi-task learning framework to predict individualized tissue gene expression profiles

Guangyi He ^1,^#, Maiyue Chen ^2,^#, Yingnan Bian ³, Ence Yang ^4,^5,^✉

Editor: Inanc Birol

PMCID: PMC10278940 PMID: 37279739

Abstract

Motivation

Transcriptional profiles of diverse tissues provide significant insights in both fundamental and translational researches, while transcriptome information is not always available for tissues that require invasive biopsies. Alternatively, predicting tissue expression profiles from more accessible “surrogate” samples, especially blood transcriptome, has become a promising strategy when invasive procedures are not practical. However, existing approaches ignore tissue-shared intrinsic relevance, inevitably limiting predictive performance.

Results

We propose a unified deep learning-based multi-task learning framework, multi-tissue transcriptome mapping (MTM), enabling the prediction of individualized expression profiles from any available tissue of an individual. By jointly leveraging individualized cross-tissue information from reference samples through multi-task learning, MTM achieves superior sample-level and gene-level performance on unseen individuals. With the high prediction accuracy and the ability to preserve individualized biological variations, MTM could facilitate both fundamental and clinical biomedical research.

Availability and implementation

MTM’s code and documentation are available upon publication on GitHub (https://github.com/yangence/MTM).

1 Introduction

Gene expression profiles in diverse tissues act as molecular mediators between genotypes and phenotypes and provide a snapshot of the systemic physiological and pathological status of an individual (Collins and Varmus 2015). Deciphering tissue-derived gene expression not only provides insights into the fundamental molecular mechanisms of biological processes (Ji et al. 2004, Lage et al. 2008, Kutsenko et al. 2014), but also aids in clinical diagnosis, subtyping, and management (Golub et al. 1999, Irgon et al. 2010, Hoadley et al. 2014). Tissue transcriptome-based biological investigations and clinical evaluations highly depend on tissue biopsy samples. However, biopsy procedures are often costly, invasive, and even infeasible in some cases. More importantly, invasive biopsy is also associated with increased infections and complications (Lundström et al. 2014, Di Meo et al. 2017), and thus presents more risks especially for pediatric and geriatric populations (Hassan et al. 2012, Greenhalgh et al. 2014, García-Albéniz et al. 2017, Pocienė et al. 2022). Thus, mapping biological information from more accessible “surrogate” samples to tissue transcriptomic profiles has become an emerging solution attracting extensive research interests (Gamazon et al. 2015, Halloran et al. 2015, Gusev et al. 2016, Wang et al. 2016, Barbeira et al. 2019, 2018, Hu et al. 2019, Wainberg et al. 2019, Xu et al. 2020, Zhou et al. 2020, Basu et al. 2021), providing wide prospects for characterization, screening, long-term monitoring, and management of diseases (San Lucas et al. 2016, Di Meo et al. 2017, Watts 2018).

Various methods have been developed to predict tissue-specific expression for specific individuals. Early attempts utilized genotypes with the rationale that individual variance in genetically regulated expression (GReX) components of tissue-specific gene expression could be explained by genotypes to some extent, i.e. expression quantitative trait loci (eQTL) (Gamazon et al. 2015, Gusev et al. 2016, Barbeira et al. 2018, Wainberg et al. 2019). A representative work demonstrating this strategy is the PrediXcan, which takes advantage of eQTL single-nucleotide polymorphisms (eSNPs) to construct gene-based tests to impute GReX (Gamazon et al. 2015). However, these eQTL-based methods are limited to small proportions of genes with significant tissue-specific eSNPs and neglect non-GReX components in expression (Gamazon et al. 2015). In contrast, since blood is involved in the circulation and bidirectional exchange of substances with tissues and organs, it is considered to reflect the real-time functional status of tissues underlying gene expression, including non-GReX components from feedback of traits, and environmental and other factors (Liew et al. 2006, Mohr and Liew 2007). Furthermore, recent studies have supported the advantages of predicting tissue expression profiles based on blood expression (Halloran et al. 2015, Wang et al. 2016, Xu et al. 2020). Nevertheless, the methods based on genotype and transcriptome usually have two common limitations: first, they are typically based on linear models or traditional machine learning methods, inevitably limiting their capability to capture complex nonlinear expression relationships in biological organisms (Von Bertalanffy 1957, Hasty et al. 2001); second, they build prediction models independently for a single gene in each target tissue, failing to utilize the intrinsic cross-tissue biological inherence underlying transcriptomes of multiple tissues from the same individual, which could be rescued by borrowing information from similar tissues (Barbeira et al. 2019, Hu et al. 2019, Zhou et al. 2020).

Deep learning has shown its powerful capability for handling high-dimensional, complex, and nonlinear biological data (Lopez et al. 2018, Maj et al. 2019, Marouf et al. 2020, Park et al. 2020, Azevedo et al. 2021, Elmarakeby et al. 2021, Viñas et al. 2021, 2022). Multi-task learning, which imposes regularity or shared representation by learning tasks simultaneously, is ideal for utilizing the intrinsic shared features of different domains (e.g. different tissues from the same individual) (Caruana 1997, Choi et al. 2017, Ruder 2017, Yang et al. 2019).

Here, we developed multi-tissue transcriptome mapping (MTM), a deep learning-based multi-task learning framework to predict individualized tissue gene expression profiles using any available tissue from a specific person. By jointly leveraging individualized cross-tissue information from multi-tissue reference samples through multi-task learning, MTM achieves superior sample-level and gene-level accuracy (1.63 and 2.29 times higher, respectively), and larger proportions of predictable genes (1.87 times more) than existing methods on unseen individuals. Meanwhile, fine-grained, individualized physiological, and pathological variations are well preserved, with the mean correlation of age-related expression changes up to 0.81. Besides, predicting arbitrary target tissue from arbitrary input tissue in a unified model is supported, which introduces more flexibility and bypasses the need to establish independent models for all prediction directions (which requires quadratic space complexity). With its outstanding predictive power and the ability to capture individualized physiological and pathological variations, MTM demonstrates its potential to accelerate transcriptome-based research and clinical applications.

2 Methods

2.1 Data preparation for MTM

The gene expression data from human tissues were downloaded from the v8 release of the Genotype-Tissue Expression (GTEx) portal (https://www.gtexportal.org/) (GTEx Consortium 2017). The metadata, including the subject-level and sample-level information, were obtained through the database of Genotypes and Phenotypes (dbGaP) (https://www.ncbi.nlm.nih.gov/gap/). Gene expression levels (transcripts per million, TPM) of 19 291 protein-coding genes across 49 tissues (sample sizes greater than 50) from 944 individuals (individuals with at least two samples, resulting in 17 329 samples in total) were used for model development. The tissue expression profiles were split into the training set and the validation set based on corresponding individual labels (individuals were randomly split into 80% for training and 20% for validation). Z-score standardization was conducted within each tissue for the training data, and then the recorded scaling factors were used for the validation data to prevent information leakage.

2.2 Description of MTM

Framework overview. To achieve individualized gene expression prediction, we proposed a deep learning framework, MTM, a unified multi-task learning approach capable of predicting tissue-specific gene expression profiles using any available tissues from an individual (Fig. 1). The expression levels of input tissue samples across different tissues were processed by a unified encoder network equipped with the tissue conditioning module (TCM), to encourage personalized representations in a shared latent space. Then, the latent codes were mapped to multiple tissue-specific expression spaces through a prediction task conditioning generator to achieve personalized prediction of tissue expression profiles. Two auxiliary networks, a mapping network and a discriminator network, were employed to smooth the latent space and to refine the prediction results, respectively. With the multi-task learning architecture, MTM is not limited to predicting the expressions of a single target tissue based on a single input tissue (single mapping direction), but rather supports arbitrary input tissues and arbitrary target tissues (arbitrary mapping directions) in a unified model, which introduces more flexibility along with potentially improved performance.

Figure 1. — Overview of MTM. MTM is based on a multi-task learning architecture and consists of four artificial neural networks, a unified encoder (E), a unified generator (G), a unified discriminator (D), and a mapping network (M). The gene expression profiles of different input tissues are embedded into a shared latent space by E to form individualized codes, which are then mapped to multiple tissue-specific expression spaces by G to realize personalized prediction of the expression profiles of different target tissues.

Key elements. The MTM architecture consists of an encoder, a generator, a discriminator, and a mapping network (Fig. 1). The encoder (E) embeds expression profiles into a shared latent space, resulting in personalized representations, i.e. latent codes. The generator (G) is symmetrical to the structure of the encoder, through which the latent codes were mapped into tissue-specific expression spaces in the form of individualized tissue expression profiles. The discriminator (D) is employed to identify the mismatches of the predicted distributions and the actual distributions to send feedback signals for the generator to improve the prediction in an adversarial manner (Goodfellow et al. 2020). The auxiliary mapping network (M) assists E in learning a smooth space (Karras et al. 2018). All E, G, and D networks are multi-task networks equipped with TCMs, which were designed to route and transform the data through tissue-specific or shared connections. The TCM consists of three parts: a shared fully connected layer and a learnable instance-level affine transformation conditioned on tissue labels followed by a leaky rectified linear unit (Leaky ReLU) to introduce nonlinearity. The detailed structures of the MTM are described in Supplementary Figs S1–S5.

Objectives. The generator $(G)$ translates latent codes from a shared latent space $(c \in W)$ to predict the expression profiles conditioned on the target tissue $t, G (c ∣ t) \to x_{t}$ . The latent codes could either be reference-guided by $E (E (x ∣ s) \to$ c) or transformed from random Gaussian noise $z \sim N (0, 1)$ by $M (M (z) \to \tilde{c})$ . To enforce G to utilize the latent code $\tilde{c}$ when predicting the expression profile $G (\tilde{c} ∣ t)$ , we employed the L1 reconstruction loss:

L_{R} = E [‖ E (G (\tilde{c} ∣ t) ∣ t) - \tilde{c} ‖_{1}] .

(1)

To minimize the divergence between the distributions of the actual expression profiles and of the predicted ones ( $G (\cdot$ )), a training strategy from a generative adversarial net (G|AN) is used, where D tries to distinguish the generated samples from real samples, while G tries to fool D. Instead of the original GAN, which minimizes the Jensen–Shannon divergence (Goodfellow et al. 2020), we used an adversarial objective function in the form of the hinge loss for robust training (Lim and Ye 2017, Gui et al. 2023):

\begin{array}{l} \begin{array}{l} L_{a d v} (D) = E_{x \sim p_{data} (x)} [\max (0, 1 - D (x ∣ t))] \end{array} \\ + E_{z \sim p_{z} (z)} [\max (0, 1 + D (G (z ∣ t)))] \end{array}

(2)

L_{a d v} (G) = - E_{z \sim p_{z} (z)} [D (G (z ∣ t))]

(3)

To achieve individualized tissue expression prediction, given the expression profile of the source tissue $x_{s}$ from a specific individual, we use the L1 loss to penalize the difference between the predicted expression profiles $G (E (x_{s} ∣ s) ∣ t)$ and the actual expression profile $x_{t}$ of the target tissue t:

L_{I} = E [{‖ G (E (x_{s} ∣ s) ∣ t) - x_{t} ‖}_{1}] .

(4)

To ensure that the predicted expression profile properly preserves the individual invariant characteristics, a cycle consistency loss is employed:

L_{C} = E [{‖ G (E (G (E (x_{s} ∣ s) ∣ t) ∣ t) ∣ s) - x_{s} ‖}_{1}] .

(5)

The final objective function is as follows:

L = L_{adv} + λ_{R} L_{R} + λ_{I} L_{I} + λ_{C} L_{C} .

(6)

Training process. In the training process, data are organized in the form of tissue pairs within the same individuals, where one is used as input and the other as prediction target. In this way, MTM is trained to learn the mapping between any two types of tissues, hence improving the prediction performance of each direction. Each fold of the 5-fold cross-validation (CV) process requires an average of 7.97 h, 20.8 GB RAM, and 4.8 GB GPU RAM during training. All deep learning-based models were trained using the PyTorch framework (version 1.10.2) (Paszke et al. 2019) on Tesla V100S graphics processing units (NVIDIA).

Training hyperparameters. The data were split into mini-batches of 256 samples for training. The Adam optimizer (Kingma and Ba 2014) was employed to optimize the network parameters using the hyperparameters recommended by implementations from Warde-Farley and Bengio (2017), Miyato et al. (2018), Lewkowycz et al. (2020), and Beugnot et al. (2022): learning rate $= 0.0005$ , $β 1 = 0.5$ , and $β 2 = 0.9$ . The number of maximum training epochs was set to 200 to avoid overfitting.

2.3 Description of compared single-tissue methods for ablative experiments

B-GEx (Method S1) and TEEBoT (Method S2) use linear regression models to predict tissue gene expression from blood sample gene expression features (Xu et al. 2020, Basu et al. 2021). Each method creates a linear regression model for each gene in each target tissue. S1 employs a Bayesian ridge regression model with the most relevant subset of features identified through feature selection for each target gene, while S2 builds a least absolute shrinkage and selection operator (LASSO) model with principal components of the blood expression data as features. Method S3 consisted of a set of nonlinear neural networks for each tissue based on simple multilayer perceptrons. For each tissue, a fully connected neural network with Leaky ReLU activations was built to predict the expression profiles of the target tissue from the expression of a specific source (blood). The number of maximum training epochs was set to 1000. Other training hyperparameters were kept the same as those of MTM.

2.4 Performance evaluation

In the evaluation process, only gene expression profiles of blood samples are used as input and other tissues as output. The prediction quality was indicated by the Pearson correlation coefficient (Pearson’s $ρ$ ) between the predicted and observed expression data. The sample-wise accuracies across all genes were calculated on Z-score standardized data to highlight the individual variance in each tissue. The gene-wise accuracies refer to Pearson’s $ρ$ of each gene across individuals for each tissue. Predictable genes (pGenes) were defined as genes with gene-wise $ρ$ > 0.3 between predicted and observed expression in each tissue. Accuracy indicators in this work were reported in a 5-fold CV manner, where predicted data were concatenated for the following evaluation. For fair comparison, the performance of all models was evaluated in a CV manner using the same inputs as were used with MTM (blood expression data from the same training and validation samples).

2.5 Exploration of the characteristics of predictable genes

Tissue similarity. Similarities between tissues were measured with Pearson’s $ρ$ across genes between the mean expression levels averaged over individuals of each tissue ( ${log}_{2}$ -scaled). When comparing certain pairwise statistics (x1) of tissues with the tissue similarities (x2), element-wise values were flattened to two arrays (X1 and X2), and then Spearman’s correlation coefficient (Spearman’s $ρ$ ) between the two was calculated.

Expression levels. In each tissue, the Wilcoxon rank sum test was used to compare the mean expression levels of the top 25 percentiles of predictable genes (highly predictable genes) with those of the control groups, with the control groups defined as the bottom 25 percentiles of predictable genes (unpredictable genes).

Conservation. The conservation distributions of genes in terms of PhastCons scores were extracted with DeepTools (Ramírez et al. 2016) from the 100-way vertebrate species alignment downloaded from the UCSC (University of California Santa Cruz) genome browser (Siepel et al. 2005, Haeussler et al. 2019). The conservation levels of the highly predictable genes were compared with those of the unpredictable genes in each tissue (Wilcoxon rank sum test).

Connectivity. The degree of connectivity distribution for each gene in the protein interaction network (PIN) was downloaded and extracted from HIPPIE v2.3 (Alanis-Lobato et al. 2017). Then, the average degrees of the two groups of genes (highly predictable genes and unpredictable genes) were compared. The average degrees of the highly predictable genes were compared with those of the unpredictable genes in each tissue (Wilcoxon rank sum test).

2.6 Exploration of the characteristics of the intermediate features of MTM

Personalized representations from the latent space of the MTM encoder were extracted across different individuals and tissues and then used to calculate pairwise similarities (Pearson’s $ρ$ ). Intraindividual codes refer to latent codes of different tissues from the same individual, while interindividual codes refer to latent codes from different individuals. Decoding paths of the data flow from latent codes to tissue expression spaces refers to activations of different layers through the MTM generator. The decoding paths of each layer from an individual’s code from blood to different tissues were used to calculate pairwise similarities (Pearson’s $ρ$ ). Spearman’s $ρ$ between the pairwise similarities of the decoding paths and the pairwise tissue similarities was calculated to determine the extent to which the decoding rules reflected biological patterns.

2.7 Phenotype-related analysis

For traits including age, gender, and BMI, Pearson’s $ρ$ was used to measure the associations of each gene with traits in each tissue. Then, the Pearson’s $ρ$ between the predicted and actual (GTEx) associations with traits was calculated to determine the extent to which the predicted tissue expression preserved the trait-related expression changes.

For disease-associated dysregulations, the sign consistency and Pearson’s $ρ$ between ${log}_{2}$ fold changes across genes in predicted and actual data were used to measure the overall concordance of the prediction. Differential expression analysis (DE analysis) was conducted using the nonparametric Wilcoxon rank sum test. When comparison of differentially expressed genes (DEGs, under FDR < 0.05) between predicted and actual data was applicable, hypergeometric tests were conducted to examine the overlap significance, and the F1 scores, which convey the balance between precision and recall, were used to measure the performance of identifying DEGs.

For disease status prediction, we built LASSO models with scikit-learn (version 0.24.2) (Pedregosa et al. 2011) for each disease–tissue pair in the predicted expression, the actual expression, and the input blood expression. The values of the area under the ROC curve (auROC) were used to indicate performance in a 5-fold CV manner with 50 independent repetitions.

2.8 Exploration of MTM applications on external datasets

The raw sequencing data (fastq files) of blood expression profiles from GSE184050 (Chen et al. 2022) were downloaded and the gene expression values of the 19 291 protein-coding genes were quantified according to the pipeline provided by the GTEx Consortium (2017). To reduce batch effects, we simply aligned the mean and standard deviation of each gene in each tissue to those of the GTEx data. For external dataset, the MTM was trained in a hold-out manner, with 80% of the data used for training and the remaining 20% for testing.

The standardized blood expression profiles of GSE184050 (type 2 diabetes, T2D) were input into MTM to predict the tissue expression profiles of corresponding individuals. Then, Pearson’s $ρ$ between ${log}_{2}$ fold changes across genes in predicted data and reference data (case and control subjects for T2D in GTEx) was used to measure the overall concordance of the predicted disease-related dysregulations.

3 Results

3.1 MTM outperforms the baseline methods

Gene expression levels (TPM) of 19 291 protein-coding genes across 49 tissues (sample sizes greater than 50) from 948 individuals (17 329 samples in total) in the GTEx (GTEx Consortium 2017) datasets were used to develop our model. The prediction quality of MTM at the sample level and the gene level was evaluated on the task of predicting individualized gene expression profiles of the other 48 tissues within the development set (GTEx) based on the expression profile of whole blood. Two representative single-tissue approaches, B-GEX (Method S1) (Xu et al. 2020) and TEEBoT (Method S2) (Basu et al. 2021), were used as baselines in the evaluation for comparison. At the sample level, the sample-wise prediction accuracy was measured by Pearson’s $ρ$ , which was 0.21 $\pm$ 0.07 across all genes among the 48 tissues (ranging from 0.04 to 0.34). Compared with 0.00 $\pm$ 0.02 for S1 and 0.08 $\pm$ 0.09 for S2, MTM showed significantly higher levels of accuracy (Wilcoxon rank sum test, P-value < .05) than S1 or S2 did in all 48 tissues. On the gene scale, MTM (0.23 $\pm$ 0.07, ranging from 0.04 to 0.34) also demonstrated significantly higher gene-wise accuracy (Wilcoxon rank sum test, P-value < .05) than S1 (0.00 $\pm$ 0.03) and S2 (0.07 $\pm$ 0.11) did in 48 and 47 tissues out of the 48 tissues, respectively (Fig. 2a and Supplementary Table S1).

Figure 2. — Comparison of the performance for predicting tissue-specific gene expression. (a) The sample-wise and gene-wise accuracy (Pearson’s correlation coefficient, Pearson’s $ρ$ ) of MTM against single-tissue approaches, including linear (S1 and S2) and nonlinear (S3) models. The values represent the mean performance of all individuals (sample-level) or of all genes (gene-level) in the target tissue. (b) The number of predictable genes (pGenes, defined as gene-wise Pearson’s $ρ$ > 0.3) of different models in different target tissues. (c) Venn diagram showing the overlap of the pGenes from S2, S3, and MTM in the esophagus muscularis. Approximately 99% and 98% of pGenes from S2 and S3 were included in the MTM-derived pGenes, respectively. (d, e) The relationship between the fold changes in the number of pGenes and the sample sizes of different tissues for MTM to S2 (d) and MTM to S3 (e). Each scatter point color represents a specific tissue as defined by the GTEx Consortium (GTEx Consortium 2017).

In terms of predictable genes (pGenes; defined as genes with Pearson’s $ρ$ > 0.3 between predicted and observed expression), MTM on average identified 6597 pGenes across the 48 tissues (ranging from 21 to 12 025), which was substantially higher than the 218 (ranging from 0 to 1048) of S1 and the 2301 (ranging from 1 to 8031) of S2 (Fig. 2b and Supplementary Table S2). For the S1- and S2-derived pGenes, a median of 94.2% (7.7%–100.0%) and 94.5% (0.0%–99.8%) was identified by MTM, respectively. In contrast, a median of 2.2% (0.0%–15.9%) and 32.6% (0.0%–73.1%) of MTM-derived pGenes was captured by S1 and S2, respectively. When comparing the fold changes in the number of pGenes of MTM to S2 (ranging from 1.22- to 91.37-fold, mean = 11.39-fold, median = 2.91-fold), we found that the performance gain was negatively correlated with sample size (Spearman’s $ρ = - 0.48$ , $P - value = 5.26 \times 10^{- 4}$ , Fig. 2d), implying that the missing information caused by small sample sizes could be partially recovered from other tissues with the application of MTM.

3.2 Both the nonlinearity and multi-task framework contribute to performance

To investigate the independent contribution of the nonlinear and multi-task components, we performed ablative experiments and assessed their influence on the overall performance. We constructed an intermediate model (S3) for comparison, which consisted of a set of nonlinear neural networks for each tissue based on simple multilayer perceptrons.

For the nonlinearity component, we compared the performance of the linear models (S2) with the constructed nonlinear models (S3) for each tissue. As expected, the accuracy of S3 outperformed S2 both at the sample and gene levels (0.17 $\pm$ 0.07 and 0.17 $\pm$ 0.07, respectively, Supplementary Fig. S6). The sample-wise accuracies and gene-wise accuracies of S3 were significantly (P-value < .05) higher than those of S2 in 46 tissues and 43 tissues, respectively. Compared with S2, the average number of pGenes resulting from S3 (on average 4057 pGenes, ranging from 106 to 10 110, Supplementary Table S2) was substantially higher, with fold changes ranging from 0.57- to 106.00-fold (mean = 7.11-fold, median = 1.93-fold).

For the multi-task learning component, MTM was compared with the constructed nonlinear single-tissue models (S3). At the sample level, the accuracy of MTM outperformed S3 in 41 tissues, with the performance in 24 tissues showing statistical significance (P-value < .05). In addition, at the gene level, MTM significantly (P-value < .05) outperformed S3 in 42 tissues (Fig. 2a and Supplementary Table S1). Furthermore, the average number of pGenes resulting from MTM (6597) was substantially higher than that of S3 (4057) (Fig. 2b and Supplementary Table S2), with fold changes ranging from 0.20- to 13.38-fold (mean = 2.46-fold, median = 1.65-fold). Approximately 92.4% of the S3-derived pGenes were still captured by MTM (Fig. 2c and Supplementary Fig. S7). The above results suggested that apart from advantages introduced by nonlinear neural networks, jointly leveraging the individualized cross-tissue information provides extra prediction improvements perpendicularly.

3.3 Intermediate features and outputs of MTM provided insights into the performance basis

To determine how MTM integrated different tissues from the same individual as an entirety, we next explored the characteristics of the individualized representations learned by MTM. Intraindividual latent codes derived from the same individual showed significantly ( $P - value < 2.2 \times 10^{- 16}$ ) higher similarities than interindividual latent codes did (Supplementary Fig. S8), indicating that individualized properties of different tissues from the same individual were captured well by MTM. Then, we explored the characteristics of the decoding paths from the same individual representation to the tissue expression profiles. We found that the pairwise similarities between the decoding paths of the data flow (i.e. activations of different layers) were highly consistent (Spearman’s P-value < .05) with those between tissue expression profiles (Supplementary Fig. S8), suggesting that the biological similarities between tissues were implicitly encoded in the decoding process. Together, the integration of multiple tissues into a single model by our multi-task learning architecture resulted in individualized representations and decoding rules with biological patterns.

Given that MTM is able to identify more pGenes, distinctive features of the pGenes may also aid in the interpretation of the overall prediction strength of MTM from a different perspective. We compared the pairwise similarities of the predictability of genes of different tissues (Supplementary Fig. S9), and found that tissues with similar expression profiles tended to share similar pGenes (Spearman’s $ρ$ = 0.56, $P - value = 1.40 \times 10^{- 92}$ ), which was consistent with previous discoveries. The pGenes tended to have significantly (Wilcoxon rank sum test, P-value < .05) higher expression levels than the unpredictable genes in all 48 tissues (Supplementary Fig. S10). The pGenes showed significantly (Wilcoxon rank sum test, P-value < .05) higher conservation than the unpredictable genes in 9 tissues but significantly lower conservation in 35 out of the 48 tissues (Supplementary Fig. S11), suggesting that there was no obvious preference for conservation in pGenes. Notably, there were significantly higher levels of connectivity (average degrees) in the PIN for pGenes than for unpredictable genes in a large portion of tissues (higher in 47 tissues, with 28 tissues showing statistical significance, while significantly lower in one tissue, Wilcoxon rank sum test) (Supplementary Fig. S8). By comparison, the S2-derived pGenes showed significantly lower levels of connectivity in 11 tissues, while exhibiting significantly higher levels in 26 tissues. The difference between MTM and S2 suggested that MTM might be more capable of capturing the complex interactions in the PIN.

3.4 MTM could capture personalized biological variations

As MTM resulted in highly personalized representations of individuals, we assessed the extent to which the predicted tissue expression profiles could reveal trait-associated variations. Correlation analysis of gene expression levels of pGenes was conducted against individuals’ traits (including age, gender, and BMI) in the predicted data (from blood expression) and compared with the actual data (i.e. observed data). The age-specific gene expression associations were well preserved (Pearson’s $ρ$ > 0.3, P-value < .05) in the predicted expression data in 46 out of the 48 tissues, with a $ρ_{median}$ of 0.81. Similarly, the test statistics of gender ( $ρ_{median}$ = 0.76) and BMI ( $ρ_{median}$ = 0.56) of the predicted data showed significant positive (Pearson’s $ρ$ > 0.3, P-value < .05) correlations with those of the actual data in all 43 gender-independent tissues and 39 out of the 48 tissues, respectively (Supplementary Figs S12–S14). The above results suggested that the individualized tissue expression profiles predicted by MTM preserved the trait-related expression changes well.

Apart from trait-associated changes, we investigated the consistency between tissue-specific disease-related dysregulations in the predicted data and those of the actual data (GTEx). DE analysis in pGenes was conducted for the combination of each tissue and each disease with at least 50 cases and 50 controls annotated in GTEx, resulting in 117 disease-related tissue–disease pairs across 35 tissues and 9 diseases (with at least 10 DEGs, FDR < 0.05). In these 117 disease-related pairs, 112 pairs exhibited both high sign consistency (sign consistency > 0.7) and high correlation (Pearson’s $ρ$ > 0.3) of ${log}_{2}$ fold changes of pGenes, among which DEGs showed even higher concordance (Supplementary Table S3), suggesting that MTM captured both the direction and the level of the disease-related expression changes. Among 109 pairs with at least 10 observed upregulated DEGs, the predicted DEGs significantly overlapped (hypergeometric test, P-value < .05) with the reference DEGs in 93 pairs (F1_median = 0.65, F1 score > 0.7 in 44 pairs). For downregulated DEGs, the predicted DEGs significantly overlapped with reference DEGs in 94 pairs (F1_median = 0.60, F1 score > 0.7 in 41 pairs) among 112 disease-related pairs (with at least 10 observed down-regulated DEGs, Fig. 3a and b).

Figure 3. — Predicted data preserve personalized biological variations. (a) Comparison of results in DE analysis between the predicted expression and the observed tibial artery expression with heart attack. The position of each point represents the observed results, while the color indicates the predicted ${log}_{2}$ fold change. (b) Performance of predicting DEGs of different diseases in different tissues using predicted data. The number of stars in the grids indicates the overlap significance level of up- or downregulated genes between observed and predicted data (* = significant at .05, ** = significant at $1 \times 10^{- 10}$ and *** = significant at $1 \times 10^{- 100}$ ). The color represents the level of the F1 score for predicting DE genes. (c) A specific case of receiver operating characteristic (ROC) curves for predicting chronic obstructive pulmonary disease (COPD) status in the observed lung expression, the predicted lung expression, and the input blood expression. (d) Performance of predicting disease status (auROC) of different cases using observed tissue data, predicted tissue data, and input blood data.

To assess whether the predicted tissue expression profiles could be employed in indicating disease status, we next compared the prediction performance of the predicted tissue expression, the actual tissue expression, and the input blood expression profiles. Among the 117 disease-related pairs mentioned above, 53 pairs were focused where the actual tissue expression profiles were more informative than those of blood for disease status prediction with auROCs of at least 0.6. Although less predictive (Wilcoxon paired test, $P - value = 1.82 \times 10^{- 5}$ ) than the actual tissue expression ( ${auROC}_{mean}$ = 0.66), the mean auROCs of the predicted tissue expression of pGenes across the 53 pairs ( ${auROC}_{mean}$ = 0.62) were significantly (Wilcoxon paired test, $P - value = 4.03 \times 10^{- 4}$ ) higher than those of the original blood expression ( ${auROC}_{mean}$ = 0.59), suggesting that predicted tissue expression profiles were informative for disease status (Fig. 3c and d and Supplementary Fig. S15).

3.5 MTM facilitates the identification of tissue-specific disease-associated dysregulations

Since MTM could capture individualized biological variations from any input tissue expression profiles, we further explored the potential application scenarios for MTM. We investigated whether MTM could predict disease-related dysregulations of tissues from blood expression on an external dataset. The external blood expression data were input into MTM to predict the expression profiles of other tissues, which were then used to perform DE analysis and were compared with the GTEx reference (with at least 50 cases and 50 controls). The dataset includes 25 T2D patients and 33 normal subjects (Chen et al. 2022). Among 16 T2D-related tissues (with at least 10 observed up- or downregulated DEGs under FDR < 0.05 within GTEx), the predicted T2D-related expression changes (fold changes) exhibited high concordance with the reference data (Pearson’s $ρ$ ranged from −0.49 to 0.76, median = 0.52, Fig. 4), which were significantly higher (Wilcoxon rank sum test, P-value = .033) than those of 11 non-T2D-related tissues (Pearson’s $ρ$ ranged from −0.54 to 0.65, median = 0.32).

Figure 4. — Potential utilities in analyzing disease-related gene signatures in external data. Hexagonal heatmaps show the relations between the ${log}_{2}$ fold changes of the predicted tissue expression using blood from external T2D patients and the reference ${log}_{2}$ fold changes (GTEx T2D subsets). Hexagonal cells are colored according to numbers of genes (count).

4 Discussion

Deciphering tissue-specific gene expression of individuals is able to provide vital clues and insights into complex traits and diseases and thus aid fundamental and clinical studies. Predicting tissue transcriptome profiles from biological information of peripheral “surrogate” samples has emerged as a promising alternative with potential utility in areas of biomedical science, especially for circumstances when tissue samples are unavailable. This work finds that, through fully leveraging information from all tissues of the same individual and the nonlinear representation capability of neural networks via the multi-task learning architecture, MTM substantially enhances performance on both overall sample levels and fine-grained gene levels. In particular, MTM identifies more predictable genes with well-preserved personalized physiological and pathological variations, providing a novel and valuable tool for tissue transcriptome-based biomedical investigations from a more comprehensive and systemic perspective. It is important to note that the medical records from GTEx are medical history records, not necessarily reflecting the disease status of the donors at the time of tissue collection, especially for acute illnesses, such as acute pneumonia. Consequently, DEGs identified using those records should be carefully considered or interpreted as DEGs related to disease history, such as adaptive changes like remodeling, fibrosis, or other long-term alterations (Garza 2015, Quinton et al. 2018, Jensen et al. 2021). However, the utilization of MTM with data from acute diseases has the potential to aid in the identification of DEGs associated with disease onset, which has potential clinical implications.

The substantial improvement in performance for MTM may be attributed to several factors. First, it is commonly recognized that an individual’s gene expression levels in different tissues are inherently linked by shared biological foundations, which could be decomposed into the static part that shares the same genetic material across tissues (i.e. GReX) and the dynamic part that is related to the temporal and spatial factors of the individual, including physiological states as well as environmental and other factors. The multi-task learning design of MTM encourages the exploitation of both the static and dynamic intrinsic relevance in all tissues from the same individual, enabling the model to extract and refine decoding rules with biological patterns and to produce better individualized representations instead of solely twisting data, thus leading to improved predictive performance, especially in tissues with smaller sample sizes. Interestingly, MTM performed worst in the two cell lines, with the lowest numbers of pGenes being captured, indicating that the integrity of internal information for the individual was lost in the cell lines that were cultured or transformed in vitro. Second, MTM may be more capable of capturing complex gene–gene interactions with nonlinearity and multi-task components. Last but not least, predictive information, including subtle signals, is preserved to the greatest extent in MTM, since dimension reduction is not required for transcriptomic data in large-scale neural networks driven by the backpropagation algorithm (LeCun et al. 2015). With these properties, the framework of MTM may serve as an effective paradigm in similar biomedical scenarios where complex intrinsic relationships underlie multiple entities, such as other large-scale omics data of different tissues or cells from the same individuals.

The high accuracy of MTM for cross-tissue expression prediction with well-preserved individualized biological variations supports its potential for biomedical applications, especially in tissue-specific biomarker discovery from readily available tissues (such as blood). However, there are still gaps that remain to be filled. A major challenge is to remove the heterogeneity in real data to align with the fixed develop sets for deployed machine learning models in practice, which imposes difficulties on existing data integration methods that cannot keep the develop sets intact. Another challenge is that the multi-task learning architecture of MTM highly depends on large-scale computational resources, which limits the incorporation of additional omics data. With the future advances of computational resources and data integration algorithms, as well as new incoming data to support continuous learning, MTM will promote real-world translational applications, such as decoding molecular mechanisms and mining clinical biomarkers.

5 Conclusion

In this study, we propose a deep learning-based multi-task learning framework, MTM, to predict individualized tissue gene expression profiles from any available gene expression of the same individual using a single unified model. Comparisons to the two representative approaches, MTM shows the superior performance at both the sample level and the gene level, with a larger proportion of predictable genes. Ablative study confirms that jointly leveraging cross-tissue information provides improvements beyond the modeling power of nonlinear neural networks, which might be achieved by exploiting individualized representation through our multi-task learning framework. Phenotype association analysis suggests that the predicted expression profiles preserve the individualized biological variations well, including trait-related expression changes and disease-related dysregulations. In summary, our work proves multi-task learning to be an effective strategy to utilize tissue-shared intrinsic biological relevance in the prediction of cross-tissue gene expression profiles.

Supplementary Material

btad363_Supplementary_Data

Click here for additional data file.^{(4.3MB, zip)}

Acknowledgements

The GTEx Project is supported by the Common Fund of the Office of the Director of the National Institutes of Health (additional funds are provided by the NCI, NHGRI, NHLBI, NIDA, NIMH, and NINDS). We also thank the reviewers for their valuable comments and suggestions.

Contributor Information

Guangyi He, Department of Medical Bioinformatics, School of Basic Medical Sciences, Peking University, Beijing 100191, China.

Maiyue Chen, School of Artificial Intelligence, Peking University, Beijing 100191, China.

Yingnan Bian, Enlight Medical Technologies (Shanghai) Co., Ltd, Shanghai 201318, China.

Ence Yang, Department of Medical Bioinformatics, School of Basic Medical Sciences, Peking University, Beijing 100191, China; Chinese Institute for Brain Research, Beijing 102206, China.

Supplementary data

Supplementary data are available at Bioinformatics online.

Conflict of interest

None declared.

Data availability

All the raw data before preprocessing are available as follows: (1) GTEx cohort with gene expression data are available in the GTEx portal (https://www.gtexportal.org/) and the dbGap (https://www.ncbi.nlm.nih.gov/gap/), and can be accessed with [dbGaP Accession phs000424.v8.p2]; (2) Data of patient and control groups for Type 2 Diabetes along with blood expression profiles are available in GEO database (https://www.ncbi.nlm.nih.gov/geo/) and can be accessed with [GSE184050].

Funding

This work was supported by the Beijing Municipal Science and Technology Commission of China [7212065], Ministry of Science and Technology of China [Grant No. 2021ZD0203203], and the Collaborative Research Fund of Chinese Institute for Brain Research, Beijing [2020-NKX-XM-01].

References

Alanis-Lobato G, Andrade-Navarro MA, Schaefer MH. et al. HIPPIE v2.0: enhancing meaningfulness and reliability of protein–protein interaction networks. Nucleic Acids Res 2017;45:D408–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
Azevedo T, Dimitri GM, Lió P. et al. Multilayer modelling of the human transcriptome and biological mechanisms of complex diseases and traits. NPJ Syst Biol Appl 2021;7:24. [DOI] [PMC free article] [PubMed] [Google Scholar]
Barbeira AN, Dickinson SP, Bonazzola R. et al. ; GTEx Consortium. Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics. Nat Commun 2018;9:1825. [DOI] [PMC free article] [PubMed] [Google Scholar]
Barbeira AN, Pividori M, Zheng J. et al. Integrating predicted transcriptome from multiple tissues improves association detection. PLoS Genet 2019;15:e1007889. [DOI] [PMC free article] [PubMed] [Google Scholar]
Basu M, Wang K, Ruppin E. et al. Predicting tissue-specific gene expression from whole blood transcriptome. Sci Adv 2021;7:eabd6991. [DOI] [PMC free article] [PubMed] [Google Scholar]
Beugnot G, Mairal J, Rudi A.. 2022. On the benefits of large learning rates for kernel methods. Technical report, arXiv preprint.
Caruana R. Multitask learning. Mach Learn 1997;28:41–75. [Google Scholar]
Chen H-H, Petty LE, North KE. et al. Novel diabetes gene discovery through comprehensive characterization and integrative analysis of longitudinal gene expression changes. Hum Mol Genet 2022;31:3191–205. [DOI] [PMC free article] [PubMed] [Google Scholar]
Choi Y, Choi M, Kim M. et al. 2017. StarGAN: unified generative adversarial networks for multi-domain image-to-image translation. Technical report, arXiv preprint.
Collins FS, Varmus H.. A new initiative on precision medicine. N Engl J Med 2015;372:793–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
Di Meo A, Bartlett J, Cheng Y. et al. Liquid biopsy: a step forward towards precision medicine in urologic malignancies. Mol Cancer 2017;16:80. [DOI] [PMC free article] [PubMed] [Google Scholar]
Elmarakeby HA, Hwang J, Arafeh R. et al. Biologically informed deep neural network for prostate cancer discovery. Nature 2021;598:348–52. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gamazon ER, Wheeler HE, Shah KP. et al. ; GTEx Consortium. A gene-based association method for mapping traits using reference transcriptome data. Nat Genet 2015;47:1091–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
García-Albéniz X, Hsu J, Bretthauer M. et al. Effectiveness of screening colonoscopy to prevent colorectal cancer among medicare beneficiaries aged 70 to 79 years: a prospective observational study. Ann Intern Med 2017;166:18–26. [DOI] [PMC free article] [PubMed] [Google Scholar]
Garza MA, Wason EA, Zhang JQ.. Cardiac remodeling and physical training post myocardial infarction. World J Cardiol 2015;7:52–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
Golub TR, Slonim DK, Tamayo P. et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999;286:531–7. [DOI] [PubMed] [Google Scholar]
Goodfellow I, Pouget-Abadie J, Mirza M. et al. Generative adversarial networks. Commun ACM 2020;63:139–44. [Google Scholar]
Greenhalgh RM, Yardley IE, Child F. et al. Lung biopsy for chronic pulmonary disease in children. J Pediatr Surg 2014;49:1075–7. [DOI] [PubMed] [Google Scholar]
GTEx Consortium. Genetic effects on gene expression across human tissues. Nature 2017;550:204–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gui J, Sun Z, Wen Y. et al. A review on generative adversarial networks: algorithms, theory, and applications. IEEE Trans Knowl Data Eng 2023;35:3313–32. [Google Scholar]
Gusev A, Ko A, Shi H. et al. Integrative approaches for large-scale transcriptome-wide association studies. Nat Genet 2016;48:245–52. [DOI] [PMC free article] [PubMed] [Google Scholar]
Haeussler M, Zweig AS, Tyner C. et al. The UCSC genome browser database: 2019 update. Nucleic Acids Res 2019;47:D853–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
Halloran JW, Zhu D, Qian DC. et al. Prediction of the gene expression in normal lung tissue by the gene expression in blood. BMC Med Genomics 2015;8:77. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hassan SF, Mathur S, Magliaro TJ. et al. Needle core vs open biopsy for diagnosis of intermediate- and high-risk neuroblastoma in children. J Pediatr Surg 2012;47:1261–6. [DOI] [PubMed] [Google Scholar]
Hasty J, McMillen D, Isaacs F. et al. Computational studies of gene regulatory networks: in numero molecular biology. Nat Rev Genet 2001;2:268–79. [DOI] [PubMed] [Google Scholar]
Hoadley KA, Yau C, Wolf DM. et al. ; Cancer Genome Atlas Research Network. Multiplatform analysis of 12 cancer types reveals molecular classification within and across tissues of origin. Cell 2014;158:929–44. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hu Y, Li M, Lu Q. et al. ; Alzheimer’s Disease Genetics Consortium. A statistical framework for cross-tissue transcriptome-wide association analysis. Nat Genet 2019;51:568–76. [DOI] [PMC free article] [PubMed] [Google Scholar]
Irgon J, Huang CC, Zhang Y. et al. Robust multi-tissue gene panel for cancer detection. BMC Cancer 2010;10:319. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jensen IJ, Li X, McGonagill PW. et al. Sepsis leads to lasting changes in phenotype and function of memory CD8 T cells. eLife 2021;10:e70989. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ji W, Zhou W, Gregg K. et al. A method for cross-species gene expression analysis with high-density oligonucleotide arrays. Nucleic Acids Res 2004;32:e93. [DOI] [PMC free article] [PubMed] [Google Scholar]
Karras T, Laine S, Aila T.. 2018. A style-based generator architecture for generative adversarial networks. Technical report, arXiv preprint. [DOI] [PubMed]
Kingma DP, Ba J.. 2014. Adam: a method for stochastic optimization. Technical report, arXiv preprint.
Kutsenko A, Svensson T, Nystedt B. et al. The Chironomus tentans genome sequence and the organization of the Balbiani ring genes. BMC Genomics 2014;15:819. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lage K, Hansen NT, Karlberg EO. et al. A large-scale analysis of tissue-specific pathology and gene expression of human disease genes and complexes. Proc Natl Acad Sci USA 2008;105:20870–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
LeCun Y, Bengio Y, Hinton G. et al. Deep learning. Nature 2015;521:436–44. [DOI] [PubMed] [Google Scholar]
Lewkowycz A, Bahri Y, Dyer E. et al. 2020. The large learning rate phase of deep learning: the catapult mechanism. Technical report, arXiv preprint.
Liew C-C, Ma J, Tang H-C. et al. The peripheral blood transcriptome dynamically reflects system wide biology: a potential diagnostic tool. J Lab Clin Med 2006;147:126–32. [DOI] [PubMed] [Google Scholar]
Lim JH, Ye JC.. 2017. Geometric GAN. Technical report, arXiv preprint.
Lopez R, Regier J, Cole MB. et al. Deep generative modeling for single-cell transcriptomics. Nat Methods 2018;15:1053–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lundström K-J, Drevin L, Carlsson S. et al. Nationwide population based study of infections after transrectal ultrasound guided prostate biopsy. J Urol 2014;192:1116–22. [DOI] [PubMed] [Google Scholar]
Maj C, Azevedo T, Giansanti V. et al. ; Alzheimer’s Disease Neuroimaging Initiative. Integration of machine learning methods to dissect genetically imputed transcriptomic profiles in Alzheimer’s disease. Front Genet 2019;10:726. [DOI] [PMC free article] [PubMed] [Google Scholar]
Marouf M, Machart P, Bansal V. et al. Realistic in silico generation and augmentation of single-cell RNA-seq data using generative adversarial networks. Nat Commun 2020;11:166. [DOI] [PMC free article] [PubMed] [Google Scholar]
Miyato T, Kataoka T, Koyama M. et al. 2018. Spectral normalization for generative adversarial networks. Technical report, arXiv preprint.
Mohr S, Liew CC.. The peripheral-blood transcriptome: new insights into disease and risk assessment. Trends Mol Med 2007;13:422–32. [DOI] [PubMed] [Google Scholar]
Park J, Kim H, Kim J. et al. A practical application of generative adversarial networks for RNA-seq analysis to predict the molecular progress of Alzheimer’s disease. PLoS Comput Biol 2020;16:e1008099. [DOI] [PMC free article] [PubMed] [Google Scholar]
Paszke A, Gross S, Massa F. et al. 2019. Pytorch: an imperative style, high-performance deep learning library. Technical report, arXiv preprint. [Google Scholar]
Pedregosa F, Varoquaux G, Gramfort A. et al. Scikit-learn: machine learning in python. J Mach Learn Res 2011;12:2825–30. [Google Scholar]
Pocienė I, Gauronskaitė R, Galkauskas D. et al. Age as a risk factor in the occurrence of complications during or after bronchoscopic lung biopsy. Geriatrics 2022;7:34. [DOI] [PMC free article] [PubMed] [Google Scholar]
Quinton LJ, Walkey AJ, Mizgerd JP. et al. Integrative physiology of pneumonia. Physiol Rev 2018;98:1417–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ramírez F, Ryan DP, Grüning B. et al. DeepTools2: a next generation web server for deep-sequencing data analysis. Nucleic Acids Res 2016;44:W160–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ruder S. 2017. An overview of multi-task learning in deep neural networks. Technical report, arXiv preprint.
San Lucas FA, Allenson K, Bernard V. et al. Minimally invasive genomic and transcriptomic profiling of visceral cancers by next-generation sequencing of circulating exosomes. Ann Oncol 2016;27:635–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
Siepel A, Bejerano G, Pedersen JS. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res 2005;15:1034–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
Viñas R, Andrés-Terré H, Liò P. et al. Adversarial generation of gene expression data. Bioinformatics 2022;38:730–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
Viñas R, Azevedo T, Gamazon ER. et al. Deep learning enables fast and accurate imputation of gene expression. Front Genet 2021;12:624128. [DOI] [PMC free article] [PubMed] [Google Scholar]
Von Bertalanffy L. Quantitative laws in metabolism and growth. Q Rev Biol 1957;32:217–31. [DOI] [PubMed] [Google Scholar]
Wainberg M, Sinnott-Armstrong N, Mancuso N. et al. Opportunities and challenges for transcriptome-wide association studies. Nat Genet 2019;51:592–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang J, Gamazon ER, Pierce BL. et al. Imputing gene expression in uncollected tissues within and beyond GTEx. Am J Hum Genet 2016;98:697–708. [DOI] [PMC free article] [PubMed] [Google Scholar]
Warde-Farley D, Bengio Y.. 2017. Improving generative adversarial networks with denoising feature matching. International Conference on Learning Representations. ICLR 2017.
Watts G. Liquid biopsy: still early days for early detection. Lancet 2018;391:2593–4. [DOI] [PubMed] [Google Scholar]
Xu W, Liu X, Leng F. et al. Blood-based multi-tissue gene expression inference with Bayesian ridge regression. Bioinformatics 2020;36:3788–94. [DOI] [PubMed] [Google Scholar]
Yang M, Zhao W, Xu W. et al. Multitask learning for cross-domain image captioning. IEEE Trans Multimedia 2019;21:1047–61. [Google Scholar]
Zhou D, Jiang Y, Zhong X. et al. A unified framework for joint-tissue transcriptome-wide association and Mendelian randomization analysis. Nat Genet 2020;52:1239–46. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btad363_Supplementary_Data

Click here for additional data file.^{(4.3MB, zip)}

Data Availability Statement

[btad363-B1] Alanis-Lobato G, Andrade-Navarro MA, Schaefer MH. et al. HIPPIE v2.0: enhancing meaningfulness and reliability of protein–protein interaction networks. Nucleic Acids Res 2017;45:D408–14. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad363-B2] Azevedo T, Dimitri GM, Lió P. et al. Multilayer modelling of the human transcriptome and biological mechanisms of complex diseases and traits. NPJ Syst Biol Appl 2021;7:24. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad363-B3] Barbeira AN, Dickinson SP, Bonazzola R. et al. ; GTEx Consortium. Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics. Nat Commun 2018;9:1825. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad363-B4] Barbeira AN, Pividori M, Zheng J. et al. Integrating predicted transcriptome from multiple tissues improves association detection. PLoS Genet 2019;15:e1007889. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad363-B5] Basu M, Wang K, Ruppin E. et al. Predicting tissue-specific gene expression from whole blood transcriptome. Sci Adv 2021;7:eabd6991. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad363-B6] Beugnot G, Mairal J, Rudi A.. 2022. On the benefits of large learning rates for kernel methods. Technical report, arXiv preprint.

[btad363-B7] Caruana R. Multitask learning. Mach Learn 1997;28:41–75. [Google Scholar]

[btad363-B8] Chen H-H, Petty LE, North KE. et al. Novel diabetes gene discovery through comprehensive characterization and integrative analysis of longitudinal gene expression changes. Hum Mol Genet 2022;31:3191–205. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad363-B9] Choi Y, Choi M, Kim M. et al. 2017. StarGAN: unified generative adversarial networks for multi-domain image-to-image translation. Technical report, arXiv preprint.

[btad363-B10] Collins FS, Varmus H.. A new initiative on precision medicine. N Engl J Med 2015;372:793–5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad363-B11] Di Meo A, Bartlett J, Cheng Y. et al. Liquid biopsy: a step forward towards precision medicine in urologic malignancies. Mol Cancer 2017;16:80. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad363-B12] Elmarakeby HA, Hwang J, Arafeh R. et al. Biologically informed deep neural network for prostate cancer discovery. Nature 2021;598:348–52. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad363-B13] Gamazon ER, Wheeler HE, Shah KP. et al. ; GTEx Consortium. A gene-based association method for mapping traits using reference transcriptome data. Nat Genet 2015;47:1091–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad363-B14] García-Albéniz X, Hsu J, Bretthauer M. et al. Effectiveness of screening colonoscopy to prevent colorectal cancer among medicare beneficiaries aged 70 to 79 years: a prospective observational study. Ann Intern Med 2017;166:18–26. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad363-B15] Garza MA, Wason EA, Zhang JQ.. Cardiac remodeling and physical training post myocardial infarction. World J Cardiol 2015;7:52–64. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad363-B16] Golub TR, Slonim DK, Tamayo P. et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999;286:531–7. [DOI] [PubMed] [Google Scholar]

[btad363-B17] Goodfellow I, Pouget-Abadie J, Mirza M. et al. Generative adversarial networks. Commun ACM 2020;63:139–44. [Google Scholar]

[btad363-B18] Greenhalgh RM, Yardley IE, Child F. et al. Lung biopsy for chronic pulmonary disease in children. J Pediatr Surg 2014;49:1075–7. [DOI] [PubMed] [Google Scholar]

[btad363-B19] GTEx Consortium. Genetic effects on gene expression across human tissues. Nature 2017;550:204–13. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad363-B20] Gui J, Sun Z, Wen Y. et al. A review on generative adversarial networks: algorithms, theory, and applications. IEEE Trans Knowl Data Eng 2023;35:3313–32. [Google Scholar]

[btad363-B21] Gusev A, Ko A, Shi H. et al. Integrative approaches for large-scale transcriptome-wide association studies. Nat Genet 2016;48:245–52. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad363-B22] Haeussler M, Zweig AS, Tyner C. et al. The UCSC genome browser database: 2019 update. Nucleic Acids Res 2019;47:D853–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad363-B23] Halloran JW, Zhu D, Qian DC. et al. Prediction of the gene expression in normal lung tissue by the gene expression in blood. BMC Med Genomics 2015;8:77. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad363-B24] Hassan SF, Mathur S, Magliaro TJ. et al. Needle core vs open biopsy for diagnosis of intermediate- and high-risk neuroblastoma in children. J Pediatr Surg 2012;47:1261–6. [DOI] [PubMed] [Google Scholar]

[btad363-B25] Hasty J, McMillen D, Isaacs F. et al. Computational studies of gene regulatory networks: in numero molecular biology. Nat Rev Genet 2001;2:268–79. [DOI] [PubMed] [Google Scholar]

[btad363-B26] Hoadley KA, Yau C, Wolf DM. et al. ; Cancer Genome Atlas Research Network. Multiplatform analysis of 12 cancer types reveals molecular classification within and across tissues of origin. Cell 2014;158:929–44. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad363-B27] Hu Y, Li M, Lu Q. et al. ; Alzheimer’s Disease Genetics Consortium. A statistical framework for cross-tissue transcriptome-wide association analysis. Nat Genet 2019;51:568–76. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad363-B28] Irgon J, Huang CC, Zhang Y. et al. Robust multi-tissue gene panel for cancer detection. BMC Cancer 2010;10:319. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad363-B29] Jensen IJ, Li X, McGonagill PW. et al. Sepsis leads to lasting changes in phenotype and function of memory CD8 T cells. eLife 2021;10:e70989. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad363-B30] Ji W, Zhou W, Gregg K. et al. A method for cross-species gene expression analysis with high-density oligonucleotide arrays. Nucleic Acids Res 2004;32:e93. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad363-B31] Karras T, Laine S, Aila T.. 2018. A style-based generator architecture for generative adversarial networks. Technical report, arXiv preprint. [DOI] [PubMed]

[btad363-B32] Kingma DP, Ba J.. 2014. Adam: a method for stochastic optimization. Technical report, arXiv preprint.

[btad363-B33] Kutsenko A, Svensson T, Nystedt B. et al. The Chironomus tentans genome sequence and the organization of the Balbiani ring genes. BMC Genomics 2014;15:819. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad363-B34] Lage K, Hansen NT, Karlberg EO. et al. A large-scale analysis of tissue-specific pathology and gene expression of human disease genes and complexes. Proc Natl Acad Sci USA 2008;105:20870–5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad363-B35] LeCun Y, Bengio Y, Hinton G. et al. Deep learning. Nature 2015;521:436–44. [DOI] [PubMed] [Google Scholar]

[btad363-B36] Lewkowycz A, Bahri Y, Dyer E. et al. 2020. The large learning rate phase of deep learning: the catapult mechanism. Technical report, arXiv preprint.

[btad363-B37] Liew C-C, Ma J, Tang H-C. et al. The peripheral blood transcriptome dynamically reflects system wide biology: a potential diagnostic tool. J Lab Clin Med 2006;147:126–32. [DOI] [PubMed] [Google Scholar]

[btad363-B38] Lim JH, Ye JC.. 2017. Geometric GAN. Technical report, arXiv preprint.

[btad363-B39] Lopez R, Regier J, Cole MB. et al. Deep generative modeling for single-cell transcriptomics. Nat Methods 2018;15:1053–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad363-B40] Lundström K-J, Drevin L, Carlsson S. et al. Nationwide population based study of infections after transrectal ultrasound guided prostate biopsy. J Urol 2014;192:1116–22. [DOI] [PubMed] [Google Scholar]

[btad363-B41] Maj C, Azevedo T, Giansanti V. et al. ; Alzheimer’s Disease Neuroimaging Initiative. Integration of machine learning methods to dissect genetically imputed transcriptomic profiles in Alzheimer’s disease. Front Genet 2019;10:726. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad363-B42] Marouf M, Machart P, Bansal V. et al. Realistic in silico generation and augmentation of single-cell RNA-seq data using generative adversarial networks. Nat Commun 2020;11:166. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad363-B43] Miyato T, Kataoka T, Koyama M. et al. 2018. Spectral normalization for generative adversarial networks. Technical report, arXiv preprint.

[btad363-B44] Mohr S, Liew CC.. The peripheral-blood transcriptome: new insights into disease and risk assessment. Trends Mol Med 2007;13:422–32. [DOI] [PubMed] [Google Scholar]

[btad363-B45] Park J, Kim H, Kim J. et al. A practical application of generative adversarial networks for RNA-seq analysis to predict the molecular progress of Alzheimer’s disease. PLoS Comput Biol 2020;16:e1008099. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad363-B46] Paszke A, Gross S, Massa F. et al. 2019. Pytorch: an imperative style, high-performance deep learning library. Technical report, arXiv preprint. [Google Scholar]

[btad363-B47] Pedregosa F, Varoquaux G, Gramfort A. et al. Scikit-learn: machine learning in python. J Mach Learn Res 2011;12:2825–30. [Google Scholar]

[btad363-B48] Pocienė I, Gauronskaitė R, Galkauskas D. et al. Age as a risk factor in the occurrence of complications during or after bronchoscopic lung biopsy. Geriatrics 2022;7:34. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad363-B49] Quinton LJ, Walkey AJ, Mizgerd JP. et al. Integrative physiology of pneumonia. Physiol Rev 2018;98:1417–64. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad363-B50] Ramírez F, Ryan DP, Grüning B. et al. DeepTools2: a next generation web server for deep-sequencing data analysis. Nucleic Acids Res 2016;44:W160–5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad363-B51] Ruder S. 2017. An overview of multi-task learning in deep neural networks. Technical report, arXiv preprint.

[btad363-B52] San Lucas FA, Allenson K, Bernard V. et al. Minimally invasive genomic and transcriptomic profiling of visceral cancers by next-generation sequencing of circulating exosomes. Ann Oncol 2016;27:635–41. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad363-B53] Siepel A, Bejerano G, Pedersen JS. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res 2005;15:1034–50. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad363-B54] Viñas R, Andrés-Terré H, Liò P. et al. Adversarial generation of gene expression data. Bioinformatics 2022;38:730–7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad363-B55] Viñas R, Azevedo T, Gamazon ER. et al. Deep learning enables fast and accurate imputation of gene expression. Front Genet 2021;12:624128. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad363-B56] Von Bertalanffy L. Quantitative laws in metabolism and growth. Q Rev Biol 1957;32:217–31. [DOI] [PubMed] [Google Scholar]

[btad363-B57] Wainberg M, Sinnott-Armstrong N, Mancuso N. et al. Opportunities and challenges for transcriptome-wide association studies. Nat Genet 2019;51:592–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad363-B58] Wang J, Gamazon ER, Pierce BL. et al. Imputing gene expression in uncollected tissues within and beyond GTEx. Am J Hum Genet 2016;98:697–708. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad363-B59] Warde-Farley D, Bengio Y.. 2017. Improving generative adversarial networks with denoising feature matching. International Conference on Learning Representations. ICLR 2017.

[btad363-B60] Watts G. Liquid biopsy: still early days for early detection. Lancet 2018;391:2593–4. [DOI] [PubMed] [Google Scholar]

[btad363-B61] Xu W, Liu X, Leng F. et al. Blood-based multi-tissue gene expression inference with Bayesian ridge regression. Bioinformatics 2020;36:3788–94. [DOI] [PubMed] [Google Scholar]

[btad363-B62] Yang M, Zhao W, Xu W. et al. Multitask learning for cross-domain image captioning. IEEE Trans Multimedia 2019;21:1047–61. [Google Scholar]

[btad363-B63] Zhou D, Jiang Y, Zhong X. et al. A unified framework for joint-tissue transcriptome-wide association and Mendelian randomization analysis. Nat Genet 2020;52:1239–46. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

MTM: a multi-task learning framework to predict individualized tissue gene expression profiles

Guangyi He

Maiyue Chen

Yingnan Bian

Ence Yang

Roles

Abstract

Motivation

Results

Availability and implementation

1 Introduction

2 Methods

2.1 Data preparation for MTM

2.2 Description of MTM

Figure 1.

2.3 Description of compared single-tissue methods for ablative experiments

2.4 Performance evaluation

2.5 Exploration of the characteristics of predictable genes

2.6 Exploration of the characteristics of the intermediate features of MTM

2.7 Phenotype-related analysis

2.8 Exploration of MTM applications on external datasets

3 Results

3.1 MTM outperforms the baseline methods

Figure 2.

3.2 Both the nonlinearity and multi-task framework contribute to performance

3.3 Intermediate features and outputs of MTM provided insights into the performance basis

3.4 MTM could capture personalized biological variations

Figure 3.

3.5 MTM facilitates the identification of tissue-specific disease-associated dysregulations

Figure 4.

4 Discussion

5 Conclusion

Supplementary Material

Acknowledgements

Contributor Information

Supplementary data

Conflict of interest

Data availability

Funding

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases