Abstract
Therapeutic target identification is challenging in drug discovery, particularly for rare and orphan diseases. Here, we propose a disease signature, TRESOR, which characterizes the functional mechanisms of each disease through genome-wide association study (GWAS) and transcriptome-wide association study (TWAS) data, and develop machine learning methods for predicting inhibitory and activatory therapeutic targets for various diseases from target perturbation signatures (i.e., gene knockdown and overexpression). TRESOR enables highly accurate identification of target candidate proteins that counteract disease-specific transcriptome patterns, and the Bayesian optimization with omics-based disease similarities achieves the performance enhancement for diseases with few or no known targets. We make comprehensive predictions for 284 diseases with 4345 inhibitory target candidates and 151 diseases with 4040 activatory target candidates, and elaborate the promising targets using several independent cohorts. The methods are expected to be useful for understanding disease–disease relationships and identifying therapeutic targets for rare and orphan diseases.
Subject terms: Bioinformatics, Target identification, Machine learning, Drug development, Genome-wide analysis of gene expression
Identifying therapeutic targets is challenging, especially for orphan diseases. Here, the authors integrate GWAS and TWAS with machine learning methods to predict therapeutic targets for various diseases and demonstrate the usefulness in practice.
Introduction
Identifying biomolecules that can lead to therapeutic effects via regulation by drugs is important to drug development1. Therapeutic targets can be distinct from disease-causing genes or biomarkers. The poor choice of a therapeutic target reduces the success rate in human clinical trials2–4. The depletion of therapeutic targets is particularly serious for intractable diseases with unexplained pathological mechanisms and for orphan diseases5,6. Recently, machine learning methods have been successfully adopted for pharmaceutical applications, such as compound–protein interaction prediction7, drug efficacy prediction8, compound virtual screening9, and drug structure optimization10; however, there are few computational approaches for therapeutic target prediction.
The most common approaches for therapeutic target prediction include the use of single nucleotide polymorphisms (SNPs) and transcriptome data, wherein protein-coding genes with SNPs associated with disease11,12 or differentially expressed genes/proteins in disease patients13,14 are chosen as therapeutic targets. The use of target gene perturbation signatures (TGPs) following target gene perturbation (e.g., gene knockdown and overexpression) has been proposed15, where proteins whose gene expression patterns counteract disease-specific gene expression patterns are predicted as therapeutic targets. Nevertheless, the accuracy of these unsupervised methods is low.
Vast amounts of genome-wide association study (GWAS) data have been accumulated for various diseases16. GWAS is useful for detecting the relationship between genomic locations and diseases; however, GWAS has a limitation in understanding disease mechanisms. Recently, transcriptome-wide association study (TWAS) has been proposed to estimate disease-specific gene-expression patterns from GWAS data17. Linking genomic loci to gene-expression patterns that are closer to functional information enhances the interpretability of pathological mechanisms18. Because GWAS is performed on tens of thousands to millions of individuals, TWAS data obtained from GWAS summaries should capture more robust disease features than transcriptome data obtained directly from only a few dozen patients. Hence, TWAS data are expected to be a useful resource for identifying therapeutic targets for diseases including rare and orphan diseases.
In this study, we introduce the TWAS-relevant signature for orphan diseases, TRESOR, a disease signature that reflects functional dysregulation of genes based on GWAS and TWAS data. Using TRESOR and TGPs, we develop machine learning methods for predicting inhibitory and activatory therapeutic targets, and demonstrate the usefulness of the methods for various diseases including rare diseases that have known therapeutic targets and orphan diseases that have no known therapeutic targets.
Results
Overview of the proposed method for therapeutic target prediction
Here, we present an overview of our proposed method for predicting therapeutic targets for a wide range of diseases, integrating GWAS, TWAS, and TGPs (Fig. 1).
Fig. 1. Overview of the proposed method for predicting therapeutic targets integrating GWAS and TWAS summary data.
A Construction of our proposed disease signature, TWAS-relevant signature for orphan diseases (TRESOR). SNP β-values from the GWAS summary data for the disease, SNP linkage disequilibrium (LD) from the reference data, and gene weights from gene-expression models from PredictDB database49 were used to estimate gene-expression scores. These estimated gene-expression scores were used in the construction of TRESOR. B Inverse signature method with TRESOR and target gene perturbation signatures (TGPs). Correlation coefficients for inhibitory or activatory target–disease pairs were calculated using TRESOR and TGPs with gene knockdown or using TRESOR and TGPs with gene overexpression. C Multitask learning method with disease similarities. Target gene knockdown and target gene overexpression signatures were used as inputs for predictive models of individual diseases. The predictive models are simultaneously learned through sharing disease similarities from various disease features, such as causal mutations. D Bayesian integrative method. Using Bayesian optimization, the inverse signature and multitask learning methods are integrated.
First, we characterize each disease by constructing TRESOR signatures from SNP β-values obtained from disease GWAS summary data, SNP linkage disequilibrium (LD) reference data, and gene weights from TWAS models (Fig. 1A and Methods).
Then, we predict therapeutic targets for each disease by leveraging the inverse signature method based on inverse correlations between the disease-specific gene-expression signatures and TGPs (Fig. 1B). TGPs were assumed to reflect the functions of drugs targeting the perturbed genes. Because disease states are characterized by impaired expression patterns, the inverse signature method identifies candidate therapeutic targets by predicting proteins whose gene-expression patterns counteract disease-specific gene-expression patterns. To compensate for the lack of robustness of gene-expression patterns in diseases, we use TRESOR as a disease signature in the inverse signature method.
Next, we predict therapeutic targets for each disease by leveraging the multitask learning method with disease similarities, where six types of disease features for gene–disease associations (GDAs) and three types of disease features for variant–disease associations (VDAs) are considered (Fig. 1C and Methods). Different diseases may have common molecular mechanisms, and the same therapeutic targets can be used for multiple diseases. Hence, we formulate the therapeutic target prediction problem in a supervised multitask learning framework.
Finally, we predict more reliable therapeutic targets with the Bayesian integrative method, through a Bayesian optimization-based combination of the inverse signature method and multitask learning method (Fig. 1D).
TRESOR adequately reflects disease-specific characteristics
We analyzed commonalities between diseases using TRESOR. We compared with SNP signatures with p-values (SNP-PV)11,12, SNP signatures with expression quantitative trait loci (eQTL) (SNP-eQTL)15, and direct transcriptome signatures measured from patients (DT)15.
We compared disease signatures between SNP-PV, SNP-eQTL, DT, and TRESOR for 24 diseases represented by the four signatures (Fig. 2A, Supplementary Figs. S1 and S2). Figure 2A shows scatterplots of disease–disease relationships with disease class from the International Statistical Classification of Diseases and Related Health Problems 11th version (ICD-11)19. Diseases in the same class were closer to each other using TRESOR than SNP-PV, SNP-eQTL, and DT. Alzheimer’s disease, Parkinson’s disease, and amyotrophic lateral sclerosis, categorized as “VIII. Nervous system diseases,” were close together using TRESOR but scattered with the other signatures. Asthma and idiopathic pulmonary fibrosis (IPF), categorized as “XII. respiratory system diseases,” were also close to each other using TRESOR but distant from each other using the other signatures. These results suggest that TRESOR reflects disease-specific features more appropriately than other signatures.
Fig. 2. Visualization of disease–disease relationships based on various disease signatures and performance comparison in predicting therapeutic targets.
A Scatterplots of diseases were obtained after applying principal component analysis (PCA) to the SNP-PV, SNP-eQTL, DT, and TRESOR signatures for 24 diseases. SNP-PV, SNP-eQTL and DT are baseline signatures, and TRESOR is our proposed signature. The proportion of variance explained by top two PCs is shown on each axis. The diseases are labeled with different colors according to the disease classification of the Eleventh Edition of the International Statistical Classification of Diseases and Related Health Problems (ICD-11)19. B Comparison of the performance of the proposed method and baseline methods for identifying inhibitory targets for 23 diseases; the proposed methods correspond to the inverse signature method with TRESOR. The baseline methods correspond to SNP profiling methods with SNP-PV and SNP-eQTL and the inverse signature method with DT. Each box represents the distribution of AUC scores for diseases. In the box plots: center line, median; box, interquartile range; whiskers, 1.5 × interquartile range; and point, the AUC score for each disease. The horizontal dotted line represents AUC = 0.5. The asterisks represent significance based on one-sided p-values after Benjamini–Homberg (BH) corrections: ; ; ; . The two groups were compared by one-sided Wilcoxon signed-rank test. Corrections for multiple testing were made, adjusting significance values for three tests per analysis stream. The -values between the TRESOR and SNP-PV, SNP-eQTL, or DT is , , and . C Same as (B) but for activatory target predictions for 13 diseases. The -values between the TRESOR and SNP-PV, SNP-eQTL, or DT is , , and . Source data for Fig. 2A–C is provided in the Supplementary Data S19. Disease name abbreviations: AD, Alzheimer’s disease; ALS, amyotrophic lateral sclerosis; AtD, atopic dermatitis; ATL, adult T-cell lymphoma/leukemia; CC, colorectal carcinoma; CD, Crohn’s disease; CLL, chronic lymphocytic leukemia; IDDM, insulin-dependent diabetes mellitus; EC, endometrial carcinoma; IPF, idiopathic pulmonary fibrosis; MM, multiple myeloma; MNT, malignant neoplasm of the testis; MNB, malignant neoplasm of the breast; MNO, malignant neoplasm of the ovaries; PC, pancreatic carcinoma; PD, Parkinson’s disease; RA, rheumatoid arthritis; RCC, renal cell carcinoma; SCLS, small cell carcinoma of the lung; SLE, systemic lupus erythematosus; UC, ulcerative colitis; UCN, uterine cervical neoplasm.
TRESOR-based inverse signature method outperforms baseline methods in accuracy and applicability
We evaluated the performance of the proposed inverse signature method, which involves calculating the inverse correlation between the TGP and TRESOR signatures (Supplementary Fig. S3).
We compared the performance between the proposed and baseline methods for therapeutic target prediction. Baseline methods correspond to the inverse signature method with DT15 and the SNP profiling methods with SNP-PV and SNP-eQTL11,12,15. We evaluated the performance using gold standard data consisting of 1921 inhibitory target–disease associations (408 targets, 284 diseases) and 274 activatory target–disease associations (80 targets, 151 diseases) based on three accuracy measures: AUC, AUPR and BED AUC.
Figures 2B and 2C show the AUC scores for the SNP profiling and the inverse signature methods in predicting inhibitory and activatory targets, respectively. The inverse signature method with TRESOR performed better than the SNP profiling method with SNP-PV ( for inhibitory target and for activatory target, Wilcoxon signed-rank test), where indicates adjusted p-value. Additionally, the inverse signature method with TRESOR performed better than the SNP profiling method with SNP-eQTL ( for inhibitory target and for activatory target). Moreover, the inverse signature method with TRESOR was more accurate than that with DT ( for inhibitory target and for activatory target). These results indicate that the proposed inverse signature method with TRESOR is more accurate than other baseline methods. The comparison with other baseline method relying solely on GWAS signals is shown in Supplementary Results.
We also examined the influence of disease degree (number of known therapeutic targets) on the prediction accuracy of the inverse signature methods (Supplementary Figs. S5 and S6). DT was less accurate for diseases with lower degrees, whereas TRESOR was more accurate for any degrees. These results indicate that TRESOR could work for diseases with high and very low degrees. We also examined the influence of cell/tissue operation in TRESOR construction and GWAS sample sizes on the prediction accuracy (Supplementary Results).
Bayesian optimization-based combination of the inverse signature and multitask learning methods enhances prediction performance
We evaluated the performance of our proposed Bayesian integrative method on therapeutic target prediction, which is a Bayesian optimization-based combination of the inverse signature and multitask learning methods by incorporating the advantages of the two methods and compensating for their shortcomings (see Methods). We performed five-fold cross-validation experiments for Bayesian integrative method, where the disease similarities of VDAs on Cm were used. Details of performance evaluation of multitask learning method alone are shown in Fig. 3A–G and Supplementary Results.
Fig. 3. Performance evaluation of the multitask learning method based on various types of disease similarities and performance evaluation of the Bayesian integrative method.
A Performance evaluation of the multitask learning method for inhibitory target predictions. Multitask learning methods were compared across nine types of disease similarities, comprising gene–disease associations (GDAs) and variant–disease associations (VDAs). GDAs consisted of all possible features (All), Altered expression (Ae), Biomarker (Bm), Causal mutation (Cm), Genetic variation (Gv), and Posttranslational modification (Pm); VDAs consisted of all possible features (All), Causal mutation (Cm), and Genetic variation (Gv). The bars in the panel represent the AUC, AUPR, BED AUC scores and disease degrees (the number of known therapeutic targets), from the top to the bottom panels. Blue bars indicate GDAs, and orange bars indicate VDAs. The horizontal axis represents the same diseases shown in Fig. 2. Supplementary Fig. S14 shows the results for all diseases. The horizontal dotted line in the top panel represents AUC = 0.5. Source data is provided in the Supplementary Data S13–S15. B Relevant disease similarities for inhibitory target prediction. The number of diseases with max AUC, AUPR, and BED AUC was counted for each disease feature from the left to right panels. Diseases and colors are the same as in (A). Source data is provided in the Supplementary Data S13–S15. C Part of the disease similarity network based on VDAs on Cm near IDDM. Orange nodes denote diseases. Node sizes reflect random walk with restart (RWR) from IDDM. Edge width reflects disease similarity for VDAs on Cm. Source data is provided in the Supplementary Data S19. D Same as (C) but for SLE. E Same as (A) but for Activatory target predictions. Source data is provided in the Supplementary Data S16–S18. F Relevant disease similarities for activatory target prediction, as in (B). Source data is provided in the Supplementary Data S16–S18. G Same as (C) but for melanoma. Source data is provided in the Supplementary Data S19. H Performance evaluation of the inverse signature, multitask learning, and Bayesian integrative methods for each disease degree for inhibitory target predictions for 113 diseases. The violin plots represent the distributions of AUC, AUPR, and BED AUC scores from the top to bottom panels. In the violin plots: center white point, median; box, interquartile range; whiskers, 1.5 interquartile range; and point, AUC, AUPR, or BED AUC score for the disease. Colors represent prediction methods; pink, inverse signature method with TRESOR; orange, multitask learning method; and blue, Bayesian integrative method. The horizontal axis represents the degree of the disease (the number of known therapeutic targets). The horizontal dotted line in the top panel represents AUC = 0.5. Source data is provided in the Supplementary Data S1. (I) Same as (H) but for activatory target predictions for 61 diseases. Source data is provided in the Supplementary Data S2. Disease name abbreviations: AA, aplastic anemia; AML, acute myeloid leukemia; BCC, basal cell carcinoma; CC, colorectal carcinoma; CML, chronic myeloid leukemia; EC, endometrial carcinoma; GIST, gastrointestinal stromal tumors; IDDM, insulin-dependent diabetes mellitus; LC, liver carcinoma; LMS, leiomyosarcoma; MNP, malignant neoplasm of prostate; MNT, malignant neoplasm of testis; MNB, malignant neoplasm of breast; MT, mammary neoplasms; NET, neuroendocrine tumors; NIDDM, non-insulin-dependent diabetes mellitus; NSCLS, non-small cell carcinoma of lung; OS, osteosarcoma; PH, pulmonary hypertension; TC, papillary thyroid carcinoma; SLE, systemic lupus erythematosus.
We examined the influence of degree of disease on the prediction accuracy of the inverse signature, multitask learning, and Bayesian integrative methods (Fig. 3H, I). The AUC, AUPR, and BED AUC scores of the Bayesian integrative method were higher than those of the inverse signature method (inhibitory target: AUC, ; AUPR, ; and BED AUC, ; activatory target: AUC, ; AUPR, ; and BED AUC, ) and multitask learning method (inhibitory target: AUC, ; AUPR, ; and BED AUC, ; activatory target: AUC, ; AUPR, ; and BED AUC, ).
For inhibitory target prediction of diseases with only 1 degree, the performance of the inverse signature method was found to be dependent on the disease [standard deviation (SD) of BED AUCs = 0.302], but the Bayesian integrative method showed a more stable prediction accuracy (SD of BED AUCs = 0.259). For activatory target predictions, the Bayesian integrative method (median: AUPR = 1, BED AUC = 1) worked significantly better than the inverse signature method (median: AUPR = 0.25, BED AUC = 0.78). These results indicate that the Bayesian integrative method can accurately predict therapeutic targets for diseases with both high and low degrees, and it is applicable to orphan diseases for which the inverse signature or multitask learning methods fail to make correct predictions. Supplementary Data S1 and S2 list the prediction accuracy for all diseases (284 and 151 diseases for inhibitory and activatory targets, respectively). We also examined the influence of disease category on the accuracy according to ICD-11 disease classes (Supplementary Results).
Bayesian integrative method predicts promising inhibitory targets for rare diseases with few known targets
Using the Bayesian integrative method, we comprehensively predicted potential inhibitory targets for 284 diseases, including rare diseases (Supplementary Data S3). TGPs with gene knockdowns for 4345 proteins were used.
Figure 4A shows a small part of the predicted inhibitory target–disease association network. We focused on rare diseases with few known inhibitory targets, as it is challenging to identify targets for these diseases using existing methods. For example, multiple endocrine neoplasia (MEN), a syndrome of multiple malignant tumors in endocrine and nonendocrine organs, is a rare disease. RET is its only known inhibitory target, and RASS5, BECN1, FHL2, ZFAND6, SKP1, RNGTT, TARBP1, MCM8, and CNOT1 are new candidate inhibitory targets predicted by the Bayesian integrative method. The validity of the predicted targets was verified using independent resources absent from the learning data: BECN1 is upregulated in pancreatic cancer and promotes tumorigenesis20. FHL2 is involved in cell survival and proliferation, and suppression of FHL2 promotes apoptosis21. CNOT7, ZFAND6, SKP1, RNGTT, TARBP1, and MCM8 are also associated with MEN (see Supplementary Results). These results suggest that the newly predicted candidates may be involved in regulating MEN.
Fig. 4. Newly predicted therapeutic targets and their modes of action for rare diseases using the Bayesian integrative method.
A Parts of the newly predicted inhibitory target–disease association networks. Blue circles and yellow diamonds denote inhibitory targets and diseases, respectively. Gray lines represent known associations, and blue lines show predicted associations. The square represents the first node for multiple endocrine neoplasia (MEN). Source data is provided in the Supplementary Data S3. B Heatmap of GO enrichment analysis for known and predicted inhibitory targets for MEN. The horizontal axis represents known and predicted targets. The vertical axis represents some of more significantly enriched GO terms, and all enriched terms can be found in the source data (Supplementary Data S5). Horizontal and vertical color bars give the GO categories and therapeutic target types, respectively. The color and asterisks for each square reflect p values and significance (*), respectively. Enrichment analysis was performed by Fisher’s exact test. Corrections for multiple testing were applied, adjusting significance values based on the number of GO terms. C Scatterplot of TGP with gene knockdown of inhibitory target FHL2 and TRESOR for MEN. The vertical and horizontal axes represent the gene-expression scores for TRESOR and the TGP for the predicted targets, respectively. Each point denotes differentially expressed genes in TRESOR and the TGP. The blue lines represent regression lines, and light blue regions represent the upper and lower limits of 95% confidence intervals for the regression estimate. The color of each point reflects the TWAS p-value, indicating the association level between genes and diseases. TWAS p-value calculation and multiple testing corrections were performed using the S-PrediXcan formula17. Source data is provided in the Supplementary Data S19. D Part of the disease similarity network in the vicinity of MEN. Blue nodes denote diseases. The sizes of the nodes reflect random walk with restart (RWR) from MEN. Edge width reflects disease similarity for VDAs on Cm. Nodes with yellow edge color apart from MEN represent major lesions of MEN. Source data is provided in the Supplementary Data S19. E Same as (A) but for the newly predicted activatory target–disease association network. The squares represent first node for idiopathic pulmonary fibrosis (IPF). Source data is provided in the Supplementary Data S4. F Same as (B) but for activatory targets for IPF. All enriched terms can be found in the source data (Supplementary Data S6) (G) Same as (C) but for TGP with gene overexpression for activatory target STIMATE/TMEM110 and TRESOR for IPF. Source data is provided in the Supplementary Data S19. H Same as (D) but for IPF. Source data is provided in the Supplementary Data S19. Disease name abbreviations: AHF, acute heart failure; ALS, amyotrophic lateral sclerosis; AML, acute myeloid leukemia; ATC, anaplastic thyroid carcinoma; BCC, basal cell carcinoma; CC, colorectal carcinoma; CML, chronic myeloid leukemia; DM, diabetes mellitus; CNDI, congenital nephrogenic diabetes insipidus; COPD, chronic obstructive airway disease; EC, endometrial carcinoma; GIST, gastrointestinal stromal tumors; HSA, hemongiosarcoma; HT, hypertensive disease; ICT, islet cell tumor; IPF, idiopathic pulmonary fibrosis; IPH, idiopathic pulmonary hypertension; LC, liver carcinoma; LMS, leiomyosarcoma; MEN, multiple endocrine neoplasia; MM, multiple myeloma; MS, motion sickness; MT, mammary neoplasms; MNP, malignant neoplasm of prostate; MNT, malignant neoplasm of testis; MNB, malignant neoplasm of breast; MNO, malignant neoplasm of ovary; MNUB, malignant neoplasm of urinary bladder; MPA, male pattern alopecia; MTC, medullary thyroid carcinoma; NET, neuroendocrine tumors; NIDDM, non-insulin-dependent diabetes mellitus; NSCLS, non-small cell carcinoma of lung; OS, osteosarcoma; PC, pancreatic carcinoma; PHEO, pheochromocytoma; PPI, paralytic ileus; SLS, Sjogren-Larsson syndrome; SS, systemic scleroderma; TC, papillary thyroid carcinoma; SCLS, small cell carcinoma of lung.
We examined the validity of the predicted inhibitory targets in terms of their mechanisms of action by performing Gene Ontology (GO) biological process22 and KEGG23 pathway analyses using differentially expressed genes in TGPs with gene knockdowns. Figure 4B and Supplementary Fig. S17A show heatmaps for significantly enriched GO terms and KEGG pathways, respectively. Focusing on the predicted targets enriched for the same GOs as the known inhibitory target (RET), FHL2 and CNOT7 were significantly enriched for “platelet aggregation”; ZFAND6, SKP1, MCM8, and CNOT7 were enriched for “chaperone-mediated autophagy”; and RASSF5, BECN1, and ZFAND6 were enriched for “positive regulation of apoptosis.” These results indicate that these predicted inhibitory targets may regulate some of the same processes as RET.
Next, we focused on predicted targets that could regulate different processes than the known therapeutic target RET. BECN1 was enriched for “NF- B signaling” ; the activation of NF-κB signaling contributes to the tumor suppressor function of MEN1, which is a causal gene for MEN24. In the pathway analysis, ZFAND6, SKP1, MCM8, and RNGTT were enriched for immune system pathways, such as “virus infection” and “T cell receptor signaling” (Supplementary Fig. S17A); The tumor microenvironment is associated with the MEN oncoprocess, and the immune response is a promising therapeutic target for MEN25. Thus, these inhibitory targets that regulate different processes than the known target may exhibit new therapeutic effects. These results suggest that the Bayesian integrative method predicts therapeutic targets that could potentially lead to therapeutic effects for diseases based on the unique features of both the inverse signature and multitask learning methods.
To better understand the prediction process of the Bayesian integrative method, the genes and diseases that contributed to the predictions were investigated. From the viewpoint of the inverse signature method, we examined how the regulation of inhibitory targets acts on differentially expressed genes in the disease. Figure 4C shows a scatterplot for TRESOR and FHL2-knockdown signatures. Genes in the second and fourth quadrants indicate that differential expression patterns in the disease may be inverted by FHL2 inhibition. RNH1, DNAJA3, KDELR2, and SPAG7 were upregulated in the MEN signature and downregulated in the FHL2-knockdown signature. Conversely, PDGFA, ABHD4, and GATA3 were downregulated in the MEN signature and upregulated in the FHL2-knockdown signature. GATA3 is associated with endocrine gland phenotypes26,27, DNAJA3 plays a critical role in tumor suppression28,29, and SPAG7 is a prognostic marker of various cancers. These results suggest that inhibiting FHL2 may counteract disease-specific gene-expression patterns. The results for the other targets are shown in Supplementary Fig. S18. We also investigated the diseases that contributed to the predictions from the viewpoint of multitask learning method (Fig. 4D and Supplementary Results).
Finally, we validated the predicted inhibitory targets using three independent resources: an adrenocortical carcinoma cohort (35 samples), a pancreatic adenocarcinoma cohort (72 samples), and a thyroid carcinoma cohort (231 samples) from The Cancer Genome Atlas (TCGA)30 (Supplementary Methods). Because these carcinomas are major lesions in MEN, the associations between their survival rates and the gene-expression patterns of predicted inhibitory targets were investigated. Figure 5A–C compare the survival rates between donors with upregulated and downregulated target gene expression. We focused on FHL2 as a predicted inhibitory target. Donors with downregulated FHL2 exhibited higher survival rates than donors with upregulated FHL2 ( in adrenocortical carcinoma; in pancreatic adenocarcinoma; in thyroid carcinoma; Log-rank test), indicating that FHL2 is a promising inhibitory target. We also examined other predicted therapeutic targets (Supplementary Figs. S20–S22).
Fig. 5. Validation of predicted therapeutic targets for rare and orphan diseases using independent cohorts.
A Comparison of survival rates in The Cancer Genome Atlas (TCGA)30 adrenocortical carcinoma cohort between patients with a low and high expression of FHL2, a predicted inhibitory target for MEN. The two groups were compared by two-sided Log-rank test. B Same as (A) but for a TCGA30 pancreatic adenocarcinoma cohort. C Same as (A) but for a TCGA30 thyroid carcinoma cohort. Adrenocortical carcinoma, pancreatic adenocarcinoma, and thyroid carcinoma are major lesions of MEN. The horizontal and vertical axes represent survival time (years) and overall survival rate, respectively. The pink and blue lines represent patients with FHL2 low expression and patients with FHL2 high expression, respectively. D Comparison of STIMATE/TMEM110 gene-expression scores between healthy controls (n = 18) and IPF donors (n = 19) in the Lung Tissue Research Consortium37. The asterisk represents significance (). The two groups were compared using a one-sided Wilcoxon rank-sum test. Corrections for multiple testing were applied using the false discovery rate (FDR) with BH corrections, adjusting significance values for nine tests per analysis stream (together with Supplementary Fig. S23). The p-value is . E Comparison of gene-expression scores between a biomarker and a predicted activatory target for IPF: SFTPA1 and STIMATE/TMEM110. The horizontal and vertical axes represent gene-expression level for STIMATE/TMEM110 and SFTPA1, a disease activity marker of IPF, respectively. Each point denotes an IPF patient (n = 19). The black lines represent regression lines, and the light gray regions represent the upper and lower limits of 95% confidence intervals for the regression estimate. F Comparison of gene-expression scores between a biomarker and a predicted activatory target for IPF: SFTPD and STIMATE/TMEM110. SFTPD is a disease activity marker of IPF. This panel follows the same format as (E). G Comparison of p-tau in the frontal white matter between donors with upregulated and downregulated RAB1B (upregulated, n = 6; downregulated, n = 19), a predicted inhibitory target for tauopathies. The boxes represent the distribution of p-tau. In the box plots: center line, median; box, interquartile range; whiskers, 1.5× interquartile range; and points, tauopathy patients. Asterisks represent significance (*p < 0.05; **p < 0.01). The two groups were compared by one-sided Wilcoxon rank-sum test. We made corrections for multiple testing using the FDR by BH corrections, adjusting significance values for four tests per analyses stream. The p-value is . (H) Same as (G) but in the hippocampus (upregulated, n = 14; downregulated, n = 14). The p-value is . I Same as (G) but in the parietal cortex (upregulated, n = 14; downregulated, n = 12). The p-value is . J Same as (G) but in the temporal cortex (upregulated, n = 13; downregulated, n = 15). The p-value is . Source data for Fig. 5A–J is provided in the Supplementary Data S19. Disease name abbreviations: IPF, idiopathic pulmonary fibrosis; MEN, multiple endocrine neoplasia.
Bayesian integrative method predicts promising activatory targets for intractable diseases with few known targets
Using the Bayesian integrative method, we comprehensively predicted potential activatory targets of 151 diseases, including rare diseases (Supplementary Data S4). The gene overexpression signatures of 4040 proteins were used.
Figure 4E shows a small part of the predicted activatory target–disease association network. We focused on intractable diseases with few known activatory targets. For example, IPF is an intractable lung disease that has an unknown cause and an irreversible progression. FFAR1/GPR40 is the only known activatory target for IPF with an ongoing clinical trial31, and APAF1, STIMATE/TMEM110, ZNF22, DTX3L, UTP14A, PA2G4, TNFSF13, SMRCE1, and ZKSCAN2 are new candidate activatory targets. The validity of several predicted targets was confirmed using independent resources absent from the learning data: APAF1 may alleviate fibrosis by inducing apoptosis of myofibroblasts32,33; A recent report suggested that supplementation with inhaled STIMATE/TMEM110-positive type II alveolar epithelial cell-derived exosomes lessened early acute injury, prevented advanced fibrosis, alleviated ventilatory impairment, and reduced mortality in a bleomycin-induced mouse fibrosis model34. ZNF22, DTX3L, PA2G4, and SMARCE1 are also associated with IPF (see Supplementary Results). These results suggest that regulating these newly predicted candidates may have therapeutic effects for IPF.
We examined the validity of the predicted activatory targets in terms of their mechanisms of action by performing GO and KEGG pathway analyses using differentially expressed genes in TGPs with gene overexpression. Figure 4F and Supplementary Fig. S17B show heatmaps of significantly enriched GO terms and pathways. Focusing on the predicted targets that enrich the same GOs as the known activatory target (FFAR1/GPR40), ZNF22 and ZKSCAN2 were enriched for “mitotic sister chromatid cohesion.” Thus, these predicted activatory targets may regulate some of the same processes as the known therapeutic target.
Next, we focused on activatory targets that may regulate different processes than the known therapeutic target FFAR1/GPR40. STIMATE/TMEM110 was significantly enriched for “fibroblast growth signaling” (Fig. 4F), suggesting that activating STIMATE/TMEM110 would lead to therapeutic effects for IPF. STIMATE/TMEM110 was also significantly enriched for “T-cell receptor signaling,” suggesting that STIMATE/TMEM110 regulates chronic inflammation and scar formation. PA2G4, STIMATE/TMEM110, UTP14A, and ZNF22 were enriched for the cell cycle, apoptosis, and proliferation, which may lead to therapeutic effects due to the importance of the balance between epithelial and mesenchymal cells in IPF treatment35. Thus, these activatory targets regulate different processes from the known therapeutic target and may elicit new therapeutic effects. These results suggest that the Bayesian integrative method can predict therapeutic targets leading to potential therapeutic effects for diseases owing to unique features of both the inverse signature and multitask learning methods.
To better understand the prediction process of the Bayesian integrative method, we analyzed the genes and diseases that contributed to the predictions. Using the inverse signature method, we investigated how the regulation of activatory targets acts on differentially expressed genes in disease. Figure 4G shows a scatterplot for the TRESOR and STIMATE/TMEM110 overexpression signatures. Genes in the second and fourth quadrants indicate that differential expression patterns in the disease may be inverted by STIMATE/TMEM110 activation. SH3BP5, PSMF1, TSPAN4, and FASTKD5 were upregulated in IPF and downregulated in the STIMATE/TMEM110 overexpression signature. Conversely, PDGFA, MCUR1, and GLOD4, downregulated in IPF, were upregulated in the STIMATE/TMEM110 overexpression signature. PDGFA was strongly associated with IPF in the TWAS analysis , and PDGFA upregulation promotes alveolar epithelial cell differentiation and repair36. These results suggest that activating STIMATE/TMEM110 may counteract disease-specific gene-expression patterns. The results of the other targets are shown in Supplementary Fig. S19. We also analyzed the diseases that contributed to the predictions from the viewpoint of the multitask learning method (Fig. 4H and Supplementary Results).
Finally, we validated the predicted activatory targets using an independent resource, an IPF cohort (19 IPF and 18 healthy donors) from the Lung Tissue Research Consortium37 (Supplementary Methods). Figure 5D compares the gene-expression scores of STIMATE/TMEM110 between healthy controls and IPF donors. IPF donors exhibited lower levels of gene expression than that in healthy donors (; Wilcoxon rank-sum test). We then examined the associations of STIMATE/TMEM110 with the disease activity markers SFTPA1 and SFTPD38 (Fig. 5E, F, respectively). STIMATE/TMEM110 expression was inversely correlated with SFTPA1 and SFTPD ( for SFTPA1; for SFTPD; Pearson’s correlation coefficient), indicating that the activation of STIMATE/TMEM110 may have a therapeutic effect on IPF. Some other predicted targets also showed associations with IPF (Supplementary Fig. S23). The above analyses demonstrate the capability of the Bayesian integrative method to predict therapeutic targets for diseases with small numbers of known targets.
Bayesian integrative method predicts promising therapeutic targets for orphan diseases without known targets
Finally, via the Bayesian integrative method, we comprehensively predicted potential therapeutic targets for orphan diseases that have no known therapeutic targets. We performed predictions for tauopathies and retinal dystrophy. Tauopathies are general term for neurodegenerative disorders characterized by phosphorylated tau (p-tau) neurofibrillary tangles, a neuropathological hallmark. We constructed TRESOR signatures of progressive supranuclear palsy, corticobasal degeneration, and frontotemporal dementia. Retinal dystrophy causes vision loss due to retinal damage (Supplementary Results). These diseases are orphan intractable diseases that have no fundamental treatments.
Table 1 lists the top ten inhibitory and activatory targets for tauopathies from 4345 and 4040 candidate proteins, respectively. The validity of these predicted candidate therapeutic targets was confirmed using independent resources. Among the inhibitory targets, RAB1B is associated with neurofibrillary tangles in tauopathies, and drugs targeting Rab family proteins are being developed39. For activatory targets, fragments of SNRNP70 are associated with phosphorylated tau aggregates, and normal SNRNP70 activation may lead to therapeutic effects for tauopathies40. Predicted targets in bold in Table 1 are also associated with tauopathies (Supplementary Results). We also performed GO and KEGG pathway analyses, and GO terms and pathways associated with tauopathies were significantly enriched (Supplementary Results and Supplementary Figs. S24 and S25). The genes that contributed to the predictions were also investigated (Supplementary Results). These results indicate the potentials of the predicted targets for tauopathies.
Table 1.
Newly predicted inhibitory and activatory targets for tauopathies by the Bayesian integrative method
| (A) Newly predicted inhibitory targets | (B) Newly predicted activatory targets | ||||
|---|---|---|---|---|---|
| Rank | Inhibitory targets | Prediction scores | Rank | Activatory targets | Prediction scores |
| 1 | RAB1B | 1.000 | 1 | APOC3 | 1.000 |
| 2 | ATF3 | 0.946 | 2 | CTSK | 0.986 |
| 3 | WNK4 | 0.928 | 3 | PPFIBP2 | 0.953 |
| 4 | IKBKB | 0.913 | 4 | METTL14 | 0.945 |
| 5 | CHAC1 | 0.885 | 5 | COPS5 | 0.945 |
| 6 | SIRT7 | 0.880 | 6 | LAGE3 | 0.928 |
| 7 | DCUN1D4 | 0.873 | 7 | SNRNP70 | 0.901 |
| 8 | FADS1 | 0.870 | 8 | PROC | 0.901 |
| 9 | A2M | 0.870 | 9 | KYNU | 0.889 |
| 10 | P4HA1 | 0.862 | 10 | SLC7A11 | 0.883 |
Note. The top 10 inhibitory and activating targets, respectively, are listed. Prediction scores represent the therapeutic targetability of tauopathies. The bold type denotes therapeutic targets whose validities are suggested in the literature (see Supplementary Results).
Finally, we validated the predicted therapeutic targets using an independent resource: a traumatic brain injury (TBI) exposure cohort (107 donors) from the “Aging, Dementia, and TBI Study”41 (Supplementary Methods). Because TBI is a risk factor for tauopathies, the associations between p-tau and gene-expression patterns of predicted therapeutic targets were examined. Figure 5G–J compare p-tau between donors with upregulated and downregulated target genes in four brain regions: frontal white matter (FWM), hippocampus, parietal cortex (PCx), and temporal cortex (TCx). We focused on RAB1B as a predicted inhibitory target. Donors with downregulated RAB1B exhibited relatively lower p-tau levels in FWM (; Wilcoxon rank-sum test) and PCx and significantly lower levels in TCx than donors with upregulated RAB1B, indicating that RAB1B is a promising inhibitory target. We then focused on SNRNP70 as a predicted activatory target. Donors with upregulated SNRNP70 exhibited lower p-tau levels than donors with downregulated SNRNP70 (, Supplementary Fig. S29), suggesting that SNRNP70 is a promising activatory target. Some other targets were associated with p-tau and their expression patterns (Supplementary Fig. S29). These results suggest the usefulness of the Bayesian integrative method for orphan diseases without known therapeutic targets.
Discussion
In this study, we proposed a disease signature, TRESOR, which reflects functional mechanisms integrating GWAS and TWAS. Using TRESOR, we also developed inverse signature, multitask learning, and Bayesian integrative methods to predict therapeutic targets for rare and orphan diseases. In the inverse signature method, using TRESOR enabled more accurate identification of target candidate proteins that counteract disease-specific gene-expression patterns. In the multitask learning method, we performed disease similarity-guided prediction to enhance the accuracy for rare and orphan diseases, and considering multiple disease features on GDAs and VDAs enabled us to capture disease similarities more precisely. The Bayesian integrative method, a hybrid model consisting of the inverse signature method (unsupervised learning) and multitask learning method (supervised learning), predicted therapeutic targets for rare and orphan diseases with few or no known therapeutic targets. Thus, our proposed approach is expected to facilitate drug development for rare and orphan diseases.
Conventional SNP profiling methods based on genetic mutations11,12 were not always able to detect therapeutic targets correctly (Fig. 2). In monogenic disorders, a causal gene or biomarker with a genetic mutation may be a therapeutic target42; however, in other diseases, susceptibility genes themselves may not be therapeutic targets, and proteins encoded by genes without disease-associated SNPs can be therapeutic targets instead, based on shared molecular mechanisms with susceptibility genes1,43. In this study, the inverse signature method with TRESOR considers disease-specific SNP information and disease-specific gene-expression patterns to reflect the molecular mechanisms of diseases. Thus, our proposed method can work for both monogenic and multigenic disorders.
The conventional inverse signature method using disease-specific transcriptome signatures from patients15 was less accurate than the inverse signature method with TRESOR (Fig. 2). The TRESOR constructed from GWAS data on many samples may reflect core gene-expression patterns and have less noise than transcriptome signatures derived from disease patients. Nevertheless, TRESOR has some limitations, the inability to perform the TWAS analysis for diseases with scarce GWAS data. New GWAS data are continually being published, so accumulating additional GWAS data could solve this problem over time.
A previous related study described a supervised learning approach using XGBoost44. This method predicts whether protein–disease pairs for various diseases are in therapeutic target–disease associations. It utilizes more information from diseases with many known therapeutic targets than from diseases with few therapeutic targets, which makes prediction for diseases with few or no known therapeutic targets difficult. In our study, the Bayesian integrative method addresses this problem by building a prediction model for each disease and integrating inverse signature and multitask learning methods. Because disease similarities are taken into account, diseases with few therapeutic targets can benefit from diseases that have many therapeutic targets. Additionally, although the previous method lacks interpretability, our proposed method can be interpreted based on inverse correlations (Fig. 4C, G) and disease similarities (Fig. 4D, H). Furthermore, the previous method will likely predict therapeutic targets with similar therapeutic effects to known targets. However, through its incorporation of unsupervised and supervised learning methods, our proposed method can predict new therapeutic targets with different therapeutic effects than known targets (Fig. 4B, F). More discussion is shown in Supplementary Discussion because of space limitation.
Methods
Target gene perturbation profiles
Target gene perturbation profiles arising from either gene knockdown or overexpression experiments were obtained from the L1000 database45. This database provided 978 landmark genes, called “L1000 genes.” We used “level 5” data, including profiles generated by collapsing several replicates. We incorporated 36,720 gene knockdown profiles (denoted as “trt_sh.cgs”) and 34,171 gene overexpression profiles (denoted as “trt_oe”). Gene knockdown and gene overexpression profiles were individualized by averaging biological replicates. We constructed 4345 gene knockdown profiles for 17 cell lines and 4040 gene overexpression profiles for 20 cell lines. Supplementary Table S2 and Data S7 show the cell lines and perturbed genes, respectively.
Target gene perturbation profiles were used as feature vectors for therapeutic target candidate proteins. Transcriptomic profiles following gene knockdown and gene overexpression, referred to as “gene knockdown signatures” and “gene overexpression signatures,” respectively, were constructed. Together, these signatures are referred to as “target gene perturbation signatures” (TGPs). Each gene knockdown and gene overexpression signature was represented as a feature vector, and , respectively, where p is the number of genes. Each element in the signature was defined as the ratio between the gene-expression value measured after the gene perturbation and that measured in the corresponding controls. Because TGPs have many missing entries, we imputed these values using a tensor decomposition algorithm46.
TWAS-relevant signature for orphan diseases (TRESOR)
Disease-specific transcriptomic profiles were constructed from GWAS data using TWAS analyses to robustly capture disease-specific gene-expression patterns. The resulting TWAS-relevant signatures for orphan diseases are referred to as “TRESOR signatures.” GWAS data was obtained from the NHGRI-EBI GWAS Catalog database16. Unified Medical Language System disease IDs47 were linked to the “DISEASE TRAIT” in the GWAS data. GWAS summary data for 241 diseases and 175,818 SNP–disease associations were used in this study. When multiple GWAS existed on the same disease traits, we used all GWAS data.
The TRESOR signatures were constructed from GWAS summary data using TWAS analyses. TWAS accuracy is reduced when only a few common variants exist between the GWAS summary data and the TWAS training data. Thus, the “summary-gwas-imputation” tool in MetaXcan was used to supplement missing genotypes in the GWAS summary data48. Using the GWAS summary data as input, we estimated z-scores representing associations between regulation of gene expression and diseases for each gene by S-PrediXcan [Eq. (1) in ref. 17] as follows:
| 1 |
where is the weight of SNP in predicting the expression of gene , is the GWAS regression coefficient for SNP , is the standard error of , is the estimated SD of SNP , and is the estimated SD of the predicted expression of gene . We used previously computed values from PredictDB database49 for , , and and beta values from the GWAS summary data for . We constructed the TRESOR signature for each disease as using the obtained from the TWAS analysis, where is the number of genes. TRESOR signatures were constructed for 49 tissues per disease. We then selected the most related tissues with the corresponding disease. These selected tissues are listed in Supplementary Data S8. All 49 tissues are shown in Supplementary Table S3.
Gene–disease and variant–disease association data
GDAs and VDAs include various types of associations, such as biomarkers and causal genes, and each type of association can capture a range different disease characteristics. GDAs and VDAs were downloaded from DisGeNET50 (v7.0) using the DisGeNET REST API, and they consisted of 1,134,942 GDAs and 369,553 VDAs. The GDAs incorporated 15 association types, of which five were used, including “Altered Expression,” “Biomarker,” “Causal Mutation,” “Genetic Variation,” and “Posttranslational Modification.” The VDAs consist of three types of association, and two types were used, including “Causal Mutation” and “Genetic Variation.” We defined the merged data of all association types as “All.” We used GDAs and VDAs for 349 diseases.
Similarities between diseases were calculated based on GDAs and VDAs to capture the relationships between diseases. For GDAs, the gene sets for disease 1 and 2 were termed , and , respectively, and the disease similarity was defined by the Jaccard Index (JI) as follows:
| 2 |
For VDAs, was also defined using the variant sets. The diseases analyzed in this study had relatively small JI values, and was used to construct a disease similarity matrix.
Direct transcriptomic data measured from disease patients
We constructed disease-specific transcriptome profiles, using transcriptomic data directly measured from disease patients. The transcriptome profiles of patients with various diseases were obtained from the Crowd Extracted Expression of Differential Signatures database51, drawing on the characteristic direction method52, comparing the gene expression measured in diseased tissue with that which was measured in control tissue. We averaged multiple patient-specific profiles for the same disease and constructed a disease transcriptome profile for the 36 diseases. The resulting signature is referred to as “direct transcriptome signature measured from patients” (DT). The DT signature of each disease was represented by a feature vector, , where q is the number of genes.
Therapeutic target data
The therapeutic target information was manually curated from medical monographs53 and the KEGG DISEASE database54. The inhibitory target data consisted of 1,921 target−disease associations involving 408 inhibitory target proteins and 284 diseases. The activatory target data included 274 target−disease associations, involving 80 activatory target proteins and 151 diseases. Supplementary Data S9 and S10 show the diseases that had at least one inhibitory and activatory target protein, respectively.
Inverse signature method for therapeutic target prediction
The inverse signature method uses disease-specific gene-expression signatures (disease signatures) and TGPs for therapeutic target predictions15. The potential inhibitory and activatory target–disease associations were predicted based on inverse correlations between TGP with gene knockdown or gene overexpression and the disease signatures.
We used TRESOR instead of DT for the disease signatures. The correlation coefficients between the gene knockdown signatures and disease signatures for each th inhibitory target and th disease and between the gene overexpression signatures and disease signatures were calculated for each ith activatory target and mth disease. Pearson’s correlation coefficient, , was calculated as follows:
| 3 |
where and were described as , and is the number of genes that were common to TGPs and disease signatures. The number of genes used for calculating the inverse correlations between TRESORs and TGPs varies across diseases because only genes common to both TRESORs and TGPs can be used. Due to space limitations, the details on the number of genes for each disease are shown in Supplementary Data S11 and S12.
Let be the class label for the mth disease assigned to the ith target, where means that the ith target is used for the mth disease, and means that the ith target is not used for the mth disease. We then represent the unsupervised loss component as follows:
| 4 |
where is the number of candidate target proteins.
The inverse correlation was used to exhibit the final predictive score, calculated as . Target–disease pairs that had high inverse correlations were considered as candidate therapeutic targets. We selected the most closely related cell lines and averaged all of the cell lines of TGPs, referred to as “cell-specified operation” and “cell-averaged operation,” respectively. We also selected the most related tissues and averaged all tissues of disease signatures, referred to as “tissue-specified operation” and “tissue-averaged operation,” respectively. These cell lines and tissues are listed in Supplementary Data S8. Additional details can be seen in the Supplementary Methods.
Multitask learning method for the prediction of the therapeutic target
The multitask learning method uses TGPs and disease similarities to predict therapeutic targets15. Different diseases may have common molecular mechanisms, and the same therapeutic targets can be used for multiple diseases. Thus, we approached the problem of therapeutic target prediction in the framework of supervised multiple-label predictions55. Additional details can be found in the Supplementary Methods.
We postulated that diseases that have GDAs or VDAs in common may also share therapeutic targets. We defined disease similarities using nine types of GDAs and VDAs. Different disease similarities are expected to exhibit properties that are different between diseases, leading to the identification of new therapeutic targets that conventional similarity in gene-expression patterns cannot predict.
Suppose diseases and candidate target proteins. We consider how to predict which diseases are treated by a candidate target protein, the ith candidate target protein (). Each candidate target protein is represented by a d-dimensional feature vector as , where is a TGP. We constructed a learning set from therapeutic target–disease pairs of known therapeutic target–disease associations. There are candidates for diseases, and each candidate target protein in the learning set is assigned a binary class label that represents the mth disease (). Let be the class label for the mth disease assigned to the ith candidate target protein, where means that the ith candidate target protein is used for the mth disease, and means that the ith candidate target protein is not used for the mth disease.
We construct the predictive model , where is a d-dimensional weight vector for the mth disease. We estimate all of the weight vectors jointly by minimizing the supervised loss component as follows:
| 5 |
where is a weight matrix defined as We estimate the weight matrix by minimizing an objective function using the gradient descent method. Additional details can be found in the Supplementary Methods.
For the prediction for orphan diseases, the weighted averages of the hyperparameters for diseases with high similarity to the orphan disease were used as the parameters, with disease similarity as the weight.
Bayesian integrative method for therapeutic target prediction
The inverse signature and multitask learning methods were integrated by incorporating the advantages of the two methods and compensating for their shortcomings. The inverse signature method, an unsupervised learning approach, is applicable to orphan diseases with no known therapeutic targets, but its accuracy is unstable. The multitask learning method, a supervised learning approach, learns high-accuracy models for diseases with many known therapeutic targets, but it is not applicable to orphan diseases with no known therapeutic targets. Thus, we combined the prediction results for both methods with Bayesian optimization, which could provide more reliable therapeutic targets for rare and orphan diseases. We formulated the therapeutic target prediction problem in the semi-supervised learning label prediction framework.
Suppose that there are diseases, and candidate target proteins. We predict which candidate target proteins would be therapeutic for the mth disease . For this, we evaluate the therapeutic targetability for each candidate target protein adopting the inverse signature and multitask learning methods, resulting in prediction scores , respectively. The loss function has two components. The first component, , penalizes different predictions for the candidate target protein by taking the sum of the squared error between the prediction score and the ideal correlation scores [see Eq. (4)]. The second component is the standard logistic loss [see Eq. (5)]. To combine the supervised and unsupervised loss terms, we scaled both terms by the weights and , as follows:
| 6 |
where and are normalized parameters for the unsupervised and supervised loss functions. We also integrated the normalized prediction scores of the inverse signature and multitask learning methods, and , respectively. The weighted sum of the prediction scores for each candidate target protein can be expressed as , where () is the weight vector, is the vector of prediction scores of the ith protein for the mth disease, and is set to 10. The symbol denotes the inner product. For optimal integration, we estimated the weight vector using Bayesian optimization to maximize the prediction accuracy measure, the BED AUC56.
Bayesian optimization refers to a method that efficiently estimates parameters in a function whose functional form is unknown. Here, we assume that the function g can be drawn from a Gaussian process prior to the observations , where (l = 1, 2,…, K) induces a multivariate Gaussian distribution on , and is the averaged BED AUC values from five-fold cross-validation. The distribution is determined by a mean function and a positive definite covariance function. We can use the prior and the observed values to induce a posterior over function called the acquisition function. We used the Upper Confidence Bound (UCB)57 for the acquisition function. Following the Gaussian process prior, the function can be defined as follows:
| 7 |
where is a hyperparameter in the Gaussian process, is an exploration weight controlling the balance between exploration and exploitation (default: ), is a predictive mean function, and is a predictive variance function. We estimated the , which achieved the highest BED AUC value and obtained the optimized weight vector as follows:
| 8 |
The obtained weight vector was then used to calculate the prediction score of the ith candidate target protein for the mth disease as follows:
| 9 |
Proteins with high prediction scores are candidate targets for treating diseases. For the prediction of orphan diseases, the weight vector of a disease similar to the orphan disease was used because the BED AUC could not be calculated due to the lack of known therapeutic targets.
SNP profiling method for therapeutic target prediction
Information on disease-associated SNPs is typically utilized to identify therapeutic targets11,12. The assumption underlying this approach is that diseases are caused by functional changes to the proteins encoded by genes that contain SNPs within their coding regions; thus, these genes are regarded as potential therapeutic targets. We used this SNP-based approach as the baseline method, termed the “SNP profiling method.” We constructed two types of disease-specific SNP profiles using GWAS p-values and eQTLs, referred to as “SNP profile with p-values (SNP-PV)” and “SNP profile with eQTLs (SNP-eQTL),” respectively.
In the SNP-PV15, when a gene had multiple SNPs or was reported by multiple GWASs, we averaged the p-values for the gene. We used values to provide the predictive scores. Genes with SNPs that were strongly associated with a disease were considered to be the candidate therapeutic targets. Because this method depends on the presence or absence of SNPs in gene-coding regions, it cannot be employed to predict whether a therapeutic target is inhibitory or activatory.
In the SNP-eQTL15, where a gene had multiple eQTLs, we summed its eQTL values for the gene. eQTL data were obtained from Genotype-Tissue Expression58 (GTEx; v8). Genes with highly positive eQTL values were considered to be candidate inhibitory targets, and genes with highly negative eQTL values were considered to be candidate activatory targets. We constructed both types of SNP profile, using gold standard data, and assigned the value 0 to genes without SNP data.
Performance evaluation procedure
We used three accuracy measures: the area under the receiver operating characteristic (ROC) curve (AUC), the area under the precision–recall (PR) curve (AUPR), and the Boltzmann-enhanced discrimination AUC (BED AUC)56. ROC curves and PR curves for the performance of classifiers over all possible cutoffs were generated by plotting the true positive rates (TPRs) against the false positive rates (FPRs) and precision against recall, respectively. AUC scores range from 0 to 1.0, where 1.0 indicates perfect inference (100% TPR, 0% FPR), and 0.5 represents random inference. AUPR scores range from 0 to 1.0, with 1.0 indicating perfect inference (100% precision, 100% recall). BED AUC scores range from 0 to 1.0, with 1.0 indicating perfect inference in early retrieval performance.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Supplementary information
Description Of Additional Supplementary Files
Acknowledgements
This work was supported by JSPS KAKENHI (20H05797 and 21H04915, Y.Y.).
Author contributions
S.N. performed experiments, analyzed data, and prepared the manuscript. M.I., S.-I.N., and N.Y.O. provided technical supports and wrote the manuscript. Y.Y. contributed to supervision of the study and writing the manuscript. All authors discussed the results and commented on the article at all stages.
Peer review
Peer review information
Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work. A peer review file is available.
Data availability
Target gene perturbation profiles are available at Gene Expression Omnibus database under accession code GSE70138 and GSE92742. Genome-wide association studies (GWAS) data (v1.2) are available at NHGRI-EBI GWAS Catalog database16 [https://www.ebi.ac.uk/gwas/]. SNP linkage disequilibrium from the reference data and gene weights from gene-expression models are available at PredictDB database49 [https://predictdb.org/]. GDAs and VDAs are available at DisGeNET50 (V7.0) [https://www.disgenet.org/]. Disease patient transcriptomic data are available at CRowd Extracted Expression of Differential Signatures database51 [https://maayanlab.cloud/CREEDS/]. eQTL data are available at Genotype-Tissue Expression58 (GTEx; v8) [https://gtexportal.org/home/]. Adrenocortical carcinoma cohort, pancreatic adenocarcinoma cohort, and thyroid carcinoma cohort from The Cancer Genome Atlas (TCGA) are available at UCSC Xena database [https://xenabrowser.net/datapages/]. Gene expression data and clinical information for the IPF cohort from the Lung Tissue Research Consortium37 are available at Fibromine [http://www.fibromine.com/Fibromine/] and GEO under accession code GSE92592, respectively. TBI exposure cohort is available at “Aging, Dementia, and TBI Study” under accession code SCR_014554 [https://aging.brain-map.org/download/index]. The prediction and analysis results from this study are provided in the Supplementary Information and Supplementary Data.
Code availability
The code supporting the current study is available at [https://github.com/YamanishiLab/TRESOR].
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
The online version contains supplementary material available at 10.1038/s41467-025-58464-4.
References
- 1.Santos, R. et al. A comprehensive map of molecular drug targets. Nat. Rev. Drug Discov.16, 19–34 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Arrowsmith, J. & Miller, P. Trial Watch: Phase II and Phase III attrition rates 2011-2012. Nat. Rev. Drug Discov.12, 569 (2013). [DOI] [PubMed] [Google Scholar]
- 3.He, H., Liu, L., Morin, E. E., Liu, M. & Schwendeman, A. Survey of clinical translation of cancer nanomedicines - lessons learned from successes and failures. Acc. Chem. Res.52, 2673–2683 (2019). [DOI] [PubMed] [Google Scholar]
- 4.Plenge, R. M. Disciplined approach to drug discovery and early development. Sci. Transl. Med. 8, 349ps15 (2016). [DOI] [PubMed]
- 5.Griebel, G. & Holmes, A. 50 years of hurdles and hope in anxiolytic drug discovery. Nat. Rev. Drug Discov. 2013 12912, 667–687 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Dawkins, H. J. S. et al. Progress in Rare Diseases Research 2010–2016: An IRDiRC Perspective. Clin. Transl. Sci.11, 11 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Luo, Y. et al. A network integration approach for drug-target interaction prediction and computational drug repositioning from heterogeneous information. Nat. Commun.8, 1–13 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Gerdes, H. et al. Drug ranking using machine learning systematically predicts the efficacy of anti-cancer drugs. Nat. Commun.12, 1–15 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Gentile, F. et al. Artificial intelligence–enabled virtual screening of ultra-large chemical libraries with deep docking. Nat. Protoc.17, 672–697 (2022). [DOI] [PubMed] [Google Scholar]
- 10.Zhou, Z., Kearnes, S., Li, L., Zare, R. N. & Riley, P. Optimization of Molecules via Deep Reinforcement Learning. Sci. Rep.9, 1–10 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Sabik, O. L. & Farber, C. R. Using GWAS to identify novel therapeutic targets for osteoporosis. Transl. Res.181, 15–26 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Shastry, B. S. SNPs in disease gene mapping, medicinal drug development and evolution. J. Hum. Genet.52, 871–880 (2007). [DOI] [PubMed] [Google Scholar]
- 13.De Vos, J. et al. Comparison of gene expression profiling between malignant and normal plasma cells with oligonucleotide arrays. Oncogene21, 6848–6857 (2002). [DOI] [PubMed] [Google Scholar]
- 14.Ruiz-Garcia, E. et al. Gene expression profiling identifies Fibronectin 1 and CXCL9 as candidate biomarkers for breast cancer screening. Br. J. Cancer102, 462–468 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Namba, S., Iwata, M. & Yamanishi, Y. From drug repositioning to target repositioning: prediction of therapeutic targets using genetically perturbed transcriptomic signatures. Bioinformatics38, I68–I76 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Buniello, A. et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res.47, D1005–D1012 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Barbeira, A. N. et al. Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics. Nat. Commun. 9, 1825 (2018). [DOI] [PMC free article] [PubMed]
- 18.Wu, P. et al. Integrating gene expression and clinical data to identify drug repurposing candidates for hyperlipidemia and hypertension. Nat. Commun.13, 1–12 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Reed, G. M. et al. Innovations and changes in the ICD-11 classification of mental, behavioural and neurodevelopmental disorders. World Psychiatry18, 3–19 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Yang, M. C. et al. Blockade of autophagy reduces pancreatic cancer stem cell activity and potentiates the tumoricidal effect of gemcitabine. Mol. Cancer14, 1–17 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Zienert, E., Eke, I., Aust, D. & Cordes, N. LIM-only protein FHL2 critically determines survival and radioresistance of pancreatic cancer cells. Cancer Lett.364, 17–24 (2015). [DOI] [PubMed] [Google Scholar]
- 22.Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet.25, 25–29 (2000). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Kanehisa, M., Goto, S., Kawashima, S. & Nakaya, A. Thed KEGG databases at GenomeNet. Nucleic Acids Res.30, 42–46 (2002). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Heppner, C. et al. The tumor suppressor protein menin interacts with NF-κB proteins and inhibits NF-κB-mediated transactivation. Oncogene20, 4917–4925 (2001). [DOI] [PubMed] [Google Scholar]
- 25.Castellone, M. D. & Melillo, R. M. RET-mediated modulation of tumor microenvironment and immune response in multiple endocrine neoplasia type 2 (MEN2). Endocr. Relat. Cancer25, T105–T119 (2018). [DOI] [PubMed] [Google Scholar]
- 26.Kamezaki, M. et al. Unusual Proliferative Glomerulonephritis in a Patient Diagnosed to Have Hypoparathyroidism, Sensorineural Deafness, and Renal Dysplasia (HDR) Syndrome with a Novel Mutation in the GATA3 Gene. Intern. Med.56, 1393–1397 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Okawa, T. et al. A novel loss-of-function mutation of GATA3 (p.R299Q) in a Japanese family with Hypoparathyroidism, Deafness, and Renal Dysplasia (HDR) syndrome. BMC Endocr. Disord. 15, 66 (2015). [DOI] [PMC free article] [PubMed]
- 28.Trinh, D. L. N., Elwi, A. N. & Kim, S. W. Direct interaction between p53 and Tid1 proteins affects p53 mitochondrial localization and apoptosis. Oncotarget1, 396–404 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Niu, G. et al. Tid1, the Mammalian Homologue of Drosophila Tumor Suppressor Tid56, Mediates Macroautophagy by Interacting with Beclin1-containing Autophagy Protein Complex. J. Biol. Chem.290, 18102–18110 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Weinstein, J. N. et al. The Cancer Genome Atlas Pan-Cancer Analysis Project. Nat. Genet.45, 1113 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Khalil, N. et al. Phase 2 clinical trial of PBI−4050 in patients with idiopathic pulmonary fibrosis. Eur. Respir. J. 53, 1800663 (2019). [DOI] [PMC free article] [PubMed]
- 32.Hohmann, M. S., Habiel, D. M., Coelho, A. L., Verri, W. A. & Hogaboam, C. M. Quercetin Enhances Ligand-induced Apoptosis in Senescent Idiopathic Pulmonary Fibrosis Fibroblasts and Reduces Lung Fibrosis In Vivo. Am. J. Respir. Cell Mol. Biol.60, 28–40 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Yu, C. et al. Orai3 mediates Orai channel remodelling to activate fibroblast in pulmonary fibrosis. J. Cell. Mol. Med.26, 4974 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Feng, Z. et al. Exosomal STIMATE derived from type II alveolar epithelial cells controls metabolic reprogramming of tissue-resident alveolar macrophages. Theranostics13, 991–1009 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Hill, C., Jones, M. G., Davies, D. E. & Wang, Y. Epithelial-mesenchymal transition contributes to pulmonary fibrosisvia aberrant epithelial/fibroblastic cross-talk. J. lung Heal. Dis.3, 31 (2019). [PMC free article] [PubMed] [Google Scholar]
- 36.Gokey, J. J. et al. Pretreatment of aged mice with retinoic acid supports alveolar regeneration via upregulation of reciprocal PDGFA signalling. Thorax76, 456–467 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Schafer, M. J. et al. Cellular senescence mediates fibrotic pulmonary disease. Nat. Commun.8, 1–11 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Zhang, Y. & Kaminski, N. Biomarkers in idiopathic pulmonary fibrosis. Curr. Opin. Pulm. Med.18, 441 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Jordan, K. L., Koss, D. J., Outeiro, T. F. & Giorgini, F. Therapeutic Targeting of Rab GTPases: Relevance for Alzheimer’s Disease. Biomedicines10, 1141 (2022). [DOI] [PMC free article] [PubMed]
- 40.Sinsky, J., Pichlerova, K. & Hanes, J. Tau Protein Interaction Partners and Their Roles in Alzheimer’s Disease and Other Tauopathies. Int. J. Mol. Sci.22, 9207 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Miller, J. A. et al. Neuropathological and transcriptomic characteristics of the aged brain. Elife6, e31126 (2017). [DOI] [PMC free article] [PubMed]
- 42.Tomlinson, B., Patil, N. G., Fok, M. & Kei Lam, C. W. Role of PCSK9 Inhibitors in Patients with Familial Hypercholesterolemia. Endocrinol. Metab.36, 279 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.King, E. A., Wade Davis, J. & Degner, J. F. Are drug targets with genetic support twice as likely to be approved? Revised estimates of the impact of genetic support for drug mechanisms on the probability of drug approval. PLOS Genet.15, e1008489 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Han, Y., Klinger, K., Rajpal, D. K., Zhu, C. & Teeple, E. Empowering the discovery of novel target-disease associations via machine learning approaches in the open targets platform. BMC Bioinforma.23, 1–19 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Subramanian, A. et al. A Next Generation Connectivity Map: L1000 platform and the first 1,000,000 profiles. Cell171, 1437–1452.e17 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Iwata, M. et al. Predicting drug-induced transcriptome responses of a wide range of human cell lines by a novel tensor-train decomposition algorithm. Bioinformatics35, i191–i199 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Bodenreider, O. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res.32, D267 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Barbeira, A. N., Bonazzola, R., Gamazon, E. R. & Liang, Y. Exploiting the GTEx resources to decipher the mechanisms at GWAS loci. Genome Biol. 22, 49 (2021). [DOI] [PMC free article] [PubMed]
- 49.Wheeler, H. E. et al. Survey of the Heritability and Sparse Architecture of Gene Expression Traits across Human Tissues. PLOS Genet.12, e1006423 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Piñero, J. et al. DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants. Nucleic Acids Res.45, D833–D839 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Wang, Z. et al. Extraction and analysis of signatures from the Gene Expression Omnibus by the crowd. Nat. Commun. 7, 12846 (2016). [DOI] [PMC free article] [PubMed]
- 52.Clark, N. R. et al. The characteristic direction: A geometrical approach to identify differentially expressed genes. BMC Bioinformatics15, 79 (2014). [DOI] [PMC free article] [PubMed]
- 53.Papadakis, M. A., McPhee, S. J. & Rabow, M. W. Current Medical Diagnosis and Treatment 2014. (McGraw Hill Medical, 2014).
- 54.Kanehisa, M., Goto, S., Furumichi, M., Tanabe, M. & Hirakawa, M. KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res. 38, D355–D360 (2009). [DOI] [PMC free article] [PubMed]
- 55.Bickel, S., Bogojeska, J., Lengauer, T. & Scheffer, T. Multi-task learning for HIV therapy screening. in Proceedings of the 25th International Conference on Machine Learning 56–63 (Association for Computing Machinery (ACM), 2008). 10.1145/1390156.1390164.
- 56.Truchon, J. F. & Bayly, C. I. Evaluating virtual screening methods: Good and bad metrics for the ‘early recognition’ problem. J. Chem. Inf. Model.47, 488–508 (2007). [DOI] [PubMed] [Google Scholar]
- 57.AuerPeter. Using confidence bounds for exploitation-exploration trade-offs. J. Mach. Learn. Res.3, 397–422 (2003). [Google Scholar]
- 58.Lonsdale, J. et al. The Genotype-Tissue Expression (GTEx) project. Nat. Genet.45, 580–585 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Description Of Additional Supplementary Files
Data Availability Statement
Target gene perturbation profiles are available at Gene Expression Omnibus database under accession code GSE70138 and GSE92742. Genome-wide association studies (GWAS) data (v1.2) are available at NHGRI-EBI GWAS Catalog database16 [https://www.ebi.ac.uk/gwas/]. SNP linkage disequilibrium from the reference data and gene weights from gene-expression models are available at PredictDB database49 [https://predictdb.org/]. GDAs and VDAs are available at DisGeNET50 (V7.0) [https://www.disgenet.org/]. Disease patient transcriptomic data are available at CRowd Extracted Expression of Differential Signatures database51 [https://maayanlab.cloud/CREEDS/]. eQTL data are available at Genotype-Tissue Expression58 (GTEx; v8) [https://gtexportal.org/home/]. Adrenocortical carcinoma cohort, pancreatic adenocarcinoma cohort, and thyroid carcinoma cohort from The Cancer Genome Atlas (TCGA) are available at UCSC Xena database [https://xenabrowser.net/datapages/]. Gene expression data and clinical information for the IPF cohort from the Lung Tissue Research Consortium37 are available at Fibromine [http://www.fibromine.com/Fibromine/] and GEO under accession code GSE92592, respectively. TBI exposure cohort is available at “Aging, Dementia, and TBI Study” under accession code SCR_014554 [https://aging.brain-map.org/download/index]. The prediction and analysis results from this study are provided in the Supplementary Information and Supplementary Data.
The code supporting the current study is available at [https://github.com/YamanishiLab/TRESOR].





