Abstract
Protein phosphorylation plays a crucial role in regulating diverse biological processes. Perturbations in protein phosphorylation are closely associated with downstream pathway dysfunctions, whereas alterations in protein expression could serve as sensitive indicators of pathological status. However, there are currently few methods that can accurately identify the regulatory links between protein phosphorylation and expression, given issues like reverse causation and confounders. Here, we present Phoslink, a causal inference model to infer causal effects between protein phosphorylation and expression, integrating prior evidence and multiomics data. We demonstrated the feasibility and advantages of our method under various simulation scenarios. Phoslink exhibited more robust estimates and lower false discovery rate than commonly used Pearson and Spearman correlations, with better performance than canonical instrumental variable selection methods for Mendelian randomization. Applying this approach, we identified 345 causal links involving 109 phosphosites and 310 proteins in 79 lung adenocarcinoma (LUAD) samples. Based on these links, we constructed a causal regulatory network and identified 26 key regulatory phosphosites as regulators strongly associated with LUAD. Notably, 16 of these regulators were exclusively identified through phosphosite–protein causal regulatory relationships, highlighting the significance of causal inference. We explored potentially druggable phosphoproteins and provided critical clues for drug repurposing in LUAD. We also identified significant mediation between protein phosphorylation and LUAD through protein expression. In summary, our study introduces a new approach for causal inference in phosphoproteomics studies. Phoslink demonstrates its utility in potential drug target identification, thereby accelerating the clinical translation of cancer proteomics and phosphoproteomic data.
Keywords: phosphoproteomics, multiomics, causal inference, network, cancer proteomics
Graphical Abstract

Highlights
-
•
Phoslink is a computational approach for causal inference in phosphoproteomics.
-
•
Phoslink outperforms canonical Mendelian randomization and correlation methods.
-
•
Phoslink uncovers novel regulatory phosphosites and their network.
-
•
Phoslink is available as an R package.
In Brief
Despite mass spectrometry detecting numerous phosphosites, over 95% remain uncharacterized. Our study introduces Phoslink, which uses protein expression as a sensitive indicator of cellular states, identifying causally regulated phosphosites that affect protein expression with greater precision and robustness. Applied to CNHPP-LUAD, Phoslink uncovers novel regulatory phosphosites with critical implications for LUAD and drug repurposing. Phoslink is a powerful tool for causal inference in phosphoproteomics, offering new opportunities to advance our understanding of phosphosites' functions and accelerate clinical translation.
Phosphorylation is the most common post-translational modification, with estimates suggesting that over two-thirds of human proteins may be phosphorylated (1). This reversible modification as a dynamic molecular switch modulates diverse protein functions by inducing conformational changes and further activates or inactivates downstream signal transduction cascades. Protein phosphorylation is involved in regulating nearly all cell processes, and aberrant phosphorylation is intimately associated with multiple pathological conditions, particularly cancer (2). For example, Rb phosphorylation drives colon cancer proliferation and could serve as a promising therapeutic target (3).
Increasingly sensitive mass spectrometry (MS)–based technologies have enabled the detection of phosphosites at a large scale within a single cohort dataset, as seen in studies such as the Chinese Human Proteome Project (CNHPP) and Clinical Proteomic Tumor Analysis Consortium (CPTAC). However, the functions and potential disease risks of a majority of these phosphosites remain unknown, as less than 5% have been annotated with specific functions (4). An imbalance in protein phosphorylation can disrupt downstream biological pathways, and alterations in protein expression could serve as an indicator for the status of various cellular processes (5), such as cancerous transformation of cells. Therefore, it is critical to identify and understand the regulatory relationships between phosphorylation and protein expression in cancer research.
In cancer multiomics research, the relationships among biomolecules are typically characterized through correlation analysis such as Pearson and Spearman (6, 7). In our previous work, we conducted the large-scale proteomic landscape of lung adenocarcinoma (LUAD) in Chinese patients (CNHPP-LUAD cohort), which deepened the understanding of molecular characteristics and provided clues for novel diagnostic and therapeutic strategies. Nevertheless, the regulation of phosphorylation remains unresolved. Similar to previous studies in tumor proteomics, we established links between phosphorylation and protein expression based on Spearman correlation analysis (8). However, an observational correlation between phosphorylation level and protein expression does not necessarily imply a causal relationship, which bears major responsibility for the failure of certain kinase-targeted inhibitors to achieve desired outcomes in clinical trials (9). There are several studies currently investigating causal relationships among molecules, such as CausalPath (10) and CARNIVAL (11), which infer causal associations based on known pathways and changes in gene or protein expression profiles. Consequently, these methods remain incapable of inferring the novel connections that have not been reported in the prior pathways. Besides, establishing causality within observational data can be challenging because of issues like reverse causation and confounders. Mendelian randomization (MR) has emerged as a promising approach to obtaining valid causal inferences from observational data. MR is grounded in the Mendelian genetic law and utilizes germline genetic instrumental variables (IVs), usually SNPs, to assess causal effects in observational datasets. These SNPs are randomly assorted during gamete formation and conception, providing a naturally randomized comparison (Supplemental Fig. S1). Their fixed nature at conception prevents modification by subsequent factors (12, 13). Consequently, MR is robust against biases like reverse causation and confounders in observational studies, enabling valid causal inference from such data. While most MR investigations utilize summarized data from consortia to leverage their large sample sizes, such as genome-wide association studies (GWAS) (14), the application of MR in cancer research is limited by the relatively smaller sample sizes of multiomics cancer cohorts compared with GWAS (15). This limitation poses a challenge in identifying reliable IVs. Consequently, the development of an alternative IV selection method tailored for multiomics cancer studies is critically needed.
In this study, we present a causal inference model called “Phoslink” for inferring causal effects between protein phosphorylation and downstream protein expression. To select reliable IVs for MR in the small-sample cancer proteomics dataset, our approach incorporates the prior evidence from GWAS and PhosSNP (16). Extensive simulations with different causal effect sizes, sample sizes, and heterogeneity of data were conducted to evaluate the feasibility and performance of Phoslink. Phoslink was demonstrably superior in controlling the risk of false positives. Moreover, Phoslink could yield more robust estimates and lower false discovery rates (FDRs) than commonly used correlation-based methods. Phoslink could serve as an effective tool for causal inference of key phosphosites in human proteomics studies and facilitate drug target discovery.
Experimental Procedures
Simulation Study Design
To assess the performance and reliability of Phoslink, we conducted simulation studies using a series of scenarios with different parameter settings. For each individual i, let denote the exposure (represent phosphorylation of phosphosite in the real data analysis), denote the outcome (represent protein expression as in the real data analysis), and represent a potential confounder of the exposure–outcome relationships. Let be the genotype of . Referring to most MR data–generating models (17, 18, 19), the simulations are conducted based on the following model:
∼ Binomial(2, 0.3) independently
, , ∼ N(0, 1) independently
where represents the effect of the genetic variant on the confounder U, represents the genetic effect of on X, represents the direct effect of the genetic variant on the outcome Y, and θ is the causal effect of X on Y. The effects of the confounder U on X and Y are denoted by and , respectively. The genetic effects of SNPs on the exposure were drawn from a uniform distribution U(0.08, 0.10). We set a total of 50 SNPs, among which 30 were designated as IVs. We randomly selected 0%, 30%, and 50% of the IVs to be invalid IVs. SNPs with direct effects on the outcome, violating the exclusion restriction assumption, are considered invalid IVs. The parameter was generated from a normal distribution N(0, 0.15) and = 0 for invalid IVs and the = = 0 for valid IVs. We fixed and at 0.75.
The simulation data comprised two parts. In the first part, we simulated a GWAS dataset by generating exposure X and SNP data for 100,000 samples. The second part involved simulating a multiomics cancer dataset (X′, SNP', and Y′). Considering population heterogeneity, which refers to the variability between the simulated GWAS and small-sample datasets, we adopted three approaches to simulate the multiomics cancer dataset:
-
1.
Homogeneous multiomics cancer dataset: We randomly extracted a specific sample size from the 100,000 samples to obtain the data as X′, SNP'. Subsequently, Y′ was generated based on X′, SNP', and a predefined θ.
-
2.
Low-heterogeneity multiomics cancer dataset: We maintained the same parameter settings as the simulation of the GWAS dataset but regenerated the multiomics cancer dataset with a certain sample size.
-
3.
High-heterogeneity multiomics cancer dataset: We altered the association (γ) between SNP and X from a uniform distribution ranging from 0.08 to 0.10 to a truncated normal distribution (truncation at 0.08 and 0.10). The dataset was then regenerated with a certain sample size. This change increased variability in the strength of associations, thereby enhancing the population heterogeneity between the small-sample multiomics cancer dataset and GWAS.
Considering the sample sizes commonly used in multiomics cancer research, we established simulation sample sizes of 50, 100, 200, and 300, respectively. The true causal coefficient (θ) between the exposure X and the outcome Y was established under two scenarios: (1) when there was a certain causal relationship between X and Y, θ was assigned to 0.12 and 0.60 and (2) when there was no causal relationship between X and Y, θ was set to 0.
After generating individual-level data, we performed univariate linear regression analyses where each SNP served as the independent variable and the exposure/outcome as the dependent variable. This approach allowed us to derive summarized data, including regression coefficients and corresponding standard errors for SNP–exposure and SNP-outcome. SNPs that exhibited a significant association with the exposure, as supported by both the multiomics cancer simulation dataset (p < 0.05) and external GWAS data (Bonferroni-adjusted p < 0.05, consistent with GWAS findings), were selected as IVs. Leveraging the summary statistics derived from the multiomics cancer simulation dataset, we then implemented the inverse variance weighted (IVW) method to assess the causal effect of X on Y. The simulation study was conducted under a one-sample MR setting, and each setting was replicated 2000 times. All simulations were performed in R, and the code can be found at https://github.com/Li-Lab-SJTU/Phoslink/tree/main/Simulations.
Proteomic and Phosphoproteomic Data From CNHPP-LUAD
A total of 79 cases of LUAD tumors and their paired noncancerous adjacent tissues (NATs) from treatment-naive Chinese patients were included in this study. The collection of proteomic and phosphoproteomic data followed the MS-based label-free quantification strategy of CNHPP (20). LC–MS/MS-generated MS raw files were searched against the UniProt human proteome database via MaxQuant software (version 1.6.5.0; The Max Planck Institute of Biochemistry) equipped with the Andromeda search engine. Intensity-based absolute quantification was utilized for quantifying both proteins and phosphosites. Intensity-based absolute quantification values calculated by MaxQuant were quantile normalized and log2 transformed if necessary (21). Detailed data preprocessing steps were outlined in a previous publication by our research group (8).
Identification and Annotation of Germline SNPs
The detection of germline SNPs from the WES clean data of 79 NATs followed the GATK Best Practice Guidelines (version 4.2.0.0) (https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPs-Indels). Initially, the clean reads were aligned to the human genome reference (GRCh38 assembly) using BWA MEM (version 0.7.17-r1188) with default parameters. The resulting alignment files were then sorted and indexed using Samtools (version 1.12), and PCR duplicates were marked using GATK MarkDuplicates. To ensure data quality, base quality recalibration was performed using the GATK BaseRecalibrator and ApplyBQSR modules. Variants were called from the recalibrated bam files using the GATK HaplotypeCaller module, and the resulting genomic variant call format (GVCF) files from the 79 sequenced individuals were combined using CombineGVCFs. Genotyping of the combined file was accomplished using GenotypeGVCFs. High-quality variant calls were retained through the VariantRecalibrator and ApplyRecalibration steps. During the VariantRecalibrator process in SNP mode, the model was trained using four standard SNP sets: HapMap3.3 SNPs, dbSNP build 146 SNPs, 1000 Genomes Project SNPs from Omni 2.5 chip, and 1000 G phase1 high-confidence SNPs. Variants were filtered using a sensitivity setting of 99.0% (--ts_filter_level 0.99) for SNPs. Subsequently, additional filtering was applied to the variants using vcftools (version 0.1.13), including criteria biallelic SNPs, autosomal SNPs, genotype calling rate >90%, and minor allele frequency >0.01 (22). Variants located in sex chromosomes were routinely excluded because of elevated noise in X-linked data (23) and low marker coverage in Y-specific regions (24). To enhance the accuracy of primary genotypes obtained using GATK, haplotype phasing and imputation were performed using Beagle (version 5.3; Brian Browning) (25) to impute sporadically missing sites. Finally, variants were annotated with gene context information using SnpEff (version 4.3; Pablo Cingolani) (26) transformed genotypes into 0 (homozygous wildtype), 1 (heterozygous), and 2 (homozygous mutant).
Phoslink Screening for Causal Associations Between Phosphorylation and Protein Abundance
Phosphorylation can be influenced by DNA variants, either local or distant to their corresponding genes, known as cis- and trans-SNPs, respectively. Cis-SNPs are often considered ideal IVs for MR because of their larger effect sizes and highly plausible biological relationships with phosphorylation levels, as cis-acting DNA variants influence signaling regulation via protein phosphorylation, which in turn affects downstream changes like protein translation (27, 28). The IVs employed in the analyses were cis-SNPs (up to 1 Mb upstream or downstream), which minimized the risk of bias from horizontal pleiotropy. Then we filtered IVs further with the prior phosphorylation-related SNPs from the PhosSNP 1.0 database (16). Applying our CNHPP-LUAD proteomics dataset, we examined the associations of these cis-acting phosphorylation-related SNPs and phosphosites using a linear regression model adjusted for age, sex, and smoking status. SNPs that demonstrated significant association with the phosphosite (p < 0.05) served as IVs finally. When there was linkage disequilibrium between SNPs, a reference panel from the 1000 Genome Project (29) was used to remove linked SNPs at R2 >0.2, retaining the most significant SNPs (with the smallest p value) and ensuring independence between SNPs (30, 31, 32). Both the proteomic and phosphoproteomic datasets were logarithmically transformed before linear regression analysis (21). The beta coefficient (β) and standard deviation (ϵ) in the linear regression later were used as inputs to MR. Using the summary statistics derived from the multiomics cancer dataset, we applied the IVW method in the MendelianRandomization R package (33) to evaluate the causal effect of phosphorylation on protein expression.
where XY can denote either exposure X (phosphorylation) or outcome Y (protein expression), β0 represents the intercept, β is the regression coefficient for the genotype G, β1, β2, and β3 are the corresponding regression coefficients for the covariates, and ε is error terms.
To ensure statistical power during MR analysis, we initially removed phosphosites and proteins with a missing value frequency of more than 90% and excluded SNPs with variation rates (proportion of nonhomozygous wildtype genotypes) below 10%. To reduce false-positive risk, only those phosphosites and proteins that showed statistically significant associations were retained (p < 0.05, Pearson’s correlation test).
Association Between Biomolecule and Clinical Features
The Wilcoxon signed-rank test was utilized to examine whether biomolecules (phosphosites or proteins) were differentially expressed between the tumors and matched NATs. To identify biomolecules exhibiting differential expression across different stages, the Kruskal–Wallis test was performed. To explore biomolecules associated with overall survival (OS) or disease-free survival, Kaplan–Meier survival analysis was conducted, along with the log-rank test and Cox proportional hazards regression. Before the log-rank test, the optimal cutpoint for sample selection was determined using the survminer R package (version 0.4.9) with the maxstat (maximally selected rank statistics). To account for multiple comparisons, p values were adjusted using the Benjamini–Hochberg (BH) method. Both the proteomic and phosphoproteomic datasets underwent logarithmic transformation before analysis.
Phosphoregulatory Network Analysis
Phoslink conducted a screening for causal associations between phosphosites (exposure) and proteins (outcome) using the PhosSNP and CNHPP-LUAD dataset. Subsequently, a phosphoregulatory network was constructed based on these causal phosphosite–protein links. The functionality of phosphoregulators was assessed by quantifying the similarity between their downstream proteins and established cancer hallmarks (34) through GOSemSim (version 2.26.0) (35). Pairwise semantic similarities were computed using the graph-based similarity measure algorithm proposed by Wang et al. (36), which determines the semantic similarity of two GO (Gene Ontology) terms based on their locations in the GO graph and their relations with ancestor terms. This method encodes the semantics (biological meanings) of GO terms into a numeric value by aggregating the contributions of their ancestor terms. Semantic comparisons of GO annotations provide a quantitative measure of functional similarities between gene groups based on these data. Each phosphoregulator was annotated to the hallmark exhibiting the highest semantic similarity score with the clusterSim.
Druggability Assessment
We first collected information on drug targets from Drug–Gene Interaction Database 4.0 (37), DrugBank (38), and Therapeutic Target Database (39) to explore their potential as known drug targets for drug repurposing in LUAD treatment. To assess the druggability of phosphoregulators, we investigated their presence within functional domains annotated in the InterPro database (40), utilizing InterProScan with default settings to search protein sequences for potential matches against InterPro protein signature databases. In addition, given the pivotal role of kinases as promising therapeutic targets in cancer (3), we gathered experimentally validated human kinase–substrate relationships from the Phospho.ELM (41), PhosphoSitePlus (42), and PhosphoNetworks (43) databases.
Estimation of Causal Mediation Effects
We conducted mediation analyses to assess the proportional contribution of regulatory phosphosites to LUAD clinical features using the R mediation package (version 4.5.0) (44). The analysis consisted of two steps. In the first step, we developed two statistical models, a mediator model for the conditional distribution of the mediator (protein) given the treatment (phosphosite) and an outcome model for the conditional distribution of the LUAD clinical phenotype (survival data). We fitted these models separately and used the resulting fitted objects as inputs for the mediate function, which calculated the effect mediated by the protein, effect unmediated by the protein, and total effect. We fitted the mediator and outcome models to the observed data using the lm and survreg functions from the R stats and survival packages, respectively. These models included age, sex, and smoking status as covariates. The mediation analysis model was executed using 1000 simulations with a quasi-Bayesian approach to estimate confidence intervals.
Results
An Overview of Phoslink
We present a novel approach to identify the Phosphoregulatory link through causal inference, named Phoslink. Because of the dynamic nature of protein expression and phosphorylation status, as well as the notable heterogeneity across different populations and cellular states, one-sample MR is more appropriate to infer causal relationships between protein expression and phosphorylation. Extensive proteogenomic data in CPTAC and CNHPP provide opportunities for Phoslink to infer the causal effects of phosphorylation on protein expression with a one-sample MR setting. This approach could provide a more reliable and robust causal inference, considering the changes in protein expression and phosphorylation and the genetic variation among individuals within the same population. Given that the existing IV selection methods are primarily used for studies with large sample sizes, the criteria for IV selection in the context of small-sample multiomics cancer data are more stringent. It is important to note that when dealing with sample sizes small, substantial effects may result in seemingly insignificant p values (45, 46). Such stringent thresholds may result in only a small subset of X being eligible as IVs for causal estimation analysis. Phoslink adopts a more relaxed threshold setting based on a small-sample phosphoproteomics dataset and incorporates external prior evidence as auxiliary screening: (1) Internal data support, indicating that the SNP–exposure relationship is statistically significant based on the small-sample dataset (p < 0.05 from a linear regression model adjusted for age, sex, and smoking status) and (2) External prior evidence, where prior studies provide support for the SNP–exposure relationship (Fig. 1). The PhosSNP 1.0 database (16) serves as a source of phosphorylation-related SNPs for external prior evidence in IV selection (47). SNPs that meet both criteria are selected as IVs, thus considering the heterogeneity of cancer data and effectively reducing false-positive rates. Next, we estimated the associations of each IV with the exposure and outcome using beta-coefficients and their standard errors from linear regression analyses. For the primary MR analysis, we applied IVW regression method in the MendelianRandomization R package (version 0.6.0) (33). We used the IVW two-sample MR method because the IVW method offers robust causal estimates by combining ratio estimates of each variant within a fixed-effects meta-analysis model (48) and is regarded as a safe option for one-sample MR analyses (49, 50). Phoslink is freely available at https://github.com/Li-Lab-SJTU/Phoslink.
Fig. 1.
The analytic framework of the study. To characterize the functionality of identified phosphosites in multiomics cancer research, we introduce a causal inference model termed “Phoslink,” which integrates prior evidence and a multiomics cancer dataset to infer causal regulatory links between protein phosphorylation and expression. Applied to the CNHPP-LUAD dataset, Phoslink uncovered regulatory links and identified key phosphosites, including MAP4 pS941, located within a therapeutic target domain in lung carcinoma. A detailed mediation analysis further assesses the impact of phospho-based regulation on survival, distinguishing between effects mediated by proteins and those that are not. CNHPP, Chinese Human Proteome Project; LUAD, lung adenocarcinoma.
Feasibility and Performance of Phoslink
To assess the reliability of Phoslink, we compared Phoslink against three IV selection strategies in MR analyses, which are frequently utilized in studies with large sample sizes: (1) FDR_MR: which selects all SNPs significantly associated with X (BH-adjusted p < 0.05) based on the simulated small-sample multiomics cancer dataset; (2) Min_MR: which selects the SNP with the minimum p value among the SNPs significantly associated with X (BH-adjusted p < 0.05) based on the simulated small-sample multiomics cancer dataset; and (3) GWAS_MR: which selects SNPs significantly associated with X as supported by the simulated external GWAS studies study (Bonferroni-adjusted p < 0.05). We conducted a comprehensive comparison of the power (with a true causal effect, θ = 0.60) and FDR (assuming a null causal effect, θ = 0) across four distinct methods. Based on the simulated evaluation results of a homogeneous multiomics cancer dataset, Phoslink displayed a lower FDR while maintaining competitive power compared with other methods (Fig. 2). These results remained robust in sensitivity analyses employing diverse proportions of invalid IVs including 30% and 50% (Supplemental Fig. S2, A and B). In addition, in two simulated datasets with varying degrees of heterogeneity, Phoslink consistently demonstrated similar performance, exhibiting comparable power while effectively reducing the FDR (Supplemental Fig. S2, C and D). To assess the robustness of the Phoslink method, we performed simulations by randomizing SNP identities across the small-sample dataset while maintaining its structure, repeating this process 2000 times for reliability. The results showed a significant decrease in Phoslink’s accuracy with varying sample sizes, with power metrics around 0.5 regardless of sample size (Supplemental Fig. S3).
Fig. 2.
Performance evaluation of Phoslink. FDR (true causal effect θ = 0) and Power (true causal effect θ = 0.6) for Phoslink and other methods in simulations at different sample sizes. The x-axis shows the sample size, whereas the y-axis displays the FDR (purple) or power (blue) for each method. FDR, false discovery rate.
Next, we compared the results from Phoslink and conventional correlation analysis, namely Pearson and Spearman correlation commonly used to elucidate molecular associations in cancer research. Under the assumption of a causal effect of 0.12 and the genetic effects of SNPs on the exposure drawn from a uniform distribution U(1.0, 1.5), Phoslink exhibited lower variability in causal estimation and yielded a more robust and precise estimate of effect size compared with Pearson and Spearman (Fig. 3 and Supplemental Fig. S4). Moreover, when the effect size was set to 0, Phoslink presented a reduction of FDR, and this advantage became more pronounced as the sample size increased (Table 1). These findings collectively support the validity and reliability of Phoslink as a valuable tool for detecting causal relationships among multidimensional biomolecules.
Fig. 3.
Comparison of accuracy and consistency between Phoslink and correlation analyses. Density plots of effect estimates from Phoslink, Pearson, and Spearman analyses with 0% invalid IV scenario. The x-axis represents the estimates from the three methods. IV, instrumental variable.
Table 1.
FDR (true causal effect θ = 0) for the three methods at different sample sizes
| Sample size | Phoslink | Pearson | Spearman |
|---|---|---|---|
| 0% invalid IV | |||
| 50 | 0.0095 | 0.0015 | 0.0005 |
| 100 | 0.0030 | 0.0030 | 0.0040 |
| 200 | 0 | 0.0370 | 0.0260 |
| 300 | 0 | 0.1450 | 0.1130 |
| 30% invalid IV | |||
| 50 | 0.0125 | 0.0005 | 0.0005 |
| 100 | 0.0085 | 0.0065 | 0.0080 |
| 200 | 0.0015 | 0.1130 | 0.0885 |
| 300 | 0 | 0.2675 | 0.2110 |
| 50% invalid IV | |||
| 50 | 0.0105 | 0.0005 | 0.0005 |
| 100 | 0.0195 | 0.0110 | 0.0100 |
| 200 | 0.0140 | 0.1155 | 0.0880 |
| 300 | 0.0045 | 0.2550 | 0.2240 |
Characterization of Germline SNPs in LUAD
WES data from CNHPP-LUAD cohorts resulted in a total of 310,576 autosomal germline SNPs with a major allele transition from C to T (Supplemental Fig. S5A), and chromosome 1 exhibited the highest number of SNPs (Supplemental Fig. S5B). The germline SNPs were not evenly distributed among chromosomes or within chromosomes (Fig. 4A), depending on the guanine–cytosine content of the chromosomes (51). The observed Ti/Tv ratios in our dataset were in agreement with the expected ratio, with an overall Ti/Tv ratio for all samples being 2.34. Among the 20 well-defined driver genes of LUAD (52), SMARCA4 and EGFR displayed the highest SNP counts, with SMARCA4 possessing 50 SNPs and EGFR containing 36 SNPs (Fig. 4B and Supplemental Fig. S5C). In contrast, RBM10 and U2AF1 showed no SNPs within their sequences.
Fig. 4.
Overview of detected germline SNPs in LUAD.A, distribution of germline SNPs on autosomal chromosomes within a 1 Mb window size. The light color represents a low content, and the dark color represents a high content of germline SNPs. B, profiling of germline SNPs in the well-defined driver genes of LUAD. Rows correspond to germline SNPs in LUAD driver genes, and columns represent the 79 samples, showing the mutation status: 0 (homozygous wildtype), 1 (heterozygous genotype), and 2 (homozygous mutant). C, Manhattan plot for all phosphosites reveals p values from univariable linear regression adjusted for age, sex, and smoking status and germline SNP positions across 22 autosomal chromosomes. The horizontal lines indicate the genome-wide cutoff of 5 × 10−8 (gray) and 0.05 (red), respectively. The y-axis shows the -log10 of the p values for the associations of genetic variants with phosphorylation levels. LUAD, lung adenocarcinoma.
The collection of SNPs, proteomics, and phosphoproteomic data in CNHPP-LUAD cases enables Phoslink to infer the causal effects of phosphorylation on protein expression. Initially, we investigated the association of each phosphosite (exposure) and cis-acting phosphorylation-related SNPs (IV candidates) in 79 LUAD tumors using a linear regression model adjusting for age, sex, and smoking status. In total, there were 1,240,601 pairs of SNP–phosphosite associations involving 7910 phosphosites and their corresponding 220,923 cis-SNPs. However, only 11 associations reached genome-wide significance (p < 5 × 10−8) (Fig. 4C). A substantial number of SNPs did not reach this stringent significance level but still deserved further investigation as large effects may produce unimpressive p values if the sample size is small (45, 46). For instance, the rs13407823 genotype was significantly associated with IWS1 phosphorylation level (Kruskal–Wallis test p < 0.001) (Supplemental Fig. S5D). Individuals who were homozygous for the minor allele (G/G) at rs13407823 were found to be associated with a decreased level of phosphorylation compared with homozygosity for the reference allele (A/A). Similarly, the phosphorylation level of SRRM1 was significantly lower in patients carrying the T allele for rs55691364 (Kruskal–Wallis test, p < 0.001) (Supplemental Fig. S5E). The threshold for IV selection in the context of small-sample multiomics cancer data is more stringent, dramatically lessening the range of potential causal relationships that can be evaluated. FDR_MR and Min_MR, which use more lenient thresholds, showed that only 51% (879/1727) and 66% (638/966) of their identified associations, respectively, were significantly correlated, suggesting that a substantial proportion of these associations may not represent true biological relationships. In addition, correlation-based methods like Pearson and Spearman identified a large number of phosphosite–protein links, but these were prone to high false-positive rates because of confounders.
Causal Inference of Phosphoregulatory Network
To identify phosphorylation-related SNPs among millions of SNPs in GWAS datasets, we downloaded a list of such SNPs from the PhosSNP 1.0 database (16), containing 64,035 entries specific to human phosphorylation events. Phoslink screened for causal associations between phosphosite (exposure) and protein (outcome) based on the PhosSNP and CNHPP-LUAD dataset. A total of 345 significant phosphosite–protein links were identified in the CNHPP-LUAD dataset. We introduced noise by randomizing SNP identities within the dataset. The overlap between the relationships identified in the perturbed data and those in the original dataset was only one, demonstrating the advantage of our method in specificity. The 345 casual links encompass 109 regulatory phosphosites (phosphoregulators) and 310 proteins (BH-adjusted p < 0.05). We observed that there was a strong correlation between the magnitudes of the causal estimates and the observational correlation coefficients (r = 0.71 with Pearson, r = 0.66 with Spearman), and the signs of the MR estimates were generally in the same direction as the Pearson correlation coefficients, with a 100% match for Pearson and 98.84% match for Spearman (1.16% discrepancy, 4/345). Subsequently, we constructed a phosphoregulatory network based on 345 causal links (Fig. 5). We assessed the functionality of phosphoregulators by quantifying the similarity between their downstream proteins and established cancer hallmarks (34) utilizing the GOSemSim (35). Functional roles of these phosphoregulators were inferred based on the hallmark exhibiting the highest semantic similarity score.
Fig. 5.
Phosphoregulatory network in LUAD. The network indicates the regulatory relationship between phosphoregulators and proteins. Node colors correspond to the most closely related cancer hallmarks, determined by the similarity of the protein sets regulated by phosphoregulators (limited to those regulating six or more proteins), and node size is proportional to node degree. LUAD, lung adenocarcinoma.
To explore the impact caused by phosphoregulators on the clinical features, we initially connected their phosphorylation signals with four LUAD clinical features: tumor stage, OS, disease-free survival, and differential phosphopeptide abundances between tumor and normal samples (differential analysis). Among the 109 phosphoregulators (see Supplemental Table S1 for the full list), 30.28% (33/109) phosphoregulators demonstrated significant association with LUAD clinical features (Fig. 6A). In further analysis, we checked the associations between the downstream proteins regulated by phosphoregulators and the features. We identified the 26 phosphosites, involving 22 proteins, as key phosphoregulators, each linked to at least one downstream protein significantly associated with LUAD clinical features, including survival and differential analysis (Fig. 6B). Pathway enrichment analysis identified several well-recognized oncogenic pathways among the top pathways enriched by these downstream proteins (Fig. 6C), such as the Cellular senescence, mTOR signaling pathway, and various metabolic pathways. Moreover, several immune-related pathways were observed, such as PD-L1 expression and PD-1 checkpoint pathway in cancer. Notably, 16 of these key phosphoregulators were exclusively identified through our Phoslink model. Among them, AKAP12 undergoes hyperphosphorylation on serine residues during the G1/S phase. This serine phosphorylation, regulated by PKC and other kinases, alters the interaction of AKAP12 with the cytoskeleton matrix, thereby controlling critical processes, such as cytokinesis, cell proliferation, and cell migration (53, 54). In the CNHPP-LUAD cohort, elevated phosphorylation levels of AKAP12 at S696 and S598 showed significant associations with poor survival (Fig. 6D). Several additional key phosphoregulators have also been reported to play significant functional roles, including RREB1 pS161 (55, 56) and MAP4 pS941 (57). Besides, while certain proteins currently lack functional validation at present sites, their phosphorylation levels indisputably hold crucial functional significance, such as ADD1 (58, 59), PCNT (60), RAP1GAP (61) and TNS1 (62). This emphasized the capability of Phoslink to uncover novel key regulatory phosphosites, adding to the current repertoire of biomolecular understanding of LUAD.
Fig. 6.
Exploring the significance of regulators in LUAD.A, distribution of the regulators associated with different clinical features in LUAD (FDR <0.05). B, stacked histogram showing the distribution of proteins associated with different clinical features regulated by 26 key regulators. The red-labeled key regulators are significantly associated with clinical features in two independent LUAD datasets.C, the dot plot describes the KEGG pathway enrichment for all key regulators, with dot size scaled by the GeneRatio, and color denoting the significance of association. D, Kaplan–Meier curves of overall survival, categorized by low versus high AKAP12 at S696 and S598 phosphorylation levels. FDR, false discovery rate; KEGG, Kyoto Encyclopedia of Genes and Genomes; LUAD, lung adenocarcinoma.
Next, we assessed the clinical relevance of the 26 phosphoregulators in two independent LUAD proteome cohorts: one from the United States (CPTAC-LUAD) (47) and the other from East Asian populations (Taiwan-LUAD) (63). Our results showed that, of the 26 key phosphoregulators identified in the CNHPP-LUAD cohort, nine were detected in the CPTAC-LUAD cohort. Relatively fewer shared phosphosites across cohorts were not detected because of technical and population-related factors (64). Eight of these nine were consistently associated with clinical phenotypes, of which three exhibited statistical significance in both differential analyses between cancer and adjacent tissues, and survival analyses (FAM83E pS351, RREB1 pS161, and TNS2 pS830). In the Taiwan-LUAD cohort, 16 of the 26 phosphoregulators were detected (missing value rate <0.9), with 13 showing statistically significant differences between cancer and adjacent tissues (one-sample Wilcoxon signed-rank test based on Supplemental Table S1G_PhosSiteLog2TN in the study, p < 0.05). Among these 13, TASOR2 pS1025 and WDR75 pS782 also exhibited significant differences between tumor stages (stage 1 versus others).
Functional Roles of Key Phosphoregulators in LUAD
Integrating phosphoregulators with data from drug databases could help identify potential drug candidates for repurposing in LUAD treatment. Previous studies have demonstrated that compounds targeting proteins supported by genetic evidence are more likely to work than those without such support (65). Among the 22 proteins harboring 26 key phosphoregulators, four have already been identified as drug targets, including IGF2R, VCAN, MAP4, and ADD1 according to the Drug–Gene Interaction Database (37), DrugBank (38) and Therapeutic Target Database (39).
To better understand the structural properties of the prioritized phosphoregulators, we first investigated whether these phosphosites reside within functional domains annotated in the InterPro resource (40). Notably, 88.46% (23/26) of the phosphoregulators were located within known InterPro domains. For instance, the MAP4 pS941 is situated within the tubulin-binding domain, as visualized in UCSF ChimeraX (66) (Fig. 7A). It is noteworthy that Paclitaxel, a tubulin-binding agent, is a commonly used first-line therapy for non–small-cell lung carcinoma (67). We calculated the site-specific disorder score using the IUPred3 (68) and classified scores exceeding 0.5 as disorder. The regulators are predominantly located in intrinsically disordered regions (Supplemental Fig. S6A). To further explore the possible functionalities of the phosphoregulators, we assigned a functional score to each, ranging from 0 to 1, based on the methodology by David Ochoa et al. (69). A higher score indicates a higher likelihood of relevance for cell fitness. The phosphosites identified by our algorithm have significantly higher functional scores (Supplemental Fig. S6B), regardless of whether the functional scores were directly obtained from Supplemental Table S3 of David Ochoa et al. (69) or recalculated using their tool, funscoR. Several phosphoregulators obtained a functional score higher than 0.4 (Supplemental Fig. S6C), such as RREB1 pS161, TBC1D5 pS539, and GORASP2 pS451, further emphasizing their functional significance. These findings support their candidacy as both biomarkers and drug targets. Considering the crucial role of kinases, as attractive therapeutic targets for cancer (3), we collected experimentally verified kinase–substrate relationships from Phospho.ELM (41), PhosphoSitePlus (42), and PhosphoNetworks (43). A total of 28 kinases were identified for the nine proteins harboring key phosphoregulators. Among them, ADD1 is targeted by the highest number of kinases with a total of nine (Fig. 7B). As expected, all these kinases are targeted by Food and Drug Administration–approved drugs, rendering them deserving further investigation into their potential therapeutic implications in LUAD.
Fig. 7.
Evaluating the potential of regulators for drug development.A, the 3D structure of MAP4 with pS941 marked with red using UCSF’s Chimera X visualization tool, where blue indicates the tubulin-binding domain. B, network diagrams depict the upstream kinases of phosphoregulators, and the outermost layer presents the group information of kinases. C, protein expression of RANGAP1 in tumors versus NATs, with the Wilcoxon signed-rank test p value indicated on top. D, Kaplan–Meier plots illustrating overall and disease-free survival in samples from the LUAD cohort, categorized by low versus high RANGAP1 protein expression. High expression of RANGAP1 is significantly associated with worse outcomes. E, mediation analysis quantified the effect sizes of the RANGAP1 mediator model, with GORASP2 pS451 as the exposure and LUAD survival as the outcome. LUAD, lung adenocarcinoma; NAT, noncancerous adjacent tissue.
To further elucidate how protein phosphorylation affects LUAD survival, we conducted mediation analyses. These analyses allowed us to decompose the total effect of phosphoregulators on survival into two components: the effect unmediated by the protein and the effect mediated by the protein. RANGAP1, a downstream protein of the key phosphoregulator GORASP2 pS451, was significantly associated with prognostic outcomes and showed elevated expression in tumors compared with NATs (Fig. 7, C and D). Mediation analysis revealed that GORASP2 pS451 exerted a significant mediation effect on survival by RANGAP1 protein expression (p = 0.018) (Fig. 7E). Interestingly, we observed opposing effects between the effect mediated by the protein and effect unmediated by the protein of GORASP2 pS451 on OS, which attenuated the total effect. This complexity in regulatory interactions may explain the difficulty in their identification using conventional methods.
Discussion
The functional importance of protein phosphorylation has been established, yet the underlying mechanisms are still not well understood. Given the conventional stringent threshold for IV selection within multiomics cancer studies characterized by relatively small sample sizes, the majority of exposures remain unavailable for assessing potential causal relationships because of the lack of IVs. In this study, we introduced Phoslink, an MR-centered strategy designed for small-sample cancer datasets, achieving causal inference of phosphosite–protein links based on multiple omics data. Phoslink integrates both internal and external evidence to reduce the winner’s curse and allow the inclusion of weak instruments to the MR analysis for small-sample multiomics cancer data. We demonstrate the effectiveness of Phoslink across different simulation scenarios, showing its ability to reduce the FDR while maintaining comparable power to conventional IV selection methods like FDR_MR, Min_MR, and GWAS_MR, even in datasets with pronounced heterogeneity. Phoslink outperforms correlation-based methods, including Pearson and Spearman, by effectively reducing the FDR and providing more stable estimates for phosphoproteomic regulation within cancer research. Besides, the moderate correlation between the absolute magnitudes of the causal estimates and the observational correlation coefficients supports the previous report that even in estimating associations between “omic” variables, the use of causal inference methods is crucial for inferring causal effect sizes over observational associations (70). Phoslink is freely available on GitHub, ensuring accessibility to the wider scientific community and facilitating collaboration for further advancements in this field.
In our application of Phoslink to 79 LUAD samples, we identified over 300 significant phosphosite–protein pairs and constructed a comprehensive phosphoregulatory network. Analysis of the regulatory network revealed 26 phosphoregulators that hold promising potential as biomarkers associated with LUAD. These regulators exhibited significant enrichment within proteins related to LUAD clinical features and were involved in well-known oncogenic pathways. Moreover, several immune-related pathways were identified including PD-L1 expression and PD-1 checkpoint pathway in cancer, which may contribute to immune evasion of tumor cells (71). Importantly, 16 of 26 phosphoregulators were exclusively identified through phosphosite–protein causal relationships, reinforcing the value of causal inference in biomarker discovery. The novel regulators identified through our research not only provide fresh insights into the phosphosite–protein interactions in LUAD but also open new avenues for drug repurposing in LUAD treatment. MAP4 is a target of paclitaxel, docetaxel, and artenimol. Paclitaxel and docetaxel have established treatments in clinical practice for LUAD. Artenimol, an artemisinin derivative, is currently used against Plasmodium falciparum infection. However, numerous studies have indicated a close association between LUAD and the gut-lung microbiota (72) and the anticancer efficacy of artemisinin derivatives has been confirmed (73). Artenimol thus holds promising potential as a therapeutic option for LUAD. Importantly, MAP4 pS941, a key phosphoregulator we identified, resides within the tubulin-binding domain and paclitaxel is just a tubulin-binding agent. Several regulators showed their functional significance. Mediation analysis revealed that the opposing effects between the mediated and unmediated components by the protein of protein phosphorylation on OS, which ultimately attenuated the total effect, may account for the challenges in identifying these interactions using conventional methods.
While Phoslink demonstrates significant advancements in causal inference and annotation of phosphoproteomic data, it is important to acknowledge its limitations. First, MR analyses rely on three core IV assumptions for testing the causal effects of exposure on the outcome. These assumptions encompass the association of IVs with the exposure (relevance), their independence from potential confounders (independence), and their exclusive influence on the outcome only through the exposure and not through alternative pathways (exclusion restriction). DNA variants are not generally influenced by confounders based on the biological belief and they influence signaling regulation via protein phosphorylation, which in turn affects downstream changes like protein translation (27). Consequently, we assumed that the IV is associated with the exposure but not the outcome or any confounders in this study. However, in practice, fully ensuring the fulfillment of all these assumptions can be challenging, especially considering the possibility of pleiotropy in the biological context. This can potentially introduce bias in causal estimates. Second, Phoslink depends on the assumption of linear relationships among genetic variations, proteins, and phosphosites. Although this assumption simplifies analysis and interpretation, it may be not always accurate, given the nonlinear and dynamic nature of biological systems. Future research using nonlinear models may provide a more comprehensive and accurate understanding of molecular interactions. In summary, Phoslink emerges as a promising method for identifying causal phosphosite–protein links, thereby accelerating the clinical translation of cancer proteomics and phosphoproteomic data.
Data Availability
The processed data are available at https://www.cell.com/cell/fulltext/S0092-8674(20)30676-0. Source data are provided in this article. We have implemented Phoslink in a computationally efficient R package that can be accessed at https://github.com/Li-Lab-SJTU/Phoslink. R code for reproducing the simulation data is available at https://github.com/Li-Lab-SJTU/Phoslink/tree/main/Simulations.
Supplemental data
This article contains supplemental data.
Conflict of interest
The authors declare no competing interests.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (grant nos.: 32170664 and 31871329), the Key Project for Computational Biology of Shanghai (grant no.: 23JS1400800), and the Fundamental Research Funds for the Central Universities (grant nos.: YG2023ZD11 and YG2022QN065). The computations in this article were run on the Siyuan-1 cluster supported by the Center for High Performance Computing at Shanghai Jiao Tong University.
Author contributions
J. L. conceptualization; Q. D., Y. Zhou, Y. Zhang, and J. L. methodology; Q. D. formal analysis; Q. D. investigation; M. T. data curation; Q. D.: writing–original draft; M. T., Y. Zhou, Y. Zhang, and J. L. writing–review & editing; Y. Zhang and J. L. supervision; Q. D. visualization; J. L. funding acquisition.
Contributor Information
Yue Zhang, Email: yue.zhang@sjtu.edu.cn.
Jing Li, Email: jing.li@sjtu.edu.cn.
Supplementary Data
References
- 1.Ardito F., Giuliani M., Perrone D., Troiano G., Lo Muzio L. The crucial role of protein phosphorylation in cell signaling and its use as targeted therapy (Review) Int. J. Mol. Med. 2017;40:271–280. doi: 10.3892/ijmm.2017.3036. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Singh V., Ram M., Kumar R., Prasad R., Roy B.K., Singh K.K. Phosphorylation: implications in cancer. Protein J. 2017;36:1–6. doi: 10.1007/s10930-017-9696-z. [DOI] [PubMed] [Google Scholar]
- 3.Vasaikar S., Huang C., Wang X., Petyuk V.A., Savage S.R., Wen B., et al. Proteogenomic analysis of human colon cancer reveals new therapeutic opportunities. Cell. 2019;177:1035–1049.e1019. doi: 10.1016/j.cell.2019.03.030. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Floyd B.M., Drew K., Marcotte E.M. Systematic identification of protein phosphorylation-mediated interactions. J. Proteome Res. 2021;20:1359–1370. doi: 10.1021/acs.jproteome.0c00750. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Muller R., Meacham Z.A., Ferguson L., Ingolia N.T. CiBER-seq dissects genetic networks by quantitative CRISPRi profiling of expression phenotypes. Science. 2020;370 doi: 10.1126/science.abb9662. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Liang X., Yao J., Cui D., Zheng W., Liu Y., Lou G., et al. The TRAF2-p62 axis promotes proliferation and survival of liver cancer by activating mTORC1 pathway. Cell Death Differ. 2023;30:1550–1562. doi: 10.1038/s41418-023-01164-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Lih T.M., Cho K.C., Schnaubelt M., Hu Y., Zhang H. Integrated glycoproteomic characterization of clear cell renal cell carcinoma. Cell Rep. 2023;42 doi: 10.1016/j.celrep.2023.112409. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Xu J.Y., Zhang C., Wang X., Zhai L., Ma Y., Mao Y., et al. Integrative proteomic characterization of human lung adenocarcinoma. Cell. 2020;182:245–261.e217. doi: 10.1016/j.cell.2020.05.043. [DOI] [PubMed] [Google Scholar]
- 9.Floris M., Olla S., Schlessinger D., Cucca F. Genetic-driven druggable target identification and validation. Trends Genet. 2018;34:558–570. doi: 10.1016/j.tig.2018.04.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Babur O., Luna A., Korkut A., Durupinar F., Siper M.C., Dogrusoz U., et al. Causal interactions from proteomic profiles: molecular data meet pathway knowledge. Patterns (N Y) 2021;2 doi: 10.1016/j.patter.2021.100257. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Liu A., Trairatphisan P., Gjerga E., Didangelos A., Barratt J., Saez-Rodriguez J. From expression footprints to causal pathways: contextualizing large signaling networks with CARNIVAL. NPJ Syst. Biol. Appl. 2019;5:40. doi: 10.1038/s41540-019-0118-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Zheng J., Baird D., Borges M.C., Bowden J., Hemani G., Haycock P., et al. Recent developments in mendelian randomization studies. Curr. Epidemiol. Rep. 2017;4:330–345. doi: 10.1007/s40471-017-0128-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Howell A.E., Zheng J., Haycock P.C., McAleenan A., Relton C., Martin R.M., et al. Use of mendelian randomization for identifying risk factors for brain tumors. Front. Genet. 2018;9:525. doi: 10.3389/fgene.2018.00525. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Burgess S., Scott R.A., Timpson N.J., Davey Smith G., Thompson S.G., Consortium E.-I. Using published data in Mendelian randomization: a blueprint for efficient identification of causal risk factors. Eur. J. Epidemiol. 2015;30:543–552. doi: 10.1007/s10654-015-0011-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Plomin R., von Stumm S. The new genetics of intelligence. Nat. Rev. Genet. 2018;19:148–159. doi: 10.1038/nrg.2017.104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Ren J., Jiang C., Gao X., Liu Z., Yuan Z., Jin C., et al. PhosSNP for systematic analysis of genetic polymorphisms that influence protein phosphorylation. Mol. Cell Proteomics. 2010;9:623–634. doi: 10.1074/mcp.M900273-MCP200. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Slob E.A.W., Burgess S. A comparison of robust Mendelian randomization methods using summary data. Genet. Epidemiol. 2020;44:313–329. doi: 10.1002/gepi.22295. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Bowden J., Davey Smith G., Burgess S. Mendelian randomization with invalid instruments: effect estimation and bias detection through Egger regression. Int. J. Epidemiol. 2015;44:512–525. doi: 10.1093/ije/dyv080. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Burgess S., Zuber V., Gkatzionis A., Foley C.N. Modal-based estimation via heterogeneity-penalized weighting: model averaging for consistent and efficient estimation in Mendelian randomization when a plurality of candidate instruments are valid. Int. J. Epidemiol. 2018;47:1242–1254. doi: 10.1093/ije/dyy080. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Ge S., Xia X., Ding C., Zhen B., Zhou Q., Feng J., et al. A proteomic landscape of diffuse-type gastric cancer. Nat. Commun. 2018;9:1012. doi: 10.1038/s41467-018-03121-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Satpathy S., Krug K., Jean Beltran P.M., Savage S.R., Petralia F., Kumar-Sinha C., et al. A proteogenomic portrait of lung squamous cell carcinoma. Cell. 2021;184:4348–4371.e4340. doi: 10.1016/j.cell.2021.07.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Bao E.L., Nandakumar S.K., Liao X., Bick A.G., Karjalainen J., Tabaka M., et al. Inherited myeloproliferative neoplasm risk affects haematopoietic stem cells. Nature. 2020;586:769–775. doi: 10.1038/s41586-020-2786-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Tian G.G., Hou C., Li J., Wu J. Three-dimensional genome structure shapes the recombination landscape of chromatin features during female germline stem cell development. Clin. Transl. Med. 2022;12:e927. doi: 10.1002/ctm2.927. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Leite Filho H.P., Pinto I.P., Oliveira L.G., Costa E.O.A., da Cruz A.S., DM E.S., et al. Deviation from Mendelian transmission of autosomal SNPs can be used to estimate germline mutations in humans exposed to ionizing radiation. PLoS One. 2020;15 doi: 10.1371/journal.pone.0233941. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Browning B.L., Tian X., Zhou Y., Browning S.R. Fast two-stage phasing of large-scale sequence data. Am. J. Hum. Genet. 2021;108:1880–1890. doi: 10.1016/j.ajhg.2021.08.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Cingolani P., Platts A., Wang le L., Coon M., Nguyen T., Wang L., et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin) 2012;6:80–92. doi: 10.4161/fly.19695. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Wang I.X., Ramrattan G., Cheung V.G. Genetic variation in insulin-induced kinase signaling. Mol. Syst. Biol. 2015;11:820. doi: 10.15252/msb.20156250. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Zheng J., Haberland V., Baird D., Walker V., Haycock P.C., Hurle M.R., et al. Phenome-wide Mendelian randomization mapping the influence of the plasma proteome on complex diseases. Nat. Genet. 2020;52:1122–1131. doi: 10.1038/s41588-020-0682-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Genomes Project C., Abecasis G.R., Auton A., Brooks L.D., DePristo M.A., Durbin R.M., et al. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491:56–65. doi: 10.1038/nature11632. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Cao Y., Yang Y., Hu Q., Wei G. Identification of potential drug targets for rheumatoid arthritis from genetic insights: a Mendelian randomization study. J. Transl. Med. 2023;21:616. doi: 10.1186/s12967-023-04474-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Garfield V., Farmaki A.E., Fatemifar G., Eastwood S.V., Mathur R., Rentsch C.T., et al. Relationship between Glycemia and cognitive function, structural brain outcomes, and dementia: a mendelian randomization study in the UK biobank. Diabetes. 2021;70:2313–2321. doi: 10.2337/db20-0895. [DOI] [PubMed] [Google Scholar]
- 32.Manousaki D., Paternoster L., Standl M., Moffatt M.F., Farrall M., Bouzigon E., et al. Vitamin D levels and susceptibility to asthma, elevated immunoglobulin E levels, and atopic dermatitis: a Mendelian randomization study. PLoS Med. 2017;14 doi: 10.1371/journal.pmed.1002294. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Yavorska O.O., Burgess S. MendelianRandomization: an R package for performing Mendelian randomization analyses using summarized data. Int. J. Epidemiol. 2017;46:1734–1739. doi: 10.1093/ije/dyx034. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Plaisier C.L., Pan M., Baliga N.S. A miRNA-regulatory network explains how dysregulated miRNAs perturb oncogenic processes across diverse cancers. Genome Res. 2012;22:2302–2314. doi: 10.1101/gr.133991.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Yu G., Li F., Qin Y., Bo X., Wu Y., Wang S. GOSemSim: an R package for measuring semantic similarity among GO terms and gene products. Bioinformatics. 2010;26:976–978. doi: 10.1093/bioinformatics/btq064. [DOI] [PubMed] [Google Scholar]
- 36.Wang J.Z., Du Z., Payattakool R., Yu P.S., Chen C.F. A new method to measure the semantic similarity of GO terms. Bioinformatics. 2007;23:1274–1281. doi: 10.1093/bioinformatics/btm087. [DOI] [PubMed] [Google Scholar]
- 37.Freshour S.L., Kiwala S., Cotto K.C., Coffman A.C., McMichael J.F., Song J.J., et al. Integration of the drug-gene interaction database (DGIdb 4.0) with open crowdsource efforts. Nucleic Acids Res. 2021;49:D1144–D1151. doi: 10.1093/nar/gkaa1084. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Wishart D.S., Feunang Y.D., Guo A.C., Lo E.J., Marcu A., Grant J.R., et al. DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res. 2018;46:D1074–D1082. doi: 10.1093/nar/gkx1037. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Zhou Y., Zhang Y., Lian X., Li F., Wang C., Zhu F., et al. Therapeutic target database update 2022: facilitating drug discovery with enriched comparative data of targeted agents. Nucleic Acids Res. 2022;50:D1398–D1407. doi: 10.1093/nar/gkab953. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Paysan-Lafosse T., Blum M., Chuguransky S., Grego T., Pinto B.L., Salazar G.A., et al. InterPro in 2022. Nucleic Acids Res. 2023;51:D418–D427. doi: 10.1093/nar/gkac993. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Dinkel H., Chica C., Via A., Gould C.M., Jensen L.J., Gibson T.J., et al. Phospho.ELM: a database of phosphorylation sites--update 2011. Nucleic Acids Res. 2011;39:D261–D267. doi: 10.1093/nar/gkq1104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Hornbeck P.V., Kornhauser J.M., Tkachev S., Zhang B., Skrzypek E., Murray B., et al. PhosphoSitePlus: a comprehensive resource for investigating the structure and function of experimentally determined post-translational modifications in man and mouse. Nucleic Acids Res. 2012;40:D261–D270. doi: 10.1093/nar/gkr1122. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Hu J., Rho H.S., Newman R.H., Zhang J., Zhu H., Qian J. PhosphoNetworks: a database for human phosphorylation networks. Bioinformatics. 2014;30:141–142. doi: 10.1093/bioinformatics/btt627. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Tingley D., Yamamoto T., Hirose K., Keele L., Imai K. Mediation: R package for causal mediation analysis. J. Stat. Softw. 2014;59:1–38. [Google Scholar]
- 45.Faber J., Fonseca L.M. How sample size influences research outcomes. Dental Press J. Orthod. 2014;19:27–29. doi: 10.1590/2176-9451.19.4.027-029.ebo. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Peterson S.J., Foley S. Clinician's guide to understanding effect size, Alpha level, power, and sample size. Nutr. Clin. Pract. 2021;36:598–605. doi: 10.1002/ncp.10674. [DOI] [PubMed] [Google Scholar]
- 47.Gillette M.A., Satpathy S., Cao S., Dhanasekaran S.M., Vasaikar S.V., Krug K., et al. Proteogenomic characterization reveals therapeutic vulnerabilities in lung adenocarcinoma. Cell. 2020;182:200–225.e235. doi: 10.1016/j.cell.2020.06.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Yan B., Yang J., Zhao B., Wu Y., Bai L., Ma X. Causal effect of visceral adipose tissue accumulation on the human longevity: a mendelian randomization study. Front. Endocrinol. (Lausanne) 2021;12 doi: 10.3389/fendo.2021.722187. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Minelli C., Del Greco M.F., van der Plaat D.A., Bowden J., Sheehan N.A., Thompson J. The use of two-sample methods for Mendelian randomization analyses on single large datasets. Int. J. Epidemiol. 2021;50:1651–1659. doi: 10.1093/ije/dyab084. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Chen J., Zhang J., So H.C., Ai S., Wang N., Tan X., et al. Association of sleep traits and heel bone mineral density: observational and mendelian randomization studies. J. Bone Miner Res. 2021;36:2184–2192. doi: 10.1002/jbmr.4406. [DOI] [PubMed] [Google Scholar]
- 51.Zhao Z., Fu Y.X., Hewett-Emmett D., Boerwinkle E. Investigating single nucleotide polymorphism (SNP) density in the human genome and its implications for molecular evolution. Gene. 2003;312:207–213. doi: 10.1016/s0378-1119(03)00670-x. [DOI] [PubMed] [Google Scholar]
- 52.Bailey M.H., Tokheim C., Porta-Pardo E., Sengupta S., Bertrand D., Weerasinghe A., et al. Comprehensive characterization of cancer driver genes and mutations. Cell. 2018;173:371–385.e318. doi: 10.1016/j.cell.2018.02.060. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Akakura S., Gelman I.H. Pivotal role of AKAP12 in the regulation of cellular adhesion dynamics: control of cytoskeletal architecture, cell migration, and mitogenic signaling. J. Signal. Transduct. 2012;2012 doi: 10.1155/2012/529179. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Ramani K., Tomasi M.L., Berlind J., Mavila N., Sun Z. Role of A-kinase anchoring protein phosphorylation in alcohol-induced liver injury and hepatic stellate cell activation. Am. J. Pathol. 2018;188:640–655. doi: 10.1016/j.ajpath.2017.11.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Fattet L., Yang J. RREB1 integrates TGF-beta and RAS signals to drive EMT. Dev. Cell. 2020;52:259–260. doi: 10.1016/j.devcel.2020.01.020. [DOI] [PubMed] [Google Scholar]
- 56.Su J., Morgani S.M., David C.J., Wang Q., Er E.E., Huang Y.H., et al. TGF-beta orchestrates fibrogenic and developmental EMTs via the RAS effector RREB1. Nature. 2020;577:566–571. doi: 10.1038/s41586-019-1897-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Fassett J.T., Hu X., Xu X., Lu Z., Zhang P., Chen Y., et al. AMPK attenuates microtubule proliferation in cardiac hypertrophy. Am. J. Physiol. Heart Circ. Physiol. 2013;304:H749–H758. doi: 10.1152/ajpheart.00935.2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Chan P.C., Hsu R.Y., Liu C.W., Lai C.C., Chen H.C. Adducin-1 is essential for mitotic spindle assembly through its interaction with myosin-X. J. Cell Biol. 2014;204:19–28. doi: 10.1083/jcb.201306083. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Zhang Y., Fan B., Li X., Tang Y., Shao J., Liu L., et al. Phosphorylation of adducin-1 by TPX2 promotes interpolar microtubule homeostasis and precise chromosome segregation in mouse oocytes. Cell Biosci. 2022;12:205. doi: 10.1186/s13578-022-00943-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Lee K., Rhee K. PLK1 phosphorylation of pericentrin initiates centrosome maturation at the onset of mitosis. J. Cell Biol. 2011;195:1093–1101. doi: 10.1083/jcb.201106093. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.McAvoy T., Zhou M.M., Greengard P., Nairn A.C. Phosphorylation of Rap1GAP, a striatally enriched protein, by protein kinase A controls Rap1 activity and dendritic spine morphology. Proc. Natl. Acad. Sci. U. S. A. 2009;106:3531–3536. doi: 10.1073/pnas.0813263106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Hall E.H., Balsbaugh J.L., Rose K.L., Shabanowitz J., Hunt D.F., Brautigan D.L. Comprehensive analysis of phosphorylation sites in Tensin1 reveals regulation by p38MAPK. Mol. Cell Proteomics. 2010;9:2853–2863. doi: 10.1074/mcp.M110.003665. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Chen Y.J., Roumeliotis T.I., Chang Y.H., Chen C.T., Han C.L., Lin M.H., et al. Proteogenomics of non-smoking lung cancer in East asia delineates molecular signatures of pathogenesis and progression. Cell. 2020;182:226–244.e217. doi: 10.1016/j.cell.2020.06.012. [DOI] [PubMed] [Google Scholar]
- 64.Geffen Y., Anand S., Akiyama Y., Yaron T.M., Song Y., Johnson J.L., et al. Pan-cancer analysis of post-translational modifications reveals shared patterns of protein regulation. Cell. 2023;186:3945–3967.e3926. doi: 10.1016/j.cell.2023.07.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Nelson M.R., Tipney H., Painter J.L., Shen J., Nicoletti P., Shen Y., et al. The support of human genetic evidence for approved drug indications. Nat. Genet. 2015;47:856–860. doi: 10.1038/ng.3314. [DOI] [PubMed] [Google Scholar]
- 66.Goddard T.D., Huang C.C., Meng E.C., Pettersen E.F., Couch G.S., Morris J.H., et al. UCSF ChimeraX: meeting modern challenges in visualization and analysis. Protein Sci. 2018;27:14–25. doi: 10.1002/pro.3235. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Hardin C., Shum E., Singh A.P., Perez-Soler R., Cheng H. Emerging treatment using tubulin inhibitors in advanced non-small cell lung cancer. Expert Opin. Pharmacother. 2017;18:701–716. doi: 10.1080/14656566.2017.1316374. [DOI] [PubMed] [Google Scholar]
- 68.Erdos G., Pajkos M., Dosztanyi Z. IUPred3: prediction of protein disorder enhanced with unambiguous experimental annotation and visualization of evolutionary conservation. Nucleic Acids Res. 2021;49:W297–W303. doi: 10.1093/nar/gkab408. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Ochoa D., Jarnuczak A.F., Vieitez C., Gehre M., Soucheray M., Mateus A., et al. The functional landscape of the human phosphoproteome. Nat. Biotechnol. 2020;38:365–373. doi: 10.1038/s41587-019-0344-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Hemani G., Tilling K., Davey Smith G. Correction: orienting the causal relationship between imprecisely measured traits using GWAS summary data. PLoS Genet. 2017;13 doi: 10.1371/journal.pgen.1007081. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Cha J.H., Chan L.C., Li C.W., Hsu J.L., Hung M.C. Mechanisms controlling PD-L1 expression in cancer. Mol. Cell. 2019;76:359–370. doi: 10.1016/j.molcel.2019.09.030. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Wong L.M., Shende N., Li W.T., Castaneda G., Apostol L., Chang E.Y., et al. Comparative analysis of age- and gender-associated microbiome in lung adenocarcinoma and lung squamous cell carcinoma. Cancers (Basel) 2020;12:1447. doi: 10.3390/cancers12061447. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Zhang Q., Yi H., Yao H., Lu L., He G., Wu M., et al. Artemisinin derivatives inhibit non-small cell lung cancer cells through induction of ROS-dependent apoptosis/ferroptosis. J. Cancer. 2021;12:4075–4085. doi: 10.7150/jca.57054. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The processed data are available at https://www.cell.com/cell/fulltext/S0092-8674(20)30676-0. Source data are provided in this article. We have implemented Phoslink in a computationally efficient R package that can be accessed at https://github.com/Li-Lab-SJTU/Phoslink. R code for reproducing the simulation data is available at https://github.com/Li-Lab-SJTU/Phoslink/tree/main/Simulations.







