TransGCN: a semi-supervised graph convolution network–based framework to infer protein translocations in spatio-temporal proteomics

Bing Wang; Xiangzheng Zhang; Xudong Han; Bingjie Hao; Yan Li; Xuejiang Guo

doi:10.1093/bib/bbae055

. 2024 Feb 28;25(2):bbae055. doi: 10.1093/bib/bbae055

TransGCN: a semi-supervised graph convolution network–based framework to infer protein translocations in spatio-temporal proteomics

Bing Wang ^1,², Xiangzheng Zhang ³, Xudong Han ^4,⁵, Bingjie Hao ⁶, Yan Li ⁷, Xuejiang Guo ^8,^✉

PMCID: PMC10939423 PMID: 38426320

Abstract

Protein subcellular localization (PSL) is very important in order to understand its functions, and its movement between subcellular niches within cells plays fundamental roles in biological process regulation. Mass spectrometry–based spatio-temporal proteomics technologies can help provide new insights of protein translocation, but bring the challenge in identifying reliable protein translocation events due to the noise interference and insufficient data mining. We propose a semi-supervised graph convolution network (GCN)–based framework termed TransGCN that infers protein translocation events from spatio-temporal proteomics. Based on expanded multiple distance features and joint graph representations of proteins, TransGCN utilizes the semi-supervised GCN to enable effective knowledge transfer from proteins with known PSLs for predicting protein localization and translocation. Our results demonstrate that TransGCN outperforms current state-of-the-art methods in identifying protein translocations, especially in coping with batch effects. It also exhibited excellent predictive accuracy in PSL prediction. TransGCN is freely available on GitHub at https://github.com/XuejiangGuo/TransGCN.

Keywords: graph convolution network, semi-supervised learning, protein translocation, spatio-temporal proteomics, protein subcellular localization, mass spectrometry

INTRODUCTION

Eukaryotic cells can be compartmentalized into organelles and subcellular niches with distinct functions and morphological structures [1], whereas proteins localized in different subcellular compartments may exert diversified functions [2]. The subcellular translocation of a protein is a dynamic temporal subcellular regulatory event from one cellular state to another; spatio-temporal proteomics incorporating the temporal dimension can help reveal protein translocation in cells at different states [3]. For example, EGF induced protein translocation from cytoplasm to nucleus [4]. The dynamic translocations of proteins between niches are prevalent in cellular processes, and a number of subcellular dysfunctional diseases are highly related to the mislocalization of proteins [5], such as cancers [6], neurodegenerative diseases [7] and Alzheimer's disease [8]. Therefore, to systemically identify the protein translocation events after cellular perturbation by spatio-temporal proteomics is critical for understanding their functions and mechanisms of the associated cellular processes [9] and is also valuable for early diagnosis and development of drug therapies for complex diseases [10].

Image-based methods [11] such as immunohistochemistry and immunofluorescence can be employed to investigate protein subcellular localization (PSL) and translocation. However, it is time-consuming and requires specific antibodies and can be costly to study thousands of protein translocation events in different cell states [12]. Overexpression of tag-fused protein can avoid the use of protein-specific antibodies; however, certain cell lines are particularly difficult to transfect, such as the human monocytic leukemia cell line (THP-1) [2]. Fortunately, mass spectrometry–based spatio-temporal proteomics [3] provides a systematic and high-throughput approach to evaluate protein translocation under different conditions. The main experimental workflows involve subcellular fractionation and mass spectrometry–based protein quantitative to capture the dynamics of relative protein occupancy profiles in these subcellular fractions [13]. There have been remarkable advancements in relevant research fields over recent years, especially in the fields of diseases and studies of molecular mechanisms [14, 15]. Jean Beltran et al. [16] characterized protein localization dynamics during human cytomegalovirus (HCMV) infection and revealed that MYO18A, an unconventional myosin that translocates from the plasma membrane to the viral assembly complex, is essential for efficient HCMV replication. Hirst et al. [17] utilized CRISPR-Cas9 to knock out (KO) the AP-5 ζ subunit gene (AP5Z1) and showed its involvement in protein retrieval, emphasizing a late-acting pathway vital for lysosomal homeostasis. Overall, the rapid growth in mass spectrometry–based spatio-temporal proteomics holds promising prospects for advancing biomedical applications.

With the high-throughput quantitative proteomics data on dynamics of PSLs, it is important to reliably infer protein translocation events based on experimental data. Methods vary from traditional statistical approaches to sophisticated machine learning algorithms. Magnitude–reproducibility (MR) [4, 18] as a traditional statistical approach relies on the combination of a multivariate outlier test magnitude (M) score and a reproducibility (R) score. However, the threshold values of M and R scores in the MR method are difficult to evaluate, and the robust acquisition of the R score requires repeated experiments which increases the cost of the experiments. The mobility score (MS) [3] method is based on the absolute change of the protein levels in subcellular fractions between the control and treated groups to detect translocated proteins. Both MR and MS methods expect no bias in the quantitative changes in the experiments to generate error-free datasets, which is too idealistic to be achieved. Inevitable experimental variations such as random noises, batch effects or reproducibility issues lead to poor interpretability and robustness of these models. The machine learning–based method TRANSPIRE [19] leverages synthetic translocation profiles and a stochastic variational Gaussian process (SVGP) [20] classifier to predict PSLs for protein translocation identification. A principled Bayesian approach named BANDLE [9] computes the probability of differential localization between two conditions for each protein and uses Gaussian process to model the mass-spectrometry profiles of the subcellular niches and then computes a differential localization probability based on Bayesian inference. Although these sophisticated machine learning approaches have made good achievements, they excessively rely on the algorithms themselves and ignore the inherent properties of the data. For example, they frequently employ Gaussian process to model the distribution of proteins in subcellular compartments, neglecting valuable information about the intricate network of protein relationship. Many statistical indicators for differentially localized proteins are not effectively utilized in these machine learning–based methods.

Here, we present a novel semi-supervised graph convolution network (GCN)–based [21, 22] framework TransGCN to infer protein translocation events in spatio-temporal proteomics. TransGCN considers different distance features of proteins in different cell states. It propagates protein localization and translocation labels from proteins with known PSLs by semi-supervised GCN, which enables effective knowledge transfer. Using a wide range of different simulated and experimental mass spectrometry-based spatio-temporal proteomics datasets, we demonstrate that TranscGCN outperforms other methods in accuracy and robustness. It can better handle spatio-temporal proteomic datasets with complex noise and identify more reliable protein translocation events.

MATERIALS AND METHODS

Dataset description

Mass spectrometry–based spatial proteomics mainly consists of subcellular fractionation of cells to derive multiple separated fractions and then uses mass spectrometry to quantity the levels of proteins in each fraction [13]. The spatio-temporal proteomics includes at least one paired control and treated experiments. It is often necessary to collect marker proteins with known PSLs without translocation events and proteins that can form similar distributions among fractions [12]. Notably, proteins with more than one PSL are not considered in marker proteins, similar to Mulvey et al.’s [2] study. For each replicate in the experimental data, the data are normalized by obtaining the proportion of protein levels in each fraction [12]. We benchmark the performance of protein translocation methods using five simulated and six experimental spatio-temporal proteomics datasets (Table 1).

Table 1.

Summary of mass spectrometry–based spatio-temporal proteomics datasets used in this study

Dataset	Cell line	Condition	No. of proteins (known/unknown PSLs)	No. of replicates	No. of fractions (per replicate)	No. of subcellular localizations
E14TG2aR	Mouse E14TG2a embryonic stem cells	Cluster-specific noise distribution random batch effects, systematic batch effects, fraction-swapping batch effects and fraction swapping	2031 (368/1663)	3	48 (6)	10
beltran2016	Primary human fibroblasts	HCMC infection	1808 (325/1483)	1	12 (6)	9
davies2018	HeLa cells	AP4B1 KO	3926 (1022/2904)	2	20 (5)	12
orre2019	Human lung cancer cells	EGFR-TKI induce	10,016 (3350/6666)	2	20 (5)	15
mulvey2021	THP-1 human leukemia cells	LSP-stimulated	3727 (785/2942)	2	80 (20)	11
valerio2022	Human kerationocytes	UVA light	1650 (431/1219)	5	90 (9)	4
kretz2022	Two antibody-secreting cell lines	CHO-K1/MPC-11	2222 (428/1794)	3	60 (10)	10

Open in a new tab

Simulated spatio-temporal proteomics datasets

We utilized the BANDLE package [9] to simulate five dynamic spatial proteomics datasets with different types of errors or noises. The basic dataset is the mouse E14TG2a embryonic stem cell line (E14TG2aR) dataset [23], which contains one replicate with eight subcellular fractions and identified 2031 proteins, including 368 marker proteins with known PSLs assigned to 10 subcellular organelles. In simulation of each spatio-temporal proteomics dataset, the BANDLE package used bootstrapping approaches and specific conditions described below to generate datasets of two conditions with 100 translocated proteins in each of three replicates. The datasets are simulated using one of the five conditions: (i) cluster specific noise distribution, (ii) random batch effects, (iii) systematic batch effects, (iv) fraction swapping and (v) batch effects and fraction swapping [9].

Experimental spatio-temporal proteomics datasets

We also collected six spatio-temporal proteomics datasets with each containing paired control and treated experiments. Here are the detailed descriptions of each dataset:

The beltran2016 [16] dataset involves the primary human fibroblasts during HCMV infection and contains uninfected (control) and HCMV-infected (treated) cells at 24 h postinfection (hpi).
The davies2018 [7] dataset involves Hela cell lines with one type of adaptor protein complex AP4B1 KO and contains wild-type (WT) (control) and AP4B1 KO (treated) HeLa cell types.
The orre2019 [14] dataset involves the human lung cancer cells with EGFR-TKI induced re-localization and contains untreated (control) and gefitinib-treated (treated) cells.
The mulvey2021 [2] dataset involves the THP-1 human leukemia cell lines with lipopolysaccharides (LPS)-stimulated and contains unstimulated (control) and LPS-stimulated (treated) THP-1 cells [2].
The valerio2022 [24] dataset involves the human keratinocytes exposed to ultraviolet-A (UVA) light and contains HaCaT skin cells in response to dark (control) and UVA light (treated) environmental conditions.
The kretz2022 [15] dataset involves the two antibody-secreting cell lines: Chinese hamster ovary (CHO-K1) (control) and murine plasma–derived (MPC-11) (treated) cells.

Table 1 shows the detailed dataset information, for example, the kretz2022 dataset has three paired replicates with each containing 10 subcellular fractions (a total of 60 fractions in all experiments) and includes 2222 proteins, in which 428 marker proteins have known PSLs assigned to 10 subcellular organelles (cytosol, endoplasmic reticulum, Golgi apparatus, lysosome, mitochondrion, nucleus, peroxisome, plasma membrane, proteasome and ribosome).

The overview of TransGCN framework

In this study, we propose TransGCN, a protein translocation inference method for spatio-temporal proteomics based on semi-supervised GCN learning. TransGCN mainly consists of three steps (Figure 1): (i) constructing a synthetic dataset for proteins with known PSLs and translocations by differential matching, which generates large amounts of high-quality synthetic dataset to address the issue of limited training data. (ii) Computing a hybrid graph, distance features and expected probabilities to solve the problem of insufficiencies in data mining. (iii) Identifying protein localization and translocation by semi-supervised GCN, which can effectively enable knowledge transfer from the synthetic dataset.

Data synthesis

Using Z-score test to select proteins with high-quality PSLs

To generate a synthetic dataset of high quality, it is necessary to select proteins with subcellular markers of high confidence at first. A Z-score test [25] is a valuable tool to test the probability of an element belonging to a particular distribution. Based on m proteins with a known PSL label, the i-th subcellular fraction [f_i¹, f_i², ⋯, f_i^m] can be used to calculate the background distribution. This calculation provides the statistical probability for each protein in the i-th fraction. By applying these procedures to all fractions of a protein, we can determine their respective probabilities. A protein is deemed to have a PSL label of high confidence only if these probabilities in all fractions fall within 95% of the background distribution (Figure 1). Through this systematic approach, proteins with high-quality PSL labels can be screened for each PSL.

Using differential matching to synthesize dataset

In general, only a few hundred proteins have known PSLs supported by the literature, which leads to a limited size of the training set. Referring to TRANSPIRE [19], we use differential matching to synthesize the dataset. For example, to generate a synthetic protein with translocation such as EndsomeToGolgi, we merge protein P_A (localized in endosome) in type 1 condition and protein P_B (localized in Golgi apparatus) in type 2 condition as a new translocated protein P_AB with localizations in endosome in type 1 condition and in Golgi apparatus in type 2 condition (Figure 1). Based on the proteins with high-quality PSL labels, we can generate massive synthetic datasets of synthetic proteins with both PSL labels and translocation labels. The synthetic protein P_AB has the PSL label as EndsomeToGolgi and translocation label as YES. If another synthetic protein has the PSL label as GolgiToGolgi, its translocation label will be NO. Therefore, synthetic proteins can be divided into translocated (e.g. EndsomeToGolgi) or non-translocated (e.g. GolgiToGolgi), and only two statuses of translocation labels (YES or NO) are used to describe whether a protein is translocated.

Hybrid graph construction

We select a subset of the synthetic dataset (500 proteins per PSL class screened) and the real dataset (experimental dataset) to generate the hybrid graph. The mutual nearest neighbor (MNN) [26] approach is used to determine whether two samples are mutual nearest neighbors. In the context of given datasets H and G, the primary steps of MNN are as follows:

(i) Compute distances: For each sample in dataset H, calculate its distances to every other sample in dataset G using a commonly employed distance metric, such as Euclidean distance.
(ii) Identify nearest neighbors: Record the nearest neighbor in dataset G for each sample in dataset H, and, simultaneously, document the nearest neighbor in dataset H for each sample in dataset G.
(iii) Determine mutual nearest neighbors: Analyze the recorded nearest neighbor information from step (ii) to identify paired samples in datasets H and G that are mutual nearest neighbors of each other.

We utilize the MNN approach to construct a hybrid graph joining protein associations within and between synthetic and real datasets (Figure 1). Within this graph, proteins are depicted as nodes, and the associations between proteins are represented as edges. At first, a synthesis-to-reality graph A^sr∈R^s × r (where s and r are the numbers of proteins in synthetic and the real datasets, respectively) is constructed based on inter-dataset. For a protein i in synthetic dataset and another protein j in real dataset, A_ij^sr = 1 only if protein i and protein j are the nearest neighbor to each other; otherwise A_ij^sr = 0. Then, an internal reality graph A^rr∈R^r × r is constructed based on intra-dataset of the real dataset. The final hybrid graph A∈R^{(s + r) × (s + r)} is then constructed by integrating the synthesis-to-reality graph (A^sr) and the internal reality graph (A^rr) [27, 28].

Calculation of distance features

Analyzing protein subcellular fraction differences between paired control and treated experiments is valuable for data mining and establishing a robust foundation for more reliable protein translocation identification. For a protein with the fraction expression p = [p₁, p₂, …, p_n] (where n is the total number of the fractions in one replicate) in type 1 condition and q = [q₁, q_2, …, q_n] in type 2 condition, we calculate 20 distance features described below by comparing p and q, including direct, distribution and ranking distance features. Consequently, each paired experiment can produce 20 distance features.

Direct distance

The direct distance measures the relationship between variables p and q based on their expression levels. Nine direct distance features are calculated, including Manhattan distance (D₁) [29], Chebyshev distance (D₂) [30], Canberra distance (D₃) [31], Euclidean distance (D₄) [32], cosine distance (D₅), Pearson's correlation coefficient (D₆) [33], Mahalanobis distance (D₇) [34], sum of absolute log-odds-ratio (D₈) and max of absolute log-odds-ratio (D₉) [35]. The formulae of these distances are defined by Equation (1):

(1)

where cov(p, q) is the covariance of p and q, σ_p and σ_q are the standard deviations for p and q, respectively. Σ is the covariance matrix of proteins in type 1 and 2 conditions.

Distribution distance

The distribution distance compares variables p and q based on their probability distribution vectors. Five direct distance features are computed, including Bhattacharyya distance (D₁₀) [36], Hellinger distance (D₁₁) [37], cross entropy (D₁₂) [38], Kullback–Leibler divergence (D₁₃) [39] and Jensen-Shannon divergence (D₁₄) [40]. The formulae of these distances are defined by Equation (2):

(2)

Ranking distance

The ranking distance compares variables p and q based on their expression rankings. Six ranking distance features are calculated, including the p-value from the Wilcoxon rank sum test (D₁₅) [41], Spearman's rank correlation coefficient (D₁₆) [41], Kendall's tau coefficient (D₁₇) [42], Hamming distance [43] (D₁₈), sum of rank distance (D₁₉) and max of rank distance (D₂₀). The formulae of these distances are defined by Equation (3):

(3)

where Wilcoxon(p, q) is the function to calculate the p-value of the Wilcoxon rank sum test and rand(p_i) is the function to obtain the ranking (from smallest to largest) of p_i in p. P is the number of concordant pairs, Q is the number of discordant pairs, T is the number of ties only in p and J is the number of ties only in q [42].

Expected probability estimation

To optimize the loss function of TransGCN [44], it is imperative to estimate the expected probabilities of PSL labels and translocation labels of each protein on the real dataset. The random forest (RF) [45], a bagging ensemble learning algorithm, trains multiple decision tree classifiers and uses averaging to improve prediction accuracy and control over-fitting. Based on the synthetic dataset, the RF is employed to predict the probabilities of PSL labels and translocation labels of quantified proteins on real dataset, respectively, serving as the expected probabilities (Figure 1).

Model training

SENet operation

At first, we concatenate the fraction expression features (no. f) and distance features (no. d) from both the subset of the synthetic dataset and the real dataset and then use the z-score method for normalization to yield X∈R^{(s + r) × (f + d)} as the input feature matrix. X is transformed to X₁ for preliminary processing as shown in Equation (4):

(4)

where X_l is the feature matrix and W_l is the weight matrix of the l-th layer, ReLU (rectified linear unit) is the activation function, MLP (multi-layer perceptron) is a type of artificial neural network. Notably, the feature dimension of X₁ remains the same as that of X.

Due to potential differences in feature contribution, the simplified Squeeze-and-Excitation networks (SENet) [46] is used as an attention mechanism for feature optimization. SENet allows the model to focus more on important features. It contains two fully connected layers to obtain the feature weight matrix W_c, with the first to reduce the number of features and thus the computational complexity and the second to restore the dimensions to the input dimensions [47]. Subsequently, W_c is multiplied with X₁ to obtain the optimized feature matrix X₂ as shown in Equation (5):

(5)

where ⊗ denotes element-wise multiplication and sigmoid is the activation function. SENet operation uses the MLP layer with the reduction ratio of 4.

Semi-supervised GCN

To predict the labels of the PSLs and translocations for each protein, two semi-supervised GCN models are trained separately. In addition, to prevent the neighbor information of a certain protein from excessively affecting the protein characteristics [28], the hybrid graph A needs to be modified to the normalized adjacency matrix Ã by Equation (6):

(6)

where I∈R^{(s + r) × (s + r)} is the identity matrix and V is the diagonal degree matrix of A^*.

GCN is one of the most prominent graph deep learning models with the powerful capability of graphs to enable capture of the structural relations between nodes [48]. GCN can combine the features of nodes and the normalized graph to uncover the valuable potential feature information by convolutional operations, as achieved by Equation (7):

(7)

The semi-supervised learning is a machine learning technique that uses a labeled dataset (synthetic dataset) and an unlabeled dataset (real dataset) to train a predictive model, which can utilize the unlabeled dataset to improve the robustness and performance of the predictive model. Semi-supervised GCN can achieve effective knowledge transfer by exploiting synthetic dataset, and it takes the optimized feature matrix X₂ and adjacency matrix Ã as inputs in TransGCN (Figure 1), as achieved by Equation (8):

(8)

where softmax is the activation function and Inline graphic denotes the probability matrix of the predicted labels. The semi-supervised GCN contains two GCN layers of 256–128 neurons and one MLP layer of the number of neurons consistent with the number of label types. The dropout layers with a dropout at 0.3 are added to prevent overfitting.

Loss functions

When utilizing the semi-supervised GCN to predict protein translocation events, both the supervised loss [Loss(S)) and unsupervised loss (Loss(U)] are employed as semi-supervised loss in TransGCN to improve predictive performance. The protein translocation labels in the synthetic dataset are used as object for supervised loss, whereas the expected probabilities of protein translocation labels in real dataset are used as object for unsupervised loss. To alleviate the effect of sample imbalance, the weighted cross entropy (WCE) loss function based on the number of samples in different classes [12] is used as the semi-supervised loss function, which is structured as Equation (9):

(9)

where s and r are the number of proteins in synthetic dataset and real dataset, respectively. k is the number of label classes, and b is the vector that counts the number of proteins in each label class on synthetic dataset. Inline graphic and are the true label vector and predicted label probability vector of synthetic protein j in synthetic dataset. and are the expected label probability vector and predicted label probability vector of real protein j in real dataset. λ is a vector of sample weight factor, and α is a weight set as 2 to balance the Loss(S) and Loss(U). To predict the PSL labels of proteins, a similar semi-supervised GCN model is trained by modifying the object and employing a comparable semi-supervised loss function.

When applying the TransGCN, the subset of synthetic dataset is randomly split into the training dataset (80%) and validation dataset (20%), and the real dataset is unlabeled and can be predicted by the semi-supervised GCN models. Both the subset of synthetic dataset and real dataset are trained together, and the semi-supervised loss function is used to guide the model training. All the semi-supervised GCN models in TransGCN are trained for 5000 epochs using the Adam optimizer [49] with a learning rate of 0.001. Early stopping [50] is used to prevent overfitting where the semi-supervised loss (on the validation dataset and real dataset) does not decrease in 500 training epochs.

False discovery rate computation

To evaluate the reliability of predicted protein translocation labels, the false discovery rate (FDR) is computed to control the confidence of protein translocation labels identified as migrated. All the proteins including the marker proteins with known PSLs but no translocation in the real dataset are ranked from large to small according to the predicted probabilities of protein translocations [51], the FDR is as Equation (10):

(10)

where N is the total number of marker proteins in the real dataset and function rank( Inline graphic ) signifies the rank of marker protein i based on its predicted probability amongst all marker proteins on the real dataset. When controlling FDR, the proteins with predicted probabilities greater than are determined to have undergone translocations. In addition, if the predicted PSL labels of proteins show no change between conditions (e.g. GolgiToGolgi), these proteins will be considered to have no translocation.

RESULTS

Comparison of TransGCN in identifying protein translocations with other methods

To evaluate the performance of TransGCN in the prediction of protein translocations, several state-of-the-art protein translocation identification methods, including MR2016 [4], MR2017 [18], MS [3], TRANSPIRE [19], BANDLE(Dirichlet) and BANDLE(Pólya-Gamma) [9] and scGCN [27] (Supplementary Material S1), were applied for comparison. The simulated spatio-temporal proteomics datasets comprising 100 protein translocations were used as benchmark datasets. In addition, the subset of synthetic datasets, which derived from the experimental spatio-temporal proteomics datasets, were randomly divided into training datasets (60%), validation datasets (20%) and test datasets (20%) (the test datasets were considered as the real dataset) as another benchmark datasets. Notably, the test datasets were deliberately omitted from the training phase to ensure their independence in calculating the expected probabilities. The evaluation metrics as shown in Supplementary Material S1.

In the simulated spatio-temporal proteomics datasets, BANDLE(Dirichlet) and TransGCN achieved better predictive performance in protein translocation identification according to the area under the receiver operating characteristic (ROC) curve (AUC) [52] metric (Figure S1), while TransGCN got the highest area under precision-recall (PR) curve (AUPR) on most simulated datasets except on the datasets of fraction swapping (Figure 2A). For example, TransGCN (AUPR = 0.8287) demonstrated superior performance with the highest AUPR compared to MR2016 (AUPR = 0.4067), MR2017 (AUPR = 0.4079), MS (AUPR = 0.3351), TRANSPIRE (AUPR = 0.2368), BANDLE(Dirichlet) (AUPR = 0.7046), BANDLE(Pólya-Gamma) (AUPR = 0.6858) and scGCN (AUPR = 0.5605) on the simulated dataset with random batch effects. We also plotted FDR curves to analyze the proportion of mispredicted translocated proteins in the top 100 proteins (Figure 2B), the results showed better performance of TransGCN on the datasets of batch effects and fraction swapping, random batch effects and systematic batch effects. The results also demonstrated the effectiveness of the semi-supervised integrative functional mixture model BANDLE in identifying differentially localized proteins. Whereas TRANSPIRE, MS and scGCN methods displayed the worst performance in inferring protein translocations, suggesting that they could not cope with the datasets with different noise errors. It is worth noting that TransGCN showed outstanding capabilities addressing datasets afflicted by batch effects—an urgent problem in the field of spatio-temporal proteomics datasets.

Comparison of TransGCN with other state-of-the-art methods in protein translocation identification on simulated spatio-temporal proteomics datasets. (A) PR curves and APPR values of different methods. (B) FDR curves and FDR values of different methods for the proportion of mis-predicted translocated proteins within the top 100 proteins.

On the experimental spatio-temporal proteomics datasets, TransGCN significantly outperformed other state-of-the-art methods in protein translocation prediction, as evident from the comparison metrics of both AUC (Figure 3) and AUPR (Figure S2). TransGCN (AUC = 0.9907) exhibited the best predictive capability compared to MR2016 (AUC = 0.5649), MR2017 (AUC = 0.5506), MS (AUC = 0.9276), TRANSPIRE (AUC = 0.9264), BANDLE(Dirichlet) (AUC = 0.9064), BANDLE(Pólya-Gamma) (AUPR = 0.9022) and scGCN (AUPR = 0.9744) on orre2019 dataset. The MR method showed poor performance on most datasets, suggesting the limitations of traditional statistical methods. Overall, TransGCN suggested superior robustness of robustness and reliability in inferring protein translocations.

ROC curves and AUC values of different methods in predicting protein translocation on experimental spatio-temporal proteomics datasets.

Comparison results of TransGCN in predicting protein subcellular localizations with other methods

To evaluate the performance of TransGCN in PSL prediction, we compared its results with various methods in predicting PSL labels based on experimental spatio-temporal proteomics datasets. The results showed superior performance of TransGCN in PSL prediction compared to other methods (Figure 4, Table S1). Across various metrics, the scores were consistently close to 1, except for the orre2019 dataset, highlighting the robustness of TransGCN. Even on the orre2019 dataset, TRANSPIRE (MCC = 0.5525), BANDLE(Dirichlet) (MCC = 0.4445), BANDLE(Pólya-Gamma) (MCC = 0.4535) and scGCN (MCC = 0.7202) performed less effectively in predicting PSL labels compared to TransGCN (MCC = 0.7852). TRANSPIRE exhibited poor performance on the mulvey2021 and kretz2022 datasets, on the other hand, BANDLE(Dirichlet) and BANDLE(Pólya-Gamma) did not perform well on the davies2018 and valerio2022 datasets. scGCN performed better than other methods but not as good as TransGCN. Overall, the results demonstrated that TransGCN had superior predictive capability and robustness for PSL prediction compared to the current state-of-the-art methods in the field of mass spectrometry-based spatio-temporal proteomics.

Accuracy, precision, recall, F1score and MCC of different methods in PSL label prediction on experimental spatio-temporal proteomics datasets.

Ablation experiments of TransGCN in identifying protein translocations

To investigate the contribution of distance features and unsupervised loss [53] in TransGCN, the ablation experiments were designed to analyze the performance of TransGCN. The combination of expression (E) and distance (D) features as well as supervised (S) and unsupervised (U) losses resulted in three other variant models that were compared with TransGCN (loss(S + U) + feature(E + D)), including loss(S) + feature(E + D), loss(S + U) + feature(E) and loss(S) + feature(D) models. The four models were applied and compared in the identification of protein translocations based on the simulated and experimental spatio-temporal proteomics datasets.

On simulated spatio-temporal proteomics datasets, TransGCN generally outperformed other models based on AUC (Figure S3), AUPR (Figure 5A) and FDR metrics (Figure 5B), with the next best performer being loss(S + U) + feature(E) and the worst performer being loss(S) + feature(E + D). Without unsupervised loss in TransGCN, the AUPRs decreased by 0.2211, 0.2266, 0.1791, 0.1888 and 0.2035, respectively, on datasets of cluster specific noise distribution, random batch effects, systematic batch effects, fraction swapping and both batch effects and fraction swapping, while the similar decrease could be seen on experimental spatio-temporal proteomics datasets (Figure S4). Furthermore, we can also find that the AUPRs could be improved by adding distance features to TransGCN on the datasets with different noise errors (Figure 5A). Based on FDR curve analysis, TransGCN (FDR = 0.31) performed primarily better than loss(S) + feature(E + D) (FDR = 0.57), loss(S + U) + feature(E) (FDR = 0.34) and loss(S) + feature(D) (FDR = 0.36) on the dataset with systematic batch effects. In general, it was valuable to introduce distance features and unsupervised loss to TransGCN for improved predictive performance in the identification of protein translocations.

Performances of protein translocation identification in the ablation study on simulated spatio-temporal proteomics datasets. (A) PR curves and APPR values of TransGCN-based variant models. (B) FDR curves and FDR values of TransGCN-based variant models for the proportion of mis-predicted translocated proteins within the top 100 proteins.

We further calculated the mean value of the feature weight matrix W_C from SENet to explore the importance of different distance features (Figure 6), and a feature is more important if its mean weight is greater. The results showed that the importance of distance features varies with different noise errors of the data. For the datasets with cluster specific noise distribution, distance features of Canberra distance (D₃), sum of absolute log-odds-ratio (D₈) and cross entropy (D₁₂) are more important. For the datasets with random batch effects, distance features of Pearson's correlation coefficient (D₆), Bhattacharyya distance (D₁₀) and Spearman's rank correlation coefficient (D₁₆) are more important. For the datasets with systematic batch effects, distance features of Chebyshev distance (D₂), Bhattacharyya distance (D₁₀) and Spearman's rank correlation coefficient (D₁₆) are more important. For the datasets with fraction swapping, distance features of Canberra distance (D₃), Pearson's correlation coefficient (D₆) and Spearman's rank correlation coefficient (D₁₆) are more important. For the datasets with both batch effects and fraction swapping, distance features of Pearson's correlation coefficient (D₆), cross entropy (D₁₂) and P-value from the Wilcoxon rank sum test (D₁₅) are more important. Overall, it can be found that Canberra distance (D₃), Pearson's correlation coefficient (D₆), cross entropy (D₁₂) and Spearman's rank correlation coefficient (D₁₆) play important roles in processing most of the data with different noises.

Distance feature importance distribution based on the mean value of the feature weight matrix.

In addition, considering that sample imbalance might affect the performance of the model, the sample weight factor is used in this study to alleviate this problem. The results show that when compared with the model without optimization, the prediction performance of optimized model using sample weight factors is improved in most of the simulated data. For example, the AUPR is improved by 2.58%, 1.77%, 2.23% and 2.59%, respectively, on datasets with cluster specific noise distribution, systematic batch effects, fraction swapping and both batch effects and fraction swapping. Thus, the model performance is optimized by taking into account the sample imbalance problem.

Ablation experiments of TransGCN in predicting protein subcellular localizations

To assess the roles of distance features and unsupervised loss from TransGCN in PSL prediction, we also compared the performance of TransGCN-based variant models in predicting PSL labels on the experimental spatio-temporal proteomics datasets. The results showed that the loss(S) + feature(E + D) and loss(S + U) + feature(D) models performed significantly lower than loss(S + U) + feature(E) and loss(S + U) + feature(E + D) models in predicting PSL labels (Figure 7). TransGCN with distance features only had a small performance improvement on the davies2018 and orre2019 datasets. The results demonstrated the unsupervised loss in TransGCN has a better learning ability to facilitate PSL prediction, whereas distance features in TransGCN have a limited effect.

Accuracy, precision, recall, F1score and MCC of TransGCN-based variant models for PSL label prediction on experimental spatio-temporal proteomics datasets.

Application of TransGCN in identifying protein translocations

On the simulated spatio-temporal proteomics dataset with batch effects and fraction swapping, TransGCN (FDR = 0.22) exhibited the lowest FDR compared to MR2016 (FDR = 0.36), MR2017 (FDR = 0.35), MS (FDR = 0.64), TRANSPIRE (FDR = 0.86), BANDLE(Dirichlet) (FDR = 0.45), BANDLE(Pólya-Gamma) (FDR = 0.49) and scGCN (FDR = 0.8) (Figure 2B). This indicated that only 22 proteins out of the top 100 proteins were mis-identified as translocated proteins by TransGCN. Among the top 100 potential translocated proteins identified by each method, TransGCN identified an additional five proteins that were not detected by other methods (Figure 8A, Figure S5A). Principal component analysis (PCA) [54] (Figure S5B, Figure 8B) revealed that proteins localized within the same subcellular compartments clustered together after TransGCN application. Even the proteins predicted to localize in the 40S ribosome and 60S ribosome formed two distinct clusters. Notably, the additional five translocated proteins can be clearly observed, such as P58252 from 40S ribosome to lysosome and Q9EPR4 from endoplasmic reticulum to nucleus–nucleolus. We also applied TransGCN on davies2018 dataset and successfully identified the translocated proteins SERINC1 and SERINC3 with the translocated FDR less than 0.05 (Figure S6), which were consistent with Davies et al.'s findings [7]. In summary, TransGCN can not only better process spatio-temporal proteomics datasets with complex noise compared to existing state-of-the-art methods but also identify more reliable protein translocation events.

Predicted results on the simulated spatio-temporal proteomics dataset with batch effects and fraction swapping. (A) The top 100 true translocated proteins annotated with the matched top 100 potential translocated proteins (Rand (≤100)) identified by different methods. (B) PCA plots of proteins annotated with known PSLs and predicted PSLs by TransGCN in type 1 and 2 conditions.

DISCUSSION

The development of mass spectrometry–based spatio-temporal proteomics methods provides new insights into biological processes of protein localizations and translocations between intracellular compartments [3, 17]. In this paper, we constructed a semi-supervised GCN-based framework named TransGCN to identify protein translocations. Notably, TransGCN adapts graph network models [48] to facilitate efficient knowledge transfer [27] while implementing data mining from both feature and sample perspectives.

The procedures of subcellular fractionation and mass spectrometry–based protein quantitation can introduce inevitable data errors, particularly batch effects arising from different experiments [9]. Simple statistical methods, like MR [4, 18] and MS [3], do not effectively address these challenges when identifying translocated proteins (Figure 2). In TransGCN, we computed 20 distance features including direct, distribution and ranking distance features to analyze the protein location differences after cellular perturbation, whereas scGCN [27], TRANSPIRE [19] and BANDLE [9] underutilize these important distance features. The diverse perspectives offered by distance features effectively mitigate a variety of data noise or errors, for instance, ranking distance features aptly handle batch effects [55]. In addition, the SENet-based attention mechanism [46] allows the model to adaptively focus on important distance features (Figure 6). Nonetheless, these distance features demonstrate suboptimal performance when confronted with datasets involving fraction swapping. It is worth noting that such data errors are infrequent and primarily stem from inaccuracies in data recording. Furthermore, our integration of distance features significantly enhanced the accuracy in identifying protein translocations, but with limited improvements to PSL prediction (Figure 7). This limitation is primarily attributed to the large number of types of PSL labels, and proteins with translocations tend to exhibit features of similar distance values.

Notably, the sample imbalance problem is often more complex in real-world data. A common challenge in model training is that it tends to lead to under-learning of a few classes by the model when dealing with large differences in the amount of data from different classes [56]. In this study, we investigated the effects of sample imbalance in the model classification problem by introducing a sample weight factor to optimize the model [12]. By adjusting the sample weight factor, we obtained improved prediction accuracy and improved generalization performance. This proved that it is crucial to fully consider the sample imbalance problem and adopt an effective weight adjustment strategy in practical applications.

With the synthetic dataset as references and the real dataset as queries [27], we constructed joint graph representations of proteins in TransGCN to facilitate knowledge transfer. This approach performs well in single-cell RNA-seq data annotation [27, 28], but has not been well evaluated in spatio-temporal proteomics. The hybrid graph is derived from the synthesis-to-reality graph (based on inter-dataset) and the internal reality graph (based on intra-dataset), which can help explore the potential non-linear relationship between different proteins from synthetic and real datasets [28]. Then the semi-supervised GCN model [21] in TransGCN can make good use of the hybrid graph to allow for reliable and reproducible protein label transfer from a synthetic dataset to a real dataset. Furthermore, the incorporation of semi-supervised loss [53], as represented by the unsupervised loss component in TransGCN, serves as a crucial element in refining the predictive performance of the models (Figure 5). Our strategy, built upon a semi-supervised GCN model with semi-supervised loss, presents a potent paradigm for enhancement of knowledge transfer and predictive accuracy in the burgeoning field of spatio-temporal proteomics.

In conclusion, our study demonstrates that TransGCN is a powerful tool for predicting protein localization and translocation, showcasing its superiority over existing state-of-the-art methods on simulated and experimental spatio-temporal proteomics datasets, whereas TRANSPIRE [19] predominantly emphasizes the prediction of PSL labels, inadvertently overlooking direct evaluation of the confidence of protein translocation (Figures 2 and 3). BANDLE [9] ignores higher-order relations between proteins during label transfer. TransGCN is robust in the presence of noises or errors such as batch effects. It is expected to facilitate better use of spatio-temporal proteomics and help further studies of functional roles of protein translocation between intracellular compartments.

Key Points

We propose TransGCN, a semi-supervised GCN-based method for predicting protein localization and translocation in mass spectrometry–based spatio-temporal proteomics.
TransGCN constructs direct, distribution and ranking distance features to effectively analyze the protein localization changes after cellular perturbation.
TransGCN utilizes the semi-supervised GCN to facilitate effective transfer of protein localization and translocation labels from proteins with known localizations.
TransGCN outperforms other state-of-the-art methods in robustness and accuracy of protein localization and translocation prediction.

FUNDING

This work was supported by the National Key R&D Program of China (2021YFC2700200) and the Chinese National Natural Science Foundation (Grants No. 82221005, 82371606, 82371623, 82001611).

DATA AVAILABILITY

The simulated spatio-temporal proteomics datasets are generated by sim_dynamic function in BANDLE package (https://bioconductor.org/packages/release/bioc/html/bandle.html). The experimental spatio-temporal proteomics datasets are available from the corresponding literature in the appendix or pRolocdata package (https://bioconductor.org/packages/release/data/experiment/html/pRolocdata.html or https://github.com/lgatto/pRolocdata). TransGCN is freely available on GitHub at https://github.com/XuejiangGuo/TransGCN.

Supplementary Material

Supplementary_bbae055

supplementary_bbae055.docx^{(22.4MB, docx)}

Author Biographies

Bing Wang is a PhD candidate at the Department of Histology and Embryology, State Key Laboratory of Reproductive Medicine and Offspring Health, Nanjing Medical University and School of Medicine, Southeast University. His research interests are bioinformatics and artificial intelligence.

Xiangzheng Zhang is a PhD candidate at the Department of Histology and Embryology, State Key Laboratory of Reproductive Medicine and Offspring Health, Nanjing Medical University. His research interests are proteomics and reproductive biology.

Xudong Han is a PhD candidate at the Department of Histology and Embryology, State Key Laboratory of Reproductive Medicine and Offspring Health, Nanjing Medical University and School of Medicine, Southeast University. His research interests are bioinformatics and artificial intelligence.

Bingjie Hao is a Master candidate at the Department of Histology and Embryology, State Key Laboratory of Reproductive Medicine, Nanjing Medical University. Her research interests are spermatogenesis and molecular biology.

Yan Li is the director of Department of Clinical Laboratory, Sir Run Run Hospital, Nanjing Medical University. Her research interests are reproductive biology and bioinformatics.

Xuejiang Guo is a professor at the Department of Histology and Embryology, State Key Laboratory of Reproductive Medicine and Offspring Health, Nanjing Medical University. His research interests are reproductive biology, proteomics and bioinformatics.

Contributor Information

Bing Wang, Department of Histology and Embryology, State Key Laboratory of Reproductive Medicine and Offspring Health, Nanjing Medical University, Nanjing 211166, China; School of Medicine, Southeast University, Nanjing 210009, China.

Xiangzheng Zhang, Department of Histology and Embryology, State Key Laboratory of Reproductive Medicine and Offspring Health, Nanjing Medical University, Nanjing 211166, China.

Xudong Han, Department of Histology and Embryology, State Key Laboratory of Reproductive Medicine and Offspring Health, Nanjing Medical University, Nanjing 211166, China; School of Medicine, Southeast University, Nanjing 210009, China.

Bingjie Hao, Department of Histology and Embryology, State Key Laboratory of Reproductive Medicine and Offspring Health, Nanjing Medical University, Nanjing 211166, China.

Yan Li, Department of Clinical Laboratory, Sir Run Run Hospital, Nanjing Medical University, Nanjing 211100, China.

Xuejiang Guo, Department of Histology and Embryology, State Key Laboratory of Reproductive Medicine and Offspring Health, Nanjing Medical University, Nanjing 211166, China.

References

1. Dreger M. Subcellular proteomics. Mass Spectrom Rev 2003;22:27–56. [DOI] [PubMed] [Google Scholar]
2. Mulvey CM, Breckels LM, Crook OM, et al. Spatiotemporal proteomic profiling of the pro-inflammatory response to lipopolysaccharide in the THP-1 human leukaemia cell line. Nat Commun 2021;12:5773. [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Martinez-Val A, Bekker-Jensen DB, Steigerwald S, et al. Spatial-proteomics reveals phospho-signaling dynamics at subcellular resolution. Nat Commun 2021;12:7113. [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Itzhak DN, Tyanova S, Cox J, Borner GHH. Global, quantitative and dynamic mapping of protein subcellular localization. Elife 2016;5:e16950. [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Wang RH, Luo T, Guo YP, et al. dbMisLoc: a manually curated database of conditional protein Mis-localization events. Interdiscip Sci 2023;15:433–8. [DOI] [PubMed] [Google Scholar]
6. Feigin ME, Akshinthala SD, Araki K, et al. Mislocalization of the cell polarity protein scribble promotes mammary tumorigenesis and is associated with basal breast cancer. Cancer Res 2014;74:3180–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Davies AK, Itzhak DN, Edgar JR, et al. AP-4 vesicles contribute to spatial control of autophagy via RUSC-dependent peripheral delivery of ATG9A. Nat Commun 2018;9:3958. [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Eftekharzadeh B, Daigle JG, Kapinos LE, et al. Tau protein disrupts nucleocytoplasmic transport in Alzheimer's disease. Neuron 2019;101:349. [DOI] [PMC free article] [PubMed] [Google Scholar]
9. Crook OM, Davies CTR, Breckels LM, et al. Inferring differential subcellular localisation in comparative spatial proteomics using BANDLE. Nat Commun 2022;13:5948. [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Hung MC, Link W. Protein localization in disease and therapy. J Cell Sci 2011;124:3381–92. [DOI] [PubMed] [Google Scholar]
11. Wang G, Xue MQ, Shen HB, Xu YY. Learning protein subcellular localization multi-view patterns from heterogeneous data of imaging, sequence and networks. Brief Bioinform 2022;23:bbab539. [DOI] [PubMed] [Google Scholar]
12. Wang B, Zhang X, Xu C, et al. DeepSP: a deep learning framework for spatial proteomics. J Proteome Res 2023;22:2186–98. [DOI] [PubMed] [Google Scholar]
13. Gatto L, Breckels LM, Lilley KS. Assessing sub-cellular resolution in spatial proteomics experiments. Curr Opin Chem Biol 2019;48:123–49. [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Orre LM, Vesterlund M, Pan Y, et al. SubCellBarCode: proteome-wide mapping of protein localization and relocalization. Mol Cell 2019;73:166–182.e7. [DOI] [PubMed] [Google Scholar]
15. Kretz R, Walter L, Raab N, et al. Spatial proteomics reveals differences in the cellular architecture of antibody-producing CHO and plasma cell-derived cells. Mol Cell Proteomics 2022;21:100278. [DOI] [PMC free article] [PubMed] [Google Scholar]
16. Jean Beltran PM, Mathias RA, Cristea IM. A portrait of the human organelle proteome in space and time during cytomegalovirus infection. Cell Syst 2016;3:361–373.e6. [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Hirst J, Itzhak DN, Antrobus R, et al. Role of the AP-5 adaptor protein complex in late endosome-to-Golgi retrieval. PLoS Biol 2018;16:e2004411. [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Itzhak DN, Davies C, Tyanova S, et al. A mass spectrometry-based approach for mapping protein subcellular localization reveals the spatial proteome of mouse primary neurons. Cell Rep 2017;20:2706–18. [DOI] [PMC free article] [PubMed] [Google Scholar]
19. Kennedy MA, Hofstadter WA, Cristea IM. TRANSPIRE: a computational pipeline to elucidate intracellular protein movements from spatial proteomics data sets. J Am Soc Mass Spectrom 2020;31:1422–39. [DOI] [PMC free article] [PubMed] [Google Scholar]
20. Wang J, Hertzmann A, Fleet DJ. Gaussian process dynamical models. In: Weiss Y, (ed). Adv Neural Inf Process Syst. MIT Press. Vancouver, British Columbia, Canada. 2005;18:1441–1448. [Google Scholar]
21. Kipf TN, Welling M. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 2016.
22. Han X, Wang B, Situ C, et al. scapGNN: a graph neural network-based framework for active pathway and gene module inference from single-cell multi-omics data. PLoS Biol 2023;21:e3002369. [DOI] [PMC free article] [PubMed] [Google Scholar]
23. Breckels LM, Holden SB, Wojnar D, et al. Learning from heterogeneous data sources: an application in spatial proteomics. PLoS Comput Biol 2016;12:e1004920. [DOI] [PMC free article] [PubMed] [Google Scholar]
24. Valerio HP, Ravagnani FG, Yaya Candela AP, et al. Spatial proteomics reveals subcellular reorganization in human keratinocytes exposed to UVA light. iScience 2022;25:104093. [DOI] [PMC free article] [PubMed] [Google Scholar]
25. Cheadle C, Vawter MP, Freed WJ, Becker KG. Analysis of microarray data using Z score transformation. J Mol Diagn 2003;5:73–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
26. Haghverdi L, Lun ATL, Morgan MD, Marioni JC. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat Biotechnol 2018;36:421–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
27. Song Q, Su J, Zhang W. scGCN is a graph convolutional networks algorithm for knowledge transfer in single cell omics. Nat Commun 2021;12:3826. [DOI] [PMC free article] [PubMed] [Google Scholar]
28. Gao H, Zhang B, Liu L, et al. A universal framework for single-cell multi-omics data integration with graph convolutional networks. Brief Bioinform 2023;24:bbad081. [DOI] [PMC free article] [PubMed] [Google Scholar]
29. Chiu W-Y, Yen GG, Juan T-K. Minimum Manhattan distance approach to multiple criteria decision making in multiobjective optimization problems. IEEE Trans Evolut Comput 2016;20:972–85. [Google Scholar]
30. Coghetto R. Chebyshev distance. Formalized Mathematics 2016;24:121–41. [Google Scholar]
31. Jurman G, Riccadonna S, Visintainer R, et al. Canberra distance on ranked lists. In: Proceedings of Advances in Ranking NIPS 09 Workshop. 2009:22–27. [Google Scholar]
32. Danielsson P-E. Euclidean distance mapping. Comput Graph Image Process 1980;14:227–48. [Google Scholar]
33. Sedgwick P. Pearson’s correlation coefficient. BMJ 2012;345:345. [Google Scholar]
34. McLachlan GJ. Mahalanobis distance. Resonance 1999;4:20–6. [Google Scholar]
35. Benej M, Klikovits T, Krajc T, et al. Lymph node log-odds ratio accurately defines prognosis in Resectable non-small cell lung cancer. Cancer 2023;15:15. [DOI] [PMC free article] [PubMed] [Google Scholar]
36. Mohammadi A, Plataniotis KN. Improper complex-valued Bhattacharyya distance. IEEE Trans Neural Netw Learn Syst 2016;27:1049–64. [DOI] [PubMed] [Google Scholar]
37. Beran R. Minimum Hellinger distance estimates for parametric models. Ann Stat 1977;5:445–63. [Google Scholar]
38. De Boer P-T, Kroese DP, Mannor S, et al. A tutorial on the cross-entropy method. Ann Oper Res 2005;134:19–67. [Google Scholar]
39. Van Erven T, Harremos P. Rényi divergence and Kullback-Leibler divergence. IEEE Trans Inform Theory 2014;60:3797–820. [Google Scholar]
40. Menéndez M, Pardo J, Pardo L, Pardo MC. The Jensen-Shannon divergence. J Franklin Inst 1997;334:307–18. [Google Scholar]
41. Barros RSM, Hidalgo JIG, Lima Cabral DR. Wilcoxon rank sum test drift detector. Neurocomputing 2018;275:1954–63. [Google Scholar]
42. Kendall MG. The treatment of ties in ranking problems. Biometrika 1945;33:239–51. [DOI] [PubMed] [Google Scholar]
43. Norouzi M, Fleet DJ, Salakhutdinov RR. Hamming distance metric learning. Adv Neural Inf Process Syst 2012;25:1061–1069. [Google Scholar]
44. Van Engelen JE, Hoos HH. A survey on semi-supervised learning. Mach Learn 2020;109:373–440. [Google Scholar]
45. Breiman L. Random forests. Mach Learn 2001;45:5–32. [Google Scholar]
46. Hu J, Shen L, Albanie S. Squeeze-andexcitation networks. arXiv preprint arXiv:1709.01507 2017.
47. Guo L, Wang Y, Xu X, et al. DeepPSP: a global-local information-based deep neural network for the prediction of protein phosphorylation sites. J Proteome Res 2021;20:346–56. [DOI] [PubMed] [Google Scholar]
48. Zhou J, Cui G, Hu S, et al. Graph neural networks: a review of methods and applications. AI Open 2020;1:57–81. [Google Scholar]
49. Kingma DP, Ba J. Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 2014.
50. Prechelt L. Early stopping—but when?Neural networks: Tricks of the Trade, 2nd edn. Springer Berlin Heidelberg. Berlin, Germany. Genevieve B. Orr.2002:55–69.
51. Wang B, Wang Y, Chen Y, et al. DeepSCP: utilizing deep learning to boost single-cell proteome coverage. Brief Bioinform 2022;23:bbac214. [DOI] [PubMed] [Google Scholar]
52. Li X, Liao M, Wang B, et al. A drug repurposing method based on inhibition effect on gene regulatory network. Comput Struct Biotechnol J 2023;21:4446–55. [DOI] [PMC free article] [PubMed] [Google Scholar]
53. Azadifar S, Ahmadi A. A novel candidate disease gene prioritization method using deep graph convolutional networks and semi-supervised learning. BMC Bioinformatics 2022;23:422. [DOI] [PMC free article] [PubMed] [Google Scholar]
54. Abdi H, Williams LJ. Principal component analysis. Wires Comput Stat 2010;2:433–59. [Google Scholar]
55. Tang K, Ji X, Zhou M, et al. Rank-in: enabling integrative analysis across microarray and RNA-seq for cancer. Nucleic Acids Res 2021;49:e99. [DOI] [PMC free article] [PubMed] [Google Scholar]
56. Longadge R, Dongre S. Class imbalance problem in data mining review. arXiv preprint arXiv:1305.1707 2013.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary_bbae055

supplementary_bbae055.docx^{(22.4MB, docx)}

Data Availability Statement

[ref1] 1. Dreger M. Subcellular proteomics. Mass Spectrom Rev 2003;22:27–56. [DOI] [PubMed] [Google Scholar]

[ref2] 2. Mulvey CM, Breckels LM, Crook OM, et al. Spatiotemporal proteomic profiling of the pro-inflammatory response to lipopolysaccharide in the THP-1 human leukaemia cell line. Nat Commun 2021;12:5773. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref3] 3. Martinez-Val A, Bekker-Jensen DB, Steigerwald S, et al. Spatial-proteomics reveals phospho-signaling dynamics at subcellular resolution. Nat Commun 2021;12:7113. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref4] 4. Itzhak DN, Tyanova S, Cox J, Borner GHH. Global, quantitative and dynamic mapping of protein subcellular localization. Elife 2016;5:e16950. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref5] 5. Wang RH, Luo T, Guo YP, et al. dbMisLoc: a manually curated database of conditional protein Mis-localization events. Interdiscip Sci 2023;15:433–8. [DOI] [PubMed] [Google Scholar]

[ref6] 6. Feigin ME, Akshinthala SD, Araki K, et al. Mislocalization of the cell polarity protein scribble promotes mammary tumorigenesis and is associated with basal breast cancer. Cancer Res 2014;74:3180–94. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref7] 7. Davies AK, Itzhak DN, Edgar JR, et al. AP-4 vesicles contribute to spatial control of autophagy via RUSC-dependent peripheral delivery of ATG9A. Nat Commun 2018;9:3958. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref8] 8. Eftekharzadeh B, Daigle JG, Kapinos LE, et al. Tau protein disrupts nucleocytoplasmic transport in Alzheimer's disease. Neuron 2019;101:349. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref9] 9. Crook OM, Davies CTR, Breckels LM, et al. Inferring differential subcellular localisation in comparative spatial proteomics using BANDLE. Nat Commun 2022;13:5948. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref10] 10. Hung MC, Link W. Protein localization in disease and therapy. J Cell Sci 2011;124:3381–92. [DOI] [PubMed] [Google Scholar]

[ref11] 11. Wang G, Xue MQ, Shen HB, Xu YY. Learning protein subcellular localization multi-view patterns from heterogeneous data of imaging, sequence and networks. Brief Bioinform 2022;23:bbab539. [DOI] [PubMed] [Google Scholar]

[ref12] 12. Wang B, Zhang X, Xu C, et al. DeepSP: a deep learning framework for spatial proteomics. J Proteome Res 2023;22:2186–98. [DOI] [PubMed] [Google Scholar]

[ref13] 13. Gatto L, Breckels LM, Lilley KS. Assessing sub-cellular resolution in spatial proteomics experiments. Curr Opin Chem Biol 2019;48:123–49. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref14] 14. Orre LM, Vesterlund M, Pan Y, et al. SubCellBarCode: proteome-wide mapping of protein localization and relocalization. Mol Cell 2019;73:166–182.e7. [DOI] [PubMed] [Google Scholar]

[ref15] 15. Kretz R, Walter L, Raab N, et al. Spatial proteomics reveals differences in the cellular architecture of antibody-producing CHO and plasma cell-derived cells. Mol Cell Proteomics 2022;21:100278. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref16] 16. Jean Beltran PM, Mathias RA, Cristea IM. A portrait of the human organelle proteome in space and time during cytomegalovirus infection. Cell Syst 2016;3:361–373.e6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref17] 17. Hirst J, Itzhak DN, Antrobus R, et al. Role of the AP-5 adaptor protein complex in late endosome-to-Golgi retrieval. PLoS Biol 2018;16:e2004411. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref18] 18. Itzhak DN, Davies C, Tyanova S, et al. A mass spectrometry-based approach for mapping protein subcellular localization reveals the spatial proteome of mouse primary neurons. Cell Rep 2017;20:2706–18. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref19] 19. Kennedy MA, Hofstadter WA, Cristea IM. TRANSPIRE: a computational pipeline to elucidate intracellular protein movements from spatial proteomics data sets. J Am Soc Mass Spectrom 2020;31:1422–39. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref20] 20. Wang J, Hertzmann A, Fleet DJ. Gaussian process dynamical models. In: Weiss Y, (ed). Adv Neural Inf Process Syst. MIT Press. Vancouver, British Columbia, Canada. 2005;18:1441–1448. [Google Scholar]

[ref21] 21. Kipf TN, Welling M. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 2016.

[ref22] 22. Han X, Wang B, Situ C, et al. scapGNN: a graph neural network-based framework for active pathway and gene module inference from single-cell multi-omics data. PLoS Biol 2023;21:e3002369. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref23] 23. Breckels LM, Holden SB, Wojnar D, et al. Learning from heterogeneous data sources: an application in spatial proteomics. PLoS Comput Biol 2016;12:e1004920. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref24] 24. Valerio HP, Ravagnani FG, Yaya Candela AP, et al. Spatial proteomics reveals subcellular reorganization in human keratinocytes exposed to UVA light. iScience 2022;25:104093. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref25] 25. Cheadle C, Vawter MP, Freed WJ, Becker KG. Analysis of microarray data using Z score transformation. J Mol Diagn 2003;5:73–81. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref26] 26. Haghverdi L, Lun ATL, Morgan MD, Marioni JC. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat Biotechnol 2018;36:421–7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref27] 27. Song Q, Su J, Zhang W. scGCN is a graph convolutional networks algorithm for knowledge transfer in single cell omics. Nat Commun 2021;12:3826. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref28] 28. Gao H, Zhang B, Liu L, et al. A universal framework for single-cell multi-omics data integration with graph convolutional networks. Brief Bioinform 2023;24:bbad081. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref29] 29. Chiu W-Y, Yen GG, Juan T-K. Minimum Manhattan distance approach to multiple criteria decision making in multiobjective optimization problems. IEEE Trans Evolut Comput 2016;20:972–85. [Google Scholar]

[ref30] 30. Coghetto R. Chebyshev distance. Formalized Mathematics 2016;24:121–41. [Google Scholar]

[ref31] 31. Jurman G, Riccadonna S, Visintainer R, et al. Canberra distance on ranked lists. In: Proceedings of Advances in Ranking NIPS 09 Workshop. 2009:22–27. [Google Scholar]

[ref32] 32. Danielsson P-E. Euclidean distance mapping. Comput Graph Image Process 1980;14:227–48. [Google Scholar]

[ref33] 33. Sedgwick P. Pearson’s correlation coefficient. BMJ 2012;345:345. [Google Scholar]

[ref34] 34. McLachlan GJ. Mahalanobis distance. Resonance 1999;4:20–6. [Google Scholar]

[ref35] 35. Benej M, Klikovits T, Krajc T, et al. Lymph node log-odds ratio accurately defines prognosis in Resectable non-small cell lung cancer. Cancer 2023;15:15. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref36] 36. Mohammadi A, Plataniotis KN. Improper complex-valued Bhattacharyya distance. IEEE Trans Neural Netw Learn Syst 2016;27:1049–64. [DOI] [PubMed] [Google Scholar]

[ref37] 37. Beran R. Minimum Hellinger distance estimates for parametric models. Ann Stat 1977;5:445–63. [Google Scholar]

[ref38] 38. De Boer P-T, Kroese DP, Mannor S, et al. A tutorial on the cross-entropy method. Ann Oper Res 2005;134:19–67. [Google Scholar]

[ref39] 39. Van Erven T, Harremos P. Rényi divergence and Kullback-Leibler divergence. IEEE Trans Inform Theory 2014;60:3797–820. [Google Scholar]

[ref40] 40. Menéndez M, Pardo J, Pardo L, Pardo MC. The Jensen-Shannon divergence. J Franklin Inst 1997;334:307–18. [Google Scholar]

[ref41] 41. Barros RSM, Hidalgo JIG, Lima Cabral DR. Wilcoxon rank sum test drift detector. Neurocomputing 2018;275:1954–63. [Google Scholar]

[ref42] 42. Kendall MG. The treatment of ties in ranking problems. Biometrika 1945;33:239–51. [DOI] [PubMed] [Google Scholar]

[ref43] 43. Norouzi M, Fleet DJ, Salakhutdinov RR. Hamming distance metric learning. Adv Neural Inf Process Syst 2012;25:1061–1069. [Google Scholar]

[ref44] 44. Van Engelen JE, Hoos HH. A survey on semi-supervised learning. Mach Learn 2020;109:373–440. [Google Scholar]

[ref45] 45. Breiman L. Random forests. Mach Learn 2001;45:5–32. [Google Scholar]

[ref46] 46. Hu J, Shen L, Albanie S. Squeeze-andexcitation networks. arXiv preprint arXiv:1709.01507 2017.

[ref47] 47. Guo L, Wang Y, Xu X, et al. DeepPSP: a global-local information-based deep neural network for the prediction of protein phosphorylation sites. J Proteome Res 2021;20:346–56. [DOI] [PubMed] [Google Scholar]

[ref48] 48. Zhou J, Cui G, Hu S, et al. Graph neural networks: a review of methods and applications. AI Open 2020;1:57–81. [Google Scholar]

[ref49] 49. Kingma DP, Ba J. Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 2014.

[ref50] 50. Prechelt L. Early stopping—but when?Neural networks: Tricks of the Trade, 2nd edn. Springer Berlin Heidelberg. Berlin, Germany. Genevieve B. Orr.2002:55–69.

[ref51] 51. Wang B, Wang Y, Chen Y, et al. DeepSCP: utilizing deep learning to boost single-cell proteome coverage. Brief Bioinform 2022;23:bbac214. [DOI] [PubMed] [Google Scholar]

[ref52] 52. Li X, Liao M, Wang B, et al. A drug repurposing method based on inhibition effect on gene regulatory network. Comput Struct Biotechnol J 2023;21:4446–55. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref53] 53. Azadifar S, Ahmadi A. A novel candidate disease gene prioritization method using deep graph convolutional networks and semi-supervised learning. BMC Bioinformatics 2022;23:422. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref54] 54. Abdi H, Williams LJ. Principal component analysis. Wires Comput Stat 2010;2:433–59. [Google Scholar]

[ref55] 55. Tang K, Ji X, Zhou M, et al. Rank-in: enabling integrative analysis across microarray and RNA-seq for cancer. Nucleic Acids Res 2021;49:e99. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref56] 56. Longadge R, Dongre S. Class imbalance problem in data mining review. arXiv preprint arXiv:1305.1707 2013.

PERMALINK

TransGCN: a semi-supervised graph convolution network–based framework to infer protein translocations in spatio-temporal proteomics

Bing Wang

Xiangzheng Zhang

Xudong Han

Bingjie Hao

Yan Li

Xuejiang Guo

Abstract

INTRODUCTION

MATERIALS AND METHODS

Dataset description

Table 1.

Simulated spatio-temporal proteomics datasets

Experimental spatio-temporal proteomics datasets

The overview of TransGCN framework

Figure 1.

Data synthesis

Using Z-score test to select proteins with high-quality PSLs

Using differential matching to synthesize dataset

Hybrid graph construction

Calculation of distance features

Direct distance

Distribution distance

Ranking distance

Expected probability estimation

Model training

SENet operation

Semi-supervised GCN

Loss functions

False discovery rate computation

RESULTS

Comparison of TransGCN in identifying protein translocations with other methods

Figure 2.

Figure 3.

Comparison results of TransGCN in predicting protein subcellular localizations with other methods

Figure 4.

Ablation experiments of TransGCN in identifying protein translocations

Figure 5.

Figure 6.

Ablation experiments of TransGCN in predicting protein subcellular localizations

Figure 7.

Application of TransGCN in identifying protein translocations

Figure 8.

DISCUSSION

Key Points

FUNDING

DATA AVAILABILITY

Supplementary Material

Author Biographies

Contributor Information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases