Detection of Protein Complexes Based on Penalized Matrix Decomposition in a Sparse Protein–Protein Interaction Network

Buwen Cao; Shuguang Deng; Hua Qin; Pingjian Ding; Shaopeng Chen; Guanghui Li

doi:10.3390/molecules23061460

. 2018 Jun 15;23(6):1460. doi: 10.3390/molecules23061460

Detection of Protein Complexes Based on Penalized Matrix Decomposition in a Sparse Protein–Protein Interaction Network

Buwen Cao ^1,^2,^*, Shuguang Deng ^1,^*, Hua Qin ¹, Pingjian Ding ², Shaopeng Chen ³, Guanghui Li ^2,⁴

Editors: Xiangxiang Zeng, Alfonso Rodríguez-Patón, Quan Zou

PMCID: PMC6100434 PMID: 29914123

Abstract

High-throughput technology has generated large-scale protein interaction data, which is crucial in our understanding of biological organisms. Many complex identification algorithms have been developed to determine protein complexes. However, these methods are only suitable for dense protein interaction networks, because their capabilities decrease rapidly when applied to sparse protein–protein interaction (PPI) networks. In this study, based on penalized matrix decomposition (PMD), a novel method of penalized matrix decomposition for the identification of protein complexes (i.e., PMD_pc) was developed to detect protein complexes in the human protein interaction network. This method mainly consists of three steps. First, the adjacent matrix of the protein interaction network is normalized. Second, the normalized matrix is decomposed into three factor matrices. The PMD_pc method can detect protein complexes in sparse PPI networks by imposing appropriate constraints on factor matrices. Finally, the results of our method are compared with those of other methods in human PPI network. Experimental results show that our method can not only outperform classical algorithms, such as CFinder, ClusterONE, RRW, HC-PIN, and PCE-FR, but can also achieve an ideal overall performance in terms of a composite score consisting of F-measure, accuracy (ACC), and the maximum matching ratio (MMR).

Keywords: protein–protein interaction (PPI), clustering, protein complex, penalized matrix decomposition

1. Introduction

The identification of protein complexes is highly beneficial for the investigation of all kinds of organisms to understand biological processes and determine inherent organizational structures within cells [1]. The dramatic development of computational methods stimulates many protein complex identification algorithms for protein–protein interaction (PPI) networks, which are generally organized into three catalogs. The first catalog includes clustering methods that are also divided into three sub-catalogs. First, the local search approaches based on density are used to identify densely connected subgraphs in PPI networks, in which subgraphs with density above a pre-defined threshold, such as MCODE (Molecular Complex Detection) [2], CFinder (a software tool for network cluster detection) [3], DPCLus (a Density-Periphery based graph CLustering software) [4], and ICPM (Iterative Clique Percolation Method) [5], are considered protein complexes. However, these approaches tend to neglect surrounding proteins that are connected to the kernel clusters with sparse links, which can show experimentally validated true interactions [6]. Another kind of method for detecting protein complexes uses classical hierarchy clustering techniques, which mainly depend on the distance between proteins to detect meaningful groups [6] and contain HC-PIN ((fast Hierarchical Clustering algorithm for Protein Interaction Network, agglomerative method) [7] and G-N algorithms (divisive method) [8]. Many hierarchical clustering methods employ similarities among the proteins that are calculated on the basis of network topology characteristics or biological meaning due to the further development of clustering technology. Such approaches mainly include NEMO (NEtwork MOdule identification) [9], ClusterONE (Clustering algorithm with Overlapping Neighborhood Expansion) [10], RFC (Rough Fuzzy Clustering) [11], MINE (Module Identification in Networks) [12], PageRankNibble [13], SPICi (Speed and Performance In Clustering,) [14], PCE-FR (Pseudo-Clique Extension based on Fuzzy Relation) [15], MTGO (Module detection via Topological information and GO knowledge) [16], WCOACH (Weighted COACH) [17], DCAFP (Density-based Clustering Approach for identifying overlapping protein complexes with Functional Preferences) [18], and cwMINE (Combined Weight of Module Identification in Networks) [19]. Experimental results show that these novel methods greatly outperform classical hierarchical clustering approaches. Except for the aforementioned clustering approaches, many other protein complex detection algorithms, such as RNSC (Restricted Neighborhood Search Clustering) [20], MCL [1], RRW (Repeated Random Walks algorithm) [21], CMC (Clustering-based on Maximal Cliques) [22], Coach [23], and AP (Affinity Propagation) with its variant [24] have achieved satisfactory results.

Another type of method used to detect protein complexes employs an intelligent optimization algorithm, which seeks the optimal solution of PPI based on a heuristic concept [25]. For large databases, the complexity of intelligent optimization algorithms is too high to run a correct consequence. The major weakness of the aforementioned methods is that their performance deteriorates when they are employed to sparse PPI networks [19,26]. To address this problem, matrix decomposition is proposed to improve the disadvantages of these methods. A co-clustering algorithm based on the adjacent matrix of PPI networks was proposed [6] and obtained overlapping and non-overlapping protein complexes successfully. The results show that the method reached a remarkable balance between network coverage and accuracy (ACC) and outperformed classical methods. Matrix factorization can be mainly organized into two main levels. The first level is the non-negative matrix factorization (NMF) (which integrates gene ontology (GO), gene expression data, and the PPI network to form the corresponding adjacency matrix and then decomposes it with common factors to achieve the overlapping functional modules with high ACC [27]). Zhang et al. [28] proposed sparse network-regularized multiple NMFs (SNMNMFs) to identify the microRNA regulatory modules and demonstrated the ideal performance of the proposed method in ovarian cancer dataset. The second level is the penalized matrix decomposition (PMD), which is widely applied in various datasets, such as microarray data [29], including gene expression data, and proteomic datasets [30].

Inspired by Ref. [24], PMD_pc, an approach used to identify the protein interaction network of protein complexes was originally proposed. First, the adjacent matrix of the protein interaction network was normalized. Second, the normalized matrix was decomposed into three factor matrices. Finally, the PMD_pc algorithm and several classical algorithms were executed from the well-investigated human PPI network. The experimental results show that our approach achieved satisfactory performance in terms of F-measure, ACC, and maximum matching ratio (MMR).

2. Results and Discussion

When PMD_pc is applied to identify the protein complexes in PPI network, the parameters of $c_{1}$ , $c_{2}$ , and $k$ are crucial for the decomposition of the network. Considering that $u$ should be sparse, we take $c_{1} = 0.25 \times \sqrt{n}$ and $c_{2} = 0.25 \times \sqrt{p}$ [31].

To study the parameter of k on the effect on the experimental results, we repeated the execution of algorithm and studied how the algorithm behaves in terms of F-measure and let $k \in (0, 2500]$ with a 100 increment. The detailed experimental results with different $k$ values are presented in Figure 1. From Figure 1, we can clearly see that $k$ is less than 1000; the experimental results fall short of satisfaction.

Values of F-measure for different values of k ∈ (0, 2500] with a 100 increment in HPRD dataset.

The value of the F-measure increases gradually until $k = 1600$ with the increase in $k$ , such that the maximum value of 0.398, the F-measure, displays a steady state when it changes from 1600 to 2000. When $k$ is greater than 2000, the value of F-measure shows a downward trend. Therefore, $k$ is set to 2000.

Five classical protein complex algorithms, namely, CFinder [3], ClusterONE [10], RRW [21], HC-PIN [7], and PCE-FR [15], are applied on human PPI network of HPRD (Human Protein Reference Database, HPRD) to demonstrate the performance of $P M D_{p c}$ . The complexes of the aforementioned algorithms with sizes less than 2 are filtered in our work. Moreover, the parameters of each method that is compared with our method are set using the default values recommended by the authors. The experimental result is shown in Table 1.

Table 1.

Results of six protein complexes Algorithms in HPRD Dataset.

Algorithms	Number	Precision	Recall	F-Measure	ACC	Sep	MMR	MCC
CFinder	49	0.959	0.143	0.249	0.184	0.165	0.017	0.327
ClusterONE	755	0.295	0.186	0.229	0.333	0.209	0.084	0.391
RRW	167	0.671	0.190	0.296	0.236	0.231	0.034	0.209
HC-PIN	99	0.646	0.140	0.230	0.256	0.233	0.024	0.196
PCE-FR	274	0.534	0.178	0.267	0.279	0.169	0.029	0.035
$P M D_{p c}$	118	0.451	0.356	0.398	0.362	0.777	0.010	0.343

Open in a new tab

Table 1 shows that $P M D_{p c}$ achieves a satisfactory performance on human PPI networks. Particularly, $P M D_{p c}$ obtains the highest value of recall, F-measure, ACC, and Sep, which are 0.356, 0.398, 0.362, and 0.777, respectively. These results are significantly superior to the five other algorithms. Furthermore, CFinder achieves the highest precision of 0.959 and the lowest MMR of 0.017. ClusterONE identifies 755 protein complexes and achieves the highest MMR of 0.084. These values elaborate that our approach achieved an ideal result in identifying protein complexes from sparse PPI networks.

From Table 1, we can also clearly see that our method obtains the second highest value of MCC, which is 12.28% lower than that of ClusterONE. It demonstrates that our method achieved satisfactory performance in dealing imbalanced data.

To void the advantage of some evaluation metric, the composite score [24] is employed to wrap up the global performance. Interestingly, the composite comparison of our method shows absolute advantage in terms of F-measure, accuracy, and maximum matching ratio. Figure 2 presents the comparison results of the six algorithms on the HPRD dataset. The composite score of F-measure, accuracy, and maximum matching ratio is 0.770, which is 19.20% higher than the highest value of the five other methods. It further demonstrates the effectiveness of our method.

Results comparison of the six algorithms in HPRD dataset using CHPC2012 gold standard dataset. Columns correspond to the following algorithms, CFinder, ClusterONE, HC-PIN, PCE-FR, and $P M D_{p c}$ from left to right. Various color of the same columns denotes the individual components of the composite score of the algorithm (cyan = F-measure, blue = ACC, and purple = MMR). The total height of each column is the value of the composite score for a special algorithm in a special dataset. Large score shows the clustering result is better.

3. Materials and Methods

3.1. Materials and Datasets

Our method is applied to detect the protein complexes in the human PPI dataset downloaded from Ref. [24], in which 9459 proteins and 36,935 interactions with the density of 0.0008 are included. The gold standard dataset is employed to evaluate the performance of the protein complexes identified in sparse PPI networks, which is CHPC2012 [32], integrating three databases, namely, CORUM [33], HPRD [34], and PINdb [35], and includes 1389 complexes and 3065 proteins.

3.2. Methods

Consider a sample dataset that consists of $p$ eigenvectors in $n$ samples, which is described by a matrix $X$ with size $n \times p$ [30]. Without loss of generality, we assume that the means of column and row $X$ are zero. The singular value decomposition of matrix $X$ can be written as follows:

X = U Δ V^{T}, U^{T} U = I_{n}, V^{T} V = I_{p}

(1)

The decomposition of sparse matrix is executed by imposing additional constraints on $U$ and $V$ . The single-factor PMD can be optimized using the following objective function, which is formulated as [30]

\begin{array}{l} \arg \min_{δ, u, v} \frac{1}{2} | | η - δ u v^{T} | |_{F}^{2}, \\ s . t . | | u | |_{2}^{2} = 1, | | v | |_{2}^{2} = 1, \\ P_{1} (u) \leq c_{1}, P_{2} (v) \leq c_{2}, δ \geq 0 . \end{array}

(2)

in which $u$ is a column of $U$ , $v$ is a column of $V$ , $δ$ is a diagonal element of the matrix of $η$ , ${‖ • ‖}_{F}$ is the Frobenius norm, and $P_{1}$ and $P_{2}$ are penalty functions that have variety of forms [30].

Let $U$ and $V$ be $n \times R$ and $p \times R$ orthogonal matrices, respectively, and $Δ$ a diagonal matrix with diagonal elements $δ_{r}$ [30]

\frac{1}{2} {‖ η - U Δ V^{T} ‖}_{F}^{2} = \frac{1}{2} {‖ η ‖}_{F}^{2} - \sum_{r = 1}^{R} u_{r}^{T} η v_{r} δ_{r} + \frac{1}{2} \sum_{r = 1}^{R} δ_{r}^{2}

(3)

Therefore, when $R = 1$ , we can infer that $u$ and $v$ satisfy Equation (7) and the following condition:

\begin{array}{l} \arg \underset{u, v}{\max u^{T}} η v \\ s . t . {‖ u ‖}_{2}^{2} = 1, {‖ v ‖}_{2}^{2} = 1, P_{1} (u) \leq c_{1}, P_{2} (v) \leq c_{1} \end{array}

(4)

Moreover, $δ$ satisfies Equation (2) when $δ = u^{T} η v$ .

The optimization problem in Equation (4) can be applied to the following biconvex optimization [30]:

\begin{array}{l} \arg \max_{u, v} u^{T} δ v \\ s . t . {‖ u ‖}_{2}^{2} \leq 1, {‖ v ‖}_{2}^{2} \leq 1, P_{1} (u) \leq c_{1}, P_{2} (v) \leq c_{2} \end{array}

(5)

Equation (5) satisfies Equation (4) based on the appropriate value of $c$ [30]. Equation (5) is called the single factor PMD, and the iterative algorithm used to optimize it is described in Algorithm 1:

Algorithm 1. Calculating the single factor of PMD.

Step1. Initialize

v

and let unit

L_{2} - n o r m

.
Step2. Interate until convergence:

(i)
$u \leftarrow \arg \max_{u} u^{T} δ v, s . t . {‖ u ‖}_{2}^{2} \leq 1, P_{1} (u) \leq c_{1}$
(ii)
$v \leftarrow \arg \max_{v} u^{T} δ v, s . t . {‖ v ‖}_{2}^{2} \leq 1, P_{2} (v) \leq c_{2}$

Step3.

d \leftarrow u^{T} δ v

Open in a new tab

Equation (2) is computed repeatedly to obtain other PMD factors. The corresponding algorithm is described in Algorithm 2.

Algorithm 2. Calculating the

k

factor of PMD.

Step1.

η^{1} \leftarrow η

;
Step2. For

r \in 1, 2, \dots, R

(i)
The single factor PMD (Algorithm 1) is executed on the matrix of $η^{r}$ , computing $u_{r}, v_{r}, δ_{r}$ , respectively;
(ii)
$η^{r + 1} \leftarrow η^{r} - δ_{r} u_{r} v_{r}^{T}$

Open in a new tab

The constraint is imposed on $u$ and $v$ with $L_{1} - n o r m$ , i.e., ${‖ u ‖}_{1} \leq c_{1}, {‖ v ‖}_{1} \leq c_{2}$ . By selecting parameters $c_{1}$ and $c_{2}$ appropriately, PMD can make factors $u$ and $v$ sparse. Generally, $c_{1}$ and $c_{2}$ should be restricted to ranges $1 \leq c_{1} \leq \sqrt{n}$ and $1 \leq c_{2} \leq \sqrt{p}$ . Thus, the PMD method is shaped as $P M D (L_{1}, L_{2})$ , which is described as follows:

\begin{array}{l} \arg \max_{u, v} u^{T} η v \\ s . t . {‖ u ‖}_{2}^{2} \leq 1, {‖ v ‖}_{2}^{2} \leq 1, {‖ u ‖}_{1} \leq c_{1}, {‖ v ‖}_{1} \leq c_{2} \end{array}

(6)

Let $S$ denote the operator of the soft threshold, i.e., $S (a, c) = sgn (a) {(| a | - c)}_{+}$ , in which $c > 0$ , $x_{+} = {\begin{cases} x & x > 0 \\ 0 & x \leq 0 \end{cases}$ . The corresponding theorem is as follows:

Theorem 1.

Considering the optimization problem

$\begin{array}{l} \arg \max_{u} u^{T} a \\ s . t . {‖ u ‖}_{2}^{2} \leq 1, {‖ u ‖}_{1} \leq c . \end{array}$ (7)

The solution is $u = \frac{S (a, Δ)}{{‖ S (a, Δ) ‖}_{2}}$ . If ${‖ u ‖}_{1} \leq c$ , then $Δ$ = 0; otherwise, ${‖ u ‖}_{1} = c$ s.t. $Δ$ > 0. The detailed proof regarding the theorem can be found in Ref. [30]. The analysis shows the solution of Equation (6) with Algorithm 1. According to Theorem 1, the single factor PMD can be optimized, as shown in Algorithm 3:

Algorithm 3. The optimization process of the single factor PMD.

Step1. Initialize

v

and let unit

L_{2} - n o r m

.
Step2. Iterate until convergence:

(i)
$u \leftarrow \frac{S (X v, Δ_{1})}{{‖ S (X v, Δ_{1}) ‖}_{2}}$ , if ${‖ u ‖}_{1} \leq c_{1}$ , then $Δ_{1}$ = 0, else ${‖ u ‖}_{1} = c_{1}, s . t .$ , $Δ_{1}$ > 0
(ii)
$u \leftarrow \frac{S (X^{T} u, Δ_{2})}{{‖ S (X^{T} u, Δ_{2}) ‖}_{2}}$ , if ${‖ v ‖}_{1} \leq c_{2}$ , then $Δ_{2}$ = 0, else ${‖ v ‖}_{1} = c_{2}, s . t .,$ $Δ_{2}$ > 0

Step3.

d \leftarrow u^{T} δ v

Open in a new tab

To obtain the sparse factors of $u$ and $v$ , we let $c_{1} = c \sqrt{n}, c_{2} = c \sqrt{p}$ , and the values of $Δ_{1}$ and $Δ_{2}$ are selected by the binary search.

For comprehensive discussion, discovered protein complexes and gold standard dataset are matched. The following evaluation measures are employed in this study.

F-measure. Two protein complexes, namely, $p$ and $g$ , are generated from the predicted protein complex and gold standard sets, respectively. The overlapping score $os (p, g)$ quantizes the closeness between the sets and is defined as follows [24]:

os (p, g) = \frac{| C_{p} \cap C_{g} |}{| C_{p} | • | C_{g} |}

(8)

in which $C_{p}$ , $C_{g}$ denote protein complex sets $p$ and $g$ , respectively. If $os (p, g) \geq θ$ , then the two complexes are matched, in which $θ$ is the threshold. $θ$ is set as 0.2, which is consistent with many experiments for protein complex identification [24]. Let $P$ and $G$ represent the detected protein complex and gold standard sets, respectively; $N_{c p}$ describes the number of identified protein complexes that match at least one gold standard set, i.e., $N_{c p} = | {p | p \in P, \exists g \in G, o s (p, g) \geq θ} |$ ; and $N_{c p}$ presents the number of gold standard protein complexes that match at least one identified complex, that is $N_{c g} = | {g | g \in G, \exists p \in P, o s (p, g) \geq θ} |$ . F-measure is mathematically defined as [24]

F - m e a s u r e = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(9)

in which $P r e c i s i o n = N_{c p} / | P |, R e c a l l = N_{c g} / | G |$ . F-measure is defined as the harmonic mean of precision and recall, which can evaluate the overall performance of the detection methods.

ACC (Accuracy, ACC). ACC is used to quantify the quality of detected protein complexes, which is the geometric means of sensitivity and positive predictive value, PPV. The corresponding formulas are described as follows [24]:

A C C = \sqrt{S_{n} \times P P V}

(10)

in which $S_{n} = \frac{\sum_{i = 1}^{n} \max_{j = 1}^{m} t_{i j}}{\sum_{i = 1}^{n} n_{i}}$ , $P P V = \frac{\sum_{j = 1}^{m} \max_{i = 1}^{n} t_{i j}}{\sum_{j = 1}^{m} \sum_{i = 1}^{n} t_{i j}}$ .

Sep (Separation, Sep). To void the case wherein proteins of a gold standard complex are matched with several identified protein complexes, Sep is used to measure the one-to-one correspondence between generated protein complexes and gold standard protein complexes. The formula is described as follows [24]:

{Sep}_{g} = \frac{\sum_{i = 1}^{n} \sum_{j = 1}^{m} S e p_{i j}}{n}, {Sep}_{p} = \frac{\sum_{j = 1}^{m} \sum_{i = 1}^{n} S e p_{i j}}{m}, Sep = \sqrt{{Sep}_{g} \times S e p_{p}},

(11)

in which ${Sep}_{i j} = \frac{{(t_{i j})}^{2}}{\sum_{i = 1}^{n} t_{i j} * \sum_{j = 1}^{m} t_{i j}}$ . In Formulas (10) and (11), $n$ is the number of protein complexes in the gold standard dataset, $m$ is the number of identified protein complexes, $t_{i j}$ denotes the size of intersection between the $i th$ gold standard complex and the $j th$ detected complex, and $n_{i}$ denotes the number of proteins included in the $i th$ gold standard complex.

MMR (Maximum Matching Ratio). MMR is used to describe the maximum one-to-one matching between the identified and gold standard protein complexes, which are defined as follows [24]:

M M R (g, p) = \frac{\sum_{i = 1}^{n} \max_{j = 1}^{m} o s (g_{i}, p_{j})}{N_{i}}

(12)

in which $o s$ represents the overlapping score between two protein complexes, $g_{i}$ is the $i th$ gold standard complex, and $p_{j}$ represents the $j th$ identified protein complex.

MCC (Matthews Correlation Coefficient). MCC is widely used in bioinformatics as a performance metric that can handle imbalanced data. The formula is described as follows [24]:

M C C = \frac{T P \times T N - F P \times F N}{\sqrt{(T P + F N) (T P + F P) (T N + F P) (T N + F N)}}

(13)

in which $T P$ , $T N$ , $F P$ , and $F N$ mean the true positive, true negative, false positive, and false negative, respectively.

3.3. Detection of Protein Complexes Using $P M D_{p c}$

A PPI network is usually modeled as an undirected weight graph $G = (V, E, ω)$ , in which $V$ represents a set of nodes (proteins), E is a set of edges (protein pairs), and $ω$ is a set of similarity value between each protein pairs. The similarity of GO (Gene Ontology, GO) terms is mathematically expressed as follows [36]:

S i m (i, j) = \frac{| N (i) \cap N (j) |}{\min (N (i), N (j))}

(14)

in which $S i m (i, j)$ indicates the GO similarity of the protein pair $(i, j)$ . $N (i)$ denotes the number of GO terms that annotate the protein $i$ . The PPI network is stocked as the matrix $X$ with a size of $n \times n$ , which is transformed into the vertex–PCA matrix $X$ of size $n \times p$ by the principal component analysis, in which each row of $X$ represents a protein in all $n$ samples (protein complexes), and each column of $X$ represents the expression level of a sample in all $p$ proteins.

According to Section 3.2, the matrix $X$ is decomposed into three matrices, namely, $U, V,$ and $Δ$ by PMD. The graphical description of $P M D_{p c}$ is shown in Figure 3, in which $u_{k}$ is the $k th$ principal component, $v_{k}$ is the $k th$ expression model of the principal component, and $u_{i k}$ indicates that the $k th$ protein is projected on the $k th$ protein complex. Therefore, matrix $U$ is decomposed into several clusters (protein complexes) due to matrix decomposition.

Graphical description of $P M D_{p c}$ . Matrix $X$ is decomposed into two base matrices, namely, U, V, and a diagonal matrix $Δ$ .

$P M D_{p c}$ is implemented in Java, and all experiments are performed on an Intel(R) Core(TM) i7-5557U CPU with 2.2 GHz and 8 GB RAM running Windows 7.0. The elapsed time is 9533 s.

4. Conclusions

The identification of protein complex helps us to discover and understand the cellular organizations and biological functions in PPI networks. Previous computational approaches mainly identified protein complexes in dense PPI networks, which had inferior performances in sparse PPI networks. In this work, $P M D_{p c}$ is proposed on the basis of the penalized matrix decomposition to detect protein complexes in the human protein interaction network with 0.0008 density.

The performance of our method, $P M D_{p c}$ , is compared with the performances of CFinder, ClusterONE, RRW, HC-PIN, and PCE-FR on the human PPI dataset derived from HPRD to validate the utilization of our method. The experimental results show that our proposed algorithm is better than the five classical approaches based on F-measure, ACC, and MMR. However, only the human PPI network was taken as the experimental dataset. The new method should be suitable for substructure detection with other sparse networks. Therefore, our algorithm will be used in the future to investigate other species of complex networks, such as gene regulatory and disease networks.

Supplementary Materials

The following are available online.

Click here for additional data file.^{(982.6KB, zip)}

Author Contributions

Conceptualization: B.C. and S.D.; data curation: S.C.; formal analysis: S.D.; investigation: S.C.; methodology: B.C.; project administration: S.D.; resources: P.D.; software: S.C.; supervision: S.D.; validation: B.C., H.Q., and G.L.; writing-original draft: B.C.; writing-review & editing: P.D.

Funding

This research was funded by the National Natural Science Foundation of China grant numbers [61572180, 61472467, 61471164, 61672011, and 61602164], the Hunan Provincial Natural Science Foundation of China grant numbers [2016JJ2012 and 2018JJ2024], the Key Project of the Education Department of Hunan Province grant number [17A037], and the Scientific and Technological Research Project of Education Department in Jiangxi Province grant number [GJJ170383].

Conflicts of Interest

The authors declare no conflict of interest.

Footnotes

Sample Availability: Samples of the compounds are not available from the authors.

References

1.Enright A.J., Van Dongen S., Ouzounis C.A. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 2002;30:1575–1584. doi: 10.1093/nar/30.7.1575. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Bader G.D., Hogue C.W. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinform. 2003;4:2. doi: 10.1186/1471-2105-4-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Adamcsek B., Palla G., Farkas I.J., Derenyi I., Vicsek T. Cfinder: Locating cliques and overlapping modules in biological networks. Bioinformatics. 2006;22:1021–1023. doi: 10.1093/bioinformatics/btl039. [DOI] [PubMed] [Google Scholar]
4.Altaf-Ul-Amin M., Shinbo Y., Mihara K., Kurokawa K., Kanaya S. Development and implementation of an algorithm for detection of protein complexes in large interaction networks. BMC Bioinform. 2006;7:1–13. doi: 10.1186/1471-2105-7-207. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Gao L., Sun P.G., Song J. Clustering algorithm for detecting functional modules in protein interaction networks. J. Bioinform. Comput. Biol. 2011;7:217–242. doi: 10.1142/S0219720009004023. [DOI] [PubMed] [Google Scholar]
6.Pizzuti C., Rombo S.E. A coclustering approach for mining large protein-protein interaction networks. IEEE ACM Trans. Comput. Biol. 2012;9:717–730. doi: 10.1109/TCBB.2011.158. [DOI] [PubMed] [Google Scholar]
7.Wang J.X., Li M., Chen J.E., Pan Y. A fast hierarchical clustering algorithm for functional modules discovery in protein interaction networks. IEEE ACM Trans. Comput. Biol. 2011;8:607–620. doi: 10.1109/TCBB.2010.75. [DOI] [PubMed] [Google Scholar]
8.Girvan M., Newman M.E.J. Community structure in social and biological networks. Proc. Natl. Acad. Sci. USA. 2002;99:7821–7826. doi: 10.1073/pnas.122653799. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Rivera C.G., Vakil R., Bader J.S. Nemo: Network module identification in cytoscape. BMC Bioinform. 2010;11(Suppl. 1):S61. doi: 10.1186/1471-2105-11-S1-S61. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Nepusz T., Yu H.Y., Paccanaro A. Detecting overlapping protein complexes in protein-protein interaction networks. Nat. Methods. 2012;9:471–472. doi: 10.1038/nmeth.1938. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Wu H., Gao L., Dong J.H., Yang X.F. Detecting overlapping protein complexes by rough-fuzzy clustering in protein-protein interaction networks. PLoS ONE. 2014;9:1856. doi: 10.1371/journal.pone.0091856. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Rhrissorrakrai K., Gunsalus K.C. Mine: Module identification in networks. BMC Bioinform. 2011;12:192. doi: 10.1186/1471-2105-12-192. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Voevodski K., Teng S.H., Xia Y. Finding local communities in protein networks. BMC Bioinform. 2009;10:297. doi: 10.1186/1471-2105-10-297. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Jiang P., Singh M. Spici: A fast clustering algorithm for large biological networks. Bioinformatics. 2010;26:1105–1111. doi: 10.1093/bioinformatics/btq078. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Cao B.W., Luo J.W., Liang C., Wang S.L., Ding P.J. Pce-fr: A novel method for identifying overlapping protein complexes in weighted protein-protein interaction networks using pseudo-clique extension based on fuzzy relation. IEEE Trans. Nanobiosci. 2016;15:728–738. doi: 10.1109/TNB.2016.2611683. [DOI] [PubMed] [Google Scholar]
16.Vella D., Marini S., Vitali F., di Silvestre D., Mauri G., Bellazzi R. Mtgo: Ppi network analysis via topological and functional module identification. Sci. Rep. 2018;8:5499. doi: 10.1038/s41598-018-23672-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Kouhsar M., Zare-Mirakabad F., Jamali Y. Wcoach: Protein complex prediction in weighted ppi networks. Genes Genet. Syst. 2015;90:317–324. doi: 10.1266/ggs.15-00032. [DOI] [PubMed] [Google Scholar]
18.Hu L., Chan K.C.C. A density-based clustering approach for identifying overlapping protein complexes with functional preferences. BMC Bioinform. 2015;16:174. doi: 10.1186/s12859-015-0583-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Cao B., Luo J., Liang C., Wang S. Identifying protein complexes by combining network topology and biological characteristics. J. Comput. Theor. Nanosci. 2016;13:1546–1955. doi: 10.1166/jctn.2016.6084. [DOI] [Google Scholar]
20.King A.D., Przulj N., Jurisica I. Protein complex prediction via cost-based clustering. Bioinformatics. 2004;20:3013–3020. doi: 10.1093/bioinformatics/bth351. [DOI] [PubMed] [Google Scholar]
21.Macropol K., Can T., Singh A.K. Rrw: Repeated random walks on genome-scale protein networks for local cluster discovery. BMC Bioinform. 2009;10:283. doi: 10.1186/1471-2105-10-283. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Liu G.M., Wong L., Chua H.N. Complex discovery from weighted ppi networks. Bioinformatics. 2009;25:1891–1897. doi: 10.1093/bioinformatics/btp311. [DOI] [PubMed] [Google Scholar]
23.Wu M., Li X.L., Kwoh C.K., Ng S.K. A core-attachment based method to detect protein complexes in ppi networks. BMC Bioinform. 2009;10:169. doi: 10.1186/1471-2105-10-169. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Maulik U., Mukhopadhyay A., Bhattacharyya M., Kaderali L., Brors B., Bandyopadhyay S., Eils R. Mining quasi-bicliques from hiv-1-human protein interaction network: A multiobjective biclustering approach. IEEE ACM Trans. Comput. Biol. 2013;10:423–435. doi: 10.1109/TCBB.2012.139. [DOI] [PubMed] [Google Scholar]
25.Cao B., Luo J., Liang C., Wang S., Song D. Moepga: A novel method to detect protein complexes in yeast protein-protein interaction networks based on multiobjective evolutionary programming genetic algorithm. Comput. Biol. Chem. 2015;58:173–181. doi: 10.1016/j.compbiolchem.2015.06.006. [DOI] [PubMed] [Google Scholar]
26.Zhu L., Deng S.-P., You Z.-H., Huang D.-S. Identifying spurious interactions in the protein-protein interaction networks using local similarity preserving embedding. IEEE/ACM Trans. Comput. Biol. Bioinform. 2017;14:345–352. doi: 10.1109/TCBB.2015.2407393. [DOI] [PubMed] [Google Scholar]
27.Zhang Y., Du N., Ge L. A collective nmf method for detecting protein functional module from multiple data sources; Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine; Orlando, FL, USA. 7–10 October 2012; pp. 655–660. [DOI] [Google Scholar]
28.Zhang S.H., Li Q.J., Liu J., Zhou X.J. A novel computational framework for simultaneous integration of multiple types of genomic data to identify microrna-gene regulatory modules. Bioinformatics. 2011;27:I401–I409. doi: 10.1093/bioinformatics/btr206. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Zheng C.H., Zhang L., Ng V.T.Y., Shiu S.C.K., Huang D.S. Molecular pattern discovery based on penalized matrix decomposition. IEEE ACM Trans. Comput. Biol. 2011;8:1592–1603. doi: 10.1109/TCBB.2011.79. [DOI] [PubMed] [Google Scholar]
30.Witten D.M., Tibshirani R., Hastie T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics. 2009;10:515–534. doi: 10.1093/biostatistics/kxp008. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Liu J.-X., Liu J., Gao Y.-L., Mi J.-X., Ma C.-X., Wang D. A class-information-based penalized matrix decomposition for identifying plants core genes responding to abiotic stresses. PLoS ONE. 2014;9:e106097. doi: 10.1371/journal.pone.0106097. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Wu M., Yu Q., Li X.L., Zheng J., Huang J.F., Kwoh C.K. Benchmarking human protein complexes to investigate drug-related systems and evaluate predicted protein complexes. PLoS ONE. 2013;8 doi: 10.1371/journal.pone.0053197. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Yang P., Li X., Wu M., Kwoh C.K., Ng S.K. Inferring gene-phenotype associations via global protein complex network propagation. PLoS ONE. 2011;6:e21502. doi: 10.1371/journal.pone.0021502. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Peri S., Navarro J.D., Kristiansen T.Z., Amanchy R., Surendranath V., Muthusamy B., Gandhi T.K., Chandrika K.N., Deshpande N., Suresh S., et al. Human protein reference database as a discovery resource for proteomics. Nucleic Acids Res. 2004;32:D497. doi: 10.1093/nar/gkh070. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Luc P.V., Tempst P. Pindb: A database of nuclear protein complexes from human and yeast. Bioinformatics. 2004;20:1413–1415. doi: 10.1093/bioinformatics/bth114. [DOI] [PubMed] [Google Scholar]
36.Shalgi R., Lieber D., Oren M., Pilpel Y. Global and local architecture of the mammalian microrna-transcription factor regulatory network. PLoS Comput. Biol. 2007;3:e131. doi: 10.1371/journal.pcbi.0030131. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Click here for additional data file.^{(982.6KB, zip)}

[B1-molecules-23-01460] 1.Enright A.J., Van Dongen S., Ouzounis C.A. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 2002;30:1575–1584. doi: 10.1093/nar/30.7.1575. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B2-molecules-23-01460] 2.Bader G.D., Hogue C.W. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinform. 2003;4:2. doi: 10.1186/1471-2105-4-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B3-molecules-23-01460] 3.Adamcsek B., Palla G., Farkas I.J., Derenyi I., Vicsek T. Cfinder: Locating cliques and overlapping modules in biological networks. Bioinformatics. 2006;22:1021–1023. doi: 10.1093/bioinformatics/btl039. [DOI] [PubMed] [Google Scholar]

[B4-molecules-23-01460] 4.Altaf-Ul-Amin M., Shinbo Y., Mihara K., Kurokawa K., Kanaya S. Development and implementation of an algorithm for detection of protein complexes in large interaction networks. BMC Bioinform. 2006;7:1–13. doi: 10.1186/1471-2105-7-207. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B5-molecules-23-01460] 5.Gao L., Sun P.G., Song J. Clustering algorithm for detecting functional modules in protein interaction networks. J. Bioinform. Comput. Biol. 2011;7:217–242. doi: 10.1142/S0219720009004023. [DOI] [PubMed] [Google Scholar]

[B6-molecules-23-01460] 6.Pizzuti C., Rombo S.E. A coclustering approach for mining large protein-protein interaction networks. IEEE ACM Trans. Comput. Biol. 2012;9:717–730. doi: 10.1109/TCBB.2011.158. [DOI] [PubMed] [Google Scholar]

[B7-molecules-23-01460] 7.Wang J.X., Li M., Chen J.E., Pan Y. A fast hierarchical clustering algorithm for functional modules discovery in protein interaction networks. IEEE ACM Trans. Comput. Biol. 2011;8:607–620. doi: 10.1109/TCBB.2010.75. [DOI] [PubMed] [Google Scholar]

[B8-molecules-23-01460] 8.Girvan M., Newman M.E.J. Community structure in social and biological networks. Proc. Natl. Acad. Sci. USA. 2002;99:7821–7826. doi: 10.1073/pnas.122653799. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B9-molecules-23-01460] 9.Rivera C.G., Vakil R., Bader J.S. Nemo: Network module identification in cytoscape. BMC Bioinform. 2010;11(Suppl. 1):S61. doi: 10.1186/1471-2105-11-S1-S61. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B10-molecules-23-01460] 10.Nepusz T., Yu H.Y., Paccanaro A. Detecting overlapping protein complexes in protein-protein interaction networks. Nat. Methods. 2012;9:471–472. doi: 10.1038/nmeth.1938. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B11-molecules-23-01460] 11.Wu H., Gao L., Dong J.H., Yang X.F. Detecting overlapping protein complexes by rough-fuzzy clustering in protein-protein interaction networks. PLoS ONE. 2014;9:1856. doi: 10.1371/journal.pone.0091856. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B12-molecules-23-01460] 12.Rhrissorrakrai K., Gunsalus K.C. Mine: Module identification in networks. BMC Bioinform. 2011;12:192. doi: 10.1186/1471-2105-12-192. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B13-molecules-23-01460] 13.Voevodski K., Teng S.H., Xia Y. Finding local communities in protein networks. BMC Bioinform. 2009;10:297. doi: 10.1186/1471-2105-10-297. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B14-molecules-23-01460] 14.Jiang P., Singh M. Spici: A fast clustering algorithm for large biological networks. Bioinformatics. 2010;26:1105–1111. doi: 10.1093/bioinformatics/btq078. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B15-molecules-23-01460] 15.Cao B.W., Luo J.W., Liang C., Wang S.L., Ding P.J. Pce-fr: A novel method for identifying overlapping protein complexes in weighted protein-protein interaction networks using pseudo-clique extension based on fuzzy relation. IEEE Trans. Nanobiosci. 2016;15:728–738. doi: 10.1109/TNB.2016.2611683. [DOI] [PubMed] [Google Scholar]

[B16-molecules-23-01460] 16.Vella D., Marini S., Vitali F., di Silvestre D., Mauri G., Bellazzi R. Mtgo: Ppi network analysis via topological and functional module identification. Sci. Rep. 2018;8:5499. doi: 10.1038/s41598-018-23672-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B17-molecules-23-01460] 17.Kouhsar M., Zare-Mirakabad F., Jamali Y. Wcoach: Protein complex prediction in weighted ppi networks. Genes Genet. Syst. 2015;90:317–324. doi: 10.1266/ggs.15-00032. [DOI] [PubMed] [Google Scholar]

[B18-molecules-23-01460] 18.Hu L., Chan K.C.C. A density-based clustering approach for identifying overlapping protein complexes with functional preferences. BMC Bioinform. 2015;16:174. doi: 10.1186/s12859-015-0583-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B19-molecules-23-01460] 19.Cao B., Luo J., Liang C., Wang S. Identifying protein complexes by combining network topology and biological characteristics. J. Comput. Theor. Nanosci. 2016;13:1546–1955. doi: 10.1166/jctn.2016.6084. [DOI] [Google Scholar]

[B20-molecules-23-01460] 20.King A.D., Przulj N., Jurisica I. Protein complex prediction via cost-based clustering. Bioinformatics. 2004;20:3013–3020. doi: 10.1093/bioinformatics/bth351. [DOI] [PubMed] [Google Scholar]

[B21-molecules-23-01460] 21.Macropol K., Can T., Singh A.K. Rrw: Repeated random walks on genome-scale protein networks for local cluster discovery. BMC Bioinform. 2009;10:283. doi: 10.1186/1471-2105-10-283. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B22-molecules-23-01460] 22.Liu G.M., Wong L., Chua H.N. Complex discovery from weighted ppi networks. Bioinformatics. 2009;25:1891–1897. doi: 10.1093/bioinformatics/btp311. [DOI] [PubMed] [Google Scholar]

[B23-molecules-23-01460] 23.Wu M., Li X.L., Kwoh C.K., Ng S.K. A core-attachment based method to detect protein complexes in ppi networks. BMC Bioinform. 2009;10:169. doi: 10.1186/1471-2105-10-169. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B24-molecules-23-01460] 24.Maulik U., Mukhopadhyay A., Bhattacharyya M., Kaderali L., Brors B., Bandyopadhyay S., Eils R. Mining quasi-bicliques from hiv-1-human protein interaction network: A multiobjective biclustering approach. IEEE ACM Trans. Comput. Biol. 2013;10:423–435. doi: 10.1109/TCBB.2012.139. [DOI] [PubMed] [Google Scholar]

[B25-molecules-23-01460] 25.Cao B., Luo J., Liang C., Wang S., Song D. Moepga: A novel method to detect protein complexes in yeast protein-protein interaction networks based on multiobjective evolutionary programming genetic algorithm. Comput. Biol. Chem. 2015;58:173–181. doi: 10.1016/j.compbiolchem.2015.06.006. [DOI] [PubMed] [Google Scholar]

[B26-molecules-23-01460] 26.Zhu L., Deng S.-P., You Z.-H., Huang D.-S. Identifying spurious interactions in the protein-protein interaction networks using local similarity preserving embedding. IEEE/ACM Trans. Comput. Biol. Bioinform. 2017;14:345–352. doi: 10.1109/TCBB.2015.2407393. [DOI] [PubMed] [Google Scholar]

[B27-molecules-23-01460] 27.Zhang Y., Du N., Ge L. A collective nmf method for detecting protein functional module from multiple data sources; Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine; Orlando, FL, USA. 7–10 October 2012; pp. 655–660. [DOI] [Google Scholar]

[B28-molecules-23-01460] 28.Zhang S.H., Li Q.J., Liu J., Zhou X.J. A novel computational framework for simultaneous integration of multiple types of genomic data to identify microrna-gene regulatory modules. Bioinformatics. 2011;27:I401–I409. doi: 10.1093/bioinformatics/btr206. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B29-molecules-23-01460] 29.Zheng C.H., Zhang L., Ng V.T.Y., Shiu S.C.K., Huang D.S. Molecular pattern discovery based on penalized matrix decomposition. IEEE ACM Trans. Comput. Biol. 2011;8:1592–1603. doi: 10.1109/TCBB.2011.79. [DOI] [PubMed] [Google Scholar]

[B30-molecules-23-01460] 30.Witten D.M., Tibshirani R., Hastie T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics. 2009;10:515–534. doi: 10.1093/biostatistics/kxp008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B31-molecules-23-01460] 31.Liu J.-X., Liu J., Gao Y.-L., Mi J.-X., Ma C.-X., Wang D. A class-information-based penalized matrix decomposition for identifying plants core genes responding to abiotic stresses. PLoS ONE. 2014;9:e106097. doi: 10.1371/journal.pone.0106097. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B32-molecules-23-01460] 32.Wu M., Yu Q., Li X.L., Zheng J., Huang J.F., Kwoh C.K. Benchmarking human protein complexes to investigate drug-related systems and evaluate predicted protein complexes. PLoS ONE. 2013;8 doi: 10.1371/journal.pone.0053197. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B33-molecules-23-01460] 33.Yang P., Li X., Wu M., Kwoh C.K., Ng S.K. Inferring gene-phenotype associations via global protein complex network propagation. PLoS ONE. 2011;6:e21502. doi: 10.1371/journal.pone.0021502. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B34-molecules-23-01460] 34.Peri S., Navarro J.D., Kristiansen T.Z., Amanchy R., Surendranath V., Muthusamy B., Gandhi T.K., Chandrika K.N., Deshpande N., Suresh S., et al. Human protein reference database as a discovery resource for proteomics. Nucleic Acids Res. 2004;32:D497. doi: 10.1093/nar/gkh070. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B35-molecules-23-01460] 35.Luc P.V., Tempst P. Pindb: A database of nuclear protein complexes from human and yeast. Bioinformatics. 2004;20:1413–1415. doi: 10.1093/bioinformatics/bth114. [DOI] [PubMed] [Google Scholar]

[B36-molecules-23-01460] 36.Shalgi R., Lieber D., Oren M., Pilpel Y. Global and local architecture of the mammalian microrna-transcription factor regulatory network. PLoS Comput. Biol. 2007;3:e131. doi: 10.1371/journal.pcbi.0030131. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Detection of Protein Complexes Based on Penalized Matrix Decomposition in a Sparse Protein–Protein Interaction Network

Buwen Cao

Shuguang Deng

Hua Qin

Pingjian Ding

Shaopeng Chen

Guanghui Li

Roles

Abstract

1. Introduction

2. Results and Discussion

Figure 1.

Table 1.

Figure 2.

3. Materials and Methods

3.1. Materials and Datasets

3.2. Methods

Theorem 1.

3.3. Detection of Protein Complexes Using $P M D_{p c}$

Figure 3.

4. Conclusions

Supplementary Materials

Author Contributions

Funding

Conflicts of Interest

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Detection of Protein Complexes Based on Penalized Matrix Decomposition in a Sparse Protein–Protein Interaction Network

Buwen Cao

Shuguang Deng

Hua Qin

Pingjian Ding

Shaopeng Chen

Guanghui Li

Roles

Abstract

1. Introduction

2. Results and Discussion

Figure 1.

Table 1.

Figure 2.

3. Materials and Methods

3.1. Materials and Datasets

3.2. Methods

Theorem 1.

3.3. Detection of Protein Complexes Using PMDpc

Figure 3.

4. Conclusions

Supplementary Materials

Author Contributions

Funding

Conflicts of Interest

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

3.3. Detection of Protein Complexes Using $P M D_{p c}$