Abstract
Backgrounds
A large number of long intergenic non-coding RNAs (lincRNAs) are linked to a broad spectrum of human diseases. The disease association with many other lincRNAs still remain as puzzle. Validation of such links between the two entities through biological experiments are expensive. However, a plethora lincRNA-data are available now, thanks to the High Throughput Sequencing (HTS) platforms, Genome Wide Association Studies (GWAS), etc, which opens the opportunity for cutting-edge machine learning and data mining approaches to extract meaningful relationships among lincRNAs and diseases. However, there are only a few in silico lincRNA-disease association inference tools available to date, and none of them utilizes side information of both the entities simultaneously in a single framework.
Methods
The recently developed Inductive Matrix Completion (IMC) technique provides a recommendation platform among two entities considering respective side information about them. However, the formulation of IMC is incapable of handling noise and outliers that may be present in the datasets, while data sparsity consideration is another issue with the standard IMC method. Thus, a robust version of IMC is needed that can solve the two issues. As a remedy, in this paper, we propose Stable Robust Inductive Matrix Completion (SRIMC) that utilizes the l 2,1 norm based regularization to optimize the objective function with a unique 2-step stable solution approach.
Results
We applied SRIMC to the available association data between human lincRNAs and OMIM disease phenotypes as well as a diverse set of side information about the lincRNAs and the diseases. The method performs better than the state-of-the-art methods in terms of p r e c i s i o n @ k and r e c a l l @ k at the top-k disease prioritization to the subject lincRNAs. We also demonstrate that SRIMC is equally effective for querying about novel lincRNAs, as well as predicting rank of a newly known disease for a set of well-characterized lincRNAs.
Conclusions
With the experimental results and computational evaluation, we show that SRIMC is robust in handling datasets with noise and outliers as well as dealing with novel lincRNAs and disease phenotypes.
Keywords: Matrix completion, Inductive learning, Long noncoding RNA, Human disease phenotypes, Association inference
Background
LincRNA-disease association inference problem
It is a surprising fact that, only 2% of the entire human genome codes for proteins [1]. In recent years, it has become evident that the non-protein coding portion of the genome, especially the long intergenic non-coding RNAs (lincRNAs) having length more than 200 bases each with no overlaps with any annotated protein-coding regions, are of critical functional importance. These lincRNAs demonstrate diverse molecular mechanisms and implicate various human diseases [2]. With the advent of the high-throughput genomic technologies, a large number of lincRNAs have been cataloged [3]. However, fully annotating the functions of the lincRNAs and their involvements in human disease implications still remain a challenge for the researchers. Developing machine learning algorithm to rank disease implications by a given lincRNA based on prior knowledge would be beneficial to the community for tackling the challenge.
Limitations of existing algorithms
There are several long non-coding RNA (lncRNA)-disease association inference tools developed in the past few years. But, there is a small number of tools that actually solved the lincRNA-disease inference problem. Due to the complexities in the relationship and the available datasets, only a small number of experimentally validated associations have been reported in the lncRNAdisease database [4]. Therefore, using multiple complementary data sources in the algorithm is important to predict potential lincRNA and disease associations. For example, LRLSLDA [5], K-RWRH [6], and TslncRNA-disease [7] belong to a family of network based association identification methods. Each of the algorithms use biological networks, such as lincRNA similarity network and disease similarity network to develop their prediction model. Then by using the model they infer lincRNA-disease connections by either using random walk procedure on a derived biological network or by computing a similarity measure between nodes with known disease implications. The association inference problem can also be tackled through the use of matrix completion based algorithms; Non-negative Matrix Factorization (NMF) belongs to this family of solution strategy. But, it suffers the cold start problem, due to the inability to address the inference predictions of the diseases for novel lincRNAs and vice versa. Furthermore, these methods were presented on a very small set of associations and developed without considering the scalability (e.g., around 200 lncRNAs compared to more than 8000 lincRNAs available to date from the research by [3] remain overlooked). However, the methods utilizing the lincRNA-expression profiles to build similarity networks dealing with a small number of disease classes. So, they fall short in generalizing their prediction to identify novel diseases-lincRNA connections. Owing to the fact that, a plethora of side information about the lincRNAs and the disease phenotypes are available, and the data is growing extensively every single day. Inductive Matrix Completion (IMC) based algorithms utilize side information about both the lincRNAs and diseases along with the known association evidences to predict missing associations [8]. But, the standard IMC uses the least square error function which is known to be unstable in presence of noise and outliers [9]. A stable and robust version of the IMC is thus needed in this problem.
Outline of our proposed approach
We propose a novel stable robust formulation of IMC using ℓ 2,1 norm based error function, as well as ℓ 2,1 based regularizer. We call the proposed method “robust” because it can handle noise better than the standard IMC. Also, we call the method “stable” because of the fact that it utilizes a 2-step stable strategies to solve the problem.
Summary of contributions
We first describe a Robust IMC approach that introduces l 2,1 norms in both its penalty function and the regularization. We then propose Stable Robust IMC (SRIMC) that can handle outliers and noise in the dataset and also joint sparsity. The solution strategy breaks the problem into two separate and independent problems, where each of the sub-problem has stable solution and easy to compute. Hence, in terms of computational complexity and reliability SRIMC should be a better option.
We provide an application of our RIMC and SRIMC methods to solve the lincRNA-disease association inference problem. We show that RIMC and SRIMC can perform induction to decipher associations between a novel disease and a novel lincRNA, based on the side information about them we have, that are not provided during learning phase. This is unlike the traditional matrix factorization methods and network-based inference methods discussed earlier which are transductive in nature.
We demonstrate that the integration of diverse features of the lincRNAs and the diseases available through publicly available data-servers can overcome worse predictive performance issue faced by the inference tools which occurs due to the extreme sparsity inherent to the lincRNA-disease association dataset.
We present a comparison of our proposed RIMC, SRIMC with standard IMC as well as the state-of-the-art lincRNA-disease association methods.
The rest of the paper is organized as follows. In “Methods” section we propose the robust IMC formulation using ℓ 2,1 norm, underline the advantages of the proposed algorithm compared with the standard IMC as well as standard NMF approaches. Here we also show the correctness of the proposed algorithm. In “Results” section we present the experimental setup and the dataset. In “Discussions” section, we present the results of the experiments, and comparative study on the performance of the proposed algorithm with the base-line methods. Finally, in “Conclusions” section we conclude the paper. A preliminary version of this work has been reported in [10].
Methods
Stable robust inductive Matrix completion (SRIMC) strategy
In this section we review standard Inductive Matrix Completion method; then we present our robust IMC (RIMC) formulation. And finally, we present the Stable Robust IMC (SRIMC) algorithm. Later, we provide a computational algorithm for our proposed method along with the correctness of the algorithm.
Review on standard IMC
The Inductive Matrix Completion approach [8] includes side information of both the row and column entities. The formulation solves the issue “cold-start” problem in a transductive setup (e.g., standard NMF, etc.). Therefore we can predict association between new entities that are not included in the data matrix available during the training time. Let’s consider an association matrix, denoting the links between M row entities and N column entities. We also have side information of both the row and column entities and the information is encapsulated in two matrices and containing m features of the M row entities and n features of the N column entities respectively. Equation 1 defines the the objective function of the standard IMC.
1 |
where λ 1,λ 2 are the regularization parameters that weighing between the accrued loss on the observed entries and the trace norm regularization constraints. Here, an entry A ij is modeled as , where is a low-rank matrix to be recovered by solving Eq. 1. It is solved in a way that Z becomes the multiplication of two factor matrices W and H, that is, W H T, where and . Equation 1 can be easily solved using Algorithm 1.
After Algorithm 1 returns, we get the two factor matrices W and H. These two matrices can be used to compute missing association scores between the row and the column entities. It can also provide prediction score for an association between a known row entity with an new column entity, or a known column entity with a new row entity, or both new row and column entities.
Robust IMC (RIMC) formulation
One limitation of the standard IMC is that it is prone to outliers in the given dataset. Given , the loss function of the standard IMC is:
4 |
Here, a squared residual error gets accumulated in each iteration in the optimization step, meaning only a few outliers may result in large error. Another shortcoming of the the standard IMC is that it can not handle joint sparsity across feature data matrices X and Y. Therefore, a solution to each of the limitations is needed. The initial hypothesis of RIMC was presented by [10]. The robust IMC, instead of using the ℓ 2 norm based loss function involves ℓ 2,1 norm in defining the loss function which is:
5 |
Due to the fact that the errors are not squared in each step, the approach has great advantage to handle outliers than that of standard IMC based approaches. The generalized objective function of the RIMC can be stated as:
6 |
Here, we have several options as the regularization function R(·); such as: , , and . Here, R 1(·) is the ridge regularization and is adapted in the standard IMC formulation, R 2(·) is the LASSO regularization which is a non-convex function and difficult to optimize. R 3(·) involves the ℓ 0 norm and is the most desirable [11], and R 4(·) employs the ℓ 2,1 norm. R 4(·) was chosen because the function is convex and we can easily optimize the objective function involving this kind of regulizer [12].
Thus given the data matrices A,X,Y, in this paper we optimize the following robust IMC formulation:
7 |
Algorithm for RIMC (version 1)
Correctness of the RIMC Algorithm (version 1)
Theorem 1
At convergence, the converged solution W ∗ of the updating rule in Algorithm 2 satisfies the KKT condition.
Proof
The KKT condition for W with constraints W ik≥0, with i=1⋯m,k=1⋯r is:
8 |
Now, the partial derivative is
9 |
where is a vector with all 1s. Also, are the two diagonal matrices with the diagonal elements given by:
10 |
11 |
Now, let us continue from Eq. 9:
12 |
Thus, the KKT condition for W is:
13 |
But, once W converges (according to Algorithm 2), the converged solution (W ∗) satisfies
which can be written as
This is identical to Eq. 43. Thus, the converged solution W ∗ satisfies the KKT condition. □
Theorem 2
At convergence, the converged solution H ∗ of the updating rule in Algorithm 2 satisfies the KKT condition.
Proof
The KKT condition for H with constraints H jk≥0, with j=1⋯n,k=1⋯r is:
14 |
Now, the partial derivative is
15 |
where D is already defined in Eq. 10, and is a diagonal matrix with the diagonal elements given by:
16 |
Now, let us continue from Eq. 15:
17 |
Thus, the KKT condition for H is:
18 |
But, once H converges (according to Algorithm 2), the converged solution (H ∗) satisfies
which can be written as
19 |
This is identical to Eq. 47. Thus, the converged solution H ∗ satisfies the KKT condition. □
Algorithm for RIMC (version 2)
We can also solve the robust IMC optimization problem (Eq. 7) without the use of the e vectors. It is demonstrated in Algorithm 3.
Convergence of the RIMC Algorithm (version 2)
Here, we present the proof of the convergence of Algorithm 3.
Theorem 3
Algorithm 3 will monotonically decrease the objective function of the problem (Eq. 7) in each iteration and converge to the global optimum of the problem.
However, it can be rephrased using the following two statements:
Proof
We prove Theorem 3 (A, B) separately in the following two sections. □
Proof of Theorem 3(A): Updating of H
Proof
We now focus on proving Theorem 3(A). The proof requires the following two lemmas: (Lemma 4 and 5). □
Lemma 4
Let, H (t) be the H at the t th iteration, and H (t+1) is obtained from the next iteration. Then, under the H update rule in Algorithm 3, the following inequality holds.
20 |
where, , and
The proof of Lemma 4 is given in section Proof of Lemma 4.
Lemma 5
Under the H update rule in Algorithm 3, the following inequality holds:
21 |
where D,P,Q matrices are defined earlier.
The proof of Lemma 5 is given in section Proof of Lemma 5.
Now, if we take a look at the right hand side of the inequality in Eq. 21, the value is negative or zero according to Lemma 4. This completes the proof that the objective function of Eq. 7 decreases monotonically.
Proof of Theorem 3(B): updating of W
Proof
We now focus on proving Theorem 3(B). The proof requires the following two lemmas: (Lemma 6 and 7). □
Lemma 6
Let, W (t) be the W at the t th iteration, and W (t+1) is obtained from the next iteration. Then, under the W update rule in Algorithm 3, the following inequality holds.
22 |
where, D,P,Q are defined earlier.
Proof of Lemma 6 is provided in section Proof of Lemma 6.
Lemma 7
Under the W update rule in Algorithm 3, the following inequality holds:
23 |
where D,P,Q matrices are defined earlier.
Proof of Lemma 7 is provided in section Proof of Lemma 7.
Now, if we take a look at the right hand side of the inequality in Eq. 23, the value is negative or zero according to Lemma 6. This completes the proof that the objective function of Eq. 7 decreases monotonically.
Proof of Lemma 4
Proof
We can re-write Eq. 20 as follows:
24 |
where
25 |
And, according to the statement of Lemma 4, under the H update rule Algorithm 3, J(H) monotonically decreases. In order to prove the statement, we follow the approaches utilizing auxiliary functions [13, 14]. □
Definition 1
G(H,H ′) is an auxiliary function for the function J(H) if G(H,H ′)≥J(H) for all H ′ and G(H,H)=J(H).
Now, we define:
So, we have
This proves that J(H (t)) is monotonically decreasing.
Now the important steps in the remainder of the proof are: (a) determine a proper auxiliary function, and (b) find the global minima of the auxiliary function.
Lemma 8
The function
26 |
is an auxiliary function for J.
Proof
Now J(H) of Eq. 25 can be re-written as:
27 |
Now we will be applying the following inequality of matrices according to the investigations by [14, 15]:
28 |
where, Λ,B,H are non-negative matrices, and Λ,B are symmetric matrices. And obviously the equality holds in Eq. 28 when H=H ′.
In Eq. 28, if we do the substitutions: Λ=Y T Y,B=W T X T D X W,H=H,H ′=H ′, we see that the fifth term of Eq. 27 is smaller than the fifth term of Eq. 26. However, the equality holds when H=H ′. Thus G(H,H ′) in Eq. 26 is an auxiliary function of J(H). □
Now, we need to find the global minimum of Eq. 26. Let f(H)=G(H,H ′). The gradient of f(H) is
29 |
However, the second order derivative (i.e., the Hessian matrix) would be
30 |
The Hessian matrix (Eq. 30 is semi-positive definite implying that f(H)=G(H,H ′) is a convex function. Thus, there exists a unique global minimum for f(H). The global minimum can be obtained by setting the gradient of f(H) to zero and solve for H. Thus from Eq. 29 we get
31 |
By replacing H (t+1)=H and H (t)=H ′, we would obtain the H update rule in Algorithm 3. Therefore, under this rule, the objective function J(H) of Eq. 25 decreases monotonically, and hence completes the proof.
Proof of Lemma 5
Proof
We know that,
Similarly, we can see that
Then, the right-hand side (r.h.s) of Eq. 21 becomes
And, the left-hand side (l.h.s) of Eq. 21 becomes
Now, we compute the difference between the l.h.s and r.h.s,
The above inequality holds because, D,Q are non-negative matrices, and the sum of non-positive numbers is always non-positive. This completes the proof. □
Proof of Lemma 6
Proof
We can re-write Eq. 22 as follows:
32 |
where
33 |
And, according to the statement of Lemma 6, under the W update rule in Algorithm 3, J(W) monotonically decreases. In order to prove the statement, we follow the approaches utilizing auxiliary functions [13, 14]. □
Definition 2
G(W,W ′) is an auxiliary function for the function J(W) if G(W,W ′)≥J(W) for all W ′ and G(W,W)=J(W).
Now, we define:
So, we have
This proves that J(W (t)) is monotonically decreasing.
Now the important steps in the remainder of the proof are: (a) determine a proper auxiliary function, and (b) find the global minima of the auxiliary function.
Lemma 9
The function
34 |
is an auxiliary function for J.
Proof
Now J(W) of Eq. 41 can be re-written as:
35 |
Now we will be applying the following inequality of matrices according to the investigations by [14, 15]:
36 |
where, Λ,B,W are non-negative matrices, and Λ,B are symmetric matrices. And obviously the equality holds in Eq. 36 when W=W ′.
In Eq. 36, if we do the substitutions: Λ=X T D X,B=H T Y T Y H,W=W,W ′=W ′, we see that the fifth term of Eq. 35 is smaller than the fifth term of Eq. 34. However, the equality holds when W=W ′. Thus G(W,W ′) in Eq. 34 is an auxiliary function of J(W). □
Now, we need to find the global minimum of Eq. 34. Let f(W)=G(W,W ′). The gradient of f(W) is
37 |
However, the second order derivative (i.e., the Hessian matrix) would be
38 |
The Hessian matrix (Eq. 38) is semi-positive definite implying that f(W)=G(W,W ′) is a convex function. Thus, there exists a unique global minimum for f(W). The global minimum can be obtained by setting the gradient of f(W) to zero and solve for W. Thus from Eq. 37 we get
39 |
By replacing W (t+1)=W and W (t)=W ′, we would obtain the W update rule in Algorithm 3. Therefore, under this rule, the objective function J(W) of Eq. 41 decreases monotonically, and hence completes the proof.
Proof of Lemma 7
Proof
We know that,
Similarly, we can see that
Then, the right-hand side (r.h.s) of Eq. 23 becomes
And, the left-hand side (l.h.s) of Eq. 23 becomes
Now, we compute the difference between the l.h.s and r.h.s,
The above inequality holds because, D,P are non-negative matrices, and the sum of non-positive numbers is always non-positive. This completes the proof. □
Correctness of the RIMC Algorithm (version 2)
In this section we are going to prove that the converged solution presented in Algorithm 3 is the correct optimal solution. In fact, we will show that the converged solution satisfies the Karush-Kuhn-Tucker (KKT) condition of the constrained optimization theory. At first, we have Theorem 10 to prove the correctness of the algorithm with respect to W. Theorem 11 will prove the correctness of the algorithm with respect to H.
Theorem 10
At convergence, the converged solution W ∗ of the updating rule in Algorithm 3 satisfies the KKT condition.
Proof
The KKT condition for W with constraints W αβ≥0, with α=1,⋯,m;β=1,⋯,r is:
40 |
Similar to Eq. 25, the J(W) can be written as:
41 |
Now, the partial derivative of J(W) can be expressed as:
42 |
Thus, the KKT condition for W is:
43 |
But, once W converges (according to Algorithm 3), the converged solution W ∗ satisfies the following:
which can be written as
44 |
This is identical to Eq. 43. Thus, the converged solution W ∗ satisfies the KKT condition. □
Theorem 11
At convergence, the converged solution H ∗ of the updating rule in Algorithm 3 satisfies the KKT condition.
Proof
The KKT condition for H with constraints H γψ≥0, with γ=1,⋯,n,ψ=1,⋯,r is:
45 |
Now, the partial derivative of J(H) from Eq. 25 is
46 |
Thus, the KKT condition for H is:
47 |
But, once H converges (according to Algorithm 3), the converged solution, H ∗ satisfies the following:
which can be written as
This is identical to Eq. 47. Thus, the converged solution H ∗ satisfies the KKT condition. □
Stable robust IMC (SRIMC) formulation
Instead of solving the RIMC objective function (Eq. 7) directly, here we propose a two-step solution strategy to the RIMC formulation, and we call this new algorithm SRIMC.
Step 1: solving matrix Z from a matrix equation
In this step, we consider the following matrix equation
48 |
where Z is an m×n matrix of unknowns, X is the M×m feature matrix of the row entities, Y is the N×n is the feature matrix of the column entities. And, A is the M×N binary association matrix between the row and column entities.
Now, in Eq. 48, if we left multiply by X T and right multiply by Y, we get the following equation
49 |
If X has full column rank and Y has a full row rank, then both X T X and Y T Y are invertible. Therefore, we can solve for Z.
50 |
Step 2: robust NMF on matrix Z
51 |
This a modified non-negative matrix factorization (NMF) problem; only difference is the usage of the ℓ 2,1 norms instead of ℓ 2 norms in the loss function and the regularizers.
Algorithm for SRIMC
We can also solve the Stable Robust IMC optimization problem by solving the two problems mentioned above. It is demonstrated in Algorithm 4.
Results
Disease-LincRNA association datasets
We prepared a sparse association matrix by extracting the lincRNA-disease association dataset from the LncRNADisease [4] with sparsity indx 0.22%. LincRNA expression dataset was obtained from the co-expression based association study [7]. Finally, we cataloged 8194 lincRNAs and 2148 human disease phenotypes and the resulting association matrix contains 46,934 associations among these two entities. We followed a standard naming of the disease phenotypes by OMIM identification numbers. We extracted top-5 OMIM phenotypes matching the human disease names using OMIM API [16].
LincRNA feature datasets
The features of LincRNAs consist of four groups of information: (i) expression profiles, (ii) transcriptor factor binding sites (TFBS), (iii) functional annotations and (iv) single nucleotide polymorphism (SNP) information. The RNA-seq expression profiles of the 8194 lincRNAs on 22 human tissues were collected from the Human BodyMap Project 2.0 [3]. The expression scores were measured in FPKM (Fragments Per Kilobase of exons per Million Fragments mapped) unit. Then, TFBS information about the lincRNAs in our study with 120 transcription factors were obtained from ChIP-base dataset [17]. Linc2GO is a public data repository containing functional annotations of lincRNAs [18]. There are three different types of functions cataloged in the Lin2GO dataset: gene ontology biological process (GO BP), gene ontology molecular function (GO MF) and KEGG pathways. The 8194 lincRNAs with the functional annotation together make a sparse matrix with sparsity index 0.11%. We performed singular value decomposition on the matrix to compute and use the leading 100 singular vectors in our study as part of the features of the lincRNAs. We extracted links among 368,494 SNPs and the lincRNAs from our study from the lncRNASNP dataset [19]. Again, the SNP-lincRNA association matrix turned out to a sparse matrix with the sparsity index 0.0077%. Therefore, we performed singular value decomposition on the matrix to compute and use the leading 100 singular vectors. Finally, we performed a filtering on all the four groups of features of the lincRNAs in our study. We found that 6540 out of the initial 8194 lincRNAs have data from all the four groups of featureset. Therefore, our final lincRNA feature matrix (X in our study) has 6540 rows (lincRNAs) and 342 columns (features).
Disease feature datasets
The disease feature dataset consists of two groups of information: (i) term frequency inverse document frequency (TF-IDF) scores and (ii) phenotype similarity scores. The TFIDF scores were prepared by mining the OMIM text corpus on the 2661 OMIM phenotypes, resulting a 20491 term scores of each of the 2148 phenotypes from our study. We took leading 100 singular vectors as part of the disease feature. The phenotype-phenotype similarity scores were retrieved from a study conducted by [20]. The similarity profiles after encapsulated in a square matrix of dimension 2148 by 2148, had to go through a singular value decomposition module to extract leading 100 singular vectors that constitute the part of the feature matrix of the diseases in our study. Finally, our disease feature matrix contains 200 features of the 2148 diseas es.
Baseline algorithms
We conducted a comparative study of our proposed algorithms with five baseline methods: (i) NMF [13], (ii) LRLSLDA [5], (iii) TsLincRNA-Disease [7], (iv) K-RWRH [6] and (v) standard IMC [21]. The NMF based approach finds the two factors W and H by just working on the lincRNA-disease association matrix A. The LRLSLDA ranks the lincRNAs with a disease by the use of a classifier trained on two similarity feature matrices. The method was developed with eight parameters to train before getting good prediction results. The TsLincRNA-Disease utilizes a series of statistical significance tests on a co-expression network obtained from tissue-specific and non-tissue-specific lincRNA expression information. Apart from the expression data, this method lacks the integration of other types of information available about the lincRNAs and the disease. The K-RWRH is a stochastic algorithm developed on top of the random walk on a three heterogeneous networks. The method is very complex and it is harder to obtain a steady state distribution for the dataset our study.
Evaluation metrics
We define two metrics for evaluating our proposed algorithm and the baseline algorithms. The metrics are popular in evaluating any recommender style systems as in [22].
p r e c i s i o n @ k: The ratio of the number of recovered disease phenotypes to recommended k phenotypes for a target lincRNA. We take average of the ratios for every lincRNAs of our study. The metric is defined as follows:
52 |
where, P l(k) is the top-k ranked diseases for an lincRNA l, D l is the set of diseases related to the lincRNA l deleted during the training phase. And, N l is the total number lincRNAs in the test set.
r e c a l l @ k: The ratio of recovered disease phenotypes to the set of hidden phenotypes in the test dataset. Again, we take average of the ratios for every lincRNAs in the study. The metric is defined as follows:
53 |
We repeated the experiments for various values of k, from 5 to 100. We conducted 10-fold cross-validation in each of the experiments listed in the following sections.
Discussions
True LincRNA-disease association retrieval
Figure 1 shows the performance of RIMC along with other base-line algorithms to predict true lincRNA-disease associations. A 10-fold cross-validation was conducted on the 2418 OMIM phenotypes. We find that our RIMC method leads in identifying true associations than all the baseline algorithms for all k values. The NMF based algorithm is better than the three other baseline algorithms. LRLSLDA’s association retrieval was the worse due to the fact that it relies only on known association matrix and the expression profiles of the lincRNAs that seems to be not sufficient to build one predictive model.
Induction on new associations
Here we conducted a thorough comparative study on the three algorithms including two of ours (RIMC and SRIMC) to predict associations between novel lincRNAs and/or diseases. We assume that all the features of the novel lincRNAs and/or diseases that we bring into our prediction framework can be computed or available. Note that,none of the baseline algorithms except the standard inductive matrix completion based approach (standard IMC) are missing in all the experiments from this sections due to the fact that none are capable of doing induction on novel associations.
Induction experiments on new LincRNAs
From the dataset in our study we selected a list of 10% lincRNAs and deleted all the entries of these randomly selected lincRNAs from the three training matrices A,X and Y. The deleted entries will serve as test set during evaluation. Then, RIMC, SRIMC and the standard IMC were trained with modified training matrices. Once, training is done on the reduced dataset, each of the obtained three modules were evaluated with the test set that were extracted at the beginning of this step. We repeat the entire training and test steps 10 times and reported the average performance score of all the three methods. Figure 2 illustrates the performance comparison of the three methods for predicting association between a new lincRNA with an existing set of diseases. RIMC and SRIMC show better p r e c i s i o n @ k than the standard IMC based approach for predicting upto the top-50 disease associations with the new lincRNAs. For higher values of k in the top-k predictions, both RIMC and the standard IMC show similar performance. But in terms of numerical precision, RIMC exceeds the performance of standard IMC. However, in terms of r e c a l l @ k, we can see that SRIMC and RIMC perform superior than that of the standard IMC method.
Induction experiments on new diseases
Similar to the approach mentioned in the previous section, we randomly selected 10% of the total disease phenotypes from the dataset of the study, and deleted all the entries related to the diseases. The deleted entries is going to be our test set. The reduced dataset is going to serve as training dataset. The RIMC, SRIMC and the standard IMC were trained on the reduced training dataset and evaluated against the test set. The entire training and evaluation were repeated 10 times and the average performance scores were reported. Figure 3 illustrates the performance comparison of the three methods to predict associations among known list of lincRNAs with a novel disease. Here, both RIMC and SRIMC demonstrates better induction performance in terms of the p r e c i s i o n @ k and r e c a l l @ k values.
Induction experiments on both new LincRNAs and new diseases
Finally, in this batch of induction experiment, we randomly picked 5% of the subject disease entries, and 5% of the subject lincRNA entries and deleted the respective connections between the two entities from the three data matrices A,X and Y. The deleted connections and feature set are treated as the test-set, while the reduced data matrices are used to train the three algorithms. We repeat the above steps 10 times and compute the average performance scores. Figure 4 illustrates the performance comparison of our proposed RIMC, SRIMC and the only baseline algorithm applicable here which is the standard IMC to predict association between a new lincRNA and a new disease based on the model trained on data about a limited set of lincRNAs and disease phenotypes not including these two lincRNA and disease phenotypes. The p r e c i s i o n @ k plot of for the RIMC and SRIMC show better performance than the standard IMC based approach for predicting for both lower and higher values of k in the top-k association ranking with the novel diseases. However, from the r e c a l l @ k cure of the both algorithms, we can see that both RIMC and standard IMC performs similar in the top-k association prediction problem. But, SRIMC performs superior than both of the algorithms.
Conclusions
In this article, we propose theoretical foundations of robust inductive matrix completion method using ℓ 2,1 norm. We provided three algorithms to solve our robust induction matrix completion objective function. The first two algorithms are equivalent, but the third one what we call Stable Robust Inductive Matrix Completion (SRIMC) breaks the problem into two sub-problems. But it turns out to be a simple, stable and better solution strategy. We applied the proposed methods in identifying missing links between putative lincRNAs and human disease phenotypes. All the three variants of robust inductive matrix completion are well suited for noisy type of datasets. Besides the standard IMC formulation, our proposed method also outperformed other four lincRNA-disease association solutions. The proposed methods are applicable to predict associations among between well-studied lincRNAs with novel disease, or novel lincRNAs with well-studied diseases, or a set of novel lincRNAs with novel diseases.
Acknowledgements
We express deepest gratitude to Dr. Kytai Nguyen in Bioeengineering Department at University of Texas at Arlington for the comments and feedback on our lincRNAs discovery results.
Funding
Not applicable.
Availability of data and materials
The ChIP-base dataset is available at https://omictools.com/chipbase-tool. The Linc2GO dataset is available at: https://omictools.com/linc2go-tool. The SNP-lincRNA data can be found at: http://bioinfo.life.hust.edu.cn/lncRNASNP/.
About this supplement
This article has been published as part of BMC Medical Genomics Volume 10 Supplement 5, 2017: Selected articles from the IEEE BIBM International Conference on Bioinformatics & Biomedicine (BIBM) 2016: medical genomics. The full contents of the supplement are available online at https://bmcmedgenomics.biomedcentral.com/articles/supplements/volume-10-supplement-5.
Authors’ contributions
AKB conceived the package and wrote the manuscript. DK and MK contributed to data analysis and programming for the experiment. CD contributed to the two-step algorithm to solve the proposed RIMC algorithm. JG provided overall supervision. All authors reviewed, edited and approved the final manuscript.
Ethics approval and consent to participate
All the datasets used in this study are from publicly available data repository. No patients samples were used or collected in this study.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Abbreviations
- FPKM
Fragments per kilobase of exons per million fragments
- GWAS
Genome wide association studies
- HTS
High throughput sequencing
- IMC
Inductive matrix completion
- lincRNAs
Long intergenic non-coding RNAs
- lncRNAs
Long non-coding RNAs
- NMF
Non-negative matrix factorization
- RIMC
Robust inductive matrix completion
- SRIMC
Stable robust inductive matrix completion
- SNP
Single nucleotide polymorphism
- TFBS
Transcriptor factor binding sites
- TF-IDF
Term frequency inverse document frequency
Contributor Information
Ashis Kumer Biswas, Email: ashis.biswas@ucdenver.edu.
Dongchul Kim, Email: dongchul.kim@utrgv.edu.
Mingon Kang, Email: mkang9@kennesaw.edu.
Chris Ding, Email: chqding@uta.edu.
Jean X. Gao, Email: gao@uta.edu
References
- 1.Alexander RP, Fang G, Rozowsky J, Snyder M, Gerstein MB. Annotating non-coding regions of the genome. Nat Rev Genet. 2010;11(8):559–71. doi: 10.1038/nrg2814. [DOI] [PubMed] [Google Scholar]
- 2.Esteller M. Non-coding rnas in human disease. Nat Rev Genet. 2011;12(12):861–74. doi: 10.1038/nrg3074. [DOI] [PubMed] [Google Scholar]
- 3.Cabili MN, Trapnell C, Goff L, Koziol M, Tazon-Vega B, Regev A, Rinn JL. Integrative annotation of human large intergenic noncoding rnas reveals global properties and specific subclasses. Genes Dev. 2011;25(18):1915–27. doi: 10.1101/gad.17446611. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Chen G, Wang Z, Wang D, Qiu C, Liu M, Chen X, Zhang Q, Yan G, Cui Q. LncRNADisease: a database for long-non-coding RNA-associated diseases. Nucleic Acids Res. 2013;41(D1):983–6. doi: 10.1093/nar/gks1099. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Chen X, Yan GY. Novel human lncRNA–disease association inference based on lncRNA expression profiles. Bioinformatics. 2013;29(20):2617–24. doi: 10.1093/bioinformatics/btt426. [DOI] [PubMed] [Google Scholar]
- 6.Ganegoda GU, Li M, Wang W, Feng Q. Heterogeneous network model to infer human disease-long intergenic non-coding rna associations. NanoBioscience IEEE Trans. 2015;14(2):175–83. doi: 10.1109/TNB.2015.2391133. [DOI] [PubMed] [Google Scholar]
- 7.Liu MX, Chen X, Chen G, Cui QH, Yan GY. A computational framework to infer human disease-associated long noncoding rnas. PloS ONE. 2014;9(1):84408. doi: 10.1371/journal.pone.0084408. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Jain P, Dhillon IS. Provable inductive matrix completion. arXiv preprint arXiv:1306.0626. 2013. https://arxiv.org/abs/1306.0626.
- 9.Liu W, Zheng N, You Q. Nonnegative matrix factorization and its applications in pattern recognition. Chin Sci Bull. 2006;51(1):7–18. doi: 10.1007/s11434-005-1109-6. [DOI] [Google Scholar]
- 10.Biswas AK, Kim DC, Kang M, Gao JX. Bioinformatics and Biomedicine (BIBM), 2016 IEEE International Conference On. Shenzhen: IEEE;; 2016. Robust inductive matrix completion strategy to explore associations between lincrnas and human disease phenotypes. [DOI] [PubMed] [Google Scholar]
- 11.Luo D, Ding C, Huang H. Towards structural sparsity: an explicit l2/l0 approach. In: 2010 IEEE International Conference on Data Mining. Sydney;2010. p. 344–53. doi:10.1109/ICDM.2010.155.
- 12.Nie F, Huang H, Cai X, Ding CH. Efficient and robust feature selection via joint ℓ-2,1-norms minimization. In: Advances in Neural Information Processing Systems. Vancouver;2010. p. 1813–21.
- 13.Lee DD, Seung HS. Learning the parts of objects by non-negative matrix factorization. Nature. 1999;401(6755):788–91. doi: 10.1038/44565. [DOI] [PubMed] [Google Scholar]
- 14.Kong D, Ding C, Huang H. Proceedings of the 20th ACM International Conference on Information and Knowledge Management. Glasgow: ACM;; 2011. Robust nonnegative matrix factorization using l2,1-norm. [Google Scholar]
- 15.Ding CH, Li T, Jordan MI. Convex and semi-nonnegative matrix factorizations. IEEE Trans Pattern Anal Mach Intell. 2010;32(1):45–55. doi: 10.1109/TPAMI.2008.277. [DOI] [PubMed] [Google Scholar]
- 16.Amberger JS, Bocchini CA, Schiettecatte F, Scott AF, Hamosh A. OMIM. org: Online Mendelian Inheritance in Man (OMIM®;), an online catalog of human genes and genetic disorders. Nucleic Acids Res. 2015;43(D1):789–98. doi: 10.1093/nar/gku1205. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Yang JH, Li JH, Jiang S, Zhou H, Qu LH. ChIPBase: a database for decoding the transcriptional regulation of long non-coding RNA and microRNA genes from ChIP-Seq data. Nucleic Acids Res. 2013;41(D1):177–87. doi: 10.1093/nar/gks1060. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Liu K, Yan Z, Li Y, Sun Z. Linc2GO: a human LincRNA function annotation resource based on ceRNA hypothesis. Bioinformatics. 2013;29(17):2221–2. doi: 10.1093/bioinformatics/btt361. [DOI] [PubMed] [Google Scholar]
- 19.Gong J, Liu W, Zhang J, Miao X, Guo AY. lncRNASNP: a database of SNPs in lncRNAs and their potential functions in human and mouse. Nucleic Acids Res. 2015;43(D1):181–6. doi: 10.1093/nar/gku1000. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Caniza H, Romero AE, Paccanaro A. A network medicine approach to quantify distance between hereditary disease modules on the interactome. Sci Rep. 2015;5:17658. doi: 10.1038/srep17658. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Natarajan N, Dhillon IS. Inductive matrix completion for predicting gene–disease associations. Bioinformatics. 2014;30(12):60–8. doi: 10.1093/bioinformatics/btu269. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Lian D, Zhao C, Xie X, Sun G, Chen E, Rui Y. Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York City: ACM;; 2014. Geomf: Joint geographical modeling and matrix factorization for point-of-interest recommendation. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The ChIP-base dataset is available at https://omictools.com/chipbase-tool. The Linc2GO dataset is available at: https://omictools.com/linc2go-tool. The SNP-lincRNA data can be found at: http://bioinfo.life.hust.edu.cn/lncRNASNP/.