Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2013 Jun 19;29(13):i89–i97. doi: 10.1093/bioinformatics/btt231

Integrating sequence, expression and interaction data to determine condition-specific miRNA regulation

Hai-Son Le 1, Ziv Bar-Joseph 1,2,*
PMCID: PMC3694655  PMID: 23813013

Abstract

Motivation: MicroRNAs (miRNAs) are small non-coding RNAs that regulate gene expression post-transcriptionally. MiRNAs were shown to play an important role in development and disease, and accurately determining the networks regulated by these miRNAs in a specific condition is of great interest. Early work on miRNA target prediction has focused on using static sequence information. More recently, researchers have combined sequence and expression data to identify such targets in various conditions.

Results: We developed the Protein Interaction-based MicroRNA Modules (PIMiM), a regression-based probabilistic method that integrates sequence, expression and interaction data to identify modules of mRNAs controlled by small sets of miRNAs. We formulate an optimization problem and develop a learning framework to determine the module regulation and membership. Applying PIMiM to cancer data, we show that by adding protein interaction data and modeling cooperative regulation of mRNAs by a small number of miRNAs, PIMiM can accurately identify both miRNA and their targets improving on previous methods. We next used PIMiM to jointly analyze a number of different types of cancers and identified both common and cancer-type-specific miRNA regulators.

Contact: zivbj@cs.cmu.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

1 BACKGROUND

MicroRNAs (miRNAs) are a family of small non-coding RNA molecules that regulate gene expression post-transcriptionally. These single-stranded RNAs, 19–25 nt long, are initially transcribed as longer independent genes or from introns of protein-coding genes. MiRNAs are now known to play a major role in development (Bartel, 2009), various brain functions (Shao et al., 2010) and diseases (Meola et al., 2009). Since their discovery, several 100 miRNAs were identified in each of several different species, including mammals, worms, flies and plants (He and Hannon, 2004). Initial discovery of large sets of miRNAs relied heavily on sequence and conservation analysis (Bartel, 2009), although recent advances in sequencing capacity are now allowing researchers to validate and identify additional miRNAs experimentally (Motameny, 2010). Most miRNAs target the genes they regulate by binding to the 3′-untranslated region of the target mRNAs (using complementary base pairing) and recruiting additional machinery to either degrade these mRNAs or prevent them from being translated. The miRNA regulation is ubiquitous, and a single miRNA can target 100s and even 1000s of genes. As the effect of each miRNA on any single target is often limited, they often work cooperatively with multiple miRNAs targeting the same mRNA in a specific condition (Krek et al., 2005; Krol et al., 2010).

Although the set of active miRNAs can often be determined experimentally (by measuring their expression levels), identifying their targets is much more challenging. Determining such target set is important for fully understanding the role of various miRNAs and to model the networks they regulate in a condition of interest. Initially, computational methods developed to predict such targets primarily relied on sequence information, in some cases, also using conservation information and/or secondary structure predictions. These methods search for base pair complementarity between the mature miRNA and 3′-untranslated regions of all mRNAs, allowing for some mismatches (the penalty for mismatches differs between the methods). Popular methods include TargetScan (Lewis et al., 2005), miRBase (now called MicroCosm) (Griffiths-Jones et al., 2006), miRanda (John et al., 2004) and PicTar (Krek et al., 2005).

Although these predictions are useful, because of the short length of miRNAs, they lead to many false positives and some false negatives (Betel et al., 2010). Conservation analysis has proven especially problematic in this domain, as several real targets are not well conserved and would be ignored if conservation is a requirement (Barakat et al., 2007). In addition, sequence data are static and do not change in different conditions or at different times. Thus, based on sequence data alone, it is impossible to map the set of targets for specific miRNA in a condition of interest (as most genes are not expressed in any specific condition or tissue). Finally, miRNAs often work cooperatively in small groups. As miRNA activation is condition specific, using this cooperative regulation property requires the use of condition-specific data, which of course cannot be inferred from sequence information alone.

Transcription factors (TFs) also play a major role in regulating gene expression, and they have been shown to work combinatorially with miRNAs (Sun et al., 2012). However, a pre-requisite for such combinatorial analysis is a list of targets for individual miRNAs. Unlike TFs, which can serve as activators or repressors and are often post-transcriptionally regulated, miRNAs are only transcriptionally regulated and inhibit their direct targets. This has led to several studies that isolated the miRNA target prediction task by integrating sequence, mRNA and miRNA expression data (Cheng and Li, 2008; Huang et al., 2007a; Joung et al., 2007; Ooi et al., 2011). Unlike sequence data, expression data are dynamic and condition-specific and thus provide useful clues about the set of active miRNAs and mRNAs. A number of methods, mostly based on (anti) correlation or regression analysis using the expression levels of miRNAs and predicted mRNA targets, were suggested for this task (Huang et al., 2011; Wang and Li, 2009). A representative example for this group is GenMiR++ (Huang et al., 2007a), one of the first methods to integrate miRNA and mRNA expression profiles in a unified probabilistic model. Given an expression dataset for both miRNAs and mRNAs and a set of putative miRNA–mRNA interactions (inferred from sequence data), GenMiR++ uses a generative probabilistic regression model to assign targets to miRNAs. It was successfully applied to identify targets of let-7b in retinoblastoma. Another approach is to project mRNA expression data on pathway databases and compute the correlation between miRNAs and average pathway expression levels to identify likely regulators of signaling pathways (Ooi et al., 2011). Although this method does not identify specific targets, it can be used to infer the function of specific miRNAs based on the pathways they regulate. A number of other methods for integrating miRNA and mRNA expression data have been proposed, see (Muniategui et al. 2012) for a recent review.

Finally, there is growing evidence that interacting proteins are more likely to be co-regulated by the same miRNAs (Hsu et al., 2008; Liang and Li, 2007). It has also been shown that some miRNAs coordinately target protein complexes (Sass et al., 2011). Although such complementary information may be important, few previous works have taken advantage of it to predict condition-specific interactions. An exception is a recent work by Zhang et al. (2011), which developed SNMNMF to integrate protein interactions with miRNA and mRNA expression data. The method is based on a non-negative matrix factorization analysis, which factorizes the two expression data matrices such that the two share one common factor, which is assumed to be the module basis matrix Inline graphic. Note, however, that although this method was successfully applied to analyze Ovarian cancer data, it does not use a regression model to explain mRNA expression levels, or requires that miRNAs and mRNAs in the same module be anti-correlated; therefore, the resulting modules do not fully use current knowledge regarding the inhibitory role of miRNAs, which may lead to missing important interactions.

The methods discussed earlier in the text successfully integrated expression and sequence data. However, a major point that is often ignored by these prediction methods is the combinatorial aspect of miRNA regulation. Several studies have shown that individual miRNAs have only limited impact on their targets (Malumbres, 2012) and multiple (different) miRNAs are needed to drastically reduce transcription levels of targets. To allow the use of such group- or module-based regulatory model, we have recently developed GroupMiR (Le and Bar-Joseph, 2011), which uses a non-parametric Bayesian prior based on the Indian Buffet Process (IBP; Griffiths and Ghahramani, 2006) to identify modules of co-regulated miRNAs and their target mRNAs. As we have shown, by using a module-based approach, we can improve on methods that treat miRNAs or mRNAs individually improving the set of correctly recovered miRNA–mRNA interactions (Le and Bar-Joseph, 2011).

Here, we present the Protein Interaction based MicroRNA Modules (PIMiM) method, which extends the regression framework of GroupMiR by using an additional type of data: protein interactions (Fig. 1). As we show, by defining a new target function that encourages interacting proteins to belong to the same module, we can use such data and integrate it with expression and sequence-based data in a probabilistic model. We develop an iterative learning procedure to learn the parameters of our model and show that it converges to a local minima. Comparison of PIMiM with previous methods indicates that by combining a module-based approach with protein interaction data, we can improve on both methods that only rely on modules (GroupMiR) and methods that rely on protein interaction (SNMNMF). We used PIMiM to study miRNA in several types of cancers, allowing us to identify novel regulators that either span multiple cancer types or are unique to specific cancers.

Fig. 1.

Fig. 1.

Data used as input for PIMiM. In addition the miRNA and mRNA expression data, PIMiM uses sequence-based predictions of miRNA–mRNA interactions and protein–protein interactions. These datasets are integrated as discussed in Section 2

2 METHODS

2.1 Overview

We developed PIMiM, a module-based method that predicts targets for miRNAs by assigning them, together with the mRNAs they regulate, to one of K modules. Modules may contain several miRNAs and many mRNAs, and both miRNA and mRNAs can be assigned to 0, 1 or multiple modules, and thus modules may overlap.

The input to PIMiM is condition-specific miRNA and mRNA expression data (usually multiple measurements from patients or different time points). In addition, we use sequence-based predictions of miRNA–mRNA interactions (any probabilistic predictions can be used) and static protein interaction data. Using these datasets we learn a regularized probabilistic regression model in which mRNA data are regressed to the expression data of miRNAs assigned to modules regulating it. The downregulation effect of an miRNA on the expression of its target mRNA is aggregated across all modules, allowing information to be shared between modules in the learning process. Our probabilistic model rewards the assignments of predicted miRNA–mRNA pairs to the same module and also rewards assignment of mRNAs of interacting proteins to the same module. Combined, the modules explain the observed mRNA expression data as a function of their regulating miRNAs and the set of proteins they interact with.

2.2 Notations

We use the following notation in the rest of the article. We assume there are M miRNAs and N mRNAs in each sample. We denote expression profiles of miRNAs by Inline graphic and of mRNAs by Inline graphic, where Inline graphic and Inline graphic are vectors with the expression levels of miRNA i and mRNA j, respectively, in all samples. Both matrices have P columns corresponding to the P-matched samples. In addition, let Inline graphic (sparse Inline graphic matrix) be the weighted adjacency matrix of the protein interactions [obtained from databases, such as BioGRID (Stark et al., 2011) or TRANSFAC (Wingender et al., 2000)] and Inline graphic (sparse Inline graphic matrix) be the list of predicted interactions of miRNAs and mRNAs from sequence data (obtained from prediction databases, such as MicroCosm; Griffiths-Jones et al., 2006). We also define Inline graphic and Inline graphic as binary matrices indicating whether an entry of Inline graphic and Inline graphic, respectively, is non-zero.

For learning K modules, our goal is to determine (learn) the values of the membership parameters uik and vjk, which represent the propensity that miRNA i or mRNA j belong to module k. Naturally, we restrict these parameters to be non-negative: Inline graphic and Inline graphic, where we interpret that an miRNA or an mRNA is not assigned to a module if the corresponding parameter is zero. We use matrices Inline graphic and Inline graphic to represent this complete set of membership parameters. Finally, we use the following subscript such as Inline graphic or Inline graphic to denote the kth column of the matrices.

graphic file with name btt231um1.jpg

2.3 Probabilistic regression model

Following previous works (Huang et al., 2007b; Le and Bar-Joseph, 2011), we use a regression-based method to link the expression profiles of miRNAs and mRNAs. Expression values of mRNAs are assumed to be downregulated from a baseline expression level by a linear combination of expression profiles of all their predicted miRNA regulators. For example, mRNA j’s expression values are distributed as: Inline graphic, where Inline graphic is the baseline expression level, Inline graphic are weights associated with miRNAs (which previous methods learn individually for each mRNA) and Inline graphic is the set of predicted miRNA regulators of mRNA j.

We depart from these previous models in how we specify miRNA regulators and how we learn the weights Inline graphic. First, each mRNA is assumed to be a target of all miRNAs assigned to the modules it belongs to as long as they are predicted to regulate it (Inline graphic). Formally, mRNA j is the target of the set of miRNAs Inline graphic. Second, the downregulation weights are aggregated across all modules, such as Inline graphic.

Given these assumptions, the likelihood of the observed expression values is

graphic file with name btt231m1.jpg (1)

where Inline graphic is the per-sample variance terms.

2.4 Using protein interactions

So far PIMiM only uses expression values in a regression setting (although we constrain the regulators to come from the sequence-based predicted set, the regression model itself does not directly encourage the assignment of miRNA and predicted mRNA targets to the same module).

To incorporate the input interaction data (predicted miRNA–mRNA pairs Inline graphic and protein interactions Inline graphic), we use a function that rewards assignments to the same module based on the strength of the predicted edge as follows:

graphic file with name btt231m2.jpg (2)

Where α and β are positive tuning parameters that are used to adjust the contributions of the two types of interaction data in our model and Inline graphic is the logistic-sigmoid function. If available (as is the case for the miRNA–mRNA interaction data), we use probabilities for Inline graphic and Inline graphic derived directly from the prediction or experimental databases (see Section 4). We deliberately do not include penalty terms for zero entries of Inline graphic because this interaction matrix is extremely sparse (the number of known protein–protein interactions is small compared with the total number of possible interactions). Penalizing zero entries when using such a sparse matrix would lead to small modules and may be less biologically accurate, as not all co-targets of a miRNA interact.

These terms indicates that the higher the probability of interaction (both miRNA–mRNA and protein–protein) the more likely it is that the interacting entities would be assigned to the same set of modules. This is done globally across all modules. For instance, if Inline graphic is positive, we have previous knowledge that miRNA i and mRNA j interact. To maximize the likelihood Inline graphic, we would need to learn parameters that lead to large values of Inline graphic, which means that the method is more likely to place them in the same module.

2.5 Overall log-likelihood

To summarize, our target is to minimize the following negative log-likelihood:

graphic file with name btt231m3.jpg (3)

The first term evaluates how well the miRNA expression explains the observed mRNA expression, whereas the second and third terms are rewards for assigning predicted miRNA–mRNA pairs and protein interaction pairs to the same module, respectively. This function is non-convex and thus can have multiple local minima solutions. To constrain the set of solutions, we add a number of regularization terms. First, we add two sets of Inline graphic norm constraints for the vectors Inline graphic and Inline graphic. Inline graphic norm contraints encourage sparsity leading to smaller and tighter modules. As our goal is to reduce false positives, such constraints are useful, as they reduce the set of predicted miRNA–mRNA pairs. Specifically, we require that

graphic file with name btt231um2.jpg

We are using two different regularization parameters C1 and C2. This is because the number of miRNAs and mRNAs are different; therefore, a single number does not yield good solutions. Moreover, we choose to use these constraints explicitly instead of adding them to the objective function (using Lagrangian multipliers), as this formulation is simpler to solve in our optimization procedure.

Together, our learning phase solves the following optimization:

graphic file with name btt231m4.jpg (4)

2.6 Learning the parameters of our model

In this section, we discuss how to solve the optimization problem from (4) to determine values for the parameters of our model. As aforementioned, this problem is non-convex, and we cannot analytically compute general solutions. However, we notice that by holding Inline graphic and Inline graphic fixed, we can solve for Inline graphic and Inline graphic in a closed form using standard linear regression:

graphic file with name btt231um3.jpg

where Inline graphic for Inline graphic and Inline graphic.

To solve for Inline graphic and Inline graphic for given values of Inline graphic and Inline graphic, we use a projected quasi-Newton (PQN) method (Schmidt et al., 2009). Quasi-Newton methods construct an approximation to the Hessian by using the observed gradients at successive iterations. We use the MATLAB implementation min_PQN (http://www.di.ens.fr/mschmidt/Software/PQN.html). There are several reasons why we chose this method instead of directly working with the Hessian. First, our set of constraints is convex, and the projection on this set can be done analytically. Second, although we can compute both the gradients and Hessian of Inline graphic, the memory required to store the Hessian is often too large given the dimensions of the expression data (Inline graphic). Moreover, because of interactions between miRNAs and mRNAs, the Hessian is not necessary sparse even if both Inline graphic and Inline graphic are. During the projection step, to speed-up the convergence of the algorithm, we set the entries of Inline graphic, which do not have predicted interactions to zero.

Using the updated values for Inline graphic and Inline graphic, we once again solve for Inline graphic and Inline graphic and so on. These two steps lead to an iterative procedure to solve (4) along the lines of coordinate-descent methods. This procedure converges to the local minima because of the fact that the objective function is bounded below, and the sequence of function values is monotonically decreasing, and the gradients at the convergence are zeros. As the problem is non-convex, we perform the learning process several times, randomly initializing the parameters each time. After repeating this process several times (10 iterations in our experiments), we select the parameters from the result that leads to the lowest value for our objective function.

Finally, the regularization and data-type–weighting parameters Inline graphic and C2 are chosen based on an external evaluation discussed in Section 4.

3 CONSTRAINT MODULE LEARNING FOR MULTIPLE CONDITION ANALYSIS

So far we have discussed our approach for identifying miRNA-regulated modules using a condition-specific expression dataset. Although the optimization problem in Equation (4) can be used with expression data from multiple conditions (e.g. different types of cancer), the output is one set of modules for all conditions. In some cases, directly identifying similar and divergent modules across conditions is an important goal. Consider, for example, joint analysis of multiple types of cancers. Although some researchers may be interested in regulatory modules that are activated in all different cancer types, others may be interested in unique aspects, or modules, of a specific cancer type when compared with other types of cancer.

In our problem, we would like to learn a set of modules for T different conditions. The interaction input matrices Inline graphic and Inline graphic are fixed, whereas for each condition t, we have a set of expression measurements Inline graphic and Inline graphic. Given this input, we jointly learn T sets of modules Inline graphic. The number of modules is also fixed for all conditions.

This type of learning is called multi-task learning (Caruana, 1997) in the machine-learning community, where many related models are learned simultaneously using the same internal representation. Such learning allows different models (or cancer types) to share some parameters, which improves learning while at the same time it can also identify unique parameters for specific types. In several cases, such framework was shown to lead to better solutions (Caruana, 1997). Many existing methods proposed for multi-task learning focus on multi-output regression problems, where it is often desirable to obtain sparse solutions by performing covariate selection. They rely on regularization technique to jointly select a set of covariates that are relevant to many tasks. One can apply Inline graphic penalty of group lasso to select covariates relevant to all tasks (Obozinski et al., 2010).

Here, we adopt the Inline graphic penalty of group lasso to regularize the modules over T conditions with the following penalty:

graphic file with name btt231um4.jpg

This penalty encourages entries Inline graphic and Inline graphic to be selected together, which means that miRNAs and mRNAs are assigned to the same modules across conditions. As the penalty is not differentiable at 0, we reformulate the optimization problem by moving the non-differentiable part to the constraints as suggested in (Liu et al., 2009):

graphic file with name btt231m5.jpg (5)

Here, we have introduced new variables Inline graphic and Inline graphic into the problem. We update the projection step in Section 2.6 with the projection on the new Inline graphic norm balls in the constraint set as shown in Liu et al. (2009) (Theorem 4).

4 RESULTS

4.1 MiRNA regulation in ovarian cancer

To test PIMiM and to compare it with previous methods for determining condition-specific miRNA regulation (SNMNMF and GroupMiR), we use the ovarian cancer dataset from Zhang et al. (2011). This dataset contains 385 samples from cancer patients, each measuring the expression of 559 miRNAs and 12 456 mRNAs and was downloaded from the Cancer Genome Atlas data portal (TCGA) (https://tcga-data.nci.nih.gov/tcga/). In addition to expression data, the sequence-based prediction of miRNA–mRNA interactions was downloaded from MicroCosm (Griffiths-Jones et al., 2006), and protein interaction data were downloaded from TRANSFAC (Wingender et al., 2000). We only use MicroCosm here to allow a fair comparison with SNMNMF, which only uses these data. In subsequent analysis, we use other sequence-based prediction methods as well. To evaluate the accuracy of each method, we used a set of 115 cancer miRNAs that were determined to participate in ovarian cancer in a recent review article (Koturbash et al., 2011; Tables 1 and 2). Using this set we compute the precision, recall and F1 score (the harmonic mean of precision and recall) of the set of miRNAs identified by each method.

Table 1.

Evaluation of all methods on the ovarian cancer dataset

F1 score Cancer miRNAs
Expression correlation Number of genes/module
F1 Precision Recall
PIMiM 0.3768 0.3230 0.4522 −0.0131 67.80
SNMNMF 0.3588 0.3197 0.4087 0.0745 79.26
GroupMiR 0.1227 0.2083 0.0870 0.0408 54.82

Note: The expression correlation values and number of genes are averaged across modules. Expression correlation: the correlation of expression values of miRNAs and mRNAs. Bold values are the best values for the column (highest or lowest depending on the context).

Table 2.

miRNAs specifically identified for a cancer type

MiRNAs Predicted type BRCA GBM AML
hsa-miR-663 BRCA Khoshnaw et al. (2009)
hsa-miR-433 GBM Hua et al. (2012)
hsa-miR-99b AML Garzon et al. (2007)

The number of modules K was set to 50 for the non-negative matrix factorization method (SNMNMF) as suggested in Zhang et al. (2011). PIMiM also requires setting regularization and weight parameters Inline graphic and C2. To set these, we performed an iterative line search (holding three of the four parameters fixed and adjusting the value of the fourth until convergence) to determine the values of these parameters using the F1 score as the target function to optimize. Based on this analysis, we selected K = 40 for PIMiM (see Supplementary Fig. S3 for details). SNMNMF was also run with the optimized set of parameters and input data described in Zhang et al. (2011). Unlike PIMiM and SNMNMF, GroupMiR uses a non-parametric Bayesian prior for the number of modules; therefore, this number cannot be fixed in advance. Thus, for GroupMiR, we report modules and interactions with posterior probability at least 0.3 to get a set of comparable size with other methods. Previously, GroupMiR was shown to outperform several other methods (Le and Bar-Joseph, 2011) including GenMiR++ (Huang et al., 2007b); therefore, we omitted comparison with these methods here. Figure 2 presents a graphical view of the modules identified by PIMiM and SNMNMF. We color interaction edges between genes using different colors for each module. The modules identified by PIMiM are more dense and, hence, are in better agreement with previous findings regarding the regulation of interacting proteins by miRNAs.

Fig. 2.

Fig. 2.

Interactions between genes of the modules. We show an edge between two genes if they are members of a module and their interaction exists in the database. Each color corresponds to one module. Genes with no edges are omitted to improve visualization

4.1.1 Evaluation: identifying cancer miRNAs

We first looked at the set of miRNAs identified by each method (those belonging to the modules returned by each of the methods). The results in Table 1 demonstrate that using the protein interaction data greatly increases precision, recall and the F1 score. Both methods that use these data (PIMiM and SNMMNF) clearly outperform GroupMiR on this set. In addition, using a regression model also helps as indicated by the increase in F1 score PIMiM obtains over SNMNMF.

4.1.2 Expression coherence

In addition to analyzing the set of identified miRNAs, we also computed the average anti-correlation between miRNAs and mRNAs in the modules identified by each of the methods (Table 1). In this analysis, GroupMiR achieves the highest anti-correlation between miRNAs and the mRNAs they regulate in a module. This is the result of a much smaller module size identified by GroupMiR. As protein interactions are not used, mRNAs in these modules are selected because they are strongly anti-correlated with the miRNAs predicted to regulate the modules. This requirement leads to smaller modules and a better (anti) correlation between miRNAs and mRNAs. Still, PIMiM improves on SNMNMF in identifying anti-correlated miRNA–mRNA pairs. SNMNMF’s objective function does not explicitly include a component for expression anti-correlation between miRNAs and mRNAs, which may explain why it does not capture the inhibitory role of miRNAs. Thus, PIMiM provides a useful compromise between relying strongly on protein interactions, which improves accuracy and using the observed expression values in a regression setting.

4.1.3 MSigDB enrichment analysis

To test the biological function of the modules, we used 880 gene sets of canonical pathways (C2-CP, v.3.0) from MSigDB (Subramanian et al., 2005). We used the hypergeometric distribution to compute enrichment P-values for each of the modules with each of the MSigDB gene sets. To correct for the multiple hypothesis testings, we used the Benjamini–Hochberg procedure implemented in the R function p.adjust, which computes a q-value for each intersection. The results are presented in Figure 3, which depicts the number of modules with at least one enriched set in the MSigDB enrichment analysis and the total number of unique enriched gene sets. PIMiM outperforms SNMNMF, achieving both better enrichment for individual modules and better coverage of different MSigDB sets. MSigDB pathways are biased toward cancer pathways and so may be more relevant for the data we are analyzing here than Gene Ontology analysis. In addition to cancer hits, top hits for MSigDB include signatures for β cells that have been linked to cancer (Pelengaris and Khan, 2001) and several translation-related categories.

Fig. 3.

Fig. 3.

MSigDB enrichment analysis: pathway enrichment analysis was done using 880 gene sets of canonical pathways (C2-CP) from MSigDB (Subramanian et al., 2005). P-values were computed using hypergeometric test (with 10 000 random permutations) on the intersection of the set of genes in each module with MSigDB gene sets. Benjamini–Hochberg procedure was used to control the false discovery rate. Top: Number of modules significantly enriched for at least one MSigDB category for different significance cut-offs. Bottom: Number of MSigDB categories identified as in enriched in at least one of the modules for different significance cut-off

4.1.4 The effect of β on the performance of PIMiM

To test the effects of using the protein interaction data in PIMiM, we re-run PIMiM with different β values. The results are presented in Figure 4. As the figure shows, when decreasing the value of β, the performance of PIMiM on all evaluation metrics decreases indicating the protein protein interactions (PPI) data are useful for identifying coherent modules. On the other hand, increasing β too much leads to high weight for PPI data at the expense of the expression information, which also negatively affects the performance of PIMiM. Thus, balancing the two data types, which is done by setting an intermediate value for β is key to the success of PIMiM.

Fig. 4.

Fig. 4.

The effect of protein interaction data to the result. We varied the value of β and tested the different metrics discussed in Section 4. As can be seen, both high and low values lead to reduced performance

4.2 Integrating data from multiple types of cancers

To further investigate miRNA control of different cancers, we applied PIMiM to a dataset of three cancer types using the multi-task learning framework described in Section 3. We learn three sets of modules for three types of cancer: breast invasive carcinoma (BRCA), Glioblastoma multiforme (GBM) and acute myeloid leukemia (AML). The miRNA and gene expression profiles of 89 BRCA, 498 GBM and 173 AML patients were downloaded from the TCGA. This set has 285 miRNAs and 10 922 mRNAs in common. Here, we combine the miRNA–mRNA predicted interactions from three public databases [MicroCosm (Griffiths-Jones et al., 2006), miRanda (John et al., 2004) and TargetScan (Lewis et al., 2005)] and protein interaction data from TRANSFAC (Wingender et al., 2000). For each cancer type, PIMiM learns 1 set of 50 modules. The parameters were set by optimizing for the F1 score of identifying miRNAs relevant to this dataset based on the set of cancer-related miRNAs from Koturbash et al. (2011). Figure 5 displays the miRNA-regulating modules in all three cancer types.

Fig. 5.

Fig. 5.

Inferred miRNA modules of the three cancer types (BRCA, GBM and AML). The x-axis shows the Inline graphic modules learned for the three cancer types (each x-axis bar is subdivided into three with the color corresponding to the cancer type). The y-axis shows miRNAs ordered by hierarchical clustering of their module membership vector. In several cases, the same miRNAs are predicted for all or two of the three cancer types

4.2.1 Analysis of identified miRNAs

Several of the modules identified by PIMiM are regulated by known cancer miRNAs. The overall F1 score for cancer miRNAs for the joint analysis was high for all three cancer types: BRCA (0.6167), GBM (0.5789) and AML (0.6111). Well-known cancer miRNAs reported by PIMiM include the let-7b/c/d/e (active in BRCA: Yu et al., 2007, GBM: Lee et al., 2011 and AML: Jongen-Lavrencic et al., 2008), mirR-302a/b/c/d cluster [suppression of the CDK2 and CDK4/6 cell cycle pathways (Lin et al., 2010)] and miR-96 (active in BRCA: Guttilla and White, 2009, AML: Zhao et al., 2010, miR-34a (active in BRCA: O’Day and Lal, 2010, GBM: Li et al., 2009, AML: Zenz et al., 2009, miR-15a/b (active in AML: Calin et al., 2008). Some members of the miR-17-92 cluster (miR-18b, miR-19a, miR-20a/b and miR-93) are also identified by PIMiM (active in BRCA: Mendell, 2008, GBM: Ernst et al., 2010, AML: Mi et al., 2010). Note that some well-known cancer miRNAs, including miR-17 and miR-92, are missing from the modules because their expression is not available for enough of the samples. Several other subsets of miRNAs were assigned to cooperatively regulate modules in multiple types of cancer as shown in Figure 5.

4.2.2 Cancer-specific miRNAs

In addition to finding common cancer regulators, PIMiM can be used to identify cancer-type–specific regulators. These can either be used as biomarkers for a sub-type or can be studied to determine the unique properties of each cancer type. Although it is hard to obtain negative information (i.e. an article that mentions that a certain miRNA does not regulate a specific cancer type) several of the predictions made by PIMiM agree with current literature that, at least so far, only mentions their role in the cancer they were assigned to by PIMiM. Table 2 lists a few of these miRNAs and the cancer type they were predicted to regulate.

4.2.3 Analyzing the miRNAs and mRNAs in an identified module

In addition to identifying important miRNAs for this particular study, PIMiM returns a set of modules providing predictions of cooperative regulation of miRNAs and their mRNAs targets. To demonstrate the informative power of this modular structure, we analyze in more details one of these modules (see also Supplementary Results for detailed discussion of other modules). Figure 6 depicts a network of miRNAs and mRNAs identified as part of Module 11. Across all cancer types, PIMiM identified a set of 14 strongly connected proteins. MiR-200a/b/c, miR-141 and miR-429 are predicted to regulate this set of mRNAs in all types of cancer. These miRNAs have previously been reported to play a role in cancer and cell proliferation (Korpal et al., 2008; Peter, 2009). Interestingly, the miR-200 family is located in two chromosomal regions on 1p36.33 (200b, 200a and 429) and 12p13.31 (200c and 141), respectively (Uhlmann et al., 2010), which may support our prediction of their cooperative regulation. Applying Gene Ontology analysis [using FuncAssociate (Berriz et al., 2009)] and MSigDB enrichment analysis to the set of 14 mRNAs in this module indicates that this set is enriched with members of transcription factor TFTC/STAGA and TFFIID complexes. Recent findings support the link between these complexes and cancer (Kurabe et al., 2007). This module also includes a tumor suppressor gene MSH2 (Wada-Hiraike et al., 2005) and a famous breast cancer susceptibility gene BRCA1 (Miki et al., 1994).

Fig. 6.

Fig. 6.

miRNAs and mRNAs assigned to Module 11 in all three cancer types. Color indicates the specific cancer type for which the mRNA or miRNA was selected as part of the module

5 CONCLUSIONS

We presented PIMiM, a new method for inferring condition-specific regulation of miRNAs and for identifying their targets. PIMiM combines sequence, expression and interaction data to discover miRNA-regulated modules of mRNAs. We use a probabilistic model that combines regression with network information to discover these modules. We developed an iterative learning procedure to learn the parameters of our model and a multi-task learning method for combining data from multiple conditions.

We tested PIMiM on ovarian cancer expression data and have shown that it correctly identifies miRNAs regulating this cancer type, and that it is able to group relevant genes together. Comparison with other methods indicates that by using protein interaction data, we can improve accuracy while at the same time PIMiM also maintains expression coherence among mRNAs and anti-correlation between miRNAs and the mRNAs they are predicted to regulate improving on previous methods that have also used protein interaction data. Application of the method to compare and contrast three types of cancer identified both common and unique regulators, which can allow researchers to determine the core cancer regulatory network and the differences in regulation among the various cancers we studied.

Although we believe PIMiM can already be of use to researchers that collect mRNA and miRNA expression data, there are a number of extensions that can further improve it. As aforementioned, we follow several other articles in isolating the miRNA target prediction task from the combinatorial analysis of miRNA–TF regulation. Although such an approach leads to good results as discussed earlier in the text, our longer term goal is to develop a method that can incorporate both types of regulation in a single-modeling framework. For this, we would need to determine the role a specific TF plays (activator or repressor) and its activity level [either based on its expression levels or on the set of its targets (Shi et al., 2009)]. With this information, we can incorporate TFs into our regression model to account for their part in regulating expression, which will hopefully lead to better results regarding the role played by specific miRNAs. The regression component that we considered in PIMiM uses a simple linear model to explain the regulation effect of multiple miRNAs. We could also extend this to incorporate other complex combinatorial analysis (for example, AND, OR logic). However, this requires an extension of the methods derived in the article. We thus leave this for future work. In addition, we would like to incorporate additional types of high-throughput data, for example, epigenetic data to our analysis framework.

Funding: National Institutes of Health (U01 HL108642) and National Science Foundation (DBI-0965316) award (to Z.B.J.).

Conflict of Interest: none declared.

Supplementary Material

Supplementary Data

REFERENCES

  1. Barakat A, et al. Conservation and divergence of microRNAs in populus. BMC Genomics. 2007;8:481. doi: 10.1186/1471-2164-8-481. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Bartel DP. MicroRNAs: target recognition and regulatory functions. Cell. 2009;136:215–233. doi: 10.1016/j.cell.2009.01.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Berriz GF, et al. Next generation software for functional trend analysis. Bioinformatics. 2009;25:3043–3044. doi: 10.1093/bioinformatics/btp498. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Betel D, et al. Comprehensive modeling of microRNA targets predicts functional non-conserved and non-canonical sites. Genome Biol. 2010;11:R90. doi: 10.1186/gb-2010-11-8-r90. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Calin GA, et al. miR-15a and miR-16-1 cluster functions in human leukemia. Proc. Natl Acad. Sci. USA. 2008;105:5166–5171. doi: 10.1073/pnas.0800121105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Caruana R. Multitask learning. Mach. Learn. 1997;28:41–75. [Google Scholar]
  7. Cheng C, Li LM. Inferring microRNA activities by combining gene expression with microRNA target prediction. PLoS One. 2008;3:1989. doi: 10.1371/journal.pone.0001989. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Ernst A, et al. De-repression of CTGF via the miR-17-92 cluster upon differentiation of human glioblastoma spheroid cultures. Oncogene. 2010;29:3411–3422. doi: 10.1038/onc.2010.83. [DOI] [PubMed] [Google Scholar]
  9. Garzon R, et al. MicroRNA gene expression during retinoic acid-induced differentiation of human acute promyelocytic leukemia. Oncogene. 2007;26:4148–4157. doi: 10.1038/sj.onc.1210186. [DOI] [PubMed] [Google Scholar]
  10. Griffiths T, Ghahramani Z. Infinite latent feature models and the Indian buffet process. Adv. Neural Inform. Process. Syst. 2006;18:475. [Google Scholar]
  11. Griffiths-Jones S, et al. miRBase: microRNA sequences, targets and gene nomenclature. Nucleic Acids Res. 2006;34(Suppl. 1):D140–D144. doi: 10.1093/nar/gkj112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Guttilla IK, White BA. Coordinate regulation of foxo1 by miR-27a, miR-96, and miR-182 in breast cancer cells. J. Biol. Chem. 2009;284:23204–23216. doi: 10.1074/jbc.M109.031427. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. He L, Hannon GJ. MicroRNAs: small RNAs with a big role in gene regulation. Nat. Rev. Genet. 2004;5:522–531. doi: 10.1038/nrg1379. [DOI] [PubMed] [Google Scholar]
  14. Hsu CW, et al. Characterization of microRNA-regulated protein-protein interaction network. Proteomics. 2008;8:1975–1979. doi: 10.1002/pmic.200701004. [DOI] [PubMed] [Google Scholar]
  15. Hua D, et al. A catalogue of glioblastoma and brain microRNAs identified by deep sequencing. OMICS. 2012;16:690–699. doi: 10.1089/omi.2012.0069. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Huang JC, et al. Using expression profiling data to identify human microRNA targets. Nat. Methods. 2007a;4:1045–1049. doi: 10.1038/nmeth1130. [DOI] [PubMed] [Google Scholar]
  17. Huang JC, et al. Bayesian inference of microRNA targets from sequence and expression data. J. Comput. Biol. 2007b;14:550–563. doi: 10.1089/cmb.2007.R002. [DOI] [PubMed] [Google Scholar]
  18. Huang GT, et al. mirConnX: condition-specific mRNA-microRNA network integrator. Nucleic Acids Res. 2011;39(Suppl. 2):W416–W423. doi: 10.1093/nar/gkr276. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. John B, et al. Human microRNA targets. PLoS Biol. 2004;2:e363. doi: 10.1371/journal.pbio.0020363. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Jongen-Lavrencic M, et al. MicroRNA expression profiling in relation to the genetic heterogeneity of acute myeloid leukemia. Blood. 2008;111:5078–5085. doi: 10.1182/blood-2008-01-133355. [DOI] [PubMed] [Google Scholar]
  21. Joung JG, et al. Discovery of microRNA–mRNA modules via population-based probabilistic learning. Bioinformatics. 2007;23:1141. doi: 10.1093/bioinformatics/btm045. [DOI] [PubMed] [Google Scholar]
  22. Khoshnaw SM, et al. MicroRNA involvement in the pathogenesis and management of breast cancer. J. Clin. Pathol. 2009;62:422–428. doi: 10.1136/jcp.2008.060681. [DOI] [PubMed] [Google Scholar]
  23. Korpal M, et al. The mir-200 family inhibits epithelial-mesenchymal transition and cancer cell migration by direct targeting of E-cadherin transcriptional repressors ZEB1 and ZEB2. J. Biol. Chem. 2008;283:14910–14914. doi: 10.1074/jbc.C800074200. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Koturbash I, et al. Small molecules with big effects: the role of the microRNAome in cancer and carcinogenesis. Mutat. Res. 2011;722:94–105. doi: 10.1016/j.mrgentox.2010.05.006. [DOI] [PubMed] [Google Scholar]
  25. Krek A, et al. Combinatorial microRNA target predictions. Nat. Genet. 2005;37:495–500. doi: 10.1038/ng1536. [DOI] [PubMed] [Google Scholar]
  26. Krol J, et al. The widespread regulation of microRNA biogenesis, function and decay. Nat. Rev. Genet. 2010;11:597–610. doi: 10.1038/nrg2843. [DOI] [PubMed] [Google Scholar]
  27. Kurabe N, et al. Deregulated expression of a novel component of TFTC/STAGA histone acetyltransferase complexes, rat SGF29, in hepatocellular carcinoma: possible implication for the oncogenic potential of c-Myc. Oncogene. 2007;26:5626–5634. doi: 10.1038/sj.onc.1210349. [DOI] [PubMed] [Google Scholar]
  28. Le H-SP, Bar-Joseph Z. (2011) Inferring interaction networks using the IBP applied to microRNA target prediction. In: Shawe-Taylor,J. et al., (eds.), Advances in Neural Information Processing Systems 24, Curran Associates, Inc. pp. 235–243. [Google Scholar]
  29. Lee ST, et al. Let-7 microRNA inhibits the proliferation of human glioblastoma cells. J. Neurooncol. 2011;102:19–24. doi: 10.1007/s11060-010-0286-6. [DOI] [PubMed] [Google Scholar]
  30. Lewis BP, et al. Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets. Cell. 2005;120:15–20. doi: 10.1016/j.cell.2004.12.035. [DOI] [PubMed] [Google Scholar]
  31. Li Y, et al. Microrna-34a inhibits glioblastoma growth by targeting multiple oncogenes. Cancer Res. 2009;69:7569–7576. doi: 10.1158/0008-5472.CAN-09-0529. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Liang H, Li WH. MicroRNA regulation of human protein–protein interaction network. RNA. 2007;13:1402. doi: 10.1261/rna.634607. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Lin SL, et al. MicroRNA miR-302 inhibits the tumorigenecity of human pluripotent stem cells by coordinate suppression of the CDK2 and CDK4/6 cell cycle pathways. Cancer Res. 2010;70:9473–9482. doi: 10.1158/0008-5472.CAN-10-2746. [DOI] [PubMed] [Google Scholar]
  34. Liu J, et al. 2009. Multi-task feature learning via efficient l 2, 1-norm minimization. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, AUAI Press, pp. 339–348. [Google Scholar]
  35. Malumbres M. miRNAs versus oncogenes: the power of social networking. Mol. Syst. Biol. 2012;8:569. doi: 10.1038/msb.2012.2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Mendell JT. miRiad roles for the mir-17-92 cluster in development and disease. Cell. 2008;133:217–222. doi: 10.1016/j.cell.2008.04.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Meola N, et al. microRNAs and genetic diseases. Pathogenetics. 2009;2:7. doi: 10.1186/1755-8417-2-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Mi S, et al. Aberrant overexpression and function of the miR-17-92 cluster in MLL-rearranged acute leukemia. Proc. Natl Acad. Sci. USA. 2010;107:3710–3715. doi: 10.1073/pnas.0914900107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Miki Y, et al. A strong candidate for the breast and ovarian cancer susceptibility gene BRCA1. Science. 1994;266:7. doi: 10.1126/science.7545954. [DOI] [PubMed] [Google Scholar]
  40. Motameny S, et al. Next generation sequencing of miRNAs–strategies, resources and methods. Genes. 2010;1:70–84. doi: 10.3390/genes1010070. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Muniategui A, et al. Joint analysis of miRNA and mRNA expression data. Brief. Bioinformatics. 2012 doi: 10.1093/bib/bbs028. [Epub ahead of print, doi: 10.1093/bib/bbs028, June 12, 2012] [DOI] [PubMed] [Google Scholar]
  42. Obozinski G, et al. Joint covariate selection and joint subspace selection for multiple classification problems. Stat. Comput. 2010;20:231–252. [Google Scholar]
  43. O’Day E, Lal A. MicroRNAs and their target gene networks in breast cancer. Breast Cancer Res. 2010;12:201. doi: 10.1186/bcr2484. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Ooi CH, et al. A densely interconnected genome-wide network of microRNAs and oncogenic pathways revealed using gene expression signatures. PLoS Genet. 2011;7:e1002415. doi: 10.1371/journal.pgen.1002415. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Pelengaris S, Khan M. Oncogenic co-operation in beta-cell tumorigenesis. Endocr. Relat. Cancer. 2001;8:307–314. doi: 10.1677/erc.0.0080307. [DOI] [PubMed] [Google Scholar]
  46. Peter ME. Let-7 and mir-200 microRNAs: guardians against pluripotency and cancer progression. Cell Cycle. 2009;8:843–852. doi: 10.4161/cc.8.6.7907. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Sass S, et al. MicroRNAs coordinately regulate protein complexes. BMC Syst. Biol. 2011;5:136. doi: 10.1186/1752-0509-5-136. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Schmidt M, et al. 2009. Optimizing costly functions with simple constraints: a limited-memory projected quasi-Newton algorithm. In Proceedings of the Conference on Artificial Intelligence and Statistics, MIT press, pp. 456–463. [Google Scholar]
  49. Shao NY, et al. Comprehensive survey of human brain microRNA by deep sequencing. BMC Genomics. 2010;11:409. doi: 10.1186/1471-2164-11-409. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Shi Y, et al. A combined expression-interaction model for inferring the temporal activity of transcription factors. J. Comput. Biol. 2009;16:1035–1049. doi: 10.1089/cmb.2009.0024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Stark C, et al. The BioGRID interaction database: 2011 update. Nucleic Acids Res. 2011;39(Suppl. 1):D698–D704. doi: 10.1093/nar/gkq1116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Subramanian A, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA. 2005;102:15545–15550. doi: 10.1073/pnas.0506580102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Sun J, et al. Uncovering microRNA and transcription factor mediated regulatory networks in glioblastoma. PLoS Comput. Biol. 2012;8:e1002488. doi: 10.1371/journal.pcbi.1002488. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Uhlmann S, et al. miR-200bc/429 cluster targets plcγ1 and differentially regulates proliferation and EGF-driven invasion than miR-200a/141 in breast cancer. Oncogene. 2010;29:4297–4306. doi: 10.1038/onc.2010.201. [DOI] [PubMed] [Google Scholar]
  55. Wada-Hiraike O, et al. The DNA mismatch repair gene hmsh2 is a potent coactivator of oestrogen receptor α. Br. J. Cancer. 2005;92:2286–2291. doi: 10.1038/sj.bjc.6602614. [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Wang H, Li WH. Increasing MicroRNA target prediction confidence by the relative R2 method. J. Theor. Biol. 2009;259:793–798. doi: 10.1016/j.jtbi.2009.05.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Wingender E, et al. Transfac: an integrated system for gene expression regulation. Nucleic Acids Res. 2000;28:316–319. doi: 10.1093/nar/28.1.316. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Yu F, et al. let-7 regulates self renewal and tumorigenicity of breast cancer cells. Cell. 2007;131:1109–1123. doi: 10.1016/j.cell.2007.10.054. [DOI] [PubMed] [Google Scholar]
  59. Zenz T, et al. miR-34a as part of the resistance network in chronic lymphocytic leukemia. Blood. 2009;113:3801–3808. doi: 10.1182/blood-2008-08-172254. [DOI] [PubMed] [Google Scholar]
  60. Zhang S, et al. A novel computational framework for simultaneous integration of multiple types of genomic data to identify microRNA-gene regulatory modules. Bioinformatics. 2011;27:i401. doi: 10.1093/bioinformatics/btr206. [DOI] [PMC free article] [PubMed] [Google Scholar]
  61. Zhao H, et al. MicroRNA and leukemia: tiny molecule, great function. Crit. Rev. Oncol Hematol. 2010;74:149–155. doi: 10.1016/j.critrevonc.2009.05.001. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES