Analyzing Large Gene Expression and Methylation Data Profiles Using StatBicRM: Statistical Biclustering-Based Rule Mining

Ujjwal Maulik; Saurav Mallik; Anirban Mukhopadhyay; Sanghamitra Bandyopadhyay

doi:10.1371/journal.pone.0119448

. 2015 Apr 1;10(4):e0119448. doi: 10.1371/journal.pone.0119448

Analyzing Large Gene Expression and Methylation Data Profiles Using StatBicRM: Statistical Biclustering-Based Rule Mining

Ujjwal Maulik ¹, Saurav Mallik ², Anirban Mukhopadhyay ³, Sanghamitra Bandyopadhyay ^2,^*

Editor: Xiaofeng Wang⁴

PMCID: PMC4382191 PMID: 25830807

Abstract

Microarray and beadchip are two most efficient techniques for measuring gene expression and methylation data in bioinformatics. Biclustering deals with the simultaneous clustering of genes and samples. In this article, we propose a computational rule mining framework, StatBicRM (i.e., statistical biclustering-based rule mining) to identify special type of rules and potential biomarkers using integrated approaches of statistical and binary inclusion-maximal biclustering techniques from the biological datasets. At first, a novel statistical strategy has been utilized to eliminate the insignificant/low-significant/redundant genes in such way that significance level must satisfy the data distribution property (viz., either normal distribution or non-normal distribution). The data is then discretized and post-discretized, consecutively. Thereafter, the biclustering technique is applied to identify maximal frequent closed homogeneous itemsets. Corresponding special type of rules are then extracted from the selected itemsets. Our proposed rule mining method performs better than the other rule mining algorithms as it generates maximal frequent closed homogeneous itemsets instead of frequent itemsets. Thus, it saves elapsed time, and can work on big dataset. Pathway and Gene Ontology analyses are conducted on the genes of the evolved rules using David database. Frequency analysis of the genes appearing in the evolved rules is performed to determine potential biomarkers. Furthermore, we also classify the data to know how much the evolved rules are able to describe accurately the remaining test (unknown) data. Subsequently, we also compare the average classification accuracy, and other related factors with other rule-based classifiers. Statistical significance tests are also performed for verifying the statistical relevance of the comparative results. Here, each of the other rule mining methods or rule-based classifiers is also starting with the same post-discretized data-matrix. Finally, we have also included the integrated analysis of gene expression and methylation for determining epigenetic effect (viz., effect of methylation) on gene expression level.

Introduction

Microarray technique is a useful tool for measuring gene expression data across different experimental and control samples. Similarly, beadchip is another efficient technique for generating genome-wide DNA methylation profiling in infinium II platform. DNA methylation is an important epigenetic factor that refers to the addition of a methyl group (-CH3) to position 5 of the cytosine pyrimidine ring or the number 6 nitrogen of the adenine purine ring in genomic DNA. It modifies, in general decreases, the expression levels of genes. Both the expression and methylation data matrix [1], [2], [3], [4] are initially organized in such a way that rows and columns indicate genes and samples (conditions), respectively. Statistical analysis [5], [6], [7] is an important tool to identify differential expression/methylation (i.e., DE/DM) genes across different types of samples.

Association rule mining (ARM) [8], [9] is another useful tool for determining interesting (expression/methylation) relationships among items (genes) under different conditions (samples). In this article, we propose a computational rule mining framework, StatBicRM (i.e., statistical biclustering-based rule mining) to identify special rules of genes and potential biomarkers from the large gene expression and/or methylation data by integrating a novel statistical technique and binary inclusion-maximal biclustering technique, consecutively.

In traditional association rule mining algorithms, huge number of rules is coming out as result. Thus, it is difficult to run them on medium or large sized dataset in which the number of genes is approximately 250 or more. To solve the problem, in our proposed method, we have utilized the binary inclusion-maximal biclustering (i.e., BiMax) technique [10] for mining non-redundant significant itemsets and corresponding special rules. But, the biclustering technique can work on such dataset whose the number of genes is less than equal to 10,000 approximately. If the number is greater than 10,000, it fails to work. Thus, for such large dataset, we have to apply a statistical strategy on the dataset before using the biclustering technique to eliminate the redundant/insignificant/low-significant genes in such way that significance level must rely on the data distribution property (viz., either normal distribution or non-normal distribution). Therefore, first of all, the whole data is passed through different fundamental statistical techniques (viz., removal of genes having low variance and normalization, consecutively).

Now, if there is large number of samples in a dataset, there is no need to use any normality test on the data before using any statistical test as all statistical tests perform more or less well for the large number of samples. But, if a dataset has small number of samples, then it has been observed that different statistical tests perform differently [11]. It is well-known that standard t-test, Welch’s t-test, Bayes t-test and Pearson’s correlation test are all parametric statistical tests, and Limma, significant analysis of microarrays (SAM), Wilcoxon’s ranksum test and permuted t-test are considered as non-parametric tests. Some non-parametric statistical tests (like Limma and SAM) are good performers for normally distributed data as well as non-normally distributed data in all conditions, specially for small sample sizes. But, the performance of SAM is found to be inconsistent as sometimes it produces good performance while at other times it fails to work properly for small sample sizes. The performance of permuted t-test is satisfactory in case of non-normal distributions for all types of sample sizes. But, in case of normal distributions, it works poorly especially for small sample sizes. For normally distributed data, the performance of Wilcoxon’s ranksum test is much poorer than standard t-test for small sample. On the other hand, for small sample sizes, the standard t-test produces poor performance for non-normally distributed data where its performance is better in normally distributed data. Performance of Welch’s t-test is poor for both the cases of data distributions for small number of samples. To summarize, it can be stated that in case of small number of samples, it is better to test the data distribution in advance [11]. Otherwise p-values may be misleading due to the assumption of incorrect distribution. Therefore, we have initially used a well-known normality test (i.e., Jarque-Bera test [12]) for testing the distribution pattern of each data whether the data is normally distributed or not. Depending on the patterns, the dataset is then partitioned into two sub-datasets, where one sub-dataset has all normally distributed data, and remaining one contains all non-normally distributed data. Now, it is noticed that the parametric tests perform better for normally distributed data than for non-normally distributed data on average. On the other hand, the performance of non-parametric tests is more satisfactory for non-normally distributed data than for normally distributed data on average [11]. Therefore, after testing for normality, we have run multiple parametric statistical tests (viz., t-test [11], Welch’s t-test [11], modified Bayes’ t-test by Fox and Dimmic [13], and Pearson’s correlation test (Corr) [11]) on the normally distributed data to identify differentially expressed/methylated genes and taken their intersection in order to be certain that whichever genes are identified, are truly differentially expressed/methylated. Similarly, we have applied multiple non-parametric tests (viz., Limma [11], significant analysis of microarrays (SAM) [11], Wilcoxon’s ranksum test (Wcox) [11], and permuted t-test (Perm) [11]) on non-normally distributed data to obtain differentially expressed/methylated genes, and taken their intersection in order to be certain that whichever genes are identified, are truly differentially expressed/methylated. A list is then prepared containing the resulting intersected genes from both the normally distributed dataset and the non-normally distributed dataset. These statistical methods are utilized to determine the proper significant non-redundant subset of the differentially expressed/methylated genes from the original large dataset. Thereafter, discretization and post-discretization are utilized consecutively on the subset of data for converting it into corresponding boolean matrix.

Now, our next major goal is rule mining. For this purpose, the biclustering technique is directly applied on the post-discretized data-matrix for determining maximal homogeneous biclusters of genes as maximal frequent closed homogeneous itemsets (viz., MFCHOIs) at a minimum support-value. Here, MFCHOI means the maximal biclusters that have sets of all homogeneous class-labels/samples. The rules are then extracted from the MFCHOIs. Each evolved rule is of special type, i.e., consequent of the rule consists of its class-label only. Therefore, each MFCHOI produces a single special rule. Our proposed methodology performs better than state-of-the-art rule mining algorithms as it generates maximal frequent closed homogeneous itemsets (viz., MFCHOIs) instead of frequent itemsets. Each of the other rule mining methods is also starting with the same post-discretized data-matrix. Another advantage of it that as these rules are classification rules, so we do not need to calculate any other rule-interestingness measure (e.g. confidence) except support. Therefore, it saves elapsed time and can work on big data in which number of genes is high. Pathway and Gene Ontology (GO) analysis are conducted on the genes of the evolved rules using David database. Furthermore, frequency analysis of the genes appearing in the evolved rules is performed to determine potential biomarkers.

Furthermore, it is also needed to know how much the evolved rules are able to describe accurately the remaining test (unknown) data. For this, we need to perform cross-validation and classification, consecutively on the data to compute average accuracy of the proposed method. Therefore, the earlier mentioned post-discretized data-matrix is divided into training and test sets using 4-fold cross-validations (CVs). Thereafter, the biclustering technique is applied on the training part of the dataset for determining MFCHOIs at a minimum support-value. The special rules are then extracted from the MFCHOIs. Here, each MFCHOI generates a single rule. We have also estimated 23 rule-interestingness measures [14], [15] of the evolved rules. We have also added another new measure (viz., the number of satisfiable conditions/samples of the corresponding bicluster for each evolved rule). We have estimated the rank of each rule according each of the 24 measures individually using Fractional ranking [16–18]. The final ranking of each evolved rule is calculated by average ranking on the resulting fractional rankings of the rule. All the rules are rearranged from best to worst case. We have then assigned some weight on the final list of rules in such a way that the topmost rule gets the highest weight, 2nd topper gets 2nd highest weight and so on; and also the weight-interval between any two consecutive ranked rules is same. The classification technique is applied on each test data point using a majority voting technique through weighted-sum method. A comparative performance study with existing popular rule-based classifiers is conducted based on the average classification accuracy, MCC and related factors of them. Our classification method provides better performance than the existing popular rule-based classifiers. Here, each of the other rule-based classifiers is also starting with the same post-discretized data-matrix. Statistical significance tests (viz., one-way ANOVA) [19] are also performed for verifying the statistically relevance of the comparative results.

As we have mentioned earlier that DNA methylation is one of the important epigenetic factors which can change (generally decrease) the expression levels of genes, therefore we have also performed integrative analysis of gene expression dataset and methylation dataset of combined dataset. As we know that the gene expression is inversely proportional to the methylation, so inversely correlated genes make sense to highlight the epigenetic effect (e.g., methylation) on the expression level. Therefore, we have identified these type of genes having inverse relationship between their methylation and expression levels.

The rest of the article is organized as follows. In Section Materials and Methods, literature review and our proposed methodology have been elaborated. Section Results and Discussion presents source and brief description about the real datasets, and the experimental results and discussion. Finally, Section Conclusion concludes the article.

Materials and Methods

Literature Review

Association rule mining (ARM) [8], [9] is one of the useful tools for determining interesting (expression/methylation) relationships among items (genes) under different conditions (samples). It can provide association rules based on frequent itemsets. A rule (R) can be described as A ⇒ C, where A, C ⊆ IM and A⋂C = ϕ. Here, A and C are called as antecedent (i.e., set of items in LHS of a rule) and consequent (i.e., set of items in RHS of a rule), respectively. The support of the itemset (IM) is defined as number of transactions in which all items of it appear together. IM is frequent when its support is greater than any threshold value (i.e., minimum support). The confidence of the rule is defined as ratio of support of IM to the support of A. Frequent closed itemset (FCI) is a condensed form of frequent itemsets. FCI is used to avoid redundancy.

In past decades, traditional Apriori algorithm [20] was most fundamental association rule mining technique. Apriori uses a bottom-up technique in which frequent subsets are extended one item at a time for determining each candidate itemset. Groups of the candidate itemsets are then tested in the data. The method terminates if no further successful extension is found. The result of Apriori is the sets of rules which determine the occurrence of items in the dataset. Apriori follows breadth-first search for counting the candidate itemsets. Apriori generates candidate itemsets having length k from the itemsets having length k − 1. It discards infrequent candidate itemsets. The set of candidate itemsets have all frequent itemsets. After extracting all frequent itemsets, corresponding set of rules is mined from each frequent itemset. As Apriori generates only frequent itemsets, thus huge number of rules are produced from the itemsets. Therefore, Apriori can not run on medium or large size of data. It can hardly work up to 100 genes, approximately. But, if there is more than 100 genes, then it either takes a long time or fails to run.

After further investigations, different shortcomings have been identified in the traditional Apriori, like production of high number of frequent itemsets, high running time, problem of multiple-scan of the dataset etc. Many other ARM techniques have been proposed (e.g., AprioriTid [21], Eclat [22], Tao et al. [23], H-mine [24] etc.) to reduce these shortcomings. But, for medium or large sized dataset (i.e., whose the number of genes is greater than 250 approximately), either the methods fails to work on the dataset or they take a long time (viz., approximately 5 hours or more).

For solving the above limitation, in this article, we have used the BiMax biclustering technique [10] for extracting maximal frequent closed homogeneous itemsets (MFCHOIs) and corresponding special rules. It is a method for identifying groups of all-1 biclusters from a boolean data matrix under certain conditions. The aim of the biclustering is to discover groups of genes (i.e., all-1 biclusters) having similar behaviour under a subset of conditions (samples). The biclustering technique extracts the maximal frequent closed homogeneous itemsets (viz., MFCHOIs) which are proper subsets of frequent itemsets (FIs); i.e., MFCHOI ⊂ FI. Thus, Our proposed method produces much less number of significant non-redundant itemsets than the other rule mining algorithms. But, the biclustering technique can work on the dataset in which the number of genes is less than equal to 10,000 approximately. If the number is greater than 10,000, it can not work on the dataset. Therefore, for the large dataset, we need to utilize some statistical strategy on the dataset before applying the biclustering technique for eliminating the redundant/insignificant/low-significant genes in such a way that significance level must satisfy the data distribution property (viz., either normal distribution or non-normal distribution). Hence, here, we have proposed a computational rule mining framework, StatBicRM for producing special rules of genes, and potential biomarkers from the large gene expression and/or methylation dataset by integrating a novel statistical technique and the biclustering technique, consecutively.

Proposed Method

Our proposed technique, StatBicRM is basically a computational framework for rule mining where integrated approach of statistical and binary inclusion-maximal biclustering techniques are utilized in gene expression or methylation dataset (see Fig. 1). Besides this, we have also performed classification using the proposed method to know how much the evolved rules are able to describe accurately the remaining test (unknown) data (see Fig. 2).

Fig 1 — Here, the terms *TOTALDESET* _N, *TOTALDESET* _NN, *TOTALDESET* _N+NN are described in last paragraph of subsection *“Identification of differentially expressed/methylated genes using Statistical tests”*. For methylation dataset, the above terms are replaced by *TOTALDMSET* _N, *TOTALDMSET* _NN, *TOTALDMSET* _N+NN, respectively.

Fig 2 — Here, the terms *TOTALDESET* _N, *TOTALDESET* _NN, *TOTALDESET* _N+NN are described in last paragraph of subsection. For methylation dataset, the above terms are replaced by *TOTALDMSET* _N, *TOTALDMSET* _NN, *TOTALDMSET* _N+NN, respectively.

The steps of StatBicRM is described briefly in the following steps:

Identification of differentially expressed/methylated genes using Statistical tests

Our proposed method basically depends on statistical analysis. As we know that in case of big gene expression/methylation dataset, there may exist 10,000 or more genes. Among them, most of the genes are non-differentially expressed/methylated (i.e., nDE/nDM), and only some of them are differentially expressed/methylated (i.e., DE/DM). When a rule is generated, then these two types of genes may occur together in the rule. According to biological scenario, DE/DM genes can only make sense in a rule relating to specific disease, where the other type of genes is irrelevant to the disease. Therefore, we have initially used a novel statistical strategy on the dataset to identify the set of statistically significant non-redundant DE/DM genes in such way that significance level must rely on the data distribution property (viz., either normal distribution or non-normal distribution). For doing this, at first, the genes which have low variance are eliminated from the gene expression/methylation dataset. Thereafter, we have used zero-mean normalization on the data of these genes to adjust the values measured on different scales to a common scale. The zero-mean normalization can be stated as:

\begin{matrix} v_{i j}^{'} = \frac{v_{i j} - μ}{σ}, \end{matrix}

(1)

where μ and σ denote mean and standard deviation of the expression/methylation data of a gene i before normalization respectively; and v _ij and $v_{i j}^{'}$ refer to the value of i-th gene at j-th condition before and after normalization, respectively.

It is well known that the parametric statistical tests [25] are appropriate for normally distributed data, and non-parametric statistical tests [25] are appropriate for non-normally distributed data, respectively. Therefore, Jarque-Bera normality test [12], [26] is utilized on the normalized data to determine the pattern of distribution of the data whether it is normally distributed or non-normally distributed. The Jarque-Bera normality test is defined as follows:

\begin{matrix} J B = \frac{d}{6} (S^{2} + \frac{1}{4} {(K - 3)}^{2}), \end{matrix}

(2)

where d denotes the degree of freedom, S is the skewness of the sample, and K refers to the kurtosis of the sample. Hence, depending on the resulting distribution patterns, the whole normalized dataset is partitioned into two sub-datasets, where one sub-dataset has all normally distributed data, and remaining one contains all non-normally distributed data.

Thereafter, we have applied four parametric statistical tests (viz., t-test [11], Welch’s t-test [11], modified Bayes’ t-test by Fox and Dimmic [13] in 2006 and Pearson’s correlation test (Corr) [11]) on the normally distributed data to obtain differentially expressed genes for the normally distributed sub-dataset. Similarly, four non-parametric tests (viz., Limma [11], Significant analysis of microarrays (SAM) [11], Wilcoxon ranksum test (Wcox) [11] and permute t-test (Perm) [11]) are applied on the non-normally distributed data to obtain differentially expressed genes for the non-normally distributed sub-dataset.

Before further proceeding, we have shortly discussed in the followings about some of the statistical tests mentioned above.

The “2-sample t-test” makes comparison between means of the two groups with the variation in the data. From the test statistic, we compute a measure (i.e., p-value). The p-value indicates the probability of observing a t-value as large or larger than the actually observed t-value where the null hypothesis is given true. By convention, if the p-value of a gene (item) is less than 5%, then the gene is statistically called as differentially expressed/methylated gene. Now, suppose, for each gene g, group 1: n ₁ treated samples, with mean ${\overline{x}}_{1 g}$ and standard deviation s _1g; and group 2: n ₁ controlled samples, with mean ${\overline{x}}_{2 g}$ and standard deviation s _2g.

\begin{matrix} t = \frac{({\bar{x}}_{1 g} - {\bar{x}}_{2 g})}{s e_{g}} . \end{matrix}

(3)

Here, se _g denotes the standard error of the groups’ mean, thus,

\begin{matrix} s e_{g} = s P o o l e d * \sqrt{\frac{1}{n_{1}} + \frac{1}{n_{2}}}, \end{matrix}

(4)

where sPooled is the pooled estimate of the population standard deviation; i.e.,

\begin{matrix} s P o o l e d = \sqrt{\frac{(n_{1} - 1) * s_{1 g}^{2} + (n_{2} - 1) * s_{2 g}^{2}}{d f}} . \end{matrix}

(5)

Here, df is degree of freedom of the test. It is stated as df = (n ₁ + n ₂ − 2). This strategy is used assuming that variance of two groups are equal.

For Welch’s t-test, the variance of two groups are checked whether they are equal to each other or not. If equal, then use earlier mentioned t-statistic in Equation 3, otherwise use the following t-statistic:

\begin{matrix} t = \frac{({\bar{x}}_{1 g} - {\bar{x}}_{2 g})}{\sqrt{\frac{s_{1 g}^{2}}{n_{1}} + \frac{s_{2 g}^{2}}{n_{2}}}} . \end{matrix}

(6)

Here we use unpooled estimates of the population standard deviations.

Pearson’s correlation coefficient (commonly denoted as ρ) between two variables is described as the covariance of the two variables divided by the product of their standard deviations, i.e.,

\begin{matrix} ρ = \frac{c o v (x, y)}{s_{x} s_{y}}, \end{matrix}

(7)

where

\begin{matrix} c o v (x, y) = \sum_{i = 1}^{n 1} (x_{i} - \bar{x}) (y_{i} - \bar{y}), \end{matrix}

(8)

where samplesize for the two groups are n1 and n2, respectively (here, n1 = n2). This test can predict whether two variables are related or not.

The moderated t-statistic in Limma [27] can be demonstrated as:

\begin{matrix} {\tilde{t}}_{g} = \frac{1}{\sqrt{\frac{1}{n_{1}} + \frac{1}{n_{2}}}} \frac{{\hat{β}}_{g}}{{\tilde{s}}_{g}}, \end{matrix}

(9)

where samplesize n = n ₁ + n ₂, ${\hat{β}}_{g}$ and ${\tilde{s}}_{g}^{2}$ denote the contrast estimator and posterior sample variance for the gene g respectively. The statistic for calculating contrast estimator for gene g is:

\begin{matrix} {\hat{β}}_{g} | σ_{g}^{2} \sim N (β_{g}, σ_{g}^{2}), \end{matrix}

(10)

where, N is normal distribution, and the statistic for estimating posterior sample variance for the gene g is:

\begin{matrix} {\tilde{s}}_{g}^{2} = \frac{d_{0} s_{0}^{2} + d_{g} s_{g}^{2}}{d_{0} + d_{g}} . \end{matrix}

(11)

Where, d ₀ (< ∞) and $s_{0}^{2}$ refer to the prior degrees of freedom and variance respectively, and d _g (> 0) and $s_{g}^{2}$ denote the experimental degrees of freedom and the sample variance of a particular gene g, respectively.

SAM chooses to add a small positive constant s ₀ (stated as “fudge factor”) to solve small variance problem. The SAM statistic by Tusher et al.(2001) is:

\begin{matrix} t_{s a m} = \frac{({\bar{x}}_{1 g} - {\bar{x}}_{2 g})}{s e_{g} + s_{0}}, \end{matrix}

(12)

where se _g is the standard error of the groups’ mean (see Equation 4). Here, sPooled is the pooled estimate of the population standard deviation (see Equation 5). Here, df is degree of freedom of the test. It is stated as df = (n ₁ + n ₂ − 2).

In Wilcoxon ranksum test, a list of ranks of the gene expression values for each gene is prepared in ascending for each group, and then tests for equality of means of the two ranked samples. The z-statistic of the test is:

\begin{matrix} z = \frac{(| T - m e a n_{w 1} | - 0.5)}{\sqrt{v a r_{w 1}}}, \end{matrix}

(13)

where

\begin{matrix} T = m i n (\sum r a n k s_{g r o u p 1}, \sum r a n k s_{g r o u p 2}), \end{matrix}

(14)

\begin{matrix} m e a n_{w 1} = n_{1} * (n_{1} + n_{2} + 1) / 2, \end{matrix}

(15)

and

\begin{matrix} v a r_{w 1} = n_{2} * m e a n_{w 1} / 6 . \end{matrix}

(16)

A permuted t-test is a kind of t-test in which an rearrangement is conducted in the labels on the observed data-points of each gene (item).

However, as stated earlier that the four parametric statistical tests are applied on the normally distributed dataset, thus different number of up-regulated and down-regulated genes are coming out from the different parametric tests. Thereafter, we have performed intersection of the up-regulated genes to identify set of common up-regulated genes (denoted by UPDESET _N) for the normally distributed sub-dataset. Similarly, we have got set of common down-regulated genes (denoted by DOWNDESET _N). We have then made a list (denoted by TOTALDESET _N) containing all the common up-regulated genes and all the common down-regulated genes; i.e., TOTALDESET _N = UPDESET _N + DOWNDESET _N. Similarly, as stated earlier that the four non-parametric statistical tests are applied on the non-normally distributed dataset, thus different number of up-regulated and down-regulated genes are coming out from the different non-parametric tests. Then, we have made intersection of the up-regulated genes to identify set of common up-regulated genes (denoted by UPDESET _NN) for the non-normally distributed sub-dataset. Similarly, we have got set of common down-regulated genes (denoted by DOWNDESET _NN). We have then made another list (denoted by TOTALDESET _NN) containing all the common up-regulated genes and all the common down-regulated genes; i.e., TOTALDESET _NN = UPDESET _NN + DOWNDESET _NN.

Finally, we have produced a final list (denoted by TOTALDESET _(N+NN)) containing all the common up-regulated and down-regulated genes from the normally distributed and non-normally distributed datasets; i.e., TOTALDESET _(N+NN) = TOTALDESET _N + TOTALDESET _NN. Hence, the final list of genes (i.e., TOTALDESET _(N+NN)) are utilized in the next step. Similar steps are performed to obtain HYPERDMSET _N, HYPODMSET _N, TOTALDMSET _N, HYPERDMSET _NN, HYPODMSET _NN, TOTALDMSET _NN and TOTALDMSET _(N+NN) instead of UPDESET _N, DOWNDESET _N, TOTALDESET _N, UPDESET _NN, DOWNDESET _NN, TOTALDESET _NN, TOTALDESET _(N+NN) respectively.

Discretization and Post-discretization

Suppose, the data matrix of the resulting list of genes, TOTALDESET _(N+NN) is denoted by I. Now, first of all, I whose rows denote genes and columns denote samples, is transposed. Suppose, PIT is the transposed matrix. As the PIT matrix is already normalized by zero-mean normalization, therefore the following step is utilized for binary discretization of the matrix:

\begin{matrix} I T = \{\begin{matrix} 1, & if P I T_{i j} > 0, \\ 0, & if P I T_{i j} < 0; \end{matrix} \end{matrix}

(17)

where PITij denotes the expression/methylation value of i-th row and j-th column (1 ≤ i ≤ m, 1 ≤ j ≤ n), m and n are number of rows (samples) and number of columns (genes) of PIT _ij matrix, respectively, and IT is the resulting discretized matrix. Now, let us assume that in the discretized boolean matrix, a up-regulated gene and a down-regulated gene are denoted by DE _up and DE _down, respectively. In the matrix IT, ‘1’ and ‘0’ refer to presence of up-regulated gene (DE _up), and presence of down-regulated gene (DE _down), respectively (see part (b) of Fig. 3). After discretization, we will apply Bimax biclustering for finding all-1 biclusters. As the Bimax biclustering rectifies only ‘1’, not ‘0’, thus we need to do post-discretization in such way where ‘1’ will represent both DE _up and DE _down properties. Therefore, after discretization, number of columns is doubled where the first half is a domain for DE _up property, and remaining half is another domain for DE _down property (see part (c) of Fig. 3). E.g., the column denoted by g1 in part (b) of Fig. 3 is divided into the two columns denoted by g1+ and g1− in part (c) of Fig. 3, where for the g1+ column, ‘1’ denotes presence of up-regulated gene (DE _up) and ‘0’ denotes absence of up-regulated gene (∼ DE _up), and for the g1− column, ‘1’ denotes presence of down-regulated gene (DE _down) and ‘0’ denotes absence of down-regulated gene (∼ DE _down). Hence, for methylation data, DM _hyper and DM _hypo are used instead of DE _up and DE _down, respectively. Note that in this paper, ‘+’ and ‘-’ denote up-regulation/hyper-methylation and down-regulation/hypo-methylation, respectively.

Fig 3 — Here, up-regulation (i.e., ‘+’) and down-regulation (‘-’) are denoted by ‘1’ and ‘0’ in (b), and red and green colors in (c), respectively. Here, s _tr and s _nr denote experimental/diseased/treated and control/normal samples respectively.

Dividing whole data into training and test sets

Let us assume that the post-discretized matrix is denoted by ITb. For classification of the matrix ITb, we have applied 4-fold cross-validations (CVs) on the matrix to divide it into test and training data, where one-fold of ITb will be used as test set and remaining three fold will be considered as training set. This procedure will be repeated for four times as it is 4-fold CV.

Finding maximal biclusters and extracting special rules

We have transposed the training boolean dataset, and applied Bimax biclustering to identify maximal frequent closed homogeneous itemsets (MFCHOIs). Before further proceeding, the fundamental method of BiMax biclustering is discussed in short.

Suppose, a boolean matrix e has size of n × m, where n is number of genes and m is the number of samples. A cell e _ij is 1 if gene i expresses differentially in the sample/condition j and otherwise, e _ij is 0. A bicluster (G, S) is a subset of genes G ⊆ {1, 2, …, n} which express differently together under a subset of samples S ⊆ {1, 2, …, m}; i.e., the pair (G, S) refers to a subset of the matrix e whose all elements have 1. The biclusters which are inclusion-maximal (i.e., the biclusters that are not entirely part of any other bicluster), are only interesting. The pair (G, S) ∈ 2^{1,2,…,n} × 2^{1,2,…,m} can be stated as a bicluster of the type inclusion-maximal [10] if and only if (i) e _ij = 1, ∀iεG, jεS, and (ii) ∄ (G′, S′) ∈ 2^{1,2,…,n} × 2^{1,2,…,m} with (a) e _ij = 1, ∀i′ ∈ G′, j′ ∈ S′ and (b) G ⊆ G′∧S ⊆ S′∧(G′, S′) ≠ (G, S). When there is no proper superset of an itemset have been found at the same support value, then the itemset is called closed itemset. Finding the set of frequent itemsets is totally equivalent to get a set of all-1 biclusters each having at least number of conditions/samples (i.e., satisfying minimum support).

In our experiment, for the biclustering, we have set a fixed minimum cutoff of items/genes (viz., 2), and different minimum cutoffs of sample/condition for determining itemsets at different minimum support of each rule. The BiMax biclustering can generate all maximal biclusters. The items/genes of maximal biclusters represent a (maximal) closed itemset. Thus, all extracted biclusters that are satisfying minimum support condition, produce the set of (maximal) frequent closed itemsets with their class-labels (i.e., conditions/types of samples). Thereby, we have to filter the biclusters depending on their conditions. We have selected such (maximal) biclusters which have the group of homogeneous (non-contradictory) conditions. In other words, we have to identify maximal frequent closed homogeneous itemsets (MFCHOIs). E.g., Bicluster 1 is a MFCHOI which has three genes g1+, g2− and g3+, and two homogeneous conditions/samples s _tr1 and s _tr3 (presented in part (d) of Fig. 3). Similarly, Bicluster 2 is another MFCHOI which has three genes g1+, g2+ and g3−, and two homogeneous conditions/samples s _nr1 and s _nr3 (presented in part (e) of Fig. 3). Hence, we have omitted the biclusters that have the group of heterogeneous (contradictory) conditions. E.g., Bicluster 3 is such type of heterogeneous (contradictory) bicluster which has three genes g1−, g2+ and g3+, and two heterogeneous conditions s _tr2 and s _nr2 (presented in part (f) of Fig. 3).

From each selected bicluster of genes, we can extract an association rule. Each resulting rule must be of special type, i.e., consequent of the rule consists of its class-label (i.e., either treated/diseased/experimental class-label or normal/control class-label) only. E.g., from the Bicluster 1 (depicted in part (d) of Fig. 3), rule id 1 (i.e., {g1+, g2−, g3+} ⇒ disease) is produced. It states that if both of gene1 and gene3 are up-regulated/hyper-methylated and gene2 is down-regulated/hypo-methylated simultaneously, then ‘disease’ occurs (see part (d) and part (g).(i) of Fig. 3). Similarly, rule id 2 (i.e., {g1+, g2+, g3−} ⇒ normal) is generated from the Bicluster 2 (see part (e) and part (g).(ii) of Fig. 3).

Ranking of rules

We have evaluated each evolved rule based on 24 rule-interestingness measures. Support, confidence, coverage, prevalance, sensitivity (or, recall), specificity, accuracy, lift (or, interest), leverage, added value, relative risk, Jaccard, Yules’ Q, klosgen, Laplace correction, Gini index, two-way support, linear correlation coefficient (or, ϕ-coefficient), cosine, least contradiction, Zhang, liverage2 (or, Piatetsky-Shapiro) and kappa [14, 15] are already included among the 24 measures (see S1 Text). The last and novel measure is the number of satisfiable conditions/samples to each evolved rule. E.g., according to part (g).(i) of Fig. 3, the value of the measure of rule id 1 (i.e., {g1+, g2−, g3+} ⇒ disease) is 2 as its corresponding bicluster (in part (d) of Fig. 3) has two conditions (s _tr1 and s _tr3). Similarly, according to part (g).(ii) of Fig. 3, the value of the measure of rule id 2 (i.e., {g1+, g2+, g3−} ⇒ normal) is 2. The rank of the rule is proportional to the value of the measure of it (i.e., if a rule that has higher value of the measure than other rule, then the rank of the rule will be better than the second rule).

Thereafter, the evolved rules are ranked according to each of the 24 rule-interestingness measures individually using Fractional ranking [16–18]. In the fraction ranking, items which compare equal, hold the same rank. This rank is the mean of ranking numbers that are received in ordinal ranking. E.g., suppose, a data set is {1 2 2}. Here, only two different numbers are available, so there should be two different ranks. If 2 and 2 are actually different numbers, then they should hold ranks 2 and 3, respectively. As these two numbers are same, thus we should calculate their rank by making the average of their ranks as follows: (2+3)/2 = 2.5; therefore, the fractional ranks will be: 1 2.5 2.5.

Hence, the final ranking of each rule is determined by average ranking on the resulting fractional rankings of the rule. The rules are then rearranged in ascending order (i.e., from best to worst rank).

Assigning weights to the rules w.r.t. their final ranking

For classification, we have to apply a majority voting technique on each test data point to identify its class-label through weighted-sum method. Thus, we firstly assign some weight on the final list of rules in such a way that the topmost rule gets the highest weight, 2nd topper gets 2nd highest weight and so on. The weight of the first ranked rule is always 1. The ranges of weight lie in between 0 and 1. The weight of each rule (denoted by w _j, 1 ≤ j ≤ p) is estimated from a function of its final rank of the rule (denoted by r _j) and the total number of rules (viz., p) as described below:

\begin{matrix} w_{j} = \frac{1}{p} * (p - (r_{j} - 1)) . \end{matrix}

(18)

Here, the weight-interval between any two consecutive ranked rules is same. Thus, the calculated weights of the rules are normalized using zero-mean normalization (in Equation 1).

Majority voting and classification

Consider one test data point (ts). For determining of the predicted class-label of it, we have applied ‘majority voting’ technique. At first, we have identified the rules whose all the items (genes) in their antecedent sides exist in ts. The weights of the rules (i.e., trR _ts number of rules) which have only the class-label ‘disease’ in their consequent sides are then summed up (viz., Ws_tr _ts). Similar summation (viz., Ws_nr _ts) is performed for the rules (i.e., nrR _ts number of rules) having only the class-label ‘normal’ in their consequent sides. The two weighted-sum are then compared, and the class-label with higher weighted-sum becomes the predicted class-label of ts (viz., PredCls _ts). But, if both the weighted-sum are equal to each other, then the class-label of the top rule which satisfies ts (i.e., ClsTopR _ts), becomes the predicted class-label of it. In case, if there is no such rule which satisfies that ts, then the class-label of the rule which satisfies maximum number of test points (i.e., Cls _R′) becomes the predicted class-label of it. An example of majority voting technique is presented in Fig. 4. Repeat this step for other test points. This process is then repeated 4 times for 4 sub-matrices of the test data as here 4-fold CV is used. Using this technique, we have calculated true positive (TP), true negative (TN), false positive (FP), false negative (FN), sensitivity [7], specificity [7], accuracy [7] and Mathews correlation coefficient (MCC) [7] for the proposed classification. Sensitivity, specificity, accuracy and MCC are defined in the followings, respectively:

\begin{matrix} s e n s i t i v i t y = \frac{T P}{(T P + F N)}, \end{matrix}

(19)

\begin{matrix} s p e c i f i c i t y = \frac{T N}{(F P + T N)}, \end{matrix}

(20)

\begin{matrix} a c c u r a c y = \frac{(T P + T N)}{(T P + F P + T N + F N)}, \end{matrix}

(21)

\begin{matrix} M C C = \frac{(T P * T N) - (F P * F N)}{\sqrt{(T P + F P) (T P + F N) (T N + F P) (T N + F N)}} . \end{matrix}

(22)

we have repeated the 4-fold CV for 10 times, and then their average sensitivity, average specificity, average accuracy, and average MCC are calculated with the standard deviations based on the results of cross-validations. Thereafter, a comparative performance analysis has been conducted between our proposed method (i.e., Prop) and other popular rule-based classifiers (i.e., ConjunctiveRule (CJR) [28], DecisionTable (DT) [28], JRip [28], OneR [28], PART [28] and Ridor [28] implemented in Weka 3.6 software) based on their average sensitivity, average specificity, average accuracy, and average MCC. Note that the other rule-based classifiers are also started with the same post-discretized matrix (i.e., ITb). We have also performed significance test (viz., One-way Anova) on the accuracies of the classifiers pairwise to know the level of significance (i.e., p-value) of the test for each (pairwise) comparison.

Fig 4 — Here, ‘r’ and ‘w’ denote rank and weight of the rule (computed by Equation 18), respectively. Tickmark/crossmark in ‘Q’ column states that test-point (ts) is satisfied/non-satisfied by the corresponding rule.

The above steps have been described the proposed methodology for the classification.

Performance comparison with other rule mining algorithms

For the purpose of rule mining only, the whole post-discretized data matrix (i.e., ITb) is used directly as a input of the Bimax biclustering. In this case, we have not performed any cross-validation since there is no need to use classification. Hence, we have compared our proposed rule mining algorithms with the other existing popular rule mining algorithms (i.e., AprioriTid [21], Eclat [22], Tao et al. [23] and H-mine [24]). It should be noted that same input binary matrix (i.e., ITb) is utilized for the other rule mining algorithms.

Biological significance of evolved rules, and Biomarker identification

As all the real datasets are microarray/beadchip (biological) datasets, so the evolved top rules should have biological significance. The information about the relation between the genes and any disease can be determined from pathway and Gene Ontology (GO) analyses. If all the genes (except the class-label) of a rule occur together in any pathway/GO-term and if the occurrence is statistically significant (i.e., p-value is less than 0.05), then the rule becomes the (statistically) biologically significant rule. If the pathway/GO-term relates to the corresponding disease, then the rule becomes important for diagnosing the disease. Therefore, KEGG pathway and GO analyses have been performed on the genes of the evolved rules using David Database to identify top significant rules with their involved KEGG pathways or GO-terms. The top rules occupied in a significant number of pathways/GOs are obtained. Frequency of occurrence of the genes in the evolved rules for experimental/treated class is performed to identify potential biomarkers.

Results and Discussion

In this section, at first, we describe the real datasets that are utilized to verify the performance of our proposed method (i.e., StatBicRM). Thereafter, we have performed experiments on the real datasets as well as some artificial datasets. The artificial datasets are made by taking random boolean values. Hence, some related discussions are also included at the end of this section.

Real Datasets

We have used three real datasets. The datasets are described in Table 1.

Table 1. Information of used Real Datasets (DS).

DS id	Dataset information
DS1	Expression dataset (NCBI ref. id:- GSE10245) of lung cancer subtypes [31], having 40 adenocarcinoma (AC) samples, and 18 squamous cell carcinoma (SCC) samples.
DS2	Expression dataset (NCBI ref. id:- GSE31699) of Uterine Leiomyoma [37], belonging 16 Uterine Leiomyoma tumor (UL) samples and 16 normal myometrial (MM) samples.
DS3	Methylation dataset (NCBI ref. id:- GSE31699) of Uterine Leiomyoma having the 18 UL samples and 18 MM samples.

	parametric tests at 0.0001 p-value cutoff				non-parametric tests at 0.0001 p-value cutoff
	t-test	Welch’s t-test	Bayes’ t-test	common	Limma	SAM	Wilcoxon	permute	common
#G _up	616	586	376	344	115	136	188	176	93
#G _dw	619	642	481	403	325	387	615	582	320

	parametric tests at 0.05 p-value cutoff					non-parametric tests at 0.05 p-value cutoff
	t-test	Welch’s t-test	Bayes t-test	Pearson’s correlation	common	Limma	SAM	Wilcoxon	permute	common
#G _up	391	391	329	62	54	86	86	86	86	86
#G _dw	576	576	491	97	82	70	70	70	70	70

Rule-based classifier	Average sensitivity[%] (s.d.)	Average specificity[%] (s.d.)	Average accuracy[%] (s.d.)	Average MCC (s.d.)
Proposed	99.25 (1.21)	85.55 (2.87)	95 (1.51)	0.88 (0.035)
ConjunctiveRule	88.25 (6.46)	78.33 (15.81)	85.18 (4.00)	0.67 (0.086)
DecisionTable	94.75 (2.19)	77.22 (12.13)	89.31 (2.27)	0.75 (0.057)
JRip	94.25 (1.21)	79.45 (6.95)	89.66(1.41)	0.75 (0.037)
OneR	92.5 (2.04)	78.33 (8.05)	88.11 (1.51)	0.72 (0.039)
PART	92 (2.58)	83.89 (4.86)	89.48 (3.19)	0.76 (0.074)
Ridor	91.75 (5.90)	79.44 (17.18)	87.93 (2.82)	0.73 (0.072)

Rule-based classifier	Average sensitivity[%] (s.d.)	Average specificity[%] (s.d.)	Average accuracy[%] (s.d.)	Average MCC (s.d.)
Proposed	98.13 (3.02)	53.75 (3.23)	75.94 (1.51)	0.58 (0.037)
ConjunctiveRule	71.88 (8.46)	63.75 (8.23)	67.81 (3.91)	0.36 (0.081)
DecisionTable	76.88 (3.02)	62.5 (5.10)	69.69 (1.51)	0.40 (0.027)
JRip	70.63 (3.02)	62.5 (5.10)	66.56 (3.92)	0.33 (0.080)
OneR	73.13 (3.02)	58.75 (10.70)	65.93 (4.53)	0.33 (0.087)
PART	66.25 (6.04)	64.38 (7.83)	65.31 (2.74)	0.31 (0.057)
Ridor	76.88 (3.02)	58.75 (3.23)	67.81 (1.51)	0.37 (0.031)

Rule-based classifier	Average sensitivity[%] (s.d.)	Average specificity[%] (s.d.)	Average accuracy[%] (s.d.)	Average MCC (s.d.)
Proposed	90.56 (2.68)	86.67 (2.87)	88.61 (2.43)	0.77 (0.048)
ConjunctiveRule	70.56 (2.68)	90.56 (6.95)	80.56 (2.27)	0.63 (0.062)
DecisionTable	84.44 (3.51)	82.78 (4.86)	83.61 (0.88)	0.68 (0.019)
JRip	75.56 (2.87)	92.78 (2.68)	84.16 (1.34)	0.69 (0.027)
OneR	76.67 (7.31)	88.33 (4.86)	82.50 (1.33)	0.66 (0.014)
PART	76.11 (13.11)	94.44 (0.00)^*	85.28 (6.55)	0.72 (0.113)
Ridor	83.33 (4.54)	80.00 (9.51)	81.67 (2.68)	0.64 (0.048)

Rule-based classifier	Average sensitivity[%] (s.d.)	Average specificity[%] (s.d.)	Average accuracy[%] (s.d.)	Average MCC (s.d.)
Proposed	84.13 (2.37)	83.62 (0.37)	83.77 (0.67)	0.64 (0.02)
ConjunctiveRule	83.47 (2.32)	82.12 (0.98)	82.56 (1.22)	0.62 (0.04)
DecisionTable	84.06 (3.82)	81.32 (0.78)	82.78 (0.87)	0.63 (0.02)
JRip	79.12 (2.89)	83.37 (0.97)	81.55 (1.37)	0.59 (0.04)
OneR	79.63 (3.48)	81.75 (1.48)	81.10 (1.78)	0.57 (0.06)
PART	80.95 (2.96)	83.97 (0.93)	81.86 (1.27)	0.60 (0.04)
Ridor	84.07 (2.57)	82.56 (1.05)	83.17 (1.42)	0.63 (0.05)

Group	p-value in DS1	p-value in DS2	p-value in DS3	p-value in DS4
Proposed vs ConjunctiveRule	2.41e-06 (S)	8.68e-06 (S)	4.53e-07 (S)	0.0139(S)
Proposed vs DecisionTable	3.40e-06 (S)	2.88e-08 (S)	8.93e-06 (S)	0.0106(S)
Proposed vs JRip	1.78e-07 (S)	1.36e-06 (S)	8.16e-05 (S)	0.0002(S)
Proposed vs OneR	6.53e-09 (S)	3.22e-06 (S)	1.68e-06 (S)	0.0003(S)
Proposed vs PART	0.0001 (S)	2.89e-09 (S)	0.1491 (NS)	0.0005(S)
Proposed vs Ridor	1.57e-06 (S)	4.81e-10 (S)	9.83e-06 (S)	0.2497(NS)

	DS1	DS2	DS3
For Rule _experimental	CENPA-	MCM4+	GZMH-
	TTK-	PRL+	TRPM2-
	CENPN-	FBXO33-	LHCGR+
	KIF2C-	NUAK1-	IQCF2-
	EZH2-	JAG1-	BSG+
	CA12-	EGFL6+	SCN4B+
	RGN+	CDC34+	HCG9+
	NCAPG-	TRPC6+	C1orf158-
	RNASEH2A-	IGF2+	PLP1-
	RPL13P5+	PSCD1-	SMPD2-
For Rule _control	SHROOM3-	MEIS3-	PRSS8-
	CMTM8-	AOX1+	NAV1+
	ZNF226-	ZNF217+	LYZL2-
	UGT1A8+/UGT1A9+	GFOD1+	FYB+
	MRAP2+	PRRG1+	EML4-
	SOX2OT+	MTMR4-	LHPP+
	CXXC5-	SERPINB1+	CCDC13-
	XKR8-	FHL5+	TEK-
	C10orf99+	ACSL5+	INHBE-
	FAM83C+	LIFR+	S100A16+

DS		Pathway/GO-BP/GO-CC/GO-MF	p-value	#Gene	Genes	#SRule	SRule ids
DS1	Path	hsa04120:Ubiquitin mediated proteolysis	0.0386	7	MGRN1, FBXO2, KLHL13, DDB2, RHOBTB2, MID1, UBE2S	1	rule id 5233
	GO:BP	GO:0022402 cell cycle process	8.69E-10	34	PRC1, BLM, TTK, PKMYT1, CEP55, AURKB, RHOU, GTSE1, SPC24, KIF2C, CDCA8, NCAPH, NCAPG, CENPA, SKA1, ZWILCH, TXNL4B, CDK1 etc.	21	rule id 327, 2231, 2232, 2914, 7360 etc.
		GO:0000278 mitotic cell cycle	1.02E-11	30	PRC1, BLM, TTK, PKMYT1, SPC24, KIF15, BIRC5, CENPE, NDC80, SMC2, CDK2, MAD2L1, TIMELESS, PLK1, BUB1B, SETD8 etc.	19	rule id 327, 2231, 2232, 2914, 7360 etc.
		GO:0022403 cell cycle phase	6.04E-12	32	PRC1, BLM, TTK, PKMYT1, CEP55, AURKB, RHOU, GTSE1, BIRC5, CENPE, NDC80, SMC2, CDK2, MAD2L1, TIMELESS, PLK1, BUB1B, RAD54B etc.	18	rule id 327, 2231, 2232, 2914, 7360 etc.
		GO:0007017 microtubule-based process	4.86E-04	14	KIF11, PRC1, KIF15, KIF18B, TTK, NDC80, CENPE, MID1, MARK1, GTSE1, KIF2C, CENPA, BUB1B, KIF13B	17	rule id 327, 2232, 7360 etc.
	GO:CC	GO:0043228 non-membrane-bounded organelle	5.43E-05	70	MTSS1, FOSL2, PRC1, CEP78, TTK, AURKB, SENP5, RHOU, GTSE1, SLC1A4, KIF2C, CDCA8, FRMD6, PBXIP1, FANCI, SNTB1, KIF13B, CDK1, MYO6, KIF11 etc.	85	rule id 151, 253, 298, 327, 415, 888, 1261, 1462, 1970, 2232 etc.
		GO:0043232 intracellular non-membrane-bounded organelle	5.43E-05	70	MTSS1, FOSL2, PRC1, CEP78, TTK, AURKB, SENP5, RHOU, GTSE1, SLC1A4, KIF2C, CDCA8, FRMD6 etc.	85	rule id 151, 253, 298, 327, 415, 888, 1261, 1462, 1970, 2232 etc.
		GO:0044459 plasma membrane part	0.0128	52	DLC1, IL27RA, TSPAN4, RHOU, SLC1A4, FRMD6, CD44, LTB4R, SNTB1, CEACAM6, SLC22A3, RAB27A, ARHGEF4, ICAM1, PLD1, MYO6, LIFR etc.	36	rule id 212, 625, 1876, 6051 etc.
	GO:MF	GO:0000166 nucleotide binding	0.0015	56	ACOX2, CTPS, PKMYT1, TTK, AURKB, RHOU, KIF2C, MCM8, LTB4R, ACAD8, RAB27B, ACAD9, RAB27A, KIF13B, NMNAT3, CDK1, MYO6, KIF11, LIMK2, KIF15, MCM4, MBD1, MCM5, CDK2, etc.	51	rule id 254, 327, 339, 344, 494, 639, 643 etc.
		GO:0001883 purine nucleoside binding	4.95E-04	45	ACOX2, FGFR2, BLM, CTPS, TTK, PKMYT1, AURKB, ADA, KIF2C, IGF1R, MCM8, STK32A, ACAD8, ACAD9, KIF13B, MYO5C, NMNAT3, CDK1, MYO6, KIF11, MKI67, LIMK2, KIF15, ATP11B etc.	39	rule id 254, 327, 339, 344, 494, 639, 643 etc.
		GO:0001882 nucleoside binding	5.72E-04	45	ACOX2, FGFR2, BLM, CTPS, TTK, PKMYT1, AURKB, ADA, KIF2C, IGF1R, MCM8, STK32A, ACAD8, ACAD9, KIF13B, MYO5C, NMNAT3, CDK1, MYO6, KIF11, MKI67, LIMK2, IPPK, UBE2S, ABCC5 etc.	39	rule id 254, 327, 339, 344, 494, 639, 643 etc.
DS2	Path	hsa00982:Drug metabolism	9.79E-04	5	GSTA4, FMO2, AOX1, GSTO2, MGST1	1	rule id 12
	BP	GO:0042127 regulation of cell proliferation	0.0341	9	CEBPA, TNFSF4, BAX, SERPINE1, LIFR, IGF2, JAG1, CD24, PRL	4	rule id 78, 95, 145, 390
		GO:0032583 regulation of gene-specific transcription	0.0039	5	CEBPA, TNFSF4, SMARCB1, PSRC1, IGF2	3	rule id 52, 78, 225
	CC	GO:0005576 extracellular region	0.0015	20	TNFSF4, EGFL6, MMP9, APOC1, LIFR, GGH, IGF2, JAG1, MMP2, CHRDL1, PRRG1, PTGDS, C1QTNF4, SERPINE1, PECAM1, SERPINA3, C1QL1, GDF15, GFOD1, PRL	6	rule id 50,81,82,87,95,246
DS3	Path	hsa04060:Cytokine-cytokine receptor interaction	1.69E-04	12	EGFR, IFNA21, CCR1, TNFSF12, IFNA1, IL23A, IL20RA, CCL3L1, INHBE, TNFRSF18, TNFSF12-TNFSF13, IFNGR2, IFNA17	1	rule id 177
	GO:BP	GO:0006952 defense response	1.88E-09	29	IFNA21, S100A8, CCR1, BNIP3, HTN3, CD74, CFHR1, APOA4, REG3A, IFNA1, IL23A, SAA2, CCL3L1, SAA1, REG3G, CFHR5, IL1RL1, DEFB103A, SCUBE1, RNASE6 etc.	12	rule id 7, 144, 613, 617, 653, 654, 784, 822, 1067, 1182, 2293, 2342
		GO:0006955 immune response	8.71E-04	20	FYB, IL1RL1, SLA2, CCR1, IGJ, CD300E, BNIP3, TNFSF12, C4BPA, CD74, CLEC4M, APOA4, CFHR1, CYBA, IL23A, CCL3L1, LYST, DEFA1, TNFSF12-TNFSF13, TREM1, CFHR5	12	rule id 7, 349, 350, 351, 387, 654, 784, 1182, 1674, 2293, 2342, 2361
		GO:0003012 muscle system process	0.0048	8	CYBA, CALD1, MYH3, SLMAP, MYH4, ACTN2, SCN5A, CASQ2	4	rule id 47, 138, 333, 2296
		GO:0006936 muscle contraction	0.0118	7	CALD1, MYH3, SLMAP, MYH4, ACTN2, SCN5A, CASQ2	4	rule id 47, 138, 333, 2296
	GO:CC	GO:0005886 plasma membrane	0.0090	67	TEX101, STEAP4, NEURL, LHCGR, F2RL1, FCRL2, TNFSF12, KCNIP4, CALB2, FCRL3, APOB, SLMAP, ERAS, CALCRL, IFNGR2, EGFR, BSG, SLA2, SCUBE1, ACTN2, CACNG3, OR1D2, FLNA, TRPM2 etc.	91	rule id 126, 144, 155, 272, 321, 338, 339, 351, 385, 416 etc.
		GO:0044459 plasma membrane part	0.0112	43	PKHD1, CCR1, LHCGR, F2RL1, TRHR, PANX3, CLDN11, TNFSF12, CD74, CALB2, SORBS3, SLMAP, TEK, ERAS, CALCRL, IFNGR2, SCN5A, EGFR, TRPM2, KCNK3, CLEC4M etc.	43	rule id 27, 28, 126, 144, 301, 339, 351, 385, 513, 514 etc.
		GO:0005576 extracellular region	4.55E-09	59	IFNA21, LHCGR, MMP27, TNFSF12, HTN3, APOA4, CFHR1, CFHR2, REG3A, APOB, OLFML3, SAA2, SERPINE2, SAA1, CCL3L1, CREG1, ANGPT1, REG3G, CFHR5, EGFR, NODAL, DEFA1 etc.	40	rule id 151, 180, 191, 346, 349, 350, 351, 486, 515, 517 etc.
	GO:MF	GO:0046983 protein dimerization activity	0.0029	16	EGFR, S100A16, SCUBE1, TRHR, LHCGR, NFS1, BNIP3, DSCAML1, ACTN2, FLNA, APOA4, CYBA, APOB, BOK, TFAP2E, CRYBB2	3	rules 1040, 1176, 2358
		GO:0019955 cytokine binding	0.0112	6	IL1RL1, IL20RA, CCR1, TNFRSF18, IFNGR2, CD74	2	rule id 177, 2342

Rule	#Pathway	Pathways
{FBXO2+, DDB2-⇒ class = AC}	1	hsa04120:Ubiquitin mediated proteolysis
Rule	#GO:BP	GO:BPs
{KIF11-, BUB1B- ⇒ class = AC }	14	GO:0000279 M phase, GO:0000280 nuclear division, GO:0007067 mitosis, GO:0022403 cell cycle phase, GO:0000087 M phase of mitotic cell cycle, GO:0007049 cell cycle, GO:0000278 mitotic cell cycle, GO:0048285 organelle fission, GO:0051301 cell division, GO:0022402 cell cycle process, GO:0007010 cytoskeleton organization, GO:0007017 microtubule-based process, GO:0000226 microtubule cytoskeleton organization, GO:0007051 spindle organization
{KIF11-, TTK- ⇒ class = AC }	10	GO:0000279 M phase, GO:0022403 cell cycle phase, GO:0007049 cell cycle, GO:0000278 mitotic cell cycle, GO:0022402 cell cycle process, GO:0007010 cytoskeleton organization, GO:0007017 microtubule-based process, GO:0007052 mitotic spindle organization, GO:0000226 microtubule cytoskeleton organization, GO:0007051 spindle organization
{KIF11-, TIMELESS- ⇒ class = AC }	10	GO:0000279 M phase, GO:0000280 nuclear division, GO:0007067 mitosis, GO:0022403 cell cycle phase, GO:0000087 M phase of mitotic cell cycle, GO:0007049 cell cycle, GO:0000278 mitotic cell cycle, GO:0048285 organelle fission, GO:0051301 cell division, GO:0022402 cell cycle process
{NCAPH+, AURKB+, KIF15+ ⇒ class = SCC }	9	GO:0000279 M phase, GO:0000280 nuclear division, GO:0007067 mitosis, GO:0022403 cell cycle phase, GO:0000087 M phase of mitotic cell cycle, GO:0007049 cell cycle, GO:0000278 mitotic cell cycle, GO:0048285 organelle fission, GO:0022402 cell cycle process
Rule	#GO:CC	GO:CCs
{CENPN-, ZWILCH- ⇒ class = AC}	9	GO:0000793 condensed chromosome, GO:0000779 condensed chromosome and centromeric region, GO:0000775 chromosome and centromeric region, GO:0000777 condensed chromosome kinetochore, GO:0000776 kinetochore, GO:0044427 chromosomal part, GO:0005694 chromosome, GO:0043228 non-membrane-bounded organelle, GO:0043232 intracellular non-membrane-bounded organelle
{CENPN-, CENPA- ⇒ class = AC}	9	GO:0000793 condensed chromosome, GO:0000779 condensed chromosome and centromeric region, GO:0000775 chromosome and centromeric region, GO:0000777 condensed chromosome kinetochore, GO:0000776 kinetochore, GO:0044427 chromosomal part, GO:0005694 chromosome, GO:0043228 non-membrane-bounded organelle, GO:0043232 intracellular non-membrane-bounded organelle
{CENPN-, CENPM- ⇒ class = AC}	9	GO:0000793 condensed chromosome, GO:0000779 condensed chromosome and centromeric region, GO:0000775 chromosome and centromeric region, GO:0000777 condensed chromosome kinetochore, GO:0000776 kinetochore, GO:0044427 chromosomal part, GO:0005694 chromosome, GO:0043228 non-membrane-bounded organelle, GO:0043232 intracellular non-membrane-bounded organelle
Rule	#GO:MF	GO:MFs
{SMC2-, TTK- ⇒ class = AC}	9	GO:0001883 purine nucleoside binding, GO:0001882 nucleoside binding GO:0030554 adenyl nucleotide binding, GO:0000166 nucleotide binding GO:0017076 purine nucleotide binding, GO:0005524 ATP binding GO:0032559 adenyl ribonucleotide binding, GO:0032555 purine ribonucleotide binding, GO:0032553 ribonucleotide binding
{TTK-, KIF2C- ⇒ class = AC}	9	GO:0001883 purine nucleoside binding, GO:0001882 nucleoside binding GO:0030554 adenyl nucleotide binding, GO:0000166 nucleotide binding GO:0017076 purine nucleotide binding, GO:0005524 ATP binding GO:0032559 adenyl ribonucleotide binding, GO:0032555 purine ribonucleotide binding, GO:0032553 ribonucleotide binding
{KIF2C-, IGF1R- ⇒ class = AC}	9	GO:0001883 purine nucleoside binding, GO:0001882 nucleoside binding GO:0030554 adenyl nucleotide binding, GO:0000166 nucleotide binding GO:0017076 purine nucleotide binding, GO:0005524 ATP binding GO:0032559 adenyl ribonucleotide binding, GO:0032555 purine ribonucleotide binding, GO:0032553 ribonucleotide binding
{SMC2-, TTK-, KIF2C- ⇒ class = AC}	9	GO:0001883 purine nucleoside binding, GO:0001882 nucleoside binding GO:0030554 adenyl nucleotide binding, GO:0000166 nucleotide binding GO:0017076 purine nucleotide binding, GO:0005524 ATP binding GO:0032559 adenyl ribonucleotide binding, GO:0032555 purine ribonucleotide binding, GO:0032553 ribonucleotide binding
{TTK-, SMC2-, CTPS- ⇒ class = AC}	9	GO:0001883 purine nucleoside binding, GO:0001882 nucleoside binding GO:0030554 adenyl nucleotide binding, GO:0000166 nucleotide binding GO:0017076 purine nucleotide binding, GO:0005524 ATP binding GO:0032559 adenyl ribonucleotide binding, GO:0032555 purine ribonucleotide binding, GO:0032553 ribonucleotide binding

Rule	#Pathway	Pathways
{AOX1+, GSTA4- ⇒ class = normal}	1	hsa00982:Drug metabolism
Rule	#GO:BP	GO:BPs
{AOX1+, GSTA4- ⇒ class = normal}	2	GO:0032583 regulation of gene-specific transcription, GO:0042127 regulation of cell proliferation
{IGF2+, PRL+ ⇒ class = tumor}	1	GO:0042127 regulation of cell proliferation
{IGF2+, PRL+ ⇒ class = tumor}	1	GO:0042127 regulation of cell proliferation
{IGF2+, PRL+ ⇒ class = tumor}	1	GO:0032583 regulation of gene-specific transcription
{IGF2+, PRL+ ⇒ class = tumor}	1	GO:0042127 regulation of cell proliferation
{IGF2+, PRL+ ⇒ class = tumor}	1	GO:0032583 regulation of gene-specific transcription
Rule	#GO:CC	GO:CCs
{IGF2+, PTGDS- ⇒ class = tumor}	3	GO:0005576 extracellular region, GO:0031090 organelle membrane, GO:0005783 endoplasmic reticulum
{IGF2+, EGFL6+ ⇒ class = tumor}	2	GO:0005576 extracellular region, GO:0005615 extracellular space
{PRRG1+, SERPINE1+ ⇒ class = normal}	1	GO:0005576 extracellular region
{CHRDL1+, JAG1- ⇒ class = tumor}	1	GO:0005576 extracellular region
{IGF2+, PRL+ ⇒ class = tumor}	1	GO:0005576 extracellular region
{SERPINE1+, GFOD1+ ⇒ class = normal}	1	GO:0005576 extracellular region
{JAG1-, PECAM1- ⇒ class = tumor}	1	GO:0005576 extracellular region

Rule	#Pathway	Pathways
{IL20RA+, CCR1+ ⇒ class = tumor}	1	hsa04060:Cytokine-cytokine receptor interaction
Rule	#GO:BP	GO:BPs
{CYBA+, C4BPA- ⇒ class = tumor}	5	GO:0006952 defense response, GO:0006954 inflammatory response, GO:0009611 response to wounding, GO:0006955 immune response, GO:0045087 innate immune response
{LYST+, BNIP3+ ⇒ class = tumor}	4	GO:0006952 defense response, GO:0009615 response to virus, GO:0006955 immune response, GO:0002252 immune effector process
{CFHR5-, REG3A- ⇒ class = tumor}	4	GO:0006952 defense response, GO:0006954 inflammatory response, GO:0009611 response to wounding, GO:0002526 acute inflammatory response
{CCR1+, CFHR5- ⇒ class = tumor}	4	GO:0006952 defense response, GO:0006954 inflammatory response, GO:0009611 response to wounding, GO:0006955 immune response
Rule	#GO:CC	GO:CCs
{MST1R+, TNFSF12/TNFSF13+ ⇒ class = tumor}	4	GO:0031226 intrinsic to plasma membrane, GO:0005886 plasma membrane, GO:0005887 integral to plasma membrane, GO:0044459 plasma membrane part
{MST1R+, CCR1+, TNFSF12/TNFSF13+ ⇒ class = tumor}	4	GO:0031226 intrinsic to plasma membrane, GO:0005886 plasma membrane, GO:0005887 integral to plasma membrane, GO:0044459 plasma membrane part
{LHCGR+, SLMAP+ ⇒ class = tumor}	4	GO:0031226 intrinsic to plasma membrane, GO:0005886 plasma membrane, GO:0005887 integral to plasma membrane, GO:0044459 plasma membrane part
{CYBA+, MST1R+ ⇒ class = tumor}	4	GO:0031226 intrinsic to plasma membrane, GO:0005886 plasma membrane, GO:0005887 integral to plasma membrane, GO:0044459 plasma membrane part
{TRPM2-, SMPD2- ⇒ class = tumor}	4	GO:0031226 intrinsic to plasma membrane, GO:0005886 plasma membrane, GO:0005887 integral to plasma membrane, GO:0044459 plasma membrane part
{SCN4B+, TRPM2- ⇒ class = tumor}	4	GO:0031226 intrinsic to plasma membrane, GO:0005886 plasma membrane, GO:0005887 integral to plasma membrane, GO:0044459 plasma membrane part
{MST1R+, CALCRL+ ⇒ class = tumor}	4	GO:0031226 intrinsic to plasma membrane, GO:0005886 plasma membrane, GO:0005887 integral to plasma membrane, GO:0044459 plasma membrane part
{S100A16+, MTNR1A-, NODAL- ⇒ class = normal}	4	GO:0031226 intrinsic to plasma membrane, GO:0005886 plasma membrane, GO:0005887 integral to plasma membrane, GO:0044459 plasma membrane part
{TRPM2-, SMPD2-, UGT1A10-⇒ class = tumor}	4	GO:0031226 intrinsic to plasma membrane, GO:0005886 plasma membrane, GO:0005887 integral to plasma membrane, GO:0044459 plasma membrane part
{SLMAP+, CCR1+ ⇒ class = tumor}	4	GO:0031226 intrinsic to plasma membrane, GO:0005886 plasma membrane, GO:0005887 integral to plasma membrane, GO:0044459 plasma membrane part
Rule	#GO:MF	GO:MFs
{BSG+, CLEC4M- ⇒ class = tumor}	2	GO:0005529 sugar binding, GO:0030246 carbohydrate binding

PERMALINK

Analyzing Large Gene Expression and Methylation Data Profiles Using StatBicRM: Statistical Biclustering-Based Rule Mining

Ujjwal Maulik

Saurav Mallik

Anirban Mukhopadhyay

Sanghamitra Bandyopadhyay

Roles

Abstract

Introduction

Materials and Methods

Literature Review

Proposed Method

Fig 1. Flowchart of the proposed methodology (StatBicRM) for the rule mining.

Fig 2. Flowchart of the proposed methodology (StatBicRM) for the classification.

Identification of differentially expressed/methylated genes using Statistical tests

Discretization and Post-discretization

Fig 3. An example of generating special rules from data matrix of the differentially expressed genes.

Dividing whole data into training and test sets

Finding maximal biclusters and extracting special rules

Ranking of rules

Assigning weights to the rules w.r.t. their final ranking

Majority voting and classification

Fig 4. An example of classification of evolved rules by the majority voting using weighted-sum.

Performance comparison with other rule mining algorithms

Biological significance of evolved rules, and Biomarker identification

Results and Discussion

Real Datasets

Table 1. Information of used Real Datasets (DS).

Experimental Results and Discussion

Table 3. Number of differentially expressed genes by different statistical tests for Dataset 2, where #G up, #G dw denote up and down-regulated genes, respectively.

Table 4. Number of differentially methylated genes by different statistical tests for Dataset 3, where #G hyper and #G hypo refer to hyper and hypo-methylated genes, respectively.

Fig 5. The clustergram of the common differentially expressed genes (by different statistical tests) for DS1.

Fig 6. Volcanoplot for identifying differential up and down-regulated genes from Dataset 1 by SAM.

Table 5. Number of differentially expressed genes by different statistical tests for the artificial Dataset 4, where #G up, #G down denote up-regulated and down-regulated genes, respectively.

Fig 7. A graphical representation of the gene expression of a maximal homogeneous bicluster (i.e., a MFCHOI) over different samples.

Table 6. Comparative performance analysis of the rule-based classifiers on Dataset 1, respectively (at 4-fold CVs repeating for 10 times); where bold font denotes the highest value for each column.

Table 7. Comparative performance analysis of the rule-based classifiers on Dataset 2, respectively (at 4-fold CVs repeating for 10 times); where bold font denotes the highest value for each column.

Table 8. Comparative performance analysis of the rule-based classifiers on Dataset 3, respectively (at 4-fold CVs repeating for 10 times); where bold font denotes the highest value for each column.

Table 9. Comparative performance analysis of the rule-based classifiers on Dataset 4, respectively (at 4-fold CVs repeating for 10 times); where bold font denotes the highest value for each column.

Fig 8. Barcharts: (a) comparison of dataset-wise average accuracies, and (b) comparison of dataset-wise average MCCs, among our proposed and other existing rule-based classifiers for the four datasets.

Table 10. p-value of Anova 1 between the avg. accuracies of the proposed and other classifiers (pairwise) in DS1, DS2, DS3 and DS4 (where ‘S’ and ‘NS’ refer to significant (p-value ≤ 0.05) and non-significant (p-value > 0.05) p-values respectively).

Table 11. Top 10 frequent genes in evolved rules of the two class-labels for DS1, DS2 and DS3, respectively.

Fig 10. Two examples of how significant biomarkers are identified from the maximal homogeneous biclusters (i.e., MFCHOI) for each class-label for each dataset.

Table 12. KEGG pathway, GO:BP, GO:CC and GO:MF analysis of corresponding genes of the evolved rules from the three datasets.

Table 13. Some top important rules w.r.t. their existing KEGG pathways/GO:BPs/GO:CCs/GO:CCs/GO:MFs in Dataset 1.

Table 14. Some top important rules w.r.t. their existing KEGG pathways/GO:BPs/GO:CCs in Dataset 2.

Table 15. Top important rules w.r.t. their existing KEGG pathways/GO:BPs/GO:CCs/GO:MFs in Dataset 3.

Fig 11. Comparison of number of significant itemsets between StatBicRM and other existing ARM methods at different minimum support for the two artificial datasets (viz., ArDS5 and ArDS6).

Integrative analysis of Gene Expression dataset and Methylation dataset

Conclusion

Supporting Information

Acknowledgments

Data Availability

Funding Statement

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Table 3. Number of differentially expressed genes by different statistical tests for Dataset 2, where #G _up, #G _dw denote up and down-regulated genes, respectively.

Table 4. Number of differentially methylated genes by different statistical tests for Dataset 3, where #G _hyper and #G _hypo refer to hyper and hypo-methylated genes, respectively.

Table 5. Number of differentially expressed genes by different statistical tests for the artificial Dataset 4, where #G _up, #G _down denote up-regulated and down-regulated genes, respectively.