Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2011 Jun 1;6(6):e20074. doi: 10.1371/journal.pone.0020074

Constructing Biological Pathways by a Two-Step Counting Approach

Hsiuying Wang 1,*, Henry Horng-Shing Lu 1, Tung-Hung Chueh 1
Editor: Miguel A Blazquez2
PMCID: PMC3105984  PMID: 21673799

Abstract

Networks are widely used in biology to represent the relationships between genes and gene functions. In Boolean biological models, it is mainly assumed that there are two states to represent a gene: on-state and off-state. It is typically assumed that the relationship between two genes can be characterized by two kinds of pairwise relationships: similarity and prerequisite. Many approaches have been proposed in the literature to reconstruct biological relationships. In this article, we propose a two-step method to reconstruct the biological pathway when the binary array data have measurement error. For a pair of genes in a sample, the first step of this approach is to assign counting numbers for every relationship and select the relationship with counting number greater than a threshold. The second step is to calculate the asymptotic p-values for hypotheses of possible relationships and select relationships with a large p-value. This new method has the advantages of easy calculation for the counting numbers and simple closed forms for the p-value. The simulation study and real data example show that the two-step counting method can accurately reconstruct the biological pathway and outperform the existing methods. Compared with the other existing methods, this two-step method can provide a more accurate and efficient alternative approach for reconstructing the biological network.

Introduction

One great challenge of postgenomic research is to explore complex biological pathways from genomic data such as DNA sequences, protein sequences, and gene expression profiles. The network building method is widely used throughout biology to reconstruct complex biological pathways.

We take MAPK pathway as an example. The MAPK/ERK pathway is a signal transduction pathway that couples intracellular responses to the binding of growth factors to cell surface receptors. Robert et al. [1] and related studies [2][5] based on biology experiments provide the MAPK pathway (Figure 1).

Figure 1. The PKC pathway in yeast.

Figure 1

This figure is redrawn from Figure 1A in [1].

It would be interesting if Figure 1 can be reconstructed in terms of their expression profile of Wsc1/2/3, Mid2,…, etc. To reduce the cost of experiments, one possibility is to predict the activation status of these genes through their microarray expression data for inferring the pathway.

There have been methods proposed in literature for reconstructing genetic regulatory networks in terms of microarray data. For instance, the Bayesian network model is an important technique that has been studied in the last two decades [6][8]. In addition, Wei and Li [9] proposed a hidden spatial-temporal Markov random field model to identify genes that are related to biological pathway. Allocco et al. [10] provided a variety of methods to find the gene-pairs with similarity relationship. Moreover, other algorithms using linear models [11], [12], differential equation [11], [13], neural network [14] and structural equation modeling [15] have been proposed to explore gene regulatory networks based on genomewide data. However, most of these methods have limitations in dealing with large-scale gene regulatory network because of their complex model structures. Also, careful discretization can be used to denoise high-throughput data. One such example can be found in Xing and Karp [16].

To overcome the disadvantage of the mentioned methods, we consider a simple model based on the Boolean network to reconstruct a large scale gene network in this study. Boolean networks have been proposed and investigated for a long time in literature. Kauffman [17], [18] considered a dynamic version of Boolean networks. Liang et al. [19] proposed the algorithm REVEAL to infer gene regulatory network by calculating the Shannon entropy. Akutsu and Miyano [20] proposed an identification algorithm to reconstruct the Boolean network by comparing the collected data with all possible Boolean functions and input datasets. In order to make Boolean network more comprehensive, Shmulevich et al. [21] proposed the model of probability Boolean network (PBN). Moreover, for large-scale gene regulatory networks, Kim et al. [22] have used Boolean network with chi-square test on the yeast cell cycle microarray gene expression datasets. Markowetz et al. [23] proposed the nested effects model to infer the genetic network. Li et al. [24] made a comparison between the approaches of probabilistic Boolean network and dynamic Bayesian network. More recent developments are referred to Ay, Xu and Kahveci [25] and Davidich and Bornholdt [26].

In this article, we consider the directed acyclic Boolean (DAB) network as a tool for exploring biological pathways. Our goal is to construct a DAB network from the noisy array data. Since it involves noisy data, the reconstruction of the pathways cannot employ a deterministic inference. Instead, we need to establish a statistical model to capture its random characteristics. A DAB network is characterized by two kinds of pairwise relationships: similarity and prerequisite. The former represents a pair of elements with coherent on-off states. The latter is a partial order relationship, namely, the on-status of one element is a prerequisite for the on-status of another element. More specifically, if one element is a prerequisite to another element, the off-status of one element will restrict another element's off-status. A DAB network is uniquely determined by its state space: all possible on-off states subjected to the pairwise relationships.

Recently, a Boolean implication network is proposed with similar aspect as the DAB network, which investigates all Boolean implications between pairs of genes for large-scale genome microarray datasets [27]. For any pair of elements, they use two statistics to test whether there is any specific relationship between the pair of elements. However, the methods are more applicable for dealing with mass information of datasets.

The approach of building a DAB network based on the expectation-maximization (EM) algorithm to derive the maximum likelihood estimator [28] for a statistical model is established in Li and Lu [29]. Their strategy is to build up a statistical model with measurement error and assign scores for the possible relationships between two genes, and then use the scores to select the true relationship. This method involves more computation and cannot provide a simple closed-form statistic to recover the true relationships between genes.

In this study, we propose a simple method to estimate pairwise relationships between elements from noisy array data. The approach is based on two steps: the first one is to count the numbers of different pairwise relationships in a sample, and the second one is to test the relationship hypotheses according to their asymptotic p-values. Compared with the Li and Lu [29] method, this new approach has a simple closed form and it is not time-consuming. In addition, the proposed counting approach shows substantial improvement compared to the Sahoo et al. [27] method. We conduct a simulation study to an example used in Li and Lu [29]. It is shown that the proposed method can recover all of the true relationships. A simulation study for a larger scale network is given in the supplementary material. In addition, the proposed method is implemented on the MAPK pathway example. It can recover 6 true relationships among seven relationships, however, Li and Lu's method only recovers one true relationship in this example. In this real data example, the new method shows a significant improvement in adopting a DAB network for exploring the pathway.

Methods

To describe the model and notations, we adopted a simple example used in Li and Lu [29] to illustrate the model assumption. Figure 2 shows the relationships of the seven elements in this example derived from the 13 states of Table 1. In the diagram, the notation Inline graphic denotes that A is a prerequisite of B and the notation Inline graphic denotes that B and E are similar. Note that Inline graphic in Figure 2 are called elements. The definitions of prerequisite and similar relationships for any two elements Inline graphic and Inline graphic are defined as follows.

Figure 2. Diagram of a directed acyclic Boolean network with seven elements and twelve pair relationships.

Figure 2

Only arrows between covering pairs are shown.

Table 1. The table of states for directed acyclic Boolean network shown in Figure 2.

case 1 2 3 4 5 6 7 8 9 10 11 12 13
A 0 1 1 1 1 1 1 1 1 1 1 1 1
B 0 0 1 1 1 1 1 1 0 0 0 0 0
C 0 0 0 1 1 1 1 1 1 1 1 1 1
D 1 1 1 1 0 1 0 0 1 0 1 0 0
E 0 0 1 1 1 1 1 1 0 0 0 0 0
F 0 0 0 0 0 1 1 1 0 0 1 1 1
G 0 0 0 0 0 0 0 1 0 0 0 0 1

Assume that an element only has two levels, on or off. We use “0” and “1” to represent “off” and “on” states respectively. For two elements Inline graphic and Inline graphic, Inline graphic is a prerequisite for Inline graphic if the on-state of Inline graphic is necessary for the on-state of Inline graphic, and we denote it by Inline graphic. When Inline graphic and Inline graphic are on and off simultaneously, the relationship between Inline graphic and Inline graphic is called similar and is denoted by Inline graphic We define Inline graphic to be the dual state of Inline graphic. It means that Inline graphic when Inline graphic.

There are 4 possible situations for the prerequisite relationship, and 2 possible situations for the similar relationship, see Table 2. Totally, there are 6 possible relationships for any two genes. The prerequisite relationship is a partial order. It is transitive on the ground-set, namely, Inline graphic and Inline graphic implies Inline graphic. The notations “Inline graphic” and “Inline graphic” in Table 2 denote the possible states of Inline graphic and Inline graphic and the impossible states of Inline graphic and Inline graphic under the relationship, respectively.

Table 2. Patterns for the six pairwise relationships assuming exhaustive sampling and no measurement error.

Inline graphic Inline graphic Inline graphic, Inline graphic
Inline graphic/Inline graphic 0 1 Inline graphic/Inline graphic 0 1 Inline graphic/Inline graphic 0 1
0 + 0 + 0 +
1 + 1 + 1 + +

Let Inline graphic and Inline graphic denote the counts and probabilities corresponding to states Inline graphic and Inline graphic without measurement error. From the possible relationships shown in Table 2, we can propose a hypothesis corresponding to each relationship. For example, the first similar relationship in Table 2 is Inline graphic, which means that the two situations Inline graphic and Inline graphic hold. In this case, the probability of the two situations, Inline graphic and Inline graphic, should be zero. Thus, its corresponding hypothesis is Inline graphic Other situations follow a similar argument. The hypotheses for the 6 relationships are presented in Table 3. Under the measurement error model assumption, let Inline graphic and Inline graphic denote the counts and probabilities corresponding to states Inline graphic and Inline graphic with misclassification probability Inline graphic.

Table 3. The six pairwise between the two elements Inline graphic and Inline graphic.

Relationship Hypothesis
diagonal Inline graphic Inline graphic
similarity Inline graphic Inline graphic
Inline graphic Inline graphic
triangular Inline graphic Inline graphic
prerequisite Inline graphic Inline graphic
Inline graphic Inline graphic

Because of the misclassification error, Inline graphic may be split up into four categories. We use the notations Inline graphic and Inline graphic to represent the counts of four cells split from Inline graphic. Analogous notations are defined for Inline graphic and Inline graphic. Consequently, their generating probabilities are calculated as follows: Inline graphic. Here, we adopt the notation Inline graphic analogous to Inline graphic. The splitting counts and probabilities implied by misclassification error are given in Tables 4 and 5.

Table 4. Splitting counts caused by misclassification error.

A/B 0 1
0 Inline graphic Inline graphic Inline graphic Inline graphic
Inline graphic Inline graphic Inline graphic Inline graphic
1 Inline graphic Inline graphic Inline graphic Inline graphic
Inline graphic Inline graphic Inline graphic Inline graphic

Table 5. Splitting probabilities caused by misclassification error.

A/B 0 1
0 Inline graphic Inline graphic Inline graphic Inline graphic
Inline graphic Inline graphic Inline graphic Inline graphic
1 Inline graphic Inline graphic Inline graphic Inline graphic
Inline graphic Inline graphic Inline graphic Inline graphic

Now we go back to the example of Figure 2 which includes 7 elements. There are a total of Inline graphic states for a seven-element network. Only thirteen of these states in Table 1 are compatible with the twelve pairwise relationships in the above example. From Figure 2, there are 12 true relationships between the elements, which are

graphic file with name pone.0020074.e124.jpg (1)

Under the measurement error model assumption, we do not directly observe the 13 states but observe states with measure error. We aim to reconstruct the true pathway. A proposed method is given in the following.

The two-step counting method

Suppose we have a sample Inline graphic of size Inline graphic for Inline graphic genes where Inline graphic Inline graphic or 1. For example, in Table 1, there is a sample of size 13 for seven genes. We propose a two-step approach to recover their relationships.

The first step: counting

For a pair of genes, say Inline graphic and Inline graphic, we can count the numbers for 6 relationships in Table 2 for the Inline graphic states. The relationships with a counting number greater than a given threshold are regarded as potential relationships.

If there are no measurement errors, it is reasonable to expect that the counting number of two elements, say Inline graphic and Inline graphic, satisfying the true relationship is exactly equal to Inline graphic. However, since it involves measurement errors, the counting number with respect to the true relationship may not be exactly equal to Inline graphic. For each pair of elements, we count the numbers satisfying the 6 relationships respectively, say Inline graphic. Since we expect that the misclassification probability is low, the counting number(Inline graphic) corresponding to the true relationship should be close to Inline graphic. Thus, we can select the relationships with a counting number greater than a threshold. The threshold selection is suggested as follows.

Threshold Selection

The suggested thresholds for the similar and prerequisite relationships are

graphic file with name pone.0020074.e140.jpg

and

graphic file with name pone.0020074.e141.jpg (2)

respectively, where Inline graphic. Here, the misclassification error probability Inline graphic can be assumed to be known from empirical experiences. If Inline graphic is unknown, the maximum likelihood approach for estimating Inline graphic is given in Appendix D in the materials section.

The argument for the threshold selection is given in Appendix A in the materials section. It is based on a confidence bound approach associated with the counting number formulas. The approach is to derive the formulas for the two kinds of relationships, and then uses a confidence interval approach to obtain a lower bound for the counting number formulas. The forms Inline graphic and Inline graphic are inferred by the counting number formulas with misclassification probability Inline graphic, where Inline graphic value is derived by a confidence bound approach.

The second step: asymptotic p-value

Besides directly counting the relationships' numbers, the second step is to test the relationships in Table 3 using an asymptotic p-value. Then we combine both steps to estimate the true relationship between two elements.

The following simulation study shows that the two steps are both essential for selecting the true relationship. If any one of the steps is used solely in selecting the true relationship, the simulation shows that it cannot select the true relationships very accurately.

The p-values derived for the 6 hypotheses with misclassification probability Inline graphic corresponding to the 6 relationships are listed as follows. The derivations are given in Appendix B in the materials section.

For testing Inline graphic the asymptotic p-value for large sample size Inline graphic is

graphic file with name pone.0020074.e153.jpg

where Inline graphic is the cumulative distribution of the standard normal random distribution. The asymptotic p-value for testing Inline graphic or Inline graphic are the forms of (10), (13),(14), (15) and (16), respectively, which are given in Appendix B in the materials section.

The extremeness of the observed value for the test statistic under the null hypothesis leads to a small p-value, which would imply rejection of the null hypothesis. Thus, if the null hypothesis is the true relationship, we expect to obtain a higher p-value. In the second step, we also set a threshold for the asymptotic p-value such that the relationships with asymptotic p-value greater than or equal to the threshold are selected.

A large p-value indicates a larger possibility that the null hypothesis holds. Note that the p-value is less than or equal to 1. In this study, we use the threshold 1 for the p-value criterion in the examples because the largest p-value for each relationship is one. From the simulation study and the real data example discussed in this study, setting 1 to be a threshold for p-value criterion can lead to very accurate results. Note that for other examples, it is possible that the largest p-value is not 1. In this case, we need to observe the p-values to select a suitable threshold.

It is worth noting that the hypothesis testing procedure corresponds to a confidence interval approach [30]. From the confidence interval viewpoint, when the p-value is large enough (close to 1) or small enough (close to 0), we have confidence to accept or reject the null hypothesis. Therefore, in this study, when p-value is 1, we have confidence to accept the null hypothesis.

The two-step method is described as follows.

Procedure for selecting the true relationship of Inline graphic elements

Step 1

For a sample of size Inline graphic for Inline graphic elements, calculate the counting numbers for the 6 relationships of each pair of the elements. Set a threshold for the counting numbers. Select the relationships with a counting number greater than the threshold.

Step 2

For each pair of elements, derive the asymptotic p-values for each relationship and set a threshold for the p-value. Select the relationships with a asymptotic p-value greater than or equal to the threshold.

Step 3

For each pair of elements, select the relationships satisfying both criteria of Step 1 and Step 2. This relationship is the estimated relationship for the two elements.

Note that it is possible to have more than one relationship satisfying both criteria for two elements. But from a simulation result and a real data application, it shows that in most situation, there is only one relationship satisfying both criteria.

The asymptotic p-value has a closed form which can be easily calculated and the counting number can also be easily calculated. This shows that this method can provide a convenient way to recover the biological pathway.

An Example

We revisit the example of Figure 2 to illustrate the counting step. Assume that we only have a sample of the states for the 7 elements and we want to recover the 12 true relationships. Note that there are totally Inline graphic pairs of the 7 elements and there are only 12 pairs with relationships in this example. When considering the case without measurement error, we can reconstruct the pathway from a sample using the counting number method if the sample size is large enough. We can construct the Boolean network for the example by identifying prerequisite or similar relationships. From Table 1, we list the relationship corresponding to the highest counting number for each pair as follows:

graphic file with name pone.0020074.e161.jpg
graphic file with name pone.0020074.e162.jpg
graphic file with name pone.0020074.e163.jpg
graphic file with name pone.0020074.e164.jpg
graphic file with name pone.0020074.e165.jpg
graphic file with name pone.0020074.e166.jpg

where Inline graphic denotes the counting number corresponding to the indicating relationship Inline graphic.

If there is only one relationship corresponding to the highest counting number, we list that one, such as Inline graphic; if there are more than one relationship corresponding to the highest counting number, we list all of the relationships, such as Inline graphic. In the 21 pairs, the relationships corresponding to the highest counting number 13 is the 12 true relationships, and the relationships with the counting number less than 13 are not the true relationships.

Comparison

We consider two existing methods for detecting the pairwise relationships between any two elements. A simulation study is conducted to compare the proposed method with the existing methods for the measurement error case.

Existing methods

Li and Lu [29] proposed the directed acyclic Boolean network to recover the genetic network. For any pair of element, they use the EM algorithm to calculate the maximum likelihood estimator of misclassification rate Inline graphic under the multinomial distribution model structure and adopt a criterion that requires a true relationship to correspond to a small estimator of Inline graphic in order to select a relationship. Besides the disadvantage that the EM algorithm is time-consuming, this method is also shown to be less accurate than the counting method from a simulation study.

Another method for inferring the relationship of any two elements is proposed by Sahoo et al. [27]. For any two genes Inline graphic and Inline graphic, let Inline graphic, Inline graphic, Inline graphic and Inline graphic denote the numbers of the four states Inline graphic, Inline graphic, Inline graphic and Inline graphic of Inline graphic, respectively from a sample. For example, to infer whether the relationship Inline graphic is true, they use the following two statistics to test if the relation Inline graphic is true:

graphic file with name pone.0020074.e186.jpg
graphic file with name pone.0020074.e187.jpg

where “Inline graphic” and “Inline graphic” denote the values of Inline graphic and Inline graphic, respectively.

The relationship Inline graphic in Sahoo et al. method is regarded as true when the “Inline graphic Inline graphic” value is less than 0.1 and “Inline graphic” value is greater than 3 [27]. However, from our calculation, the method may lead to inaccurate results when the sample size is not large. For instance, suppose the number of experiments we observed is 91 and the numbers of states corresponding to (0,1), (0,0), (1,0) and (1,1) are 1, 30, 30 and 30 respectively, resulting in a small “Inline graphic” value of 2.94. Note that the state Inline graphic indicates that the relationship Inline graphic does not hold. However, since the state Inline graphic only occurs once, it may be due to a measurement error. In this case, the method does not select the relationship Inline graphic. This shows that the criterion is too conservative to select a potential relationship when the sample size is not large enough.

Simulation

We conduct a simulation study using the example of Figure 2 with 13 compatible states (Table 1) in order to compare the proposed method with the two existing methods. With a misclassification probability 0.05, we generate 100 states for the simulation comparison. Tables 6 and 7 show the counting numbers and p-values for different relationships with a sample size of 100, respectively. Note that the notations Inline graphic in Tables 6 and 7 denote the relationships in order in Table 3.

Table 6. The counting numbers for the 21 pairs in the 100 states under each relationship.

hypothesis Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic
(A, B) 55 45 98 57 90 55
(A, C) 72 28 95 77 93 35
(A, D) 51 49 93 58 95 54
(A, E) 56 44 98 58 90 54
(A, F) 54 46 98 56 90 56
(A, G) 26 74 99 27 89 85
(B, C) 51 49 64 87 83 66
(B, D) 46 54 70 76 77 77
(B, E) 91 9 95 96 52 57
(B, F) 51 49 76 75 71 78
(B, G) 53 47 92 61 55 92
(C, D) 35 65 76 59 94 71
(C, E) 50 50 86 64 84 66
(C, F) 72 28 98 74 72 56
(C, G) 42 58 98 44 72 86
(D, E) 43 57 74 69 79 78
(D, F) 37 63 72 65 81 82
(D, G) 37 63 87 50 66 97
(E, F) 50 50 76 74 72 78
(E, G) 52 48 92 60 56 92
(F, G) 66 34 98 68 48 86

Table 7. The p-values for the 21 pairs in the 100 states under each relationship.

hypothesis Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic
(A, B) 0.8207 0.0065 1 0 0.1358 0
(A, C) 1 0 1 0.0095 0.0680 0
(A, D) 0.1330 0.1511 0.7931 0 1 0
(A, E) 0.9228 0.0046 1 0 0.1201 0
(A, F) 0.7200 0.0092 1 0 0.1527 0
(A, G) 0 0.6534 1 1 0.7599 0.3630
(B, C) 0.9149 0.4654 0.0060 1 0.7636 0.0152
(B, D) 0.4237 1 0.6569 0.6504 0.9855 0.9855
(B, E) 1 0 0.8094 1 0 0
(B, F) 0.9994 0.8352 0.9990 0.9065 0.9118 0.9189
(B, G) 0.4376 0.1168 1 0 0 0.8556
(C, D) 0.0004 1 0.0268 0 1 0.0025
(C, E) 0.7452 0.6280 1 0.0058 0.9122 0.0165
(C, F) 1 0 1 0.0005 0.0014 0
(C, G) 0.0724 0.1689 1 0 0.0070 0.3746
(D, E) 0.1613 1 0.4171 0.4297 1 0.9031
(D, F) 0.0093 1 0.1226 0.1554 0.8975 1
(D, G) 0.0003 1 0.0377 0 0.0001 1
(E, F) 0.9367 0.9716 0.9702 0.7834 0.9965 0.9993
(E, G) 0.3662 0.1523 1 0 0 0.9043
(F, G) 1 0 1 0.0001 0 0.0110

In this case, the maximum value for the counting number is 100 because the sample size is 100. As discussed in above, we can set a threshold (2) for the counting number. In this case, the thresholds for the similar and prerequisite relationships are 86 and 93. And we set the threshold for the p-value to be 1 because the highest p-value for each pair is 1 in this case. The relationships corresponding to the hypotheses with a p-value 1 are the candidates for the true relationship.

For any pair of the 7 elements, there are 6 possible relationships of each pair. Since there are 21 pairs for the 7 elements, there are totally 126 possible relationships. Using the two thresholds set above, there are only 12 relationships satisfying the conditions, which are exactly the true relationships (1). There are many relationships among the 126 relationships satisfying only one condition, but not both. For example, the relationships Inline graphic in Inline graphic and Inline graphic in Inline graphic satisfy the counting number condition, but not the p-value condition; the relationships Inline graphic and Inline graphic in Inline graphic satisfy the p-value condition, but not the counting number condition. It shows that any one of the two steps is an important condition for identifying the true relationships. In this case, we can recover all true relationships using the proposed method.

Next, we implement the algorithm of Li and Lu [29] in the simulated data. Since this algorithm does not provide a specific threshold selection method, we adopt different thresholds and find that the best situation is to recover 11 relationships . In this case, one relationship Inline graphic is misjudged to be Inline graphic.

In order to compare the counting method with Sahoo et al. [27], we implement their method in this simulated data. There are only two relationships Inline graphic and Inline graphic recovered from their method with “Inline graphic” values 4.05 and 3.18, respectively. The “Inline graphic” values for relationships of other pair elements are smaller than 3, resulting in inaccuracy of identifying the other true relationships. It shows that the method of [27] is less efficient and accurate in recovering the true relationships than the counting method from the simulation study for the case with measurement error.

Beside the example with 7 elements, a more comprehensive example with a larger network (Figure S1) that shows the superiority of the proposed method is given in the supplementary material.

Yeast expression data

We revisit the MAPK pathway example from the Introduction. The datasets used in analyzing the MAPK pathway include 81 experimental data excluding two data with missing values, 57 from Spellman et al. [31] and 26 from Zhu et al. [32] . The datasets from Spellman et al. [31] include 18 data from the alpha factor experiments, 14 data set from the Elutrtation experiments and 24 data sets from cdc15 experiments. The datasets from Zhu et al. [32] include 25 data from Forkhead experiments. The raw data can be download from the Stanford Microarray Database [33]. We adopt values corresponding to the Log(base2) column in the raw dataset to reconstruct the MAPK pathway, which are log ratio values of red to green signal.

A gene state is regarded as on state or off state when the log ratio value of red to green signal is greater than or less than 0, respectively. The gene expression data for the 81 experimental data (Table S1) are given in the supplementary material.

In this study, we apply the two-step approach to explore the expression profiles, and show exploratory results on the pathway. The results are also compared with the Li and Lu's method [29] and Shaoo et al. method [27].

We implement the proposed method to the yeast cell cycle data [31], [32]. In our analysis, we assume that the level Inline graphic. According to the threshold selection formulas (2), the thresholds for the similar and prerequisite relationships are 61 and 69, respectively. And the threshold for the asymptotical p-value we selected is 1.

According to the network structure reconstructed using our proposed approach, we can see that Wsc2p and Mid2p activate Rho1p, Pkc1p and Bck1p which results in activation of the downstream of MAPK cascade, Mkk1p and Mlp1p. Activated Wsc2p also interacts with Mid2p. The functions of genes Swi4p, Swi6p and Rlm1p in the downstream of the network are not significant in our approach.

The reconstruction results of the DAB network using the two-step approach and the method of Li and Lu [29] are illustrated in Figure 3(a) and Figure 3(b), respectively. In addition, we also implemented the method of Shaoo et al. [27] in this real yeast data. The results show that there are no pair relationships detected by the method of Shaoo et al. [27], because all “Inline graphic” values are smaller than 3 for any two elements. Therefore, compared with the methods in Li and Lu [29] and Shaoo et al. [27], our proposed method is more useful for finding the cascade relationship.

Figure 3. Some pairwise relationships identified by the two-steps counting approach (a), and the Li and Lu method (b) using the expression data of yeast Saccharomyces cerevisiae.

Figure 3

Discussion

For the implementation of the network reconstruction algorithm, the greatest complexity lies in the computation of p-value for every two elements. The number of all pair is Inline graphic where Inline graphic is the number of elements. Therefore, the time complexity for the proposed approach is Inline graphic showing that the proposed method is capable of handling thousands of genes simultaneously.

This study mainly focuses on reconstructing pathway by gene expression. Although pathway reconstruction methods based on gene expression have been widely discussed in the literature, there is a limitation on the gene expression methods. A biological pathway comprises more than genetic interactions alone. Long chains of vents may happen on the protein level (e.g. (de)activation by phosphorylation) which does not necessarily have to be regulated via gene expression. Therefore, these gene expression methods can be expected to reconstruct pathways that are regulated via gene expression, but not other biological interactions.

In summary, we propose a two-step approach to test the biological pathways from noisy array data. This new method has the advantages of easy calculation for the counting numbers and simple closed forms of the p-value. From the simulation results, we can see that this method can precisely estimate the true relationships for most of the situations. Compared with the other existing methods, it can provide a more accurate and efficient alternative approach for reconstructing the biological network.

Materials and Methods

Appendix A: Threshold Selection

(i) Suppose the misclassification probability is Inline graphic. For a similar relationship such as the case Inline graphic, in this case, we have

graphic file with name pone.0020074.e234.jpg (3)

With misclassification error, the counting number corresponding to the relationship is

graphic file with name pone.0020074.e235.jpg

By (3), the last equation is equal to Inline graphic, which is the mean of the counting number if this similar relationship holds. Since we cannot expect that the counting number is always equal to the mean, we look for a lower bound of the counting number as a threshold. From the viewpoint of constructing confidence interval, if Inline graphic is unknown, a Inline graphic upper bound of Inline graphic is Inline graphic, where Inline graphic is an estimator of Inline graphic and Inline graphic is the upper Inline graphic quantile of the standard normal distribution. The bound Inline graphic is an upper bound of Inline graphic. Then Inline graphic is a lower bound of Inline graphic. Here we replace Inline graphic by Inline graphic in the upper bound and suggest Inline graphic, where Inline graphic as a threshold. We expect that the counting number is greater than the threshold if the similar relationship holds. Beside using the conventional confidence interval, we can also consider some improved intervals discussed in literature [34][37].

(ii) Assume for two elements Inline graphic and Inline graphic, a prerequisite relationship holds. In this case, we have

graphic file with name pone.0020074.e255.jpg (4)

With misclassification errors, the counting number corresponding to the relationship is

graphic file with name pone.0020074.e256.jpg (5)

By (4), (5) is equal to

graphic file with name pone.0020074.e257.jpg (6)

By a similar argument as in (i), we suggest Inline graphic as a threshold for the prerequisite relationship.

Appendix B: Computational details

The methods for testing the 6 hypotheses in Table 3 are listed as following.

(i) For deriving the p-value of the test:

(ii)

graphic file with name pone.0020074.e259.jpg

we can consider the following two different situations. Note that the condition Inline graphic in hypothesis Inline graphic is equivalent to Inline graphic because Inline graphic and Inline graphic.

(I) The misclassification probability Inline graphic is zero.

The statistics

graphic file with name pone.0020074.e266.jpg (7)

has an asymptotic standard normal distribution under the null hypothesis Inline graphic.

(II) The misclassification probability Inline graphic is greater than zero.

In this case, the mean and the variance of the random variable are Inline graphic is

graphic file with name pone.0020074.e270.jpg (8)

and

graphic file with name pone.0020074.e271.jpg (9)

under the null hypothesis.

Consequently, the asymptotic p-value is

graphic file with name pone.0020074.e272.jpg

which can be rewritten as (3).

(iii) For deriving the p-value of the test:

(iv)

graphic file with name pone.0020074.e273.jpg

by an argument similar to (i), for Inline graphic, the asymptotic p-value is

graphic file with name pone.0020074.e275.jpg (10)

(iii) For deriving the p-value of the test:

graphic file with name pone.0020074.e276.jpg

consider the case of Inline graphic.

Under the null hypoyhesis, the mean and variance of the statistics Inline graphic are

graphic file with name pone.0020074.e279.jpg (11)

and

graphic file with name pone.0020074.e280.jpg (12)

under the null hypothesis.

The asymptotic p-value is

graphic file with name pone.0020074.e281.jpg (13)

(iv) For deriving the p-value of the test:

graphic file with name pone.0020074.e282.jpg

by an argument similar to (iii), the asymptotic p-value is

graphic file with name pone.0020074.e283.jpg (14)

(v) For deriving the p-value of the test:

graphic file with name pone.0020074.e284.jpg

by an argument similar to (iii), the asymptotic p-value is

graphic file with name pone.0020074.e285.jpg (15)

(vi) For deriving the p-value of the test:

graphic file with name pone.0020074.e286.jpg

by an argument similar to (iii), the asymptotic p-value is

graphic file with name pone.0020074.e287.jpg (16)

The estimators Inline graphic of Inline graphic in the above formulas of asymptotic p-values are given in Appendix C.

Appendix C: Frequency estimation

If the misclassification probability Inline graphic is known, the methods for estimating the probability Inline graphic, Inline graphic, Inline graphic and Inline graphic are listed as follows.

According to Table 5, we have

graphic file with name pone.0020074.e295.jpg
graphic file with name pone.0020074.e296.jpg
graphic file with name pone.0020074.e297.jpg
graphic file with name pone.0020074.e298.jpg

Note that Inline graphic and Inline graphic. By solving the above equations, we have

graphic file with name pone.0020074.e301.jpg
graphic file with name pone.0020074.e302.jpg
graphic file with name pone.0020074.e303.jpg
graphic file with name pone.0020074.e304.jpg

The derived values are used as estimators for Inline graphic.

Appendix D: Misclassification probability estimation

If Inline graphic is unknown, we can apply the maximum likelihood approach to estimate Inline graphic. By Table 5, we can rewrite the multinomial model for the observations Inline graphic in terms of Inline graphic and other parameters. The maximum likelihood approach for deriving the maximum likelihood estimator of Inline graphic is based on the likelihood function

graphic file with name pone.0020074.e311.jpg (17)

where Inline graphic. This involve Inline graphic and other parameters, Inline graphic, given in Appendix C. The maximum likelihood approach is to find the maximum likelihood estimators of Inline graphic and Inline graphic such that the estimators can maximize the likelihood function (17) [29].

Supporting Information

Figure S1

An example of Boolean network with 10 elements.

(DOCX)

Table S1

The 81 experimental yeast expression data.

(DOCX)

Footnotes

Competing Interests: The authors have declared that no competing interests exist.

Funding: National Science Council (NSC 99-2118-M-009-001-MY2, NSC 98-2118-M-009-004-MY3), National Center for Theoretical Sciences, and Center of Mathematical Modeling and Scientific Computing. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Roberts CJ, Nelson B, Marton MJ, Stoughton R, Meyer MR, et al. Signaling and circuitry of multiple MAPK pathways revealed by a matrix of global gene expression profiles. Science. 2000;287:873–880. doi: 10.1126/science.287.5454.873. [DOI] [PubMed] [Google Scholar]
  • 2.Buehrer BM, Errede B. Coordination of the mating and cell integrity mitogen-activated protein kinase pathways in Saccharomyces cerevisiae. Molecular and Cellular Biology. 1997;17:6517–6525. doi: 10.1128/mcb.17.11.6517. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.O'Rourke SM, Herskowitz I. The hog1 MAPK prevents cross talk between the hog and pheromone response MAPK pathways in Saccharomyces cerevisiae. Genes & Development. 1998;12:2874–2886. doi: 10.1101/gad.12.18.2874. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Madhani HD, Fink GR. Combinatorial control required for the specificity of yeast MAPK signaling. Science. 1997;275:1314–1317. doi: 10.1126/science.275.5304.1314. [DOI] [PubMed] [Google Scholar]
  • 5.Posas F, Saito H. Osmotic activation of the HOG MAPK pathway via ste11p MAPKKK: scaffold role of pbs2p MAPKK. Science. 1997;276:1702–1705. doi: 10.1126/science.276.5319.1702. [DOI] [PubMed] [Google Scholar]
  • 6.Jensen FV. New York: Springer; 2001. Bayesian Networks and Decision Graphs. [Google Scholar]
  • 7.Jensen ST, Chen G, Stoeckert CJ., Jr Bayesian variable selection and data integration for biological regulatory networks. Annals of Applied Statistics. 2007;1:612–633. [Google Scholar]
  • 8.Pearl J. San Mateo: Morgan Kaufmann; 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. [Google Scholar]
  • 9.Wei Z, Li H. A hidden spatial-temporal markov random field model for network-based analysis of time course gene expression data. Annals of Applied Statistics. 2008;2:408–429. [Google Scholar]
  • 10.Allocco DJ, Kohane IS, Butte AJ. Quantifying the relationship between co-expression, coregulation and gene function. BMC Bioinformatics. 2004;5:18. doi: 10.1186/1471-2105-5-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.D'haeseleer P, Wen X, Fuhrman S, Somogyi R. Linear modeling of mRNA expression levels during CNS development and injury. Pacific Symposium on Biocomputing. 1999;4:41–52. doi: 10.1142/9789814447300_0005. [DOI] [PubMed] [Google Scholar]
  • 12.van Someren EP, Wessels LFA, Reinders MJT. Linear modeling of genetic networks from experimental data. In: Proceedings 8th International Conference on Intelligent Systems for Molecular Biology. volume 8, 2000;355–366 [PubMed] [Google Scholar]
  • 13.Chen T, He HL, Church GM. Modeling gene expression with di®erential equations. Pacific Symposium on Biocomputing. 1999;4:29–40. [PubMed] [Google Scholar]
  • 14.Weaver DC, Workman CT, Stormo GD. Modeling regulatory networks with weight matrices. Pacific Symposium on Biocomputing. 1999;4:112–123. doi: 10.1142/9789814447300_0011. [DOI] [PubMed] [Google Scholar]
  • 15.Liu B, De La Fuente A, Hoeschele I. Gene network inference via structural equation modeling in genetical genomics experiments. Genetics. 2008;178:1763–1776. doi: 10.1534/genetics.107.080069. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Xing EP, Karp RM. CLIFF: Clustering of high-dimensional microarray data via iterative feature filtering using normalized cuts. Bioinformatics. 2001;17:306–315. doi: 10.1093/bioinformatics/17.suppl_1.s306. [DOI] [PubMed] [Google Scholar]
  • 17.Kauffman SA. Gene regulation networks: A theory for their global structure and behaviors. Current Topics in Developmental Biology. 1977;6:145–182. doi: 10.1016/s0070-2153(08)60640-7. [DOI] [PubMed] [Google Scholar]
  • 18.Kauffman SA. Assessing the probable regulatory structures and dynamics of the metazoan genome. In: Thomas R, editor. Kinetic Logic: A Boolean Approach to the Analysis of Complex Regulatory Systems. Vol. 29. Berlin: Lecture Notes in Biomathematics. Springer-Verlag; 1979. pp. 30–60. [Google Scholar]
  • 19.Liang S, Fuhrman S, Somogyi R. REVEAL, a general reverse engineering algorithm for inference of genetic network architectures. Pacific Symposium on Biocomputing. 1998;3:18–29. [PubMed] [Google Scholar]
  • 20.Akutsu T, Miyano S. Pacific Symposium on Biocomputing; 1999. Identification of genetic networks from a small number of gene expression patterns under the boolean network model. pp. 17–28. [DOI] [PubMed] [Google Scholar]
  • 21.Shmulevich I, Dougherty ER, Kim S, Zhang W. Probabilistic boolean networks: a rule-based uncertainty model for gene regulatory networks. Bioinformatics. 2002;18:261–274. doi: 10.1093/bioinformatics/18.2.261. [DOI] [PubMed] [Google Scholar]
  • 22.Kim H, Lee JK, Park T. Boolean networks using the chi-square test for inferring large-scale gene regulatory networks. BMC Bioinformatics. 2007;8:37. doi: 10.1186/1471-2105-8-37. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Markowetz F, Kostka D, Troyanskaya OG, Spang R. Nested effects models for high-dimensional phenotyping screens. Bioinformatics. 2007;23:i305–i312. doi: 10.1093/bioinformatics/btm178. [DOI] [PubMed] [Google Scholar]
  • 24.Li P, Zhang C, Perkins E, Gong P, Deng Y. Comparison of probabilistic Boolean network and dynamic Bayesian network approaches for inferring gene regulatory networks. BMC Bioinformatics. 2007;8:S13. doi: 10.1186/1471-2105-8-S7-S13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Ay F, Xu F, Kahveci T. Scalable steady state analysis of Boolean biological regulatory networks. PLOS One. 2009;4:e7992. doi: 10.1371/journal.pone.0007992. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Davidich MI, Bornholdt S. Boolean network model predicts cell cycle sequence of fission yeast. PLOS One. 2008;3:1672. doi: 10.1371/journal.pone.0001672. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Sahoo D, Dill D, Gentles A, Tibshirani R, Plevritis S. Boolean implication networks derived from large scale, whole genome microarray datasets. Genome Biology. 2008;9:R157. doi: 10.1186/gb-2008-9-10-r157. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B (Methodological) 1977;39:1–38. [Google Scholar]
  • 29.Li LM, Lu HH-S. Explore biological pathways from noisy array data by directed acyclic boolean networks. Journal of Computational Biology. 2005;12:170–185. doi: 10.1089/cmb.2005.12.170. [DOI] [PubMed] [Google Scholar]
  • 30.Bickel PJ, Doksum KA. Mathematical statistics: basic ideas and selected topics. Prentice Hall. 2000;1 [Google Scholar]
  • 31.Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, et al. Comprehensive identification of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Molecular Biology of Cell. 1998;9:3273–3297. doi: 10.1091/mbc.9.12.3273. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Zhu G, Spellman PT, Volpe T, Brown PO, Botstein D, et al. Two yeast forkhead genes regulate the cell cycle and pseudohyphal growth. Nature. 2000;406:90–94. doi: 10.1038/35017581. [DOI] [PubMed] [Google Scholar]
  • 33.Hubble J, Demeter J, Jin H, Mao M, Nitzberg M, et al. Implementation of GenePattern within the Stanford Microarray Database. 2009. [ http://smd.stanford.edu/] [DOI] [PMC free article] [PubMed]
  • 34.Agresti A, Coull BA. Approximate is better than ‘exact’ for interval estimation of binomial proportions. The American Statistician. 1998;52:119–126. [Google Scholar]
  • 35.Wang H. Exact confidence coe±cients of confidence intervals for a binomial proportion. Statistica Sinica. 2007;17:361–368. [Google Scholar]
  • 36.Wang H. Exact confidence coeffcients of simultaneous confidence intervals for multinomial proportions. Journal of Multivariate Analysis. 2008;99:896–911. [Google Scholar]
  • 37.Wang H. Exact average coverage probabilities and confidence coeffcients of confidence intervals for discrete distributions. Statistics and Computing. 2009;19:139–148. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Figure S1

An example of Boolean network with 10 elements.

(DOCX)

Table S1

The 81 experimental yeast expression data.

(DOCX)


Articles from PLoS ONE are provided here courtesy of PLOS

RESOURCES