Skip to main content
Briefings in Bioinformatics logoLink to Briefings in Bioinformatics
. 2024 Aug 12;25(5):bbae396. doi: 10.1093/bib/bbae396

Reconstruction of gene regulatory networks for Caenorhabditis elegans using tree-shaped gene expression data

Yida Wu 1, Da Zhou 2, Jie Hu 3,
PMCID: PMC11318059  PMID: 39133097

Abstract

Constructing gene regulatory networks is a widely adopted approach for investigating gene regulation, offering diverse applications in biology and medicine. A great deal of research focuses on using time series data or single-cell RNA-sequencing data to infer gene regulatory networks. However, such gene expression data lack either cellular or temporal information. Fortunately, the advent of time-lapse confocal laser microscopy enables biologists to obtain tree-shaped gene expression data of Caenorhabditis elegans, achieving both cellular and temporal resolution. Although such tree-shaped data provide abundant knowledge, they pose challenges like non-pairwise time series, laying the inaccuracy of downstream analysis. To address this issue, a comprehensive framework for data integration and a novel Bayesian approach based on Boolean network with time delay are proposed. The pre-screening process and Markov Chain Monte Carlo algorithm are applied to obtain the parameter estimates. Simulation studies show that our method outperforms existing Boolean network inference algorithms. Leveraging the proposed approach, gene regulatory networks for five subtrees are reconstructed based on the real tree-shaped datatsets of Caenorhabditis elegans, where some gene regulatory relationships confirmed in previous genetic studies are recovered. Also, heterogeneity of regulatory relationships in different cell lineage subtrees is detected. Furthermore, the exploration of potential gene regulatory relationships that bear importance in human diseases is undertaken. All source code is available at the GitHub repository https://github.com/edawu11/BBTD.git.

Keywords: gene regulatory network, tree-shaped gene expression data, data integration, Boolean network, Bayesian statistics

Introduction

Gene regulatory network (GRN) is a collection of molecular species and their interactions, together controlling gene-product (RNA and proteins) abundance [1]. Constructing GRNs enables biologists to shed light on the biological processes of an organism from a holistic perspective [2, 3]. There are two main types of gene expression data used for GRN construction. First, time series data, produced by DNA microarray or next-generation RNA-sequencing, provide an opportunity for scientists to investigate GRN by leveraging the temporal pattern [4]. However, such data obscure biological signals in the gene expression profiles on cellular levels, which is insufficient to explore gene regulation for specific cell types [5]. Second, single-cell RNA-sequencing data offer a profound insight into GRNs at cellular levels, but these data lack temporal information of genes, which is crucial to reflect the genes’ dynamic change within cells [5].

With the development of modern high throughput experiment techniques, there is a chance to construct GRNs by leveraging the gene expression data including both cellular and temporal information. [6] and [7] described an automated system for analyzing successive gene expression profiles in Caenorhabditis elegans (C. elegans) with cellular resolution from the zygote until adult using time-lapse confocal laser microscopy. Specifically, a C. elegans individual is measured once per 1.5 min to concurrently report the fluorescence intensity of labeled gene within each living cell. Thus, this technology enables the generation of temporal data for each cell within a single worm, tracking from its birth through to division or death, all recorded within a single data file. Consequently, each data file documents the quantified fluorescent intensity of one specific gene in a single C. elegans individual and different data files correspond to different worms, which are provided by [8]. Also, each data file can be depicted as a binary tree, as shown in Fig. 1, clearly displaying parent–child cell relationships. Hence, we refer to this kind of cellular-temporal data as tree-shaped gene expression data. Furthermore, due to the invariant cell lineage from zygote to adult of C. elegans [9], which means all C. elegans individuals follow exactly the same path to develop from embryo into adult worm, we can summarize each data file into specific cell lineage subtrees. Following the characterization by [9], there are mainly five distinct cell lineage subtrees of C. elegans, which include all descendants of respective founder cells: ‘AB’, ‘C’, ‘D’, ‘E’, and ‘MS’. These five subtrees are denoted with the same names as their founder cells. Due to the absence of differentiation in the early stages and the limited duration of observation, only a small fraction of the descendants in each subtree manifest cell fates. However, the cell fates across different subtrees are markedly distinct. For example, the ‘AB’ subtree predominantly differentiates into hypodermis and neuronal cells, while the ‘E’ subtree mainly gives rise to intestinal cells. Therefore, it is essential to reconstruct the GRN for each subtree independently, as similarly demonstrated in [10].

Figure 1.

Figure 1

Comparison of two cell lineage subtrees; the figure shows a part of two cell lineage subtrees, which are, respectively, derived from two real data files: CD20060629_pha4_b2.csv and CD20081014_ref-1_2_L1.csv; both subtrees start from ‘Ea’ cell at time zero; the length of each vertical line corresponds to the lifetime of a single cell; each horizontal line represents an event of cell division; the color of lines corresponds to the fluorescent intensity of two labeled gene: pha-4 and ref-1; the cell names are from [9], which are signed on the right side of each vertical line.

Although such tree-shaped data provide valuable insights about cell lineage, they pose a challenge to downstream analysis. Due to the limitation of tracking only one labeled gene per individual, the lifetimes of the same cell in different data files may not align [8]. For instance, as shown in Fig. 1, the lifetime of ‘Eal’ within the left subtree is longer than the right one, resulting in non-pairwise data points measured as the measurement intervals are consistently 1.5 min. Thus, it is impractical to directly align the fluorescence intensity from different data files according to the absolute time. To solve the problem of non-pairwise time series, [10] resorted to averaging the fluorescence intensity of each cell to one averaged value which can be applied to perform downstream analysis. However, since the fluorescent intensity may vary greatly within a cell, this method would result in the loss of temporal information at cellular levels. In order to avoid this drawback, a framework of data integration including interpolation and binarization is proposed, which enables each cell from different data files to have identical and matched time points.

With the availability of cellular and temporal information, several methods of GRN reconstruction for C. elegans were proposed previously. [11] developed a mathematical model on wild-type data to predict gene regulatory relationships for C. elegans, but only focused on a very small number of ‘C’ cells. [10] employed a probabilistic graphical model to reconstruct GRNs. While their analysis centered around five founder cells and their descendants, it is noteworthy that their approach was dependent on known protein–protein interaction, protein–DNA interaction, and gene knockout data. Hence, there is no research of GRN reconstruction focusing on distinct cell lineage subtrees for C. elegans by leveraging tree-shaped data exclusively. Due to the appealing characteristics of dynamic complexity and robustness to noisy data [2, 12, 13], employing Boolean networks for reconstruction of GRNs is a proper strategy to handle such high-resolution but rather noisy datasets. REVEAL (REV) [14] and Best-Fit Extension (BFE) [15] are two widely recognized algorithms for inferring Boolean networks. Despite that, the former method primarily employs a deterministic Boolean network model, which has limited capacity to accommodate biological uncertainties, and the latter focuses on local optimization, rather than optimizing the entire network jointly. Recently, ATEN [16] has been developed to infer Boolean network topology and dynamics from noisy time series data. However, its reliance on deterministic Boolean networks limits its robustness to uncertainty. [17] proposed a probabilistic Boolean network on the protein interaction network of yeast cell cycle. This stochastic model, which is stable and robust when considering a wide range of noise, can be used as a tool to study interactions among genes [18–22]. However, this work has some certain limitations. First, gene regulation was modeled by fixing the network’s time delay as one time unit, thereby oversimplifying the underlying dynamics of gene regulation [23, 24]. Second, they did not include the development of a parameter inference algorithm for the model. Therefore, it is necessary to address the problem within the stochastic model proposed by [17] and develop a new method to reconstruct GRNs.

In the following contents, a framework of data integration is proposed to transform tree-shaped data with non-pairwise time series into pairwise data for five distinct cell lineage subtrees. Next, a novel Bayesian approach based on Boolean network with time delay (BBTD) is developed. Especially, time delay is introduced as a discrete parameter in our model, allowing for a more realistic representation of gene regulatory processes [25–27]. Finally, BBTD is applied to reconstruct GRNs for each subtree based on both synthesized and real data.

Materials and methods

Data integration

The tree-shaped gene expression dataset for C. elegans provided by [8] is downloaded from http://epic.gs.washington.edu/. There are 184 files within the dataset, corresponding to 105 genes, and the repeated files of one specific gene are called copies. In order to reduce impacts of noise fluctuation, only the gene expression onset cells, which are provided by [28], with their descendants are considered in each file for data integration. Besides, since the fluorescent intensity associated with gene expression tends to remain stable or increase over time, rarely exhibiting a decrease [28], which can not reflect the real expression state of change, the first-order difference of fluorescence intensity is computed, defined as expression rate. As mentioned in Section 2, this tree-shaped dataset presents a primary challenge for application since the lifetimes of one cell may vary across embryos. Hence, it is not feasible to directly align the time series data from different files based on the absolute time. To address this, as depicted in Fig. 2, a data integration framework to align time series is proposed for each cell lineage subtree as follows:

Figure 2.

Figure 2

A four-step process of data integration for each cell lineage subtree; Step 1: normalize the lifetime of each cell to a standardized unit; Step 2: compute the first-order difference of the raw data and interpolate them for each cell to the same number; Step 3: discretize the expression rates into binary values; Step 4: merge the multiple copies of the same gene.

  • Step 1: normalize the lifetime of each cell to a standardized unit.

  • Step 2: compute expression rates and interpolate the same number of points to each cell using FMM spline method [29]. The number of interpolations for each subtree is the median of the number of points per cell, which are shown in Table 1. For convenience, all data points are referred to as interpolations after Step 2.

  • Step 3: discretize the expression rates into binary values, where Inline graphic indicates high expression rate and Inline graphic denotes low expression rate. The threshold is set as the median values of expression rates within each subtree.

  • Step 4: merge the multiple copies of the same gene. At each gene expression time point, if more than half of the gene copies exhibit State Inline graphic, the gene state is classified as Inline graphic. Conversely, the gene state is categorized as Inline graphic. If all copies of one gene are missing at some gene expression time points, then the gene states at that time are set to missing values.

Table 1.

Summary of five cell lineage subtree datasets; the ‘Gene’ column shows the number of candidate genes and the ‘Cell’ column shows the number of candidate cells; the ‘Inters’ column displays the number of interpolations per cell for analysis.

Subtree Gene Cell Inters
AB 27 112 24
C 30 51 20
D 20 12 28
E 40 26 16
MS 29 80 22

After the integration of raw tree-shaped data for each subtree, it is observed that missing values persist within each subtree dataset (see Supplementary Figure S1). To alleviate the impacts of missing values, candidate cells and genes are selected for the subsequent analysis, which are displayed in Supplementary Table S1 and Table S2. Then, the remaining missing values (Inline graphic in each subtree, see Supplementary Figure S2) are treated as low expression rates, i.e. State 0. The details of selection are shown in Supplementary I. Table 1 provides a summary of five subtree datasets used for network inference.

BBTD method

To effectively capture the dynamic nature of gene regulation from pairwise time series subtree dataset, a new method, i.e. BBTD, is proposed. This method comprises a probabilistic Boolean network model with time delay, a pre-screening process, and a Bayesian inference framework. The details of BBTD are discussed below.

Statistical modeling

To begin with, we consider a network evolving in the configuration space Inline graphic with Inline graphic being the number of genes, where Inline graphic denotes active state (high expression rate) and Inline graphic indicates inactive state (low expression rate). Then, an Inline graphic-dimensional square matrix Inline graphic is utilized to represent the GRN which includes Inline graphic genes, where Inline graphic. Specifically, Inline graphic means a positive regulation acting from regulator gene Inline graphic to regulated gene Inline graphic, which indicates that when gene Inline graphic reaches an active state, gene Inline graphic will receive a positive signal and is more likely to be activated after some time. Inline graphic means when gene Inline graphic reaches an active state, gene Inline graphic will receive a negative sign and is more likely to be suppressed after some time. Inline graphic indicates that there is no regulation emitted from gene Inline graphic to gene Inline graphic. Since the binarization ignores the continuous and often subtle variations in gene expression levels that are biologically important, especially in the context of self-regulation where feedback mechanisms can be sensitive to small changes in expression levels [30], self-regulation of genes is not considered in our study. Hence, as in [31–33], we do not consider self-regulation in this case and set Inline graphic. Figure 3a provides an illustrative example of five-gene GRN. According to the definition of matrix Inline graphic above, the GRN can be represented by a five-dimensional square matrix as shown in Fig. 3b.

Figure 3.

Figure 3

An example of the probabilistic Boolean network model with time delay; (a) depicts a 5-gene GRN; each node corresponds to a gene, and the regulatory relationships are illustrated by solid deep green arrows (indicating positive regulations) and dashed orange arrows (indicating negative regulations); (b) presents the matrix Inline graphic corresponding to the GRN representation in (a); a deep green grid signifies Inline graphic, an orange grid represents Inline graphic, and a white grid represents Inline graphic; (c) demonstrates the state transition process of the 5-gene GRN; the gray circle represents an active state of the gene, while the white circle represents an inactive state; specifically, (c) illustrates the regulatory mechanism governing the gene states at time Inline graphic with different time delay as an example.

Furthermore, it is crucial to account for time delay between the regulator gene and the regulated gene in our analysis. To address this problem, a time delay matrix is denoted as Inline graphic. When Inline graphic, Inline graphic represents the time delay when gene Inline graphic activates or suppresses gene Inline graphic, where Inline graphic denotes the time unit between two successive interpolations. This is exemplified in Fig. 3c, where gene 1 suppresses gene 3 with one unit of time delay, while gene 3 activates gene 4 with two units of time delay. Thus, the joint probability of state transition is defined as follows:

graphic file with name DmEquation1.gif (1)

where Inline graphic denotes the expression state of gene Inline graphic at time Inline graphic. Inline graphic denotes the number of root cells, i.e. cells that do not have a parent among the candidate cells. Inline graphic indicates the total number of interpolations for the Inline graphic-th root cell as well as its daughter cells. Since the max time delay as defined is Inline graphic, Inline graphic begins at Inline graphic for Inline graphic. The equality in Equation (1) is true for the reason of conditional independence, that is, when given all of the regulator genes’ states of the regulated gene Inline graphic, the state of gene Inline graphic is independent of those of the other genes. Then, we denote Inline graphic being the input of all signals received of gene Inline graphic at time Inline graphic and the conditional probability of Inline graphic is given as follows: If Inline graphic,

graphic file with name DmEquation2.gif (2)

and Inline graphic,

graphic file with name DmEquation3.gif (3)

Note that the positive parameter Inline graphic in Equation (2) controls the probability for gene Inline graphic to maintain its state when the input to gene Inline graphic is zero [17]. And the positive temperature-like parameter Inline graphic in Equation (3) represents noise in the system from the perspective of statistical physics [17, 34, 35].

Pre-screening process

In the proposed Boolean network model, the unknown parameters of Inline graphic is Inline graphic. As Inline graphic increases, the number of Inline graphic to be estimated increases at a polynomial level. To improve the efficiency of network inference, we design a pre-screening process based on Fisher’s exact test [36] to reduce the number of unknown parameters.

Fisher’s exact test is a hypothesis test for testing significant differences in contingency tables. As introduced in Section 2, expression state Inline graphic is a binary variable. Let Inline graphic be the vector of all available expression states of gene Inline graphic, where Inline graphic. Then, a contingency table is created to summarize the expression states of a regulated gene and a candidate regulator gene. Assuming that if the P -value of Fisher’s exact test is greater than or equal to a given threshold value Inline graphic, there is no existing regulatory relationship between these two genes. The scheme of the pre-screening process is displayed in Algorithm 1. After that, the Inline graphic whose estimated value is Inline graphic no longer participates in subsequent calculations, thereby reducing the number of estimated parameters.

graphic file with name bbae396fx1.jpg

Bayesian inference

After implementing the pre-screening process, the unknown parameters are estimated under the Bayesian framework. In order to obtain the posterior of these parameters, the priors need to be specified. Initially, it is assumed that the regulated cases of each gene are independent of each other, and that the corresponding time delay is meaningful only if the regulatory relationship exists. Thus, the joint prior Inline graphic is given as follows:

graphic file with name DmEquation5.gif (4)

where

graphic file with name DmEquation6.gif (5)
graphic file with name DmEquation7.gif (6)

In Equation (5), Inline graphic indicates the Inline graphic-th row of Inline graphic, and Inline graphic means the number of candidate regulatory relationships whose regulated gene is gene Inline graphic. According to [37], GRN is considered sparse. Then, two hyperparameters: Inline graphic and Inline graphic are introduced to control the sparsity. Inline graphic is the normalizing constant of Inline graphic. In Equation (6), when Inline graphic, Inline graphic is assigned a non-informative prior, i.e. uniform discrete distribution, defined as

graphic file with name DmEquation8.gif (7)

When Inline graphic, the corresponding Inline graphic is redundant and should be fixed to Inline graphic. However, fixing Inline graphic leads to a changing dimension problem, rendering the sampling algorithm unaccessible. To address this, a spike-and-slab prior is implemented for Inline graphic when Inline graphic, ensuring consistent dimensionality throughout iterations [38]. This prior is given as follows:

graphic file with name DmEquation9.gif (8)

Then, for two positive parameters Inline graphic and Inline graphic in the model as well as two hyperparameters Inline graphic and Inline graphic, their priors are supposed as follows:

graphic file with name DmEquation10.gif

Finally, the full joint posterior can be written as follows:

graphic file with name DmEquation11.gif (9)

where Inline graphic represents the binary pairwise time series subtree dataset.

After getting the posterior, Markov Chain Monte Carlo (MCMC) is performed to draw samples from Equation (9). Specifically, Gibbs Sampling [39] and Metropolis–Hastings (M-H) algorithm [40, 41] are utilized to draw samples of unknown parameters. For discrete parameters Inline graphic, their samples are drawn by Gibbs Sampling. The key step for Gibbs Sampling is to obtain the conditional posterior of the parameters. For each Inline graphic, their conditional posterior is given as follows:

graphic file with name DmEquation12.gif (10)

where ‘−’ means given all other variables. For continuous parameters Inline graphic, Inline graphic, Inline graphic, and Inline graphic, the M-H algorithm is used to draw samples. The complete scheme of MCMC algorithm is shown in Algorithm 2. When sampling is completed, a burn-in phase is conducted by discarding the first half of the samples to ensure convergence to the posterior. Finally, the Maximum-A-Posteriori estimate Inline graphic is calculated based on the remaining samples.

graphic file with name bbae396fx2.jpg

Results

Simulation studies

The synthesized tree-shaped datasets are generated from the probabilistic Boolean network model in Equations (2) to (3) the number of whose genes and cells, and the number of interpolations per cell are consistent with the real datasets. The ground truth values of parameters are drawn from their priors. For each subtree, 50 groups of synthesized datasets are generated and BBTD is utilized to infer their gene regulatory relationships.

To begin with, the pre-screening process (Algorithm 1) is applied to filter the discrete parameters Inline graphic. After the pre-screening process is completed, two indices are introduced to evaluate the performance of the pre-screening process, which are defined as follows:

  • Pre-screening Rate (PR) of Inline graphic,
    graphic file with name DmEquation14.gif (11)
  • False Negative Rate (FNR) of Inline graphic,
    graphic file with name DmEquation15.gif (12)

Next, MCMC algorithm (Algorithm 2) is performed with three independent MCMC chains for each group. Since the discrete parameters Inline graphic where we are interested are sparse, i.e. there are excessive zeros in Inline graphic and Inline graphic, five indices are used to evaluate the performance of simulation that are defined as bellow:

  • True Positive Rate (TPR) of Inline graphic,
    graphic file with name DmEquation16.gif (13)
  • Precise Positive Rate (PPR) of Inline graphic,
    graphic file with name DmEquation17.gif (14)
  • True Negative Rate (TNR) of Inline graphic,
    graphic file with name DmEquation18.gif (15)
  • Precise Negative Rate (PNR) of Inline graphic,
    graphic file with name DmEquation19.gif (16)
  • Accuracy (ACC) of Inline graphic,
    graphic file with name DmEquation20.gif (17)

Table 2 shows the performance of BBTD on five synthesized subtrees. Obviously, the pre-screening process can effectively screen around Inline graphic of Inline graphic for each subtree. Also, Inline graphic in all settings indicates that there is no loss of ground truth regulatory relationships after implementing the pre-screening process. This robustness is attributable to the utilization of Fisher’s exact test, a nonparametric hypothesis testing method, which is inherently robust against variations in data distribution. Moreover, despite the variability in data distributions, the threshold value Inline graphic within the pre-screening process can be consistently set without losing any ground truth regulatory relationships. This consistency in performance demonstrates that the pre-screening process is robust across different data distributions. Furthermore, the fourth to the eighth columns in Table 2 reveal only minor differences across different data distributions when the probabilistic Boolean network model is applied, in conjunction with the use of the MCMC algorithm to infer unknown parameters. Table 2 shows that most gene regulations (positive and negative) as well as time delay can be predicted accurately by the MCMC algorithm for distinct subtrees. In conclusion, our simulation results demonstrate that the BBTD method exhibits robustness across these data variations.

Table 2.

The simulation results for synthesized cell lineage subtrees; the second to the eighth columns summarize the denoted indices in Equation (11) to (17).

Subtree PR FNR TPR PPR TNR PNR ACC
AB 0.683 0 0.970 0.837 0.976 0.759 0.894
C 0.699 0 0.976 0.853 0.970 0.764 0.887
D 0.705 0 0.969 0.953 0.949 0.865 0.969
E 0.732 0 0.965 0.918 0.954 0.843 0.942
MS 0.696 0 0.973 0.823 0.970 0.724 0.878

In addition, the statistical model in BBTD is replaced with the probabilistic Boolean network model proposed by [17] for comparison which does not consider multi-time delay and this method is called BB. BBTD is also compared with other existing Boolean network inference algorithms mentioned in Section 2, i.e. REV, BFE, and ATEN. The details of algorithm settings are shown in Supplementary II. Since REV, BFE, and ATEN cannot distinguish positive and negative regulation relationships, for ease of comparison, we set Inline graphic if there is a regulation acting from regulator gene Inline graphic to regulated gene Inline graphic, otherwise Inline graphic. All experiments in this study are conducted on a server equipped with two CPUs (Inter (R) Xeon(R) Platinum 8368 with two threads * 38 cores, @ 2.40 GHz). The network inference results are displayed in Fig. 4. REV is not listed in this figure due to its poor capability to handle nondeterministic network models. Although BFE and ATEN have a better tolerance of noisy data compared with REV, they consider neither the parent–child cell relationships within the tree-shaped dataset nor the time delay of regulation, thus resulting in lower correct rates. Figure 4 shows that our method outperforms other algorithms in five synthesized subtrees. The information of calculation time and memory usage for our method, BBTD, alongside three other Boolean network methods: BB, BFE, and ATEN, is shown in Supplementary Table S3.

Figure 4.

Figure 4

Performance of synthesized data with difference methods: BBTD, BB, BFE, and ATEN; the figure shows the comparisons of average TPR and average PPR for five synthesized subtrees.

Real data analysis

In this section, BBTD is further conducted to reconstruct GRNs for five cell lineage subtrees of C. elegans based on the 4D confocal microscopy data. The MCMC algorithm is implemented with five independent MCMC chains for each subtree. The trace plot of the log-posterior probabilities for these five MCMC chains is displayed in Supplementary Figure S3. The inferred results are displayed in Fig. 5. Each sub-figure showcases the directed positive or negative regulatory relationships, along with their corresponding time delay, from regulator genes to regulated genes. Specifically, there are 31 pairs gene regulatory relationships in the ‘AB’ subtree, 34 pairs in the ‘C’ subtree, 8 pairs in the ‘D’ subtree, 12 pairs in the ‘E’ subtree, and 18 pairs in the ‘MS’ subtree.

Figure 5.

Figure 5

The inferred GRNs of the five cell lineage subtrees: ‘AB’, ‘C’, ‘D’, ‘E’, and ‘MS’; sub-figures (a) to (e) illustrate the GRNs of these subtrees in sequential order; each solid circle in the inferred GRN represents a gene; the presence of a solid (dashed) arrow indicates a positive (negative) regulatory relationship extending from regulator genes to regulated ones; the color of the arrow corresponds to the respective time delay associated with the regulatory relationship.

Identification of direct gene regulatory relationships

Our network structures reveal direct gene regulatory relationships within distinct subtrees, confirming several pairs of relationships established in prior genetic studies. Notably, in the ‘AB’ subtree, sma-9 activates F16B12.6, a relationship validated in [42]. Similarly, in the ‘MS’ subtree, glp-1 activates pha-4, as confirmed in [43]. Additionally, we identify that egl-27 directly activates tbx-8 in the ‘C’ subtree. This discovery is supported by the established interactions from egl-27 to tbx-9 and from tbx-9 to tbx-8, both documented in [42]. Moreover, 34 pairs of regulatory relationships across five subtrees whose regulator and regulated genes have same function annotations are detected using gene ontology enrichment analysis [44] (see Supplementary Table S4 and Table S5). These findings provide support for the gene regulatory relationships identified in our analysis.

In order to compare our method with other available methods on real data, confirmed gene regulatory relationships among the candidate genes of each subtree are retrieved from BioGRID Version 4.4.233 [45] and WormBase Version WS254 [46], which are displayed in Supplementary Table S6. According to the results depicted in Table 3, our method, BBTD, successfully detects confirmed gene regulatory relationships in four subtrees, outperforming the other methods: BFE identifies confirmed relationships in three subtrees; BB detects them in two subtrees; and ATEN does not discover any confirmed relationships. For the ‘E’ subtree, although BBTD identifies only one regulatory relationship, resulting in a lower TPR than that achieved by BFE, it exhibits a significantly higher PPR. This distinction suggests that BFE, with its tendency to overpredict, adopts a greedy approach that leads to the identification of an excessive number of regulatory relationships within each subtree, potentially compromising precision.

Table 3.

Comparison of gene regulatory relationships for five subtrees detected by BBTD and other three Boolean network inference methods: BB, BFE, and ATEN; the column labeled ‘Ground truth’ represents the number of confirmed gene regulatory relationships; each cell from the second to the fifth column contains three rows; the first row shows the number of predicted gene regulatory relationships; the value in parentheses indicates the overlap between the predicted relationships and the confirmed ones; the values in square brackets represent TPR and PPR, respectively, separated by a slash; the cells which are bold means there is at least one confirmed gene regulatory relationship.

Subtree BBTD BB BFE ATEN Ground truth
AB 31 (1) [0.333/0.032] 28 (1) [0.333/0.036] 142 (0) [0/0] 29 (0) [0/0] 3
C 34 (1) [0.167/0.029] 19 (0) [0/0] 171 (1) [0.167/0.006] 41 (0) [0/0] 6
D 8 (0) [0/0] 7 (0) [0/0] 117 (1) [0.25/0.009] 23 (0) [0/0] 4
E 12 (1) [0.111/0.083] 12 (1) [0.111/0.083] 290 (4) [0.444/0.014] 64 (0) [0/0] 9
MS 18 (1) [0.167/0.056] 19 (0) [0/0] 164 (0) [0/0] 33 (0) [0/0] 6

Heterogeneity of regulatory relationships in different cell lineage subtrees

Based on our findings, it is evident that diverse regulatory relationships characterize different subtrees, a diversity likely attributed to inherent heterogeneity among them. On the one hand, the genes involved in regulation vary across distinct subtrees, reflecting their association with specific cell fates. In essence, a regulated gene within a given subtree may engage distinct regulator genes for its regulation compared with other subtrees. Consider ref-1 as an illustrative example: in the ‘C’ subtree, it is activated by hsp-3, expressed in various structures including hypodermis, while activation in the ‘D’ subtree is mediated by pes-1, expressed in muscle. This variance may be linked to the differing cell fates of the subtrees. On the other hand, it is apparent that regulatory relationships involving the same regulator-regulated genes can manifest variations across different subtrees. For instance, although F09G2.9 positively activates F16B12.6 in both ‘C’ and ‘MS’ subtrees, the time delay is zero in the ‘C’ subtree, in contrast to two units in the ‘MS’ subtree. This discrepancy implies that the activation of F16B12.6 by F09G2.9 takes more time in the ‘MS’ subtree than in the ‘C’ subtree.

Potentially significant gene relationships associated with human diseases

In recent years, scientists have gained valuable insights into human disease genes through the study of C. elegans [47, 48]. This research discovers some regulatory relationships among genes whose human orthologs play a crucial role in human disease pathogenesis. For instance, hmg-11 corresponds to the human gene HMGA2, implicated in various diseases, including lipoma [49], leiomyoma [50], and ovarian cancer [51]. As depicted in Fig. 5a and b, hmg-11 is suppressed by W10D9.4 in the ‘AB’ subtree and activated by T23G5.6 with a two-unit time delay in the ‘C’ subtree. Consequently, potential interventions such as inducing the expression of the human ortholog of W10D9.4 or mitigating the regulatory effect of the human ortholog of T23G5.6 within the specified time frame could be considered to modulate the expression of HMGA2 in the treatment of related diseases in the future.

Discussion

In this study, a novel data integration framework is presented to effectively align the gene expression data for each cell lineage subtree, addressing the challenge of non-pairwise time series across various data files. Then, a new method BBTD is proposed to reconstruct GRNs from pairwise binarized data by establishing a probabilistic Boolean network model with time delay. Through extensive simulation studies, the performance of our method is demonstrated across varying the number of genes and cells, which outperforms other Boolean network inference methods. When applied to the real dataset, our method successfully identifies known regulatory relationships between regulator genes and regulated genes. Besides, the results of spatial-temporal regulation reveal the heterogeneity of regulatory relationships in different cell subtrees, providing critical insights into cellular functions and differentiation processes. Moreover, the identification of disease-related gene regulatory relationships provides opportunities for the development of novel diagnostics and precision medicine.

To address the issue of non-pairwise time series data, a framework of data integration including interpolation and binarization is proposed. Interpolation helps preserve more information; however, the absence of pairwise data points necessitates extensive interpolations, which can introduce considerable noise. This noise can severely compromise the effectiveness of more detailed analyses, such as applying Bayesian networks to the interpolated fluorescence intensities. Consequently, binarizing the data after interpolation and then employing Boolean networks presents a robust strategy for dealing with such high-resolution but inherently noisy datasets typical in this study. Moreover, our model excels in performance as it not only accounts for the influence of noise, enhancing its reliability, but also incorporates time delay as a discrete parameter, which closely reflects the dynamics of real biological systems. Furthermore, BBTD can infer the positive and negative regulation, while many other existing methods only determine the presence of a regulatory relationship between two genes.

However, our study is subject to several limitations. Due to the application of data integration, discretization of the gene expression state inevitably results in some degree of information loss, which hinders us from further investigating GRNs in more details. In addition, as binarization does not capture the continuous and often subtle variations in gene expression levels, which are crucial for detecting feedback mechanisms, self-regulation of genes is not considered in our study. This limitation restricts the model’s ability to fully capture the complexities of real biological systems. Last but not least, since a great number of parameters in the proposed model are discrete (e.g. Inline graphic), it is difficult to judge the convergence of MCMC chains in a more principled way. In further work, more specific gene regulatory models, which can also handle non-pairwise datasets, need to be proposed to delve deeper into the mechanisms of gene regulation.

Key Points

  • The research addresses the challenge of inferring GRNs using tree-shaped gene expression data from time-lapse confocal laser microscopy of C. elegans, offering a solution to handle such high-resolution but rather noisy data.

  • A comprehensive framework for data integration as well as a novel Bayesian approach based on Boolean network with time delay are proposed, enhancing the accuracy of GRNs construction.

  • Our method outperforms existing Boolean network algorithms in both simulation studies and real data analysis, validating known gene regulatory relationships and uncovering the heterogeneity of regulatory relationships across different subtrees.

  • All source code is freely available at the GitHub repository https://github.com/edawu11/BBTD.git for promoting reproducibility.

Supplementary Material

supplementary_bbae396

Contributor Information

Yida Wu, School of Mathematical Sciences, Xiamen University, Zengcuo'an West Road, Siming District, Xiamen 361000, China.

Da Zhou, School of Mathematical Sciences, Xiamen University, Zengcuo'an West Road, Siming District, Xiamen 361000, China.

Jie Hu, School of Mathematical Sciences, Xiamen University, Zengcuo'an West Road, Siming District, Xiamen 361000, China.

Funding

This work was supported by the National Natural Sciences Foundation of China [11971405], the Natural Science Foundation of Fujian Province of China [2023J01025], and the Fundamental Research Funds for the Central Universities in China [20720230024].

Data availability

The datasets are available via http://epic.gs.washington.edu/. The processed data is available via https://doi.org/10.5281/zenodo.11261241. All source code is available at https://github.com/edawu11/BBTD.git.

Author contributions

J.H. and D.Z. initiated and designed the study, Y.W. implemented the model with assistance from D.Z. and J.H., Y.W. developed the software tool and performed the simulation studies as well as the real data analysis, Y.W. and J.H. analyzed the results, Y.W. wrote the manuscript, and Y.W., D.Z., and J.H. edited and revised the manuscript. J.H. supervised the study.

Conflict of interest: The authors have declared no conflict of interest.

References

  • 1. Karlebach G, Shamir R. Modelling and analysis of gene regulatory networks. Nat Rev Mol Cell Biol 2008; 9:770–80. 10.1038/nrm2503. [DOI] [PubMed] [Google Scholar]
  • 2. Han S, Wong RKW, Lee TCM. et al. A full Bayesian approach for Boolean genetic network inference. PloS One 2014; 9:e115806. 10.1371/journal.pone.0115806. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Banf M, Rhee SY. Computational inference of gene regulatory networks: approaches, limitations and opportunities. Biochim Biophys Acta—Gene Regul Mech 2017; 1860:41–52. 10.1016/j.bbagrm.2016.09.003. [DOI] [PubMed] [Google Scholar]
  • 4. Chai LE, Loh SK, Low ST. et al. A review on the computational approaches for gene regulatory network construction. Comput Biol Med 2014; 48:55–65. 10.1016/j.compbiomed.2014.02.011. [DOI] [PubMed] [Google Scholar]
  • 5. Akers K, Murali T. Gene regulatory network inference in single-cell biology. Curr Opin Syst Biol 2021; 26:87–97. 10.1016/j.coisb.2021.04.007. [DOI] [Google Scholar]
  • 6. Bao Z, Murray JI, Boyle T. et al. Automated cell lineage tracing in Caenorhabditis elegans. Proc Natl Acad Sci US A2006; 103:2707–12. 10.1073/pnas.0511111103. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Murray JI, Bao Z, Boyle TJ. et al. Automated analysis of embryonic gene expression with cellular resolution in C. Elegans. Nat Methods 2008; 5:703–9. 10.1038/nmeth.1228. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Murray JI, Boyle TJ, Preston E. et al. Multidimensional regulation of gene expression in the C. elegans embryo. Genome Res2012; 22:1282–94. 10.1101/gr.131920.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Sulston J, Schierenberg E, White J. et al. The embryonic cell lineage of the nematode Caenorhabditis elegans. Dev Biol1983; 100:64–119. 10.1016/0012-1606(83)90201-4. [DOI] [PubMed] [Google Scholar]
  • 10. Huang XT, Zhu Y, Chan LHL. et al. Inference of cellular level signaling networks using single-cell gene expression data in Caenorhabditis elegans reveals mechanisms of cell fate specification. Bioinformatics 2017; 33:1528–35. 10.1093/bioinformatics/btw796. [DOI] [PubMed] [Google Scholar]
  • 11. Stigler B, Chamberlin HM. A regulatory network modeled from wild-type gene expression data guides functional predictions in Caenorhabditis elegans development. BMC Syst Biol 2012; 6:77. 10.1186/1752-0509-6-77. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Kaderali L, Radde N. Inferring gene regulatory networks from expression data. In: Kacprzyk J, Kelemen A, Abraham A, Chen Y (eds.) Computational Intelligence in Bioinformatics. Berlin Heidelberg, Berlin, Heidelberg: Springer,2008; 94:33–74, 10.1007/978-3-540-76803-6_2. [DOI] [Google Scholar]
  • 13. Huang X, Chen L, Chim H. et al. Boolean genetic network model for the control of C. elegans early embryonic cell cycles. Biomed Eng Online 2013; 12:S1. 10.1186/1475-925X-12-S1-S1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Liang S, Fuhrman S, Somogyi R. Reveal, a general reverse engineering algorithm for inference of genetic network architectures. Pac Symp Biocomput 1998;3:18–29. [PubMed] [Google Scholar]
  • 15. Lähdesmäki H, Shmulevich I, Yli-Harja O. On learning gene regulatory networks under the Boolean network model. Mach Learn 2003; 52:147–67. 10.1023/A:1023905711304. [DOI] [Google Scholar]
  • 16. Shi N, Zhu Z, Tang K. et al. ATEN: and/or tree ensemble for inferring accurate Boolean network topology and dynamics. Bioinformatics 2020; 36:578–85. 10.1093/bioinformatics/btz563. [DOI] [PubMed] [Google Scholar]
  • 17. Zhang Y, Qian M, Ouyang Q. et al. Stochastic model of yeast cell cycle network. Physica D 2006; 219:35–9. 10.1016/j.physd.2006.05.009. [DOI] [Google Scholar]
  • 18. Ao P, Galas D, Hood L. et al. Cancer as robust intrinsic state of endogenous molecular-cellular network shaped by evolution. Med Hypotheses 2008; 70:678–84. 10.1016/j.mehy.2007.03.043. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Todd RG, Helikar T. Ergodic sets as cell phenotype of budding yeast cell cycle. PloS One 2012; 7:e45780. 10.1371/journal.pone.0045780. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Chaouiya C, Ourrad O, Lima R. Majority rules with random tie-breaking in Boolean gene regulatory networks. PloS One 2013; 8:e69626. 10.1371/journal.pone.0069626. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Lopes FM, Martins DC, Barrera J. et al. A feature selection technique for inference of graphs from their known topological properties: revealing scale-free gene regulatory networks. Inf Sci 2014; 272:1–15. 10.1016/j.ins.2014.02.096. [DOI] [Google Scholar]
  • 22. Dehghannasiri R, Yoon BJ, Dougherty ER. Efficient experimental design for uncertainty reduction in gene regulatory networks. BMC Bioinformatics 2015; 16:S2. 10.1186/1471-2105-16-S13-S2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Li X, Rao S, Jiang W. et al. Discovery of time-delayed gene regulatory networks based on temporal gene expression profiling. BMC Bioinformatics 2006; 7:26. 10.1186/1471-2105-7-26. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Lozano AC, Abe N, Liu Y. et al. Grouped graphical granger modeling for gene expression regulatory networks discovery. Bioinformatics 2009; 25:i110–8. 10.1093/bioinformatics/btp199. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Lewis J. Autoinhibition with transcriptional delay. Curr Biol 2003; 13:1398–408. 10.1016/S0960-9822(03)00534-7. [DOI] [PubMed] [Google Scholar]
  • 26. Rateitschak K, Wolkenhauer O. Intracellular delay limits cyclic changes in gene expression. Math Biosci 2007; 205:163–79. 10.1016/j.mbs.2006.08.010. [DOI] [PubMed] [Google Scholar]
  • 27. Yalamanchili HK, Yan B, Li MJ. et al. DDGni: dynamic delay gene-network inference from high-temporal data using gapped local alignment. Bioinformatics 2014; 30:377–83. 10.1093/bioinformatics/btt692. [DOI] [PubMed] [Google Scholar]
  • 28. Hu J, Zhao Z, Yalamanchili HK. et al. Bayesian detection of embryonic gene expression onset in C. Elegans. Ann. Appl Stat 2015; 9:950–68. 10.1214/15-AOAS820. [DOI] [Google Scholar]
  • 29. Forsythe GE, Malcolm MA, Moler CB. et al. Computer methods for mathematical computations. Prentice-Hall Series in Automatic Computation. Prentice-Hall, Inc., ; Englewood Cliffs, NJ, 1977. [Google Scholar]
  • 30. Ray JCJ, Tabor JJ, Igoshin OA. Non-transcriptional regulatory processes shape transcriptional network dynamics. Nat Rev Microbiol 2011; 9:817–28. 10.1038/nrmicro2667. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Tan N, Ouyang Q. Design of a network with state stability. J Theor Biol 2006; 240:592–8. 10.1016/j.jtbi.2005.10.019. [DOI] [PubMed] [Google Scholar]
  • 32. Saithong T, Bumee S, Liamwirat C. et al. Analysis and practical guideline of constraint-based Boolean method in genetic network inference. PloS One 2012; 7:e30232. 10.1371/journal.pone.0030232. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Ventre E, Herbach U, Espinasse T. et al. One model fits all: combining inference and simulation of gene regulatory networks. PLoS Comput Biol 2023; 19:e1010962. 10.1371/journal.pcbi.1010962. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Albeverio S, Feng J, Qian M. Role of noises in neural networks. Phys Rev E 1995; 52:6593–606. 10.1103/PhysRevE.52.6593. [DOI] [PubMed] [Google Scholar]
  • 35. Ge H, Qian M. Boolean network approach to negative feedback loops of the p53 pathways: synchronized dynamics and stochastic limit cycles. J Comput Biol 2009; 16:119–32. 10.1089/cmb.2007.0181. [DOI] [PubMed] [Google Scholar]
  • 36. Fisher RA. Statistical methods for research workers. In: Kotz S, Johnson (eds.) NL, Breakthroughs in Statistics. New York, New York, NY: Springer; 1992; 66–70, 10.1007/978-1-4612-4380-9_6. [DOI] [Google Scholar]
  • 37. Cai X, Bazerque JA, Giannakis GB. Inference of gene regulatory networks with sparse structural equation models exploiting genetic perturbations. PLoS Comput Biol 2013; 9:e1003068. 10.1371/journal.pcbi.1003068. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Mitchell TJ, Beauchamp JJ. Bayesian variable selection in linear regression. J Am Stat Assoc 1988; 83:1023–32. 10.1080/01621459.1988.10478694. [DOI] [Google Scholar]
  • 39. Geman S, Geman D. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans Pattern Anal Mach Inte. 1984; PAMI-6:721–41. 10.1109/TPAMI.1984.4767596. [DOI] [PubMed] [Google Scholar]
  • 40. Metropolis N, Rosenbluth AW, Rosenbluth MN. et al. Equation of state calculations by fast computing machines. J Chem Phys 1953; 21:1087–92. 10.1063/1.1699114. [DOI] [Google Scholar]
  • 41. Hastings WK. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 1970; 57:97–109. 10.1093/biomet/57.1.97. [DOI] [Google Scholar]
  • 42. Reece-Hoyes JS, Pons C, Diallo A. et al. Extensive rewiring and complex evolutionary dynamics in a C. elegans multiparameter transcription factor network. Mol Cell 2013; 51:116–27. 10.1016/j.molcel.2013.05.018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Byrne AB, Weirauch MT, Wong V. et al. A global analysis of genetic interactions in Caenorhabditis elegans. J Biol 2007; 6:8. 10.1186/jbiol58. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Wu T, Hu E, Xu S. et al. clusterProfiler 4.0: a universal enrichment tool for interpreting omics data. Innovation 2021; 2:100141. 10.1016/j.xinn.2021.100141. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Oughtred R, Rust J, Chang C. et al. The BioGRID database: a comprehensive biomedical resource of curated protein, genetic, and chemical interactions. Protein Sci 2021; 30:187–200. 10.1002/pro.3978. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Sternberg PW, Van Auken K, Wang Q. et al. WormBase 2024: status and transitioning to alliance infrastructure. Genetics 2024; 227:iyae050. 10.1093/genetics/iyae050. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Apfeld J, Alper S. What can we learn about human disease from the nematode C. elegans? In: DiStefano (eds.) JK, Disease Gene Identification. New York, New York, NY: Springer, ; 2018; 1706:53–75, 10.1007/978-1-4939-7471-9_4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Roussos A, Kitopoulou K, Borbolis F. et al. Caenorhabditis elegans as a model system to study human neurodegenerative disorders. Biomolecules 2023; 13:478. 10.3390/biom13030478. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Asher H, Schoenberg Fejzo M, Tkachenko A. et al. Disruption of the architectural factor HMGI-C: DNA-binding AT hook motifs fused in lipomas to distinct transcriptional regulatory domains. Cell 1995; 82:57–65. 10.1016/0092-8674(95)90052-7. [DOI] [PubMed] [Google Scholar]
  • 50. Kazmierczak B, Pohnke Y, Bullerdiek J. Fusion transcripts between the HMGIC gene and RTVL-H-related sequences in mesenchymal tumors without cytogenetic aberrations. Genomics 1996; 38:223–6. 10.1006/geno.1996.0619. [DOI] [PubMed] [Google Scholar]
  • 51. Malek A, Bakhidze E, Noske A. et al. HMGA2 gene is a promising target for ovarian cancer silencing therapy. Int J Cancer 2008; 123:348–56. 10.1002/ijc.23491. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplementary_bbae396

Data Availability Statement

The datasets are available via http://epic.gs.washington.edu/. The processed data is available via https://doi.org/10.5281/zenodo.11261241. All source code is available at https://github.com/edawu11/BBTD.git.


Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES