Abstract
Zero-inflated count data arise in a wide range of scientific areas such as social science, biology, and genomics. Very few causal discovery approaches can adequately account for excessive zeros as well as various features of multivariate count data such as overdispersion. In this paper, we propose a new zero-inflated generalized hypergeometric directed acyclic graph (ZiG-DAG) model for inference of causal structure from purely observational zero-inflated count data. The proposed ZiG-DAGs exploit a broad family of generalized hypergeometric probability distributions and are useful for modeling various types of zero-inflated count data with great flexibility. In addition, ZiG-DAGs allow for both linear and nonlinear causal relationships. We prove that the causal structure is identifiable for the proposed ZiG-DAGs via a general proof technique for count data, which is applicable beyond the proposed model for investigating causal identifiability. Score-based algorithms are developed for causal structure learning. Extensive synthetic experiments as well as a real dataset with known ground truth demonstrate the superior performance of the proposed method against state-of-the-art alternative methods in discovering causal structure from observational zero-inflated count data. An application of reverse-engineering a gene regulatory network from a single-cell RNA-sequencing dataset illustrates the utility of ZiG-DAGs in practice.
Keywords: Bayesian network, Causal identifiability, Directed acyclic graph, Observational zero-inflated count data, Single-cell RNA-sequencing
1. Introduction
Discovering causal structure of an unknown system is an important task in practically all areas of science. Knowing the causal structure is not only useful for predicting a system’s behavior under external interventions, but also has implications for machine learning tasks such as covariate shift and transfer learning (Schölkopf et al., 2012). The most effective and principled way for causal discovery is to conduct controlled experiments. However, it is often expensive, unethical, or even impossible in certain fields such as genomics (Opgen-Rhein and Strimmer, 2007) and social sciences (Bollen, 1989). Hence, causal discovery approaches that can infer the unknown causal structures from purely observational data are often desired.
This paper considers causal discovery for purely observational zero-inflated count data. Observational zero-inflated count data are common across multiple disciplines, for instance, educational psychology (Fox, 2013), genomics (Kang et al., 2011), ecology (Barry and Welsh, 2002), behavior studies (Hua et al., 2014), and economics (Staub and Winkelmann, 2013). A specific application, by which we are motivated, is to reverse-engineer gene regulatory networks from single-cell RNA-sequencing (scRNA-seq) data. The scRNA-seq technology measures the abundance of mRNA within single cells, resulting in count data with excessive zeros because of technological limits in sequencing the low amounts of mRNA in individual cells. For causal structure learning from observational zero-inflated count data, we work under the framework of causal Bayesian networks (BNs), which have been widely used for representing causal relationships among variables via directed acylic graphs (DAGs).
Learning the structure of BNs is not trivial because the size of the space of possible graph structures grows super-exponentially in the number of variables. Furthermore, BNs may not be distinguishable from each other with observational data. Multiple DAGs can encode the same conditional independence assertions and in general, DAGs are identifiable only up to Markov equivalence class (MEC) in which all DAGs encode the same set of conditional independences (Heckerman et al., 1995). Therefore, in the past, many approaches have focused on identifying the MEC rather than individual DAGs (Spirtes et al., 2000; Chickering, 2002; Kalisch and Bühlman, 2007; Castelletti et al., 2018). For example, the well-known PC algorithm infers a set of conditional independencies and recovers a MEC that is compatible with the inferred conditional independencies (Spirtes et al., 2000). The GES algorithm performs greedy search over the space of MECs and obtains the best-scored MEC (Chickering, 2002). However, DAGs within the same MEC may have drastically different causal interpretations.
Since 2006, it has been shown that for some classes of BNs, the exact graph structure, not just the MEC, may be identifiable from observational data alone. For continuous variables, BNs are often represented by sparse additive noise models. Under this formulation, the underlying DAG is identifiable if the functional form of the additive noise model is linear with non-Gaussian noises (Shimizu et al., 2006; Wang and Drton, 2020) and if the functional form is nonlinear with mild regularity assumptions on the function-noise pair (Hoyer et al., 2008; Peters et al., 2011, 2014). Peters and Bühlmann (2014); Chen et al. (2019) have also shown that unique identification of DAG structure is possible under linear additive noise models with Gaussian noises having equal variances.
The vast majority of the existing works that establish identifiability theorems for BNs have focused on continuous variables; identifiability issues of BNs for count data are less studied. Park and Raskutti (2015) proposed linear Poisson BNs for observational count data and investigated the overdispersion scores to prove that the unique identification of the underlying DAG is possible. However, the applicability of Poisson BNs may be limited due to the restrictive assumption of Poisson distribution that the variance is equal to the mean. Park and Park (2019) generalized the idea of Poisson BNs to a family of generalized hypergeometric distributions that includes the Poisson distribution, the hyper-Poisson distribution, the negative binomial distribution, and many more. An identifiability theorem for the generalized hypergeometric BNs was established using the moment ratio scores. Although the generalized hypergeometric BNs are a quite general class of count BNs, they tend not to adequately model count data with excessive zeros.
There have been a few recent BNs that are fully identifiable for observational zero-inflated data. Using Hurdle conditional distributions, Yu et al. (2020) proposed fully identifiable BNs for zero-inflated Gaussian data. Recently, we (Choi et al., 2020) developed zero-inflated Poisson BNs for observational zero-inflated count data. We have shown, theoretically and empirically, that the underlying causal DAG can be identified from observational data alone. However, the zero-inflated Poisson BNs have the same limitation as Poisson BNs, that is, Poisson distribution is a restrictive distribution. In particular, the Poisson-based BNs do not adequately account for overdispersion, a common feature of count data. Hence it is desirable to further develop a more general class of count BNs that can account for a broad range of multivariate count data with excessive zeros.
In this paper, we introduce a fairly general class of count BNs for observational zero-inflated count data, termed zero-inflated generalized hypergeometric DAGs (ZiG-DAGs). We extend the zero-inflated Poisson BNs (Choi et al., 2020) to zero-inflated generalized hypergeometric models, which include many common count distributions. Therefore, the proposed ZiG-DAGs are capable of modeling various types of zero-inflated count data, for example, overdispersed zero-inflated count data. In addition, we allow for both linear and nonlinear causal relationships in order to flexibly capture real causality in practice whereas Choi et al. (2020) only considers linear causal relationships. Based on a new general proof technique, we prove that the proposed ZiG-DAG is uniquely identifiable, justifying its use for casual discovery. The general proof technique can be potentially used to check identifiability for other discrete BNs as well. The established identifiability theorems do not require the causal faithfulness assumption (Uhler et al., 2013) typically required by constraint-based algorithms. For the structure learning of ZiG-DAGs, we develop score-based algorithms: exhaustive search for small graphs and greedy search for moderate-to-large graphs. Specifically, we consider two different greedy search algorithms to deal with the local optima problem of greedy search. We empirically demonstrate that the proposed methods compare favorably against state-of-the-art alternatives. We also illustrate the utility of ZiG-DAGs in real-world problems using a scRNA-seq dataset.
The remainder of this paper is organized as follows. We set up necessary notations and definitions for BNs in Section 2.1 and we introduce the proposed ZiG-DAG models for observational zero-inflated count data in Section 2.2. Section 3 establishes identifiability theorems for the proposed ZiG-DAGs. In Section 4, we develop score-based algorithms for causal structure learning of ZiG-DAGs. We demonstrate the utility of our methods through synthetic data in Section 5 and real-world applications in Section 6. Section 7 provides our closing discussion.
2. Bayesian Networks for Observational Zero-inflated Count Data
2.1. Notation and Background
We start with some basic notations for DAGs and BNs. Let denote a set of random variables. A DAG consists of a set of nodes corresponding to the variables and a set of directed edges representing the causal relationships between the nodes without cycles. If we have a directed edge (or slightly abusing the notation, ) for , node is called a parent of and node is called a child of . We denote the set of parents of node in by and the set of children of in by . Node is said to be a descendant of node if there exists a directed path and otherwise is said to be a non-descendant of . We use to denote the set of non-descendants of (excluding itself). A BN for is a pair with the joint distribution factorizing over as follows:
| (1) |
where and is the conditional probability distribution of given its parents. We say a joint distribution is (local) Markov with respect to a DAG if each variable is independent of its non-descendants given its parents . The factorization in (1) is equivalent to the Markov property of (Verma and Pearl, 1990). In this paper, we make the causal Markov assumption is Markov with respect to the causal DAG – so that we can interpret causally; in other words, each node is assumed to be independent of all its non-effects conditional on all its direct causes.
In general, the DAG of a is not identifiable from the joint distribution . Indeed, the joint distribution is Markov with respect to many different DAGs including all fully connected DAGs. Therefore, we have many possible BNs with different graph structures for the same joint distribution. To overcome this indeterminacy, one can make additional assumptions and obtain a restricted model for which the graph is identifiable from the joint distribution. A common assumption in the literature for learning BNs is faithfulness. A joint distribution is faithful with respect to a DAG if the graph encodes all the conditional independence constraints in the joint distribution . If faithfulness is assumed, DAGs are identifiable up to MEC (Spirtes et al., 2000). Two DAGs and are Markov equivalent if the two DAGs encodes the same set of conditional independence constraints and a MEC is defined by a set of DAGs that are Markov equivalent. For example, despite the seemingly different graph structures, the DAGs in Figure 1(a)–(c) forms a MEC, which encodes the only conditional independence , whereas the DAG in Figure 1(d) encodes the marginal independence of and only and forms another MEC. Since both the Markov property and faithfulness only constrain conditional independencies in the joint distribution, we cannot distinguish DAGs in the same MEC, which impose the same set of conditional independence assertions. For instance, the well-known PC algorithm (Spirtes et al., 2000) and the GES algorithm (Chickering, 2002), under the faithfulness assumption, aim to find the best MEC rather than the best individual DAG.
Figure 1:

Examples of DAGs with three nodes. DAGs in (a)-(c) are Markov equivalent and form a Markov equivalence class that encodes . DAG in (d) forms another Markov equivalence class that encodes .
In many applications of BNs, a specific family of distributions is assumed for the conditional distribution of each node given its parents. For example, we will assume that the conditional probability for each node comes from a zero-inflated count model. Even with such distributional assumptions, the DAG may still be non-identifiable due to the distribution equivalence. Two DAGs and are distribution equivalent if for every BN there exists a different BN such that the joint distributions are identical, i.e., . For example, for Gaussian BNs or multinomial BNs, they are distribution equivalent if and only if they are Markov equivalent. Hence, we can identify only the MEC from the joint distribution for Gaussian and multinomial BNs. Hence, not surprisingly, if we assume that data are generated from one of the three DAGs in Figure 1 (a)–(c), the best answer that we can achieve using Gaussian BNs or multinomial BNs is that one of them is the true model. This is unsatisfactory for many applications and it has been recently shown that there exist certain cases where we can overcome the issue of distribution equivalence and the graph structure is fully identifiable. The existing works often represent continuous BNs as sparse additive noise models and under this framework, the underlying DAG is identifiable if the functional form of the additive noise model is linear and the noises are non-Gaussian (Shimizu et al., 2006), if nonlinear functions are considered with very mild additional conditions (Hoyer et al., 2008; Peters et al., 2011), or if the functions are linear and the noises are Gaussian with equal variance (Peters and Bühlmann, 2014). However, most existing approaches focus on BNs for continuous data, and the identifiability of BNs for count data are much less studied (Park and Raskutti, 2015; Park and Park, 2019; Choi et al., 2020).
2.2. Zero-Inflated Generalized Hypergeometric Directed Acyclic Graphs
We consider a broad family of discrete distributions for count data. Kemp (1968a, b) defines a family of generalized hypergeometric probability distributions (GHPDs), which includes a lot of common probability distributions for count data and has many useful properties such as recurrence relationships for both their probabilities and their factorial moments. Let denote the ascending (rising) factorial with . The generalized hypergeometric function is then defined as
Note that are exchangeable and so are . A distribution is said to be a GHPD if its probability generating function can be written in the following form:
| (2) |
where and . A large number of discrete distributions for count data belong to the class of GHPDs, for example, binomial, Poisson, negative binomial, hypergeometric, beta-binomial, and beta-negative binomial. Table 1 provides some examples of GHPDs with their probability generating functions (see also Kemp 1968a; Dacey 1972; Johnson et al. 2005).
Table 1:
Examples of GHPDs and their probability generating functions
| Distributions | Probability generating function | Parameters |
|---|---|---|
| Binomial | ||
| Poisson | ||
| Hyper-Poisson | ||
| Geometric | ||
| Negative Binomial | ||
| Hypergeometric | ||
| Beta-Negative Binomial | ||
| Extended Generalized Waring |
We define, by using the GHPDs, ZiG-DAGs for observational zero-inflated count data. In order to explicitly account for excessive zeros in count data, we adopt the zero-inflated model. We say a BN for random counts is a ZiG-DAG if for each node , the conditional distribution of the factorization (1) has a probability generating function of the following form,
| (3) |
where is a GHPD probability generating function defined by (2) with and . Here, and are functions/mappings from to , which connect the parents of node to its conditional distribution, where . For a ZiG-DAG, the probability mass function of each conditional distribution is given by
where is the probability mass function of a GHPD of which the probability generating function is given by . Particularly, is the probability that extra zeros occur in addition to the zeros that arise from the GHPD, and is the power parameter of GHPD that is closely related to its moments. For example, in the Poisson probability generating function represents the mean of the Poisson distribution. As another example, in the negative binomial probability generating function denotes the probability of “success”, which can be reparametrized in terms of the first and second moments of the negative binomial distribution.
For and , we consider both linear and nonlinear functional forms. We use the logit function, , as link function for , where for . Let , denote any suitable link function for , which is assumed to be strictly increasing for invertibility. First, we define linear ZiG-DAGs by assuming logit and vary linearly with the parents of node .
Definition 1 (Linear ZiG-DAGs) We say a BN ) is a linear ZiG-DAG if the joint distribution factorizes with respect to the DAG as in (1) with each conditional distribution having a probability generating function (3), where and are given by
| (4) |
with some strictly increasing functions .
The zero-inflated Poisson BN in our recent work (Choi et al., 2020) is a special case of the proposed linear ZiG-DAG. Furthermore, in order to allow more flexible causal relationships, we propose nonlinear ZiG-DAGs by adopting the additive model framework. Particularly, for each , we model and as the sum of nonlinear functions of .
Definition 2 (Nonlinear ZiG-DAGs) We say a BN ) is a nonlinear ZiG-DAG if the joint distribution factorizes with respect to the DAG as in (1) with each conditional distribution having a probability generating function (3), where and are given by
| (5) |
with some strictly increasing functions and nonlinear functions and .
Without loss of generality, we assume that because they can always otherwise be absorbed into the intercepts and . If zero-inflated random counts follow either a linear ZiG-DAG or a nonlinear ZiG-DAG, they satisfy, by definition, the Markov property (conditional independencies) encoded in the underlying DAG. As mentioned earlier, BNs may not be identifiable due to Markov and distribution equivalence. In the next section, we will show that under the proposed ZiG-DAG models, the causal graph structure is identifiable from observational data alone.
3. Identifiability Theory
Recently, much effort has been directed to show that some assumptions on the conditional distribution of each node can impose non-independence constraints on the joint distribution so that the DAG of a BN is identifiable (Shimizu et al., 2006; Hoyer et al., 2008; Peters et al., 2014; Peters and Bühlmann, 2014). However, the existing literature mostly addresses the identifiability issue of BNs for continuous data, and there are much fewer identifiability results on BNs for count data. Park and Raskutti (2015); Park and Park (2019) developed BNs by using Poisson and the generalized hypergeometric family and showed that their causal orderings are identifiable. Our recent work (Choi et al., 2020) investigated the identifiability of the zero-inflated Poisson BNs. These three methods are special cases of the proposed ZiG-DAGs; however, none of their proof techniques is applicable in our setting. Therefore, before we state the main identifiability theories for both linear and nonlinear ZiG-DAGs, we provide a general framework to check the identifiability of discrete BNs. We provide a sufficient condition under which two discrete BNs with different DAGs must have different joint distributions. The proofs of the identifiability theorems for the proposed ZiG-DAGs are based on such a sufficient condition. Specifically, Proposition 4 formulates the sufficient condition in terms of probability generating functions for the conditional distribution of each node given its parents. As discrete distributions are often defined by the probability generating function, one can potentially use Proposition 4 to verify the identifiability of other discrete BNs. We first state two assumptions that our identifiability theories require:
Condition 3 We assume (i) there exists no unmeasured confounder, and (ii) there is no selection bias.
No unmeasured confounder (also known as causal sufficiency) and no selection bias are commonly adopted in the literature for causal structure learning (Chickering, 2002; Shimizu et al., 2006; Peters and Bühlmann, 2014; Maathuis et al., 2018). The BN factorization (1) does not hold if either or both assumptions in Condition 3 are violated.
For two discrete BNs and , we denote , , , , and for the ease of notation. We let and denote the probability generating functions for the conditional distributions and of and , respectively. Note that by definition, where denotes the -th derivative of .
Proposition 4 Let and be any two discrete BNs, where and . Suppose that for every node for which
| (6) |
holds for all possible , it is also true that and holds. Then, if the joint distributions of and are equivalent, i.e., , we have .
All proofs can be found in the appendices. The main idea behind the proof is to show that if the observational joint distributions and are identical, then the proposition condition (6) necessarily implies that and have to be identical. Given any topological ordering of the graph , we first show that the parent sets, in and , of the last node in the ordering have to be identical if the joint distributions are the same. Then we use mathematical induction to show that this is also true for any node of and therefore and have to be identical.
Sometimes, it is also of interest to identify the model parameters. When the graph structure is identifiable, the parameter identifiability simplifies to a question of whether parameters associated with the conditional distribution of each node are identifiable. Since Proposition 4 implies that the graph structure is already identifiable, if a conditional distribution of each node is uniquely determined by the associated parameters, then we necessarily have a one-to-one correspondence between the joint distribution of the BN and the set of all associated parameters, and hence parameters are identifiable.
Corollary 5 Let and be sets of parameters that are associated with discrete BNs and , respectively, where and denote the sets of parameters associated with the conditional distribution of the node only. Suppose that for any , the assumption in Proposition 4 holds and, furthermore, whenever . Then, if the joint distributions of and are equivalent, i.e., , we have .
Using Proposition 4 and Corollary 5, we prove identifiability of the underlying DAG and the associated parameters for both the linear ZiG-DAG and the nonlinear ZiG-DAG in Theorems 6 and 7.
Theorem 6 Let ) be a linear ZiG-DAG. Assume Condition 3 holds. Then, if the variables are not binary, the graph is identifiable from the joint distribution . For given (which characterizes the generalized hypergeometric function ) and given (the link function for ), there is a unique set of parameters for the linear ZiG-DAG that induces the observed distribution .
Theorem 7 Let ) be a nonlinear ZiG-DAG. Assume Condition 3 holds. Then, if the variables are not binary, the graph is identifiable from the joint distribution . For given () (which characterizes the generalized hypergeometric function ) and given (the link function for ), there is a unique set of parameters for the nonlinear ZiG-DAG that induces the observed distribution .
In Theorems 6 and 7, the assumptions that are given, which we make for parameter identifiability, indicates that the conditional distribution of each node in ZiG-DAGs should take a specific GHPD model among the family of GHPDs along with a specific link function. One example of such a combination would be the Poisson distribution with the log link function. Furthermore, the assumption excludes limiting cases of a given GHPD. For instance, if we use the negative binomial distribution for a node, we do not allow it to degenerate to a Poisson distribution, since the negative binomial distribution has , while the Poisson distribution has . This assumption seems reasonable since we have to decide which GHPD and link function to use in practice. In Sections 5 and 6, for the proposed ZiG-DAG, we consider the Poisson distribution, the hyper-Poisson distribution, and the negative binomial distribution with the log link function. With such choices of GHPD and link function, Theorems 6 and 7 state that both the causal structure and the model parameters for the proposed ZiG-DAGs are fully identifiable from the joint distribution.
Theorems 6 and 7 do not require faithfulness to prove that the exact graph structure is identifiable under the proposed ZiG-DAG models. While continuous BNs such as linear Gaussian BNs may have accidental cancellation of positive and negative causal effects and hence may become unfaithful, the proposed ZiG-DAGs do not allow such cancellation due to inherent asymmetry of count distributions. Faithfulness can be violated in an equilibriummaintaining system such as a biological system (Andersen, 2013) and in datasets with limited sample size (Uhler et al., 2013). In such cases, therefore, causal discovery approaches that require the faithfulness assumption are not favorable. In our specific motivating application of reverse-engineering gene regulatory networks from scRNA-seq data, one should avoid the common practice of “Gaussianizing” raw scRNA-seq data because then one needs to additionally make the faithfulness assumption that may not be suitable in gene regulatory systems; instead, directly working with raw zero-inflated count data with the proposed ZiG-DAG does not suffer from this limitation.
4. Algorithms
In this section, we discuss algorithms for learning the causal structures of both the linear ZiG-DAGs and the nonlinear ZiG-DAGs. We will consider score-based approaches, which complement the Bayesian inference procedure developed in our recent work (Choi et al., 2020).
4.1. Structure Learning for Linear ZiG-DAGs
Suppose that we are given zero-inflated count data that are independent realizations of from a linear ZiG-DAG model . For the linear ZiG-DAG, we denote the model parameters by with , and . We let denote the joint distribution of the linear ZiG-DAG given the model parameters and the DAG . We score each DAG by the Bayesian information criterion (BIC),
| (7) |
where denotes the maximum likelihood estimate of the model parameters and denotes the number of model parameters. As the individual DAG is identifiable for the proposed ZiG-DAGs, the consistency of the BIC ensures that the true DAG uniquely achieves the minimum BIC with probability converging to 1 as (Claeskens et al., 2008). We take two strategies to minimize the BIC given by (7) with respect to the DAG : (1) exhaustive search and (2) greedy search.
Exhaustive Search
For small graphs where the number of nodes is small, the BIC can be minimized by computing the scores for all possible DAGs and find the DAG with the lowest score. This approach is exact and is useful for small (say, ). As the number of nodes grows, however, this approach becomes computationally infeasible very quickly because the number of DAGs grows super-exponentially in .
Greedy Search
For larger graphs, exhaustive search is infeasible; we will use greedy search instead. Greedy search algorithms in the context of BN learning consider local moves from the current graph and makes the locally optimal choice at each iteration. We consider two strategies, hill climbing (HC) and tabu search (TS) algorithms.
The HC algorithm explores the neighborhood of the current DAG in the space of all possible DAGs. The neighborhood is defined using local moves. At each iteration, the algorithm scores all the DAGs that can be reached from the current graph by an edge addition, deletion, or reversal. The current DAG is then replaced by the DAG that provides the largest improvement, i.e., largest decrease in BIC in our case. We stop the algorithm if the improvement is no longer possible. We summarize the HC procedure in Algorithm 1. Although this algorithm finds a local optimal graph, there is no guarantee that the graph obtained by HC is a global optimum.
In order to avoid being trapped in local optima, the TS algorithm allows additional local moves (edge addition, deletion, and reversal) when we reach a local optimal graph for which the score cannot be improved. These additional steps explore new territories around the local optimum even if they do not improve the score and may find new direction to arrive at a better structure. Note that the final solution should be the best DAG found anywhere during the search, not the DAG at which the algorithm stops. Furthermore, we keep a list (the tabu list) of all local moves that we have applied within the last iterations. During the search over the neighborhood of the current DAG, our TS algorithm do not consider local modifications that reverse the local moves in the tabu list. For example, if we add an edge , we cannot delete the edge in the next steps. This forces the search to explore new directions in the space of DAGs, instead of tweaking with the same parts of the current solution. Our TS algorithm is summarized in Algorithm 2.
Algorithm 1.
Hill climbing
| 1: | Input: data , initial DAG . |
| 2: | Compute and set . |
| 3: | Set . |
| 4: | repeat |
| 5: | Initialize . |
| 6: | for all DAGs reachable from by an edge addition, deletion, or reversal do |
| 7: | Compute . |
| 8: | if then |
| 9: | Set and . |
| 10: | Set . |
| 11: | end if |
| 12: | end for |
| 13: | until is |
| 14: | Output: DAG . |
Algorithm 2.
Tabu search
| 1: | Input: data , initial DAG , number of additionally allowed steps , size of the tabu list . |
| 2: | Compute and set . |
| 3: | Set . |
| 4: | Initialize . |
| 5: | while do |
| 6: | Initialize . |
| 7: | for all DAGs reachable from by an edge addition, deletion, or reversal do |
| 8: | if does not reverse local moves in the tabu list, (i.e., in the last steps) then |
| 9: | Compute . |
| 10: | if then |
| 11: | Set and . |
| 12: | end if |
| 13: | end if |
| 14: | end for |
| 15: | if then |
| 16: | Set and . |
| 17: | Set . |
| 18: | else |
| 19: | Set . |
| 20: | end if |
| 21: | end while |
| 22: | Output: DAG . |
4.2. Structure Learning for Nonlinear ZiG-DAGs
While our identifiability theory for nonlinear ZiG-DAGs is general, for structure learning, we need to make specific choice of the nonlinear functions and . For example, one can expand and with the Fourier bases if the functional relationship is expected to be periodic. Similarly, if we expect that the relationship might show a very localized behavior, wavelets can be a good choice. In this paper, we employ spline basis expansion for and . Splines are popular in semiparametric function estimation because of the ease of their construction, their flexibility and accuracy to approximate a smooth function, and their interpretability through the representation by a compact set of basis functions and coefficients. Particularly, and are modeled by cubic B-splines,
where and are cubic B-spline basis functions with some pre-specified knots. In summary, the nonlinear ZiG-DAG model is parameterized by spline coefficients and the other node-specific model parameters . The BIC for each DAG can be evaluated in the same way with (7) and we can use either exhaustive search or greedy search as in Section 4.1 for estimating the underlying graph for nonlinear ZiG-DAGs. The R implementation of the proposed method is available in the R package ZiGDAG (https://github.com/junsoukchoi/ZiGDAG.git).
5. Experiments
We empirically evaluate the causal discovery performance of both linear and nonlinear ZiG-DAG models with synthetic data. We compare the proposed method with state-of-the-art BN learning algorithms for count data: the overdispersion scoring (ODS) algorithm for Poisson BNs (Park and Raskutti, 2015) and the moments ratio scoring (MRS) algorithm for generalized hypergeometric BNs (Park and Park, 2019). We also consider the ZiDAG for zero-inflated Gaussian data (Yu et al., 2020) with the transformation of the synthetic count data.
5.1. Linear ZiG-DAG
We first consider a linear ZiG-DAG, where the conditional distribution of each node has a probability generating function given by (3) with . That is, the conditional distribution of each node follows a zero-inflated hyper-Poisson, which is a quite flexible distribution as the hyper-Posson distribution allows for both overdispersion and underdisperion in count data. We sample data from the linear ZiG-DAG with different sample sizes and different numbers of nodes . For each simulation setting, we set the causal DAG by randomly generating a sparse DAG with edges. Given the DAG, we generate coefficients () in (4) from independent uniform distributions: and for and . The intercepts and in (4) are chosen uniformly at random from (−1.5, 1) and (1, 1.5), respectively. The additional parameters for the GHPD (hyper-Poisson distribution) are sampled as . These ranges are chosen so that the resulting observations are not all zeros or do not have extremely large values. Each simulation setting is repeated 50 times, and the simulated datasets have ~ 50% zeros.
For ZiG-DAG, we implement both HC and TS algorithms as introduced in Section 4. Since they are greedy, initial values can affect the outcome. We consider two ways of initialization: first, we start HC (HC0) and TS (TS0) at the empty graph; and second, we initialize HC (HC1) and TS (TS1) with the DAGs obtained by MRS, which are expected to be better than empty graphs. To assess the causal discovery performance of each method, we calculate the true positive rate (TPR), the false discovery rate (FDR), and the Mattews correlation coefficient (MCC) for selection of true directed edges. MCC is a balanced measure of binary classification that takes a value between −1 and 1 with 1 indicating perfect agreement between the true and estimated graphs (i.e., perfect selection), 0 indicating random guess, and −1 indicating total disagreement.
We summarize in Tables 2–3 the operating characteristics of each method for different combinations of the sample size and the number of nodes . For every simulation setting, the proposed methods consistently outperform ODS, MRS, and ZiDAG. Specifically, as the sample size increases, our greedy search algorithms find the causal structure more accurately as expected. Our approaches also show satisfactory performance for various graph sizes including moderately large graphs (). We make additional observations for difference between the HC and TS algorithms. When the greedy search algorithms starts at the empty graph, i.e., HC0 and TS0, the performance of TS is better than that of HC. However, if we consider HC1 and TS1 for which we provide more informative initial DAG, there is no statistically significant difference between HC1 and TS1 in most cases. In subsequent simulations, for simplicity, we leave out HC1, TS0 and TS1, and only consider HC0 to learn the proposed ZiG-DAGs from data.
Table 2:
Linear ZiG-DAG. Average operating characteristics over 50 simulations for different sample sizes with . The standard error for each statistic is given within parentheses.
| Sample size, | |||||
|---|---|---|---|---|---|
| Method | Measure | 250 | 500 | 1000 | 2000 |
| HC0 | TPR | 0.728 (0.009) | 0.844 (0.009) | 0.924 (0.008) | 0.946 (0.007) |
| FDR | 0.387 (0.009) | 0.226 (0.009) | 0.125 (0.009) | 0.083 (0.009) | |
| MCC | 0.660 (0.008) | 0.804 (0.009) | 0.897 (0.009) | 0.930 (0.008) | |
| HC1 | TPR | 0.802 (0.009) | 0.892 (0.007) | 0.958 (0.005) | 0.971 (0.004) |
| FDR | 0.354 (0.008) | 0.203 (0.008) | 0.101 (0.008) | 0.074 (0.007) | |
| MCC | 0.713 (0.008) | 0.840 (0.007) | 0.926 (0.007) | 0.947 (0.005) | |
| TS0 | TPR | 0.739 (0.009) | 0.862 (0.009) | 0.932 (0.007) | 0.956 (0.006) |
| FDR | 0.391 (0.009) | 0.222 (0.009) | 0.121 (0.009) | 0.074 (0.008) | |
| MCC | 0.663 (0.009) | 0.815 (0.008) | 0.903 (0.008) | 0.939 (0.007) | |
| TS1 | TPR | 0.798 (0.009) | 0.889 (0.008) | 0.953 (0.006) | 0.971 (0.004) |
| FDR | 0.369 (0.009) | 0.214 (0.009) | 0.110 (0.008) | 0.073 (0.007) | |
| MCC | 0.702 (0.009) | 0.832 (0.008) | 0.919 (0.007) | 0.948 (0.005) | |
| ODS | TPR | 0.418 (0.006) | 0.454 (0.004) | 0.474 (0.005) | 0.474 (0.003) |
| FDR | 0.710 (0.005) | 0.726 (0.004) | 0.753 (0.004) | 0.776 (0.003) | |
| MCC | 0.331 (0.005) | 0.335 (0.004) | 0.323 (0.004) | 0.305 (0.003) | |
| MRS | TPR | 0.662 (0.006) | 0.755 (0.004) | 0.809 (0.004) | 0.816 (0.003) |
| FDR | 0.467 (0.006) | 0.454 (0.005) | 0.464 (0.004) | 0.505 (0.003) | |
| MCC | 0.585 (0.006) | 0.633 (0.004) | 0.650 (0.004) | 0.626 (0.003) | |
| ZiDAG | TPR | 0.619 (0.010) | 0.710 (0.009) | 0.756 (0.007) | 0.778 (0.007) |
| FDR | 0.291 (0.010) | 0.243 (0.009) | 0.243 (0.008) | 0.252 (0.008) | |
| MCC | 0.656 (0.010) | 0.727 (0.009) | 0.751 (0.008) | 0.758 (0.008) | |
Table 3:
Linear ZiG-DAG. Average operating characteristics over 50 simulations for different numbers of nodes with . The standard error for each statistic is given within parentheses.
| Number of nodes, | |||||
|---|---|---|---|---|---|
| Method | Measure | 10 | 25 | 50 | 100 |
| HC0 | TPR | 0.948 (0.012) | 0.891 (0.013) | 0.924 (0.008) | 0.864 (0.007) |
| FDR | 0.067 (0.015) | 0.166 (0.019) | 0.125 (0.009) | 0.255 (0.008) | |
| MCC | 0.932 (0.015) | 0.855 (0.017) | 0.897 (0.009) | 0.800 (0.007) | |
| HC1 | TPR | 0.912 (0.014) | 0.977 (0.005) | 0.958 (0.005) | 0.870 (0.006) |
| FDR | 0.141 (0.021) | 0.060 (0.009) | 0.101 (0.008) | 0.269 (0.008) | |
| MCC | 0.869 (0.020) | 0.956 (0.007) | 0.926 (0.007) | 0.795 (0.007) | |
| TS0 | TPR | 0.964 (0.010) | 0.925 (0.011) | 0.932 (0.007) | 0.869 (0.006) |
| FDR | 0.051 (0.013) | 0.130 (0.017) | 0.121 (0.009) | 0.254 (0.008) | |
| MCC | 0.951 (0.013) | 0.892 (0.015) | 0.903 (0.008) | 0.803 (0.007) | |
| TS1 | TPR | 0.932 (0.014) | 0.969 (0.005) | 0.953 (0.006) | 0.874 (0.007) |
| FDR | 0.103 (0.020) | 0.081 (0.009) | 0.110 (0.008) | 0.267 (0.009) | |
| MCC | 0.902 (0.019) | 0.941 (0.007) | 0.919 (0.007) | 0.798 (0.008) | |
| ODS | TPR | 0.386 (0.010) | 0.419 (0.008) | 0.474 (0.005) | 0.543 (0.004) |
| FDR | 0.677 (0.009) | 0.775 (0.006) | 0.753 (0.004) | 0.761 (0.003) | |
| MCC | 0.262 (0.010) | 0.265 (0.007) | 0.323 (0.004) | 0.350 (0.003) | |
| MRS | TPR | 0.742 (0.010) | 0.874 (0.009) | 0.809 (0.004) | 0.733 (0.004) |
| FDR | 0.331 (0.011) | 0.423 (0.009) | 0.464 (0.004) | 0.623 (0.002) | |
| MCC | 0.664 (0.010) | 0.695 (0.009) | 0.650 (0.004) | 0.519 (0.003) | |
| ZiDAG | TPR | 0.640 (0.017) | 0.700 (0.012) | 0.756 (0.007) | 0.794 (0.006) |
| FDR | 0.295 (0.016) | 0.368 (0.016) | 0.243 (0.008) | 0.254 (0.007) | |
| MCC | 0.632 (0.018) | 0.649 (0.015) | 0.751 (0.008) | 0.768 (0.006) | |
Model Misspecification
When the conditional distribution of each node in a ZiG-DAG model is misspecified, our identifiability theories do not guarantee that we can find the true DAG. Therefore, an important question is how well our algorithms recovers the true graph when misspecified distributions are used. We investigate this empirically. We choose the simulation scenario with and , and apply two different linear ZiG-DAG models. The first one is a linear ZiG-DAG (ZiG-DAG-HP) using the zero-inflated hyper-Poisson distribution as above. The second one is another linear ZiG-DAG (ZiG-DAG-NB) where the conditional distribution of each node is assumed to follow a zero-inflated negative binomial distribution. In ZiG-DAG-NB, every conditional distribution is misspecified, as the true data-generating model is ZiG-DAG-HP. The simulation results are shown in Figure 2. Both ZiG-DAG-HP and ZiG-DAG-NB are better than ODS, MRS, and ZiDAG. Although ZiG-DAG-NB is a misspecified model, its performance is still better than the alternative state-of-the-art approaches. This shows that the proposed ZiG-DAG is useful for learning the true causal structure even if the true conditional distributions are misspecified.
Figure 2:

Box plots of operating characteristics for ZiG-DAG-HP, ZiG-DAG-NB, ODS, MRS, and ZiDAG applied to synthetic datasets generated from a linear ZiG-DAG with and .
5.2. Nonlinear ZiG-DAG
We next assess the performance of the nonlinear ZiG-DAG models. We sample data from a nonlinear ZiG-DAG with and , where the conditional distribution of each node is again assumed to be a zero-inflated hyper-Poisson. We randomly choose the true nonlinear functions and in (5) from three candidates, respectively:
and
The intercepts in (5) and the additional parameters for the hyper-Poisson are generated as in Section 5.1: , , and . For learning the nonlinear ZiG-DAG, we use spline basis with a knot being placed at the 50% quantile of the data. We also consider the linear ZiG-DAG for comparison. Additionally, since ZiDAG allows for both linear and nonlinear causal relationships, in this simulation study, we use ZiDAG with nonlinear implementation.
We report in Table 4 the simulation results based on 50 repetitions. Overall, the nonlinear ZiG-DAG outperforms the other approaches including the linear ZiG-DAG. Especially, the nonlinear ZiG-DAG results in extremely low FDR compared to the other competitors. Not surprisingly, in this nonlinear simulation setting, ZiDAG gives better results than the linear ZiG-DAG.
Table 4:
Nonlinear ZiG-DAG. Average operating characteristics over 50 simulations with and . The standard error for each statistic is given within parentheses.
| Nonlinear ZiG-DAG | Linear ZiG-DAG | ODS | MRS | ZiDAG | |
|---|---|---|---|---|---|
| TPR | 0.622 (0.014) | 0.662 (0.018) | 0.614 (0.008) | 0.588 (0.020) | 0.568 (0.013) |
| FDR | 0.179 (0.018) | 0.377 (0.017) | 0.417 (0.010) | 0.405 (0.022) | 0.249 (0.014) |
| MCC | 0.684 (0.017) | 0.596 (0.020) | 0.546 (0.007) | 0.540 (0.023) | 0.616 (0.014) |
5.3. Non-zero-inflation
Although the proposed ZiG-DAG models are primarily developed to deal with excessive zeros in count data, they are also applicable and robust to count data generated from non-zero-inflated distributions. We perform additional simulations to support this claim. We generate data from a negative binomial BN, which does not include any zero-inflation components. The parameters for the negative binomial BN are sampled uniformly at random in a similar way to Section 5.1. The resulting data have ~ 26% zeros, which are much less than the zero-inflated case. As in Section 5.1, we consider ZiG-DAG-HP and ZiG-DAG-NB that assume a zero-inflated hyper-Poisson distribution and a zero-inflated negative binomial distribution for the conditional distribution of each node, respectively. Furthermore, we consider two distinctive MRS algorithms that learn DAGs for hyper-Poisson BNs (MRS-HP) and negative binomial BNs (MRS-NB). Since MRS-NB requires an input of the dispersion parameter of the negative binomial distribution, we provide it with the true dispersion parameter value.
The simulation results are shown in Table 5. Even though the data are not zero-inflated, our approaches, ZiG-DAG-HP and ZiG-DAG-NB, generally show better performance than the alternative methods (ODS, MRS-HP, MRS-NB, and ZiDAG). Even though MRS-NB uses the correct distributional model and the true dispersion parameter, it shows worse performance than our methods as well as ZiDAG with respect to FDR and MCC. This might be because the performance of the MRS algorithm highly relies on the choice of external methods for the skeleton estimation. In our experiments, MRS utilizes the R package MXM to estimate the skeleton of DAG, which might provide unreliable skeleton estimates in this simulation setting.
Table 5:
Non-zero-inflation. Average operating characteristics over 50 simulations for a negative binomial BN with and . The standard error for each statistic is given within parentheses.
| ZiG-DAG-HP | ZiG-DAG-NB | ODS | MRS-HP | MRS-NB | ZiDAG | |
|---|---|---|---|---|---|---|
| TPR | 0.692 (0.009) | 0.774 (0.007) | 0.306 (0.004) | 0.637 (0.008) | 0.714 (0.008) | 0.582 (0.008) |
| FDR | 0.342 (0.010) | 0.334 (0.008) | 0.658 (0.005) | 0.514 (0.009) | 0.455 (0.009) | 0.267 (0.010) |
| MCC | 0.668 (0.009) | 0.712 (0.007) | 0.310 (0.004) | 0.546 (0.008) | 0.615 (0.008) | 0.647 (0.009) |
5.4. Latent Confounders
Recall that Theorems 6 and 7 in Section 3 assume causal sufficiency (Condition 3), that is, there exist no latent confounders. Although the causal sufficiency assumption is common in the causal literature, in real applications, it is difficult to check whether an unmeasured latent confounder exists, and there is always a possibility that we do not observe some variables of interest. Therefore, we test how sensitive our method is to the existence of latent confounders. We consider two true causal DAGs in Figure 3 that have three nodes, . Given each causal graph, we generate zero-inflated count data from a linear ZiG-DAGs and treat as an unmeasured confounder (i.e., hide it from the algorithms).
Figure 3:

Two different confounding scenarios with a confounder .
The graph in Figure 3(a) assumes a casual effect of on , which is confounded by . For the simulation truth corresponding to Figure 3(a), we assume that the conditional distribution of each node is a zero-inflated hyper-Poisson similarly to Section 5.1. We denote and , and set and . We consider different levels of confounding effects where , while fixing the causal effect . For each level of confounding effect, we simulate 50 datasets with sample size . Figure 4(a) plots the average accuracy (ACC) over 50 repeat simulations of ZiG-DAG for identifying the true causal direction . We also consider MRS and ZiDAG as benchmarks. Our approach finds the true causal direction quite well across the confounding levels, while ZiDAG becomes worse when the confounding effect is relatively large (). MRS does not work well in this case.
Figure 4:

Plots of average ACC of ZiG-DAG, MRS, and ZiDAG against different levels of confounding effect under the confounding scenarios of (a) Figure 3(a) and (b) Figure 3(b).
Next, we consider the DAG in Figure 3(b). There is no causal effect between and whereas the confounding effect by is still present. Therefore, we set ; otherwise the same simulation truth with Figure 3(a) is used. We consider the same confounding effects with Figure 3(a). Figure 4(b) displays the resulting ACCs of ZiG-DAG, MRS, and ZiG-DAG, again averaged over 50 repeat simulations. In the range of the confounding level being considered, ZiG-DAG does not add any spurious causal relation between and . In summary, the empirical results in Figure 4(a)–(b) indicate that the proposed ZiG-DAG is relatively robust to the presence of hidden confounders.
6. Real Data Analyses
We illustrate the utility of the proposed ZiG-DAG by performing two analyses of a scRNA-seq dataset (Li et al., 2017) that consists of 561 cells from 11 primary colorectal cancer (CRC) tumors and matched normal mucosa.
6.1. Real-Data Validation with Known Causal Relationships
Using the real scRNA-seq data and known causal relationships in the biological literature, we validate the causal identifiability of ZiG-DAG and compare it to other state-of-the-art alternatives. First, from the TRRUST database (Han et al., 2018), we extract a list of literature-curated pairs of transcription factor and its target. This list establishes a biological ground truth of cause-and-effect relationships with the transcription factors being causes and the targets being effects. We extract from our scRNA-seq data the pairs of genes on the list for which the maximum information coefficient (Reshef et al., 2011), a measure of linear and nonlinear correlations between two variables, is greater than 0.5. This results in 47 pairs for validation.
We apply the proposed ZiG-DAG to each pair of genes. Specifically, we use a nonlinear ZiG-DAG where the conditional distribution of each node is a zero-inflated hyper-Poisson distribution. For comparison, we apply MRS and ZiDAG to the same dataset. We calculate the accuracy of identifying true causal relationships for ZiG-DAG, MRS, and ZiDAG, and the results are 60%, 51%, and 53%, respectively. Out of a total of 47 pairs, the proposed ZiG-DAG correctly identifies 28 causal relationships. This indicates that the proposed method is capable of finding true causal relationships in real data: the p-value for a binomial test is 0.0002 when compared to random guesses. Furthermore, among the three count BNs, ZiG-DAG has the highest accuracy.
6.2. Reverse Engineering of Gene Regulatory Network
In this section we aim to reconstruct a gene regulatory network for genes from the TGF- signaling pathway, which has been shown as the most activated signaling pathway in the analysis of Li et al. (2017). Before reconstructing the gene regulatory network, we filter cell doublets and multiplets using an R package for single cell genomics, Seurat (Hao et al., 2021), and retain 472 cells, which contain ~ 40% zeros. To estimate the gene regulatory network, we use the HC algorithm for the nonlinear ZiG-DAG that assumes a zero-inflated hyper-Poisson distribution as the conditional distribution of each node. We initialize the algorithm with the DAG obtained by MRS, which shows promising performance in the experiments of Section 5.1.
Figure 5 displays the estimated gene regulatory network for genes of the TGF- signaling pathway. In total, 26 directed edges are found by our nonlinear ZiG-DAG. Some of the estimated gene regulations are consistent with known regulatory relationships in the existing biological literature. For example, the proposed model finds gene regulations involving SMAD proteins, which are main signal transducers for receptors of the TGF- superfamily. Specifically, SMAD2 affects mRNA profiles of ZFYVE9 (Runyan et al., 2009) and RBX1 regulates the SMAD4 protein stability (Inoue and Imamura, 2008). Moreover, the estimated network also confirms the fact that ROCK is a well-known downstream effector of RHOA.
Figure 5:

The estimated gene regulatory network for genes of the TGF- signaling pathway using the nonlinear ZiG-DAGs.
Furthermore, we can find 2 hub genes in the estimated network: RHOA and SKP1 with out-degrees of 7 and 5. Hub genes are of particular importance because they are often involved in essential regulatory relationships. In fact, the importance of our hub genes in TGF- signaling has been supported by the existing literature. RHOA is a small GTPase of the RHO family, whose inactivation plays key roles in colorectal cancer progression/metastasis by interacting with many members of TGF- signaling pathway (Rodrigues et al., 2014; Dopeso et al., 2018). SKP1 belongs to the SCF complex, which is a RING-type E3 ubiquitin ligase that participates in the degradation of a wide variety of proteins that regulates TGF- signaling (Inoue and Imamura, 2008).
7. Discussion
We have proposed a novel BN model, ZiG-DAG, to infer causal relationships in observational zero-inflated count data. ZiG-DAGs are built upon a fairly general class of count distributions, namely generalized hypergeometric probability distributions, and therefore can account for various types of zero-inflated count data including overdispersed or underdispersed zero-inflated count data. We have also considered not only linear causal relationships but also nonlinear relationships. The identifiability theory for the proposed ZiG-DAGs has been established using a general proof technique, which can potentially be used to show identifiability of other discrete BN models. The proposed ZiG-DAG models are paired with two structure learning procedures, exhaustive search and greedy search. Through extensive numerical experiments and real data analysis, we have empirically validated the identifiability theory for ZiG-DAGs and have shown its superior performance against state-of-the-art alternatives.
There are a few future research directions that can be taken. First, the proposed approach can be extended for modeling interventional zero-inflated count data. This may be done by modifying the likelihood according to the -calculus framework of Pearl (2009). The second direction is to establish an identifiability theory in the presence of latent confounders. Although we have empirically shown in Section 5.4 that ZiG-DAG is relatively robust against confounding, we do not yet have theoretical support of it; the proofs of our identifiability theorems are not directly applicable as they rely on the factorization (1), which requires causal sufficiency. Third, the acyclicity of BNs may be restrictive in applications where the underlying systems have feedback loops, for example, genetic systems. We may relax this acyclicity restriction by using directed cyclic graphs.
Acknowledgments
Ni’s research is partially supported by NSF DMS-1918851, NSF DMS-2112943, and NIH 1R01GM148974-01.
Appendix A. Proof of Propostion 4
First, we state a lemma on the conditions under which conditional distributions of the same node are identical in two discrete BNs, which is needed for the proof of Proposition 4.
Lemma 8 Suppose that and are any two discrete BNs, and is the topological ordering of the DAG of . If for , we have , and
| (8) |
then the conditional distribution of node is the same in and , i.e., .
Proof Since is the topological ordering of and , node cannot be a parent of nodes in both and , and hence in (8), and for are functions of all the variables but . The only terms in (8) that depends on are and . Furthermore, both and are functions of and , where we denote . Let be fixed. Then, (8) is simplified as
| (9) |
where and are constants not depending on . If we sum up (9) over all possible , we have
Due to the fact that and , we obtain , and since are arbitrary, it follows that for any possible and .■
We now provide the proof of Proposition 4. We show that if the joint distributions and are equivalent, then the identity of the causal structures and automatically follows from the proposition assumption.
Proof We assume that the joint distributions of and are the same, i.e.,
| (10) |
for all possible values of . Without loss of generality, assume that is the topological ordering of the DAG of , i.e., the nodes are labeled such that there is no directed edge in from later nodes to earlier nodes. Such orderings must exist (although not necessarily unique) because of the acyclicity of DAGs. We then show by mathematical induction that for all (hence ), which contradicts our assumption that .
We begin with the last node that has no child in the graph . Taking the ratio of (10) at and , we obtain
for all . Note that this implicitly assumes that the conditional distributions and , are positive over their supports. The above ratio can be rewritten using probability generating functions as follows:
The proposition assumption then indicates and . Moreover, Lemma 8 implies that .
Now assume that for any , it holds that , and . We then show that it also holds for . First, observe that the ratio of (10) at and is given by
| (11) |
Note that the induction assumption implies and
for . Therefore, we can simplify (11) into a similar form of the case :
It again follows from the proposition assumption and Lemma 8 that , and , which completes the proof.■
Appendix B. Proof of Corollary 5
Proof Due to Proposition 4 and its assumption, the equivalence of the joint distributions of and (i.e., ) implies , which in turn implies for any . Then by the corollary assumption, for any . Hence, .■
Appendix C. Proof of Theorem 6
Proof We use Proposition 4 to prove that the DAG is identifiable for the linear ZiG-DAGs. Furthermore, we use Corollary 5 to show that the model parameters are also identifiable up to permutations of and , given that (), which characterize the generalized hypergeometric function, and the link function are fixed for each . Let and be two arbitrary linear ZiG-DAGs and we show that they satisfy the sufficient condition of Proposition 4. Let be a node such that the identity (6) holds. We show that and . We use the superscript * to indicate parameters that define the linear ZiG-DAG .
First, we show . Suppose by way of contradiction that , and let such that ; such always exists due to the acyclicity of . For (6), let while fixing for . Then, if , it is simplified to
and if , it takes the following form:
where , and
Here, are some constants not depending on . The above identities are well defined since each conditional distribution (or equivalently, the derivatives of the probability generating function) should be positive over the entire support in our definition of the linear ZiG-DAG. Since is not a binary variable (i.e., it takes integers beyond {0, 1}), taking the ratio of each of the above equations at and , we observe that if ,
and if ,
and
for any possible positive value of in the support of . Since these equations hold for all possible values of , in both cases, we have that and . This indicates , which contradicts the assumption that .
We now show . Given the above result, , we can simplify (6) as
| (12) |
for a positive integer . For , taking the ratio of (12) at and , we obtain
which leads to . Next, if , (6) is simplified as
| (13) |
Again, take the ratio of (13) at and for . We have
and therefore . The finding that for implies that . Similarly, we get , and hence .
Next, we show that the model parameters of the linear ZiG-DAGs are also identifiable (up to permutations of and ), under the assumption that , and are fixed (i.e., , and ). We already know and thus we denote . According to Corollary 5, it suffices to show that implies that for , , and and , as well as and , are equivalent up to a permutation. If we take the ratio of at and , we observe that
| (14) |
for , and
| (15) |
for . Since (14) and (15) hold for any possible value of for , we must have for , , and . It is also easy to see that and , as well as and , are equivalent up to a permutation.■
Appendix D. Proof of Theorem 7
Proof To show that the graph structure of the nonlinear ZiG-DAGs is identifiable, we show and , assuming the identity (6) holds for node for two arbitrary nonlinear ZiG-DAGs and . Additionally, in order to establish the parameter identifiability of the nonlinear ZiG-DAGs, we show that equivalence of the parameters follows , under the assumption that for each , and are fixed. We use the superscript * to indicate parameters that define the nonlinear ZiG-DAG .
We first show . Suppose on the contrary that . Consider such that , as in the proof of Theorem 6. Under the nonlinear ZiG-DAGs, letting while fixing for all , we can simplify (6) as
if , and
if , where are some constants, , and . The above identities are well defined, because in our definition of the nonlinear ZiG-DAG, each conditional distribution (or equivalently, the derivatives of the probability generating function) should be positive over the entire support.
Because is not binary (i.e., it takes integers beyond {0, 1}), if we take the ratio of the first equation above at and , we obtain that if ,
If , we take the ratio of the second equation and obtain
and
for any possible positive value of in the support of . Similar to the proof of Theorem 6, we necessarily have in both cases that as well as , . Now, we consider the ratio of (6) at and with arbitrary . Observe that
holds for any , indicating and for all possible . All together, we have and for any possible value of . Because we have assumed for all , where are count random variables, it implies that for any value of in its support, and hence . This contradicts the assumption that .
Next, we show . Let . Since , by taking the ratio of (6) at and , we have
for all and , where
It easily follows that and for all possible values of , and combining this again with the assumption that , we have . Similarly, we obtain , which implies .
Lastly, we show that if , and are fixed (i.e., , and ), we can deduce from that for , , and and are equivalent to and up to permutations. Here, we denote as in the proof of Theorem 6. Consider the ratio of at and . We observe that if ,
| (16) |
and otherwise,
| (17) |
Note that (16) and (17) hold for all possible values of for . If it is combined with , our identifiability condition for the nonlinear functions and , we get , and and for all possible values of for . Then, it is clear that and , respectively, are equivalent to a permutation of and , which completes the proof.■
Contributor Information
Junsouk Choi, Department of Statistics, Texas A&M University, College Station, TX 98195-4322, USA.
Yang Ni, Department of Statistics, Texas A&M University, College Station, TX 94720-1776, USA.
References
- Andersen Holly. When to expect violations of causal faithfulness and why it matters. Philosophy of Science, 80(5):672–683, 2013. [Google Scholar]
- Barry Simon C and Welsh Alan H. Generalized additive modelling and zero inflated count data. Ecological Modelling, 157(2–3):179–188, 2002. [Google Scholar]
- Bollen Kenneth A. Structural equations with latent variables, volume 210. John Wiley & Sons, 1989. [Google Scholar]
- Castelletti Federico, Consonni Guido, Della Vedova Marco L, and Peluso Stefano. Learning markov equivalence classes of directed acyclic graphs: an objective bayes approach. Bayesian Analysis, 13(4):1235–1260, 2018. [Google Scholar]
- Chen Wenyu, Drton Mathias, and Wang Y Samuel. On causal discovery with an equalvariance assumption. Biometrika, 106(4):973–980, 2019. [Google Scholar]
- Chickering David Maxwell. Optimal structure identification with greedy search. Journal of machine learning research, 3(Nov):507–554, 2002. [Google Scholar]
- Choi Junsouk, Chapkin Robert, and Ni Yang. Bayesian causal structural learning with zero-inflated poisson bayesian networks. Advances in Neural Information Processing Systems, 33, 2020. [Google Scholar]
- Claeskens Gerda, Hjort Nils Lid, et al. Model selection and model averaging. Cambridge Books, 2008. [Google Scholar]
- Dacey Michael F. A family of discrete probability distributions defined by the generalized hypergeometric series. Sankhyā: The Indian Journal of Statistics, Series B, pages 243–250, 1972. [Google Scholar]
- Dopeso Higinio, Rodrigues Paulo, Bilic Josipa, Bazzocco Sarah, Cartón-García Fernando, Macaya Irati, De Marcondes Priscila Guimarães, Anguita Estefanía, Masanas Marc, Jiménez-Flores Lizbeth M, et al. Mechanisms of inactivation of the tumour suppressor gene rhoa in colorectal cancer. British journal of cancer, 118(1):106–116, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fox Jean-Paul. Multivariate zero-inflated modeling with latent predictors: Modeling feedback behavior. Computational Statistics & Data Analysis, 68:361–374, 2013. [Google Scholar]
- Han Heonjong et al. TRRUST v2: an expanded reference database of human and mouse transcriptional regulatory interactions. Nucleic Acids Research, 46(D1):D380–D386, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hao Yuhan, Hao Stephanie, Andersen-Nissen Erica, Mauck William M III, Zheng Shiwei, Butler Andrew, Lee Maddie J, Wilk Aaron J, Darby Charlotte, Zager Michael, et al. Integrated analysis of multimodal single-cell data. Cell, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Heckerman David, Geiger Dan, and Chickering David M. Learning bayesian networks: The combination of knowledge and statistical data. Machine learning, 20(3):197–243, 1995. [Google Scholar]
- Hoyer Patrik O, Janzing Dominik, Mooij Joris M, Peters Jonas, Schölkopf Bernhard, et al. Nonlinear causal discovery with additive noise models. In NIPS, volume 21, pages 689–696. Citeseer, 2008. [Google Scholar]
- HE Hua, TANG Wan, WANG Wenjuan, and CRITS-CHRISTOPH Paul. Structural zeroes and zero-inflated models. Shanghai Archives of Psychiatry, 26(4):236, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Inoue Yasumichi and Imamura Takeshi. Regulation of tgf-β family signaling by e3 ubiquitin ligases. Cancer science, 99(11):2107–2112, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Johnson Norman L, Kemp Adrienne W, and Kotz Samuel. Univariate discrete distributions, volume 444. John Wiley & Sons, 2005. [Google Scholar]
- Kalisch Markus and Bühlman Peter. Estimating high-dimensional directed acyclic graphs with the pc-algorithm. Journal of Machine Learning Research, 8(3), 2007. [Google Scholar]
- Kang Yun, Norris Michael H, Zarzycki-Siek Jan, Nierman William C, Donachie Stuart P, and Hoang Tung T. Transcript amplification from single bacterium for transcriptome analysis. Genome Research, 21(6):925–935, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kemp Adrienne W. A wide class of discrete distributions and the associated differential equations. Sankhyā: The Indian Journal of Statistics, Series A, pages 401–410, 1968a. [Google Scholar]
- Kemp Adrienne Winifred. Studies in Univariate Discrete Distribution Theory Based on the Generalized Hypergeometric Function and Associated Differential Equations. PhD thesis, Queens’ University of Belfast, 1968b. [Google Scholar]
- Li Huipeng, Courtois Elise T, Sengupta Debarka, Tan Yuliana, Chen Kok Hao, Goh Jolene Jie Lin, Kong Say Li, Chua Clarinda, Hon Lim Kiat, Tan Wah Siew, et al. Reference component analysis of single-cell transcriptomes elucidates cellular heterogeneity in human colorectal tumors. Nature genetics, 49(5):708–718, 2017. [DOI] [PubMed] [Google Scholar]
- Maathuis Marloes, Drton Mathias, Lauritzen Steffen, and Wainwright Martin. Handbook of graphical models. CRC Press, 2018. [Google Scholar]
- Opgen-Rhein Rainer and Strimmer Korbinian. From correlation to causation networks: a simple approximate learning algorithm and its application to high-dimensional plant gene expression data. BMC systems biology, 1(1):1–10, 2007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Park Gunwoong and Park Hyewon. Identifiability of generalized hypergeometric distribution (ghd) directed acyclic graphical models. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 158–166. PMLR, 2019. [Google Scholar]
- Park Gunwoong and Raskutti Garvesh. Learning large-scale poisson dag models based on overdispersion scoring. Advances in Neural Information Processing Systems, 28:631–639, 2015. [Google Scholar]
- Pearl Judea. Causality: Models, reasoning and inference. Cambridge university press, 2009. [Google Scholar]
- Peters Jonas and Bühlmann Peter. Identifiability of gaussian structural equation models with equal error variances. Biometrika, 101(1):219–228, 2014. [Google Scholar]
- Peters Jonas, Mooij Joris, Janzing Dominik, and Schölkopf Bernhard. Identifiability of causal graphs using functional models. In 27th Conference on Uncertainty in Artificial Intelligence (UAI 2011), pages 589–598. AUAI Press, 2011. [Google Scholar]
- Peters Jonas, Mooij Joris M, Janzing Dominik, and Schölkopf Bernhard. Causal discovery with continuous additive noise models. Journal of Machine Learning Research, 15:2009–2053, 2014. [Google Scholar]
- Reshef David N, Reshef Yakir A, Finucane Hilary K, Grossman Sharon R, McVean Gilean, Turnbaugh Peter J, Lander Eric S, Mitzenmacher Michael, and Sabeti Pardis C. Detecting novel associations in large data sets. science, 334(6062):1518–1524, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rodrigues Paulo, Macaya Irati, Bazzocco Sarah, Mazzolini Rocco, Andretta Elena, Dopeso Higinio, Mateo-Lozano Silvia, Bilić Josipa, Cartón-García Fernando, Nieto Rocio, et al. Rhoa inactivation enhances wht signalling and promotes colorectal cancer. Nature communications, 5(1):1–15, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Runyan Constance E, Hayashida Tomoko, Hubchak Susan, Curley Jessica F, and Schnaper H William. Role of sara (smad anchor for receptor activation) in maintenance of epithelial cell phenotype. Journal of Biological Chemistry, 284(37):25181–25189, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schölkopf Bernhard, Janzing Dominik, Peters Jonas, Sgouritsa Eleni, Zhang Kun, and Mooij Joris. On causal and anticausal learning. arXiv preprint arXiv:1206.6471, 2012. [Google Scholar]
- Shimizu Shohei, Hoyer Patrik O, Hyvärinen Aapo, Kerminen Antti, and Jordan Michael. A linear non-gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7(10), 2006. [Google Scholar]
- Spirtes Peter, Glymour Clark N, Scheines Richard, and Heckerman David. Causation, prediction, and search. MIT press, 2000. [Google Scholar]
- Staub Kevin E and Winkelmann Rainer. Consistent estimation of zero-inflated count models. Health Economics, 22(6):673–686, 2013. [DOI] [PubMed] [Google Scholar]
- Uhler Caroline, Raskutti Garvesh, Bühlmann Peter, and Yu Bin. Geometry of the faithfulness assumption in causal inference. The Annals of Statistics, pages 436–463, 2013. [Google Scholar]
- Verma Thomas and Pearl Judea. Causal networks: Semantics and expressiveness. In Machine intelligence and pattern recognition, volume 9, pages 69–76. Elsevier, 1990. [Google Scholar]
- Wang Y Samuel and Drton Mathias. High-dimensional causal discovery under non-gaussianity. Biometrika, 107(1):41–59, 2020. [Google Scholar]
- Yu Shiqing, Drton Mathias, and Shojaie Ali. Directed graphical models and causal discovery for zero-inflated data. arXiv preprint arXiv:2004.04150, 2020. [PMC free article] [PubMed] [Google Scholar]
