Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Aug 11.
Published in final edited form as: J Mach Learn Res. 2023;24:200.

Model-Based Causal Discovery for Zero-Inflated Count Data

Junsouk Choi 1, Yang Ni 2
PMCID: PMC12337821  NIHMSID: NIHMS2092542  PMID: 40791739

Abstract

Zero-inflated count data arise in a wide range of scientific areas such as social science, biology, and genomics. Very few causal discovery approaches can adequately account for excessive zeros as well as various features of multivariate count data such as overdispersion. In this paper, we propose a new zero-inflated generalized hypergeometric directed acyclic graph (ZiG-DAG) model for inference of causal structure from purely observational zero-inflated count data. The proposed ZiG-DAGs exploit a broad family of generalized hypergeometric probability distributions and are useful for modeling various types of zero-inflated count data with great flexibility. In addition, ZiG-DAGs allow for both linear and nonlinear causal relationships. We prove that the causal structure is identifiable for the proposed ZiG-DAGs via a general proof technique for count data, which is applicable beyond the proposed model for investigating causal identifiability. Score-based algorithms are developed for causal structure learning. Extensive synthetic experiments as well as a real dataset with known ground truth demonstrate the superior performance of the proposed method against state-of-the-art alternative methods in discovering causal structure from observational zero-inflated count data. An application of reverse-engineering a gene regulatory network from a single-cell RNA-sequencing dataset illustrates the utility of ZiG-DAGs in practice.

Keywords: Bayesian network, Causal identifiability, Directed acyclic graph, Observational zero-inflated count data, Single-cell RNA-sequencing

1. Introduction

Discovering causal structure of an unknown system is an important task in practically all areas of science. Knowing the causal structure is not only useful for predicting a system’s behavior under external interventions, but also has implications for machine learning tasks such as covariate shift and transfer learning (Schölkopf et al., 2012). The most effective and principled way for causal discovery is to conduct controlled experiments. However, it is often expensive, unethical, or even impossible in certain fields such as genomics (Opgen-Rhein and Strimmer, 2007) and social sciences (Bollen, 1989). Hence, causal discovery approaches that can infer the unknown causal structures from purely observational data are often desired.

This paper considers causal discovery for purely observational zero-inflated count data. Observational zero-inflated count data are common across multiple disciplines, for instance, educational psychology (Fox, 2013), genomics (Kang et al., 2011), ecology (Barry and Welsh, 2002), behavior studies (Hua et al., 2014), and economics (Staub and Winkelmann, 2013). A specific application, by which we are motivated, is to reverse-engineer gene regulatory networks from single-cell RNA-sequencing (scRNA-seq) data. The scRNA-seq technology measures the abundance of mRNA within single cells, resulting in count data with excessive zeros because of technological limits in sequencing the low amounts of mRNA in individual cells. For causal structure learning from observational zero-inflated count data, we work under the framework of causal Bayesian networks (BNs), which have been widely used for representing causal relationships among variables via directed acylic graphs (DAGs).

Learning the structure of BNs is not trivial because the size of the space of possible graph structures grows super-exponentially in the number of variables. Furthermore, BNs may not be distinguishable from each other with observational data. Multiple DAGs can encode the same conditional independence assertions and in general, DAGs are identifiable only up to Markov equivalence class (MEC) in which all DAGs encode the same set of conditional independences (Heckerman et al., 1995). Therefore, in the past, many approaches have focused on identifying the MEC rather than individual DAGs (Spirtes et al., 2000; Chickering, 2002; Kalisch and Bühlman, 2007; Castelletti et al., 2018). For example, the well-known PC algorithm infers a set of conditional independencies and recovers a MEC that is compatible with the inferred conditional independencies (Spirtes et al., 2000). The GES algorithm performs greedy search over the space of MECs and obtains the best-scored MEC (Chickering, 2002). However, DAGs within the same MEC may have drastically different causal interpretations.

Since 2006, it has been shown that for some classes of BNs, the exact graph structure, not just the MEC, may be identifiable from observational data alone. For continuous variables, BNs are often represented by sparse additive noise models. Under this formulation, the underlying DAG is identifiable if the functional form of the additive noise model is linear with non-Gaussian noises (Shimizu et al., 2006; Wang and Drton, 2020) and if the functional form is nonlinear with mild regularity assumptions on the function-noise pair (Hoyer et al., 2008; Peters et al., 2011, 2014). Peters and Bühlmann (2014); Chen et al. (2019) have also shown that unique identification of DAG structure is possible under linear additive noise models with Gaussian noises having equal variances.

The vast majority of the existing works that establish identifiability theorems for BNs have focused on continuous variables; identifiability issues of BNs for count data are less studied. Park and Raskutti (2015) proposed linear Poisson BNs for observational count data and investigated the overdispersion scores to prove that the unique identification of the underlying DAG is possible. However, the applicability of Poisson BNs may be limited due to the restrictive assumption of Poisson distribution that the variance is equal to the mean. Park and Park (2019) generalized the idea of Poisson BNs to a family of generalized hypergeometric distributions that includes the Poisson distribution, the hyper-Poisson distribution, the negative binomial distribution, and many more. An identifiability theorem for the generalized hypergeometric BNs was established using the moment ratio scores. Although the generalized hypergeometric BNs are a quite general class of count BNs, they tend not to adequately model count data with excessive zeros.

There have been a few recent BNs that are fully identifiable for observational zero-inflated data. Using Hurdle conditional distributions, Yu et al. (2020) proposed fully identifiable BNs for zero-inflated Gaussian data. Recently, we (Choi et al., 2020) developed zero-inflated Poisson BNs for observational zero-inflated count data. We have shown, theoretically and empirically, that the underlying causal DAG can be identified from observational data alone. However, the zero-inflated Poisson BNs have the same limitation as Poisson BNs, that is, Poisson distribution is a restrictive distribution. In particular, the Poisson-based BNs do not adequately account for overdispersion, a common feature of count data. Hence it is desirable to further develop a more general class of count BNs that can account for a broad range of multivariate count data with excessive zeros.

In this paper, we introduce a fairly general class of count BNs for observational zero-inflated count data, termed zero-inflated generalized hypergeometric DAGs (ZiG-DAGs). We extend the zero-inflated Poisson BNs (Choi et al., 2020) to zero-inflated generalized hypergeometric models, which include many common count distributions. Therefore, the proposed ZiG-DAGs are capable of modeling various types of zero-inflated count data, for example, overdispersed zero-inflated count data. In addition, we allow for both linear and nonlinear causal relationships in order to flexibly capture real causality in practice whereas Choi et al. (2020) only considers linear causal relationships. Based on a new general proof technique, we prove that the proposed ZiG-DAG is uniquely identifiable, justifying its use for casual discovery. The general proof technique can be potentially used to check identifiability for other discrete BNs as well. The established identifiability theorems do not require the causal faithfulness assumption (Uhler et al., 2013) typically required by constraint-based algorithms. For the structure learning of ZiG-DAGs, we develop score-based algorithms: exhaustive search for small graphs and greedy search for moderate-to-large graphs. Specifically, we consider two different greedy search algorithms to deal with the local optima problem of greedy search. We empirically demonstrate that the proposed methods compare favorably against state-of-the-art alternatives. We also illustrate the utility of ZiG-DAGs in real-world problems using a scRNA-seq dataset.

The remainder of this paper is organized as follows. We set up necessary notations and definitions for BNs in Section 2.1 and we introduce the proposed ZiG-DAG models for observational zero-inflated count data in Section 2.2. Section 3 establishes identifiability theorems for the proposed ZiG-DAGs. In Section 4, we develop score-based algorithms for causal structure learning of ZiG-DAGs. We demonstrate the utility of our methods through synthetic data in Section 5 and real-world applications in Section 6. Section 7 provides our closing discussion.

2. Bayesian Networks for Observational Zero-inflated Count Data

2.1. Notation and Background

We start with some basic notations for DAGs and BNs. Let X=X1,,Xd denote a set of d random variables. A DAG 𝒢=(V,E) consists of a set of nodes V={1,,d} corresponding to the variables X and a set of directed edges EV×V representing the causal relationships between the nodes V without cycles. If we have a directed edge (j,k)E (or slightly abusing the notation, jkE) for j,kV, node j is called a parent of k and node k is called a child of j. We denote the set of parents of node j in 𝒢 by pa𝒢(j) and the set of children of j in 𝒢 by ch𝒢(j). Node k is said to be a descendant of node j if there exists a directed path j=j0j1jl=k and otherwise is said to be a non-descendant of j. We use nd𝒢(j) to denote the set of non-descendants of j (excluding j itself). A BN for X is a pair =(𝒢,p) with the joint distribution p(·) factorizing over 𝒢 as follows:

pX1,,Xd=j=1dpXjXpa𝒢(j), (1)

where Xpa𝒢(j)=Xk:kpa𝒢(j) and pXjXpa𝒢(j) is the conditional probability distribution of Xj given its parents. We say a joint distribution p(·) is (local) Markov with respect to a DAG 𝒢 if each variable Xj is independent of its non-descendants Xnd𝒢(j)=Xk:knd𝒢(j) given its parents Xpa𝒢(j). The factorization in (1) is equivalent to the Markov property of p(·) (Verma and Pearl, 1990). In this paper, we make the causal Markov assumption -p(·) is Markov with respect to the causal DAG 𝒢 – so that we can interpret 𝒢 causally; in other words, each node is assumed to be independent of all its non-effects conditional on all its direct causes.

In general, the DAG 𝒢 of a BN=(𝒢,p) is not identifiable from the joint distribution p(·). Indeed, the joint distribution p(·) is Markov with respect to many different DAGs including all fully connected DAGs. Therefore, we have many possible BNs with different graph structures for the same joint distribution. To overcome this indeterminacy, one can make additional assumptions and obtain a restricted model for which the graph is identifiable from the joint distribution. A common assumption in the literature for learning BNs is faithfulness. A joint distribution p(·) is faithful with respect to a DAG 𝒢 if the graph 𝒢 encodes all the conditional independence constraints in the joint distribution p(·). If faithfulness is assumed, DAGs are identifiable up to MEC (Spirtes et al., 2000). Two DAGs 𝒢 and 𝒢 are Markov equivalent if the two DAGs encodes the same set of conditional independence constraints and a MEC is defined by a set of DAGs that are Markov equivalent. For example, despite the seemingly different graph structures, the DAGs in Figure 1(a)(c) forms a MEC, which encodes the only conditional independence X1X2X3, whereas the DAG in Figure 1(d) encodes the marginal independence of X1 and X2 only and forms another MEC. Since both the Markov property and faithfulness only constrain conditional independencies in the joint distribution, we cannot distinguish DAGs in the same MEC, which impose the same set of conditional independence assertions. For instance, the well-known PC algorithm (Spirtes et al., 2000) and the GES algorithm (Chickering, 2002), under the faithfulness assumption, aim to find the best MEC rather than the best individual DAG.

Figure 1:

Figure 1:

Examples of DAGs with three nodes. DAGs in (a)-(c) are Markov equivalent and form a Markov equivalence class that encodes X1X2X3. DAG in (d) forms another Markov equivalence class that encodes X1X2.

In many applications of BNs, a specific family of distributions is assumed for the conditional distribution of each node given its parents. For example, we will assume that the conditional probability for each node comes from a zero-inflated count model. Even with such distributional assumptions, the DAG may still be non-identifiable due to the distribution equivalence. Two DAGs 𝒢 and 𝒢 are distribution equivalent if for every BN =(𝒢,p) there exists a different BN =𝒢,p such that the joint distributions are identical, i.e., pX1,,Xd=pX1,,Xd. For example, for Gaussian BNs or multinomial BNs, they are distribution equivalent if and only if they are Markov equivalent. Hence, we can identify only the MEC from the joint distribution for Gaussian and multinomial BNs. Hence, not surprisingly, if we assume that data are generated from one of the three DAGs in Figure 1 (a)(c), the best answer that we can achieve using Gaussian BNs or multinomial BNs is that one of them is the true model. This is unsatisfactory for many applications and it has been recently shown that there exist certain cases where we can overcome the issue of distribution equivalence and the graph structure is fully identifiable. The existing works often represent continuous BNs as sparse additive noise models and under this framework, the underlying DAG is identifiable if the functional form of the additive noise model is linear and the noises are non-Gaussian (Shimizu et al., 2006), if nonlinear functions are considered with very mild additional conditions (Hoyer et al., 2008; Peters et al., 2011), or if the functions are linear and the noises are Gaussian with equal variance (Peters and Bühlmann, 2014). However, most existing approaches focus on BNs for continuous data, and the identifiability of BNs for count data are much less studied (Park and Raskutti, 2015; Park and Park, 2019; Choi et al., 2020).

2.2. Zero-Inflated Generalized Hypergeometric Directed Acyclic Graphs

We consider a broad family of discrete distributions for count data. Kemp (1968a, b) defines a family of generalized hypergeometric probability distributions (GHPDs), which includes a lot of common probability distributions for count data and has many useful properties such as recurrence relationships for both their probabilities and their factorial moments. Let (a)k=a(a+1)(a+k-1) denote the ascending (rising) factorial with (a)0=1. The generalized hypergeometric function is then defined as

Fpqa1,,ap;b1,,bq;λ=i0a1iapiλib1ibqii!.

Note that a1,,ap are exchangeable and so are b1,,bq. A distribution is said to be a GHPD if its probability generating function can be written in the following form:

G(s;a,b,λ)=Fpqa1,,ap;b1,,bq;λsFpqa1,,ap;b1,,bq;λ, (2)

where a=a1,,ap and b=b1,,bq. A large number of discrete distributions for count data belong to the class of GHPDs, for example, binomial, Poisson, negative binomial, hypergeometric, beta-binomial, and beta-negative binomial. Table 1 provides some examples of GHPDs with their probability generating functions (see also Kemp 1968a; Dacey 1972; Johnson et al. 2005).

Table 1:

Examples of GHPDs and their probability generating functions

Distributions Probability generating function Parameters
Binomial F10(-n;;-ps/(1-p))F10(-n;;-p/(1-p)) 0<p<1
Poisson F00(;;θs)F00(;;θ) θ>0
Hyper-Poisson F11(1;ψ;θs)F11(1;ψ;θ) ψ>0,θ>0
Geometric F10(1;;qs)F10(1;;q) 0<q<1
Negative Binomial F10(k;;qs)F10(k;;q) k>0, 0<q<1
Hypergeometric F21(-n,-Np;N-Np-n+1;s)F21(-n,-Np;N-Np-n+1;1) n,NN,0<p<1
Beta-Negative Binomial F21(k,;k++m;s)F21(k,;k++m;1) k,,m0
Extended Generalized Waring F21(k,;k++m;θs)F21(k,;k++m;θ) k,>0,mR,0<θ<1

We define, by using the GHPDs, ZiG-DAGs for observational zero-inflated count data. In order to explicitly account for excessive zeros in count data, we adopt the zero-inflated model. We say a BN =(𝒢,p) for random counts X is a ZiG-DAG if for each node jV, the conditional distribution pXjXpa𝒢(j) of the factorization (1) has a probability generating function of the following form,

Gjs;Xpa𝒢(j)=πjXpa𝒢(j)+1-πjXpa𝒢(j)Gs;aj,bj,λjXpa𝒢(j), (3)

where Gs;aj,bj,λj is a GHPD probability generating function defined by (2) with aj=aj1,,ajpj and bj=bj1,,bjqj. Here, πj(·) and λj(·) are functions/mappings from 𝒳pa𝒢(j) to R, which connect the parents Xpa𝒢(j) of node j to its conditional distribution, where 𝒳pa𝒢(j){0,1,2,}pa𝒢(j). For a ZiG-DAG, the probability mass function of each conditional distribution is given by

PrXj=xXpa𝒢(j)=πjXpa𝒢(j)+1-πjXpa𝒢(j)GHP0;aj,bj,λjXpa𝒢(j)ifx=0,1-πjXpa𝒢(j)GHPx;aj,bj,λjXpa𝒢(j)ifx=1,2,,

where GHPx;aj,bj,λj is the probability mass function of a GHPD of which the probability generating function is given by Gs;aj,bj,λj. Particularly, πj(·)(0,1) is the probability that extra zeros occur in addition to the zeros that arise from the GHPD, and λj(·) is the power parameter of GHPD that is closely related to its moments. For example, in the Poisson probability generating function Gs;aj,bj,λj=F00;;λjs/F00;;λj,λj represents the mean of the Poisson distribution. As another example, in the negative binomial probability generating function Gs;aj,bj,λj=F10k;;λjs/F10k;;λj,λj denotes the probability of “success”, which can be reparametrized in terms of the first and second moments of the negative binomial distribution.

For πjXpa𝒢(j) and λjXpa𝒢(j), we consider both linear and nonlinear functional forms. We use the logit function, logit(·), as link function for πj, where logit(ρ)=log(ρ/(1-ρ)) for 0<ρ<1. Let hj(·),j=1,,d, denote any suitable link function for λj, which is assumed to be strictly increasing for invertibility. First, we define linear ZiG-DAGs by assuming logit πjXpa𝒢(j) and hjλjXpa𝒢(j) vary linearly with the parents Xpa𝒢(j) of node jV.

Definition 1 (Linear ZiG-DAGs) We say a BN =(𝒢,p) is a linear ZiG-DAG if the joint distribution p(·) factorizes with respect to the DAG 𝒢 as in (1) with each conditional distribution pXjXpa𝒢(j) having a probability generating function (3), where πjXpa𝒢(j) and λjXpa𝒢(j) are given by

logitπjXpa𝒢(j)=kpa𝒢(j)αjkXk+δj,hjλjXpa𝒢(j)=kpa𝒢(j)βjkXk+γj, (4)

with some strictly increasing functions hj.

The zero-inflated Poisson BN in our recent work (Choi et al., 2020) is a special case of the proposed linear ZiG-DAG. Furthermore, in order to allow more flexible causal relationships, we propose nonlinear ZiG-DAGs by adopting the additive model framework. Particularly, for each Xj, we model logitπjXpa𝒢(j) and hλjXpa𝒢(j) as the sum of nonlinear functions of Xk,kpa𝒢(j).

Definition 2 (Nonlinear ZiG-DAGs) We say a BN =(𝒢,p) is a nonlinear ZiG-DAG if the joint distribution p factorizes with respect to the DAG 𝒢 as in (1) with each conditional distribution pXjXpa𝒢(j) having a probability generating function (3), where πjXpa𝒢(j) and λjXpa𝒢(j) are given by

logitπjXpa𝒢(j)=kpa𝒢(j)fjkXk+μj,hjλjXpa𝒢(j)=kpa𝒢(j)gjkXk+νj, (5)

with some strictly increasing functions hj and nonlinear functions fjk and gjk.

Without loss of generality, we assume that EfjkXk=EgjkXk=0,jV,kpa𝒢(j) because they can always otherwise be absorbed into the intercepts μj and νj. If zero-inflated random counts X follow either a linear ZiG-DAG or a nonlinear ZiG-DAG, they satisfy, by definition, the Markov property (conditional independencies) encoded in the underlying DAG. As mentioned earlier, BNs may not be identifiable due to Markov and distribution equivalence. In the next section, we will show that under the proposed ZiG-DAG models, the causal graph structure is identifiable from observational data alone.

3. Identifiability Theory

Recently, much effort has been directed to show that some assumptions on the conditional distribution of each node can impose non-independence constraints on the joint distribution so that the DAG of a BN is identifiable (Shimizu et al., 2006; Hoyer et al., 2008; Peters et al., 2014; Peters and Bühlmann, 2014). However, the existing literature mostly addresses the identifiability issue of BNs for continuous data, and there are much fewer identifiability results on BNs for count data. Park and Raskutti (2015); Park and Park (2019) developed BNs by using Poisson and the generalized hypergeometric family and showed that their causal orderings are identifiable. Our recent work (Choi et al., 2020) investigated the identifiability of the zero-inflated Poisson BNs. These three methods are special cases of the proposed ZiG-DAGs; however, none of their proof techniques is applicable in our setting. Therefore, before we state the main identifiability theories for both linear and nonlinear ZiG-DAGs, we provide a general framework to check the identifiability of discrete BNs. We provide a sufficient condition under which two discrete BNs with different DAGs must have different joint distributions. The proofs of the identifiability theorems for the proposed ZiG-DAGs are based on such a sufficient condition. Specifically, Proposition 4 formulates the sufficient condition in terms of probability generating functions for the conditional distribution of each node given its parents. As discrete distributions are often defined by the probability generating function, one can potentially use Proposition 4 to verify the identifiability of other discrete BNs. We first state two assumptions that our identifiability theories require:

Condition 3 We assume (i) there exists no unmeasured confounder, and (ii) there is no selection bias.

No unmeasured confounder (also known as causal sufficiency) and no selection bias are commonly adopted in the literature for causal structure learning (Chickering, 2002; Shimizu et al., 2006; Peters and Bühlmann, 2014; Maathuis et al., 2018). The BN factorization (1) does not hold if either or both assumptions in Condition 3 are violated.

For two discrete BNs =(𝒢,p) and *=𝒢*,p*, we denote pa(j)=pa𝒢(j), pa*(j)=pa𝒢*(j), ch(j)=ch𝒢(j), ch*(j)=ch𝒢*(j), and nd(j)=nd𝒢(j) for the ease of notation. We let Gjs;xpa(j) and Gj*s;xpa*(j) denote the probability generating functions for the conditional distributions pxjxpa(j) and p*xjxpa*(j) of and *, respectively. Note that by definition, pxjxpa(j)=Gjxj0;xpa(j) where Gjxj denotes the xj-th derivative of Gj.

Proposition 4 Let =(𝒢,p) and *=𝒢*,p* be any two discrete BNs, where 𝒢=(V,E) and 𝒢*=V,E*. Suppose that for every node jV for which

Gjxj+10;xpa(j)Gjxj0;xpa(j)=Gj*xj+10;xpa*(j)Gj*xj0;xpa*(j)kch*(j)nd(j)Gk*xk0;xpa*(k){j},xj+1Gk*xk0;xpa*(k){j},xj (6)

holds for all possible x1,,xd, it is also true that pa(j)=pa*(j) and ch*(j)nd(j)= holds. Then, if the joint distributions of and * are equivalent, i.e., p=p*, we have E=E*.

All proofs can be found in the appendices. The main idea behind the proof is to show that if the observational joint distributions p and p* are identical, then the proposition condition (6) necessarily implies that 𝒢 and 𝒢* have to be identical. Given any topological ordering of the graph 𝒢, we first show that the parent sets, in 𝒢 and 𝒢*, of the last node in the ordering have to be identical if the joint distributions are the same. Then we use mathematical induction to show that this is also true for any node of V and therefore 𝒢 and 𝒢* have to be identical.

Sometimes, it is also of interest to identify the model parameters. When the graph structure is identifiable, the parameter identifiability simplifies to a question of whether parameters associated with the conditional distribution of each node are identifiable. Since Proposition 4 implies that the graph structure is already identifiable, if a conditional distribution of each node is uniquely determined by the associated parameters, then we necessarily have a one-to-one correspondence between the joint distribution of the BN and the set of all associated parameters, and hence parameters are identifiable.

Corollary 5 Let Ξ=ΞjjV and Ξ*=Ξj*jV be sets of parameters that are associated with discrete BNs and *, respectively, where Ξj and Ξj* denote the sets of parameters associated with the conditional distribution of the node j only. Suppose that for any jV, the assumption in Proposition 4 holds and, furthermore, Ξj=Ξj* whenever pxjxpa(j)=p*xjxpa*(j). Then, if the joint distributions of and * are equivalent, i.e., p=p*, we have Ξ=Ξ*.

Using Proposition 4 and Corollary 5, we prove identifiability of the underlying DAG and the associated parameters for both the linear ZiG-DAG and the nonlinear ZiG-DAG in Theorems 6 and 7.

Theorem 6 Let =(𝒢,p) be a linear ZiG-DAG. Assume Condition 3 holds. Then, if the variables are not binary, the graph 𝒢 is identifiable from the joint distribution p(X). For given pj,qj (which characterizes the generalized hypergeometric function Fqjpj) and given hj (the link function for λj), there is a unique set of parameters for the linear ZiG-DAG that induces the observed distribution p(X).

Theorem 7 Let =(𝒢,p) be a nonlinear ZiG-DAG. Assume Condition 3 holds. Then, if the variables are not binary, the graph 𝒢 is identifiable from the joint distribution p(X). For given (pj,qj) (which characterizes the generalized hypergeometric function Fqjpj) and given hj (the link function for λj), there is a unique set of parameters for the nonlinear ZiG-DAG that induces the observed distribution p(X).

In Theorems 6 and 7, the assumptions that pj,qj,hj are given, which we make for parameter identifiability, indicates that the conditional distribution of each node in ZiG-DAGs should take a specific GHPD model among the family of GHPDs along with a specific link function. One example of such a combination would be the Poisson distribution with the log link function. Furthermore, the assumption excludes limiting cases of a given GHPD. For instance, if we use the negative binomial distribution for a node, we do not allow it to degenerate to a Poisson distribution, since the negative binomial distribution has pj=1,qj=0, while the Poisson distribution has pj=0,qj=0. This assumption seems reasonable since we have to decide which GHPD and link function to use in practice. In Sections 5 and 6, for the proposed ZiG-DAG, we consider the Poisson distribution, the hyper-Poisson distribution, and the negative binomial distribution with the log link function. With such choices of GHPD and link function, Theorems 6 and 7 state that both the causal structure and the model parameters for the proposed ZiG-DAGs are fully identifiable from the joint distribution.

Theorems 6 and 7 do not require faithfulness to prove that the exact graph structure is identifiable under the proposed ZiG-DAG models. While continuous BNs such as linear Gaussian BNs may have accidental cancellation of positive and negative causal effects and hence may become unfaithful, the proposed ZiG-DAGs do not allow such cancellation due to inherent asymmetry of count distributions. Faithfulness can be violated in an equilibriummaintaining system such as a biological system (Andersen, 2013) and in datasets with limited sample size (Uhler et al., 2013). In such cases, therefore, causal discovery approaches that require the faithfulness assumption are not favorable. In our specific motivating application of reverse-engineering gene regulatory networks from scRNA-seq data, one should avoid the common practice of “Gaussianizing” raw scRNA-seq data because then one needs to additionally make the faithfulness assumption that may not be suitable in gene regulatory systems; instead, directly working with raw zero-inflated count data with the proposed ZiG-DAG does not suffer from this limitation.

4. Algorithms

In this section, we discuss algorithms for learning the causal structures of both the linear ZiG-DAGs and the nonlinear ZiG-DAGs. We will consider score-based approaches, which complement the Bayesian inference procedure developed in our recent work (Choi et al., 2020).

4.1. Structure Learning for Linear ZiG-DAGs

Suppose that we are given zero-inflated count data x=x(1),,x(n) that are n independent realizations of X from a linear ZiG-DAG model =(𝒢,p). For the linear ZiG-DAG, we denote the model parameters by θ={α,β,δ,γ,a,b} with α=αjk(j,k)E,β=βjk(j,k)E,δ=δjjV,γ=γjjV,a=ajjV, and b=bjjV. We let p(·θ,𝒢) denote the joint distribution of the linear ZiG-DAG given the model parameters θ and the DAG 𝒢. We score each DAG by the Bayesian information criterion (BIC),

BIC𝒢x=-2i=1nlogpxiθˆ,𝒢+θlogn, (7)

where θˆ denotes the maximum likelihood estimate of the model parameters and |θ| denotes the number of model parameters. As the individual DAG is identifiable for the proposed ZiG-DAGs, the consistency of the BIC ensures that the true DAG uniquely achieves the minimum BIC with probability converging to 1 as n (Claeskens et al., 2008). We take two strategies to minimize the BIC given by (7) with respect to the DAG 𝒢: (1) exhaustive search and (2) greedy search.

Exhaustive Search

For small graphs where the number of nodes d is small, the BIC can be minimized by computing the scores for all possible DAGs and find the DAG with the lowest score. This approach is exact and is useful for small d (say, d4). As the number of nodes d grows, however, this approach becomes computationally infeasible very quickly because the number of DAGs grows super-exponentially in d.

Greedy Search

For larger graphs, exhaustive search is infeasible; we will use greedy search instead. Greedy search algorithms in the context of BN learning consider local moves from the current graph and makes the locally optimal choice at each iteration. We consider two strategies, hill climbing (HC) and tabu search (TS) algorithms.

The HC algorithm explores the neighborhood of the current DAG in the space of all possible DAGs. The neighborhood is defined using local moves. At each iteration, the algorithm scores all the DAGs that can be reached from the current graph by an edge addition, deletion, or reversal. The current DAG is then replaced by the DAG that provides the largest improvement, i.e., largest decrease in BIC in our case. We stop the algorithm if the improvement is no longer possible. We summarize the HC procedure in Algorithm 1. Although this algorithm finds a local optimal graph, there is no guarantee that the graph obtained by HC is a global optimum.

In order to avoid being trapped in local optima, the TS algorithm allows s additional local moves (edge addition, deletion, and reversal) when we reach a local optimal graph for which the score cannot be improved. These additional steps explore new territories around the local optimum even if they do not improve the score and may find new direction to arrive at a better structure. Note that the final solution should be the best DAG found anywhere during the search, not the DAG at which the algorithm stops. Furthermore, we keep a list (the tabu list) of all local moves that we have applied within the last t iterations. During the search over the neighborhood of the current DAG, our TS algorithm do not consider local modifications that reverse the local moves in the tabu list. For example, if we add an edge jk, we cannot delete the edge in the next t steps. This forces the search to explore new directions in the space of DAGs, instead of tweaking with the same parts of the current solution. Our TS algorithm is summarized in Algorithm 2.

Algorithm 1.

Hill climbing

1: Input: data x, initial DAG 𝒢0.
2: Compute BIC𝒢0x and set BICmax=BIC𝒢0x.
3: Set 𝒢max=𝒢0.
4: repeat
5:  Initialize Improvement=false.
6: for all DAGs 𝒢 reachable from 𝒢max by an edge addition, deletion, or reversal do
7:   Compute BIC𝒢x.
8:   if BIC𝒢x<BICmax then
9:    Set 𝒢max=𝒢 and BICmax=BIC𝒢x.
10:    Set Improvement=true.
11:   end if
12: end for
13: until Improvment is false
14: Output: DAG 𝒢max.
Algorithm 2.

Tabu search

1: Input: data x, initial DAG 𝒢0, number of additionally allowed steps s, size of the tabu list t.
2: Compute BIC𝒢0x and set BICmax=BIC𝒢0x.
3: Set 𝒢*=𝒢max=𝒢0.
4: Initialize LastImprovement=0.
5: while LastImprovement<s do
6:  Initialize BIC*=.
7: for all DAGs 𝒢 reachable from 𝒢* by an edge addition, deletion, or reversal do
8:   if 𝒢 does not reverse local moves in the tabu list, (i.e., in the last t steps) then
9:    Compute BIC𝒢x.
10:    if BIC𝒢x<BIC* then
11:     Set 𝒢*=𝒢 and BIC*=BIC𝒢x.
12:    end if
13:   end if
14: end for
15: if BIC*<BICmax then
16:   Set 𝒢max=𝒢* and BICmax=BIC*.
17:   Set LastImprovement=0.
18: else
19:   Set LastImprovement=LastImprovement+1.
20: end if
21: end while
22: Output: DAG 𝒢max.

4.2. Structure Learning for Nonlinear ZiG-DAGs

While our identifiability theory for nonlinear ZiG-DAGs is general, for structure learning, we need to make specific choice of the nonlinear functions fjk and gjk. For example, one can expand fjk and gjk with the Fourier bases if the functional relationship is expected to be periodic. Similarly, if we expect that the relationship might show a very localized behavior, wavelets can be a good choice. In this paper, we employ spline basis expansion for fjk and gjk. Splines are popular in semiparametric function estimation because of the ease of their construction, their flexibility and accuracy to approximate a smooth function, and their interpretability through the representation by a compact set of basis functions and coefficients. Particularly, fjk and gjk are modeled by cubic B-splines,

fjk(·)=l=1MfζjklBjkl(·)andgjk(·)=l=1MgηjklCjkl(·),

where Bjkl(·)l=1Mf and Cjkl(·)l=1Mg are cubic B-spline basis functions with some pre-specified knots. In summary, the nonlinear ZiG-DAG model is parameterized by spline coefficients ζjkl,ηjkl and the other node-specific model parameters μj,νj,aj,bj. The BIC for each DAG can be evaluated in the same way with (7) and we can use either exhaustive search or greedy search as in Section 4.1 for estimating the underlying graph for nonlinear ZiG-DAGs. The R implementation of the proposed method is available in the R package ZiGDAG (https://github.com/junsoukchoi/ZiGDAG.git).

5. Experiments

We empirically evaluate the causal discovery performance of both linear and nonlinear ZiG-DAG models with synthetic data. We compare the proposed method with state-of-the-art BN learning algorithms for count data: the overdispersion scoring (ODS) algorithm for Poisson BNs (Park and Raskutti, 2015) and the moments ratio scoring (MRS) algorithm for generalized hypergeometric BNs (Park and Park, 2019). We also consider the ZiDAG for zero-inflated Gaussian data (Yu et al., 2020) with the log(x+1) transformation of the synthetic count data.

5.1. Linear ZiG-DAG

We first consider a linear ZiG-DAG, where the conditional distribution of each node has a probability generating function given by (3) with Gs;aj,bj,λj=F111;ψj,λjs/F111;ψj,λj. That is, the conditional distribution of each node follows a zero-inflated hyper-Poisson, which is a quite flexible distribution as the hyper-Posson distribution allows for both overdispersion and underdisperion in count data. We sample data from the linear ZiG-DAG with different sample sizes n{250,500,1000,2000} and different numbers of nodes d{10,25,50,100}. For each simulation setting, we set the causal DAG 𝒢 by randomly generating a sparse DAG with d edges. Given the DAG, we generate coefficients (αjk,βjk) in (4) from independent uniform distributions: αjk~U(0.5,2) and βjk~U(-2,-0.5) for kpa𝒢(j) and jV. The intercepts δj and γj in (4) are chosen uniformly at random from (−1.5, 1) and (1, 1.5), respectively. The additional parameters ψj for the GHPD (hyper-Poisson distribution) are sampled as logψj~U(-2,2). These ranges are chosen so that the resulting observations are not all zeros or do not have extremely large values. Each simulation setting is repeated 50 times, and the simulated datasets have ~ 50% zeros.

For ZiG-DAG, we implement both HC and TS algorithms as introduced in Section 4. Since they are greedy, initial values can affect the outcome. We consider two ways of initialization: first, we start HC (HC0) and TS (TS0) at the empty graph; and second, we initialize HC (HC1) and TS (TS1) with the DAGs obtained by MRS, which are expected to be better than empty graphs. To assess the causal discovery performance of each method, we calculate the true positive rate (TPR), the false discovery rate (FDR), and the Mattews correlation coefficient (MCC) for selection of true directed edges. MCC is a balanced measure of binary classification that takes a value between −1 and 1 with 1 indicating perfect agreement between the true and estimated graphs (i.e., perfect selection), 0 indicating random guess, and −1 indicating total disagreement.

We summarize in Tables 23 the operating characteristics of each method for different combinations of the sample size n and the number of nodes d. For every simulation setting, the proposed methods consistently outperform ODS, MRS, and ZiDAG. Specifically, as the sample size increases, our greedy search algorithms find the causal structure more accurately as expected. Our approaches also show satisfactory performance for various graph sizes including moderately large graphs (d=100). We make additional observations for difference between the HC and TS algorithms. When the greedy search algorithms starts at the empty graph, i.e., HC0 and TS0, the performance of TS is better than that of HC. However, if we consider HC1 and TS1 for which we provide more informative initial DAG, there is no statistically significant difference between HC1 and TS1 in most cases. In subsequent simulations, for simplicity, we leave out HC1, TS0 and TS1, and only consider HC0 to learn the proposed ZiG-DAGs from data.

Table 2:

Linear ZiG-DAG. Average operating characteristics over 50 simulations for different sample sizes n{250,500,1000,2000} with d=50. The standard error for each statistic is given within parentheses.

Sample size, n
Method Measure 250 500 1000 2000
HC0 TPR 0.728 (0.009) 0.844 (0.009) 0.924 (0.008) 0.946 (0.007)
FDR 0.387 (0.009) 0.226 (0.009) 0.125 (0.009) 0.083 (0.009)
MCC 0.660 (0.008) 0.804 (0.009) 0.897 (0.009) 0.930 (0.008)
HC1 TPR 0.802 (0.009) 0.892 (0.007) 0.958 (0.005) 0.971 (0.004)
FDR 0.354 (0.008) 0.203 (0.008) 0.101 (0.008) 0.074 (0.007)
MCC 0.713 (0.008) 0.840 (0.007) 0.926 (0.007) 0.947 (0.005)
TS0 TPR 0.739 (0.009) 0.862 (0.009) 0.932 (0.007) 0.956 (0.006)
FDR 0.391 (0.009) 0.222 (0.009) 0.121 (0.009) 0.074 (0.008)
MCC 0.663 (0.009) 0.815 (0.008) 0.903 (0.008) 0.939 (0.007)
TS1 TPR 0.798 (0.009) 0.889 (0.008) 0.953 (0.006) 0.971 (0.004)
FDR 0.369 (0.009) 0.214 (0.009) 0.110 (0.008) 0.073 (0.007)
MCC 0.702 (0.009) 0.832 (0.008) 0.919 (0.007) 0.948 (0.005)
ODS TPR 0.418 (0.006) 0.454 (0.004) 0.474 (0.005) 0.474 (0.003)
FDR 0.710 (0.005) 0.726 (0.004) 0.753 (0.004) 0.776 (0.003)
MCC 0.331 (0.005) 0.335 (0.004) 0.323 (0.004) 0.305 (0.003)
MRS TPR 0.662 (0.006) 0.755 (0.004) 0.809 (0.004) 0.816 (0.003)
FDR 0.467 (0.006) 0.454 (0.005) 0.464 (0.004) 0.505 (0.003)
MCC 0.585 (0.006) 0.633 (0.004) 0.650 (0.004) 0.626 (0.003)
ZiDAG TPR 0.619 (0.010) 0.710 (0.009) 0.756 (0.007) 0.778 (0.007)
FDR 0.291 (0.010) 0.243 (0.009) 0.243 (0.008) 0.252 (0.008)
MCC 0.656 (0.010) 0.727 (0.009) 0.751 (0.008) 0.758 (0.008)

Table 3:

Linear ZiG-DAG. Average operating characteristics over 50 simulations for different numbers of nodes d{10,25,50,100} with n=1000. The standard error for each statistic is given within parentheses.

Number of nodes, d
Method Measure 10 25 50 100
HC0 TPR 0.948 (0.012) 0.891 (0.013) 0.924 (0.008) 0.864 (0.007)
FDR 0.067 (0.015) 0.166 (0.019) 0.125 (0.009) 0.255 (0.008)
MCC 0.932 (0.015) 0.855 (0.017) 0.897 (0.009) 0.800 (0.007)
HC1 TPR 0.912 (0.014) 0.977 (0.005) 0.958 (0.005) 0.870 (0.006)
FDR 0.141 (0.021) 0.060 (0.009) 0.101 (0.008) 0.269 (0.008)
MCC 0.869 (0.020) 0.956 (0.007) 0.926 (0.007) 0.795 (0.007)
TS0 TPR 0.964 (0.010) 0.925 (0.011) 0.932 (0.007) 0.869 (0.006)
FDR 0.051 (0.013) 0.130 (0.017) 0.121 (0.009) 0.254 (0.008)
MCC 0.951 (0.013) 0.892 (0.015) 0.903 (0.008) 0.803 (0.007)
TS1 TPR 0.932 (0.014) 0.969 (0.005) 0.953 (0.006) 0.874 (0.007)
FDR 0.103 (0.020) 0.081 (0.009) 0.110 (0.008) 0.267 (0.009)
MCC 0.902 (0.019) 0.941 (0.007) 0.919 (0.007) 0.798 (0.008)
ODS TPR 0.386 (0.010) 0.419 (0.008) 0.474 (0.005) 0.543 (0.004)
FDR 0.677 (0.009) 0.775 (0.006) 0.753 (0.004) 0.761 (0.003)
MCC 0.262 (0.010) 0.265 (0.007) 0.323 (0.004) 0.350 (0.003)
MRS TPR 0.742 (0.010) 0.874 (0.009) 0.809 (0.004) 0.733 (0.004)
FDR 0.331 (0.011) 0.423 (0.009) 0.464 (0.004) 0.623 (0.002)
MCC 0.664 (0.010) 0.695 (0.009) 0.650 (0.004) 0.519 (0.003)
ZiDAG TPR 0.640 (0.017) 0.700 (0.012) 0.756 (0.007) 0.794 (0.006)
FDR 0.295 (0.016) 0.368 (0.016) 0.243 (0.008) 0.254 (0.007)
MCC 0.632 (0.018) 0.649 (0.015) 0.751 (0.008) 0.768 (0.006)

Model Misspecification

When the conditional distribution of each node in a ZiG-DAG model is misspecified, our identifiability theories do not guarantee that we can find the true DAG. Therefore, an important question is how well our algorithms recovers the true graph when misspecified distributions are used. We investigate this empirically. We choose the simulation scenario with n=1000 and d=50, and apply two different linear ZiG-DAG models. The first one is a linear ZiG-DAG (ZiG-DAG-HP) using the zero-inflated hyper-Poisson distribution as above. The second one is another linear ZiG-DAG (ZiG-DAG-NB) where the conditional distribution of each node is assumed to follow a zero-inflated negative binomial distribution. In ZiG-DAG-NB, every conditional distribution is misspecified, as the true data-generating model is ZiG-DAG-HP. The simulation results are shown in Figure 2. Both ZiG-DAG-HP and ZiG-DAG-NB are better than ODS, MRS, and ZiDAG. Although ZiG-DAG-NB is a misspecified model, its performance is still better than the alternative state-of-the-art approaches. This shows that the proposed ZiG-DAG is useful for learning the true causal structure even if the true conditional distributions are misspecified.

Figure 2:

Figure 2:

Box plots of operating characteristics for ZiG-DAG-HP, ZiG-DAG-NB, ODS, MRS, and ZiDAG applied to synthetic datasets generated from a linear ZiG-DAG with n=1000 and d=50.

5.2. Nonlinear ZiG-DAG

We next assess the performance of the nonlinear ZiG-DAG models. We sample data from a nonlinear ZiG-DAG with n=500 and d=10, where the conditional distribution of each node is again assumed to be a zero-inflated hyper-Poisson. We randomly choose the true nonlinear functions fjk and gjk in (5) from three candidates, respectively:

f1z=12zz-3,f2z=sinz,f3z=exp12z-1,

and

g1z=-12z-322,g2z=cosz,g3z=-12logz+1.

The intercepts μj,νj in (5) and the additional parameters ψj for the hyper-Poisson are generated as in Section 5.1: μj~U(-1.5,-1), νj~U(1,1.5), and logψj~U(-2,2). For learning the nonlinear ZiG-DAG, we use Mf=Mg=4 spline basis with a knot being placed at the 50% quantile of the data. We also consider the linear ZiG-DAG for comparison. Additionally, since ZiDAG allows for both linear and nonlinear causal relationships, in this simulation study, we use ZiDAG with nonlinear implementation.

We report in Table 4 the simulation results based on 50 repetitions. Overall, the nonlinear ZiG-DAG outperforms the other approaches including the linear ZiG-DAG. Especially, the nonlinear ZiG-DAG results in extremely low FDR compared to the other competitors. Not surprisingly, in this nonlinear simulation setting, ZiDAG gives better results than the linear ZiG-DAG.

Table 4:

Nonlinear ZiG-DAG. Average operating characteristics over 50 simulations with n=500 and d=10. The standard error for each statistic is given within parentheses.

Nonlinear ZiG-DAG Linear ZiG-DAG ODS MRS ZiDAG
TPR 0.622 (0.014) 0.662 (0.018) 0.614 (0.008) 0.588 (0.020) 0.568 (0.013)
FDR 0.179 (0.018) 0.377 (0.017) 0.417 (0.010) 0.405 (0.022) 0.249 (0.014)
MCC 0.684 (0.017) 0.596 (0.020) 0.546 (0.007) 0.540 (0.023) 0.616 (0.014)

5.3. Non-zero-inflation

Although the proposed ZiG-DAG models are primarily developed to deal with excessive zeros in count data, they are also applicable and robust to count data generated from non-zero-inflated distributions. We perform additional simulations to support this claim. We generate data from a negative binomial BN, which does not include any zero-inflation components. The parameters for the negative binomial BN are sampled uniformly at random in a similar way to Section 5.1. The resulting data have ~ 26% zeros, which are much less than the zero-inflated case. As in Section 5.1, we consider ZiG-DAG-HP and ZiG-DAG-NB that assume a zero-inflated hyper-Poisson distribution and a zero-inflated negative binomial distribution for the conditional distribution of each node, respectively. Furthermore, we consider two distinctive MRS algorithms that learn DAGs for hyper-Poisson BNs (MRS-HP) and negative binomial BNs (MRS-NB). Since MRS-NB requires an input of the dispersion parameter of the negative binomial distribution, we provide it with the true dispersion parameter value.

The simulation results are shown in Table 5. Even though the data are not zero-inflated, our approaches, ZiG-DAG-HP and ZiG-DAG-NB, generally show better performance than the alternative methods (ODS, MRS-HP, MRS-NB, and ZiDAG). Even though MRS-NB uses the correct distributional model and the true dispersion parameter, it shows worse performance than our methods as well as ZiDAG with respect to FDR and MCC. This might be because the performance of the MRS algorithm highly relies on the choice of external methods for the skeleton estimation. In our experiments, MRS utilizes the R package MXM to estimate the skeleton of DAG, which might provide unreliable skeleton estimates in this simulation setting.

Table 5:

Non-zero-inflation. Average operating characteristics over 50 simulations for a negative binomial BN with n=500 and d=50. The standard error for each statistic is given within parentheses.

ZiG-DAG-HP ZiG-DAG-NB ODS MRS-HP MRS-NB ZiDAG
TPR 0.692 (0.009) 0.774 (0.007) 0.306 (0.004) 0.637 (0.008) 0.714 (0.008) 0.582 (0.008)
FDR 0.342 (0.010) 0.334 (0.008) 0.658 (0.005) 0.514 (0.009) 0.455 (0.009) 0.267 (0.010)
MCC 0.668 (0.009) 0.712 (0.007) 0.310 (0.004) 0.546 (0.008) 0.615 (0.008) 0.647 (0.009)

5.4. Latent Confounders

Recall that Theorems 6 and 7 in Section 3 assume causal sufficiency (Condition 3), that is, there exist no latent confounders. Although the causal sufficiency assumption is common in the causal literature, in real applications, it is difficult to check whether an unmeasured latent confounder exists, and there is always a possibility that we do not observe some variables of interest. Therefore, we test how sensitive our method is to the existence of latent confounders. We consider two true causal DAGs in Figure 3 that have three nodes, X1,X2,X3. Given each causal graph, we generate zero-inflated count data from a linear ZiG-DAGs and treat X3 as an unmeasured confounder (i.e., hide it from the algorithms).

Figure 3:

Figure 3:

Two different confounding scenarios with a confounder X3.

The graph in Figure 3(a) assumes a casual effect of X2 on X1, which is confounded by X3. For the simulation truth corresponding to Figure 3(a), we assume that the conditional distribution of each node is a zero-inflated hyper-Poisson similarly to Section 5.1. We denote c=α13,β13=α23,β23 and d=α12,β12, and set δj=-1,γj=0 and ψj=5. We consider different levels of confounding effects c=σ×(-0.8,0.8) where σ{0,0.1,0.2,,1}, while fixing the causal effect d=(0.8,-0.8). For each level of confounding effect, we simulate 50 datasets with sample size n=250. Figure 4(a) plots the average accuracy (ACC) over 50 repeat simulations of ZiG-DAG for identifying the true causal direction X2X1. We also consider MRS and ZiDAG as benchmarks. Our approach finds the true causal direction quite well across the confounding levels, while ZiDAG becomes worse when the confounding effect is relatively large (σ>0.5). MRS does not work well in this case.

Figure 4:

Figure 4:

Plots of average ACC of ZiG-DAG, MRS, and ZiDAG against different levels of confounding effect σ under the confounding scenarios of (a) Figure 3(a) and (b) Figure 3(b).

Next, we consider the DAG in Figure 3(b). There is no causal effect between X1 and X2 whereas the confounding effect by X3 is still present. Therefore, we set d=0; otherwise the same simulation truth with Figure 3(a) is used. We consider the same confounding effects with Figure 3(a). Figure 4(b) displays the resulting ACCs of ZiG-DAG, MRS, and ZiG-DAG, again averaged over 50 repeat simulations. In the range of the confounding level being considered, ZiG-DAG does not add any spurious causal relation between X1 and X2. In summary, the empirical results in Figure 4(a)(b) indicate that the proposed ZiG-DAG is relatively robust to the presence of hidden confounders.

6. Real Data Analyses

We illustrate the utility of the proposed ZiG-DAG by performing two analyses of a scRNA-seq dataset (Li et al., 2017) that consists of 561 cells from 11 primary colorectal cancer (CRC) tumors and matched normal mucosa.

6.1. Real-Data Validation with Known Causal Relationships

Using the real scRNA-seq data and known causal relationships in the biological literature, we validate the causal identifiability of ZiG-DAG and compare it to other state-of-the-art alternatives. First, from the TRRUST database (Han et al., 2018), we extract a list of literature-curated pairs of transcription factor and its target. This list establishes a biological ground truth of cause-and-effect relationships with the transcription factors being causes and the targets being effects. We extract from our scRNA-seq data the pairs of genes on the list for which the maximum information coefficient (Reshef et al., 2011), a measure of linear and nonlinear correlations between two variables, is greater than 0.5. This results in 47 pairs for validation.

We apply the proposed ZiG-DAG to each pair of genes. Specifically, we use a nonlinear ZiG-DAG where the conditional distribution of each node is a zero-inflated hyper-Poisson distribution. For comparison, we apply MRS and ZiDAG to the same dataset. We calculate the accuracy of identifying true causal relationships for ZiG-DAG, MRS, and ZiDAG, and the results are 60%, 51%, and 53%, respectively. Out of a total of 47 pairs, the proposed ZiG-DAG correctly identifies 28 causal relationships. This indicates that the proposed method is capable of finding true causal relationships in real data: the p-value for a binomial test is 0.0002 when compared to random guesses. Furthermore, among the three count BNs, ZiG-DAG has the highest accuracy.

6.2. Reverse Engineering of Gene Regulatory Network

In this section we aim to reconstruct a gene regulatory network for d=26 genes from the TGF-β signaling pathway, which has been shown as the most activated signaling pathway in the analysis of Li et al. (2017). Before reconstructing the gene regulatory network, we filter cell doublets and multiplets using an R package for single cell genomics, Seurat (Hao et al., 2021), and retain 472 cells, which contain ~ 40% zeros. To estimate the gene regulatory network, we use the HC algorithm for the nonlinear ZiG-DAG that assumes a zero-inflated hyper-Poisson distribution as the conditional distribution of each node. We initialize the algorithm with the DAG obtained by MRS, which shows promising performance in the experiments of Section 5.1.

Figure 5 displays the estimated gene regulatory network for d=26 genes of the TGF-β signaling pathway. In total, 26 directed edges are found by our nonlinear ZiG-DAG. Some of the estimated gene regulations are consistent with known regulatory relationships in the existing biological literature. For example, the proposed model finds gene regulations involving SMAD proteins, which are main signal transducers for receptors of the TGF-β superfamily. Specifically, SMAD2 affects mRNA profiles of ZFYVE9 (Runyan et al., 2009) and RBX1 regulates the SMAD4 protein stability (Inoue and Imamura, 2008). Moreover, the estimated network also confirms the fact that ROCK is a well-known downstream effector of RHOA.

Figure 5:

Figure 5:

The estimated gene regulatory network for d=26 genes of the TGF-β signaling pathway using the nonlinear ZiG-DAGs.

Furthermore, we can find 2 hub genes in the estimated network: RHOA and SKP1 with out-degrees of 7 and 5. Hub genes are of particular importance because they are often involved in essential regulatory relationships. In fact, the importance of our hub genes in TGF-β signaling has been supported by the existing literature. RHOA is a small GTPase of the RHO family, whose inactivation plays key roles in colorectal cancer progression/metastasis by interacting with many members of TGF-β signaling pathway (Rodrigues et al., 2014; Dopeso et al., 2018). SKP1 belongs to the SCF complex, which is a RING-type E3 ubiquitin ligase that participates in the degradation of a wide variety of proteins that regulates TGF-β signaling (Inoue and Imamura, 2008).

7. Discussion

We have proposed a novel BN model, ZiG-DAG, to infer causal relationships in observational zero-inflated count data. ZiG-DAGs are built upon a fairly general class of count distributions, namely generalized hypergeometric probability distributions, and therefore can account for various types of zero-inflated count data including overdispersed or underdispersed zero-inflated count data. We have also considered not only linear causal relationships but also nonlinear relationships. The identifiability theory for the proposed ZiG-DAGs has been established using a general proof technique, which can potentially be used to show identifiability of other discrete BN models. The proposed ZiG-DAG models are paired with two structure learning procedures, exhaustive search and greedy search. Through extensive numerical experiments and real data analysis, we have empirically validated the identifiability theory for ZiG-DAGs and have shown its superior performance against state-of-the-art alternatives.

There are a few future research directions that can be taken. First, the proposed approach can be extended for modeling interventional zero-inflated count data. This may be done by modifying the likelihood according to the do-calculus framework of Pearl (2009). The second direction is to establish an identifiability theory in the presence of latent confounders. Although we have empirically shown in Section 5.4 that ZiG-DAG is relatively robust against confounding, we do not yet have theoretical support of it; the proofs of our identifiability theorems are not directly applicable as they rely on the factorization (1), which requires causal sufficiency. Third, the acyclicity of BNs may be restrictive in applications where the underlying systems have feedback loops, for example, genetic systems. We may relax this acyclicity restriction by using directed cyclic graphs.

Acknowledgments

Ni’s research is partially supported by NSF DMS-1918851, NSF DMS-2112943, and NIH 1R01GM148974-01.

Appendix A. Proof of Propostion 4

First, we state a lemma on the conditions under which conditional distributions of the same node are identical in two discrete BNs, which is needed for the proof of Proposition 4.

Lemma 8 Suppose that =(𝒢,p) and *=𝒢*,p* are any two discrete BNs, and (1,2,,d) is the topological ordering of the DAG 𝒢 of . If for jV, we have paj=pa*j,ch*jndj=, and

j=1jpxjxpa(j)=j=1jp*xjxpa*(j), (8)

then the conditional distribution of node j is the same in and *, i.e., pxjxpaj=p*xjxpa*j.

Proof Since (1,,d) is the topological ordering of 𝒢 and ch*jndj=, node j cannot be a parent of nodes 1,2,,j-1 in both and *, and hence in (8), pxjxpa(j) and p*xjxpa*(j) for j=1,,j-1 are functions of all the variables x1,,xd but xj. The only terms in (8) that depends on xj are pxjxpaj and p*xjxpa*j. Furthermore, both pxjxpaj and p*xjxpa*j are functions of xj and xpaj, where we denote pa(j)=paj=pa*j. Let x1,,xj-1,xj+1,,xd be fixed. Then, (8) is simplified as

Cpxjxpaj=C*p*xjxpa*j (9)

where C=j=1j-1pxjxpa(j) and C*=j=1j-1p*xjxpa*(j) are constants not depending on xj. If we sum up (9) over all possible xj, we have

Cxjpxjxpaj=C*xjp*xjxpa*j.

Due to the fact that xjpxjxpaj=1 and xjp*xjxpa*j=1, we obtain C=C*, and since x1,,xd are arbitrary, it follows that pxjxpaj=p*xjxpa*j for any possible xj and xpaj.■

We now provide the proof of Proposition 4. We show that if the joint distributions p and p* are equivalent, then the identity of the causal structures 𝒢 and 𝒢* automatically follows from the proposition assumption.

Proof We assume that the joint distributions of and * are the same, i.e.,

j=1dpxjxpa(j)=j=1dp*xjxpa*(j) (10)

for all possible values of x1,,xd. Without loss of generality, assume that (1,,d) is the topological ordering of the DAG 𝒢 of , i.e., the nodes are labeled such that there is no directed edge in 𝒢 from later nodes to earlier nodes. Such orderings must exist (although not necessarily unique) because of the acyclicity of DAGs. We then show by mathematical induction that pa(j)=pa*(j) for all j=1,2,,d (hence E=E*), which contradicts our assumption that EE*.

We begin with the last node d that has no child in the graph 𝒢. Taking the ratio of (10) at xd+1 and xd, we obtain

pxd+1xpa(d)pxdxpa(d)=p*xd+1xpa*(d)p*xdxpa*(d)kch*(d)p*xkxpa*(k){d},xd+1p*xkxpa*(k){d},xd

for all x1,,xd. Note that this implicitly assumes that the conditional distributions pxjxpa(j) and p*xjxpa*(j),j=1,,p, are positive over their supports. The above ratio can be rewritten using probability generating functions as follows:

Gdxd+10;xpa(d)Gdxd0;xpa(d)=Gd*xd+10;xpa*(d)Gd*xd0;xpa*(d)kch*(d)Gk*xk0;xpa*(k){d},xd+1Gk*xk0;xpa*(k){d},xd.

The proposition assumption then indicates pa(d)=pa*(d) and ch*(d)=ch*(d)nd(d)=. Moreover, Lemma 8 implies that pxdxpa(d)=p*xdxpa*(d).

Now assume that for any j=j+1,,d, it holds that pa(j)=pa*(j), ch*(j)nd(j)= and pxjxpa(j)=p*xjxpa*(j). We then show that it also holds for j=j. First, observe that the ratio of (10) at xj+1 and xj is given by

pxj+1xpajpxjxpajkchjpxkxpa(k)j,xj+1pxkxpa(k)j,xj=p*xj+1xpa*jp*xjxpa*jkch*jp*xkxpa*(k)j,xj+1p*xkxpa*(k)j,xj. (11)

Note that the induction assumption implies chj=ch*jndj and

pxkxpa(k)j,xj+1pxkxpa(k)j,xj=p*xkxpa*(k)j,xj+1p*xkxpa*(k)j,xj

for kchj=ch*jndj. Therefore, we can simplify (11) into a similar form of the case j=d:

Gjxj+10;xpajGjxj0;xpaj=Gj*xj+10;xpa*jGj*xj0;xpa*jkch*jndjGk*xk0;xpa*(k)j,xj+1Gk*xk0;xpa*(k)j,xj.

It again follows from the proposition assumption and Lemma 8 that paj=pa*j,ch*jndj=, and pxjxpaj=p*xjxpaj, which completes the proof.■

Appendix B. Proof of Corollary 5

Proof Due to Proposition 4 and its assumption, the equivalence of the joint distributions of and * (i.e., p=p*) implies E=E*, which in turn implies pxjxpa(j)=p*xjxpa*(j) for any jV. Then by the corollary assumption, Ξj=Ξj* for any jV. Hence, Ξ=Ξ*.■

Appendix C. Proof of Theorem 6

Proof We use Proposition 4 to prove that the DAG is identifiable for the linear ZiG-DAGs. Furthermore, we use Corollary 5 to show that the model parameters are also identifiable up to permutations of aj and bj, given that (pj,qj), which characterize the generalized hypergeometric function, and the link function hj are fixed for each jV. Let and * be two arbitrary linear ZiG-DAGs and we show that they satisfy the sufficient condition of Proposition 4. Let jV be a node such that the identity (6) holds. We show that pa(j)=pa*(j) and ch*(j)nd(j)=. We use the superscript * to indicate parameters that define the linear ZiG-DAG *.

First, we show ch*(j)nd(j)=. Suppose by way of contradiction that h*(j)nd(j), and let kch*(j)nd(j) such that ch*kch*(j)nd(j)=; such k always exists due to the acyclicity of 𝒢*. For (6), let xj=0 while fixing xl for lVj,k. Then, if kpa(j), it is simplified to

C1=C21+expαkj*+δ˜k*pk*Fpk*ak*;bk*;Hk*βkj*+γ˜k*1+expδ˜k*pk*Fpk*ak*;bk*;Hk*γ˜k*forxk=0C2rxkforxk0

and if kpa(j), it takes the following form:

C3Hjβjkxk+γ˜j1+expαjkxk+δ˜jpjFqjaj;bj;Hjβjkxk+γ˜j=C21+expαkj*+δk*pk*Fpk*ak*;bk*;Hk*βkj*+γk*1+expδk*pk*Fpk*ak*;bk*;Hk*γk*forxk=0C2rxkforxk0,

where Hj=hj-1,Hk*=hk*-1,δ˜j=lpa(j)kαjlxl+δj,γ˜j=lpa(j)kβjlxl+γj,δ˜k*=lpa*k{j}αkl*xl+δk*,γ˜k*=lpa*k{j}βkl*xl+γk*, and

r=Hk*βkj*+γ˜k*Hk*γ˜k*.

Here, C1,C2,C3 are some constants not depending on xk. The above identities are well defined since each conditional distribution (or equivalently, the derivatives of the probability generating function) should be positive over the entire support in our definition of the linear ZiG-DAG. Since xk is not a binary variable (i.e., it takes integers beyond {0, 1}), taking the ratio of each of the above equations at xk+1 and xk, we observe that if kpa(j),

r=1+expαkj*+δ˜k*pk*Fpk*ak*;bk*;Hk*βkj*+γ˜k*1+expδ˜k*pk*Fpk*ak*;bk*;Hk*γ˜k*=1;

and if kpa(j),

r=1+expαkj*+δ˜k*pk*Fpk*ak*;bk*;Hk*βkj*+γ˜k*1+expδ˜k*pk**Fpk**ak*;bk*;Hk*γ˜k*×Hjβjk+γ˜j1+expδ˜jpjFqjaj;bj;Hjγ˜jHjγ˜j1+expαjk+δ˜jpjFqjaj;bj;Hjβjk+γ˜j

and

r=Hjβjkxk+1+γ˜j1+expαjkxk+δ˜jpjFqjaj;bj;Hjβjkxk+γ˜jHjβjkxk+γ˜j1+expαjkxk+1+δ˜jpjFqjaj;bj;Hjβjkxk+1+γ˜j

for any possible positive value of xk in the support of Xk. Since these equations hold for all possible values of xl,lVj,k, in both cases, we have that r=1 and αkj*=βkj*=0. This indicates kch*(j), which contradicts the assumption that kch*(j)nd(j)().

We now show pa(j)=pa*(j). Given the above result, ch*(j)nd(j)=, we can simplify (6) as

aj1+xjajpj+xjHjkpa(j)βjkxk+γjbj1+xjbjqj+xjxj+1=aj1*+xjajpj**+xjHj*kpa*(j)βjk*xk+γj*bj1*+xjbjqj**+xjxj+1. (12)

for a positive integer xj. For lpa(j)pa*(j), taking the ratio of (12) at xl=1 and xl=0, we obtain

Hjkpa(j){l}βjkxk+βjl+γjHjkpa(j){l}βjkxk+γj=1,

which leads to βjl=0. Next, if xj=0, (6) is simplified as

aj1ajpjHjkpa(j)βjkxk+γjbj1bjqj1+expkpa(j)αjkxk+δjpjFqjaj;bj;Hjkpa(j)pa*(j)βjkxk+γj=aj1*ajpj**Hj*kpa*(j)βjk*xk+γj*bj1*bjqj**1+expkpa*(j)αjk*xk+δj*pj*Fqj*aj*;bj*;Hj*kpa(j)pa*(j)βjk*xk+γj*. (13)

Again, take the ratio of (13) at xl=1 and xl=0 for lpa(j)pa*(j). We have

1+expkpa(j){l}αjkxk+αjl+δjpjFqjaj;bj;Hjkpa(j)pa*(j)βjkxk+γj1+expkpa(j){l}αjkxk+δjpjFqjaj;bj;Hjkpa(j)pa*(j)βjkxk+γj=1,

and therefore αjl=0. The finding that αjl=βjl=0 for lpa(j)pa*(j) implies that pa(j)pa*(j)=. Similarly, we get pa*(j)pa(j)=, and hence pa(j)=pa*(j).

Next, we show that the model parameters of the linear ZiG-DAGs are also identifiable (up to permutations of aj and bj), under the assumption that pj,qj, and hj are fixed (i.e., pj=pj*,qj=qj*, and hj=hj*). We already know pa(j)=pa*(j) and thus we denote pa(j)=pa(j)=pa*(j). According to Corollary 5, it suffices to show that pxjxpa(j)=p*xjxpa*(j) implies that αjk=αjk*,βjk=βjk* for kpan(j), δj=δj*,γj=γj*, and aj and aj*, as well as bj and bj*, are equivalent up to a permutation. If we take the ratio of pxjxpa(j)=p*xjxpa*(j) at xj+1 and xj, we observe that

aj1+xjajpj+xjHjkpa(j)βjkxk+γjbj1+xjbjqj+xjxj+1=aj1*+xjajpj*+xjHjkpa(j)βjk*xk+γj*bj1*+xjbjqj*+xjxj+1 (14)

for xj0, and

aj1ajpjHjkpajβjkxk+γjbj1bjqj1+expkpajαjkxk+δjpjFqjaj;bj;Hjkpajβjkxk+γj=aj1*ajpj*Hjkpajβjk*xk+γj*bj1*bjqj*1+expkpajαjk*xk+δj*pjFqjaj*;bj*;Hjkpajβjk*xk+γj* (15)

for xj=0. Since (14) and (15) hold for any possible value of xk for kpa(j), we must have αjk=αjk*,βjk=βjk* for kpa(j), δj=δj*, and γj=γj*. It is also easy to see that aj and aj*, as well as bj and bj*, are equivalent up to a permutation.■

Appendix D. Proof of Theorem 7

Proof To show that the graph structure of the nonlinear ZiG-DAGs is identifiable, we show pa(j)=pa*(j) and ch*(j)nd(j)=, assuming the identity (6) holds for node jV for two arbitrary nonlinear ZiG-DAGs and *. Additionally, in order to establish the parameter identifiability of the nonlinear ZiG-DAGs, we show that equivalence of the parameters follows pxjxpa(j)=p*xjxpa*(j), under the assumption that for each jV,pj,qj, and hj are fixed. We use the superscript * to indicate parameters that define the nonlinear ZiG-DAG *.

We first show ch*(j)nd(j)=. Suppose on the contrary that ch*(j)nd(j). Consider kch*(j)nd(j) such that ch*kch*(j)nd(j)=, as in the proof of Theorem 6. Under the nonlinear ZiG-DAGs, letting xj=0 while fixing xl for all lVj,k, we can simplify (6) as

C1=C21+expfkj*1+μ˜k*pk*Fpk*ak*;bk*;Hk*gkj*1+ν˜k*1+expfkj*0+μ˜k*pk*Fpk*ak*;bk*;Hk*gkj*0+ν˜k*forxk=0C2rxkforxk0,

if kpa(j), and

C3Hjgjkxk+ν˜j1+expfjkxk+μ˜jpjFqjaj;bj;Hjgjkxk+ν˜j=C21+expfkj*1+μ˜k*pk*Fpk*ak*;bk*;Hk*gkj*1+ν˜k*1+expfkj*0+μ˜k*pk*Fpk*ak*;bk*;Hk*gkj*0+ν˜k*forxk=0C2rxkforxk0,

if kpa(j), where C1,C2,C3 are some constants, Hj=hj-1,Hk*=hk*-1,μ˜j=lpa(j)kfjlxl+μj,ν˜j=lpa(j)kgjlxl+νj,μ˜k*=lpa*k{j}fkl*xl+μk*,ν˜k*=lpa*k{j}gkl*xl+νk*, and r=Hk*gkj*(1)+ν˜k*Hk*gkj*(0)+ν˜k*. The above identities are well defined, because in our definition of the nonlinear ZiG-DAG, each conditional distribution (or equivalently, the derivatives of the probability generating function) should be positive over the entire support.

Because xk is not binary (i.e., it takes integers beyond {0, 1}), if we take the ratio of the first equation above at xk+1 and xk, we obtain that if kpa(j),

r=1+expfkj*1+μ˜k*pk*Fpk*ak*;bk*;Hk*gkj*1+ν˜k*1+expfkj*0+μ˜k*pk*Fpk*ak*;bk*;Hk*gkj*0+ν˜k*=1.

If kpa(j), we take the ratio of the second equation and obtain

r=1+expfkj*1+μ˜k*pk*Fpk*ak*;bk*;Hk*gkj*1+ν˜k*1+expfkj*0+μ˜k*pk*Fpk*ak*;bk*;Hk*gkj*0+ν˜k*×Hjgjk1+ν˜j1+expfjk0+μ˜jpjFqjaj;bj;Hjgjk0+ν˜jHjgjk0+ν˜j1+expfjk1+μ˜jpjFqjaj;bj;Hjgjk1+ν˜j

and

r=Hjgjkxk+1+ν˜j1+expfjkxk+μ˜jpjFqjaj;bj;Hjgjkxk+ν˜jHjgjkxk+ν˜j1+expfjkxk+1+μ˜jpjFqjaj;bj;Hjgjkxk+1+ν˜j

for any possible positive value of xk in the support of Xk. Similar to the proof of Theorem 6, we necessarily have in both cases that r=1 as well as fkj*(0)=fkj*(1), gkj*(0)=gkj*(1). Now, we consider the ratio of (6) at xk=1 and xk=0 with arbitrary xj0. Observe that

1+expfkj*xj+μ˜k*pk*Fqk*ak*;bk*;Hkgkj*xj+ν˜k*1+expfkj*xj+1+μ˜k*pk*Fqk*ak*;bk*;Hkgkj*xj+1+ν˜k*=1

holds for any xj0, indicating fkj*xj=fkj*xj+1 and gkj*xj=gkj*xj+1 for all possible xj0. All together, we have fkj*xj=fkj*xj+1 and gkj*xj=gkj*xj+1 for any possible value of xj. Because we have assumed Efjk*Xk=Egjk*Xk=0 for all j,k, where Xk are count random variables, it implies that fkj*xj=gkj*xj=0 for any value of xj in its support, and hence kch*(j). This contradicts the assumption that kch*(j)nd(j)().

Next, we show pa(j)=pa*(j). Let lpa(j)pa*(j). Since ch*(j)nd(j)=, by taking the ratio of (6) at xl+1 and xl, we have

rjxj;xl+1,xpa(j){l}rjxj;xl,xpa(j){l}=1

for all xj and xpa(j), where

rjxj;xpa(j)=aj1ajpjHjkpa(j)gjkxk+νjbj1bjqj1+expkpa(j)fjkxk+μjpjFqjaj;bj;Hjkpa(j)gjkxk+νjifxj=0aj1+xjajpj+xjHjkpa(j)gjkxk+νjbj1+xjbjqj+xjxj+1ifxj0.

It easily follows that fjlxl+1=fjlxl and gjlxl+1=gjlxl for all possible values of xl, and combining this again with the assumption that Efjk*Xk=Egjk*Xk=0, we have pa*(j)pa(j)=. Similarly, we obtain pa*(j)pa(j)=, which implies pa(j)=pa*(j).

Lastly, we show that if pj,qj, and hj are fixed (i.e., pj=pj*,qj=qj*, and hj=hj*), we can deduce from pxjxpa(j)=p*xjxpa*(j) that fjk,gjk=fjk*,gjk* for kpa(j), μj=μj*,νj=νj*, and aj and bj are equivalent to aj* and bj* up to permutations. Here, we denote pa(j)=pa(j)=pa*(j) as in the proof of Theorem 6. Consider the ratio of pxjxpa(j)=p*xjxpa*(j) at xj+1 and xj. We observe that if xj0,

aj1+xjajpj+xjHjkpa(j)gjkxk+νjbj1+xjbjqj+xjxj+1=aj1*+xjajpj*+xjHjkpa(j)gjk*xk+νj*bj1*+xjbjqj*+xjxj+1, (16)

and otherwise,

aj1ajpjHjkpa(j)gjkxk+νjbj1bjqj1+expkpa(j)fjkxk+μjpjFqjaj;bj;Hjkpa(j)gjkxk+νj=aj1*ajpj*Hjkpa(j)gjk*xk+νjbj1*bjqj*1+expkpa(j)fjk*xk+μj*pjFqjaj*;bj*;Hjkpa(j)gjk*xk+νj*. (17)

Note that (16) and (17) hold for all possible values of xk for kpa(j). If it is combined with EfjkXk=EgjkXk=0, our identifiability condition for the nonlinear functions fjk and gjk, we get μj=μj*,νj=νj*, and fjkxk=fjk*xk and gjkxk=gjk*xk for all possible values of xk for kpa(j). Then, it is clear that aj and bj, respectively, are equivalent to a permutation of aj* and bj*, which completes the proof.■

Contributor Information

Junsouk Choi, Department of Statistics, Texas A&M University, College Station, TX 98195-4322, USA.

Yang Ni, Department of Statistics, Texas A&M University, College Station, TX 94720-1776, USA.

References

  1. Andersen Holly. When to expect violations of causal faithfulness and why it matters. Philosophy of Science, 80(5):672–683, 2013. [Google Scholar]
  2. Barry Simon C and Welsh Alan H. Generalized additive modelling and zero inflated count data. Ecological Modelling, 157(2–3):179–188, 2002. [Google Scholar]
  3. Bollen Kenneth A. Structural equations with latent variables, volume 210. John Wiley & Sons, 1989. [Google Scholar]
  4. Castelletti Federico, Consonni Guido, Della Vedova Marco L, and Peluso Stefano. Learning markov equivalence classes of directed acyclic graphs: an objective bayes approach. Bayesian Analysis, 13(4):1235–1260, 2018. [Google Scholar]
  5. Chen Wenyu, Drton Mathias, and Wang Y Samuel. On causal discovery with an equalvariance assumption. Biometrika, 106(4):973–980, 2019. [Google Scholar]
  6. Chickering David Maxwell. Optimal structure identification with greedy search. Journal of machine learning research, 3(Nov):507–554, 2002. [Google Scholar]
  7. Choi Junsouk, Chapkin Robert, and Ni Yang. Bayesian causal structural learning with zero-inflated poisson bayesian networks. Advances in Neural Information Processing Systems, 33, 2020. [Google Scholar]
  8. Claeskens Gerda, Hjort Nils Lid, et al. Model selection and model averaging. Cambridge Books, 2008. [Google Scholar]
  9. Dacey Michael F. A family of discrete probability distributions defined by the generalized hypergeometric series. Sankhyā: The Indian Journal of Statistics, Series B, pages 243–250, 1972. [Google Scholar]
  10. Dopeso Higinio, Rodrigues Paulo, Bilic Josipa, Bazzocco Sarah, Cartón-García Fernando, Macaya Irati, De Marcondes Priscila Guimarães, Anguita Estefanía, Masanas Marc, Jiménez-Flores Lizbeth M, et al. Mechanisms of inactivation of the tumour suppressor gene rhoa in colorectal cancer. British journal of cancer, 118(1):106–116, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Fox Jean-Paul. Multivariate zero-inflated modeling with latent predictors: Modeling feedback behavior. Computational Statistics & Data Analysis, 68:361–374, 2013. [Google Scholar]
  12. Han Heonjong et al. TRRUST v2: an expanded reference database of human and mouse transcriptional regulatory interactions. Nucleic Acids Research, 46(D1):D380–D386, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Hao Yuhan, Hao Stephanie, Andersen-Nissen Erica, Mauck William M III, Zheng Shiwei, Butler Andrew, Lee Maddie J, Wilk Aaron J, Darby Charlotte, Zager Michael, et al. Integrated analysis of multimodal single-cell data. Cell, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Heckerman David, Geiger Dan, and Chickering David M. Learning bayesian networks: The combination of knowledge and statistical data. Machine learning, 20(3):197–243, 1995. [Google Scholar]
  15. Hoyer Patrik O, Janzing Dominik, Mooij Joris M, Peters Jonas, Schölkopf Bernhard, et al. Nonlinear causal discovery with additive noise models. In NIPS, volume 21, pages 689–696. Citeseer, 2008. [Google Scholar]
  16. HE Hua, TANG Wan, WANG Wenjuan, and CRITS-CHRISTOPH Paul. Structural zeroes and zero-inflated models. Shanghai Archives of Psychiatry, 26(4):236, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Inoue Yasumichi and Imamura Takeshi. Regulation of tgf-β family signaling by e3 ubiquitin ligases. Cancer science, 99(11):2107–2112, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Johnson Norman L, Kemp Adrienne W, and Kotz Samuel. Univariate discrete distributions, volume 444. John Wiley & Sons, 2005. [Google Scholar]
  19. Kalisch Markus and Bühlman Peter. Estimating high-dimensional directed acyclic graphs with the pc-algorithm. Journal of Machine Learning Research, 8(3), 2007. [Google Scholar]
  20. Kang Yun, Norris Michael H, Zarzycki-Siek Jan, Nierman William C, Donachie Stuart P, and Hoang Tung T. Transcript amplification from single bacterium for transcriptome analysis. Genome Research, 21(6):925–935, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Kemp Adrienne W. A wide class of discrete distributions and the associated differential equations. Sankhyā: The Indian Journal of Statistics, Series A, pages 401–410, 1968a. [Google Scholar]
  22. Kemp Adrienne Winifred. Studies in Univariate Discrete Distribution Theory Based on the Generalized Hypergeometric Function and Associated Differential Equations. PhD thesis, Queens’ University of Belfast, 1968b. [Google Scholar]
  23. Li Huipeng, Courtois Elise T, Sengupta Debarka, Tan Yuliana, Chen Kok Hao, Goh Jolene Jie Lin, Kong Say Li, Chua Clarinda, Hon Lim Kiat, Tan Wah Siew, et al. Reference component analysis of single-cell transcriptomes elucidates cellular heterogeneity in human colorectal tumors. Nature genetics, 49(5):708–718, 2017. [DOI] [PubMed] [Google Scholar]
  24. Maathuis Marloes, Drton Mathias, Lauritzen Steffen, and Wainwright Martin. Handbook of graphical models. CRC Press, 2018. [Google Scholar]
  25. Opgen-Rhein Rainer and Strimmer Korbinian. From correlation to causation networks: a simple approximate learning algorithm and its application to high-dimensional plant gene expression data. BMC systems biology, 1(1):1–10, 2007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Park Gunwoong and Park Hyewon. Identifiability of generalized hypergeometric distribution (ghd) directed acyclic graphical models. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 158–166. PMLR, 2019. [Google Scholar]
  27. Park Gunwoong and Raskutti Garvesh. Learning large-scale poisson dag models based on overdispersion scoring. Advances in Neural Information Processing Systems, 28:631–639, 2015. [Google Scholar]
  28. Pearl Judea. Causality: Models, reasoning and inference. Cambridge university press, 2009. [Google Scholar]
  29. Peters Jonas and Bühlmann Peter. Identifiability of gaussian structural equation models with equal error variances. Biometrika, 101(1):219–228, 2014. [Google Scholar]
  30. Peters Jonas, Mooij Joris, Janzing Dominik, and Schölkopf Bernhard. Identifiability of causal graphs using functional models. In 27th Conference on Uncertainty in Artificial Intelligence (UAI 2011), pages 589–598. AUAI Press, 2011. [Google Scholar]
  31. Peters Jonas, Mooij Joris M, Janzing Dominik, and Schölkopf Bernhard. Causal discovery with continuous additive noise models. Journal of Machine Learning Research, 15:2009–2053, 2014. [Google Scholar]
  32. Reshef David N, Reshef Yakir A, Finucane Hilary K, Grossman Sharon R, McVean Gilean, Turnbaugh Peter J, Lander Eric S, Mitzenmacher Michael, and Sabeti Pardis C. Detecting novel associations in large data sets. science, 334(6062):1518–1524, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Rodrigues Paulo, Macaya Irati, Bazzocco Sarah, Mazzolini Rocco, Andretta Elena, Dopeso Higinio, Mateo-Lozano Silvia, Bilić Josipa, Cartón-García Fernando, Nieto Rocio, et al. Rhoa inactivation enhances wht signalling and promotes colorectal cancer. Nature communications, 5(1):1–15, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Runyan Constance E, Hayashida Tomoko, Hubchak Susan, Curley Jessica F, and Schnaper H William. Role of sara (smad anchor for receptor activation) in maintenance of epithelial cell phenotype. Journal of Biological Chemistry, 284(37):25181–25189, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Schölkopf Bernhard, Janzing Dominik, Peters Jonas, Sgouritsa Eleni, Zhang Kun, and Mooij Joris. On causal and anticausal learning. arXiv preprint arXiv:1206.6471, 2012. [Google Scholar]
  36. Shimizu Shohei, Hoyer Patrik O, Hyvärinen Aapo, Kerminen Antti, and Jordan Michael. A linear non-gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7(10), 2006. [Google Scholar]
  37. Spirtes Peter, Glymour Clark N, Scheines Richard, and Heckerman David. Causation, prediction, and search. MIT press, 2000. [Google Scholar]
  38. Staub Kevin E and Winkelmann Rainer. Consistent estimation of zero-inflated count models. Health Economics, 22(6):673–686, 2013. [DOI] [PubMed] [Google Scholar]
  39. Uhler Caroline, Raskutti Garvesh, Bühlmann Peter, and Yu Bin. Geometry of the faithfulness assumption in causal inference. The Annals of Statistics, pages 436–463, 2013. [Google Scholar]
  40. Verma Thomas and Pearl Judea. Causal networks: Semantics and expressiveness. In Machine intelligence and pattern recognition, volume 9, pages 69–76. Elsevier, 1990. [Google Scholar]
  41. Wang Y Samuel and Drton Mathias. High-dimensional causal discovery under non-gaussianity. Biometrika, 107(1):41–59, 2020. [Google Scholar]
  42. Yu Shiqing, Drton Mathias, and Shojaie Ali. Directed graphical models and causal discovery for zero-inflated data. arXiv preprint arXiv:2004.04150, 2020. [PMC free article] [PubMed] [Google Scholar]

RESOURCES