Abstract
Marginal log-linear (MLL) models provide a flexible approach to multivariate discrete data. MLL parametrizations under linear constraints induce a wide variety of models, including models defined by conditional independences. We introduce a subclass of MLL models which correspond to Acyclic Directed Mixed Graphs (ADMGs) under the usual global Markov property. We characterize for precisely which graphs the resulting parametrization is variation independent. The MLL approach provides the first description of ADMG models in terms of a minimal list of constraints. The parametrization is also easily adapted to sparse modelling techniques, which we illustrate using several examples of real data.
Keywords: acyclic directed mixed graph, discrete graphical model, marginal log-linear parameter, parsimonious modelling, variation independence
1 Introduction
Models defined by conditional independence constraints are central to many methods in multivariate statistics, and in particular to graphical models (Darroch et al., 1980; Whittaker, 1990). In the case of discrete data, marginal log-linear (MLL) parameters can be used to parametrize a broad range of models, including some graphical classes and models for conditional independence (Rudas et al., 2010; Forcina et al., 2010). These parameters are defined by considering a sequence, M1, M2, … Mk, of margins of the distribution which respects inclusion (i.e. Mi precedes Mj if Mi ⊂ Mj), with each such sequence giving rise to a smooth parametrization of the saturated model. Useful sub-models can be induced by setting some of the parameters to zero, or more generally by restricting attention to a linear or affine subset of the parameter space.
The flexibility present in this scheme presents a challenge both in terms of interpreting the resulting model and performing model selection, for which a tractable search space is typically required. We describe a sub-class of marginal log-linear models corresponding to a class of graphs known as acyclic directed mixed graphs (ADMGs), which contain directed (→) and bidirected (↔) edges, subject to the constraint that there are no cycles of directed edges; an example is given in Figure 1. The relationship between the MLL models and ADMGs is analogous to that between ordinary log-linear models and undirected graphs: log-linear models give a very rich class of models to choose from, since their number grows doubly-exponentially as the number of variables increases; undirected graphs provide a natural and more manageable subset of models with which to work (Darroch et al., 1980).
Figure 1.

An acyclic directed mixed graph, 𝒢1.
The patterns of independence described by ADMGs arise naturally in the context of generating processes in which not all variables are observed. To illustrate this, consider the randomized encouragement design carried out by McDonald et al. (1992) to investigate the effect of computer reminders for doctors on take-up of influenza vaccinations, and consequent morbidity in patients. The study involved 2,861 patients; here we focus on the following fields:
(Re) patient’s doctor sent a card asking to Remind them about flu vaccine (randomized);
(Va) patient Vaccinated against influenza;
(Y) the endpoint: patient was not hospitalized with flu;
(Ag) Age of patient: 0 = ‘65 and under’, 1 = ‘over 65’;
(Co) patient has Chronic Obstructive Pulmonary Disease (COPD), as measured at baseline.
The graphs in Figure 2 represent two possible data generating processes. Under both structures, whether or not a patient’s doctor received a reminder note is independent of the baseline variables age (Ag) and COPD status (Co), as would be expected under randomization. Further the absence of an edge Re → Y encodes the assumption that whether or not a reminder (Re) was received only influences the final outcome (Y) via whether or not a patient received a flu vaccination (Va). Both structures also assume that there are unobserved confounding factors between vaccination and COPD, and between COPD and the final outcome. However, the graph in Figure 2(b) supposes that there is no additional confounding between Va and Y. As a consequence the generating process given in (b) implies the additional restriction that Re ⫫ Y | Va, Ag. (We make no assumptions about the state spaces of the variables H, H1 and H2, since these factors are unobserved.)
Figure 2.

Two different generating processes for the flu vaccine encouragement design (red vertices are unobserved): both graphs imply Re ⫫ Ag, Co; however (b) also implies Re ⫫ Y | Va, Ag.
In Figure 3 we show the ADMGs corresponding to the generating processes in Figure 2. These graphs only contain observed variables, but by including bidirected edges (↔) they encode the same observable conditional independence relations; see §3.1 for details.
Figure 3.

Two ADMGs representing the conditional independence restrictions on the observed margin implied by the corresponding graphs in Figure 2.
All the work herein can easily be extended to graphs which also contain an undirected component, provided no undirected edge is adjacent to an arrowhead. This latter case is equivalent to the summary graphs of (Wermuth, 2011), and strictly includes all ancestral graphs (Richardson and Spirtes, 2002). Our approach may be seen as extending earlier work (Rudas et al., 2006, 2010; Forcina et al., 2010) which described the conditional independence structure of certain marginal log-linear models.
1.1 ADMG Models
Richardson (2003) described local and global Markov properties for ADMGs, while Richardson (2009) described a parametrization for discrete random variables via a collection of conditional probabilities of the form P(XH = 0 | XT = xT). However, although Richardson’s parametrization is simple, it does not naturally lead to parsimonious sub-models. In addition, the parameters are subject to variation dependence constraints, in the sense that setting some parameters to particular values may restrict the valid range of other parameters; this makes maximum likelihood fitting, for example, more challenging (Evans and Richardson, 2010). To illustrate this point, consider the graph 𝒢1 in Figure 1 as an example; it encodes the model under which X1 ⫫ X3 and X4 ⫫ X1 | X2. Richardson’s parametrization consists in this case (for binary random variables) of the probabilities
where x1, x2 ∈ {0, 1}. A disadvantage of this parametrization is that, for instance, the joint probabilities P(X2 = 0, X3 = 0 | X1 = x1) are bounded above by the marginal probabilities P(X2 = 0 | X1 = x1). Consequently, from the point of view of parameter interpretation, it makes little sense to consider the joint probabilities in isolation. For example, strong (conditional) correlation between X2 and X3 is present when the joint probability is large relative to the marginals.
However, replacing the joint probabilities P(X2 = 0, X3 = 0 | X1 = x1) with the conditional odds ratios
(and similarly for P(X3 = 0, X4 = 0 | X1 = x1, X2 = x2)) yields a variation independent parametrization, the odds ratio measuring dependence without reference to marginal distributions. This means that if we wish to define a prior distribution over the univariate probabilities and the odds ratios, we may, if appropriate, simply use a product of univariate distributions; similarly, to fit a generalized linear model with these parameters as joint responses, we need only use simple univariate link functions. We will see that this approach to discrete parametrizations can be generalized using marginal log-linear parameters.
In Section 2 we introduce marginal log-linear (MLL) parameters and some of their properties, while Section 3 gives background theory about ADMGs and the parametrization of Richardson (2009). The development of MLL parameters for ADMG models is presented in Section 4, resulting in a parametrization we refer to as ingenuous (since it arises naturally, but ‘natural parametrization’ already has a particular meaning). We also show that this parametrization can always be embedded in a larger one corresponding to a complete graph and the saturated model, where some of the parameters in the bigger model are linearly constrained. In Section 5 we classify for which models the ingenuous parametrization is variation independent, since this can facilitate interpretation of the resulting coefficients. In Section 6 we discuss approaches to sparse modelling using MLLs in the context of several additional datasets and a simulation. Longer proofs are in the supplementary material.
2 Marginal Log-Linear Parameters
We consider collections of random variables (Xv)v∈V with finite index set V, taking values in finite discrete probability spaces (
)v∈V under a strictly positive probability measure P; without loss of generality,
= {0, 1, …, |
| − 1}. For A ⊆ V we let
≡ ×v∈A(
),
≡
and similarly XA ≡ (Xv)v∈A, X ≡ XV and xA ≡ (xv)v∈A, x ≡ xV. In addition
is the subset of
which does not contain the last possible element in any co-ordinate; that is
= {0, 1, …, |
| − 2}, and
= ×v∈V (
). We use pA(xA) ≡ P(XA = xA) and pA|B(xA | xB) ≡ P(XA = xA | XB = xB), for particular instantiations of x.
Following Bergsma and Rudas (2002), we define a general class of parameters on discrete distributions. The definition relies upon abstract collections of subsets, so it may be helpful to the reader to keep in mind that the sets Mi ∈
are margins, or subsets, of the distribution over V, and each set
is a collection of effects in the margin Mi. A pair (L, Mi) corresponds to a log-linear interaction over the set L, within the margin Mi.
Definition 2.1
For L ⊆ M ⊆ V, the pair (L, M) is an ordered pair of subsets of V. Let ℙ be a collection of such pairs, and define
to be the collection of margins in ℙ. If
= {M1, …, Mk}, write
for the set of effects present in the margin Mi. We say that the collection ℙ is hierarchical if the ordering on
may be chosen so that if i < j, then Mj ⊈ Mi and also L ∈
⇒ L ⊈ Mi; the second condition is equivalent to saying that each L is associated only with the first margin M of which it is a subset. We say the collection is complete if every non-empty subset of V is an element of precisely one set
.
The term ‘hierarchical’ is used because each log-linear interaction is defined in the first possible margin in an ascending class, and ‘complete’ because all interactions are present. Some authors (Rudas et al., 2010; Lupparelli et al., 2009) consider only collections which are complete.
Definition 2.2
For each M ⊆ V and xM ∈
, define the functions
by the identity
subject to the identifiability constraint that for every ∅ ≠ L ⊆ M, xL ∈
and v ∈ L,
that is, the sum over the support of each variable is zero. We call a marginal log-linear parameter.
Note that the constant is determined by the values of the other parameters and the fact that the probabilities pM (xM) sum to one. In the sequel we will always assume that L is non-empty.
The term ‘marginal log-linear parameter’ is coined by analogy with ordinary log-linear parameters, which correspond to the special case M = V. The following result provides an explicit expression for .
Lemma 2.3
For L ⊆ M ⊆ V and xL ∈
we have
| (1) |
This result is elementary, and its proof is omitted.
For a collection of ordered pairs of subsets ℙ (see Definition 2.1), we let
be the collection of marginal log-linear parameters associated with ℙ. Note that we avoid the redundancy created by the identifiability constraint by only considering xL ∈
.
The definition of a marginal log-linear parameter we give is equivalent to the recursive one given in Bergsma and Rudas (2002); since both expositions are somewhat abstract, we invite the reader to consult the examples below for additional intuition. In particular note that for binary random variables, the product in (1) is always ±1. Bergsma and Rudas (2002, Theorem 2) show that any collection Λ̃(ℙ) where ℙ is hierarchical and complete smoothly parametrizes the saturated model, that is, it parametrizes the set of all positive distributions on
.
The restriction that the parameters must sum to zero is required for identifiability, but different constraints can be used in its place. We might instead require that be zero whenever any entry of xL is zero (or some other selected value); this is seen in Marchetti and Lupparelli (2011), for example, and its use would not substantially affect any of the results in this paper.
2.1 Examples of Marginal Log-Linear Models
We will write to mean the collection { }; the expression denotes that we are setting all the parameters in this collection to 0.
Example 2.4
The classical log-linear parameters for a discrete distribution over a set of variables V are { }.
Example 2.5
Up to trivial transformations, the multivariate logistic parameters of Glonek and McCullagh (1995) are { }.
Example 2.6
Let V = {1, 2, 3} and assume all random variables are binary. Write P001 ≡ P(X1 = 0, X2 = 0, X3 = 1), and P1++ ≡ P(X1 = 1), etc. Then
which, up to a multiplicative constant, is the logit of the probability of the event {X1 = 0}. Also,
the log odds product and log odds ratio between X1 and X2 respectively.
If instead X1 is ternary, we obtain
Here contrasts the probability P(X1 = 0) with the geometric mean of the probabilities P(X1 = 1) and P(X1 = 2). On the other hand, up to constants, is an average of the two log odds ratios
and so gives a contrast between P(X1 = X2 = 0) and other joint probabilities in a way which generalizes the binary log odds ratio and provides a measure of dependence; in particular note that if X1 ⫫ X2.
Here we have written, for example, 12 instead of {1, 2}; similarly, for sets A and B we sometimes write AB for A ∪ B, and aB for {a} ∪ B.
2.2 Properties of Marginal Log-Linear Models
The next result relates marginal log-linear parameters to conditional independences; it is found as Lemma 1 in Rudas et al. (2010) and Equation (6) of Forcina et al. (2010).
Lemma 2.7
For any disjoint sets A, B and C, where C may be empty, A ⫫ B | C if and only if
The special case of C = ∅ (giving marginal independence) is proved in the context of multivariate logistic parameters by Kauermann (1997).
Example 2.8
Take a complete and hierarchical parametrization of 3 variables,
Then we can force X1 ⫫ X3 by setting . Similarly X2 ⫫ X3 | X1 corresponds to setting .
The following lemma shows that under conditional independence constraints, certain MLL parameters defined within different margins are equal.
Lemma 2.9
Suppose that A ⫫ B | C, and A is non-empty. Then for any D ⊆ C,
The proof of this result is found in the supplementary material.
3 Acyclic Directed Mixed Graphs
We introduce basic graphical concepts used to describe the global Markov property and parametrization schemes.
Definition 3.1
A directed mixed graph 𝒢 consists of a set of vertices V, and both directed (→) and bidirected (↔) edges. Edges of the same type and orientation may not be repeated, but there may be multiple edges of different types between a pair of vertices.
A path in 𝒢 is a sequence of adjacent edges, without repetition of a vertex; a path may be empty, or equivalently consist of only one vertex. The first and last vertices on a path are the endpoints (these are not distinct if the path is empty); other vertices on the path are non-endpoints. The graph 𝒢1 in Figure 1, for example, contains the path 1 → 2 → 4 ↔ 3, with endpoints 1 and 3, and non-endpoints 2 and 4. A directed path is one in which all the edges are directed (→) and are oriented in the same direction, whereas a bidirected path consists entirely of bidirected edges.
A directed cycle is a non-empty sequence of edges of the form v → ··· → v. An acyclic directed mixed graph (ADMG) is one which contains no directed cycles.
Definition 3.2
For a graph 𝒢 and a subset of its vertices A ⊆ V, we denote by 𝒢A the induced subgraph formed by A; that is, the graph containing the vertices A, and the edges in 𝒢 whose endpoints are both in A.
Definition 3.3
Let a and d be vertices in a mixed graph 𝒢. If a = d, or there is a directed path from a to d, we say that a is an ancestor of d, and that d is a descendant of a. The sets of ancestors of d and descendants of a are denoted an𝒢(d) and de𝒢(a) respectively. If there is a directed path from a to d containing precisely one edge (a → d) then a is called a parent of d; the set of vertices which are parents of d is written pa𝒢(d).
The district of a, denoted dis𝒢(a), is the set containing a and all vertices which are connected to a by a bidirected path. These definitions are applied disjunctively to sets of vertices, so that, for example,
A set of vertices A is ancestral if A = an𝒢(A); that is, A contains all its own ancestors.
Example 3.4
Consider the graph 𝒢1 in Figure 1. We have
The district of 3 is the set {2, 3, 4}, and since 3 has no parents, pa𝒢1(3) = ∅.
Note that by the definitions of some authors, vertices are not their own ancestors (Lauritzen, 1996). The above notations may be shortened on induced subgraphs so that paA ≡ pa𝒢A, and similarly for other definitions. In some cases where the meaning is clear, we will dispense with the subscript altogether.
We use the now standard notation of Dawid (1979), and represent the statement ‘X is independent of Y given Z under a probability measure P′, for random variables X, Y and Z, by X ⫫ Y | Z [P]. If P is unambiguous, this part is dropped, and if Z is empty we write simply X ⫫ Y. Finally, we abuse notation in the usual way: v and Xv are used interchangeably as both a vertex and a random variable; likewise A denotes both a vertex set and XA.
3.1 Global Markov Property for ADMGs
A Markov property associates a particular set of independence relations with a graph.
A non-endpoint vertex c on a path is a collider on the path if the edges preceding and succeeding c on the path have an arrowhead at c, for example → c ← or ↔ c ←; otherwise c is a non-collider. A path between vertices a and b in a mixed graph is said to be blocked given a set C if either
there is a non-collider on the path in C, or
there is a collider on the path which is not in an𝒢(C).
If all paths from a to b are blocked by C, then a and b are said to be m-separated given C. Sets A and B are said to be m-separated given C if every a ∈ A and every b ∈ B are m-separated given C. This naturally extends the d-separation criterion of Pearl (1988) to graphs with bidirected edges.
A probability measure P on
is said to satisfy the global Markov property for 𝒢 if for every triple of disjoint sets of vertices A, B and C,
The model associated with an ADMG 𝒢 is simply the set of distributions that obey the global Markov property for 𝒢.
Proposition 3.5
If a path from x to y is not blocked given Z, then every vertex on the path is in an𝒢({x, y} ∪ Z).
Proof
This follows from the definition of m-separation.
Example 3.6
Consider the graph 𝒢1 in Figure 1. There are two paths between the vertices 1 and 4,
both are blocked by C = {2}. π1 is blocked because 2 is a non-collider on the path and is in C, while π2 is blocked because 3 is a collider on the path and is not in an𝒢1(C) = {1, 2}. Hence {1} and {4} are m-separated given {2} in 𝒢1.
One can similarly see that {1} and {3} are m-separated given C = ∅, and that no other m-separations hold for this graph. Thus a joint distribution P obeys the global Markov property for 𝒢1 if and only if X1 ⫫ X4 | X2 [P] and X1 ⫫ X3 [P].
By similar arguments the independences associated with the ADMGs in Figure 2 may also be read off.
3.2 Existing Parametrization of ADMG models
This subsection defines the parameters of Richardson (2009) for multivariate discrete distributions satisfying the global Markov property for an ADMG.
Definition 3.7
Let 𝒢 be an ADMG with vertex set V. We say that a collection of vertices W ⊆ V is barren if for each v ∈ W, we have W ∩ de𝒢(v) = {v}; in other words v has no non-trivial descendants in W. For an arbitrary set of vertices U, the maximal subset with no non-trivial descendants in U is denoted barren𝒢(U).
A head is a collection of vertices H which is connected by bidirected paths in 𝒢an(H) and is barren in 𝒢. We write
(𝒢) for the collection of heads in 𝒢. The tail of a head H is the set
Thus the tail of H is the set of vertices in V\H connected to a vertex in H by a path on which every vertex is a collider and an ancestor of a vertex in H. We typically write T for a tail, provided it is clear which head it belongs to.
Proposition 3.8
Let H be a head. Then (i) H = barren𝒢(H ∪ tail𝒢(H)); (ii) tail𝒢(H) ⊆ an𝒢(H).
Proof
Immediate from the respective definitions.
Richardson (2009) shows that discrete distributions obeying the global Markov property for an ADMG 𝒢 are parametrized by the conditional probabilities:
This is achieved via factorizations based on head-tail pairs; let ≺ be the partial ordering on heads such that Hi ≺ Hj if Hi ⊂ an𝒢(Hj) and Hi ≠ Hj. This is well defined, since otherwise 𝒢 would contain a directed cycle. Then let [·]𝒢 be a function which partitions sets of vertices into heads by repeatedly removing heads which are maximal under ≺.
Then P satisfies the global Markov property for 𝒢 if and only if it obeys the factorizations
| (2) |
for ancestral sets of vertices A; see Richardson (2009) for details. In the case of a directed acyclic graph (DAG), this corresponds to the probability distribution of each vertex conditional on its parents.
Example 3.9
Consider again the ADMG 𝒢1 in Figure 1; its head-tail pairs (H, T) are (1, ∅), (2, 1), (3, ∅), (23, 1), (4, 2) and (34, 12). Multivariate binary distributions obeying the global Markov property with respect to 𝒢1 are therefore parametrized by
for x1, x2 ∈ {0, 1}, as mentioned in the Introduction.
3.3 Graphical Completions
Given a discrete model defined by a set of conditional independence constraints, it is natural to consider it as a sub-model of the saturated model, which contains all positive probability distributions. In a setting where the model is graphical, it becomes equally natural to think of the graph as a subgraph of a complete graph, by which we mean a graph containing at least one edge between every pair of vertices. We can obtain a complete graph from an incomplete one by inserting edges between each pair of vertices which lack one, but this leaves a choice of edge type and orientation. These choices may affect how much of the structure and spirit of the original graph is retained; we will require that a completion preserves the heads of the original graph, which helps to preserve the structure of the parametrization.
Definition 3.10
Given an ADMG 𝒢 and a supergraph 𝒢̄, we call 𝒢̄ a head-preserving completion of 𝒢 if 𝒢̄ is complete, and
(𝒢) ⊆
(𝒢̄).
It is easy to see that a head-preserving completion always exists; for example, if we add in a bidirected edge between every pair of vertices which are not joined by an edge, then it is clear that barren sets in 𝒢 will remain barren in 𝒢̄, and bidirected connected sets in 𝒢 will remain bidirected connected in 𝒢̄.
Note that it is not necessary for every pair of vertices to be joined by an edge in order for a graph to represent the saturated model, however we will require this for our completions.
Example 3.11
Figure 4 shows a head-preserving completion of the ADMG in Figure 1.
Figure 4.

A head-preserving completion, 𝒢̄1 of the ADMG in Figure 1.
Proposition 3.12
If 𝒢̄ is a head-preserving completion of 𝒢 then an𝒢(v) ⊆ an𝒢̄(v). In particular, if a set A is ancestral in 𝒢̄ then A is also ancestral in 𝒢.
Proof
This follows because 𝒢 contains a subset of the edges in 𝒢̄.
4 Ingenuous Parametrization of an ADMG model
We now use the marginal log-linear parameters defined in Section 2 to parametrize the ADMG models discussed in Section 3.
Definition 4.1
Consider an ADMG 𝒢 with head-tail pairs (Hi, Ti) over some index i, and let Mi = Hi ∪ Ti. Further, let
= {A | Hi ⊆ A ⊆ Hi ∪ Ti}. This collection of margins and associated effects is the ingenuous parametrization of 𝒢, denoted ℙing(𝒢).
Example 4.2
We return again to the ADMG 𝒢1 in Figure 1; the head-tail pairs are (1, ∅), (2, 1), (3, ∅), (23, 1), (4, 2) and (34, 12), meaning that the ingenuous parametrization is given by the following margins and effects:
| M |
|
|---|---|
|
| |
| 1 | 1 |
| 12 | 2, 12 |
| 3 | 3 |
| 123 | 23, 123 |
| 24 | 4, 24 |
| 1234 | 34, 134, 234, 1234. |
Note that the ordering of the margins given here is hierarchical; in order to use most of the results of Bergsma and Rudas (2002), we need to confirm that the definition above always leads to a hierarchical parametrization, which is shown by the following result.
Lemma 4.3
For any ADMG 𝒢, there is an ordering on the margins Mi of the ingenuous parametrization ℙing(𝒢) which is hierarchical.
Proof
Firstly we show that for distinct heads Hi and Hj, the collections
and
are disjoint. To see this, assume for a contradiction that there exists A such that Hi ⊆ A ⊆ Hi ∪ Ti and Hj ⊆ A ⊆ Hj ∪ Tj. Since Hi ≠ Hj, assume without loss of generality that there exists
.
Then v ∈ Hj ∪ Tj implies that v ∈ Tj, and thus there is a directed path from v to some w ∈ Hj. Now, w ∉ Hi, since v, w ∈ Hi would imply that Hi is not barren. But if
, then by the same argument as above we can find a directed path from w to some x ∈ Hi. Then v → ··· → w → ··· → x is a directed path between elements of Hi, which is a contradiction. Thus
and
are disjoint.
Now, consider the partial ordering ≺ of heads defined in Section 3.2: Hi ≺ Hj whenever Hi ⊂ an𝒢(Hj) and Hi ≠ Hj Any total ordering which respects this partial ordering is hierarchical, because each set A ∈
is a subset of the ancestors of Hi.
We proceed to show that the ingenuous parameters for an ADMG 𝒢 characterize the set of distributions which obey the global Markov property with respect to 𝒢.
Lemma 4.4
For any sets M and L ⊆ M, the collection of MLL parameters
together with the (|L| − 1)-dimensional marginal distributions of XL conditional on XM\L, smoothly parametrizes the distribution of XL conditional on XM\L.
A proof is given in the supplementary material. We now come to the main result of this section.
Theorem 4.5
The ingenuous parametrization Λ̃(ℙing(𝒢)) of an ADMG 𝒢 parametrizes precisely those distributions P obeying the global Markov property with respect to 𝒢.
Proof
We proceed by induction. Again we use the partial ordering ≺ on heads from Section 3.2. For the base case, we know that singleton heads {h} with empty tails are parametrized by the logits .
Now, suppose that we wish to find the distribution of a head H conditional on its tail T. Assume that we have the distribution of all heads H′ which precede H, conditional on their respective tails; we claim this is sufficient to give the (|H| − 1)-dimensional marginal distributions of H conditional on T.
Let v ∈ H, and let C = H\{v} be a (|H| − 1)-dimensional marginal of interest. The set A = an𝒢(H)\{v} is ancestral, since v cannot have (non-trivial) descendants in an𝒢(H); in particular C ∪ T ⊆ A. Theorem 4 of Richardson (2009) states that the factorization in equation (2) holds for every ancestral set, so
But all the probabilities in the product are known by our induction hypothesis, and the marginal distribution of C conditional on T is given by the distribution of A.
The ingenuous parametrization, by definition, contains for H ⊆ A ⊆ H ∪ T, and thus the result follows from Lemma 4.4.
Example 4.6
Returning to our running example, the graph 𝒢1 in Figure 1 corresponds to the model
Theorem 4.5 tells us that this collection of distributions is precisely characterized by the ingenuous parameters for 𝒢1,
4.1 Constraint-Based Model Description
The results above show that the ingenuous parameters for an ADMG 𝒢, like Richardson’s parameters, provide precisely the information required to reconstruct a distribution obeying the global Markov property for 𝒢. However, it is difficult to use this parametrization in practice unless we can evaluate the likelihood, which requires us to make explicit the map which we have implicitly defined from the ingenuous parameters to the joint probability distribution under the model. For example, for the parameters in Richardson (2009) there is an explicit map from the parameters back to the joint distribution using a generalization of Möbius inversion. This was used by Evans and Richardson (2010) to fit these models via maximum likelihood. In contrast, the map from ingenuous parameters to the joint distribution cannot be written in closed form.
An alternative approach is to consider the ingenuous parametrization as part of a larger, complete parametrization of the saturated model, such that the additional parameters are constrained to be zero under the sub-model defined by 𝒢. This enables us to fit the model using Lagrange-type algorithms, as in Evans and Forcina (2013).
Theorem 4.7
Let 𝒢 be an ADMG, and 𝒢̄ a head-preserving completion of 𝒢. The ingenuous parametrization of 𝒢 corresponds to setting
for (L, M) ∈ ℙing(𝒢̄) whenever L does not appear as an effect in ℙing(𝒢). In particular, these constraints define the set of distributions which satisfy the global Markov property with respect to 𝒢.
To prove this result we require the following lemma.
Lemma 4.8
Let 𝒢̄ be a head-preserving completion of 𝒢, and let H ∈
(𝒢) have tails T and
in 𝒢 and 𝒢̄ respectively. Then under the global Markov property for 𝒢,
Proof
Let π be a path in 𝒢 from some h ∈ H to t ∈
\T, and assume without loss of generality that π does not intersect H or
\T other than at its endpoints. By Proposition 3.5, every vertex on π is in an𝒢({h, t} ∪ T) ⊆ an𝒢(H ∪
). Since 𝒢̄ is complete, if v ∈ an𝒢̄(H ∪
), then v ∈ H ∪
, thus H ∪
is ancestral in 𝒢̄. By Proposition 3.12, H ∪
is also ancestral in 𝒢, thus every vertex on π is in H ∪
.
By Proposition 3.8,
⊆ an𝒢̄(H), so H ∪
= an𝒢̄(H). However, since H forms a head in 𝒢̄, H is barren in 𝒢̄. Thus in 𝒢̄, no proper descendant of a vertex in H is on π, and by Proposition 3.12 this also holds in 𝒢.
Now let y be the first vertex after h on π that is not in T. By hypothesis, y exists since t ∉ T. By construction, any vertices between h and y on π are in T, hence are colliders on π and ancestors of H in 𝒢 (by Proposition 3.8). Thus y ∈ dis𝒢(H) ∪ pa𝒢(dis𝒢(H)). If y ∈ an𝒢 (H) then y ∈ T, which is a contradiction, hence y ∈ dis𝒢(H) and y ∉ an𝒢(H). As shown earlier, y is not a descendant of a vertex in H, so H ∪ {y} forms a head in 𝒢. Since 𝒢̄ is a head-preserving completion, it follows that H ∪ {y} also forms a head in 𝒢̄, and thus y ∉ an𝒢̄(H) = H ∪
, but this is a contradiction.
Proof of Theorem 4.7
Let (H,
) be a head-tail pair in 𝒢̄. There are three possibilities for how this pair relates to 𝒢: if (H,
) is also a head-tail pair in 𝒢, then there is no work to be done; otherwise either (i) H is not a head in 𝒢, or (ii) H is a head in 𝒢 but
is not its tail.
If (i) holds, then we claim that under 𝒢,
for all H ⊆ A ⊆ H ∪
. To see this, first note that H is a barren set in 𝒢̄, and since H is maximally connected, this means that all elements are joined by bidirected edges in 𝒢̄. Since 𝒢 contains a subset of the edges in 𝒢̄, H is also barren in 𝒢; since H is not a head in 𝒢 this means that H = K ∪ L for disjoint non-empty sets K and L with no edges directly connecting them. But this implies that K and L are m-separated conditional on
, and thus XK ⫫ XL |
under the Markov property for 𝒢. Then, by Lemma 2.7, these parameters are all identically zero under 𝒢.
(ii) implies that H is head in both 𝒢 and 𝒢̄, but
≡ tail𝒢̄(H) ⊃ tail𝒢(H) ≡ T. Then
for all H ⊆ A ⊆ H ∪
such that A ∩ (
\T) ≠ ∅; this follows from Lemma 4.8 and application of Lemma 2.7.
We have shown that all parameters corresponding to effects not found in ℙing(𝒢) are identically zero under 𝒢. The vanishing of these parameters defines the correct sub-model, but note that some of the margins in ℙing(𝒢̄) which we have not yet considered are not the same as those in ℙing(𝒢). These remaining cases are again from (ii), but where H ⊆ A ⊆ H ∪ T; in this case under 𝒢, again due to Lemma 4.8, this time combined with Lemma 2.9.
Thus we have shown that under 𝒢, all the ingenuous parameters for 𝒢̄ are either zero or equal to ingenuous parameters for 𝒢. Combined with Theorem 4.5, this shows that those constraints define the model.
Example 4.9
Consider again the ADMG 𝒢1 in Figure 1; a possible head-preserving completion 𝒢̄1 (shown in Figure 4) is obtained by adding the edges 1 → 3 and 1 → 4. The ingenuous parametrization for 𝒢̄1 is
| M |
|
|---|---|
|
| |
| 1 | 1 |
| 2 | 2, 12 |
| 13 | 3, 13 |
| 123 | 23, 123 |
| 124 | 4, 14, 24, 124 |
| 1234 | 34, 134, 234, 1234. |
The effects found in ℙing(𝒢̄1) but not in ℙing(𝒢1) are 13, 14, and 124, and indeed the sub-model defined by 𝒢1 corresponds to setting
under this model the following equalities hold by Lemma 2.9:
Removing the zero parameters in ℙing(𝒢̄1) and renaming two others according to the above equations returns us to the ingenuous parametrization of 𝒢1.
Theorem 4.7 shows that we can fit the model defined by 𝒢1 by maximum likelihood simply by maximizing the log-likelihood subject to . In particular, this approach always provides a list of independent constraints which characterize the model.
An obvious question which arises is whether any completion of a graph will lead to a complete parametrization with the property of Theorem 4.7. We can obtain a counterexample by considering the complete graph 𝒢̄1 in Figure 5, which has ingenuous parametrization
Figure 5.

A complete ADMG, 𝒢̄1, of which 𝒢1 is a subgraph, but whose ingenuous parametrization does not contain the model described by 𝒢1 as a linear sub-space because the associated completion is not head-preserving.
| M |
|
|---|---|
|
| |
| 3 | 3 |
| 13 | 1, 13 |
| 123 | 2, 12, 23, 123 |
| 1234 | 4, 14, 24, 124, 34, 134, 234, 1234. |
The graph 𝒢1 in Figure 1 is a subgraph of 𝒢̄1, and corresponds to the model obtained by setting ; however, these last two parameters do not appear in the ingenuous parametrization of 𝒢̄1, and so there is no way to enforce the sub-model as a linear constraint.
𝒢̄1 is, of course, not head-preserving. Such completions may still lead to parametrizations which satisfy the property of Theorem 4.7: for example, if the edge 1 → 3 is added to the graph in Figure 6(a), this destroys the head {1, 2, 3}, but the sub-model corresponds to , which is a parameter in the complete graph.
Figure 6.

(a) a graph with a variation dependent ingenuous parametrization; (b) a Markov equivalent graph to (a) with a variation independent ingenuous parametrization; (c) a graph with no variation independent MLL parametrization.
4.2 Relationship To Prior Work
Rudas et al. (2010) parametrize chain graph models of multivariate regression type, also known as type IV chain graph models, using marginal log-linear parameters. Type IV chain graph models are a special case of ADMG models, in the sense that by replacing the undirected edges in a type IV chain graph with bidirected edges, the global Markov property on the resulting ADMG is equivalent to the Markov property for the chain graph (see Drton, 2009). The graphs in Figure 6 are examples of Type IV models. However, there are models in the class of ADMGs which do not correspond to any chain graph, such as the one described by 𝒢1 in Figure 1.
The parametrization of Rudas et al. (2010) uses different choices of margins to the ingenuous parametrization, though their parameters can be shown to be equal to the parameters considered here under the global Markov property, using Lemma 2.9. Thus the variation dependence properties of that parametrization are identical to those of the ingenuous parametrization (see next section). Forcina et al. (2010) provide an algorithm which gives a range of ‘admissible’ margins in which collections of conditional independence constraints may be defined.
Marchetti and Lupparelli (2011) also parametrize type IV chain graph models in a similar manner to Rudas et al. (2010), in that case using multivariate logistic contrasts.
5 Variation Independence
As discussed in the introduction, the interpretation of parameters and the construction of prior distributions is simpler when parameters are variation independent.
Definition 5.1
Let θi, for i = 1, … k be a collection of parameters such that θi takes all values in the set Θi. We say that the vector θ = (θ1, …, θk) is variation independent if θ can take every value in the set Θ1 × ··· × Θk.
Bergsma and Rudas (2002) characterize precisely which hierarchical and complete parametrizations are variation independent, using a notion they call ordered decomposability. We now do this for ingenuous parametrizations.
Definition 5.2
A collection of sets
= {M1, …, Mk} is incomparable if Mi ⊈ Mj for every i ≠ j.
A collection
of incomparable subsets of V is decomposable if it has at most two elements, or there is an ordering M1, …, Mk on the elements of
wherein for each i = 3, …, k, there exists ji < i such that
This is also known as the running intersection property.
A collection
of (possibly comparable) subsets is ordered decomposable if it has at most two elements, or there is an ordering M1, …, Mk such that Mi ⊈ Mj for i > j, and for each i = 3, …, k, the inclusion maximal elements of {M1, …, Mi} form a decomposable collection. We say that a collection ℙ of parameters is ordered decomposable if there is an ordering on the margins
which is both hierarchical and ordered decomposable.
The following example is found in Bergsma and Rudas (2002).
Example 5.3
Let
= {12, 13, 23, 123}. In order to have a hierarchical ordering of these margins it is clear that the set 123 must come last, but there is no way to order the collection of inclusion maximal margins {12, 13, 23} such that it has the running intersection property. Thus
is not ordered decomposable.
The next result links variation independence to ordered decomposability.
Theorem 5.4
(Bergsma and Rudas (2002), Theorem 4). Let ℙ be a parametrization which is hierarchical and complete. Then the parameters Λ̃(ℙ) are variation independent if and only if ℙ is ordered decomposable.
As previously noted, the ingenuous parametrization is not complete in general, and so we cannot apply the above result directly to characterize its variation dependence. However, by constructing complete parametrizations of which the ingenuous parametrizations are linear sub-models, we can obtain the following.
Theorem 5.5
The ingenuous parametrization for an ADMG 𝒢 is variation independent if and only if 𝒢 contains no heads of size greater than or equal to 3.
The proof of this result is found in the supplementary material.
Example 5.6
The graph 𝒢1 in Figure 1 has maximum head size 2, and therefore the associated ingenuous parametrization is variation independent.
Likewise the graphs in Figure 3(a) and (b) contain no heads of size greater than 2, so that the resulting ingenuous parameters are variation independent. Note that this was not true of the parameters given by Richardson (2009).
Example 5.7
The bidirected 3-chain shown in Figure 6(a) has the head 123, and therefore its ingenuous parametrization is variation dependent. This can easily be seen directly: in the binary case, for example, if the parameters and are chosen to be very large, this induces very strong dependence between the variables X1 and X2, and between X2 and X3 respectively. If these correlations are chosen to be too large, then it is impossible for X1 and X3 to be marginally independent, which is implied by the graph.
Observe that we could use the Markov equivalent graph in Figure 6(b), which has no heads of size 3, and thus obtain a variation independent parametrization of the same model. However, if we add incident arrows as shown in Figure 6(c), we obtain a graph where such a trick is not possible. In fact this third graph has no variation independent parametrization in the Bergsma and Rudas framework, since it requires , and these margins cannot be ordered in a way which satisfies the running intersection property (see Example 5.3).
In general, it would be sensible for a statistician concerned about variation dependence to choose a graph from the Markov equivalence class created by their model which has the smallest possible maximum head size. This could be achieved by reducing the number of bidirected edges in the graph, where possible; see, for example, Ali et al. (2005) and Drton and Richardson (2008b) for algorithms for finding the graph with the minimal number of arrowheads in a given Markov equivalence class.
Example 5.8
The bidirected 4-cycle, shown in Figure 7, contains a head of size 4, and so its ingenuous parametrization is variation dependent. However, there is a marginal log-linear parametrization of this model which is ordered decomposable, and therefore variation independent. The 4-cycle is precisely the model with X1 ⫫ X3 and X2 ⫫ X4. Set
= {13, 24, 1234}, with
Figure 7.

A bidirected 4-cycle.
here
(A) denotes the power set of A. This gives a hierarchical, complete and ordered decomposable parametrization, so the parameters are variation independent. The 4-cycle corresponds exactly to setting
, and it follows that the remaining parameters are still variation independent under this constraint.
This approach to parametrization, which considers disconnected sets, is discussed in detail by Lupparelli et al. (2009). It produces a variation independent parametrization for graphs where the disconnected sets do not overlap, and may well be preferable to the ingenuous parametrization in these cases. In sparser graphs however, it does not seem as useful; as mentioned above, some graphs have no variation independent MLL parametrization.
6 Parsimonious Modelling with Marginal Log-Linear Parameters
The number of parameters in a model associated with a sparse graph containing bidirected edges can, in certain cases, be relatively large. In a purely bidirected graph, the parameter count depends upon the number of connected sets of vertices; in the case of a chain of bidirected edges such as that shown in Figure 11(a), this means that the number of parameters grows quadratically in the length of the chain.
Figure 11.
(a) A bidirected k-chain and (b) a DAG with latent variables (h1, …, hk−1) generating the same observable conditional independence structure.
The parametrization of Richardson (2009), and its special case for purely bidirected graphs (see Drton and Richardson, 2008a) does not present us with any obvious method of reducing the parameter count whilst preserving the conditional independence structure. In contrast, there are well established methods for sparse modelling with other classes of graphical models. In the case of an undirected graph with binary random variables, restricting to one parameter for each vertex and each edge leads to a Boltzmann Machine (Ackley et al., 1985). Rudas et al. (2006) use marginal log-linear parameters to provide a sparse parametrization of a DAG model, again restricting to one parameter for each vertex and edge.
As we will see from the following examples, the ingenuous parametrization allows us to fit graphical models with a large number of parameters, and then remove higher-order interactions to obtain a more parsimonious model whilst preserving the conditional independence structure of the original graph.
6.1 Flu Vaccination Data Revisited
We first return to the McDonald et al. (1992) study considered in the Introduction. All variables are binary, and (excepting Age) are coded as 0 = false, 1 = true; we add constraints to our model sequentially, recording the results in the analysis of deviance Table 1. The ADMG in Figure 3(a) represents the constraint Ag, Co ⫫ Re; it fits well, having a deviance of 2.54 on 3 degrees of freedom. The smaller model for 3(b) encodes
Table 1.
Analysis of deviance table of models considered for influenza data. Constraints are added sequentially from top to bottom; the last three columns give the additional deviance for the constraint, the total degrees of freedom and the total deviance of the models respectively.
note that these precise independences cannot be represented by a DAG or chain graph (of any of the types considered by Drton (2009)). It also fits well (deviance 7.66 on 7 d.f.), so we may prefer it on the grounds of simplicity.
The ingenuous parametrization in this case contains some higher order effects, including the 5-way interaction between all variables. Setting for |L| ≥ 4 removes five parameters whilst increasing the deviance by only 2.22; removing the effects of size 3 adds a further 8.39 to the deviance whilst removing seven more parameters. The resulting model has a total deviance of 18.28 on 19 degrees of freedom, representing a good fit compared to the saturated model (likelihood ratio test p = 0.49).
6.2 Incorporating Symmetry: Twins Data
Hakim et al. (2003) investigate genetic effects on the presence or absence of two soft tissue disorders, frozen shoulder and tennis elbow, based on a study in pairs of monozygotic and dizygotic twins; the data are reproduced in Ekholm et al. (2012). We have count data for a 5-way contingency table over the variables Si and Ei, indicators of whether twin i in the pair suffers from frozen shoulder and tennis elbow respectively, i ∈ {1, 2}, and T, an indicator of whether the pair are monozygotic or dizygotic twins. There are a total of 866 observations for monozygotic pairs, and 963 for dizygotic pairs; twin 1 corresponds to the twin who was born first.
We first fitted the model T ⫫ (S1, S2, E1, E2) to test whether the zygosity of the twins has any effect on the other variables; we obtained a deviance of 16.4 on 15 degrees of freedom, suggesting that there is no evidence that T is related to the other variables. Note that this contradicts the conclusions of Ekholm et al. (2012), but they use additional assumptions to obtain more powerful tests.
Collapsing to a 4-way table over (S1, S2, E1, E2), we consider the complete bidirected model in Figure 8(a). A further simplifying assumption is to impose symmetry between the twins in each pair, on the basis that we do not expect any association between the prevalence of the disorders and which twin was born first. Using the ingenuous parametrization for the graph in Figure 8(a), which is itself symmetric with respect to the individual twins, this amounts to six independent linear constraints, and gives a deviance of 0.59 compared to the saturated model on four variables; there is therefore no evidence to reject symmetry.
Figure 8.

Graphs for the twins data for models corresponding to (a) a common gene and (b) separate genes affecting the prevalence of frozen shoulder and tennis elbow.
Now, a hypothesis of interest is whether a common gene is responsible for the increased risk of the two disorders, or the genetic effects are separate and independent. In the latter case we would expect the data to be explained by the model encoded by the graph in Figure 8(b), and therefore to observe the marginal independences E1 ⫫ S2 and E2 ⫫ S1 (see Drton and Richardson, 2008a, for more details). This amounts to the constraint ; the first equality already holds by symmetry, so only one additional constraint is imposed.
This model has a deviance of 8.41 on 7 degrees of freedom, which is not rejected in a likelihood ratio test with the saturated model (p = 0.30), and so there is no evidence to reject the separate genes hypothesis. We remark however, that the model with symmetry but no marginal independences has a slightly lower BIC score, and so might be preferred.
The elimination of the 4-way and 3-way interaction parameters for the model from Figure 8(b) with symmetry results in deviances of 11.63 on 8 d.f. and 16.69 on 10 d.f. respectively, both of which also represent reasonable fits; the latter of these has just 5 free parameters.
6.3 Netherlands Kinship Data
The Netherlands Kinship Panel Survey (NKPS) is an ongoing study which collects longitudinal information on several thousand Dutch individuals and their families (Dykstra et al., 2005, 2007). One question asked of both the primary respondents (anchors) and their partners is “How is your health in general?”, with possible responses of ‘excellent’, ‘good’, ‘good nor poor’, ‘poor’ and ‘very poor’. We combined ‘good nor poor’, ‘poor’ and ‘very poor’ into one category to avoid small counts.
Two waves of data are currently available, from 2002–04 and 2006–07. We only considered anchors who had the same partner in both waves, and such that both the individual and the partner answered the health question in both waves. Let Ai and Pi denote the response of the anchor and partner respectively for wave i ∈ {1, 2}. In total there are n = 2, 318 data points, classified into a 3 × 3 × 3 × 3 table.
We begin with the complete graph in Figure 9. One plausible model would be that anchors and their partners are exchangeable. Since the graph is symmetrical in this respect, so is the ingenuous parametrization, and enforcing symmetry amounts merely to a set of 36 linear constraints; for example:
Figure 9.

Graphs for the NKPS data; responses of Anchor and Partner regarding their assessment of health; subscripts indicate time. (a) a complete graph; (b) a subgraph which implies P2 ⫫ A1 | P1.
This model has a deviance of 89.98, which when compared to the tail of a distribution gives p = 1.6×10−6; thus the symmetry model is a poor fit to the data, and is rejected. The lack of exchangeability is probably due to selection bias in the sampling of the anchors, as well as the different ways in which the anchors and their partners were asked the question: anchors were asked about their health as part of a face-to-face interview, whereas the partners were only asked to complete a survey. See Siemiatycki (1979) for an analysis of differences resulting from survey mode.
If instead we remove the edge A1 → P2 and fit the graph in Figure 9(b), we obtain an explanation of the data which is not rejected at the 5% level (deviance 19.09 on 12 degrees of freedom, p = 0.086); this model corresponds to the conditional independence P2 ⫫ A1 | P1. This graph is the only subgraph of the complete graph in Figure 9(a) which leads to a good fit; in particular the model created by removing the edge P1 → A2 is strongly rejected, which is one manifestation of the asymmetry between individuals and their partners.
Note that we could also have obtained the independence P2 ⫫ A1 | P1, for instance, by using a DAG with topological ordering P1, A1, P2, A2, but the resulting parametrization would have made it much more difficult to enforce the symmetry constraint tested above.
6.4 Example: Trust Data
Drton and Richardson (2008a) examine responses to seven questions relating to trust and social institutions, taken from the US General Social Survey between 1975 and 1994. Briefly, the seven questions were:
Trust: Can most people be trusted?
Helpful: Do you think most people are usually helpful?
MemUn, MemCh: Are you a member of a labour union/church?
ConLegis, ConClerg, ConBus: Do you have confidence in congress/organized religion / business?
In that paper, the model given by the graph in Figure 10 is shown to adequately explain the data, having a deviance of 32.67 on 26 degrees of freedom, when compared with the saturated model. The authors also provide an undirected graphical model which has one more edge than the graph in Figure 10, and yet has 62 fewer parameters. It too gives a good fit to the data, having a deviance of 87.62 on 88 degrees of freedom. Both graphs were chosen by backwards stepwise selection methods; see Drton and Richardson (2008a) for details.
Figure 10.

Markov model for trust data given in Drton and Richardson (2008a).
For practical and theoretical reasons, the bidirected model may be preferred to the undirected one, even though the latter appears to be much more parsimonious. One may consider the dependence between the responses given to a questionnaire to be manifestations of unmeasured characteristics of the respondent, such as their political beliefs. Such a system can be well represented by a bidirected graph, through its marginal independence structure and connection to latent variable models, but not necessarily by an undirected one, which induces conditional independences. Note that, since models defined by undirected and bidirected graphs are not nested, there is no a priori reason to expect the two methods to give a similar graphical structure.
The greater parsimony of the undirected model (when defined purely by conditional independences) is due to its hierarchical nature: if we remove an edge between two vertices a and b, then this corresponds to requiring that for every effect A containing both a and b. Removing that edge in a bidirected model may correspond merely to setting and nothing else, depending upon the other edges present. Using the ingenuous parametrization, it is easy to constrain additional higher order terms to be zero to obtain sub-models of the set of distributions obeying the global Markov property.
Starting with the model in Figure 10 and fixing the 4-, 5-, 6- and 7-way interaction terms to be zero increases the deviance to 84.18 on 81 degrees of freedom; none of the 4-way interaction parameters was found to be significant on its own. Furthermore, removing 21 of the remaining 25 three-way interaction terms increases the deviance to 111.48 on 102 degrees of freedom; using an asymptotic χ2 approximation gives a p-value of 0.245, so this model is not contradicted by the data. The only parameters retained are the one-dimensional marginal probabilities, the two-way interactions corresponding to edges in Figure 10, and the following three-way interactions:
| MemUn, ConClerg, ConBus | Helpful, MemUn, MemCh |
| Trust, ConLegis, ConBus | MemCh, ConClerg, ConBus. |
This model retains the marginal independence structure of Drton and Richardson’s model, but provides a good fit with only 25 parameters, rather than the original 101.
A similar analysis, for different data, is performed by Lupparelli et al. (2009, page 573); again they find an undirected graphical model to be much more parsimonious than any bidirected one, but obtain comparable fits by removing statistically insignificant higher-order parameters.
6.5 Simulated Data
We saw in the earlier examples that we were often able to remove higher order interaction parameters without compromising the goodness of fit. Here we explore this phenomena further via simulations.
Consider the DAG with latent variables shown in Figure 11(b); over the observed variables, the conditional independences which hold are exactly those given by the bidirected chain in Figure 11(a).
We randomly generated 1,000 distributions from this DAG model with k = 6, where each latent variable was given three states, and each observed variable two. The probability of each observed variable being zero, conditional on each state of its parents, was an independent uniform random draw on (0, 1); latent states were fixed to occur with equal probability. For each distribution, a sample size of 10,000 was drawn, and the bidirected chain model was fitted to it by maximum likelihood estimation. For each of the 1,000 data sets, we then measured the increase in deviance associated with removing higher order parameters
The histogram in Figure 12(a) demonstrates that the deviance increase from setting the 5- and 6-way interaction parameters to zero (a total of three parameters) was not distinguishable from that which would be observed under the null hypothesis that these parameters are zero. The deviance increase from setting the 4-, 5- and 6-way interactions to zero appeared to have only a slightly heavier tail than the associated χ2-distribution, as suggested by the outliers in Figure 12(b). Removing the 3-way interactions in addition to this caused a dramatic increase in the deviance, as may be observed from the heavy tail of the histogram in Figure 12(c). This illustrates that the ingenuous parametrization can be used to produce more parsimonious model descriptions than would be possible using Richardson’s parameters.
Figure 12.
Histograms showing the increase in deviance caused by setting to zero (a) the 5- and 6-way interaction parameters; (b) the 4-, 5- and 6-way interaction parameters; (c) the 3-, 4-, 5- and 6-way interaction parameters. Plots are based on 1, 000 datasets, each of size 10, 000, generated from the DAG in Figure 11(b). The plotted densities are χ2 with 3, 6 and 10 degrees of freedom respectively.
Note that under the process which generated these models, each of these interaction parameters was non-zero almost surely. As the sample size increases the power of a likelihood ratio test for a fixed distribution tends to one, so it must be the case that a simulation such as the above would, for large enough data sets, show significant deviation from the associated χ2 distributions. However, even at a fairly large sample size of 10,000, a limited effect was observed in Figures 12(a) and (b), and the examples above with real data suggest that higher order interactions are often not particularly useful in practice for describing data.
Supplementary Material
Acknowledgments
This research was supported by the U.S. National Science Foundation grant CNS-0855230 and U.S. National Institutes of Health grant R01 AI032475. The Netherlands Kinship Panel Study is funded by grant 480-10-009 from the Major Investments Fund of the Netherlands Organisation for Scientific Research (NWO), and by the Netherlands Interdisciplinary Demographic Institute (NIDI), Utrecht University, the University of Amsterdam and Tilburg University. We thank McDonald, Hiu and Tierney for giving us permission to use their u vaccine data.
Our thanks go to Tamás Rudas for helpful discussions, and to Antonio Forcina for discussions and the use of his computer programmes. Finally we thank two anonymous referees and an associate editor for their thorough reading of an earlier draft, and very useful suggestions.
Contributor Information
Robin J. Evans, Email: rje42@stat.washington.edu, Department of Statistics, University of Washington
Thomas S. Richardson, Email: tsr@stat.washington.edu, Department of Statistics, University of Washington
References
- Ackley DH, Hinton GE, Sejnowski TJ. A learning algorithm for Boltzmann machines. Cognitive Science. 1985;9:147–169. [Google Scholar]
- Ali RA, Richardson TS, Spirtes P, Zhang J. Towards characterizing Markov equivalence classes for directed acyclic graphs with latent variables. Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence; 2005. pp. 10–17. [Google Scholar]
- Bergsma WP, Rudas T. Marginal models for categorical data. Ann Stat. 2002;30(1):140–159. [Google Scholar]
- Darroch JN, Lauritzen SL, Speed TP. Markov fields and log-linear models for contingency tables. Ann Statist. 1980;8:522–539. [Google Scholar]
- Dawid AP. Conditional independence in statistical theory (with discussion) J Roy Statist Soc Ser B. 1979;41:1–31. [Google Scholar]
- Drton M. Discrete chain graph models. Bernoulli. 2009;15(3):736–753. [Google Scholar]
- Drton M, Richardson TS. Binary models for marginal independence. J Roy Statist Soc Ser B. 2008a;70(2):287–309. [Google Scholar]
- Drton M, Richardson TS. Graphical methods for efficient likelihood inference in Gaussian covariance models. J Mach Learn Res. 2008b;9:893–914. [Google Scholar]
- Dykstra PA, Kalmijn M, Knijn TCM, Komter AE, Liefbroer AC, Mulder CH. NKPS Working Paper No 4. 2005. Codebook of the Netherlands Kinship Panel Study, a multi-actor, multi-method panel study on solidarity in family relationships. Wave 1. [Google Scholar]
- Dykstra PA, Kalmijn M, Knijn TCM, Komter AE, Liefbroer AC, Mulder CH. NKPS Working Paper No 6. 2007. Codebook of the Netherlands Kinship Panel Study, a multi-actor, multi-method panel study on solidarity in family relationships. Wave 2. [Google Scholar]
- Ekholm A, Jokinen J, McDonald JW, Smith PWF. A latent class model for bivariate binary responses from twins. J Roy Statist Soc Ser C. 2012;61(3) [Google Scholar]
- Evans RJ, Forcina A. Two algorithms for fitting constrained marginal models. To appear in Computational Statistics and Data Analysis. 2013 doi: 10.1016/j.csda.2013.02.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Evans RJ, Richardson TS. Maximum likelihood fitting of acyclic directed mixed graphs to binary data. Proceedings of the 26th conference on Uncertainty in Artificial Intelligence; 2010. [Google Scholar]
- Forcina A, Lupparelli M, Marchetti GM. Marginal parameterizations of discrete models defined by a set of conditional independencies. Journal of Multivariate Analysis. 2010;101:2519–2527. [Google Scholar]
- Glonek GFV, McCullagh P. Multivariate logistic models. J Roy Statist Soc Ser B. 1995;57(3):533–546. [Google Scholar]
- Hakim AJ, Cherkas LF, Spector TD, MacGregor AJ. Genetic associations between frozen shoulder and tennis elbow: a female twin study. Rheumatology. 2003;42(6):739–742. doi: 10.1093/rheumatology/keg159. [DOI] [PubMed] [Google Scholar]
- Kauermann G. A note on multivariate logistic models for contingency tables. Austral J Statist. 1997;39(3):261–276. [Google Scholar]
- Lauritzen SL. Graphical Models. Clarendon Press; Oxford, UK: 1996. [Google Scholar]
- Lupparelli M, Marchetti GM, Bergsma WP. Parameterizations and fitting of bi-directed graph models to categorical data. Scand J Statist. 2009;36:559–576. [Google Scholar]
- Marchetti GM, Lupparelli M. Chain graph models of multivariate regression type for categorical data. Bernoulli. 2011;17(3):827–844. [Google Scholar]
- McDonald CJ, Hui SL, Tierney WM. Effects of computer reminders for influenza vaccination on morbidity during influenza epidemics. MD computing. 1992;9(5):304. [PubMed] [Google Scholar]
- Pearl J. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann; 1988. [Google Scholar]
- Richardson TS. Markov properties for acyclic directed mixed graphs. Scand J Statist. 2003;30(1):145–157. [Google Scholar]
- Richardson TS. A factorization criterion for acyclic directed mixed graphs. Proceedings of the 25th conference on Uncertainty in Artificial Intelligence; 2009. [Google Scholar]
- Richardson TS, Spirtes P. Ancestral graph Markov models. Ann Statist. 2002;30:962–1030. [Google Scholar]
- Rudas T, Bergsma WP, Nemeth R. Parameterization and estimation of path models for categorical data. Proceedings in Computational Statistics, 17th Symposium; Physica-Verlag HD. 2006. pp. 383–394. [Google Scholar]
- Rudas T, Bergsma WP, Nemeth R. Marginal log-linear parameterization of conditional independence models. Biometrika. 2010;97:1006–1012. [Google Scholar]
- Siemiatycki J. A comparison of mail, telephone, and home interview strategies for household health surveys. American Journal of Public Health. 1979;69:238–245. doi: 10.2105/ajph.69.3.238. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wermuth N. Probability distributions with summary graph structure. Bernoulli. 2011;17(3):845–879. [Google Scholar]
- Whittaker J. Graphical models in applied multivariate statistics. Wiley; 1990. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.


