Marginal log-linear parameters for graphical Markov models

Robin J Evans; Thomas S Richardson

doi:10.1111/rssb.12020

. Author manuscript; available in PMC: 2014 Sep 1.

Published in final edited form as: J R Stat Soc Series B Stat Methodol. 2013 Jul 3;75(4):743–768. doi: 10.1111/rssb.12020

Marginal log-linear parameters for graphical Markov models

Robin J Evans ¹, Thomas S Richardson ²

PMCID: PMC3754910 NIHMSID: NIHMS433719 PMID: 23997643

Abstract

Marginal log-linear (MLL) models provide a flexible approach to multivariate discrete data. MLL parametrizations under linear constraints induce a wide variety of models, including models defined by conditional independences. We introduce a subclass of MLL models which correspond to Acyclic Directed Mixed Graphs (ADMGs) under the usual global Markov property. We characterize for precisely which graphs the resulting parametrization is variation independent. The MLL approach provides the first description of ADMG models in terms of a minimal list of constraints. The parametrization is also easily adapted to sparse modelling techniques, which we illustrate using several examples of real data.

Keywords: acyclic directed mixed graph, discrete graphical model, marginal log-linear parameter, parsimonious modelling, variation independence

1 Introduction

Models defined by conditional independence constraints are central to many methods in multivariate statistics, and in particular to graphical models (Darroch et al., 1980; Whittaker, 1990). In the case of discrete data, marginal log-linear (MLL) parameters can be used to parametrize a broad range of models, including some graphical classes and models for conditional independence (Rudas et al., 2010; Forcina et al., 2010). These parameters are defined by considering a sequence, M₁, M₂, … M_k, of margins of the distribution which respects inclusion (i.e. M_i precedes M_j if M_i ⊂ M_j), with each such sequence giving rise to a smooth parametrization of the saturated model. Useful sub-models can be induced by setting some of the parameters to zero, or more generally by restricting attention to a linear or affine subset of the parameter space.

The flexibility present in this scheme presents a challenge both in terms of interpreting the resulting model and performing model selection, for which a tractable search space is typically required. We describe a sub-class of marginal log-linear models corresponding to a class of graphs known as acyclic directed mixed graphs (ADMGs), which contain directed (→) and bidirected (↔) edges, subject to the constraint that there are no cycles of directed edges; an example is given in Figure 1. The relationship between the MLL models and ADMGs is analogous to that between ordinary log-linear models and undirected graphs: log-linear models give a very rich class of models to choose from, since their number grows doubly-exponentially as the number of variables increases; undirected graphs provide a natural and more manageable subset of models with which to work (Darroch et al., 1980).

The patterns of independence described by ADMGs arise naturally in the context of generating processes in which not all variables are observed. To illustrate this, consider the randomized encouragement design carried out by McDonald et al. (1992) to investigate the effect of computer reminders for doctors on take-up of influenza vaccinations, and consequent morbidity in patients. The study involved 2,861 patients; here we focus on the following fields:

(Re) patient’s doctor sent a card asking to Remind them about flu vaccine (randomized);
(Va) patient Vaccinated against influenza;
(Y) the endpoint: patient was not hospitalized with flu;
(Ag) Age of patient: 0 = ‘65 and under’, 1 = ‘over 65’;
(Co) patient has Chronic Obstructive Pulmonary Disease (COPD), as measured at baseline.

The graphs in Figure 2 represent two possible data generating processes. Under both structures, whether or not a patient’s doctor received a reminder note is independent of the baseline variables age (Ag) and COPD status (Co), as would be expected under randomization. Further the absence of an edge Re → Y encodes the assumption that whether or not a reminder (Re) was received only influences the final outcome (Y) via whether or not a patient received a flu vaccination (Va). Both structures also assume that there are unobserved confounding factors between vaccination and COPD, and between COPD and the final outcome. However, the graph in Figure 2(b) supposes that there is no additional confounding between Va and Y. As a consequence the generating process given in (b) implies the additional restriction that Re ⫫ Y | Va, Ag. (We make no assumptions about the state spaces of the variables H, H₁ and H₂, since these factors are unobserved.)

Two different generating processes for the flu vaccine encouragement design (red vertices are unobserved): both graphs imply Re ⫫ Ag, Co; however (b) also implies Re ⫫ Y | Va, Ag.

In Figure 3 we show the ADMGs corresponding to the generating processes in Figure 2. These graphs only contain observed variables, but by including bidirected edges (↔) they encode the same observable conditional independence relations; see §3.1 for details.

Two ADMGs representing the conditional independence restrictions on the observed margin implied by the corresponding graphs in Figure 2.

All the work herein can easily be extended to graphs which also contain an undirected component, provided no undirected edge is adjacent to an arrowhead. This latter case is equivalent to the summary graphs of (Wermuth, 2011), and strictly includes all ancestral graphs (Richardson and Spirtes, 2002). Our approach may be seen as extending earlier work (Rudas et al., 2006, 2010; Forcina et al., 2010) which described the conditional independence structure of certain marginal log-linear models.

1.1 ADMG Models

Richardson (2003) described local and global Markov properties for ADMGs, while Richardson (2009) described a parametrization for discrete random variables via a collection of conditional probabilities of the form P(X_H = 0 | X_T = x_T). However, although Richardson’s parametrization is simple, it does not naturally lead to parsimonious sub-models. In addition, the parameters are subject to variation dependence constraints, in the sense that setting some parameters to particular values may restrict the valid range of other parameters; this makes maximum likelihood fitting, for example, more challenging (Evans and Richardson, 2010). To illustrate this point, consider the graph 𝒢₁ in Figure 1 as an example; it encodes the model under which X₁ ⫫ X₃ and X₄ ⫫ X₁ | X₂. Richardson’s parametrization consists in this case (for binary random variables) of the probabilities

\begin{array}{l} P (X_{1} = 0) & P (X_{2} = 0 ∣ X_{1} = x_{1}) & P (X_{2} = 0, X_{3} = 0 ∣ X_{1} = x_{1}) \\ P (X_{3} = 0) & P (X_{4} = 0 ∣ X_{2} = x_{2}) & P (X_{3} = 0, X_{4} = 0 ∣ X_{1} = x_{1}, X_{2} = x_{2}) \end{array}

where x₁, x₂ ∈ {0, 1}. A disadvantage of this parametrization is that, for instance, the joint probabilities P(X₂ = 0, X₃ = 0 | X₁ = x₁) are bounded above by the marginal probabilities P(X₂ = 0 | X₁ = x₁). Consequently, from the point of view of parameter interpretation, it makes little sense to consider the joint probabilities in isolation. For example, strong (conditional) correlation between X₂ and X₃ is present when the joint probability is large relative to the marginals.

However, replacing the joint probabilities P(X₂ = 0, X₃ = 0 | X₁ = x₁) with the conditional odds ratios

\frac{P (X_{2} = 0, X_{3} = 0 ∣ X_{1} = x_{1}) \cdot P (X_{2} = 1, X_{3} = 1 ∣ X_{1} = x_{1})}{P (X_{2} = 1, X_{3} = 0 ∣ X_{1} = x_{1}) \cdot P (X_{2} = 0, X_{3} = 1 ∣ X_{1} = x_{1})}, x_{1} \in {0, 1}

(and similarly for P(X₃ = 0, X₄ = 0 | X₁ = x₁, X₂ = x₂)) yields a variation independent parametrization, the odds ratio measuring dependence without reference to marginal distributions. This means that if we wish to define a prior distribution over the univariate probabilities and the odds ratios, we may, if appropriate, simply use a product of univariate distributions; similarly, to fit a generalized linear model with these parameters as joint responses, we need only use simple univariate link functions. We will see that this approach to discrete parametrizations can be generalized using marginal log-linear parameters.

In Section 2 we introduce marginal log-linear (MLL) parameters and some of their properties, while Section 3 gives background theory about ADMGs and the parametrization of Richardson (2009). The development of MLL parameters for ADMG models is presented in Section 4, resulting in a parametrization we refer to as ingenuous (since it arises naturally, but ‘natural parametrization’ already has a particular meaning). We also show that this parametrization can always be embedded in a larger one corresponding to a complete graph and the saturated model, where some of the parameters in the bigger model are linearly constrained. In Section 5 we classify for which models the ingenuous parametrization is variation independent, since this can facilitate interpretation of the resulting coefficients. In Section 6 we discuss approaches to sparse modelling using MLLs in the context of several additional datasets and a simulation. Longer proofs are in the supplementary material.

2 Marginal Log-Linear Parameters

We consider collections of random variables (X_v)_v_∈_V with finite index set V, taking values in finite discrete probability spaces ( Inline graphic )_v_∈_V under a strictly positive probability measure P; without loss of generality, = {0, 1, …, | | − 1}. For A ⊆ V we let ≡ ×_v_∈_A( ), ≡ and similarly X_A ≡ (X_v)_v_∈_A, X ≡ X_V and x_A ≡ (x_v)_v_∈_A, x ≡ x_V. In addition is the subset of Inline graphic which does not contain the last possible element in any co-ordinate; that is = {0, 1, …, | | − 2}, and = ×_v_∈_V ( ). We use p_A(x_A) ≡ P(X_A = x_A) and p_A_|_B(x_A | x_B) ≡ P(X_A = x_A | X_B = x_B), for particular instantiations of x.

Following Bergsma and Rudas (2002), we define a general class of parameters on discrete distributions. The definition relies upon abstract collections of subsets, so it may be helpful to the reader to keep in mind that the sets M_i ∈ Inline graphic are margins, or subsets, of the distribution over V, and each set is a collection of effects in the margin M_i. A pair (L, M_i) corresponds to a log-linear interaction over the set L, within the margin M_i.

Definition 2.1

For L ⊆ M ⊆ V, the pair (L, M) is an ordered pair of subsets of V. Let ℙ be a collection of such pairs, and define

M \equiv {M ∣ (L, M) \in ℙ for some L},

to be the collection of margins in ℙ. If Inline graphic = {M₁, …, M_k}, write

L_{i} \equiv {L ∣ (L, M_{i}) \in ℙ},

for the set of effects present in the margin M_i. We say that the collection ℙ is hierarchical if the ordering on Inline graphic may be chosen so that if i < j, then M_j ⊈ M_i and also L ∈ ⇒ L ⊈ M_i; the second condition is equivalent to saying that each L is associated only with the first margin M of which it is a subset. We say the collection is complete if every non-empty subset of V is an element of precisely one set Inline graphic .

The term ‘hierarchical’ is used because each log-linear interaction is defined in the first possible margin in an ascending class, and ‘complete’ because all interactions are present. Some authors (Rudas et al., 2010; Lupparelli et al., 2009) consider only collections which are complete.

Definition 2.2

For each M ⊆ V and x_M ∈ Inline graphic , define the functions $λ_{L}^{M} (x_{L})$ by the identity

log p_{M} (x_{M}) \equiv \sum_{L \subseteq M} λ_{L}^{M} (x_{L}),

subject to the identifiability constraint that for every ∅ ≠ L ⊆ M, x_L ∈ Inline graphic and v ∈ L,

\sum_{x_{v} \in X_{v}} λ_{L}^{M} (x_{L \ {v}}, x_{v}) = 0;

that is, the sum over the support of each variable is zero. We call $λ_{L}^{M} (x_{L})$ a marginal log-linear parameter.

Note that the constant $λ_{\emptyset}^{M}$ is determined by the values of the other parameters and the fact that the probabilities p_M (x_M) sum to one. In the sequel we will always assume that L is non-empty.

The term ‘marginal log-linear parameter’ is coined by analogy with ordinary log-linear parameters, which correspond to the special case M = V. The following result provides an explicit expression for $λ_{L}^{M} (x_{L})$ .

Lemma 2.3

For L ⊆ M ⊆ V and x_L ∈ Inline graphic we have

λ_{L}^{M} (x_{L}) = \frac{1}{∣ X_{M} ∣} \sum_{y_{M} \in X_{M}} log p_{M} (y_{M}) \prod_{v \in L} (∣ X_{v} ∣ I_{{x_{v} = y_{v}}} - 1) .

(1)

This result is elementary, and its proof is omitted.

For a collection of ordered pairs of subsets ℙ (see Definition 2.1), we let

\tilde{Λ} (ℙ) = {λ_{L}^{M} (x_{L}) ∣ (L, M) \in ℙ, x_{L} \in {\tilde{X}}_{L}}

be the collection of marginal log-linear parameters associated with ℙ. Note that we avoid the redundancy created by the identifiability constraint by only considering x_L ∈ Inline graphic .

The definition of a marginal log-linear parameter we give is equivalent to the recursive one given in Bergsma and Rudas (2002); since both expositions are somewhat abstract, we invite the reader to consult the examples below for additional intuition. In particular note that for binary random variables, the product in (1) is always ±1. Bergsma and Rudas (2002, Theorem 2) show that any collection Λ̃(ℙ) where ℙ is hierarchical and complete smoothly parametrizes the saturated model, that is, it parametrizes the set of all positive distributions on Inline graphic .

The restriction that the parameters must sum to zero is required for identifiability, but different constraints can be used in its place. We might instead require that $λ_{L}^{M} (x_{L})$ be zero whenever any entry of x_L is zero (or some other selected value); this is seen in Marchetti and Lupparelli (2011), for example, and its use would not substantially affect any of the results in this paper.

2.1 Examples of Marginal Log-Linear Models

We will write $λ_{L}^{M}$ to mean the collection { $λ_{L}^{M} (x_{L}) ∣ x_{L} \in X_{L}$ }; the expression $λ_{L}^{M} = 0$ denotes that we are setting all the parameters in this collection to 0.

Example 2.4

The classical log-linear parameters for a discrete distribution over a set of variables V are { $λ_{L}^{V} ∣ L \subseteq V$ }.

Example 2.5

Up to trivial transformations, the multivariate logistic parameters of Glonek and McCullagh (1995) are { $λ_{L}^{L} ∣ L \subseteq V$ }.

Example 2.6

Let V = {1, 2, 3} and assume all random variables are binary. Write P₀₀₁ ≡ P(X₁ = 0, X₂ = 0, X₃ = 1), and P₁₊₊ ≡ P(X₁ = 1), etc. Then

λ_{1}^{1} (0) = \frac{1}{2} log \frac{P_{0 + +}}{P_{1 + +}},

which, up to a multiplicative constant, is the logit of the probability of the event {X₁ = 0}. Also,

λ_{1}^{12} (0) = \frac{1}{4} log \frac{P_{00 +} P_{01 +}}{P_{10 +} P_{11 +}} and λ_{12}^{12} (0, 0) = \frac{1}{4} log \frac{P_{00 +} P_{11 +}}{P_{10 +} P_{01 +}},

the log odds product and log odds ratio between X₁ and X₂ respectively.

If instead X₁ is ternary, we obtain

\begin{matrix} λ_{1}^{1} (0) = \frac{1}{3} log \frac{P_{0 + +}^{2}}{P_{1 + +} P_{2 + +}}, \\ λ_{1}^{12} (0) = \frac{1}{6} log \frac{P_{00 +}^{2} P_{01 +}^{2}}{P_{10 +} P_{11 +} P_{20 +} P_{21 +}} and λ_{12}^{12} (0, 0) = \frac{1}{6} log \frac{P_{00 +}^{2} P_{11 +} P_{21 +}}{P_{10 +} P_{20 +} P_{01 +}^{2}} . \end{matrix}

Here $λ_{1}^{1} (0)$ contrasts the probability P(X₁ = 0) with the geometric mean of the probabilities P(X₁ = 1) and P(X₁ = 2). On the other hand, up to constants, $λ_{12}^{12} (0, 0)$ is an average of the two log odds ratios

log \frac{P_{00 +} P_{21 +}}{P_{20 +} P_{01 +}} log \frac{P_{00 +} P_{11 +}}{P_{10 +} P_{01 +}},

and so gives a contrast between P(X₁ = X₂ = 0) and other joint probabilities in a way which generalizes the binary log odds ratio and provides a measure of dependence; in particular note that $λ_{12}^{12} (0, 0) = 0$ if X₁ ⫫ X₂.

Here we have written, for example, 12 instead of {1, 2}; similarly, for sets A and B we sometimes write AB for A ∪ B, and aB for {a} ∪ B.

2.2 Properties of Marginal Log-Linear Models

The next result relates marginal log-linear parameters to conditional independences; it is found as Lemma 1 in Rudas et al. (2010) and Equation (6) of Forcina et al. (2010).

Lemma 2.7

For any disjoint sets A, B and C, where C may be empty, A ⫫ B | C if and only if

λ_{A^{'} B^{'} C^{'}}^{ABC} = 0 for every \emptyset \neq A^{'} \subseteq A, \emptyset \neq B^{'} \subseteq B, C^{'} \subseteq C .

The special case of C = ∅ (giving marginal independence) is proved in the context of multivariate logistic parameters by Kauermann (1997).

Example 2.8

Take a complete and hierarchical parametrization of 3 variables,

λ_{1}^{1} λ_{2}^{2} λ_{3}^{3} λ_{12}^{12} λ_{13}^{13} λ_{23}^{123} λ_{123}^{123} .

Then we can force X₁ ⫫ X₃ by setting $λ_{13}^{13} = 0$ . Similarly X₂ ⫫ X₃ | X₁ corresponds to setting $λ_{23}^{123} = λ_{123}^{123} = 0$ .

The following lemma shows that under conditional independence constraints, certain MLL parameters defined within different margins are equal.

Lemma 2.9

Suppose that A ⫫ B | C, and A is non-empty. Then for any D ⊆ C,

λ_{A D}^{ABC} (x_{A D}) = λ_{A D}^{A C} (x_{A D}), for each x_{A D} \in X_{A D} .

The proof of this result is found in the supplementary material.

3 Acyclic Directed Mixed Graphs

We introduce basic graphical concepts used to describe the global Markov property and parametrization schemes.

Definition 3.1

A directed mixed graph 𝒢 consists of a set of vertices V, and both directed (→) and bidirected (↔) edges. Edges of the same type and orientation may not be repeated, but there may be multiple edges of different types between a pair of vertices.

A path in 𝒢 is a sequence of adjacent edges, without repetition of a vertex; a path may be empty, or equivalently consist of only one vertex. The first and last vertices on a path are the endpoints (these are not distinct if the path is empty); other vertices on the path are non-endpoints. The graph 𝒢₁ in Figure 1, for example, contains the path 1 → 2 → 4 ↔ 3, with endpoints 1 and 3, and non-endpoints 2 and 4. A directed path is one in which all the edges are directed (→) and are oriented in the same direction, whereas a bidirected path consists entirely of bidirected edges.

A directed cycle is a non-empty sequence of edges of the form v → ··· → v. An acyclic directed mixed graph (ADMG) is one which contains no directed cycles.

Definition 3.2

For a graph 𝒢 and a subset of its vertices A ⊆ V, we denote by 𝒢_A the induced subgraph formed by A; that is, the graph containing the vertices A, and the edges in 𝒢 whose endpoints are both in A.

Definition 3.3

Let a and d be vertices in a mixed graph 𝒢. If a = d, or there is a directed path from a to d, we say that a is an ancestor of d, and that d is a descendant of a. The sets of ancestors of d and descendants of a are denoted an_𝒢(d) and de_𝒢(a) respectively. If there is a directed path from a to d containing precisely one edge (a → d) then a is called a parent of d; the set of vertices which are parents of d is written pa_𝒢(d).

The district of a, denoted dis_𝒢(a), is the set containing a and all vertices which are connected to a by a bidirected path. These definitions are applied disjunctively to sets of vertices, so that, for example,

{pa}_{G} (W) \equiv \underset{w \in W}{\cup} {pa}_{G} (w), {dis}_{G} (W) \equiv \underset{w \in W}{\cup} {dis}_{G} (w) .

A set of vertices A is ancestral if A = an_𝒢(A); that is, A contains all its own ancestors.

Example 3.4

Consider the graph 𝒢₁ in Figure 1. We have

{an}_{G_{1}} (4) = {1, 2, 4} {an}_{G_{1}} ({2, 3}) = {1, 2, 3} .

The district of 3 is the set {2, 3, 4}, and since 3 has no parents, pa_𝒢₁(3) = ∅.

Note that by the definitions of some authors, vertices are not their own ancestors (Lauritzen, 1996). The above notations may be shortened on induced subgraphs so that pa_A ≡ pa_{𝒢_A}, and similarly for other definitions. In some cases where the meaning is clear, we will dispense with the subscript altogether.

We use the now standard notation of Dawid (1979), and represent the statement ‘X is independent of Y given Z under a probability measure P′, for random variables X, Y and Z, by X ⫫ Y | Z [P]. If P is unambiguous, this part is dropped, and if Z is empty we write simply X ⫫ Y. Finally, we abuse notation in the usual way: v and X_v are used interchangeably as both a vertex and a random variable; likewise A denotes both a vertex set and X_A.

3.1 Global Markov Property for ADMGs

A Markov property associates a particular set of independence relations with a graph.

A non-endpoint vertex c on a path is a collider on the path if the edges preceding and succeeding c on the path have an arrowhead at c, for example → c ← or ↔ c ←; otherwise c is a non-collider. A path between vertices a and b in a mixed graph is said to be blocked given a set C if either

there is a non-collider on the path in C, or
there is a collider on the path which is not in an_𝒢(C).

If all paths from a to b are blocked by C, then a and b are said to be m-separated given C. Sets A and B are said to be m-separated given C if every a ∈ A and every b ∈ B are m-separated given C. This naturally extends the d-separation criterion of Pearl (1988) to graphs with bidirected edges.

A probability measure P on Inline graphic is said to satisfy the global Markov property for 𝒢 if for every triple of disjoint sets of vertices A, B and C,

A is m - separated from B given C in G \Rightarrow X_{A} ⫫ X_{B} ∣ X_{C} [P] .

The model associated with an ADMG 𝒢 is simply the set of distributions that obey the global Markov property for 𝒢.

Proposition 3.5

If a path from x to y is not blocked given Z, then every vertex on the path is in an_𝒢({x, y} ∪ Z).

Proof

This follows from the definition of m-separation.

Example 3.6

Consider the graph 𝒢₁ in Figure 1. There are two paths between the vertices 1 and 4,

π_{1} : 1 \to 2 \to 4 and π_{2} : 1 \to 2 \leftrightarrow 3 \leftrightarrow 4;

both are blocked by C = {2}. π₁ is blocked because 2 is a non-collider on the path and is in C, while π₂ is blocked because 3 is a collider on the path and is not in an_𝒢₁(C) = {1, 2}. Hence {1} and {4} are m-separated given {2} in 𝒢₁.

One can similarly see that {1} and {3} are m-separated given C = ∅, and that no other m-separations hold for this graph. Thus a joint distribution P obeys the global Markov property for 𝒢₁ if and only if X₁ ⫫ X₄ | X₂ [P] and X₁ ⫫ X₃ [P].

By similar arguments the independences associated with the ADMGs in Figure 2 may also be read off.

3.2 Existing Parametrization of ADMG models

This subsection defines the parameters of Richardson (2009) for multivariate discrete distributions satisfying the global Markov property for an ADMG.

Definition 3.7

Let 𝒢 be an ADMG with vertex set V. We say that a collection of vertices W ⊆ V is barren if for each v ∈ W, we have W ∩ de_𝒢(v) = {v}; in other words v has no non-trivial descendants in W. For an arbitrary set of vertices U, the maximal subset with no non-trivial descendants in U is denoted barren_𝒢(U).

A head is a collection of vertices H which is connected by bidirected paths in 𝒢_an(H) and is barren in 𝒢. We write Inline graphic (𝒢) for the collection of heads in 𝒢. The tail of a head H is the set

{tail}_{G} (H) \equiv {pa}_{G} ({dis}_{an (H)} (H)) \cup ({dis}_{an (H)} (H) \ H) .

Thus the tail of H is the set of vertices in V\H connected to a vertex in H by a path on which every vertex is a collider and an ancestor of a vertex in H. We typically write T for a tail, provided it is clear which head it belongs to.

Proposition 3.8

Let H be a head. Then (i) H = barren_𝒢(H ∪ tail_𝒢(H)); (ii) tail_𝒢(H) ⊆ an_𝒢(H).

Proof

Immediate from the respective definitions.

Richardson (2009) shows that discrete distributions obeying the global Markov property for an ADMG 𝒢 are parametrized by the conditional probabilities:

{P (X_{H} = x_{H} ∣ X_{T} = x_{T}) | H \in H, T = {tail}_{G} (H), x_{H} \in {\tilde{X}}_{H}, x_{T} \in X_{T}} .

This is achieved via factorizations based on head-tail pairs; let ≺ be the partial ordering on heads such that H_i ≺ H_j if H_i ⊂ an_𝒢(H_j) and H_i ≠ H_j. This is well defined, since otherwise 𝒢 would contain a directed cycle. Then let [·]_𝒢 be a function which partitions sets of vertices into heads by repeatedly removing heads which are maximal under ≺.

Then P satisfies the global Markov property for 𝒢 if and only if it obeys the factorizations

P (X_{A} = x_{A}) = \prod_{H \in {[A]}_{G}} P (X_{H} = x_{H} ∣ X_{T} = x_{T})

(2)

for ancestral sets of vertices A; see Richardson (2009) for details. In the case of a directed acyclic graph (DAG), this corresponds to the probability distribution of each vertex conditional on its parents.

Example 3.9

Consider again the ADMG 𝒢₁ in Figure 1; its head-tail pairs (H, T) are (1, ∅), (2, 1), (3, ∅), (23, 1), (4, 2) and (34, 12). Multivariate binary distributions obeying the global Markov property with respect to 𝒢₁ are therefore parametrized by

\begin{array}{r} p_{1} (0) p_{2 ∣ 1} (0 ∣ x_{1}) p_{3} (0) p_{23 ∣ 1} (0, 0, ∣ x_{1}) \\ p_{4 ∣ 2} (0 ∣ x_{2}) p_{34 ∣ 12} (0, 0 ∣ x_{1}, x_{2}), \end{array}

for x₁, x₂ ∈ {0, 1}, as mentioned in the Introduction.

3.3 Graphical Completions

Given a discrete model defined by a set of conditional independence constraints, it is natural to consider it as a sub-model of the saturated model, which contains all positive probability distributions. In a setting where the model is graphical, it becomes equally natural to think of the graph as a subgraph of a complete graph, by which we mean a graph containing at least one edge between every pair of vertices. We can obtain a complete graph from an incomplete one by inserting edges between each pair of vertices which lack one, but this leaves a choice of edge type and orientation. These choices may affect how much of the structure and spirit of the original graph is retained; we will require that a completion preserves the heads of the original graph, which helps to preserve the structure of the parametrization.

Definition 3.10

Given an ADMG 𝒢 and a supergraph 𝒢̄, we call 𝒢̄ a head-preserving completion of 𝒢 if 𝒢̄ is complete, and Inline graphic (𝒢) ⊆ (𝒢̄).

It is easy to see that a head-preserving completion always exists; for example, if we add in a bidirected edge between every pair of vertices which are not joined by an edge, then it is clear that barren sets in 𝒢 will remain barren in 𝒢̄, and bidirected connected sets in 𝒢 will remain bidirected connected in 𝒢̄.

Note that it is not necessary for every pair of vertices to be joined by an edge in order for a graph to represent the saturated model, however we will require this for our completions.

Example 3.11

Figure 4 shows a head-preserving completion of the ADMG in Figure 1.

A head-preserving completion, 𝒢̄₁ of the ADMG in Figure 1.

Proposition 3.12

If 𝒢̄ is a head-preserving completion of 𝒢 then an_𝒢(v) ⊆ an_𝒢̄(v). In particular, if a set A is ancestral in 𝒢̄ then A is also ancestral in 𝒢.

Proof

This follows because 𝒢 contains a subset of the edges in 𝒢̄.

4 Ingenuous Parametrization of an ADMG model

We now use the marginal log-linear parameters defined in Section 2 to parametrize the ADMG models discussed in Section 3.

Definition 4.1

Consider an ADMG 𝒢 with head-tail pairs (H_i, T_i) over some index i, and let M_i = H_i ∪ T_i. Further, let Inline graphic = {A | H_i ⊆ A ⊆ H_i ∪ T_i}. This collection of margins and associated effects is the ingenuous parametrization of 𝒢, denoted ℙ^ing(𝒢).

Example 4.2

We return again to the ADMG 𝒢₁ in Figure 1; the head-tail pairs are (1, ∅), (2, 1), (3, ∅), (23, 1), (4, 2) and (34, 12), meaning that the ingenuous parametrization is given by the following margins and effects:

M

1	1
12	2, 12
3	3
123	23, 123
24	4, 24
1234	34, 134, 234, 1234.

Open in a new tab

Note that the ordering of the margins given here is hierarchical; in order to use most of the results of Bergsma and Rudas (2002), we need to confirm that the definition above always leads to a hierarchical parametrization, which is shown by the following result.

Lemma 4.3

For any ADMG 𝒢, there is an ordering on the margins M_i of the ingenuous parametrization ℙ^ing(𝒢) which is hierarchical.

Proof

Firstly we show that for distinct heads H_i and H_j, the collections Inline graphic and are disjoint. To see this, assume for a contradiction that there exists A such that H_i ⊆ A ⊆ H_i ∪ T_i and H_j ⊆ A ⊆ H_j ∪ T_j. Since H_i ≠ H_j, assume without loss of generality that there exists $v \in H_{i} \cap H_{j}^{c} \subseteq A$ .

Then v ∈ H_j ∪ T_j implies that v ∈ T_j, and thus there is a directed path from v to some w ∈ H_j. Now, w ∉ H_i, since v, w ∈ H_i would imply that H_i is not barren. But if $w \in H_{j} \cap H_{i}^{c}$ , then by the same argument as above we can find a directed path from w to some x ∈ H_i. Then v → ··· → w → ··· → x is a directed path between elements of H_i, which is a contradiction. Thus Inline graphic and are disjoint.

Now, consider the partial ordering ≺ of heads defined in Section 3.2: H_i ≺ H_j whenever H_i ⊂ an_𝒢(H_j) and H_i ≠ H_j Any total ordering which respects this partial ordering is hierarchical, because each set A ∈ Inline graphic is a subset of the ancestors of H_i.

We proceed to show that the ingenuous parameters for an ADMG 𝒢 characterize the set of distributions which obey the global Markov property with respect to 𝒢.

Lemma 4.4

For any sets M and L ⊆ M, the collection of MLL parameters

{λ_{A}^{M} (x_{A}) ∣ L \subseteq A \subseteq M, x_{M} \in {\tilde{X}}_{M}},

together with the (|L| − 1)-dimensional marginal distributions of X_L conditional on X_M_\_L, smoothly parametrizes the distribution of X_L conditional on X_M_\_L.

A proof is given in the supplementary material. We now come to the main result of this section.

Theorem 4.5

The ingenuous parametrization Λ̃(ℙ^ing(𝒢)) of an ADMG 𝒢 parametrizes precisely those distributions P obeying the global Markov property with respect to 𝒢.

Proof

We proceed by induction. Again we use the partial ordering ≺ on heads from Section 3.2. For the base case, we know that singleton heads {h} with empty tails are parametrized by the logits $λ_{h}^{h}$ .

Now, suppose that we wish to find the distribution of a head H conditional on its tail T. Assume that we have the distribution of all heads H′ which precede H, conditional on their respective tails; we claim this is sufficient to give the (|H| − 1)-dimensional marginal distributions of H conditional on T.

Let v ∈ H, and let C = H\{v} be a (|H| − 1)-dimensional marginal of interest. The set A = an_𝒢(H)\{v} is ancestral, since v cannot have (non-trivial) descendants in an_𝒢(H); in particular C ∪ T ⊆ A. Theorem 4 of Richardson (2009) states that the factorization in equation (2) holds for every ancestral set, so

p_{A} (x_{A}) = \prod_{\begin{matrix} H^{'} \in {[A]}_{G} \\ T^{'} = tail (H) \end{matrix}} p_{H^{'} ∣ T^{'}} (x_{H^{'}} ∣ x_{T^{'}}) .

But all the probabilities in the product are known by our induction hypothesis, and the marginal distribution of C conditional on T is given by the distribution of A.

The ingenuous parametrization, by definition, contains $λ_{A}^{H \cup T}$ for H ⊆ A ⊆ H ∪ T, and thus the result follows from Lemma 4.4.

Example 4.6

Returning to our running example, the graph 𝒢₁ in Figure 1 corresponds to the model

{P ∣ X_{1} ⫫ X_{4} ∣ X_{2} [P] and X_{1} ⫫ X_{3} [P]} .

Theorem 4.5 tells us that this collection of distributions is precisely characterized by the ingenuous parameters for 𝒢₁,

\begin{matrix} λ_{1}^{1} & λ_{2}^{12} & λ_{12}^{12} & λ_{3}^{3} & λ_{23}^{123} & λ_{123}^{123} \\ λ_{4}^{24} & λ_{24}^{24} & λ_{34}^{1234} & λ_{134}^{1234} & λ_{234}^{1234} & λ_{1234}^{1234} . \end{matrix}

4.1 Constraint-Based Model Description

The results above show that the ingenuous parameters for an ADMG 𝒢, like Richardson’s parameters, provide precisely the information required to reconstruct a distribution obeying the global Markov property for 𝒢. However, it is difficult to use this parametrization in practice unless we can evaluate the likelihood, which requires us to make explicit the map which we have implicitly defined from the ingenuous parameters to the joint probability distribution under the model. For example, for the parameters in Richardson (2009) there is an explicit map from the parameters back to the joint distribution using a generalization of Möbius inversion. This was used by Evans and Richardson (2010) to fit these models via maximum likelihood. In contrast, the map from ingenuous parameters to the joint distribution cannot be written in closed form.

An alternative approach is to consider the ingenuous parametrization as part of a larger, complete parametrization of the saturated model, such that the additional parameters are constrained to be zero under the sub-model defined by 𝒢. This enables us to fit the model using Lagrange-type algorithms, as in Evans and Forcina (2013).

Theorem 4.7

Let 𝒢 be an ADMG, and 𝒢̄ a head-preserving completion of 𝒢. The ingenuous parametrization of 𝒢 corresponds to setting

λ_{L}^{M} = 0

for (L, M) ∈ ℙ^ing(𝒢̄) whenever L does not appear as an effect in ℙ^ing(𝒢). In particular, these constraints define the set of distributions which satisfy the global Markov property with respect to 𝒢.

To prove this result we require the following lemma.

Lemma 4.8

Let 𝒢̄ be a head-preserving completion of 𝒢, and let H ∈ Inline graphic (𝒢) have tails T and in 𝒢 and 𝒢̄ respectively. Then under the global Markov property for 𝒢,

H ⫫ (\bar{T} \ T) ∣ T [P] .

Proof

Let π be a path in 𝒢 from some h ∈ H to t ∈ Inline graphic \T, and assume without loss of generality that π does not intersect H or \T other than at its endpoints. By Proposition 3.5, every vertex on π is in an_𝒢({h, t} ∪ T) ⊆ an_𝒢(H ∪ ). Since 𝒢̄ is complete, if v ∈ an_𝒢̄(H ∪ ), then v ∈ H ∪ , thus H ∪ is ancestral in 𝒢̄. By Proposition 3.12, H ∪ Inline graphic is also ancestral in 𝒢, thus every vertex on π is in H ∪ .

By Proposition 3.8, Inline graphic ⊆ an_𝒢̄(H), so H ∪ = an_𝒢̄(H). However, since H forms a head in 𝒢̄, H is barren in 𝒢̄. Thus in 𝒢̄, no proper descendant of a vertex in H is on π, and by Proposition 3.12 this also holds in 𝒢.

Now let y be the first vertex after h on π that is not in T. By hypothesis, y exists since t ∉ T. By construction, any vertices between h and y on π are in T, hence are colliders on π and ancestors of H in 𝒢 (by Proposition 3.8). Thus y ∈ dis_𝒢(H) ∪ pa_𝒢(dis_𝒢(H)). If y ∈ an_𝒢 (H) then y ∈ T, which is a contradiction, hence y ∈ dis_𝒢(H) and y ∉ an_𝒢(H). As shown earlier, y is not a descendant of a vertex in H, so H ∪ {y} forms a head in 𝒢. Since 𝒢̄ is a head-preserving completion, it follows that H ∪ {y} also forms a head in 𝒢̄, and thus y ∉ an_𝒢̄(H) = H ∪ Inline graphic , but this is a contradiction.

Proof of Theorem 4.7

Let (H, Inline graphic ) be a head-tail pair in 𝒢̄. There are three possibilities for how this pair relates to 𝒢: if (H, ) is also a head-tail pair in 𝒢, then there is no work to be done; otherwise either (i) H is not a head in 𝒢, or (ii) H is a head in 𝒢 but is not its tail.

If (i) holds, then we claim that under 𝒢, $λ_{A}^{H \bar{T}} = 0$ for all H ⊆ A ⊆ H ∪ Inline graphic . To see this, first note that H is a barren set in 𝒢̄, and since H is maximally connected, this means that all elements are joined by bidirected edges in 𝒢̄. Since 𝒢 contains a subset of the edges in 𝒢̄, H is also barren in 𝒢; since H is not a head in 𝒢 this means that H = K ∪ L for disjoint non-empty sets K and L with no edges directly connecting them. But this implies that K and L are m-separated conditional on Inline graphic , and thus X_K ⫫ X_L | under the Markov property for 𝒢. Then, by Lemma 2.7, these parameters are all identically zero under 𝒢.

(ii) implies that H is head in both 𝒢 and 𝒢̄, but Inline graphic ≡ tail_𝒢̄(H) ⊃ tail_𝒢(H) ≡ T. Then $λ_{A}^{H \bar{T}} = 0$ for all H ⊆ A ⊆ H ∪ such that A ∩ ( \T) ≠ ∅; this follows from Lemma 4.8 and application of Lemma 2.7.

We have shown that all parameters corresponding to effects not found in ℙ^ing(𝒢) are identically zero under 𝒢. The vanishing of these parameters defines the correct sub-model, but note that some of the margins in ℙ^ing(𝒢̄) which we have not yet considered are not the same as those in ℙ^ing(𝒢). These remaining cases are again from (ii), but where H ⊆ A ⊆ H ∪ T; in this case $λ_{A}^{H \bar{T}} = λ_{A}^{H T}$ under 𝒢, again due to Lemma 4.8, this time combined with Lemma 2.9.

Thus we have shown that under 𝒢, all the ingenuous parameters for 𝒢̄ are either zero or equal to ingenuous parameters for 𝒢. Combined with Theorem 4.5, this shows that those constraints define the model.

Example 4.9

Consider again the ADMG 𝒢₁ in Figure 1; a possible head-preserving completion 𝒢̄₁ (shown in Figure 4) is obtained by adding the edges 1 → 3 and 1 → 4. The ingenuous parametrization for 𝒢̄₁ is

M

1	1
2	2, 12
13	3, 13
123	23, 123
124	4, 14, 24, 124
1234	34, 134, 234, 1234.

Open in a new tab

The effects found in ℙ^ing(𝒢̄₁) but not in ℙ^ing(𝒢₁) are 13, 14, and 124, and indeed the sub-model defined by 𝒢₁ corresponds to setting

λ_{13}^{13} = λ_{14}^{124} = λ_{124}^{124} = 0;

under this model the following equalities hold by Lemma 2.9:

λ_{4}^{124} = λ_{4}^{24} λ_{24}^{124} = λ_{24}^{24} .

Removing the zero parameters in ℙ^ing(𝒢̄₁) and renaming two others according to the above equations returns us to the ingenuous parametrization of 𝒢₁.

Theorem 4.7 shows that we can fit the model defined by 𝒢₁ by maximum likelihood simply by maximizing the log-likelihood subject to $λ_{13}^{13} = λ_{14}^{124} = λ_{124}^{124} = 0$ . In particular, this approach always provides a list of independent constraints which characterize the model.

An obvious question which arises is whether any completion of a graph will lead to a complete parametrization with the property of Theorem 4.7. We can obtain a counterexample by considering the complete graph 𝒢̄₁ in Figure 5, which has ingenuous parametrization

A complete ADMG, 𝒢̄₁, of which 𝒢₁ is a subgraph, but whose ingenuous parametrization does not contain the model described by 𝒢₁ as a linear sub-space because the associated completion is not head-preserving.

M

3	3
13	1, 13
123	2, 12, 23, 123
1234	4, 14, 24, 124, 34, 134, 234, 1234.

Open in a new tab

The graph 𝒢₁ in Figure 1 is a subgraph of 𝒢̄₁, and corresponds to the model obtained by setting $λ_{13}^{13} = λ_{14}^{124} = λ_{124}^{124} = 0$ ; however, these last two parameters do not appear in the ingenuous parametrization of 𝒢̄₁, and so there is no way to enforce the sub-model as a linear constraint.

𝒢̄₁ is, of course, not head-preserving. Such completions may still lead to parametrizations which satisfy the property of Theorem 4.7: for example, if the edge 1 → 3 is added to the graph in Figure 6(a), this destroys the head {1, 2, 3}, but the sub-model corresponds to $λ_{13}^{13} = 0$ , which is a parameter in the complete graph.

(a) a graph with a variation dependent ingenuous parametrization; (b) a Markov equivalent graph to (a) with a variation independent ingenuous parametrization; (c) a graph with no variation independent MLL parametrization.

4.2 Relationship To Prior Work

Rudas et al. (2010) parametrize chain graph models of multivariate regression type, also known as type IV chain graph models, using marginal log-linear parameters. Type IV chain graph models are a special case of ADMG models, in the sense that by replacing the undirected edges in a type IV chain graph with bidirected edges, the global Markov property on the resulting ADMG is equivalent to the Markov property for the chain graph (see Drton, 2009). The graphs in Figure 6 are examples of Type IV models. However, there are models in the class of ADMGs which do not correspond to any chain graph, such as the one described by 𝒢₁ in Figure 1.

The parametrization of Rudas et al. (2010) uses different choices of margins to the ingenuous parametrization, though their parameters can be shown to be equal to the parameters considered here under the global Markov property, using Lemma 2.9. Thus the variation dependence properties of that parametrization are identical to those of the ingenuous parametrization (see next section). Forcina et al. (2010) provide an algorithm which gives a range of ‘admissible’ margins in which collections of conditional independence constraints may be defined.

Marchetti and Lupparelli (2011) also parametrize type IV chain graph models in a similar manner to Rudas et al. (2010), in that case using multivariate logistic contrasts.

5 Variation Independence

As discussed in the introduction, the interpretation of parameters and the construction of prior distributions is simpler when parameters are variation independent.

Definition 5.1

Let θ_i, for i = 1, … k be a collection of parameters such that θ_i takes all values in the set Θ_i. We say that the vector θ = (θ₁, …, θ_k) is variation independent if θ can take every value in the set Θ₁ × ··· × Θ_k.

Bergsma and Rudas (2002) characterize precisely which hierarchical and complete parametrizations are variation independent, using a notion they call ordered decomposability. We now do this for ingenuous parametrizations.

Definition 5.2

A collection of sets Inline graphic = {M₁, …, M_k} is incomparable if M_i ⊈ M_j for every i ≠ j.

A collection Inline graphic of incomparable subsets of V is decomposable if it has at most two elements, or there is an ordering M₁, …, M_k on the elements of wherein for each i = 3, …, k, there exists j_i < i such that

(\cup_{l = 1}^{i - 1} M_{l}) \cap M_{i} = M_{j_{i}} \cap M_{i} .

This is also known as the running intersection property.

A collection Inline graphic of (possibly comparable) subsets is ordered decomposable if it has at most two elements, or there is an ordering M₁, …, M_k such that M_i ⊈ M_j for i > j, and for each i = 3, …, k, the inclusion maximal elements of {M₁, …, M_i} form a decomposable collection. We say that a collection ℙ of parameters is ordered decomposable if there is an ordering on the margins Inline graphic which is both hierarchical and ordered decomposable.

The following example is found in Bergsma and Rudas (2002).

Example 5.3

Let Inline graphic = {12, 13, 23, 123}. In order to have a hierarchical ordering of these margins it is clear that the set 123 must come last, but there is no way to order the collection of inclusion maximal margins {12, 13, 23} such that it has the running intersection property. Thus is not ordered decomposable.

The next result links variation independence to ordered decomposability.

Theorem 5.4

(Bergsma and Rudas (2002), Theorem 4). Let ℙ be a parametrization which is hierarchical and complete. Then the parameters Λ̃(ℙ) are variation independent if and only if ℙ is ordered decomposable.

As previously noted, the ingenuous parametrization is not complete in general, and so we cannot apply the above result directly to characterize its variation dependence. However, by constructing complete parametrizations of which the ingenuous parametrizations are linear sub-models, we can obtain the following.

Theorem 5.5

The ingenuous parametrization for an ADMG 𝒢 is variation independent if and only if 𝒢 contains no heads of size greater than or equal to 3.

The proof of this result is found in the supplementary material.

Example 5.6

The graph 𝒢₁ in Figure 1 has maximum head size 2, and therefore the associated ingenuous parametrization is variation independent.

Likewise the graphs in Figure 3(a) and (b) contain no heads of size greater than 2, so that the resulting ingenuous parameters are variation independent. Note that this was not true of the parameters given by Richardson (2009).

Example 5.7

The bidirected 3-chain shown in Figure 6(a) has the head 123, and therefore its ingenuous parametrization is variation dependent. This can easily be seen directly: in the binary case, for example, if the parameters $λ_{12}^{12} (0, 0)$ and $λ_{23}^{23} (0, 0)$ are chosen to be very large, this induces very strong dependence between the variables X₁ and X₂, and between X₂ and X₃ respectively. If these correlations are chosen to be too large, then it is impossible for X₁ and X₃ to be marginally independent, which is implied by the graph.

Observe that we could use the Markov equivalent graph in Figure 6(b), which has no heads of size 3, and thus obtain a variation independent parametrization of the same model. However, if we add incident arrows as shown in Figure 6(c), we obtain a graph where such a trick is not possible. In fact this third graph has no variation independent parametrization in the Bergsma and Rudas framework, since it requires $λ_{0124}^{0124} = λ_{0134}^{0134} = λ_{0234}^{0234} = 0$ , and these margins cannot be ordered in a way which satisfies the running intersection property (see Example 5.3).

In general, it would be sensible for a statistician concerned about variation dependence to choose a graph from the Markov equivalence class created by their model which has the smallest possible maximum head size. This could be achieved by reducing the number of bidirected edges in the graph, where possible; see, for example, Ali et al. (2005) and Drton and Richardson (2008b) for algorithms for finding the graph with the minimal number of arrowheads in a given Markov equivalence class.

Example 5.8

The bidirected 4-cycle, shown in Figure 7, contains a head of size 4, and so its ingenuous parametrization is variation dependent. However, there is a marginal log-linear parametrization of this model which is ordered decomposable, and therefore variation independent. The 4-cycle is precisely the model with X₁ ⫫ X₃ and X₂ ⫫ X₄. Set Inline graphic = {13, 24, 1234}, with

\begin{array}{l} L_{1} = {1, 3, 13} \\ L_{2} = {2, 4, 24} \\ L_{3} = P ({1, 2, 3, 4}) \ (L_{1} \cup L_{2}); \end{array}

here Inline graphic (A) denotes the power set of A. This gives a hierarchical, complete and ordered decomposable parametrization, so the parameters are variation independent. The 4-cycle corresponds exactly to setting $λ_{13}^{13} = λ_{24}^{24} = 0$ , and it follows that the remaining parameters are still variation independent under this constraint.

This approach to parametrization, which considers disconnected sets, is discussed in detail by Lupparelli et al. (2009). It produces a variation independent parametrization for graphs where the disconnected sets do not overlap, and may well be preferable to the ingenuous parametrization in these cases. In sparser graphs however, it does not seem as useful; as mentioned above, some graphs have no variation independent MLL parametrization.

6 Parsimonious Modelling with Marginal Log-Linear Parameters

The number of parameters in a model associated with a sparse graph containing bidirected edges can, in certain cases, be relatively large. In a purely bidirected graph, the parameter count depends upon the number of connected sets of vertices; in the case of a chain of bidirected edges such as that shown in Figure 11(a), this means that the number of parameters grows quadratically in the length of the chain.

(a) A bidirected k-chain and (b) a DAG with latent variables (h₁, …, *h_k*₋₁) generating the same observable conditional independence structure.

The parametrization of Richardson (2009), and its special case for purely bidirected graphs (see Drton and Richardson, 2008a) does not present us with any obvious method of reducing the parameter count whilst preserving the conditional independence structure. In contrast, there are well established methods for sparse modelling with other classes of graphical models. In the case of an undirected graph with binary random variables, restricting to one parameter for each vertex and each edge leads to a Boltzmann Machine (Ackley et al., 1985). Rudas et al. (2006) use marginal log-linear parameters to provide a sparse parametrization of a DAG model, again restricting to one parameter for each vertex and edge.

As we will see from the following examples, the ingenuous parametrization allows us to fit graphical models with a large number of parameters, and then remove higher-order interactions to obtain a more parsimonious model whilst preserving the conditional independence structure of the original graph.

6.1 Flu Vaccination Data Revisited

We first return to the McDonald et al. (1992) study considered in the Introduction. All variables are binary, and (excepting Age) are coded as 0 = false, 1 = true; we add constraints to our model sequentially, recording the results in the analysis of deviance Table 1. The ADMG in Figure 3(a) represents the constraint Ag, Co ⫫ Re; it fits well, having a deviance of 2.54 on 3 degrees of freedom. The smaller model for 3(b) encodes

Table 1.

Analysis of deviance table of models considered for influenza data. Constraints are added sequentially from top to bottom; the last three columns give the additional deviance for the constraint, the total degrees of freedom and the total deviance of the models respectively.

Constraint	Figure	Add. Dev.	d.f.	Total Dev.

Ag, Co ⫫ Re	3(a)	2.54	3	2.54
Y ⫫ Re \|Va, Ag	3(b)	5.11	7	7.66
no 4- and 5-way params		2.22	12	9.88
no 3-way params		8.39	19	18.28

Open in a new tab

Ag, Co ⫫ Re Y ⫫ Re ∣ Va, Ag;

note that these precise independences cannot be represented by a DAG or chain graph (of any of the types considered by Drton (2009)). It also fits well (deviance 7.66 on 7 d.f.), so we may prefer it on the grounds of simplicity.

The ingenuous parametrization in this case contains some higher order effects, including the 5-way interaction between all variables. Setting $λ_{L}^{M} = 0$ for |L| ≥ 4 removes five parameters whilst increasing the deviance by only 2.22; removing the effects of size 3 adds a further 8.39 to the deviance whilst removing seven more parameters. The resulting model has a total deviance of 18.28 on 19 degrees of freedom, representing a good fit compared to the saturated model (likelihood ratio test p = 0.49).

6.2 Incorporating Symmetry: Twins Data

Hakim et al. (2003) investigate genetic effects on the presence or absence of two soft tissue disorders, frozen shoulder and tennis elbow, based on a study in pairs of monozygotic and dizygotic twins; the data are reproduced in Ekholm et al. (2012). We have count data for a 5-way contingency table over the variables S_i and E_i, indicators of whether twin i in the pair suffers from frozen shoulder and tennis elbow respectively, i ∈ {1, 2}, and T, an indicator of whether the pair are monozygotic or dizygotic twins. There are a total of 866 observations for monozygotic pairs, and 963 for dizygotic pairs; twin 1 corresponds to the twin who was born first.

We first fitted the model T ⫫ (S₁, S₂, E₁, E₂) to test whether the zygosity of the twins has any effect on the other variables; we obtained a deviance of 16.4 on 15 degrees of freedom, suggesting that there is no evidence that T is related to the other variables. Note that this contradicts the conclusions of Ekholm et al. (2012), but they use additional assumptions to obtain more powerful tests.

Collapsing to a 4-way table over (S₁, S₂, E₁, E₂), we consider the complete bidirected model in Figure 8(a). A further simplifying assumption is to impose symmetry between the twins in each pair, on the basis that we do not expect any association between the prevalence of the disorders and which twin was born first. Using the ingenuous parametrization for the graph in Figure 8(a), which is itself symmetric with respect to the individual twins, this amounts to six independent linear constraints, and gives a deviance of 0.59 compared to the saturated model on four variables; there is therefore no evidence to reject symmetry.

Graphs for the twins data for models corresponding to (a) a common gene and (b) separate genes affecting the prevalence of frozen shoulder and tennis elbow.

Now, a hypothesis of interest is whether a common gene is responsible for the increased risk of the two disorders, or the genetic effects are separate and independent. In the latter case we would expect the data to be explained by the model encoded by the graph in Figure 8(b), and therefore to observe the marginal independences E₁ ⫫ S₂ and E₂ ⫫ S₁ (see Drton and Richardson, 2008a, for more details). This amounts to the constraint $λ_{E_{1} S_{2}}^{E_{1} S_{2}} = λ_{E_{2} S_{1}}^{E_{2} S_{1}} = 0$ ; the first equality already holds by symmetry, so only one additional constraint is imposed.

This model has a deviance of 8.41 on 7 degrees of freedom, which is not rejected in a likelihood ratio test with the saturated model (p = 0.30), and so there is no evidence to reject the separate genes hypothesis. We remark however, that the model with symmetry but no marginal independences has a slightly lower BIC score, and so might be preferred.

The elimination of the 4-way and 3-way interaction parameters for the model from Figure 8(b) with symmetry results in deviances of 11.63 on 8 d.f. and 16.69 on 10 d.f. respectively, both of which also represent reasonable fits; the latter of these has just 5 free parameters.

6.3 Netherlands Kinship Data

The Netherlands Kinship Panel Survey (NKPS) is an ongoing study which collects longitudinal information on several thousand Dutch individuals and their families (Dykstra et al., 2005, 2007). One question asked of both the primary respondents (anchors) and their partners is “How is your health in general?”, with possible responses of ‘excellent’, ‘good’, ‘good nor poor’, ‘poor’ and ‘very poor’. We combined ‘good nor poor’, ‘poor’ and ‘very poor’ into one category to avoid small counts.

Two waves of data are currently available, from 2002–04 and 2006–07. We only considered anchors who had the same partner in both waves, and such that both the individual and the partner answered the health question in both waves. Let A_i and P_i denote the response of the anchor and partner respectively for wave i ∈ {1, 2}. In total there are n = 2, 318 data points, classified into a 3 × 3 × 3 × 3 table.

We begin with the complete graph in Figure 9. One plausible model would be that anchors and their partners are exchangeable. Since the graph is symmetrical in this respect, so is the ingenuous parametrization, and enforcing symmetry amounts merely to a set of 36 linear constraints; for example:

λ_{A_{2} P_{2}}^{A_{1} P_{1} A_{2} P_{2}} (1, 0) = λ_{A_{2} P_{2}}^{A_{1} P_{1} A_{2} P_{2}} (0, 1) .

Graphs for the NKPS data; responses of Anchor and Partner regarding their assessment of health; subscripts indicate time. (a) a complete graph; (b) a subgraph which implies P₂ ⫫ A₁ | P₁.

This model has a deviance of 89.98, which when compared to the tail of a $χ_{36}^{2}$ distribution gives p = 1.6×10⁻⁶; thus the symmetry model is a poor fit to the data, and is rejected. The lack of exchangeability is probably due to selection bias in the sampling of the anchors, as well as the different ways in which the anchors and their partners were asked the question: anchors were asked about their health as part of a face-to-face interview, whereas the partners were only asked to complete a survey. See Siemiatycki (1979) for an analysis of differences resulting from survey mode.

If instead we remove the edge A₁ → P₂ and fit the graph in Figure 9(b), we obtain an explanation of the data which is not rejected at the 5% level (deviance 19.09 on 12 degrees of freedom, p = 0.086); this model corresponds to the conditional independence P₂ ⫫ A₁ | P₁. This graph is the only subgraph of the complete graph in Figure 9(a) which leads to a good fit; in particular the model created by removing the edge P₁ → A₂ is strongly rejected, which is one manifestation of the asymmetry between individuals and their partners.

Note that we could also have obtained the independence P₂ ⫫ A₁ | P₁, for instance, by using a DAG with topological ordering P₁, A₁, P₂, A₂, but the resulting parametrization would have made it much more difficult to enforce the symmetry constraint tested above.

6.4 Example: Trust Data

Drton and Richardson (2008a) examine responses to seven questions relating to trust and social institutions, taken from the US General Social Survey between 1975 and 1994. Briefly, the seven questions were:

Trust: Can most people be trusted?
Helpful: Do you think most people are usually helpful?
MemUn, MemCh: Are you a member of a labour union/church?
ConLegis, ConClerg, ConBus: Do you have confidence in congress/organized religion / business?

In that paper, the model given by the graph in Figure 10 is shown to adequately explain the data, having a deviance of 32.67 on 26 degrees of freedom, when compared with the saturated model. The authors also provide an undirected graphical model which has one more edge than the graph in Figure 10, and yet has 62 fewer parameters. It too gives a good fit to the data, having a deviance of 87.62 on 88 degrees of freedom. Both graphs were chosen by backwards stepwise selection methods; see Drton and Richardson (2008a) for details.

Markov model for trust data given in Drton and Richardson (2008a).

For practical and theoretical reasons, the bidirected model may be preferred to the undirected one, even though the latter appears to be much more parsimonious. One may consider the dependence between the responses given to a questionnaire to be manifestations of unmeasured characteristics of the respondent, such as their political beliefs. Such a system can be well represented by a bidirected graph, through its marginal independence structure and connection to latent variable models, but not necessarily by an undirected one, which induces conditional independences. Note that, since models defined by undirected and bidirected graphs are not nested, there is no a priori reason to expect the two methods to give a similar graphical structure.

The greater parsimony of the undirected model (when defined purely by conditional independences) is due to its hierarchical nature: if we remove an edge between two vertices a and b, then this corresponds to requiring that $λ_{A}^{V} = 0$ for every effect A containing both a and b. Removing that edge in a bidirected model may correspond merely to setting $λ_{a b}^{a b} = 0$ and nothing else, depending upon the other edges present. Using the ingenuous parametrization, it is easy to constrain additional higher order terms to be zero to obtain sub-models of the set of distributions obeying the global Markov property.

Starting with the model in Figure 10 and fixing the 4-, 5-, 6- and 7-way interaction terms to be zero increases the deviance to 84.18 on 81 degrees of freedom; none of the 4-way interaction parameters was found to be significant on its own. Furthermore, removing 21 of the remaining 25 three-way interaction terms increases the deviance to 111.48 on 102 degrees of freedom; using an asymptotic χ² approximation gives a p-value of 0.245, so this model is not contradicted by the data. The only parameters retained are the one-dimensional marginal probabilities, the two-way interactions corresponding to edges in Figure 10, and the following three-way interactions:

MemUn, ConClerg, ConBus	Helpful, MemUn, MemCh
Trust, ConLegis, ConBus	MemCh, ConClerg, ConBus.

Open in a new tab

This model retains the marginal independence structure of Drton and Richardson’s model, but provides a good fit with only 25 parameters, rather than the original 101.

A similar analysis, for different data, is performed by Lupparelli et al. (2009, page 573); again they find an undirected graphical model to be much more parsimonious than any bidirected one, but obtain comparable fits by removing statistically insignificant higher-order parameters.

6.5 Simulated Data

We saw in the earlier examples that we were often able to remove higher order interaction parameters without compromising the goodness of fit. Here we explore this phenomena further via simulations.

Consider the DAG with latent variables shown in Figure 11(b); over the observed variables, the conditional independences which hold are exactly those given by the bidirected chain in Figure 11(a).

We randomly generated 1,000 distributions from this DAG model with k = 6, where each latent variable was given three states, and each observed variable two. The probability of each observed variable being zero, conditional on each state of its parents, was an independent uniform random draw on (0, 1); latent states were fixed to occur with equal probability. For each distribution, a sample size of 10,000 was drawn, and the bidirected chain model was fitted to it by maximum likelihood estimation. For each of the 1,000 data sets, we then measured the increase in deviance associated with removing higher order parameters

The histogram in Figure 12(a) demonstrates that the deviance increase from setting the 5- and 6-way interaction parameters to zero (a total of three parameters) was not distinguishable from that which would be observed under the null hypothesis that these parameters are zero. The deviance increase from setting the 4-, 5- and 6-way interactions to zero appeared to have only a slightly heavier tail than the associated χ²-distribution, as suggested by the outliers in Figure 12(b). Removing the 3-way interactions in addition to this caused a dramatic increase in the deviance, as may be observed from the heavy tail of the histogram in Figure 12(c). This illustrates that the ingenuous parametrization can be used to produce more parsimonious model descriptions than would be possible using Richardson’s parameters.

Histograms showing the increase in deviance caused by setting to zero (a) the 5- and 6-way interaction parameters; (b) the 4-, 5- and 6-way interaction parameters; (c) the 3-, 4-, 5- and 6-way interaction parameters. Plots are based on 1, 000 datasets, each of size 10, 000, generated from the DAG in Figure 11(b). The plotted densities are χ² with 3, 6 and 10 degrees of freedom respectively.

Note that under the process which generated these models, each of these interaction parameters was non-zero almost surely. As the sample size increases the power of a likelihood ratio test for a fixed distribution tends to one, so it must be the case that a simulation such as the above would, for large enough data sets, show significant deviation from the associated χ² distributions. However, even at a fairly large sample size of 10,000, a limited effect was observed in Figures 12(a) and (b), and the examples above with real data suggest that higher order interactions are often not particularly useful in practice for describing data.

Supplementary Material

Supp Material

NIHMS433719-supplement-Supp_Material.pdf^{(211.6KB, pdf)}

Acknowledgments

This research was supported by the U.S. National Science Foundation grant CNS-0855230 and U.S. National Institutes of Health grant R01 AI032475. The Netherlands Kinship Panel Study is funded by grant 480-10-009 from the Major Investments Fund of the Netherlands Organisation for Scientific Research (NWO), and by the Netherlands Interdisciplinary Demographic Institute (NIDI), Utrecht University, the University of Amsterdam and Tilburg University. We thank McDonald, Hiu and Tierney for giving us permission to use their u vaccine data.

Our thanks go to Tamás Rudas for helpful discussions, and to Antonio Forcina for discussions and the use of his computer programmes. Finally we thank two anonymous referees and an associate editor for their thorough reading of an earlier draft, and very useful suggestions.

Contributor Information

Robin J. Evans, Email: rje42@stat.washington.edu, Department of Statistics, University of Washington

Thomas S. Richardson, Email: tsr@stat.washington.edu, Department of Statistics, University of Washington

References

Ackley DH, Hinton GE, Sejnowski TJ. A learning algorithm for Boltzmann machines. Cognitive Science. 1985;9:147–169. [Google Scholar]
Ali RA, Richardson TS, Spirtes P, Zhang J. Towards characterizing Markov equivalence classes for directed acyclic graphs with latent variables. Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence; 2005. pp. 10–17. [Google Scholar]
Bergsma WP, Rudas T. Marginal models for categorical data. Ann Stat. 2002;30(1):140–159. [Google Scholar]
Darroch JN, Lauritzen SL, Speed TP. Markov fields and log-linear models for contingency tables. Ann Statist. 1980;8:522–539. [Google Scholar]
Dawid AP. Conditional independence in statistical theory (with discussion) J Roy Statist Soc Ser B. 1979;41:1–31. [Google Scholar]
Drton M. Discrete chain graph models. Bernoulli. 2009;15(3):736–753. [Google Scholar]
Drton M, Richardson TS. Binary models for marginal independence. J Roy Statist Soc Ser B. 2008a;70(2):287–309. [Google Scholar]
Drton M, Richardson TS. Graphical methods for efficient likelihood inference in Gaussian covariance models. J Mach Learn Res. 2008b;9:893–914. [Google Scholar]
Dykstra PA, Kalmijn M, Knijn TCM, Komter AE, Liefbroer AC, Mulder CH. NKPS Working Paper No 4. 2005. Codebook of the Netherlands Kinship Panel Study, a multi-actor, multi-method panel study on solidarity in family relationships. Wave 1. [Google Scholar]
Dykstra PA, Kalmijn M, Knijn TCM, Komter AE, Liefbroer AC, Mulder CH. NKPS Working Paper No 6. 2007. Codebook of the Netherlands Kinship Panel Study, a multi-actor, multi-method panel study on solidarity in family relationships. Wave 2. [Google Scholar]
Ekholm A, Jokinen J, McDonald JW, Smith PWF. A latent class model for bivariate binary responses from twins. J Roy Statist Soc Ser C. 2012;61(3) [Google Scholar]
Evans RJ, Forcina A. Two algorithms for fitting constrained marginal models. To appear in Computational Statistics and Data Analysis. 2013 doi: 10.1016/j.csda.2013.02.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
Evans RJ, Richardson TS. Maximum likelihood fitting of acyclic directed mixed graphs to binary data. Proceedings of the 26th conference on Uncertainty in Artificial Intelligence; 2010. [Google Scholar]
Forcina A, Lupparelli M, Marchetti GM. Marginal parameterizations of discrete models defined by a set of conditional independencies. Journal of Multivariate Analysis. 2010;101:2519–2527. [Google Scholar]
Glonek GFV, McCullagh P. Multivariate logistic models. J Roy Statist Soc Ser B. 1995;57(3):533–546. [Google Scholar]
Hakim AJ, Cherkas LF, Spector TD, MacGregor AJ. Genetic associations between frozen shoulder and tennis elbow: a female twin study. Rheumatology. 2003;42(6):739–742. doi: 10.1093/rheumatology/keg159. [DOI] [PubMed] [Google Scholar]
Kauermann G. A note on multivariate logistic models for contingency tables. Austral J Statist. 1997;39(3):261–276. [Google Scholar]
Lauritzen SL. Graphical Models. Clarendon Press; Oxford, UK: 1996. [Google Scholar]
Lupparelli M, Marchetti GM, Bergsma WP. Parameterizations and fitting of bi-directed graph models to categorical data. Scand J Statist. 2009;36:559–576. [Google Scholar]
Marchetti GM, Lupparelli M. Chain graph models of multivariate regression type for categorical data. Bernoulli. 2011;17(3):827–844. [Google Scholar]
McDonald CJ, Hui SL, Tierney WM. Effects of computer reminders for influenza vaccination on morbidity during influenza epidemics. MD computing. 1992;9(5):304. [PubMed] [Google Scholar]
Pearl J. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann; 1988. [Google Scholar]
Richardson TS. Markov properties for acyclic directed mixed graphs. Scand J Statist. 2003;30(1):145–157. [Google Scholar]
Richardson TS. A factorization criterion for acyclic directed mixed graphs. Proceedings of the 25th conference on Uncertainty in Artificial Intelligence; 2009. [Google Scholar]
Richardson TS, Spirtes P. Ancestral graph Markov models. Ann Statist. 2002;30:962–1030. [Google Scholar]
Rudas T, Bergsma WP, Nemeth R. Parameterization and estimation of path models for categorical data. Proceedings in Computational Statistics, 17th Symposium; Physica-Verlag HD. 2006. pp. 383–394. [Google Scholar]
Rudas T, Bergsma WP, Nemeth R. Marginal log-linear parameterization of conditional independence models. Biometrika. 2010;97:1006–1012. [Google Scholar]
Siemiatycki J. A comparison of mail, telephone, and home interview strategies for household health surveys. American Journal of Public Health. 1979;69:238–245. doi: 10.2105/ajph.69.3.238. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wermuth N. Probability distributions with summary graph structure. Bernoulli. 2011;17(3):845–879. [Google Scholar]
Whittaker J. Graphical models in applied multivariate statistics. Wiley; 1990. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp Material

NIHMS433719-supplement-Supp_Material.pdf^{(211.6KB, pdf)}

[R1] Ackley DH, Hinton GE, Sejnowski TJ. A learning algorithm for Boltzmann machines. Cognitive Science. 1985;9:147–169. [Google Scholar]

[R2] Ali RA, Richardson TS, Spirtes P, Zhang J. Towards characterizing Markov equivalence classes for directed acyclic graphs with latent variables. Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence; 2005. pp. 10–17. [Google Scholar]

[R3] Bergsma WP, Rudas T. Marginal models for categorical data. Ann Stat. 2002;30(1):140–159. [Google Scholar]

[R4] Darroch JN, Lauritzen SL, Speed TP. Markov fields and log-linear models for contingency tables. Ann Statist. 1980;8:522–539. [Google Scholar]

[R5] Dawid AP. Conditional independence in statistical theory (with discussion) J Roy Statist Soc Ser B. 1979;41:1–31. [Google Scholar]

[R6] Drton M. Discrete chain graph models. Bernoulli. 2009;15(3):736–753. [Google Scholar]

[R7] Drton M, Richardson TS. Binary models for marginal independence. J Roy Statist Soc Ser B. 2008a;70(2):287–309. [Google Scholar]

[R8] Drton M, Richardson TS. Graphical methods for efficient likelihood inference in Gaussian covariance models. J Mach Learn Res. 2008b;9:893–914. [Google Scholar]

[R9] Dykstra PA, Kalmijn M, Knijn TCM, Komter AE, Liefbroer AC, Mulder CH. NKPS Working Paper No 4. 2005. Codebook of the Netherlands Kinship Panel Study, a multi-actor, multi-method panel study on solidarity in family relationships. Wave 1. [Google Scholar]

[R10] Dykstra PA, Kalmijn M, Knijn TCM, Komter AE, Liefbroer AC, Mulder CH. NKPS Working Paper No 6. 2007. Codebook of the Netherlands Kinship Panel Study, a multi-actor, multi-method panel study on solidarity in family relationships. Wave 2. [Google Scholar]

[R11] Ekholm A, Jokinen J, McDonald JW, Smith PWF. A latent class model for bivariate binary responses from twins. J Roy Statist Soc Ser C. 2012;61(3) [Google Scholar]

[R12] Evans RJ, Forcina A. Two algorithms for fitting constrained marginal models. To appear in Computational Statistics and Data Analysis. 2013 doi: 10.1016/j.csda.2013.02.001. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Evans RJ, Richardson TS. Maximum likelihood fitting of acyclic directed mixed graphs to binary data. Proceedings of the 26th conference on Uncertainty in Artificial Intelligence; 2010. [Google Scholar]

[R14] Forcina A, Lupparelli M, Marchetti GM. Marginal parameterizations of discrete models defined by a set of conditional independencies. Journal of Multivariate Analysis. 2010;101:2519–2527. [Google Scholar]

[R15] Glonek GFV, McCullagh P. Multivariate logistic models. J Roy Statist Soc Ser B. 1995;57(3):533–546. [Google Scholar]

[R16] Hakim AJ, Cherkas LF, Spector TD, MacGregor AJ. Genetic associations between frozen shoulder and tennis elbow: a female twin study. Rheumatology. 2003;42(6):739–742. doi: 10.1093/rheumatology/keg159. [DOI] [PubMed] [Google Scholar]

[R17] Kauermann G. A note on multivariate logistic models for contingency tables. Austral J Statist. 1997;39(3):261–276. [Google Scholar]

[R18] Lauritzen SL. Graphical Models. Clarendon Press; Oxford, UK: 1996. [Google Scholar]

[R19] Lupparelli M, Marchetti GM, Bergsma WP. Parameterizations and fitting of bi-directed graph models to categorical data. Scand J Statist. 2009;36:559–576. [Google Scholar]

[R20] Marchetti GM, Lupparelli M. Chain graph models of multivariate regression type for categorical data. Bernoulli. 2011;17(3):827–844. [Google Scholar]

[R21] McDonald CJ, Hui SL, Tierney WM. Effects of computer reminders for influenza vaccination on morbidity during influenza epidemics. MD computing. 1992;9(5):304. [PubMed] [Google Scholar]

[R22] Pearl J. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann; 1988. [Google Scholar]

[R23] Richardson TS. Markov properties for acyclic directed mixed graphs. Scand J Statist. 2003;30(1):145–157. [Google Scholar]

[R24] Richardson TS. A factorization criterion for acyclic directed mixed graphs. Proceedings of the 25th conference on Uncertainty in Artificial Intelligence; 2009. [Google Scholar]

[R25] Richardson TS, Spirtes P. Ancestral graph Markov models. Ann Statist. 2002;30:962–1030. [Google Scholar]

[R26] Rudas T, Bergsma WP, Nemeth R. Parameterization and estimation of path models for categorical data. Proceedings in Computational Statistics, 17th Symposium; Physica-Verlag HD. 2006. pp. 383–394. [Google Scholar]

[R27] Rudas T, Bergsma WP, Nemeth R. Marginal log-linear parameterization of conditional independence models. Biometrika. 2010;97:1006–1012. [Google Scholar]

[R28] Siemiatycki J. A comparison of mail, telephone, and home interview strategies for household health surveys. American Journal of Public Health. 1979;69:238–245. doi: 10.2105/ajph.69.3.238. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] Wermuth N. Probability distributions with summary graph structure. Bernoulli. 2011;17(3):845–879. [Google Scholar]

[R30] Whittaker J. Graphical models in applied multivariate statistics. Wiley; 1990. [Google Scholar]

PERMALINK

Marginal log-linear parameters for graphical Markov models

Robin J Evans

Thomas S Richardson

Abstract

1 Introduction

Figure 1.

Figure 2.

Figure 3.

1.1 ADMG Models

2 Marginal Log-Linear Parameters

Definition 2.1

Definition 2.2

Lemma 2.3

2.1 Examples of Marginal Log-Linear Models

Example 2.4

Example 2.5

Example 2.6

2.2 Properties of Marginal Log-Linear Models

Lemma 2.7

Example 2.8

Lemma 2.9

3 Acyclic Directed Mixed Graphs

Definition 3.1

Definition 3.2

Definition 3.3

Example 3.4

3.1 Global Markov Property for ADMGs

Proposition 3.5

Proof

Example 3.6

3.2 Existing Parametrization of ADMG models

Definition 3.7

Proposition 3.8

Proof

Example 3.9

3.3 Graphical Completions

Definition 3.10

Example 3.11

Figure 4.

Proposition 3.12

Proof

4 Ingenuous Parametrization of an ADMG model

Definition 4.1

Example 4.2

Lemma 4.3

Proof

Lemma 4.4

Theorem 4.5

Proof

Example 4.6

4.1 Constraint-Based Model Description

Theorem 4.7

Lemma 4.8

Proof

Proof of Theorem 4.7

Example 4.9

Figure 5.

Figure 6.

4.2 Relationship To Prior Work

5 Variation Independence

Definition 5.1

Definition 5.2

Example 5.3

Theorem 5.4

Theorem 5.5

Example 5.6

Example 5.7

Example 5.8

Figure 7.

6 Parsimonious Modelling with Marginal Log-Linear Parameters

Figure 11.

6.1 Flu Vaccination Data Revisited

Table 1.

6.2 Incorporating Symmetry: Twins Data

Figure 8.

6.3 Netherlands Kinship Data

Figure 9.

6.4 Example: Trust Data

Figure 10.