Path-Counting Formulas for Generalized Kinship Coefficients and Condensed Identity Coefficients

En Cheng; Z Meral Ozsoyoglu

doi:10.1155/2014/898424

. 2014 Jul 21;2014:898424. doi: 10.1155/2014/898424

Path-Counting Formulas for Generalized Kinship Coefficients and Condensed Identity Coefficients

En Cheng ^1,^*, Z Meral Ozsoyoglu ²

PMCID: PMC4130148 PMID: 25165486

Abstract

An important computation on pedigree data is the calculation of condensed identity coefficients, which provide a complete description of the degree of relatedness of two individuals. The applications of condensed identity coefficients range from genetic counseling to disease tracking. Condensed identity coefficients can be computed using linear combinations of generalized kinship coefficients for two, three, four individuals, and two pairs of individuals and there are recursive formulas for computing those generalized kinship coefficients (Karigl, 1981). Path-counting formulas have been proposed for the (generalized) kinship coefficients for two (three) individuals but there have been no path-counting formulas for the other generalized kinship coefficients. It has also been shown that the computation of the (generalized) kinship coefficients for two (three) individuals using path-counting formulas is efficient for large pedigrees, together with path encoding schemes tailored for pedigree graphs. In this paper, we propose a framework for deriving path-counting formulas for generalized kinship coefficients. Then, we present the path-counting formulas for all generalized kinship coefficients for which there are recursive formulas and which are sufficient for computing condensed identity coefficients. We also perform experiments to compare the efficiency of our method with the recursive method for computing condensed identity coefficients on large pedigrees.

1. Introduction

With the rapidly expanding field of medical genetics and genetic counseling, genealogy information is becoming increasingly abundant. In January 2009, the US Department of Health and Human Services released an updated and improved version of the Surgeon General's Web-based family health history tool [1]. This Web-based tool makes it easy for users to record their family health history. Large extended human pedigrees are very informative for linkage analysis. Pedigrees including thousands of members in 10–20 generations are available from genetically isolated populations [2, 3]. In human genetics, a pedigree is defined as “a simplified diagram of a family's genealogy that shows family members' relationships to each other and how a specific trait, abnormality, or disease has been inherited” [4]. Pedigrees are utilized to trace the inheritance of a specific disease, calculate genetic risk ratios, identify individuals at risk, and facilitate genetic counseling. To calculate genetic risk ratios or identify individuals at risk, we need to assess the degree of relatedness of two individuals. As a matter of fact, all measures of relatedness are based on the concept of identical by descent (IBD). Two alleles are identical by descent if one is an ancestral copy of the other or if they are both copies of the same ancestral allele. The IBD concept is primarily due to Cotterman [5] and Malecot [6] and has been successfully applied to many problems in population genetics.

The simplest measure of relationship between two individuals is their kinship coefficient. The kinship coefficient between two individuals i and j is the probability that an allele selected randomly from i and an allele selected randomly from the same autosomal locus of j are identical by descent. To better discriminate between different types of pairs of relatives, identity coefficients were introduced by Gillois [7] and Harris [8] and promulgated by Jacquard [9]. Considering the four alleles of two individuals at a fixed autosomal locus, there are 15 possible identity states. Disregarding the distinction between maternally and paternally derived alleles, we obtain 9 condensed identity states. The probabilities associated with each condensed identity state are called condensed identity coefficients, which are useful in a diverse range of fields. This includes the calculation of risk ratios for qualitative disease, the analysis of quantitative traits, and genetic counseling in medicine.

A recursive algorithm for calculating condensed identity coefficients proposed by Karigl [10] has been known for some time. This method requires that one calculates a set of generalized kinship coefficients, from which one obtains condensed identity coefficients via a linear transformation. One limitation is that this recursive approach is not scalable when applied to very large pedigrees. It has been previously shown that the kinship coefficients for two individuals [11–13] and the generalized kinship coefficients for three individuals [14, 15] can be efficiently calculated using path-counting formulas together with path encoding schemes tailored for pedigree graphs.

Motivated by the efficiency of path-counting formulas for computing the kinship coefficient for two individuals and the generalized kinship coefficient for three individuals, we first introduce a framework for developing path-counting formulas to compute generalized kinship coefficients concerning three individuals, four individuals, and two pairs of individuals. Then, we present path-counting formulas for all generalized kinship coefficients which have recursive formulas proposed by Karigl [10] and are sufficient to compute condensed identity coefficients. In summary, our ultimate goal is to use path-counting formulas for generalized kinship coefficients computation so that efficiency and scalability for condensed identity coefficients calculation can be improved.

The main contributions of our work are as follows:

a framework to develop path-counting formulas for generalized kinship coefficients;
a set of path-counting formulas for all generalized kinship coefficients having recursive formulas [10];
experimental results demonstrating significant performance gains for calculating condensed identity coefficients based on our proposed path-counting formulas as compared to using recursive formulas [10].

2. Materials and Methods

This section describes kinship coefficients and generalized kinship coefficients, identity coefficients, and condensed identity coefficients in more detail. Conceptual terms for the path-counting formulas for three and four individuals are introduced in Section 2.3. In addition, an overview of path-counting formula derivation is presented.

2.1. Kinship Coefficients and Generalized Kinship Coefficients

The kinship coefficient between two individuals a and b is the probability that a randomly chosen allele at the same locus from each is identical by descent (IBD). There are two approaches to computing the kinship coefficient Φ_ab: the recursive approach [10] and the path-counting approach [16]. The recursive formulas [10] for Φ_ab and Φ_aa are

\begin{matrix} Φ_{a b} = \frac{1}{2} (Φ_{f b} + Φ_{m b}) if a is not an ancestor of b, \\ Φ_{a a} = \frac{1}{2} (1 + Φ_{f m}) = \frac{1}{2} (1 + F_{a}), \end{matrix}

(1)

where f and m denote the father and the mother of a, respectively, and F _a is the inbreeding coefficient of a.

Wright's path-counting formula [16] for Φ_ab is

\begin{matrix} Φ_{a b} = \sum_{A} \sum_{〈 P_{A a}, P_{A b} 〉 \in P P} {(\frac{1}{2})}^{r + s + 1} (1 + F_{A}), \end{matrix}

(2)

where A is a common ancestor of a and b, PP is a set of nonoverlapping path-pairs 〈P _Aa, P _Ab〉 from A to a and b, r is the length of the path P _Aa, s is the length of the path P _Ab, and F _A is the inbreeding coefficient of A. The path-pair 〈P _Aa, P _Ab〉 is nonoverlapping if and only if the two paths share no common individuals, except A.

Recursive formulas proposed by Karigl [10] for generalized kinship coefficients concerning three individuals, four individuals, and two pairs of individuals are listed as follows in (3), (4), and (5):

graphic file with name CMMM2014-898424.e003.jpg

(3)

graphic file with name CMMM2014-898424.e004.jpg

(4)

graphic file with name CMMM2014-898424.e005.jpg

(5)

Φ_abc is the probability that randomly chosen alleles at the same locus from each of the three individuals (i.e., a, b, and c) are identical by descent (IBD). Similarly, Φ_abcd is the probability that randomly chosen alleles at the same locus from each of the four individuals (i.e., a, b, c, and d) are IBD. Φ_ab,cd is the probability that a random allele from a is IBD with a random allele from b and that a random allele from c is IBD with a random allele from d at the same locus. Note that Φ_abc = 0 if there is no common ancestor of a, b, and c. Φ_abcd = 0 if there is no common ancestor of a, b, c, and d, and Φ_ab,cd = 0 in the absence of a common ancestor either for a and b or for c and d.

2.2. Identity Coefficients and Condensed Identity Coefficients

Given two individuals a and b with maternally and paternally derived alleles at a fixed autosomal locus, there are 15 possible identity states, and the probabilities associated with each identity state are called identity coefficients. Ignoring the distinction between maternally and paternally derived alleles, we categorize the 15 possible states to 9 condensed identity states, as shown in Figure 1. The states range from state 1, in which all four alleles are IBD, to state 9, in which none of the four alleles are IBD. The probabilities associated with each condensed identity state are called condensed identity coefficients, denoted by {Δ_i∣1 ≤ i ≤ 9} . The condensed identity coefficients can be computed based on generalized kinship coefficients using the linear transformation shown as follows in (6):

\begin{matrix} [\begin{matrix} 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 2 & 2 & 2 & 2 & 1 & 1 & 1 & 1 & 1 \\ 2 & 2 & 1 & 1 & 2 & 2 & 1 & 1 & 1 \\ 4 & 0 & 2 & 0 & 2 & 0 & 2 & 1 & 0 \\ 8 & 0 & 4 & 0 & 2 & 0 & 2 & 1 & 0 \\ 8 & 0 & 2 & 0 & 4 & 0 & 2 & 1 & 0 \\ 16 & 0 & 4 & 0 & 4 & 0 & 2 & 1 & 0 \\ 4 & 4 & 2 & 2 & 2 & 2 & 1 & 1 & 1 \\ 16 & 0 & 4 & 0 & 4 & 0 & 4 & 1 & 0 \end{matrix}] [\begin{matrix} Δ_{1} \\ Δ_{2} \\ Δ_{3} \\ Δ_{4} \\ Δ_{5} \\ Δ_{6} \\ Δ_{7} \\ Δ_{8} \\ Δ_{9} \end{matrix}] = [\begin{matrix} 1 \\ 2 Φ_{a a} \\ 2 Φ_{b b} \\ 4 Φ_{a b} \\ 8 Φ_{a a b} \\ 8 Φ_{a b b} \\ 16 Φ_{a a b b} \\ 4 Φ_{a a, b b} \\ 16 Φ_{a b, a b} \end{matrix}] . \end{matrix}

(6)

The 15 possible identity states for individuals a and b, grouped by their 9 condensed states. Lines indicate alleles that are IBD.

In our work, we focus on deriving the path-counting formulas for the generalized kinship coefficients, including Φ_abc, Φ_abcd, and Φ_ab,cd.

2.3. Terms Defined for Path-Counting Formulas for Three and Four Individuals

(1) Triple-Common Ancestor. Given three individuals a, b, and c, if A is a common ancestor of the three individuals, then we call A a triple-common ancestor of a, b, and c.

(2) Quad-Common Ancestor. Given four individuals a, b, c, and d, if A is a common ancestor of the four individuals, then we call A a quad-common ancestor of a, b, c, and d.

(3) P(A, a). It denotes the set of all possible paths from A to a, where the paths can only traverse edges in the direction of parent to child such that P(A, a) ≠ NULL if and only if A is an ancestor of a. P _Aa denotes a particular path from A to a, where P _Aa ∈ P(A, a).

(4) Path-Pair. It consists of two paths, denoted as 〈P _Aa, P _Ab〉, where P _Aa ∈ P(A, a) and P _Ab ∈ P(A, b).

(5) Nonoverlapping Path-Pair. Given a path-pair 〈P _Aa, P _Ab〉, it is nonoverlapping if and only if the two paths share no common individuals, except A.

(6) Path-Triple. It consists of three paths, denoted as 〈P _Aa, P _Ab, P _Ac〉, where P _Aa ∈ P(A, a), P _Ab ∈ P(A, b), and P _Ac ∈ P(A, c).

(7) Path-Quad. It consists of four paths, denoted as 〈P _Aa, P _Ab, P _Ac, P _Ad〉, where P _Aa ∈ P(A, a), P _Ab ∈ P(A, b), P _Ac ∈ P(A, c), and P _Ad ∈ P(A, d).

(8) Bi_C(P _Aa, P _Ab). It denotes all common individuals shared between P _Aa and P _Ab, except A.

(9) Tri_C(P _Aa, P _Ab, P _Ac). It denotes all common individuals shared among P _Aa, P _Ab, and P _Ac, except A.

(10) Quad_C(P _Aa, P _Ab, P _Ac, P _Ad). It denotes all common individuals shared among P _Aa, P _Ab, P _Ac, and P _Ad, except A.

(11) Crossover and 2-Overlap Individual. If s ∈ Bi_C(P _Aa, P _Ab), we call s a crossover individual with respect to P _Aa and P _Ab if the two paths pass through different parents of s. On the other hand, if P _Aa and P _Ab pass through the same parent of s, then we call s a 2-overlap individual with respect to P _Aa and P _Ab.

(12) 3-Overlap Individual. If s ∈ Tri_C(P _Aa, P _Ab, P _Ac) and the three paths P _Aa, P _Ab, and P _Ac pass through the same parent of s, then we call s a 3-overlap individual with respect to P _Aa, P _Ab, and P _Ac.

(13) 2-Overlap Path. If s is a 2-overlap individual with respect to P _Aa and P _Ab, then both P _Aa and P _Ab pass through the same parent of s, denoted by p, and the edge from p to s is called an overlap edge. All consecutive overlap edges constitute a path and this path is called a 2-overlap path. If the 2-overlap path extends all the way to the ancestor A, we call it a root 2-overlap path.

(14) 3-Overlap Path. It consists of all 3-overlap individuals in a consecutive order. If the 3-overlap path extends all the way to the root A, we call it a root 3-overlap path.

Example 1 . —

Consider the path-pairs from A to a and b in Figure 2, where A is a common ancestor of a and b. For path-pair1, Bi_C(P _Aa, P _Ab) = {s, e, t}, and A→s→e→t is a root 2-overlap path with respect to P _Aa and P _Ab. For path-pair4, Bi_C(P _Aa, P _Ab) = {e, t}, where e is a crossover individual; t is a 2-overlap individual with respect to P _Aa and P _Ab, and e→t is a root 2-overlap path with respect to P _Aa and P _Ab.

Examples of path-pairs and path-triples.

Example 2 . —

There are four path-quads listed in Figure 3, from A to four individuals a, b, c, and d, where A is a quad-common ancestor of the four individuals. For path-quad2, considering the paths P _Aa and P _Ab, the path A→t→f→s is a root 2-overlap path; {t, f, s} are 2-overlap individuals with respect to P _Aa and P _Ab. For path-quad3, {t, f, s} are 3-overlap individuals with respect to P _Aa, P _Ab, and P _Ac, and the path A→t→f→s is a root 3-overlap path.

Then, we summarize all the conceptual terms used in the path-counting formulas for two individuals, three individuals, and four individuals in Table 1 which reveals a glimpse of our framework for generalizing Wright's formula to three and four individuals from terminology aspect.

Table 1.

The conceptual terms used for two, three, and four individuals.

Two individuals	Three individuals	Four individuals
Common ancestor	Triple-common ancestor	Quad-common ancestor
Path-pair	Path-triple	Path-quad
Bi_C(P _Aa, P _Ab)	Tr i_C(P _Aa, P _Ab, P _Ac)	Qu ad_C(P _Aa, P _Ab, P _Ac, P _Ad)
N/A	2-Overlap individual	3-Overlap individual
N/A	2-Overlap path	3-Overlap path
N/A	Root 2-overlap path	Root 3-overlap path
N/A	Crossover individual	Crossover individual

Open in a new tab

2.4. An Overview of Path-Counting Formula Derivation

According to Wright's path-counting formula [16] (see (2)) for two individuals a and b, the path-counting approach requires identifying common ancestors of a and b and calculating the contribution of each common ancestor to Φ_ab. More specifically, for each common ancestor, denoted as A, we obtain all path-pairs from A to a and b and identify acceptable path-pairs. For Φ_ab, an acceptable path-pair 〈P _Aa, P _Ab〉 is a nonoverlapping path-pair where the two paths share no common individuals, except A. In Figure 2, path-pair2 is an acceptable path-pair, while path-pair1, path-pair3, and path-pair4 are not acceptable path-pairs. The contribution of each common ancestor A to Φ_ab is computed based on the inbreeding coefficient of A, modified by the length of each acceptable path-pair.

To compute Φ_abc, the path-counting approach requires identifying all triple-common ancestors of a, b, and c and summing up all triple-common ancestors' contributions to Φ_abc. For each triple-common ancestor, denoted as A, we first identify all path-triples each of which consists of three paths from A to a, b, and c, respectively. Some examples of path-triples are presented in Figure 2.

For Φ_ab, only nonoverlapping path-pairs are acceptable. A path-triple 〈P _Aa, P _Ab, P _Ac〉 consists of three path-pairs 〈P _Aa, P _Ab〉, 〈P _Aa, P _Ac〉, and 〈P _Ab, P _Ac〉. For Φ_abc, a path-triple might be acceptable even though either 2-overlap individuals or crossover individuals exist between a path-pair. The main challenge we need to address is finding necessary and sufficient conditions for acceptable path-triples.

Aiming at solving the problem of identifying acceptable path-triples, we first use a systematic method to generate all possible cases for a path-pair by considering different types of common individuals shared between the two paths. Then, we introduce building blocks which are connected graphs with conditions on every edge in the graph that encapsulates a set of acceptable cases of path-pairs. In each building block, we represent paths as nodes and interactions (i.e., shared common individuals between two paths) as edges. There are at least two paths in a building block. For each building block, we obtain all acceptable cases for concerned path-pairs. Given a path-triple, it can be decomposed to one or multiple building blocks. Considering a shared path-pair between two building blocks, we use the natural join operator from relational algebra to match the acceptable cases for the shared path-pair between two building blocks. In other words, considering the acceptable cases for building blocks as inputs, we use the natural join operator to construct all acceptable cases for a path-triple. Acceptable cases for a path-triple are identified and then used in deriving the path-counting formula for Φ_abc.

Then, we summarize all the main procedures used for deriving the path-counting formula for Φ_abc in a flowchart shown in Figure 4. The main procedures are also applicable for deriving the path-counting formulas for Φ_abcd and Φ_ab,cd.

A flowchart for path-counting formula derivation.

3. Results and Discussion

3.1. Path-Counting Formulas for Three Individuals

We first introduce a systematic method to generate all possible cases for a path-pair. Then we discuss building blocks for path-triples and identify all acceptable cases which are used in deriving the path-counting formula for Φ_abc.

3.1.1. Cases for a Path-Pair

Given a path-pair 〈P _Aa, P _Ab〉 with Bi_C(P _Aa, P _Ab) ≠ NULL, where A is a common ancestor of a and b and Bi_C(P _Aa, P _Ab) consists of all common individuals shared between P _Aa and P _Ab, except A, we introduce three patterns (i.e., crossover, 2-overlap, and root 2-overlap) to generate all possible cases for 〈P _Aa, P _Ab〉.

X(P _Aa, P _Ab): P _Aa and P _Ab share one or multiple crossover individuals.
T(P _Aa, P _Ab): P _Aa and P _Ab are root 2-overlapping from A, and the root 2-overlap path can have one or multiple 2-overlap individuals.
Y(P _Aa, P _Ab): P _Aa and P _Ab are overlapping but not from A, and the 2-overlap path can have one or multiple 2-overlap individuals.

Based on the three patterns, X(P _Aa, P _Ab), T(P _Aa, P _Ab), and Y(P _Aa, P _Ab), we use regular expressions to generate all possible cases for the path-pair 〈P _Aa, P _Ab〉. For convenience, we drop 〈P _Aa, P _Ab〉 and use X, T, and Y instead of patterns X(P _Aa, P _Ab), T(P _Aa, P _Ab), and Y(P _Aa, P _Ab), whenever there is no confusion. When Bi_C(P _Aa, P _Ab) ≠ NULL, the eight cases shown in (7) cover all possible cases for 〈P _Aa, P _Ab〉. The completeness of eight cases shown in (7) for 〈P _Aa, P _Ab〉 can be proved by induction on the total number of T, X, and Y appearing in 〈P _Aa, P _Ab〉. Using the pedigree in Figure 2, Cases 1–3 and Case 6 are illustrated in (8), (9), (10), and (11):

graphic file with name CMMM2014-898424.e001.jpg

(7)

graphic file with name CMMM2014-898424.e002.jpg

(8)

where {s, e, t} are 2-overlap individuals and the overlap path is a root 2-overlap path:

\begin{matrix} \begin{matrix} A ⟶ s ⟶ e ⟶ t ⟶ a \\ A ⟶ s ⟶ f ⟶ t ⟶ b \end{matrix}} \in T X, \end{matrix}

(9)

where s is a 2-overlap individual and the overlap path is a root 2-overlap path; t is a crossover individual:

\begin{matrix} \begin{matrix} A ⟶ s ⟶ e ⟶ t ⟶ a \\ A ⟶ d ⟶ f ⟶ t ⟶ b \end{matrix}} \in X, \end{matrix}

(10)

where t is a crossover individual:

\begin{matrix} \begin{matrix} A ⟶ c ⟶ e ⟶ t ⟶ a \\ A ⟶ s ⟶ e ⟶ t ⟶ b \end{matrix}} \in X Y, \end{matrix}

(11)

where e is a crossover individual; t is a 2-overlap individual and the overlap path is a 2-overlap path.

3.1.2. Path-Pair Level Graphical Representation of a Path-Triple

Given a path-triple 〈P _Aa, P _Ab, P _Ac〉, we represent each path as a node. The path-triple can be decomposed to three path-pairs (i.e., 〈P _Aa, P _Ab〉, 〈P _Aa, P _Ac〉, and 〈P _Ab, P _Ac〉). For each path-pair, if the two paths share at least one common individual (i.e., either 2-overlap individual or crossover individual), except A, then there is an edge between the two nodes representing the two paths. Therefore, we obtain four different scenarios S ₀–S ₃, shown in Figure 5.

A path-pair level graphical representation of 〈P _Aa, P _Ab, P _Ac〉.

In Figure 5, the scenario S ₀ has no edges, so it means that 〈P _Aa, P _Ab, P _Ac〉 consists of three independent paths. In Figure 2, path-triple1 is an example of S ₀. Next, we introduce a lemma which can assist with identifying the options for the edges in the scenarios S ₁–S ₃.

Lemma 3 . —

Given a path-triple 〈P _Aa, P _Ab, P _Ac〉, consider the three path-pairs 〈P _Aa, P _Ab〉, 〈P _Aa, P _Ac〉, and 〈P _Ab, P _Ac〉, if there is a 2-overlap edge which is represented by Y in regular expression representation of any of the three path-pairs, and then the path-triple 〈P _Aa, P _Ab, P _Ac〉 has no contribution to Φ_abc.

Proof —

In [17], Nadot and Vaysseix proposed, from a genetic and biological point of view, that Φ_abc can be evaluated by enumerating all eligible inheritance paths at allele-level starting from a triple common ancestor A to the three individuals a, b, and c.

For the pedigree in Figure 6, let us consider the path-triple 〈P _Aa, P _Ab, P _Ac〉 listed as follows. P _Aa : A → a; P _Ab : A → p ₃ → p ₆ → p ₇ → b; P _Ac : A → p ₄ → p ₆ → p ₇ → c.

For 〈P _Ab, P _Ac〉, p ₆ is a crossover individual, p ₇ is an overlap individual, and p ₆ → p ₇ is a 2-overlap edge represented by Y in regular expression representation (see the definition for Y in Section 3.1.1).

For the individual p ₆, let us denote the two alleles at one fixed autosomal locus as g ₁ and g ₂. At allele-level, only one allele can be passed down from p ₆ to p ₇. Since p ₃ and p ₄ are parents of p ₆, g ₁ is passed down from one parent, and g ₂ is passed down from the other parent. It is infeasible to pass down both g ₁ and g ₂ from p ₆ to p ₇. In other words, there are no corresponding inheritance paths for the path-triple 〈P _Aa, P _Ab, P _Ac〉 with a 2-overlap edge between 〈P _Ab, P _Ac〉 (i.e., Case 6: XY). Therefore, such kind of path-triples has no contribution to Φ_abc.

Examples of pedigree and inheritance paths.

Figure 6(b) shows one example of eligible inheritance paths corresponding to a pedigree graph. Each individual is represented by two allele nodes. The eligible inheritance paths in Figure 6(b) consist of red edges only.

Only Case 1, Case 2, and Case 3 do not have Y in the regular expression representation of a path-pair (see (7)); considering the scenarios S ₁–S ₃ shown in Figure 5, an edge can have three options {Case 1: T; Case 2: X; Case 3: TX}.

3.1.3. Constructing Cases for a Path-Triple

For the scenarios S ₁–S ₃ in Figure 5, we define two building blocks {B ₁, B ₂} along with some rules in Figure 7 to generate acceptable cases. For B ₁, the edge can have three options {Case 1: T; Case 2: X; Case 3: TX}. For B ₂, we cannot allow both edges to be root overlap, because if two edges are root overlap, then P _Aa and P _Ac must share at least one common individual, except A, which contradicts the fact that P _Aa and P _Ac have no edge.

Building blocks {B ₁, B ₂} and basic rules.

Next, we focus on generating all acceptable cases for the scenarios S ₁–S ₃ in Figure 5, where only S ₃ contains more than one building block. In order to leverage the dependency among building blocks, we decompose S ₃ to S ₃ = {u ₁ = B ₂, u ₂ = B ₂, u ₃ = B ₂}, shown in Figure 8. For each u _i, we have a set of acceptable path-triples, denoted as R _i.

A graphical illustration for obtaining T ₃.

Considering the dependency among {R ₁, R ₂, R ₃}, we use the natural join operator, denoted as ⋈, operating on {R ₁, R ₂, R ₃} to generate all acceptable cases for S ₃. As a result, we obtain T ₃ = R ₁⋈R ₂⋈R ₃, where T ₃ denotes the acceptable cases of the path-triple 〈P _Aa, P _Ab, P _Ac〉 in the scenario S ₃.

For each scenario in Figure 5, we generate all acceptable cases for 〈P _Aa, P _Ab, P _Ac〉. The scenario S ₀ has no edges, and it shows that 〈P _Aa, P _Ab, P _Ac〉 consists of three independent paths, while, for the other scenarios S _k (k = 1,2, 3), the k edges can have two options:

all k edges belong to crossover; or
one edge belongs to root 2-overlap; the remaining (k − 1) edges belong to crossover.

In summary, acceptable path-triples can have at most one root 2-overlap path, any number of crossover individuals, but zero 2-overlap path.

3.1.4. Splitting Operator

Considering the existence of root 2-overlap path and crossover in acceptable path-triples, we propose a splitting operator to transform a path-triple with crossover individuals to a noncrossover path-triple without changing the contribution from this path-triple to Φ_abc. The main purpose of using the splitting operator is to simplify the path-counting formula derivation process. We first use an example in Figure 9 to illustrate how the splitting operator works. In Figure 9, there is a crossover individual s between P _Aa and P _Ab in the path triple 〈P _Aa, P _Ab, P _Ac〉 in G _k+1. The splitting operator proceeds as follows:

split the node s to two nodes, s ₁ and s ₂;
transform the edges s → a' and s → b' to s ₁ → a' and s ₂ → b', respectively;
add two new edges, s ₂ → a' and s ₁ → b'.

Transforming pedigree graph G _k + 1 having k + 1 crossover to G _k having k crossover.

Lemma 4 . —

Given a pedigree graph G _k+1 having (k + 1) crossover individuals regarding 〈P _Aa, P _Ab, P _Ac〉 shown in Figure 9, let s denote the lowest crossover individual, where no descendant of s can be a crossover individual among the three paths P _Aa, P _Ab, and P _Ac. After using the splitting operator for the lowest crossover individual s in G _k + 1, the number of crossover individuals in G _k+1 is decreased by 1.

Proof —

The splitting operator only affects the edges from s to a' and b'. If there is a new crossover node appearing, the only possible node is either a' or b'. Assume b' becomes a crossover individual; it means that b' is able to reach a and b from two separate paths. It contradicts the fact that s is the lowest crossover individual between P _Aa and P _Ab.

Next, we introduce a canonical graph which results from applying the splitting operator for all crossover individuals. The canonical graph has zero crossover individual.

Definition 5 (Canonical Graph). —

Given a pedigree graph G having one or more crossover individuals regarding Φ_abc, If there exists a graph G' which has no crossover individuals with regards to Φ_abc such that

any acceptable path-triple in G has an acceptable path-triple in G' which has the same contribution to Φ_abc as the one in G for Φ_abc;

any acceptable path-triple in G' has an acceptable path-triple in G which and has the same contribution to Φ_abc as the one in G' for Φ_abc.

We call G' a canonical graph of G regarding Φ_abc.

Lemma 6 . —

For a pedigree graph G having one or more crossover individuals regarding 〈P _Aa, P _Ab, P _Ac〉, there exists a canonical graph G' for G.

Proof —

The proof is by induction on the number of crossover individuals.

Induction hypothesis: assume that if G has k or less crossovers, there is a canonical graph G' for G.

In the induction step, let G _k+1 be a graph with k + 1 crossovers; let s be the lowest crossover between paths P _Aa and P _Ab in G _k+1. We apply the splitting operator on s in G _k+1 and obtain G _k having k crossovers by Lemma 4.

3.1.5. Path-Counting Formula for Φ_abc

Now, we present the path-counting formula for Φ_abc:

\begin{matrix} Φ_{a b c} = \sum_{A} (\sum_{Type 1} {(\frac{1}{2})}^{L_{triple}} Φ_{A A A} \\ + \sum_{Type 2} {(\frac{1}{2})}^{L_{triple} + 1} Φ_{A A}), \end{matrix}

(12)

where Φ_AA = (1/2)(1 + F _A), Φ_AAA = (1/4)(1 + 3F _A), F _A: the inbreeding coefficient of A, A: a triple-common ancestor of a, b, and c, Type 1: 〈P _Aa, P _Ab, P _Ac〉 has zero root 2-overlap, Type 2: 〈P _Aa, P _Ab, P _Ac〉 has one root 2-overlap path P _As ending at the individual s

\begin{matrix} L_{triple} = {\begin{matrix} L_{P_{A a}} + L_{P_{A b}} + L_{P_{A c}} & for Type 1 \\ L_{P_{A a}} + L_{P_{A b}} + L_{P_{A c}} - L_{P_{A s}} & for Type 2, \end{matrix} \end{matrix}

(13)

and L _{P_Aa}: the length of the path P _Aa (also applicable for P _Aa, P _Ac, and P _As).

For completeness, the path-counting formula for Φ_aab is given in Appendix A; and the correctness proof of the path-counting formula is given in Appendix B.

3.2. Path-Counting Formulas for Four Individuals

3.2.1. Path-Pair Level Graphical Representation of 〈P _Aa, P _Ab, P _Ac, P _Ad〉

Given a path-quad 〈P _Aa, P _Ab, P _Ac, P _Ad〉 and Quad_C(P _Aa, P _Ab, P _Ac, P _Ad) = ∅, the path-quad can have 11 scenarios S ₀–S ₁₀ shown in Figure 10 where all four paths are considered symmetrically.

A path-pair level graphical representation of 〈P _Aa, P _Ab, P _Ac, P _Ad〉.

In Figure 11, we introduce three building blocks {B ₁, B ₂, B ₃}. For B ₁ and B ₂, the rules presented in Figure 7 are also applicable for Figure 11. For B ₃, we only consider root overlap, because the crossover individuals can be eliminated by using the splitting operator introduced in Section 3.1.4. Note that for B ₃, if Tri_C(P _Aa, P _Ab, P _Ac) = ∅, then it is equivalent to the scenario S ₃ in Figure 8 Therefore, we only need to consider B ₃ when Tri_C(P _Aa, P _Ab, P _Ac) ≠ ∅.

Building blocks for all scenarios of 〈P _Aa, P _Ab, P _Ac, P _Ad〉.

3.2.2. Building Block-Based Cases Construction for 〈P _Aa, P _Ab, P _Ac, P _Ad〉

For a scenario S _i (0 ≤ i ≤ 10) in Figure 11, we first decompose S _i to one or multiple building blocks. For a scenario S _i ∈ {S ₁, S ₃}, it has only one building block, and all acceptable cases can be obtained directly. For S ₂ = {u ₁ = B ₁, u ₂ = B ₁}, there is no need to consider the conflict between the edges in u ₁ and u ₂ because u ₁ and u ₂ are disconnected. Let R _i denote all acceptable cases of the path-pairs in u _i, and let T _i denote all acceptable cases for S _i. Therefore, we obtain T ₂ = R ₁ × R ₂ where × denotes the Cartesian product operator from relational algebra.

For S ₆ = {u ₁ = B ₃}, we obtain T ₆ = R ₁. For S _i ∈ {S _i∣4 ≤ i ≤ 10 and i ≠ 6}, we define the largest subgraph of S _i based on which we construct T _i.

Definition 7 (Largest Subgraph). —

Given a scenario S _i (4 ≤ i ≤ 10 and i ≠ 6), the largest subgraph of S _i, denoted as S _j, is defined as follows:

S _j is a proper subgraph of S _i;

if S _i contains B ₃, then S _j must also contain B ₃;

no such S _k exists that S _j is a proper subgraph of S _k while S _k is also a proper subgraph of S _i.

For each scenario S _i (4 ≤ i ≤ 10 and i ≠ 6), we list the largest subgraph of S _i, denoted as S _j, in Table 2.

Table 2.

Largest subgraph of a scenario S _i (4 ≤ i ≤ 10 and i ≠ 6).

S _i	S ₄	S ₅	S ₇	S ₈	S ₉	S ₁₀

S _j	S ₃	S ₃	S ₆	S ₅	S ₇	S ₉

Open in a new tab

For a scenario S _i (4 ≤ i ≤ 10 and i ≠ 6), let Diff(S _i∖S _j) denote the set of building blocks in S _i but not in S _j, where S _j is the largest subgraph of S _i. Let |E _i| and |E _j| denote the number of edges in S _i and S _j, respectively. According to Table 2, we can conclude that |E _i | −|E _j | = 1. In order to leverage the dependency among building blocks, we consider only B ₂ in Diff(S _i∖S _j). For example, Diff(S ₅∖S ₃) = {B ₂}. Let T ₃ denote all acceptable cases for S ₃. And let R ₁ denote the set of acceptable cases for Diff(S ₅∖S ₃). Then, we can use S ₃ and Diff(S ₅∖S ₃) to construct all acceptable cases for S ₅. Then, we apply this idea for constructing all acceptable cases for each S _i in Table 2.

Given a path-quad 〈P _Aa, P _Ab, P _Ac, P _Ad〉, an acceptable case has the following properties:

if there is one root 3-overlap path, there can be at most one root 2-overlap path;
otherwise, there can be at most two root 2-overlap paths.

3.2.3. Path-Counting Formula for Φ_abcd

Now, we present the path-counting formula for Φ_abcd as follows:

\begin{matrix} Φ_{a b c d} = \sum_{A} (\sum_{Type 1} {(\frac{1}{2})}^{L_{quad}} Φ_{A A A A} \\ + \sum_{Type 2} {(\frac{1}{2})}^{L_{quad} + 1} Φ_{A A A} \\ + \sum_{Type 3} {(\frac{1}{2})}^{L_{quad} + 2} Φ_{A A}), \end{matrix}

(14)

where Φ_AA = (1/2)(1 + F _A), Φ_AAA = (1/4)(1 + 3F _A), Φ_AAAA = (1/8)(1 + 7F _A), F _A: the inbreeding coefficient of A, A: a quad-common ancestor of a, b, c, and d, Type 1: zero root 2-overlap and zero root 3-overlap path, Type 2: one root 2-overlap path P _As ending at s

\begin{matrix} Type 3 : {\begin{matrix} Case  1: two root 2-overlap paths P_{A s 1}, \\ P_{A s 2} ending at s_{1} and s_{2}, respectively \\ Case  2: one root 3-overlap path \\ P_{A t} ending at t \\ Case  3: one root 2-overlap path \\ P_{A s}, one root 3-overlap \\ path P_{A t} ending at s and t, \\ respectively, \end{matrix} \\ L_{quad} = {\begin{matrix} L_{P_{A a}} + L_{P_{A b}} + L_{P_{A c}} + L_{P_{A d}} & for Type 1 \\ L_{P_{A a}} + L_{P_{A b}} + L_{P_{A c}} \\ + L_{P_{A d}} - L_{P_{A s}} & for Type 2 \\ L_{P_{A a}} + L_{P_{A b}} + L_{P_{A c}} + L_{P_{A d}} \\ - L_{P_{A s_{1}}} - L_{P_{A s_{2}}} & for Case 1 \in Type 3 \\ L_{P_{A a}} + L_{P_{A b}} + L_{P_{A c}} \\ + L_{P_{A d}} - 2 * L_{P_{A t}} & for Case 2 \in Type 3 \\ L_{P_{A a}} + L_{P_{A b}} + L_{P_{A c}} + L_{P_{A d}} \\ - L_{P_{A t}} - L_{P_{A s}} & for Case 3 \in Type 3, \end{matrix} \end{matrix}

(15)

and L _{P_Aa}: the length of the path P _Aa (also applicable for P _Ab, P _Ac, P _Ad, etc.).

For completeness, the path-counting formulas for Φ_aabc and Φ_aaab are presented in Appendix A. The correctness of the path-counting formula for four individuals is proven in Appendix C.

3.3. Path-Counting Formulas for Two Pairs of Individuals

3.3.1. Terminology and Definitions

(1) 2-Pair-Path-Pair. It consists of two pairs of path-pairs denoted as 〈(P _Sa, P _Sb), (P _Tc, P _Td)〉, where P _Sa ∈ P(S, a), P _Sb ∈ P(S, b), P _Tc ∈ P(T, c), P _Td ∈ P(T, d), S is a common ancestor of a and b, and T is a common ancestor of c and d. If A = S = T, then A is a quad-common ancestor of a, b, c, and d.

(2) Homo-Overlap and Heter-Overlap Individual. Given two pairs of individuals 〈a, b〉 and 〈c, d〉, if s ∈ Bi_C(P _Aa, P _Ab) (or s ∈ Bi_C(P _Ac, P _Ad), we call s a homo-overlap individual when P _Aa and P _Ab (or P _Ac and P _Ad) pass through the same parent of s. If r ∈ Bi_C(P _Ai, P _Aj), where i ∈ {a, b} and j ∈ {c, d}, we call r a heter-overlap individual when P _Ai and P _Aj pass through the same parent of r.

(3) Root Homo-Overlap and Heter-Overlap Path. Given a 2-pair-path-pair 〈(P _Aa, P _Ab), (P _Ac, P _Ad)〉, if s is a homo-overlap individual and the homo-overlap path extends all the way to the quad-common ancestor A, then we call it a root homo-overlap path. If r is a heter-overlap individual and the heter-overlap path extends all the way to the quad-common ancestor A, then we call it a root heter-overlap path.

Example 8 . —

A is quad-common ancestor for a, b, c, and d in Figure 12. For (a), s is a homo-overlap individual between P _Aa and P _Ab.

t is a homo-overlap individual between P _Ac and P _Ad. And, A → s and A → t are root homo-overlap paths. For (b), x is a heter-overlap individual between P _Aa and P _Ad. y is a heter-overlap individual between P _Ab and P _Ac. And A → x and A → y are root heter-overlap paths.

Examples of 2-pair-path-quads for Φ_ab,cd.

3.3.2. Path-Counting Formula for Φ_ab,cd

Now, we present a path-pair level graphical representation for 〈(P _Aa, P _Ab), (P _Ac, P _Ad)〉 shown in Figure 13. The options for an edge can be {T, X, TX}. (Refer to Section 3.1.1 for definitions of T, X, and TX). Based on the different types of 〈P _Aa, P _Ab, P _Ac, P _Ad〉 presented in (14), all cases for 〈(P _Aa, P _Ab), (P _Ac, P _Ad)〉 are summarized in Table 3, where h is the last individual of a root homo-overlap path P _Ah (i.e., the path P _Ah ending at h) and r ₁ and r ₂ are the last individuals of root heter-overlap paths P _Ar1 and P _Ar2, respectively.

Scenarios of 〈(P _Aa, P _Ab), (P _Ac, P _Ad)〉 at path-pair level.

Table 3.

A summary of all cases for 〈(P _Aa, P _Ab), (P _Ac, P _Ad)〉.

〈P _Aa, P _Ab, P _Ac, P _Ad〉	〈(P _Aa, P _Ab), (P _Ac, P _Ad)〉
Zero root 2-overlap and zero root 3-overlap	Zero root homo-overlap and zero root heter-overlap

One root 2-overlap path	One root homo-overlap and zero root heter-overlap
One root 2-overlap path	Zero root homo-overlap and one root heter-overlap

Two root 2-overlap paths	Two root homo-overlaps and zero root heter-overlap
Two root 2-overlap paths	Zero root homo-overlap and two root heter-overlaps

One root 3-overlap path	One root homo-overlap and two root heter-overlaps, and h = r ₁ = r ₂

One root 2-overlap and one root 3-overlap	One root homo-overlap and two root heter-overlaps, and r ₁ = r ₂ ≠ h
One root 2-overlap and one root 3-overlap	One root homo-overlap and two root heter-overlaps, and h = r ₁ ≠ r ₂

Open in a new tab

Given a pedigree graph having one or multiple progenitors {p _i∣i > 0}, we define that the generation of a progenitor p _i is 0, denoted as gen(p _i) = 0. If an individual a has only one parent p, then we define gen(a) = gen(p) + 1. If an individual a has two parents f and m, we define gen(a) = MAX{gen(f), gen(m)} + 1.

The path-counting formula for Φ_ab,cd is as follows:

\begin{matrix} Φ_{a b, c d} = \sum_{A} (\sum_{Type 1} {(\frac{1}{2})}^{L_{2 -pair}} Φ_{A A A} + \sum_{Type 2} {(\frac{1}{2})}^{L_{2 -pair} + 1} Φ_{A A A} \\ + \sum_{Type 3} {(\frac{1}{2})}^{L_{2 -pair} + 2} Φ_{A A} \\ + \sum_{Type 4} {(\frac{1}{2})}^{L_{2 -pair} + 1} Φ_{A A}) \\ + \sum_{(S, T) \in Type 5} {(\frac{1}{2})}^{L_{〈 P_{S a}, P_{S b} 〉} + L_{〈 P_{T c}, P_{T d} 〉} + 1} Φ_{B B}, \end{matrix}

(16)

where A: a quad-common ancestor of a, b, c, and d, S: a common ancestor of a and b, and T: a common ancestor of c and d. For 〈(P _Aa, P _Ab), (P _Ac, P _Ad)〉 (S = T = A), there are four types (i.e., Type 1 to Type 4).

Type 1: zero root homo-overlap and zero root heter-overlap.

Type 2: zero root homo-overlap and one root heter-overlap P _Ar ending at r,

\begin{matrix} Type 3 : {\begin{matrix} zero root homo-overlap and two root \\ heter-overlap P_{A r 1} and P_{A r 2} ending at \\ r_{1} and r_{2}, respectively, \\ one root homo-overlap P_{A h} ending at h \\ and two root heter-overlap P_{A r 1} and P_{A r 2} \\ ending at r_{1} and r_{2}, and r_{1} \neq r_{2} . \end{matrix} \end{matrix}

(17)

Type 4: one root homo-overlap P _Ah ending at h and two root heter-overlap ending at r ₁ and r ₂, and h = r ₁ = r ₂. For 〈(P _Sa, P _Sb), (P _Tc, P _Td)〉 (S ≠ T), there is one type (i.e., Type 5).
Type 5: 〈P _Sa, P _Sb〉 has zero overlap individual, 〈P _Tc, P _Td〉 has zero overlap individual.

At most one path-pair (either 〈P _Sa, P _Sb〉 or 〈P _Tc, P _Td〉) can have crossover individuals.

Between a path from 〈P _Sa, P _Sb〉 and a path from 〈P _Tc, P _Td〉, there are no overlap individuals, but there can be crossover individuals, x, where x ≠ S and x ≠ T:

\begin{matrix} B = {\begin{matrix} S & when gen (S) < gen (T) \\ S & when gen (S) = gen (T) \\ and T has two parents \\ T & otherwise, \end{matrix} \\ L_{2 -pair} = {\begin{matrix} L_{P_{A a}} + L_{P_{A b}} \\ + L_{P_{A c}} + L_{P_{A d}} & for Type 1 \\ L_{P_{A a}} + L_{P_{A b}} + L_{P_{A c}} \\ + L_{P_{A d}} - L_{P_{A r}} & for Type 2 \\ L_{P_{A a}} + L_{P_{A b}} + L_{P_{A c}} \\ + L_{P_{A d}} - L_{P_{A r 1}} - L_{P_{A r 2}} & for Type 3 \\ L_{P_{A a}} + L_{P_{A b}} + L_{P_{A c}} \\ + L_{P_{A d}} - 2 * L_{P_{A h}} & for Type 4, \end{matrix} \\ L_{〈 P_{S a}, P_{S b} 〉} = L_{P_{S a}} + L_{P_{S b}} for Type 5, \\ L_{〈 P_{T c}, P_{T d} 〉} = L_{P_{T c}} + L_{P_{T d}} for Type 5 . \end{matrix}

(18)

Note that if 〈a, b〉 and 〈c, d〉 have zero quad-common ancestors, we have the following formula for Φ_ab,cd:

\begin{matrix} Φ_{a b, c d} = \sum_{(S, T) \in Type 6} {(\frac{1}{2})}^{L_{〈 P_{S a}, P_{S b} 〉} + L_{〈 P_{T c}, P_{T d} 〉}} Φ_{S S} * Φ_{T T} . \end{matrix}

(19)

Type 6: 〈P _Sa, P _Sb〉 is a nonoverlapping path-pair and 〈P _Tc, P _Td〉 is a nonoverlapping path-pair. Between a path from 〈P _Sa, P _Sb〉 and a path from 〈P _Tc, P _Td〉, there are no overlap individuals, but there can be crossover individuals.

L _{〈P_Sa,P_Sb〉} and L _{〈P_Tc,P_Td〉} are defined as in Type 5.

The correctness of the path-counting formula for Φ_ab.cd is proven in Appendix C. For completeness, please refer to [18] for the path-counting formulas for Φ_aa,bc, Φ_ab,ac, Φ_ab,ab, and Φ_aa,ab.

3.4. Experimental Results

In this section, we show the efficiency of our path-counting method using NodeCodes for condensed identity coefficients by making comparisons with the performance of a recursive method used in [10]. We implemented two methods: (1) using recursive formulas to compute each required kinship coefficient and generalized kinship coefficient; (2) using path-counting method coupled with NodeCodes to compute each required kinship coefficient and generalized kinship coefficient independently. We refer to the first method as Recursive, the second method as NodeCodes. For completeness, please refer to [18] for the details of the NodeCodes-based method.

Nodecodes of a node is a set of labels each representing a path to the node from its ancestors. Given a pedigree graph, let r be the progenitor (i.e., the node with 0 in-degree). (For simplicity, we assume there is one progenitor, r, as the ancestor of all individuals in the pedigree. Otherwise, a virtual node r can be added to the pedigree graph and all progenitors can be made children of r.) For each node u in the graph, the set of NodeCodes of u, denoted as NC(u), are assigned using a breadth-first-search traversal starting from r as follows.

If u is r then NC(r) contains only one element: the empty string.
Otherwise, let u be a node with NC(u), and v ₀, v ₁, …, v _k be u's children in sibling order; then for each x in NC(u), a code xi* is added to NC(v _i), where 0 ≤ i ≤ k, and ∗ indicates the gender of the individual represented by node v _i.

Computations of kinship coefficients for two individuals and generalized kinship coefficients for three individuals presented in [11, 12, 14, 15] are using NodeCodes. The NodeCodes-based computation schemes can also be applied for the generalized kinship coefficients for four individuals and two pairs of individuals. For completeness, please refer to [18] for the details using NodeCodes to compute the generalized kinship coefficients for four individuals and two pairs of individuals based on our proposed path-counting formulas in Sections 3.2 and 3.3.

In order to test the scalability of our approach for calculating condensed identity coefficients on large pedigrees, we used a population simulator implemented in [11] to generate arbitrarily large pedigrees. The population simulator is based on the algorithm for generating populations with overlapping generations in Chapter 4 of [19] along with the parameters given in Appendix B of [20] to model the relatively isolated Finnish Kainuu subpopulation and its growth during the years 1500–2000. An overview of the generation algorithm was presented in [11, 12, 14]. The parameters include starting/ending year, initial population size, initial age distribution, marriage probability, maximum age at pregnancy, expected number of children by time period, immigration rate, and probability of death by time period and age group.

We examine the performance of condensed identity coefficients using twelve synthetic pedigrees which range from 75 individuals to 195,197 individuals. The smallest pedigree spans 3 generations, and the largest pedigree spans 19 generations. We analyzed the effects of pedigree size and the depth of individuals in the pedigree (the longest path between the individual and a progenitor) on the computation efficiency improvement.

In the first experiment, 300 random pairs were selected from each of our 12 synthetic pedigrees. Figure 14 shows computation efficiency improvement for each pedigree. As can be seen, the improvement of NodeCodes over Recursive grew increasingly larger as the pedigree size increased, from a comparable amount of 26.83% on the smallest pedigree to 94.75% on the largest pedigree. It also shows that path-counting method coupled with NodeCodes can scale very well on large pedigrees in terms of computing condensed identity coefficients.

The effect of pedigree size on computation efficiency improvement.

In our next experiment, we examined the effect of the depth of the individual in the pedigree on the query time. For each depth, we generated 300 random pairs from the largest synthetic pedigree.

Figure 15 shows the effect of depth on the computation efficiency improvement. We can see the improvement of NodeCodes over Recursive, ranging from 86.48% to 91.30%.

The effect of depth on computation efficiency improvement.

4. Conclusion

We have introduced a framework for generalizing Wright's path-counting formula for more than two individuals. Aiming at efficiently computing condensed identity coefficients, we proposed path-counting formulas (PCF) for all generalized kinship coefficients for which are sufficient for expressing condensed identity coefficients by a linear combination. We also perform experiments to compare the efficiency of our method with the recursive method for computing condensed identity coefficients on large pedigrees. Our future work includes (i) further improvements on condensed identify coefficients computation by collectively calculating the set of generalized kinship coefficients to avoid redundant computations, and (ii) experimental results for using PCF in conjunction with encoding schemes (e.g., compact path-encoding schemes [13]) for computing condensed identity coefficients on very large pedigrees.

Acknowledgments

The authors thank Professor Robert C. Elston, Case School of Medicine, for introducing to them the identity coefficients and referring them to the related literature [7, 10, 17]. This work is partially supported by the National Science Foundation Grants DBI 0743705, DBI 0849956, and CRI 0551603 and by the National Institute of Health Grant GM088823.

Appendices

A. Path-Counting Formulas of Special Cases

A.1. Path-Counting Formula for Φ_aab

For 〈P _Aa1, P _Aa2〉, we introduce a special case, where P _Aa1 and P _Aa2 are mergeable.

Definition A.1 A.1 (Mergeable Path-Pair). —

A path-pair 〈P _Aa1, P _Aa2〉 is mergeable if and only if the two paths P _Aa1 and P _Aa2 are completely identical.

Next, we present a graphical representation of 〈P _Aa1, P _Aa2, P _Ab〉 in Figure 16.

A path-pair level graphical representation of 〈P _Aa1, P _Aa2, P _Ab〉.

Lemma A.2 A.2. —

For S ₂ and S ₃ in Figure 16, 〈P _Aa1, P _Aa2〉 cannot be a mergeable path-pair.

Proof —

For S ₂ and S ₃, if 〈P _Aa1, P _Aa2〉 is mergeable, then any common individual s between P _Aa1 and P _Ab is also a shared individual between P _Aa2 and P _Ab. It means s ∈ Tri_C(P _Aa1, P _Aa2, P _Ab) which contradicts the fact that Tri_C(P _Aa1, P _Aa2, P _Ab) = ∅.

Considering all three scenarios in Figure 16, only S ₁ can have a mergeable path-pair 〈P _Aa1, P _Aa2〉 by Lemma A.2. Now, we present our path-counting formula for Φ_aab where a is not an ancestor of b:

$\begin{matrix} Φ_{a a b} = \sum_{A} (\sum_{Type 1} {(\frac{1}{2})}^{L_{triple} - 1} Φ_{A A A} + \sum_{Type 2} {(\frac{1}{2})}^{L_{triple}} Φ_{A A} \\ + \sum_{Type 3} {(\frac{1}{2})}^{L_{〈 P_{A a}, P_{A b} 〉} + 1} Φ_{A A}), \end{matrix}$ (A.1)

where A: a common ancestor of a and b.

When 〈P _Aa1, P _Aa2〉 is not mergeable,

Type 1: 〈P _Aa1, P _Aa2, P _Ab〉 has no root 2-overlap.

Type 2: 〈P _Aa1, P _Aa2, P _Ab〉 has one root 2-overlap path P _As ending at the individual s.

When 〈P _Aa1, P _Aa2〉 is mergeable,

Type 3: 〈P _Aa, P _Ab〉 is a nonoverlapping path-pair

$\begin{matrix} L_{triple} = {\begin{matrix} L_{P_{A a 1}} + L_{P_{A a 2}} + L_{P_{A b}} & for Type 1 \\ L_{P_{A a 1}} + L_{P_{A a 2}} + L_{P_{A b}} - L_{P_{A s}} & for Type 2, \end{matrix} \\ L_{〈 P_{A a}, P_{A b} 〉} = L_{P_{A a}} + L_{P_{A b}} for Type 3 . \end{matrix}$ (A.2)

For the sake of completeness, if a is an ancestor of b, there is no recursive formula for Φ_aab in [10], but we can use either the recursive formula for Φ_abc or the path-counting formula for Φ_abc to compute Φ_a1a2b.

A.2. Path-Counting Formula for Φ_aabc

Given a path-quad 〈P _Aa1, P _Aa2, P _Ab, P _Ac〉, if 〈P _Aa1, P _Aa2〉 is not mergeable, then we process the path-quad as equivalent to 〈P _Aa, P _Ab, P _Ac, P _Ad〉. If 〈P _Aa1, P _Aa2〉 is mergeable, the path-quad 〈P _Aa1, P _Aa2, P _Ab, P _Ac〉 can be condensed to scenarios for 〈P _Aa, P _Ab, P _Ac〉.

Now, we present a path-counting formula for Φ_aabc where a is not an ancestor of b and c as follows:

\begin{matrix} Φ_{a a b c} = \sum_{A} (\sum_{Type 1} {(\frac{1}{2})}^{L_{quad} - 1} Φ_{A A A A} + \sum_{Type 2} {(\frac{1}{2})}^{L_{quad}} Φ_{AAA} \\ + \sum_{Type 3} {(\frac{1}{2})}^{L_{quad} + 1} Φ_{A A}) \\ + \sum_{A} (\sum_{Type 4} {(\frac{1}{2})}^{L_{triple} + 1} Φ_{A A A} \\ + \sum_{Type 5} {(\frac{1}{2})}^{L_{triple} + 2} Φ_{A A}), \end{matrix}

(A.3)

where A: a quad-common ancestor of a, b, c, and d.

When 〈P _Aa1, P _Aa2〉 is not mergeable,

Type 1: zero root 2-overlap and zero root 3-overlap path;

Type 2: one root 2-overlap path P _As ending at s

\begin{matrix} Type 3: {\begin{matrix} Case  1: two root 2-overlap paths P_{A s 1} \\ and P_{A s 2} ending at s_{1} and s_{2}, respectively \\ Case  2: one root 3-overlap path P_{A t} \\ ending at t \\ Case  3: one root 2-overlap \\ and one root 3-overlap paths \\ P_{A s} and P_{A t} ending at s and t, \\ respectively. \end{matrix} \end{matrix}

(A.4)

When 〈P _Aa1, P _Aa2〉 is mergeable,

Type 4: 〈P _Aa, P _Ab, P _Ac〉 has zero root 2-overlap path;

Type 5: 〈P _Aa, P _Ab, P _Ac〉 has one root 2-overlap path P _As ending at s

\begin{matrix} L_{quad} = {\begin{matrix} L_{P_{A a 1}} + L_{P_{A a 2}} + L_{P_{A b}} + L_{P_{A c}} & for Type 1 \\ L_{P_{A a 1}} + L_{P_{A a 2}} + L_{P_{A b}} + L_{P_{A c}} \\ - L_{P_{A s}} & for Type 2 \\ L_{P_{A a 1}} + L_{P_{A a 2}} + L_{P_{A b}} + L_{P_{A c}} \\ - L_{P_{A s_{1}}} - L_{P_{A s_{2}}} & for Case 1 \in Type 3 \\ L_{P_{A a 1}} + L_{P_{A a 2}} + L_{P_{A b}} + L_{P_{A c}} \\ - L_{P_{A t}} & for Case 2 \in Type 3 \\ L_{P_{A a 1}} + L_{P_{A a 2}} + L_{P_{A b}} + L_{P_{A c}} \\ - L_{P_{A t}} - L_{P_{A s}} & for Case 3 \in Type 3, \end{matrix} \\ L_{triple} = {\begin{matrix} L_{P_{A a}} + L_{P_{A b}} + L_{P_{A c}} & for Type 4 \\ L_{P_{A a}} + L_{P_{A b}} + L_{P_{A c}} - L_{P_{A s}} & for Type 5 . \end{matrix} \end{matrix}

(A.5)

Note that if a is an ancestor of either b or c, or both of them, then the path-counting formula of Φ_abcd is applicable to compute Φ_a1a2bc.

A.3. Path-Counting Formula for Φ_aaab

A special case of 〈P _Aa₁, P _Aa₂, P _Aa₃〉 for 〈P _Aa₁, P _Aa₂, P _Aa₃, P _Ab〉 is introduced when 〈P _Aa₁, P _Aa₂, P _Aa₃〉 is mergeable. With the existence of a mergeable path-triple, 〈P _Aa₁, P _Aa₂, P _Aa₃, P _Ab〉 can be condensed to 〈P _Aa, P _Ab〉.

Definition A.3 A.3 (Mergeable Path-Triple). —

Given three paths P _Aa1, P _Aa2, and P _Aa3, they are mergeable if and only if they are completely identical.

Lemma A.4 A.4. —

Given a path-quad 〈P _Aa₁, P _Aa₂, P _Aa₃, P _Ab〉, there must be at least one mergeable path-pair among 〈P _Aa₁, P _Aa₂〉, 〈P _Aa₁, P _Aa₃〉, 〈P _Aa₂, P _Aa₃〉.

Proof —

For an individual a with two parents f and m, the paternal allele of the individual a is transmitted from f and the maternal allele is transmitted from m. At allele level, only two descent paths starting from an ancestor are allowed. For a path-quad 〈P _Aa₁, P _Aa₂, P _Aa₃, P _Ab〉, there must be at least one mergeable path-pair among 〈P _Aa₁, P _Aa₂〉, 〈P _Aa₁, P _Aa₃〉, and 〈P _Aa₂, P _Aa₃〉.

For simplicity, we treat 〈P _Aa1, P _Aa2〉 as a default mergeable path-pair.

Now, we present the path-counting formula for Φ_aaab where a is not an ancestor of b as follows:

\begin{matrix} Φ_{a a a b} = \sum_{A} (\frac{3}{2} (\sum_{Type 1} {(\frac{1}{2})}^{L_{triple} - 1} Φ_{A A A} \\ + \sum_{Type 2} {(\frac{1}{2})}^{L_{triple}} Φ_{A A}) \\ + \sum_{Type 3} {(\frac{1}{2})}^{L_{pair} + 2} Φ_{A A}), \end{matrix}

(A.6)

where A: a common ancestor of a and b.

When there is only one mergeable path-pair (let us consider 〈P _Aa1, P _Aa2〉 as the mergeable path-pair),

Type 1: 〈P _Aa1, P _Aa3, P _Ab〉 has zero root 2-overlap path,
Type 2: 〈P _Aa1, P _Aa3, P _Ab〉 has one root 2-overlap path P _As ending at s.

When 〈P _Aa1, P _Aa2, P _Aa3〉 is mergeable,

Type 3: 〈P _Aa, P _Ab〉 is nonoverlapping
$\begin{matrix} L_{triple} = {\begin{matrix} L_{P_{A a 1}} + L_{P_{A a 3}} + L_{P_{A b}} & for Type 1 \\ L_{P_{A a 1}} + L_{P_{A a 3}} + L_{P_{A b}} - L_{P_{A s}} & for Type 2, \end{matrix} \\ L_{pair} = L_{P_{A a}} + L_{P_{A b}} for Type 3 . \end{matrix}$ (A.7)

Note that if a is an ancestor of b, we treat Φ_aaab = Φ_a1a2a3b. Then, we apply the path-counting formula for Φ_abcd to compute Φ_a1a2a3b.

B. Proof for Path-Counting Formulas of Three Individuals

We first demonstrate that, for one triple-common ancestor A, the path-counting computation of Φ_abc is equivalent to the computation using recursive formulas. Then, we prove the correctness of the path-counting computation for multiple triple-common ancestors.

B.1. One Triple-Common Ancestor

Considering the different types of path-triples starting from a triple-common ancestor A in a pedigree graph G contributing to Φ_abc and Φ_aab, G can have 5 different cases:

\begin{matrix} \begin{matrix} Case  2.1: G does not have \\ any path-triples \\ 〈 P_{A a_{1}}, P_{A a_{2}}, P_{A b} 〉 \\ with root overlap \\ Case  2.2: G has path-triples \\ 〈 P_{A a_{1}}, P_{A a_{2}}, P_{A b} 〉 \\ with root overlap \\ Case  2.3: G has path-triples \\ 〈 P_{A a_{1}}, P_{A a_{2}}, P_{A b} 〉 \\ having mergeable \\ path-pair 〈 P_{A a_{1}}, P_{A a_{2}} 〉 \end{matrix}} ⟸ Φ_{a a b}, \\ \begin{matrix} Case  3.1: G does not have \\ any path-triples \\ 〈 P_{A a}, P_{A b}, P_{A c} 〉 \\ with root overlap \\ Case  3.2: G has path-triples \\ 〈 P_{A a}, P_{A b}, P_{A c} 〉 \\ with root overlap \end{matrix}} ⟸ Φ_{a b c} . \end{matrix}

(B.1)

Based on the 5 cases from Case 2.1 to Case 3.2, we first construct a dependency graph shown in Figure 17, consistent with the recursive formulas (3), (4), and (5) for the generalized kinship coefficients for three individuals.

Dependency graph for different cases regarding Φ_abc and Φ_aab.

Then, we take the following steps to prove the correctness of the path-counting formulas (12) and (A.1):

for Φ_ab, the correctness of the path-counting formula (i.e., Wright's formula) is proven in [21]. For Case 2.1 and Case 2.2, the correctness is proven based on the correctness of Cases 3.1 and 3.2;
for Case 2.3, it has no cycle but only depends on Φ_ab. Thus, we prove the correctness of Case 2.3 by transforming the case to Φ_ab;
for Cases 3.1 and 3.2, the correctness is proven by induction on the number of edges, n, in the pedigree graph G.

B.1.1. Correctness Proof for Case 3.1

Case 3.1. For Φ_abc, G does not have any path triples 〈P _Aa, P _Ab, P _Ac〉 with root overlap.

Proof —

There are two basic scenarios: (i) one individual is a parent of another; (ii) no individual is a parent of another, among a, b, and c.

Using the recursive formula (3) to compute Φ_abc, for Figure 18(a), Φ_abc = (1/2)Φ_cbc = (1/2)²Φ_ccc; for Figure 18(b), Φ_abc = (1/2)Φ_Abc = (1/2)²Φ_AAc = (1/2)³Φ_AAA.

Using the path-counting formula (12), if a path-triple 〈P _Aa, P _Ab, P _Ac〉 has no root overlap (i.e., Type 1), then the contribution of 〈P _Aa, P _Ab, P _Ac〉 to Φ_abc can be computed as follows: ∑_Type 1(1/2)^{L_{〈P_Aa,P_Ab,P_Ac〉}}Φ_AAA, where L _{〈P_Aa,P_Ab,P_Ac〉} = L _{P_Aa} + L _{P_Ab} + L _{P_Ac}.

For Figure 18(a), c is the only triple-common ancestor and we obtain Φ_abc = (1/2)^{L_{〈P_ca,P_cb,P_cc〉}}Φ_ccc = (1/2)²Φ_ccc; for Figure 18(b), we obtain Φ_abc = (1/2)^{L_{〈P_Aa,P_Ab,P_Ac〉}}Φ_AAA = (1/2)³Φ_AAA.

Induction Step. Let n denote the number of edges in G. Assume true for n ≤ k, where k ≥ 2. Then, we show it is true for n = k + 1.

For Figures 19(a) and 19(b), among a, b, and c, let a be the individual having the longest path starting from their triple-common ancestor in the pedigree graph G with (k + 1) edges. If we remove the node a and cut the edge f → a from G, then the new graph G* has k edges. In terms of computing Φ_fbc, G* satisfies the condition for induction hypothesis.

For Figure 19(a), Φ_fbc = ∑_Type 1(1/2)^{L_{〈P_Af,P_Ab,P_Ac〉}}Φ_AAA. Based on the recursive formula (3), Φ_abc = (1/2)(Φ_fbc + Φ_mbc) where f and m are parents of a. In G, a only has one parent f; thus, it indicates Φ_mbc = 0. Then, we can plug-in the path-counting formula for Φ_fbc to obtain

$\begin{matrix} Φ_{a b c} = \frac{1}{2} Φ_{f b c} \\ = \frac{1}{2} * \sum_{Type 1} {(\frac{1}{2})}^{L_{〈 P_{A f}, P_{A b}, P_{A c} 〉}} Φ_{A A A} \\ = \sum_{Type 1} {(\frac{1}{2})}^{L_{〈 P_{A f}, P_{A b}, P_{A c} 〉} + 1} Φ_{A A A} \\ ∵ L_{〈 P_{A a}, P_{A b}, P_{A c} 〉} = L_{〈 P_{A f}, P_{A b}, P_{A c} 〉} + 1 \\ ∴ Φ_{a b c} = \sum_{Type 1} {(\frac{1}{2})}^{L_{〈 P_{A a}, P_{A b}, P_{A c} 〉}} Φ_{A A A} . \end{matrix}$ (B.2)

Similarly, for Figure 19(b), we obtain Φ_abc = ∑_Type 1(1/2)^{L_{〈P_cf,P_cb,P_cc〉}+1}Φ_ccc = ∑_Type 1(1/2)^{L_{〈P_ca,P_cb,P_cc〉}}Φ_ccc.

Thus, it is true for n = k + 1.

(a) c is a parent of a and b; (b) no individual is a parent of another.

(a) No individual is a parent of another; (b) c is an ancestor of a and b.

B.1.2. Correctness Proof for Case 3.2

Case 3.2. For Φ_abc, G has path triples 〈P _Aa, P _Ab, P _Ac〉 with root overlap.

Proof —

There are three basic scenarios: (i) there are two individuals who are parents of another; (ii) there is only one individual who is parent of another; (iii) there is no individual who is a parent of another, among a, b, and c.

Using the recursive formula (3) to compute Φ_abc: in Figure 20, for Figure 20(a), Φ_abc = (1/2)Φ_bbc = (1/2)²Φ_bc = (1/2)³Φ_cc; for Figure 20(b),Φ_abc = (1/2)Φ_bbc = (1/2)²Φ_bc = (1/2)⁴Φ_AA; for Figure 20(c), Φ_abc = (1/2)²Φ_ssc = (1/2)³Φ_sc = (1/2)⁵Φ_AA.

Using the path-counting formula (12), if a path-triple 〈P _Aa, P _Ab, P _Ac〉 has root overlap (i.e., Type 2), then the contribution of 〈P _Aa, P _Ab, P _Ac〉 to Φ_abc can be computed as follows:∑_Type 2(1/2)^{L_{〈P_Aa,P_Ab,P_Ac〉}+1}Φ_AA, where L _{〈P_Aa,P_Ab,P_Ac〉} = L _{P_Aa} + L _{P_Ab} + L _{P_Ac} − L _{P_As}and s is the last individual of the root overlap path P _As.

For Figure 20(a), c is the only triple-common ancestor and we obtain Φ_abc = (1/2)^{L_{〈P_ca,P_cb,P_cc〉}+1}Φ_cc = (1/2)²⁺¹Φ_cc = (1/2)³Φ_cc. Similarly, for Figures 20(b) and 20(c), we obtain Φ_abc = (1/2)⁴Φ_AA and Φ_abc = (1/2)⁵Φ_AA, respectively.

Induction Step. Let n denote the number of edges in G. Assume true for n ≤ k, where k ≥ 2. Show that it is true for = k + 1.

For Figures 21(a), 21(b), and 21(c), among a, b, and c, let a be the individual who has the longest path and let p be a parent of a. Then, we cut the edge p → a from G and obtain a new graph G* which satisfies the condition of induction hypothesis. For Figure 21(a), we use the path-counting formula for Φ_fbc in G* : Φ_fbc = ∑_Type 2(1/2)^{L_{〈P_Af,P_Ab,P_Ac〉}+1}Φ_AA.

In G, f is the only parent of a, according to the recursive formula (3), we have Φ_abc = (1/2)Φ_fbc. Then, we can plug-in the Φ_fbc and obtain

$\begin{matrix} Φ_{a b c} = \frac{1}{2} Φ_{f b c} \\ = \frac{1}{2} \sum_{Type 2} {(\frac{1}{2})}^{L_{〈 P_{A f}, P_{A b}, P_{A c} 〉} + 1} Φ_{A A} \\ = \sum_{Type 2} {(\frac{1}{2})}^{L_{〈 P_{A f}, P_{A b}, P_{A c} 〉} + 1 + 1} Φ_{A A} \\ ∵ L_{〈 P_{A a}, P_{A b}, P_{A c} 〉} = L_{〈 P_{A f}, P_{A b}, P_{A c} 〉} + 1 \\ ∴ Φ_{a b c} = \sum_{Type 2} {(\frac{1}{2})}^{L_{〈 P_{A f}, P_{A b}, P_{A c} 〉} + 1 + 1} Φ_{A A} \\ = \sum_{Type 2} {(\frac{1}{2})}^{L_{〈 P_{A a}, P_{A b}, P_{A c} 〉} + 1} Φ_{A A} . \end{matrix}$ (B.3)

For Figures 21(b) and 21(c), we take the same steps as we calculate Φ_abc for Figure 21(a).

In summary, it is true for n = k + 1.

(a) b is a parent of a, and c is a parent of b; (b) b is a parent of a; (c) no individual who is a parent of another.

(a) No individual who is a parent of another; (b) b is a parent of a; (c) b is a parent of a and c is an ancestor of b.

B.1.3. Correctness Proof for Case 2.3

Case 2.3. For Φ_aab, the path-triples in the pedigree graph G have mergeable path-pair.

Proof —

Considering the relationship between a and b, G has two scenarios: (i) b is not an ancestor of a; (ii) b is an ancestor of a. Using the path-counting formula (A.1), if a path-triple 〈P _Aa1, P _Aa2, P _Ab〉∈ Type 3, which means that it has a mergeable path-pair, then the contribution of 〈P _Aa1, P _Aa2, P _Ab〉 to Φ_aab can be computed as follows: ∑_Type 3(1/2)^{L_{〈P_Aa,P_Ab〉}+1}Φ_AA, where L _{〈P_Aa,P_Ab〉} = L _{P_Aa} + L _{P_Ab}.

Using the recursive formula (4), we obtain Φ_aab = (1/2)(Φ_ab + Φ_fmb).

For Figure 22(a), A is a common ancestor of a and b.

∵a only has one parent f

$\begin{matrix} ∴ Φ_{a a b} = \frac{1}{2} (Φ_{a b} + Φ_{f m b}) \\ = \frac{1}{2} (Φ_{a b} + 0) = \frac{1}{2} Φ_{a b} (as m is missing) . \end{matrix}$ (B.4)

For Φ_ab, we use Wright's formula and obtain Φ_ab = ∑_P(1/2)^{L_{〈P_Aa,P_Ab〉}}Φ_AA where P denotes all nonoverlapping path-pairs 〈P _Aa, P _Ab〉.

Then, we have Φ_aab = (1/2)Φ_ab = (1/2)∑_P(1/2)^{L_{〈P_Aa,P_Ab〉}}Φ_AA = ∑_P(1/2)^{L_{〈P_Aa,P_Ab〉}+1}Φ_AA.

For Figure 22(b), we can also transform the computation of Φ_aab to Φ_ab.

In summary, it shows that the path-counting formula (A.1) is true for Case 2.3.

(a) b is not an ancestor of a; (b) b is an ancestor of a.

B.1.4. Correctness Proof for Cases 2.1 and 2.2

For Φ_aab, when there is no path-triple having mergeable path-pair, (i.e., the path-triple belongs to either Case 2.1 or Case 2.3), Φ_aab can be transformed to Φ_a₁a₂b, which is equivalent to the computation of Φ_abc for Cases 3.1 and 3.2. The correctness of our path-counting formula for Cases 3.1 and 3.2 is proven. Thus, we obtain the correctness for Φ_aab when the path-triple belongs to either Case 2.1 or Case 2.2.

B.2. Multiple Triple-Common Ancestors

Now, we provide the correctness proof for multiple triple-common ancestors regarding the path-counting formulas (12) and (A.1).

Lemma B.2 A. —

Given a pedigree graph G and three individuals a, b, c having at least one trip-common ancestor, Φ_abc is correctly computed using the path counting formulas (12) and (A.1).

Proof —

Proof by induction on the number of triple-common ancestors

Basis. G has only one triple-common ancestor of a, b, and c.

The correctness of (12) and (A.1) for G with only one triple-common ancestor of a, b, and c is proven in the previous section.

Induction Hypothesis. Assume that if G has k or less triple-common ancestors of a, b, and c, (12) and (A.1) are correct for G.

Induction Step. Now, we show that it is true for G with k + 1 triple-common ancestors of a, b, and c.

Let Tri_C(a, b, c, G) denote all triple-common ancestors of a, b, and c in G, where Tri_C(a, b, c, G) = {A _i∣1 ≤ i ≤ k + 1}. Let A ₁ be the most top triple-common ancestor such that there is no individual among the remaining ancestors {A _i∣2 ≤ i ≤ k + 1} who is an ancestor of A ₁. Let S(A ₁) denote the contribution from A ₁ to Φ_abc.

Because A ₁ is the most top triple-common ancestor, there is no path-triple from {A _i∣2 ≤ i ≤ k + 1} to a, b, and c which passes through A ₁. Then, we can remove A ₁ from G and delete all out-going edges from A ₁ and obtain a new graph G′ which has k triple-common ancestors of a, b, and c. It means Tri_C(a, b, c, G′) = {A _i∣2 ≤ i ≤ k + 1}.

For the new graph G′, we can apply our induction hypothesis and obtain Φ_abc (G′).

For the most top triple-common ancestor A ₁, there are two different cases considering its relationship with the other triple-common ancestors:

there is no individual among {A _i∣2 ≤ i ≤ k + 1} who is a descendant of A ₁;

there is at least one individual among {A _i∣2 ≤ i ≤ k + 1} who is a descendant of A ₁.

For (1), since no individual among {A _i∣2 ≤ i ≤ k + 1} is a descendant of A ₁, the set of path-triples from A ₁ to a, b, and c is independent of the set of path-triples from {A _i∣2 ≤ i ≤ k + 1} to a, b, and c. It also means that the contribution from A ₁ to Φ_abc is independent of the contribution from the other triple-common ancestors.

Summing up all contributions, we can obtain Φ_abc (G) = Φ_abc (G′) + S(A ₁).

For (2), let A _j be one descendant of A ₁. Now both A ₁ and A _j can reach a, b, and c.

pt _i = {t _a: A ₁ → ⋯→a; t _b: A ₁ → ⋯→b; t _c: A ₁ → ⋯→c}, a path-triple from A ₁ to a, b, and c.

If t _a, t _b, and t _c all pass through A _j, then the path-triple pt _i is not an eligible path-triple for Φ_abc. When we compute the contribution from A ₁ to Φ_abc, we exclude all such path-triples where t _a, t _b, and t _c all pass through a lower triple-common ancestor. In other words, an eligible path-triple from A ₁ regarding Φ_abc cannot have three paths all passing through a lower triple-common ancestor. Therefore, we know that that the contribution from A ₁ to Φ_abc is independent of the contribution from the other triple-common ancestors. Summing up all contributions, we obtain Φ_abc(G) = Φ_abc(G′) + S(A ₁).

C. Proof for Four Individuals and Two Pairs of Individuals

Here, we give a proof sketch for the correctness of path counting formulas for four individuals. First of all, for four individuals in a pedigree graph G, we present all different cases based on which we construct a dependency graph. The correctness of the path-counting formulas for two-pair individuals can be proved similarly.

C.1. Proof for Four Individuals

Consider the existence of different types of path-quads regarding Φ_abcd, Φ_aabc, and Φ_aaab; there are 15 cases for a pedigree graph G:

\begin{matrix} \begin{matrix} Case 2.1 : G has path-triples \\ 〈 P_{A a_{1}}, P_{A a_{2}}, P_{A b} 〉 \\ with zero root overlap \\ Case 2.2 : G has path-triples \\ 〈 P_{A a_{1}}, P_{A a_{2}}, P_{A b} 〉 \\ with one root overlap \\ Case 2.3 : G has path-pairs \\ 〈 P_{A a}, P_{A b} 〉 \\ with zero root overlap \end{matrix}} ⟸ Φ_{a a a b}, \\ \begin{matrix} Case 3.1 : G has path-quads \\ 〈 P_{A a_{1}}, P_{A a_{2}}, P_{A b}, P_{A c} 〉 \\ with zero root overlap \\ Case 3.2 : G has path-quads \\ 〈 P_{A a_{1}}, P_{A a_{2}}, P_{A b}, P_{A c} 〉 \\ with one root 2 -overlap \\ Case 3.3 . 1 : G has path-quads \\ 〈 P_{A a_{1}}, P_{A a_{2}}, P_{A b}, P_{A c} 〉 \\ with two root 2 -overlap \\ Case 3.3 . 2 : G has path-quads \\ 〈 P_{A a_{1}}, P_{A a_{2}}, P_{A b}, P_{A c} 〉 \\ with one root 3 -overlap \\ Case 3.3 . 3 : G has path-quads \\ 〈 P_{A a_{1}}, P_{A a_{2}}, P_{A b}, P_{A c} 〉 \\ with one root 2-overlap \\ and one root 3-overlap \\ Case 3.4 : G has path-triples \\ 〈 P_{A a}, P_{A b}, P_{A c} 〉 \\ with zero root overlap \\ Case 3.5 : G has path-triples \\ 〈 P_{A a}, P_{A b}, P_{A c} 〉 \\ with one root overlap \end{matrix}} ⟸ Φ_{a a b c}, \\ \begin{matrix} Case 4.1 : G has path-quads \\ 〈 P_{A a}, P_{A b}, P_{A c}, P_{A d} 〉 \\ with zero root overlap \\ Case 4.2 : G has path-quads \\ 〈 P_{A a}, P_{A b}, P_{A c}, P_{A d} 〉 \\ with one root 2 -overlap \\ Case 4.3 . 1 : G has path-quads \\ 〈 P_{A a}, P_{A b}, P_{A c}, P_{A d} 〉 \\ with two root 2 -overlap \\ Case 4.3 . 2 : G has path-quads \\ 〈 P_{A a}, P_{A b}, P_{A c}, P_{A d} 〉 \\ with one root 3 -overlap \\ Case 4.3 . 3 : G has path-quads \\ 〈 P_{A a}, P_{A b}, P_{A c}, P_{A d} 〉 \\ with one root 2-overlap \\ and one root 3-overlap \end{matrix}} ⟸ Φ_{a b c d} . \end{matrix}

(C.1)

Then, we construct a dependency graph shown in Figure 23 for all cases for four individuals.

Dependency graph for different cases for four individuals.

According to the dependency graph in Figure 23, the intermediate steps including Cases 3.4 and 3.5 are already proved for the computation of Φ_abc. The correctness of the transformation from Case 4.2 to Case 3.4 can be proved based on the recursive formula for Φ_abcd and Φ_aabc. Similarly, we can obtain the transformation from Case 4.3.1 to Case 3.5.

C.2. Proof for Two Pairs of Individuals

Consider the existence of different types of 2-pair-path-pair regarding Φ_ab,cd; there are 9 cases which are listed as follows.

Case 4.1. G has 〈(P _Aa, P _Ab), (P _Ac, P _Ad)〉 with zero root homo-overlap and zero root heter-overlap.

Case 4.2. G has 〈(P _Aa, P _Ab), (P _Ac, P _Ad)〉 with zero root homo-overlap and one root heter-overlap.

Case 4.3.1. G has 〈(P _Aa, P _Ab), (P _Ac, P _Ad)〉 with zero root homo-overlap and two root heter-overlap.

Case 4.3.2.G has 〈(P _Aa, P _Ab), (P _Ac, P _Ad)〉 with one root homo-overlap and two root heter-overlap.

Case 4.4. G has 〈(P _Aa, P _Ab), (P _Ac, P _Ad)〉 with one root homo-overlap and zero root heter-overlap.

Case 4.5. G has 〈(P _Aa, P _Ab), (P _Ac, P _Ad)〉 with two root homo-overlap and zero root heter-overlap.

Case 4.6. G has path-triples 〈P _Aa, P _Ab, P _Ac〉 with zero root overlap.

Case 4.7. G has path-triples 〈P _Aa, P _Ab, P _Ac〉 with one root overlap.

Case 4.8. G has path-pairs 〈P _Tc, P _Td〉 with zero root overlap.

Then, we construct a dependency graph for the cases relating to Φ_ab,cd in Figure 24.

Dependency graph for different cases for two pairs of individuals.

According to the dependency graph in Figure 24, Cases 4.6, 4.7, and 4.8 are the intermediate steps which already are proved for the computation of Φ_abc. The correctness of the transformation from Case 4.2 to Case 4.6 can be proved based on the recursive formula for Φ_ab,cd and Φ_ab,ac. Similarly, we can obtain the transformation from Cases 4.3.1 and 4.3.2 to Case 4.7 as well as from Case 4.4 to Case 4.8 accordingly.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

References

1. Surgeon General’s New Family Health History Tool Is Released, Ready for “21st Century Medicine”, http://compmed.com/category/people-helping-people/page/7/
2.Falchi M, Forabosco P, Mocci E, et al. A genomewide search using an original pairwise sampling approach for large genealogies identifies a new locus for total and low-density lipoprotein cholesterol in two genetically differentiated isolates of Sardinia. The American Journal of Human Genetics. 2004;75(6):1015–1031. doi: 10.1086/426155. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Ciullo M, Bellenguez C, Colonna V, et al. New susceptibility locus for hypertension on chromosome 8q by efficient pedigree-breaking in an Italian isolate. Human Molecular Genetics. 2006;15(10):1735–1743. doi: 10.1093/hmg/ddl097. [DOI] [PubMed] [Google Scholar]
4. Glossary of Genetic Terms, National Human Genome Research Institute, http://www.genome.gov/glossary/?id=148.
5.Cotterman CW. A calculus for statistico-genetics [Ph.D. thesis] Ohio State University: Columbus, Ohio, USA; 1940. Reprinted in P. Ballonoff, Ed., Genetics and Social Structure, Dowden, Hutchinson & Ross, Stroudsburg, Pa, USA, 1974. [Google Scholar]
6.Malecot G. Les mathématique de l'hérédité. Paris, France: Masson; 1948. Translated edition: The Mathematics of Heredity, Freeman, San Francisco, Calif, USA, 1969. [Google Scholar]
7.Gillois M. La relation d'identité en génétique. Annales de l'Institut Henri Poincaré B. 1964;2:1–94. [Google Scholar]
8.Harris DL. Genotypic covariances between inbred relatives. Genetics. 1964;50:1319–1348. doi: 10.1093/genetics/50.6.1319. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Jacquard A. Logique du calcul des coefficients d’identite entre deux individuals. Population. 1966;21:751–776. [Google Scholar]
10.Karigl G. A recursive algorithm for the calculation of identity coefficients. Annals of Human Genetics. 1981;45(3):299–305. doi: 10.1111/j.1469-1809.1981.tb00341.x. [DOI] [PubMed] [Google Scholar]
11.Elliott B, Akgul SF, Mayes S, Ozsoyoglu ZM. Efficient evaluation of inbreeding queries on pedigree data. Proceedings of the 19th International Conference on Scientific and Statistical Database Management (SSDBM '07); July 2007; [Google Scholar]
12.Elliott B, Cheng E, Mayes S, Ozsoyoglu ZM. Efficiently calculating inbreeding on large pedigrees databases. Information Systems. 2009;34(6):469–492. [Google Scholar]
13.Yang L, Cheng E, Özsoyoğlu ZM. Using compact encodings for path-based computations on pedigree graphs. Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine (ACM-BCB '11); August 2011; pp. 235–244. [Google Scholar]
14.Cheng E, Elliott B, Ozsoyoglu ZM. Scalable computation of kinship and identity coefficients on large pedigrees. Proceedings of the 7th Annual International Conference on Computational Systems Bioinformatics (CSB '08); 2008; pp. 27–36. [PubMed] [Google Scholar]
15.Cheng E, Elliott B, Özsoyoĝlu ZM. Efficient computation of kinship and identity coefficients on large pedigrees. Journal of Bioinformatics and Computational Biology (JBCB) 2009;7(3):429–453. doi: 10.1142/s0219720009004175. [DOI] [PubMed] [Google Scholar]
16.Wright S. Coefficients of inbreeding and relationship. The American Naturalist. 1922;56(645) [Google Scholar]
17.Nadot R, Vaysseix G. Kinship and identity algorithm of coefficients of identity. Biometrics. 1973;29(2):347–359. [Google Scholar]
18.Cheng E. Scalable path-based computations on pedigree data [Ph.D. thesis] Cleveland, Ohio, USA: Case Western Reserve University; 2012. [Google Scholar]
19.Ollikainen V. Simulation Techniques for Disease Gene Localization in Isolated Populations [Ph.D. thesis] Helsinki, Finland: University of Helsinki; 2002. [Google Scholar]
20.Toivonen HTT, Onkamo P, Vasko K, et al. Data mining applied to linkage diseqilibrium mapping. The American Journal of Human Genetics. 2000;67(1):133–145. doi: 10.1086/302954. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Boucher W. Calculation of the inbreeding coefficient. Journal of Mathematical Biology. 1988;26(1):57–64. doi: 10.1007/BF00280172. [DOI] [PubMed] [Google Scholar]

[B1] 1. Surgeon General’s New Family Health History Tool Is Released, Ready for “21st Century Medicine”, http://compmed.com/category/people-helping-people/page/7/

[B2] 2.Falchi M, Forabosco P, Mocci E, et al. A genomewide search using an original pairwise sampling approach for large genealogies identifies a new locus for total and low-density lipoprotein cholesterol in two genetically differentiated isolates of Sardinia. The American Journal of Human Genetics. 2004;75(6):1015–1031. doi: 10.1086/426155. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B3] 3.Ciullo M, Bellenguez C, Colonna V, et al. New susceptibility locus for hypertension on chromosome 8q by efficient pedigree-breaking in an Italian isolate. Human Molecular Genetics. 2006;15(10):1735–1743. doi: 10.1093/hmg/ddl097. [DOI] [PubMed] [Google Scholar]

[B4] 4. Glossary of Genetic Terms, National Human Genome Research Institute, http://www.genome.gov/glossary/?id=148.

[B5] 5.Cotterman CW. A calculus for statistico-genetics [Ph.D. thesis] Ohio State University: Columbus, Ohio, USA; 1940. Reprinted in P. Ballonoff, Ed., Genetics and Social Structure, Dowden, Hutchinson & Ross, Stroudsburg, Pa, USA, 1974. [Google Scholar]

[B6] 6.Malecot G. Les mathématique de l'hérédité. Paris, France: Masson; 1948. Translated edition: The Mathematics of Heredity, Freeman, San Francisco, Calif, USA, 1969. [Google Scholar]

[B7] 7.Gillois M. La relation d'identité en génétique. Annales de l'Institut Henri Poincaré B. 1964;2:1–94. [Google Scholar]

[B8] 8.Harris DL. Genotypic covariances between inbred relatives. Genetics. 1964;50:1319–1348. doi: 10.1093/genetics/50.6.1319. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B9] 9.Jacquard A. Logique du calcul des coefficients d’identite entre deux individuals. Population. 1966;21:751–776. [Google Scholar]

[B10] 10.Karigl G. A recursive algorithm for the calculation of identity coefficients. Annals of Human Genetics. 1981;45(3):299–305. doi: 10.1111/j.1469-1809.1981.tb00341.x. [DOI] [PubMed] [Google Scholar]

[B11] 11.Elliott B, Akgul SF, Mayes S, Ozsoyoglu ZM. Efficient evaluation of inbreeding queries on pedigree data. Proceedings of the 19th International Conference on Scientific and Statistical Database Management (SSDBM '07); July 2007; [Google Scholar]

[B12] 12.Elliott B, Cheng E, Mayes S, Ozsoyoglu ZM. Efficiently calculating inbreeding on large pedigrees databases. Information Systems. 2009;34(6):469–492. [Google Scholar]

[B13] 13.Yang L, Cheng E, Özsoyoğlu ZM. Using compact encodings for path-based computations on pedigree graphs. Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine (ACM-BCB '11); August 2011; pp. 235–244. [Google Scholar]

[B14] 14.Cheng E, Elliott B, Ozsoyoglu ZM. Scalable computation of kinship and identity coefficients on large pedigrees. Proceedings of the 7th Annual International Conference on Computational Systems Bioinformatics (CSB '08); 2008; pp. 27–36. [PubMed] [Google Scholar]

[B15] 15.Cheng E, Elliott B, Özsoyoĝlu ZM. Efficient computation of kinship and identity coefficients on large pedigrees. Journal of Bioinformatics and Computational Biology (JBCB) 2009;7(3):429–453. doi: 10.1142/s0219720009004175. [DOI] [PubMed] [Google Scholar]

[B16] 16.Wright S. Coefficients of inbreeding and relationship. The American Naturalist. 1922;56(645) [Google Scholar]

[B21] 17.Nadot R, Vaysseix G. Kinship and identity algorithm of coefficients of identity. Biometrics. 1973;29(2):347–359. [Google Scholar]

[B18] 18.Cheng E. Scalable path-based computations on pedigree data [Ph.D. thesis] Cleveland, Ohio, USA: Case Western Reserve University; 2012. [Google Scholar]

[B19] 19.Ollikainen V. Simulation Techniques for Disease Gene Localization in Isolated Populations [Ph.D. thesis] Helsinki, Finland: University of Helsinki; 2002. [Google Scholar]

[B20] 20.Toivonen HTT, Onkamo P, Vasko K, et al. Data mining applied to linkage diseqilibrium mapping. The American Journal of Human Genetics. 2000;67(1):133–145. doi: 10.1086/302954. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B17] 21.Boucher W. Calculation of the inbreeding coefficient. Journal of Mathematical Biology. 1988;26(1):57–64. doi: 10.1007/BF00280172. [DOI] [PubMed] [Google Scholar]

PERMALINK

Path-Counting Formulas for Generalized Kinship Coefficients and Condensed Identity Coefficients

En Cheng

Z Meral Ozsoyoglu

Abstract

1. Introduction

2. Materials and Methods

2.1. Kinship Coefficients and Generalized Kinship Coefficients

2.2. Identity Coefficients and Condensed Identity Coefficients

Figure 1.

2.3. Terms Defined for Path-Counting Formulas for Three and Four Individuals

Example 1 . —

Figure 2.

Example 2 . —

Figure 3.

Table 1.

2.4. An Overview of Path-Counting Formula Derivation

Figure 4.

3. Results and Discussion

3.1. Path-Counting Formulas for Three Individuals

3.1.1. Cases for a Path-Pair

3.1.2. Path-Pair Level Graphical Representation of a Path-Triple

Figure 5.

Lemma 3 . —

Proof —

Figure 6.

3.1.3. Constructing Cases for a Path-Triple

Figure 7.

Figure 8.

3.1.4. Splitting Operator

Figure 9.

Lemma 4 . —

Proof —

Definition 5 (Canonical Graph). —

Lemma 6 . —

Proof —

3.1.5. Path-Counting Formula for Φabc

3.2. Path-Counting Formulas for Four Individuals

3.2.1. Path-Pair Level Graphical Representation of 〈P Aa, P Ab, P Ac, P Ad〉

Figure 10.

Figure 11.

3.2.2. Building Block-Based Cases Construction for 〈P Aa, P Ab, P Ac, P Ad〉

Definition 7 (Largest Subgraph). —

Table 2.

3.2.3. Path-Counting Formula for Φabcd

3.3. Path-Counting Formulas for Two Pairs of Individuals

3.3.1. Terminology and Definitions

Example 8 . —

Figure 12.

3.3.2. Path-Counting Formula for Φab,cd

Figure 13.

Table 3.

3.4. Experimental Results

Figure 14.

Figure 15.

4. Conclusion

Acknowledgments

Appendices

A. Path-Counting Formulas of Special Cases

A.1. Path-Counting Formula for Φaab

Definition A.1 A.1 (Mergeable Path-Pair). —

Figure 16.

Lemma A.2 A.2. —

Proof —

A.2. Path-Counting Formula for Φaabc

A.3. Path-Counting Formula for Φaaab

Definition A.3 A.3 (Mergeable Path-Triple). —

Lemma A.4 A.4. —

Proof —

B. Proof for Path-Counting Formulas of Three Individuals

B.1. One Triple-Common Ancestor

Figure 17.

B.1.1. Correctness Proof for Case 3.1

Proof —

Figure 18.

Figure 19.

B.1.2. Correctness Proof for Case 3.2

Proof —

Figure 20.

Figure 21.

3.1.5. Path-Counting Formula for Φ_abc

3.2.1. Path-Pair Level Graphical Representation of 〈P _Aa, P _Ab, P _Ac, P _Ad〉

3.2.2. Building Block-Based Cases Construction for 〈P _Aa, P _Ab, P _Ac, P _Ad〉

3.2.3. Path-Counting Formula for Φ_abcd

3.3.2. Path-Counting Formula for Φ_ab,cd

A.1. Path-Counting Formula for Φ_aab

A.2. Path-Counting Formula for Φ_aabc

A.3. Path-Counting Formula for Φ_aaab