Skip to main content
Computational and Mathematical Methods in Medicine logoLink to Computational and Mathematical Methods in Medicine
. 2014 Jul 21;2014:898424. doi: 10.1155/2014/898424

Path-Counting Formulas for Generalized Kinship Coefficients and Condensed Identity Coefficients

En Cheng 1,*, Z Meral Ozsoyoglu 2
PMCID: PMC4130148  PMID: 25165486

Abstract

An important computation on pedigree data is the calculation of condensed identity coefficients, which provide a complete description of the degree of relatedness of two individuals. The applications of condensed identity coefficients range from genetic counseling to disease tracking. Condensed identity coefficients can be computed using linear combinations of generalized kinship coefficients for two, three, four individuals, and two pairs of individuals and there are recursive formulas for computing those generalized kinship coefficients (Karigl, 1981). Path-counting formulas have been proposed for the (generalized) kinship coefficients for two (three) individuals but there have been no path-counting formulas for the other generalized kinship coefficients. It has also been shown that the computation of the (generalized) kinship coefficients for two (three) individuals using path-counting formulas is efficient for large pedigrees, together with path encoding schemes tailored for pedigree graphs. In this paper, we propose a framework for deriving path-counting formulas for generalized kinship coefficients. Then, we present the path-counting formulas for all generalized kinship coefficients for which there are recursive formulas and which are sufficient for computing condensed identity coefficients. We also perform experiments to compare the efficiency of our method with the recursive method for computing condensed identity coefficients on large pedigrees.

1. Introduction

With the rapidly expanding field of medical genetics and genetic counseling, genealogy information is becoming increasingly abundant. In January 2009, the US Department of Health and Human Services released an updated and improved version of the Surgeon General's Web-based family health history tool [1]. This Web-based tool makes it easy for users to record their family health history. Large extended human pedigrees are very informative for linkage analysis. Pedigrees including thousands of members in 10–20 generations are available from genetically isolated populations [2, 3]. In human genetics, a pedigree is defined as “a simplified diagram of a family's genealogy that shows family members' relationships to each other and how a specific trait, abnormality, or disease has been inherited” [4]. Pedigrees are utilized to trace the inheritance of a specific disease, calculate genetic risk ratios, identify individuals at risk, and facilitate genetic counseling. To calculate genetic risk ratios or identify individuals at risk, we need to assess the degree of relatedness of two individuals. As a matter of fact, all measures of relatedness are based on the concept of identical by descent (IBD). Two alleles are identical by descent if one is an ancestral copy of the other or if they are both copies of the same ancestral allele. The IBD concept is primarily due to Cotterman [5] and Malecot [6] and has been successfully applied to many problems in population genetics.

The simplest measure of relationship between two individuals is their kinship coefficient. The kinship coefficient between two individuals i and j is the probability that an allele selected randomly from i and an allele selected randomly from the same autosomal locus of j are identical by descent. To better discriminate between different types of pairs of relatives, identity coefficients were introduced by Gillois [7] and Harris [8] and promulgated by Jacquard [9]. Considering the four alleles of two individuals at a fixed autosomal locus, there are 15 possible identity states. Disregarding the distinction between maternally and paternally derived alleles, we obtain 9 condensed identity states. The probabilities associated with each condensed identity state are called condensed identity coefficients, which are useful in a diverse range of fields. This includes the calculation of risk ratios for qualitative disease, the analysis of quantitative traits, and genetic counseling in medicine.

A recursive algorithm for calculating condensed identity coefficients proposed by Karigl [10] has been known for some time. This method requires that one calculates a set of generalized kinship coefficients, from which one obtains condensed identity coefficients via a linear transformation. One limitation is that this recursive approach is not scalable when applied to very large pedigrees. It has been previously shown that the kinship coefficients for two individuals [1113] and the generalized kinship coefficients for three individuals [14, 15] can be efficiently calculated using path-counting formulas together with path encoding schemes tailored for pedigree graphs.

Motivated by the efficiency of path-counting formulas for computing the kinship coefficient for two individuals and the generalized kinship coefficient for three individuals, we first introduce a framework for developing path-counting formulas to compute generalized kinship coefficients concerning three individuals, four individuals, and two pairs of individuals. Then, we present path-counting formulas for all generalized kinship coefficients which have recursive formulas proposed by Karigl [10] and are sufficient to compute condensed identity coefficients. In summary, our ultimate goal is to use path-counting formulas for generalized kinship coefficients computation so that efficiency and scalability for condensed identity coefficients calculation can be improved.

The main contributions of our work are as follows:

  1. a framework to develop path-counting formulas for generalized kinship coefficients;

  2. a set of path-counting formulas for all generalized kinship coefficients having recursive formulas [10];

  3. experimental results demonstrating significant performance gains for calculating condensed identity coefficients based on our proposed path-counting formulas as compared to using recursive formulas [10].

2. Materials and Methods

This section describes kinship coefficients and generalized kinship coefficients, identity coefficients, and condensed identity coefficients in more detail. Conceptual terms for the path-counting formulas for three and four individuals are introduced in Section 2.3. In addition, an overview of path-counting formula derivation is presented.

2.1. Kinship Coefficients and Generalized Kinship Coefficients

The kinship coefficient between two individuals a and b is the probability that a randomly chosen allele at the same locus from each is identical by descent (IBD). There are two approaches to computing the kinship coefficient Φab: the recursive approach [10] and the path-counting approach [16]. The recursive formulas [10] for Φab and Φaa are

Φab=12(Φfb+Φmb)if  a  is  not  an  ancestor  of  b,Φaa=12(1+Φfm)=12(1+Fa), (1)

where f and m denote the father and the mother of a, respectively, and F a is the inbreeding coefficient of a.

Wright's path-counting formula [16] for Φab is

Φab=APAa,PAbPP(12)r+s+1(1+FA), (2)

where A is a common ancestor of a and b, PP is a set of nonoverlapping path-pairs 〈P Aa, P Ab〉 from A to a and b, r is the length of the path P Aa, s is the length of the path P Ab, and F A is the inbreeding coefficient of A. The path-pair 〈P Aa, P Ab〉 is nonoverlapping if and only if the two paths share no common individuals, except A.

Recursive formulas proposed by Karigl [10] for generalized kinship coefficients concerning three individuals, four individuals, and two pairs of individuals are listed as follows in (3), (4), and (5):

graphic file with name CMMM2014-898424.e003.jpg (3)
graphic file with name CMMM2014-898424.e004.jpg (4)
graphic file with name CMMM2014-898424.e005.jpg (5)

Φabc is the probability that randomly chosen alleles at the same locus from each of the three individuals (i.e., a, b, and c) are identical by descent (IBD). Similarly, Φabcd is the probability that randomly chosen alleles at the same locus from each of the four individuals (i.e., a, b, c, and d) are IBD. Φab,cd is the probability that a random allele from a is IBD with a random allele from b and that a random allele from c is IBD with a random allele from d at the same locus. Note that Φabc = 0 if there is no common ancestor of a, b, and c. Φabcd = 0 if there is no common ancestor of a, b, c, and d, and Φab,cd = 0 in the absence of a common ancestor either for a and b or for c and d.

2.2. Identity Coefficients and Condensed Identity Coefficients

Given two individuals a and b with maternally and paternally derived alleles at a fixed autosomal locus, there are 15 possible identity states, and the probabilities associated with each identity state are called identity coefficients. Ignoring the distinction between maternally and paternally derived alleles, we categorize the 15 possible states to 9 condensed identity states, as shown in Figure 1. The states range from state 1, in which all four alleles are IBD, to state 9, in which none of the four alleles are IBD. The probabilities associated with each condensed identity state are called condensed identity coefficients, denoted by {Δi∣1 ≤ i ≤ 9}  . The condensed identity coefficients can be computed based on generalized kinship coefficients using the linear transformation shown as follows in (6):

[11111111122221111122112211140202021080402021080204021016040402104422221111604040410][Δ1Δ2Δ3Δ4Δ5Δ6Δ7Δ8Δ9]=[12Φaa2Φbb4Φab8Φaab8Φabb16Φaabb4Φaa,bb16Φab,ab]. (6)

Figure 1.

Figure 1

The 15 possible identity states for individuals a and b, grouped by their 9 condensed states. Lines indicate alleles that are IBD.

In our work, we focus on deriving the path-counting formulas for the generalized kinship coefficients, including Φabc, Φabcd, and Φab,cd.

2.3. Terms Defined for Path-Counting Formulas for Three and Four Individuals

(1) Triple-Common Ancestor. Given three individuals a, b, and c, if A is a common ancestor of the three individuals, then we call A a triple-common ancestor of a, b, and c.

(2) Quad-Common Ancestor. Given four individuals a, b, c, and d, if A is a common ancestor of the four individuals, then we call A a quad-common ancestor of a, b, c, and d.

(3) P(A, a). It denotes the set of all possible paths from A to a, where the paths can only traverse edges in the direction of parent to child such that P(A, a) ≠ NULL if and only if A is an ancestor of a. P Aa denotes a particular path from A to a, where P AaP(A, a).

(4) Path-Pair. It consists of two paths, denoted as 〈P Aa, P Ab〉, where P AaP(A, a) and P AbP(A, b).

(5) Nonoverlapping Path-Pair. Given a path-pair 〈P Aa, P Ab〉, it is nonoverlapping if and only if the two paths share no common individuals, except A.

(6) Path-Triple. It consists of three paths, denoted as 〈P Aa, P Ab, P Ac〉, where P AaP(A, a), P AbP(A, b), and P AcP(A, c).

(7) Path-Quad. It consists of four paths, denoted as 〈P Aa, P Ab, P Ac, P Ad〉, where P AaP(A, a), P AbP(A, b), P AcP(A, c), and P AdP(A, d).

(8) Bi_C(P Aa, P Ab). It denotes all common individuals shared between P Aa and P Ab, except A.

(9) Tri_C(P Aa, P Ab, P Ac). It denotes all common individuals shared among P Aa, P Ab, and P Ac, except A.

(10) Quad_C(P Aa, P Ab, P Ac, P Ad). It denotes all common individuals shared among P Aa, P Ab, P Ac, and P Ad, except A.

(11) Crossover and 2-Overlap Individual. If sBi_C(P Aa, P Ab), we call s a crossover individual with respect to P Aa and P Ab if the two paths pass through different parents of s. On the other hand, if P Aa and P Ab pass through the same parent of s, then we call s a 2-overlap individual with respect to P Aa and P Ab.

(12) 3-Overlap Individual. If sTri_C(P Aa, P Ab, P Ac) and the three paths P Aa, P Ab, and P Ac pass through the same parent of s, then we call s a 3-overlap individual with respect to P Aa, P Ab, and P Ac.

(13) 2-Overlap Path. If s is a 2-overlap individual with respect to P Aa and P Ab, then both P Aa and P Ab pass through the same parent of s, denoted by p, and the edge from p to s is called an overlap edge. All consecutive overlap edges constitute a path and this path is called a 2-overlap path. If the 2-overlap path extends all the way to the ancestor A, we call it a root 2-overlap path.

(14) 3-Overlap Path. It consists of all 3-overlap individuals in a consecutive order. If the 3-overlap path extends all the way to the root A, we call it a root 3-overlap path.

Example 1 . —

Consider the path-pairs from A to a and b in Figure 2, where A is a common ancestor of a and b. For path-pair1, Bi_C(P Aa, P Ab) = {s, e, t}, and Aset is a root 2-overlap path with respect to P Aa and P Ab. For path-pair4, Bi_C(P Aa, P Ab) = {e, t}, where e is a crossover individual; t is a 2-overlap individual with respect to P Aa and P Ab, and et is a root 2-overlap path with respect to P Aa and P Ab.

Figure 2.

Figure 2

Examples of path-pairs and path-triples.

Example 2 . —

There are four path-quads listed in Figure 3, from A to four individuals a, b, c, and d, where A is a quad-common ancestor of the four individuals. For path-quad2, considering the paths P Aa and P Ab, the path Atfs is a root 2-overlap path; {t, f, s} are 2-overlap individuals with respect to P Aa and P Ab. For path-quad3, {t, f, s} are 3-overlap individuals with respect to P Aa, P Ab, and P Ac, and the path Atfs is a root 3-overlap path.

Figure 3.

Figure 3

Examples of path-quads.

Then, we summarize all the conceptual terms used in the path-counting formulas for two individuals, three individuals, and four individuals in Table 1 which reveals a glimpse of our framework for generalizing Wright's formula to three and four individuals from terminology aspect.

Table 1.

The conceptual terms used for two, three, and four individuals.

Two individuals Three individuals Four individuals
Common ancestor Triple-common ancestor Quad-common ancestor
Path-pair Path-triple Path-quad
Bi_C(P Aa, P Ab) Tr i_C(P Aa, P Ab, P Ac) Qu ad_C(P Aa, P Ab, P Ac, P Ad)
N/A 2-Overlap individual 3-Overlap individual
N/A 2-Overlap path 3-Overlap path
N/A Root 2-overlap path Root 3-overlap path
N/A Crossover individual Crossover individual

2.4. An Overview of Path-Counting Formula Derivation

According to Wright's path-counting formula [16] (see (2)) for two individuals a and b, the path-counting approach requires identifying common ancestors of a and b and calculating the contribution of each common ancestor to Φab. More specifically, for each common ancestor, denoted as A, we obtain all path-pairs from A to a and b and identify acceptable path-pairs. For Φab, an acceptable path-pair 〈P Aa, P Ab〉 is a nonoverlapping path-pair where the two paths share no common individuals, except A. In Figure 2, path-pair2 is an acceptable path-pair, while path-pair1, path-pair3, and path-pair4 are not acceptable path-pairs. The contribution of each common ancestor A to Φab is computed based on the inbreeding coefficient of A, modified by the length of each acceptable path-pair.

To compute Φabc, the path-counting approach requires identifying all triple-common ancestors of a, b, and c and summing up all triple-common ancestors' contributions to Φabc. For each triple-common ancestor, denoted as A, we first identify all path-triples each of which consists of three paths from A to a, b, and c, respectively. Some examples of path-triples are presented in Figure 2.

For Φab, only nonoverlapping path-pairs are acceptable. A path-triple 〈P Aa, P Ab, P Ac〉 consists of three path-pairs 〈P Aa, P Ab〉, 〈P Aa, P Ac〉, and 〈P Ab, P Ac〉. For Φabc, a path-triple might be acceptable even though either 2-overlap individuals or crossover individuals exist between a path-pair. The main challenge we need to address is finding necessary and sufficient conditions for acceptable path-triples.

Aiming at solving the problem of identifying acceptable path-triples, we first use a systematic method to generate all possible cases for a path-pair by considering different types of common individuals shared between the two paths. Then, we introduce building blocks which are connected graphs with conditions on every edge in the graph that encapsulates a set of acceptable cases of path-pairs. In each building block, we represent paths as nodes and interactions (i.e., shared common individuals between two paths) as edges. There are at least two paths in a building block. For each building block, we obtain all acceptable cases for concerned path-pairs. Given a path-triple, it can be decomposed to one or multiple building blocks. Considering a shared path-pair between two building blocks, we use the natural join operator from relational algebra to match the acceptable cases for the shared path-pair between two building blocks. In other words, considering the acceptable cases for building blocks as inputs, we use the natural join operator to construct all acceptable cases for a path-triple. Acceptable cases for a path-triple are identified and then used in deriving the path-counting formula for Φabc.

Then, we summarize all the main procedures used for deriving the path-counting formula for Φabc in a flowchart shown in Figure 4. The main procedures are also applicable for deriving the path-counting formulas for Φabcd and Φab,cd.

Figure 4.

Figure 4

A flowchart for path-counting formula derivation.

3. Results and Discussion

3.1. Path-Counting Formulas for Three Individuals

We first introduce a systematic method to generate all possible cases for a path-pair. Then we discuss building blocks for path-triples and identify all acceptable cases which are used in deriving the path-counting formula for Φabc.

3.1.1. Cases for a Path-Pair

Given a path-pair 〈P Aa, P Ab〉 with Bi_C(P Aa, P Ab) ≠ NULL, where A is a common ancestor of a and b and Bi_C(P Aa, P Ab) consists of all common individuals shared between P Aa and P Ab, except A, we introduce three patterns (i.e., crossover, 2-overlap, and root 2-overlap) to generate all possible cases for 〈P Aa, P Ab〉.

  1. X(P Aa, P Ab): P Aa and P Ab share one or multiple crossover individuals.

  2. T(P Aa, P Ab): P Aa and P Ab are root 2-overlapping from A, and the root 2-overlap path can have one or multiple 2-overlap individuals.

  3. Y(P Aa, P Ab): P Aa and P Ab are overlapping but not from A, and the 2-overlap path can have one or multiple 2-overlap individuals.

Based on the three patterns, X(P Aa, P Ab), T(P Aa, P Ab), and Y(P Aa, P Ab), we use regular expressions to generate all possible cases for the path-pair 〈P Aa, P Ab〉. For convenience, we drop 〈P Aa, P Ab〉 and use X, T, and Y instead of patterns X(P Aa, P Ab), T(P Aa, P Ab), and Y(P Aa, P Ab), whenever there is no confusion. When Bi_C(P Aa, P Ab) ≠ NULL, the eight cases shown in (7) cover all possible cases for 〈P Aa, P Ab〉. The completeness of eight cases shown in (7) for 〈P Aa, P Ab〉 can be proved by induction on the total number of T, X, and Y appearing in 〈P Aa, P Ab〉. Using the pedigree in Figure 2, Cases 1–3 and Case 6 are illustrated in (8), (9), (10), and (11):

graphic file with name CMMM2014-898424.e001.jpg (7)
graphic file with name CMMM2014-898424.e002.jpg (8)

where {s, e, t} are 2-overlap individuals and the overlap path is a root 2-overlap path:

AsetaAsftb}TX, (9)

where s is a 2-overlap individual and the overlap path is a root 2-overlap path; t is a crossover individual:

AsetaAdftb}X, (10)

where t is a crossover individual:

AcetaAsetb}XY, (11)

where e is a crossover individual; t is a 2-overlap individual and the overlap path is a 2-overlap path.

3.1.2. Path-Pair Level Graphical Representation of a Path-Triple

Given a path-triple 〈P Aa, P Ab, P Ac〉, we represent each path as a node. The path-triple can be decomposed to three path-pairs (i.e., 〈P Aa, P Ab〉, 〈P Aa, P Ac〉, and 〈P Ab, P Ac〉). For each path-pair, if the two paths share at least one common individual (i.e., either 2-overlap individual or crossover individual), except A, then there is an edge between the two nodes representing the two paths. Therefore, we obtain four different scenarios S 0S 3, shown in Figure 5.

Figure 5.

Figure 5

A path-pair level graphical representation of 〈P Aa, P Ab, P Ac〉.

In Figure 5, the scenario S 0 has no edges, so it means that 〈P Aa, P Ab, P Ac〉 consists of three independent paths. In Figure 2, path-triple1 is an example of S 0. Next, we introduce a lemma which can assist with identifying the options for the edges in the scenarios S 1S 3.

Lemma 3 . —

Given a path-triple 〈P Aa, P Ab, P Ac〉, consider the three path-pairs 〈P Aa, P Ab〉, 〈P Aa, P Ac〉, and 〈P Ab, P Ac〉, if there is a 2-overlap edge which is represented by Y in regular expression representation of any of the three path-pairs, and then the path-triple 〈P Aa, P Ab, P Ac〉 has no contribution to Φabc.

Proof —

In [17], Nadot and Vaysseix proposed, from a genetic and biological point of view, that Φabc can be evaluated by enumerating all eligible inheritance paths at allele-level starting from a triple common ancestor A to the three individuals a, b, and c.

For the pedigree in Figure 6, let us consider the path-triple 〈P Aa, P Ab, P Ac〉 listed as follows. P Aa : Aa; P Ab : Ap 3p 6p 7b; P Ac : Ap 4p 6p 7c.

For 〈P Ab, P Ac〉,   p 6 is a crossover individual, p 7 is an overlap individual, and p 6p 7 is a 2-overlap edge represented by Y in regular expression representation (see the definition for Y in Section 3.1.1).

For the individual p 6, let us denote the two alleles at one fixed autosomal locus as g 1 and g 2. At allele-level, only one allele can be passed down from p 6 to p 7. Since p 3 and p 4 are parents of p 6, g 1 is passed down from one parent, and g 2 is passed down from the other parent. It is infeasible to pass down both g 1 and g 2 from p 6 to p 7. In other words, there are no corresponding inheritance paths for the path-triple 〈P Aa, P Ab, P Ac〉 with a 2-overlap edge between 〈P Ab, P Ac〉 (i.e., Case 6: XY). Therefore, such kind of path-triples has no contribution to Φabc.

Figure 6.

Figure 6

Examples of pedigree and inheritance paths.

Figure 6(b) shows one example of eligible inheritance paths corresponding to a pedigree graph. Each individual is represented by two allele nodes. The eligible inheritance paths in Figure 6(b) consist of red edges only.

Only Case 1, Case 2, and Case 3 do not have Y in the regular expression representation of a path-pair (see (7)); considering the scenarios S 1S 3 shown in Figure 5, an edge can have three options {Case  1:  T; Case  2:  X; Case  3:  TX}.

3.1.3. Constructing Cases for a Path-Triple

For the scenarios S 1S 3 in Figure 5, we define two building blocks {B 1, B 2} along with some rules in Figure 7 to generate acceptable cases. For B 1, the edge can have three options {Case  1:  T; Case  2:  X; Case  3:  TX}. For B 2, we cannot allow both edges to be root overlap, because if two edges are root overlap, then P Aa and P Ac must share at least one common individual, except A, which contradicts the fact that P Aa and P Ac have no edge.

Figure 7.

Figure 7

Building blocks {B 1, B 2} and basic rules.

Next, we focus on generating all acceptable cases for the scenarios S 1S 3 in Figure 5, where only S 3 contains more than one building block. In order to leverage the dependency among building blocks, we decompose S 3 to S 3 = {u 1 = B 2, u 2 = B 2, u 3 = B 2}, shown in Figure 8. For each u i, we have a set of acceptable path-triples, denoted as R i.

Figure 8.

Figure 8

A graphical illustration for obtaining T 3.

Considering the dependency among {R 1, R 2, R 3}, we use the natural join operator, denoted as ⋈, operating on {R 1, R 2, R 3} to generate all acceptable cases for S 3. As a result, we obtain T 3 = R 1R 2R 3, where T 3 denotes the acceptable cases of the path-triple 〈P Aa, P Ab, P Ac〉 in the scenario S 3.

For each scenario in Figure 5, we generate all acceptable cases for 〈P Aa, P Ab, P Ac〉. The scenario S 0 has no edges, and it shows that 〈P Aa, P Ab, P Ac〉 consists of three independent paths, while, for the other scenarios S k (k = 1,2, 3), the k edges can have two options:

  1. all k edges belong to crossover; or

  2. one edge belongs to root 2-overlap; the remaining (k − 1) edges belong to crossover.

In summary, acceptable path-triples can have at most one root 2-overlap path, any number of crossover individuals, but zero 2-overlap path.

3.1.4. Splitting Operator

Considering the existence of root 2-overlap path and crossover in acceptable path-triples, we propose a splitting operator to transform a path-triple with crossover individuals to a noncrossover path-triple without changing the contribution from this path-triple to Φabc. The main purpose of using the splitting operator is to simplify the path-counting formula derivation process. We first use an example in Figure 9 to illustrate how the splitting operator works. In Figure 9, there is a crossover individual s between P Aa and P Ab in the path triple 〈P Aa, P Ab, P Ac〉 in G k+1. The splitting operator proceeds as follows:

  1. split the node s to two nodes, s 1 and s 2;

  2. transform the edges sa' and sb' to s 1a' and s 2b', respectively;

  3. add two new edges, s 2a' and s 1b'.

Figure 9.

Figure 9

Transforming pedigree graph G k + 1 having k + 1 crossover to G k having k crossover.

Lemma 4 . —

Given a pedigree graph G k+1 having (k + 1) crossover individuals regarding 〈P Aa, P Ab, P Ac〉 shown in Figure 9, let s denote the lowest crossover individual, where no descendant of s can be a crossover individual among the three paths P Aa, P Ab, and P Ac. After using the splitting operator for the lowest crossover individual s in G k + 1, the number of crossover individuals in G k+1 is decreased by 1.

Proof —

The splitting operator only affects the edges from s to a' and b'. If there is a new crossover node appearing, the only possible node is either a' or b'. Assume b' becomes a crossover individual; it means that b' is able to reach a and b from two separate paths. It contradicts the fact that s is the lowest crossover individual between P Aa and P Ab.

Next, we introduce a canonical graph which results from applying the splitting operator for all crossover individuals. The canonical graph has zero crossover individual.

Definition 5 (Canonical Graph). —

Given a pedigree graph G having one or more crossover individuals regarding Φabc, If there exists a graph G' which has no crossover individuals with regards to Φabc such that

  1. any acceptable path-triple in G has an acceptable path-triple in G' which has the same contribution to Φabc as the one in G for Φabc;

  2. any acceptable path-triple in G' has an acceptable path-triple in G which and has the same contribution to Φabc as the one in G' for Φabc.

We call G' a canonical graph of G regarding Φabc.

Lemma 6 . —

For a pedigree graph G having one or more crossover individuals regarding 〈P Aa, P Ab, P Ac〉, there exists a canonical graph G' for G.

Proof —

The proof is by induction on the number of crossover individuals.

Induction hypothesis: assume that if G has k or less crossovers, there is a canonical graph G' for G.

In the induction step, let G k+1 be a graph with k + 1 crossovers; let s be the lowest crossover between paths P Aa and P Ab in G k+1. We apply the splitting operator on s in G k+1 and obtain G k having k crossovers by Lemma 4.

3.1.5. Path-Counting Formula for Φabc

Now, we present the path-counting formula for Φabc:

Φabc=A(Type  1(12)LtripleΦAAA+Type  2(12)Ltriple+1ΦAA), (12)

where  ΦAA = (1/2)(1 + F A), ΦAAA = (1/4)(1 + 3F A), F A: the inbreeding coefficient of A, A: a triple-common ancestor of a, b, and c,  Type 1: 〈P Aa, P Ab, P Ac〉 has zero root 2-overlap,  Type 2: 〈P Aa, P Ab, P Ac〉 has one root 2-overlap path P As ending at the individual s

Ltriple={LPAa+LPAb+LPAcfor    Type  1LPAa+LPAb+LPAcLPAsfor    Type  2, (13)

and L PAa: the length of the path P Aa (also applicable for P Aa, P Ac, and P As).

For completeness, the path-counting formula for Φaab is given in Appendix A; and the correctness proof of the path-counting formula is given in Appendix B.

3.2. Path-Counting Formulas for Four Individuals

3.2.1. Path-Pair Level Graphical Representation of 〈P Aa, P Ab, P Ac, P Ad

Given a path-quad 〈P Aa, P Ab, P Ac, P Ad〉 and Quad_C(P Aa, P Ab, P Ac, P Ad) = , the path-quad can have 11 scenarios S 0S 10 shown in Figure 10 where all four paths are considered symmetrically.

Figure 10.

Figure 10

A path-pair level graphical representation of 〈P Aa, P Ab, P Ac, P Ad〉.

In Figure 11, we introduce three building blocks {B 1, B 2, B 3}. For B 1 and B 2, the rules presented in Figure 7 are also applicable for Figure 11. For B 3, we only consider root overlap, because the crossover individuals can be eliminated by using the splitting operator introduced in Section 3.1.4. Note that for B 3, if Tri_C(P Aa, P Ab, P Ac) = , then it is equivalent to the scenario S 3 in Figure 8 Therefore, we only need to consider B 3 when Tri_C(P Aa, P Ab, P Ac) ≠ .

Figure 11.

Figure 11

Building blocks for all scenarios of 〈P Aa, P Ab, P Ac, P Ad〉.

3.2.2. Building Block-Based Cases Construction for 〈P Aa, P Ab, P Ac, P Ad

For a scenario S i  (0 ≤ i ≤ 10) in Figure 11, we first decompose S i to one or multiple building blocks. For a scenario S i ∈ {S 1, S 3}, it has only one building block, and all acceptable cases can be obtained directly. For S 2 = {u 1 = B 1, u 2 = B 1}, there is no need to consider the conflict between the edges in u 1 and u 2 because u 1 and u 2 are disconnected. Let R i denote all acceptable cases of the path-pairs in u i, and let T i denote all acceptable cases for S i. Therefore, we obtain T 2 = R 1 × R 2 where × denotes the Cartesian product operator from relational algebra.

For S 6 = {u 1 = B 3}, we obtain T 6 = R 1. For S i ∈ {S i∣4 ≤ i ≤ 10  and  i ≠ 6}, we define the largest subgraph of S i based on which we construct T i.

Definition 7 (Largest Subgraph). —

Given a scenario S i  (4 ≤ i ≤ 10 and i ≠ 6), the largest subgraph of S i, denoted as S j, is defined as follows:

  1. S j is a proper subgraph of S i;

  2. if S i contains B 3, then S j must also contain B 3;

  3. no such S k exists that S j is a proper subgraph of S k while S k is also a proper subgraph of S i.

For each scenario S i  (4 ≤ i ≤ 10 and i ≠ 6), we list the largest subgraph of S i, denoted as S j, in Table 2.

Table 2.

Largest subgraph of a scenario S i (4 ≤ i ≤ 10 and i ≠ 6).

S i S 4 S 5 S 7 S 8 S 9 S 10

S j S 3 S 3 S 6 S 5 S 7 S 9

For a scenario S i  (4 ≤ i ≤ 10 and i ≠ 6), let Diff(S iS j) denote the set of building blocks in S i but not in S j, where S j is the largest subgraph of S i. Let |E i| and |E j| denote the number of edges in S i and S j, respectively. According to Table 2, we can conclude that |E i | −|E j | = 1. In order to leverage the dependency among building blocks, we consider only B 2 in Diff(S iS j). For example, Diff(S 5S 3) = {B 2}. Let T 3 denote all acceptable cases for S 3. And let R 1 denote the set of acceptable cases for Diff(S 5S 3). Then, we can use S 3 and Diff(S 5S 3) to construct all acceptable cases for S 5. Then, we apply this idea for constructing all acceptable cases for each S i in Table 2.

Given a path-quad 〈P Aa, P Ab, P Ac, P Ad〉, an acceptable case has the following properties:

  1. if there is one root 3-overlap path, there can be at most one root 2-overlap path;

  2. otherwise, there can be at most two root 2-overlap paths.

3.2.3. Path-Counting Formula for Φabcd

Now, we present the path-counting formula for Φabcd as follows:

Φabcd=A(Type  1(12)LquadΦAAAA+Type  2(12)Lquad+1ΦAAA+Type  3(12)Lquad+2ΦAA), (14)

where   ΦAA = (1/2)(1 + F A), ΦAAA = (1/4)(1 + 3F A), ΦAAAA = (1/8)(1 + 7F A), F A: the inbreeding coefficient of A, A: a quad-common ancestor of a, b, c, and d, Type 1: zero root 2-overlap and zero root 3-overlap path, Type 2: one root 2-overlap path P As ending at s

Type  3:{Case  1: two  root 2-overlap pathsPAs1,PAs2  ending  at  s1  and  s2,respectivelyCase  2: one  root 3-overlap pathPAtending  at  tCase  3: one  root 2-overlap pathPAs,one root 3-overlap pathPAtending  at  s  and  t,respectively,Lquad={LPAa+LPAb+LPAc+LPAd  for    Type  1LPAa+LPAb+LPAc+LPAdLPAsfor    Type  2LPAa+LPAb+LPAc+LPAdLPAs1LPAs2forCase  1Type  3LPAa+LPAb+LPAc+LPAd2LPAtforCase  2Type  3LPAa+LPAb+LPAc+LPAdLPAtLPAsforCase  3Type  3, (15)

and L PAa: the length of the path P Aa (also applicable for P Ab, P Ac, P Ad, etc.).

For completeness, the path-counting formulas for Φaabc and Φaaab are presented in Appendix A. The correctness of the path-counting formula for four individuals is proven in Appendix C.

3.3. Path-Counting Formulas for Two Pairs of Individuals

3.3.1. Terminology and Definitions

(1) 2-Pair-Path-Pair. It consists of two pairs of path-pairs denoted as 〈(P Sa, P Sb), (P Tc, P Td)〉, where P SaP(S, a), P SbP(S, b), P TcP(T, c), P TdP(T, d), S is a common ancestor of a and b, and T is a common ancestor of c and d. If A = S = T, then A is a quad-common ancestor of a, b, c, and d.

(2) Homo-Overlap and Heter-Overlap Individual. Given two pairs of individuals 〈a, b〉  and  〈c, d〉, if sBi_C(P Aa, P Ab) (or sBi_C(P Ac, P Ad), we call s a homo-overlap individual when P Aa and P Ab (or P Ac and P Ad) pass through the same parent of s. If rBi_C(P Ai, P Aj), where i ∈ {a, b} and j ∈ {c, d}, we call r a heter-overlap individual when P Ai and P Aj pass through the same parent of r.

(3) Root Homo-Overlap and Heter-Overlap Path. Given a 2-pair-path-pair 〈(P Aa, P Ab), (P Ac, P Ad)〉, if s is a homo-overlap individual and the homo-overlap path extends all the way to the quad-common ancestor A, then we call it a root homo-overlap path. If r is a heter-overlap individual and the heter-overlap path extends all the way to the quad-common ancestor A, then we call it a root heter-overlap path.

Example 8 . —

A is quad-common ancestor for a, b, c, and d in Figure 12. For (a), s is a homo-overlap individual between P Aa and P Ab.

t is a homo-overlap individual between P Ac and P Ad. And, As and At are root homo-overlap paths. For (b), x is a heter-overlap individual between P Aa and P Ad. y is a heter-overlap individual between P Ab and P Ac. And Ax and Ay are root heter-overlap paths.

Figure 12.

Figure 12

Examples of 2-pair-path-quads for Φab,cd.

3.3.2. Path-Counting Formula for Φab,cd

Now, we present a path-pair level graphical representation for 〈(P Aa, P Ab), (P Ac, P Ad)〉 shown in Figure 13. The options for an edge can be {T, X, TX}. (Refer to Section 3.1.1 for definitions of T, X, and TX). Based on the different types of 〈P Aa, P Ab, P Ac, P Ad〉 presented in (14), all cases for 〈(P Aa, P Ab), (P Ac, P Ad)〉 are summarized in Table 3, where h is the last individual of a root homo-overlap path P Ah (i.e., the path P Ah ending at h) and r 1 and r 2 are the last individuals of root heter-overlap paths P Ar1 and P Ar2, respectively.

Figure 13.

Figure 13

Scenarios of 〈(P Aa, P Ab), (P Ac, P Ad)〉 at path-pair level.

Table 3.

A summary of all cases for 〈(P Aa, P Ab), (P Ac, P Ad)〉.

P Aa, P Ab, P Ac, P Ad 〈(P Aa, P Ab), (P Ac, P Ad)〉
Zero root 2-overlap and
zero root 3-overlap
Zero root homo-overlap and zero root heter-overlap

One root 2-overlap path One root homo-overlap and zero root heter-overlap
Zero root homo-overlap and one root heter-overlap

Two root 2-overlap paths Two root homo-overlaps and zero root heter-overlap
Zero root homo-overlap and two root heter-overlaps

One root 3-overlap path One root homo-overlap and two root heter-overlaps, and h = r 1 = r 2

One root 2-overlap and one root 3-overlap One root homo-overlap and two root heter-overlaps, and r 1 = r 2h
One root homo-overlap and two root heter-overlaps, and h = r 1r 2

Given a pedigree graph having one or multiple progenitors {p ii > 0}, we define that the generation of a progenitor p i is 0, denoted as gen(p i) = 0. If an individual a has only one parent p, then we define gen(a) = gen(p) + 1. If an individual a has two parents f and m, we define gen(a) = MAX{gen(f), gen(m)} + 1.

The path-counting formula for Φab,cd is as follows:

Φab,cd=A(Type  1(12)L2-pairΦAAA+Type  2(12)L2-pair+1ΦAAA+Type  3(12)L2-pair+2ΦAA+Type  4(12)L2-pair+1ΦAA)+(S,T)Type  5(12)LPSa,PSb+LPTc,PTd+1ΦBB, (16)

where A: a quad-common ancestor of a, b, c, and d, S: a common ancestor of a and b, and T: a common ancestor of c and d. For 〈(P Aa, P Ab), (P Ac, P Ad)〉  (S = T = A), there are four types (i.e., Type 1 to Type 4).

  • Type 1: zero root homo-overlap and zero root heter-overlap.

  • Type 2: zero root homo-overlap and one root heter-overlap P Ar ending at r,
    Type  3:{zero  root  homo-overlap  and  two  root  heter-overlapPAr1andPAr2endingatr1andr2,respectively,one  root  homo-overlapPAhending  athandtwo  root  heter-overlapPAr1andPAr2  ending  atr1  andr2,andr1r2. (17)
  • Type 4: one root homo-overlap P Ah ending at h and two root heter-overlap ending at r 1 and  r 2, and  h = r 1 = r 2. For 〈(P Sa, P Sb), (P Tc, P Td)〉  (ST), there is one type (i.e., Type 5).

  • Type 5: 〈P Sa, P Sb〉 has zero overlap individual, 〈P Tc, P Td〉 has zero overlap individual.

At most one path-pair (either  〈P Sa, P Sb〉  or  〈P Tc, P Td〉)  can have crossover individuals.

Between a path from 〈P Sa, P Sb〉 and a path from 〈P Tc, P Td〉, there are no overlap individuals, but there can be crossover individuals, x, where xS and xT:

B={Swhen  gen(S)<gen(T)Swhen  gen(S)=gen(T)  and  T  has  two  parentsTotherwise,L2-pair={LPAa+LPAb+LPAc+LPAdfor  Type  1LPAa+LPAb+LPAc+LPAdLPArfor  Type  2LPAa+LPAb+LPAc+LPAdLPAr1LPAr2for  Type  3LPAa+LPAb+LPAc+LPAd2LPAhfor  Type  4,LPSa,PSb=LPSa+LPSbfor  Type    5,LPTc,PTd=LPTc+LPTdfor  Type  5. (18)

Note that if 〈a, b〉 and 〈c, d〉 have zero quad-common ancestors, we have the following formula for Φab,cd:

Φab,cd=(S,T)Type  6(12)LPSa,PSb+LPTc,PTdΦSSΦTT. (19)

Type  6:  〈P Sa, P Sb〉 is a nonoverlapping path-pair and 〈P Tc, P Td〉  is a nonoverlapping path-pair. Between a path from 〈P Sa, P Sb〉 and a path from 〈P Tc, P Td〉, there are no overlap individuals, but there can be crossover individuals.

   L PSa,PSb and L PTc,PTd are defined as in Type 5.

The correctness of the path-counting formula for Φab.cd is proven in Appendix C. For completeness, please refer to [18] for the path-counting formulas for Φaa,bc, Φab,ac, Φab,ab, and Φaa,ab.

3.4. Experimental Results

In this section, we show the efficiency of our path-counting method using NodeCodes for condensed identity coefficients by making comparisons with the performance of a recursive method used in [10]. We implemented two methods: (1) using recursive formulas to compute each required kinship coefficient and generalized kinship coefficient; (2) using path-counting method coupled with NodeCodes to compute each required kinship coefficient and generalized kinship coefficient independently. We refer to the first method as Recursive, the second method as NodeCodes. For completeness, please refer to [18] for the details of the NodeCodes-based method.

Nodecodes of a node is a set of labels each representing a path to the node from its ancestors. Given a pedigree graph, let r be the progenitor (i.e., the node with 0 in-degree). (For simplicity, we assume there is one progenitor, r, as the ancestor of all individuals in the pedigree. Otherwise, a virtual node r can be added to the pedigree graph and all progenitors can be made children of r.)   For each node u in the graph, the set of NodeCodes of u, denoted as NC(u), are assigned using a breadth-first-search traversal starting from r as follows.

  1. If u is r then NC(r) contains only one element: the empty string.

  2. Otherwise, let u be a node with NC(u), and v 0, v 1, …, v k be u's children in sibling order; then for each x in NC(u), a code xi* is added to NC(v i), where 0 ≤ ik, and ∗ indicates the gender of the individual represented by node v i.

Computations of kinship coefficients for two individuals and generalized kinship coefficients for three individuals presented in [11, 12, 14, 15] are using NodeCodes. The NodeCodes-based computation schemes can also be applied for the generalized kinship coefficients for four individuals and two pairs of individuals. For completeness, please refer to [18] for the details using NodeCodes to compute the generalized kinship coefficients for four individuals and two pairs of individuals based on our proposed path-counting formulas in Sections 3.2 and 3.3.

In order to test the scalability of our approach for calculating condensed identity coefficients on large pedigrees, we used a population simulator implemented in [11] to generate arbitrarily large pedigrees. The population simulator is based on the algorithm for generating populations with overlapping generations in Chapter 4 of [19] along with the parameters given in Appendix B of [20] to model the relatively isolated Finnish Kainuu subpopulation and its growth during the years 1500–2000. An overview of the generation algorithm was presented in [11, 12, 14]. The parameters include starting/ending year, initial population size, initial age distribution, marriage probability, maximum age at pregnancy, expected number of children by time period, immigration rate, and probability of death by time period and age group.

We examine the performance of condensed identity coefficients using twelve synthetic pedigrees which range from 75 individuals to 195,197 individuals. The smallest pedigree spans 3 generations, and the largest pedigree spans 19 generations. We analyzed the effects of pedigree size and the depth of individuals in the pedigree (the longest path between the individual and a progenitor) on the computation efficiency improvement.

In the first experiment, 300 random pairs were selected from each of our 12 synthetic pedigrees. Figure 14 shows computation efficiency improvement for each pedigree. As can be seen, the improvement of NodeCodes over Recursive grew increasingly larger as the pedigree size increased, from a comparable amount of 26.83% on the smallest pedigree to 94.75% on the largest pedigree. It also shows that path-counting method coupled with NodeCodes can scale very well on large pedigrees in terms of computing condensed identity coefficients.

Figure 14.

Figure 14

The effect of pedigree size on computation efficiency improvement.

In our next experiment, we examined the effect of the depth of the individual in the pedigree on the query time. For each depth, we generated 300 random pairs from the largest synthetic pedigree.

Figure 15 shows the effect of depth on the computation efficiency improvement. We can see the improvement of NodeCodes over Recursive, ranging from 86.48% to 91.30%.

Figure 15.

Figure 15

The effect of depth on computation efficiency improvement.

4. Conclusion

We have introduced a framework for generalizing Wright's path-counting formula for more than two individuals. Aiming at efficiently computing condensed identity coefficients, we proposed path-counting formulas (PCF) for all generalized kinship coefficients for which are sufficient for expressing condensed identity coefficients by a linear combination. We also perform experiments to compare the efficiency of our method with the recursive method for computing condensed identity coefficients on large pedigrees. Our future work includes (i) further improvements on condensed identify coefficients computation by collectively calculating the set of generalized kinship coefficients to avoid redundant computations, and (ii) experimental results for using PCF in conjunction with encoding schemes (e.g., compact path-encoding schemes [13]) for computing condensed identity coefficients on very large pedigrees.

Acknowledgments

The authors thank Professor Robert C. Elston, Case School of Medicine, for introducing to them the identity coefficients and referring them to the related literature [7, 10, 17]. This work is partially supported by the National Science Foundation Grants DBI 0743705, DBI 0849956, and CRI 0551603 and by the National Institute of Health Grant GM088823.

Appendices

A. Path-Counting Formulas of Special Cases

A.1. Path-Counting Formula for Φaab

For 〈P Aa1, P Aa2〉, we introduce a special case, where P Aa1 and P Aa2 are mergeable.

Definition A.1 A.1 (Mergeable Path-Pair). —

A path-pair 〈P Aa1, P Aa2〉 is mergeable if and only if the two paths P Aa1 and P Aa2 are completely identical.

Next, we present a graphical representation of 〈P Aa1, P Aa2, P Ab〉 in Figure 16.

Figure 16.

Figure 16

A path-pair level graphical representation of 〈P Aa1, P Aa2, P Ab〉.

Lemma A.2 A.2. —

For S 2 and S 3 in Figure 16,  〈P Aa1, P Aa2〉 cannot be a mergeable path-pair.

Proof —

For S 2 and S 3, if 〈P Aa1, P Aa2〉 is mergeable, then any common individual s between P Aa1 and P Ab is also a shared individual between P Aa2 and P Ab. It means sTri_C(P Aa1, P Aa2, P Ab) which contradicts the fact that Tri_C(P Aa1, P Aa2, P Ab) = .

Considering all three scenarios in Figure 16, only S 1 can have a mergeable path-pair 〈P Aa1, P Aa2〉 by Lemma A.2. Now, we present our path-counting formula for Φaab where a is not an ancestor of b:

Φaab=A(Type  1(12)Ltriple1ΦAAA+Type  2(12)LtripleΦAA+Type  3(12)LPAa,PAb+1ΦAA), (A.1)

where A: a common ancestor of a and b.

When 〈P Aa1, P Aa2〉 is not mergeable,

  • Type 1: 〈P Aa1, P Aa2, P Ab〉 has no root 2-overlap.

  • Type 2: 〈P Aa1, P Aa2, P Ab〉 has one root 2-overlap path P As ending at the individual s.

When 〈P Aa1, P Aa2〉 is mergeable,

Type 3: 〈P Aa, P Ab〉 is a nonoverlapping path-pair

Ltriple={LPAa1+LPAa2+LPAbfor  Type  1LPAa1+LPAa2+LPAbLPAsfor  Type  2,LPAa,PAb=LPAa+LPAbfor  Type  3. (A.2)

For the sake of completeness, if a is an ancestor of b, there is no recursive formula for Φaab in [10], but we can use either the recursive formula for Φabc or the path-counting formula for Φabc to compute Φa1a2b.

A.2. Path-Counting Formula for Φaabc

Given a path-quad 〈P Aa1, P Aa2, P Ab, P Ac〉, if 〈P Aa1, P Aa2〉 is not mergeable, then we process the path-quad as equivalent to 〈P Aa, P Ab, P Ac, P Ad〉. If 〈P Aa1, P Aa2〉 is mergeable, the path-quad 〈P Aa1, P Aa2, P Ab, P Ac〉 can be condensed to scenarios for 〈P Aa, P Ab, P Ac〉.

Now, we present a path-counting formula for Φaabc where a is not an ancestor of b and c as follows:

Φaabc=A(Type  1(12)Lquad1ΦAAAA+Type  2(12)LquadΦAAA+Type  3(12)Lquad+1ΦAA)+A(Type  4(12)Ltriple+1ΦAAA+Type  5(12)Ltriple+2ΦAA), (A.3)

where A: a quad-common ancestor of a, b, c, and d.

When 〈P Aa1, P Aa2〉 is not mergeable,

  • Type 1: zero root 2-overlap and zero root 3-overlap path;

  • Type 2: one root 2-overlap path P As ending at s
    Type 3:{Case  1: two root 2-overlap paths  PAs1 and PAs2  ending at s1  and  s2, respectivelyCase  2: one root 3-overlap path  PAt  ending at  tCase  3: one root 2-overlap  and one root 3-overlap pathsPAs and  PAt  ending at s  and  t,respectively. (A.4)

When 〈P Aa1, P Aa2〉 is mergeable,

  • Type 4: 〈P Aa, P Ab, P Ac〉 has zero root 2-overlap path;

  • Type 5: 〈P Aa, P Ab, P Ac〉 has one root 2-overlap path P As ending at s
    Lquad={LPAa1+LPAa2+LPAb+LPAcfor    Type  1LPAa1+LPAa2+LPAb+LPAcLPAsfor    Type  2LPAa1+LPAa2+LPAb+LPAcLPAs1LPAs2for    Case  1Type  3LPAa1+LPAa2+LPAb+LPAcLPAtfor  Case  2Type  3LPAa1+LPAa2+LPAb+LPAcLPAtLPAsfor  Case  3Type  3,Ltriple={LPAa+LPAb+LPAcfor    Type  4LPAa+LPAb+LPAcLPAsfor    Type  5. (A.5)

Note that if a is an ancestor of either b or c, or both of them, then the path-counting formula of Φabcd is applicable to compute Φa1a2bc.

A.3. Path-Counting Formula for Φaaab

A special case of 〈P Aa1, P Aa2, P Aa3〉 for 〈P Aa1, P Aa2, P Aa3, P Ab〉 is introduced when 〈P Aa1, P Aa2, P Aa3〉 is mergeable. With the existence of a mergeable path-triple, 〈P Aa1, P Aa2, P Aa3, P Ab〉 can be condensed to 〈P Aa, P Ab〉.

Definition A.3 A.3 (Mergeable Path-Triple). —

Given three paths P Aa1, P Aa2, and P Aa3, they are mergeable if and only if they are completely identical.

Lemma A.4 A.4. —

Given a path-quad 〈P Aa1, P Aa2, P Aa3, P Ab〉, there must be at least one mergeable path-pair among 〈P Aa1, P Aa2〉, 〈P Aa1, P Aa3〉, 〈P Aa2, P Aa3〉.

Proof —

For an individual a with two parents f and m, the paternal allele of the individual a is transmitted from f and the maternal allele is transmitted from m. At allele level, only two descent paths starting from an ancestor are allowed. For a path-quad 〈P Aa1, P Aa2, P Aa3, P Ab〉, there must be at least one mergeable path-pair among 〈P Aa1, P Aa2〉, 〈P Aa1, P Aa3〉, and 〈P Aa2, P Aa3〉.

For simplicity, we treat 〈P Aa1, P Aa2〉 as a default mergeable path-pair.

Now, we present the path-counting formula for Φaaab where a is not an ancestor of b as follows:

Φaaab=A(32(Type  1(12)Ltriple1ΦAAA+Type  2(12)LtripleΦAA)+Type  3(12)Lpair+2ΦAA), (A.6)

where A: a common ancestor of a and b.

When there is only one mergeable path-pair (let us consider 〈P Aa1, P Aa2〉 as the mergeable path-pair),

  • Type 1: 〈P Aa1, P Aa3, P Ab〉 has zero root 2-overlap path,

  • Type 2: 〈P Aa1, P Aa3, P Ab〉 has one root 2-overlap path P As ending at s.

When 〈P Aa1, P Aa2, P Aa3〉 is mergeable,

  • Type 3: 〈P Aa, P Ab〉 is nonoverlapping
    Ltriple={LPAa1+LPAa3+LPAbfor  Type  1LPAa1+LPAa3+LPAbLPAsfor  Type  2,Lpair=LPAa+LPAbfor    Type  3. (A.7)

Note that if a is an ancestor of b, we treat Φaaab = Φa1a2a3b. Then, we apply the path-counting formula for Φabcd to compute Φa1a2a3b.

B. Proof for Path-Counting Formulas of Three Individuals

We first demonstrate that, for one triple-common ancestor A, the path-counting computation of Φabc is equivalent to the computation using recursive formulas. Then, we prove the correctness of the path-counting computation for multiple triple-common ancestors.

B.1. One Triple-Common Ancestor

Considering the different types of path-triples starting from a triple-common ancestor A in a pedigree graph G contributing to Φabc and Φaab, G can have 5 different cases:

Case  2.1:  G  does not have  any  path-triplesPAa1,PAa2,PAb  with  root  overlapCase  2.2:  G  has path-triplesPAa1,PAa2,PAbwith  root  overlapCase  2.3:  G  has path-triples PAa1,PAa2,PAbhaving  mergeable  path-pairPAa1,PAa2}Φaab,Case  3.1:  G  does not have any path-triplesPAa,PAb,PAc  with  root  overlapCase  3.2:  Ghas path-triplesPAa,PAb,PAc  with  root  overlap}Φabc. (B.1)

Based on the 5 cases from Case 2.1 to Case 3.2, we first construct a dependency graph shown in Figure 17, consistent with the recursive formulas (3), (4), and (5) for the generalized kinship coefficients for three individuals.

Figure 17.

Figure 17

Dependency graph for different cases regarding Φabc and Φaab.

Then, we take the following steps to prove the correctness of the path-counting formulas (12) and (A.1):

  1. for Φab, the correctness of the path-counting formula (i.e., Wright's formula) is proven in [21]. For Case 2.1 and Case 2.2, the correctness is proven based on the correctness of Cases 3.1 and 3.2;

  2. for Case 2.3, it has no cycle but only depends on Φab. Thus, we prove the correctness of Case 2.3 by transforming the case to Φab;

  3. for Cases 3.1 and 3.2, the correctness is proven by induction on the number of edges, n, in the pedigree graph G.

B.1.1. Correctness Proof for Case 3.1

Case 3.1. For Φabc, G does not have any path triples 〈P Aa, P Ab, P Ac〉 with root overlap.

Proof —

There are two basic scenarios: (i) one individual is a parent of another; (ii) no individual is a parent of another, among a, b, and c.

Using the recursive formula (3) to compute Φabc, for Figure 18(a), Φabc = (1/2)Φcbc = (1/2)2Φccc; for Figure 18(b), Φabc = (1/2)ΦAbc = (1/2)2ΦAAc = (1/2)3ΦAAA.

Using the path-counting formula (12), if a path-triple 〈P Aa, P Ab, P Ac〉 has no root overlap (i.e., Type 1), then the contribution of 〈P Aa, P Ab, P Ac〉 to Φabc can be computed as follows: ∑Type  1(1/2)LPAa,PAb,PAcΦAAA, where L PAa,PAb,PAc = L PAa + L PAb + L PAc.

For Figure 18(a), c is the only triple-common ancestor and we obtain Φabc = (1/2)LPca,Pcb,PccΦccc = (1/2)2Φccc; for Figure 18(b), we obtain Φabc = (1/2)LPAa,PAb,PAcΦAAA = (1/2)3ΦAAA.

Induction Step. Let n denote the number of edges in G. Assume true for nk, where k ≥ 2. Then, we show it is true for n = k + 1.

For Figures 19(a) and 19(b), among a, b, and c, let a be the individual having the longest path starting from their triple-common ancestor in the pedigree graph G with (k + 1) edges. If we remove the node a and cut the edge fa from G, then the new graph G* has k edges. In terms of computing Φfbc, G* satisfies the condition for induction hypothesis.

For Figure 19(a), Φfbc = ∑Type  1(1/2)LPAf,PAb,PAcΦAAA. Based on the recursive formula (3), Φabc = (1/2)(Φfbc + Φmbc) where f and m are parents of a. In G, a only has one parent f; thus, it indicates Φmbc = 0. Then, we can plug-in the path-counting formula for Φfbc to obtain

Φabc=12Φfbc=12Type  1(12)LPAf,PAb,PAcΦAAA=Type  1(12)LPAf,PAb,PAc+1ΦAAALPAa,PAb,PAc=LPAf,PAb,PAc+1Φabc=Type  1(12)LPAa,PAb,PAcΦAAA. (B.2)

Similarly, for Figure 19(b), we obtain Φabc = ∑Type  1(1/2)LPcf,Pcb,Pcc+1Φccc = ∑Type  1(1/2)LPca,Pcb,PccΦccc.

Thus, it is true for n = k + 1.

Figure 18.

Figure 18

(a) c is a parent of a and b; (b) no individual is a parent of another.

Figure 19.

Figure 19

(a) No individual is a parent of another; (b) c is an ancestor of a and b.

B.1.2. Correctness Proof for Case 3.2

Case 3.2. For Φabc, G has path triples 〈P Aa, P Ab, P Ac〉 with root overlap.

Proof —

There are three basic scenarios: (i) there are two individuals who are parents of another; (ii) there is only one individual who is parent of another; (iii) there is no individual who is a parent of another, among a, b, and c.

Using the recursive formula (3) to compute Φabc: in Figure 20, for Figure 20(a), Φabc = (1/2)Φbbc = (1/2)2Φbc = (1/2)3Φcc; for Figure 20(b)abc = (1/2)Φbbc = (1/2)2Φbc = (1/2)4ΦAA; for Figure 20(c), Φabc = (1/2)2Φssc = (1/2)3Φsc = (1/2)5ΦAA.

Using the path-counting formula (12), if a path-triple 〈P Aa, P Ab, P Ac〉 has root overlap (i.e., Type 2), then the contribution of 〈P Aa, P Ab, P Ac〉 to Φabc can be computed as follows:∑Type  2(1/2)LPAa,PAb,PAc+1ΦAA, where L PAa,PAb,PAc = L PAa + L PAb + L PAcL PAsand s is the last individual of the root overlap path P As.

For Figure 20(a), c is the only triple-common ancestor and we obtain Φabc = (1/2)LPca,Pcb,Pcc+1Φcc = (1/2)2+1Φcc = (1/2)3Φcc. Similarly, for Figures 20(b) and 20(c), we obtain Φabc = (1/2)4ΦAA and Φabc = (1/2)5ΦAA, respectively.

Induction Step. Let n denote the number of edges in G. Assume true for nk, where k ≥ 2. Show that it is true for = k + 1.

For Figures 21(a), 21(b), and 21(c), among a, b, and c, let a be the individual who has the longest path and let p be a parent of a. Then, we cut the edge pa from G and obtain a new graph G* which satisfies the condition of induction hypothesis. For Figure 21(a), we use the path-counting formula for Φfbc in G* : Φfbc = ∑Type  2(1/2)LPAf,PAb,PAc+1ΦAA.

In G, f is the only parent of a, according to the recursive formula (3), we have Φabc = (1/2)Φfbc. Then, we can plug-in the Φfbc and obtain

Φabc=12Φfbc=12Type  2(12)LPAf,PAb,PAc+1ΦAA=Type  2(12)LPAf,PAb,PAc+1+1ΦAALPAa,PAb,PAc=LPAf,PAb,PAc+1Φabc=Type  2(12)LPAf,PAb,PAc+1+1ΦAA=Type  2(12)LPAa,PAb,PAc+1ΦAA. (B.3)

For Figures 21(b) and 21(c), we take the same steps as we calculate Φabc for Figure 21(a).

In summary, it is true for n = k + 1.

Figure 20.

Figure 20

(a) b is a parent of a, and c is a parent of b; (b) b is a parent of a; (c) no individual who is a parent of another.

Figure 21.

Figure 21

(a) No individual who is a parent of another; (b) b is a parent of a; (c) b is a parent of a and c is an ancestor of b.

B.1.3. Correctness Proof for Case  2.3

Case 2.3. For Φaab, the path-triples in the pedigree graph G have mergeable path-pair.

Proof —

Considering the relationship between a and b, G has two scenarios: (i) b is not an ancestor of a; (ii) b is an ancestor of a. Using the path-counting formula (A.1), if a path-triple 〈P Aa1, P Aa2, P Ab〉∈ Type 3, which means that it has a mergeable path-pair, then the contribution of 〈P Aa1, P Aa2, P Ab〉 to Φaab can be computed as follows: ∑Type  3(1/2)LPAa,PAb+1ΦAA, where L PAa,PAb = L PAa + L PAb.

Using the recursive formula (4), we obtain Φaab = (1/2)(Φab + Φfmb).

For Figure 22(a), A is a common ancestor of a and b.

a  only  has  one  parent  f

Φaab=12(Φab+Φfmb)=12(Φab+0)=12Φab(as  m  is  missing). (B.4)

For Φab, we use Wright's formula and obtain Φab = ∑P(1/2)LPAa,PAbΦAA where P denotes all nonoverlapping path-pairs 〈P Aa, P Ab〉.

Then, we have Φaab = (1/2)Φab = (1/2)∑P(1/2)LPAa,PAbΦAA = ∑P(1/2)LPAa,PAb+1ΦAA.

For Figure 22(b), we can also transform the computation of Φaab to Φab.

In summary, it shows that the path-counting formula (A.1) is true for Case 2.3.

Figure 22.

Figure 22

(a) b is not an ancestor of a; (b) b is an ancestor of a.

B.1.4. Correctness Proof for Cases 2.1 and 2.2

For Φaab, when there is no path-triple having mergeable path-pair, (i.e., the path-triple belongs to either Case 2.1 or Case 2.3), Φaab can be transformed to Φa1a2b, which is equivalent to the computation of Φabc for Cases 3.1 and 3.2. The correctness of our path-counting formula for Cases 3.1 and 3.2 is proven. Thus, we obtain the correctness for Φaab when the path-triple belongs to either Case 2.1 or Case 2.2.

B.2. Multiple Triple-Common Ancestors

Now, we provide the correctness proof for multiple triple-common ancestors regarding the path-counting formulas (12) and (A.1).

Lemma B.2 A. —

Given a pedigree graph G and three individuals a, b, c having at least one trip-common ancestor, Φabc is correctly computed using the path counting formulas (12) and (A.1).

Proof —

Proof by induction on the number of triple-common ancestors

Basis. G has only one triple-common ancestor of a, b, and c.

The correctness of (12) and (A.1) for G with only one triple-common ancestor of a, b, and c is proven in the previous section.

Induction Hypothesis. Assume that if G has k or less triple-common ancestors of a, b, and c, (12) and (A.1) are correct for G.

Induction Step. Now, we show that it is true for G with k + 1 triple-common ancestors of a, b, and c.

Let Tri_C(a, b, c, G) denote all triple-common ancestors of a, b, and c in G, where Tri_C(a, b, c, G) = {A i∣1 ≤ ik + 1}. Let A 1 be the most top triple-common ancestor such that there is no individual among the remaining ancestors {A i∣2 ≤ ik + 1} who is an ancestor of A 1.   Let S(A 1) denote the contribution from A 1 to Φabc.

Because A 1 is the most top triple-common ancestor, there is no path-triple from {A i∣2 ≤ ik + 1} to a, b, and c which passes through A 1. Then, we can remove A 1 from G and delete all out-going edges from A 1 and obtain a new graph G′ which has k triple-common ancestors of a, b, and c. It means Tri_C(a, b, c, G′) = {A i∣2 ≤ ik + 1}.

For the new graph G′, we can apply our induction hypothesis and obtain Φabc     (G′).

For the most top triple-common ancestor A 1, there are two different cases considering its relationship with the other triple-common ancestors:

  1. there is no individual among {A i∣2 ≤ ik + 1} who is a descendant of A 1;

  2. there is at least one individual among {A i∣2 ≤ ik + 1} who is a descendant of A 1.

For (1), since no individual among {A i∣2 ≤ ik + 1} is a descendant of A 1, the set of path-triples from A 1 to a, b, and c is independent of the set of path-triples from {A i∣2 ≤ ik + 1} to a, b, and c. It also means that the contribution from A 1 to Φabc     is independent of the contribution from the other triple-common ancestors.

Summing up all contributions, we can obtain Φabc     (G) = Φabc     (G′) + S(A 1).

For (2), let A j be one descendant of A 1. Now both A 1 and A j can reach a, b, and c.

pt i = {t a:  A 1 → ⋯→a; t b:  A 1 → ⋯→b; t c:  A 1 → ⋯→c}, a path-triple from A 1 to a, b, and c.

If t a, t b, and t c all pass through A j, then the path-triple pt i is not an eligible path-triple for Φabc. When we compute the contribution from A 1 to Φabc, we exclude all such path-triples where t a, t b, and t c all pass through a lower triple-common ancestor. In other words, an eligible path-triple from A 1 regarding Φabc cannot have three paths all passing through a lower triple-common ancestor. Therefore, we know that that the contribution from A 1 to Φabc is independent of the contribution from the other triple-common ancestors. Summing up all contributions, we obtain Φabc(G) = Φabc(G′) + S(A 1).

C. Proof for Four Individuals and Two Pairs of Individuals

Here, we give a proof sketch for the correctness of path counting formulas for four individuals. First of all, for four individuals in a pedigree graph G, we present all different cases based on which we construct a dependency graph. The correctness of the path-counting formulas for two-pair individuals can be proved similarly.

C.1. Proof for Four Individuals

Consider the existence of different types of path-quads regarding Φabcd, Φaabc, and Φaaab; there are 15 cases for a pedigree graph G:

Case2.1:  Ghas  path-triplesPAa1,PAa2,PAb with  zero  root  overlapCase2.2:  Ghas  path-triplesPAa1,PAa2,PAb with  one  root  overlapCase2.3:  Ghas  path-pairs  PAa,PAb with  zero  root  overlap}Φaaab,Case3.1:  Ghas  path-quadsPAa1,PAa2,PAb,PAc with  zero  root  overlapCase3.2:  Ghas  path-quadsPAa1,PAa2,PAb,PAc with  one  root  2-overlapCase3.3.1:  Ghas  path-quadsPAa1,PAa2,PAb,PAc with  two  root  2-overlapCase3.3.2:  Ghas  path-quadsPAa1,PAa2,PAb,PAc with  one  root  3-overlapCase3.3.3:  Ghas  path-quadsPAa1,PAa2,PAb,PAc with  one  root 2-overlap   and  one root 3-overlapCase3.4:  Ghas  path-triplesPAa,PAb,PAc with  zero  root  overlapCase3.5:  Ghas  path-triplesPAa,PAb,PAc with  one  root  overlap}Φaabc,Case4.1:  Ghas  path-quadsPAa,PAb,PAc,PAd with  zero  root  overlapCase4.2:  Ghas  path-quadsPAa,PAb,PAc,PAd with  one  root  2-overlapCase4.3.1:  Ghas  path-quadsPAa,PAb,PAc,PAd with  two  root  2-overlapCase4.3.2:  Ghas  path-quadsPAa,PAb,PAc,PAd with  one  root  3-overlapCase4.3.3:  Ghas  path-quadsPAa,PAb,PAc,PAd with one root 2-overlap  and one root 3-overlap}Φabcd. (C.1)

Then, we construct a dependency graph shown in Figure 23 for all cases for four individuals.

Figure 23.

Figure 23

Dependency graph for different cases for four individuals.

According to the dependency graph in Figure 23, the intermediate steps including Cases 3.4 and 3.5 are already proved for the computation of Φabc. The correctness of the transformation from Case 4.2 to Case 3.4 can be proved based on the recursive formula for Φabcd and Φaabc. Similarly, we can obtain the transformation from Case 4.3.1 to Case 3.5.

C.2. Proof for Two Pairs of Individuals

Consider the existence of different types of 2-pair-path-pair regarding Φab,cd; there are 9 cases which are listed as follows.

Case 4.1. G  has 〈(P Aa, P Ab), (P Ac, P Ad)〉 with zero root homo-overlap and zero root heter-overlap.

Case 4.2. G  has 〈(P Aa, P Ab), (P Ac, P Ad)〉 with zero root homo-overlap and one root heter-overlap.

Case 4.3.1. G  has 〈(P Aa, P Ab), (P Ac, P Ad)〉 with zero root homo-overlap and two root heter-overlap.

Case 4.3.2.G  has 〈(P Aa, P Ab), (P Ac, P Ad)〉 with one root homo-overlap and two root heter-overlap.

Case 4.4.G  has 〈(P Aa, P Ab), (P Ac, P Ad)〉 with one root homo-overlap and zero root heter-overlap.

Case 4.5. G  has 〈(P Aa, P Ab), (P Ac, P Ad)〉 with two root homo-overlap and zero root heter-overlap.

Case 4.6. G  has path-triples 〈P Aa, P Ab, P Ac〉 with zero root overlap.

Case 4.7. G  has path-triples 〈P Aa, P Ab, P Ac〉 with one root overlap.

Case 4.8. G  has path-pairs 〈P Tc, P Td〉 with zero root overlap.

Then, we construct a dependency graph for the cases relating to Φab,cd in Figure 24.

Figure 24.

Figure 24

Dependency graph for different cases for two pairs of individuals.

According to the dependency graph in Figure 24, Cases 4.6,  4.7, and 4.8 are the intermediate steps which already are proved for the computation of Φabc. The correctness of the transformation from Case 4.2 to Case 4.6 can be proved based on the recursive formula for Φab,cd and Φab,ac. Similarly, we can obtain the transformation from Cases 4.3.1 and 4.3.2 to Case 4.7 as well as from Case 4.4 to Case 4.8 accordingly.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

References

  • 1. Surgeon General’s New Family Health History Tool Is Released, Ready for “21st Century Medicine”, http://compmed.com/category/people-helping-people/page/7/
  • 2.Falchi M, Forabosco P, Mocci E, et al. A genomewide search using an original pairwise sampling approach for large genealogies identifies a new locus for total and low-density lipoprotein cholesterol in two genetically differentiated isolates of Sardinia. The American Journal of Human Genetics. 2004;75(6):1015–1031. doi: 10.1086/426155. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Ciullo M, Bellenguez C, Colonna V, et al. New susceptibility locus for hypertension on chromosome 8q by efficient pedigree-breaking in an Italian isolate. Human Molecular Genetics. 2006;15(10):1735–1743. doi: 10.1093/hmg/ddl097. [DOI] [PubMed] [Google Scholar]
  • 4. Glossary of Genetic Terms, National Human Genome Research Institute, http://www.genome.gov/glossary/?id=148.
  • 5.Cotterman CW. A calculus for statistico-genetics [Ph.D. thesis] Ohio State University: Columbus, Ohio, USA; 1940. Reprinted in P. Ballonoff, Ed., Genetics and Social Structure, Dowden, Hutchinson & Ross, Stroudsburg, Pa, USA, 1974. [Google Scholar]
  • 6.Malecot G. Les mathématique de l'hérédité. Paris, France: Masson; 1948. Translated edition: The Mathematics of Heredity, Freeman, San Francisco, Calif, USA, 1969. [Google Scholar]
  • 7.Gillois M. La relation d'identité en génétique. Annales de l'Institut Henri Poincaré B. 1964;2:1–94. [Google Scholar]
  • 8.Harris DL. Genotypic covariances between inbred relatives. Genetics. 1964;50:1319–1348. doi: 10.1093/genetics/50.6.1319. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Jacquard A. Logique du calcul des coefficients d’identite entre deux individuals. Population. 1966;21:751–776. [Google Scholar]
  • 10.Karigl G. A recursive algorithm for the calculation of identity coefficients. Annals of Human Genetics. 1981;45(3):299–305. doi: 10.1111/j.1469-1809.1981.tb00341.x. [DOI] [PubMed] [Google Scholar]
  • 11.Elliott B, Akgul SF, Mayes S, Ozsoyoglu ZM. Efficient evaluation of inbreeding queries on pedigree data. Proceedings of the 19th International Conference on Scientific and Statistical Database Management (SSDBM '07); July 2007; [Google Scholar]
  • 12.Elliott B, Cheng E, Mayes S, Ozsoyoglu ZM. Efficiently calculating inbreeding on large pedigrees databases. Information Systems. 2009;34(6):469–492. [Google Scholar]
  • 13.Yang L, Cheng E, Özsoyoğlu ZM. Using compact encodings for path-based computations on pedigree graphs. Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine (ACM-BCB '11); August 2011; pp. 235–244. [Google Scholar]
  • 14.Cheng E, Elliott B, Ozsoyoglu ZM. Scalable computation of kinship and identity coefficients on large pedigrees. Proceedings of the 7th Annual International Conference on Computational Systems Bioinformatics (CSB '08); 2008; pp. 27–36. [PubMed] [Google Scholar]
  • 15.Cheng E, Elliott B, Özsoyoĝlu ZM. Efficient computation of kinship and identity coefficients on large pedigrees. Journal of Bioinformatics and Computational Biology (JBCB) 2009;7(3):429–453. doi: 10.1142/s0219720009004175. [DOI] [PubMed] [Google Scholar]
  • 16.Wright S. Coefficients of inbreeding and relationship. The American Naturalist. 1922;56(645) [Google Scholar]
  • 17.Nadot R, Vaysseix G. Kinship and identity algorithm of coefficients of identity. Biometrics. 1973;29(2):347–359. [Google Scholar]
  • 18.Cheng E. Scalable path-based computations on pedigree data [Ph.D. thesis] Cleveland, Ohio, USA: Case Western Reserve University; 2012. [Google Scholar]
  • 19.Ollikainen V. Simulation Techniques for Disease Gene Localization in Isolated Populations [Ph.D. thesis] Helsinki, Finland: University of Helsinki; 2002. [Google Scholar]
  • 20.Toivonen HTT, Onkamo P, Vasko K, et al. Data mining applied to linkage diseqilibrium mapping. The American Journal of Human Genetics. 2000;67(1):133–145. doi: 10.1086/302954. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Boucher W. Calculation of the inbreeding coefficient. Journal of Mathematical Biology. 1988;26(1):57–64. doi: 10.1007/BF00280172. [DOI] [PubMed] [Google Scholar]

Articles from Computational and Mathematical Methods in Medicine are provided here courtesy of Wiley

RESOURCES