Estimating Genetic Relatedness in Admixed Populations

Arun Sethuraman

doi:10.1534/g3.118.200485

. 2018 Aug 13;8(10):3203–3220. doi: 10.1534/g3.118.200485

Estimating Genetic Relatedness in Admixed Populations

Arun Sethuraman ^1,¹

PMCID: PMC6169378 PMID: 30104261

Abstract

Estimating genetic relatedness, and inbreeding coefficients is important to the fields of quantitative genetics, conservation, genome-wide association studies (GWAS), and population genetics. Traditional estimators of genetic relatedness assume an underlying model of population structure. Each individual is assigned to a population, depending on a priori assumptions about geographical location of sampling, proximity, or genetic similarity. But often, this population assignment is unknown and assumptions about assignment can lead to erroneous estimates of genetic relatedness. I develop a generalized method of estimating relatedness in admixed populations, to account for (1) multi-allelic genomic data, (2) including all nine Identity By Descent (IBD) states, and implement a maximum likelihood based estimator of pairwise genetic relatedness in structured populations, part of the software, InRelate. Replicated estimations of genetic relatedness between admixed full sib (FS), half sib (HS), first cousin (FC), parent-offspring (PO) and unrelated (UR) dyads in simulated and empirical data from the HGDP-CEPH panel show considerably low bias and error while using InRelate, compared to several previously developed methods. I also propose a bootstrap scheme, and a series of Wald Tests to assign relatedness categories to pairs of individuals.

Keywords: Genetic Relatedness, Coancestry, Admixture, Population Structure

Estimating genetic relatedness is an important problem in biological statistics and population genetics. For instance, paternity or maternity assignment (see Avise 2001, Pearse et al. 2002, Yue and Chang 2010, Coleman and Jones 2011), and forensic studies (reviewed in Weir 2004) require a robust statistical framework to infer relatedness between genotyped individuals. Genetic relatedness also plays an important role in the study of quantitative traits where the proportion of trait variability explained by shared alleles indicates the strength of the genetic component of the trait (Falconer and Mackay 1996, Visscher et al. 2008). In several allied fields, accurate estimation of genetic relatedness is critical. For instance, association studies and linkage analyses without accounting for the increased relatedness due to population genetic structure could lead to spurious associations (Pritchard et al. 2000a). Genetic relatedness is also important in fields such as conservation genetics (Oliehoek et al. 2006, Wang 2018).

The genetic relatedness, $r_{X Y},$ between two individuals X and Y can be defined in terms of the probability that their alleles are Identical By Descent (IBD). $r_{X Y}$ is thus also twice the coefficient of coancestry, $θ_{X Y},$ and can be thought of as the inbreeding coefficient of any offspring that X and Y may sire (Weir et al. 2006).

Conventional relatedness estimators work in either of three ways: (1) estimating a coefficient of relatedness between two individuals using multilocus genotype data, and linkage data to inform the length of IBD tract sharing; or (2) assigning sib-ship partitions, reconstructing pedigrees, and using the pedigrees to estimate relatedness; or (3) directly estimating relatedness from known pedigrees (Weir et al. 2006). All relatedness estimators, however, have high variances, primarily owing to difficulty in parsing out true IBD states from observed Identity By State (IBS) states (Blouin 2003). This delineation of IBS vs. IBD is achieved by estimating the conditional probabilities of observing a genotype at a locus in one individual X, given the observed genotype at the same locus in individual Y.

In the presence of population genetic structure though, localized inbreeding makes individuals within the same subpopulation ‘more related’, than as suggested by their pedigree. Pervasive or specific inbreeding in recent generations past between two related individuals can be quantified though, if sufficient information is available on the existing genetic subpopulation structure. The estimated inbreeding coefficients (e.g., θ, Weir 1994) affect the aforementioned conditional probabilities (Weir 1994). Alternately, maintenance of advantageous alleles in subpopulations by selection within a total population could also yield ‘artificial’ patterns of relatedness between individuals that share alleles, but not necessarily by direct descent.

Not accounting for such ‘shared’ allelic ancestry by utilizing true, or estimated subpopulation allele frequencies leads to incorrect estimates of genetic relatedness. Anderson and Weir (2007) subvert this issue of estimating subpopulation allele frequencies by directly quantifying the amount of inbreeding due to subpopulation structure, conditioned on a priori knowledge of existing subpopulations within a total population. Thus estimates of relatedness using the inbreeding coefficient θ in its formulation could be potentially biased.

Several other methods also utilize current population allele frequencies as proxies for ‘ancestral’ (this could mean subpopulation allele frequencies of the current generation, as in Anderson and Weir 2007, or allele frequencies of subpopulations from generations past, equated to current allele frequencies, as in Wang 2002) subpopulation allele frequencies, under Hardy-Weinberg Equilibrium (HWE), in their estimates of the inbreeding coefficient, θ. This assumption can be problematic because we do not know the precise number of ancestral subpopulations. However, the number of ancestral subpopulations can be approximated by the current subpopulation structure in a reference population.

Most methods for estimating pairwise genetic relatedness also assume that individuals whose pairwise relatedness is being estimated are derived from the same single, panmictic subpopulation. The methods of Anderson and Weir (2007), and Wang (2011b), that attempt to relax this assumption by handling samples from multiple subpopulations, assume that individuals derived from different subpopulations are genetically unrelated. However, in the presence of genetic admixture and migration, alleles are shared between subpopulations.

To account for unobserved population structure in bi-allelic genetic data, Moltke and Albrechtsen (2013) develop a two-step method (RelateADMIX), which estimates population genetic structure as admixture or ancestry proportions, and subpopulation allele frequencies. This method, when compared with other popular tools for estimating relatedness, including REAP (Thornton et al. 2012) and PLink (Purcell et al. 2007) shows considerable reduction in bias in estimating IBD probabilities. This method uses the following information: (a) admixture proportions of alleles at multiple bi-allelic loci in individuals, in “most likely” genetic subpopulations, as determined by likelihood or Bayesian methods such as those implemented in STRUCTURE (Falush et al. 2007), ADMIXTURE (Alexander et al. 2009), and MULTICLUST (Sethuraman 2013); and (b) subpopulation allele frequencies that are estimated as parameters in the model. Specifically, the model uses the probability distribution that an allele at a locus in an individual, or a multilocus genotype of an individual, was derived from a subpopulation in the recent past. IBS probabilities for two individuals, conditioned on the three IBD states $(D_{7}, D_{8}, D_{9})$ sensu Jacquard 1972, Anderson and Weir 2007) are then calculated. This calculation contributes to a likelihood function (sensu Thompson 1975), which is then maximized using an Expectation Maximization (EM) algorithm (Dempster et al. 1977 to obtain maximum likelihood estimates for relatedness coefficients. These IBD coefficients are then used in calculating pairwise genetic relatedness, $r_{X Y}$ and coancestry coefficients, $θ_{X Y}$ This method however assumes that alleles derived from different ancestral subpopulations are not IBD, and hence accounts for recent population structure. Here I develop an alternate formulation that utilizes estimated subpopulation allele frequencies, and ancestry proportions to estimate genetic relatedness in structured populations to include all nine IBD states $(D_{0 - 9}),$ and to be applicable to multi-allelic data, which accounts for ancestral subpopulation structure, where alleles derived from different subpopulations can also be IBD in an ancestral population. I develop a new package, InRelate based on a non-linear programming solution to this problem. I then address several questions based on the new framework 1) how does this estimator of pairwise genetic relatedness compare with other estimators of relatedness for structured and unstructured populations in simulated and empirical datasets?, 2) how does this estimator compare to other estimators with increase in available information (measured in terms of the number of genotyped loci)?, 3) how do bias and mean squared errors (MSE’s) in estimation using InRelate change with demographic model of evolutionary history?, 4) how does erroneous estimation of subpopulation structure due to label switching affect estimates of relatedness under the InRelate model? I also describe a method of bootstrapping and a series of statistical tests in order to obtain confidence intervals around estimates of relatedness.

Materials and Methods

Relatedness Under the Admixture Model

Theory:

I use the admixture model introduced by Pritchard et al. (2000b) to model population structure, since it makes few assumptions about the demography or history of the studied population.

It is to be noted that this model assumes that all individuals in the sample are unrelated, which in our case, is not actually true. If there are however, proportionately few relatives in the sample, then estimation under the admixture model should be reliable. For samples with rampant relatedness, pedigree estimation, or using methods that rely on linkage information may be more appropriate.

Data:

Assume that a sample of I largely unrelated, diploid individuals has been collected from a population possibly consisting of K unknown subpopulations. Each individual has been genotyped at L unlinked, codominant, neutrally evolving loci. Assume that locus l exhibits $A_{l}$ possible allelic states in the sample. For example, at SNP or AFLP presence/absence markers, $A_{l} = 2.$ Microsatellite markers evolving under the infinite alleles model theoretically have infinite states, but we observe some $A_{l} < ∞$ in the finite sample. Missing data due to failed genotyping are allowed, but assumed to be missing completely at random.

The observed genotype data from diploids can then be combined into a three-dimensional matrix X of size $I \times L \times 2.$ Thus, $1 \leq X_{i l m} \leq A_{l}$ is the mth (first or second) allele at a locus l in individual i. The data can then be reduced to sufficient statistics. Specifically, let $N = n_{i l a} : 1 \leq i \leq I, 1 \leq l \leq L, 1 \leq a \leq A_{l}$ be a jagged array with entry $n_{i l a},$ the number of alleles of type a observed at locus l in individual i.

Relatedness Under the Admixture Model

The admixture model Pritchard et al. (2000b) posits that all the alleles in an individual are independent draws from a mixture of K subpopulations. Each subpopulation is characterized by its allele frequencies: $p_{k l a}$ is the frequency of allele a $(1 \leq a \leq A_{l})$ at locus l $(1 \leq l \leq L)$ in subpopulation k $(1 \leq k \leq K) .$ Each unrelated individual is characterized by a particular mixture of the K subpopulations: each allele of individual i $(1 \leq i \leq I)$ is derived from subpopulation k with probability $η_{i k} .$ The parameters are constrained such that $\sum_{k = 1}^{K} η_{i k} = 1$ for each individual i, and $\sum_{a = 1}^{A_{l}} p_{k l a} = 1$ for each subpopulation k and locus l.

The likelihood of the observed multilocus genotype data, $N,$ given the parameters $Θ = η_{i k}, p_{k l a} : 1 \leq i \leq I, 1 \leq k \leq K, 1 \leq a \leq A_{l}$ under the admixture model is:

L (N | Θ) = \prod_{i = 1}^{I} \prod_{l = 1}^{L} \prod_{a = 1}^{A_{l}} {(\sum_{k = 1}^{K} η_{i k} p_{k l a})}^{n_{i l a}} .

(1)

Relatives are then characterized by their shared alleles, i.e., shared alleles that are identical by descent (IBD). As shown in Figure 1, the four alleles at a locus in two diploid individuals can be in one of nine possible, unobserved IBD states, $Δ = D_{1}, D_{2}, ..., D_{9} .$ The marginal probability distribution over the IBD states for a pair of individuals at a locus is determined by their relationship. I use the notation $δ_{q} = P (D_{q})$ for this distribution. For example, in non-inbred populations, unrelated pairs are in state $D_{9}$ with probability $δ_{9} = 1,$ while full siblings will share no alleles at a locus with probability $δ_{9} = 0.25,$ one allele with probability $δ_{8} = 0.5,$ and both alleles with probability $δ_{7} = 0.25,$ assuming their parents are unrelated.

Nine possible Identity By Descent (IBD) states for the observed genotypes of two diploid individuals i and j at a genomic locus l. In each IBD state ( $D_{1} - D_{9}$ ), The alleles are connected by a line if they are IBD. Observed Identity By State (IBS) states are not shown.

We only know if alleles are identical in state (IBS), and each IBD state is consistent with one or more of the nine IBS states, $S = S_{1}, S_{2}, ..., S_{9} .$ Methods of relatedness estimation, use the IBS states observed at multiple, independent loci of two individuals to estimate δ, and hence their relationship.

Consider two individuals i and j. We observe their IBS state, $Y_{l} = X_{i l 1}, X_{i l 2}, X_{j l 1}, X_{j l 2},$ at each locus l, where $a_{1}$ Each $Y_{l}$ follows an observed configuration in $S,$ but the true IBD state, $Z_{l},$ is unobserved. Given a known relationship, $R,$ between i and j, the likelihood of the observed data are

P (Y | R) = \prod_{l = 1}^{L} P (Y_{l} | R) = \prod_{l = 1}^{L} \sum_{s}^{1, 2, \dots, 9} P (Y_{l} | Z_{l} = D_{s}, R) P (D_{s} | R)

(2)

If two individuals were full siblings from parents from the same subpopulation, genetic relatedness estimated using ancestral subpopulation frequencies would be expected to account for deep descent, and potential inbreeding of the parents. The relatedness between these full siblings, estimated using the parameters of the admixture model, should be as close to the true estimate, i.e., $r_{X Y} = 0.5,$ as possible. On the other hand, if two individuals are full siblings from parents derived from two different subpopulations, genetic relatedness estimated using current subpopulation allele frequencies would likely be an over- or under-estimate, because the recent admixture event between the two parents in the previous generation is not accommodated. This result permits defining conditional probabilities of IBS states, given their IBD state using this new parametrization, sensu Jacquard (1972).

Following the leads of Jacquard (1972), Anderson and Weir (2007), and Wang (2011b), define the set of nine IBD states (see Figure 1), ${D_{1}, D_{2}, ..., D_{9}}$ given a diploid locus between two individuals, 1 and 2. Each IBD state could have nine, or more possible IBS states, ${S_{1} S_{2}, ..., S_{9}} .$ . Under the above assumptions, the probability that an allele $a_{p},$ is observed at a locus l, in individual i is $\sum_{k = 1}^{K} p_{k l a_{p}} η_{i k} = Z_{p i},$ the probability that an allele $a_{q},$ observed at the same locus l, in individual j is $\sum_{k = 1}^{K} p_{k l a_{q}} η_{j k} = Z_{q j},$ and so on. All the conditional probabilities, $P (S_{x} | D_{y})$ are shown in Table 1. The likelihood of the IBD states over a single locus, $L (X | Δ)$ can be written as

L (X | Δ) = P (S_{x} | Δ) = \sum_{y}^{1, 2, \dots, 9} P (S_{x} | D_{y}) Δ_{y}

(3)

, where $Δ$ is the set of 9 IBD states observable, X is the observed data, and $S_{x}$ is the observed IBS state of $x \in X .$ Over L independent loci, this likelihood can be written as a product of individual locus likelihoods as

L (X | Δ) = \prod_{l}^{L} P (S_{x} | Δ) = \prod_{l}^{L} \sum_{y}^{1, 2, \dots, 9} P (S_{x} | D_{y}) Δ_{y}

(4)

This likelihood function can be maximized using the constraints that each IBD coefficient, $Δ_{y},$ $y \in 1, ..., 9$ is $\geq 0$ and $\leq 1,$ and $\sum_{y}^{1, \dots, 9} Δ_{y} = 1.$ I used the $s o l n p$ function in the $R s o l n p$ package in R (Ghalanos and Theussl 2012), which implements the augmented Lagrange method of Ye (1988) to solve this nine-dimensional problem with linear constraints. The coancestry coefficient, $θ_{X Y},$ between two individuals, X and Y then can be calculated as $θ_{X Y} = Δ_{1} + \frac{1}{2} (Δ_{3} + Δ_{5} + Δ_{7}) + \frac{1}{4} Δ_{8}$ and, by definition, the relatedness as $r_{X Y} = 2 θ_{X Y} .$ Note that $r_{X Y}$ is $\leq 1$ only if the population is outbred ( $Δ_{j}, j = 1, ..., 6 = 0,$ and $Δ_{7}, Δ_{8}, Δ_{9} \neq 0$ ).

Table 1. Conditional Probabilities $P (S_{p} | D_{q})$ .

Identity By Descent Mode
IBS Mode	Allelic State	$D_{1}$	$D_{2}$	$D_{3}$	$D_{4}$	$D_{5}$	$D_{6}$	$D_{7}$	$D_{8}$	$D_{9}$
$S_{1}$	$A_{i} A_{i}, A_{i} A_{i}$	$\frac{Z_{i 1} + Z_{i 2}}{2}$	$Z_{i 1} Z_{i 2}$	$Z_{i 1} Z_{i 2}$	$\frac{Z_{i 1} Z_{i 2}^{2} + Z_{i 1}^{2} Z_{i 2}}{2}$	$Z_{i 1} Z_{i 2}$	$\frac{Z_{i 1} Z_{i 2}^{2} + Z_{i 1}^{2} Z_{i 2}}{2}$	$Z_{i 1} Z_{i 2}$	$\frac{Z_{i 1}^{2} Z_{i 2} + Z_{i 1}^{2} Z_{i 1}}{2}$	$Z_{i 1}^{2} Z_{i 2}^{2}$
$S_{2}$	$A_{i} A_{i}, A_{j} A_{j}$	0	$\frac{Z_{i 1} Z_{j 2} + Z_{j 1} Z_{i 2}}{2}$	0	$\frac{Z_{i 1} Z_{j 2}^{2} + Z_{j 1} Z_{i 2}^{2}}{2}$	0	$\frac{Z_{i 1} Z_{j 2}^{2} + Z_{j 1} Z_{i 2}^{2}}{2}$	0	0	$\frac{Z_{i 1}^{2} Z_{j 2}^{2} + Z_{j 1}^{2} Z_{i 2}^{2}}{2}$
$S_{3}$	$A_{i} A_{i}, A_{i} A_{j}$	0	0	$\frac{Z_{i 1} Z_{j 2} + Z_{j 1} Z_{i 2}}{2}$	$\frac{Z_{i 1} Z_{i 2} Z_{j 2} + Z_{j 1} Z_{i 2} Z_{j 2}}{2}$	0	0	0	$\frac{Z_{i 1} Z_{i 2} Z_{j 2} + Z_{j 1} Z_{i 2} Z_{j 2}}{2}$	$\frac{Z_{i 1}^{2} Z_{i 2} Z_{j 2} + Z_{j 1}^{2} Z_{i 2} Z_{j 2}}{2}$
$S_{4}$	$A_{i} A_{i}, A_{j} A_{k}$	0	0	0	$\frac{Z_{i 1} Z_{j 2} Z_{k 2} + Z_{i 2} Z_{j 1} Z_{k 1}}{2}$	0	0	0	0	$\frac{Z_{i 1}^{2} Z_{j 2} Z_{k 2} + Z_{i 2}^{2} Z_{j 1} Z_{k 1}}{2}$
$S_{5}$	$A_{i} A_{j}, A_{i} A_{i}$	0	0	0	0	$\frac{Z_{i 1} Z_{j 2} + Z_{j 1} Z_{i 2}}{2}$	$\frac{Z_{i 1} Z_{i 2} Z_{j 2} + Z_{j 1} Z_{i 2} Z_{j 2}}{2}$	0	$\frac{Z_{i 1} Z_{i 2} Z_{j 2} + Z_{j 1} Z_{i 2} Z_{j 2}}{2}$	$\frac{Z_{i 1}^{2} Z_{i 2} Z_{j 2} + Z_{j 1}^{2} Z_{i 2} Z_{j 2}}{2}$
$S_{6}$	$A_{j} A_{k}, A_{i} A_{i}$	0	0	0	0	0	$\frac{Z_{i 1} Z_{j 2} Z_{k 2} + Z_{i 2} Z_{j 1} Z_{k 1}}{2}$	0	0 $\frac{Z_{i 1}^{2} Z_{j 2} Z_{k 2} + Z_{i 2}^{2} Z_{j 1} Z_{k 1}}{2}$
$S_{7}$	$A_{i} A_{j}, A_{i} A_{j}$	0	0	0	0	0	0	$\frac{Z_{i 1} Z_{j 2} + Z_{j 1} Z_{i 2}}{2}$	$0.5 * {Z_{j 1} Z_{j 2} \frac{Z_{i 1} + Z_{i 2}}{2} + Z_{i 1} Z_{i 2} \frac{Z_{j 1} + Z_{j 2}}{2}}$	$Z_{i 1} Z_{i 2} Z_{j 1} Z_{j 2}$
$S_{8}$	$A_{i} A_{j}, A_{i} A_{k}$	0	0	0	0	0	0	0	$\begin{array}{c} 0.5 * {\frac{Z_{i 1} + Z_{i 2}}{2} (Z_{j 1} Z_{k 2} + Z_{k 1} Z_{j 2})} \end{array}$	$0.5 * {Z_{i 1} Z_{i 2} (Z_{j 1} Z_{k 2} + Z_{j 2} Z_{k 1})}$
$S_{9}$	$A_{i} A_{j}, A_{k} A_{l}$	0	0	0	0	0	0	0	0	$\frac{Z_{i 1} Z_{j 1} Z_{k 2} Z_{l 2} + Z_{k 1} Z_{l 1} Z_{i 2} Z_{j 2}}{2}$

Open in a new tab

Other Relatedness Estimators

I also implemented the methods of Anderson and Weir (2007), and Wang (2011b) under the same optimization framework, using $R s o l n p .$ The method of Wang (2011b) is different from that of Anderson and Weir (2007) in that it accounts for inbreeding. In both cases, subpopulation allele frequencies are modeled under the Dirichlet distribution, with the global parameter, θ, measured as the probability that two randomly sampled individuals from a subpopulation are IBD under an island model. Anderson and Weir (2007) do not state explicitly how they estimate θ, but Wang (2011b) indicates using the Weir and Cockerham θ estimator Weir and Cockerham (1984), which I used as well in the framework of Anderson and Weir (2007) (and Wang 2011b) to obtain comparable relatedness estimates. Regardless, under the equilibrium assumption that population subdivision is unchanging in time, the probability of drawing two of the same a alleles at a locus from the same subpopulation is $p_{a} + (1 - θ) p_{a},$ where $p_{a}$ is the frequency of allele a at that locus. This leads into the same likelihood framework described above (3,4), for the estimators of Anderson and Weir (2007), and Wang (2011b). Anderson and Weir (2007) utilize a simplex method to obtain maximum likelihood estimates of the IBD coefficients, $Δ_{7},$ $Δ_{8}$ and $Δ_{9},$ using the constraints that $\sum_{i = 7}^{9} Δ = 1,$ $0 \leq Δ_{i} \leq 1,$ and $4 Δ_{7} Δ_{9} < Δ_{8}^{2},$ for large, non-inbred populations.

Wang (2011b) offers another numerical solution by using Powell’s quadratically convergent method (Press 2007) to obtain likelihood estimates for all 9 variables above, as well as derives moment estimators under the same population structure framework, accounting for inbreeding using the inbreeding coefficient, θ, for other previously derived estimators (Queller and Goodnight 1989, Lynch and Ritland 1999, Wang 2002).

In this manuscript, the same non-linear programming method in 9 variables ( $Δ_{i}, i \in 1, 2, ..., 9$ ) was used to obtain maximum likelihood estimates for both estimators of Anderson and Weir (2007) and Wang (2011b). Genetic relatedness, $r_{X Y}$ and the coancestry coefficient, $θ_{X Y}$ were then calculated as before.

Other estimators that were compared include those of Queller and Goodnight (1989), Wang (2002), Lynch and Ritland (1999), Lynch (1988), Ritland (1996), Wang (2007), and Milligan (2003), as implemented in the program COANCESTRY (Wang 2011a). Note that all the methods implemented in Wang (2011a) do not account for subpopulation structure (Table 2). However, all these methods account for multi-allelic data, which allow for equitable comparison with InRelate. The methods of Thornton et al. (2012) (REAP), Purcell et al. (2007) (PLINK), and Moltke and Albrechtsen (2013) (RelateAdmix), while more popular in recent years, are only applicable to di-allelic data (e.g., SNP’s), and hence were not used for comparison in this manuscript.

Table 2. List of estimators tested and their references.

	Label	Reference	Accounts for structure?
1	AW2007	Anderson and Weir 2007	Yes
2	Wang2011	Wang 2011b	Yes
3	MC2013_WI	Sethuraman 2013 with inbreeding	Yes
4	MC2013	Sethuraman 2013	Yes
5	TrioML	Wang 2007	No
6	Wang2002	Wang 2002	No
7	LynchLi	Lynch 1988, Li et al. 1993	No
8	LynchRi	Lynch and Ritland 1999	No
9	Ritland	Ritland 1996	No
10	QuellerG	Queller and Goodnight 1989	No
11	DyadML	Milligan 2003	No

Open in a new tab

Bootstrapping and Pedigree Assignment

Under the assumption that sampled loci between two individuals X and Y are independent, we can obtain variance in estimation of relatedness by bootstrapping over loci. For every pair of individuals, loci are sampled with replacement to construct bootstrap replicates, and relatedness is estimated under the maximum likelihood framework. I then construct 95% confidence intervals of the estimated relatedness values. Simulated bootstrap standard errors are calculated as:

S E ({\hat{θ}}_{X Y}) = \sqrt{\frac{\sum_{b = 1}^{B} {({\hat{θ}}_{X Y, b} - {\bar{\hat{θ}}}_{X Y, b})}^{2}}{B - 1}}

(5)

where B is the number of bootstrap replicates and:

{\bar{\hat{θ}}}_{X Y, b} = \frac{{\hat{θ}}_{X Y, b}}{B}

(6)

These variance estimates are then used in a series of Wald Tests, compared to a normal distribution, to assign relatedness categories to each pair of relatedness estimates. The Wald Test statistic is calculated as:

\frac{{\hat{θ}}_{X Y} - θ_{0}}{S E ({\hat{θ}}_{X Y})}

(7)

After correcting for multiple testing by the Bonferroni method, pairs are assigned to a relatedness category at a p-value threshold of 0.05. Relatedness categories tested include: MonoZygotic twins - MZ, Full Siblings - FS, Half Siblings - HS, First Cousins - FC, Parent-Offspring - PO, Second Cousins - SC, AvunCular - AC, and UnRelated - UR.

Simulations

Five separate sets of multi-allelic genomic data were simulated to test the performance of relatedness estimates using InRelate (MC2013, hereon), against other estimators. In all scenarios, subpopulations from which individuals were sampled from were assumed to be the ‘true’ subpopulation, for comparison with other methods. Admixture proportions and subpopulation allele frequencies for all analyses were obtained by performing runs of MULTICLUST (Sethuraman 2013). MULTICLUST uses an EM algorithm to estimate parameters under the admixture model (Pritchard et al. 2000b), and extends the method of Alexander et al. (2009) for multi-allelic data. It is much faster than STRUCTURE, and does not have MCMC convergence and mixing problems. Convergence of the EM algorithm was assumed if the log likelihood was not increasing by $\geq 10^{- 6}$ in all scenarios.

Scenario 1: Hierarchical Island Model:

Under Scenario 1, all initial allele frequencies were simulated at 50 diploid, codominant, multi-allelic (maximum of 50 allelic variants per locus), unlinked loci, using Easypop v.1.7 (Balloux 2001). The Hierarchical Island Model was used, wherein each total population (out of 3) is comprised of subpopulations, which are in turn comprised of smaller subpopulations. I varied the number of subpopulations (K) to be one of 3, 5, 10, or 15. To allow for genetic admixture, I specified relatively greater levels of gene flow of 0.01 total proportion of migrant females and males per generation, between subpopulations inside each population, and relatively lower gene flow of 0.001 total proportion of migrant females and males per generation, between populations. Subpopulation sizes of 25 males and 25 females per subpopulation were held constant across generations. A forward-time simulation was performed for 3000 generations, and I utilized the last generation’s allele frequency distribution for all further simulations. All populations at generation 3000 were tested for Hardy-Weinberg Equilibrium (HWE).

Scenario 2: Island Model:

Under Scenario 2, I simulated multi-allelic genomic data using the same demographic parameters as in scenario 1 using a single Island Model ( $K = 1$ ), with no migration.

Simulating related dyads:

I then simulated $k = 1000$ replicate dyads each of Parent-Offspring (PO), Full Siblings (FS), Half Siblings (HS), First Cousins (FC), and UnRelated (UR) individuals under different levels of known population subdivision ( $K = 3, 5, 10, 15$ ). For FS dyads, two parents were randomly picked from the same subpopulation, and two offspring were created from their multilocus genotypes by randomly sampling their allele distribution from either parent. Since these loci are unlinked, I did not explicitly account for IBD tract length distribution. For HS dyads, one shared parent, and two other parents were simulated, and offspring generated from each cross. For FC dyads, a pair of FS dyads were created first, then their mates were randomly picked from the same subpopulation, to create offspring from each cross. PO dyads were picked similar to the FS simulation, with two parents being sampled randomly from the same subpopulation to create an offspring, and one of the parents were sampled as part of the dyad. UR dyads were created by randomly sampling two individuals from the same subpopulation.

Admixture proportions and subpopulation allele frequencies were estimated using MULTICLUST (Sethuraman 2013) at the ‘true’ assumed number of subpopulations (i.e., $K = 1, 3, ..., 15$ ). These estimates were then used in determining pairwise genetic relatedness with InRelate.

I estimated $F_{s t},$ using the geneclust package in R, and utilized those estimates in the same IBD-IBS framework in R to obtain pairwise relatedness by the methods of Anderson and Weir (2007) and Wang (2011b). The package $g e n e c l u s t$ implements the method of Weir and Cockerham (1984) to obtain a normalized multi-locus global θ estimate. For comparison with methods that did not account for population structure, I used the program COANCESTRY (Wang 2011a). Table 2 shows a summary of methods tested in this manuscript.

Scenario 3: Effect of number of loci:

To quantify the effect of increasing the number of genotyped loci on estimates of genetic relatedness, I simulated datasets under the same models specified in scenario 1 above, and the number of observed loci were varied between 10 and 40, to simulate a realistic scenario wherein individuals are genotyped at $< 50$ variant STR loci.

Scenario 4: Effect of method of estimating $F_{s t}$ :

Under scenario 4, I was interested in how the estimation of $F_{s t}$ affected estimates of genetic relatedness using the methods of Anderson and Weir (2007) and Wang (2011b), against MC2013 (InRelate using all 9 IBD states), and MC2013WI (InRelate using only the last 3 IBD states, assuming an outbred population, sensu Moltke and Albrechtsen 2013). To study this, I simulated a total of 1000 individuals distributed among $K = 3$ subpopulations, genotyped at 50 STR loci ( $\leq 50$ allelic states per locus), with a mutation rate of $1 \times 10^{- 6}$ mutations per generation, and a constant bidirectional migration rate of 0.001 of total individuals per generation, for 5000 generations. This gives a theoretical $F_{s t} = \frac{1}{(1 + 4 N m)}$ of 0.2, while Weir and Cockerham’s normalized $Θ$ estimated at generation 5000 by geneclust was 0.1038. Hundred FS pairs were simulated from the generation 5000 population as described above. Allele frequency distribution of the generation 5000 population was used in estimating relatedness by the methods of Anderson and Weir (2007) and Wang (2011b). Admixture proportions and subpopulation allele frequencies for use by MC2013 and MC2013WI estimators were obtained using MULTICLUST at $K = 3$ as before. To compare the performance of the methods of Anderson and Weir (2007) and Wang (2011b), I estimated relatedness under both methods using (a) theoretical $F_{s t}$ of 0.2, and (b) using the estimated Weir and Cockerham $Θ_{s t}$ of 0.1038.

Scenario 5: Effect of label-switching:

Under scenario 5, I was interested in understanding how ‘label-switching’ affected estimates of genetic relatedness in methods that accounted for population structure. ‘Label-switching’ in this context refers to misclassification of individuals to subpopulations. To study this, I used the same dataset simulated for scenario 3, switched the labels of either 0.1, 0.5 or 1.0 fraction of the total population, and re-estimated Weir and Cockerham’s $Θ_{s t}$ Weir and Cockerham (1984), and genetic relatedness using the methods of Anderson and Weir (2007) and Wang (2011b). Since population assignment is not a priori for the MC2013 and MC2013WI methods, I used the same results obtained from scenario 3 for comparison with the methods of AW2007 and Wang2011.

Error and Bias:

Deviation from true relatedness was examined by calculating the Mean Square Error (MSE). MSE is measured as:

\frac{1}{R} \sum_{i = 1}^{R} {(\hat{r_{i}} - r_{t r u e})}^{2}

(8)

, where R is the total number of replicate dyads (here 1000), $\hat{r_{i}}$ is the relatedness estimated using one of the above methods, and $r_{t r u e}$ is the true relatedness value, $r_{x y},$ which is 0.5 for PO and FS dyads, 0.25 for HS dyads, 0.125 for FC dyads, and 0.0 for UR dyads. Bias was calculated as the deviation of the mean for all $k = 1000$ replicates under each scenario from the true mean.

{\bar{r}}_{t r u e} - {\bar{\hat{r}}}_{i}

(9)

Scenario 6: Bootstrapping:

For bootstrap analyses, I simulated another dataset from the above data set of $K = 3$ subpopulations, genotyped over 300 loci. I picked 5 dyads each of FS, HS, PO, FC, and UR individuals (total of 50 individuals). Boostrap datasets (200 replicates) were then simulated, with 50 individuals each by resampling loci with replacement. For each dataset, the true subpopulation structure was assumed to be comprised of $K = 3$ subpopulations. Admxiture proportions and allele frequencies were computed using MULTICLUST (at $K = 3$ ), and relatedness was then estimated using InRelate. Relatedness category assignment was then performed using the procedure described above.

Scenario 7: HGDP-CEPH Data:

Rosenberg (2006) and several allied publications (also see Ramachandran et al. 2005, Rosenberg et al. 2006) describe the use of subsets of ‘unrelated’ individuals from the HGDP-CEPH Human Genome Diversity Cell Line Panel (H1048 Cann et al. 2002, Rosenberg et al. 2005). In these studies, relatedness was estimated between all pairs of individuals from within each sampled locale using RELPAIR (Boehnke and Cox 1997, Epstein et al. 2000), and several putatively related individuals from both within and across sampled locations were identified. For the purpose of this manuscript, I mined the original H1048 dataset for individuals reportedly related from within the African continent. The African continent was represented in this data set by 115 individuals, classified as Bantu (South Africa), Bantu (Kenya), Mandenka, Yoruba, San, Mbuti Pygmy, or Biaka Pygmy, and were genotyped at a total of 783 microsatellite loci. Average differentiation, measured as Nei’s $G_{s t}$ between these sampled locations was estimated as 0.1169, using the method of Nei and Chesser (1983), which indicates ‘moderate’ levels of differentiation (Wright 1950). I estimated population structure within these 115 individuals using MULTICLUST, at an a priori $K = 7.$ Admixture proportions and subpopulation allele frequencies were then obtained for the 24 relatedness dyads reported in Rosenberg et al. (2005), and I used these in estimating pairwise relatedness using InRelate. Allele frequencies were calculated assuming sampled locations as subpopulations, and used in estimating relatedness by the methods of Anderson and Weir (2007) and Wang (2011b) for comparison. Note that RELPAIR (Boehnke and Cox 1997) utilizes recombination information to obtain genetic relatedness, and is therefore very different from all the other methods compared in this manuscript. For the purpose of this comparison, I used RELPAIR estimates as the ‘truth’ to measure concordance with MC2013 and MC2013WI.

Data Availability

All simulated data, and R scripts can be accessed at https://github.com/arunsethuraman/inrelate.

Results

Scenario 1: Hierarchical Island Model

In general, in all scenarios that measured genetic relatedness among FS, PO, and HS dyads, the InRelate estimator (MC2013) performed better, or comparably with the AW2007 (Anderson and Weir 2007) and Wang2011 (Wang 2011b) estimators (Figures 2, 3, 4, 5, 6). FS and PO relatedness had the least bias, compared to all other estimators. Interestingly, MC2013 and MC2013WI underestimated relatedness in FC and UR dyads when compared to the AW2007 and Wang2011 estimators. Distributions of estimated relatedness using MC2013 and MC2013WI are shown in Figures 7, 8, 9, 10, and 11.

Comparing (a) MSE and (b) Bias in estimates of genetic relatedness between 1000 Full Sib (FS) dyads with increasing degree of subpopulation structure. Number of subpopulations (K) here was varied between K = 3 to K = 15 under the hierarchical island model described in Scenario 1.

Comparing (a) MSE and (b) Bias in estimates of genetic relatedness between 1000 Half Sib (HS) dyads with increasing degree of subpopulation structure. Number of subpopulations (K) here was varied between K = 3 to K = 15 under the hierarchical island model described in Scenario 1.

Comparing (a) MSE and (b) Bias in estimates of genetic relatedness between 1000 Parent Offspring (PO) dyads with increasing degree of subpopulation structure. Number of subpopulations (K) here was varied between K = 3 to K = 15 under the hierarchical island model described in Scenario 1.

Comparing (a) MSE and (b) Bias in estimates of genetic relatedness between 1000 First Cousin (FC) dyads with increasing degree of subpopulation structure. Number of subpopulations (K) here was varied between K = 3 to K = 15 under the hierarchical island model described in Scenario 1.

Comparing (a) MSE and (b) Bias in estimates of genetic relatedness between 1000 UnRelated (UR) dyads with increasing degree of subpopulation structure. Number of subpopulations (K) here was varied between K = 3 to K = 15 under the hierarchical island model described in Scenario 1.

Distribution of estimates of genetic relatedness between 1000 Full Sib (FS) dyads with increasing degree of subpopulation structure using (a) MC2013, and (b) MC2013WI estimators implemented in InRelate. Number of subpopulations (K) here was varied between K = 3 to K = 15 under the hierarchical island model described in Scenario 1. True relatedness between full sibs = 0.5 is indicated using the dotted red line.

Distribution of estimates of genetic relatedness between 1000 Half Sib (HS) dyads with increasing degree of subpopulation structure using (a) MC2013, and (b) MC2013WI estimators implemented in InRelate. Number of subpopulations (K) here was varied between K = 3 to K = 15 under the hierarchical island model described in Scenario 1. True relatedness between half sibs = 0.25 is indicated using the dotted red line.

Distribution of estimates of genetic relatedness between 1000 Parent Offspring (PO) dyads with increasing degree of subpopulation structure using (a) MC2013, and (b) MC2013WI estimators implemented in InRelate. Number of subpopulations (K) here was varied between K = 3 to K = 15 under the hierarchical island model described in Scenario 1. True relatedness between parent-offsprings = 0.5 is indicated using the dotted red line.

Distribution of estimates of genetic relatedness between 1000 First Cousin (FC) dyads with increasing degree of subpopulation structure using (a) MC2013, and (b) MC2013WI estimators implemented in InRelate. Number of subpopulations (K) here was varied between K = 3 to K = 15 under the hierarchical island model described in Scenario 1. True relatedness between first cousins = 0.125 is indicated using the dotted red line.

Distribution of estimates of genetic relatedness between 1000 UnRelated (UR) dyads with increasing degree of subpopulation structure using (a) MC2013, and (b) MC2013WI estimators implemented in InRelate. Number of subpopulations (K) here was varied between K = 3 to K = 15 under the hierarchical island model described in Scenario 1. True relatedness between half sibs = 0.0 is indicated using the dotted red line.

The other estimators that did not account for population structure consistently over-, or under-estimated genetic relatedness between dyads, with large mean squared errors (MSE). It was also noted (Wang 2011b) that all estimators that ignored population genetic structure had increasing bias, with an increase in the degree of population genetic structure, except in the inference of PO dyads, and UR dyads.

Correspondingly, MC2013 had the lowest MSE in the estimation of relatedness in FS, PO, and HS dyads, while the methods of AW2007 and Wang2011 had the lowest MSE for FC and UR dyads. The Ritland (Ritland 2005) estimator, and the methods of Anderson and Weir (2007) and Wang (2011b), had the highest MSE for PO dyads, while the Ritland estimator (Ritland 2005) had the highest MSE in all the cases. The estimators of Queller and Goodnight (1989) (QuellerG), Lynch and Ritland (1999) (LynchRi), and Wang (2007) (TrioML) performed similarly, with higher bias and MSE, than MC2013. Also, the estimators of Ritland (1996), Queller and Goodnight (1989) may have values $< 0$ or $> 1,$ but these were not truncated to fall inside this range, as performed by Wang (2011b) in order to observe the true trend in estimation of relatedness.

Scenario 2: Island Model

In the absence of population structure, under a panmictic island model, all methods performed comparably, with low MSE and bias for all FS, PO, HS dyads. The method of Ritland (1996) had considerably higher MSE compared to all the other methods in the estimation of FS, PO, HS and FC dyads. The MC2013 and MC2013WI estimators have higher MSE and bias in determining relatedness between FC and UR dyads (see Figures 12, 13).

(a) MSE and (b) Bias in estimates of genetic relatedness between 1000 Full Sib (FS) dyads sampled from a panmictic population (K = 1) under Scenario 2, as described in the methods. Methods compared in this figure are those of Anderson and Weir (2007), Wang (2011b), Wang (2007), Wang (2002), Lynch (1988), and Lynch and Ritland (1999).

Scenario 3: Effect of Number of Loci

Bias and MSE estimates of pairwise genetic relatedness in FS dyads showed a trend of decrease with an increase in the number of loci (Figures 14, 15, 16, 17, 18, 19) across all estimators at $K = 3,$ 5, and 10, indicating the relative better estimation with increased genotypic information. Estimates of relatedness at K = 3, 5 and 15 are shown in Figures 15, 17, and 19 respectively. In general, InRelate had the least bias and least MSE in estimation of FS dyads across different levels of available information, measured as a function of the number of loci, with and without accounting for inbreeding (Figures 14, 16, 18). The estimator that accounted for inbreeding (MC2013WI) outperformed all other estimators with the least bias and MSE in estimation of FS relatedness. All other estimators of relatedness which did, or did not did not account for subpopulation structure performed with consistent decrease in bias and MSE with increase in the number of analyzed loci, as expected. The Ritland estimator was the least accurate, at $K = 3, 5, 10,$ across $L = 10, 20, 30, 40,$ followed by the estimators of Anderson and Weir (2007), and Wang (2011b).

Bias and Mean Squared Error in estimates of genetic relatedness between 1000 Full Sib (FS) dyads sampled from $K = 3$ subpopulations, simulated under Scenario 5, with increasing number of genotyped loci between L = 10 and L = 40.

(a) MC2013 and (b) MC2013WI estimates of genetic relatedness between 1000 Full Sib (FS) dyads sampled under Scenario 5 (K = 3), by varying the number of loci sampled between L = 10 to L = 40. True estimate of relatedness between Full Siblings = 0.5 is shown in the dotted red line.

Bias and Mean Squared Error in estimates of genetic relatedness between 1000 Full Sib (FS) dyads sampled from $K = 5$ subpopulations, simulated under Scenario 5, with increasing number of genotyped loci between L = 10 and L = 40.

(a) MC2013 and (b) MC2013WI estimates of genetic relatedness between 1000 Full Sib (FS) dyads sampled under Scenario 5 (K = 5), by varying the number of loci sampled between L = 10 to L = 40. True estimate of relatedness between Full Siblings = 0.5 is shown in the dotted red line.

Bias and Mean Squared Error in estimates of genetic relatedness between 1000 Full Sib (FS) dyads sampled from $K = 10$ subpopulations, simulated under Scenario 5, with increasing number of genotyped loci between L = 10 and L = 40.

(a) MC2013 and (b) MC2013WI estimates of genetic relatedness between 1000 Full Sib (FS) dyads sampled under Scenario 5 (K = 10), by varying the number of loci sampled between L = 10 to L = 40. True estimate of relatedness between Full Siblings = 0.5 is shown in the dotted red line.

Scenario 4: Effect of method of estimating $F_{s t}$

The methods of Anderson and Weir (2007) and Wang (2011b) have larger confidence intervals in estimating relatedness in FS dyads, with the $Θ_{s t}$ of Weir and Cockerham (1984) having lower deviation from the truth ( $r_{x y} = 0.5$ ), compared to the theoretical $F_{s t} .$ The MC2013 and MC2013WI methods outperform both methods with smaller confidence intervals around the mean (as shown in Figure 20).

Estimates of relatedness for 1000 FS dyads simulated under the panmictic island model (K = 1). (a) Estimates of relatedness under Scenario 3 where method of estimating $F_{s t}$ was varied. MC1 denotes the method of MC2013, MC2 is MC2013 accounting for inbreeding, A1 is the method of Anderson and Weir (2007) using the estimated $Θ_{s t}$ of Weir and Cockerham (1984), W1 is the method of Wang (2011b) using estimated $Θ,$ A2 and W2 denote the above methods using expected $F_{s t} .$ (b) Estimates of relatedness under Scenario 4, where the population ID’s were shuffled to simulate ‘label switching’. MC1, MC2, A1 and W1 are the same as before. A2 and W2 are the methods of Anderson and Weir (2007) and Wang (2011b) respectively, with 0.1 proportion of labels shuffled, A3 and W3 have 0.5 proportion of labels shuffled, and A4 and W4 have 1.0 proportion of labels shuffled.

Scenario 5: Effect of ‘label-switching’

InRelate estimators do not have problems with ‘label-switching’, since population assignment is determined by the clustering method, and hence the ancestry proportions and allele frequencies are recomputed every time. On the other hand, both the methods of Anderson and Weir (2007) and Wang (2011b) show increased deviation from the mean (true $r_{x y} = 0.5$ ) when labels are switched, due to the erroneous computation of differentiation (See Figure 20).

Scenario 6: Bootstrapping

Out of the 5 dyads of FS, HS, PO, FC and UR categories were correctly assigned after 200 bootstrap iterations in 44% of pairs. All Parent-Offspring pairs were correctly assigned, while two each of FS, FC, and UR pairs were correctly assigned. None of the HS pairs were significantly assigned to any category. Plots of confidence intervals around estimates using both the MC2013 and MC2013WI estimators are shown in Figure 21.

(a) MC2013 and (b) MC2013WI relatedness estimates and confidence intervals for 5 different relatedness categories, constructed using 200 bootstrap replicates under Scenario 6. The simulation used K = 3 subpopulations, and a total of 5 dyads of FS, HS, PO, FC, and UR individuals were picked.

Scenario 7: HGDP-CEPH Panel

Across 24 dyads which were either identified as FS, HS (or Avuncular), or PO by Rosenberg et al. (2005), the MC2013 and MC2013WI estimators outperformed the methods of Anderson and Weir (2007) and Wang (2011b) (see Figure 22), with consistently lower bias (MC2013WI - mean bias = 0.0114 (sd = 0.0667), MC2013 - mean bias = 0.0114 (sd = 0.0667), AW2007 - mean bias = -0.2857 (sd = 0.0779), Wang2011 - mean bias = -0.3204 (sd = 0.0895)). The MSE was also considerably lower (MC2013WI - 0.0044, MC2013 - 0.0044, AW2007 - 0.0874, Wang2011 - 0.1103) when comparing the MC2013 estimators with AW2007 and Wang2011. As reported before, these populations are ‘moderately’ differentiated (with a $G_{s t}$ = 0.1169), and have historically been reported to have significant levels of gene flow or admixture, as well as exhibiting serial founder effects (see Tishkoff et al. 2009, Ramachandran et al. 2005).

Relatedness estimates between 24 related dyads sampled from 6 locations in Africa, which were previously reported to be either FS, HS (or avuncular), or PO dyads by Rosenberg *et al.* (2005) using REAP (Thornton *et al.* 2012). The REAP estimates are plotted as the ‘True’ estimate, while other estimators compared are those of MC2013, MC2013WI, AW2007, and Wang2011.

Discussion

The presence of ancestral subpopulation structure affects estimates of pairwise genetic relatedness between individuals from the same subpopulation, owing to pervasive inbreeding, and non-random mating in recent ancestral generations.

The primary goal of this paper was to develop a maximum-likelihood framework using an alternate parametrization, to estimate pairwise genetic relatedness between two individuals X and Y, while accounting for the ‘true’ genetic subpopulation structure in the population. This ‘true’ genetic subpopulation structure is unobserved and can be inferred from the data. Since the proposal of an admixture model by Pritchard et al. (2000b), several tools have been developed to estimate subpopulation structure, primarily to infer the number of subpopulations, K, admixture proportions (here $η_{i k}$ ), and subpopulation allele frequencies, $p_{k l a} .$ These estimates have been applied widely, including to infer ancestral migration patterns (e.g., Rosenberg et al. 2002,Eriksson and Manica 2012), in association studies (e.g., Collins-Schramm et al. 2002), and to inform conservation decisions (see Allendorf et al. 2010). InRelate uses inferred information from population structure studies (using methods such as STRUCTURE (Pritchard et al. 2000b) or MULTICLUST (Sethuraman 2013) - see Liu et al. 2013) to inform the estimation of relatedness.

Across my simulations and analyses of the HGDP-CEPH African datasets, InRelate estimators of relatedness (MC2013 and MC2013WI) outperform several previously developed methods for relatedness estimation in admixed populations with considerably less error and bias. This accuracy is more pronounced particularly in between pairs of full siblings, parent-offspring, or half-siblings. The previously developed methods of Anderson and Weir (2007) and Wang (2011b) outperform InRelate in estimating first cousins or unrelated dyads in my simulations. As noted by Anderson and Weir (2007), estimates of relatedness in unrelated individuals are upwardly biased by all methods (see Figure 6)). I surmise this result is an artifact of ignoring subpopulation structure, in the presence of undetected ancient admixture, which results in an upward bias for all estimates. While MC2013 and MC2013WI account for this by using estimated subpopulation allele frequencies, the other estimators (AW2007(Anderson and Weir 2007, Wang2011(Wang 2011b) approximate it by using current allele frequencies, estimated from sampled populations.

Of note though are general difficulties in estimation of relatedness between first cousins, second cousins, and other more distantly related or unrelated pairs. These are also seen and reported by other likelihood methods (see Thompson 1975, Anderson and Weir 2007, Wang 2011b, Konovalov and Heg 2008), other estimators that use summary statistics (see Lynch and Ritland 1999, Blouin 2003, Anderson and Weir 2007, Wang 2002), and methods that utilize linkage or recombination information (see Pemberton et al. 2013, Rosenberg et al. 2005). This is primarily due to the fact that the most predominant relationship between two individuals is usually inferred, while the historical relatedness, due to evolutionary demographic processes, between them is ignored by most methods. Methods that account for this ‘deep’ relatedness are yet to be devised, and could help resolve issues with estimating deeper pedigrees, and relatedness between individuals. Wang (2011b) also notes this bias in estimating relatedness values close to the lower bound of 0 in the methods of Anderson and Weir (2007) and Wang (2011b).

Varying the number of loci minimally affects all relatedness estimators. This outcome may derive from variation in allele frequencies being sufficiently explained by the parameters of the admixture model (admixture proportions and subpopulation allele frequencies), as against biasing all estimates using a single non-varying parameter, θ (or $F_{s t}$ , sensu Anderson and Weir 2007 and Wang 2011b). Several methods can estimate this coefficient θ and each method has its own biases and efficiencies. This approach could potentially cause increased bias and MSE in using the estimators of Anderson and Weir (2007) and Wang (2011b), which could be addressed by utilizing a population structuring method to assign individuals to subpopulations, conditioning on that population structure in estimating θ. Regardless, increasing the number of sampled loci decreased bias of all estimators, as expected.

InRelate estimators do not have problems with ‘label-switching’, since the subpopulation structure is inferred from the data and not assumed a priori as in all other methods. Correspondingly, all allele frequencies, and ancestry proportions are re-calculated with switched labels, which are then used in estimates of relatedness. While all my analyses have inferred admixture proportions at the assumed ‘true’ subpopulation structure (i.e., K), perhaps the true utility of this method would be if this K was inferred from the data, and the corresponding inferred admixture proportions and allele frequencies used in the estimation of relatedness. However, this is a statistical problem (Pritchard et al. 2000b, Falush et al. 2003, Hubisz et al. 2009, Sethuraman 2013, Alexander et al. 2009), with estimates of subpopulation allele frequencies and ancestry proportions confounded by (1) different demographic histories (Falush et al. 2016, (2) overparametrization, and a general improvement in the likelihood with increasing the parameter K (Evanno et al. 2005), and (3) issues with label switching (Jakobsson and Rosenberg 2007). InRelate and the method of Moltke and Albrechtsen (2013) are hence both affected by the ‘accuracy’ of estimates of structure and admixture parameters.

InRelate methods are of best utility when dealing with multi-allelic data, generated from individuals that are sampled from populations that are ancestrally structured, and generally outdo the methods of Anderson and Weir (2007), and Wang (2011b), which are both relatedness estimators under similar models. InRelate also does not require linkage maps, which makes it more utilitarian for estimating relatedness in non-model systems that don’t have detailed genomic information. I have also shown that InRelate outperforms all the methods implemented in the COANCESTRY (Wang 2011a) software, since all these methods do not account for ancestral population structure. However, the RelateAdmix method of Moltke and Albrechtsen (2013), which has been shown to outperform the methods of REAP (Thornton et al. 2012), PLink (Purcell et al. 2007), and KING (Manichaikul et al. 2010) is more applicable when analyzing SNP (di-allelic) data, generated from non-inbred populations that are recently admixed. When the underlying demographic history of the sampled individuals is unknown (or difficult to estimate), methods that are model-free, such as PC-Relate (Conomos et al. 2016) are bound to perform better (summarized in Ramstetter et al. 2017).

Acknowledgments

AS designed the method, wrote the code, performed all simulations, analyses, and wrote the paper. This work was part of AS’s doctoral thesis, and he would like to thank his doctoral co-advisors, Karin S Dorman and Fredric J Janzen for their guidance and help throughout the process. AS continues to work with KSD on improvements to InRelate. More recently, completion of this project was made possible by an NSF ABI Development Grant 1564659 to AS. All analyses reported were performed on HPC facilities at ISU, CSUSM, and Temple University.

Footnotes

Communicating editor: J. Fay

Literature Cited

Alexander D. H., Novembre J., Lange K., 2009. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19: 1655–1664. 10.1101/gr.094052.109 [DOI] [PMC free article] [PubMed] [Google Scholar]
Allendorf F. W., Hohenlohe P. A., Luikart G., 2010. Genomics and the future of conservation genetics. Nat. Rev. Genet. 11: 697–709. 10.1038/nrg2844 [DOI] [PubMed] [Google Scholar]
Anderson A. D., Weir B. S., 2007. A maximum-likelihood method for the estimation of pairwise relatedness in structured populations. Genetics 176: 421–440. 10.1534/genetics.106.063149 [DOI] [PMC free article] [PubMed] [Google Scholar]
Avise J. C., 2001. DNA-based Profiling of Mating Systems and Reproductive Behaviors in Poikilothermic Vertebrates: AGA Symposium Issue, Yale University, New Haven, Connecticut, June 17–20, 2000. Oxford University Press. 10.1093/jhered/92.2.99 [DOI] [PubMed] [Google Scholar]
Balloux F., 2001. EASYPOP (version 1.7): a computer program for population genetics simulations. J. Hered. 92: 301–302. 10.1093/jhered/92.3.301 [DOI] [PubMed] [Google Scholar]
Blouin M., 2003. Dna-based methods for pedigree reconstruction and kinship analysis in natural populations. Trends Ecol. Evol. 18: 503–511. 10.1016/S0169-5347(03)00225-8 [DOI] [Google Scholar]
Boehnke M., Cox N. J., 1997. Accurate inference of relationships in sib-pair linkage studies. Am. J. Hum. Genet. 61: 423–429. 10.1086/514862 [DOI] [PMC free article] [PubMed] [Google Scholar]
Cann H. M., de Toma C., Cazes L., Legrand M. F., Morel V., et al. , 2002. A human genome diversity cell line panel. Science 296: 261–262. 10.1126/science.296.5566.261b [DOI] [PubMed] [Google Scholar]
Coleman S. W., Jones A. G., 2011. Patterns of multiple paternity and maternity in fishes. Biol. J. Linn. Soc. Lond. 103: 735–760. 10.1111/j.1095-8312.2011.01673.x [DOI] [Google Scholar]
Collins-Schramm H. E., Phillips C. M., Operario D. J., Lee J. S., Weber J. L., et al. , 2002. Ethnic-difference markers for use in mapping by admixture linkage disequilibrium. Am. J. Hum. Genet. 70: 737–750. 10.1086/339368 [DOI] [PMC free article] [PubMed] [Google Scholar]
Conomos M. P., Reiner A. P., Weir B. S., Thornton T. A., 2016. Model-free estimation of recent genetic relatedness. Am. J. Hum. Genet. 98: 127–148. 10.1016/j.ajhg.2015.11.022 [DOI] [PMC free article] [PubMed] [Google Scholar]
Dempster A. P., Laird N. M., Rubin D. B., 1977. Maximum likelihood from incomplete data via em algorithm. J. R. Stat. Soc. B 39: 1–38. [Google Scholar]
Epstein M. P., Duren W. L., Boehnke M., 2000. Improved inference of relationship for pairs of individuals. Am. J. Hum. Genet. 67: 1219–1231. 10.1016/S0002-9297(07)62952-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
Eriksson A., Manica A., 2012. Effect of ancient population structure on the degree of polymorphism shared between modern human populations and ancient hominins. Proc. Natl. Acad. Sci. USA 109: 13956–13960. 10.1073/pnas.1200567109 [DOI] [PMC free article] [PubMed] [Google Scholar]
Evanno G., Regnaut S., Goudet J., 2005. Detecting the number of clusters of individuals using the software structure: a simulation study. Mol. Ecol. 14: 2611–2620. 10.1111/j.1365-294X.2005.02553.x [DOI] [PubMed] [Google Scholar]
Falconer, D. S., and T. F. C. Mackay, 1996 Introduction to Quantitative Genetics, Ed 4. Longmans Green, Harlow, Essex, UK. [Google Scholar]
Falush D., Stephens M., Pritchard J. K., 2003. Inference of population structure using multilocus genotype data: Linked loci and correlated allele frequencies. Genetics 164: 1567–1587. [DOI] [PMC free article] [PubMed] [Google Scholar]
Falush D., Stephens M., Pritchard J. K., 2007. Inference of population structure using multilocus genotype data: dominant markers and null alleles. Mol. Ecol. Notes 7: 574–578. 10.1111/j.1471-8286.2007.01758.x [DOI] [PMC free article] [PubMed] [Google Scholar]
Falush D., van Dorp L., Lawson D., 2016. A tutorial on how (not) to over-interpret structure/admixture bar plots. bioRxiv 066431. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ghalanos A., Theussl S., 2012. Rsolnp: general non-linear optimization using augmented lagrange multiplier method. R package version 1.
Hubisz M. J., Falush D., Stephens M., Pritchard J. K., 2009. Inferring weak population structure with the assistance of sample group information. Mol. Ecol. Resour. 9: 1322–1332. 10.1111/j.1755-0998.2009.02591.x [DOI] [PMC free article] [PubMed] [Google Scholar]
Jacquard A., 1972. Genetic information given by a relative. Biometrics 28: 1101–1114. 10.2307/2528643 [DOI] [PubMed] [Google Scholar]
Jakobsson M., Rosenberg N. A., 2007. Clumpp: a cluster matching and permutation program for dealing with label switching and multimodality in analysis of population structure. Bioinformatics 23: 1801–1806. 10.1093/bioinformatics/btm233 [DOI] [PubMed] [Google Scholar]
Konovalov D. A., Heg D., 2008. TECHNICAL ADVANCES: A maximum-likelihood relatedness estimator allowing for negative relatedness values. Mol. Ecol. Resour. 8: 256–263. 10.1111/j.1471-8286.2007.01940.x [DOI] [PubMed] [Google Scholar]
Li C. C., Weeks D. E., Chakravarti A., 1993. Similarity of dna fingerprints due to chance and relatedness. Hum. Hered. 43: 45–52. 10.1159/000154113 [DOI] [PubMed] [Google Scholar]
Liu Y., Nyunoya T., Leng S., Belinsky S. A., Tesfaigzi Y., et al. , 2013. Softwares and methods for estimating genetic ancestry in human populations. Hum. Genomics 7: 1 10.1186/1479-7364-7-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
Lynch M., 1988. Estimation of relatedness by dna fingerprinting. Mol. Biol. Evol. 5: 584–599. [DOI] [PubMed] [Google Scholar]
Lynch M., Ritland K., 1999. Estimation of pairwise relatedness with molecular markers. Genetics 152: 1753–1766. [DOI] [PMC free article] [PubMed] [Google Scholar]
Manichaikul A., Mychaleckyj J. C., Rich S. S., Daly K., Sale M., et al. , 2010. Robust relationship inference in genome-wide association studies. Bioinformatics 26: 2867–2873. 10.1093/bioinformatics/btq559 [DOI] [PMC free article] [PubMed] [Google Scholar]
Milligan B. G., 2003. Maximum-likelihood estimation of relatedness. Genetics 163: 1153–1167. [DOI] [PMC free article] [PubMed] [Google Scholar]
Moltke I., Albrechtsen A., 2013. Relateadmix: a software tool for estimating relatedness between admixed individuals. Bioinformatics 30: 1027–1028. 10.1093/bioinformatics/btt652 [DOI] [PubMed] [Google Scholar]
Nei M., Chesser R. K., 1983. Estimation of fixation indices and gene diversities. Ann. Hum. Genet. 47: 253–259. 10.1111/j.1469-1809.1983.tb00993.x [DOI] [PubMed] [Google Scholar]
Oliehoek P. A., Windig J. J., van Arendonk J. A. M., Bijma P., 2006. Estimating relatedness between individuals in general populations with a focus on their use in conservation programs. Genetics 173: 483–496. 10.1534/genetics.105.049940 [DOI] [PMC free article] [PubMed] [Google Scholar]
Pearse D. E., Janzen F. J., Avise J. C., 2002. Multiple paternity, sperm storage, and reproductive success of female and male painted turtles (chrysemys picta) in nature. Behav. Ecol. Sociobiol. 51: 164–171. 10.1007/s00265-001-0421-7 [DOI] [Google Scholar]
Pemberton T. J., DeGiorgio M., Rosenberg N. A., 2013. Population structure in a comprehensive genomic data set on human microsatellite variation. G3 (Bethesda) 3: 891–907. 10.1534/g3.113.005728 [DOI] [PMC free article] [PubMed] [Google Scholar]
Press W. H., 2007. Numerical recipes 3rd edition: The art of scientific computing. Cambridge university press, Cambridge, United Kingdom. [Google Scholar]
Pritchard J., Stephens M., Rosenberg N., Donnelly P., 2000a Association mapping in structured populations. Am. J. Hum. Genet. 67: 170–181. 10.1086/302959 [DOI] [PMC free article] [PubMed] [Google Scholar]
Pritchard J. K., Stephens M., Donnelly P., 2000b Inference of population structure using multilocus genotype data. Genetics 155: 945–959. [DOI] [PMC free article] [PubMed] [Google Scholar]
Purcell S., Neale B., Todd-Brown K., Thomas L., Ferreira M. A., et al. , 2007. Plink: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81: 559–575. 10.1086/519795 [DOI] [PMC free article] [PubMed] [Google Scholar]
Queller D. C., Goodnight K. F., 1989. Estimating relatedness using genetic-markers. Evolution 43: 258–275. 10.1111/j.1558-5646.1989.tb04226.x [DOI] [PubMed] [Google Scholar]
Ramachandran S., Deshpande O., Roseman C. C., Rosenberg N. A., Feldman M. W., et al. , 2005. Support from the relationship of genetic and geographic distance in human populations for a serial founder effect originating in africa. Proc. Natl. Acad. Sci. USA 102: 15942–15947. 10.1073/pnas.0507611102 [DOI] [PMC free article] [PubMed] [Google Scholar]
Ramstetter M. D., Dyer T., Lehman D. M., Curran J. E., Duggirala R., et al. , 2017. Benchmarking relatedness inference methods with genome-wide data from thousands of relatives. Genetics 207: 75–82. 10.1534/genetics.117.1122 [DOI] [PMC free article] [PubMed] [Google Scholar]
Ritland K., 1996. Estimators for pairwise relatedness and individual inbreeding coefficients. Genet. Res. 67: 175–186. 10.1017/S0016672300033620 [DOI] [Google Scholar]
Ritland K., 2005. Multilocus estimation of pairwise relatedness with dominant markers. Mol. Ecol. 14: 3157–3165. 10.1111/j.1365-294X.2005.02667.x [DOI] [PubMed] [Google Scholar]
Rosenberg N. A., 2006. Standardized subsets of the hgdp-ceph human genome diversity cell line panel, accounting for atypical and duplicated samples and pairs of close relatives. Ann. Hum. Genet. 70: 841–847. 10.1111/j.1469-1809.2006.00285.x [DOI] [PubMed] [Google Scholar]
Rosenberg N. A., Mahajan S., Gonzalez-Quevedo C., Blum M. G., Nino-Rosales L., et al. , 2006. Low levels of genetic divergence across geographically and linguistically diverse populations from india. PLoS Genet. 2: e215 10.1371/journal.pgen.0020215 [DOI] [PMC free article] [PubMed] [Google Scholar]
Rosenberg N. A., Mahajan S., Ramachandran S., Zhao C. F., Pritchard J. K., et al. , 2005. Clines, clusters, and the effect of study design on the inference of human population structure. PLoS Genet. 1: e70 10.1371/journal.pgen.0010070 [DOI] [PMC free article] [PubMed] [Google Scholar]
Rosenberg N. A., Pritchard J. K., Weber J. L., Cann H. M., Kidd K. K., et al. , 2002. Genetic structure of human populations. Science 298: 2381–2385. 10.1126/science.1078311 [DOI] [PubMed] [Google Scholar]
Sethuraman A., 2013. On inferring and interpreting genetic population structure-applications to conservation, and the estimation of pairwise genetic relatedness.
Thompson E. A., 1975. Estimation of pairwise relationships. Ann. Hum. Genet. 39: 173–188. 10.1111/j.1469-1809.1975.tb00120.x [DOI] [PubMed] [Google Scholar]
Thornton T., Tang H., Hoffmann T. J., Ochs-Balcom H. M., Caan B. J., et al. , 2012. Estimating kinship in admixed populations. Am. J. Hum. Genet. 91: 122–138. 10.1016/j.ajhg.2012.05.024 [DOI] [PMC free article] [PubMed] [Google Scholar]
Tishkoff S. A., Reed F. A., Friedlaender F. R., Ehret C., Ranciaro A., et al. , 2009. The genetic structure and history of Africans and African Americans. Science 324: 1035–1044. 10.1126/science.1172257 [DOI] [PMC free article] [PubMed] [Google Scholar]
Visscher P. M., Hill W. G., Wray N. R., 2008. Heritability in the genomics era - concepts and misconceptions. Nat. Rev. Genet. 9: 255–266. 10.1038/nrg2322 [DOI] [PubMed] [Google Scholar]
Wang J., 2007. Triadic ibd coefficients and applications to estimating pairwise relatedness. Genet. Res. 89: 135–153. 10.1017/S0016672307008798 [DOI] [PubMed] [Google Scholar]
Wang J., 2011a Coancestry: a program for simulating, estimating and analysing relatedness and inbreeding coefficients. Mol. Ecol. Resour. 11: 141–145. 10.1111/j.1755-0998.2010.02885.x [DOI] [PubMed] [Google Scholar]
Wang J., 2011b Unbiased relatedness estimation in structured populations. Genetics 187: 887–901. 10.1534/genetics.110.124438 [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang J., 2018. Effects of sampling close relatives on some elementary population genetics analyses. Mol. Ecol. Resour. 18: 41–54. 10.1111/1755-0998.12708 [DOI] [PubMed] [Google Scholar]
Wang J. L., 2002. An estimator for pairwise relatedness using molecular markers. Genetics 160: 1203–1215. [DOI] [PMC free article] [PubMed] [Google Scholar]
Weir B., 1994. The effects of inbreeding on forensic calculations. Annu. Rev. Genet. 28: 597–621. 10.1146/annurev.ge.28.120194.003121 [DOI] [PubMed] [Google Scholar]
Weir B., 2004. Matching and partially-matching dna profiles. J. Forensic Sci. 49: 1009–1014. 10.1520/JFS2003039 [DOI] [PubMed] [Google Scholar]
Weir B. S., Anderson A. D., Hepler A. B., 2006. Genetic relatedness analysis: modern data and new challenges. Nat. Rev. Genet. 7: 771–780. 10.1038/nrg1960 [DOI] [PubMed] [Google Scholar]
Weir B. S., Cockerham C. C., 1984. Estimating f-statistics for the analysis of population-structure. Evolution 38: 1358–1370. [DOI] [PubMed] [Google Scholar]
Wright S., 1950. Genetical structure of populations. Nature 166: 247–249. 10.1038/166247a0 [DOI] [PubMed] [Google Scholar]
Ye Y., 1988. Interior algorithms for linear, quadratic, and linearly constrained convex programming.
Yue G. H., Chang A., 2010. Molecular evidence for high frequency of multiple paternity in a freshwater shrimp species caridina ensifera. PLoS One 5: e12721 10.1371/journal.pone.0012721 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

All simulated data, and R scripts can be accessed at https://github.com/arunsethuraman/inrelate.

[bib1] Alexander D. H., Novembre J., Lange K., 2009. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19: 1655–1664. 10.1101/gr.094052.109 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib2] Allendorf F. W., Hohenlohe P. A., Luikart G., 2010. Genomics and the future of conservation genetics. Nat. Rev. Genet. 11: 697–709. 10.1038/nrg2844 [DOI] [PubMed] [Google Scholar]

[bib3] Anderson A. D., Weir B. S., 2007. A maximum-likelihood method for the estimation of pairwise relatedness in structured populations. Genetics 176: 421–440. 10.1534/genetics.106.063149 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib4] Avise J. C., 2001. DNA-based Profiling of Mating Systems and Reproductive Behaviors in Poikilothermic Vertebrates: AGA Symposium Issue, Yale University, New Haven, Connecticut, June 17–20, 2000. Oxford University Press. 10.1093/jhered/92.2.99 [DOI] [PubMed] [Google Scholar]

[bib5] Balloux F., 2001. EASYPOP (version 1.7): a computer program for population genetics simulations. J. Hered. 92: 301–302. 10.1093/jhered/92.3.301 [DOI] [PubMed] [Google Scholar]

[bib6] Blouin M., 2003. Dna-based methods for pedigree reconstruction and kinship analysis in natural populations. Trends Ecol. Evol. 18: 503–511. 10.1016/S0169-5347(03)00225-8 [DOI] [Google Scholar]

[bib7] Boehnke M., Cox N. J., 1997. Accurate inference of relationships in sib-pair linkage studies. Am. J. Hum. Genet. 61: 423–429. 10.1086/514862 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib8] Cann H. M., de Toma C., Cazes L., Legrand M. F., Morel V., et al. , 2002. A human genome diversity cell line panel. Science 296: 261–262. 10.1126/science.296.5566.261b [DOI] [PubMed] [Google Scholar]

[bib9] Coleman S. W., Jones A. G., 2011. Patterns of multiple paternity and maternity in fishes. Biol. J. Linn. Soc. Lond. 103: 735–760. 10.1111/j.1095-8312.2011.01673.x [DOI] [Google Scholar]

[bib10] Collins-Schramm H. E., Phillips C. M., Operario D. J., Lee J. S., Weber J. L., et al. , 2002. Ethnic-difference markers for use in mapping by admixture linkage disequilibrium. Am. J. Hum. Genet. 70: 737–750. 10.1086/339368 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib11] Conomos M. P., Reiner A. P., Weir B. S., Thornton T. A., 2016. Model-free estimation of recent genetic relatedness. Am. J. Hum. Genet. 98: 127–148. 10.1016/j.ajhg.2015.11.022 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib12] Dempster A. P., Laird N. M., Rubin D. B., 1977. Maximum likelihood from incomplete data via em algorithm. J. R. Stat. Soc. B 39: 1–38. [Google Scholar]

[bib13] Epstein M. P., Duren W. L., Boehnke M., 2000. Improved inference of relationship for pairs of individuals. Am. J. Hum. Genet. 67: 1219–1231. 10.1016/S0002-9297(07)62952-8 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib14] Eriksson A., Manica A., 2012. Effect of ancient population structure on the degree of polymorphism shared between modern human populations and ancient hominins. Proc. Natl. Acad. Sci. USA 109: 13956–13960. 10.1073/pnas.1200567109 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib15] Evanno G., Regnaut S., Goudet J., 2005. Detecting the number of clusters of individuals using the software structure: a simulation study. Mol. Ecol. 14: 2611–2620. 10.1111/j.1365-294X.2005.02553.x [DOI] [PubMed] [Google Scholar]

[bib16] Falconer, D. S., and T. F. C. Mackay, 1996 Introduction to Quantitative Genetics, Ed 4. Longmans Green, Harlow, Essex, UK. [Google Scholar]

[bib17] Falush D., Stephens M., Pritchard J. K., 2003. Inference of population structure using multilocus genotype data: Linked loci and correlated allele frequencies. Genetics 164: 1567–1587. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib18] Falush D., Stephens M., Pritchard J. K., 2007. Inference of population structure using multilocus genotype data: dominant markers and null alleles. Mol. Ecol. Notes 7: 574–578. 10.1111/j.1471-8286.2007.01758.x [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib19] Falush D., van Dorp L., Lawson D., 2016. A tutorial on how (not) to over-interpret structure/admixture bar plots. bioRxiv 066431. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib20] Ghalanos A., Theussl S., 2012. Rsolnp: general non-linear optimization using augmented lagrange multiplier method. R package version 1.

[bib21] Hubisz M. J., Falush D., Stephens M., Pritchard J. K., 2009. Inferring weak population structure with the assistance of sample group information. Mol. Ecol. Resour. 9: 1322–1332. 10.1111/j.1755-0998.2009.02591.x [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib22] Jacquard A., 1972. Genetic information given by a relative. Biometrics 28: 1101–1114. 10.2307/2528643 [DOI] [PubMed] [Google Scholar]

[bib23] Jakobsson M., Rosenberg N. A., 2007. Clumpp: a cluster matching and permutation program for dealing with label switching and multimodality in analysis of population structure. Bioinformatics 23: 1801–1806. 10.1093/bioinformatics/btm233 [DOI] [PubMed] [Google Scholar]

[bib24] Konovalov D. A., Heg D., 2008. TECHNICAL ADVANCES: A maximum-likelihood relatedness estimator allowing for negative relatedness values. Mol. Ecol. Resour. 8: 256–263. 10.1111/j.1471-8286.2007.01940.x [DOI] [PubMed] [Google Scholar]

[bib25] Li C. C., Weeks D. E., Chakravarti A., 1993. Similarity of dna fingerprints due to chance and relatedness. Hum. Hered. 43: 45–52. 10.1159/000154113 [DOI] [PubMed] [Google Scholar]

[bib26] Liu Y., Nyunoya T., Leng S., Belinsky S. A., Tesfaigzi Y., et al. , 2013. Softwares and methods for estimating genetic ancestry in human populations. Hum. Genomics 7: 1 10.1186/1479-7364-7-1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib27] Lynch M., 1988. Estimation of relatedness by dna fingerprinting. Mol. Biol. Evol. 5: 584–599. [DOI] [PubMed] [Google Scholar]

[bib28] Lynch M., Ritland K., 1999. Estimation of pairwise relatedness with molecular markers. Genetics 152: 1753–1766. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib29] Manichaikul A., Mychaleckyj J. C., Rich S. S., Daly K., Sale M., et al. , 2010. Robust relationship inference in genome-wide association studies. Bioinformatics 26: 2867–2873. 10.1093/bioinformatics/btq559 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib30] Milligan B. G., 2003. Maximum-likelihood estimation of relatedness. Genetics 163: 1153–1167. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib31] Moltke I., Albrechtsen A., 2013. Relateadmix: a software tool for estimating relatedness between admixed individuals. Bioinformatics 30: 1027–1028. 10.1093/bioinformatics/btt652 [DOI] [PubMed] [Google Scholar]

[bib32] Nei M., Chesser R. K., 1983. Estimation of fixation indices and gene diversities. Ann. Hum. Genet. 47: 253–259. 10.1111/j.1469-1809.1983.tb00993.x [DOI] [PubMed] [Google Scholar]

[bib33] Oliehoek P. A., Windig J. J., van Arendonk J. A. M., Bijma P., 2006. Estimating relatedness between individuals in general populations with a focus on their use in conservation programs. Genetics 173: 483–496. 10.1534/genetics.105.049940 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib34] Pearse D. E., Janzen F. J., Avise J. C., 2002. Multiple paternity, sperm storage, and reproductive success of female and male painted turtles (chrysemys picta) in nature. Behav. Ecol. Sociobiol. 51: 164–171. 10.1007/s00265-001-0421-7 [DOI] [Google Scholar]

[bib35] Pemberton T. J., DeGiorgio M., Rosenberg N. A., 2013. Population structure in a comprehensive genomic data set on human microsatellite variation. G3 (Bethesda) 3: 891–907. 10.1534/g3.113.005728 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib36] Press W. H., 2007. Numerical recipes 3rd edition: The art of scientific computing. Cambridge university press, Cambridge, United Kingdom. [Google Scholar]

[bib37] Pritchard J., Stephens M., Rosenberg N., Donnelly P., 2000a Association mapping in structured populations. Am. J. Hum. Genet. 67: 170–181. 10.1086/302959 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib38] Pritchard J. K., Stephens M., Donnelly P., 2000b Inference of population structure using multilocus genotype data. Genetics 155: 945–959. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib39] Purcell S., Neale B., Todd-Brown K., Thomas L., Ferreira M. A., et al. , 2007. Plink: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81: 559–575. 10.1086/519795 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib40] Queller D. C., Goodnight K. F., 1989. Estimating relatedness using genetic-markers. Evolution 43: 258–275. 10.1111/j.1558-5646.1989.tb04226.x [DOI] [PubMed] [Google Scholar]

[bib41] Ramachandran S., Deshpande O., Roseman C. C., Rosenberg N. A., Feldman M. W., et al. , 2005. Support from the relationship of genetic and geographic distance in human populations for a serial founder effect originating in africa. Proc. Natl. Acad. Sci. USA 102: 15942–15947. 10.1073/pnas.0507611102 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib42] Ramstetter M. D., Dyer T., Lehman D. M., Curran J. E., Duggirala R., et al. , 2017. Benchmarking relatedness inference methods with genome-wide data from thousands of relatives. Genetics 207: 75–82. 10.1534/genetics.117.1122 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib43] Ritland K., 1996. Estimators for pairwise relatedness and individual inbreeding coefficients. Genet. Res. 67: 175–186. 10.1017/S0016672300033620 [DOI] [Google Scholar]

[bib44] Ritland K., 2005. Multilocus estimation of pairwise relatedness with dominant markers. Mol. Ecol. 14: 3157–3165. 10.1111/j.1365-294X.2005.02667.x [DOI] [PubMed] [Google Scholar]

[bib45] Rosenberg N. A., 2006. Standardized subsets of the hgdp-ceph human genome diversity cell line panel, accounting for atypical and duplicated samples and pairs of close relatives. Ann. Hum. Genet. 70: 841–847. 10.1111/j.1469-1809.2006.00285.x [DOI] [PubMed] [Google Scholar]

[bib46] Rosenberg N. A., Mahajan S., Gonzalez-Quevedo C., Blum M. G., Nino-Rosales L., et al. , 2006. Low levels of genetic divergence across geographically and linguistically diverse populations from india. PLoS Genet. 2: e215 10.1371/journal.pgen.0020215 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib47] Rosenberg N. A., Mahajan S., Ramachandran S., Zhao C. F., Pritchard J. K., et al. , 2005. Clines, clusters, and the effect of study design on the inference of human population structure. PLoS Genet. 1: e70 10.1371/journal.pgen.0010070 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib48] Rosenberg N. A., Pritchard J. K., Weber J. L., Cann H. M., Kidd K. K., et al. , 2002. Genetic structure of human populations. Science 298: 2381–2385. 10.1126/science.1078311 [DOI] [PubMed] [Google Scholar]

[bib49] Sethuraman A., 2013. On inferring and interpreting genetic population structure-applications to conservation, and the estimation of pairwise genetic relatedness.

[bib50] Thompson E. A., 1975. Estimation of pairwise relationships. Ann. Hum. Genet. 39: 173–188. 10.1111/j.1469-1809.1975.tb00120.x [DOI] [PubMed] [Google Scholar]

[bib51] Thornton T., Tang H., Hoffmann T. J., Ochs-Balcom H. M., Caan B. J., et al. , 2012. Estimating kinship in admixed populations. Am. J. Hum. Genet. 91: 122–138. 10.1016/j.ajhg.2012.05.024 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib52] Tishkoff S. A., Reed F. A., Friedlaender F. R., Ehret C., Ranciaro A., et al. , 2009. The genetic structure and history of Africans and African Americans. Science 324: 1035–1044. 10.1126/science.1172257 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib53] Visscher P. M., Hill W. G., Wray N. R., 2008. Heritability in the genomics era - concepts and misconceptions. Nat. Rev. Genet. 9: 255–266. 10.1038/nrg2322 [DOI] [PubMed] [Google Scholar]

[bib54] Wang J., 2007. Triadic ibd coefficients and applications to estimating pairwise relatedness. Genet. Res. 89: 135–153. 10.1017/S0016672307008798 [DOI] [PubMed] [Google Scholar]

[bib55] Wang J., 2011a Coancestry: a program for simulating, estimating and analysing relatedness and inbreeding coefficients. Mol. Ecol. Resour. 11: 141–145. 10.1111/j.1755-0998.2010.02885.x [DOI] [PubMed] [Google Scholar]

[bib56] Wang J., 2011b Unbiased relatedness estimation in structured populations. Genetics 187: 887–901. 10.1534/genetics.110.124438 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib57] Wang J., 2018. Effects of sampling close relatives on some elementary population genetics analyses. Mol. Ecol. Resour. 18: 41–54. 10.1111/1755-0998.12708 [DOI] [PubMed] [Google Scholar]

[bib58] Wang J. L., 2002. An estimator for pairwise relatedness using molecular markers. Genetics 160: 1203–1215. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib59] Weir B., 1994. The effects of inbreeding on forensic calculations. Annu. Rev. Genet. 28: 597–621. 10.1146/annurev.ge.28.120194.003121 [DOI] [PubMed] [Google Scholar]

[bib60] Weir B., 2004. Matching and partially-matching dna profiles. J. Forensic Sci. 49: 1009–1014. 10.1520/JFS2003039 [DOI] [PubMed] [Google Scholar]

[bib61] Weir B. S., Anderson A. D., Hepler A. B., 2006. Genetic relatedness analysis: modern data and new challenges. Nat. Rev. Genet. 7: 771–780. 10.1038/nrg1960 [DOI] [PubMed] [Google Scholar]

[bib62] Weir B. S., Cockerham C. C., 1984. Estimating f-statistics for the analysis of population-structure. Evolution 38: 1358–1370. [DOI] [PubMed] [Google Scholar]

[bib63] Wright S., 1950. Genetical structure of populations. Nature 166: 247–249. 10.1038/166247a0 [DOI] [PubMed] [Google Scholar]

[bib64] Ye Y., 1988. Interior algorithms for linear, quadratic, and linearly constrained convex programming.

[bib65] Yue G. H., Chang A., 2010. Molecular evidence for high frequency of multiple paternity in a freshwater shrimp species caridina ensifera. PLoS One 5: e12721 10.1371/journal.pone.0012721 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Estimating Genetic Relatedness in Admixed Populations

Arun Sethuraman

Abstract

Materials and Methods

Relatedness Under the Admixture Model

Theory:

Data:

Relatedness Under the Admixture Model

Figure 1.

Table 1. Conditional Probabilities P(Sp|Dq).

Other Relatedness Estimators

Table 2. List of estimators tested and their references.

Bootstrapping and Pedigree Assignment

Simulations

Scenario 1: Hierarchical Island Model:

Scenario 2: Island Model:

Simulating related dyads:

Scenario 3: Effect of number of loci:

Scenario 4: Effect of method of estimating Fst:

Scenario 5: Effect of label-switching:

Error and Bias:

Scenario 6: Bootstrapping:

Scenario 7: HGDP-CEPH Data:

Data Availability

Results

Scenario 1: Hierarchical Island Model

Figure 2.

Figure 3.

Figure 4.

Figure 5.

Figure 6.

Figure 7.

Figure 8.

Figure 9.

Figure 10.

Figure 11.

Scenario 2: Island Model

Figure 12.

Figure 13.

Scenario 3: Effect of Number of Loci

Figure 14.

Figure 15.

Figure 16.

Figure 17.

Figure 18.

Figure 19.

Scenario 4: Effect of method of estimating Fst

Figure 20.

Scenario 5: Effect of ‘label-switching’

Scenario 6: Bootstrapping

Figure 21.

Scenario 7: HGDP-CEPH Panel

Figure 22.

Discussion

Acknowledgments

Footnotes

Literature Cited

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Table 1. Conditional Probabilities $P (S_{p} | D_{q})$ .

Scenario 4: Effect of method of estimating $F_{s t}$ :

Scenario 4: Effect of method of estimating $F_{s t}$