Summary
The application of genetic relationships among individuals, characterized by a genetic relationship matrix (GRM), has far-reaching effects in human genetics. However, the current standard to calculate the GRM treats linked markers as independent and does not explicitly model the underlying genealogical history of the study sample. Here, we propose a coalescent-informed framework, namely the expected GRM (eGRM), to infer the expected relatedness between pairs of individuals given an ancestral recombination graph (ARG) of the sample. Through extensive simulations, we show that the eGRM is an unbiased estimate of latent pairwise genome-wide relatedness and is robust when computed with ARG inferred from incomplete genetic data. As a result, the eGRM better captures the structure of a population than the canonical GRM, even when using the same genetic information. More importantly, our framework allows a principled approach to estimate the eGRM at different time depths of the ARG, thereby revealing the time-varying nature of population structure in a sample. When applied to SNP array genotypes from a population sample from Northern and Eastern Finland, we find that clustering analysis with the eGRM reveals population structure driven by subpopulations that would not be apparent via the canonical GRM and that temporally the population model is consistent with recent divergence and expansion. Taken together, our proposed eGRM provides a robust tree-centric estimate of relatedness with wide application to genetic studies.
Keywords: ancestral recombination graph, genetic relationship matrix, population structure
Graphical abstract
Introduction
Genetic relationships among individuals, commonly characterized by a genetic relationship matrix (GRM), have fueled major advances in modern human genetics. Its applications include the detection of population structure,1,2 adjusting for shared genetic backgrounds in genome-wide association testing,3, 4, 5, 6, 7, 8, 9, 10 and heritability estimation.11 Historically, genetic relationships across pairs of individuals in a known pedigree were estimated with the expected proportion of co-inherited alleles, which neglects the variance in the distribution of alleles from meiosis.12, 13, 14, 15 The advent of high-throughput genomics has enabled estimating pairwise relationships directly from genotype data without the need to rely on expectations determined from an inheritance model.16,17
The current standard to calculate the GRM is based on computing a weighted expectation across genotyped variants (i.e., identity-by-state or IBS).11,12,14 While straightforward to compute, this approach generally does not utilize linkage information between markers (though also see Patterson et al.,4 Visscher et al.,16 Speed et al.,18 Meuwissen et al.,19 Hickey et al.,20 Luan et al.,21 Sell et al.,22 and Han and Abney23). The canonical GRM does not explicitly or fully model the shared genealogical histories that connect everyone in a population, assumes independence among potentially linked genotyped markers, and inadequately reflects the contribution of untyped markers to relatedness.24, 25, 26 Thus, genome-wide IBS-based relatedness is sensitive to ascertainment biases of genetic variation and only partially captures individuals’ relationships compared with relatedness based on the underlying genealogical history of the population. An identity-by-descent (IBD)-based GRM could incorporate linkage information to infer finer-scale genetic relationships underlying the structure or demographic history of the study population. However, current bioinformatic approaches estimating shared IBD segments are subject to technical and methodological constraints that effectively limit the resolution of inferred relatedness to only the most recent branches nearing the tips of the local genealogical or coalescent trees (i.e., over the last 50–100 generations).12, 13, 14,27, 28, 29, 30 Because of its methodological simplicity, the canonical, IBS-based GRM continues to be the standard in statistical genetics despite its shortcomings.12,14 Nevertheless, these shortcomings motivate the search of an approach that better captures the genealogical relatedness in a population sample.
In this study, we describe a model for pairwise relatedness by using a coalescent-based framework relating everyone in a population sample. Given a coalescent tree at a locus, we define relatedness between individuals by tracing the tree backward to a single common ancestor. The locus-specific tree provides generalized IBD information across the population sample, unlike conventional definitions of IBD that are only defined in multi-generational pedigress or are restricted to recent branches of the tree in forms of detectable IBD segments. The entire genealogy of the DNA sequence of a sample of individuals can be represented by a series of coalescent trees connected through recombination events. This encoded structure is referred to as the ancestral recombination graph31,32 (ARG). The ARG carries substantial linkage information, as mutations on the same branch are by definition linked, and historical recombination events are encoded across the sequence of trees. In practice, the ARG is inferred through haplotypic linkage that exists in genetic data. As such, a genealogical measure of relatedness conditioned on the ARG can exploit linkage information that is commonly ignored in the canonical GRM.
Here, we devise a coalescent-based framework to estimate the expected genetic relatedness, or the eGRM, for pairs of individuals given the ARG of the population and evaluate our framework by using empirically reconstructed ARG. Conceptually, the eGRM is based on the expected number of mutations occurring randomly on each branch of the ARG33 rather than directly genotyped variants. Notably, each element in such an eGRM matrix can be seen as a special case of the “branch statistics” as previously proposed in a general framework of ARG statistics.34 Our method provides two primary benefits compared with previous relatedness measures not utilizing the ARG. First, because the ARG encodes historical recombination events and the estimation of the ARG generally leverages patterns of haplotype sharing, the eGRM in practice is expected to be more robust to ungenotyped genetic variation and retains greater information of IBD relatedness among individuals than the canonical GRM. Second, and more importantly, our framework seamlessly provides insights to the time-varying nature of population structure by estimating relatedness at specific depths in the coalescent tree. To calculate the eGRM, our framework leverages recent computational advancements for scalable ARG inferences. Although these are approximate methods, and often do not provide uncertainty in the inferred ARG, their computational efficiency enables the investigation of datasets containing thousands35 to tens of thousands of individuals.36, 37, 38
We characterized the behavior of our ARG-based eGRM through extensive simulations starting from standard population genetic models. In simulations of a single, exponentially growing population, we demonstrate that the eGRM better captures latent genome-wide relatedness compared with the canonical GRM. Importantly, we find the improved performance of eGRM is robust when performing inference with noisier ARGs inferred from a subset of common genotyped variants rather than true ARGs. It is believed that common variants are not sufficiently informative to detect recent population structure.39 However, in simulations of a recently structured population with multiple demes, we find that principal-component analysis (PCA) of the eGRM better reflects overall population structure and more accurately identifies each deme compared to PCA of the canonical GRM. Finally, in analyses of 2,644 genotyped samples from Northern and Eastern Finland,27 we observe that principal components (PCs) derived from the eGRM reveal fine-scale structure previously not identified with the canonical GRM. We estimate multiple time-specific eGRMs at multiple epochs across the history of the sample and show that time-specific patterns of population structure are qualitatively similar to simulated results of a recently structured population model, which is consistent with the known history of this region of Finland.
Material and methods
Pairwise genetic relatedness with bi-allelic SNP genotypes
We first describe the canonical and expected GRM (eGRM) in a haploid scenario; our framework can easily be generalized to diploid scenarios as described in supplemental methods. We model the haplotypes of samples with variants as an binary matrix and denote the vector at the -th variant as . The sample allele frequency is , and it is required that for to be a variant site. Given , the GRM is commonly defined as
where is the all-ones vector and is the number of variants in . However, in practice only markers are observed, resulting in the observed haplotype matrix . The GRM computed with is given by . As its definition suggests, reflects the pairwise relatedness conditioned on the observed haplotype data and provides an incomplete picture of relatedness measure between individuals. Even though the complete haplotype matrix is unknown ( is also unknown), through linkage between the unobserved markers in and observed markers in , there may exist a more reasonable estimate of than itself. Specifically, we show that when the ARG that connects the samples is given, the expectation of can be derived, which we denote as the expected GRM (eGRM) on .
In practice, the ARG can be inferred from the observed haplotypes with recently emerging ARG reconstruction tools.35, 36, 37, 38 Intuitively, we wish to define an eGRM where each entry represents the expected similarity between a pair of individuals should a mutation arise randomly in the ARG after accounting for the expected similarity between a random pair of individuals.
Expectation of pairwise relatedness given a genealogical tree
The recombination and coalescent history of the haploid samples can be completely represented by the ARG31,32 denoted as , which consists of a sequence of genealogical trees across the whole genome. Each tree is a directed binary tree with each node representing a chromosomal segment of a sample or an ancestor and each branch representing the history of its child node until it coalesced into its parent node. We define as the haplotype vector (vector of haploid individuals) associated with , that is
We assume is fixed, and is generated randomly through mutations occurring on , implying expectation or variance over is conditional on by default. We overload set membership notation over and denote that is a branch in the ARG as to mean for some .
Here, we define how mutations arise on . For each branch we define as its length in generations, as the number of base pairs that covers, and as the mutation rate on this branch. We model the number of mutations, occurring on branch as being Poisson distributed with rate , which implies the total number of mutations, , over also follows a Poisson distribution with rate .
Next, we consider the sampling distribution of the complete haplotype matrix and given . All column vectors in are from , where is repeated times in to account for the fact that there exists mutations on branch . We have
The equality is in essence shifting from summing over the genotype vector for each marker to summing over the vector of haploid individuals for each branch . Because multiple mutations can exist on a branch, the summation is weighted by .
Note that can only be defined when . The expectation of is
Using the fact that whenever , , we can compute the expectation
We have
where is a centering matrix and is the identity matrix.
Computationally, we can compute eGRM by traversing in any order while updating a buffer matrix. For each branch , we first compute and add this number to a square submatrix of elements indexed by nonzero elements of . Finally, we divide the resulting matrix by and then center it by column and by row to get the eGRM of . Note that the eGRM defined here is related to previously proposed statistics based on the most recent common ancestry (TMRCA) or genealogies. For example, it can be seen as a continuation of previous derivation33 where the expected probability of allele sharing is expressed in terms of TMRCA. Also, each element in the eGRM matrix is a special case of the “polarized branch statistic” previously proposed,34 where the “sample weights” are orthogonal unit vectors and the “summary function,” .
Extension of the haploid eGRM to diploid organisms are straightforward by considering and weighing the paternal and maternal haplotypes separately; details are provided in the supplemental methods. Furthermore, given our probabilistic formulation of relatedness given a genealogical tree, it is natural to define higher central moments. Thus, we also defined the element-wise , which we term varGRM and which captures the expected deviation around the individual entries in the eGRM. Derivation of the varGRM can be found in supplemental methods.
ARG and genotype simulations
We used two different demographic models in our simulation experiments (Figure 1B) for a comprehensive comparison between GRM and eGRM. To investigate the accuracy of eGRM compared with the canonical GRM in estimating true relatedness, we simulated ARG and genotypes under a single-population exponential-growth model based on the published out-of-Africa demography.40 Model parameters were suggested by the msprime documentation based on the European branch of the model. We did not simulate the other two branches of population nor the migration rates between the populations. To investigate the performance of eGRM compared with the canonical GRM in detecting recent population structure, we simulated ARG and genotypes of a structured population with a 5 × 5 grid stepping-stone demographic model motivated by a similar model recently published.39 We simulated 50 individuals per deme, with population size of 500 and migration rate of 0.01 per generation between neighboring demes. The 25 demes split from a single ancestral population of the same population size 100 generations ago.
Figure 1.
Illustrative example of eGRM and methodological overview
(A) An illustrative example of a single genealogical tree containing three samples, four branches, and six mutations to contrast the eGRM and GRM. Each mutation has a corresponding vector of length 3, corresponding to the number of haplotypes (e.g., mutation m1 has vector (1, 1, 0)). The “single-variant GRM” can be computed as the outer product of the centered and normalized vector with itself. The canonical GRM is then the unweighted average of the single-variant GRMs of the six mutations. The eGRM, on the other hand, is based on the four branches, weighted by their lengths (i.e., the expected number of mutations on this branch).
(B) Overview of simulation workflow to test the performance of eGRM. ARGs are simulated by msprime on the basis of a single-growth demographic model and a grid-like spatial structure model. Observed variants are oversampled from common variants to mimic real genotyping array data. We then compute the complete GRM (Kall), the observed GRM (Kobs), eGRM based on the true ARG (EK), and eGRM based on Relate or tsinfer+tsdate-reconstructed ARG (EKrelate and EKtsdate).
We simulated genotypes and tree sequences by msprime.41 To mimic observed genetic data derived from a genotyping array that biases toward the common variants, we restricted the observed set of variation to a subset (20% by default, unless otherwise specified) of the simulated variants with minor allele count ≥ 5. To oversample the common variants, we sampled with probability proportional to where is the sample allele frequency of variant . To show the practical use of eGRM, we reconstruct the sequence of trees from observed variants by using Relate35 and tsinfer+tsdate36,37 with default parameters as suggested by the user manuals. The tree sequence output of Relate is converted to TSKIT format, which contains a gap-filler tree with no genetic information between basepair zero and the first genetic marker in the dataset. In order to prevent overrepresenting a tree that covers a long region but with little actual information, we always skip the first tree in the tree sequence in our empirical analysis. We denote the canonical GRM based on observed variants as Kobs, the eGRM computed with true ARG as EK, and the eGRM computed with ARG inferred from observed variants as input for Relate or tsinfer+tsdate as EKrelate or EKtsdate. Unless otherwise noted, by default all simulations are performed on a 30 Mb chromosome with both mutation rates and recombination rates set to per generation per base pair.
FinMetSeq genotyping and quality control
To examine the performance of eGRM on real genotyping data, we applied our method to a subset of the FinMetSeq dataset,27 consisting of 2,644 samples who have self-reported that both parents were born in the same municipality in Finland. The dataset contained 1,504,461 SNPs from whole-exome sequencing and genome-wide genotyping arrays.27 We retained only bi-allelic SNPs. We filtered variants with minor allele frequency (MAF) ≥ 0.01 and missingness ≤ 0.01, resulting in 208,681 common SNPs. We phased the genotypes with EAGLE by using its default hg19 genetic map. We reconstructed the ARG by using Relate with all parameters the same as in its official manual. We then applied our eGRM algorithm on the resulting tree sequence to compute EKrelate. The canonical GRM was computed on the basis of the same set of SNPs. Access of data through dbGaP and research using the FinMetSeq data were approved by the institutional review board at University of Southern California.
Population structure analysis
We contrasted and visualized the information of population structure contained in GRM and eGRM through PCA and uniform manifold approximation and projection (UMAP). PCA was computed with the “linalg.eig” function in the python “numpy” library, and UMAP was computed by the R “umap” package with all default parameters. To quantitatively assess the improvement of eGRM over GRM in informing clustering analysis from structured populations, we devised a separation index to assess proportion of nearest neighbors that are in the same population in multi-dimensional space. Suppose we have a set of sample points in a metric space with metric . Each point has a true label . The separation index defines as the true class that belongs. In simulated data, the true label is the deme or population membership of each individual. In empirical data, the birthplace of the parents or grandparents was assumed to be the true label. We also define the size- neighbor of as the nearest point of including itself, denoted as . The separation index is defined as the average proportion of same-class neighbors
which is a real number between 0 and 1 indicating how well the metric is capturing the true classification . Note that is only dependent on the relative order of distances between pairs of points, making it a unified measure of clustering performance among PCA, UMAP, and other distance-based methods.
Results
Method overview: ARG-based definition of genetic relatedness
The eGRM, conditioned on the ARG, is conceptually different from the canonical GRM. We demonstrate this difference through a toy example on a single genealogical tree with four branches and six mutations (Figure 1A). The canonical GRM is variant centric and is the average of the six relatedness matrices based on each mutation. The eGRM, however, defines relatedness through tree branches that relate a pair of haplotypes. Assuming constant mutation rates across branches, the eGRM is the average of the four relatedness matrices based on each branch, weighted by their branch lengths. A single tree is shown in Figure 1A for simplicity, but the eGRM can be generalized to a sequence of trees along a chromosome by weighting each tree by its total branch length times the number of base pairs covered by each tree (material and methods). In this toy example, haplotypes a and b are expected to be equally related to c in the eGRM, while in the canonical GRM, b will be more closely related to c. Under the canonical GRM framework, the relative genetic distance to c is subject to the randomness and ascertainment of mutations. Instead of relying on ascertained mutations, branch lengths from the true ARG (or from the inferred ARG, reconstructed on the basis of linkage information among nearby markers) provide an estimate of genetic relatedness that is more robust to ascertainment effects. In addition, while the eGRM is defined as a function of the ARG, it maintains the mathematical properties of canonical GRMs (e.g., positive definiteness), as eGRM is the expectation of the canonical GRM. The eGRM is thus compatible with all downstream applications of the GRM.
To help distinguish between various eGRM estimators, here we define some useful notation. We denote the eGRM estimated conditioned on the true ARG as EK. When conditioned on an ARG inferred from genetic data with either Relate35 or tsinfer+tsdate,36,37 we denote such eGRM as EKrelate and EKtsdate, respectively. We denote the canonical GRM computed with all genetic data as Kall and with only the observed genetic data as Kobs. In empirical data analysis, Kobs are constructed with all genotyped data passing quality controls; in simulations, Kobs are constructed with only 20% of the genetic data oversampled from the common variation of the frequency spectrum (material and methods) to mimic a genotyping array. Importantly, EKrelate and EKtsdate are constructed only with the same set of the observed genetic data as Kobs.
eGRM accurately measures relatedness on a genealogical tree
To establish that the eGRM estimator better reflects genealogical relatedness compared with the canonical GRM approach, we first sought to quantify the performance of eGRM in capturing relatedness in a single tree, defined here as the TMRCA between pairs of individuals, when using the true ARG. We simulated a 1 Mb genetic region with 1,000 individuals under a single population growth model and computed EK and Kobs (see material and methods; Figure 1B). Unsurprisingly, the eGRM based on the true genealogical tree, EK, is better correlated with TMRCA than Kobs in 97.5% of the simulations (p = 4 × 10−252 by sign test; Figure 2A) and more accurately captures recent genetic relatedness between pairs of individuals (Figure S1A). More importantly, eGRM constructed with genealogies inferred under Relate or tsinfer+tsdate on the same set of observed variants (EKrelate and EKtsdate) also showed better correlation with TMRCA than the canonical GRM in ∼70% of the simulations (p < 1 × 10−26 in all cases; Figure 2A, Figure S1B), suggesting that the eGRM is relatively robust to noise in inferred ARGs. Our results thus demonstrate a consistent advantage of the eGRM over the canonical GRM in capturing local relatedness represented by TMRCA within a non-recombining chromosome segment. Even though common variants are individually uninformative for recent relatedness, our results also suggest the eGRM framework based on predominantly common variants can provide insight for the recent part of the genealogical tree.
Figure 2.
eGRM is highly and unbiasedly correlated with measures of relatedness in simulations
(A) Negative Spearman correlation between TMRCA and Kobs, EK (left), or EKrelate (right) on 1,000 sets of 1 Mb non-recombining locus. Spearman correlation is used because GRM by definition normalizes according to allele frequency to upweight rare mutations and thus is not expected to correlate linearly with TMRCA.
(B) Heatmap summarizing the Pearson correlations between GRM and eGRM matrices on a 30 Mb chromosome. Results were averages and standard errors over 50 independent simulations.
(C) Scatterplots of the GRM and eGRM values for all pairs of individuals with the same simulated demography as in (B). All simulations from (A) to (C) simulated 1,000 individuals.
(D) Pearson correlation with Kall, with varying proportion of SNPs observed (sample size is fixed to 1,000, recombination rate is fixed to 1e−8; left), varying recombination rate (20% common SNPs observed, sample size 1,000; middle) or varying sample size (20% common SNPs observed, recombination rate 1e−8; right) on a 30 Mb chromosome. Results for EK on the left panel is denoted with a dashed line because EK was computed with the true ARG independent of the proportion of observed SNPs. Results were averaged across ten independent simulations. The curves are smoothed by LOESS and error bands show 99% confidence interval.
eGRM provides an unbiased estimate of genome-wide relatedness
While TMRCA provides an intuitive measure of the local genetic relatedness between a pair of haplotypes, the eGRM is formulated as the expectation of the latent GRM so that it adheres to the mathematical properties of a GRM necessary for many downstream statistical genetic applications.1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 Therefore, we evaluated how well eGRM is capturing the genome-wide relatedness measured by the latent GRM (Kall). Briefly, we repeatedly simulated a 30 Mb genomic region of 1,000 individuals with recombination rate set as 10−8 per bp per generation (see material and methods). We found that EK provides an approximately unbiased estimate of Kall (Pearson correlation = 0.98 ± 0.0008; Figure 2B; regression slope of 0.951, 95% CI [0.949, 0.953], intercept = −6.7 × 10−5, 95% CI [−9.1 × 10−5, −4.3 × 10−5]; Figure 2C) when compared with Kobs (Pearson correlation = 0.81 ± 0.003; regression slope of 2.69, 95% CI [2.67, 2.71], intercept 1.8 × 10−3, 95% CI [1.6 × 10−3, 2.0 × 10−3]). We observed similar performance gains when computing the eGRM by using ARG inferred by Relate; EKrelate attained a highly correlated (r = 0.91 ± 0.004; Figure 2B) and approximately unbiased estimate of Kall (regression slope of 0.96, intercept = 3.7 × 10−5; Figure 2C). Taken together, our results suggest the eGRM is an unbiased and accurate estimator of the idealized canonical GRM containing all variants.
Next, we quantified the performance of eGRM when computed with ARG inferred from a varying proportion of observed genetic variants. We found that the correlation between Kall and EKrelate was consistently higher than the correlation between Kall and Kobs (Figure 2D, left). Similar patterns are also observed as a function of increasing recombination rates (Figure 2D, middle). Moreover, for a fixed proportion of observed common SNPs (e.g., 20%; similar to SNP arrays) and recombination rate (1 × 10−8 per bp), we observed the performance gap widened between EKrelate and Kobs as sample size increased (Figure 2D, right). Intuitively, this improvement reflects the increasing contribution from rare variants to kinship in a larger sample that would not be captured by the canonical GRM based on only variants assayed on an array. Our results imply that eGRM can in principle more effectively capture relatedness in large-scale studies.
In practice, the construction of the GRM often uses imputed variants and/or is restricted to relatively common variants after pruning of correlated variants by linkage disequilibrium (LD). However, in simulations we found that pruning SNPs by LD before computing the GRM [Kobs (pruned)] further decreased correlation with Kall (Figure S2A). When using imputed variants to construct the canonical GRM [Kobs (imputed)], we found it more strongly correlated with Kall on average when compared with Kobs (Figure S2A). However, we observed performance in this scenario depends on relatedness between individuals in the imputation reference panel with target individuals; correlation between Kobs (imputed) and Kall decreases with average panel relatedness (Figure S2B). The dependence on the availability of a closely related reference panel suggests that underrepresented populations would be at a disadvantage for genetic analysis with the canonical GRM. Most importantly, across all of these scenarios, we observed our eGRM based on inferred ARG (i.e., EKrelate) consistently exhibited better correlation with Kall than Kobs (pruned) or Kobs (imputed) (Figure S2A).
eGRM captures recent demographic events
Population structure, as the result of historical demographic processes, is conventionally visualized through PCA of the canonical GRM.4,33 Because the eGRM is conditioned on the ARG encoding these historical events, we expected the eGRM to be more sensitive to population structure than the canonical GRM based on only common variants. To this end, we quantified the performance of the eGRM in capturing recent population structure through PCA. Motivated by recent work demonstrating that recent population structure (i.e., <100 generations) is not well captured by PCA computed from common SNP GRMs,39 we simulated a stepping-stone population model with 25 demes spatially distributed in a 5 × 5 grid where demes coalesced into a single deme 100 generations ago (Figure 1B, Figure 3; see material and methods). We then compared the ability of the eGRM or the canonical GRM to identify recent population structure through PCA as quantified by the separation index (SI), which measures the proportion of neighbors in multi-dimensional space that are of the same deme or cluster (material and methods).
Figure 3.
PCA based on eGRM more effectively captures recently established spatial structure in simulation compared to the canonical GRM
A 30 Mb region was simulated, with 20% of common variants observed. Each deme has a constant population size of 500, in which 50 individuals are sampled. The first two PCs based on PCA of three GRMs are shown (top): the canonical GRM based on observed SNPs (Kobs), the eGRM based on Relate-reconstructed ARG using the observed SNPs (EKrelate), or the time-specific eGRM based on the subset of branches between 0 and 100 generations across the ARG (EKrelate(0–100 gen)). Separation index (SI; see material and methods for a precise definition) is shown at the top right corner of each plot, which is the average proportion of same-label neighbors for each sample, indicating how well populations are separated. The first two features of UMAP transformation (bottom) applied to the top ten PCs further accentuate the detected structure, as measured by SI.
To establish a baseline, we first recapitulated previous results demonstrating that PCA based on the GRM constructed with common variants (defined as MAF ≥ 0.05; Kcommon), or the observed variants (defined as 20% of the variants, oversampling common variants; Kobs), cannot distinguish the spatial structure of the demes (SI = 0.07–0.08 for Kcommon and Kobs; Figure 3, Figure S3A). In comparison, the GRM constructed from rare variants (minor allele count = 2, 3, 4, or 5) alone (Krare; SI = 0.25) or all of the variants (Kall; SI = 0.20; Figure S3A) can better detect structure.
We repeated this PCA analysis by using EKrelate computed from the same set of variants used to construct Kobs and found it to better identify recent population structure (SI = 0.22; Figure 3). We next applied a UMAP transformation to the top ten PCs based on each of the evaluated relatedness matrices and found cluster separation performance improved when using Krare, Kall, or EKrelate (SI = 0.79–0.81) but with little benefit when using Kcommon or Kobs (SI = 0.09-0.11; Figure 3 and Figure S3B). Performance applying PCA or PCA + UMAP on EKrelate is also on par or better than applying the analysis on matrices of haplotype sharing, such as the coancestry matrix produced by fineSTRUCTURE42 (Figure S3); however, the eGRM framework comes with the benefit of taking time slices of the ARG for time-specific information (see below). We also simulated a demographic history with older population split times of 200 or 500 generations. In these scenarios, EKrelate (SI = 0.38 and 0.55 when split time is 200 and 500, respectively) consistently outperformed Kobs (SI = 0.13 and 0.31, respectively) in capturing population structure (Figure S4), most likely due to additional haplotypic information captured by the inferred ARG. Therefore, under a structured model, the eGRM consistently extracts more information of the population structure compared to the canonical GRM based on the same set of variants.
As incomplete correction of population structure is one of the main sources of confounding in genome-wide association studies (GWASs),43 we also examined whether inclusion of PCs from EKrelate would better control for confounding in association tests. We simulated a GWAS (total N = 6,250, with 250 individuals from each of the 25 demes) of a non-heritable phenotype with an environmental component that is sharply distributed in space. As previously observed,39 in this model the association test statistics would be inflated if no correction or correction only with PCs based on a canonical GRM of common variants (Kcommon) were applied (Figure S5). Inclusion of PCs derived from the eGRM would greatly reduce the inflation, as expected, because it better captures the structure in the data (Figure 3). As the scalability and accuracy of ARG inference continues to improve, it will be of great interest to test the ability of eGRM to control of population structure in realistic GWAS settings.
Time-specific eGRM reveals dynamic relatedness through history
By defining genome-wide relatedness as a function of coalescent trees, a major advantage of our eGRM framework is its natural generalization of relatedness constrained to a specific time window. We denote the eGRM computed from the ARG when considering only branches of a certain age as the time-specific eGRM (see material and methods). To demonstrate the benefit of limiting relatedness calculation to certain generations, we re-analyzed our grid simulations but estimated EKrelate restricted to the most recent 100 generations (Figure 3). We observed that PCA based on EKrelate of the most recent 100 generations improved its ability to delineate population structure (SI = 0.44, compared to SI = 0.22 for EKrelate or 0.25 for Krare; Figure 3, Figure S3A). We found performance to further improve when applying an additional UMAP transformation on PCs (SI = 0.94; Figure 3).
The time-specific eGRM also provides new insight into the limitation of the canonical GRM method. In our single-population simulation, the correlation between each time-specific eGRM that we examined and the canonical GRM (Kobs) increases monotonically backward in time because the allelic ages of observed SNPs are generally much older (mean age of 12,872 generations; Figures S6A and S6D). In our grid structure simulations, due to smaller population sizes and stronger genetic drift, the common SNPs are much younger (mean age of 2,089 generations), resulting in Kobs having the highest correlation with time-specific eGRM between 2,000 and 3,000 generations ago (Figures S6B and S6D). The correlation with time-specific eGRM decreased going further back in time, most likely due to older variants becoming fixed in the population and thus being excluded from Kobs computation. The expansion into multiple demes only occurred 100 generations ago, but because the allelic ages of observed SNPs used to construct Kobs generally predated this event (Figure S6D), individually these SNPs contained little information for the recent structure. Taken together, these results indicate that the canonical GRM provides a coarse measure of relatedness in which older, common SNPs are enriched, while the eGRM and time-specific eGRM provide a more fine-grained measure of relatedness at different time points.
eGRM improves prediction of geographical pattern in empirical data
We evaluated the ability of eGRM to detect population structure in real world data by applying the framework to the genotyping array data of a Finnish cohort, FinMetSeq. We computed the canonical GRM and EKrelate on the basis of 208,681 SNPs with minor allele frequency > 1% genotyped on 2,644 individuals with both parents born in the same municipality in Finland27 (material and methods). Using parental birthplaces as population labels, we found PCA of EKrelate was able to identify patterns of structure (SI = 0.52; Figure 4A), whereas the canonical GRM displayed mild separation between individuals with recent ancestry along differing regions of Finland (SI = 0.39; Figure 4A). Lower-order PCs computed from EKrelate revealed additional structure not matched by Kobs (SI = 0.47 versus SI = 0.29 for PC3–PC6; Figure S7). Notably, the first two PCs of EKrelate were explained in part by individuals with both parents born in the surrendered Karelia (magenta color in Figure 4). Similar to our simulated results, when we applied UMAP to top PCs, we observed improved resolution of fine-scale structure; UMAP based on EKrelate continues to be more informative of the fine-scale structure within Finland than that based on the canonical GRM, regardless of the number of PCs included in UMAP (Figure 4B). Furthermore, EKrelate, which is based on the 208,681 genotyped SNPs, performs comparably, if not better, than the canonical GRM based on 17 million imputed SNPs from the TOPMed Imputation Server (Figure S8). Our results therefore suggest that the eGRM framework would potentially elucidate greater details of fine-scale structure for understudied populations that are poorly imputed because of a lack of representation in available imputation reference panels.
Figure 4.
Clustering analysis based on eGRM revealed novel population structure in the population of Northern and Eastern Finland
(A) PCA and PCA + UMAP based on either Kobs or EKrelate are shown. A map of Finland with regions colored is provided for reference. Main geographical locations referenced in the text are labeled (surrendered Karelia colored in magenta, Lapland colored in red, Turku-Pori colored in light green, and Vassa colored in light blue). Scatterplot of the first two features of UMAP transformation was based on the first 24 and 58 components of PCA of Kobs and EKrelate, respectively. These numbers were chosen as they respectively are the number of components at which the separation index (SI) is maximized after applying the UMAP transformation.
(B) Separation index achieved as successive PCs were included in UMAP transformation of PCA of Kobs and EKrelate. Results are LOESS curves from ten independent runs of UMAP.
Next, to shed light on historical migration and population movements for FinMetSeq data, we computed and analyzed the time-specific eGRM considering only branches for the past 0–100 generations (Figure S6C). PCA of the time-specific eGRM suggests that recent structure in Northern and Eastern Finland is mainly driven by individuals from Lapland (colored red in Figure 4; the northernmost part of Finland and home to the indigenous Finno-Urgic people, Sami), surrendered Karelia (magenta), and Turku-Pori and Vaasa (light green and light blue, sharing major port borders with Sweden). Computing a time-specific eGRM further in the past exhibited patterns more similar to those found in the canonical GRM (Figure S6C). Qualitatively, the pattern of the time-specific eGRM at varying time depth and its correlation with a fixed canonical GRM are more reminiscent of the pattern observed in the grid structure simulation (Figure S6B) than the pattern in a single homogeneous population (Figure S6A). Together, these findings further support previous claims that common variants are enriched for those that survived a bottleneck in Finland and that there are extensive internal structures due to recent population movement, isolation, and drift.27,44, 45, 46
Time and memory considerations for eGRM algorithm
We implemented the eGRM in a flexible Python framework by using custom C extensions to accelerate core eGRM calculations. Our implementation of eGRM is memory efficient. The main memory usage throughout the algorithm is a matrix of size , which takes bytes of memory when stored as doubles in C. However, outputting the resulting matrix into a NumPy array dominates the overall memory consumption (Figure S9). The time cost of computing eGRM is , where is the number of genealogical trees (see supplemental information). In the case of the FinMetSeq data, the genome-wide eGRM takes ∼30 h on a single CPU to compute for 2,644 samples and ∼120,000 genealogical trees.
Discussion
In the current study, we introduce the eGRM, a genealogical estimate of genetic relatedness. The eGRM is conceptually distinct from the canonical GRM, which is variant- or mutation-centric. As a result, analyses utilizing the canonical GRM need to be interpreted within the context of the marker ascertainment. Ascertainment could be biased because of availability of data, technical errors in data generation, or inconsistent analytical conventions across analysts. In contrast, the eGRM does not depend directly on the detection of variation (eGRM based on the true ARG does not depend on variation, but haplotypes based on a set of variants is used for inferring the ARG in practice) and thus is more robust when used in analyses with incomplete data.
A number of methods have been proposed to exploit the rich information stored in the ARG to make inference of population genetic parameters (e.g., for selection47,48 or population history49,50). Similarly, recent theoretical work has demonstrated the relationship between mutational processes by site or on branches and nodes of the ARG.34 Given the ARG, our framework considers mutations as appearing uniformly at random on the ARG, and relatedness between pairs of individuals is based on the probability of shared mutation, which is proportional to the branch lengths relating the two individuals. Our decision to explore this framework over alternative paths such as manipulating a matrix of TMRCAs is driven by the conceptual shift in treating mutations as random. We expect that our genealogical framework to compute genetic relatedness will enable seamless incorporation with downstream statistical applications, such as its inclusion in a linear mixed model for controlling population structure in association testing.
Through extensive simulations, we demonstrated that the eGRM is highly correlated with TMRCA and importantly provides an approximately unbiased estimate of Kall. The former we used as the standard for true relatedness given a single tree, while the latter is an idealized GRM assuming all variants are perfectly observed. We illustrated the improvement and new insights that could be garnered via the eGRM with an application in the detection of population structure. First, common SNPs were thought to be uninformative about recent population structures because they tend to predate recent population divergence and are most likely shared across all populations.39 However, because haplotypes derived from common variants are used to infer the ARG, we showed that the eGRM detects the recent split and can better separate the spatial structure among demes despite relying on only the common SNPs. Second, our framework provides a means to flexibly probe into the population structure at arbitrary time depths through the time-specific eGRM, suggesting that it can be optimized to account for structure at varying timescales. Third, we demonstrated empirically the insights of population structure that can be learned from eGRM by using Finnish genotyping data. In contrast to the GRM, PCA and PCA + UMAP based on the eGRM could better delineate subpopulations such as individuals from the surrendered Karelia region or from the southwestern region of Finland. The surrendered Karelia region was a geographical region at the border of Finland and Russia but was ceded to Russia in 1940. Finnish citizens in this area were evacuated and resettled throughout the rest of Finland.51 Contribution to the population structure of Finnish due to the evacuees and their descendants would not have been apparent if we examined only the PCA based on the canonical GRM (Figure 4A). Finally, by examining the time-specific eGRMs, the structure in Northern and Eastern Finland appeared to be more recently established, and the pattern of variation from this dataset is more consistent with a recently structured population with enhanced drift rather than the conventional belief that Finnish composes a single homogeneous population.27,52
We have found that eGRM inference is stable when using computationally reconstructed ARG rather than the true ARG (e.g., EKrelate). However, we note that underlying assumptions required for accurate ARG inference may not be met in massive sample sizes. For example, it is often assumed that there are no recurrent mutations and multiple coalescent events per generation; both of these assumptions would be violated in extremely large samples. Mutation rate is often assumed to be constant across all branches of the inferred tree, which may not be true empirically,53 and ARG inferences currently may not appropriately model all recombination events.54 The extent to which relaxation or violation of these assumptions impact the ARG inference and its downstream computation of the eGRM will need to be evaluated systematically. Furthermore, a major current impediment is the scalability of ARG inference. In our simulations, eGRM from Relate-reconstructed ARGs performs better than that of tsinfer+tsdate, but the computational time is also orders of magnitude longer. As a result, even though the eGRM computation usually takes less than 5% of the total runtime, we were unable to efficiently compute eGRM on Relate-reconstructed ARGs beyond 10,000 diploid individuals. Nevertheless, computational advances may well continue to make ARG inferences more accurate and scalable; until recently, ARG inferences were restricted to only tens of individuals. Future research may also focus on scaling the eGRM computation with increasing number and sizes of genealogical trees from large samples.
As faster or more accurate ARG inference algorithms become available, our method will be primed to achieve advanced usability and performance. However, even without more scalable ARG inference methods, eGRM will have the potential to make an immediate impact in genetic studies of humans and other species. For instance, most understudied populations are not resourced with a matched imputation reference panel or whole-genome sequencing data.55,56,57 Even when genotyping array data are available, the arrays are rarely designed to represent variation found in the population of interest.58,59 The biased ascertainment of incomplete genomic information is anticipated to exacerbate the disparity in our understanding of the genetic architecture between different populations. The eGRM could overcome these limitations because it is able to improve relatedness estimation, using only a subset of common markers nonetheless, to a level comparable to the canonical GRM constructed in presence of a population-matched imputation reference. The eGRM could thus enable analysis of limited genetic data and genetic mapping studies from under-resourced populations. Stepping outside of human studies, the genetic studies of other ecological species are rarely equipped with complete genomic information. In some cases, complete genomes of a sample are impossible to obtain, such as phylogenetic or ancestral studies on historical specimens. However, ARG could be inferred from limited genotyping data,60 suggesting that eGRM can fill in the void in these studies.
Author contributions
C.W.K.C. conceived of the study. C.F., N.A.M., and C.W.K.C. designed the study. C.F. performed the analysis. C.F., N.A.M., and C.W.K.C. interpreted the data. C.W.K.C. contributed to the data collection. C.F., N.A.M., and C.W.K.C. wrote the paper.
Acknowledgments
We would like to thank Michael D. Edge, Diego Ortega-Del Vecchyo, Christian D. Huber, Jerome Kelleher, Peter Ralph, Iain Mathieson, attendees of the 2020 Biology of Genomes, 2020 American Society of Human Genetics, and 2021 Probgen virtual meetings, and members of FinMetSeq consortium for discussions and help with data access. Research reported in this publication was supported by National Institute of General Medical Sciences (NIGMS) of the National Institute of Health under award number R35GM142783 (to C.W.K.C.). Computation for this work is supported by USC’s Center for Advanced Research Computing (https://www.carc.usc.edu/).
Declaration of interests
The authors declare no competing interests.
Published: April 12, 2022
Footnotes
Supplemental information can be found online at https://doi.org/10.1016/j.ajhg.2022.03.016.
Contributor Information
Caoqi Fan, Email: caoqifan@usc.edu.
Charleston W.K. Chiang, Email: charleston.chiang@med.usc.edu.
Data and code availability
We have implemented the algorithms related to eGRM in a python package, “egrm,” which is publicly available in PyPI. Documentation of this package as well as simulation commands used in this study can be found on its GitHub page (https://github.com/Ephraim-usc/egrm). The Finnish dataset is available through dbGaP (dbGaP: phs000743.v1.p1, phs000756.v1.p1).
Supplemental information
References
- 1.Chiang C.W.K., Mangul S., Robles C., Sankararaman S. A Comprehensive Map of Genetic Variation in the World’s Largest Ethnic Group-Han Chinese. Mol. Biol. Evol. 2018;35:2736–2750. doi: 10.1093/molbev/msy170. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Novembre J., Johnson T., Bryc K., Kutalik Z., Boyko A.R., Auton A., Indap A., King K.S., Bergmann S., Nelson M.R., et al. Genes mirror geography within Europe. Nature. 2008;456:98–101. doi: 10.1038/nature07331. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Hirschhorn J.N., Daly M.J. Genome-wide association studies for common diseases and complex traits. Nat. Rev. Genet. 2005;6:95–108. doi: 10.1038/nrg1521. [DOI] [PubMed] [Google Scholar]
- 4.Patterson N., Price A.L., Reich D. Population structure and eigenanalysis. PLoS Genet. 2006;2:e190. doi: 10.1371/journal.pgen.0020190. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Price A.L., Patterson N.J., Plenge R.M., Weinblatt M.E., Shadick N.A., Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 2006;38:904–909. doi: 10.1038/ng1847. [DOI] [PubMed] [Google Scholar]
- 6.Kang H.M., Sul J.H., Service S.K., Zaitlen N.A., Kong S.Y., Freimer N.B., Sabatti C., Eskin E. Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 2010;42:348–354. doi: 10.1038/ng.548. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Lippert C., Listgarten J., Liu Y., Kadie C.M., Davidson R.I., Heckerman D. FaST linear mixed models for genome-wide association studies. Nat. Methods. 2011;8:833–835. doi: 10.1038/nmeth.1681. [DOI] [PubMed] [Google Scholar]
- 8.Listgarten J., Lippert C., Kadie C.M., Davidson R.I., Eskin E., Heckerman D. Improved linear mixed models for genome-wide association studies. Nat. Methods. 2012;9:525–526. doi: 10.1038/nmeth.2037. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Loh P.R., Tucker G., Bulik-Sullivan B.K., Vilhjálmsson B.J., Finucane H.K., Salem R.M., Chasman D.I., Ridker P.M., Neale B.M., Berger B., et al. Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat. Genet. 2015;47:284–290. doi: 10.1038/ng.3190. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Zhou X., Stephens M. Genome-wide efficient mixed-model analysis for association studies. Nat. Genet. 2012;44:821–824. doi: 10.1038/ng.2310. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Yang J., Benyamin B., McEvoy B.P., Gordon S., Henders A.K., Nyholt D.R., Madden P.A., Heath A.C., Martin N.G., Montgomery G.W., et al. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 2010;42:565–569. doi: 10.1038/ng.608. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Speed D., Balding D.J. Relatedness in the post-genomic era: is it still useful? Nat. Rev. Genet. 2015;16:33–44. doi: 10.1038/nrg3821. [DOI] [PubMed] [Google Scholar]
- 13.Thompson E.A. Identity by descent: variation in meiosis, across genomes, and in populations. Genetics. 2013;194:301–326. doi: 10.1534/genetics.112.148825. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Powell J.E., Visscher P.M., Goddard M.E. Reconciling the analysis of IBD and IBS in complex trait studies. Nat. Rev. Genet. 2010;11:800–805. doi: 10.1038/nrg2865. [DOI] [PubMed] [Google Scholar]
- 15.Hill W.G., Weir B.S. Variation in actual relationship as a consequence of Mendelian sampling and linkage. Genet. Res. 2011;93:47–64. doi: 10.1017/S0016672310000480. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Visscher P.M., Medland S.E., Ferreira M.A.R., Morley K.I., Zhu G., Cornes B.K., Montgomery G.W., Martin N.G. Assumption-free estimation of heritability from genome-wide identity-by-descent sharing between full siblings. PLoS Genet. 2006;2:e41. doi: 10.1371/journal.pgen.0020041. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.VanRaden P.M. Efficient methods to compute genomic predictions. J. Dairy Sci. 2008;91:4414–4423. doi: 10.3168/jds.2007-0980. [DOI] [PubMed] [Google Scholar]
- 18.Speed D., Hemani G., Johnson M.R., Balding D.J. Improved heritability estimation from genome-wide SNPs. Am. J. Hum. Genet. 2012;91:1011–1021. doi: 10.1016/j.ajhg.2012.10.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Meuwissen T.H.E., Luan T., Woolliams J.A. The unified approach to the use of genomic and pedigree information in genomic evaluations revisited. J. Anim. Breed. Genet. 2011;128:429–439. doi: 10.1111/j.1439-0388.2011.00966.x. [DOI] [PubMed] [Google Scholar]
- 20.Hickey J.M., Kinghorn B.P., Tier B., Clark S.A., van der Werf J.H.J., Gorjanc G. Genomic evaluations using similarity between haplotypes. J. Anim. Breed. Genet. 2013;130:259–269. doi: 10.1111/jbg.12020. [DOI] [PubMed] [Google Scholar]
- 21.Luan T., Yu X., Dolezal M., Bagnato A., Meuwissen T.H. Genomic prediction based on runs of homozygosity. Genet. Sel. Evol. 2014;46:64. doi: 10.1186/s12711-014-0064-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Selle M.L., Steinsland I., Lindgren F., Brajkovic V., Cubric-Curik V., Gorjanc G. Hierarchical Modelling of Haplotype Effects on a Phylogeny. Front. Genet. 2021;11:531218. doi: 10.3389/fgene.2020.531218. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Han L., Abney M. Identity by descent estimation with dense genome-wide genotype data. Genet. Epidemiol. 2011;35:557–567. doi: 10.1002/gepi.20606. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Mancuso N., Rohland N., Rand K.A., Tandon A., Allen A., Quinque D., Mallick S., Li H., Stram A., Sheng X., et al. The contribution of rare variation to prostate cancer heritability. Nat. Genet. 2016;48:30–35. doi: 10.1038/ng.3446. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Hartman K.A., Rashkin S.R., Witte J.S., Hernandez R.D. Imputed Genomic Data Reveals a Moderate Effect of Low Frequency Variants to the Heritability of Complex Human Traits. Preprint at bioRxiv. 2019 doi: 10.1101/2019.12.18.879916. [DOI] [Google Scholar]
- 26.Hernandez R.D., Uricchio L.H., Hartman K., Ye C., Dahl A., Zaitlen N. Ultrarare variants drive substantial cis heritability of human gene expression. Nat. Genet. 2019;51:1349–1355. doi: 10.1038/s41588-019-0487-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Locke A.E., Steinberg K.M., Chiang C.W.K., Service S.K., Havulinna A.S., Stell L., Pirinen M., Abel H.J., Chiang C.C., Fulton R.S., et al. Exome sequencing of Finnish isolates enhances rare-variant association power. Nature. 2019;572:323–328. doi: 10.1038/s41586-019-1457-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Chiang C.W., Ralph P., Novembre J. Conflation of Short Identity-by-Descent Segments Bias Their Inferred Length Distribution. G3 (Bethesda) 2016;6:1287–1296. doi: 10.1534/g3.116.027581. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Gusev A., Lowe J.K., Stoffel M., Daly M.J., Altshuler D., Breslow J.L., Friedman J.M., Pe’er I. Whole population, genome-wide mapping of hidden relatedness. Genome Res. 2009;19:318–326. doi: 10.1101/gr.081398.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Naseri A., Liu X., Tang K., Zhang S., Zhi D. RaPID: ultra-fast, powerful, and accurate detection of segments identical by descent (IBD) in biobank-scale cohorts. Genome Biol. 2019;20:143. doi: 10.1186/s13059-019-1754-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Hudson R.R. Gene genealogies and the coalescent process. Oxf. Surv. Evol. Biol. 1990;7:1–44. [Google Scholar]
- 32.Griffiths R.C., Marjoram P. Ancestral inference from samples of DNA sequences with recombination. J. Comput. Biol. 1996;3:479–502. doi: 10.1089/cmb.1996.3.479. [DOI] [PubMed] [Google Scholar]
- 33.McVean G. A genealogical interpretation of principal components analysis. PLoS Genet. 2009;5:e1000686. doi: 10.1371/journal.pgen.1000686. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Ralph P., Thornton K., Kelleher J. Efficiently Summarizing Relationships in Large Samples: A General Duality Between Statistics of Genealogies and Genomes. Genetics. 2020;215:779–797. doi: 10.1534/genetics.120.303253. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Speidel L., Forest M., Shi S., Myers S.R. A method for genome-wide genealogy estimation for thousands of samples. Nat. Genet. 2019;51:1321–1329. doi: 10.1038/s41588-019-0484-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Kelleher J., Wong Y., Wohns A.W., Fadil C., Albers P.K., McVean G. Inferring whole-genome histories in large population datasets. Nat. Genet. 2019;51:1330–1338. doi: 10.1038/s41588-019-0483-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Wohns A.W., Wong Y., Jeffery B., Akbari A., Mallick S., Pinhasi R., Patterson N., Reich D., Kelleher J., McVean G. A unified genealogy of modern and ancient genomes. Science. 2022;375:eabi8264. doi: 10.1126/science.abi8264. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Zhang B.C., Biddanda A., Palamara P.F. Biobank-scale inference of ancestral recombination graphs enables genealogy-based mixed model association of complex traits. Preprint at bioRxiv. 2021 doi: 10.1101/2021.11.03.466843. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Zaidi A.A., Mathieson I. Demographic history mediates the effect of stratification on polygenic scores. eLife. 2020;9:e61548. doi: 10.7554/eLife.61548. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Gutenkunst R.N., Hernandez R.D., Williamson S.H., Bustamante C.D. Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genet. 2009;5:e1000695. doi: 10.1371/journal.pgen.1000695. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Kelleher J., Etheridge A.M., McVean G. Efficient Coalescent Simulation and Genealogical Analysis for Large Sample Sizes. PLoS Comput. Biol. 2016;12:e1004842. doi: 10.1371/journal.pcbi.1004842. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Lawson D.J., Hellenthal G., Myers S., Falush D. Inference of population structure using dense haplotype data. PLoS Genet. 2012;8:e1002453. doi: 10.1371/journal.pgen.1002453. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Sohail M., Maier R.M., Ganna A., Bloemendal A., Martin A.R., Turchin M.C., Chiang C.W., Hirschhorn J., Daly M.J., Patterson N., et al. Polygenic adaptation on height is overestimated due to uncorrected stratification in genome-wide association studies. eLife. 2019;8:e39702. doi: 10.7554/eLife.39702. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Wang S.R., Agarwala V., Flannick J., Chiang C.W., Altshuler D., Hirschhorn J.N., GoT2D Consortium Simulation of Finnish population history, guided by empirical genetic data, to assess power of rare-variant tests in Finland. Am. J. Hum. Genet. 2014;94:710–720. doi: 10.1016/j.ajhg.2014.03.019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Martin A.R., Karczewski K.J., Kerminen S., Kurki M.I., Sarin A.-P., Artomov M., Eriksson J.G., Esko T., Genovese G., Havulinna A.S., et al. Haplotype Sharing Provides Insights into Fine-Scale Population History and Disease in Finland. Am. J. Hum. Genet. 2018;102:760–775. doi: 10.1016/j.ajhg.2018.03.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Kerminen S., Havulinna A.S., Hellenthal G., Martin A.R., Sarin A.-P., Perola M., Palotie A., Salomaa V., Daly M.J., Ripatti S., Pirinen M. Fine-Scale Genetic Structure in Finland. G3 (Bethesda) 2017;7:3459–3468. doi: 10.1534/g3.117.300217. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Stern A.J., Wilton P.R., Nielsen R. An approximate full-likelihood method for inferring selection and allele frequency trajectories from DNA sequence data. PLoS Genet. 2019;15:e1008384. doi: 10.1371/journal.pgen.1008384. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Stern A.J., Speidel L., Zaitlen N.A., Nielsen R. Disentangling selection on genetically correlated polygenic traits via whole-genome genealogies. Am. J. Hum. Genet. 2021;108:219–239. doi: 10.1016/j.ajhg.2020.12.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Li H., Durbin R. Inference of human population history from individual whole-genome sequences. Nature. 2011;475:493–496. doi: 10.1038/nature10231. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Schiffels S., Durbin R. Inferring human population size and separation history from multiple genome sequences. Nat. Genet. 2014;46:919–925. doi: 10.1038/ng.3015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Armstrong K. Berghahn Books; New York: 2004. Remembering Karelia: a family’s story of displacement during and after the Finnish wars. [Google Scholar]
- 52.Jakkula E., Rehnström K., Varilo T., Pietiläinen O.P., Paunio T., Pedersen N.L., deFaire U., Järvelin M.R., Saharinen J., Freimer N., et al. The genome-wide patterns of variation expose significant substructure in a founder population. Am. J. Hum. Genet. 2008;83:787–794. doi: 10.1016/j.ajhg.2008.11.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Harris K., Pritchard J.K. Rapid evolution of the human mutation spectrum. eLife. 2017;6:e24284. doi: 10.7554/eLife.24284. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Deng Y., Song Y.S., Nielsen R. The distribution of waiting distances in ancestral recombination graphs. Theor. Popul. Biol. 2021;141:34–43. doi: 10.1016/j.tpb.2021.06.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Chiang C.W.K. The Opportunities and Challenges of Integrating Population Histories Into Genetic Studies for Diverse Populations: A Motivating Example From Native Hawaiians. Front. Genet. 2021;12:643883. doi: 10.3389/fgene.2021.643883. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Xu Z.M., Rüeger S., Zwyer M., Brites D., Hiza H., Reinhard M., Rutaihwa L., Borrell S., Isihaka F., Temba H., et al. Using population-specific add-on polymorphisms to improve genotype imputation in underrepresented populations. PLoS Comput. Biol. 2022;18:e1009628. doi: 10.1371/journal.pcbi.1009628. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Lin M., Caberto C., Wan P., Li Y., Lum-Jones A., Tiirikainen M., Pooler L., Nakamura B., Sheng X., Porcel J., et al. Population-specific reference panels are crucial for genetic analyses: an example of the CREBRF locus in Native Hawaiians. Hum. Mol. Genet. 2020;29:2275–2284. doi: 10.1093/hmg/ddaa083. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Martin A.R., Atkinson E.G., Chapman S.B., Stevenson A., Stroud R.E., Abebe T., Akena D., Alemayehu M., Ashaba F.K., Atwoli L., et al. Low-coverage sequencing cost-effectively detects known and novel variation in underrepresented populations. Am. J. Hum. Genet. 2021;108:656–668. doi: 10.1016/j.ajhg.2021.03.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Wojcik G.L., Fuchsberger C., Taliun D., Welch R., Martin A.R., Shringarpure S., Carlson C.S., Abecasis G., Kang H.M., Boehnke M., et al. Imputation-Aware Tag SNP Selection To Improve Power for Large-Scale, Multi-ethnic Association Studies. G3 (Bethesda) 2018;8:3255–3267. doi: 10.1534/g3.118.200502. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Speidel L., Cassidy L., Davies R.W., Hellenthal G., Skoglund P., Myers S.R. Inferring Population Histories for Ancient Genomes Using Genome-Wide Genealogies. Mol. Biol. Evol. 2021;38:3497–3511. doi: 10.1093/molbev/msab174. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
We have implemented the algorithms related to eGRM in a python package, “egrm,” which is publicly available in PyPI. Documentation of this package as well as simulation commands used in this study can be found on its GitHub page (https://github.com/Ephraim-usc/egrm). The Finnish dataset is available through dbGaP (dbGaP: phs000743.v1.p1, phs000756.v1.p1).