Abstract
Genome-wide genealogies compactly represent the evolutionary history of a set of genomes and inferring them from genetic data has the potential to facilitate a wide range of analyses. We introduce a method, ARG-Needle, for accurately inferring biobank-scale genealogies from sequencing or genotyping array data, as well as strategies to utilize genealogies to perform association and other complex trait analyses. We use these methods to build genome-wide genealogies using genotyping data for 337,464 UK Biobank individuals and test for association across seven complex traits. Genealogy-based association detects more rare and ultra-rare signals (N = 134, frequency range 0.0007−0.1%) than genotype imputation using ~65,000 sequenced haplotypes (N = 64). In a subset of 138,039 exome sequencing samples, these associations strongly tag (average r = 0.72) underlying sequencing variants enriched (4.8×) for loss-of-function variation. These results demonstrate that inferred genome-wide genealogies may be leveraged in the analysis of complex traits, complementing approaches that require the availability of large, population-specific sequencing panels.
Subject terms: Genome-wide association studies, Population genetics
ARG-Needle is a method to infer genome-wide genealogies from large-scale genotyping data that can be used in association analyses. Applied to UK Biobank data, genealogy-based testing finds more trait associations than using imputed genotypes.
Main
Modeling genealogical relationships between individuals plays a key role in a wide range of analyses, including the study of natural selection1 and demographic history2, genotype phasing3 and imputation4. Due to the very large number of genealogical relationships that may give rise to observed genomic variation, data-driven inference of these relationships is computationally difficult5. For this reason, available methods for the inference of genealogies rely on strategies that trade model simplification for computational scalability, such as the use of approximate probabilistic models5–11, scalable heuristics12–16 or combinations of the two17,18. Recent advances enabled efficient estimation of the genealogical distance between genomic regions from ascertained genotype data11, rapid genealogical approximations for hundreds of thousands of samples15 and improved scalability of probabilistic inference17. However, available methods do not simultaneously offer all these features, so that scalable and accurate genealogical inference in modern biobanks remains challenging. In addition, these datasets contain extensive phenotypic information, but applications of inferred genealogies have primarily focused on evolutionary analyses. Early work suggested that genealogical data may be used to improve association and fine-mapping13,19, but the connections between genealogical inference and modern methodology for complex trait analysis20–22 remain under-explored.
We introduce a new algorithm, ARG-Needle, to accurately infer the ancestral recombination graph23 (ARG) for large collections of genotyping or sequencing samples. We demonstrate that the ARG of a sample may be used within a linear mixed model (LMM) framework to increase association power, detect association to unobserved genomic variants, infer narrow sense heritability and perform polygenic prediction. Using ARG-Needle, we infer the ARG for 337,464 UK Biobank samples and perform a genealogy-wide association scan for seven complex traits. We show that despite being inferred using only array data, the ARG detects more independent associations to rare and ultra-rare variants (minor allele frequency (MAF) < 0.1%) than imputation based on a reference panel of ~65,000 sequenced haplotypes of matched ancestry. We use 138,039 exome sequencing samples to confirm that these signals correspond to unobserved sequencing variants, which are strongly enriched for loss-of-function and other protein-altering variation and overlap with likely causal associations detected using within-cohort exome sequencing imputation. Using the ARG, we detect associations to variants as rare as MAF ≈ 4 × 10−6 and independent higher frequency variation that is not captured using imputation.
Results
Overview of the ARG-Needle method
The ARG is a graph in which nodes represent the genomes of individuals or their ancestors and edges represent genealogical connections (see Supplementary Note 1 for additional details). ARG-Needle infers the ARG for large genotyping array or sequencing datasets by iteratively ‘threading’9 one haploid sample at a time, as depicted in Fig. 1. Given an existing ARG, initialized to contain a single sample, we randomly select the next sample to be added (or threaded). We then compute a ‘threading instruction’, which at each genomic position provides the index of a sample in the ARG that is most closely related to the target sample, as well as their time to most recent common ancestor (TMRCA). We use this instruction to thread the target sample to the current ARG, and iterate until all samples have been included.
To compute the threading instruction of a sample, ARG-Needle first performs genotype hashing24,25 to rapidly detect a subset of candidate closest relatives within the ARG. It then uses the Ascertained Sequentially Markovian Coalescent (ASMC) algorithm11 to estimate the TMRCA between the new sample and each of these individuals at each site, threading to the closest individual. When all samples have been included, ARG-Needle uses a fast postprocessing step, which we call ARG normalization, to refine the estimated node times. ARG-Needle builds the ARG in time approximately linear in sample size (see below).
We also introduce a simple extension of ASMC11, called ASMC-clust, that builds genome-wide genealogies by forming a tree at each site using hierarchical clustering on pairwise TMRCAs output by ASMC. This approach scales quadratically with sample size but yields improved accuracy compared with ARG-Needle in certain simulated scenarios (see below). ARG-Needle and ASMC-clust efficiently represent and store ARGs using a graph data structure, which is an adaptation of the representation used within the ARGON simulator26. Additional details, theoretical guarantees and properties for the ARG-Needle and ASMC-clust algorithms are described in the Methods and Supplementary Note 1.
Accuracy of ARG reconstruction in simulated data
We used extensive simulations to compare the accuracy and scalability of ARG-Needle, ASMC-clust, Relate17, tsinfer and a variant of tsinfer designed for sparse datasets we refer to as ‘tsinfer-sparse’15. We considered several metrics to compare ARGs, including: the Robinson–Foulds distance27, which reflects dissimilarities between the mutations that may be generated by two ARGs; the root mean squared error (RMSE) between true and inferred pairwise TMRCAs, which captures the accuracy in predicting allele sharing between individuals; and the Kendall–Colijn (KC) topology-only distance28. We found that the KC distance is systematically lower for trees containing polytomies (that is, nodes with more than two children), which are not output by Relate, ASMC-clust or ARG-Needle (Extended Data Fig. 1b,c). We therefore applied a heuristic to allow these methods to output polytomies (see the Methods and Supplementary Note 2 for additional discussion). Although these three metrics capture similarity between marginal trees, they are not specifically developed for comparing ARGs. We therefore developed an additional metric, called the ARG total variation distance, which generalizes the Robinson–Foulds distance to better capture the ability of a reconstructed ARG to predict mutation patterns that may be generated by the true underlying ARG (see the Methods and Supplementary Note 2 for further details).
We measured ARG reconstruction accuracy in synthetic array datasets of up to 32,000 haploid samples (Fig. 2 and Methods). We also tested a variety of additional conditions, including different demographic histories, varying recombination rates and genotyping error. We also examined the effects of ARG normalization, of variations of the KC distance that account for branch lengths and of stratifying the total variation distance by allele frequency (Extended Data Figs. 1 and 2 and Supplementary Figs. 1–4). ARG-Needle tended to achieve best performance across all accuracy metrics in array data, sometimes tied or in close performance with ASMC-clust or Relate. In simulations of sequencing data, ASMC-clust performed best on the ARG total variation and TMRCA RMSE metrics, with ARG-Needle and Relate close in performance, while Relate performed best on the Robinson–Foulds metric (Extended Data Fig. 3). We next measured the speed and memory footprint of these methods. ARG-Needle requires lower computation and memory than Relate and ASMC-clust, which both scale quadratically with sample size (Fig. 2e,f and Extended Data Fig. 1e). It runs slower than tsinfer and tsinfer-sparse but with a similar (approximately linear) scaling (also see the Methods and Supplementary Note 1).
We next examined additional properties of the ARG-Needle and ASMC-clust algorithms. We found that the order used to thread samples into the ARG does not substantially affect accuracy (Supplementary Fig. 5a–d), but that averaging estimates obtained using different random threading orders may produce improved estimates of genealogical relationships and higher similarity to ARGs inferred using ASMC-clust (Supplementary Fig. 5c–f). We observed that inferred genealogies contain realistic linkage disequilibrium (LD) patterns (Extended Data Fig. 4a,b). ARG-Needle, ASMC-clust and Relate do not guarantee that the variants used to infer genealogies may be mapped to inferred marginal trees, but performed well when we considered the fraction of unobserved variants that could be mapped back to inferred genealogies (Extended Data Fig. 4c; also see ref. 17). Finally, we assessed the similarity of ARGs inferred using different algorithms, observing highest similarities between ASMC-clust and ARG-Needle, as well as between these methods and, in decreasing order, Relate, tsinfer-sparse and tsinfer (Supplementary Fig. 6a,b).
A genealogical approach to LMM analysis
LMMs enable state-of-the-art analysis of polygenic traits20,29,30,31. We developed an approach that uses the ARG of a set of genomes to perform mixed linear model association (MLMA29; Methods). More in detail, we use an ARG built from genotyping array data to infer the presence of unobserved variants and perform MLMA testing of these variants. This increases association power in two ways: the ARG is used to uncover putatively associated variants, while the LMM utilizes estimates of genomic similarity to model polygenicity, relatedness and population stratification29. We refer to association analyses that test variants in the ARG as ‘genealogy-wide association’ scans and, more specifically, to analyses that incorporate mixed linear model testing as ARG-MLMA. Genealogy-wide association complements genotype imputation based on a sequenced reference panel, as it enables capturing rare variants in the sample that may be absent from the panel or cannot be accurately imputed (Extended Data Fig. 5a). It also generalizes rare variant association strategies based on haplotype sharing13,19,25,32–35, as detailed in Supplementary Fig. 7. In simulations, we observed that for low-frequency variants genealogy-wide association may achieve higher association power than testing of variants imputed from a sequenced reference panel (Fig. 3a and Extended Data Fig. 6).
In addition, we developed strategies to leverage the ARG to obtain estimates of genomic similarity across individuals, which are aggregated in a genomic relatedness matrix (GRM; Methods) and are a key element of several mixed-model analyses of complex traits. We refer to GRMs built using this approach as ARG-GRMs and provide details of their construction and properties in Supplementary Note 3. We used ARG-GRMs to measure the amount of phenotypic variance captured by inferred ARGs (Extended Data Fig. 7a). In simulations, ARG-GRMs built using ARGs inferred by ARG-Needle in array data captured more narrow sense heritability than GRMs built using array data30,36,37 (Methods, Fig. 3b and Supplementary Fig. 8). We also performed additional simulations to test whether the modeling of unobserved genomic variation using ARG-GRMs may be leveraged to obtain performance gains in other LMM analyses. Indeed, ARG-GRMs built using true ARGs performed as well as GRMs computed using sequencing data in LMM-based heritability estimation, polygenic prediction and association (Methods, Fig. 3c and Extended Data Fig. 8). Applying these strategies to large-scale inferred ARGs, however, will require improved accuracy and scalability (Discussion).
Overall, these experiments suggest that accurate genealogical inference combined with LMMs improves association power, by testing variants that are not well tagged using available markers while modeling polygenicity. The ARG may also be potentially utilized to obtain improved estimates of genomic similarity and perform additional LMM-based complex trait analyses.
Genealogy-wide association scan of rare and ultra-rare variants in the UK Biobank
We applied ARG inference methods in a subset of the genome using UK Biobank data and observed results consistent with our simulations (Supplementary Fig. 6c,d). We then used ARG-Needle to build the genome-wide ARG from SNP array data for 337,464 individuals in the white British ancestry subset defined by ref. 38 (Methods). We performed ARG-MLMA for height and six molecular traits, comprising alkaline phosphatase, aspartate aminotransferase, low-density lipoprotein (LDL) / high-density lipoprotein (HDL) cholesterol, mean platelet volume and total bilirubin. To achieve the required scalability, we built on a recent MLMA method22,39, implicitly relying on an array-based GRM (Methods and Discussion). We compared ARG-MLMA with standard MLMA testing of variants imputed using the Haplotype Reference Consortium (HRC) and UK10K reference panels38,40,41 (hereafter, HRC + UK10K), comprising ~65,000 haplotypes. We focused on rare (0.01% ≤ MAF < 0.1%) and ultra-rare (MAF < 0.01%) genomic variants. We used resampling-based testing42 to establish genome-wide significance thresholds of P < 4.8 × 10−11 for ARG variants (sampled with mutation rate μ = 10−5) and P < 1.06 × 10−9 for imputed variants (Supplementary Table 1). For each analysis, we performed LD-based filtering to extract a stringent set of approximately independent associations (hereafter, ‘independent associations’; Methods). We leveraged a subset of 138,039 individuals with whole-exome sequencing (WES) data (hereafter, WES-138K) to validate these independent associations. For each detected independent variant, we selected the WES variant with the largest correlation, which we call its ‘WES partner’.
Applying this approach, we detected 134 independent signals using the ARG and 64 using imputation, jointly implicating 152 unique WES partners (Supplementary Tables 2 and 3). Of these WES variants, 36 were implicated using both approaches (Fig. 4a, and see Extended Data Fig. 9a for region-level results). The fraction of WES partners uniquely identified using the ARG was larger among ultra-rare variants (84%) compared with rare variants (42%), reflecting a scarcity of ultra-rare variants in the HRC + UK10K imputation panel. The phenotypic effects estimated in the 337,464 individuals using ARG-derived or imputed associations were strongly correlated to those directly estimated for the WES partners in the WES-138K dataset (Fig. 4b), with stronger average correlation (bootstrap P = 0.003) for ARG-derived variants ( = 0.93) compared with imputed variants ( = 0.80). Only 74% of the WES partners for ARG-derived rare variant associations were significantly associated (P < 5 × 10−8) in the smaller WES-138K dataset, a proportion that dropped to 59% for ultra-rare variants. Variants detected using genealogy-wide association had a larger average phenotypic effect than those detected via imputation (bootstrap P < 0.0001; average = 1.46; average = 0.90), reflecting lower average frequencies. In addition, WES partners of ARG-derived variants were ~4.8× enriched for loss-of-function variation (bootstrap P < 0.001; Fig. 4c), and WES partners implicated by either ARG or imputation were ~2.3× enriched for other protein-altering variation (Methods), supporting their likely causal role.
We also used variant-level precision and recall statistics (Methods) to measure the extent to which carrying an associated ARG-derived or imputed variant is predictive of carrying sequence-level WES partner variants (Fig. 4d). ARG-derived and imputed rare variants had similar levels of variant-level precision, while imputation had higher recall (bootstrap P = 0.0005). For ultra-rare variants, ARG-derived signals performed better than imputed variants for both precision (bootstrap P = 0.01) and recall (bootstrap P = 0.002). Similarly, ARG-derived and imputed rare variants provided comparable tagging for their WES partners (Extended Data Fig. 9b), while ARG-derived ultra-rare variants provided stronger tagging compared with imputed ultra-rare variants (average rARG = 0.77, rimp = 0.42, bootstrap P < 0.001). Compared with ARG-derived variants, genotype imputation has the advantage that associated variants that are sequenced in the reference panel may be directly localized in the genome. We found that for 21 of 52 rare and 2 of 12 ultra-rare independent imputation signals the WES partner had been imputed, while the remaining signals likely provide indirect tagging for underlying variants. ARG-derived and imputed variants, however, had similar distributions for the distance to their WES partners (Fig. 4e and Extended Data Fig. 9c). This suggests that genealogy-wide associations have the same spatial resolution as associations obtained using genotype imputation in cases where the variant driving the signal cannot be directly imputed.
We compared our results with those of a recent study that leveraged exome sequencing data from a subset of ~50,000 participants (hereafter, WES-50K) to perform genotype imputation for ~459,000 samples43. We found that, among the WES partners implicated using the ARG but not using HRC + UK10K imputation, 14 of 30 partners of rare and 26 of 55 partners of ultra-rare ARG variants were also flagged as likely causal associations (P < 5 × 10−8) in ref. 43 (Supplementary Table 2). The remaining 45 WES partners detected using the ARG but not reported in ref. 43 are often very rare variants (median MAF = 3.6 × 10−5; Extended Data Fig. 9d) of large effect (median || = 1.14), which are difficult to impute; 21 of 45 such variants were absent or singletons in the WES-50K reference panel or had poor imputation quality score. Associations uniquely detected using the ARG often extended allelic series at known genes. For instance, restricting to loss-of-function or other protein-altering WES partners for independent ARG signals not present or marginally significant in ref. 43, five novel associations with aspartate aminotransferase are mapped to the GOT1 gene (Fig. 4f) and four with alkaline phosphatase are mapped to ALPL (Extended Data Fig. 9e). A subset of strong independent associations uniquely detected by the ARG had weak correlation with their WES partners, possibly due to tagging of structural or regulatory variation absent from the WES-138K dataset (for example, a signal for aspartate aminotransferase with P = 7.4 × 10−39, MAFARG = 0.0005, WES partner r = 0.21, minor allele count (MAC)WES-138K = 6, MACWES-50K = 1).
In summary, genealogy-wide association using an ARG inferred from common SNPs revealed more rare and ultra-rare signals than genotype imputation based on ~65,000 reference haplotypes, and detected ultra-rare variants that were not associated using within-cohort imputation based on ~50,000 exome-sequenced participants. ARG-derived associations accurately predicted effect sizes for underlying sequencing variants, as well as the subset of carrier individuals.
Genealogy-wide association for low- and high-frequency variants
Lastly, we performed genealogy-wide association for low- (0.1% ≤ MAF < 1%) and high- (MAF ≥ 1%) frequency variants, which are more easily imputed using reference panels that are not necessarily large and population-specific. Consistent with this, extending our previous analysis to low-frequency variants yielded a similar number of independent associations for ARG-derived and HRC + UK10K-imputed variants (NARG = 103, Nimp = 100; Supplementary Tables 4 and 5 and Extended Data Fig. 10a–c). Associations detected using the ARG had slightly larger effects compared with those found using imputation (bootstrap P = 0.026; average |βARG| = 0.32, |βimp| = 0.27) but provided lower tagging to WES partners (bootstrap P < 0.001; average rARG = 0.57, rimp = 0.73), reflecting the large fraction (42 of 100) of imputation WES partners that were directly imputed.
We hypothesized that, although imputation of higher frequency variants is generally more accurate, branches in the marginal trees of the ARG may in some cases complement available markers by providing improved tagging of underlying variation. This may be the case, for instance, for short insertions/deletions or structural variants44, which are often underrepresented in reference panels41, or for variants of moderately high frequency, which may be difficult to impute45 (Extended Data Fig. 5a). To test this, we performed MLMA for height using HRC + UK10K-imputed variants, filtered as in ref. 38 (MAF > 0.1%, info score > 0.3; Methods), for which we established a resampling-based genome-wide significance threshold of 4.5 × 10−9 (95% confidence interval (95% CI): 2.2 × 10−9, 9.6 × 10−9). To facilitate direct comparison, we selected ARG-MLMA parameters (MAF > 1%, μ = 10−5; Methods) corresponding to a higher MAF cutoff but a comparable genome-wide significance threshold of 3.4 × 10−9 (95% CI: 2.4 × 10−9, 5 × 10−9) and adopted a threshold of 3 × 10−9 for all downstream analyses.
We first assessed the number of 1-megabase (Mb) regions that contain an association (P < 3 × 10−9) for genotype array, imputed or ARG-derived variants. We found that ARG-MLMA detected 98.9% of regions found by both SNP array and imputation, as well as 71% of regions found by imputation but not array data and an additional 8% of regions not found using either imputation or array data (Extended Data Fig. 10d). A significant fraction (54 of 92, permutation P < 0.0001) of regions identified using the ARG but not imputation contained associations (P < 3 × 10−9) in a larger meta-analysis by the Genetic Investigation of ANthropometric Traits (GIANT) consortium46 (N ≈ 700,000) comprising the UK Biobank and additional cohorts. Inspecting associated loci, we observed that ARG-MLMA captures association peaks and haplotype structure found using imputation but not array data (Fig. 5a–c and Supplementary Figs. 9 and 10a–e) as well as association peaks uniquely identified using ARG-MLMA (Fig. 5d and Supplementary Fig. 10f–h).
We sought to further assess the degree of overlap and complementarity of associations detected using SNP array data, imputation and the ARG, by performing LD-based filtering and conditional and joint (COJO47) association analyses (Fig. 5e and Methods). When we jointly considered either or both ARG-derived and imputed variants in addition to array markers, we observed an increase in the number of approximately independent COJO associations (P < 3 × 10−9; NSNP = 964, NSNP+ARG = 1,110, NSNP+imp = 1,126, NSNP+ARG+imp = 1,161). The fraction of COJO-associated array markers was reduced by the inclusion of ARG-derived or imputed variants, which resulted in comparable proportions of associations when jointly analyzed (Fig. 5e), suggesting that both ARG and imputation provide good tagging of underlying signal. By considering the set of 1-Mb regions harboring significant COJO associations, we verified that the additional COJO signals detected when including ARG-derived or imputed variants concentrated within regions that also harbor significant (P < 3 × 10−9) COJO signals in the GIANT meta-analysis46 (Fig. 5f and Extended Data Fig. 10e).
In summary, genealogy-wide association using the ARG inferred by ARG-Needle from SNP array data was less effective for the analysis of higher frequency variants because these variants could be more accurately imputed compared to rare and ultra-rare variants. However, ARG-derived variants revealed associated peaks and haplotypes that were not found through association of array data alone and in some cases complemented genotype imputation in detecting approximately independent associations. We note that the choices of filtering criteria, such as MAF threshold, imputation info score and ARG mutation rate, all affect the sensitivity and specificity of these analyses. Results for an analysis restricting to association of variants with MAF > 10% are shown in Supplementary Fig. 11.
Discussion
We developed ARG-Needle, a method for accurately inferring genome-wide genealogies from genomic data that scales to large biobank datasets. We performed extensive simulations, showing that ARG-Needle is both accurate and scalable when applied to genotyping array and sequencing data. We also developed a framework that combines inferred genealogies with LMMs to increase association power, and showed that this strategy may be further leveraged in analyses of heritability and polygenic prediction. We built genome-wide ARGs from genotyping array data for 337,464 UK Biobank individuals and performed a genealogy-wide association scan for seven quantitative phenotypes. Using the inferred ARG, we detected more large-effect associations to rare and ultra-rare variants than using genotype imputation from ~65,000 sequenced haplotypes, down to an allele frequency of ~4 × 10−6. We validated these signals using exome sequencing, showing that they tag underlying variants enriched for loss-of-function and other protein-altering variation. Associations detected using the ARG overlap with and extend fine-mapped associations detected using within-cohort genotype imputation. Applied to the analysis of higher frequency variants, the ARG revealed haplotype structure and independent signals complementary to those obtained using imputation.
These results highlight the importance of genealogical modeling in the analysis of complex traits. Genome-wide association analyses rely on the correlation between available markers and underlying variation48 and the MLMA framework accounts for polygenicity, relatedness and population structure29. In genealogy-wide association, the signal of LD is amplified by further modeling of past recombination events to infer the presence of hidden genomic variation. Through ARG-GRMs, inferred genealogies may facilitate better modeling of genomic similarity and polygenic effects, leading to improved robustness and increased statistical power.
These analyses also demonstrate that genealogical inference provides a complementary strategy to genotype imputation approaches, which rely on haplotype sharing between the analyzed samples and a sequenced reference panel to extend the set of available markers. Imputation has been successfully applied in several complex trait analyses4,36, but its efficacy for the study of rare variants hinges on the availability of large, population-specific sequencing panels, which are not widely available for all populations. Genealogy-wide association may therefore offer new avenues to better utilize genomic resources for underrepresented groups49.
We highlight several limitations and directions of future development. First, although genealogy-wide association enables detecting individuals carrying associated variants, it may implicate large genomic regions, whereas genotype imputation may associate individual variants if they are sequenced in the reference panel. When sequencing data are available, however, they may be utilized to further localize ARG-derived signals, for instance using WES partners. Second, although we have shown in simulation that ARG-GRMs built from true ARGs may be used to estimate heritability, perform prediction and increase association power, real data applications of this approach will require methodological improvements to increase LMM scalability50,51. Third, although our study was restricted to unrelated samples of homogeneous ancestry, we expect genealogy-wide association to be as susceptible as standard association to issues such as relatedness and population stratification29,52,53, requiring adequate control for these confounders. Fourth, although we have focused on leveraging an ARG inferred from array data alone, ARG-Needle enables building an ARG using a mixture of sequencing and array data. This approach may be used to perform additional analyses such as ARG-based genotype imputation, which is likely to improve upon approaches that do not model the TMRCA between target and reference samples54. In simulations we performed, this ARG-based imputation strategy obtained promising results (Supplementary Note 4, Extended Data Fig. 5 and Supplementary Fig. 12). Fifth, our analyses were limited to quantitative traits; support for MLMA of rare case/control traits will require methodological extensions. Sixth, we adopted a computationally intensive resampling-based approach42 to establish significance thresholds across filtering parameters; future work may lead to improved strategies to address multiple testing. Seventh, although we relied on several existing and novel metrics to analyze properties of the reconstructed ARGs, further research should develop additional metrics and explore their properties and relationships to downstream analyses. These metrics should be applicable for benchmarking methods that only infer the topology of an ARG as well as methods that focus on estimating branch lengths55. Eighth, reconstructing biobank-scale ARGs will likely aid the study of additional evolutionary properties of disease-associated variants, including analyses of natural selection acting on complex traits11,56,57 which we have not explored in this work. Finally, our analysis focused on the UK Biobank dataset, which provides an excellent testbed due to the large volumes of high-quality data of different types available for validation. Future applications of our methods will involve analysis of cohorts that are less strongly represented in current sequencing studies. Nevertheless, we believe that the results described in this work represent an advance in large-scale genealogical inference and provide new tools for the analysis of complex traits.
Methods
ARG-Needle and ASMC-clust algorithms
We introduce two algorithms to construct the ARG of a set of samples, called ARG-Needle and ASMC-clust. Both approaches leverage output from the ASMC algorithm11, which takes as input a pair of genotyping array or sequencing samples and outputs a posterior distribution of the TMRCA across the genome. ARG-Needle and ASMC-clust use this pairwise genealogical information to assemble the ARG for all individuals.
ASMC-clust runs ASMC on all pairs of samples and performs hierarchical clustering of TMRCA matrices to obtain an ARG. At every site, we apply the unweighted pair group method with arithmetic mean (UPGMA) clustering algorithm58 on the N × N posterior mean TMRCA matrix to yield a marginal tree. We combine these marginal trees into an ARG, using the midpoints between sites’ physical positions to decide when one tree ends and the next begins. Using an O(N2) implementation of UPGMA59,60, we achieve a runtime and memory complexity of O(N2M). We also implement an extension that achieves O(NM) memory but increased runtime (Supplementary Note 1).
ARG-Needle starts with an empty ARG and repeats three steps to add additional samples to the ARG: (1) detecting a set of closest genetic relatives via hashing, (2) running ASMC and (3) ‘threading’ the new sample into the ARG (Fig. 1). Given a new sample, step 1 performs a series of hash table queries to determine the candidate closest samples already in the ARG24. We divide up the sites present in the genetic data into nonoverlapping ‘words’ of S sites and store hash tables mapping from the possible values of the ith word to the samples that carry that word. We use this approach to rapidly detect samples already in the ARG that share words with the target sample and return the top K samples with the most consecutive matches. A tolerance parameter T controls the number of mismatches allowed in an otherwise consecutive stretch. We also allow the top K samples to vary across the genome due to recombination events, by partitioning the genome into regions of genetic distance L. Assuming this results in R regions, the hashing step outputs a matrix of R × K sample identities (IDs) containing the predicted top K related samples for each region. We note, however, that the hashing step can look arbitrarily far beyond the boundaries of each region to select the K samples.
The sample IDs output by step 1 inform step 2, in which ASMC is run over pairs of samples. In each of the R regions, ASMC computes the posterior mean and maximum a posteriori TMRCA between the sample being threaded and each of the K candidate most related samples. We add burn-in on either side of the region to provide additional context for the ASMC model, 2.0 centimorgans (cM) for all simulation experiments unless otherwise stated and 1.0 cM in real data inference for greater efficiency.
In step 3, ARG-Needle finds the minimum posterior mean TMRCA among the K candidates at each site of the genome. Note that both the use of a posterior mean estimator with a pairwise demographic prior and the selection of a minimum among K estimated values lead to bias in the final TMRCA estimates (Supplementary Fig. 3h), which we later address using a postprocessing normalization step (see below). The corresponding IDs determine which sample in the ARG to thread to at each site. Because the posterior mean assumes continuous values and changes at each site, we average the posterior mean over neighboring sites where the ID to thread to and the associated maximum a posteriori remain constant. This produces piecewise constant values which determine how high above the sample to thread, with changes corresponding to inferred recombination events. The sample is then efficiently threaded into the existing ARG, utilizing custom data structures and algorithms.
Throughout our analyses we adopted K = 64, T = 1, L = 0.5 cM for array data and L = 0.1 cM for sequencing data. We used S = 16 in simulations, and in real data analyses we increased S as threading proceeded to reduce computation without a major loss in accuracy. For additional details on all three steps in the ARG-Needle algorithm and our parameter choices, see Supplementary Note 1.
ARG normalization
ARG normalization applies a monotonically increasing mapping from existing node times to transformed node times (similar to quantile normalization), further utilizing the demographic prior provided in input. We compute quantile distributions of node times in the inferred ARG as well as in 1,000 independent trees simulated using the demographic model provided in input under the single-locus coalescent. We match the two quantile distributions and use this to rewrite all nodes in the inferred ARG to new corresponding times (Supplementary Note 1). ARG normalization preserves the time-based ordering of nodes and therefore does not alter the topology of an ARG. It is applied by default to our inferred ARGs and optionally to ARGs inferred by Relate (Extended Data Figs. 2–4 and Supplementary Figs. 1, 3 and 4).
Simulated genetic data
We used the msprime coalescent simulator61 to benchmark ARG inference algorithms. For each run, we first simulated sequence data with given physical length L for N haploid individuals, with L = 1 Mb for sequencing and L = 5 Mb for array data experiments. Our primary simulations used a mutation rate of μ = 1.65 × 10−8 per base pair per generation, a constant recombination rate of ρ = 1.2 × 10−8 per base pair per generation and a demographic model inferred using SMC++ on CEU (Utah residents with ancestry from Northern and Western Europe) 1,000 Genomes samples10. These simulations also output the simulated genealogies, which we refer to as ‘ground-truth ARGs’ or ‘true ARGs’. To obtain realistic SNP data, we subsampled the simulated sequence sites to match the genotype density and allele frequency spectrum of UK Biobank SNP array markers (chromosome 2, with density defined using 50 evenly spaced MAF bins). When running ASMC, we used decoding quantities precomputed for version 1.1, which were obtained using a European demographic model and UK Biobank SNP array allele frequencies, setting two haploid individuals for pairwise TMRCA inference as ‘distinguished’ and sampling 298 haploid individuals as ‘undistinguished’11. ASMC and the hashing step of ARG-Needle also require a genetic map, which we computed based on the recombination rate used in simulations.
In addition to our primary simulations, we included various additional simulation conditions where we modified one parameter while keeping all others fixed. First, we varied the recombination rate to ρ ∈ {6 × 10−9, 2.4 × 10−8} per base pair per generation. Second, we used a constant demographic model of 15,000 diploid individuals, for which we generated new decoding quantities to run ASMC. Third, we inferred ARGs using sequencing data, running ASMC in sequencing mode. Fourth, we introduced genotyping errors into the array data. After sampling the array SNPs, we flipped each haploid genotype per SNP and individual with probability p (Supplementary Fig. 4).
Comparisons of ARG inference methods
We compared ASMC-clust and ARG-Needle with the Relate17 and tsinfer15 algorithms. Relate runs a modified Li-and-Stephens algorithm62 for each haplotype, using all other haplotypes as reference panel. It then performs hierarchical clustering on the output to estimate the topology of marginal trees at each site. Finally, it estimates the marginal tree branch lengths using a Markov chain Monte Carlo algorithm with a coalescent prior. tsinfer uses a heuristic approach to find a set of haplotypes that will act as ancestors for other haplotypes and to rank them based on their estimated time of origin. It then runs a variation of the Li-and-Stephens algorithm to connect older ancestral haplotypes to their descendants, forming the topology of the ARG. To improve the performance of tsinfer in the analysis of UK Biobank array data, the authors developed an alternative approach where subsets of the analyzed individuals are added as potential ancestors15. This approach was motived by the sparsity of the variant sites, so we refer to it as ‘tsinfer-sparse’, obtaining its code from ref. 63.
We ran Relate with the mutation rate, recombination rate and demographic model used in simulations. We kept Relate’s default option which limits the memory used for storing pairwise matrices to 5 GB. Because the branch lengths output by tsinfer and tsinfer-sparse are not calibrated, we omitted these methods in comparisons for metrics involving branch lengths. For each choice of sample size, we generated genetic data using five random seeds (25 random seeds in Extended Data Fig. 3d,e) and applied ARG-Needle, ASMC-clust, Relate, tsinfer and (when dealing with array data) tsinfer-sparse to infer ARGs. Due to scalability differences, we ran ASMC-clust and Relate in up to N = 8,000 haploid samples (N = 4,000 for sequencing) and ARG-Needle, tsinfer and tsinfer-sparse in up to N = 32,000 haploid samples. All analyses used Intel Skylake 2.6 GHz nodes on the Oxford Biomedical Research Computing cluster.
The Robinson–Foulds metric27 counts the number of unique mutations that can be generated by one tree but not the other. Because polytomies can skew this metric, we randomly break polytomies as done in ref. 15. We report a genome-wide average, rescaled between 0 and 1.
We generalized the Robinson–Foulds metric to better capture the accuracy in predicting unobserved variants by incorporating ARG branch lengths. To this end, we consider the probability distribution of mutations that arise from uniform sampling on an ARG, and compare the resulting distributions for the true and inferred ARG using the total variation distance, a metric for comparing probability measures. Polytomies do not need to be broken using this metric, as they simply concentrate the probability mass on fewer predicted mutations. We refer to this metric as the ARG total variation distance, and note that it bears similarities to previous extensions of the Robinson–Foulds metric64,65 (see Supplementary Note 2 for further details, including an extension that stratifies by allele frequencies).
We also used the KC topology-only distance averaged over all positions to compare ARG topologies. We observed that for methods that output binary trees (Relate, ASMC-clust and ARG-Needle), performance substantially improved when we selected inferred branches at random and collapsed them to create polytomies (solid lines in Extended Data Figs. 1c and 3g), suggesting that the KC topology-only distance rewards inferred ARGs with polytomies. We further quantified the amount of polytomies output by tsinfer and tsinfer-sparse as the mean fraction of nonleaf branches collapsed per marginal tree, observing that when polytomies were randomly broken15, performance on the KC topology-only distance deteriorated (dashed lines in Fig. 2d and Extended Data Figs. 1b,c and 3f,g). To account for these observations, we compared all methods both with the restriction of no polytomies and with allowing all methods to output polytomies (Fig. 2d and Extended Data Figs. 1b,d and 3f,h). In the latter case, we formed polytomies in ARGs inferred by Relate, ASMC-clust and ARG-Needle using a heuristic to select and collapse branches that are less confidently inferred. For each marginal binary tree, we ordered the N − 2 nonleaf branches by computing the branch length divided by the height of the parent node, and collapsed a fraction f of branches for which this ratio is smallest (for additional details, see Supplementary Note 2).
We used the pairwise TMRCA RMSE metric to measure accuracy of inferred branch lengths. The KC distance may also consider branch lengths28, and we performed evaluations using the branch-length-aware versions of the KC distance with parameter λ = 1, which compares lengths between pairwise MRCA events and the root, and λ = 0.02, which combines branch length and topology estimation (Supplementary Fig. 1).
Supplementary Note 2 provides further details on the computation of these metrics and their interpretation in the context of ARG inference and downstream analyses.
ARG-MLMA
We developed an approach to perform MLMA of variants extracted from the ARG, which we refer to as ARG-MLMA. We sampled mutations from a given ARG using a specified rate μ and applied a mixed model association test to these variants. Note that each mutation occurs on a single branch of marginal trees, so that recurrent mutations are not modeled.
For simulation experiments (Fig. 3a and Extended Data Fig. 6) we tested all possible mutations on a true or inferred ARG, which is equivalent to adopting a large value of μ. We used sequencing variants from chromosomes 2–22 to form a polygenic background and added a single causal sequencing variant on chromosome 1 with effect size β. We varied the value of β and measured power as the fraction of runs (out of 100), detecting a significant association on the ARG for chromosome 1. For ARG-MLMA UK Biobank analyses we adopted μ = 10−5, also adding variants sampled with μ = 10−3 to locus-specific Manhattan plots to gain further insights. For additional details on our ARG-MLMA methods, including the determination of significance thresholds, see Supplementary Note 4.
Construction of ARG-GRMs
Consider N haploid individuals, M sites and genotypes xik for individual i at site k, where variant k has mean pk. Under an infinitesimal genetic architecture, the parameter α captures the strength of negative selection30,66, with a trait’s genetic component given by gi = ∑k βkxik where Var(βk) = (pk(1 − pk))α. Using available markers, a common estimator for the ij-th entry of the N × N GRM21 is
1 |
Given an ARG, we compute the ARG-GRM as the expectation of the marker-based GRM that would be obtained using sequencing data, assuming that mutations are sampled uniformly over the area of the ARG. When sequencing data are unavailable but an ARG can be estimated from an incomplete set of markers, the ARG-GRM may provide a good estimate for the sequence-based GRM. We briefly describe how ARG-GRMs are derived from the ARG for the special case of α = 0. We discuss the more general case and provide further derivations in Supplementary Note 3.
Assuming α = 0, equation (1) is equivalent (up to invariances described in Supplementary Note 3) to the matrix whose ij-th entry contains the number of genomic sites at which sequences i and j differ (that is, their Hamming distance). This may be expressed as
where ⊕ refers to the exclusive or (XOR) function. Assume there are L base pairs in the genome and a constant mutation rate per base pair of μ, and let tijk denote the TMRCA of i and j at base pair k. The ij-th entry of the ARG-GRM is equivalent to the expected number of mutations carried by only one of the two individuals, which is proportional to the sum of the pairwise TMRCAs across the genome (Extended Data Fig. 7a):
For increased efficiency, we compute a Monte Carlo ARG-GRM by uniformly sampling new mutations on the ARG with a high mutation rate and apply equation (1) to build the ARG-GRM using these mutations. We used simulations to verify that Monte Carlo ARG-GRMs converge to exactly computed ARG-GRMs for large mutation rates, saturating at μ ≈ 1.65 × 10−7 (Extended Data Fig. 7b,c), the default value we used for ARG-GRM computations. Stratified Monte Carlo ARG-GRMs may also be computed by partitioning the sampled mutations based on allele frequency, LD or allele age36,31,67,68 (Supplementary Note 3).
ARG-GRM simulation experiments
We simulated polygenic traits from haploid sequencing samples for various values of h2 and α. We varied the number of haploid samples N but fixed the ratio L/N throughout experiments, where L is the genetic length of the simulated region. For heritability and polygenic prediction experiments, we adopted L/N = 5 × 10−3 Mb per individual. For association experiments, we simulated a polygenic phenotype from 22 chromosomes, with each chromosome consisting of equal length L/22 and L/N = 5.5 × 10−3 Mb per individual. Mixed-model prediction r2 and association power may be roughly estimated as a function of h2 and the ratio N/M, where M is the number of markers39,69,70. We thus selected values of M and L such that the N/M ratio is kept close to that of the UK Biobank (L = 3 × 103 Mb, N ≈ 6 × 105).
We computed GRMs using ARGs, SNP data, imputed data (IMPUTE4 (ref. 38) within-cohort imputation) and sequencing data, and performed complex trait analyses using GCTA21. Polygenic prediction used cvBLUP71 leave-one-out prediction within GCTA. ARG-GRM association experiments (Fig. 3c and Extended Data Fig. 8c,f) tested array SNPs on each chromosome while using GRMs built on the other 21 chromosomes to increase power, measured as the relative increase of mean −log10(P) compared with linear regression. We observed that MAF-stratification for ARG-GRMs of true ARGs enabled robust heritability estimation and polygenic prediction if α is unknown (Extended Data Fig. 8g). In experiments involving inferred ARGs (Fig. 3b and Supplementary Fig. 8), we applied MAF-stratification for ARG-Needle ARGs and imputed data, but not for SNP data, for which GCTA did not converge.
ARG-Needle inference in the UK Biobank
Starting from 488,337 samples and 784,256 available autosomal array variants (including SNPs and short indels), we removed 135 samples (129 withdrawn, 6 due to missingness > 10%) and 57,126 variants (missingness > 10%). We performed reference-free phasing of the remaining variants and samples using Beagle 5.1 (ref. 72) and extracted the unrelated white British ancestry subset defined in ref. 38, yielding 337,464 samples. We built the ARG of these samples using ARG-Needle, using parameters described above. We parallelized the ARG inference by splitting phased genotypes into 749 nonoverlapping ‘chunks’ of approximately equal numbers of variants. We added 50 variants on either side of each chunk to provide additional context for inference and independently applied ARG normalization on each chunk. For our brief comparison of ARG inference methods in real data (Supplementary Fig. 6c,d), we repeatedly sampled independent subsets of N = 2,000 and N = 16,000 diploid individuals, and inferred the ARG for these individuals using the first chunk in the second half of chromosome 1, which amounted to 7.5 Mb.
Genealogy-wide association scan in the UK Biobank
To process phenotypes (standing height, alkaline phosphatase, aspartate aminotransferase, low-density lipoprotein (LDL)/high-density lipoprotein (HDL) cholesterol, mean platelet volume and total bilirubin) we first stratified by sex and performed quantile normalization. We then regressed out age, age squared, genotyping array, assessment center and the first 20 genetic principal components computed in ref. 38. We built a noninfinitesimal BOLT-LMM mixed model using SNP array variants, then tested HRC + UK10K-imputed data38,40,41 and variants inferred using the ARG (ARG-MLMA, described above). For association of imputed data (including SNP array) we restricted to variants with Hardy–Weinberg equilibrium P > 10−12, missingness < 0.05 and info score > 0.3 (matching the filtering criteria adopted in ref. 38). For all analyses we did not test variants with an MAC < 5 and used MAF thresholds detailed below.
Association analysis of seven traits
Using the filtering criteria above and no additional MAF cutoff, we obtained resampling-based genome-wide significance thresholds of P < 4.8 × 10−11 (95% CI: 4.06 × 10−11, 5.99 × 10−11) for ARG and P < 1.06 × 10−9 (95% CI: 5.13 × 10−10, 2.08 × 10−9) for imputation (Supplementary Table 1 and Supplementary Note 4). After performing genome-wide MLMA for the seven traits, we selected genomic regions harboring low-frequency (0.1% ≤ MAF < 1%), rare (0.01% ≤ MAF < 0.1%) or ultra-rare (MAF < 0.01%) variant associations. We then formed regions by grouping any associated variants within 1 Mb of each other and adding 0.5 Mb on either side of these groups.
We next performed several further filtering and association analyses using a procedure similar to ref. 43 to extract sets of approximately independent signals (‘independent’ for short; Supplementary Tables 2–5 and Supplementary Note 4). Of the seven phenotypes, total bilirubin did not yield any rare or ultra-rare independent signals and height did not yield any independent ultra-rare signals. We leveraged the UK Biobank WES data to validate and localize independent associations. We extracted 138,039 exome-sequenced samples that overlap with the analyzed set of white British ancestry individuals and performed lift-over of exome sequencing positions from genome build hg38 to hg19. We computed pairwise LD between the set of independent associated variants and the set of all WES variants, defining the ‘WES partner’ of an independent variant to be the WES variant with largest r2 to it. In a few instances, the same WES partner was selected by two ARG variants or two imputation variants (Supplementary Tables 2–5). Additionally, three WES partners were selected by one ultra-rare ARG and one rare imputation variant, and one WES partner was selected by one rare ARG and two ultra-rare imputation variants; these WES partners counted towards the 36 WES partners identified by both methods in rare and ultra-rare analyses, but were not counted as jointly identified when restricting to only rare or ultra-rare categories (as in Fig. 4a). We labeled WES variants by gene and functional status (‘loss-of-function’ and ‘other protein altering’; Supplementary Note 4) based on annotations obtained using the Ensembl Variant Effect Predictor (VEP) tool73.
Association analysis for higher frequency variants and height
For genome-wide association analyses of higher frequency variants and height, we matched filtering criteria used in ref. 38, retaining imputed variants that satisfy the basic filters listed above, as well as MAF ≥ 0.1%. Using these criteria, we estimated a resampling-based genome-wide significance threshold of 4.5 × 10−9 (95% CI: (2.2 × 10−9, 9.6 × 10−9); Supplementary Table 1). To facilitate direct comparison, we aimed to select parameters that would result in a comparable significance threshold for the ARG-MLMA analysis. Two sets of parameters satisfied this requirement: 3.4 × 10−9 (95% CI: 2.4 × 10−9, 5 × 10−9), obtained for μ = 10−5, MAF ≥ 1%; and 4 × 10−9 (95% CI: 3.1 × 10−9, 5.3 × 10−9), obtained for μ = 10−6, MAF ≥ 0.1%. We selected the former set of parameters, as a low sampling rate of μ = 10−6 leads to worse signal-to-noise and lower association power. We thus used a significance threshold of P < 3 × 10−9 for all analyses of higher frequency variants. We used PLINK74,75 (v.1.90b6.21 and v.2.00a3LM) and GCTA21 (v.1.93.2) to detect approximately independent associations using COJO47, retaining results with COJO P < 3 × 10−9 (Fig. 5e,f, Extended Data Fig. 10e, Supplementary Fig. 11 and Supplementary Note 4).
Statistics and reproducibility
For real data analysis in the UK Biobank, we included all 337,464 individuals of white British ancestry (as reported in ref. 38) who did not have genotype missingness > 10% and had not withdrawn from the UK Biobank at the time of our analysis. To further explore our findings, we selected the 138,039 of these individuals who were exome sequenced in the 200,000 UK Biobank whole-exome sequencing release.
Ethics
UK Biobank data were analyzed after approval of UK Biobank proposal no. 43206.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Online content
Any methods, additional references, Nature Portfolio reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at 10.1038/s41588-023-01379-x.
Supplementary information
Acknowledgements
We thank P.-R. Loh, A. Gusev, S. R. Myers, R. Davies, N. Whiffin, A. Dahl and R. Fournier for discussions and suggestions; and S. Shi, J. Nait Saada, G. Kalantzis and J. Zhu for sharing code used for various parts of the analysis. This work was conducted using the UK Biobank resource (application no. 43206). We thank the participants of the UK Biobank project. This work was supported by the Clarendon Scholarship (to Á.F.G. and B.C.Z.); NIH grant no. R21-HG010748-01 (to P.F.P., F.C. and A.B.); Wellcome Trust ISSF grant no. 204826/Z/16/Z (to P.F.P.); Wellcome Trust grant no. 222336/Z/21/Z (to Á.F.G.); and ERC Starting Grant no. ARGPHENO 850869 (to P.F.P., F.C., A.B. and B.C.Z.). Computation used the Oxford Biomedical Research Computing (BMRC) facility, a joint development between the Wellcome Centre for Human Genetics and the Big Data Institute supported by Health Data Research UK and the NIHR Oxford Biomedical Research Centre. Financial support was provided by the Wellcome Trust Core Award Grant no. 203141/Z/16/Z. The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health. This research was funded in whole, or in part, by the Wellcome Trust (204826/Z/16/Z; 222336/Z/21/Z; 203141/Z/16/Z). For the purpose of Open Access, the author has applied a CC BY public copyright license to any Author Accepted Manuscript version arising from this submission. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.
Extended data
Author contributions
P.F.P. designed the study. B.C.Z. and P.F.P. implemented the ASMC-clust and ARG-Needle algorithms. B.C.Z., Á.F.G. and P.F.P. performed and analyzed simulations. B.C.Z., A.B. and P.F.P. performed analysis of UK Biobank data. F.C. developed software tools. B.C.Z. and P.F.P. wrote the manuscript.
Peer review
Peer review information
Nature Genetics thanks Leo Speidel and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Data availability
COJO association signals for higher frequency ARG variants with height are available at 10.5281/zenodo.7411562. VEP annotations were generated using the Ensembl VEP tool (v.101.0, output produced February 2021), https://www.ensembl.org/info/docs/tools/vep/index.html. UK Biobank data can be accessed by approved researchers through https://www.ukbiobank.ac.uk/. Other datasets were downloaded from the following URLs: summary statistics from whole-exome imputation from 50,000 sequences43, https://data.broadinstitute.org/lohlab/UKB_exomeWAS/; likely causal associations from whole-exome imputation from 50,000 sequences43, https://www.nature.com/articles/s41588-021-00892-1 Supplementary Table 3; GIANT consortium summary statistics in ~700,000 (ref. 46), https://portals.broadinstitute.org/collaboration/giant/index.php/GIANT_consortium_data_files.
Code availability
The arg-needle and arg-needle-lib software packages, which implement the ARG-Needle and ASMC-clust methods as well as methods for the main analyses in this paper, are available at https://palamaralab.github.io/software/argneedle/. Python packages can be downloaded at https://pypi.org/project/arg-needle/ and https://pypi.org/project/arg-needle-lib/; analysis scripts are available at 10.5281/zenodo.7745745. External softwares used in the current study were obtained from the following URLs: msprime (v.0.7.4), https://pypi.org/project/msprime/; tsinfer (v.0.1.4), https://pypi.org/project/tsinfer/; tsinfer scripts for sparse data (accessed January 2022), https://github.com/mcveanlab/treeseq-inference; Relate (v.1.0.15), https://myersgroup.github.io/relate/; ARGON (v.0.1.160415), https://github.com/pierpal/ARGON/; DASH (v.1.1) and GERMLINE (v.1.5.3), http://www1.cs.columbia.edu/~gusev/dash/; IMPUTE4 (v.4.1.2), https://jmarchini.org/software/#impute-4; Beagle (v.5.1), https://faculty.washington.edu/browning/beagle/b5_1.html; PLINK (v.1.90b6.21), https://www.cog-genomics.org/plink/; PLINK (v.2.00a3LM), https://www.cog-genomics.org/plink/2.0/; GCTA (v.1.93.2), https://cnsgenomics.com/software/gcta/; BOLT-LMM (v.2.3.2), https://alkesgroup.broadinstitute.org/BOLT-LMM/downloads/; LiftOver (used April 2021), https://genome.ucsc.edu/cgi-bin/hgLiftOver.
Competing interests
During the revision of this manuscript, A.B. became an employee of 54gene and Á.F.G. became an employee of deCODE genetics/Amgen. The remaining authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
is available for this paper at 10.1038/s41588-023-01379-x.
Supplementary information
The online version contains supplementary material available at 10.1038/s41588-023-01379-x.
References
- 1.Bamshad M, Wooding SP. Signatures of natural selection in the human genome. Nat. Rev. Genet. 2003;4:99–110. doi: 10.1038/nrg999. [DOI] [PubMed] [Google Scholar]
- 2.Beichman AC, Huerta-Sanchez E, Lohmueller KE. Using genomic data to infer historic population dynamics of nonmodel organisms. Annu. Rev. Ecol. Evol. Syst. 2018;49:433–456. [Google Scholar]
- 3.Browning SR, Browning BL. Haplotype phasing: existing methods and new developments. Nat. Rev. Genet. 2011;12:703–714. doi: 10.1038/nrg3054. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Marchini J, Howie B. Genotype imputation for genome-wide association studies. Nat. Rev. Genet. 2010;11:499–511. doi: 10.1038/nrg2796. [DOI] [PubMed] [Google Scholar]
- 5.McVean GA, Cardin NJ. Approximating the coalescent with recombination. Philos. Trans. R. Soc. Lond. B Biol. Sci. 2005;360:1387–1393. doi: 10.1098/rstb.2005.1673. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Li H, Durbin R. Inference of human population history from individual whole-genome sequences. Nature. 2011;475:493–496. doi: 10.1038/nature10231. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Sheehan S, Harris K, Song YS. Estimating variable effective population sizes from multiple genomes: a sequentially Markov conditional sampling distribution approach. Genetics. 2013;194:647–662. doi: 10.1534/genetics.112.149096. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Schiffels S, Durbin R. Inferring human population size and separation history from multiple genome sequences. Nat. Genet. 2014;46:919–925. doi: 10.1038/ng.3015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Rasmussen MD, Hubisz MJ, Gronau I, Siepel A. Genome-wide inference of ancestral recombination graphs. PLoS Genet. 2014;10:e1004342. doi: 10.1371/journal.pgen.1004342. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Terhorst J, Kamm JA, Song YS. Robust and scalable inference of population history from hundreds of unphased whole genomes. Nat. Genet. 2017;49:303–309. doi: 10.1038/ng.3748. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Palamara PF, Terhorst J, Song YS, Price AL. High-throughput inference of pairwise coalescence times identifies signals of selection and enriched disease heritability. Nat. Genet. 2018;50:1311–1317. doi: 10.1038/s41588-018-0177-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Lyngsø, R. B., Song, Y. S. & Hein, J. Minimum recombination histories by branch and bound. in International Workshop on Algorithms in Bioinformatics (eds Casadio, R. & Myers, G.) 239–250 (Springer, 2005).
- 13.Minichiello MJ, Durbin R. Mapping trait loci by use of inferred ancestral recombination graphs. Am. J. Hum. Genet. 2006;79:910–922. doi: 10.1086/508901. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Mirzaei S, Wu Y. RENT+: an improved method for inferring local genealogical trees from haplotypes with recombination. Bioinformatics. 2017;33:1021–1030. doi: 10.1093/bioinformatics/btw735. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Kelleher J, et al. Inferring whole-genome histories in large population datasets. Nat. Genet. 2019;51:1330–1338. doi: 10.1038/s41588-019-0483-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Schaefer NK, Shapiro B, Green RE. An ancestral recombination graph of human, Neanderthal, and Denisovan genomes. Sci. Adv. 2021;7:eabc0776. doi: 10.1126/sciadv.abc0776. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Speidel L, Forest M, Shi S, Myers SR. A method for genome-wide genealogy estimation for thousands of samples. Nat. Genet. 2019;51:1321–1329. doi: 10.1038/s41588-019-0484-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Speidel L, et al. Inferring population histories for ancient genomes using genome-wide genealogies. Mol. Biol. Evol. 2021;38:3497–3511. doi: 10.1093/molbev/msab174. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Zöllner S, Pritchard JK. Coalescent-based association mapping and fine mapping of complex trait loci. Genetics. 2005;169:1071–1092. doi: 10.1534/genetics.104.031799. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Kang HM, et al. Efficient control of population structure in model organism association mapping. Genetics. 2008;178:1709–1723. doi: 10.1534/genetics.107.080101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: a tool for genome-wide complex trait analysis. Am. J. Hum. Genet. 2011;88:76–82. doi: 10.1016/j.ajhg.2010.11.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Loh P-R, Kichaev G, Gazal S, Schoech AP, Price AL. Mixed-model association for biobank-scale datasets. Nat. Genet. 2018;50:906–908. doi: 10.1038/s41588-018-0144-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Griffiths RC, Marjoram P. An ancestral recombination graph. Inst. Math. Appl. 1997;87:257. [Google Scholar]
- 24.Gusev A, et al. Whole population, genome-wide mapping of hidden relatedness. Genome Res. 2009;19:318–326. doi: 10.1101/gr.081398.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Nait Saada J, et al. Identity-by-descent detection across 487,409 British samples reveals fine scale population structure and ultra-rare variant associations. Nat. Commun. 2020;11:6130. doi: 10.1038/s41467-020-19588-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Palamara PF. ARGON: fast, whole-genome simulation of the discrete time Wright-Fisher process. Bioinformatics. 2016;32:3032–3034. doi: 10.1093/bioinformatics/btw355. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Robinson DF, Foulds LR. Comparison of phylogenetic trees. Math. Biosci. 1981;53:131–147. [Google Scholar]
- 28.Kendall M, Colijn C. Mapping phylogenetic trees to reveal distinct patterns of evolution. Mol. Biol. Evol. 2016;33:2735–2743. doi: 10.1093/molbev/msw124. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Yang J, Zaitlen NA, Goddard ME, Visscher PM, Price AL. Advantages and pitfalls in the application of mixed-model association methods. Nat. Genet. 2014;46:100–106. doi: 10.1038/ng.2876. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Yang J, et al. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 2010;42:565–569. doi: 10.1038/ng.608. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Evans LM, et al. Comparison of methods that use whole genome data to estimate the heritability and genetic architecture of complex traits. Nat. Genet. 2018;50:737–745. doi: 10.1038/s41588-018-0108-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Templeton AR, Crandall KA, Sing CF. A cladistic analysis of phenotypic associations with haplotypes inferred from restriction endonuclease mapping and DNA sequence data. III. Cladogram estimation. Genetics. 1992;132:619–633. doi: 10.1093/genetics/132.2.619. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Houwen RH, et al. Genome screening by searching for shared segments: mapping a gene for benign recurrent intrahepatic cholestasis. Nat. Genet. 1994;8:380–386. doi: 10.1038/ng1294-380. [DOI] [PubMed] [Google Scholar]
- 34.Gusev A, et al. DASH: a method for identical-by-descent haplotype mapping uncovers association with recent variation. Am. J. Hum. Genet. 2011;88:706–717. doi: 10.1016/j.ajhg.2011.04.023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Browning SR, Thompson EA. Detecting rare variant associations by identity-by-descent mapping in case-control studies. Genetics. 2012;190:1521–1531. doi: 10.1534/genetics.111.136937. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Yang J, et al. Genetic variance estimation with imputed variants finds negligible missing heritability for human height and body mass index. Nat. Genet. 2015;47:1114–1120. doi: 10.1038/ng.3390. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Wainschtein P, et al. Assessing the contribution of rare variants to complex trait heritability from whole-genome sequence data. Nat. Genet. 2022;54:263–273. doi: 10.1038/s41588-021-00997-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Bycroft C, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562:203–209. doi: 10.1038/s41586-018-0579-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Loh P-R, et al. Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat. Genet. 2015;47:284–290. doi: 10.1038/ng.3190. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Huang J, et al. Improved imputation of low-frequency and rare variants using the UK10K haplotype reference panel. Nat. Commun. 2015;6:8111. doi: 10.1038/ncomms9111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.McCarthy S, et al. A reference panel of 64,976 haplotypes for genotype imputation. Nat. Genet. 2016;48:1279–1283. doi: 10.1038/ng.3643. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Kanai M, Tanaka T, Okada Y. Empirical estimation of genome-wide significance thresholds based on the 1000 Genomes Project data set. J. Hum. Genet. 2016;61:861–866. doi: 10.1038/jhg.2016.72. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Barton AR, Sherman MA, Mukamel RE, Loh P-R. Whole-exome imputation within UK Biobank powers rare coding variant association and fine-mapping analyses. Nat. Genet. 2021;53:1260–1269. doi: 10.1038/s41588-021-00892-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Mukamel RE, et al. Protein-coding repeat polymorphisms strongly shape diverse human phenotypes. Science. 2021;373:1499–1505. doi: 10.1126/science.abg8289. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Taliun D, et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature. 2021;590:290–299. doi: 10.1038/s41586-021-03205-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Yengo L, et al. Meta-analysis of genome-wide association studies for height and body mass index in ~700000 individuals of European ancestry. Hum. Mol. Genet. 2018;27:3641–3649. doi: 10.1093/hmg/ddy271. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Yang J, et al. Conditional and joint multiple-SNP analysis of GWAS summary statistics identifies additional variants influencing complex traits. Nat. Genet. 2012;44:369–375. doi: 10.1038/ng.2213. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Reich DE, et al. Linkage disequilibrium in the human genome. Nature. 2001;411:199–204. doi: 10.1038/35075590. [DOI] [PubMed] [Google Scholar]
- 49.Martin AR, et al. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat. Genet. 2019;51:584–591. doi: 10.1038/s41588-019-0379-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Loh P-R, et al. Contrasting genetic architectures of schizophrenia and other complex diseases using fast variance-components analysis. Nat. Genet. 2015;47:1385. doi: 10.1038/ng.3431. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Pazokitoroudi A, et al. Efficient variance components analysis across millions of genomes. Nat. Commun. 2020;11:4020. doi: 10.1038/s41467-020-17576-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Berg JJ, et al. Reduced signal for polygenic adaptation of height in UK Biobank. eLife. 2019;8:e39725. doi: 10.7554/eLife.39725. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Sohail M, et al. Polygenic adaptation on height is overestimated due to uncorrected stratification in genome-wide association studies. eLife. 2019;8:e39702. doi: 10.7554/eLife.39702. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Si Y, Vanderwerff B, Zöllner S. Why are rare variants hard to impute? Coalescent models reveal theoretical limits in existing algorithms. Genetics. 2021;217:iyab011. doi: 10.1093/genetics/iyab011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Wohns AW, et al. A unified genealogy of modern and ancient genomes. Science. 2022;375:eabi8264. doi: 10.1126/science.abi8264. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Yasumizu Y, et al. Genome-wide natural selection signatures are linked to genetic risk of modern phenotypes in the Japanese population. Mol. Biol. Evol. 2020;37:1306–1316. doi: 10.1093/molbev/msaa005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Stern AJ, Speidel L, Zaitlen NA, Nielsen R. Disentangling selection on genetically correlated polygenic traits via whole-genome genealogies. Am. J. Hum. Genet. 2021;108:219–239. doi: 10.1016/j.ajhg.2020.12.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Sneath, P. H. & Sokal, R. R. Numerical Taxonomy. The Principles and Practice of Numerical Classification (W. H. Freeman and Co., 1973).
- 59.Gronau I, Moran S. Optimal implementations of UPGMA and other common clustering algorithms. Inf. Process. Lett. 2007;104:205–210. [Google Scholar]
- 60.Müllner D. fastcluster: fast hierarchical, agglomerative clustering routines for R and Python. J. Stat. Softw. 2013;53:1–18. [Google Scholar]
- 61.Kelleher J, Etheridge AM, McVean G. Efficient coalescent simulation and genealogical analysis for large sample sizes. PLoS Comput. Biol. 2016;12:e1004842. doi: 10.1371/journal.pcbi.1004842. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Li N, Stephens M. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics. 2003;165:2213–2233. doi: 10.1093/genetics/165.4.2213. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Wong, Y., Kelleher, J., Wohns, A. W. & Fadil, C. Evaluating tsinfer. GitHub https://github.com/mcveanlab/treeseq-inference (2020).
- 64.Robinson DF, Foulds LR. Comparison of phylogenetic trees. Math. Biosci. 1981;53:131–147. [Google Scholar]
- 65.Kuhner MK, Felsenstein J. A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Mol. Biol. Evol. 1994;11:459–468. doi: 10.1093/oxfordjournals.molbev.a040126. [DOI] [PubMed] [Google Scholar]
- 66.Speed D, Hemani G, Johnson MR, Balding DJ. Improved heritability estimation from genome-wide SNPs. Am. J. Hum. Genet. 2012;91:1011–1021. doi: 10.1016/j.ajhg.2012.10.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Lee SH, et al. Estimation of SNP heritability from dense genotype data. Am. J. Hum. Genet. 2013;93:1151–1155. doi: 10.1016/j.ajhg.2013.10.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Gazal S, et al. Linkage disequilibrium–dependent architecture of human complex traits shows action of negative selection. Nat. Genet. 2017;49:1421–1427. doi: 10.1038/ng.3954. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Wray NR, et al. Pitfalls of predicting complex traits from SNPs. Nat. Rev. Genet. 2013;14:507–515. doi: 10.1038/nrg3457. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Daetwyler HD, Villanueva B, Woolliams JA. Accuracy of predicting the genetic risk of disease using a genome-wide approach. PLoS ONE. 2008;3:e3395. doi: 10.1371/journal.pone.0003395. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Mefford J, et al. Efficient estimation and applications of cross-validated genetic predictions to polygenic risk scores and linear mixed models. J. Comput. Biol. 2020;27:599–612. doi: 10.1089/cmb.2019.0325. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Browning SR, Browning BL. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet. 2007;81:1084–1097. doi: 10.1086/521987. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.McLaren W, et al. The ensembl variant effect predictor. Genome Biol. 2016;17:122. doi: 10.1186/s13059-016-0974-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Purcell S, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 2007;81:559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Chang CC, et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience. 2015;4:7. doi: 10.1186/s13742-015-0047-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
COJO association signals for higher frequency ARG variants with height are available at 10.5281/zenodo.7411562. VEP annotations were generated using the Ensembl VEP tool (v.101.0, output produced February 2021), https://www.ensembl.org/info/docs/tools/vep/index.html. UK Biobank data can be accessed by approved researchers through https://www.ukbiobank.ac.uk/. Other datasets were downloaded from the following URLs: summary statistics from whole-exome imputation from 50,000 sequences43, https://data.broadinstitute.org/lohlab/UKB_exomeWAS/; likely causal associations from whole-exome imputation from 50,000 sequences43, https://www.nature.com/articles/s41588-021-00892-1 Supplementary Table 3; GIANT consortium summary statistics in ~700,000 (ref. 46), https://portals.broadinstitute.org/collaboration/giant/index.php/GIANT_consortium_data_files.
The arg-needle and arg-needle-lib software packages, which implement the ARG-Needle and ASMC-clust methods as well as methods for the main analyses in this paper, are available at https://palamaralab.github.io/software/argneedle/. Python packages can be downloaded at https://pypi.org/project/arg-needle/ and https://pypi.org/project/arg-needle-lib/; analysis scripts are available at 10.5281/zenodo.7745745. External softwares used in the current study were obtained from the following URLs: msprime (v.0.7.4), https://pypi.org/project/msprime/; tsinfer (v.0.1.4), https://pypi.org/project/tsinfer/; tsinfer scripts for sparse data (accessed January 2022), https://github.com/mcveanlab/treeseq-inference; Relate (v.1.0.15), https://myersgroup.github.io/relate/; ARGON (v.0.1.160415), https://github.com/pierpal/ARGON/; DASH (v.1.1) and GERMLINE (v.1.5.3), http://www1.cs.columbia.edu/~gusev/dash/; IMPUTE4 (v.4.1.2), https://jmarchini.org/software/#impute-4; Beagle (v.5.1), https://faculty.washington.edu/browning/beagle/b5_1.html; PLINK (v.1.90b6.21), https://www.cog-genomics.org/plink/; PLINK (v.2.00a3LM), https://www.cog-genomics.org/plink/2.0/; GCTA (v.1.93.2), https://cnsgenomics.com/software/gcta/; BOLT-LMM (v.2.3.2), https://alkesgroup.broadinstitute.org/BOLT-LMM/downloads/; LiftOver (used April 2021), https://genome.ucsc.edu/cgi-bin/hgLiftOver.