Abstract
Summary
Ancestral recombination graphs (ARGs) encode the ensemble of correlated genealogical trees arising from recombination in a compact and efficient structure and are of fundamental importance in population and statistical genetics. Recent breakthroughs have made it possible to simulate and infer ARGs at biobank scale, and there is now intense interest in using ARG-based methods across a broad range of applications, particularly in genome-wide association studies (GWAS). Sophisticated methods exist to simulate ARGs using population genetics models, but there is currently no software to simulate quantitative traits directly from these ARGs. To apply existing quantitative trait simulators users must export genotype data, losing important information about ancestral processes and producing prohibitively large files when applied to the biobank-scale datasets currently of interest in GWAS. We present tstrait, an open-source Python library to simulate quantitative traits on ARGs, and show how this user-friendly software can quickly simulate phenotypes for biobank-scale datasets on a laptop computer.
Availability and implementation
tstrait is available for download on the Python Package Index. Full documentation with examples and workflow templates is available on https://tskit.dev/tstrait/docs/, and the development version is maintained on GitHub (https://github.com/tskit-dev/tstrait).
1 Introduction
Genome-wide association studies (GWAS) identify genetic variants that are statistically associated with a specific trait (Uffelmann et al. 2021). Many loci that are associated with various human diseases and traits have been identified (e.g. Yengo et al. 2022, Mathieson et al. 2023), and GWAS results are actively being incorporated into clinical practice (Visscher et al. 2017). The great success of GWAS has prompted the collection of many biobank datasets consisting of hundreds of thousands of participants (Tanjo et al. 2021), but this scale presents significant challenges to current GWAS methodology (Uffelmann et al. 2021).
Simulation is a critical component of GWAS method development and generally consists of two steps: first simulating genetic variation (genotypes) and then simulating quantitative traits (phenotypes) based on the synthetic genotypes. The combined genotypes and phenotypes represent ground-truth data which GWAS methods can be evaluated against. Genetic variation is usually simulated either by model-based population genetic methods such as msprime (Baumdicker et al. 2022) and SLiM (Haller and Messer 2023), or by statistical resampling from existing datasets using methods like HAPGEN2 (Su et al. 2011) and HAPNEST (Wharrie et al. 2023). Both approaches have advantages and disadvantages and excel in different situations. Roughly speaking, model-based simulation methods provide better control of population processes such as demography, whereas resampling methods are better at capturing difficult to model nuances of real data. Model-based population genetic simulations have made great strides in recent years, with major advances in both scalability (Kelleher et al. 2016, 2018, Haller et al. 2018) and realism (Adrion et al. 2020, Anderson-Trocmé et al. 2023), and have been successfully used to simulate large-scale GWAS cohorts (e.g. Martin et al. 2017, Zaidi and Mathieson 2020).
An important property of these population genetic simulation methods is that they output ancestral recombination graphs (ARGs) rather than sample genotypes. ARGs encode the interwoven paths of genetic inheritance caused by recombination (Hudson 1983, Griffiths and Marjoram 1997, Wong et al. 2023), and contain rich detail about ancestral processes. Recent breakthroughs in inferrence methods have made it possible to estimate ARGs at biobank scale (Kelleher et al. 2019, Zhang et al. 2023), and there is now intense interest in their practical application (Lewanski et al. 2024, Brandt et al. 2024). Statistical genetics has been a particular focus, and ARG-based methods have been shown to detect more ultra rare variants than conventional association testing methods (Zhang et al. 2023); to have better power to detect causal loci in quantitative-trait locus mapping (Link et al. 2023); and to provide a sparse and efficient model of linkage disequilibrium in GWAS and downstream applications (Nowbandegani et al. 2023).
ARG-based methods can simulate genetic variation for millions of samples and store the output very compactly in the ‘succinct tree sequence’ encoding (Wong et al. 2023) and tskit library (Ralph et al. 2020). For example, a highly realistic simulation of chromosome 9 for 1.4 million French-Canadian samples (Anderson-Trocmé et al. 2023) requires around 550GB of storage space in gzip-compressed VCF (Danecek et al. 2011). At 1.36GB, the original simulated ARG (compressed using the tszip utility) is around 400× smaller. Furthermore, many calculations can be expressed efficiently in terms of the underlying ARG (Kelleher et al. 2016, Ralph et al. 2020), without needing to decode the actual variation data. Finally, outputting a simulated ARG provides access to the full history, not just the genetic variation among the samples.
Although there are sophisticated methods available for simulating ARGs at biobank scale, there is currently no easy way to simulate quantitative traits based on such an ARG. Many existing methods to simulate quantitative traits from a given set of genetic sequences assume that the genotypes fit in memory (e.g. Meyer and Birney 2018, Fernandes and Lipka 2020), which makes them impractical at biobank scale (the French-Canadian dataset discussed above would require 140TB of RAM assuming 1 byte per genotype). Methods that read parts of the genotype matrix from file as required (e.g. Wharrie et al. 2023) can be used on reasonable hardware, but working with such large files is slow and cumbersome. More fundamentally, exporting genotypes discards much of rich detail about ancestral history contained in an ARG, and it is exactly this information that we wish to take advantage of when using inferred ARGs in GWAS applications. In their analysis of the portability of polygenic risk scores across populations, Martin et al. (2017) demonstrated the utility of simulating phenotypes directly from an ARG. Their approach, however, is tightly coupled to the details of the study and not designed to be reused. Simulation code can be subtle and difficult to debug (Ragsdale et al. 2020), and there is a critical need for a well-documented and thoroughly tested means of simulating quantitative traits directly from an ARG.
In this article, we present tstrait, a Python library that efficiently simulates quantitative traits on an arbitrary ARG. Tstrait can quickly simulate quantitative traits for population-scale datasets, with a very low memory overhead, and taking into account the rich historical detail contained within an ARG. The tstrait library also integrates well with the wider Python data-science ecosystem (Harris et al. 2020), allowing users to efficiently analyse large-scale data using familiar and ergonomic tools.
2 Results
2.1 Model
Phenotypes are simulated in tstrait following standard GWAS models (Uffelmann et al. 2021), adapted to the ARG context. Each trait is associated with one or more causal sites (positions on the genome), and at each causal site there is a causal allele (ie a particular nucleotide) associated with an effect size β. For each causal site, an effect size is drawn from a distribution and optionally multiplied by , where p is the frequency of the causal allele and α is a parameter describing the strength of frequency dependence (Speed et al. 2012). At a particular causal site, every node in the local tree that inherits the causal allele at that site is said to have a ‘genetic value’ of β.
In Fig. 1 we show an example tree for three individuals. Because these individuals are diploids, each is associated with two nodes in the tree (highlighted by colour). Ancestral nodes are not associated with individuals here, but in general an ARG may be embedded in a multigenerational pedigree, where some internal nodes would be associated with individuals. In the example of Fig. 1, T is chosen as the causal allele with , so all nodes descending from i have genetic value 0.05, except e which has zero because of the back-mutation to A. Following the standard practise in GWAS (Uffelmann et al. 2021), we assume the additive model such that the overall genetic value of an individual is the sum of its nodes’ genetic values. Given these per-individual genetic values, the final phenotype is then generated by adding some environmental noise. This noise is simulated from a normal distribution with mean zero and variance of , where VG is the variance of the individual genetic values and h2 is the narrow-sense heritability provided as input by the user. Dominance and epistasis are straightforward extensions to this model, which we plan to include in future versions of tstrait.
Figure 1.
Example simulation of a phenotype at a site with ancestral state A and two mutations. In this diploid example, each of the three individuals is associated with two nodes (i.e. the individual with ID 0 corresponds to nodes a and b). Internal nodes in the tree are associated with the null individual, –1. Here, the trait’s causal allele is T with an effect size . Each node in the tree has an associated genetic value, and the overall genetic value for an individual is the sum of the genetic values of their corresponding nodes. The final phenotype for each individual is the sum of the genetic value and simulated environmental noise.
2.2 Interface
Tstrait is a Python library, building on the tskit ARG toolkit (Ralph et al. 2020, Wong et al. 2023) and the rich Python data-science ecosystem (Harris et al. 2020). Simulating a phenotype for an ARG with default parameter values requires only a few lines of code:
import tstrait as tst
model = tst.trait_model(’normal’, mean=0, var=1)
result = tst.sim_phenotype(arg, model)
We first create model, representing the distribution from which effect sizes are drawn. Five commonly used univariate distributions are supported, along with the multivariate normal distribution to model pleiotropic traits. Given this model, we can then simulate phenotypes for the individuals in an ARG (as a tskit TreeSequence) using the sim_phenotype function. The user either can specify a number of causal sites to be chosen randomly along the genome (one, by default) or can directly provide the causal sites as input. Combined with the detailed information about mutations recorded in a tskit ARG, explicitly specifying causal sites allows us to model many different types of trait, e.g. those associated with mutations arising in a particular population or time interval. The return value result is an object encapsulating two Pandas dataframes (McKinney 2010): one describing the simulated effect sizes and the other describing the genetic values, environmental noise, and phenotypes for each individual. The simulation results can then be efficiently and conveniently processed using standard Python data-science tools, or exported to (e.g.) CSV for broad compatibility.
As well as this convenient single-function interface, tstrait provides modular building blocks for power-users and facilitates integration with other tools that generate traits on an ARG. The sim_trait function simulates effect sizes for an input ARG and returns a data frame describing the causal sites, alleles, and effect sizes. This dataframe can then be passed to the genetic_values function, which calculates the genetic values for each node, and accumulates them by individual (Fig. 1). Finally, the sim_env function takes these per-individual genetic values and adds some simulated environmental noise to produce the final phenotypes.
A major benefit of this modular architecture is the flexibility it offers users. Because the causal sites and effect sizes are specified in a simple tabular format, users can easily develop their own approach to simulating these values. Alternatively, other simulators such as SLiM (Haller and Messer 2023) that generate effect sizes and causal mutations during the progress of a forwards-time simulation could output these values to a CSV or similar file. The modular architecture and simple input data formats are specifically intended to facilitate such interoperability.
2.3 Implementation and validation
Tstrait is written entirely in Python. Numerical operations are either performed using standard array-oriented approaches (Harris et al. 2020) or accelerated using the numba JIT compiler (Lam et al. 2015). The tstrait codebase includes a suite of unit tests, which are automatically run as part of the development process. The output of tstrait has been validated against theoretical expectations, as well as the output of AlphaSimR (Gaynor et al. 2021) and simplePHENOTYPES (Fernandes and Lipka 2020).
2.4 Performance
Tstrait is very efficient and can be applied to datasets at the largest scales on standard computers. Supplementary Fig. S1 shows how trait simulation time scales with a number of individuals on human-like coalescent simulations generated using stdpopsim (Adrion et al. 2020). To emphasize scalability, we also applied tstrait to the large French-Canadian simulations discussed in the introduction. It took 80.69 s to simulate a trait with 100 causal sites for all 2.7 million pedigree individuals. Finally, to demonstrate that tstrait can also be applied to ARGs inferred from real data, we simulated a trait with 100 causal sites for an ARG estimated from 1000 Genomes project data (Kelleher et al. 2019) which has 2504 samples and 1 685 401 variant sites. This took 5.40 s. Memory requirements for tstrait are modest: all of the above experiments were performed on a laptop computer with 16 GB of RAM.
3 Conclusion
There is substantial interest in using inferred ARGs to improve association testing methods (Zhang et al. 2023, Link et al. 2023, Nowbandegani et al. 2023), and there is a pressing need for a well-tested, efficient, and user-friendly means of simulating phenotypes on ARGs. Highly realistic simulations conditioned on large pedigrees (Anderson-Trocmé et al. 2023) provide an exciting opportunity to test the effects of intricate population structure on GWAS, and we hope that tstrait will facilitate these investigations. Tstrait’s modular architecture and flexible specification of causal sites should provide the opportunity to explore new avenues of research, and an extensible platform for future development.
Supplementary Material
Acknowledgements
We are grateful to Gregor Gorjanc, Ben Haller, Ben Jeffery, Pier Palamara, Alison Etheridge, and others in the tskit community for helpful discussions and feedback.
Contributor Information
Daiki Tagami, Department of Statistics, University of Oxford, Oxford OX1 3LB, United Kingdom; Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, United Kingdom.
Gertjan Bisschop, Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, United Kingdom.
Jerome Kelleher, Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, United Kingdom.
Author contributions
Daiki Tagami (Conceptualization [supporting], Methodology [supporting], Software [lead], Validation [lead], Visualization [lead], Writing—original draft [lead], Writing—review & editing-Equal), Gertjan Bisschop (Conceptualization [supporting], Methodology [supporting], Software [supporting], Supervision [supporting], Validation [supporting], Visualization [supporting], Writing—original draft [supporting], Writing—review & editing [supporting]), and Jerome Kelleher(Conceptualization [lead], Methodology [lead], Software [lead], Supervision [lead], Validation [lead], Visualization [supporting], Writing—original draft [supporting], Writing—review & editing [equal]).
Supplementary data
Supplementary data are available at Bioinformatics online.
Conflict of interest
No competing interest is declared.
Funding
D.T. is supported by the Oxford Kobe Scholarship from the University of Oxford and the Euretta J. Kellett Fellowship from Columbia University. J.K. acknowledges support from the Robertson Foundation, NIH (research grants HG011395 and HG012473) and EPSRC (research grant EP/X024881/1).
References
- Adrion JR, Cole CB, Dukler N et al. A community-maintained standard library of population genetic models. Elife 2020;9:e54967. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Anderson-Trocmé L, Nelson D, Zabad S et al. On the genes, genealogies, and geographies of Quebec. Science 2023;380:849–55. [DOI] [PubMed] [Google Scholar]
- Baumdicker F, Bisschop G, Goldstein D et al. Efficient ancestry and mutation simulation with msprime 1.0. Genetics 2022;220:iyab229. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brandt DY, Huber CD, Chiang CW et al. The promise of inferring the past using the ancestral recombination graph. Genome Biol Evol 2024;16:evae005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Danecek P, Auton A, Abecasis G et al. ; 1000 Genomes Project Analysis Group. The variant call format and vcftools. Bioinformatics 2011;27:2156–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fernandes SB, Lipka AE. simplePHENOTYPES: SIMulation of pleiotropic, linked and epistatic phenotypes. BMC Bioinform 2020;21:1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gaynor RC, Gorjanc G, Hickey JM. AlphaSimR: an R package for breeding program simulations. G3 2021;11:jkaa017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Griffiths RC, Marjoram P. An ancestral recombination graph. In: Donnelly P, Tavaré S (eds.), Progress in Population Genetics and Human Evolution, IMA Volumes in Mathematics and Its Applications, Vol. 87, 1997, 257–270, Berlin: Springer-Verlag. [Google Scholar]
- Haller BC, Messer PW. SLiM 4: multispecies eco-evolutionary modeling. Am Nat 2023;201:E127–E139. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Haller BC, Galloway J, Kelleher J et al. Tree-sequence recording in SLiM opens new horizons for forward-time simulation of whole genomes. Mol Ecol Resour 2018;19:552–66. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Harris CR, Millman KJ, van der Walt SJ et al. Array programming with NumPy. Nature 2020;585:357–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hudson RR. Properties of a neutral allele model with intragenic recombination. Theor Popul Biol 1983;23:183–201. [DOI] [PubMed] [Google Scholar]
- Kelleher J, Etheridge AM, McVean G. Efficient coalescent simulation and genealogical analysis for large sample sizes. PLoS Comput Biol 2016;12:e1004842. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kelleher J, Thornton KR, Ashander J et al. Efficient pedigree recording for fast population genetics simulation. PLoS Comput Biol 2018;14:1–21, 11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kelleher J, Wong Y, Wohns AW et al. Inferring whole-genome histories in large population datasets. Nat Genet 2019;51:1330–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lam SK, Pitrou A, Seibert S. Numba: a LLVM-based Python JIT compiler. In: Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC, 2015, 1–6.
- Lewanski AL, Grundler MC, Bradburd GS. The era of the ARG: an introduction to ancestral recombination graphs and their significance in empirical evolutionary genomics. PLoS Genet 2024;20:e1011110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Link V, Schraiber JG, Fan C et al. Tree-based QTL mapping with expected local genetic relatedness matrices. Am J Hum Genet 2023;110:2077–91. https://doi.org/10.1016/j. ajhg.2023.10.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Martin AR, Gignoux CR, Walters RK et al. Human demographic history impacts genetic risk prediction across diverse populations. Am J Hum Genet 2017;100:635–49. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mathieson I, Day FR, Barban N et al. Genome-wide analysis identifies genetic effects on reproductive success and ongoing natural selection at the FADS locus. Nat Hum Behav 2023;7:790–801. 10.1038/s41562-023-01528-6. [DOI] [PubMed] [Google Scholar]
- McKinney W. Data structures for statistical computing in Python. In: Proceedings of the 9th Python in Science Conference, 2010, 56–61.
- Meyer HV, Birney E. PhenotypeSimulator: a comprehensive framework for simulating multi-trait, multi-locus genotype to phenotype relationships. Bioinformatics 2018;34:2951–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nowbandegani PS, Wohns AW, Ballard JL et al. Extremely sparse models of linkage disequilibrium in ancestrally diverse association studies. Nat Genet 2023;55:1494–502. [DOI] [PubMed] [Google Scholar]
- Ragsdale AP, Nelson D, Gravel S et al. Lessons learned from bugs in models of human history. Am J Hum Genet 2020;107:583–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ralph P, Thornton K, Kelleher J. Efficiently summarizing relationships in large samples: a general duality between statistics of genealogies and genomes. Genetics 2020;215:779–97. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Speed D, Hemani G, Johnson MR et al. Improved heritability estimation from genome-wide SNPs. Am J Hum Genet 2012;91:1011–21. 10.1016/j.ajhg.2012.10.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Su Z, Marchini J, Donnelly P. HAPGEN2: simulation of multiple disease SNPs. Bioinformatics 2011;27:2304–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tanjo T, Kawai Y, Tokunaga K et al. Practical guide for managing large-scale human genome data in research. J Hum Genet 2021;66:39–52. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Uffelmann E, Huang QQ, Munung NS et al. Genome-wide association studies. Nat Rev Methods Primers 2021;1:59. [Google Scholar]
- Visscher PM, Wray NR, Zhang Q et al. 10 years of GWAS discovery: biology, function, and translation. Am J Hum Genet 2017;101:5–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wharrie S, Yang Z, Raj V et al. HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes. Bioinformatics 2023;39:btad535. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wong Y, Ignatieva A, Koskela J et al. A general and efficient representation of ancestral recombination graphs. bioRxiv, 2023, preprint: not peer reviewed. [DOI] [PMC free article] [PubMed]
- Yengo L, Vedantam S, Marouli E et al. ; Understanding Society Scientific Group. A saturated map of common genetic variants associated with human height. Nature 2022;610:704–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zaidi AA, Mathieson I. Demographic history mediates the effect of stratification on polygenic scores. Elife 2020;9:e61548. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang BC, Biddanda A, Gunnarsson ÁF et al. Biobank-scale inference of ancestral recombination graphs enables genealogical analysis of complex traits. Nat Genet 2023;55:768–76. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.