Summary
We provide a method for estimating the genome-wide mutation rate from sequence data on unrelated individuals by using segments of identity by descent (IBD). The length of an IBD segment indicates the time to shared ancestor of the segment, and mutations that have occurred since the shared ancestor result in discordances between the two IBD haplotypes. Previous methods for IBD-based estimation of mutation rate have required the use of family data for accurate phasing of the genotypes. This has limited the scope of application of IBD-based mutation rate estimation. Here, we develop an IBD-based method for mutation rate estimation from population data, and we apply it to whole-genome sequence data on 4,166 European American individuals from the TOPMed Framingham Heart Study, 2,996 European American individuals from the TOPMed My Life, Our Future study, and 1,586 African American individuals from the TOPMed Hypertension Genetic Epidemiology Network study. Although mutation rates may differ between populations as a result of genetic factors, demographic factors such as average parental age, and environmental exposures, our results are consistent with equal genome-wide average mutation rates across these three populations. Our overall estimate of the average genome-wide mutation rate per 108 base pairs per generation for single-nucleotide variants is 1.24 (95% CI 1.18–1.33).
Keywords: identity by descent, genome-wide mutation rate
Genotype errors make it difficult to accurately estimate mutation rates in humans from parent-offspring data. We use identity by descent in whole-genome sequence data from population samples to accurately detect germline mutations that have occurred in the past 100 generations and hence estimate the genome-wide mutation rate.
Introduction
Although the genome-wide average human mutation rate is a critical parameter in genetic studies, there is still considerable uncertainty about its value. Sequence data on parent-offspring trios are increasingly available and used for estimation of the mutation rate,1,2,3 however genotype error rates are high relative to rates of mutation. Thus, the choice of filtering criteria to balance false-positive and false-negative genotype calls is critical and can significantly impact the estimates.4 Sequence data on multi-generational families can help address these issues2 but are unavailable for many populations.
Furthermore, differences in mutation rates between human populations are likely, although they may not be large. Mutation rates increase with parental age,5,6 and average parental age can vary between populations. Differing environmental conditions and exposures to mutagens, as well as genetic factors, may also lead to differences between populations.3 Thus, estimates of average genome-wide mutation rate by population are needed, rather than simply obtaining a single “human” rate.
Recently, methods have been developed for estimating mutation rates from segments of identity by descent (IBD).7,8,9,10 Existing methods require highly accurate estimates of haplotype phase,8,10 except for methods that utilize homozygosity by descent (i.e., IBD within an individual rather than between individuals).7,9 In particular, highly accurate phase is needed for the rarest variants because these are the variants that are the most likely to be to the result of mutation since the common ancestor of the IBD haplotypes. However, statistical population-based phasing cannot provide accurate phase of variants that are seen only a few times in the sample.11 Thus, previous IBD-based methods used data from parent-offspring trios, for which Mendelian rules enable phasing of rare variants.8,10
In this paper, we build on the work of Tian et al.,10 which used sets of three IBD haplotypes to estimate the mutation rate. By focusing on sets of three IBD haplotypes, the impact of genotype error is much reduced. Variants must be seen at least twice to be considered and must be carried by two of the three IBD haplotypes, greatly increasing their likelihood of being true rare alleles rather than miscalls. We show that this design also enables highly accurate phasing of the rare variants via IBD so that parent-offspring trios are not needed. As a result, our method can be applied to sequenced population samples, such as those in the TOPMed study.12 Existing IBD-based methods adjust for gene conversion with post hoc regression,8,9,10 but in this work, we incorporate gene conversion directly within the likelihood framework.
Subjects and methods
Modeling of mutations, gene conversion, and genotype error
We model the occurrence of mutations as a Poisson process of rate per base pair per meiosis. Thus, across independent meioses and base pairs, the number of changed alleles due to mutation is distributed as Poisson with mean .
The lowest frequency variants are useful for estimating the mutation rate, while the highest frequency variants are most informative for estimating IBD segments. We thus divide the genetic markers into three groups: markers with minor allele frequency (MAF) , which are informative about mutation and gene conversion; markers with MAF between and , which can be affected by gene conversion but not by mutation, thus allowing estimation of gene conversion rates for correcting for gene conversion in the markers with MAF ; markers with MAF , which are used for IBD segment detection. Gene conversions can disrupt identity by state, leading to inability to detect the corresponding IBD segments. By using distinct sets of markers for IBD detection and parameter estimation, we avoid this cause of downward bias in estimates of gene conversion rates. We use and in all analyses except as noted. In simulation studies, we found that all mutations in IBD segments had allele frequencies , and using allows for both sufficient markers for estimation of gene conversion rates and sufficient markers for IBD detection.
Let be the proportion of base pairs that are located within a gene conversion tract per meiosis. The offspring’s haplotype is only altered by gene conversion at positions where the parent individual’s genotype was heterozygous. We write for the genome-wide proportion of base pair positions that are heterozygous and have MAF . Thus, for a genome with base pairs and individuals, and writing for the total number of heterozygous genotypes seen at markers with MAF , we obtain . For a single base pair position and meiosis, the expected number of alleles with frequency inserted by gene conversion will be . The division by 2 is because if the individual in which the gene conversion occurred was heterozygous at a locus within the gene conversion tract, either the transmitted haplotype was originally the major allele and changed to the minor allele by the gene conversion or vice versa, each with probability . We are only counting instances of change to the minor allele here. Thus, we model the number of alleles with frequency inserted by gene conversion over meioses and base pairs as having a Poisson distribution with mean . Similarly, we model the number of alleles with frequency inserted by gene conversion over meioses and base pairs as having a Poisson distribution with mean by using the same reasoning but this time only counting instances in which the transmitted haplotype’s allele was changed from the minor to the major allele.
We write for the genome-wide proportion of base pair positions that are heterozygous and have MAF between and . Applying the same reasoning as above, we model the number of alleles with frequency between and inserted by gene conversion over meioses and base pairs as having a Poisson distribution with mean . The distribution is the same for alleles with frequency between and .
We also consider genotype errors. Our method is concerned with low frequency alleles that are observed in two out of three IBD haplotypes. The probability of seeing this at any given base pair, conditional on the true alleles being identical in the three haplotypes, is defined to be for alleles with MAF and to be for alleles with MAF between and ; thus for base pairs, we model the numbers of such errors as having Poisson distributions with means and , respectively.
The parameters , , , and are not known a priori. We compute a likelihood across a grid of values for these parameters and report the parameter values that maximize the likelihood. Our primary interest is in estimating .
Modeling of IBD
We compute likelihoods for sets of three mutually IBD haplotypes. We need to model the genealogical relationship between these three haplotypes.
In computing the likelihood, we will sum over the different possible relationships. Here, we consider a single, generalizable relationship, , in which haplotypes A and B have their most recent common ancestor first, generations ago, and the ancestor of A and B has its most recent common ancestor with C occurring generations ago (Figure 1). The probability of this particular relationship depends on the population’s demographic history. If is the effective size of the population generations before the present, then the probability of the relationship is10
| Equation 1 |
Figure 1.

The genealogical relationship between three IBD haplotypes
Mutually IBD haplotypes are labeled A, B, and C. Haplotypes A and B have a common ancestor generations before the present, while C and the common ancestor of A and B have a common ancestor generations before the present. The branches of the genealogical tree are labeled b1, b2, b3, and b4.
We use the distribution of lengths of pairwise IBD sharing in the sample to estimate the effective population sizes with the IBDNe program.13
The probability of the lengths of the observed IBD segments between the three haplotypes given the relationship can be found in Table S1, with derivations in Tian et al.10
Probability of allele counts given relationship
For alleles with frequency and for alleles with frequency between and , we count the number of instances in which haplotypes A and B carry the allele but C does not, the number of instances in which haplotypes A and C carry the allele but B does not, and the number of instances in which haplotypes B and C carry the allele but A does not. These counts apply to the region that is shared IBD by all three haplotypes after trimming 0.5 cM from each end. We write for the base pair length of this region. The trimming accounts for uncertainty in the exact endpoints of the IBD segment.
Given the relationship in Figure 1, with branches labeled in the figure, the low-frequency variants carried by A and B but not C can be obtained through one of the following ways: (1) a mutation on branch b3, (2) a gene conversion on branch b3 that replaces the high-frequency allele with the low-frequency allele, (3) a gene conversion on branch b4 that replaces the low-frequency allele with the high-frequency allele, or (4) genotype error. Other possibilities involve low probability events such as two or more gene conversions at the same site or back mutation, and we ignore these.
Given detectable IBD (e.g., length cM) between the haplotypes, their time to most recent common ancestor ( is within the past few hundred generations and the frequency of any allele created by mutation on branch b3 will be low in the population. Under our model, the number of alleles with frequency that are shared by A and B but not C follows a Poisson distribution with mean
| Equation 2 |
Under our model, the number of alleles with frequency between and that are shared by A and B but not C follows a Poisson distribution with mean
| Equation 3 |
Given the relationship in Figure 1, with branches labeled as in the figure, the low-frequency variants carried by A and C but not B can be obtained through one of the following ways: (1) a gene conversion on branch b2 that replaces the high-frequency allele with the low-frequency allele or (2) genotype error. Other possibilities involve low probability events such as two or more gene conversions at the same site or back mutation, and we ignore these. Thus, the number of alleles with frequency that are shared by A and C but not B follows a Poisson distribution with mean
| Equation 4 |
Similarly, the number of alleles with frequency between and that are shared by A and C but not B follows a Poisson distribution with mean
| Equation 5 |
These same distributions apply to counts of alleles shared by B and C but not A.
The likelihood
The observed data for each instance of three-way IBD sharing are the allele counts for the three haplotypes within the region that is shared IBD by all three haplotypes after trimming, the trimmed three-way IBD segment length measured in base pairs, and the untrimmed lengths of the pairwise IBD segments measured in cM. The likelihood (of ) is obtained by summing over the possible genealogical relationships and, for each one, multiplying the prior probability of the relationship (Equation 1), the probability of the IBD lengths given the relationship (Table S1), and the probability of the allele counts given the relationship (Equations 2, 3, 4, and 5).
This gives the likelihood based on data from one instance of three-way IBD sharing. We find all instances of three-way IBD sharing and multiply the likelihoods (or add the log-likelihoods) to obtain the overall likelihood. This is a composite likelihood because the instances of three-way IBD sharing are not fully independent. We perform a grid search to find the values of the parameters that maximize the likelihood. Confidence intervals are found by bootstrap resampling of chromosomes (the 22 autosomes in the human data and the 30 simulated chromosomes in the simulated data). We obtain 10,000 bootstrap estimates and report the 2.5th and 97.5th percentiles as the 95% confidence interval.
In order to reduce the computation time for the grid search, we use a three-stage approach. In the first stage, we use alleles with MAF and Equations 2 and 4 to obtain estimates of and . We discard the estimates for from this analysis because much of the information about is contained in the higher frequency data and because and are somewhat confounded without the use of the higher-frequency data. We retain the estimate of for use in the third stage. In the second stage, we use alleles with MAF between and and Equations 3 and 5 to obtain estimates of and . As in the first stage, we discard the estimate for because it is based on partial data, but we retain the estimate of . In the third stage, we fix and at and and use alleles with MAF as well as alleles with MAF between and and Equations 2, 3, 4, and 5 to obtain overall estimates of the mutation rate and gene conversion rate . For each parameter, we start the grid search with a wide range and large step size. If the resulting confidence interval falls within the range, we then reduce the range and the step size. This iterative step is repeated until the desired level of precision is achieved. We provide further details of the procedure for performing the three-stage likelihood maximization in the context of using the supplied java program in supplemental methods.
Accounting for phase uncertainty
Common variants can be phased extremely accurately with statistical methods in large samples.14 We use the statistically inferred phase for variants with minor allele frequency (MAF) . Because lower frequency variants can’t be phased as well with standard statistical phasing, we use a different approach for variants with MAF .
Here, we are considering alleles that are rare (frequency ) yet are carried by two individuals who share a haplotype IBD. The question is whether the rare allele is on the IBD haplotype or not. In most cases, it will be on the IBD haplotype because if it is not, then the probability of seeing this rare allele in both individuals is small. Yet if there is evidence for one or both individuals that the allele may be coming in through the individual’s other (non-IBD) haplotype, we exclude it from the count.
Identification of IBD segments is performed with the statistically phased haplotypes, and segments are identified with respect to the haplotype indices. For example, we may find that at one position, individuals A and B are identical by descent on A’s haplotype 1 and B’s haplotype 2. We may also observe that A and D are identical by descent at the same position, but with A’s haplotype 2 and D’s haplotype 1. Thus, in this case A’s IBD sharing with D is not on the same haplotype as A’s IBD sharing with B.
Consider individuals A and B who have a haplotype that is also shared with C (three-way IBD sharing). For the sake of concreteness, suppose it is haplotype 1 of A and haplotype 1 of B that is IBD with haplotype 1 of C. Now, suppose we observe that A and B both carry one copy of a certain rare allele “z” at a position that is within the region of three-way sharing but that C does not carry this allele. We want to know whether the allele is on haplotype 1 of A and haplotype 1 of B and hence should be counted in the likelihood calculations. Because this allele is rare, we do not have a direct estimate of its phase. We look to see whether there is any individual D who has one or more copies of “z”, who does not have a haplotype identical by descent with haplotype 1 of A or with haplotype 1 of B, and who has a haplotype identical by descent with haplotype 2 of A or with haplotype 2 of B at the position. We require that such IBD must extend at least 0.5 cM on either side of the position to protect against possible misestimation of the IBD segment endpoints. If we find one or more such individuals, we determine that the rare variant is on haplotype 2 of A and B and is thus not part of the three-way IBD with C and can be ignored in the likelihood calculation. If we don’t find such an individual D, we include the rare variant in the likelihood calculation because the sharing of the allele by the two IBD individuals indicates that the allele is most likely on the IBD haplotype. A special case that will occasionally occur is when A and/or B is homozygous for “z.” In this case, we always include the variant in the likelihood calculation.
For comparison, in the simulated data, we also performed analyses in which all variants are statistically phased and analyses in which the true phase is used. In either of these two cases, the phase is used as given in counting the mutations rather than employing the approach described above.
Simulated data
We used MaCS15 to simulate a dataset with a gene-conversion initiation rate of per base pair per meiosis and mean gene conversion tract length of 300 bp, which is close to previously reported estimates of gene conversion rate with human data.16 The mutation rate for the simulation was per base pair per meiosis, and the recombination rate was per base pair per meiosis. The simulation sample size was 2,000 diploid individuals, and 30 chromosomes of length 100 Mb were simulated. The simulated demographic history was the “European-American model” described in Tian et al.,10 which is based on an IBDNe13 analysis of TOPMed Framingham Heart Study data.12 We added genotype error to each simulated single-nucleotide variant (SNV), changing each allele with probability 0.01%.
We inferred haplotype phase with Beagle 5.117 for variants with MAF >1%. We inferred IBD segments by using hap-IBD18 with SNVs with MAF , the minimum seed length (min-seed) set to 0.5 cM, the minimum extension length (min-extend) set to 0.1 cM, and the maximum gap length (max-gap) set to 5,000 base pairs. We also ran an analysis with to demonstrate that the method is not overly sensitive to the choice of value for this parameter. We inferred the recent effective population size with IBDNe13 with the inferred IBD segments and default settings. Three-way IBD was obtained via segments of length cM. We used an upper limit of 6 cM to reduce the possibility of downward bias due to false-negative mutation calls.10 We used a lower limit of 2.5 cM to avoid false-positive IBD. We also ran sensitivity analyses with minimum IBD segment lengths of 2 cM and 3 cM. For the analyses with a 2 cM minimum length, we could not analyze all available IBD trios because of computational constraints, so we randomly sampled a subset of the IBD trios to match the number of IBD trios in the analysis with the 2.5 cM minimum length.
TOPMed data
We estimated mutation rates by using TOPMed whole-genome sequence data12 from 4,166 European-descent individuals in the Framingham Heart Study (FHS, dbGaP: phs000974.v4.p3), 2,996 White non-Hispanic individuals in the My Life, Our Future study (MLOF, dbGaP: phs001515.v2.p2), and 1,586 Black non-Hispanic individuals in the Hypertension Genetic Epidemiology Network study (HyperGEN, dbGaP: phs001293.v2.p1). We complied with the data use agreements for these data. For these data, we used haplotype phase inferred in a previous analysis with a larger set of TOPMed project individuals.19 We used the deCODE genetic map20 throughout the analysis.
We used hap-IBD18 to detect pairwise IBD segments from the phased haplotypes with the same MAF filtering and parameters as for the simulated data. We used ibd-ends21 for a second step of IBD segment inference because we previously found that in real sequence data large spikes in IBD rate occur at some locations10,21 and the application of ibd-ends solves this issue.21 For the ibd-ends analysis, we used the output from the hap-IBD analysis as the input IBD segments, and we kept all parameters at their default values. The ibd-ends program estimates the posterior distribution of the endpoints for each IBD segment and returns the posterior medians, which we used as the adjusted endpoints of the IBD segments.
When searching for three-way IBD sharing, we restricted to only IBD segments with length between 2.5 cM to 6 cM, as in the simulated data. We also excluded any IBD segments from duplicated samples, identical twins, and parent-offspring pairs. We identified such pairs of individuals from the provided pedigree file if available or otherwise from the degree of relatedness estimated by IBDkin.22
In the FHS cohort, we removed from the analysis the IBD segments shared between 2,415 pairs of individuals who are identified as duplicated samples, monozygotic twins, or parent and offspring in the provided pedigree. Among White non-Hispanic individuals in the MLOF cohort, we removed the IBD segments shared between 385 pairs of 1st degree relatives and eight pairs of 0th degree relatives identified by IBDkin.22 In the HyperGEN Black non-Hispanic cohort, IBDkin22 found 803 pairs of 1st degree relatives and two pairs of 0th degree relatives, and we removed the IBD segments shared between these pairs. The identified three-way IBD in the FHS cohort covered 2.75 Gb across the autosomes. In the White non-Hispanic participants in the MLOF study, the identified three-way IBD sharing covered 2.73 Gb across the autosomes. In the HyperGEN Black non-Hispanic cohort, the identified three-way IBD covered 2.55 Gb across the autosomes. The reduced IBD coverage for the HyperGEN cohort is primarily due to the smaller sample size. We inferred the recent effective populations sizes of the three study populations by using the default settings of IBDNe13 with the IBD segments output by ibd-ends.
We estimated likelihoods for each study across a search grid as for the simulated data and multiplied the likelihoods to obtain overall likelihoods. We report the mutation rate that maximizes the likelihood separately for each study and overall for the combined analysis.
As well as estimating the overall genome-wide mutation rate, we also estimated separate rates for transition and transversion mutations. To perform these analyses, after inferring the IBD, we filtered the genotype data to just the transition SNVs, or to just the transversion SNVs, and proceeded with the analysis as for the overall rate.
Results
Analysis of simulated data
In the simulated data, which has a mutation rate of per base pair per meiosis, we obtained a mutation rate estimate of per base pair per meiosis with a 95% confidence interval of []. We also performed an analysis that used the true (simulated) phase. For this analysis, we do not use the phasing uncertainty procedure for the rare variants, but instead we use the phase as given for all variants. In this case, the estimate is similar at per base pair per meiosis with a 95% confidence interval of []. In contrast, when using estimated phase but ignoring the phasing uncertainty (using the estimated phase as given for all variants), the estimated mutation rate is per base pair per meiosis with a 95% confidence interval of []. Hence the simulation results demonstrate that the proposed method can effectively adjust for the phasing uncertainty.
We performed sensitivity analyses to evaluate the impact of different choices of (the MAF for IBD detection) and the minimum length of IBD segments. After we increased the value of from 0.25 to 0.3, the estimated mutation rate was per base pair per meiosis with a 95% confidence interval of . This result is similar to the mutation rate estimate obtained from given above.
We increased to the minimum length of IBD segments from 2.5 cM to 3 cM and found the average number of IBD trios per simulated chromosome decreased significantly from 1,409 trios per chromosome to 303 trios per chromosome. The 95% confidence interval for the mutation rate estimate from this analysis is , which is wider than that from the 2.5 cM analysis. Reducing the minimum length of IBD segments from 2.5 cM to 2 cM resulted an average of 9,004 IBD trios per chromosome, which is computationally infeasible to analyze fully with our pipeline. We thus randomly sampled a subset of 2–6 cM IBD trios to match the number of 2.5–6 cM IBD trios on each simulated chromosome and obtained an estimated mutation rate of per base pair per meiosis with a 95% confidence interval of . The small downward bias in the mutation rate estimate may be due to lower IBD calling accuracy in the shorter segments.
We compared the estimated gene conversion rates to the simulated gene conversion rate. We simulated gene conversion initiations at a rate of per base pair with mean length base pairs, so the overall gene conversion rate (proportion of base pairs included in a gene conversion tract) is . With true haplotype phase, the estimated gene conversion rate is with a 95% confidence interval of []. With inferred haplotype phase and using our procedure to account for phase uncertainty in the rare variants, the estimated gene conversion rate is with 95% confidence interval of []. This downward bias cannot be corrected by increasing the allele frequency threshold for IBD detection and is likely due to the effect of gene conversion events on phasing accuracy. Thus, in real population data, for which true phase is not available, the gene conversion rate cannot be accurately estimated. Nevertheless, this does not appear to affect the accuracy of the estimation of mutation rate.
Analysis of TOPMed data
The joint estimate of mutation rate with data from the three TOPMed cohorts is per base pair per generation with 95% confidence interval [].
Using the FHS cohort alone, the estimated mutation rate is per base pair per meiosis with 95% confidence interval []. For comparison, when analyzing the trio-phased parents in the Framingham data (1,307 individuals) in our earlier work, we obtained an estimated mutation rate of with 95% confidence interval [].10 This comparison both supports the accuracy of our method for accounting for phasing uncertainty in the larger set of individuals and illustrates the increased precision (narrower confidence intervals) that comes from being able to analyze larger numbers of individuals when not restricted to trio-phased individuals.
Using the White non-Hispanic MLOF cohort, we obtain a mutation rate estimate of per base pair per meiosis with 95% confidence interval [ , ]. Using the Black non-Hispanic HyperGEN cohort, we estimate the mutation rate to be per base pair per generation with 95% confidence interval [ ]. The confidence intervals from all three studies overlap (Figure 2), indicating that the results are consistent with all three populations having the same mutation rate.
Figure 2.
Estimated mutation rates in the TOPMed FHS, MLOF, and HyperGEN cohorts
Point estimates for each dataset are shown along with 95% confidence intervals (horizontal lines). The overall estimate from joint analysis of all three datasets is shown by the vertical line, and the 95% confidence interval is represented by the shaded area.
The estimate of transition mutation rate from the three TOPMed cohorts is with 95% confidence interval . The estimate of transversion mutation rate from the three TOPMed cohorts is with 95% confidence interval . The sum of these two estimates is very close to the overall mutation rate estimate as expected. The higher rate of transitions (more than double the rate of transversions) is in line with previous reports.6,23,24
Discussion
Our work in this paper utilizes the three-way IBD framework of Tian et al.10 This three-way IBD framework allows for very strong control over genotype error by requiring that potential mutations be observed in two of three IBD individuals. In our previous work, we validated through simulation studies that this approach is robust to a variety of demographic histories and to different models for genotype error.10
The methodology in Tian et al. required that haplotypes be accurately phased, including phasing of rare variants, which in practice requires that the sequenced individuals include close relatives to enable the application of Mendelian phasing. Population samples can be statistically phased with linkage disequilibrium-based methods such as Beagle,19 however very rare variants tend not to be phased well with such methods. We thus developed a method for counting mutations on the IBD haplotypes when the underlying individual data is from population samples with uncertain phase. Again the three-way framework is very helpful in this mutation counting process because it largely removes the effect of allele errors on the non-IBD haplotypes of the individuals, because it removes the most recent mutations on the non-IBD haplotypes (which are much less likely to be shared by two of the three individuals), and because the additional “D” haplotypes that we introduce for the phasing can be IBD with the non-IBD haplotype of either of two individuals, which increases the chance of finding such a haplotype.
Additionally, we incorporated the effects of gene conversion directly into the likelihood rather than accounting for them with a post-processing regression on maximum allele frequency.8,10 This has the potential to increase statistical efficiency in estimating the mutation and gene conversion rates. However, we found that the method underestimates the gene conversion rate in analysis of population data. This is most likely because gene conversion can induce phasing errors in common variants that disrupt the detection of IBD segments from common variants. Thus, a proportion of IBD segments involving gene conversion are lost to analysis. In contrast, mutation rates are still accurately estimated. Mutation rates are based on rare variants that are not included in the IBD detection process.
We analyzed data from more than 8,000 individuals from three TOPMed studies including European American and African American populations and found that estimated mutation rates were not statistically different between the analyzed populations. The overall estimated genome-wide average mutation rate for SNVs was per base pair per generation. This rate is consistent with our previous IBD-based estimates but has tighter confidence intervals because of the larger sample size enabled by the methodology presented here.10 Our estimated rate is higher than some estimates based on direct ascertainment of de novo variants in parent-offspring trio data,3,25 and possible reasons include the stringent filtering that is needed in trio analyses in order to avoid counting false-positive mutations.3,4,25 A caveat of our results is that they apply only to SNVs and only to those parts of the autosomes that are well covered by called genotypes, which is approximately 2.7 Gb in the data that we analyzed.
Acknowledgments
The methodological and analytical work performed in this study was supported by R01 HG005701 from the National Human Genome Research Institute (NHGRI). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. Molecular data for the Trans-Omics in Precision Medicine (TOPMed) program was supported by the National Heart, Lung, and Blood Institute (NHLBI). Core support including centralized genomic read mapping and genotype calling along with variant quality metrics and filtering was provided by the TOPMed Informatics Research Center (3R01HL-117626-02S1; contract HHSN268201800002I). Core support including phenotype harmonization, data management, sample-identity QC, and general program coordination was provided by the TOPMed Data Coordinating Center (R01HL-120393; U01HL-120393; contract HHSN268201800001I). We gratefully acknowledge the studies and participants who provided biological samples and data for TOPMed. The Framingham Heart Study (FHS) was supported by contracts NO1-HC-25195 and HHSN268201500001I from the National Heart, Lung, and Blood Institute (NHLBI) and grant supplement R01 HL092577-06S1; genome sequencing was funded by HHSN268201600034I and U54HG003067. The My Life, Our Future samples and data are made possible through the partnership of Bloodworks Northwest, the American Thrombosis and Hemostasis Network, the National Hemophilia Foundation, and Bioverativ; genome sequencing was funded by HHSN268201600033I and HHSN268201500016C. The Hypertension Genetic Epidemiology Network Study is part of the NHLBI Family Blood Pressure Program; collection of the data represented here was supported by grants U01 HL054472, U01 HL054473, U01 HL054495, and U01 HL054509; genome sequencing was funded by R01HL055673.
Declaration of interests
X.T. is an employee of Google LLC.
Published: November 11, 2022
Footnotes
Supplemental information can be found online at https://doi.org/10.1016/j.ajhg.2022.10.015.
Contributor Information
Xiaowen Tian, Email: xtianuw@gmail.com.
Sharon R. Browning, Email: sguy@uw.edu.
Web resources
Supplemental information
and supplemental methods
Data and code availability
The code generated during this study is available at https://github.com/tianxiaowen/mutation_unphased. The TOPMed data are available from dbGaP.
References
- 1.Wong W.S.W., Solomon B.D., Bodian D.L., Kothiyal P., Eley G., Huddleston K.C., Baker R., Thach D.C., Iyer R.K., Vockley J.G., Niederhuber J.E. New observations on maternal age effect on germline de novo mutations. Nat. Commun. 2016;7:10486. doi: 10.1038/ncomms10486. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Jónsson H., Sulem P., Kehr B., Kristmundsdottir S., Zink F., Hjartarson E., Hardarson M.T., Hjorleifsson K.E., Eggertsson H.P., Gudjonsson S.A., et al. Parental influence on human germline de novo mutations in 1, 548 trios from Iceland. Nature. 2017;549:519–522. doi: 10.1038/nature24018. [DOI] [PubMed] [Google Scholar]
- 3.Kessler M.D., Loesch D.P., Perry J.A., Heard-Costa N.L., Taliun D., Cade B.E., Wang H., Daya M., Ziniti J., Datta S., et al. De novo mutations across 1, 465 diverse genomes reveal mutational insights and reductions in the Amish founder population. Proc. Natl. Acad. Sci. USA. 2020;117:2560–2569. doi: 10.1073/pnas.1902766117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Ségurel L., Wyman M.J., Przeworski M. Determinants of mutation rate variation in the human germline. Annu. Rev. Genomics Hum. Genet. 2014;15:47–70. doi: 10.1146/annurev-genom-031714-125740. [DOI] [PubMed] [Google Scholar]
- 5.Penrose L. Parental age and mutation. Lancet. 1955;269:312–313. doi: 10.1016/s0140-6736(55)92305-9. [DOI] [PubMed] [Google Scholar]
- 6.Kong A., Frigge M.L., Masson G., Besenbacher S., Sulem P., Magnusson G., Gudjonsson S.A., Sigurdsson A., Jonasdottir A., et al. Rate of de novo mutations and the importance of father's age to disease risk. Nature. 2012;488:471–475. doi: 10.1038/nature11396. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Campbell C.D., Chong J.X., Malig M., Ko A., Dumont B.L., Han L., Vives L., O'Roak B.J., Sudmant P.H., Shendure J., et al. Estimating the human mutation rate using autozygosity in a founder population. Nat. Genet. 2012;44:1277–1281. doi: 10.1038/ng.2418. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Palamara P.F., Francioli L.C., Wilton P.R., Genovese G., Gusev A., Finucane H.K., Sankararaman S., Genome of the Netherlands Consortium. Sunyaev S.R., de Bakker P.I.W., et al. Leveraging distant relatedness to quantify human mutation and gene-conversion rates. Am. J. Hum. Genet. 2015;97:775–789. doi: 10.1016/j.ajhg.2015.10.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Narasimhan V.M., Rahbari R., Scally A., Wuster A., Mason D., Xue Y., Wright J., Trembath R.C., Maher E.R., van Heel D.A., et al. Estimating the human mutation rate from autozygous segments reveals population differences in human mutational processes. Nat. Commun. 2017;8:303. doi: 10.1038/s41467-017-00323-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Tian X., Browning B.L., Browning S.R. Estimating the genome-wide mutation rate with three-way identity by descent. Am. J. Hum. Genet. 2019;105:883–893. doi: 10.1016/j.ajhg.2019.09.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Browning S.R., Browning B.L. Haplotype phasing: existing methods and new developments. Nat. Rev. Genet. 2011;12:703–714. doi: 10.1038/nrg3054. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Taliun D., Harris D.N., Kessler M.D., Carlson J., Szpiech Z.A., Torres R., Taliun S.A.G., Corvelo A., Gogarten S.M., Kang H.M., et al. Sequencing of 53, 831 diverse genomes from the NHLBI TOPMed Program. Nature. 2021;590:290–299. doi: 10.1038/s41586-021-03205-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Browning S.R., Browning B.L. Accurate non-parametric estimation of recent effective population size from segments of identity by descent. Am. J. Hum. Genet. 2015;97:404–418. doi: 10.1016/j.ajhg.2015.07.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Delaneau O., Zagury J.-F., Robinson M.R., Marchini J.L., Dermitzakis E.T. Accurate, scalable and integrative haplotype estimation. Nat. Commun. 2019;10:5436–5510. doi: 10.1038/s41467-019-13225-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Chen G.K., Marjoram P., Wall J.D. Fast and flexible simulation of DNA sequence data. Genome Res. 2009;19:136–142. doi: 10.1101/gr.083634.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Williams A.L., Genovese G., Dyer T., Altemose N., Truax K., Jun G., Patterson N., Myers S.R., Curran J.E., Duggirala R., et al. Non-crossover gene conversions show strong GC bias and unexpected clustering in humans. Elife. 2015;4:e04637. doi: 10.7554/eLife.04637. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Browning S.R., Browning B.L. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet. 2007;81:1084–1097. doi: 10.1086/521987. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Zhou Y., Browning S.R., Browning B.L. A fast and simple method for detecting identity-by-descent segments in large-scale data. Am. J. Hum. Genet. 2020;106:426–437. doi: 10.1016/j.ajhg.2020.02.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Browning B.L., Tian X., Zhou Y., Browning S.R. Fast two-stage phasing of large-scale sequence data. Am. J. Hum. Genet. 2021;108:1880–1890. doi: 10.1016/j.ajhg.2021.08.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Halldorsson B.V., Palsson G., Stefansson O.A., Jonsson H., Hardarson M.T., Eggertsson H.P., Gunnarsson B., Oddsson A., Halldorsson G.H., Zink F., et al. Characterizing mutagenic effects of recombination through a sequence-level genetic map. Science. 2019;363:eaau1043. doi: 10.1126/science.aau1043. [DOI] [PubMed] [Google Scholar]
- 21.Browning S.R., Browning B.L. Probabilistic estimation of identity by descent segment endpoints and detection of recent selection. Am. J. Hum. Genet. 2020;107:895–910. doi: 10.1016/j.ajhg.2020.09.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Zhou Y., Browning S.R., Browning B.L. IBDkin: fast estimation of kinship coefficients from identity by descent segments. Bioinformatics. 2020;36:4519–4520. doi: 10.1093/bioinformatics/btaa569. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Nachman M.W., Crowell S.L. Estimate of the mutation rate per nucleotide in humans. Genetics. 2000;156:297–304. doi: 10.1093/genetics/156.1.297. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Schaibley V.M., Zawistowski M., Wegmann D., Ehm M.G., Nelson M.R., St Jean P.L., Abecasis G.R., Novembre J., Zöllner S., Li J.Z. The influence of genomic context on mutation patterns in the human genome inferred from rare variants. Genome Res. 2013;23:1974–1984. doi: 10.1101/gr.154971.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Roach J.C., Glusman G., Smit A.F.A., Huff C.D., Hubley R., Shannon P.T., Rowen L., Pant K.P., Goodman N., Bamshad M., et al. Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science. 2010;328:636–639. doi: 10.1126/science.1186802. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
and supplemental methods
Data Availability Statement
The code generated during this study is available at https://github.com/tianxiaowen/mutation_unphased. The TOPMed data are available from dbGaP.

