Skip to main content
eLife logoLink to eLife
. 2019 Sep 24;8:e46922. doi: 10.7554/eLife.46922

Large, three-generation human families reveal post-zygotic mosaicism and variability in germline mutation accumulation

Thomas A Sasani 1,, Brent S Pedersen 1, Ziyue Gao 2, Lisa Baird 1, Molly Przeworski 3,4, Lynn B Jorde 1,5,, Aaron R Quinlan 1,5,6,
Editors: Amy L Williams7, Mark I McCarthy8
PMCID: PMC6759356  PMID: 31549960

Abstract

The number of de novo mutations (DNMs) found in an offspring's genome increases with both paternal and maternal age. But does the rate of mutation accumulation in human gametes differ across families? Using sequencing data from 33 large, three-generation CEPH families, we observed significant variability in parental age effects on DNM counts across families, ranging from 0.19 to 3.24 DNMs per year. Additionally, we found that ~3% of DNMs originated following primordial germ cell specification in a parent, and differed from non-mosaic germline DNMs in their mutational spectra. We also discovered that nearly 10% of candidate DNMs in the second generation were post-zygotic, and present in both somatic and germ cells; these gonosomal mutations occurred at equivalent frequencies on both parental haplotypes. Our results demonstrate that rates of germline mutation accumulation vary among families with similar ancestry, and confirm that post-zygotic mosaicism is a substantial source of human DNM.

Research organism: Human

eLife digest

Humans receive half of their DNA from each of their parents. However, this inherited DNA is not identical to the corresponding half of the parents’ genetic material. Instead, both the egg and the sperm that combine to generate an embryo carry so-called ‘germline de novo’ mutations that are not present in the rest of the parents’ cells. Although these de novo mutations are an important source of genetic diversity, they can also cause disease.

Geneticists have a longstanding interest in how, when and at what rate germline de novo mutations arise. These questions are commonly addressed by analyzing the DNA of large cohorts of two-generation families. Now, Sasani et al. have used the genetic data of 33 families in Utah, United States, which all span three generations, to determine the rate at which de novo mutations appear.

The analysis revealed that, on average, each person has around 70 de novo mutations that were not present in their parent’s genetic code. Sasani et al. also found that sperm and egg cells from older parents typically contain more de novo mutations. However, this effect varied substantially across the Utah families. In some families, an increase of one year in the parents’ age resulted in over three extra de novo mutations in their children. In others, the number of new mutations barely increased at all.

In addition, Sasani et al. found that almost 10% of de novo mutations do not occur in the parents’ sperm or eggs, but happen in the embryo very soon after fertilization. These mutations can lead to ‘mosaicism’, resulting in a person having a mutation in some, but not all of their organs and tissues. In some cases, this could cause an unknown number of sperm and egg cells to carry a mutation that others do not. This makes it hard to predict how likely two or more siblings are to inherit the mutation.

This analysis reveals that parental age affects the number of de novo mutations in children, but this effect changes from family to family. This finding could point to genetic or environmental factors that alter the human mutation rate.

Introduction

In a 1996 lecture at the National Academy of Sciences, James Crow noted that ‘without mutation, evolution would be impossible’ (Crow, 1997). His remark highlights the importance of understanding the rate at which germline mutations occur, the mechanisms that generate them, and the effects of gamete-of-origin and parental age. Not surprisingly, continued investigation into the germline mutation rate has helped to illuminate the timing and complexity of human evolution and demography, as well as the key role of spontaneous mutation in human disease (Scally and Durbin, 2012; Moorjani et al., 2016; Deciphering Developmental Disorders Study, 2017; Yuen et al., 2016; Acuna-Hidalgo et al., 2016; Veltman and Brunner, 2012).

Some of the first careful investigations of human mutation rates can be attributed to J.B.S. Haldane and others, who cleverly leveraged an understanding of mutation-selection balance to estimate rates of mutation at individual disease-associated loci (Haldane, 1935; Nachman, 2008). Over half of a century later, phylogenetic analyses inferred mutation rates from the observed sequence divergence between humans and related primate species at a small number of loci (Nachman and Crowell, 2000; Shendure and Akey, 2015). In the last decade, whole genome sequencing of pedigrees has enabled direct estimates of the human germline mutation rate by identifying mutations present in offspring yet absent from their parents (de novo mutations, DNMs) (Ségurel et al., 2014; Scally and Durbin, 2012; Jónsson et al., 2017; Goldmann et al., 2016; Kong et al., 2012; Roach et al., 2010; Francioli et al., 2015). Numerous studies have employed this approach to analyze the mutation rate in cohorts of small, nuclear families, producing estimates nearly two-fold lower than those from phylogenetic comparison (Roach et al., 2010; Kong et al., 2012; Jónsson et al., 2017; Goldmann et al., 2016; Scally and Durbin, 2012; Shendure and Akey, 2015; Turner et al., 2017).

These studies have demonstrated that the number of DNMs increases with both maternal and paternal ages; such age effects can likely be attributed to a number of factors, including the increased mitotic divisions in sperm cells following puberty, an accumulation of damage-associated mutation, and substantial epigenetic reprogramming undergone by germ cells (Jónsson et al., 2017; Kong et al., 2012; Goldmann et al., 2016; Rahbari et al., 2016; Crow, 2000; Gao et al., 2019). There is also evidence that the mutational spectra of de novo mutations differ in the male and female germlines (Jónsson et al., 2017; Goldmann et al., 2016; Francioli et al., 2015; Gao et al., 2019; Agarwal and Przeworski, 2019). Furthermore, a recent study of three two-generation pedigrees, each with 4 or five children, indicated that paternal age effects may differ across families (Rahbari et al., 2016). However, two-generation families with few offspring provide limited power to quantify parental age effects on mutation rates and restrict the ability to assign a gamete-of-origin to ~20–30% of DNMs (Rahbari et al., 2016; Jónsson et al., 2017; Goldmann et al., 2016).

Here, we investigate germline mutation among families with large numbers of offspring spanning many years of parental age. We describe de novo mutation dynamics across multiple births using blood-derived DNA samples from large, three-generation families from Utah, which were collected as part of the Centre d'Etude du Polymorphisme Humain (CEPH) consortium (Dausset et al., 1990). The CEPH/Utah families have played a central role in our understanding of human genetic variation (Prescott et al., 2008; 1000 Genomes Project Consortium et al., 2015) by guiding the construction of reference linkage maps for the Human Genome Project (Lander et al., 2001), defining haplotypes in the International HapMap Project (International HapMap Consortium, 2003), and characterizing genome-wide variation in the 1000 Genomes Project (1000 Genomes Project Consortium et al., 2015).

The CEPH/Utah pedigrees are uniquely powerful for the study of germline mutation dynamics in that they have considerably more (min = 4, max = 16, median = 8) offspring than those used in many prior studies of the human mutation rate (Supplementary file 1). Multiple offspring, whose birth dates span up to 27 years, motivated our investigation of parental age effects on DNM counts within families and allowed us to ask whether these effects differed across families. The structure of all CEPH/Utah pedigrees (Supplementary file 1) also enables the use of haplotype sharing through three generations to determine the parental haplotype of origin for nearly all DNMs in the second generation. Using this large dataset of ‘phased’ DNMs, we can investigate the effects of gamete-of-origin on human germline mutation in greater detail.

Finally, if a DNM occurs in the early cell divisions following zygote fertilization (considered gonosomal), or during the proliferation of primordial germ cells, it may be mosaic in the germline of that individual. This mosaicism can then present as recurrent DNMs in two or more children of that parent. As DNMs are an important source of genetic disease (Campbell et al., 2014b; Campbell et al., 2015; Biesecker and Spinner, 2013; Forsberg et al., 2017; Acuna-Hidalgo et al., 2016; Veltman and Brunner, 2012), it is critical to understand the rates of mosaic DNM transmission in families. The structures of the CEPH/Utah pedigrees enable the identification of these recurrent DNMs and can allow one to distinguish mutations arising as post-zygotic gonosomal variants from those that are mosaic in the germline of the second generation.

Results

Identifying high-confidence DNMs using transmission to a third generation

We sequenced the genomes of 603 individuals from 33, three-generation CEPH/Utah pedigrees to a genome-wide median depth of ~30X (Figure 1—figure supplement 1, Supplementary file 1), and removed 10 samples from further analysis following quality control using peddy (Pedersen and Quinlan, 2017a). After standard quality filtering, we identified a total of 4,671 germline de novo mutations in 70 second-generation individuals, each of which was transmitted to at least one offspring in the third generation (Figure 1a, Supplementary file 2). Approximately 92% (4,298 of 4,671) of DNMs observed in the second generation were single nucleotide variants (SNVs), and the remainder were small (<=10 bp) insertion/deletion variants. The eight parents of four second-generation samples were re-sequenced to a median depth of ~60X (Figure 1—figure supplement 1d), allowing us to estimate a false positive rate of 4.5% for our de novo mutation detection strategy (Materials and methods). Taking all second-generation samples together, we calculated median germline mutation rates of 1.10 x 10-8 and 9.29 x 10-10 per base pair per generation for SNVs and indels, respectively, which corroborate prior estimates based on family genome sequencing with roughly comparable parental ages (Jónsson et al., 2017; Kong et al., 2012; Besenbacher et al., 2016; Rahbari et al., 2016). Extrapolating to a diploid genome size of ~6.4 Gbp, we therefore estimate an average number of 70.1 de novo SNVs and 5.9 de novo indels per genome, at average paternal and maternal ages of 29.1 and 26.0 years, respectively (Sasani, 2019).

Figure 1. Estimating the rate of germline mutation using multigenerational CEPH/Utah pedigrees.

(a) The CEPH/Utah dataset comprises 33 three-generation families. Summaries of sequencing coverage for CEPH/Utah individuals are presented in Figure 1—figure supplement 1. After identifying candidate de novo mutations in the second generation (e.g., the de novo ‘T’ mutation shown in the second-generation father), it is possible to assess their validity both by their absence in the parental (first) generation and by transmission to one or more offspring in the third generation. (b) Total numbers of DNMs (both SNVs and indels) identified across second-generation CEPH/Utah individuals and stratified by parental gamete-of-origin. Boxes indicate the interquartile range (IQR), and whiskers indicate 1.5 times the IQR. Diagrams of phasing strategies for germline DNMs are presented in Figure 1—figure supplement 2.

Figure 1.

Figure 1—figure supplement 1. Distribution of sequencing coverage in CEPH/Utah samples (a) The fraction of bases greater than or equal to the specified coverage in the second generation, (b) third generation, (c) first-generation parents sequenced to 30X coverage, and (d) first-generation parents re-sequenced to 60X coverage.

Figure 1—figure supplement 1.

Figure 1—figure supplement 2. Determining the parent-of-origin for de novo mutations using transmission.

Figure 1—figure supplement 2.

(a) We phased de novo mutations observed in the second generation by transmission to a third generation. We first searched ±200 kilobase pairs from the de novo allele (shown in red) for informative sites (shown in blue) present in one of the two first-generation parents of the second-generation individual. If the second-generation individual’s spouse does not possess these informative alleles, we can look in the children of the second-generation individual to see if they have inherited both the de novo allele and the nearby informative alleles. This pattern of inheritance is only possible if the de novo allele and informative alleles are on the same haplotype; thus, in this example, we see that the de novo allele is on the maternal grandfather’s haplotype, and is paternal in origin. (b) A toy sample of paired-end sequencing reads is shown for each member of a trio (mother, father, and child). In this strategy, we identify informative alleles (shown in blue) that are present in one of the two parents, and within a read length (500 bp) of the de novo allele in the child (shown in red). Then, we identify individual sequencing reads that span the de novo and informative alleles. If the de novo allele is always present in the same read as the informative allele, then we can phase the de novo allele to the parent with the informative allele, and vice versa.

Parent-of-origin and parental age effects on de novo mutation observed in the second generation

We determined the parental gamete-of-origin for a median of 98.5% of de novo variants per second-generation individual (range: 90.3–100%) by leveraging haplotype sharing across all three generations in a family (Kong et al., 2012; Jónsson et al., 2017), as well as read tracing of DNMs to informative sites in the parents (Figure 1b, Figure 1—figure supplement 2). The ratio of paternal to maternal DNMs was 3.96:1, and 79.8% of DNMs were paternal in origin. We then measured the relationship between the number of phased DNMs observed in each child and the ages of the child’s parents at birth (Figure 2a). After fitting Poisson regressions, we observed a significant paternal age effect of 1.44 (95% CI: 1.12–1.77, p<2e-16) additional DNMs per year, and a significant maternal age effect of 0.38 (95% CI: 0.21–0.55, p=1.24e-5) DNMs per year (Figure 2a). These confirm prior estimates of the paternal and maternal age effects on de novo mutation accumulation, and further suggest that both older mothers and fathers contribute to increased DNM counts in children (Figure 2—figure supplement 1) (Jónsson et al., 2017; Goldmann et al., 2016; Rahbari et al., 2016Wong et al., 2016; Besenbacher et al., 2015).

Figure 2. Effects of parental age and sex on autosomal DNM counts and mutation types in the second generation.

(a) Numbers of phased paternal and maternal de novo variants as a function of parental age at birth. Poisson regressions (with 95% confidence bands, calculated as 1.96 times the standard error) were fit for mothers and fathers separately using an identity link. Germline mutation rates, as a function of both paternal and maternal ages, are presented in Figure 2—figure supplement 1. (b) Mutation spectra in autosomal DNMs phased to the paternal (n = 3,584) and maternal (n = 880) haplotypes. Asterisks indicate significant differences between paternal and maternal fractions at a false-discovery rate of 0.05 (Benjamini-Hochberg procedure), using a Chi-squared test of independence. P-values for each comparison are: C > G: 0.719, T > G: 4.93e-3, T > A: 8.60e-2, T > C: 8.02e-2, C > A: 0.159, C > T: 7.65e-6, indel: 8.01e-2, CpG >TpG: 0.835. Mutation spectra stratified by parental ages are presented in Figure 2—figure supplement 2.

Figure 2.

Figure 2—figure supplement 1. Contribution of maternal and paternal age to de novo mutation rates.

Figure 2—figure supplement 1.

For (a) second- and (b) third-generation individuals in the CEPH/Utah cohort, plotted points show the relationship between paternal and maternal age at birth.Each point is colored by the autosomal SNV mutation rate in the individual; these rates were calculated by dividing the autosomal SNV DNM count in each child by that child’s autosomal callable fraction. Colors indicate the magnitude of the mutation rate (blue = lower, red = higher). Black lines indicate the trend for a 1:1 relationship between paternal and maternal age.
Figure 2—figure supplement 2. Comparison of mutation spectra in children born to older or younger parents.

Figure 2—figure supplement 2.

Second-generation children were divided into two groups based on the ages of their parents at birth, and autosomal mutation spectra were compared between the two groups. In all panels, no significant differences were found at a false-discovery rate of 0.05 (Benjamini-Hochberg procedure), using a Chi-squared test of independence. (a) Comparison of DNMs in children born to fathers younger (n = 2,182) or older (n = 2,360) than the median paternal age of 29.2 years. P-values for each comparison are: C > G: 0.304, T > G: 0.140, T > A: 0.306, T > C: 0.248, C > A: 0.8.81e-2, C > T: 0.444, indel: 6.89e-2, CpG >TpG: 0.810. (b) Comparison of DNMs in children born to mothers younger (n = 2,225) or older (n = 2,317) than the median maternal age of 25.7 years. P-values for each comparison are: C > G: 0.580, T > G: 0.659, T > A: 0.554, T > C: 0.697, C > A: 0.918, C > T: 0.990, indel: 0.371, CpG >TpG: 0.678. (c) Comparison of DNMs in children born to fathers in the 25th percentile of youngest (n = 1,120) or oldest (n = 1,165) paternal ages (26.4 or 34 years). P-values for each comparison are: C > G: 1.73e-2, T > G: 0.428, T > A: 0.872, T > C: 0.979, C > A: 0.943, C > T: 7.77e-2, indel: 0.788, CpG >TpG: 0.706. (d) Comparison of DNMs in children born to mothers in the 25th percentile of youngest (n = 1,169) or oldest (n = 1,121) maternal ages (22.5 or 31.4 years). P-values for each comparison are: C > G: 0.327, T > G: 9.92e-2, T > A: 0.841, T > C: 0.975, C > A: 0.963, C > T: 0.940, indel: 0.598, CpG >TpG: 0.780.

We next compared the paternal and maternal fractions of phased autosomal DNMs identified in the second generation across eight mutational classes (Figure 2b). In maternal mutations, there was an enrichment of C > T transitions in a non-CpG context (p=7.65e-6, Chi-squared test of independence), and we observed an enrichment of T > G transversions in paternal mutations (p=4.93e-3, Chi-squared test of independence). Maternal and paternal enrichments of C > T and T > G, respectively, have been reported in recent studies of de novo mutation spectra, though the mechanisms underlying these observations are currently unclear (Goldmann et al., 2016; Jónsson et al., 2017). We additionally stratified second-generation individuals by the ages of their parents at birth and found no significant differences in the mutational spectra of children born to older or younger parents, though we may be underpowered to detect these differences in our dataset (Figure 2—figure supplement 2).

Evidence for inter-family variability of parental age effects on offspring DNM counts

A recent study of three two-generation pedigrees with multiple offspring suggested that the effect of paternal age on DNM counts in children may differ between families (Rahbari et al., 2016). Given the large numbers of offspring in the CEPH/Utah pedigrees, we were motivated to perform an investigation of parental age effects on mutation counts within individual families. To measure these effects in the CEPH dataset, we first generated a high-quality set of de novo variants observed in the third generation, excluding recurrent (mosaic) DNMs shared by multiple third-generation siblings, likely post-zygotic DNMs (Materials and methods), and ‘missed heterozygotes’ in the second generation (0.4% of heterozygous variants). The ‘missed heterozygotes’ represent apparent DNMs in the third generation that were, in fact, likely inherited from a second-generation parent who was incorrectly genotyped as being homozygous for the reference allele (Materials and methods). In total, we detected 24,975 de novo SNVs and small indels in 350 individuals in the third generation (Supplementary file 3). Of these, we were able to confidently determine a parental gamete-of-origin for 5,336 (median of 21% per third-generation individual; range of 8–38%) using read tracing, and assign 4,201 (78.7%) of these to fathers. Given the comparatively low phasing rate in the third generation, we focused our age effect analysis on the relationship between paternal age only and the total number of autosomal DNMs in each individual, regardless of parent-of-origin. Taking all third-generation individuals into account, we estimate the slope of the paternal age effect to be 1.72 DNMs per year (95% CI: 1.58–1.85, p<2e-16). Within a given family, maternal and paternal ages are perfectly correlated; therefore, the paternal effect approximates the combined age effects of both parents.

When inspecting each family separately, we observed a wide range of paternal age effects among the CEPH/Utah families (Figure 3). To test whether these observed effects varied significantly between families, we fit a Poisson regression that incorporated the effects of paternal age, family membership, and an interaction between paternal age and family membership, across all third-generation individuals in CEPH/Utah pedigrees. As a small number of the CEPH/Utah pedigrees comprise multiple three-generation families (Supplementary file 1), we assigned each unique set of second-generation parents and their third-generation children a distinct ID, resulting in a total of 40 families (Figure 3—figure supplement 1). Overall, the effect of paternal age on offspring DNM counts varied widely across pedigrees, from only 0.19 (95% CI: −1.05–1.44) to nearly 3.24 (95% CI: 2.24–4.24) additional DNMs per year. A goodness-of-fit test supported the use of a ‘family-aware’ regression model when compared to a model that ignores family membership, even after accounting for variable sequencing coverage across third-generation samples (median autosomal base pairs covered = 2,582,875,060; ANOVA: p=9.36e-10). Moreover, we found that the interaction between paternal age and family membership improved the fit of the linear model (p=0.043, Appendix 1—table 1), suggesting that inter-family variability involves differences in paternal age effects (i.e., the slopes of each regression). We note that the confidence intervals surrounding the slope point estimates for some CEPH/Utah families are quite wide, likely due to the small number of third-generation individuals (with respect to count-based regression) in each family, as well as some stochastic noise in the DNM counts attributed to each child (Figure 3d). Nonetheless, family rankings based upon the effect of paternal age on DNM counts are stable and relatively insensitive to outliers (Figure 3—figure supplement 2).

Figure 3. Parental age effects on autosomal germline mutation counts vary significantly among CEPH/Utah families.

Illustrations of pedigrees exhibiting the smallest (family 24_C, panel a) and largest (family 16, panel b) paternal age effects on third-generation DNM counts demonstrate the extremes of inter-family variability. Diamonds are used to anonymize the sex of each third-generation individual. The method used to separate CEPH/Utah pedigrees into unique groups of second-generation parents and third-generation children is presented in Figure 3—figure supplement 1. Third-generation individuals are arranged by birth order from left to right. The number of autosomal DNMs observed in each third-generation individual is shown within the diamonds, and the age of the father at the third-generation individual’s birth is shown below the diamond. The coloring for these two families is used to identify them in panels c and d. (c) The total number of autosomal DNMs is plotted versus paternal age at birth for third-generation individuals from all CEPH/Utah families. Regression lines and 95% confidence bands indicate the predicted number of DNMs as a function of paternal age using a Poisson regression (identity link). Families are sorted in order of increasing slope, and families with the least and greatest paternal age effects are highlighted in blue and red, respectively. (d) A Poisson regression (predicting autosomal DNMs as a function of paternal age) was fit to each family separately; the slope of each family’s regression is plotted, as well as the 95% confidence interval of the regression coefficient estimate. The same two families are highlighted as in (a). A dashed black line indicates the overall paternal age effect (estimated using all third-generation samples). Families are ordered from top to bottom in order of increasing slope, as in (c). A random sampling approach was used to assess the robustness of the per-family regressions to possible outliers; the results of these simulations are shown in Figure 3—figure supplement 2.

Figure 3.

Figure 3—figure supplement 1. Defining unique families in the CEPH/Utah dataset.

Figure 3—figure supplement 1.

The pedigree for a single family (family ID 19) is depicted. In this family, the third-generation individuals are first cousins and share a pair of grandparents. However, for the purposes of the inter-family variability presented in Figure 3, we defined ‘families’ as the unique groups of second-generation parents and their third-generation children. Thus, family ID 19 would be split into two unique families (19_A and 19_B), designated by the red boxes.
Figure 3—figure supplement 2. Paternal age effect ranks of CEPH/Utah families are robust to outlier samples.

Figure 3—figure supplement 2.

For each CEPH/Utah family (i.e., unique set of second-generation and third-generation individuals), we randomly sampled 75% of the third-generation individuals in the family, fit a regression predicting autosomal DNM counts as a function of paternal age at birth, and calculated the ‘rank’ of that family’s paternal age effect (out of 40 total families). We then plotted the distribution of ranks across 100 trials for each family. Families’ density plots are ordered along the y-axis by the original ranks of each family (as determined using the full dataset, and originally shown in Figure 3d, where a rank of 1 corresponds to the smallest age effect, and a rank of 40 corresponds to the largest).

Finally, when compared to a multiple regression that includes the effects of both paternal and maternal age, a model that takes family membership into account remained a significantly better fit (ANOVA: p=2.12e-5). The high degree of correlation between paternal and maternal ages makes it difficult to tease out the individual contributions of each parent to the observed inter-family differences. Nonetheless, these results suggest the existence of substantial variability in parental age effects across CEPH/Utah families, which could involve both genetic and environmental factors that differ among families.

Identifying gonadal, post-primordial germ cell specification (PGCS) mosaicism in the second generation

Generally, studies of de novo mutation focus on variants that arise in a single parental gamete. However, if a de novo variant arises during or after primordial germ cell specification (PGCS), that variant may be present in multiple resulting gametes and absent from somatic cells (Rahbari et al., 2016; Acuna-Hidalgo et al., 2015; Campbell et al., 2014b; Tang et al., 2016; Jónsson et al., 2018; Campbell et al., 2015; Biesecker and Spinner, 2013). These variants can therefore be present in more than one offspring as apparent de novo mutations. In each family, we searched for post-PGCS germline mosaic variants by identifying high-confidence DNMs that were shared by two or more third-generation individuals, and were absent from the blood DNA of any parents or grandparents within the family (Figure 4a). Given the large number of third-generation siblings in each CEPH/Utah family, we had substantially higher power to detect germline mosaicism that occurred in in the second generation than in prior studies. In total, we identified 720 single-nucleotide germline mosaic mutations at a total of 303 unique sites, which were subsequently corroborated through visual inspection using the Integrative Genomics Viewer (IGV) (Supplementary file 4) (Thorvaldsdóttir et al., 2013). Of the phased shared germline mosaic mutations, 124/260 (47.7%) were paternal in origin; thus, the mutations that occurred following PGCS likely occurred irrespective of any parental sex biases on mutation counts. Overall, approximately 3.1% (720/23,399) of all single-nucleotide DNMs observed in the third generation likely arose during or following PGCS in a parent’s germline, confirming that these variants comprise a non-negligible fraction of all de novo germline mutations.

Figure 4. Identification of post-PGCS germline mosaicism in the second generation.

Figure 4.

(a) Mosaic variants occurring during or after primordial germ cell specification (PGCS) were defined as DNMs present in multiple third-generation siblings, and absent from progenitors in the family. (b) Comparison of mutation spectra in autosomal single-nucleotide germline mosaic variants (red, n = 288) and germline de novo variants observed in the third generation (non-shared) (blue, n = 22,644). Asterisks indicate significant differences at a false-discovery rate of 0.05 (Benjamini-Hochberg procedure), using a Chi-squared test of independence. P-values for each comparison are: C > G: 6.84e-2, T > G: 0.169, T > A: 0.236, T > C: 1.51e-2, C > A: 4.31e-3, C > T: 0.385, CpG >TpG: 2.26e-6. (c) For each third-generation individual, we calculated the number of their DNMs that was shared with at least one sibling, and plotted this number against the individual’s paternal age at birth. The red line shows a Poisson regression (identity link) predicting the mosaic number as a function of paternal age at birth. (d) We fit a Poisson regression predicting the total number of germline single-nucleotide DNMs observed in the third-generation individuals as a function of paternal age at birth, and plotted the regression line (with 95% CI) in blue. In red, we plotted the line of best fit (with 95% CI) produced by the regression detailed in (c). (e) For each third-generation individual, we divided the number of their DNMs that occurred during or post-PGCS in a parent (i.e., that were shared with a sibling) by their total number of DNMs (germline +germline mosaic), and plotted this fraction of shared germline mosaic DNMs against their paternal age at birth.

The mutation spectrum for non-shared germline de novo variants was significantly different than the spectrum for shared germline mosaic variants (Figure 4b). Specifically, we found enrichments of CpG >TpG and C > A mutations, and a depletion of T > C mutations, in shared germline mosaic variants when compared to all unshared germline de novo variants observed in the third generation (Figure 4b). An enrichment of CpG >TpG mutations in germline mosaic DNMs, which was also seen in a recent report on mutations shared between siblings (Jónsson et al., 2018), is particularly intriguing, as many C > T transitions in a CG dinucleotide context are thought to occur due to spontaneous deamination of methylated cytosine (Fryxell and Zuckerkandl, 2000). Indeed, DNA methylation patterns are highly dynamic during gametogenesis; evidence in mouse demonstrates that the early primordial germ cells are highly methylated, but experience a global loss of methylation during expansion and migration to the genital ridge, followed by a re-establishment of epigenetic marks (at different time points in males and females) (Seisenberger et al., 2012; Reik et al., 2001).

We also tabulated the number of each third-generation individual’s DNMs that was shared with one or more of their siblings. As reported in the recent analysis of germline mosaicism (Jónsson et al., 2018), we observed that the number of shared germline mosaic DNMs does not increase with paternal age (p=0.647, Figure 4c, Materials and methods). Thus, a de novo mutation sampled from the child of a younger father is more likely to recur in a future child, as early-occurring, potentially mosaic mutations comprise a larger proportion of all DNMs present among the younger father’s sperm population (Figure 4d). Conversely, a de novo mutation sampled from the child of an older father is less likely to recur, as the vast majority of DNMs in that father’s gametes will have arisen later in life in individual spermatogonial stem cells (Figure 4d) (Campbell et al., 2014a; Jónsson et al., 2018). Consistent with this expectation, we observed a significant age-related decrease in the proportion of shared germline mosaic DNMs (p=1.61e-5, Figure 4e). Although families with large numbers of siblings are expected to offer greater power to detect shared, germline mosaic DNMs, we verified that neither the mosaic fraction nor the number of mosaic DNMs observed in third-generation children are significantly associated with the number of siblings in a family (Materials and methods).

Identifying gonosomal mosaicism in the second generation

We further distinguished germline mosaicism from mutations that occurred before primordial germ cell specification, but likely following the fertilization of second-generation zygotes. De novo mutations that occur prior to PGCS can be present in both blood and germ cells; we therefore sought to characterize these ‘gonosomal’ variants that likely occurred early during the early post-zygotic development of second-generation individuals (Besenbacher et al., 2015; Campbell et al., 2015; Campbell et al., 2014a; Campbell et al., 2014b; Rahbari et al., 2016; Harland et al., 2017; Jónsson et al., 2018). We assumed that these gonosomal mutations would be genotyped as heterozygous in a second-generation individual, but exhibit a distinct pattern of ‘incomplete linkage’ to informative heterozygous alleles nearby (Materials and methods, Figure 5a) (Feusier et al., 2018; Harland et al., 2017; Jónsson et al., 2018). If these variants occurred early in development, and were present in both the blood and germ cells, we could also validate them by identifying third-generation individuals that inherited the variants with a balanced number of reads supporting the reference and alternate alleles (Figure 5a).

Figure 5. Identification of gonosomal mutations in the second generation.

(a) Gonosomal post-zygotic variants were identified as DNMs in a second-generation individual that were inherited by one or more third-generation individuals, but exhibited incomplete linkage to informative heterozygous sites nearby. (b) Comparison of mutation spectra in single-nucleotide gonosomal DNMs that occurred on the paternal (n = 249) or maternal (n = 226) haplotypes. No significant differences were found at a false-discovery rate of 0.05 (Benjamini-Hochberg procedure), using a Chi-squared test of independence. P-values for each comparison are: C > G: 3.05e-2, T > G: 0.972, T > A: 0.858, T > C: 0.148, C > A: 3.31e-2, C > T: 2.66e-2, indel: 0.247, CpG >TpG: 0.932. (c) Comparison of mutation spectra in autosomal single-nucleotide germline DNMs observed in the second-generation (non-gonosomal) (n = 4,542) and putative gonosomal mutations (n = 475) in the second generation. Asterisks indicate significant differences at a false-discovery rate of 0.05 (Benjamini-Hochberg procedure), using a Chi-squared test of independence. P-values for each comparison are: C > G: 0.517, T > G: 0.800, T > A: 2.32e-3, T > C: 0.255, C > A: 0.129, C > T: 0.805, indel: 0.446, CpG >TpG: 0.212. (d) Numbers of phased gonosomal variants as a function of parental age at birth. Poisson regressions (with 95% confidence bands) were fit for the mutations phased to the maternal and paternal haplotypes separately using an identity link. A diagram of an identification strategy for post-zygotic gonosomal DNMs (using only two generations) is presented in Figure 5—figure supplement 1.

Figure 5.

Figure 5—figure supplement 1. Strategy for identifying post-zygotic DNMs using two generations.

Figure 5—figure supplement 1.

(a) Diagram of an example two-generation pedigree structure that is amenable to the post-zygotic detection strategy. In this example, the child has a de novo ‘T’ allele that is <= 500 bp downstream of a heterozygous ‘G’ allele. Question marks in the parents indicate that the child could have inherited the ‘G’ allele from either parent; unlike the read tracing strategy (Figure 1—figure supplement 2), a particular parent does not need to be ‘informative.’ (b) In the child’s reads, only two possible sets of linked haplotypes should be seen, assuming the de novo allele occurred in the germline of a parent. The presence of three distinct haplotypes, demonstrating incomplete linkage of the de novo and heterozygous alleles, indicates that the de novo ‘T’ allele is post-zygotic.

In total, we identified 475 putative autosomal gonosomal DNMs, which were also validated by visual inspection (Supplementary file 5). In contrast to single-gamete germline DNMs observed in the second-generation, gonosomal mutations appeared to be sex-balanced with respect to the parental haplotype on which they occurred; 52% (249/475) of all gonosomal DNMs occurred on a paternal haplotype, as compared to ~80% of germline DNMs observed in the second generation. Similarly, no significant enrichment of particular gonosomal mutation types was observed on either parental haplotype at a false discovery rate of 0.05 (Figure 5b), though we found that T > A transversions are enriched in gonosomal DNMs when compared to single-gamete germline DNMs observed in the second generation (p=2.32e-3) (Figure 5c). Unlike single-gamete germline DNMs, there were no significant effects of parental age on gonosomal DNM counts (maternal age, p=0.132; paternal age, p=0.225) (Figure 5d). However, a recent study found tentative evidence for a maternal age effect on de novo mutations that arise in the early stages of zygote development (Gao et al., 2019). As noted in this previous study, we are likely underpowered to detect a possible maternal age effect using the numbers of second-generation individuals in the CEPH/Utah dataset. Overall, our results demonstrate that over 9% (475/5,017) of all candidate autosomal germline mutations observed in the second generation were, in fact, post-zygotic in these second-generation individuals. Perhaps most importantly, approximately 6% of candidate de novo mutations detected in the second generation with an allele balance >= 0.2 (303/5,017) were determined to be post-zygotic, and present in both somatic and germ cells. This suggests that a fraction of many germline de novo mutation datasets are comprised of truly post-zygotic DNMs, rather than mutations that occurred in a single parental gamete.

We note that our analysis pipeline may erroneously classify some gonosomal and shared germline mosaic DNMs. Namely, our count of gonosomal DNMs may be an underestimate, since our requirement that the second-generation individual be heterozygous precludes the detection of post-zygotic mosaic mutations at very low frequency in blood. Also, blood cells represent only a fraction of the total somatic cell population, and we cannot rule out the possibility that mosaicism apparently restricted to the germline may, in fact, be present in other somatic cells that were not sampled in this study (Biesecker and Spinner, 2013).

Discussion

Using a cohort of large, multi-generational CEPH/Utah families, we identified a high-confidence set of germline de novo mutations that were validated by transmission to the following generation. We determined the parental gamete-of-origin for nearly all of these DNMs observed in the second generation and produced estimates of the maternal and paternal age effects on the number of DNMs in offspring. Then, by comparing parental age effects among pedigrees with large third generations whose birth dates span as many as 27 years, we found that families significantly differed with respect to these age effects. Finally, we identified gonosomal and shared germline mosaic de novo variants which appear to differ from single-gamete germline DNMs with respect to mutational spectra and magnitude of the sex bias.

Understanding family differences in both mutation rates and parental age effects could enable the identification of developmental, genetic, and environmental factors that impact this variability. The fact that there were detectable differences in parental age effects between families is striking in light of the fact that the CEPH/Utah pedigrees comprise mostly healthy individuals, and that at the time of collection they resided within a relatively narrow geographic area (Malhotra et al., 2005; Dausset et al., 1990). We therefore suspect that our results understate the true extent of variability in mutation rates and age effects among families with diverse inherited risk for mutation accumulation, and who experience a wide range of exposures, diets, and other environmental factors. Supporting this hypothesis, a recent report identified substantial differences in the mutation spectra of variants in populations of varied ancestries, suggesting that genetic modifiers of the mutation rate may exist in humans, as well as possible differences in environmental exposures (Harris and Pritchard, 2017; Mathieson and Reich, 2017). Another explanation (that we are unable to explore) for the range of de novo mutation counts in firstborn children across families is variability in the age at which parents enter puberty. For example, a father entering puberty at an older age could result in less elapsed time between the start of spermatogenesis and the fertilization of his first child’s embryo. Compared to another male parent of the same age, his sperm will have accumulated fewer mutations by the time of conception. Of course, this hypothesis assumes that for both fathers, three parameters are identical: the mutation rate at puberty, the yearly mutation rate increase following puberty, and age at fertilization of the first child’s embryo. Moreover, we note that replication errors are unlikely to be the sole source of de novo germline mutations (Gao et al., 2019). Overall, the potential sources of inter-family variability in mutation rates remain mysterious, and we anticipate that future studies will be needed to uncover the biological underpinnings of this variability.

Our observation of germline mosaicism, a result of de novo mutations that occur during or post-PGCS, has broad implications for the study of human disease and estimates of recurrence risks within families (Jónsson et al., 2018; Campbell et al., 2014b; Biesecker and Spinner, 2013; Forsberg et al., 2017; Krupp et al., 2017). If a de novo mutation is found to underlie a genetic disorder in a child, it is critical to understand the risk of mutation recurrence in future offspring. We estimate that ~3% of germline de novo mutations originated as a mosaic in the germ cells of a parent. This result corroborates recent reports (Rahbari et al., 2016; Jónsson et al., 2018) and demonstrates that a substantial fraction of all germline DNMs may be recurrent within a family. We also find that the mutation spectrum of shared germline mosaic DNMs is significantly different than the spectrum for single-gamete germline DNMs, raising the intriguing possibility that different mechanisms contribute to de novo mutation accumulation throughout the proliferation of primordial germ cells and later stages of gametogenesis. For instance, the substantial epigenetic reprogramming that occurs following primordial germ cell specification may predispose cells at particular developmental time points to certain classes of de novo mutations, such as C > T transitions at CpG dinucleotide sites (Gao et al., 2019).

Recurrent DNMs across siblings can also manifest as a consequence of gonosomal mosaicism in parents (Biesecker and Spinner, 2013; Jónsson et al., 2018). Although it can be difficult to distinguish gonosomal mosaicism from both single-gamete germline de novo mutation and germline mosaicism, we have identified a set of putative gonosomal mosaic mutations that are sex-balanced with respect to the parental haplotype on which they occurred, and do not exhibit any detectable dependence on parental age at birth. Both of these observations are expected if gonosomal mutations arise after zygote fertilization, rather than during the process of gametogenesis. We do, however, find that T > A transversions are enriched in gonosomal DNMs, as compared to DNMs that occurred exclusively in the germline of a parent. Overall, we estimate that approximately 10% of candidate germline de novo mutations in our study were, in fact, gonosomal mutations that occurred during the early cell divisions of the offspring, rather than in a single parental gamete. Prior work in cattle has estimated the fraction of mosaic DNMs that occur during early cell divisions to be even higher, suggesting that these mosaic mutations make up a large fraction of DNMs that are reported to have occurred in a single parental gamete (Harland et al., 2017).

These results underscore the power of large, multi-generational pedigrees for the study of de novo human mutation and yield new insight into the mutation dynamics that exist due to factors such as parental age and sex, as well as family of origin. Given that we studied only 33 large pedigrees, the mutation rate variability we observe is very likely an underestimate of the full range of variability worldwide. We therefore anticipate future studies of multi-generational pedigrees that will help to dissect the relative contributions of genetic background, developmental timing, and myriad environmental factors.

Materials and methods

Key resources table.

Reagent type
(species) or
resource
Designation Source or
reference
Identifiers Additional
information
Software, algorithm Genome Analysis Toolkit (GATK) DePristo et al., 2011 v3.5.0; RRID:SCR_001876
Software, algorithm peddy Pedersen and Quinlan, 2017a v0.4.3; RRID:
SCR_017287
Software, algorithm cyvcf2 Pedersen and Quinlan, 2017b v0.11.2
Software, algorithm mosdepth Pedersen and Quinlan, 2018 v0.2.4
Software, algorithm pysam https://github.com/pysam-developers/pysam v0.15.2
Software, algorithm python https://www.python.org/ v3.7.3; RRID:SCR_008394
Software, algorithm R https://www.r-project.org/ v3.4.4; RRID:SCR_001905
Software, algorithm Integrative Genomics Viewer (IGV) Thorvaldsdóttir et al., 2013 v2.4.11; RRID:SCR_011793
Software, algorithm samtools Li et al., 2009 RRID:
SCR_002105
Software, algorithm BWA-MEM Li, 2013 v0.7.15; RRID:SCR_010910

Genome sequencing

Whole-genome DNA sequencing libraries were constructed with 500 ng of genomic DNA isolated from blood, utilizing the KAPA HTP Library Prep Kit (KAPA Biosystems, Boston, MA) on the SciClone NGS instrument (Perkin Elmer, Waltham, MA) targeting 350 bp inserts. Post-fragmentation (Covaris, Woburn, MA), the genomic DNA was size selected with AMPure XP beads using a 0.6x/0.8x ratio. The libraries were PCR amplified with KAPA HiFi for 4–6 cycles (KAPA Biosystems, Boston, MA). The final libraries were purified with two 0.7x AMPureXP bead cleanups. The concentration of each library was accurately determined through qPCR (KAPA Biosystems, Boston, MA). Twenty four libraries were pooled and loaded across four lanes of a HiSeqX flow cell to ensure that the libraries within the pool were equally balanced. The final pool of balanced libraries was loaded over an additional 16 lanes of the Illumina HiSeqX (Illumina, San Diego, CA). 2 × 150 paired-end sequence data was generated. This efficient pooling scheme targeted ~30X coverage for each sample.

DNA sequence alignment

Sequence reads were aligned to the GRCh37 reference genome (including decoy sequences from the GATK resource bundle) using BWA-MEM v0.7.15 (Li, 2013). The aligned BAM files produced by BWA-MEM were de-duplicated with samblaster (Faust and Hall, 2014). Realignment for regions containing potential short insertions and deletions and base quality score recalibration was performed using GATK v3.5.0 (DePristo et al., 2011). Alignment quality metrics were calculated by running samtools ‘stats’ and ‘flagstats’ (Li et al., 2009) on aligned and polished BAM files.

Variant calling

Single-nucleotide and short insertion/deletion variant calling was performed with GATK v3.5.0 (DePristo et al., 2011) to produce gVCF files for each sample. Sample gVCF files were then jointly genotyped to produce a multi-sample project level VCF file.

Sample quality control and filtering

We used peddy (Pedersen and Quinlan, 2017a) to perform relatedness and sample sequencing quality checks on all CEPH/Utah samples. We discovered a total of 10 samples with excess levels of heterozygosity (proportion of heterozygous calls > 0.2). Many of these samples were also listed as being duplicates of other samples in the cohort, indicating possible sample contamination prior to sequencing. We therefore removed all 10 samples with a heterozygous genotype proportion exceeding 0.2 from further analysis. In total, we were left with 593 first-, second-, and third-generation samples with high-quality sequencing data.

Identifying DNM candidates

We identified high-confidence de novo mutations from the joint-called VCF in the second and third generations as follows, using cyvcf2 (Pedersen and Quinlan, 2017b). For each variant, we required that the child possessed a unique genotyped allele absent from both parents; when identifying de novo variants on the X chromosome, we required male offspring genotypes to be homozygous. We required the aligned sequencing depth in the child and both parents to be >= 12 reads, Phred-scaled genotype quality (GQ) to be >= 20 in the child and both parents, and no reads supporting the de novo allele in either parent. We removed de novo variants within low-complexity regions (Li, 2014; Turner et al., 2017), and any variants that were not listed as ‘PASS’ variants by GATK HaplotypeCaller. Finally, we removed DNMs with likely DNM carriers in the cohort; we define carriers as samples that possess the DNM allele, other than the sample with the putative DNM and his/her immediate family (i.e., siblings, parents, or grandparents). We adapted a previously published strategy (Jónsson et al., 2017) to discriminate between ‘possible carriers’ of the DNM allele (samples genotyped as possessing the de novo allele), and ‘likely carriers’ (a subset of ‘possible carriers’ with depth >= 12, allele balance >= 0.2, and Phred-scaled genotype quality >= 20). We removed putative DNMs for which there were any ‘likely carriers’ of the allele in the cohort. We then separated the candidate variants observed in the second-generation into true and false positives based on transmission to the third generation. For each candidate second-generation variant, we assessed whether the DNM was inherited by at least one member of the third generation; to limit our identification of false positive transmission events, we required third-generation individuals with inherited DNMs to have a depth >= 12 reads at the site and Phred-scaled genotype quality >= 20. We defined ‘transmitted’ second-generation variants as variants for which the median allele balance across transmissions was >= 0.3. One CEPH/Utah family (family ID 26) contains only four sequenced grandchildren (Supplementary file 1); therefore, we did not include the two second-generation individuals from this family in our analysis of DNMs observed in the second-generation, as we lacked power to detect high-quality transmission events.

Because we were unable to validate DNMs observed in the third generation by transmission, we applied a more stringent set of quality filters to all third-generation DNMs. We required the same filters as applied to all second-generation DNMs, but additionally required that the allele balance in each DNM was >= 0.3. We further required that there were no possible carriers of the de novo allele in the rest of the cohort. For each DNM in the third generation, we assessed if any of the third-generation individuals’ grandparents were genotyped as possessing the DNM allele; if so, we removed that DNM from further analysis (see section entitled ‘Estimating a missed heterozygote rate’). Finally, we removed a total of 319 candidate germline DNMs in the third generation after finding evidence that these were, in fact, post-zygotic mutations (see section entitled ‘Identifying gonosomal mutations’).

Determining the parent of origin for single-gamete germline DNMs

To determine the parent of origin for each de novo variant in the second generation, we phased mutation alleles by transmission to a third generation, a technique which has been described previously (Jónsson et al., 2017; Kong et al., 2012; Goldmann et al., 2016; Rahbari et al., 2016) (Figure 1—figure supplement 2a). We searched 200 kbp upstream and downstream of each DNM for informative variants, defined as alleles present as a heterozygote in the second-generation individual, observed in only one of the two parents, and observed in each of the third-generation individuals that inherited the DNM. For each of these informative variants, we confirmed that the informative variant was always transmitted with the DNM; if so, we could infer that the heterozygous variant was present on the same haplotype as the DNM (assuming recombination did not occur between the DNM and the flanking informative variants), and assign the first-generation parent with the informative variant as the parent of origin (Figure 1—figure supplement 2a). For each second-generation DNM, we identified all transmission patterns (i.e., combinations of a first-generation parent, second-generation child, and set of third-generation grandchildren that inherited both the informative variant and the DNM). We only assigned a confident parent-of-origin at sites where the most frequent transmission pattern occurred at >= 75% of all informative sites.

We additionally phased de novo variants in the second generation, as well as all DNMs in the third generation, using ‘read tracing’ (also known as ‘read-backed phasing’) (Jónsson et al., 2017; Goldmann et al., 2016). Briefly, for each de novo variant, we first searched for nearby (within one read fragment length, 500 bp) variants present in the proband and one of the two parents. Thus, if the de novo variant was present on the same read as the inherited variant, we could infer haplotype sharing, and determine that the de novo event occurred on that parent’s chromosome (Figure 1—figure supplement 2b). Similarly, if the de novo variant was not present on the same read as the inherited variant, we could infer that the de novo event occurred on the other parent’s chromosome.

We were also able to determine the parent-of-origin for many of the shared germline mosaic variants by leveraging haplotype sharing across three generations (Jónsson et al., 2018). If all third-generation individuals with a post-PGCS DNM shared a haplotype with a particular first-generation grandparent, we assigned that first-generation grandparent’s child (i.e., one of the two second-generation parents) as the parent of origin.

In the second generation, the read tracing and haplotyping sharing phasing strategies were highly concordant, and the parent-of-origin predictions agreed at 98.8% (969/980) of all DNMs for which both strategies could be applied.

Calculating the rate of germline mutation

Given the filters we employed to identify high-confidence de novo mutations, we needed to calculate the fraction of the genome that was considered in our analysis. To this end, we used mosdepth (Pedersen and Quinlan, 2018) to calculate per-base genome coverage in all CEPH/Utah samples, excluding low-complexity regions (Li, 2014) and reads with mapping quality <20 (the minimum mapping quality threshold used by GATK HaplotypeCaller in this analysis). For each second- and third-generation child, we then calculated the number of all genomic positions that had at least 12 aligned sequence reads in the child’s, mother's, and father's genome (excluding the X chromosome). In the second generation, the median number of callable autosomal base pairs per sample was 2,582,336,232. For each individual, we then divided their count of autosomal de novo mutations by the resulting number of base pairs, and divided the result by two to obtain a diploid human mutation rate per base pair per generation. The median second-generation germline SNV mutation rate was calculated to be 1.143 × 10−8 per base pair per generation. We then adjusted this mutation rate based on our estimated false positive rate (FPR) and our estimated ‘missed heterozygote rate’ (MHR; see section entitled ‘Estimating a missed heterozygote rate’) as follows:

adj_mu = mu * (1 - FPR/1 - MHR)
adj_mu = 1.143e-8 * (1–0.045/1–0.004)

Assessing age effect variability between families

Using the full call set of de novo variants in the third generation (excluding the recurrent, post-PGCS DNMs and likely post-zygotic DNMs) we first fit a simple Poisson regression model that calculated the effect of paternal age on total autosomal DNM counts in the R statistical language (v3.5.1) as follows:

glm(autosomal_dnms ~ dad_age, family = poisson(link='identity’))

This model returned a highly significant effect of paternal age on total DNM counts (1.72 DNMs per year of paternal age, p<2e-16), but was agnostic to the family from which each third-generation individual was ‘sampled.’ Importantly, a number of third-generation individuals in the CEPH/Utah cohort share grandparents, and may therefore be considered members of the same family, despite having unique second-generation parents (Figure 3—figure supplement 1). For all subsequent analysis, we defined a ‘family’ as the unique group of two second-generation parents and their third-generation offspring (Figure 3—figure supplement 1). In the CEPH/Utah cohort, there are a total of 40 ‘families’ meeting this definition.

To test for significant variability in paternal age effects between families, we fit the following model:

glm(autosomal_dnms ~ dad_age * family_id,
family = poisson(link='identity'))

Which can also be written in an expanded form as:

glm(autosomal_dnms ~ dad_age + family_id + dad_age:family_id,
family = poisson(link='identity'))

To assess the significance of each term in the fitted model, we performed an analysis of variance (ANOVA) as follows:

m = glm(autosomal_dnms ~ dad_age + family_id + dad_age:family_id, family = poisson(link='identity'))
anova(m, test='Chisq')

The results of this ANOVA are shown in Appendix 1—Table 1. In summary, this model contained the fixed effect of paternal age, as well as different regression intercepts within each ‘grouping factor’ (i.e., family ID). Additionally, this model includes an interaction between paternal age and family ID, allowing for the effect of paternal age (i.e., the slope of the regression) to vary within each grouping factor.

To account for variable sequencing coverage across CEPH/Utah samples, we additionally calculated the callable autosomal fraction for all third-generation individuals by summing the total number of nucleotides covered by >= 12 reads in the third-generation individual and both of their second-generation parents, excluding low-complexity regions and reads with mapping quality <20 (see section entitled ‘Calculating the rate of germline mutation’).

Since we only consider the effect of paternal age on the mutation rate, we can model the mutation rate (mu) as:

mu = Bp * Ap +B0

Where Bp is the paternal age effect, Ap is the paternal age, and B0 is an intercept term.

Therefore, the number of DNMs in a sample is assumed to follow a Poisson distribution, with the expected mean of the distribution defined as:

E(# DNMs) = mu * callable_fraction
E(# DNMs) = (Bp * Ap + B0) * callable_fraction
E(# DNMs) = (Bp * Ap * callable_fraction) + (callable_fraction * B0)

As our analysis only considers the effect of paternal age on total DNM counts, we can thus scale Ap (paternal age at birth) by the callable_fraction, generating a term called dad_age_scaled, and fit the following model, which takes each sample’s callable fraction into account:

glm(autosomal_dnms ~ dad_age_scaled + autosomal_callable_fraction +0, family = poisson(link='identity'))

Then, we can determine whether inter-family differences remain significant by comparing the above null model to a model that takes family into account:

glm(autosomal_dnms ~ dad_age_scaled * family_id + autosomal_callable_fraction + 0, family = poisson(link='identity'))

After running an ANOVA to compare the two models, we find that the model incorporating family ID is a significantly better fit (ANOVA: p=9.359e-10).

We previously identified significant effects of both maternal and paternal age on DNM counts (Figure 2a). Therefore, to account for the non-negligible effect of maternal age on DNM counts, we fit a final model that incorporated the effects of both maternal and paternal age, as well as family ID, on total DNM counts as follows:

glm(autosomal_dnms ~ dad_age +mom_age +family_id, family = poisson(link='identity'))

We then performed an ANOVA on the model, and found that a model incorporating a family term is a significantly better fit than a model that includes the effects of paternal and maternal age alone (p=2.12e-5).

Identifying post-PGCS mosaic mutations

To identify post-PGCS mosaic variants, we searched the previously generated callset of single-nucleotide DNMs in the third generation (‘Identifying DNM candidates’) for de novo single-nucleotide mutations that appeared in two or more third-generation siblings. As a result, all filters applied to the germline third-generation DNM callset were also applied to the post-PGCS mosaic variants. We validated all putative post-PGCS mosaic variants by visual inspection using the Integrative Genomics Viewer (IGV) (Thorvaldsdóttir et al., 2013). In a small number of cases (32), we found evidence for the post-PGCS mosaic variant in one of the two second-generation parents. Reads supporting the post-PGCS mosaic variant were likely filtered from the joint-called CEPH/Utah VCF output following local re-assembly with GATK, though they are clearly present in the raw BAM alignment files. We removed these 32 variants, at which an second-generation parent possessed two or more reads of support for the mosaic DNM allele in the aligned sequencing reads.

Assessing age effects on post-PGCS DNMs

To identify a paternal age effect on the number of post-PGCS DNMs transmitted to third-generation children, we tabulated the number of each third-generation individual’s DNMs that was shared with at least one of their siblings. We then fit a Poisson regression as follows, regressing the number of mosaic DNMs in each third-generation individual against their father’s age at birth:

glm(mosaic_number ~dad_age, family = poisson(link='identity'))

We did not find a significant effect of paternal age (p=0.647).

Using the predicted paternal age effects on germline DNM counts and post-PGCS DNM counts, we determined that the fraction of post-PGCS DNMs should decrease non-linearly with paternal age (Figure 4e). Therefore, to assess the effect of paternal age on the fraction of each third-generation individual’s DNMs that occurred post-PGCS in a parent, we fit the following model:

lm(log(mosaic_fraction)~dad_age)

We found a significant effect of paternal age on the post-PGCS mosaic fraction (p=1.61e-5).

As we may be more likely to identify shared, post-PGCS DNMs in families with larger numbers of third-generation siblings, we additionally tested whether the fraction of post-PGCS DNMs in each child was dependent on the number of their siblings in the family by performing a correlation test as follows:

cor.test(mosaic_fraction, n_siblings)

We did not observe a significant correlation between a third-generation individual’s number of siblings and the fraction of their DNMs that was shared with a sibling (p=0.882). We also did not observe a significant correlation between a third-generation individual’s number of siblings and the total number of their DNMs shared with a sibling (p=0.426).

Identifying gonosomal mutations

To identify variants that occurred early in post-zygotic development, we identified de novo single-nucleotide variants in the second generation using the same genotype quality and population-based filters as described previously (‘Identifying DNM candidates’). Then, to distinguish single-gamete germline de novo mutations from post-zygotic DNMs (de novo mutations that occurred in the cell divisions following fertilization of the second-generation individual’s embryo), we employed a previously described method (Harland et al., 2017; Feusier et al., 2018; Jónsson et al., 2018) that relies on linkage between DNMs and informative heterozygous alleles nearby. In this approach, which is similar in principle to the strategy used for phasing germline second-generation DNMs, we first search ±200 kbp up- and down-stream of the de novo allele in the second-generation individual for ‘informative’ alleles; that is, alleles that are present in only one first-generation parent, and inherited by the second-generation child (Figure 5a). Then, we identify all of the third-generation grandchildren that inherited the informative alleles. If all of the third-generation individuals that inherited the informative alleles also inherited the DNM, we infer that the DNM occurred in the germline of the first-generation parent with the informative allele. However, if one or more third-generation individuals inherited the informative alleles but did not inherit the DNM, we can infer that the DNM occurred sometime following the fertilization of the second-generation sample’s embryo. This is because the DNM is not always present on the background haplotype that the second-generation individual inherited from their informative first-generation parent. Using this approach, we do not apply any allele balance filters to putative gonosomal DNMs in the second generation, instead relying on linkage to distinguish them from germline DNMs. As with germline de novo mutations observed in the second-generation, to limit our identification of false positive events, we required third-generation individuals with inherited DNMs to have a depth >= 12 reads at the site, Phred-scaled genotype quality (GQ) >= 20, and for the median allele balance across transmissions to be >= 0.3.

Additionally, we can use an orthogonal method to distinguish single-gamete germline DNMs from post-zygotic DNMs. In this second approach, we identify all heterozygous sites ± 500 base pairs (approximately one read length) from a DNM in a child. Then, by assessing the linkage of the DNM and heterozygous alleles, we look for evidence of three distinct haplotypes in the child (Figure 5—figure supplement 1). If we observe at least two reads supporting a third haplotype (i.e., reads that indicate incomplete linkage between the DNM and the informative heterozygous allele), we inferred that the DNM occurred post-zygotically in the child. We applied this method to all putative germline DNMs identified in the third generation, and discovered that 319 of apparent germline DNMs showed evidence of being post-zygotic mutations that occurred following the fertilization of the third-generation embryo. We removed these DNMs from all analyses of third-generation germline DNMs.

We validated all putative gonosomal variants in the second generation by visual inspection using the Integrative Genomics Viewer (IGV) (Thorvaldsdóttir et al., 2013).

Estimating a ‘missed heterozygote rate’ for DNM detection

Infrequently, variant calling methods such as GATK may incorrectly assign genotypes to samples at particular sites in the genome. When identifying de novo variants, we require that children possess genotyped alleles that are absent from either parent; thus, genotyping errors in parents could lead us to assign variants as being de novo, when in fact one or both parents possessed the variant and transmitted the allele. Given the multi-generational structure of our study cohort, we were able to estimate the rate at which our variant calling and filtering pipeline mis-genotyped a second-generation parent as being homozygous for a reference allele. To estimate this ‘missed heterozygote’ rate in our dataset, we looked for any cases in which one or more third-generation individuals possessed a putative de novo variant (i.e. possessed an allele absent from both second-generation parents). Then, we looked at the sample’s grandparental (first-generation) genotypes for evidence of the same variant. If one or more grandparents was genotyped as having high-quality evidence for the de novo allele (depth >= 12 and Phred-scaled genotype quality >= 20), we inferred that the variant could have been ‘missed’ in the second generation, despite being truly inherited. We estimate the missed heterozygote rate (MHR) to be 0.4%, by dividing the total number of third-generation DNMs with grandparental support by the total number of third-generation DNMs (100/25,075). In a small number of CEPH/Utah pedigrees, some members of the first-generation (grandparental) generation were not sequenced (6 grandparents in five families, Supplementary file 1). As a result these families are underpowered to detect evidence of third-generation DNM alleles in the first generation, and our MHR is likely a slight underestimate.

Estimating a false positive rate for de novo mutation detection

In a separate set of sequencing runs, a total of 8 first-generation grandparents were re-sequenced to a greater genome-wide median depth of 60X (Figure 1—figure supplement 1d). However, when variant calling and joint genotyping was performed on all 603 CEPH/Utah samples, the 30X data for these grandparents was used. Therefore, we sought to estimate the false positive rate for our de novo mutation detection strategy using the de novo mutation calls in the children of these eight first-generation individuals. For each of the children (second-generation) of these high-coverage first-generation individuals, we looked for evidence of the second-generation DNMs in the 60X alignments from their parents. Specifically, for each second-generation DNM, we counted the number of reads supporting the DNM allele in each of the first-generation parents, excluding reads with mapping quality <20 (the minimum mapping quality imposed by GATK HaplotypeCaller in our analysis), and excluding bases with base qualities < 20 (the minimum base quality imposed by GATK HaplotypeCaller in our analysis). If we observed two or more reads supporting the second-generation DNM in a first-generation parent’s 60X alignments, we considered the second-generation DNM to be a false positive. Of the 202 de novo mutations called in the four second-generation children of the high-coverage first-generation parents, we find nine mutations with at least two reads of supporting evidence in the 60X first-generation alignments. Thus, we estimate our false positive rate for de novo mutation detection to be approximately 4.5% (9/202).

Data and code availability

Code used for statistical analysis and figure generation has been deposited on GitHub as a collection of annotated Jupyter Notebooks: https://github.com/quinlan-lab/ceph-dnm-manuscript (Sasani, 2019; copy archived at https://github.com/elifesciences-publications/ceph-dnm-manuscript/blob/master/README.md). Data files containing high-confidence de novo mutations, as well as the gonosomal and post-primordial germ cell specification (PGCS) mosaic mutations, are included with these Notebooks. To mitigate compatibility issues, we have also made all notebooks available in a Binder environment, accessible at the above GitHub repository (Sasani, 2019).

Acknowledgements

We thank all of the Utah individuals who participated in the CEPH consortium. We also thank Ray White, Jean-Marc Lalouel, and Mark Leppert, who were instrumental in the ascertainment of the CEPH/Utah pedigrees. We additionally thank Chad Harland and Julie Feusier for assisting our detection of post-zygotic mosaicism and Andrew Farrell for assistance with interpreting DNM calls. Finally, we thank Tim Formosa, Richard Cawthon, Amelia Wallace and many other members of the Quinlan and Jorde laboratories for insightful discussion related to the manuscript.

Appendix 1

Supplementary Information

Appendix 1—table 1. Results of ANOVA on fitted ‘family-aware’ model.

Term (independent variable) DoF Deviance Resid. DoF Resid. Deviance Pr(>Chi)
dad_age 1 635.77 348 502.84 < 2.2e-16
family_id 39 103.43 309 399.41 9.667e-9
dad_age:family_id 39 55.34 270 344.07 0.04328

Funding Statement

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Contributor Information

Thomas A Sasani, Email: tom.sasani@utah.edu.

Lynn B Jorde, Email: lbj@genetics.utah.edu.

Aaron R Quinlan, Email: aquinlan@genetics.utah.edu.

Amy L Williams, Cornell University, United States.

Mark I McCarthy, University of Oxford, United Kingdom.

Funding Information

This paper was supported by the following grants:

  • National Institute of General Medical Sciences T32GM007464 to Thomas A Sasani.

  • National Human Genome Research Institute R01HG006693 to Aaron R Quinlan.

  • National Human Genome Research Institute R01HG009141 to Aaron R Quinlan.

  • National Institute of General Medical Sciences R01GM124355 to Aaron R Quinlan.

  • National Institute of General Medical Sciences R35GM118335 to Lynn Jorde.

  • National Institute of General Medical Sciences R01GM122975 to Molly Przeworski.

Additional information

Competing interests

Reviewing editor, eLife.

No competing interests declared.

Author contributions

Data curation, Software, Formal analysis, Investigation, Visualization, Methodology, Writing—original draft.

Software, Formal analysis, Investigation, Methodology, Writing—review and editing.

Formal analysis, Methodology, Writing—review and editing.

Resources, Data curation.

Formal analysis, Methodology, Writing—review and editing.

Conceptualization, Resources, Supervision, Funding acquisition, Project administration, Writing—review and editing.

Conceptualization, Supervision, Funding acquisition, Writing—original draft, Project administration, Writing—review and editing.

Ethics

Human subjects: Informed consent was obtained from the CEPH/Utah individuals, and the University of Utah Institutional Review Board approved the study (University of Utah IRB reference #80145).

Additional files

Supplementary file 1. Pedigree structures for all CEPH/Utah families.

All family and sample IDs have been anonymized, and the sexes of third-generation individuals have been hidden.

elife-46922-supp1.zip (177KB, zip)
DOI: 10.7554/eLife.46922.015
Supplementary file 2. IGV images of 100 randomly selected germline DNMs identified in the second generation.

In each image, the first two tracks contain alignments from the first-generation parents, and the third track contains the alignments for the second-generation child. Reads with mapping quality <20 are not included, as they were not considered by our variant calling pipeline, and mismatched bases are shaded by quality score (more transparent = lower base quality).

DOI: 10.7554/eLife.46922.016
Supplementary file 3. IGV images of 100 randomly selected germline.

DNMs identified in the third generation In each image, the first two tracks contain alignments from the second-generation parents, and the third track contains the alignments for the third-generation child. Reads with mapping quality <20 are filtered out, as they were not considered by our variant calling pipeline, and mismatched bases are shaded by quality score (more transparent = lower base quality).

DOI: 10.7554/eLife.46922.017
Supplementary file 4. IGV images of all putative post-PGCS mosaic mutations In each image, the first two tracks contain alignments from the two second-generation parents in the pedigree.

All tracks below contain alignments from the third-generation children that share a DNM at the site. Reads with mapping quality <20 are filtered out, as they were not considered by our variant calling pipeline, and mismatched bases are shaded by quality score (more transparent = lower base quality).

elife-46922-supp4.zip (7.2MB, zip)
DOI: 10.7554/eLife.46922.018
Supplementary file 5. IGV images of all putative gonosomal mutations identified in the second generation.

In each image, the first two, three, or four tracks contain alignments from the grandparents in the pedigree (i.e., paternal grandmother and grandfather, maternal grandmother and grandfather). In some families, one or two of the first-generation grandparents were not sequenced (see Supplementary file 1). The two tracks below contain alignments from the second-generation individual with the putative gonosomal mutation and that second-generation individual’s spouse. The remaining tracks below contain alignments from the third-generation individuals that inherited the gonosomal mutation. Reads with mapping quality <20 are filtered out, as they were not considered by our variant calling pipeline, and mismatched bases are shaded by quality score (more transparent = lower base quality).

elife-46922-supp5.zip (16.6MB, zip)
DOI: 10.7554/eLife.46922.019
Transparent reporting form
DOI: 10.7554/eLife.46922.020

Data availability

Code used for statistical analysis and figure generation has been deposited on GitHub as a collection of annotated Jupyter Notebooks: https://github.com/quinlan-lab/ceph-dnm-manuscript (copy archived at https://github.com/elifesciences-publications/ceph-dnm-manuscript). Data files containing high-confidence de novo mutations, as well as the gonosomal and post-primordial germ cell specification (PGCS) mosaic mutations, are included with these Notebooks. To mitigate compatibility issues, we have also made all notebooks available in a Binder environment, accessible at the above GitHub repository. Aligned sequencing reads (in CRAM format) and variant calls (in VCF format) will be made available at the SRA and dbGaP under controlled access, with accession phs001872.v1.p1.

The following dataset was generated:

Sasani TA, Pedersen BS, Gao Z, Baird L, Przeworski M, Jorde LB, Quinlan AR. 2019. Genome sequencing of large, multigenerational CEPH/Utah families. NCBI dbGaP. phs001872.v1.p1

References

  1. 1000 Genomes Project Consortium. Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, Korbel JO, Marchini JL, McCarthy S, McVean GA, Abecasis GR. A global reference for human genetic variation. Nature. 2015;526:68–74. doi: 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Acuna-Hidalgo R, Bo T, Kwint MP, van de Vorst M, Pinelli M, Veltman JA, Hoischen A, Vissers LE, Gilissen C. Post-zygotic point mutations are an underrecognized source of de novo genomic variation. The American Journal of Human Genetics. 2015;97:67–74. doi: 10.1016/j.ajhg.2015.05.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Acuna-Hidalgo R, Veltman JA, Hoischen A. New insights into the generation and role of de novo mutations in health and disease. Genome Biology. 2016;17:241. doi: 10.1186/s13059-016-1110-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Agarwal I, Przeworski M. Signatures of replication, recombination and sex in the spectrum of rare variants on the human X chromosome and autosomes. bioRxiv. 2019 doi: 10.1101/519421. [DOI] [PMC free article] [PubMed]
  5. Besenbacher S, Liu S, Izarzugaza JM, Grove J, Belling K, Bork-Jensen J, Huang S, Als TD, Li S, Yadav R, Rubio-García A, Lescai F, Demontis D, Rao J, Ye W, Mailund T, Friborg RM, Pedersen CN, Xu R, Sun J, Liu H, Wang O, Cheng X, Flores D, Rydza E, Rapacki K, Damm Sørensen J, Chmura P, Westergaard D, Dworzynski P, Sørensen TI, Lund O, Hansen T, Xu X, Li N, Bolund L, Pedersen O, Eiberg H, Krogh A, Børglum AD, Brunak S, Kristiansen K, Schierup MH, Wang J, Gupta R, Villesen P, Rasmussen S. Novel variation and de novo mutation rates in population-wide de novo assembled Danish trios. Nature Communications. 2015;6:5969. doi: 10.1038/ncomms6969. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Besenbacher S, Sulem P, Helgason A, Helgason H, Kristjansson H, Jonasdottir A, Jonasdottir A, Magnusson OT, Thorsteinsdottir U, Masson G, Kong A, Gudbjartsson DF, Stefansson K. Multi-nucleotide de novo mutations in humans. PLOS Genetics. 2016;12:e1006315. doi: 10.1371/journal.pgen.1006315. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Biesecker LG, Spinner NB. A genomic view of mosaicism and human disease. Nature Reviews Genetics. 2013;14:307–320. doi: 10.1038/nrg3424. [DOI] [PubMed] [Google Scholar]
  8. Campbell IM, Stewart JR, James RA, Lupski JR, Stankiewicz P, Olofsson P, Shaw CA. Parent of origin, mosaicism, and recurrence risk: probabilistic modeling explains the broken symmetry of transmission genetics. The American Journal of Human Genetics. 2014a;95:345–359. doi: 10.1016/j.ajhg.2014.08.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Campbell IM, Yuan B, Robberecht C, Pfundt R, Szafranski P, McEntagart ME, Nagamani SC, Erez A, Bartnik M, Wiśniowiecka-Kowalnik B, Plunkett KS, Pursley AN, Kang SH, Bi W, Lalani SR, Bacino CA, Vast M, Marks K, Patton M, Olofsson P, Patel A, Veltman JA, Cheung SW, Shaw CA, Vissers LE, Vermeesch JR, Lupski JR, Stankiewicz P. Parental somatic mosaicism is underrecognized and influences recurrence risk of genomic disorders. The American Journal of Human Genetics. 2014b;95:173–182. doi: 10.1016/j.ajhg.2014.07.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Campbell IM, Shaw CA, Stankiewicz P, Lupski JR. Somatic mosaicism: implications for disease and transmission genetics. Trends in Genetics. 2015;31:382–392. doi: 10.1016/j.tig.2015.03.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Crow JF. The high spontaneous mutation rate: is it a health risk? PNAS. 1997;94:8380–8386. doi: 10.1073/pnas.94.16.8380. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Crow JF. The origins, patterns and implications of human spontaneous mutation. Nature Reviews Genetics. 2000;1:40–47. doi: 10.1038/35049558. [DOI] [PubMed] [Google Scholar]
  13. Dausset J, Cann H, Cohen D, Lathrop M, Lalouel JM, White R. Centre D'etude Du Polymorphisme humain (CEPH): collaborative genetic mapping of the human genome. Genomics. 1990;6:575–577. doi: 10.1016/0888-7543(90)90491-C. [DOI] [PubMed] [Google Scholar]
  14. Deciphering Developmental Disorders Study Prevalence and architecture of de novo mutations in developmental disorders. Nature. 2017;542:433–438. doi: 10.1038/nature21062. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell TJ, Kernytsky AM, Sivachenko AY, Cibulskis K, Gabriel SB, Altshuler D, Daly MJ. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics. 2011;43:491–498. doi: 10.1038/ng.806. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Faust GG, Hall IM. SAMBLASTER: fast duplicate marking and structural variant read extraction. Bioinformatics. 2014;30:2503–2505. doi: 10.1093/bioinformatics/btu314. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Feusier J, Scott Watkins W, Thomas J, Farrell A, Witherspoon DJ, Baird L, Ha H, Xing J, Jorde LB. Pedigree-Based estimation of human mobile element retrotransposition rates. bioRxiv. 2018 doi: 10.1101/506691. [DOI] [PMC free article] [PubMed]
  18. Forsberg LA, Gisselsson D, Dumanski JP. Mosaicism in health and disease - clones picking up speed. Nature Reviews Genetics. 2017;18:128–142. doi: 10.1038/nrg.2016.145. [DOI] [PubMed] [Google Scholar]
  19. Francioli LC, Polak PP, Koren A, Menelaou A, Chun S, Renkens I, van Duijn CM, Swertz M, Wijmenga C, van Ommen G, Slagboom PE, Boomsma DI, Ye K, Guryev V, Arndt PF, Kloosterman WP, de Bakker PIW, Sunyaev SR, Genome of the Netherlands Consortium Genome-wide patterns and properties of de novo mutations in humans. Nature Genetics. 2015;47:822–826. doi: 10.1038/ng.3292. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Fryxell KJ, Zuckerkandl E. Cytosine deamination plays a primary role in the evolution of mammalian isochores. Molecular Biology and Evolution. 2000;17:1371–1383. doi: 10.1093/oxfordjournals.molbev.a026420. [DOI] [PubMed] [Google Scholar]
  21. Gao Z, Moorjani P, Sasani TA, Pedersen BS, Quinlan AR, Jorde LB, Amster G, Przeworski M. Overlooked roles of DNA damage and maternal age in generating human germline mutations. PNAS. 2019;116:9491–9500. doi: 10.1073/pnas.1901259116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Goldmann JM, Wong WS, Pinelli M, Farrah T, Bodian D, Stittrich AB, Glusman G, Vissers LE, Hoischen A, Roach JC, Vockley JG, Veltman JA, Solomon BD, Gilissen C, Niederhuber JE. Parent-of-origin-specific signatures of de novo mutations. Nature Genetics. 2016;48:935–939. doi: 10.1038/ng.3597. [DOI] [PubMed] [Google Scholar]
  23. Haldane JBS. The rate of spontaneous mutation of a human gene. Journal of Genetics. 1935;31:317–326. doi: 10.1007/BF02982403. [DOI] [PubMed] [Google Scholar]
  24. Harland C, Charlier C, Karim L, Cambisano N, Deckers M, Mni M, Mullaart E, Coppieters W, Georges M. Frequency of mosaicism points towards Mutation-Prone early cleavage cell divisions in cattle. bioRxiv. 2017 doi: 10.1101/079863. [DOI]
  25. Harris K, Pritchard JK. Rapid evolution of the human mutation spectrum. eLife. 2017;6:e24284. doi: 10.7554/eLife.24284. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. International HapMap Consortium The international HapMap project. Nature. 2003;426:789–796. doi: 10.1038/nature02168. [DOI] [PubMed] [Google Scholar]
  27. Jónsson H, Sulem P, Kehr B, Kristmundsdottir S, Zink F, Hjartarson E, Hardarson MT, Hjorleifsson KE, Eggertsson HP, Gudjonsson SA, Ward LD, Arnadottir GA, Helgason EA, Helgason H, Gylfason A, Jonasdottir A, Jonasdottir A, Rafnar T, Frigge M, Stacey SN, Th Magnusson O, Thorsteinsdottir U, Masson G, Kong A, Halldorsson BV, Helgason A, Gudbjartsson DF, Stefansson K. Parental influence on human germline de novo mutations in 1,548 trios from Iceland. Nature. 2017;549:519–522. doi: 10.1038/nature24018. [DOI] [PubMed] [Google Scholar]
  28. Jónsson H, Sulem P, Arnadottir GA, Pálsson G, Eggertsson HP, Kristmundsdottir S, Zink F, Kehr B, Hjorleifsson KE, Jensson BÖ, Jonsdottir I, Marelsson SE, Gudjonsson SA, Gylfason A, Jonasdottir A, Jonasdottir A, Stacey SN, Magnusson OT, Thorsteinsdottir U, Masson G, Kong A, Halldorsson BV, Helgason A, Gudbjartsson DF, Stefansson K. Multiple transmissions of de novo mutations in families. Nature Genetics. 2018;50:1674–1680. doi: 10.1038/s41588-018-0259-9. [DOI] [PubMed] [Google Scholar]
  29. Kong A, Frigge ML, Masson G, Besenbacher S, Sulem P, Magnusson G, Gudjonsson SA, Sigurdsson A, Jonasdottir A, Jonasdottir A, Wong WS, Sigurdsson G, Walters GB, Steinberg S, Helgason H, Thorleifsson G, Gudbjartsson DF, Helgason A, Magnusson OT, Thorsteinsdottir U, Stefansson K. Rate of de novo mutations and the importance of father's age to disease risk. Nature. 2012;488:471–475. doi: 10.1038/nature11396. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Krupp DR, Barnard RA, Duffourd Y, Evans SA, Mulqueen RM, Bernier R, Rivière JB, Fombonne E, O'Roak BJ. Exonic mosaic mutations contribute risk for autism spectrum disorder. The American Journal of Human Genetics. 2017;101:369–390. doi: 10.1016/j.ajhg.2017.07.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, Stange-Thomann Y, Stojanovic N, Subramanian A, Wyman D, Rogers J, Sulston J, Ainscough R, Beck S, Bentley D, Burton J, Clee C, Carter N, Coulson A, Deadman R, Deloukas P, Dunham A, Dunham I, Durbin R, French L, Grafham D, Gregory S, Hubbard T, Humphray S, Hunt A, Jones M, Lloyd C, McMurray A, Matthews L, Mercer S, Milne S, Mullikin JC, Mungall A, Plumb R, Ross M, Shownkeen R, Sims S, Waterston RH, Wilson RK, Hillier LW, McPherson JD, Marra MA, Mardis ER, Fulton LA, Chinwalla AT, Pepin KH, Gish WR, Chissoe SL, Wendl MC, Delehaunty KD, Miner TL, Delehaunty A, Kramer JB, Cook LL, Fulton RS, Johnson DL, Minx PJ, Clifton SW, Hawkins T, Branscomb E, Predki P, Richardson P, Wenning S, Slezak T, Doggett N, Cheng JF, Olsen A, Lucas S, Elkin C, Uberbacher E, Frazier M, Gibbs RA, Muzny DM, Scherer SE, Bouck JB, Sodergren EJ, Worley KC, Rives CM, Gorrell JH, Metzker ML, Naylor SL, Kucherlapati RS, Nelson DL, Weinstock GM, Sakaki Y, Fujiyama A, Hattori M, Yada T, Toyoda A, Itoh T, Kawagoe C, Watanabe H, Totoki Y, Taylor T, Weissenbach J, Heilig R, Saurin W, Artiguenave F, Brottier P, Bruls T, Pelletier E, Robert C, Wincker P, Smith DR, Doucette-Stamm L, Rubenfield M, Weinstock K, Lee HM, Dubois J, Rosenthal A, Platzer M, Nyakatura G, Taudien S, Rump A, Yang H, Yu J, Wang J, Huang G, Gu J, Hood L, Rowen L, Madan A, Qin S, Davis RW, Federspiel NA, Abola AP, Proctor MJ, Myers RM, Schmutz J, Dickson M, Grimwood J, Cox DR, Olson MV, Kaul R, Raymond C, Shimizu N, Kawasaki K, Minoshima S, Evans GA, Athanasiou M, Schultz R, Roe BA, Chen F, Pan H, Ramser J, Lehrach H, Reinhardt R, McCombie WR, de la Bastide M, Dedhia N, Blöcker H, Hornischer K, Nordsiek G, Agarwala R, Aravind L, Bailey JA, Bateman A, Batzoglou S, Birney E, Bork P, Brown DG, Burge CB, Cerutti L, Chen HC, Church D, Clamp M, Copley RR, Doerks T, Eddy SR, Eichler EE, Furey TS, Galagan J, Gilbert JG, Harmon C, Hayashizaki Y, Haussler D, Hermjakob H, Hokamp K, Jang W, Johnson LS, Jones TA, Kasif S, Kaspryzk A, Kennedy S, Kent WJ, Kitts P, Koonin EV, Korf I, Kulp D, Lancet D, Lowe TM, McLysaght A, Mikkelsen T, Moran JV, Mulder N, Pollara VJ, Ponting CP, Schuler G, Schultz J, Slater G, Smit AF, Stupka E, Szustakowki J, Thierry-Mieg D, Thierry-Mieg J, Wagner L, Wallis J, Wheeler R, Williams A, Wolf YI, Wolfe KH, Yang SP, Yeh RF, Collins F, Guyer MS, Peterson J, Felsenfeld A, Wetterstrand KA, Patrinos A, Morgan MJ, de Jong P, Catanese JJ, Osoegawa K, Shizuya H, Choi S, Chen YJ, Szustakowki J, International Human Genome Sequencing Consortium Initial Sequencing and Analysis of the Human Genome. Nature. 2001;409:860–921. doi: 10.1038/35057062. [DOI] [PubMed] [Google Scholar]
  32. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome Project Data Processing Subgroup The sequence alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv. 2013 http://arxiv.org/abs/1303.3997
  34. Li H. Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics. 2014;30:2843–2851. doi: 10.1093/bioinformatics/btu356. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Malhotra A, Cromer K, Leppert MF, Hasstedt SJ. The power to detect genetic linkage for quantitative traits in the utah CEPH pedigrees. Journal of Human Genetics. 2005;50:69–75. doi: 10.1007/s10038-004-0222-8. [DOI] [PubMed] [Google Scholar]
  36. Mathieson I, Reich D. Differences in the rare variant spectrum among human populations. PLOS Genetics. 2017;13:e1006581. doi: 10.1371/journal.pgen.1006581. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Moorjani P, Gao Z, Przeworski M. Human germline mutation and the erratic evolutionary clock. PLOS Biology. 2016;14:e2000744. doi: 10.1371/journal.pbio.2000744. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Nachman MW. Haldane and the first estimates of the human mutation rate. Journal of Genetics. 2008;87:317. doi: 10.1007/s12041-008-0052-0. [DOI] [PubMed] [Google Scholar]
  39. Nachman MW, Crowell SL. Estimate of the Mutation Rate per Nucleotide in Humans. Genetics. 2000;156:297–304. doi: 10.1093/genetics/156.1.297. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Pedersen BS, Quinlan AR. Who's who? detecting and resolving sample anomalies in human DNA sequencing studies with peddy. The American Journal of Human Genetics. 2017a;100:406–413. doi: 10.1016/j.ajhg.2017.01.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Pedersen BS, Quinlan AR. cyvcf2: fast, flexible variant analysis with Python. Bioinformatics. 2017b;33:1867–1869. doi: 10.1093/bioinformatics/btx057. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Pedersen BS, Quinlan AR. Mosdepth: quick coverage calculation for genomes and exomes. Bioinformatics. 2018;34:867–868. doi: 10.1093/bioinformatics/btx699. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Prescott SM, Lalouel JM, Leppert M. From linkage maps to quantitative trait loci: the history and science of the utah genetic reference project. Annual Review of Genomics and Human Genetics. 2008;9:347–358. doi: 10.1146/annurev.genom.9.081307.164441. [DOI] [PubMed] [Google Scholar]
  44. Rahbari R, Wuster A, Lindsay SJ, Hardwick RJ, Alexandrov LB, Turki SA, Dominiczak A, Morris A, Porteous D, Smith B, Stratton MR, Hurles ME, UK10K Consortium Timing, rates and spectra of human germline mutation. Nature Genetics. 2016;48:126–133. doi: 10.1038/ng.3469. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Reik W, Dean W, Walter J. Epigenetic reprogramming in mammalian development. Science. 2001;293:1089–1093. doi: 10.1126/science.1063443. [DOI] [PubMed] [Google Scholar]
  46. Roach JC, Glusman G, Smit AF, Huff CD, Hubley R, Shannon PT, Rowen L, Pant KP, Goodman N, Bamshad M, Shendure J, Drmanac R, Jorde LB, Hood L, Galas DJ. Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science. 2010;328:636–639. doi: 10.1126/science.1186802. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Sasani T. ceph_dnm_manuscript. Github. 2019 https://github.com/quinlan-lab/ceph-dnm-manuscript. 8569be3
  48. Scally A, Durbin R. Revising the human mutation rate: implications for understanding human evolution. Nature Reviews Genetics. 2012;13:745–753. doi: 10.1038/nrg3295. [DOI] [PubMed] [Google Scholar]
  49. Ségurel L, Wyman MJ, Przeworski M. Determinants of mutation rate variation in the human germline. Annual Review of Genomics and Human Genetics. 2014;15:47–70. doi: 10.1146/annurev-genom-031714-125740. [DOI] [PubMed] [Google Scholar]
  50. Seisenberger S, Andrews S, Krueger F, Arand J, Walter J, Santos F, Popp C, Thienpont B, Dean W, Reik W. The dynamics of genome-wide DNA methylation reprogramming in mouse primordial germ cells. Molecular Cell. 2012;48:849–862. doi: 10.1016/j.molcel.2012.11.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Shendure J, Akey JM. The origins, determinants, and consequences of human mutations. Science. 2015;349:1478–1483. doi: 10.1126/science.aaa9119. [DOI] [PubMed] [Google Scholar]
  52. Tang WW, Kobayashi T, Irie N, Dietmann S, Surani MA. Specification and epigenetic programming of the human germ line. Nature Reviews Genetics. 2016;17:585–600. doi: 10.1038/nrg.2016.88. [DOI] [PubMed] [Google Scholar]
  53. Thorvaldsdóttir H, Robinson JT, Mesirov JP. Integrative genomics viewer (IGV): high-performance genomics data visualization and exploration. Briefings in Bioinformatics. 2013;14:178–192. doi: 10.1093/bib/bbs017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Turner TN, Coe BP, Dickel DE, Hoekzema K, Nelson BJ, Zody MC, Kronenberg ZN, Hormozdiari F, Raja A, Pennacchio LA, Darnell RB, Eichler EE. Genomic patterns of de novo mutation in simplex autism. Cell. 2017;171:710–722. doi: 10.1016/j.cell.2017.08.047. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Veltman JA, Brunner HG. De novo mutations in human genetic disease. Nature Reviews Genetics. 2012;13:565–575. doi: 10.1038/nrg3241. [DOI] [PubMed] [Google Scholar]
  56. Wong WS, Solomon BD, Bodian DL, Kothiyal P, Eley G, Huddleston KC, Baker R, Thach DC, Iyer RK, Vockley JG, Niederhuber JE. New observations on maternal age effect on germline de novo mutations. Nature Communications. 2016;7:10486. doi: 10.1038/ncomms10486. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Yuen RKC, Merico D, Cao H, Pellecchia G, Alipanahi B, Thiruvahindrapuram B, Tong X, Sun Y, Cao D, Zhang T, Wu X, Jin X, Zhou Z, Liu X, Nalpathamkalam T, Walker S, Howe JL, Wang Z, MacDonald JR, Chan AJS, D’Abate L, Deneault E, Siu MT, Tammimies K, Uddin M, Zarrei M, Wang M, Li Y, Wang J, Wang J, Yang H, Bookman M, Bingham J, Gross SS, Loy D, Pletcher M, Marshall CR, Anagnostou E, Zwaigenbaum L, Weksberg R, Fernandez BA, Roberts W, Szatmari P, Glazer D, Frey BJ, Ring RH, Xu X, Scherer SW. Genome-wide characteristics of de novo mutations in autism. Npj Genomic Medicine. 2016;1:509. doi: 10.1038/npjgenmed.2016.27. [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision letter

Editor: Amy L Williams1
Reviewed by: Amy L Williams2

In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.

Thank you for submitting your article "Large, three-generation CEPH families reveal post-zygotic mosaicism and variability in germline mutation accumulation" for consideration by eLife. Your article has been reviewed by three peer reviewers, including Amy Williams as the Reviewing Editor and Reviewer #1, and the evaluation has been overseen by Mark McCarthy as the Senior Editor.

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

Summary:

Sasani et al. present a study of 40 large multi-sibling, three-generation CEPH/Utah pedigrees with the aim of estimating the rate of de novo mutations (DNMs), analyzing variation in paternal age effects, and identifying germline mosaic DNMs with their associated mutational spectra. The number of families and the fact that the CEPH/Utah pedigrees are enriched with large numbers of children in the third generation enable a more detailed study of the paternal age effect and of germline mosaic DNMs than prior studies, which have primarily analyzed trios. The finding that the paternal age effect varies by more than an order of magnitude adds to the already complex picture of mutational dynamics and provides strong evidence in support of a prior finding of a two-fold difference based on three families (Rahbari et al.). The study design enables the division of germline mosaic variants into those that seem to have arisen very early in development – gonosomal variants – and those that arise after primordial germ cell specification (post-PGCS). The mutational spectra associated with post-PGCS DNMs differ from those of other DNMs, with some possible biological explanations proposed in the text.

Essential revisions:

Overall this paper is a solid contribution to the DNM literature, but there are a few additional analyses that will help ensure the findings are robust and enhance the information already presented, as outlined below.

1) Since this manuscript's main advance regarding variation in paternal age effect over the Rahbari et al. result is greater statistical power, more robust statistical analyses of this pattern would strengthen the paper. Figure 3 presents a commendable amount of raw data in a fairly clear way, yet the authors use only a simple ANOVA to test whether different families have different dependencies on paternal age. The supplement claims that this result cannot be an artifact of low sequencing coverage because regions covered by <12 reads are excluded from the denominator, but there still might be subtle differences in variant discovery power between e.g. regions covered by 12 reads and regions covered by 30 reads. To hedge against this, the authors can define the "callable genome" continuously (point 1 under "other questions and suggestions") or they can check whether mean read coverage appears to covary with mutation rates across individuals after filtering away the regions covered by <12 reads.

2) Another concern about paternal age effects is the extent to which outlier offspring may be driving the apparent rate variation across families. If the authors were to randomly sample half of the children from each family and run the analysis again, how much is the paternal age effect rank preserved? Alternatively, how much is the family rank ordering preserved if mutations are only called from a subset of the chromosomes?

3) A factor regarding paternal age effects that is only mentioned briefly alluded to late in the paper are the differences in the intercept between families (Figure 3C). Do either the intercept or slope vary with the number of F2 children? Is there any (anti-)correlation between slope and intercept? It would seem odd if the intercept strongly impacts the slope since a low per-year rate probably should not correlate with a high initial rate at younger ages.

4) Another analysis that is informative about the cellular stage at which mutations occurred is to examine, for mutations found in F1, what fraction of these are transmitted to F2s. Putative DNMs could in fact be present in blood but not in the germline, and the study design here makes it easy to identify the fraction of such DNMs. A complexity of this is the use of multi-sample genotype calling. Perhaps dividing the genotype calling into a set called using the P0 and F1 generations only and comparing the resulting DNMs to those found in F2s (with calling in everyone) would ensure that the set of DNMs aren't biased towards those also present in the germline of F1s.

5) In the additional data files, I could not find the age of all the individuals. It would be informative if the age information can be provided on the pedigree diagrams or as a separate files with identifiers.

6) Will the sequencing data generated here be posted to the SRA or dbGaP?

7) The authors state that "a gamete sampled from a younger father is more likely to possess a DNM that will recur in a future child." This doesn't seem correct as stated – every gamete should be equally likely to possess a DNM that will recur in a future child, independent of parental age. I believe what is meant is that a particular DNM sampled from the child of a young parent is more likely to be shared with a sibling than a DNM sampled from the child of an older parent.

Other questions and suggestions:

1) One concern is in defining a site as "callable", which is of course not strictly binary. It would be good to take sequencing depth into consideration when deciding callability. Ideally this would also factor into both the FPR and MHR values. Given the three-generation study design, there is a greater opportunity to perform these analyses in a more detailed manner than in trio studies and thus to better estimate/model these rates.

2) Another factor to analyze is the use of multi-sample genotype calling and its potential to bias against the identification of non-mosaic (singleton) DNMs. Perhaps the 60x vs. 30x analysis can help estimate the rates of missing singletons in a way that is distinct from the MHR analysis.

3) What is the range of the MHR? Is there significant variation with respect to MHR among families (in cases where P0 were genotyped)? Moreover, is there any enrichment or biases in mutational spectra seen using the MHR?

4) The authors mentioned that they have removed DNMs with likely "DNM carriers" in the cohort. Does this remove DNMs where the alternate alleles are observed in only the unrelated individuals or does it also include the related individuals?

5) Other things to explore further related to parental age effects are: how do the conclusions change and/or can you detect similar variability in maternal age when analyzing phased DNMs? This may be underpowered, but for those families that share grandparents, if two brothers are in the F1 generation, do their paternal age effects differ?

6) For the parental age effect model the authors have correctly included "family-id", but for the rest of their analysis they have defined a "family" as the unique group of two F1 parents and their F2 offspring (e.g., Figure 3—figure supplement 1). Can the authors comment whether this might introduce biases in their analysis and filtering strategies, as some families are more related to each other than the rest?

7) Figure 4 shows that the number of DNMs shared with siblings does not appear to correlate with paternal age. Although it is seems unlikely to affect the result, it seems odd to report these as raw counts without correcting for the number of siblings the child has. It would be good to report the strength of the correlation between family size and shared DNM count and correct the shared counts for family size before testing for a correlation with paternal age.

8) Why, from Figure 4B, are the differences in mutational spectra found in the post-PGCS mosaic analysis only based on 289/721 of these DNMs (presumably the phased ones)?

9) A minor but important consideration here is as a term, "post-PGCS", seems to include any mutations that arise following the establishment of the germ cells, but what seems to be the intended meaning is those mutations that arise during germ cell proliferation (or related). Rewording would aid understanding here.

10) For the mutational spectra analysis of gonosomal mosaic DNMs, this and other similar analyses consider each allelic class independently. Would power increase by analyzing the data as a whole using, say, a Chi-squared six degree of freedom test?

11) The authors have applied the same filters as DNMs for identifying post-PGCS mosaic variants. They seem to have filtered candidates based on VAF > 0.2. Might this filter may be too stringent for identifying the post-PGCS events?

12) To identify gonosomal mutations the authors have applied hard VAF cut off < 0.2, considering the number of cell divisions before PGC, would this threshold be a bit too low? Would their observation change significantly if they change the threshold to <0.3?

13) What is the VAF distribution of candidate gonosomal mutations in F2? One of the filters they have used have VAF >=0.3 in F2. Might this threshold be too lax? For the gonosomal mutations that occur in F1, an expectation of a higher VAF of almost 0.5 in the F2 set seems reasonable.

14) In mosaic post-PGCS analysis: the authors have identified 32 events with supporting alleles in F1. Among these 32 mutations, do any occur in families were F1s are related? For example, do F1 19_A mom and 19_B dad share some of these mosaic mutations? If so is there any correlation in mutational burden in F1 with the age of P0?

eLife. 2019 Sep 24;8:e46922. doi: 10.7554/eLife.46922.027

Author response


Essential revisions:

Overall this paper is a solid contribution to the DNM literature, but there are a few additional analyses that will help ensure the findings are robust and enhance the information already presented, as outlined below.

1) Since this manuscript's main advance regarding variation in paternal age effect over the Rahbari et al. result is greater statistical power, more robust statistical analyses of this pattern would strengthen the paper. Figure 3 presents a commendable amount of raw data in a fairly clear way, yet the authors use only a simple ANOVA to test whether different families have different dependencies on paternal age. The supplement claims that this result cannot be an artifact of low sequencing coverage because regions covered by <12 reads are excluded from the denominator, but there still might be subtle differences in variant discovery power between e.g. regions covered by 12 reads and regions covered by 30 reads. To hedge against this, the authors can define the "callable genome" continuously (point 1 under "other questions and suggestions") or they can check whether mean read coverage appears to covary with mutation rates across individuals after filtering away the regions covered by <12 reads.

It is true that “callability” is not necessarily a binary quality. To address this, we have assessed whether mutation rates are correlated with mean read depth in the second- and third-generation samples. We counted the total number of sites at which all members of a trio (mother, father, and child) had depth >= 12, and then averaged the read depth across all of these sites in the child. We have included a plot of mean autosomal read depth versus autosomal mutation rates in all second- and third-generation children in Author response image 1. Overall, mutation rates do not appear to be correlated with mean read depth in the second (p = 0.92) or third-generation samples (p = 0.073), though there are a small number of third-generation samples that have both relatively low mutation rates and low mean read depths.

Author response image 1. Lack of correlation between read depth and mutation rates in CEPH/Utah samples.

Author response image 1.

For each second- or third-generation CEPH/Utah sample, we calculated mean read depth across all autosomal base pairs covered by >=12 reads in all members of the trio. We then assessed whether there was a correlation between mean read depth and the autosomal mutation rate in these samples. For each generation, we fit a linear model predicting read depth as a function of autosomal mutation rate, and do not find a significant association in either generation at a p-value threshold of 0.05 (second-generation p = 0.92, third-generation p = 0.073).

2) Another concern about paternal age effects is the extent to which outlier offspring may be driving the apparent rate variation across families. If the authors were to randomly sample half of the children from each family and run the analysis again, how much is the paternal age effect rank preserved? Alternatively, how much is the family rank ordering preserved if mutations are only called from a subset of the chromosomes?

This is an understandable concern. To address the issue of outlier samples impacting apparent variation in paternal age effects across families, we took the following approach. For each of the 40 CEPH/Utah families shown in Figure 3, we randomly sampled three-quarters of the family’s offspring. Given the family sizes of the CEPH/Utah cohort (median of 8 grandchildren) and manual inspection of the regressions for each family, we felt that this sampling strategy would remove the small numbers of possible outlier samples in each family without dramatically reducing the number of samples used for regression. We then fit the following regression on each subsampled family:

m = glm(autosomal_dnms ~ dad_age, family=poisson(link=”identity”))

Finally, we ranked each of the 40 subsampled families in order of increasing slope; as mentioned in the manuscript, the slope in each family represents the sum of both the paternal and maternal age effects. We repeated this procedure (random sampling followed by regression and re-ranking) 100 times, and aggregated the ranks for each family. In Figure 3—figure supplement 2, we have plotted the distribution of ranks (across 100 trials) for each of the 40 CEPH families. These distributions are ordered by the original ranks of the families, as determined using the full dataset and originally presented in Figure 3. We find that some families are indeed sensitive to possible outlier samples, as the ranks of some families are substantially changed after removing these outliers (Figure 3—figure supplement 2). For example, the distribution of ranks for family 26 appears to be approximately bimodal, suggesting that some families are quite sensitive to a small number of outliers. It is perhaps unsurprising that decreasing the number of data points in each family would change the ranks of certain families, as each family’s regression might become less precise, and potentially even more sensitive to small outliers.

Importantly, though, we note that for nearly all of the families, the median ranks after 100 simulations are very similar to the ranks inferred using the full, original dataset. Overall, these results suggest that our estimates of paternal age effect “ranks” for the 40 CEPH/Utah families are robust to possible outlier samples.

3) A factor regarding paternal age effects that is only mentioned briefly alluded to late in the paper are the differences in the intercept between families (Figure 3C). Do either the intercept or slope vary with the number of F2 children? Is there any (anti-)correlation between slope and intercept? It would seem odd if the intercept strongly impacts the slope since a low per-year rate probably should not correlate with a high initial rate at younger ages.

We do observe an anti-correlation between slopes and intercepts across the 40 CEPH/Utah families (Author response image 2). A significant negative correlation between slopes and intercepts is the expected consequence of fitting regression lines to data in which the mean of the independent variable is greater than 0. For example, we would expect to observe a negative correlation between slopes and intercepts if the third-generation DNM counts in all families were randomly scattered along the same y = a + bx line (where x represents paternal age at birth, and a and b are fixed constants), and regressions were fit for each family separately. Since a random distribution of DNM counts in each family would also produce a negative correlation between slopes and intercepts, it is possible that stochastic noise in the CEPH families’ DNM counts might be contributing to some of the variability in slopes we observe, as well as the resulting negative correlation with intercepts. Indeed, the confidence intervals surrounding slope point estimates in CEPH families are occasionally quite wide (Figure 3D), demonstrating the uncertainty in some of these estimates.

Author response image 2. Anti-correlation between slope and intercept.

Author response image 2.

For each CEPH/Utah family, we fit a linear model predicting DNM counts as a function of paternal age (see Figure 3). We then assessed whether the slopes and intercepts of these regressions were correlated; overall, slope and intercept point estimates are negatively correlated in CEPH/Utah families (p < 2.2e-16).

However, this noise alone is unlikely to produce the significant inter-family variability we find in our observed third-generation DNM counts, and we feel confident that our results represent true biological differences between families. Specifically, we can randomly distribute DNM counts in all third-generation CEPH/Utah individuals by drawing a single value from a Poisson distribution – with λ = a + (b * x), where x represents paternal age at birth, a = 15, and b = 1.72 – for each third-generation sample. The values of a and b were chosen to match the intercept and slope point estimates of the regression predicting autosomal DNM counts as a function of paternal age, using the full set of DNMs in the third generation.

If we test for inter-family variability in these simulated data, we find that a “family-aware” model is not a significantly better fit to the data than a “family-agnostic” model. We have included a Jupyter notebook (Reviewer Response Notebook, filename “response_figures.ipynb”) that includes code needed to recreate the figures and analyses presented in the main reviewer response, available at the following GitHub site: https://github.com/quinlan-lab/ceph-dnm-manuscript (copied archived at https://github.com/elifesciences-publications/ceph-dnm-manuscript/tree/master/notebooks). Additionally, we have added a caveat about the large confidence intervals in some families’ slope estimates, as well as the possible contribution of stochastic noise to these estimates, in the subsection “Identifying gonadal, post-primordial germ cell specification (PGCS) mosaicism in the second generation”.

To the reviewers’ other points, we do not observe any correlation between the number of siblings in a particular family (i.e. the number of third-generation individuals) and either the slope or intercept measured in that family (Author response image 3).

Author response image 3. Lack of correlation between sibling number and either slope or intercept.

Author response image 3.

For each CEPH/Utah family, we fit a linear model predicting DNM counts as a function of paternal age (see Figure 3). We then assessed whether the number of third-generation siblings in these families was predictive of either the (a) slope or (b) intercept point estimate in the regression. Neither slope (p = 0.654) or intercept (p = 0.718) are significantly associated with sibling number.

4) Another analysis that is informative about the cellular stage at which mutations occurred is to examine, for mutations found in F1, what fraction of these are transmitted to F2s. Putative DNMs could in fact be present in blood but not in the germline, and the study design here makes it easy to identify the fraction of such DNMs. A complexity of this is the use of multi-sample genotype calling. Perhaps dividing the genotype calling into a set called using the P0 and F1 generations only and comparing the resulting DNMs to those found in F2s (with calling in everyone) would ensure that the set of DNMs aren't biased towards those also present in the germline of F1s.

We agree that the untransmitted DNMs are an interesting class of potential mutations, and could represent somatic DNMs in the second generation. Thus, we returned to our original de novo mutation calls and counted the number of DNMs observed in each second-generation individual that were not transmitted to the third generation. For this analysis, we did not consider any second-generation individuals without sequenced children in the CEPH/cohort. Using a filtering strategy similar to the one described in the Materials and methods section (no likely or possible carriers, GQ >= 20 in the second-generation individual and both parents, DP >= 12 in the second-generation individual and both parents), we observed 3,919 untransmitted DNMs.

The counts of filtered untransmitted DNMs were not normally distributed across second-generation individuals. The median number of untransmitted DNMs per second-generation sample was 30, but four samples had substantially elevated counts of untransmitted DNMs (180, 187, 223, and 1,098 DNMs). For the purposes of this analysis, we removed these samples from further consideration, leaving a total of 2,231 untransmitted DNMs. The distribution of allele balances (fraction of reads supporting the alternate, de novo allele) in the filtered untransmitted DNMs (median AB = 0.182) was quite different than in the DNMs transmitted at “high-quality” to the F2 generation (median AB = 0.487, Author response image 4).

Author response image 4. Allele balance distributions in transmitted and untransmitted DNMs.

Author response image 4.

Allele balance was calculated as the fraction of reads supporting the alternate (i.e., de novo) allele at a particular site. As there are substantially more transmitted than untransmitted DNMs in the plot, the y-axis is shown as the normalized count of DNMs.

This substantial difference in allele balance could reflect the fact that the untransmitted DNMs are, by and large, false positives. However, it is also possible that the untransmitted DNMs are post-zygotic mutations that occurred following the fertilization of the F1’s embryo, present exclusively in somatic cells and absent from the germline. To discriminate false-positive from possible post-zygotic untransmitted DNMs, we visually examined a subset of the 2,231 untransmitted DNMs using the Integrative Genomics Viewer (IGV). Following visual inspection of 200 randomly sampled untransmitted DNMs, we found 130 likely false positive DNMs, likely a result of mapping artifacts, genotyping error, and other possible factors. Since the majority of untransmitted DNMs appear to be false positives, it is difficult to estimate the true fraction of each sample’s DNMs that are post-zygotic. However, this result suggests that post-zygotic (somatic) DNMs do exist within the set of untransmitted de novo mutations identified in the second-generation. Given the scope of this paper, we have elected to save a more detailed treatment of possible untransmitted post-zygotic mutations for a future analysis, though it presents an interesting direction for future work.

The reviewers also raise an important point regarding our differential power to detect DNMs in the second and third generations due to the multi-generational structure of the pedigrees. Because transmitted DNMs are (by their nature) present in more than one sample, a variant caller can integrate these multiple observations of the mutation into its posterior probability that the mutation is “real.” However, given the number of samples in the CEPH dataset and the complexities/time involved in re-running the full variant calling pipeline on two distinct sets of samples, we chose not to perform additional rounds of genotype calling. We anticipate that using the CEPH/Utah sequencing data, future analyses could address this important question.

5) In the additional data files, I could not find the age of all the individuals. It would be informative if the age information can be provided on the pedigree diagrams or as a separate files with identifiers.

Currently, all of the second- and third-generation individuals have associated paternal and maternal ages at birth in the ‘second_gen.dnms.summary.csv’ and ‘third_gen.dnms.summary.csv’ files, respectively. However, our IRB precludes us from providing ages and/or exact birth dates for every sample in the dataset, as this is more sensitive, identifiable information.

6) Will the sequencing data generated here be posted to the SRA or dbGaP?

The sequencing data for all 603 CEPH/Utah individuals (as well as a joint-called VCF) will be uploaded via dbGaP under controlled access. We have begun depositing these data in the Sequence Read Archive and dbGaP, though we don’t yet have an accession number for our data submission.

7) The authors state that "a gamete sampled from a younger father is more likely to possess a DNM that will recur in a future child." This doesn't seem correct as stated – every gamete should be equally likely to possess a DNM that will recur in a future child, independent of parental age. I believe what is meant is that a particular DNM sampled from the child of a young parent is more likely to be shared with a sibling than a DNM sampled from the child of an older parent.

The reviewer is correct; our original wording of this phrase is not accurate as stated. We have updated the subsection “Identifying gonosomal mosaicism in the second generation”, to reflect the reviewer’s correction.

Other questions and suggestions:

1) One concern is in defining a site as "callable", which is of course not strictly binary. It would be good to take sequencing depth into consideration when deciding callability. Ideally this would also factor into both the FPR and MHR values. Given the three-generation study design, there is a greater opportunity to perform these analyses in a more detailed manner than in trio studies and thus to better estimate/model these rates.

We agree that “callability” is not a binary quality, and have attempted to address concerns about variable sequencing depth by investigating the correlation between mutation rates and average sequencing depth in CEPH/Utah samples (see response to Major revision #1).

One additional informative experiment might be to compare the MHR in CEPH/Utah families using either the 30X or 60X data in the 8 first-generation grandparents who were re-sequenced at higher depth. We hypothesize that using the 60X data, we might identify even more instances of “missed heterozygotes,” in which a grandparent is heterozygous for a particular variant that goes undetected in the second-generation, only to appear as an ostensibly de novo mutation in the third generation. However, this experiment would require re-running the full variant calling pipeline using the 60X data for these first-generation samples, and given the scope of this manuscript, we feel that it is best left for a future analysis.

2) Another factor to analyze is the use of multi-sample genotype calling and its potential to bias against the identification of non-mosaic (singleton) DNMs. Perhaps the 60x vs. 30x analysis can help estimate the rates of missing singletons in a way that is distinct from the MHR analysis.

The reviewers again raise an interesting hypothesis, which has implications for other large, family-based sequencing studies: are singleton (i.e., untransmitted) DNMs less likely to be identified in a joint-genotyping approach, as there is inherently less evidence for those DNMs in the rest of the cohort? Given the scope of this paper, we have not addressed this concern explicitly. However, we expect that in the future, researchers could make use of the CEPH/Utah dataset to address this more robustly.

Overall, we do not believe that the 60X and 30X data would be particularly useful for estimating a rate of “missing” singleton DNMs. The approach suggested in reviewer comment #3 (separating the CEPH/Utah families into groups of first/second and second/third-generation samples, following by joint-genotyping of each group separately) might be more fruitful for this analysis. Instead, the 60X and 30X data allow us to carefully estimate the fraction of apparently “real” singleton DNMs that are, in fact, likely inherited mutations that went undetected in a parent (see section entitled “Estimating a false positive rate for our de novo mutation detection strategy” in the Materials and methods section of the manuscript).

We hypothesize that if we had access to higher-depth (60X) sequencing data for children in the CEPH/Utah cohort, rather than grandparents, we might be able to use those deep sequencing data to better estimate the rates of missing singletons. Increased depth in the CEPH/Utah children could result in the increased sampling of alternate alleles; using the 30X data alone, it’s possible that these alleles may have been unobserved. Increased sampling of these alternate alleles could increase the sensitivity of the variant calling software, lead us to identify a larger number of true singleton DNMs, and obtain a better estimate of the fraction of singleton DNMs that go “missing” using only 30X data.

3) What is the range of the MHR? Is there significant variation with respect to MHR among families (in cases where P0 were genotyped)? Moreover, is there any enrichment or biases in mutational spectra seen using the MHR?

We have calculated the range of missed heterozygote rates across the CEPH families. Overall, the MHR is relatively low and consistent across these families (Author response image 5A). Though the MHR is very low overall (~0.4%), we compared the mutation spectrum in the third-generation DNMs with grandparental evidence (i.e., DNMs that were removed as “missed heterozygotes”) to the filtered, high-quality third-generation germline DNMs, and did not find significant differences for any particular mutation types (Author response image 5B).

Author response image 5. Range of missed heterozygote rates across CEPH families.

Author response image 5.

(a) For each unique set of second-generation parents and third-generation children, we counted the total number of DNMs in the third generation for which we saw evidence in the first generation (i.e., grandparents). The missed heterozygote rate (MHR) therefore represents the fraction of DNMs in each family that were likely “missed” in the second generation, as a percentage of the total number of DNMs identified in the third-generation children. (b) Comparison of mutation spectra in autosomal filtered germline third-generation DNMs (n=22,644) and autosomal third-generation DNMs that were removed due to evidence in a genotyped grandparent (n=83). No significant differences for particular mutation types were found at a false-discovery rate of 0.05 (Benjamini-Hochberg procedure) using a Chi-squared test of independence.

4) The authors mentioned that they have removed DNMs with likely "DNM carriers" in the cohort. Does this remove DNMs where the alternate alleles are observed in only the unrelated individuals or does it also include the related individuals?

We defined “carriers” as unrelated individuals who possess the DNM allele, and did not include individuals in the same immediate family as the sample with the DNM (i.e., siblings, parents, or grandparents) in our carrier observations. We have made this definition clearer in the Materials and methods (subsection “Identifying DNM Candidates”).

5) Other things to explore further related to parental age effects are: how do the conclusions change and/or can you detect similar variability in maternal age when analyzing phased DNMs? This may be underpowered, but for those families that share grandparents, if two brothers are in the F1 generation, do their paternal age effects differ?

As mentioned in the manuscript, given the low phasing rate in the third generation, we assessed inter-family variability using only the total autosomal counts of DNMs in each third-generation individual. However, we also attempted to identify similar variability using only the phased paternal or maternal de novo mutations. Using only the paternal de novo mutations, we performed a regression and ANOVA as follows:

m = glm(dad_dnms_auto ~ dad_age * family_id, family=poisson(link=”identity”))

anova(m, test=”Chisq”)

We find that the additive family_id term is significant in the model at a p-value threshold of 0.05 (p = 2.82e-4), though the interaction between dad_age and family_id is not (p = 0.402).

We also performed the regression using only the maternal DNMs:

m = glm(mom_dnms_auto ~ mom_age * family_id, family=poisson(link=”identity”))

anova(m, test=”Chisq”)

In many families, the number of children with 0 maternally phased DNMs renders the Poisson regression model unable to calculate coefficients. Therefore, we added a pseudo-count of 1 to all of the F2 individuals’ maternal DNM counts and re-ran the regression. Neither the additive family_id term (p = 0.221) nor the interaction between mom_age and family_id (p = 0.623) were significant at a p-value threshold of 0.05, likely due to the small numbers of phased maternal DNMs in each sample (range = 0-12).

The reviewers are correct that in this study, there are not many instances in which two brothers each had children whose DNA was sequenced. However, family ID 24 and family ID 19 (Supplementary file 1) each present an opportunity to investigate the paternal age effects of 2 brothers. Samples 426 and 444 are both members of family 24; in Figure 3, these two brothers and their children form the unique families “24_C” and “24_D.” To identify possible differences in paternal age effects between these brothers, we can fit a generalized linear model to a subset of the third-generation DNM counts that only includes families “24_C” and “24_D,” and run an ANOVA as done previously:

m = glm(autosomal_dnms ~ dad_age * family_id, family=poisson(link=”identity”))

anova(m, test=”Chisq”)

Following this test, we don’t find that a “family-aware” model is a better fit than a “family-agnostic” model at a p-value threshold of 0.05 (ANOVA p = 0.137), though this could be due in part to the small number of data points in each family, and the uncertainty surrounding their slope and intercept point estimates (Figure 3D). Indeed, in Figure 3D we can see that family “24_C” has the lowest slope point estimates of all 40 families, and the slope point estimate in family “24_D” is nearly identical to the median slope across all families.

We performed the same statistical test using a subset of the third-generation DNM counts that included only families “19_A” and “19_B.” These two families also contain a pair of brothers, who each had sequenced third-generation children in the CEPH dataset. Once again, a “family-aware” model is not a significantly better fit than a “family-agnostic” model (ANOVA p = 0.614), suggesting that there aren’t substantial differences in paternal age effects between these two brothers. This observation is supported by visually examining the point estimates for families “19_A” and “19_B” in Figure 3D, which appear to be quite similar.

Overall, however, it is difficult to confidently determine whether the above sets of brothers differ in their paternal age effects, given the small number of data points in each comparison. Indeed, there are many pairs of unrelated families that also do not appear to differ substantially in their paternal age effects.

6) For the parental age effect model the authors have correctly included "family-id", but for the rest of their analysis they have defined a "family" as the unique group of two F1 parents and their F2 offspring (e.g., Figure 3—figure supplement 1). Can the authors comment whether this might introduce biases in their analysis and filtering strategies, as some families are more related to each other than the rest?

For our analyses of inter-family variability, we defined “families” as the unique groups of second-generation parents and their third-generation offspring. Thus, in our regression models, the family_id term represents these 40 unique family IDs.

We note that in Figure 3D, there doesn’t appear to be any clear bias in terms of related families having very similar slopes, though we may be underpowered to detect differences between related second-generation individuals (see our response to comment #5 for an analysis of paternal age effects in two sets of brothers).

Additionally, we do not believe that the interconnected structure of some CEPH/Utah families would impact our filtering strategies. Our filters on depth, genotype quality, and allele balance, for example, should not be biased by possible relatedness between families.

Of course, if there truly are genetic modifiers of the mutation rate segregating in human populations, it is possible that more related families would have more similar parental age effects on DNM counts. In our manuscript, however, we are likely underpowered to detect such similarities, given the relatively small number of data points in each family.

7) Figure 4 shows that the number of DNMs shared with siblings does not appear to correlate with paternal age. Although it is seems unlikely to affect the result, it seems odd to report these as raw counts without correcting for the number of siblings the child has. It would be good to report the strength of the correlation between family size and shared DNM count and correct the shared counts for family size before testing for a correlation with paternal age.

We agree, and now report the lack of correlation between family size and shared DNM count (p=0.426, subsection “Assessing age effects on post-PGCS DNMs”). Given this lack of correlation, and the fact that all but one CEPH/Utah family has at least 8 children, we do not anticipate that sibling number would impact the correlation between shared germline mosaic DNM number and parental age at birth.

8) Why, from Figure 4B, are the differences in mutational spectra found in the post-PGCS mosaic analysis only based on 289/721 of these DNMs (presumably the phased ones)?

When searching for shared germline mosaic variants in the third generation, we identify all of the apparent DNMs in the third generation that are shared with at least one sibling. Thus, if we identify a particular de novo mutation that is shared by 3 siblings, that DNM would be represented 3 times in the set of 721 (720 in the updated version of the manuscript) post-PGCS DNMs. For our comparisons of mutation spectra, we did not want to count the same mutation multiple times if it was shared by siblings; the “289” number therefore reflects the number of unique autosomal sites (i.e. unique autosomal mutational events) in the list of 721 shared mosaic DNMs.

9) A minor but important consideration here is as a term, "post-PGCS", seems to include any mutations that arise following the establishment of the germ cells, but what seems to be the intended meaning is those mutations that arise during germ cell proliferation (or related). Rewording would aid understanding here.

We agree that the term “post-PGCS” is a bit imprecise, since “normal” single-gamete germline DNMs technically occur post-PGCS, as well. Therefore, we instead refer to the “post-PGCS” variants as “shared/germline mosaic DNMs” throughout.

10) For the mutational spectra analysis of gonosomal mosaic DNMs, this and other similar analyses consider each allelic class independently. Would power increase by analyzing the data as a whole using, say, a Chi-squared six degree of freedom test?

The reviewers are correct that a Chi-squared test with six degrees of freedom might offer greater power to detect significant differences in mutation spectra. However, for the purposes of our analyses, we were interested in identifying specific differences for each mutation class separately.

11) The authors have applied the same filters as DNMs for identifying post-PGCS mosaic variants. They seem to have filtered candidates based on VAF > 0.2. Might this filter may be too stringent for identifying the post-PGCS events?

We may be misunderstanding the reviewer’s concern here, but the post-PGCS variants should appear to be “normal” germline DNMs in third-generation individuals, but happen to be shared with other third-generation siblings. In other words, we would expect a post-PGCS DNM to be a heterozygous mutation present in every cell of the third-generation child; this is because the mutation actually occurred in a progenitor of the sperm or egg cell that ultimately fertilized the third-generation child’s embryo. To identify the post-PGCS mosaic variants, we search through all of the DNMs seen in the third generation, and find the DNMs that are shared by siblings. As a result, the same filters applied to the “normal,” single-gamete germline DNMs in the third generation (VAF >= 0.3) are applied to the post-PGCS variants.

12) To identify gonosomal mutations the authors have applied hard VAF cut off < 0.2, considering the number of cell divisions before PGC, would this threshold be a bit too low? Would their observation change significantly if they change the threshold to <0.3?

The reviewers raise an important concern here; namely, that if a post-zygotic mutation occurs very soon after fertilization of the embryo, it could be present in a large fraction of somatic cells, and manifest with VAF > 0.2. In the process of addressing this concern, we substantially improved our strategy for identifying gonosomal post-zygotic DNMs, and now use a phasing-by-transmission strategy rather than a VAF cutoff (see “Note from authors regarding post-zygotic mosaicism”). As a result, we no longer apply strict VAF cutoffs to the gonosomal mutations, and can identify gonosomal mutations present at VAF > 0.2.

13) What is the VAF distribution of candidate gonosomal mutations in F2? One of the filters they have used have VAF >=0.3 in F2. Might this threshold be too lax? For the gonosomal mutations that occur in F1, an expectation of a higher VAF of almost 0.5 in the F2 set seems reasonable.

The reviewers are correct that the candidate gonosomal mutations (which occurred during the post-zygotic development of the F1 individuals) should be present at high VAF (~0.5) in the F2 individuals, since these mutations should have been inherited as “normal” heterozygous mutations by the F2 children. We applied a VAF filter of >= 0.3, simply because we expect the VAF for true heterozygous variants to be approximately normally distributed about a mean of 0.5, with the VAF for most of these variants falling between 0.3 and 0.7.

14) In mosaic post-PGCS analysis: the authors have identified 32 events with supporting alleles in F1. Among these 32 mutations, do any occur in families were F1s are related? For example, do F1 19_A mom and 19_B dad share some of these mosaic mutations? If so is there any correlation in mutational burden in F1 with the age of P0?

Some of the post-PGCS mutations with supporting evidence in a parent did occur in families where second-generation individuals were related (including family IDs 19 and 24). However, none of these mutations were found in multiple second-generation individuals (i.e., they each occurred at a unique site).

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Data Citations

    1. Sasani TA, Pedersen BS, Gao Z, Baird L, Przeworski M, Jorde LB, Quinlan AR. 2019. Genome sequencing of large, multigenerational CEPH/Utah families. NCBI dbGaP. phs001872.v1.p1

    Supplementary Materials

    Supplementary file 1. Pedigree structures for all CEPH/Utah families.

    All family and sample IDs have been anonymized, and the sexes of third-generation individuals have been hidden.

    elife-46922-supp1.zip (177KB, zip)
    DOI: 10.7554/eLife.46922.015
    Supplementary file 2. IGV images of 100 randomly selected germline DNMs identified in the second generation.

    In each image, the first two tracks contain alignments from the first-generation parents, and the third track contains the alignments for the second-generation child. Reads with mapping quality <20 are not included, as they were not considered by our variant calling pipeline, and mismatched bases are shaded by quality score (more transparent = lower base quality).

    DOI: 10.7554/eLife.46922.016
    Supplementary file 3. IGV images of 100 randomly selected germline.

    DNMs identified in the third generation In each image, the first two tracks contain alignments from the second-generation parents, and the third track contains the alignments for the third-generation child. Reads with mapping quality <20 are filtered out, as they were not considered by our variant calling pipeline, and mismatched bases are shaded by quality score (more transparent = lower base quality).

    DOI: 10.7554/eLife.46922.017
    Supplementary file 4. IGV images of all putative post-PGCS mosaic mutations In each image, the first two tracks contain alignments from the two second-generation parents in the pedigree.

    All tracks below contain alignments from the third-generation children that share a DNM at the site. Reads with mapping quality <20 are filtered out, as they were not considered by our variant calling pipeline, and mismatched bases are shaded by quality score (more transparent = lower base quality).

    elife-46922-supp4.zip (7.2MB, zip)
    DOI: 10.7554/eLife.46922.018
    Supplementary file 5. IGV images of all putative gonosomal mutations identified in the second generation.

    In each image, the first two, three, or four tracks contain alignments from the grandparents in the pedigree (i.e., paternal grandmother and grandfather, maternal grandmother and grandfather). In some families, one or two of the first-generation grandparents were not sequenced (see Supplementary file 1). The two tracks below contain alignments from the second-generation individual with the putative gonosomal mutation and that second-generation individual’s spouse. The remaining tracks below contain alignments from the third-generation individuals that inherited the gonosomal mutation. Reads with mapping quality <20 are filtered out, as they were not considered by our variant calling pipeline, and mismatched bases are shaded by quality score (more transparent = lower base quality).

    elife-46922-supp5.zip (16.6MB, zip)
    DOI: 10.7554/eLife.46922.019
    Transparent reporting form
    DOI: 10.7554/eLife.46922.020

    Data Availability Statement

    Code used for statistical analysis and figure generation has been deposited on GitHub as a collection of annotated Jupyter Notebooks: https://github.com/quinlan-lab/ceph-dnm-manuscript (Sasani, 2019; copy archived at https://github.com/elifesciences-publications/ceph-dnm-manuscript/blob/master/README.md). Data files containing high-confidence de novo mutations, as well as the gonosomal and post-primordial germ cell specification (PGCS) mosaic mutations, are included with these Notebooks. To mitigate compatibility issues, we have also made all notebooks available in a Binder environment, accessible at the above GitHub repository (Sasani, 2019).

    Code used for statistical analysis and figure generation has been deposited on GitHub as a collection of annotated Jupyter Notebooks: https://github.com/quinlan-lab/ceph-dnm-manuscript (copy archived at https://github.com/elifesciences-publications/ceph-dnm-manuscript). Data files containing high-confidence de novo mutations, as well as the gonosomal and post-primordial germ cell specification (PGCS) mosaic mutations, are included with these Notebooks. To mitigate compatibility issues, we have also made all notebooks available in a Binder environment, accessible at the above GitHub repository. Aligned sequencing reads (in CRAM format) and variant calls (in VCF format) will be made available at the SRA and dbGaP under controlled access, with accession phs001872.v1.p1.

    The following dataset was generated:

    Sasani TA, Pedersen BS, Gao Z, Baird L, Przeworski M, Jorde LB, Quinlan AR. 2019. Genome sequencing of large, multigenerational CEPH/Utah families. NCBI dbGaP. phs001872.v1.p1


    Articles from eLife are provided here courtesy of eLife Sciences Publications, Ltd

    RESOURCES