Elucidation of de novo small insertion/deletion biology with parent-of-origin phasing

Allison H Seiden; Felix Richter; Nihir Patel; Oscar L Rodriguez; Gintaras Deikus; Hardik Shah; Melissa Smith; Amy Roberts; Eileen C King; Robert P Sebra; Andrew J Sharp; Bruce D Gelb

doi:10.1002/humu.23971

. Author manuscript; available in PMC: 2021 Apr 1.

Published in final edited form as: Hum Mutat. 2020 Jan 16;41(4):800–806. doi: 10.1002/humu.23971

Elucidation of de novo small insertion/deletion biology with parent-of-origin phasing

Allison H Seiden ^1,^†, Felix Richter ^2,^†, Nihir Patel ¹, Oscar L Rodriguez ^1,^2,³, Gintaras Deikus ^3,⁴, Hardik Shah ^3,⁴, Melissa Smith ^3,⁴, Amy Roberts ⁵, Eileen C King ⁶, Robert P Sebra ^3,⁴, Andrew J Sharp ^1,³, Bruce D Gelb ^1,^3,^7,^*

PMCID: PMC7069802 NIHMSID: NIHMS1066608 PMID: 31898844

Abstract

The mechanisms underlying de novo insertion/deletion (indel) genesis, such as polymerase slippage, have been hypothesized but not well characterized in the human genome. We implemented two methodological improvements, which were leveraged to dissect indel mutagenesis. We assigned de novo variants to parent-of-origin (i.e., phasing) with low coverage long-read whole genome sequencing, achieving better phasing compared to short-read sequencing (medians of 84% and 23%, respectively). We then wrote an application programming interface to classify indels into three subtypes according to sequence context. Across three cohorts with different phasing methods (N_trios=540, all cohorts), we observed that one de novo indel subtype, change in copy count (CCC), was significantly correlated with father’s (P=7.1×10⁻⁴) but not mother’s (P=0.45) age at conception. We replicated this effect in three cohorts without de novo phasing (P_paternal=1.9×10⁻⁹, P_maternal=0.61, N_trios=3,391, all cohorts). Although this is consistent with polymerase slippage during spermatogenesis, the percentage of variance explained by paternal age was low, and we did not observe an association with replication timing. These results suggest that spermatogenesis-specific events have a minor role in CCC indel mutagenesis, one not observed for other indel subtypes nor for maternal age in general. These results have implications for indel modeling in evolution and disease.

Keywords: indels, de novo variants, long-read technology, parent-of-origin phasing, whole genome sequencing

Introduction

Genomes evolve through mutation, selection, and random drift. Mutations include single nucleotide variants (SNVs), short insertions and deletions (indels, ≤ 50-bp changes), and larger structural variations. Models of local indel mutation rates could serve as a powerful control for disease association studies as well as an invaluable tool for understanding evolution, especially in repetitive regions (Campbell & Eichler, 2013).

Modeling de novo indels requires understanding the mutational mechanisms that give rise to them (Table 1) (Kloosterman et al., 2015; Montgomery et al., 2013). Indel mechanisms can be investigated experimentally (e.g., biochemical assays, structural biology) (Garcia-Diaz & Kunkel, 2006) or at the population level through associations with parental age, relative genomic distribution, local sequence properties, and ancestry. Previously, sequence context was used to infer that three quarters of de novo indels arise through polymerase slippage, with the remaining quarter hypothesized to arise from double-strand break repair (e.g., non-homologous end-joining) and/or unknown mechanisms (Kloosterman et al., 2015; Montgomery et al., 2013). Polymerase slippage implies a replication-associated event consistent with spermatogenesis, yet previous studies could not detect an association between de novo indels and paternal age (Jónsson et al., 2017; Kloosterman et al., 2015). This might be attributable to a small sample size of parentally assigned (i.e., phased) indels, or not applying previously defined indel classification techniques to the new data.

Table 1.

Description and examples of indel classes.

Position	Indel	Context	Type
chr8:117967436	+T	TTAAATATTTTTTT	HR
chr5:52931910	−A	CCAATTAAAAAAA	HR
chr12:71038252	−GTG	TAACTTGTGGTGTTT	CCC
chr2:158890470	+GAG	CAGAACTGAGGAGCAT	CCC
chr20:39588834	−AGG	GAGAGAAGGAGATGT	non-CCC
chr4:14567444	+AACCC	ACATAATATATAACCCAACCTACCTT	non-CCC

Open in a new tab

HR=homopolymer run, CCC=change in copy count

Here, we describe the feasibility of using low-coverage single molecule real-time (SMRT), long-read sequencing (Pacific Biosciences) (N=10 trios) to phase de novo SNVs and indels identified with short-read (Illumina) whole-genome sequencing (WGS). We used these data as an orthogonal technology to validate previously observed associations between parental age and de novo SNVs. We then developed an application programming interface (API) (https://pypi.org/project/sorting-hat/) to classify indels based on sequence context (sorting hat). Finally, we combined results from three cohorts each using different phasing methods (long-read read-based phasing (N=10), short read-based phasing (N=305 trios) and three-generation haplotype phasing (N=225 families) (Jónsson et al., 2017)) to characterize indel mutagenesis.

Materials and methods

Subjects, whole genome sequencing, and de novo variant calling

Patients and parents (i.e., trios) were enrolled in the Pediatric Cardiac Genomics Consortium (PCGC) Congenital Heart Disease Network Study (CHD GENES: ClinicalTrials.gov identifier NCT01196182) (Gelb et al., 2013). The protocols were approved by the Institutional Review Boards of Boston’s Children’s Hospital, Brigham and Women’s Hospital, Children’s Hospital of Los Angeles, Children’s Hospital of Philadelphia, Columbia University Medical Center, Great Ormond Street Hospital, Icahn School of Medicine at Mount Sinai, Rochester School of Medicine and Dentistry, Steven and Alexandra Cohen Children’s Medical Center of New York, and Yale School of Medicine. All subjects or their parents provided informed consent and assent (if applicable).

Long-read SMRT sequencing

DNAs of ten PCGC patients were isolated and SMRTbells were prepared as recommended by the manufacturer. Samples were sequenced using v2.0 chemistry and 10-hour movies on a Sequel I System (Pacific Biosciences) using 7 SMRT cells to achieve a target depth of 10x WGS coverage. Ultimately, we achieved mean mapped read coverage ranging from 7–9x per sample and mean insert read lengths ranging from 5488–7268 base pairs per sample. Reads were then mapped to hg38 using NGMLR (Sedlazeck et al., 2018).

Short-read, paired-end sequencing

DNAs of PCGC samples underwent short-read sequencing at the Baylor College of Medicine Genomic and RNA Profiling Core, the New York Genome Center (NYGC) Genomic Research Services, and the Broad Institute for Genomic Services. Genomic DNAs extracted from venous blood or saliva were prepared for sequencing using a PCR-free or SK2-IES library preparation. All samples were sequenced on an Hi-Seq X Ten (Illumina) with 150-bp paired end reads to a median depth >30x per individual. The Simons Foundation kindly provided the phenotypic and genomic data for 1627 autism-unaffected sibling-parent trios derived from sporadic autism quartets (Fischbach & Lord, 2010).

De novo indel identification and validation

We performed alignment, variant calling, and de novo identification for PCGC and Simons data. BWA-MEM was used to align reads to GRCh37 or GRCh38 and GATK Best Practices were used for indel realignment and identification via N+1 joint genotyping (DePristo et al., 2011; Li & Durbin, 2009; McKenna et al., 2010; Van der Auwera et al., 2002). The union of three de novo filtering pipelines was taken to prioritize de novo indels (Richter et al. 2019, in review). These pipelines applied heuristic filters that were combinations of GATK PASS, heterozygous ratio (0.2–0.8 or 0.3–0.7), parental homozygous ratio (<1%, <3.5%, or <5%), depth (>10 or 7–120), allele count (1 or ≤2), genotype quality (≥70 or >60 in proband, >30 in parents), and alternate allele depth (>7 in proband, <3 reads in parents ). Following pooling, all variants were force called with FreeBayes (Garrison & Marth, 2012) and their Integrative Genomics Viewer (IGV) (Robinson et al., 2011) plots were passed through a convolutional neural network trained on curated IGV plots (Qi et al. 2019, in prep.). Finally, DNVs were excluded if they were located in segmental duplications (score >0.99), low complexity regions, low mappability regions (300 bp, score <1), ENCODE blacklisted sites, or mucin/HLA genes (Bailey, Yavor, Massa, Trask, & Eichler, 2001; Bernstein et al., 2012; Derrien et al., 2012; Li, 2014). Variants identified using GRCh37 were lifted over to GRCh38 (Kent et al., 2002), and Sanger sequencing was used to confirm a subset.

De novo indels from deCODE were obtained from Supplemental Data (Jónsson et al., 2017).

Phasing

WhatsHap (0.16) was used to phase de novo variants (DNVs) through read-back phasing (also known as read-pair tracing) (Martin et al., 2016). Inputs were either the short- or long-read alignment files and the trio VCFs generated from short-read sequencing. WhatsHap down-samples to the most informative 15 reads at location. The output was a phased VCF, with the full trio and the proband’s variants phased.

Following phasing, DNVs were programmatically assigned to the parent-of-origin. WhastHap was used to generate a GTF from the phased VCF, where the GTF genomic coordinates represented haplotype blocks of contiguously phased variants. DNVs were assigned to a parent-of-origin if ≥85% of informative variants (i.e., variants in a haplotype block) were assigned to that parent.

For long-read data, de novo indels were phased manually with the Integrative Genomics Viewer (IGV) (Robinson et al., 2011). For each de novo indel, we identified the 2–5 highest confidence reads (with Reference/Alternative alleles). These reads were highlighted with IGV, and we identified inherited SNVs on the informative reads. We then assigned the de novo indel to the parent-of-origin if all SNVs on the informative reads were in agreement. We validated this heuristic approach with the de novo indels phased using the short-read sequencing data. IGV plots used for de novo indel phasing are provided in the code repository (see Data Access). These are organized such that each patient has their own directory with indel subdirectories and final phasing results. The indel subdirectories contain IGV plots of the de novo indel and all informative variants, as well as a file describing how these informative variants contributed to the final phasing result (or lack thereof). The IGV plots are ordered, from top to bottom, as patient PacBio reads followed by patient, mother, and father Illumina reads. The interpretation of each IGV plot is provided in the descriptor file, including details on every variant used for phasing (e.g., inheritance patterns), criteria for creating a zoomed in or panoramic IGV plots, and the specific reads used for phasing.

Three-generation haplotype phasing results from deCODE were downloaded from Supplemental Data (Jónsson et al., 2017).

Replicating de novo SNV results

Variants were classified into one of eight mutational groups (C>A, C>G, C>T, T>A, T>C, T>G, CpG>TpG, and indel), with reverse complements grouped together (e.g., C>T and G>A are both C>T). Bedtools getfasta was used to identify CpG>TpG mutations by marking reference C alleles with an adjacent G, and reference G alleles with an adjacent C (Quinlan & Hall, 2010).

Indel classification application programming interface (API)

The three classes of indels were defined as follows. Homopolymer run (HR) mutations occur in regions with six or more copies of the inserted/deleted single nucleotide. Changes in copy counts (CCCs) occur if the sequence being inserted or deleted has one or more repeats in the directly flanking bases. Finally, any indel not falling in the above categories was considered a non-CCC. These classes were chosen because they are consistently defined across multiple studies (Kloosterman et al., 2015; Montgomery et al., 2013). Indel classes that were not used are tandem repeats (TRs), predicted hotspots, and complex indels. Of these, only TRs were referenced in multiple studies. However, they had varying definitions and were called with TR-specific algorithms that could not be applied to the deCODE data.

The sorting-hat API was made to automate the indel classification process. Sorting-hat calls the bedtools getfasta command and collects flanking base pairs depending on indel length (Quinlan & Hall, 2010). Sorting-hat collects the six flanking bases if the indel was a single nucleotide, or 2*L flanking bases if the indel length (L) was greater than one. Sorting-hat also optionally annotates indels with the encompassing repeat if the RepeatMasker track from the UCSC Genome Browser is provided (Smit, Glusman, & Hubley, 2015).

Association with replication timing

One embryonic stem cell line (BG02) and 14 other Repli-seq wavelet-smoothed bigwig tracks were downloaded from ENCODE (Hansen et al., 2010). De novo indels, lifted from hg38 to hg19 (Kent et al., 2002), were overlapped with each Repli-seq track. We tested for associations using replication timing from either the embryonic stem cell line or the mean of all 15 lines, averaged per variant. Following annotation, indels were classified as late or early replicating if they fell above or below the median of replication timing values. We compared the proportion of paternal de novo CCCs in early versus late replicating regions using a Fisher’s exact test

Statistics

Correlations were quantified with the Spearman’s correlation coefficient, with statistical significance assessed with two-tailed p-values. The meta-analysis p-values were calculated with Fisher’s method, in which constituent p-values were log-transformed, multiplied by −2, and summed, generating a chi-squared test statistic (with 2 * N_p-values degrees of freedom, N=number of constituent p-values) that was converted to a final meta-analysis p-value. The meta-analysis Spearman’s correlation coefficients were calculated using the metacor function in the meta R package on Fisher’s z-transformed correlations.

A mixed effects Poisson regression was also performed as secondary evidence and to provide models for scatterplot fitting using the glmer function from lme4 (Bates, Mächler, Bolker, & Walker, 2015). Parental age slope and intercept were considered cohort-specific random effects, while cohort was a fixed effect (model formula: Indels/trio ~ parental age + (parental age | cohort) ). For Poisson models, all variance/mean ratios were <2.

For unphased trios, Poisson regressions were performed instead of Spearman’s rank correlations to account for both maternal and paternal age in the same model (model formula: Indels/trio ~ father’s age + mother’s age). As with phased trios, separate regressions were fit for every cohort (PCGC, Simons, and deCODE) in every indel class, and p-values were meta-analyzed across studies with Fisher’s method.

For all comparisons within the same hypothesis space, p-values or the false positive rate (α) were Bonferroni adjusted.

Results

The most common method for phasing DNVs utilizes the presence of an inherited variant within the same read or read pair and depends on read length. Using read-pair tracings from short-read sequencing, previous studies found that parent-of-origin can be assigned for ~20% of DNVs (Jónsson et al., 2017; Kloosterman et al., 2015). We used read-pair tracing to phase DNVs identified with short-read whole genome sequencing (Figure 1), achieving similar results (23%, N=305 trios). As might be predicted, using long-read sequencing data, we were able to phase a considerably higher fraction of the DNVs (84%, N=10 trios) identified with short-read sequencing.

Figure 1. — (a) Number of SNVs and indels phased in each cohort. (b) Histograms of the fraction of *de novo* indels.

Having phased DNVs in two cohorts, we first sought to replicate known associations of de novo SNV with parental age. Previously, C>A and T>G DNVs were observed to be significantly enriched for paternal origin, while C>T DNVs were found to be significantly enriched for maternal origin (Jónsson et al., 2017). We observed the same directionality for all three variant classes (Supp. Figure S1).

Others previously observed a correlation between unphased de novo indels and paternal, but not maternal, age at conception (Jónsson et al., 2017). We corroborated and dissected this observation in cohorts representing three types of phasing: long-read (PacBio) and short-read (Illumina) read-pair tracing (described above) as well as indels phased with three-generation haplotype phasing (Jónsson et al., 2017). Building on a previous categorization system (Montgomery et al., 2013), we first classified de novo indels into homopolymer runs (HRs), non-HR changes in copy count (CCCs), and non-CCCs (Table 1) using the sorting-hat API (see Methods). In total, 156 HRs, 653 CCCs, and 569 non-CCCs were phased across all three methods (Supp. Table S1). We observed a similar percent of de novo HRs (5–13%), CCCs (45–48%), and non-CCCs (39–48%) across all three methods, consistent with previous results (Supp. Table S2) (Kloosterman et al., 2015). We found the highest deletion/insertion ratio in non-CCCs (range 9:1 to 24:1), also observed previously (Kloosterman et al., 2015). Non-homologous repair is hypothesized to be the primary origin of small non-CCC deletions, and the high deletion/insertion ratio for non-CCCs is consistent with this mechanism (Kloosterman et al., 2015).

The ratio of paternal to maternal indels was similar across all three cohorts (range 3:1 to 4:1), with no difference in proportions of any indel class (Supp. Figure S2). We then compared the number of phased indels per class with parental age (Figure 2 and Supp. Figure S3). In a meta-analysis of all approaches, the only statistically significant correlation (i.e., P_Bonferroni<6.3×10⁻³) occurred between paternally phased CCCs and father’s age (two-sided meta-analysis P=7.1×10⁻⁴). This result was confirmed with Poisson mixed effects regression across all three cohorts (P=1.5×10⁻³). The meta-analysis Spearman’s r=0.16 (95% C.I. 0.07–0.25) corresponds to 2.7% of variance (95% C.I. 0.5%−6.3%) in paternal de novo CCC indels per individual explained by paternal age.

Figure 2. — **(a)** Spearman’s correlation coefficient and 95% confidence intervals for maternal or paternal age at conception and the number of indels/trio for three indel classes for three methods. **(b)** Fisher’s method meta-analysis of correlation p-values shows a significant correlation between paternal age and CCC indels (significance P<6.3×10⁻³). **(c)** Scatter plots illustrating the association between parental age at conception and the number of CCC indels, fitted with mixed effects Poisson regression (Supp. Figure S2 for HR, non-CCC, and all indel scatter plots). HR=homopolymer run, CCC=change in copy count (excluding HRs), non-CCC=indels not classified as HR or CCC

We sub-divided the paternally phased CCCs to derive further insight into the mechanisms of indel mutagenesis. De novo CCC indels identified in repeat-masked (Smit et al., 2015) genomic regions had similar correlations with paternal age (meta-analysis Spearman’s r=0.12, P_Bonferroni=0.03, 233 paternal CCC indels) compared to CCCs outside of repeats (meta-analysis Spearman’s r=0.09, P_Bonferroni=0.29, 278 paternal CCC indels). We sub-divided CCCs into insertions and deletions, observing significant correlations with paternal age and deletions (meta-analysis Spearman’s r=0.18, P_Bonferroni=4.9×10⁻⁴, 355 paternal CCC deletions) but not insertions (meta-analysis Spearman’s r=0.06, P_Bonferroni=1.0, 156 paternal CCC insertions). We found consistent trends for paternal de novo CCC indels within and outside of repeats, a similar trend for paternal de novo non-CCC indels, and the opposite trend for de novo HR indels (Supp. Table S3). None of these trends were statistically significant (P<6×10⁻³), possibly reflecting power limitations in the context of lower effect sizes. We also investigated if the number of de novo CCC indels in fathers was associated with replication timing in embryonic stem cells (Hansen et al., 2010), building on previous associations with de novo SNV data (Francioli et al., 2015). However, we observed no difference in the proportion of paternal de novo CCC indels in late and early replicating regions in embryonic stem cells (261/539 versus 249/539, P=0.50). Within late-replicating regions, there was also no difference in the proportion of de novo CCCs from younger and older fathers (135/270 versus 126/269, P=0.51). We observed a similar lack of significance for replication timing averaged across 15 diverse cell lines.

To further corroborate these findings, we tested the correlation between parental age and de novo indels in three trio cohorts with unphased de novo variants, all ascertained with Illumina short-read sequencing at a depth of 30x. De novo variants in two of these cohorts (PCGC and SFARI) were identified using the same pipeline as for phased variants, and a subset of these were previously Sanger sequenced (Richter et al. 2019, in review). PCR validation confirmed 41/41 CCC, 25/26 non-CCC, and 12/16 HR de novo indels (58/62 deletions and 20/21 insertions). Consistent with findings in the phased data, we observed statistically significant associations (P_Bonferroni<6.3×10⁻³) between all and CCC de novo indels and paternal age at conception (two-sided meta-analysis P_All=1.4×10⁻¹⁴, P_CCC=1.9×10⁻⁹) but not maternal age at conception (P_All=0.04, P_CCC=0.61) (Supp. Figure S4). We also observed statistically significant associations between father’s age and non-CCC indels. There was no association with maternal age for any de novo indel class and no association between number of HRs and parental age for either parent.

Discussion

We phased de novo variants to gain a deeper understanding of germline indel mutagenesis. Phasing with long-read sequencing was successful, illustrated by our ability to phase four times as many variants as was possible using short-read sequencing. We replicated previously defined associations for de novo SNVs and the predisposition of deletions among de novo non-CCC indels. The latter observation is consistent with the hypothesis that most non-CCCs arise with double-strand break repair through non-homologous end-joining.

We observed a novel but expected correlation between de novo CCC indels and father’s age at conception. Because CCCs are hypothesized to arise during polymerase slippage, the paternal age association is consistent with spermatogenesis-associated DNA replication errors. However, paternal age explained a small fraction of the variance in de novo CCC counts per trio, suggesting there are other, likely polymerase-mediated, events contributing to CCCs. When sub-dividing CCC indels, we observed significant correlations within repeats and with deletions (but not insertions), suggesting that the underlying mutagenesis processes differ between deletions and insertions and highlighting further directions for mechanistic explorations. We observed a similar trend for non-CCC deletions compared to insertions, but this observation was limited by the insertion sample size, an order of magnitude lower than for deletions. We also observed trends supporting a possible role for both maternal and paternal age in the development of other types of de novo indels, including a trend for de novo HRs, which are also hypothesized to be associated with polymerase slippage. In contrast to HRs, which were limited by sample sizes, de novo non-CCCs had similar sample sizes to CCCs, suggesting that correlations with paternal age, if any, are even lower than those observed for CCCs.

De novo indel identification is error prone and inferences warrant caution. However, our results, which represent the largest aggregation of phased and unphased de novo indels to date, are robust across thousands of WGS trios ascertained through multiple centers, pipelines, sequencing technologies, and phasing methods. Further limiting this concern, true positive rates from PCR validations were >90% for de novo indels. Finally, the lack of significant correlations with maternal age is consistent with a lack of bias towards data obtained from older parents.

Taken together, we have demonstrated that grouping de novo indels into HRs, CCCs, and non-CCCs provides a valuable framework for interpreting indel mutagenesis and have developed an easy-to-use convenient API. The CCC-paternal-age correlation is consistent with hypothesized mechanisms but explains a low percentage of indel mutagenesis, suggesting there are other, as yet unknown, mechanisms at play.

Availability

The whole genome sequencing data generated in this study are available on dbGaP (www.ncbi.nlm.nih.gov/gap) under accession numbers phs001138.v2.p2 and phs001194.v2.p2.

Code and IGV images: https://github.com/allisonseiden/longreadclustersequencing/

Sorting-hat: https://pypi.org/project/sorting-hat/

Supplementary Material

supp info

NIHMS1066608-supplement-supp_info.docx^{(801.4KB, docx)}

Acknowledgments

We are grateful to the patients and families who participated in this research, and thank the following for patient recruitment: A. Julian, M. Mac Neal, Y. Mendez, T. Mendiz-Ramdeen and C. Mintz (Icahn School of Medicine at Mount Sinai); N. Cross (Yale School of Medicine); J. Ellashek and N. Tran (Children’s Hospital of Los Angeles); B. McDonough, J. Geva and M. Borensztein (Harvard Medical School), K. Flack, L. Panesar and N. Taylor (University College London); E. Taillie (University of Rochester School of Medicine and Dentistry); S. Edman, J. Garbarini, J. Tusi and S. Woyciechowski (Children’s Hospital of Philadelphia); D. Awad, C. Breton, K. Celia, C. Duarte, D. Etwaru, N. Fishman, M. Kaspakoval, J. Kline, R. Korsin, A. Lanz, E. Marquez, D. Queen, A. Rodriguez, J. Rose, J.K. Sond, D. Warburton, A. Wilpers and R. Yee (Columbia Medical School); D. Gruber (Cohen Children’s Medical Center, Northwell Health). These data were generated by the Pediatric Cardiac Genomics Consortium (PCGC), under the auspices of the National Heart, Lung, and Blood Institute’s Bench to Bassinet Program (https://benchtobassinet.com). The results analyzed and published here are based in part on data generated by Gabriella Miller Kids First Pediatric Research Program projects phs001138.v1.p2/phs001194.v1.p2, and were accessed from from the Kids First Data Resource Portal (https://kidsfirstdrc.org/) and/or dbGaP (www.ncbi.nlm.nih.gov/gap). This manuscript was prepared in collaboration with investigators of the PCGC and has been reviewed and/or approved by the PCGC. PCGC investigators are listed at https://benchtobassinet.com/Centers/PCGCCenters.aspx. The Pediatric Cardiac Genomics Consortium (PCGC) program is funded by the National Heart, Lung, and Blood Institute, National Institutes of Health, U.S. Department of Health and Human Services through grants UM1HL128711, UM1HL098162, UM1HL098147, UM1HL098123, UM1HL128761, and U01HL131003. The PCGC Kids First study includes data sequenced by the Broad Institute (U24 HD090743–01). This work was supported by the National Institute of Dental and Craniofacial Research Interdisciplinary Training in Systems and Developmental Biology and Birth Defects [T32HD075735 to F.R.] and Mount Sinai Medical Scientist Training Program [5T32GM007280 to F.R.]. This work was supported in part through the computational resources and staff expertise provided by Scientific Computing at the Icahn School of Medicine at Mount Sinai.

Funding information

T32HD075735, 5T32GM007280, U24 HD090743–01, UM1HL128711, UM1HL098162, UM1HL098147, UM1HL098123, UM1HL128761, U01HL131003

Footnotes

Disclosure declaration

No conflicts of interest to disclose.

References

Bailey JA, Yavor AM, Massa HF, Trask BJ, & Eichler EE (2001). Segmental duplications: organization and impact within the current human genome project assembly. Genome Research, 11(6), 1005–1017. 10.1101/gr.187101 [DOI] [PMC free article] [PubMed] [Google Scholar]
Bates D, Mächler M, Bolker B, & Walker S (2015). Fitting Linear Mixed-Effects Models Using lme4. Journal of Statistical Software, 67(1), 1–48. 10.18637/jss.v067.i01 [DOI] [Google Scholar]
Bernstein BE, Birney E, Dunham I, Green ED, Gunter C, & Snyder M (2012). An integrated encyclopedia of DNA elements in the human genome. Nature, 489(7414), 57–74. 10.1038/nature11247 [DOI] [PMC free article] [PubMed] [Google Scholar]
Campbell CD, & Eichler EE (2013). Properties and rates of germline mutations in humans. Trends in Genetics, 29(10), 575–584. 10.1016/j.tig.2013.04.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, … Daly MJ (2011). A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics, 43(5), 491–498. 10.1038/ng.806 [DOI] [PMC free article] [PubMed] [Google Scholar]
Derrien T, Estellé J, Marco Sola S, Knowles DG, Raineri E, Guigó R, & Ribeca P (2012). Fast Computation and Applications of Genome Mappability. PLoS ONE, 7(1), e30377 10.1371/journal.pone.0030377 [DOI] [PMC free article] [PubMed] [Google Scholar]
Fischbach GD, & Lord C (2010). The Simons Simplex Collection: a resource for identification of autism genetic risk factors. Neuron, 68(2), 192–195. 10.1016/j.neuron.2010.10.006 [DOI] [PubMed] [Google Scholar]
Francioli LC, Polak PP, Koren A, Menelaou A, Chun S, Renkens I, … Sunyaev SR (2015). Genome-wide patterns and properties of de novo mutations in humans. Nature Genetics, 47(7), 822–826. 10.1038/ng.3292 [DOI] [PMC free article] [PubMed] [Google Scholar]
Garcia-Diaz M, & Kunkel TA (2006). Mechanism of a genetic glissando*: structural biology of indel mutations. Trends in Biochemical Sciences, 31(4), 206–214. 10.1016/j.tibs.2006.02.004 [DOI] [PubMed] [Google Scholar]
Garrison E, & Marth G (2012). Haplotype-based variant detection from short-read sequencing. [Google Scholar]
Gelb B, Brueckner M, Chung W, Goldmuntz E, Kaltman J, Kaski JP, … Pearson G (2013). The Congenital Heart Disease Genetic Network Study: rationale, design, and early results. Circulation Research, 112(4), 698–706. 10.1161/CIRCRESAHA.111.300297 [DOI] [PMC free article] [PubMed] [Google Scholar]
Hansen RS, Thomas S, Sandstrom R, Canfield TK, Thurman RE, Weaver M, … Stamatoyannopoulos JA (2010). Sequencing newly replicated DNA reveals widespread plasticity in human replication timing. Proceedings of the National Academy of Sciences, 107(1), 139–144. 10.1073/pnas.0912402107 [DOI] [PMC free article] [PubMed] [Google Scholar]
Jónsson H, Sulem P, Kehr B, Kristmundsdottir S, Zink F, Hjartarson E, … Stefansson K (2017). Parental influence on human germline de novo mutations in 1,548 trios from Iceland. Nature, 549(7673), 519–522. 10.1038/nature24018 [DOI] [PubMed] [Google Scholar]
Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, & Haussler D (2002). The human genome browser at UCSC. Genome Research, 12(6), 996–1006. 10.1101/gr.229102. Article published online before print in May 2002 [DOI] [PMC free article] [PubMed] [Google Scholar]
Kloosterman WP, Francioli LC, Hormozdiari F, Marschall T, Hehir-Kwa JY, Abdellaoui A, … Guryev V (2015). Characteristics of de novo structural changes in the human genome. Genome Research, 25(6), 792–801. 10.1101/gr.185041.114 [DOI] [PMC free article] [PubMed] [Google Scholar]
Li H (2014). Towards Better Understanding of Artifacts in Variant Calling from High-Coverage Samples. 10.1093/bioinformatics/btu356 [DOI] [PMC free article] [PubMed] [Google Scholar]
Li H, & Durbin R (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics (Oxford, England), 25(14), 1754–1760. 10.1093/bioinformatics/btp324 [DOI] [PMC free article] [PubMed] [Google Scholar]
Martin M, Patterson M, Garg S, Fischer SO, Pisanti N, Klau GW, … Marschall T (2016). WhatsHap: fast and accurate read-based phasing. BioRxiv, 085050 10.1101/085050 [DOI] [Google Scholar]
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, … DePristo MA (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research, 20(9), 1297–1303. 10.1101/gr.107524.110 [DOI] [PMC free article] [PubMed] [Google Scholar]
Montgomery SB, Goode DL, Kvikstad E, Albers CA, Zhang ZD, Mu XJ, … Lunter G (2013). The origin, evolution, and functional impact of short insertion-deletion variants identified in 179 human genomes. Genome Research, 23(5), 749–761. 10.1101/gr.148718.112 [DOI] [PMC free article] [PubMed] [Google Scholar]
Quinlan AR, & Hall IM (2010). BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics, 26(6), 841–842. 10.1093/bioinformatics/btq033 [DOI] [PMC free article] [PubMed] [Google Scholar]
Robinson JT, Thorvaldsdóttir H, Winckler W, Guttman M, Lander ES, Getz G, & Mesirov JP (2011). Integrative genomics viewer. Nature Biotechnology, 29(1), 24–26. 10.1038/nbt.1754 [DOI] [PMC free article] [PubMed] [Google Scholar]
Sedlazeck FJ, Rescheneder P, Smolka M, Fang H, Nattestad M, von Haeseler A, & Schatz MC (2018). Accurate detection of complex structural variations using single-molecule sequencing. Nature Methods, 15(6), 461–468. 10.1038/s41592-018-0001-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
Smit A, Glusman G, & Hubley R (2015). RepeatMasker 4.0. Retrieved from http://www.repeatmasker.org/
Van der Auwera GA, Carneiro MO, Hartl C, Poplin R, Del Angel G, Levy-Moonshine A, … DePristo MA (2002). Current Protocols in Bioinformatics. In Bateman A, Pearson WR, Stein LD, Stormo GD, & Yates JR (Eds.), Current protocols in bioinformatics / editoral board, Andreas D. Baxevanis ... [et al.] (Vol. 11). 10.1002/0471250953 [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supp info

NIHMS1066608-supplement-supp_info.docx^{(801.4KB, docx)}

[R1] Bailey JA, Yavor AM, Massa HF, Trask BJ, & Eichler EE (2001). Segmental duplications: organization and impact within the current human genome project assembly. Genome Research, 11(6), 1005–1017. 10.1101/gr.187101 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Bates D, Mächler M, Bolker B, & Walker S (2015). Fitting Linear Mixed-Effects Models Using lme4. Journal of Statistical Software, 67(1), 1–48. 10.18637/jss.v067.i01 [DOI] [Google Scholar]

[R3] Bernstein BE, Birney E, Dunham I, Green ED, Gunter C, & Snyder M (2012). An integrated encyclopedia of DNA elements in the human genome. Nature, 489(7414), 57–74. 10.1038/nature11247 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Campbell CD, & Eichler EE (2013). Properties and rates of germline mutations in humans. Trends in Genetics, 29(10), 575–584. 10.1016/j.tig.2013.04.005 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, … Daly MJ (2011). A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics, 43(5), 491–498. 10.1038/ng.806 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] Derrien T, Estellé J, Marco Sola S, Knowles DG, Raineri E, Guigó R, & Ribeca P (2012). Fast Computation and Applications of Genome Mappability. PLoS ONE, 7(1), e30377 10.1371/journal.pone.0030377 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Fischbach GD, & Lord C (2010). The Simons Simplex Collection: a resource for identification of autism genetic risk factors. Neuron, 68(2), 192–195. 10.1016/j.neuron.2010.10.006 [DOI] [PubMed] [Google Scholar]

[R8] Francioli LC, Polak PP, Koren A, Menelaou A, Chun S, Renkens I, … Sunyaev SR (2015). Genome-wide patterns and properties of de novo mutations in humans. Nature Genetics, 47(7), 822–826. 10.1038/ng.3292 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Garcia-Diaz M, & Kunkel TA (2006). Mechanism of a genetic glissando*: structural biology of indel mutations. Trends in Biochemical Sciences, 31(4), 206–214. 10.1016/j.tibs.2006.02.004 [DOI] [PubMed] [Google Scholar]

[R10] Garrison E, & Marth G (2012). Haplotype-based variant detection from short-read sequencing. [Google Scholar]

[R11] Gelb B, Brueckner M, Chung W, Goldmuntz E, Kaltman J, Kaski JP, … Pearson G (2013). The Congenital Heart Disease Genetic Network Study: rationale, design, and early results. Circulation Research, 112(4), 698–706. 10.1161/CIRCRESAHA.111.300297 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Hansen RS, Thomas S, Sandstrom R, Canfield TK, Thurman RE, Weaver M, … Stamatoyannopoulos JA (2010). Sequencing newly replicated DNA reveals widespread plasticity in human replication timing. Proceedings of the National Academy of Sciences, 107(1), 139–144. 10.1073/pnas.0912402107 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Jónsson H, Sulem P, Kehr B, Kristmundsdottir S, Zink F, Hjartarson E, … Stefansson K (2017). Parental influence on human germline de novo mutations in 1,548 trios from Iceland. Nature, 549(7673), 519–522. 10.1038/nature24018 [DOI] [PubMed] [Google Scholar]

[R14] Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, & Haussler D (2002). The human genome browser at UCSC. Genome Research, 12(6), 996–1006. 10.1101/gr.229102. Article published online before print in May 2002 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Kloosterman WP, Francioli LC, Hormozdiari F, Marschall T, Hehir-Kwa JY, Abdellaoui A, … Guryev V (2015). Characteristics of de novo structural changes in the human genome. Genome Research, 25(6), 792–801. 10.1101/gr.185041.114 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Li H (2014). Towards Better Understanding of Artifacts in Variant Calling from High-Coverage Samples. 10.1093/bioinformatics/btu356 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Li H, & Durbin R (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics (Oxford, England), 25(14), 1754–1760. 10.1093/bioinformatics/btp324 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Martin M, Patterson M, Garg S, Fischer SO, Pisanti N, Klau GW, … Marschall T (2016). WhatsHap: fast and accurate read-based phasing. BioRxiv, 085050 10.1101/085050 [DOI] [Google Scholar]

[R19] McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, … DePristo MA (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research, 20(9), 1297–1303. 10.1101/gr.107524.110 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Montgomery SB, Goode DL, Kvikstad E, Albers CA, Zhang ZD, Mu XJ, … Lunter G (2013). The origin, evolution, and functional impact of short insertion-deletion variants identified in 179 human genomes. Genome Research, 23(5), 749–761. 10.1101/gr.148718.112 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Quinlan AR, & Hall IM (2010). BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics, 26(6), 841–842. 10.1093/bioinformatics/btq033 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Robinson JT, Thorvaldsdóttir H, Winckler W, Guttman M, Lander ES, Getz G, & Mesirov JP (2011). Integrative genomics viewer. Nature Biotechnology, 29(1), 24–26. 10.1038/nbt.1754 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] Sedlazeck FJ, Rescheneder P, Smolka M, Fang H, Nattestad M, von Haeseler A, & Schatz MC (2018). Accurate detection of complex structural variations using single-molecule sequencing. Nature Methods, 15(6), 461–468. 10.1038/s41592-018-0001-7 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Smit A, Glusman G, & Hubley R (2015). RepeatMasker 4.0. Retrieved from http://www.repeatmasker.org/

[R25] Van der Auwera GA, Carneiro MO, Hartl C, Poplin R, Del Angel G, Levy-Moonshine A, … DePristo MA (2002). Current Protocols in Bioinformatics. In Bateman A, Pearson WR, Stein LD, Stormo GD, & Yates JR (Eds.), Current protocols in bioinformatics / editoral board, Andreas D. Baxevanis ... [et al.] (Vol. 11). 10.1002/0471250953 [DOI] [Google Scholar]

PERMALINK

Elucidation of de novo small insertion/deletion biology with parent-of-origin phasing

Allison H Seiden

Felix Richter

Nihir Patel

Oscar L Rodriguez

Gintaras Deikus

Hardik Shah

Melissa Smith

Amy Roberts

Eileen C King

Robert P Sebra

Andrew J Sharp

Bruce D Gelb

Abstract

Introduction

Table 1.

Materials and methods

Subjects, whole genome sequencing, and de novo variant calling

Long-read SMRT sequencing

Short-read, paired-end sequencing

De novo indel identification and validation

Phasing

Replicating de novo SNV results

Indel classification application programming interface (API)

Association with replication timing

Statistics

Results

Figure 1. The fraction of de novo variants (DNVs) phased with Illumina short-read or PacBio long-read sequencing through read-pair tracing.

Figure 2. Correlation between number of de novo indels per trio and parental age for each indel class across three phasing methods.

Discussion

Availability

Supplementary Material

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases