Skip to main content
Wiley Open Access Collection logoLink to Wiley Open Access Collection
. 2022 Sep 10;43(12):1979–1993. doi: 10.1002/humu.24455

de novo variant calling identifies cancer mutation signatures in the 1000 Genomes Project

Jeffrey K Ng 1, Pankaj Vats 2, Elyn Fritz‐Waters 3, Stephanie Sarkar 1, Eleanor I Sams 1, Evin M Padhi 1, Zachary L Payne 1, Shawn Leonard 3, Marc A West 2, Chandler Prince 3, Lee Trani 4, Marshall Jansen 3, George Vacek 2, Mehrzad Samadi 2, Timothy T Harkins 2, Craig Pohl 3, Tychele N Turner 1,
PMCID: PMC9771978  NIHMSID: NIHMS1833872  PMID: 36054329

Abstract

Detection of de novo variants (DNVs) is critical for studies of disease‐related variation and mutation rates. To accelerate DNV calling, we developed a graphics processing units‐based workflow. We applied our workflow to whole‐genome sequencing data from three parent‐child sequenced cohorts including the Simons Simplex Collection (SSC), Simons Foundation Powering Autism Research (SPARK), and the 1000 Genomes Project (1000G) that were sequenced using DNA from blood, saliva, and lymphoblastoid cell lines (LCLs), respectively. The SSC and SPARK DNV callsets were within expectations for number of DNVs, percent at CpG sites, phasing to the paternal chromosome of origin, and average allele balance. However, the 1000G DNV callset was not within expectations and contained excessive DNVs that are likely cell line artifacts. Mutation signature analysis revealed 30% of 1000G DNV signatures matched B‐cell lymphoma. Furthermore, we found variants in DNA repair genes and at Clinvar pathogenic or likely‐pathogenic sites and significant excess of protein‐coding DNVs in IGLL5; a gene known to be involved in B‐cell lymphomas. Our study provides a new rapid DNV caller for the field and elucidates important implications of using sequencing data from LCLs for reference building and disease‐related projects.

Keywords: 1000 Genomes Project, cell line artifacts, de novo variants, GPU accelerated workflow, Simons Simplex Collection

1. INTRODUCTION

de novo variants (DNVs) are important for assessing mutation rates (Ségurel et al., 2014) and have been shown to contribute to human disease (e.g., autism [Feliciano et al., 2019; Iossifov et al., 2014; O'Roak et al., 2012; O'roak et al., 2011; De Rubeis et al., 2014; Sanders et al., 2012; Turner et al., 201620172019], epilepsy [Allen et al., 2013; Helbig et al., 2016], intellectual disability [DDD, 2017; Gilissen et al., 2014; Kaplanis et al., 2020; de Ligt et al., 2012], congenital heart disorders [Homsy et al., 2015; Sifrim et al., 2016; Zaidi et al., 2013]). Typically, the calling of DNVs from raw sequence data to final calls can take days to weeks. Multiple DNV workflows exist that primarily rely on CPU‐based approaches (Allen et al., 2013; Besenbacher et al., 2015; Boomsma et al., 2014; Chesi et al., 2013; Conrad et al., 2011; DDD, 2017; Dimassi et al., 2016; Dong et al., 2014; Feliciano et al., 2019; Fromer et al., 2014; Homsy et al., 2015; Iossifov et al., 20122014; Jónsson et al., 2017; Kaplanis et al., 2020; de Ligt et al., 2012; McCarthy et al., 2014; Narzisi et al., 2014; O'Roak et al., 2012; O'roak et al., 2011; Ramu et al., 2013; De Rubeis et al., 2014; Sanders et al., 2012; Turner et al., 20162017). These workflows employ different strategies including strict filtering, utilizing multiple variant callers as opposed to using only one, machine‐learning, and incorporation of genotypic information at other sites around the genome. Overall, there is no community consensus on a standard method for detecting DNVs. It is imperative that this process be streamlined and flexible to enable broad adoption across the community. In this study, we developed a rapid workflow to accelerate DNV calling using graphics processing units (GPUs) that is integrated into NVIDIA Parabricks (Franke & Crowgey, 2020) software. We chose GPUs because they contain thousands of simple cores and can be used for increased parallelization. Since we want to encourage broad use of our DNV workflow, we also developed an equivalent, freely available open‐source, CPU‐based version of the workflow. Together, the GPU‐based workflow, Hare, and the CPU‐based workflow, Tortoise, make up Hare And Tortoise (HAT).

Our desire for a standardized, rapid DNV workflow stems from our interest in detecting these DNVs in the large number of whole‐genome sequencing (WGS) data in families with neurodevelopmental disorders that has recently become available (https://anvilproject.org/data). Studies assessing individuals with WGS data based on DNA derived from blood have provided the field with our best estimates of DNV characteristics in humans (Carlson et al., 2018; Kessler et al., 2020; Ségurel et al., 2014). One recent data set, with DNA derived from blood, consisting of 4216 parent‐child whole‐genome sequenced trios from the Simons Simplex Collection (SSC) has been extensively studied for DNVs (An et al., 2018; Turner et al., 2017; Werling et al., 2018; Wilfert et al., 2021). We processed this data with HAT and found that our method performed well. We also assessed a second WGS data set of 1326 parent‐child sequenced trios from Simons Foundation Powering Autism Research (SPARK) (“SPARK: A US cohort of 50,000 families to accelerate autism research, 2018). This WGS data set was sequenced on DNA derived from saliva and our DNV calls also met expectations.

This led us to assess the recently generated, publicly available, WGS data set from a cohort called the 1000 Genomes Project (1000G), where our initial goal was to build a DNV data set from control individuals. Overall, 1000G is a data resource for the study of genetic variation that includes individuals from diverse genetic ancestries (Auton et al., 2015; Byrska‐Bishop et al., 2021). Represented in the data are 602 trios from 18 worldwide populations (Supporting Information: Figure S1). Moreover, as a field standard, 1000G has been utilized in many applications as a control resource for filtering of genetic variation by allele frequency and/or variant presence‐absence in the data set (Rehm et al., 2013).

One complicating factor of DNV assessment in this resource is the fact that sequencing data is generated from DNA isolated from lymphoblastoid cell lines (LCLs) (Byrska‐Bishop et al., 2021) as opposed to primary tissue. Epstein‐Barr Virus (EBV) is used to make these LCLs and passaging over time enables the accumulation of cell line artifacts. These artifacts can complicate variant filtration schemes and the utility of this data as a frequency control. As opposed to a random accumulation of mutations in each individual, we found that 80% of 1000G individuals had an excess of DNVs and 30% of all 1000G individuals had a mutation signature matching a B‐cell lymphoma. The similarity to this cancer is problematic, and it would be imperative that this data not be used as a control in the context of the study of these and related cancers. A secondary consequence of the excess DNVs is their presence at disease‐related sites whereby simple filtering schemes may accidentally remove sites of interest in patients due to their presence in 1000G.

2. MATERIALS AND METHODS

2.1. Software code availability

The description for HAT can be found at https://github.com/TNTurnerLab/HAT. HAT v1.0 was used for analyses in this paper and is available at https://github.com/TNTurnerLab/GPU_accelerated_de_novo_workflow. We also developed a fully open‐source CPU‐based version of the code that does not require the NVIDIA Parabricks license, Tortoise, and it is available at https://github.com/TNTurnerLab/Tortoise. We found that Tortoise is just as accurate as Hare, with high level of overlap between the two versions when tested on NA12878 and the monozygotic twin pair (Supporting Information: Figure S2).

2.2. 1000 genomes trio WGS data set

As described previously (Byrska‐Bishop et al., 2021), a total of 602 trios from the 1000G were whole‐genome sequenced, from LCL DNA, at the New York Genome Center. We downloaded the publicly available aligned data files (crams), totaling around 27TB, onto the Washington University Information Technology's Research Infrastructure Services (RIS), a Load Sharing Facility based, high compute server for further analysis described below. The download locations are described here http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/1000G_2504_high_coverage.sequence.index and here http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/1000G_698_related_high_coverage.sequence.index. Details of the 602 trios are found in Supporting Information: Table S2.

2.3. SSC WGS data set

We downloaded SSC WGS alignment files (crams) from SFARI Base using Globus, totaling around 239TB, onto the RIS. Importantly, these genomes were sequenced, from DNA derived from blood, at the New York Genome Center (Wilfert et al., 2021). We utilized the crams as the starting point for running in HAT. In total, we assessed 8,922 individuals from both quad (unaffected father, unaffected mother, one child with autism, one child without autism) and trio (unaffected father, unaffected mother, one child with autism) families resulting in a total of 4,216 parent‐child sequenced trios. The following individuals were not present in the Globus link and were excluded from the study: SSC03147, SSC03138, SSC03133, SSC03146, SSC06708, SSC06703, SSC06699.

2.4. SPARK WGS data set

We downloaded SPARK WGS1 WGS alignment files (crams) from SFARI Base using Globus, totaling around 62 TB, onto the RIS. These genomes were sequenced, from DNA derived from saliva at the New York Genome Center. We utilized the crams as the starting point for running in HAT. In total, we assessed 2629 individuals from both quad (unaffected father, unaffected mother, one child with autism, one child without autism) and trio (unaffected father, unaffected mother, one child with autism) families resulting in a total of 1326 parent‐child sequenced trios.

2.5. Single‐nucleotide variant and insertion/deletion calling

The NVIDIA Parabricks program version 3.0.0 was utilized to call single‐nucleotide variants and small insertions/deletions (indels) with GATK (McKenna et al., 2010) version 4.1.0 and Google's DeepVariant (Poplin et al., 2018) version 0.10 using default parameters (note for DeepVariant the model_type utilized is WGS). The reference genome utilized for these analyses was GRCh38_full_analysis_set_plus_decoy_hla.fa as the data was originally mapped to this reference genome (Byrska‐Bishop et al., 2021). For each individual, a GVCF was generated for these two variant callers. The GVCFs were then genotyped, on a per trio basis, using the GLnexus (Yun et al., 2020) version 1.2.6 joint genotyper using prebuilt configs for each respective caller. Post‐calling, we checked the counts of all variants and heterozygous variants per chromosome in each individual as a quality check (Supporting Information: Figure S3).

2.6. DNVs calling

DNVs were called by identifying all putative DNVs in GATK and DeepVariant based on the parent and child genotypes, respectively. Specifically, the parent genotypes had to be homozygous for the reference allele (i.e., 0/0) and the child had to be, at a minimum, heterozygous for the alternate allele (e.g., 0/1, 1/1). DNVs identified in both GATK and DeepVariant (intersection of the two callers) were then identified and further filtering was carried out as follows: depth, in each trio member, at the DNV position had to be ≥10, the genotype quality of the DNV had to be >20, the DNV had to have an allele balance (AB) >0.25, and there could be no presence of the DNV allele present in any reads in the parents. Finally, we removed DNVs in low complexity regions, centromeres, and recent repeats from further analysis.

To assess the quality of our DNVs, we manually scored 3980 sites, by visualizing the underlying read data in each trio member, with SAMtools version 1.9 tview. To score these sites, we looked at the first column (variant location in the read data as seen in tview images) of both parents and the proband sample to see what variants were present (example shown in Supporting Information: Figure S4). If there was any variant in the first column of the mother or father, regardless of quality, that matched the main variant in the proband's first column, then we denoted the variant as maternal, paternal, or both depending on whether it was the mother's variant that matched the proband or the fathers or both parents. If the main variant in the first column of the parental samples did not match the proband's variant, then we knew this sample would be a DNV, thus verifying our results.

As a second check of our DNVs, we randomly sampled 25 DNV sites in NA12878 and performed Sanger sequencing in NA12878 and parents (NA12891, NA12892). Primers were designed using Primer3Plus (https://primer3plus.com) to target each of the 25 variants. PCR reactions were run using the primers, genomic DNA for individuals NA12878 (Coriell tube label NA12878 * N44 12/02/2019), NA12891 (Coriell tube label NA12891 * H3 7/25/2019), and NA12892 (Coriell tube label NA12892 * F3 8/6/2019), and Thermo Scientific Phusion High‐Fidelity PCR Master Mix with HF Buffer. All PCR products underwent PCR clean‐up and Sanger sequencing through Genewiz (https://www.genewiz.com). Trace files with the Sanger sequencing data were assembled and visualized as chromatograms using Sequencher 5.4.6 (http://www.genecodes.com). For 24 of the variants, the result from Sanger sequencing was clear. However, for site chr11_134531608_C_G we saw evidence of the alternate allele at a low frequency. To test whether this signal was real, we pursued deep sequencing of the amplicon on an Oxford Nanopore Technologies (ONT) MinION sequencer as follows. PCR products for amplicon chr11_134531608_C_G, in each of the three individuals, underwent purification using the QIAquick PCR Purification Kit. A library of the purified products was prepared using the ONT Rapid Barcoding Kit (SQK‐RBK004). Sequencing of the library was performed using the ONT MinION sequencer and the MinKNOW software. The fastq output files containing the sequencing data for all three samples were mapped to the amplicon reference sequence using minimap2 (Li, 2018) (version 2.21) and all had coverage depth >100x. A bam file and indexed bam file were generated for each sample using SAMtools (Li et al., 2009) (version 1.9). The bam files were then visualized using the Integrated Genomics Viewer (Robinson et al., 2011) to determine the count of each nucleotide base at the variant position.

2.7. Phasing of DNVs

We utilized Unfazed version 1.0.2 (https://github.com/jbelyeu/unfazed) (Belyeu et al., 2021) to phase the DNVs in our study with regard to the parent‐of‐origin chromosome. First, a bed file containing DNVs was generated for each individual. Second, the de novo bed file, DeepVariant full genome trio VCF, and the alignment files for all trio members were run through Unfazed. Since Unfazed uses different approaches to phasing on the X chromosome in males and females, we only focus on phased variants on the autosomes in this study.

2.8. NA12878 additional data sets

We identified additional high‐coverage WGS data from NA12878 from the SRA (https://www.ncbi.nlm.nih.gov/sra) and other sources. These included SRA data SRR944138 from 2012 and SRR952827 from 2013, McDonnell Genome Institute data gerald_HFKWMDSXX and H_IJ‐NA12878 both from 2018, and the high‐coverage data from 2020. To avoid differences due to coverage, we downsampled all data sets to 30x using SAMtools. All data was remapped to build 38 using SpeedSeq (Chiang et al., 2015) version 0.1.2 and run through the DNV workflow using the NA12891 and NA12892 parental WGS data from 2020 1000G. We again did a count check for total and heterozygous variants per chromosome (Supporting Information: Figure S5).

2.9. Phylogenetic tree of DNVs

To assess the differences between different NA12878 replicates we built a multi‐sequence FASTA file where each FASTA represents the aggregate of all possible DNVs identified in this individual. The specific steps to build the tree were as follows: (1) we first merged the samples together and converted the genotypes for each DNV site from 0/0 or 0/1 to the nucleotide counterparts (e.g., AA, CG, TC) for all of the NA12878 samples; (2) next we converted these genotype symbols to their IUPAC code; (3) we then collapsed the IUPAC symbols into a sequence per sample and placed them into a FASTA file. We also included a reference “sample,” which was just the reference allele at each DNV site; and (4) we used MEGAX (Kumar et al., 2018) version 10.2.4 to create a maximum likelihood phylogenetic tree.

2.10. Mutation signature assessment

We utilized the deconstructSig (Rosenthal et al., 2016) software version‐1.9.0 inside of Parabricks to perform mutation signature analysis. The prominent signature was chosen for an individual and if there was not one prominent signature than the weights of two signature was equal to or greater than (≥0.31) both signatures were represented in the tables and figures.

2.11. Karyotype analysis

Read‐depth based karyotypes were generated by assessment of the aligned sequence data. First, the number of reads per chromosome was calculated using SAMtools (Li et al., 2009) in each individual. Second, the size of each chromosome was generated using the reference genome data and by removing locations of gaps from the reference. Third, the copy number of each of the chromosomes was calculated as follows: ([fold coverage per chromosome]/[fold coverage of chromosome 1]) × 2.

2.12. Viral analysis

We ran SAMtools idxstats on all individuals to determine the number of mapped reads to each chromosome. We then calculated the copy number of EBV in each individual as follows: EBV copy number = ([mapped reads to EBV × 150 base pairs per read)/length of EBV])/([mapped reads to chromosome 1 × 150 base pairs per read]/length of chromosome 1).

2.13. DNV enrichment in genes

To test for DNV enrichment in genes we utilized two methods: chimpanzee‐human and denovolyzeR. These were run as previously described (Coe et al., 2019; Turner et al., 2019).

2.14. Annotation of protein‐coding DNVs

We uploaded the DNV calls to the open‐cravat program (https://opencravat.org/) and specifically identified Clinvar as one of the annotation categories. Rescoring of DNVs in Franklin was performed using Franklin (https://franklin.genoox.com).

3. RESULTS

3.1. Rapid DNV calling with GPUs

HAT consists of three main steps: GVCF generation, family‐level genotyping, and filtering of variants to get final DNVs. We utilized existing features of the NVIDIA Parabricks software for rapid GVCF generation from GPU accelerated versions of GATK (McKenna et al., 2010) HaplotypeCaller and Google DeepVariant (Poplin et al., 2018). The run times for GVCF generation are ~40 min per sample on a 4 GPU node and can be run in parallel on all three family members in the parent‐child trio. In comparison on our HPC, the CPU version of GATK HaplotypeCaller runs on average around 2.5 days on a 2 CPU node and DeepVariant around 8 h on a 32 CPU node. Post‐GVCF generation, the trio is genotyped using the GLnexus joint genotyper (Yun et al., 2020). Finally, our post‐genotyping custom DNV filtering workflow runs in ~1 h with acceleration at all steps with parallelization providing a clear advantage over CPU‐based approaches (Figure 1a).

Figure 1.

Figure 1

de novo variant calling in short‐read whole‐genome sequencing data. (a) de novo workflow for detection of DNVs from aligned read files (crams); (b) and (c) benchmarking DNV workflow in a monozygotic twin pair sequenced from DNA derived from blood; (d) DNV detecting in four trios in the 1000 Genomes Project (1000G). DNVs, de novo variants.

To benchmark HAT, we tested it on a monozygotic twin pair with WGS data derived from blood DNA. These individuals should share the same DNVs from generation in the germline. However, they may differ at some sites if DNVs occur in a post‐zygotic, somatic manner. The twins shared 75 autosomal DNVs and contained 83 and 81 autosomal DNVs, respectively (Figure 1b,c). The percent of DNVs at CpG sites were 19.3% and 17.2%, respectively and in line with previous published estimates of ~20% (Ségurel et al., 2014; Turner et al., 2017) (Figure 1b,c). As this monozygotic twin pair was discordant for the phenotype of autism, we also tested whether there were any protein‐coding DNV differences between the two twins. These would potentially be relevant for autism, but there were no such differences.

To establish a DNV callset from the 1000G data set, we started with the assessment of DNVs with HAT in four trios from the 1000G (Figure 1d). Two were chosen at random (i.e., HG00405, HG00408) and two were chosen because they were “famous” trios assessed in many other studies (i.e., NA12878 [Conrad et al., 2011; Zook et al., 2014], NA19240 [Conrad et al., 2011]). One of these trios (HG00405) had 70 DNVs and a CpG percent of 21.4 as we would have expected from DNA derived from blood. To our surprise, the other trios had varying numbers of DNVs from 592 to 2230 with NA12878 having the most DNVs. As the number of DNVs increased, the number of DNVs at CpG sites decreased considerably (down to ~10%). We also assessed 3598 of the DNVs from the four trios by manual visual inspection of the underlying reads in each family member (Supporting Information: Table S1) and found that 93.6% of the variants appeared to be true DNVs, 4.9% were inherited, and 1.5% were low confidence calls.

3.2. Differences in DNVs in blood/saliva and LCLs

Our initial observations led us to assess three main cohorts: the 602 trios from 1000G (Supporting Information: Table S2) with DNA derived from LCLs, 4,216 trios from the SSC with DNA derived from blood, and 1,326 trios from the SPARK cohort with DNA derived from saliva. In 1000G, we detected 445,711 total DNVs in the cohort (Supporting Information: Table S3) with 740 ± 968 DNVs per individual (Supporting Information: Table S4). There was a clear bimodal distribution in this DNV data set (Hartigan's dip test: D = 0.033, p = 1.32 × 10‐4) wherein some individuals contained an excess of DNVs (Figure 2a). In the SSC, we identified 329,589 total DNVs with 78 ± 15 (mean ± standard deviation) DNVs per individual (Figure 2b, Supporting Information: Table S5). In the SPARK cohort, we detected 101,337 total DNVs with 76 ± 16 DNVs per individual (Figure 2c, Supporting Information: Table S6). The values derived from the SSC and SPARK data sets are in line with expectation. However, the values in the 1000G are higher than expected and we estimated the number of individuals with expected numbers of DNVs by splitting the 1000G data into two groups: individuals having less than or equal to 100 DNVs (n = 123) and individuals with greater than 100 DNVs (n = 479). This estimate suggests that only 20.4% of trios in the 1000G have the expected number of DNVs. We hypothesized that individuals with excess DNVs may have cell line artifacts due to culturing of LCLs.

Figure 2.

Figure 2

Comparison of characteristics of DNVs detected in 1000G, Simons Simplex Collection (SSC), and Simons Foundation Powering Autism Research (SPARK) callsets. (a) Histogram of DNV counts from 1000G in 602 trios; (b) histogram of DNV counts from SSC in 4,216 trios; (c) histogram of DNV counts from SPARK in 1,326 trios; (d) percent of DNVs found within CpG sites versus the total number of DNVs for 1000G; (e) percent of DNVs found within CpG sites versus the total number of DNVs found for SSC; (f) percent of DNVs found within CpG sites versus the total number of DNVs found for SPARK; (g) percent of autosomal DNVs with paternal parent of origin versus the total number of DNVs for 1000G; (h) percent of autosomal DNVs with paternal parent of origin versus the total number of DNVs for SSC; (i) percent of autosomal DNVs with paternal parent of origin versus the total number of DNVs for SPARK. 1000G, 1000 Genomes Project; DNVs, de novo variants.

We assessed three main features of typical DNVs to investigate our hypothesis that the excess DNVs found in individuals were cell line artifacts. These features are DNVs at CpG locations, the percent of DNVs arising on the paternal chromosome, and the average AB for each of the DNV callsets.

As mentioned previously, the percent of DNVs at CpG should be ~20%, the percent of DNVs arising on the paternal chromosome should be ~80%, and the average AB should be ~0.5 (Sasani et al., 2019). In the 1000G trios, 14 ± 4.4% of DNVs per individual occurred at CpG sites (Figure 2d). In individuals with less than or equal to 100 DNVs, 17.4 ± 5.2% of DNVs occurred at a CpG and in families with greater than 100 DNVs 12.7 ± 3.6% of DNVs occurred at a CpG. The difference in DNVs at CpG sites between these two groups was significant (Wilcoxon rank sum: p < 2.2 × 10−16). In the SSC, 18 ± 4.7% of DNVs are at a CpG site (Figure 2e) and in SPARK, 18 ± 4.6% of DNVs occurred at a CpG site (Figure 2f). The SSC and SPARK estimates are in line with expectation.

In 1000G, the percent of DNVs that were phase‐able for parent‐of‐origin was 37.2 ± 7.5% (Supporting Information: Figure S6, Table 2). Of the phased variants, 61 ± 11.3% were on the chromosome of paternal origin (Figure 2g, Supporting Information: Table S7, Table 2). In the families with less than or equal to 100 DNVs this rose to 72.0 ± 8.5% and in the families with greater than 100 DNVs it fell to 58.6 ± 10.3% (Table 2). This difference in percent phased variants of paternal origin was found to be significantly different (Wilcoxon rank sum: p < 2.2 × 10−16). The drop leveled off to ~50% in the individuals with the most DNVs (Figure 2d). In the SSC, 37% of DNVs were phase‐able (Supporting Information: Figure S7, Table 2) with 75 ± 9.2% (Figure 2h, Table 2) phased to the paternal chromosome of origin. In SPARK, 38% of DNVs (Supporting Information: Figure S8, Table 2) were phase‐able, with 74 ± 9.8% (Figure 2i, Table 2) of DNVs phased to paternal chromosome of origin. Both SSC and SPARK fall in line with expectation.

Table 2.

Phasing metrics from application of Unfazed to our DNV data sets

Data set % of DNVs phase‐able (mean ± standard deviation) % of DNVs phased to the paternal chromosome (mean ± standard deviation) % of DNVs phased to paternal chromosome in trios with ≤100 DNVs (mean ± standard deviation) % of DNVs phased to paternal chromosome in trios with >100 DNVs (mean ± standard deviation)
1000 Genomes Project 37.2 ± 7.5 61.3 ± 11.3 72.0 ± 8.5 58.6 ± 10.3
SSC 37.3 ± 6.6 75.1 ± 9.2 74.4 ± 8.7 75.1 ± 9.4
SPARK WGS1 37.5 ± 6.6 74.3 ± 9.8 74.0 ± 8.8 74.4 ± 9.8

Abbreviations: DNVs, de novo variants; SPARK,Simons Foundation Powering Autism Research; SSC, Simons Simplex Collection.

Finally, AB for DNVs was assessed in 1000G, SSC (Supporting Information: Figure S9A) and SPARK (Supporting Information: Figure S9B). We found that the 1000G had a mean AB of 0.42 and in the SSC and SPARK it was nearly perfect at an AB of 0.48 in line with expectation of 0.5. This indicated a lower average AB level in 1000G from newly arising mutations likely from cell line artifacts.

Overall, these comparisons showed that the individuals in the 1000G with less than or equal to 100 DNVs behaved more like true DNVs in regard to CpG percentage, percent arising on the paternal chromosome, and AB. This also was true for the SSC and SPARK trios where DNA was derived from blood. However, individuals in the 1000G with >100 DNVs did not have statistical properties matching expectations of true DNVs providing evidence they are potential cell line artifacts.

It was critical to assess both SSC and SPARK in these estimates since the DNA used for WGS in both cohorts was derived from cells that are not cultured in vitro in a lab setting. In SSC, the DNA was derived from blood and in SPARK the DNA was derived from saliva. These two cell types are also of different developmental lineages. The finding of consistent DNV characteristics across these two cohorts is an important aspect of this study. Unlike 1000G, parental age at birth data is available for SSC and SPARK. This allowed us to perform one final comparison of these two cohorts; assessment of number of parental DNVs versus parental age at birth (Table 3). In the analysis, we used a linear model and identified an increase of ~1.7 DNVs per year with the SSC and SPARK father's age at birth and an increase of ~1.5 DNVs per year with the SSC and SPARK mother's age at birth. These results are consistent with one another even though the DNVs are derived on WGS from different DNA sources and developmental lineages.

Table 3.

Number of DNVs and age at birth assessment using a linear regression model

Cohort SSC SPARK
Father's age at birth Mother's age at birth Father's age at birth Mother's age at birth
Intercept 21.51 29.467 22.43 26.605
Number of DNVs per parent age at birth 1.728 [1.615−1.820] 1.574 [1.448−1.699] 1.697 [1.589−1.805] 1.673 [1.513−1.814]
Parental age (mean ±standard deviation) 33.118 ± 5.567 31.097 ± 4.961 31.816 ± 6.104 29.776 ± 5.187

Abbreviations: DNVs, de novo variants; SPARK,Simons Foundation Powering Autism Research; SSC, Simons Simplex Collection.

3.3. DNVs by 1000G population

While we expected there to be no difference in DNV counts per individual by ancestry we sought to see if there were any populations with excess DNVs (Figure 3a). The population with the most DNVs was the CEU having on average 1688 DNVs per individual. We hypothesized that this may be because the CEU is one of the oldest cohorts in the 1000 Genome Project dating back to the HapMap project (Gibbs et al., 2003) and these individuals may have cell lines that have been cultured more over time than other populations.

Figure 3.

Figure 3

Assessment of five replicates of NA12878. (a) Population distribution of 1000G data set. (b) UpSet plot demonstrating the number of variants detected in the replicates (at the bottom of the plot the percent of true DNVs is listed for each category). (c) Phylogenetic tree of the five replicates. 1000G, 1000 Genomes Project; DNVs, de novo variants.

3.4. DNVs increase over time

We utilized the fact that the 1000G individual NA12878 has been studied and sequenced multiple times over the past 10 years by WGS (Byrska‐Bishop et al., 2021) (SRA identifiers: SRR944138 and SRR952827). Presumably, across time, the utilization of NA12878 has required additional culturing of this cell line, and potentially even by different laboratories. We aggregated five Illumina WGS data sets from this individual, downsampled them to ~30x coverage, and assessed them with HAT. The data for this individual ranged from the year 2012 to the year 2020 and we found that the 2012 experiment had the least DNVs (n = 2,060) and the 2020 experiment had the most DNVs (n = 2,230) (Figure 3b). Overall, the five replicates had a large overlap of DNVs (n= 1,820) across all samples. These shared DNVs constitute what were present in the ancestor of all the cell line replicates. DNVs not shared by all five replicates are sometimes shared by a subset of the replicates and are sometimes unique to the replicate. To formally assess the ancestral state, we built a phylogenetic tree based only on the DNVs and saw that the farthest replicates from each other in the tree were the 2012 and 2020 replicates (Figure 3c). To further assess the DNVs in NA12878, we randomly sampled 25 DNVs from the union data set from the five replicates. We performed Sanger sequencing on DNA from NA12878 and her parents (NA12891, NA12892) (Supporting Information: Figure S10S34, Supporting Information: Table S8). We found that 24 of the 25 DNV sites gave clear results in the Sanger sequencing with 23 confirming as real DNVs. Surprisingly, we found two sites, chr12_91353615_T_C and chr13_81142986_T_A, that were determined to be true DNVs but were previously shown to be a false positive reading using Sanger sequencing (Conrad et al., 2011). In the Sanger experiments, one site chr11_134531608_C_G showed subtle evidence for the variant allele in NA12878, so we pursued deep amplicon sequencing of this region in the trio using ONT sequencing (Supporting Information: Figure S35). This resulted in a variant allele frequency of 11% in NA12878 suggesting this is a cell line artifact. This was elevated in comparison to the background rate of 1% in NA12891 and 0% in NA12892, that is in line with expectation for error rates from ONT. Intriguingly, this DNV was only found in one of the NA12878 replicates (2018_1). Overall, this indicates that 96% of DNVs called with HAT are real (24/25) and this estimate is close to the 93.6% we saw by manual inspection of underlying read data at 3598 DNV sites (see above).

3.5. Genomes with cancer mutation signatures

We used mutation signature analysis (Alexandrov et al., 2013; Rosenthal et al., 2016) (Supporting Information: Table S9) to determine whether the DNVs identified in individuals from the 1000G had any certain characteristics. For this analysis, we utilized a method that would enable comparisons to known mutational signatures that are either age‐related (reminiscent of true DNVs) or are seen in cancers (Figure 4a,b). There were 186 individuals (30%) that had a strong contribution of an age‐related signature (Signature 1A, Signature 1B). To our surprise, the other contributing signatures in individuals were primarily those associated with B‐cell lymphomas (Signature 9 and Signature 17) in 178 individuals (30%). We also saw that Signature 9, which was found to be contributing signature in 169 individuals, was correlated to the number of DNVs (Pearson's correlation R = 0.54, p < 2.2 × 10‐16). This was intriguing because LCLs are generated from B‐cells that are infected with Epstein Barr Virus (Strachan & Read, 2018) and demonstrates that new mutations are not arising in a random manner. Rather they are being generated in a manner consistent with the development of cancer in the same cell type.

Figure 4.

Figure 4

Mutational properties of DNVs. (a) Mutation signature analysis showing the total number of DNVs and the individuals with each signature type; (b) heatmap of individuals based on their mutational signatures; (c) mutations in the DNA repair gene RAD18 shown on their 3D structure (and modeled using mupit). Also, shown are known cancer mutations from The Cancer Genome Atlas; (d) location of DNVs based on their phased parent‐of‐origin in NA07048. Most notable there are a cluster of mutations on the maternal chromosome on chromosome 2; (e) DNVs in IGLL5 shown on their 3D structure (and modeled using mupit). The image on the left is modeling variants discovered in 1000G, the image on the right is modeling variants discovered in SSC. 1000G, 1000 Genomes Project; DNVs, de novo variants; SSC, Simons Simplex Collection.

We further sought to determine what the mechanism was for the generation of a B‐cell lymphoma‐like state. First, we determined whether there was high rate of aneuploidies in the cell lines. By digital karyotyping (Supporting Information: Table S10) we found that 595 individuals (98.8%) had a typical chromosome complement (46, XX or 46, XY), 4 were missing a sex chromosome (45, X0), 1 was 47, XXY, 1 had three chromosome 12 (47, XY), and 1 had three chromosome 9 (47, XY). This demonstrated that while these aneuploidies are occurring in some cell lines, they are probably not the main driving factor. Next, we looked at DNVs in genes involved in DNA repair and found 17 individuals contained a missense or loss‐of‐function in one of these genes (Supporting Information: Table S11). Individuals with B‐cell lymphoma signatures and disruptive mutations in DNA repair genes included mutations in the following genes FANCF (HG01126), MUS81 (NA10838), POLB (NA10838), POLD1 (NA19677), POLE (HG01096), RAD18 (NA12864) (Figure 4c), RAD51 (HG02683), RPA4 (HG02630), and two individuals with mutations in FANCA (HG02841, HG03200) and WRN (HG04115, NA19161), respectively (Table 1). Third, we looked at Epstein Barr Virus load in each of the genomes (Supporting Information: Table S12) and found that there was a weak, yet significant, correlation with the number of DNVs (p = 2.32 × 10−5, r= 0.17) (Supporting Information: Figure S36). By visual inspection of phased variation in all individuals we also identified individuals with clusters of mutations (e.g., NA07048, Figure 4d, Supporting Information: Figure S37).

Table 1.

DNVs in DNA damage repair genes and clinically relevant variants

Category Individual De novo variant Variant type Gene
DNA damage repair gene HG01074 chr3_48447050_C_G missense ATRIP
HG02841 chr16_89799603_A_G splice_donor FANCA
HG03200 chr16_89762010_C_T missense FANCA
HG01126 chr11_22625482_T_G missense FANCF
NA18875 chr5_80654794_G_A missense MSH3
HG02650 chr6_31759121_C_T missense MSH5
NA10838 chr11_65865247_C_T missense MUS81
NA10838 chr8_42357362_AT_A frameshift POLB
NA19677 chr19_50407375_G_A missense POLD1
HG01096 chr12_132634327_C_T missense POLE
HG01755 chr15_89321792_C_T missense POLG
NA12864 chr3_8958938_G_T missense RAD18
HG02683 chr15_40729853_T_C missense RAD51
HG02630 chrX_96884884_G_A missense RPA4
NA19919 chr3_133644039_A_G missense TOPBP1
HG04115 chr8_31120294_C_T missense WRN
NA19161 chr8_31124967_G_T missense WRN
Clinvar pathogenic/likely pathogenic HG03795 chr11_22274728_C_T stop_gained ANO5
NA10854 chr13_110179298_C_T missense COL4A1
NA10842 chr10_49530737_G_A stop_gained ERCC6
HG02668 chr1_111787063_C_T missense KCND3
HG02466 chr1_39485559_G_A missense MACF1
HG02129 chr11_77206108_G_A missense MYO7A
HG03122 chr11_45949458_G_A stop_gained PHF21A
NA12707 chr6_52058438_C_T missense PKHD1
HG01755 chr15_89321792_C_T missense POLG
HG02892 chr11_124875581_C_T stop_gained ROBO3
HG03635 chr2_165310406_G_A missense SCN2A
NA10830 chr2_39023106_C_T missense SOS1
NA10831 chr3_24143512_G_A missense THRB
HG01629 chr2_209775898_C_T stop_gained UNC80
HG00558 chr16_88435401_G_A missense ZNF469

Abbreviation: DNVs, de novo variants.

3.6. Excess of DNVs in IGLL5

We applied a multiphase approach to determine if there were any genes with enrichment of protein‐coding DNVs in individuals with greater than 100 DNVs. In the first phase, we tested whether there was genome‐wide significance for enrichment of protein‐coding DNVs (missense, loss‐of‐function) in any specific genes. By application of two methods (chimpanzee‐human, denovolyzeR), we identified 29 significant genes (ARMC3, BCL2, BCR, C6orf15, CCDC168, CSMD3, EGR3, EXO1, HLA‐B, HLAC, IGLL5, KMT2D, LINGO2, LTB, MEOX2, MUC16, MUC22, NPAP1, PCLO, PRPF40A, RUNX1T1, SGK1, STRAP, TMEM232, TNXB, TTN, WDFY4, XIRP2, ZNF488) with excess of DNVs (Supporting Information: Table S12). In the second phase, we tested these 29 genes to see whether there were significantly more protein‐coding DNVs in individuals with greater than 100 DNVs in comparison to individuals with less than or equal to 100 DNVs. Only IGLL5 was significant in this comparison (1.79 × 10−3) (Supporting Information: Table S13, Supporting Information: Table S14, Figure 4e). To test whether this finding was relevant only to LCLs, we looked for protein‐coding DNVs in SSC and only found one missense variant (Figure 4e). This gene did not have significant excess of DNVs in SSC.

3.7. DNVs identified in clinically relevant variants

We tested whether any of the DNVs detected were already known to be pathogenic or likely‐pathogenic in the Clinvar (Landrum et al., 2018) database (Table 1). There were 15 mutations meeting these criteria (Supporting Information: Table S15). We rescored these variants using Franklin software to assess their pathogenicity and found that 13 were also pathogenic or likely‐pathogenic by this approach. Twelve of these variants were associated with described phenotypes in Clinvar. These included a missense variant in SOS1 involved in Noonan syndrome, a missense variant in SCN2A involved in seizures, a stop gained variant in UNC80 involved in a syndrome with hypotonia, intellectual disability, and characteristic facies, a missense variant in THRB involved in thyroid hormone resistance, a missense variant in PKHD1 involved in polycystic kidney disease, a stop‐gained in ERCC6 involved in Cockayne syndrome, a stop‐gained in ANO5 involved in gnathodiaphyseal dysplasia, a stop‐gained in PHF21A involved in inborn genetic disease, a missense in MYO7A in Usher syndrome type 1, a stop‐gained in ROBO3 in Gaze palsy with progressive scoliosis, a missense in COL4A1 involved in inborn genetic disease, and a missense in POLG involved in POLG‐related disorder.

4. DISCUSSION

While the 1000G data has been extensively studied in the past, there has been no previous cross‐cohort assessment of DNVs. This limitation is primarily because family‐based sequencing was not available until 2020 when this cohort was sequenced by high‐coverage short‐read WGS 10 years after the initial ground‐breaking publication on the 1000G (Abecasis et al., 2010). Determining DNV signatures across this data set of diverse individuals is critical for assessment of mutation rates in the human population, while also providing a more complete catalog of all genetic variants within these individuals. The decision to sequence these individuals using DNA derived from LCLs was a practical one. However, it opened the door to the possibility of cell line artifacts, while simultaneously introducing a dynamic aspect to this extensive set of controls. As control samples, the cell lines that were used as the inputs for the 1000G are still actively used across laboratories, acting as matched controls for workflows to known sets of variants (Zheng‐Bradley & Flicek, 2017). The large distribution of DNVs across the 1000G suggest that a subset of the control source inputs are dynamic, and in some cases, harbor a spectrum of genetic variants associated with B‐cell lymphomas or named clinical syndromes. Laboratories using control samples from the 1000G should account for both the presence and dynamic nature of the reported DNVs and in some cases may consider changing which control samples to use within the laboratory to avoid any of the associated issues with the presence of DNVs. Additionally, other public efforts to establish reference data sets using cell lines should consider the impacts of DNVs on their project design.

We utilized a novel and accelerated analysis workflow to detect DNVs from short‐read, WGS data. We showed this new workflow is of high‐quality by running it on 4,216 trios with WGS, from the SSC, on DNA derived from blood and 1,326 trios from DNA derived from saliva. This analysis revealed expected number of DNVs, percent of DNVs at CpG sites, percent of DNVs phased to the paternal chromosome of origin, and average AB of the DNV. This was an important analysis and was in contrast to our DNV analysis of the 1000G. In 1000G, we identified a total of 445,711 DNVs in the 602 trios. Originally, it was assumed that the DNVs across the 1000G would have been random and minimal, and yet only 20% of the offspring (123 children) have a number of DNVs around expectation (<100) and the remainder have an excess of DNVs with the most extreme case being an individual (HG02683) having 11,219 DNVs. We hypothesized that the excess DNVs were cell line artifacts and found multiple lines of evidence to support this hypothesis, including a reduction in the percent of DNVs at CpG as well as the reduction in percent phased to the paternal parent‐of‐origin chromosome with increasing DNVs, respectively. A detailed analysis of individual NA12878, who has been studied various times over the years, revealed increasing DNVs in the more recently sequenced samples also supporting this hypothesis. The changes in the DNVs for NA12878 suggest the dynamic nature of the DNVs, demonstrating that the number is increasing over time.

When mutational signature analysis was performed on this new set of DNVs, the most common mutation signatures were those seen in B‐cell lymphomas. This signature was found in 30% of individuals in the 1000G. This is important as the LCLs are generated from B‐cells and points to a nonrandom accumulation of mutations that are in line with the development of cancer in this cell type. In particular, we identified mutations in key DNA repair genes as well as a statistically significant excess of DNVs in IGLL5 (Chen et al., 2021; Pedrosa et al., 2021 ). This gene is found to be mutated in B‐cell lymphomas and protein‐coding DNVs are identified in 27 individuals in this cohort; all of which have >100 overall DNVs. From our work, we identify two contributing factors causing these higher levels of DNVs, one is the mutation of DNA repair genes while the second is an excess of Epstein‐Barr Viral load. Future work using long‐read sequencing and de novo assemblies will be imperative to identify complete viral integration in these genomes as integration sites can have impacts on cell line stability.

In addition to the DNA repair gene DNVs, we identified 15 pathogenic or likely‐pathogenic DNVs that had already been implicated in a database of clinical variation (Clinvar). This calls into question the use of the 1000G data as a control for both B‐cell lymphomas and more generally for DNVs identified in clinical patients. More importantly, the extensive spectrum of DNVs that can appear in a cell line call into question the use of control samples derived from LCLs. Recently, the Human Pangenome Reference Consortium (HPRC) began building reference databases and pangenomes using DNA from individuals in 1000G. The HPRC is trying to limit the cell culture passages for their project (Wang et al., 2022). Overall, we recommend that studies sequence DNA from primary tissues and not DNA from post‐culturing of LCLs. The SSC and SPARK collections have taken this approach and high‐quality DNVs are the result.

CONFLICTS OF INTEREST

Pankaj Vats, Marc A. West, George Vacek, and Timothy T. Harkins are full time employees of NVIDIA.

Supporting information

Supporting information.

Supporting information.

Supporting information.

Supporting information.

ACKNOWLEDGMENTS

We are grateful to all of the families at the participating SSC sites, as well as the principal investigators (A. B., R. B., J. C., E. C., E. F., D. G., R. G.‐K., E. H., D. G., A. K., 25D. L. C., L. C., M. D., M. R., M. J., M. O., O. K., P. B., P. J., P. C., S. M., S. W., S. J., S. C., W. Z. W., and E. W.). We appreciate obtaining access to phenotypic and genetic data for the monozygotic twin pair on SFARI Base. This study was supported by grants from the National Institutes of Health (R00MH117165 to T. N. T., R01MH126933 to T. N. T.) and the Simons Foundation (Award #734069 to T.N.T.). Approved researchers can obtain the SSC population data set described in this study (https://www.sfari.org/resource/simons-simplex-collection/), accession SFARI_DS361901, by applying at https://base.sfari.org.

Ng, J. K. , Vats, P. , Fritz‐Waters, E. , Sarkar, S. , Sams, E. I. , Padhi, E. M. , Payne, Z. L. , Leonard, S. , West, M. A. , Prince, C. , Trani, L. , Jansen, M. , Vacek, G. , Samadi, M. , Harkins, T. T. , Pohl, C. , & Turner, T. N. (2022). de novo variant calling identifies cancer mutation signatures in the 1000 Genomes Project. Human Mutation, 43, 1979–1993. 10.1002/humu.24455

REFERENCES

  1. Alexandrov, L. B. , Nik‐Zainal, S. , Wedge, D. C. , Aparicio, S. A. , Behjati, S. , Biankin, A. V. , Bignell, G. R. , Bolli, N. , Borg, A. , Børresen‐Dale, A. L. , Boyault, S. , Burkhardt, B. , Butler, A. P. , Caldas, C. , Davies, H. R. , Desmedt, C. , Eils, R. , Eyfjörd, J. E. , Foekens, J. A. , … Stratton, M. R. (2013). Signatures of mutational processes in human cancer. Nature, 500(7463), 415–421. 10.1038/nature12477 [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Allen, A. S. , Berkovic, S. F. , Cossette, P. , Delanty, N. , Dlugos, D. , Eichler, E. E. , Epstein, M. P. , Glauser, T. , Goldstein, D. B. , Han, Y. , Heinzen, E. L. , Hitomi, Y. , Howell, K. B. , Johnson, M. R. , Kuzniecky, R. , Lowenstein, D. H. , Lu, Y. F. , … Winawer, M. R. , Epi4K Consortium, Epilepsy Phenome/Genome Project . (2013). De novo mutations in epileptic encephalopathies. Nature, 501(7466), 217–221. 10.1038/nature12439 [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. An, J. Y. , Lin, K. , Zhu, L. , Werling, D. M. , Dong, S. , Brand, H. , Wang, H. Z. , Zhao, X. , Schwartz, G. B. , Collins, R. L. , Currall, B. B. , Dastmalchi, C. , Dea, J. , Duhn, C. , Gilson, M. C. , Klei, L. , Liang, L. , Markenscoff‐Papadimitriou, E. , Pochareddy, S. , … Sanders, S. J. (2018). Genome‐wide de novo risk score implicates promoter variation in autism spectrum disorder. Science , 362(6420). 10.1126/science.aat6576 [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Belyeu, J. R. , Sasani, T. A. , Pedersen, B. S. , & Quinlan, A. R. (2021). Unfazed: Parent‐of‐origin detection for large and small de novo variants. bioRxiv, 2021.2002.2003.429658. 10.1101/2021.02.03.429658 [DOI] [PMC free article] [PubMed]
  5. Besenbacher, S. , Liu, S. , Izarzugaza, J. M. , Grove, J. , Belling, K. , Bork‐Jensen, J. , Huang, S. , Als, T. D. , Li, S. , Yadav, R. , Rubio‐García, A. , Lescai, F. , Demontis, D. , Rao, J. , Ye, W. , Mailund, T. , Friborg, R. M. , Pedersen, C. N. , Xu, R. , … Rasmussen, S. (2015). Novel variation and de novo mutation rates in population‐wide de novo assembled Danish trios. Nature Communications, 6, 5969. 10.1038/ncomms6969 [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Boomsma, D. I. , Wijmenga, C. , Slagboom, E. P. , Swertz, M. A. , Karssen, L. C. , Abdellaoui, A. , Ye, K. , Guryev, V. , Vermaat, M. , van Dijk, F. , Francioli, L. C. , Hottenga, J. J. , Laros, J. F. , Li, Q. , Li, Y. , Cao, H. , Chen, R. , Du, Y. , Li, N. , … van Duijn, C. M. (2014). The genome of the Netherlands: Design, and project goals. European Journal of Human Genetics, 22(2), 221–227. 10.1038/ejhg.2013.118 [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Byrska‐Bishop, M. , Evani, U. S. , Zhao, X. , Basile, A. O. , Abel, H. J. , Regier, A. A. , & Zody, M. C. (2021). High coverage whole genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios. bioRxiv. 10.1101/2021.02.06.430068 [DOI] [PMC free article] [PubMed]
  8. Carlson, J. , Locke, A. E. , Flickinger, M. , Zawistowski, M. , Levy, S. , Myers, R. M. , Boehnke, M. , Kang, H. M. , Scott, L. J. , Li, J. Z. , Zöllner, S. , & BRIDGES, C. (2018). Extremely rare variants reveal patterns of germline mutation rate heterogeneity in humans. Nature Communications, 9(1), 3753. 10.1038/s41467-018-05936-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Chen, F. , Zhang, Y. , & Creighton, C. J. (2021). Systematic identification of non‐coding somatic single nucleotide variants associated with altered transcription and DNA methylation in adult and pediatric cancers. NAR Cancer, 3(1), zcab001. 10.1093/narcan/zcab001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Chesi, A. , Staahl, B. T. , Jovičić, A. , Couthouis, J. , Fasolino, M. , Raphael, A. R. , Yamazaki, T. , Elias, L. , Polak, M. , Kelly, C. , Williams, K. L. , Fifita, J. A. , Maragakis, N. J. , Nicholson, G. A. , King, O. D. , Reed, R. , Crabtree, G. R. , Blair, I. P. , Glass, J. D. , & Gitler, A. D. (2013). Exome sequencing to identify de novo mutations in sporadic ALS trios. Nature Neuroscience, 16(7), 851–855. 10.1038/nn.3412 [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Chiang, C. , Layer, R. M. , Faust, G. G. , Lindberg, M. R. , Rose, D. B. , Garrison, E. P. , Marth, G. T. , Quinlan, A. R. , & Hall, I. M. (2015). SpeedSeq: Ultra‐fast personal genome analysis and interpretation. Nature Methods, 12(10), 966–968. 10.1038/nmeth.3505 [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Coe, B. P. , Stessman, H. , Sulovari, A. , Geisheker, M. R. , Bakken, T. E. , Lake, A. M. , Dougherty, J. D. , Lein, E. S. , Hormozdiari, F. , Bernier, R. A. , & Eichler, E. E. (2019). Neurodevelopmental disease genes implicated by de novo mutation and copy number variation morbidity. Nature Genetics, 51(1), 106–116. 10.1038/s41588-018-0288-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Conrad, D. F. , Keebler, J. E. , DePristo, M. A. , Lindsay, S. J. , Zhang, Y. , Casals, F. , Idaghdour, Y. , Hartl, C. L. , Torroja, C. , Garimella, K. V. , Zilversmit, M. , Cartwright, R. , Rouleau, G. A. , Daly, M. , Stone, E. A. , Hurles, M. E. , Awadalla, P. , & Genomes, P. (2011). Variation in genome‐wide mutation rates within and between human families. Nature Genetics, 43(7), 712–714. 10.1038/ng.862 [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Deciphering Developmental Disorders Study (DDD) . (2017). Prevalence and architecture of de novo mutations in developmental disorders. Nature, 542(7642), 433–438. 10.1038/nature21062 [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Dimassi, S. , Labalme, A. , Ville, D. , Calender, A. , Mignot, C. , Boutry‐Kryza, N. , de Bellescize, J. , Rivier‐Ringenbach, C. , Bourel‐Ponchel, E. , Cheillan, D. , Simonet, T. , Maincent, K. , Rossi, M. , Till, M. , Mougou‐Zerelli, S. , Edery, P. , Saad, A. , Heron, D. , des Portes, V. , … Lesca, G. (2016). Whole‐exome sequencing improves the diagnosis yield in sporadic infantile spasm syndrome. Clinical Genetics, 89(2), 198–204. 10.1111/cge.12636 [DOI] [PubMed] [Google Scholar]
  16. Dong, S. , Walker, M. F. , Carriero, N. J. , DiCola, M. , Willsey, A. J. , Ye, A. Y. , Waqar, Z. , Gonzalez, L. E. , Overton, J. D. , Frahm, S. , Keaney, J. F. , Teran, N. A. , Dea, J. , Mandell, J. D. , Hus Bal, V. , Sullivan, C. A. , DiLullo, N. M. , Khalil, R. O. , Gockley, J. , … Sanders, S. J. (2014). De novo insertions and deletions of predominantly paternal origin are associated with autism spectrum disorder. Cell Reports, 9(1), 16–23. 10.1016/j.celrep.2014.08.068 [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Feliciano, P. , Zhou, X. , Astrovskaya, I. , Turner, T. N. , Wang, T. , Brueggeman, L. , Barnard, R. , Hsieh, A. , Snyder, L. G. , Muzny, D. M. , Sabo, A. , SPARK, C. , Gibbs, R. A. , Eichler, E. E. , O'Roak, B. J. , Michaelson, J. J. , Volfovsky, N. , Shen, Y. , & Chung, W. K. (2019). Exome sequencing of 457 autism families recruited online provides evidence for autism risk genes. NPJ Genomic Medicine, 4, 19. 10.1038/s41525-019-0093-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Franke, K. R. , & Crowgey, E. L. (2020). Accelerating next generation sequencing data analysis: An evaluation of optimized best practices for genome analysis toolkit algorithms. Genomics & Informatics, 18(1), e10. 10.5808/GI.2020.18.1.e10 [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Fromer, M. , Pocklington, A. J. , Kavanagh, D. H. , Williams, H. J. , Dwyer, S. , Gormley, P. , Georgieva, L. , Rees, E. , Palta, P. , Ruderfer, D. M. , Carrera, N. , Humphreys, I. , Johnson, J. S. , Roussos, P. , Barker, D. D. , Banks, E. , Milanova, V. , Grant, S. G. , Hannon, E. , … O'Donovan, M. C. (2014). De novo mutations in schizophrenia implicate synaptic networks. Nature, 506(7487), 179–184. 10.1038/nature12929 [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Genomes Project, C. , Abecasis, G. R. , Altshuler, D. , Auton, A. , Brooks, L. D. , Durbin, R. M. , Gibbs, R. A. , Hurles, M. E. , & McVean, G. A. (2010). A map of human genome variation from population‐scale sequencing. Nature, 467(7319), 1061–1073. 10.1038/nature09534 [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Genomes Project, C. , Auton, A. , Brooks, L. D. , Durbin, R. M. , Garrison, E. P. , Kang, H. M. , Korbel, J. O. , Marchini, J. L. , McCarthy, S. , McVean, G. A. , & Abecasis, G. R. (2015). A global reference for human genetic variation. Nature, 526(7571), 68–74. 10.1038/nature15393 [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Gibbs, R. A. , Belmont, J. W. , Hardenbol, P. , Willis, T. D. , Yu, F. , Yang, H. , & Methods, G. (2003). The international HapMap project. Nature, 426(6968), 789–796. 10.1038/nature02168 [DOI] [PubMed] [Google Scholar]
  23. Gilissen, C. , Hehir‐Kwa, J. Y. , Thung, D. T. , van de Vorst, M. , van Bon, B. W. , Willemsen, M. H. , Kwint, M. , Janssen, I. M. , Hoischen, A. , Schenck, A. , Leach, R. , Klein, R. , Tearle, R. , Bo, T. , Pfundt, R. , Yntema, H. G. , de Vries, B. B. , Kleefstra, T. , Brunner, H. G. , … Veltman, J. A. (2014). Genome sequencing identifies major causes of severe intellectual disability. Nature, 511(7509), 344–347. 10.1038/nature13394 [DOI] [PubMed] [Google Scholar]
  24. Helbig, K. L. , Farwell Hagman, K. D. , Shinde, D. N. , Mroske, C. , Powis, Z. , Li, S. , Tang, S. , & Helbig, I. (2016). Diagnostic exome sequencing provides a molecular diagnosis for a significant proportion of patients with epilepsy. Genetics in Medicine, 18, 898–905. 10.1038/gim.2015.186 [DOI] [PubMed] [Google Scholar]
  25. Homsy, J. , Zaidi, S. , Shen, Y. , Ware, J. S. , Samocha, K. E. , Karczewski, K. J. , Depalma, S. R. , McKean, D. , Wakimoto, H. , Gorham, J. , Jin, S. C. , Deanfield, J. , Giardini, A. , Porter, G. A., Jr. , Kim, R. , Bilguvar, K. , López‐Giráldez, F. , Tikhonova, I. , Mane, S. , … Chung, W. K. (2015). De novo mutations in congenital heart disease with neurodevelopmental and other congenital anomalies. Science, 350(6265), 1262–1266. 10.1126/science.aac9396 [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Iossifov, I. , O'Roak, B. J. , Sanders, S. J. , Ronemus, M. , Krumm, N. , Levy, D. , Stessman, H. A. , Witherspoon, K. T. , Vives, L. , Patterson, K. E. , Smith, J. D. , Paeper, B. , Nickerson, D. A. , Dea, J. , Dong, S. , Gonzalez, L. E. , Mandell, J. D. , Mane, S. M. , Murtha, M. T. , … Wigler, M. (2014). The contribution of de novo coding mutations to autism spectrum disorder. Nature, 515, 216–221. 10.1038/nature13908 [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Iossifov, I. , Ronemus, M. , Levy, D. , Wang, Z. , Hakker, I. , Rosenbaum, J. , Yamrom, B. , Lee, Y. H. , Narzisi, G. , Leotta, A. , Kendall, J. , Grabowska, E. , Ma, B. , Marks, S. , Rodgers, L. , Stepansky, A. , Troge, J. , Andrews, P. , Bekritsky, M. , … Wigler, M. (2012). De novo gene disruptions in children on the autistic spectrum. Neuron, 74(2), 285–299. 10.1016/j.neuron.2012.04.009 [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Jónsson, H. , Sulem, P. , Kehr, B. , Kristmundsdottir, S. , Zink, F. , Hjartarson, E. , Hardarson, M. T. , Hjorleifsson, K. E. , Eggertsson, H. P. , Gudjonsson, S. A. , Ward, L. D. , Arnadottir, G. A. , Helgason, E. A. , Helgason, H. , Gylfason, A. , Jonasdottir, A. , Jonasdottir, A. , Rafnar, T. , Frigge, M. , … Stefansson, K. (2017). Parental influence on human germline de novo mutations in 1,548 trios from Iceland. Nature, 549(7673), 519–522. 10.1038/nature24018 [DOI] [PubMed] [Google Scholar]
  29. Kaplanis, J. , Samocha, K. E. , Wiel, L. , Zhang, Z. , Arvai, K. J. , Eberhardt, R. Y. , & Retterer, K. (2020). Integrating healthcare and research genetic data empowers the discovery of 28 novel developmental disorders. bioRxiv, 797787. 10.1101/797787 [DOI]
  30. Kessler, M. D. , Loesch, D. P. , Perry, J. A. , Heard‐Costa, N. L. , Taliun, D. , Cade, B. E. , Wang, H. , Daya, M. , Ziniti, J. , Datta, S. , Celedón, J. C. , Soto‐Quiros, M. E. , Avila, L. , Weiss, S. T. , Barnes, K. , Redline, S. S. , Vasan, R. S. , Johnson, A. D. , Mathias, R. A. , … O'Connor, T. D. (2020). De novo mutations across 1,465 diverse genomes reveal mutational insights and reductions in the Amish founder population. Proceedings of the National Academy of Sciences of the United States of America, 117(5), 2560–2569. 10.1073/pnas.1902766117 [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Kumar, S. , Stecher, G. , Li, M. , Knyaz, C. , & Tamura, K. (2018). MEGA X: Molecular evolutionary genetics analysis across computing platforms. Molecular Biology and Evolution, 35(6), 1547–1549. 10.1093/molbev/msy096 [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Landrum, M. J. , Lee, J. M. , Benson, M. , Brown, G. R. , Chao, C. , Chitipiralla, S. , Gu, B. , Hart, J. , Hoffman, D. , Jang, W. , Karapetyan, K. , Katz, K. , Liu, C. , Maddipatla, Z. , Malheiro, A. , McDaniel, K. , Ovetsky, M. , Riley, G. , Zhou, G. , … Maglott, D. R. (2018). ClinVar: Improving access to variant interpretations and supporting evidence. Nucleic Acids Research, 46(D1), D1062–D1067. 10.1093/nar/gkx1153 [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. de Ligt, J. , Willemsen, M. H. , van Bon, B. W. , Kleefstra, T. , Yntema, H. G. , Kroes, T. , Vulto‐van Silfhout, A. T. , Koolen, D. A. , de Vries, P. , Gilissen, C. , del Rosario, M. , Hoischen, A. , Scheffer, H. , de Vries, B. B. , Brunner, H. G. , Veltman, J. A. , & Vissers, L. E. (2012). Diagnostic exome sequencing in persons with severe intellectual disability. New England Journal of Medicine, 367(20), 1921–1929. 10.1056/NEJMoa1206524 [DOI] [PubMed] [Google Scholar]
  34. Li, H. (2018). Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics, 34(18), 3094–3100. 10.1093/bioinformatics/bty191 [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Li, H. , Handsaker, B. , Wysoker, A. , Fennell, T. , Ruan, J. , Homer, N. , Marth, G. , Abecasis, G. , & Durbin, R. (2009). The sequence alignment/Map format and SAMtools. Bioinformatics, 25(16), 2078–2079. 10.1093/bioinformatics/btp352 [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. McCarthy, S. E. , Gillis, J. , Kramer, M. , Lihm, J. , Yoon, S. , Berstein, Y. , Mistry, M. , Pavlidis, P. , Solomon, R. , Ghiban, E. , Antoniou, E. , Kelleher, E. , O'Brien, C. , Donohoe, G. , Gill, M. , Morris, D. W. , McCombie, W. R. , & Corvin, A. (2014). De novo mutations in schizophrenia implicate chromatin remodeling and support a genetic overlap with autism and intellectual disability. Molecular Psychiatry, 19(6), 652–658. 10.1038/mp.2014.29 [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. McKenna, A. , Hanna, M. , Banks, E. , Sivachenko, A. , Cibulskis, K. , Kernytsky, A. , Garimella, K. , Altshuler, D. , Gabriel, S. , Daly, M. , & DePristo, M. A. (2010). The genome analysis toolkit: A MapReduce framework for analyzing next‐generation DNA sequencing data. Genome Research, 20(9), 1297–1303. 10.1101/gr.107524.110 [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Narzisi, G. , O'Rawe, J. A. , Iossifov, I. , Fang, H. , Lee, Y. H. , Wang, Z. , Wu, Y. , Lyon, G. J. , Wigler, M. , & Schatz, M. C. (2014). Accurate de novo and transmitted indel detection in exome‐capture data using microassembly. Nature Methods, 11(10), 1033–1036. 10.1038/nmeth.3069 [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. O'roak, B. J. , Deriziotis, P. , Lee, C. , Vives, L. , Schwartz, J. J. , Girirajan, S. , Karakoc, E. , Mackenzie, A. P. , Ng, S. B. , Baker, C. , Rieder, M. J. , Nickerson, D. A. , Bernier, R. , Fisher, S. E. , Shendure, J. , & Eichler, E. E. (2011). Exome sequencing in sporadic autism spectrum disorders identifies severe de novo mutations. Nature Genetics, 43(6), 585–589. 10.1038/ng.835 [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. O'Roak, B. J. , Vives, L. , Girirajan, S. , Karakoc, E. , Krumm, N. , Coe, B. P. , Levy, R. , Ko, A. , Lee, C. , Smith, J. D. , Turner, E. H. , Stanaway, I. B. , Vernot, B. , Malig, M. , Baker, C. , Reilly, B. , Akey, J. M. , Borenstein, E. , Rieder, M. J. , … Eichler, E. E. (2012). Sporadic autism exomes reveal a highly interconnected protein network of de novo mutations. Nature, 485(7397), 246–250. 10.1038/nature10989 [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Pedrosa, L. , Fernández‐Miranda, I. , Pérez‐Callejo, D. , Quero, C. , Rodríguez, M. , Martín‐Acosta, P. , Gómez, S. , González‐Rincón, J. , Santos, A. , Tarin, C. , García, J. F. , García‐Arroyo, F. R. , Rueda, A. , Camacho, F. I. , García‐Cosío, M. , Heredero, A. , Llanos, M. , Mollejo, M. , Piris‐Villaespesa, M. , … Sánchez‐Beato, M. (2021). Proposal and validation of a method to classify genetic subtypes of diffuse large B cell lymphoma. Scientific Reports, 11(1), 1886. 10.1038/s41598-020-80376-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Poplin, R. , Chang, P. C. , Alexander, D. , Schwartz, S. , Colthurst, T. , Ku, A. , Newburger, D. , Dijamco, J. , Nguyen, N. , Afshar, P. T. , Gross, S. S. , Dorfman, L. , McLean, C. Y. , & DePristo, M. A. (2018). A universal SNP and small‐indel variant caller using deep neural networks. Nature Biotechecnology, 36(10), 983–987. 10.1038/nbt.4235 [DOI] [PubMed] [Google Scholar]
  43. Ramu, A. , Noordam, M. J. , Schwartz, R. S. , Wuster, A. , Hurles, M. E. , Cartwright, R. A. , & Conrad, D. F. (2013). DeNovoGear: De novo indel and point mutation discovery and phasing. Nature Methods, 10(10), 985–987. 10.1038/nmeth.2611 [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Rehm, H. L. , Bale, S. J. , Bayrak‐Toydemir, P. , Berg, J. S. , Brown, K. K. , Deignan, J. L. , Friez, M. J. , Funke, B. H. , Hegde, M. R. , & Lyon, E. (2013). ACMG clinical laboratory standards for next‐generation sequencing. Genetics in Medicine, 15(9), 733–747. 10.1038/gim.2013.92 [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Robinson, J. T. , Thorvaldsdottir, H. , Winckler, W. , Guttman, M. , Lander, E. S. , Getz, G. , & Mesirov, J. P. (2011). Integrative genomics viewer. Nature Biotechnology, 29(1), 24–26. 10.1038/nbt.1754 [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Rosenthal, R. , McGranahan, N. , Herrero, J. , Taylor, B. S. , & Swanton, C. (2016). DeconstructSigs: Delineating mutational processes in single tumors distinguishes DNA repair deficiencies and patterns of carcinoma evolution. Genome Biology, 17, 31. 10.1186/s13059-016-0893-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. De Rubeis, S. , He, X. , Goldberg, A. P. , Poultney, C. S. , Samocha, K. , Cicek, A. E. , Kou, Y. , Liu, L. , Fromer, M. , Walker, S. , Singh, T. , Klei, L. , Kosmicki, J. , Shih‐Chen, F. , Aleksic, B. , Biscaldi, M. , Bolton, P. F. , Brownfeld, J. M. , Cai, J. , … Buxbaum, J. D. (2014). Synaptic, transcriptional and chromatin genes disrupted in autism. Nature, 515(7526), 209–215. 10.1038/nature13772 [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Sanders, S. J. , Murtha, M. T. , Gupta, A. R. , Murdoch, J. D. , Raubeson, M. J. , Willsey, A. J. , Ercan‐Sencicek, A. G. , DiLullo, N. M. , Parikshak, N. N. , Stein, J. L. , Walker, M. F. , Ober, G. T. , Teran, N. A. , Song, Y. , El‐Fishawy, P. , Murtha, R. C. , Choi, M. , Overton, J. D. , Bjornson, R. D. , … State, M. W. (2012). De novo mutations revealed by whole‐exome sequencing are strongly associated with autism. Nature, 485(7397), 237–241. 10.1038/nature10945 [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Sasani, T. A. , Pedersen, B. S. , Gao, Z. , Baird, L. , Przeworski, M. , Jorde, L. B. , & Quinlan, A. R. (2019). Large, three‐generation human families reveal post‐zygotic mosaicism and variability in germline mutation accumulation. eLife, 8, e4692. 10.7554/eLife.46922 [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Ségurel, L. , Wyman, M. J. , & Przeworski, M. (2014). Determinants of mutation rate variation in the human germline. Annual Review of Genomics and Human Genetics, 15, 47–70. 10.1146/annurev-genom-031714-125740 [DOI] [PubMed] [Google Scholar]
  51. Sifrim, A. , Hitz, M. P. , Wilsdon, A. , Breckpot, J. , Turki, S. H. , Thienpont, B. , Mcrae, J. , Fitzgerald, T. W. , Singh, T. , Swaminathan, G. J. , Prigmore, E. , Rajan, D. , Abdul‐Khaliq, H. , Banka, S. , Bauer, U. M. , Bentham, J. , Berger, F. , Bhattacharya, S. , Bu'Lock, F. , … Hurles, M. E. (2016). Distinct genetic architectures for syndromic and nonsyndromic congenital heart defects identified by exome sequencing. Nature Genetics, 48(9), 1060–1065. 10.1038/ng.3627 [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. SPARK Consortium . (2018). SPARK: A US cohort of 50,000 families to accelerate autism research. Neuron, 97(3), 488–493. 10.1016/j.neuron.2018.01.015 [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Strachan, T. , & Read, A. P. (2018). Human molecular genetics (5th ed.). Garland Science. [Google Scholar]
  54. Turner, T. N. , Coe, B. P. , Dickel, D. E. , Hoekzema, K. , Nelson, B. J. , Zody, M. C. , Kronenberg, Z. N. , Hormozdiari, F. , Raja, A. , Pennacchio, L. A. , Darnell, R. B. , & Eichler, E. E. (2017). Genomic patterns of de novo mutation in simplex autism. Cell, 171(3), 710–722. 10.1016/j.cell.2017.08.047 [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Turner, T. N. , Hormozdiari, F. , Duyzend, M. H. , McClymont, S. A. , Hook, P. W. , Iossifov, I. , Raja, A. , Baker, C. , Hoekzema, K. , Stessman, H. A. , Zody, M. C. , Nelson, B. J. , Huddleston, J. , Sandstrom, R. , Smith, J. D. , Hanna, D. , Swanson, J. M. , Faustman, E. M. , Bamshad, M. J. , … Eichler, E. E. (2016). Genome sequencing of autism‐affected families reveals disruption of putative noncoding regulatory DNA. American Journal of Human Genetics, 98(1), 58–74. 10.1016/j.ajhg.2015.11.023 [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Turner, T. N. , Wilfert, A. B. , Bakken, T. E. , Bernier, R. A. , Pepper, M. R. , Zhang, Z. , Torene, R. I. , Retterer, K. , & Eichler, E. E. (2019). Sex‐based analysis of de novo variants in neurodevelopmental disorders. American Journal of Human Genetics, 105(6), 1274–1285. 10.1016/j.ajhg.2019.11.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Wang, T. , Antonacci‐Fulton, L. , Howe, K. , Lawson, H. A. , Lucas, J. K. , Phillippy, A. M. , Popejoy, A. B. , Asri, M. , Carson, C. , Chaisson, M. J. P. , Chang, X. , Cook‐Deegan, R. , Felsenfeld, A. L. , Fulton, R. S. , Garrison, E. P. , Garrison, N. A. , Graves‐Lindsay, T. A. , Ji, H. , Kenny, E. E. , … Haussler, D. (2022). The human pangenome project: A global resource to map genomic diversity. Nature, 604(7906), 437–446. 10.1038/s41586-022-04601-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Werling, D. M. , Brand, H. , An, J. Y. , Stone, M. R. , Zhu, L. , Glessner, J. T. , Collins, R. L. , Dong, S. , Layer, R. M. , Markenscoff‐Papadimitriou, E. , Farrell, A. , Schwartz, G. B. , Wang, H. Z. , Currall, B. B. , Zhao, X. , Dea, J. , Duhn, C. , Erdman, C. A. , Gilson, M. C. , … Sanders, S. J. (2018). An analytical framework for whole‐genome sequence association studies and its implications for autism spectrum disorder. Nature Genetics, 50(5), 727–736. 10.1038/s41588-018-0107-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Wilfert, A. B. , Turner, T. N. , Murali, S. C. , Hsieh, P. , Sulovari, A. , Wang, T. , Coe, B. P. , Guo, H. , Hoekzema, K. , Bakken, T. E. , Winterkorn, L. H. , Evani, U. S. , Byrska‐Bishop, M. , Earl, R. K. , Bernier, R. A. , Zhou, X. , Feliciano, P. , Hall, J. , Astrovskaya, I. , … Eichler, E. E. (2021). Recent ultra‐rare inherited variants implicate new autism candidate risk genes. Nature Genetics, 53, 1125–1134. 10.1038/s41588-021-00899-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. Yun, T. , Li, H. , Chang, P.‐C. , Lin, M. F. , Carroll, A. , & McLean, C. Y. (2020). Accurate, scalable cohort variant calls using DeepVariant and GLnexus. bioRxiv, 2020.2002.2010.942086. 10.1101/2020.02.10.942086 [DOI] [PMC free article] [PubMed]
  61. Zaidi, S. , Choi, M. , Wakimoto, H. , Ma, L. , Jiang, J. , Overton, J. D. , Romano‐Adesman, A. , Bjornson, R. D. , Breitbart, R. E. , Brown, K. K. , Carriero, N. J. , Cheung, Y. H. , Deanfield, J. , DePalma, S. , Fakhro, K. A. , Glessner, J. , Hakonarson, H. , Italia, M. J. , Kaltman, J. R. , … Lifton, R. P. (2013). De novo mutations in histone‐modifying genes in congenital heart disease. Nature, 498(7453), 220–223. 10.1038/nature12141 [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Zheng‐Bradley, X. , & Flicek, P. (2017). Applications of the 1000 genomes project resources. Briefings in Functional Genomics, 16(3), 163–170. 10.1093/bfgp/elw027 [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Zook, J. M. , Chapman, B. , Wang, J. , Mittelman, D. , Hofmann, O. , & Hide, W. (2014). Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. 10.1038/nbt.2835 [DOI] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting information.

Supporting information.

Supporting information.

Supporting information.


Articles from Human Mutation are provided here courtesy of Wiley

RESOURCES