Summary
De novo genetic variants are an important source of causative variation in complex genetic disorders. Many methods for variant discovery rely on mapping reads to a reference genome, detecting numerous inherited variants irrelevant to the phenotype of interest. To distinguish between inherited and de novo variation, sequencing of families (parents and siblings) is commonly pursued. However, standard mapping-based approaches tend to have a high false-discovery rate for de novo variant prediction. Kevlar is a mapping-free method for de novo variant discovery, based on direct comparison of sequences between related individuals. Kevlar identifies high-abundance k-mers unique to the individual of interest. Reads containing these k-mers are partitioned into disjoint sets by shared k-mer content for variant calling, and preliminary variant predictions are sorted using a probabilistic score. We evaluated Kevlar on simulated and real datasets, demonstrating its ability to detect both de novo single-nucleotide variants and indels with high accuracy.
Subject Areas: Bioinformatics, Biological Sciences, Genetics
Graphical Abstract
Highlights
-
•
Method for discovery of de novo variants without mapping reads to a reference genome
-
•
Novel probabilistic score for ranking variant predictions as confidently de novo
-
•
Predicts de novo SNVs, indels, and structural variants with high accuracy
-
•
Higher accuracy than competing methods for predicting long (>100 bp) variants
Bioinformatics; Biological Sciences; Genetics
Introduction
It is speculated that genetic variation is a major contributing factor in complex genetic disorders. The genetic heritability of many disorders is estimated to be relatively high. For example, the heritability of autism spectrum disorder is over 0.6, and the heritability of schizophrenia is over 0.5 (Cardno et al., 1999, Hallmayer et al., 2011). Only a fraction of this heritability is explained by known genetic variants, however, a phenomenon termed missing heritability (Manolio et al., 2009). One hypothesis is that de novo mutations, in particular indels and structural variants (SVs), are a large source of causative variation (and consequently missing heritability) in developmental disorders (Eichler et al., 2010, Manolio et al., 2009, Veltman and Brunner, 2012). However, the complexity of de novo variant discovery, especially de novo indel and SV discovery, has resulted in incomplete accounting of their contribution to these disorders. The discovery of genetic variants in general, and de novo variants in particular, remains a topic of intense research interest. In addition to illuminating the role of genetic variation in the etiology of complex disorders, improved discovery and cataloging of de novo variants across many samples or cohorts will shed additional light on important unresolved questions in human genomics, including rates, biases, and mechanisms of new mutation.
Whole genome sequencing of simplex families (presenting an isolated case of a genetic disorder) is a proven successful approach for discovery of novel genetic variants resulting from de novo mutation in the germline (Fromer et al., 2014, Iossifov et al., 2014, O’Roak et al., 2012, Veltman and Brunner, 2012, Zaidi et al., 2013). A “trio” composed of an individual affected by the disorder (the proband), the mother, and the father (alternatively, a “quad” or “quartet” composed of the proband, mother, father, and a sibling) provides a rich information source for discriminating between shared and unique variation. Following standard variant calling protocols, mapping-based methods for de novo variant prediction begin by aligning reads to the reference genome. Variants are then predicted for each individual based on artifacts observed in the read alignments, such as mismatches, gaps, abrupt shifts in coverage, and discordant read pair distances or orientations (Hormozdiari et al., 2009, Layer et al., 2014, Medvedev et al., 2010, Rausch et al., 2012, Sindi et al., 2012, Soylev et al., 2017, Ye et al., 2009). This initial process typically results in millions of variant predictions, which de novo variant discovery algorithms must examine to discern between inherited variation, true de novo variation, and spurious variant calls.
Although reference-based variant discovery methods have proved valuable in the study of complex genetic disorders, we note some of their limitations. Despite consistent improvements in read alignment algorithms, finding the correct mapping for each read is still complicated by sequencing errors, repetitive DNA content, and misassemblies in the reference. Reads that do not map to the reference genome because they span mutation breakpoints or contain novel sequence are ignored completely by mapping-based variant predictors. Also, few methods are able to predict multiple variant types simultaneously using a single strategy, instead focusing exclusively on single-nucleotide variants (SNVs), short indels, or SVs separately. Finally, most variant calls determined by analysis of read alignments are not unique to the individual of interest (child, or proband) but instead reflect divergence in ancestry between the family and the reference genome donors. Estimates of human germline mutation rates give an expectation of approximately 80 novel mutations per generation (Campbell and Eichler, 2013), and distinguishing true de novo variation events from millions of inherited or false variants is a substantial challenge.
More generally, accurate and comprehensive de novo variant discovery is complicated by several computational and biological factors, and remains an elusive goal. Any algorithm must be confident not only in the existence of the variant in the proband but also in its non-existence in both parents. And although SNVs are the most common variant type, larger variants that are less frequent, nevertheless, affect more nucleotides overall and are hypothesized to have an even greater impact in genetic disorders. Accurate discovery of these larger de novo variants is particularly challenging due to the inherent complexity of indel and SV prediction. In a reference-mapping context, calling indels with confidence requires accurate mapping of each read spanning the indel, with all gaps arranged consistently. This is possible only for short indels and tends to be prone to error and misalignment. Thus prediction of indels with length 10 bp has proved to be very challenging and accompanied by high false-positive and false-negative rates. Furthermore, the prediction of SVs via read mapping is only possible through indirect signatures such as alterations in read depth or read-pair signatures. These signatures can be quite noisy and result in high rate of false-negative and false-positive prediction. As a result, some basic properties of de novo SVs, including their rate of occurrence, remain unknown. It is important to note that there also exists no method for predicting more complex types of de novo SVs, such as inversion-duplication.
Many of the challenges with de novo variant prediction can be mitigated by an approach that compares sequence content between related individuals directly, rather than indirectly via a reference genome. Such an approach neither requires any read alignments nor is it sensitive to off-target shared or inherited variation. What a mapping-free approach does require is a signature of variation that is not defined in terms of artifacts observed in read alignments.
One of the first tools to explore a mapping-free strategy for predicting and genotyping variants was Cortex, which introduced the concept of a “colored de Bruijn graph” to compare sequence content from two or more samples and predict variants between samples (Iqbal et al., 2012). Cortex was used successfully for predicting variants in the 1000 Genomes Project. The DiscoSnp method (Uricaru et al., 2014) implemented a very efficient strategy for scanning a de Bruijn graph for “bubbles” reflective of isolated SNVs. More recently, DiscoSnp++ has improved on this strategy and is capable of predicting isolated SNVs, proximal SNVs, and indels without the use of a reference genome (Peterlongo et al., 2017). At the core of both methods is the analysis of k-mers, or sequences of a fixed length k.
Increased attention is being given to these kinds of k-mer-based methods that avoid read alignments altogether. Indeed, mapping-free strategies for a variety of genomic and transcriptomic applications have become increasingly prominent, in large part due to their efficiency and robustness to the shortcomings of reference genomes. (It is important to note that these and other developments have greatly benefited from the availability of software libraries for rapid exact and approximate k-mers; these libraries include Jellyfish, Marçais and Kingsford, 2011; khmer, Crusoe et al., 2015; ntHash, Mohamadi et al., 2016; DSK, Rizk et al., 2013; and KMC, Deorowicz et al., 2013). In the realm of transcriptome analysis, tools such as Kallisto (Bray et al., 2016) and Sailfish (Patro et al., 2014) are capable of accurate RNA-sequencing quantification at a fraction of the time and computational cost of previous mapping-based strategies. A recent study has also introduced a novel mapping-free method for performing genome-wide association studies from whole-genome sequence data (Rahman et al., 2018) using k-mer counts. The tool Hawk (Rahman et al., 2018) performs rapid and accurate discovery of variant-phenotype associations by directly comparing k-mer frequencies between arbitrary numbers of case and control samples. Hawk counts all k-mers in the sequenced samples and finds k-mers that are significantly associated with the phenotype or trait of interest (“significant k-mers”), and then performs a local assembly of these significant k-mers to predict the corresponding significant variants associated with the traits. This approach provides an efficient method for discovery of significant associations between all types of variants (i.e., SNVs, indels, and SVs) and the phenotype or trait of interest (Rahman et al., 2018).
Developments in variant prediction frameworks continue to spur improvements in a variety of contexts. Scalpel (Narzisi et al., 2014) implements a hybrid method for de novo indel discovery from whole-exome sequencing of quads. Read mapping is used only to localize reads to the reference genome. In subsequent steps, Scalpel performs localized de novo assembly of reads at loci of interest and aligns assembled contigs back to the loci to annotate any de novo variants present (Narzisi et al., 2014). More recently, NovoBreak (Chong et al., 2017) introduced a method that utilizes k-mer counts to predict somatic variants, including SVs, by comparison of paired tumor and normal whole-genome sequence samples. COBASI (Gómez-Romero et al., 2018) performs rapid and accurate de novo SNV discovery on whole-genome sequencing of trios by computing perfect matches to unique strings in the reference genome and then identifying abrupt shifts in the coverage of the resulting alignments. Finally, mapping-free approaches such as LAVA (Shajii et al., 2016), VarGeno (Sun and Medvedev, 2018), MALVA (Bernardini et al., 2019), and Nebula (Khorsand and Hormozdiari, 2019) were recently developed for fast and accurate genotyping of common variation using whole-genome sequencing data.
The present study introduces a new mapping-free strategy grounded on a k-mer-based formulation of the de novo variant discovery problem—see Figure 1A. Intuitively, a novel germline mutation should result in new sequence content in the proband compared with the parental genomes. Even in the simplest case, a single-nucleotide substitution, most of the k-mers spanning the mutation should be unique, given a sufficiently large value of k. Incidentally, this is also true for other classes of variants, such as indels and various types of structural variation. And with sufficiently deep sampling of the proband genome, the expectation is that these novel k-mers are present in the read data at levels that can be readily distinguished from sequencing errors. Thus, it should be possible to detect both SNVs and larger variants (indels, SVs) simultaneously using a single mapping-free model.
Building on this intuition, we developed Kevlar, a new method based on a mapping-free formulation of the de novo variant discovery problem. Kevlar examines k-mer abundances to identify “interesting” k-mers, which we define as having significantly high abundance in the proband or child reads, whereas being effectively absent in the reads from both parents. These interesting k-mers are an indicator of the potential existence of a de novo variant in the proband and are conceptually similar to the “significant” k-mers used by Hawk (Rahman et al., 2018). We next group the reads containing interesting k-mers into disjoint sets, each reflecting a putative variant, based on the k-mers shared between pairs of reads. Kevlar then uses standard algorithms to assemble each set of reads into contigs and align the assembled contigs to a reference genome to make preliminary variant calls. Finally, Kevlar employs a probabilistic model to score predicted variants to distinguish between miscalled inherited variants and true de novo mutations.
We demonstrate the utility of this new method on simulated and real data. We show that Kevlar achieves similar predictive performance to best-in-class tools for SNV and short indel discovery, while at the same time predicting larger events with high accuracy. We also demonstrate Kevlar's ability to accurately predict large-scale SV events, defining breakpoints with nucleotide-level precision.
Kevlar is available as an open source software project and can be invoked via a Python API, a command-line interface, or a standard Snakemake workflow (Köster and Rahmann, 2012). The stable and actively developed source code is available at https://github.com/kevlar-dev/kevlar, and documentation is available at https://kevlar.readthedocs.io.
Results
We present a novel framework for discovery of de novo variants based on direct comparisons of sequence content between related individuals, requiring no mapping of short reads to a reference genome. This framework utilizes a single strategy that accurately predicts SNVs, insertions and deletions (indels), and structural variation events simultaneously.
Overview of Kevlar
Our variant discovery strategy is fundamentally a search for novel DNA content in the sample of interest. It is based on the observation that k-mers (short subsequences of fixed length k) spanning a de novo mutation will be novel with high probability (Figures 1B and 1C). Often the subject is a child affected by a disorder or other trait of interest (referred to as proband), with related individuals being the two parents.
Figure 1D summarizes the Kevlar workflow. In brief, DNA sequence reads from the case and control samples are processed independently. For each sample, the reads are split into k-mers and the abundance of each k-mer is stored for subsequent lookup. A second pass over reads from the case sample then identifies all k-mers that are unique to the proband—that is, k-mers that are abundant in the proband but effectively absent in both parents. Reads containing any novel k-mers are retained for subsequent processing.
After applying filters for contamination and erroneous k-mer abundances, the reads containing novel k-mers are partitioned such that any two reads sharing at least one novel k-mer are grouped together. The reads in each partition are then analyzed independently: they are assembled into a contig, the contig is aligned to the reference genome, and the alignment is used to assess the presence of a variant and make a variant call. Finally, Kevlar employs a likelihood-based score to rank and filter the variant calls.
Each step of the Kevlar workflow is discussed in detail in the Transparent Methods.
Performance on Simulated Data
We simulated whole-genome shotgun sequencing of a mock family for a fine-grained assessment of Kevlar's accuracy in identifying different variant types at different levels of sequencing depth. Our simulation not only realistically modeled the inheritance of parental variants but also included hundreds of “de novo” (unique to the proband) SNVs and indels ranging in size from 10 to 400 bp. The sequencing was simulated at 10x, 20x, 30x, and 50x coverage with low error rate. We compared Kevlar's accuracy on this dataset with two widely used mapping-based de novo variant callers (GATK PhaseByTransmission, Francioli et al., 2016; and TrioDenovo, Wei et al., 2015) as well as two mapping-free or hybrid variant callers (Scalpel, Narzisi et al., 2014; and DiscoSnp++, Peterlongo et al., 2017).
The accuracy of all variant callers evaluated is poor at low (10x) coverage (see Figure S1). GATK PhaseByTransmission makes very few variant predictions at 10x coverage. The remaining variant callers report numerous predictions, but in general suffer from both low sensitivity (failing to predict many true variants) and poor specificity (predicting many false variants). TrioDenovo shows the best prediction performance for SNVs and short (1–100 bp) indels at 10x coverage. At 20x coverage (Figure S2), all five algorithms show marked improvement in SNV detection, in particular TrioDenovo, which achieves 90% sensitivity. Scalpel exhibits both improved sensitivity and improved specificity at 20x coverage and approaches or surpasses TrioDenovo's performance for indels of most lengths. Kevlar's ability to accurately detect indels 100 bp becomes evident at 20x coverage.
At higher levels of coverage (30x and 50x), Kevlar consistently achieves top performance across all variant types (see Figures 2 and S3). Notably, Kevlar recovers 90% of true variants while making very few false predictions across all variant types at high coverage. TrioDenovo shows marginally better sensitivity than Kevlar for predicting SNVs at 30x and 50x (as does GATK PhaseByTransmission at 50x), but at the expense of numerous false predictions. Kevlar also rivals Scalpel's impressive short indel prediction performance and exceeds it for predicting long (100 bp) indels.
Performance on the SSC 14153 Autism Trio
To assess Kevlar's performance on real data, we applied Kevlar to predict de novo variants in the proband of an autism trio from the Simons Simplex Collection (family 14153). As a reference for comparison, we obtained a potential “truth set” from the denovo-db database (http://denovo-db.gs.washington.edu/denovo-db/). This truth set includes 196 de novo variant predictions and represents the union of predictions made for this trio by several recent studies (Turner et al., 2016, Turner et al., 2017, Werling et al., 2018). Note that the expected number of de novo variants per generation is estimated to be around 100 (Campbell and Eichler, 2013, Turner et al., 2016), or about half of the number of predictions in the truth set. Annotations in the denovo-db database indicate that experimental validation has confirmed 14 of the 196 calls.
In total, Kevlar predicts 219 de novo variants for trio 14153, including 150 SNVs, 68 indels/SVs, and a single 2-bp multinucleotide variant. We note that Kevlar assigned many of these predicted variants a low likelihood of the variant being a true de novo event. Figure 3 shows the congruence between the 100 top-ranked Kevlar calls and the denovo-db calls for this trio.
Of the 14 denovo-db calls with experimental validation, 13 (92.9%) were predicted accurately by Kevlar and assigned a high likelihood score, indicative of a confident de novo variant call. Overall, the 100 Kevlar variant calls ranked highest by the likelihood score include only four calls not present in denovo-db (probable false calls). On the other hand, only five Kevlar variant calls present in denovo-db (probable true variants) are not among the 100 highest ranked Kevlar calls. Of the 196 denovo-db calls, 95 are absent from the Kevlar predictions. The majority of these calls (75/95, 80%) occur in regions of repetitive DNA and have shown to be unreliable in experimental validation (Tychele Turner, personal communication).
Finally, a recent study verified the presence of a de novo deletion of approximately 6 kbp in the proband of this trio (Turner et al., 2016), removing the 5′ UTR of the gene CANX. Kevlar also predicted this de novo deletion successfully and identified the precise (and previously undetermined) breakpoints at chr5:179,122,593 and chr15:179,128,130 (GRCh37). Inspection of the variant reveals that both the deletions' breakpoints occur in Alu repeat elements abundant throughout the genome (Figure 4). As a result, only seven of the k-mers spanning the variant are unique signatures of mutation not already present elsewhere in the genome. We observe with interest that both breakpoints occur inside a 20-bp identical repeat, indicating this de novo deletion is the result of non-allelic homologous recombination.
Discussion
De novo variants are a major contributing factor in many disorders (e.g., intellectual disability, autism, and epilepsy). Accurate discovery of these variants has been challenging as prediction methods need to be confident not only in the existence of the event in the proband or child but also in the absence of the variant in the parents. Current approaches depend on correct alignments of sequence reads to a reference genome. Any complications in computing read alignments due to repeats, gaps in the reference, or variant complexity can result in false predictions or failure to discover a true de novo variant.
The method proposed in this study compares k-mers between related individuals to find the k-mers indicating a de novo variant in the sample of interest. We acknowledge recently proposed methods NovoBreak (Chong et al., 2017) and Hawk (Rahman et al., 2018), which are conceptually similar and likewise capable of accurately predicting de novo variants. Kevlar, Hawk (Rahman et al., 2018), and other related methods do not depend on mapping reads to a reference genome, but instead rely on direct comparison of sequence content between related individuals. This strategy enables Kevlar to accurately predict several classes of de novo mutations (substitutions, insertions, deletions, SVs) simultaneously with a single simple mathematical model. As long as the de novo mutation creates a k-mer not already present in the reference genome, the proposed algorithm should be able to accurately discover the event. We have also developed a k-mer-based likelihood model for scoring and ranking variant calls according to their probability of being true de novo events. This likelihood score is effective in discerning de novo variants from inherited mutations and false variant calls. We have demonstrated the effectiveness of our discovery method and scoring model using both simulated and real data. Kevlar is competitive with best-in-class tools for discovery of a variety of variant types, and substantially outperforms available methods for discovery of larger de novo variants. Kevlar not only predicts indels and SVs with high sensitivity and specificity but also reports the exact breakpoints of these variants with single base pair precision.
De novo variants are, by definition, expected to be unique for each individual. Aggregating multiple simplex trios will not increase the rate of recall. However, multiple trios could potentially be aggregated to identify any systematic errors resulting in the same k-mers being marked as “interesting” in multiple samples. Identifying and removing these k-mers and any corresponding variant calls could improve precision.
Development of completely reference-free methods is tremendously valuable in scenarios where the availability, quality, or relevance of a reference genome is insufficient. Kevlar's preliminary steps—identifying variant-spanning reads, binning reads into groups corresponding to distinct putative variants, and assembling each read group into a variant-spanning contig—are performed without the use of a reference genome. We note, however, that subsequent steps in the Kevlar workflow to annotate, filter, and score the preliminary variant calls still depend on a reference genome. One promising approach to developing a completely reference-free de novo variant discovery method would be to annotate variants by aligning variant-spanning contigs directly to an assembly or variation graph.
Limitations of the Study
Misclassification of heterozygous inherited variants as de novo is one of the main sources of false prediction. These errors are enriched at loci with low coverage in the donor parent. This is due to the difficulty of distinguishing true variation from sequencing error. It is possible that utilizing a probabilistic approach for selecting “interesting” k-mers, as proposed in Hawk (Rahman et al., 2018), can reduce the false de novo prediction rate.
Kevlar will successfully annotate k-mers that span the breakpoints of large insertions. It will also assemble the reads containing these k-mers into breakpoint-spanning contigs. However, unless the inserted sequence is entirely novel, Kevlar is unlikely to assemble a single contig that spans the entire variant and is thus capable of annotating its precise coordinates.
Even using a probabilistic k-mer counting strategy, Kevlar's memory requirements can be quite demanding. Applying error correction to the input reads will substantially reduce Kevlar's memory requirements, but this typically leads to a small reduction in sensitivity for discovering SNVs.
Finally, in scoring and ranking of the predicted de novo variants Kevlar assumes independence between k-mers in likelihood calculation. While this assumption simplifies the likelihood calculation, a more sophisticated formulation that does not have this limitation may yield improvements in scoring and ranking the final variant calls.
Methods
All methods can be found in the accompanying Transparent Methods supplemental file.
Acknowledgments
We would like to acknowledge Dr. Tamer Mansour, Luiz Irber Jr., Camille Scott, and Lisa Johnson for helpful discussions on method development and implementation and Dr. Tychele Turner for helpful discussions on the method evaluation. We also thank reviewers and several colleagues for comments on earlier versions of the manuscript, which have improved the final paper.
This work is funded in part by the Gordon and Betty Moore Foundation's Data-Driven Discovery Initiative through Grant GBMF4551 and NIH R01 HG007513, both to C.T.B., and by the Sloan Research Fellowship number FG-2017-9159 to F.H..
Author Contributions
D.S.S., C.T.B., and F.H. conceived the study. D.S.S. implemented the method and performed the experiments. D.S.S. and F.H. designed the experiments and wrote the manuscript. D.S.S., C.T.B., and F.H. edited and approved the final manuscript.
Declaration of Interests
The authors declare no competing interests.
Published: August 30, 2019
Footnotes
Supplemental Information can be found online at https://doi.org/10.1016/j.isci.2019.07.032.
Contributor Information
Daniel S. Standage, Email: daniel.standage@nbacc.dhs.gov.
C. Titus Brown, Email: ctbrown@ucdavis.edu.
Fereydoun Hormozdiari, Email: fhormozd@ucdavis.edu.
Data and Code Availability
The Kevlar software is hosted as an open source software project at https://github.com/kevlar-dev/kevlar and is freely available under the MIT license. User documentation is available at https://kevlar.readthedocs.io. Reads from the simulated dataset are available in FASTQ format from DOI https://doi.org/10.1706/ODF.IO/4CHPB. Reads from the 14153 trio are available in BAM format from the Simons Simplex Collection at https://www.sfari.org/2015/12/11/whole-genome-analysis-of-the-simons-simplex-collection-ssc-2/#chapter-how-to-access-the-data.
Supplemental Information
References
- Bernardini G., Bonizzoni P., Denti L., Previtali M., Schönhuth A. Malva: genotyping by mapping-free allele detection of known variants. bioRxiv. 2019:575126. doi: 10.1016/j.isci.2019.07.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bray N.L., Pimentel H., Melsted P., Pachter L. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 2016;34:525. doi: 10.1038/nbt.3519. [DOI] [PubMed] [Google Scholar]
- Campbell C.D., Eichler E.E. Properties and rates of germline mutations in humans. Trends Genet. 2013;29:575–584. doi: 10.1016/j.tig.2013.04.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cardno A.G., Marshall E.J., Coid B., Macdonald A.M., Ribchester T.R., Davies N.J., Venturi P., Jones L.A., Lewis S.W., Sham P.C. Heritability estimates for psychotic disorders: the Maudsley twin psychosis series. Arch. Gen. Psychiatry. 1999;56:162–168. doi: 10.1001/archpsyc.56.2.162. [DOI] [PubMed] [Google Scholar]
- Chong Z., Ruan J., Gao M., Zhou W., Chen T., Fan X., Ding L., Lee A.Y., Boutros P., Chen J. novobreak: local assembly for breakpoint detection in cancer genomes. Nat. Methods. 2017;14:65. doi: 10.1038/nmeth.4084. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Crusoe M.R., Alameldin H.F., Awad S., Boucher E., Caldwell A., Cartwright R., Charbonneau A., Constantinides B., Edvenson G., Fay S. The Khmer software package: enabling efficient nucleotide sequence analysis. F1000Res. 2015;4:900. doi: 10.12688/f1000research.6924.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Deorowicz S., Debudaj-Grabysz A., Grabowski S. Disk-based k-mer counting on a pc. BMC Bioinformatics. 2013;14:160. doi: 10.1186/1471-2105-14-160. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Eichler E.E., Flint J., Gibson G., Kong A., Leal S.M., Moore J.H., Nadeau J.H. Missing heritability and strategies for finding the underlying causes of complex disease. Nat. Rev. Genet. 2010;11:446. doi: 10.1038/nrg2809. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Francioli L.C., Cretu-Stancu M., Garimella K.V., Fromer M., Kloosterman W.P., Genome of the Netherlands consortium. Samocha K.E., Neale B.M., Daly M.J., Banks E., DePristo M.A., de Bakker P.I. A framework for the detection of de novo mutations in family-based sequencing data. Eur. J. Hum. Genet. 2016;25:227–233. doi: 10.1038/ejhg.2016.147. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fromer M., Pocklington A.J., Kavanagh D.H., Williams H.J., Dwyer S., Gormley P., Georgieva L., Rees E., Palta P., Ruderfer D.M. De novo mutations in schizophrenia implicate synaptic networks. Nature. 2014;506:179. doi: 10.1038/nature12929. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gómez-Romero L., Palacios-Flores K., Reyes J., García D., Boege M., Dávila G., Flores M., Schatz M.C., Palacios R. Precise detection of de novo single nucleotide variants in human genomes. Proc. Natl. Acad. Sci. U S A. 2018;115:5516–5521. doi: 10.1073/pnas.1802244115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hallmayer J., Cleveland S., Torres A., Phillips J., Cohen B., Torigoe T., Miller J., Fedele A., Collins J., Smith K. Genetic heritability and shared environmental factors among twin pairs with autism. Arch. Gen. Psychiatry. 2011;68:1095–1102. doi: 10.1001/archgenpsychiatry.2011.76. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hormozdiari F., Alkan C., Eichler E.E., Sahinalp S.C. Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes. Genome Res. 2009;19:1270–1278. doi: 10.1101/gr.088633.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Iossifov I., O’Roak B.J., Sanders S.J., Ronemus M., Krumm N., Levy D., Stessman H.A., Witherspoon K.T., Vives L., Patterson K.E. The contribution of de novo coding mutations to autism spectrum disorder. Nature. 2014;515:216. doi: 10.1038/nature13908. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Iqbal Z., Caccamo M., Turner I., Flicek P., McVean G. De novo assembly and genotyping of variants using colored de bruijn graphs. Nat. Genet. 2012;44:226. doi: 10.1038/ng.1028. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Khorsand P., Hormozdiari F. Nebula: Ultra-efficient mapping-free structural variant genotyper. bioRxiv. 2019:566620. doi: 10.1093/nar/gkab025. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Köster J., Rahmann S. Snakemake: a scalable bioinformatics workflow engine. Bioinformatics. 2012;28:2520–2522. doi: 10.1093/bioinformatics/bts480. [DOI] [PubMed] [Google Scholar]
- Layer R.M., Chiang C., Quinlan A.R., Hall I.M. Lumpy: a probabilistic framework for structural variant discovery. Genome Biol. 2014;15:R84. doi: 10.1186/gb-2014-15-6-r84. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Manolio T.A., Collins F.S., Cox N.J., Goldstein D.B., Hindorff L.A., Hunter D.J., McCarthy M.I., Ramos E.M., Cardon L.R., Chakravarti A. Finding the missing heritability of complex diseases. Nature. 2009;461:747. doi: 10.1038/nature08494. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Marçais G., Kingsford C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 2011;27:764–770. doi: 10.1093/bioinformatics/btr011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Medvedev P., Fiume M., Dzamba M., Smith T., Brudno M. Detecting copy number variation with mated short reads. Genome Res. 2010;20:1613–1622. doi: 10.1101/gr.106344.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mohamadi H., Chu J., Vandervalk B.P., Birol I. ntHash: recursive nucleotide hashing. Bioinformatics. 2016;32:3492–3494. doi: 10.1093/bioinformatics/btw397. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Narzisi G., O’Rawe J.A., Iossifov I., Fang H., Lee Y.-h., Wang Z., Wu Y., Lyon G.J., Wigler M., Schatz M.C. Accurate de novo and transmitted indel detection in exome-capture data using microassembly. Nat. Methods. 2014;11:1033. doi: 10.1038/nmeth.3069. [DOI] [PMC free article] [PubMed] [Google Scholar]
- O’Roak B.J., Vives L., Girirajan S., Karakoc E., Krumm N., Coe B.P., Levy R., Ko A., Lee C., Smith J.D. Sporadic autism exomes reveal a highly interconnected protein network of de novo mutations. Nature. 2012;485:246. doi: 10.1038/nature10989. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Patro R., Mount S.M., Kingsford C. Sailfish enables alignment-free isoform quantification from rna-seq reads using lightweight algorithms. Nat. Biotechnol. 2014;32:462. doi: 10.1038/nbt.2862. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Peterlongo P., Riou C., Drezen E., Lemaitre C. Discosnp++: de novo detection of small variants from raw unassembled read set(s) bioRxiv. 2017:209965. [Google Scholar]
- Rahman A., Hallgrímsdóttir I., Eisen M., Pachter L. Association mapping from sequencing reads using k-mers. Elife. 2018;7:e32920. doi: 10.7554/eLife.32920. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rausch T., Zichner T., Schlattl A., Stütz A.M., Benes V., Korbel J.O. DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics. 2012;28:i333–i339. doi: 10.1093/bioinformatics/bts378. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rizk G., Lavenier D., Chikhi R. DSK: k-mer counting with very low memory usage. Bioinformatics. 2013;29:652–653. doi: 10.1093/bioinformatics/btt020. [DOI] [PubMed] [Google Scholar]
- Shajii A., Yorukoglu D., William Yu Y., Berger B. Fast genotyping of known snps through approximate k-mer matching. Bioinformatics. 2016;32:i538–i544. doi: 10.1093/bioinformatics/btw460. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sindi S.S., Önal S., Peng L.C., Wu H.-T., Raphael B.J. An integrative probabilistic model for identification of structural variation in sequencing data. Genome Biol. 2012;13:R22. doi: 10.1186/gb-2012-13-3-r22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Soylev A., Kockan C., Hormozdiari F., Alkan C. Toolkit for automated and rapid discovery of structural variants. Methods. 2017;129:3–7. doi: 10.1016/j.ymeth.2017.05.030. [DOI] [PubMed] [Google Scholar]
- Sun C., Medvedev P. Toward fast and accurate SNP genotyping from whole genome sequencing data for bedside diagnostics. bioRxiv. 2018:239871. doi: 10.1093/bioinformatics/bty641. [DOI] [PubMed] [Google Scholar]
- Turner T.N., Coe B.P., Dickel D.E., Hoekzema K., Nelson B.J., Zody M.C., Kronenberg Z.N., Hormozdiari F., Raja A., Pennacchio L.A. Genomic patterns of de novo mutation in simplex autism. Cell. 2017;171:710–722. doi: 10.1016/j.cell.2017.08.047. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Turner T.N., Hormozdiari F., Duyzend M.H., McClymont S.A., Hook P.W., Iossifov I., Raja A., Baker C., Hoekzema K., Stessman H.A. Genome sequencing of autism-affected families reveals disruption of putative noncoding regulatory dna. Am. J. Hum. Genet. 2016;98:58–74. doi: 10.1016/j.ajhg.2015.11.023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Uricaru R., Rizk G., Lacroix V., Quillery E., Plantard O., Chikhi R., Lemaitre C., Peterlongo P. Reference-free detection of isolated snps. Nucleic Acids Res. 2014;43:e11. doi: 10.1093/nar/gku1187. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Veltman J.A., Brunner H.G. De novo mutations in human genetic disease. Nat. Rev. Genet. 2012;13:565. doi: 10.1038/nrg3241. [DOI] [PubMed] [Google Scholar]
- Wei Q., Zhan X., Zhong X., Liu Y., Han Y., Chen W., Li B. A Bayesian framework for de novo mutation calling in parents-offspring trios. Bioinformatics. 2015;31:1375–1381. doi: 10.1093/bioinformatics/btu839. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Werling D.M., Brand H., An J.-Y., Stone M.R., Zhu L., Glessner J.T., Collins R.L., Dong S., Layer R.M., Markenscoff-Papadimitriou E. An analytical framework for whole-genome sequence association studies and its implications for autism spectrum disorder. Nat. Genet. 2018;50:727–736. doi: 10.1038/s41588-018-0107-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ye K., Schulz M.H., Long Q., Apweiler R., Ning Z. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics. 2009;25:2865–2871. doi: 10.1093/bioinformatics/btp394. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zaidi S., Choi M., Wakimoto H., Ma L., Jiang J., Overton J.D., Romano-Adesman A., Bjornson R.D., Breitbart R.E., Brown K.K. De novo mutations in histone-modifying genes in congenital heart disease. Nature. 2013;498:220. doi: 10.1038/nature12141. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The Kevlar software is hosted as an open source software project at https://github.com/kevlar-dev/kevlar and is freely available under the MIT license. User documentation is available at https://kevlar.readthedocs.io. Reads from the simulated dataset are available in FASTQ format from DOI https://doi.org/10.1706/ODF.IO/4CHPB. Reads from the 14153 trio are available in BAM format from the Simons Simplex Collection at https://www.sfari.org/2015/12/11/whole-genome-analysis-of-the-simons-simplex-collection-ssc-2/#chapter-how-to-access-the-data.