In the current high-throughput genomics era, efficient and accurate analysis of large-scale whole-genome sequencing (WGS) data constitutes a computational bottleneck. Johnston et al. (1) introduce the PEMapper/PECaller software package for short-read WGS alignment and variant calling, promising faster analyses with reduced output file sizes and “nearly identical (or better)” variant calling accuracy compared with the de facto standard Burrows–Wheeler aligner/Genome Analysis Toolkit (BWA/GATK) best-practices pipeline (2). However, we cannot confirm this promised BWA/GATK-like accuracy of PEMapper/PECaller, and there are other pipelines offering ultrafast WGS data analyses with small disk footprints, as we show in this correspondence.
To assess sensitivity/recall, precision, computation time, and disk footprint of four corresponding pipelines, we performed alignment and variant calling for the reference short-read WGS data of NA12878 and the Ashkenazim trio (3, 4). The four pipelines included the downloadable PEMapper/PECaller (1) and BWA/GATK (2) as well as the commercially available Isaac (5) and GENALICE MAP (genalice.com) software packages (versions and settings specified in Fig. 1). To largely reduce systematic errors and alignment artifacts, we limited our benchmarking of whole-genome variant calling to the coding part of the high-confidence BED file of GIAB 3.3 (https://github.com/genome-in-a-bottle), excluding exons with mappability <1, differences between GRCh37 and GRCh38, and/or common copy number variations (CNVs) (6).
In our benchmarking, PEMapper/PECaller was, although powerful, neither the fastest pipeline (Fig. 1) nor as sensitive in variant calling as BWA/GATK (Fig. 2A). Indeed, PEMapper/PECaller resulted in the highest number of false-negative calls (Fig. 2A), making it less suitable for clinical sequencing. As expected, BWA/GATK showed the highest sensitivity but fell behind the other three pipelines regarding run time and disk footprint. GENALICE MAP showed sensitivity comparable to BWA/GATK (Fig. 2A) but with a 112× faster total run time and a 45× lower disk footprint (Fig. 1). In precision, only minor differences were observed among pipelines, except for the PEMapper/PECaller population calling and the GENALICE MAP single-sample calling pipelines, which performed with the lowest and with distinctly lower precision, respectively, using downloaded FASTQ files (Fig. 2B). The difference between downloaded and our in-house data was pronounced in the sensitivity of the PEMapper/PECaller single-sample pipeline as well (Fig. 2A), suggesting considerable influence of input sequencing reads on PEMapper/PECaller.
However, although the here-applied reference datasets may have been used for pipeline optimization, there are no alternative/unbiased whole-genome truth sets available for benchmarking. Moreover, PEMapper/PECaller does not output BAM files, which are particularly useful in clinical sequencing for evaluating called variants and in CNV detection. Regarding run time, BWA/GATK might soon catch up with PEMapper/PECaller if the upcoming GATK version 4.0 is indeed 5× faster as announced or might even be faster if accelerated by the DRAGEN platform (edicogenome.com) or compressive methods such as CORA (7). Impressively, GENALICE MAP has already achieved ultrarapid speed and superior low disk footprint with BWA/GATK-like sensitivity, thus enabling efficient (re)analyses of ever-increasing amounts of WGS data.
Acknowledgments
We thank David J. Cutler and H. Richard Johnston, and Bas Tolhuis, Jack Findhammer, and Johannes Karten for help with PEMapper/PECaller and GENALICE MAP, respectively. This work was supported by the Blumenau-Léonie Hartmann-Stiftung, Gebauer Stiftung, Gemeinnützige Stiftung der ehemaligen Sparkasse Limmattal, and Wohlfahrtsstiftung des Vereines Zürcher Brockenhaus.
Footnotes
The authors declare no conflict of interest.
References
- 1.Johnston HR, et al. International Consortium on Brain and Behavior in 22q11.2 Deletion Syndrome PEMapper and PECaller provide a simplified approach to whole-genome sequencing. Proc Natl Acad Sci USA. 2017;114:E1923–E1932. doi: 10.1073/pnas.1618065114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Van der Auwera GA, et al. From FastQ data to high confidence variant calls: The Genome Analysis Toolkit best practices pipeline. Curr Protoc Bioinformatics. 2013;11:11.10.1–11.10.33. doi: 10.1002/0471250953.bi1110s43. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Zook JM, et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat Biotechnol. 2014;32:246–251. doi: 10.1038/nbt.2835. [DOI] [PubMed] [Google Scholar]
- 4.Zook JM, et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci Data. 2016;3:160025. doi: 10.1038/sdata.2016.25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Raczy C, et al. Isaac: Ultra-fast whole-genome secondary analysis on Illumina sequencing platforms. Bioinformatics. 2013;29:2041–2043. doi: 10.1093/bioinformatics/btt314. [DOI] [PubMed] [Google Scholar]
- 6.Meienberg J, Bruggmann R, Oexle K, Matyas G. Clinical sequencing: Is WGS the better WES? Hum Genet. 2016;135:359–362. doi: 10.1007/s00439-015-1631-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Yorukoglu D, Yu YW, Peng J, Berger B. Compressive mapping for next-generation sequencing. Nat Biotechnol. 2016;34:374–376. doi: 10.1038/nbt.3511. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Li H. Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics. 2014;30:2843–2851. doi: 10.1093/bioinformatics/btu356. [DOI] [PMC free article] [PubMed] [Google Scholar]