Skip to main content
F1000Research logoLink to F1000Research
. 2024 Sep 12;12:1125. Originally published 2023 Sep 11. [Version 2] doi: 10.12688/f1000research.140344.2

NCBench: providing an open, reproducible, transparent, adaptable, and continuous benchmark approach for DNA-sequencing-based variant calling

Friederike Hanssen 1, Gisela Gabernet 1, Famke Bäuerle 1,2,3,4, Bianca Stöcker 5, Felix Wiegand 5, Nicholas H Smith 6, Christian Mertes 6,7,8, Avirup Guha Neogi 9, Leon Brandhoff 9,10, Anna Ossowski 9, Janine Altmueller 9,11,12, Kerstin Becker 9, Andreas Petzold 13, Marc Sturm 14, Tyll Stöcker 15, Sugirthan Sivalingam 16, Fabian Brand 17, Axel Schmidt 18, Andreas Buness 19, Alexander J Probst 20, Susanne Motameny 9,10, Johannes Köster 5,21,a
PMCID: PMC11428021  PMID: 39345270

Version Changes

Revised. Amendments from Version 1

We have added three new coauthors, Felix Wiegand, Famke Bäuerle and Bianca Stöcker, who have improved the NCBench workflow and helped with the revisions (author contributions and funding sections have been updated accordingly). We have added a section about the maintenance of NCBench to the end of the "Evaluation pipeline" section in the manuscript. We have extended and clarified our statement on variant atomization (see section "variant atomization"). We have extended the "Reporting" section in the manuscript to include the F* measure that has been added to the analysis. Apart from those changes in the manuscript, various improvements to the Datavzrd based reports on https://ncbench.github.io have been made, which are listed in detail in the response to the reviewers.

Abstract

We present the results of the human genomic small variant calling benchmarking initiative of the German Research Foundation (DFG) funded Next Generation Sequencing Competence Network (NGS-CN) and the German Human Genome-Phenome Archive (GHGA).

In this effort, we developed NCBench, a continuous benchmarking platform for the evaluation of small genomic variant callsets in terms of recall, precision, and false positive/negative error patterns. NCBench is implemented as a continuously re-evaluated open-source repository.

We show that it is possible to entirely rely on public free infrastructure (Github, Github Actions, Zenodo) in combination with established open-source tools. NCBench is agnostic of the used dataset and can evaluate an arbitrary number of given callsets, while reporting the results in a visual and interactive way.

We used NCBench to evaluate over 40 callsets generated by various variant calling pipelines available in the participating groups that were run on three exome datasets from different enrichment kits and at different coverages.

While all pipelines achieve high overall quality, subtle systematic differences between callers and datasets exist and are made apparent by NCBench.These insights are useful to improve existing pipelines and develop new workflows.

NCBench is meant to be open for the contribution of any given callset. Most importantly, for authors, it will enable the omission of repeated re-implementation of paper-specific variant calling benchmarks for the publication of new tools or pipelines, while readers will benefit from being able to (continuously) observe the performance of tools and pipelines at the time of reading instead of at the time of writing.

Keywords: continuous, benchmarking, NGS, variant calling

Introduction

Genome sequencing is integral to many research and diagnostic procedures. For both pipeline and tool development, it is crucial to ensure that genomic variant calls are as accurate as possible. This can be achieved by testing tools and pipelines on datasets with a known set of true variants and correspondingly known sites where the genome is the same as the reference genome.

Several such benchmark datasets have been published. The Genome in a Bottle Consortium (GIAB) has released truth variant sets based on common calls across three variant callers on 14 different sequencing technologies and library preparation methods on a well-characterized genome (HG001 or NA12878), as well as an Ashkenazim trio (HG002-4) and a Han Chinese trio (HG005-7). 1 , 2 The Platinum variant catalog provides consensus calls of six variant calling pipelines across two different sequencing platforms on a family of four grandparents, two parents and 11 children including the NA12878 genome, allowing an extended inheritance-based validation. 3 In an alternative approach, Li et al. 4 generated a synthetic diploid from two complete hydatidiform mole (CHM) cell lines (CHM1 and CHM13), which are almost completely homozygous across the whole genome, such that the known variants in this set are phased (their haplotype of origin is known). The synthetic diploid benchmark dataset has the advantage of not relying on a consensus callset across several variant callers, which limit the benchmark set to high-confidence regions and lead to an overestimation of the true variant calling performance. Finally, the SEQC2/MAQC-IV initiative provides another extensive set of validated benchmarks, not only focussing on genomic DNA but also considering RNA-seq and single-cell sequencing. 5

Several publications have utilized the aforementioned gold-standard callsets to benchmark variant calling tools and pipelines. 6 9 However, the continuous development of variant calling tools and pipelines means that static, one-time benchmarks based on a specific pipeline or tool version can quickly become outdated.

In contrast, benchmarking platforms aim at providing a way to facilitate continuous benchmarking by pipeline and tool developers and users. Examples of such platforms are OpenEBench [ 1] and Omnibenchmark [ 2]. Both platforms run on their own dedicated computing infrastructure and utilize specialized frameworks for results reporting and dataset uploading.

In this work, we want to propose a different approach for hosting a continuous benchmark, which was developed by the human genomic small variant calling benchmarking initiative of the NGS-CN [ 3] and GHGA [ 4]. We show that it is possible to build a benchmarking platform by entirely relying on public free infrastructure, namely GitHub [ 5], GitHub Actions [ 6], and Zenodo [ 7]. Using these technologies as a basis and extending upon best practices, 10 we developed a comprehensive and reproducible benchmarking workflow for small genomic variants that is agnostic of the used dataset and can evaluate an arbitrary number of given callsets, while reporting the results in a visual and interactive way.

Methods

Datasets

We have sequenced the NA12878 sample from the genome in a bottle (GIAB) [ 8] project with two exome sequencing kits at varying average coverages. The genomic DNA from NA12878 was obtained from the NIGMS Human Genetic Cell Repository at the Coriell Institute for Medical Research. The Agilent Human All Exon V7 kit was used to yield a dataset with 182 million paired-end reads sequenced on an Illumina Nova Seq 6000 (211 bp mean insert size and 2 × 101 bp read length). We used random subsampling to derive two datasets from this that were used in the benchmarking, one with 37.5 million and one with 100 million paired-end reads. The Twist Human Comprehensive Exome (Twist Bioscience, San Francisco, CA, USA) sequencing kit was used according to the manufacturer’s protocol to generate 200 million paired-end reads on an Illumina NovaSeq 6000 (291 bp mean insert size and 2 × 101 bp read length). The raw reads of the two subsampled Agilent and Twist exome datasets are available via Zenodo. 11 , 12

Evaluation pipeline

To analyze the quality of the callsets yielded by each pipeline on the given datasets, we have developed a generic, reproducible Snakemake 13 workflow, which conducts all steps from downloading benchmark data, preprocessing, comparison with a known ground truth, plotting, and automatic deployment of the required software stacks via Snakemake’s Conda/Mamba [ 9] integration: https://github.com/snakemake-workflows/dna-seq-benchmark. The workflow comes with predefined standard datasets like CHM-eval and GIAB, but can be additionally configured to use any other DNA-seq-based benchmark dataset consisting of a known set of true variants, confident regions where the reported true variants are considered to be complete ( i.e. every non-variant position is assumed to homozygously have the reference allele), raw read data (as FASTQ files), and (optionally) sequenced target regions ( e.g. in case of exome sequencing). The workflow uses BWA-mem, 14 Picard tools [ 10] and Mosdepth 15 for calculating the read coverage across the genome. We use Bedtools 16 to limit the known true variants to the confident regions provided by the respective truth publishers and to stratify variants by coverage (see below). For interactive exploration of the results, we use Datavzrd [ 11] and Vega-Lite. 17 The matching of calls and true variants in a haplotype-aware manner happens via RTG-tools vcfeval [ 12]. To ensure a fair and correct comparison of the different evaluated callsets, several key points had to be considered, which we outline below.

Read depth stratification and selection of regions of interest. The available read depth can naturally affect both the precision and recall of a pipeline. Hence, the read depth characteristics of a benchmark dataset can have an impact on the derived precision and recall, which can limit the generalizability of obtained results. In order to avoid this effect, we decided to stratify recall and precision by read depth. For any benchmark dataset, this workflow generates a quantized set of regions with low (0-9), medium (10-30), and high ( >30 ) read depth using Mosdepth, while considering only reads with mapping quality (MAPQ) 60 . Notably, this means that, for example, regions in the low read depth category have either only few reads or a lot of reads with uncertain alignments ( high mapping uncertainty). We intersect these regions with the confidence regions of the benchmark sample ( e.g. as provided by GIAB) using Bedtools. If the given dataset was generated using a capturing approach ( e.g. exome sequencing) we further restrict the regions to the captured loci according to the manufacturer. Afterwards, any given callset is split into three subsets with low, medium, and high coverage using Bedtools.

Separating genotyping from calling performance At decreasing read depth or increasing mapping uncertainty, one can expect a callset to yield a decreasing recall: with less evidence, it will become harder to find variants. This is true for both genotyping ( i.e. requiring that the variant caller detects the correct genotype) as well as when just requiring the variant allele to be correctly recognized without considering whether the variant is predicted to be homo- or heterozygous ( i.e. plain variant calling without genotyping). In contrast, a variant callset’s precision should ideally remain constant and unaffected by a decrease in read depth or increase in mapping uncertainty, if the method manages to correctly report the increasing uncertainty with decreasing depth or increasing mapping uncertainty. The latter behavior differs between measuring a callset’s genotyping or calling precision. In order to make these differences visible, we therefore decided to calculate precision and recall for both genotyping and calling separately.

Variant atomization Some variant callers report complex variants as replacements of longer alleles ( i.e., both the reported reference and the alternative allele are longer than one base, e.g. ACCGCGT>ACGCT). While this is in general a good idea ( e.g. in order to be able to properly assess the combined impact on proteins), we found this to introduce problems when counting true positive, false positive, and false negative predictions. If a caller gets a part of such a replacement wrong, the entire replacement is considered a false positive by vcfeval, ignoring the other parts which can potentially contain true positives, thereby leading to biased results for callers that frequently report such replacements. Similar to the approach implemented in the hap.py pipeline [ 13], we solved this issue by introducing a normalization step prior to vcfeval into our analysis workflow, which uses Bcftools 18 to normalize variants, in a way that indels are moved to their left-most possible location, and complex replacements are split into their atomic components— i.e. single nucleotide variants (SNVs), insertions or deletions (indels)—while removing exact duplicates resulting from the atomization.

Reporting For reporting results, we employ Datavzrd to create interactive tabular reports for recall and precision, as well as individual false positive and false negative variants. Datavzrd enables us to just provide the required data as TSV or CSV files combined with a configuration file that defines the rendering of each column. For the latter, one can choose from automatic link-outs, heatmap plots, tick plots, bar plots, or custom complex Vega-Lite plots (which can also be used to define alternative visualizations for an entire table view). For the former, we report a table containing for each callset and each read depth category (low, medium, high) precision and recall (while ignoring whether the genotype was predicted correctly), the underlying counts of true positives (TP), false positives (FP), and false negatives (FN), as well as the fraction of wrongly predicted genotypes. It is important to note that it cannot be excluded that the same variant in the truthset is predicted multiple times by a callset, e.g. as part of several complex replacements (see “Variant atomization” above). We therefore report two TP counts TPquery (number of TPs in the callset with the same matching variant from the truth potentially counted multiple times) and TPtruth (number of variants in the truth set that occur in the callset, each variant counted once, regardless how often it occurs in the callset), with TPqueryTPtruth . Following the established definitions, precision is then calculated as

TPqueryTPquery+FP

while recall is calculated as

TPtruthTPtruth+FN.

An example can be seen in Figure 2. In addition, we calculate the F*-measure 19 as

TPqueryTPquery+FP+FN,

a monotone transformation of the F-measure providing additional interpretability 20 : in this case, it is an estimate of the probability that a random variant taken from the union of prediction and truth is predicted correctly (i.e., a perfectly predicting pipeline would have an F*-measure of 1.0). For reporting of individual FP and FN variants, we provide a Datavzrd table view for each that has one row per variant and a column for each callset. In order to visualize systematic patterns arising from the properties of callsets ( e.g., using the same variant detection or mapping method), any kind of property can be annotated as a so-called “label” when registering a callset for evaluation with the pipeline. The labels are displayed using a categorical color coding in the header of the table views. Moreover, we perform a Chi2 test for the association of the FP or FN pattern of each variant against the different labels in order to detect systematic effects. The variant/label combinations for which this test yields a significant result are then displayed in a separate table view for each type of label. These allow, for example, to spot variants that only occur when callsets use a particular variant caller. Thereby, significance is determined by controlling the false discovery rate over the p-values of the Chi2 test using the Benjamini-Yekuteli procedure, as the variants could be both positively ( e.g., being on the same haplotype) or negatively ( e.g., being on different haplotypes) correlated. In order to combine the results with data provenance information we include the Datavzrd views into a Snakemake report [ 14], which automatically provides a menu structure for navigation between views, association with used parameters, code, and software versions as well as runtime statistics.

Figure 2. Exemplary screenshot of interactive tabular precision recall display.

Figure 2.

Each three rows display precision and recall together with underlying numbers and wrongly predicted genotypes stratified by read depth/coverage category. As provided via Datavzrd, every column can be selected for sorting, hidden, or searched (via the buttons next to the column names). In the interactive report, callset/pipeline names occur on the far left. Here, they have been removed since results can be expected to change over time. For actual results please see the always up-to-date interactive report at https://ncbench.github.io.

Maintenance The evaluation workflow can be continuously maintained by contributing to its publicly available Github repository [ 15]. In particular, this offers the ability to easily update already available benchmark data and ground truths (under workflow/resources/presets.yaml), by editing e.g. via the Github interface and creating a pull request. The repository is automatically tested upon each change via Github Actions.

Continuous public evaluation

A central goal of the project was not to conduct a single benchmark and just publish the results, but rather provide a resource for continuous repeated and always up-to-date benchmarking, that is moreover open to any kind of contribution (callsets and code improvements, among others) from outside collaborators. In order to achieve this, we have developed the following approach (see Figure 1 for an illustration). We deployed the benchmarking workflow [ 16] as a module [ 17] into another Snakemake workflow that in addition has the ability to download callsets from Zenodo, using Snakemake’s Zenodo integration [ 18]. Then, we deployed this workflow into the GitHub repository [ 19] and configured GitHub Actions [ 20] to continuously rerun the workflow upon every commit on the main branch or any pull request [ 21]. In order to ensure that the workflow runs sufficiently fast (GitHub Actions offers only limited runtime and resources per job), we have precomputed benchmark dataset-specific central intermediate results (read depth and confidence derived stratification regions) that are computationally intensive to obtain, and deployed them along with the workflow code into the GitHub repository.

Figure 1. Continuous evaluation and reporting workflow.

Figure 1.

Upon pull requests or pushes, a GitHub Actions workflow is triggered. This downloads data, runs the Snakemake-based evaluation pipeline, creates the Snakemake report and uploads it as an artifact. If the workflow is triggered on the main branch, its finalization triggers a second Github Actions workflow that builds and deploys the homepage at https://ncbench.github.io.

Upon each completion of the evaluation pipeline, a Snakemake report [ 22] is generated. In case of pull requests ( e.g., contributing a feature or a new callset), the report is uploaded as a GitHub artifact [ 23], for inspection by the pull request author and the reviewer. In the case of the main branch, we utilize GitHub Actions to trigger the execution of a secondary GitHub Action pipeline in a repository that hosts the NCBench homepage [ 24]. This pipeline fetches the latest report artifact associated with the main branch and deploys it to the homepage. This way, the most recent results are automatically accessible on the homepage.

Results

The always up-to-date results of the benchmark can be found and interactively explored under https://ncbench.github.io. At the time of writing, the benchmark consists of more than 40 callsets on three different benchmark datasets, the two NA12878 samples described in the Datasets section and CHM-eval. 21 The callsets span various pipelines, read mapping, variant detection, and genotyping approaches.

Since the central idea of this project is to provide a continuous, standardized and open benchmark platform for DNA-seq, we strived to make the contribution of new callsets as straightforward as possible. The benchmark repository [ 25] shows the steps needed to perform variant calling on the supported datasets and describes how to pre-check the resulting callset locally. Once a contributor is convinced that the callset is ready for publication, we provide instructions for uploading the result to Zenodo and providing it via a pull request for continuous evaluation in the future.

Figure 2 shows an exemplary screenshot of the interactive tabular precision/recall display (see Pipeline section). This illustrates the importance of stratifying by read depth/coverage categories (see Pipeline). This is in contrast to the commonly seen practice, where GIAB and other benchmark datasets are evaluated on the entire set of variants, without stratification. While this generates realistic estimates for the prediction quality of a variant calling pipeline overall, the provided information is less generalizable, since a new dataset might have different read depth characteristics. Further, it tells little about the expected quality at an individual location, which might differ as well from the global characteristics.

Discussion and conclusions

So far, variant calling benchmark studies were often published once, in a single or multiple manuscripts that can only represent a snapshot at the time of writing. This holds both for studies evaluating multiple tools or pipelines, as well as the evaluations around newly published individual tools.

For continuous benchmarking, platforms like Omnibenchmark [ 26] or OpenEBench 22 are available. Both platforms run on their own dedicated computing infrastructure and utilize specialized frameworks for results reporting and dataset uploading.

In this work, we demonstrate that a continuous benchmarking platform can be set up without the need for dedicated computing infrastructure, and instead entirely relying on freely available and widely used resources.

  • By basing the benchmark of DNA-seq variant calling pipelines on a public GitHub repository for code, configuration and result storage, GitHub Actions for analysis execution, and callsets hosted by Zenodo, we allow rapid and straightforward contributions by anybody used to these services.

  • By implementing the analysis with Snakemake and Conda/Mamba, we decouple the analysis code and the reporting of results from the hosting platform: instead of relying on GitHub Actions, the benchmark analysis can easily be conducted locally, or on a different platform without any modifications of the code.

  • By generating interactive visual presentations of the results with Datavzrd, we (a) allow for a modern and versatile exploration of results and comparisons between different methods and pipelines, and (b) to a large degree enable contributions and modifications to the way the data is presented by simply editing YAML based configuration files.

  • By encapsulating all results in a Snakemake report that is portable and can be viewed and provided without any web service, we enable people to freely choose between relying on the online version of the report and providing snapshot-like versions of the report in their publications.

In the future, we will further extend upon this approach. For example, we will add a whole genome dataset of the NA12878 sample sequenced on an Illumina NovaSeq 6000 of ca. 400 million paired-end reads (mean insert size 473 and 2 × 151 bp reads length). We have already extended the pipeline to include the evaluation somatic variants and plan to further extend it towards structural variants. Henceforth, we (and others) will extend the NCBench online results in that regard. Finally, as the implemented comparison workflow is in principle agnostic to the considered species, we will evaluate the inclusion of benchmark datasets from non-human organisms. Particularly for natural microbial populations, whose species mostly exist as multiple genotypes in one ecosystem, variant calling can be a complex process 23 and often not completely resolved due to the lack of complete and closed reference genomes from mono-cultures.

We hope that our approach will attract contributors beyond our initiative. A first success in that regard is the recent usage of NCBench in the PM4Onco project [ 27]. Ideally, the combination of being continuous, simple to use, reproducible, and easy to integrate outside of the primary web service will change the way DNA-seq benchmarking is handled in the future. Instead of requiring every new tool and benchmark study manuscript to conduct its own analysis for precision and recall on public resources like GIAB or CHM-eval as well as comparison with other tools or pipelines, authors can rather include their callsets in our benchmark. In turn, readers will be able to always see the performance of a tool in the context of the state of the art at the time of reading, instead of at the time of writing.

Author contributions

FH, GG, SM, and JK have written the manuscript. JK, FB, and BS have implemented the benchmarking pipeline. FW has implemented required visualization functionality in Datavzrd. SM has coordinated the benchmarking initiative. SM and KB provided the FastQ files for the Agilent Human All Exon v7 kit. MS has created the callset data for the megSAP pipeline and edited the manuscript. TS has created the callset data for the WEScropbio pipeline. LB has created the callset data for the Cologne exome pipeline. AGN analyzed callset data. JA and AO have sequenced the NA12878 sample at the WGGC Cologne. AJP contributed to discussion and manuscript writing. AP has provided advice on and reviewed the benchmark design. SS has created the call sets for the NVIDIA Parabricks pipeline. AB and FB have supported the NVIDIA Parabricks pipeline. AS has coordinated the sequencing of NA12878 at the NGS Core Facility Bonn. NHS and CM have created the callset data for the GHGA pipeline. GG and FH have created the callset data for the sarek pipeline. All authors have read and approved the manuscript.

Acknowledgements

We thank the NIST and the GIAB working group for their extraordinarily useful work. We thank the authors of CHM-eval for their amazing work and making their data publicly available. We acknowledge support by the Open Access Publication Fund of the University of Duisburg-Essen.

Funding Statement

NHS and CM are supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) via the project NFDI 1/1 "GHGA - German Human Genome-Phenome Archive" (\#441914366). SM is supported by the DFG via the project "West German Genome Center" (\#407493903). BS, FB, and JK have been supported by the German Federal Ministry for Education and Research via the PM4Onco project (01ZZ2322K and 01ZZ2322C).

The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

[version 2; peer review: 2 approved]

Footnotes

Data availability

Underlying data

Twist Whole-Exome Sequencing Dataset of NA12878: https://doi.org/10.5281/zenodo.7075041. 24

Agilent v7 exomes of NA12878: https://doi.org/10.5281/zenodo.6513789. 25

Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).

CHM-eval public benchmark data: https://github.com/lh3/CHM-eval

Analysis code

NCBench code available from: https://github.com/ncbench/ncbench-workflow

Archived NCBench code available from: https://doi.org/10.5281/zenodo.8268264

References

  • 1. Zook JM, Chapman B, Wang J, et al. : Integrating human sequence data sets provides a resource of benchmark snp and indel genotype calls. Nat. Biotechnol. Mar 2014;32(33):246–251. 10.1038/nbt.2835 [DOI] [PubMed] [Google Scholar]
  • 2. Zook JM, Catoe D, McDaniel J, et al. : Ying Sheng, Karoline Bjarnesdatter Rypdal, and Marc Salit. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci. Data. Jun 2016;3(11):160025. 10.1038/sdata.2016.25 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Eberle MA, Fritzilas E, Krusche P, et al. : A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Res. Jan 2017;27(1):157–164. 10.1101/gr.210500.116 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Li H, Bloom JM, Farjoun Y, et al. : A synthetic-diploid benchmark for accurate variant-calling evaluation. Nat. Methods. Aug 2018;15(88):595–597. 10.1038/s41592-018-0054-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Wendell J, SAS Cary’s Russ Wolfinger, and MAQC As : Sequencing benchmarked.
  • 6. Barbitoff YA, Abasov R, Tvorogova VE, et al. : Systematic benchmark of state-of-the-art variant calling pipelines identifies major factors affecting accuracy of coding sequence variant discovery. BMC Genomics. Feb 2022;23(1):155. 10.1186/s12864-022-08365-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Chen J, Li X, Zhong H, et al. : Systematic comparison of germline variant calling pipelines cross multiple next-generation sequencers. Sci. Rep. Jun 2019;9(1):9345–9345. MAG ID: 2953517386. 10.1038/s41598-019-45835-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Supernat A, Vidarsson OV, Steen VM, et al. : Comparison of three variant callers for human whole genome sequencing. Sci. Rep. Dec 2018;8(11):17851. 10.1038/s41598-018-36177-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Zhao S, Agafonov O, Azab A, et al. : Accuracy and efficiency of germline variant calling pipelines for human genome data. Sci. Rep. Nov 2020;10(11):20222. 10.1038/s41598-020-77218-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Krusche P, Trigg L, Boutros PC, et al. : Best practices for benchmarking germline small-variant calls in human genomes. Nat. Biotechnol. May 2019;37(55):555–560. 10.1038/s41587-019-0054-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Motameny S: Agilent v7 exomes of NA12878. May 2022. 10.5281/zenodo.6513789 [DOI]
  • 12. Schmidt A, Sivalingam S, Buness A, et al. : Twist human comprehensive exome sequencing kit - high coverage - coriell - NA12878. September 2022. 10.5281/zenodo.7075041 [DOI]
  • 13. Mölder F, Jablonski KP, Letcher B, et al. : Sustainable data analysis with Snakemake. F1000Res. January 2021;10:33. 10.12688/f1000research.29032.1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Li H: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997 [q-bio] March 2013. arXiv: 1303.3997. 10.48550/arXiv.1303.3997 [DOI]
  • 15. Pedersen BS, Quinlan AR: Mosdepth: quick coverage calculation for genomes and exomes. Bioinformatics. March 2018;34(5):867–868. 10.1093/bioinformatics/btx699 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Quinlan AR, Hall IM: BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. March 2010;26(6):841–842. 10.1093/bioinformatics/btq033 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Satyanarayan A, Moritz D, Wongsuphasawat K, et al. : Vega-Lite: A Grammar of Interactive Graphics. IEEE Trans. Vis. Comput. Graph. January 2017;23(1):341–350. 10.1109/TVCG.2016.2599030 [DOI] [PubMed] [Google Scholar]
  • 18. Danecek P, Bonfield JK, Liddle J, et al. : Twelve years of SAMtools and BCFtools. GigaScience. February 2021;10(2):giab008. 10.1093/gigascience/giab008 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Hand DJ, Christen P, Kirielle N, et al. : F*: an interpretable transformation of the F-measure. Mach. Learn. 2021;110(3):451–456. 10.1007/s10994-021-05964-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Christen P, Hand DJ, Kirielle N, et al. : A Review of the F-Measure: Its History, Properties, Criticism, and Alternatives. ACM Comput. Surv. October 2023;56(3):73:1–73:24. 10.1145/3606367 [DOI] [Google Scholar]
  • 21. Li H, Bloom JM, Farjoun Y, et al. : A synthetic-diploid benchmark for accurate variant-calling evaluation. Nat. Methods. August 2018;15(8):595–597. 10.1038/s41592-018-0054-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Capella-Gutierrez S, Iglesia D, Haas J, et al. : Lessons Learned: Recommendations for Establishing Critical Periodic Scientific Benchmarking, August 2017. Pages: 181677 Section: New Results.
  • 23. Olm MR, Crits-Christoph A, Bouma-Gregson K, et al. : instrain profiles population microdiversity from metagenomic data and sensitively detects shared microbial strains. Nat. Biotechnol. Jun 2021;39:727–736. 10.1038/s41587-020-00797-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Sivalingam S: Twist Whole-Exome Sequencing Dataset - High Coverage - WGGC SIG4 Benchmarking.[Dataset]. Zenodo. 2022. 10.5281/zenodo.7075041 [DOI]
  • 25. Motameny S: Agilent v7 exomes of NA12878.[Dataset]. Zenodo. 2022. 10.5281/zenodo.6513789 [DOI]
F1000Res. 2024 Sep 25. doi: 10.5256/f1000research.171050.r323086

Reviewer response for version 2

Justin M Zook 1

The authors have done a good job of addressing the reviewers' questions, and I have no further concerns.

Is the work clearly and accurately presented and does it cite the current literature?

Yes

If applicable, is the statistical analysis and its interpretation appropriate?

Yes

Are all the source data underlying the results available to ensure full reproducibility?

Partly

Is the study design appropriate and is the work technically sound?

Yes

Are the conclusions drawn adequately supported by the results?

Yes

Are sufficient details of methods and analysis provided to allow replication by others?

Partly

Reviewer Expertise:

Variant calling and benchmarking

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

F1000Res. 2024 Sep 13. doi: 10.5256/f1000research.171050.r323085

Reviewer response for version 2

Kez Cleal 1

Thank you for addressing my points in a thorough manner.

Is the work clearly and accurately presented and does it cite the current literature?

Yes

If applicable, is the statistical analysis and its interpretation appropriate?

Not applicable

Are all the source data underlying the results available to ensure full reproducibility?

Yes

Is the study design appropriate and is the work technically sound?

Yes

Are the conclusions drawn adequately supported by the results?

Yes

Are sufficient details of methods and analysis provided to allow replication by others?

Yes

Reviewer Expertise:

Cancer genomics

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

F1000Res. 2024 Feb 28. doi: 10.5256/f1000research.153686.r239648

Reviewer response for version 1

Kez Cleal 1

The article presents NCBench, a benchmarking platform for evaluating SNVs and indel variant calls from various gold-standard benchmark sets. NCBench makes use of public, free infrastructure like GitHub Actions and Zenodo to facilitate continuous, open-source benchmarking, which is a neat idea. NCBench supports evaluation across several datasets, emphasizing adaptability and reproducibility, and effectively addresses the need for ongoing evaluation in genomic research. However, the true utility of NCBench will depend on its adoption by the wider research community which itself could be enhanced by making the platform as easy to use as possible. On this front, I think NCBench could be improved by addressing the following:

 

  1. The documentation of how to implement/run NCBench and how to contribute a new callset was pretty limited. A more detailed guide aimed at newcomers, detailing the pipeline steps, inputs and outputs, and ways to configure NCBench for other datasets would help with adoption.

  2. It is unclear how to run NCBench locally on a custom benchmark dataset. The pipeline supports uploading a vcf to zenodo and running via github actions. Ideally, NCBench should be possible to run locally to test different tool parameters, or for developers to experiment with new tool implementations. This pattern would make NCBench more useful for general bioinformatics workflows. If this pattern is supported, it should be documented more clearly.

  3. The output tables on github.io are useful, but I think they could be improved. For example, it would be useful if rows could be sorted by a column, ability to hide certain columns.

  4. There doesn’t appear to be a way to download the results of the benchmarking run, or if there is, I didn’t find it. It would be nice to be able to download results in a table, in order to make custom tables for publication, for example.

  5. Stratifying benchmark results by coverage categories is a nice idea. However, I think it would still be useful to include an ‘any’ category (any mapq), to provide a dataset summary. Additionally, if there was a way to filter some of the rows, this would be very useful, for example selecting only mapq 1-10 category.

  6. Tables should probably include an F1 score, this would be useful to rank callsets, although which metric used (tp_query or tp_truth) should be evident.

  7. The dark-blue colour of some of the cells on the tables makes the text very hard to read (see ‘Show as plot’ subpage).

  8. Some of the table numbers needs to be rounded, for example clicking on ‘Show as plot’, the numbers are rounded to >12dp.

  9. In some of the tables, the text doesn’t fit on the page (fn variant page, the top row of labels stretched off the page using Safari).

Is the work clearly and accurately presented and does it cite the current literature?

Yes

If applicable, is the statistical analysis and its interpretation appropriate?

Not applicable

Are all the source data underlying the results available to ensure full reproducibility?

Yes

Is the study design appropriate and is the work technically sound?

Yes

Are the conclusions drawn adequately supported by the results?

Yes

Are sufficient details of methods and analysis provided to allow replication by others?

Yes

Reviewer Expertise:

Cancer genomics

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

F1000Res. 2024 Aug 26.
Johannes Köster 1

We thank the reviewer for the thorough assessment and positive feedback. A point by point response can be found below.

1. We have added a more detailed documentation on what is required to add new callsets and configure the pipeline to the readme: https://github.com/ncbench/ncbench-workflow

2. Indeed, we agree that running the workflow locally makes a lot of sense. We have extended the instructions in the readme accordingly.

3. This is a good suggestion. We have extended Datavzrd accordingly, such that the updated report now supports this functionality, as can be seen in the updated version under https://ncbench.github.io.

4. Indeed, offering a download is a good suggestion. We have now activated the corresponding feature in datavzrd, such that each table can be downloaded by opening the "hamburger" menu at the top right and clicking the "download as excel" entry.

5. Filtering of the rows (e.g. by coverage) has already been possible, using the magnifying glass icon on top of the respective column. We explicitly decided against providing a non-stratified precision and recall, because we believe that this is usually misleading, since it does only reflect the composition of easy and harder to call sites in the benchmark sample, rather than the actual performance of the evaluated pipelines/tools. Therefore, any reported such measures would be of very limited generalizability.

6. We appreciate the suggestion. Instead of F1-score, we have now added the F*-measure, which provides an monotone transformation of F1 that adds additional interpretability, by being an estimate of the probability that a random variant taken from the union of prediction and truth is predicted correctly. We have extended the "Reporting" section in the manuscript accordingly.

7. We have updated the reports to a newer version of Datavzrd, in which the "show as plot" functionality has been replaced by a functionality that prints the current page as an SVG file that looks exactly the same as the display in the browser, thereby avoiding the low-contrast issue in the previous version.

8. See answer for point 7, the newer Datavzrd solves this issue.

9. Indeed, the FN and FP tables are sometimes wider than a usual screen resolution. However, they are scrollable horizontally. We have extended Datavzrd to show a visual indicator that points the user to the scrollbar if that is the case.

F1000Res. 2023 Oct 3. doi: 10.5256/f1000research.153686.r207163

Reviewer response for version 1

Justin M Zook 1

The authors present an open benchmarking platform for variant calling, with exome and genome variant call sets for two different benchmarks as examples. As the authors note, the ability to continuously benchmark variant calls is currently lacking, as existing efforts like precisionFDA Truth Challenges only reflect a point in time. Ongoing benchmarking should be very useful as both variant calling methods and benchmark sets evolve. I only have a few suggestions for clarifying the data and methods used.

  1. Are FNs and FPs counted to include genotype errors, as is the default in vcfeval and hap.py, or do both exclude genotype errors from their counts, which is implied but not explicit in the text

  2. I saw the bed file for Agilent but not for twist in zenodo

  3. What were the benchmark versions used for GIAB and CHM-eval?

  4. Do the authors expect that adding new versions of the benchmarks would be straightforward as GIAB develops these?

  5. vcfeval generally should be able to compare different representations of the same variant, as long as they exactly match, even if they are not atomized. The only reason I have found this not to work is if the variant caller gets part of a haplotype wrong or the genotype wrong, in which case the whole haplotype is called wrong even if most of the variants in the region are correct. Is this what the authors’ encountered? If so, it might be good to clarify this. A potential problem with left shifting variants is that occasionally this will cause a change in the haplotype, e.g., if an indel in a homopolymer is shifted past a SNV on the same haplotype, though is relatively rare.

  6. The number of variants for CHM-eval is lower than I'd expect. Did the authors restrict to one chromosome?

Is the work clearly and accurately presented and does it cite the current literature?

Yes

If applicable, is the statistical analysis and its interpretation appropriate?

Yes

Are all the source data underlying the results available to ensure full reproducibility?

Partly

Is the study design appropriate and is the work technically sound?

Yes

Are the conclusions drawn adequately supported by the results?

Yes

Are sufficient details of methods and analysis provided to allow replication by others?

Partly

Reviewer Expertise:

Variant calling and benchmarking

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

F1000Res. 2024 Aug 26.
Johannes Köster 1

We thank the reviewer for the positive and comprehensive assessment of our work. Please find a point by point response below.

1. TP, FP, and FN are calculated by ignoring the genotype (just requiring presence or absense of the variant in the callset). Instead, we decided to provide a genotype mismatch rate, that shows how many of the genotypes in the true positives are wrongly predicted compared to the truth. The definition of this approach can be found under "Methods"/"Evaluation pipleine"/"Reporting".

2. The twist bed file can be found publicly under https://www.twistbioscience.com/sites/default/files/resources/2022-01/Twist_Comprehensive_Exome_Covered_Targets_hg38.bed. This is also the file that is used by the ncbench workflow when assessing the twist dataset. The user does not need the bedfile, since ncbench takes care of restricting any input vcf to calls within the regions specified by the bed file.

3. The pipeline currently uses v4.2.1 of the NA12878 ground truth from GIAB and v0.5 of CHM-eval. We have extended the online report to include this information in the description above each result table.

4. Indeed, bumping benchmark versions is very easy. One simply has to update the respective entries here: https://github.com/snakemake-workflows/dna-seq-benchmark/blob/main/workflow/resources/presets.yaml#L51. We have added a section about the maintenance to the end of the "Evaluation pipeline" section in the manuscript.

5. We have extended and clarified our statement (see section "variant atomization"). Indeed, it is like suggested: the whole replacement is considered as one, leading to missing additional true positives within the replacement.

6. There was no restriction to a particular chromosome involved. To be sure, we have checked the number of variants in the CHM-eval ground truth and can confirm that it matches the reported TP and FN in the reported tables. Note that (a) the precision/recall tables are split into SNVs and indels, such that both numbers have to be considered. (b) we filter the truth to the reported confidence regions. (c) we stratify by coverage in a way that each kept variant has to be entirely inside of the respective coverage region. In total, this explains that the benchmark runs on fewer variants than one would consider otherwise.

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Data Citations

    1. Sivalingam S: Twist Whole-Exome Sequencing Dataset - High Coverage - WGGC SIG4 Benchmarking.[Dataset]. Zenodo. 2022. 10.5281/zenodo.7075041 [DOI]
    2. Motameny S: Agilent v7 exomes of NA12878.[Dataset]. Zenodo. 2022. 10.5281/zenodo.6513789 [DOI]

    Data Availability Statement

    Underlying data

    Twist Whole-Exome Sequencing Dataset of NA12878: https://doi.org/10.5281/zenodo.7075041. 24

    Agilent v7 exomes of NA12878: https://doi.org/10.5281/zenodo.6513789. 25

    Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).

    CHM-eval public benchmark data: https://github.com/lh3/CHM-eval

    Analysis code

    NCBench code available from: https://github.com/ncbench/ncbench-workflow

    Archived NCBench code available from: https://doi.org/10.5281/zenodo.8268264


    Articles from F1000Research are provided here courtesy of F1000 Research Ltd

    RESOURCES