Skip to main content
G3: Genes | Genomes | Genetics logoLink to G3: Genes | Genomes | Genetics
. 2023 May 27;13(8):jkad117. doi: 10.1093/g3journal/jkad117

Whole Animal Genome Sequencing: user-friendly, rapid, containerized pipelines for processing, variant discovery, and annotation of short-read whole genome sequencing data

Jonah N Cullen 1,, Steven G Friedenberg 2
Editor: A Lipka2
PMCID: PMC10411559  PMID: 37243692

Abstract

Advancements in massively parallel short-read sequencing technologies and the associated decreasing costs have led to large and diverse variant discovery efforts across species. However, processing high-throughput short-read sequencing data can be challenging with potential pitfalls and bioinformatics bottlenecks in generating reproducible results. Although a number of pipelines exist that address these challenges, these are often geared toward human or traditional model organism species and can be difficult to configure across institutions. Whole Animal Genome Sequencing (WAGS) is an open-source set of user-friendly, containerized pipelines designed to simplify the process of identifying germline short (SNP and indel) and structural variants (SVs) geared toward the veterinary community but adaptable to any species with a suitable reference genome. We present a description of the pipelines [adapted from the best practices of the Genome Analysis Toolkit (GATK)], along with benchmarking data from both the preprocessing and joint genotyping steps, consistent with a typical user workflow.

Keywords: whole genome sequencing, pipeline, variants

Introduction

High-throughput sequencing (HTS) has revolutionized the field of genetics and helped usher in today's modern genomic era. Continued advances in sequencing technologies coupled with decreased costs have empowered researchers to generate massive amounts of short-read sequencing data, particularly for whole genome sequencing (WGS) studies. WGS projects may range from single individuals with a phenotype of interest to entire cohorts, broadening our ability to answer biological questions at population-level scales. In addition to large efforts in human medicine (Auton et al. 2015), the revolution in HTS has been equally fruitful in veterinary medicine, where research groups across the globe have generated WGS data from tens of thousands of animals across a range of species (Rubin et al. 2012; Daetwyler et al. 2014; Jagannathan et al. 2019; Ostrander et al. 2019; Buckley et al. 2020; Wang et al. 2020; Durward-Akhurst et al. 2021).

Working with WGS data, however, is not always straightforward. The processing of WGS data is often inaccessible to many researchers and relatively nonstandard. For years, the Broad Institute has been at the vanguard in developing open-source tools [e.g. Genome Analysis Toolkit (GATK)], analysis pipelines, and best practices geared for human WGS projects (der Auwera and O’Connor 2020). Reconfiguring GATK pipelines for nonhuman WGS samples can be a challenging and potentially error-prone endeavor when scaled to dozens or hundreds of samples. For example, generating genome and alignment indices, variant calling intervals, and preprocessing FASTQ files to unmapped binary alignment map (uBAM) files (i.e. the expected inputs for the GATK germline short variant discovery pipeline) require substantial bioinformatics expertise and familiarity with many command-line-based tools. Numerous bioinformatics pipelines exist for processing and analyzing WGS data (Evani et al. 2012; Karczewski et al. 2014; Guo et al. 2015; Menon et al. 2016; Elshazly et al. 2017; Causey et al. 2018; Ahmed et al. 2021) with varying degrees of usability, accessibility, and adaptability, resulting in a spectrum of potential challenges. These challenges may be exacerbated when setting up these pipelines on a remote high-performance computing (HPC) environment using a job scheduler. Moreover, many of these pipelines do not utilize container software or use Docker (Merkel 2014) as a way of standardizing their bioinformatics workflow. Container software, like Docker, provides lightweight, portable computational environments that can package data (e.g. reference genomes) and software along with their dependencies, enabling consistent and reproducible computation across computing environments. However, Docker is often incompatible with HPC environments found at many academic institutions. Given the potential for pipeline output discrepancies due to differences in software versions and dependencies, the use of containers in processing WGS data could be regarded as essential. Without containers, it is difficult to ensure reproducibility necessary for comparing and synthesizing results across studies or institutions. The enforced reproducibility provided by containers ensures that pipeline outputs are consistent regardless of where the output was generated. Thus, the use of container software specifically designed for scientific computing such as Singularity/Apptainer (Kurtzer et al. 2017), in conjunction with a workflow management system (e.g. Snakemake (Mölder et al. 2021)) for WGS pipelines, is a central component to ensuring reproducibility, adaptability, and transparency.

To address these hurdles, we developed scalable, adaptable, open-source, containerized, Snakemake-based pipelines tailored for the veterinary community under the umbrella term Whole Animal Genome Sequencing (WAGS) (https://github.com/jonahcullen/wags). These pipelines, which are meant to be run on a high-performance Linux-based cluster, employ best practices championed by GATK developers while lowering barriers often faced by animal researchers seeking to analyze WGS data. WAGS consists of 3 pipelines. The first pipeline processes raw FASTQ files into genomic variant call format (GVCF) files and is capable of processing samples split across flow cells or lanes. The second pipeline takes multiple GVCF files as inputs and performs joint genotype calling and variant annotation. And the third pipeline compares variants in a specific sample to variants in a larger variant call file (VCF) database to enable the identification of unique variants in a particular animal.

Methods

WAGS consists of 3 pipelines for (1) processing raw short-read FASTQ files into GVCF files (OneWAG, Fig. 1a), (2) joint genotyping and annotating variants (ManyWAGS, Fig. 1b), and (3) identifying private variants in a single sample (OnlyWAG, Fig. 1c). The latter pipeline, OnlyWAG, combines aspects of OneWAG and ManyWAGS, generating the same outputs from OneWAG in addition to a table of private and rare variants compared to a larger population of animals. All 3 pipelines are Snakemake-based (Mölder et al. 2021) and utilize a Singularity container (Kurtzer et al. 2017) bundling all required software and reference files for select species (e.g. dog, cat, and horse). Additionally, log files documenting the processing of each step for all 3 pipelines are saved in a designated and structured logs directory, facilitating an understanding of processing errors or analysis of edge cases (e.g. samples with exceedingly high duplication rates). For species not already available in the container, all required inputs (e.g. reference index and dictionary, Burrows-Wheeler Alignment tool (BWA) indices (Li and Durbin 2009)) can be generated from a reference sequence in FASTA format using a separate script prep_custom_ref.py. WAGS is designed to be run on an HPC cluster using the Slurm workload manager. Currently, only Slurm is supported (https://github.com/Snakemake-Profiles/slurm) with future plans to include additional schedulers (e.g. Load Sharing Facility, Portable Batch System, and Sun Grid Engine).

Fig. 1.

Fig. 1.

Overview of the WAGS pipelines. a) OneWAG for single-sample variant calling starting from paired-end FASTQ files. b) ManyWAGS for joint genotyping sample cohorts. c) Identification of unique and rare variants from a single sample compared to a multisample variant call set (OnlyWAG). For OnlyWAG, the single-sample VCF uses 2 sets of call sets: (1) unique variants and (2) variants that are considered rare (allele frequency < 0.005) and unique. The unique variants from the latter are then filtered out by comparison to the unique variant call set, yielding a set of unique and rare variants from the input sample. Created with BioRender.com.

OneWAG

Input and setup

To process WGS data from a single sample, a plain text file containing the sample name, the breed (optional), and the prefix of paired-end FASTQ data files is required. This text file is used to generate the pipeline inputs as well as a job script to be submitted to the scheduler. The meta file may contain 1 or more samples to process with 1 sample per row, producing pipeline inputs and submission scripts organized by breed and sample name. There are no limitations on the number of FASTQ files associated with a given sample (e.g. sequenced across multiple lanes or separate flowcells) as long as they share a common portion of the FASTQ file names. Required arguments and various options are described in the documentation (https://github.com/jonahcullen/wags).

FASTQ to GVCF

The data processing steps implemented in OneWAG are adapted from GATK Best Practice guidelines (der Auwera and O’Connor 2020) and rely almost entirely on these methods unless otherwise noted. OneWAG begins by simultaneously assessing FASTQ quality via FastQC (Andrews 2010) and converting paired FASTQ files to a uBAM file (FastqToSam). This enables the attachment of user-defined metadata (e.g. sample name) or metadata parsed from the FASTQ identifier (e.g. flowcell ID, lane number) as part of the initial setup. Adapters are then marked (MarkIlluminaAdapters) prior to converting the uBAM back to FASTQ, and alignment to a reference genome is performed with BWA (Li and Durbin 2009); these steps are piped together with view from SAMtools (Danecek et al. 2021) to produce the aligned binary alignment map (BAM) file. The aligned BAM is then merged with the uBAM (MergeBamAlignment), retaining the added metadata. The next steps include marking duplicate reads (MarkDuplicates), sorting (SortSam), adding tags (SetNmMdAndUqTags), and generating a BAM index file. Optionally, indels may be left aligned (LeftAlignIndels). Depending on the availability of known variant (SNP and indel) positions for the species, base quality score recalibration (BQSR) may be performed. For select species, known variant position files are included within the container and may be utilized. Users may also supply their own known variant position files if desired. BQSR is conducted in parallel using chromosome- or contig-scale intervals by building a recalibration model per interval (BaseRecalibrator), combining recalibration reports across intervals (GatherBQSRReports), and adjusting the quality score for each base using the combined recalibration report per interval (ApplyBQSR). The recalibrated BAM files are gathered across intervals to create a final analysis-ready BAM and index. If BQSR is not desired, the BAM produced following marking duplicates, sorting, and adding tags is considered the final BAM for downstream analysis. Whether or not a recalibration step is included, a compressed BAM (CRAM) and index are generated to minimize data storage requirements (PrintReads).

SNPs and indels are called in parallel by first creating an interval list by splitting the reference at stretches of missing base calls (Ns) (tunable with default of maximum 50 Ns tolerated before splitting) (ScatterIntervalsByNs; Picard tools, http://broadinstitute.github.io/picard/). The initial interval list is then split into sets (tunable with default of 50 interval sets) of approximately equal size (SplitIntervals) over which variants are called in parallel (HaplotypeCaller), producing a GVCF for each set. GVCF files for each interval set are then gathered (MergeVcfs), resulting in a sample-level GVCF and index, ready for joint genotyping by ManyWAGS.

BAM files are also used as inputs to call structural variants (SVs). Due to the difficulty in identifying SVs from short-read sequencing data and the variability between SV calling algorithms, OneWAG utilizes 4 well-characterized callers: DELLY (Rausch et al. 2012), GRIDSS2 (Cameron et al. 2017, 2021), Smoove (Pedersen et al. 2020), and Manta (Chen et al. 2016). Smoove uses LUMPY (Layer et al. 2014) and svtyper (Chiang et al. 2015) to identify and genotype SVs. Users may decide which outputs to consider for downstream analysis.

Quality control

In addition to the FastQC (Andrews 2010) walker described above, OneWAG also provides output from SAMtools’ flagstat (Danecek et al. 2021) to count alignment types, GATK's AnalyzeCovariates to assess BAM recalibration, and Qualimap’s bamqc module to provide depth profiles and other parameters (Okonechnikov et al. 2016). All QC data and plots are combined into a single, user-friendly report using MultiQC (Ewels et al. 2016).

ManyWAGS

Input and setup

To joint genotype GVCF files generated from the OneWAG pipeline, a single text file containing sample names and paths of GVCF files is required; an example is provided in the GitHub repository (https://github.com/jonahcullen/wags). Species- or reference-specific configuration files needed for joint genotyping are first generated (config_joint.py). Similar to OneWAG, all pipeline inputs and a job script are generated to initiate joint genotyping adapted from GATK Best Practice guidelines (der Auwera and O’Connor 2020).

GVCF to VCF

To begin processing, the first step is to determine the intervals over which input GVCF files will be processed. There are 2 methods for defining interval boundaries: one of which is based on runs of at least 50 Ns (as in the OneWAG pipeline), while the other utilizes midpoints of intergenic regions, thus requiring a reference annotation file such as a gene transfer format (GTF) file. This second option may be particularly useful for nearly complete (i.e. few missing bases) reference genomes. Regardless of the interval boundary method, the resulting intervals are collapsed into larger intervals of an ideal maximum length (default 10 million). This value can be modified; however, this default setting was chosen to balance efficient parallel processing with the number of submissions to the cluster simultaneously (e.g. smaller length intervals will result not only in an increased number of intervals and ultimately faster processing times but also in a higher number of concurrent job submissions, which may be limited by some institutions). GVCF files are then combined into GenomicsDB workspaces via GenomicsDBImport, genotyped (GenotypeGVCFs), filtered (VariantFiltration) by excess heterozygosity, and converted to sites-only VCFs (MakeSitesOnlyVcf) for each of the collapsed intervals in parallel. If variant quality score recalibration (VQSR) is enabled, which is strongly encouraged by GATK developers, the sites-only VCFs are combined (GatherVcfsCloud), sorted (SortVcf), and used as inputs to generate recalibration models for SNPs and indels separately (VariantRecalibrator). User-supplied files of known SNP and indel positions are required for this step unless they are already stored within the container. ApplyVQSR then filters the collapsed interval VCFs in parallel, where each site is marked as PASS or binned based on user-defined tranche sensitivity. The recalibrated VCFs are combined (GatherVcfsCloud) into a complete VCF and indexed (IndexFeatureFile). If VQSR is not possible or desired (e.g. lack of known SNP and/or indel sites), hard filtering of SNPs and/or indels may be conducted with default thresholds that may be adjusted. Finally, the complete VCF is annotated with the Ensembl Variant Effect Predictor (VEP) (McLaren et al. 2016). For basic VEP annotation, an annotation file (e.g. GTF) is required; these are included within the container for select species but may be added for other species. This last step is also conducted in parallel using the collapsed intervals followed by merging the annotated VCFs, compression, and indexing (Li 2011).

Quality control

Summary statistics are collected on the final VCF using both BCFtools’ stats and GATK's CollectVariantCallingMetrics. Per-interval VEP summaries are also collected, enabling the examination of variant classes and consequences, along with other standard metrics. MultiQC aggregates all QC data and plots into a single, user-friendly report.

OnlyWAG

The OnlyWAG pipeline was designed to identify private or rare variants in a single individual compared to variants in a population of interest generated by ManyWAGS or any other VCF. The OneWAG pipeline yields an individual GVCF (and all other outputs), which is genotyped using the ManyWAGS pipeline to produce a single-sample VCF. The single-sample VCF is compared to the population VCF using GATK's SelectVariants, and 2 specific outputs are produced: (1) a VCF containing variants that are unique to the single-sample VCF and (2) a VCF containing variants in the single sample that are rare (tunable with default allele frequency < 0.005) in the population VCF. BCFtools’ +split-vep is used to create delimited tables of the unique and rare VCFs, which are then parsed and output as Excel sheets for downstream analysis and easy visualization by end users.

Results and discussion

Performance evaluation

WAGS has been utilized and validated for processing and genotyping large cohorts of horses, dogs, and tigers. For example, we have cataloged genetic variants from WGS data of nearly 700 dogs across more than 50 breeds, using 2 canine reference genomes (Lindblad-Toh et al. 2005; Hoeppner et al. 2014; Wang et al. 2021). Ongoing analyses have demonstrated the usefulness of this catalog in identifying genetic variants associated with various Mendelian diseases and traits in our canine companions (Olby et al. 2020; Shelton et al. 2021; Shelton, Minor, Guo, et al. 2022; Shelton, Minor, Vieira, et al. 2022). However, unlike humans with the Genome in a Bottle Consortium (Zook et al. 2016), we do not have an extensively curated truth set of genetic variants in animal species for validating the accuracy of WAGS. Instead, we report the processing times for OneWAG [FASTQ to GVCF, mapped against GCD_1.0 (Wang et al. 2021) plus ChrY (GCA_014441545.1, https://www.ebi.ac.uk/ena/browser/view/GCA_014441545.1)] using 100 canine WGS samples as well as ManyWAGS for joint genotyping of various cohort sizes.

From a private collection of Illumina short-read, paired-end WGS data from 671 dogs, we randomly selected 100 samples representing 36 breeds that had between 1 and 12 sets of paired FASTQ files, with 60 samples having 1 pair and 40 samples with greater than 1 pair. All FASTQ files used for this analysis have been made publicly available (Supplementary Table 1). As expected, runtime increases with mean coverage, which depends on the number of paired FASTQ inputs (Fig. 2). This is consistent with the understanding that processing multiple smaller inputs in parallel will be more efficient than larger single FASTQ pairs. However, larger runtime differences do not appear to be meaningful unless sample coverage is greater than ∼40×.

Fig. 2.

Fig. 2.

Processing times from FASTQ files through GVCF using the OneWAG pipeline. Samples with a single FASTQ pair (n = 60, purple) and greater than 1 pair (n = 40, orange) are plotted by mean depth with generalized additive model (GAM) smooth lines.

To examine joint genotyping runtimes and SNP/indel counts, we randomly selected (without replacement) cohorts of 1, 5, 10, 25, and 50 dogs and ran ManyWAGS 5 times per cohort from the 100 samples described above. In addition, the entire cohort of 100 dogs was joint genotyped once. Mean runtimes for each cohort are presented in Table 1. We observed little difference between joint genotyping cohort sizes of 10 and 25; however, the cohort of 50 dogs took nearly 3 times as long. At size 50, the runtimes were more variable, with a standard deviation (SD) of 3.0 and range from 10.1 to 17.3 h. Due to the large number of joint genotyping steps and processing variability on an institutional HPC with shared nodes, runtimes are often impacted by available resources. We also tabulated SNP and indel counts for each cohort size replicate from the summary metrics generated as part of joint genotyping quality control (Fig. 3). Mean SNP counts increased by 98.5% from sizes 1 to 5, with each increase in cohort size adding a smaller percent of SNPs (e.g. a 10.4% increase from a cohort size of 50 to 100 samples) (Table 1). Indel counts displayed a similar pattern with increasing cohort sizes.

Table 1.

Mean (SD) runtimes for joint genotyping at 6 cohort sizes.

Cohort size Mean runtime [hours (SD)] Mean SNP counts Mean SNP increase (%) Mean indel counts Mean indel increase (%)
1 1.6 (0.6) 3,645,205 1,600,195
5 2.96 (0.4) 7,235,061 98.5 2,478,264 54.9
10 5.00 (0.8) 8,636,849 19.4 2,766,402 11.6
25 5.42 (0.8) 10,155,441 17.6 3,100,775 12.1
50 14.59 (3.0) 11,352,736 11.8 3,390,625 9.3
100 25.24 (NA) 12,535,006 10.4 3,660,728 8.0

Repeated sampling (n = 5) at each cohort size except 100, which was only run once. NA = not applicable.

Fig. 3.

Fig. 3.

SNP and indel counts at increasing cohort sizes. Cohorts were randomly selected and processed with ManyWAGS 5 times for each indicated size (1, 5, 10, 25, and 50) sans 100.

Abundant resources and pipelines tailored for WGS processing and variant discovery are available for humans; however, this is not typically the case for many veterinary species. WAGS was developed in an effort to close this gap and empower veterinary researchers to analyze WGS data in a user-friendly and reproducible manner. As a Snakemake-based, containerized pipeline, FASTQ files may be processed with little user input and configuration, yielding GVCF files with relatively short runtimes. Similarly, genotyping and annotation of moderate-to-large cohorts, a potentially error-prone endeavor, may be accomplished without substantial bioinformatics knowledge. WAGS is open source and open access. We expect that the majority of users will be researchers in veterinary medicine interested in identifying variants associated with a disease or trait of interest in cohorts of public and private WGS data.

Supplementary Material

jkad117_Supplementary_Data

Acknowledgements

We are grateful to the Minnesota Supercomputing Institute at the University of Minnesota for providing the support and computational resources to develop WAGS and to process and store data. Additionally, we would like to thank the following individuals for testing, utilizing, and supporting the development of WAGS: Jillian Marlowe, Sian Durward-Akhurst, Ted Kalbfleisch, and Molly McCue with equine WGS data; Alessandra Bresnan and Brian W. Davis with tiger WGS data; and Kathryn M. Meurs, Natasha J. Olby, Leigh Ann Clark, Joshua Stern, Kari Ekenstedt, Diane Shelton, Eva Furrow, Jody Gookin, and James Mickelson with canine WGS data.

Contributor Information

Jonah N Cullen, Department of Veterinary Clinical Sciences, College of Veterinary Medicine, University of Minnesota, 1352 Boyd Ave, Saint Paul, MN 55108, USA.

Steven G Friedenberg, Department of Veterinary Clinical Sciences, College of Veterinary Medicine, University of Minnesota, 1352 Boyd Ave, Saint Paul, MN 55108, USA.

Data availability

The WGS data used to benchmark WAGS are publicly available with Sequence Read Archive accessions listed in Supplementary Table 1. The source code of WAGS is available at https://github.com/jonahcullen/wags.

Supplemental material available at G3 online.

Funding

JNC was funded by a United States Department of Agriculture National Institute of Food and Agriculture Predoctoral Fellowship as part of the Agriculture and Food Research Initiative (https://nifa.usda.gov/) (2020-67034-31739). SGF is supported in part by an NIH Special Emphasis Research Career Award (K01 OD027058) in Pathology and Comparative Medicine sponsored by the Division of Comparative Medicine, Office of Research Infrastructure Programs.

Literature cited

  1. Ahmed Z, Renart EG, Mishra D, Zeeshan S. JWES: a new pipeline for whole genome/exome sequence data processing, management, and gene-variant discovery, annotation, prediction, and genotyping. FEBS Open Bio. 2021;11(9):2441–2452. doi: 10.1002/2211-5463.13261. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Andrews S. FastQC: a quality control tool for high throughput sequence data. 2010.
  3. Auton A, Abecasis GR, Altshuler DM, Durbin RM, Abecasis GR, Bentley DR., Chakravarti A, Clark AG, Donnelly P, Eichler EE, et al. A global reference for human genetic variation. Nature. 2015;526(7571):68–74. doi: 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Buckley RM, Davis BW, Brashear WA, Farias FHG, Kuroki K, Graves T, Hillier LW, Kremitzki M, Li G, Middleton RP, et al. A new domestic cat genome assembly based on long sequence reads empowers feline genomic medicine and identifies a novel gene for dwarfism. PLoS Genet. 2020;16(10):e1008926. doi: 10.1371/journal.pgen.1008926. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Cameron DL, Baber J, Shale C, Valle-Inclan JE, Besselink N, van Hoeck A, Janssen R, Cuppen E, Priestley P, Papenfuss AT. GRIDSS2: comprehensive characterisation of somatic structural variation using single breakend variants and structural variant phasing. Genome Biol. 2021;22(1):202. doi: 10.1186/s13059-021-02423-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Cameron DL, Schröder J, Penington JS, Do H, Molania R, Dobrovic A, Speed TP, Papenfuss AT. GRIDSS: sensitive and specific genomic rearrangement detection using positional de Bruijn graph assembly. Genome Res. 2017;27(12):2050–2060. doi: 10.1101/gr.222109.117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Causey JL, Ashby C, Walker K, Wang ZP, Yang M, Guan Y, Moore JH, Huang X. DNAp: a pipeline for DNA-seq data analysis. Sci Rep. 2018;8(1):6793. doi: 10.1038/s41598-018-25022-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Chen X, Schulz-Trieglaff O, Shaw R, Barnes B, Schlesinger F, et al. Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics. 2016;32(8):1220–1222. doi: 10.1093/bioinformatics/btv710. [DOI] [PubMed] [Google Scholar]
  9. Chiang C, Layer RM, Faust GG, Lindberg MR, Rose DB, Garrison EP, Marth GT, Quinlan AR, Hall IM. SpeedSeq: ultra-fast personal genome analysis and interpretation. Nat Methods. 2015;12(10):966–968. doi: 10.1038/nmeth.3505. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Daetwyler HD, Capitan A, Pausch H, Stothard P, van Binsbergen R, Brøndum RF, Liao X, Djari A, Rodriguez SC, Grohs C, et al. Whole-genome sequencing of 234 bulls facilitates mapping of monogenic and complex traits in cattle. Nat Genet. 2014;46(8):858–865. doi: 10.1038/ng.3034. [DOI] [PubMed] [Google Scholar]
  11. Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, Whitwham A, Keane T, McCarthy SA, Davies RM, et al. Twelve years of SAMtools and BCFtools. GigaScience. 2021;10(2):giab008. doi: 10.1093/gigascience/giab008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. der Auwera GAV, O’Connor BD. Genomics in the Cloud: Using Docker, GATK, and WDL in Terra. Sebastopol (CA): O’Reilly Media, Inc; 2020. [Google Scholar]
  13. Durward-Akhurst SA, Schaefer RJ, Grantham B, Carey WK, Mickelson JR, McCue ME. Genetic variation and the distribution of variant types in the horse. Front Genet. 2021;12:758366. doi: 10.3389/fgene.2021.758366. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Elshazly H, Souilmi Y, Tonellato PJ, Wall DP, Abouelhoda M. MC-GenomeKey: a multicloud system for the detection and annotation of genomic variants. BMC Bioinformatics. 2017;18(1):1–14. doi: 10.1186/s12859-016-1454-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Evani US, Challis D, Yu J, Jackson AR, Paithankar S, Bainbridge MN, Jakkamsetti A, Pham P, Coarfa C, Milosavljevic A, et al. Atlas2 cloud: a framework for personal genome analysis in the cloud. BMC Genomics. 2012;13(S6):S19. doi: 10.1186/1471-2164-13-Supplementary 6-19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016;32(19):3047–3048. doi: 10.1093/bioinformatics/btw354. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Guo Y, Ding X, Shen Y, Lyon GJ, Wang K. SeqMule: automated pipeline for analysis of human exome/genome sequencing data. Sci Rep. 2015;5(1):14283. doi: 10.1038/srep14283. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Hoeppner MP, Lundquist A, Pirun M, Meadows JRS, Zamani N, Johnson J, Sundström G, Cook A, FitzGerald MG, Swofford R, et al. An improved canine genome and a comprehensive catalogue of coding genes and non-coding transcripts. PLoS One. 2014;9(3):e91172. doi: 10.1371/journal.pone.0091172. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Jagannathan V, Drögemüller C, Leeb T. A comprehensive biomedical variant catalogue based on whole genome sequences of 582 dogs and eight wolves. Anim Genet. 2019;50(6):695–704. doi: 10.1111/age.12834. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Karczewski KJ, Fernald GH, Martin AR, Snyder M, Tatonetti NP, Dudley JT. STORMSeq: an open-source, user-friendly pipeline for processing personal genomics data in the cloud. PLoS One. 2014;9(1):e84860. doi: 10.1371/journal.pone.0084860. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Kurtzer GM, Sochat V, Bauer MW. Singularity: scientific containers for mobility of compute. PLoS One. 2017;12(5):e0177459. doi: 10.1371/journal.pone.0177459. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Layer RM, Chiang C, Quinlan AR, Hall IM. LUMPY: a probabilistic framework for structural variant discovery. Genome Biol. 2014;15(6):R84. doi: 10.1186/gb-2014-15-6-r84. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Li H. Tabix: fast retrieval of sequence features from generic TAB-delimited files. Bioinformatics. 2011;27(5):718–719. doi: 10.1093/bioinformatics/btq671. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Li H, Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 2009;25(14):1754–1760. doi: 10.1093/bioinformatics/btp324. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Lindblad-Toh K, Wade CM, Mikkelsen TS, Karlsson EK, Jaffe DB, Kamal M, Clamp M, Chang JL, Kulbokas EJ, Zody MC, et al. Genome sequence, comparative analysis and haplotype structure of the domestic dog. Nature. 2005;438(7069):803–819. doi: 10.1038/nature04338. [DOI] [PubMed] [Google Scholar]
  26. McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GRS, Thormann A, Flicek P, Cunningham F. The ensembl variant effect predictor. Genome Biol. 2016;17(1):122. doi: 10.1186/s13059-016-0974-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Menon R, Patel NV, Mohapatra A, Joshi CG. VDAP-GUI: a user-friendly pipeline for variant discovery and annotation of raw next-generation sequencing data. 3 Biotech 2016;6(1):68. doi: 10.1007/s13205-016-0382-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Merkel D. Docker: lightweight Linux containers for consistent development and deployment. Linux J. 2014;2014(239):2. [Google Scholar]
  29. Mölder F, Jablonski KP, Letcher B, Hall MB, Tomkins-Tinch CH, Sochat V, Forster J, Lee S, Twardziok SO, Kanitz A, et al. Sustainable data analysis with Snakemake. F1000Res. 2021;10:33. doi: 10.12688/f1000research.29032.2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Okonechnikov K, Conesa A, García-Alcalde F. Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data. Bioinformatics. 2016;32(2):292–294. doi: 10.1093/bioinformatics/btv566. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Olby NJ, Friedenberg S, Meurs K, DeProspero D, Guevar J, Lau J, Yost O, Guo LT, Shelton GD. A mutation in MTM1 causes X-linked myotubular myopathy in Boykin spaniels. Neuromuscul Disord NMD. 2020;30(5):353–359. doi: 10.1016/j.nmd.2020.02.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Ostrander EA, Wang G-D, Larson G, vonHoldt BM, Davis BW, Jagannathan V, Hitte C, Wayne RK, Zhang Y-P, André C, et al. Dog10K: an international sequencing effort to advance studies of canine domestication, phenotypes and health. Natl Sci Rev. 2019;6(4):810–824. doi: 10.1093/nsr/nwz049. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Pedersen BS, Layer R, Quinlan AR. Smoove: structural-variant calling and genotyping with existing tools. 2020.
  34. Rausch T, Zichner T, Schlattl A, Stütz AM, Benes V, Korbel JO. DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics. 2012;28(18):i333–i339. doi: 10.1093/bioinformatics/bts378. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Rubin C-J, Megens H-J, Barrio AM, Maqbool K, Sayyab S, Schwochow D, Wang C, Carlborg Ö, Jern P, Jørgensen CB, et al. Strong signatures of selection in the domestic pig genome. Proc Natl Acad Sci. 2012;109(48):19529–19536. doi: 10.1073/pnas.1217149109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Shelton GD, Minor KM, Guo LT, Friedenberg SG, Cullen JN, Hord JM, Venzke D, Anderson ME, Devereaux M, Prouty SJ, et al. Muscular dystrophy-dystroglycanopathy in a family of Labrador retrievers with a LARGE 1 mutation. Neuromuscul Disord NMD. 2021;31(11):1169–1178. doi: 10.1016/j.nmd.2021.07.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Shelton GD, Minor KM, Guo LT, Thomas-Hollands A, Walsh KA, Friedenberg SG, Cullen JN, Mickelson JR. An EHPB1L1 nonsense mutation associated with congenital dyserythropoietic anemia and polymyopathy in Labrador retriever littermates. Genes (Basel). 2022;13(8):1427. doi: 10.3390/genes13081427. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Shelton GD, Minor KM, Vieira NM, Kunkel LM, Friedenberg SG, Cullen JN, Guo LT, Zatz M, Mickelson JR. Tandem duplication within the DMD gene in Labrador retrievers with a mild clinical phenotype. Neuromuscul Disord. 2022;32(10):836–841. doi: 10.1016/j.nmd.2022.08.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Wang M-S, Thakur M, Peng M-S, Jiang Y, Frantz LAF, Li M, Zhang J-J, Wang S, Peters J, Otecko NO, et al. 863 genomes reveal the origin and domestication of chicken. Cell Res. 2020;30(8):693–701. doi: 10.1038/s41422-020-0349-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Wang C, Wallerman O, Arendt M-L, Sundström E, Karlsson Å, Nordin J, Mäkeläinen S, Pielberg GR, Hanson J, Ohlsson Å, et al. A novel canine reference genome resolves genomic architecture and uncovers transcript complexity. Commun Biol. 2021;4(1):185. doi: 10.1038/s42003-021-01698-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Zook JM, Catoe D, McDaniel J, Vang L, Spies N, Sidow A, Weng Z, Liu Y, Mason CE, Alexander N, et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci Data. 2016;3(1):160025. doi: 10.1038/sdata.2016.25. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

jkad117_Supplementary_Data

Data Availability Statement

The WGS data used to benchmark WAGS are publicly available with Sequence Read Archive accessions listed in Supplementary Table 1. The source code of WAGS is available at https://github.com/jonahcullen/wags.

Supplemental material available at G3 online.


Articles from G3: Genes|Genomes|Genetics are provided here courtesy of Oxford University Press

RESOURCES