Abstract
Whole-genome sequencing is widely used to investigate population genomic variation in organisms of interest. Assorted tools have been independently developed to call variants from short-read sequencing data aligned to a reference genome, including single nucleotide polymorphisms (SNPs) and structural variations (SVs). We developed SNP-SVant, an integrated, flexible, and computationally efficient bioinformatic workflow that predicts high-confidence SNPs and SVs in organisms without benchmarked variants, which are traditionally used for distinguishing sequencing errors from real variants. In the absence of these benchmarked datasets, we leverage multiple rounds of statistical recalibration to increase the precision of variant prediction. The SNP-SVant workflow is flexible with user options to tradeoff accuracy for sensitivity. The workflow predicts SNPs and small insertions and deletions (INDELs) using GATK, and SVs using GRIDSS, and culminates in variant annotation using custom scripts. A key utility of SNP-SVant is its scalability. Variant calling is a computationally expensive procedure, and thus SNP-SVant uses a workflow management system with intermediary checkpoint steps to ensure efficient use of resources by minimizing redundant computations and omitting steps where dependent files are available. SNP-SVant also provides metrics to assess the quality of called variants, and converts between VCF and aligned FASTA format outputs to ensure compatibility with downstream tools to calculate selection statistics, which are commonplace in population genomics studies. By accounting for both small and large structural variants, users of this workflow can obtain a wide-ranging view of genomic alterations in an organism of interest. Overall, this workflow advances our capabilities in assessing the functional consequences of different types of genomic alterations, ultimately improving our abilities to associate genotypes with phenotypes.
Basic Protocol 1: Predicting SNPs and SVs
Support Protocol 1: Downloading publicly available sequencing data
Support Protocol 2: Visualizing variant loci using IGV
Support Protocol 3: Converting between VCF and aligned FASTA formats
Keywords: Variant calling, structural variations, single nucleotide polymorphisms, computational pipeline, short-read paired-end sequencing
INTRODUCTION
Variant calling refers to the process of identifying differences or “variants” in DNA sequences of individuals or samples compared to a reference genome. These variants can be changes in single nucleotides, known as single nucleotide polymorphisms (SNPs), small insertions and deletions (INDELs) or large genomic alterations such as insertions, deletions, duplications, or inversions (Mahmoud et al., 2019; Olson et al., 2015). These large genomic alterations, encompassing at least 50 base pairs, are known as structural variations (SVs) (Mahmoud et al., 2019). Variant calling is a critical step in genomic analyses and plays a fundamental role in several applications, including population genetics and disease association studies. Studies aiming to associate genotypes with phenotypes have focused largely on SNPs (Uffelmann et al., 2021). Genome-wide studies in humans, however, have shown that SVs account for more variation within species compared to SNPs (Sudmant et al., 2015), and are also more frequently implicated in disease states (Weischenfeldt et al., 2013). While identification of SVs is challenging due to low sequencing depth, especially in short-read sequencing data, recent advances in next-generation sequencing technologies (De Coster & Van Broeckhoven, 2019) and algorithmic development (Kosugi et al., 2019) for detection of SVs has made it possible to predict SVs more accurately. Existing tools to predict SNPs and SVs currently do so independently. Here, we developed a computational workflow called SNP-SVant to streamline the comprehensive prediction of genomic variants – both SNPs and SVs together. This workflow uses the Genome Analysis ToolKit (GATK) to predict SNPs and INDELs (McKenna et al., 2010), and the Genome Rearrangement IDentification Software Suite (GRIDSS) to predict SVs (Cameron et al., 2017, 2021) from short-read paired-end data. SNP-SVant was designed to be user-friendly such that users without extensive computational expertise can utilize the workflow. Additionally, SNP-SVant is tailored for variant prediction in organisms lacking benchmarked, known variants, which are traditionally used to distinguish sequencing errors from real variants (Van der Auwera et al., 2013). Benchmarked variant datasets are used to recalibrate base quality scores obtained from sequencing machines (Van der Auwera et al., 2013). In the absence of these benchmarked datasets, recalibration of base quality scores cannot be performed, which can result in biases stemming from genome context or read base position. To facilitate base quality score recalibration in the absence of benchmarked datasets, we follow the best practices recommended by developers of GATK (McKenna et al., 2010; Van der Auwera et al., 2013) and leverage multiple rounds of statistical recalibration and variant calling to increase the precision of variant prediction.
SNP-SVant uses the Snakemake workflow management system to create reproducible data analyses (Köster et al., 2021). Variant calling is a computationally expensive process, especially when working with multiple samples. SNP-SVant run-time is optimized by leveraging parallelization and multicore processors when available, thus resolving dependencies for intermediate steps and checkpointing. Here, we demonstrate the use of SNP-SVant to predict variants in a strain of Candida albicans, a common human fungal pathogen. We use C. albicans as a test case for this pipeline because it is a medically relevant fungal pathogen that does not yet have experimentally validated and benchmarked variant data. However, SNP-SVant was developed to predict SNPs and SVs in any species lacking benchmarked, experimentally verified, genomic variant data. SNP-SVant uses paired-end and short-read genomic DNA sequences as inputs to predict variants. The quality of the raw data is verified using FastQC (Andrews, 2010), reads are mapped to the indexed reference genome using Bowtie2 (Langmead & Salzberg, 2012), and aligned reads are sorted by genomic loci using samtools (Danecek et al., 2021). Duplicate reads in the alignment are then denoted using the MarkDuplicates command in the Picard toolkit (https://broadinstitute.github.io/picard/). In this latter step, duplicate reads that occur during the PCR amplification step of sequencing are marked to ensure that they are excluded from the statistical calculation to assess variant quality. Summaries of read alignment statistics, insert size distributions, and read depths, are also generated to verify the quality of the sequenced data and alignment using the Picard toolkit and samtools. The first round of variant calling to predict SNPs and small INDELs is performed using the HaplotypeCaller function in GATK (McKenna et al., 2010), and variants with low mapping quality, strand biases, and low variant confidence scores accounting for coverage are removed.
The GATK thresholds of the workflow are set according to best practices described in the GATK documentation, and can be modified by users based on the distributions of these metrics in the data. Base quality scores of the aligned reads are then recalibrated using filtered, high-quality variants to account for non-random errors due to genome context, especially homopolymer errors that can arise during sequencing (Stoler & Nekrutenko, 2021). This step is repeated twice and the aligned reads with the recalibrated base scores are then used to perform another round of variant calling using HaplotypeCaller and are filtered again using the same criteria. These final filtered variants are then retained as the high-confidence variants. The effects of these variants on protein coding regions are then predicted using the Variant Effect Predictor (VEP) software suite (McLaren et al., 2016).
SVs can be identified from short-read paired-end sequencing data based on the patterns of distribution of read pair distances and orientations at specific genomic regions, and different types of SVs result in different distributional patterns (Mahmoud et al., 2019). GRIDSS uses short-read paired-end sequencing data to predict SVs (Cameron et al., 2017, 2021). GRIDSS retains reads that are partially aligned, reads with incorrect orientation, reads with only a single mapped end, and reads with unusual insert sizes, and filters out reads with low mapping quality and reads mapping to low complexity genomic regions. The remaining high-quality reads are then used to construct a positional de Bruijn graph (Cameron et al., 2017), which is used to identify single break-end contigs. Break-end contigs are defined as contigs that extend out and flank a single breakpoint of the SV from each side, and are generated by assembling reads that span a breakpoint region (Cameron et al., 2017). The break-end contigs are subsequently realigned to the reference to precisely identify breakpoints. Since GRIDSS reports the break-ends, but does not annotate SVs, we wrote a custom R script to annotate the SVs. During the annotation step, we annotate SVs if the break-ends reported by GRIDSS are paired. This results in an annotation file comprising simple SVs such as duplications, insertions, deletions, and inversions.
Our workflow is especially useful for users without extensive computational knowledge since it not only integrates SNP and SV prediction and functional annotation, but also includes custom scripts to visualize quality scores of predicted variants. These custom scripts simplify the process of parameter tuning to filter out low quality variants. Additionally, we modify default SNP calling strategies in our workflow by adapting this step for organisms that do not have benchmarked, experimentally validated variant data. As a result, users can predict genomic variants including SNPs and SVs, and their functional consequences in organisms using only a reference genome, annotated protein-coding regions, and paired-end short-read sequencing data. By accounting for both small and large structural variants, users can obtain a wide-ranging view of genetic diversity in their organism of interest. Overall, this workflow advances our capabilities in assessing the functional consequences of different types of genomic alterations, ultimately improving our abilities to associate genotypes with phenotypes.
BASIC PROTOCOL 1
Basic protocol title:
Predicting SNPs and SVs
Introductory paragraph:
In this protocol, we describe in detail the usage of our variant calling workflow called SNP-SVant to predict SNPs and SVs in non-benchmarked organisms. Figure 1 depicts a summary of the computational workflow. The SNP-SVant workflow requires the input files described in Table 1. Briefly, the workflow relies on a contiguous and complete reference genome, and short-read paired-end sequencing data. The outputs of SNP-SVant include a Variant Call Format (VCF) file with SNPs and small INDELs obtained using GATK (version 4.4.0.0), a VCF file with structural variants obtained using GRIDSS (version 2.12.0), and annotated SVs listed in BED format. Additionally, the scores used to assess the quality of the variant calls can be visualized and used to adjust the variant filtering parameters listed in Table 2. A list of all available parameters for this workflow is given in Table 3.
Figure 1: SNP-SVant, a computational workflow for predicting genomic variants.

We developed a reproducible, scalable variant calling workflow called SNP-SVant to predict SNPs, small INDELs and SVs in non-benchmarked organisms. The boxes in purple indicate steps that generate intermediary files and the boxes in yellow depict modules that result in final variant calls and annotation files. The publicly available tools used in each step are indicated in blue text at each step of the workflow.
Table 1.
Input files. These files are required to identify and annotate variants.
| File | Description | Parameter |
|---|---|---|
| sample_metadata.tsv | A tab-delimited file with the first column containing the base name of the FASTQ file | samples=‘{metadata: /path/to/sample_metadata.tsv}’ |
| reference.fasta | Reference FASTA file | reference=‘{genome: /path/to/reference.fasta}’ |
| reference.gff | Reference GFF file with annotations of protein-coding genes in the reference genome | reference=‘{genome_gff: /path/to/reference.gff}’ |
Table 2.
GATK parameters.
| Parameter | Description | Default value |
|---|---|---|
| QD_filter | Quality score of the variant normalized by depth | QD < 2.0 (for SNPs and INDELs) |
| FS_filter | Phred-scaled probability that the variant site has a strand bias | FS > 60.0 (for SNPs) FS > 200.0 (for INDELs) |
| SOR_filter | Strand odds ratio to estimate strand bias | SOR > 4 (for SNPs) SOR > 10 (for INDELs) |
| MQ_filter | Root mean square mapping quality over all reads | MQ < 40.0 (for SNPs) |
| MQRankSum_filter | Approximation from the rank sum test for mapping qualities | MQRankSum < -12.5 (for SNPs) |
| ReadPosRankSum_filter | Approximation from the rank sum test for the variant site position within the reads | ReadPosRankSum < -8.0 (for SNPs) |
Table 3.
Description of all available parameters.
| Parameter | Description | Default value | Required |
|---|---|---|---|
| samples=‘{metadata: <path>}’ | Tab-delimited metadata file with the first column as sample IDs | test/metadata/test_samples.tsv | Yes |
| samples=‘{header: <boolean>}’ | Indicates if the sample metadata contains a header | False | No |
| threads=<int> | Number of threads to be used for the workflow | 8 | No |
| output=<path> | Output directory | test | No |
| input_dir=<path> | Input directory containing FASTQ files | test/raw | No |
| logs=<path> | Directory to store log files | MQRankSum < -12.5 | |
| import=<boolean> | Indicates if sample IDs in the metadata file are SRA IDs and if FASTQ files for these publicly available samples should be imported | False | No |
| trimming=‘{trim: <boolean>}’ | Indicates if adapter trimming should be performed | False | No |
| trimming=‘{adapters: <path>}’ | FASTA file with adapter sequences | resources/adapters/NexteraPE-PE.fa | No |
| trimming=‘{seed Mismatches: <int>}’ | The maximum number of mismatches to be allowed in the seed | 2 | No |
| trimming=‘{palindromeClipThreshold: <int>}’ | The minimum score for the full alignment of the reads in palindrome mode | 30 | No |
| trimming=‘{simpleClipThreshold: <int>}’ | The minimum score threshold for the adapter alignment to the read for clipping to take place | 10 | No |
| trimming=‘{leading: <int>}’ | The minimum quality required to retain the base at the beginning of the read | 3 | No |
| trimming=‘{trailing: <int>}’ | The minimum quality required to retain the base at the end of the read | 3 | No |
| trimming=‘{windowSize: <int>}’ | Number of bases to average scores across | 4 | No |
| trimming=‘{required Quality: <int>}’ | Average quality required | 15 | No |
| trimming=‘{minlength: <int>}’ | The minimum length of reads to be retained | 36 | No |
| snp_filters=‘{QD_filter: <string>}’ | Quality score of the variant normalized by depth for SNPs | “QD < 2.0” | No |
| snp_filters=‘{FS_filter: <string>}’ | Phred-scaled probability that the variant site has a strand bias for SNPs | “FS > 60.0” | No |
| snp_filters=‘{SOR_filter: <string>}’ | Strand odds ratio to estimate the strand bias for SNPs | ”SOR > 4” | No |
| snp_filters=‘{MQ_filter: <string>}’ | Root mean square mapping quality over all reads for SNPs | ”MQ < 40.0” | No |
| snp_filters=‘{MQRank Sum_filter: <string>}’ | Approximation from the rank sum test for mapping qualities for SNPs | ”MQRankSum < -12.5” | No |
| snp_filters=‘{ReadPosRankSum_filter: <string>}’ | Approximation from the rank sum test for the variant site position within the reads for SNPs | ”ReadPosRankSum < -8.0” | No |
| indel_filters=‘{QD_filter: <string>}’ | Quality score of the variant normalized by depth for INDELs | “QD < 2.0” | No |
| indel_filters=‘{FS_filter: <string>}’ | Phred-scaled probability that the variant site has a strand bias for INDELs | “FS > 200.0” | No |
| indel_filters=‘{SOR_filter: <string>}’ | Strand odds ratio to estimate the strand bias for INDELs | ”SOR > 10” | No |
| reference=‘{genome: <path>}’ | Path for reference FASTA | reference/genome/reference.fasta | Yes |
| reference=‘{gff: <path>}’ | Path for annotated reference genes in GFF format | reference/genome/reference.gff | Yes |
Necessary Resources:
Necessary resources
Hardware
A machine running Linux Ubuntu version 18.04.6 with internet connection and at least 32 Gb RAM. The number of threads can be provided by the users. Since variant calling is a computationally expensive process, we recommend at least 16 threads for optimal computation time.
Software
SNP-SVant uses the Conda command line package management system to install packages and software dependencies for variant calling. SNP-SVant is designed for users without extensive computational expertise and requires a basic knowledge of Unix commands to install and use. The Conda software can be installed from https://anaconda.org/anaconda/conda. We also recommend using Mamba, a reimplementation of the Conda package manager in C++ to facilitate the installation of Snakemake. Mamba can be installed using the command below.
conda install -n base -c conda-forge mamba
SNP-SVant requires Snakemake, which can be installed from https://snakemake.readthedocs.io/en/stable/. Snakemake is a workflow management system that is widely used in the field of bioinformatics. It is designed to help automate the process of creating, executing, and managing complex data analysis pipelines. Snakemake allows users to define and execute workflows ensuring reproducibility and scalability. The source code and documentation for SNP-SVant can be found on GitHub at https://github.com/dgunasekaran/snp_svant. To use the command line to download the GitHub repository, an installation of Git is required. You can download and install Git from https://git-scm.com/downloads.
Other requirements
Basic knowledge of Unix commands
Raw paired-end short-read sequencing data in FASTQ format
R version 4.1.3 or higher
Input files listed in Table 1
Protocol steps with step annotations:
Downloading the variant calling workflow on a local machine
- Download the publicly available GitHub repository of the variant calling workflow, SNP-SVant to your local Linux machine using the git command-line tool. Navigate to the desired directory and use the following command.
cd /path/to/desired/directory
-
Clone the GitHub repository.
git clone https://github.com/dgunasekaran/snp_svant
Alternatively, users can click on the green “Code” button on the URL of the GitHub repository (https://github.com/dgunasekaran/snp_svant) and click the “Download ZIP” option. Users can then move the downloaded file to the desired directory and unzip the file in the new directory.
- Navigate to the downloaded repository.
cd snp_svant
Creating and loading the Conda environment
The Conda command line package management system can be used to install the packages and dependencies required for executing SNP-SVant. This workflow also relies on Snakemake for reproducibility. Adapter trimming of input reads using trimmomatic (Bolger et al., 2014) is an optional step; however, the default settings of this workflow exclude this step. Packages required for read mapping and variant calling are available as Snakemake wrappers and are used to simplify the integration of external tools into existing workflows. Python should be updated to ensure that the version is compatible across the different parts of the SNP-SVant workflow.
-
Create the Conda environment included in the repository.
conda env create -f workflow/envs/environment.yaml
Note: The above command creates a Conda environment called snp_svant. For additional descriptions of creating and managing Conda environments, refer to https://conda.io/projects/conda/en/latest/index.html.
-
Activate the Conda environment before running the workflow.
conda activate snp_svant
Note: While the environment is active, any packages installed will be done so within this environment. To exit this environment, use the command conda deactivate.
- Install R packages.
Rscript workflow/scripts/install_r_packages.R
Building genome indices
Variant calling is performed by aligning reads derived from genomic DNA to a reference genome. To perform this alignment step, reference-based aligners rely on a pre-processing step to index reference genomes for fast and efficient alignment. We describe the steps to build the genome indices required for reference-based alignment of reads using Bowtie2.
snakemake --cores <number_of_threads> build --use-conda
Note: SNP-SVant requires a contiguous assembly of a reference genome of the organism of interest. Reference genomes are publicly available for many species of interest in the National Center for Biotechnology Information (NCBI) repository (https://www.ncbi.nlm.nih.gov/genome/) and can be found by entering the name of the species of interest in the search bar.
Running the variant calling pipeline
The main script, called Snakemake lists all intermediary and final outputs that will be generated in this workflow. The dependencies between the intermediary files are described as individual rules and can be found in workflow/rules. Users must activate the Conda environment and build the genome indices before executing the variant calling step using the command below.
snakemake --cores <number_of_threads> all --use-conda
Generate quality control graphs for predicted variants
- Set up the Conda environment by deactivating the current environment.
conda deactivate
- Set up the new environment to execute the python script that generates density plots of quality metrics from the VCF file.
conda create -n vcf_qc python pyvcf pandas argparse seaborn conda activate vcf_qc
-
Use the vcf file as an input for the python script to generate quality control plots.
python workflow/scripts/variant_quality_assessment.py -i <input_vcf> -o <output_vcf>
An example output is shown in Figure 2.
Figure 2: Example density plots for quality control of SNP calling.

The distribution of quality scores listed in Table 2 computed during variant calling by GATK are shown. QD indicates the quality score of the variant normalized by depth; FS indicates the Phred-scaled probability that the variant site has a strand bias; SOR is the strand odds ratio to estimate strand bias; MQ is the root mean square mapping quality over all reads; MQRankSum is an approximation from the rank sum test for mapping qualities; and ReadPosRankSum is an approximation from the rank sum test for the variant site position within the reads. The vertical read lines denote the filtering thresholds for these quantities recommended by GATK.
Critical Parameters:
Variant calling is dependent on the thresholds for quality control filters. SNP-SVant uses the GATK recommended thresholds determined from benchmarking studies. These thresholds are listed in Table 2. These parameters can be tuned by the user by modifying the config/config.yaml file in the snp_svant repository or using the command line parameters listed in Table 3.
Advanced Parameters:
SNP-SVant also accepts other, optional parameters that can be modified by users. These parameters determine whether additional steps should be executed in the workflow including optional adapter trimming of sequenced reads using trimmomatic and input parameters required for trimmomatic. All parameters for SNP-SVant are listed in the config/config.yaml file. The descriptions of these parameters are given in workflow/schemas/config.schemas.yaml. Table 3 is a comprehensive list of all available parameters.
Sample Data
SNP-SVant generates multiple intermediary files. Users can obtain a representation of the steps involved in the workflow along with the dependencies between the intermediary files using the command below.
snakemake --dag all --use-conda | dot -Tpng > all_files.png
The file all_files.png depicts the order in which files are generated. These intermediary files can be used to gain insights on read mapping statistics and can be used to visualize the results from alignment or variant calling steps in genome browsers such as the Integrated Genome Viewer (IGV) browser (Robinson et al., 2011). The final output files listing predicted variants are described in Table 4.
Table 4.
Output files.
| File name | Description | File type | Location |
|---|---|---|---|
| *_ filtered_snps_final.vcf | SNPs identified using GATK and filtered using the parameters in Table 2 | VCF | <output>/preprocessed/final_variants/ |
| *_ filtered_indels_final.vcf | Small INDELs identified using GATK and filtered using the parameters in Table 2 | VCF | <output>/preprocessed/final_variants/ |
| *_gridss.vcf.gz | SVs identified using GRIDSS defining all single breakpoints which can be used to annotate complex SVs. This file also contains predicted translocations | VCF | <output>/preprocessed/gridss/ |
| *_simple.bed | Simple SVs annotated by type (insertion, deletion, inversion, and duplication) | BED | <output>/preprocessed/gridss/ |
| *_snps_vep.txt | Effect of SNPs on protein-coding genes | VEP | <output>/preprocessed/vep_genes/ |
| *_ filtered_snps_final.vcf | SNPs identified using GATK and filtered using the parameters in Table 2. This file should be used only when input data is low-coverage (<30X) and/or highly diverged from the reference genome (>5%) | VCF | <output>/preprocessed/3_hapcaller_r1/ |
| *_ filtered_indels_final.vcf | Small INDELs identified using GATK and filtered using the parameters in Table 2. This file should be used only when input data is low-coverage (<30X) and/or highly diverged from the reference genome (>5%) | VCF | <output>/preprocessed/3_hapcaller_r1/ |
SUPPORT PROTOCOL 1
SUPPORT PROTOCOL TITLE
Downloading publicly available sequencing data
Introductory paragraph:
Previously published sequencing data is publicly available at the Sequencing Read Archive (SRA) on NCBI (https://www.ncbi.nlm.nih.gov/sra). FASTQ files for these samples can be downloaded directly from the NCBI database or using the SRA toolkit (https://github.com/ncbi/sra-tools). This support protocol uses the command line to download the sequenced reads using the SRA accession number.
Necessary Resources:
Software
The SRA toolkit provides tools for accessing and downloading high-throughput sequencing data files generated from sequencing platforms such as Illumina. Details on downloading the SRA toolkit (version 3.0.6 or higher) is available at https://github.com/ncbi/sra-tools.
Protocol steps with step annotations:
Verify if the publicly available data is sequenced in paired-end mode. Navigate to the directory to store the input FASTQ files. Download the FASTQ files for the SRA accession using the command below.
fasterq-dump <sra_accession> -e <number_of_threads>
SUPPORT PROTOCOL 2
SUPPORT PROTOCOL TITLE
Visualizing variant loci using IGV
Introductory paragraph:
The intermediary files generated during the alignment step and the variant calling step can be visualized using IGV. The annotation of protein-coding genes for the reference strain can also be used to compare the variant loci across genomic elements. Additional information such as transcriptomic or epigenomic data, if available, can also be integrated into the visualization to facilitate interpretation of the variants.
Necessary Resources:
Software
IGV is available as a desktop or web application. The IGV desktop application can be downloaded at https://software.broadinstitute.org/software/igv/.
Protocol steps with step annotations:
-
Load the reference genome and annotations.
In the IGV desktop application, click on Genomes > Load Genome from file and select the reference FASTA file from your local device. To include the annotations of protein-coding genes, click on File > Load from File and select the annotation file in GFF format from your local device.
-
Ensure that the reference FASTA file is indexed.
An indexed FASTA file can be obtained using the “Building genome indices” step in Basic Protocol 1.
-
Load the SNP and SV variant files.
In the IGV desktop application, click on File > Load from File and select the variant files *_filtered_snps_final.vcf, *_filtered_indels_final.vcf and *_gridss.vcf from your local device.
Note: Alternatively, the IGV web application can be used to visualize variants (https://igv.org/app/). Documentation for using IGV for this purpose is available at https://igvteam.github.io/igv-webapp/.
-
Save a snapshot of a region of interest.
In the IGV desktop application, navigate to the region of interest using the search tab to enter the genomic position or gene of interest. Use the “+” and “-” symbols at the top right of the application to zoom in and out of the genomic range. Click on File > Save PNG image and navigate to the folder of interest to save the IGV snapshot. An example visualization from IGV is shown in Figure 3.
Figure 3: Example visualization of variants using IGV.

Example visualization of variants overlapping the REG1 gene in C. albicans. SVs are shown in the topmost track, followed by SNPs, annotations of protein-coding genes, and annotated SVs. The SNPs are colored in red and blue and denote allele frequencies, where SNPs shown only in red indicate homozygous variants. The annotated SVs are indicated with DEL, INS, or INV, denoting deletion, insertion, or inversion, respectively.
SUPPORT PROTOCOL 3
SUPPORT PROTOCOL TITLE
Converting between VCF and aligned FASTA formats
Introductory paragraph:
Genomic variants can be used to compute population genomics statistics. Some packages such as alnpi, a module in the FAST toolkit, takes a multiFASTA file as input to compute population genomics statistics (Lawrence et al., 2015). In this support protocol, we describe the use of a custom python script to convert SNPs in the VCF format to a multiFASTA format along with the reference sequence. This script is available as a part of the snp_svant GitHub repository.
Necessary Resources:
This support protocol requires the same resources listed in Basic Protocol 1.
Protocol steps with step annotations:
-
Create and activate a Conda environment
Create a new Conda environment to download packages required for the python script to run.conda create -n parse_vcf python pyvcf gffutils argparse biopython activate parse_vcf
-
Run the python script
The custom python script uses as an input the VCF file with SNPs generated by GATK and the reference FASTA file. Additionally, the python script requires the genomic locus to create the multiFASTA file for this genomic region along with the strand of the locus.python workflow/scripts/parse_vcf.py -i <*_filtered_snps_final.vcf> -r <reference.fasta> -l <chromosome:start:end> -s <strand> -o <output_fasta>
The strand of the locus should be indicated as “+” to denote the positive strand and “-” to denote the negative strand. Typically, for population genomic studies, the locus is a gene of interest and the genomic coordinates for the gene of interest can be obtained from the annotated GFF file for the reference strain. This script generates a multiFASTA file with the reference sequence for the locus of interest and the sample sequence with predicted SNPs.
GUIDELINES FOR UNDERSTANDING RESULTS
The primary outputs of this pipeline are variants predicted in the sample compared to the reference strain documented in a VCF format. VCF is a standard file format used to represent information about genetic variants. This file contains information on the chromosome and position of the variant location, the reference allele, alternate alleles, and allele frequencies. Additionally, the VCF file contains information on the variant quality including the metrics listed in Table 2. The VCF file also includes information about the sequencing depth, such as read depth at a variant location and allelic depth. A detailed description of all fields in the VCF format can be found at http://samtools.github.io/hts-specs/.
The output for the SV annotations is a BED file. This BED file is a tab-delimited file with 8 columns. The information represented in these columns are detailed below.
Chromosome location of the predicted SV.
Start coordinate of the SV.
End coordinate of the SV.
Type of SV, where INS denotes insertion, DEL denotes deletion, DUP denotes duplication and INV denotes inversion.
Quality score of the SV obtained from GRIDSS.
Length of the SV.
If the SV is an insertion, this column will show the nucleotide sequence.
Strand of the SV.
If the output files listed in Table 4 are not generated, then the execution of the SNP-SVant workflow has failed. Possible causes and solutions to resolve this execution failure are listed in Table 5. If the VCF files are generated but only headers are listed without any identified variants, this could indicate poor quality data, either due to low sequencing depth or due to a failed sequencing run, both of which would necessitate resequencing the sample to improve variant prediction.
Table 5.
Causes and solutions to potential workflow errors.
| Problem | Possible Cause | Solution |
|---|---|---|
| Error importing packages (ImportError) | Mamba was not installed correctly or is not up to date | Update Mamba in the Conda environment or re-install Mamba using the following command conda install mamba -c conda-forge |
| Files are incomplete (IncompleteFilesException) | Previous run crashed unexpectedly, and some files were not completely generated | Re-run the pipeline or use additional parameter --rerun-incomplete to regenerate incomplete files |
COMMENTARY
In this paper, we present a user-friendly and computationally efficient workflow called SNP-SVant for identifying genomic variants. Users with minimal computational expertise should be able to run the workflow by following the protocol provided here. Since variant calling is complex and highly dependent on parameter choices, we calibrated SNP-SVant to ensure that default values are concordant with recommended, benchmarked standards. We also report quality metrics from variant callers visually, enabling the comparison of the comprehensive variant calls to these critical quality thresholds, and to allow for the easy identification of low-quality samples.
SNP-SVant relies on a contiguous, complete reference genome. Thus, variant calling for organisms where the genome is fragmented or where the reference genome is incomplete could result in lower quality scores of the predicted variants, higher false positive variant calls, and higher rates of undetected true genetic variations or false negatives. Another limitation of SNP-SVant is that only relatively simple SVs can be annotated. More complex SVs involving multiple genomic breakpoints are challenging to identify and annotate using short-read sequencing methods. The accurate identification of such complex SVs would, thus, require long-read sequencing data. Additionally, SVs with breakpoints where the break-end on one side cannot be unambiguously resolved are reported by GRIDSS and available in VCF format. These unpaired single break-ends contain information on SVs with parts of the sequence missing from the reference, likely due to the presence of foreign DNA or repetitive regions (Cameron et al., 2021). As a result, such SVs are not annotated due to the challenges in identifying the source and type of SV, but are retained in the GRIDSS output file in VCF format.
Variant calling is a computationally expensive and complex procedure. Existing methods for variant calling predict SNPs and SVs independently and typically require computational expertise to implement. Additionally, the accurate identification of SNPs relies on experimentally validated and benchmarked variant data, which is available primarily for humans (Karczewski et al., 2020) and model organisms, such as mice (Ringwald et al., 2022), fruit flies (Wang et al., 2015), and baker’s yeast (Cherry et al., 2012). In the absence of these benchmarked datasets, it is difficult to distinguish between sequencing artefacts and true genomic variants. To predict SNPs, small INDELs, and SVs, in the absence of benchmarked datasets, SNP-SVant uses multiple rounds of statistical recalibration to calibrate quality scores for the variants based on high confidence variant calls. Calibration of base scores is an important step in SNP calling, and one major advantage of SNP-SVant is its ability to perform this step without benchmarked datasets. While recalibration of base scores is known to reduce the false positive rate in SNP prediction for early next-generation sequencing data (Depristo et al., 2011), it is also reported to compromise sensitivity in low-coverage data, and when there is high divergence between the strain and reference genome (Li & Wren, 2014; Tian et al., 2016). If users are working with low-coverage data or with strains that have diverged significantly from the reference, we recommend using VCF files for GATK SNP calls without base score recalibration, listed in Table 4. Further work is needed to identify sources of biases to refine best practices in SNP prediction for organisms without benchmarked variants. Overall, we integrate the prediction of SNPs, small INDELs, and SVs in the SNP-SVant workflow, which can be used to comprehensively identify genomic variants at different scales.
Troubleshooting:
Troubleshooting steps for some common problems are listed in Table 5.
ACKNOWLEDGMENTS:
We thank all past and present members of the Nobile and Ardell laboratories for feedback on this manuscript. This work was supported by the National Institutes of Health (NIH) National Institute of General Medical Sciences (NIGMS) award number R35GM124594 and by the Kamangar family in the form of an endowed chair to C.J.N. The content is the sole responsibility of the authors and does not represent the views of the funders. The funders had no role in the design of the study; in the collection, analyses, or interpretation of the data; in the writing of the manuscript; or in the decision to publish the results.
Footnotes
CONFLICT OF INTEREST STATEMENT:
Clarissa J. Nobile is a cofounder of BioSynesis, Inc., a company developing diagnostics and therapeutics for biofilm infections. BioSynesis, Inc. was not involved in the content of this manuscript.
DATA AVAILABILITY STATEMENT:
The source code describing the SNP-SVant workflow is freely available as a GitHub repository (https://github.com/dgunasekaran/snp_svant). The example paired-end sequencing data used to illustrate the usage of this workflow is publicly available and can be downloaded from the SRA using the accession # SRR7801919 (Sitterlé et al., 2019).
LITERATURE CITED:
- Andrews S (2010). FastQC: a quality control tool for high throughput sequence data.
- Bolger AM, Lohse M, & Usadel B (2014). Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics, 30(15), 2114. 10.1093/BIOINFORMATICS/BTU170 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cameron DL, Baber J, Shale C, Valle-Inclan JE, Besselink N, van Hoeck A, Janssen R, Cuppen E, Priestley P, & Papenfuss AT (2021). GRIDSS2: comprehensive characterisation of somatic structural variation using single breakend variants and structural variant phasing. Genome Biology, 22(1), 1–25. 10.1186/S13059-021-02423-X/FIGURES/5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cameron DL, Schröder J, Penington JS, Do H, Molania R, Dobrovic A, Speed TP, & Papenfuss AT (2017). GRIDSS: Sensitive and specific genomic rearrangement detection using positional de Bruijn graph assembly. Genome Research, 27(12), 2050–2060. 10.1101/GR.222109.117/-/DC1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cherry JM, Hong EL, Amundsen C, Balakrishnan R, Binkley G, Chan ET, Christie KR, Costanzo MC, Dwight SS, Engel SR, Fisk DG, Hirschman JE, Hitz BC, Karra K, Krieger CJ, Miyasato SR, Nash RS, Park J, Skrzypek MS, … Wong ED (2012). Saccharomyces Genome Database: the genomics resource of budding yeast. Nucleic Acids Research, 40(Database issue), D700. 10.1093/NAR/GKR1029 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, Whitwham A, Keane T, McCarthy SA, & Davies RM (2021). Twelve years of SAMtools and BCFtools. GigaScience, 10(2), 1–4. 10.1093/GIGASCIENCE/GIAB008 [DOI] [PMC free article] [PubMed] [Google Scholar]
- De Coster W, & Van Broeckhoven C (2019). Newest Methods for Detecting Structural Variations. Trends in Biotechnology, 37(9), 973–982. 10.1016/J.TIBTECH.2019.02.003 [DOI] [PubMed] [Google Scholar]
- Depristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, Del Angel G, Rivas MA, Hanna M, McKenna A, Fennell TJ, Kernytsky AM, Sivachenko AY, Cibulskis K, Gabriel SB, Altshuler D, & Daly MJ (2011). A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics 2011 43:5, 43(5), 491–498. 10.1038/ng.806 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Karczewski KJ, Francioli LC, Tiao G, Cummings BB, Alföldi J, Wang Q, Collins RL, Laricchia KM, Ganna A, Birnbaum DP, Gauthier LD, Brand H, Solomonson M, Watts NA, Rhodes D, Singer-Berk M, England EM, Seaby EG, Kosmicki JA, … Xavier RJ (2020). The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 2020 581:7809, 581(7809), 434–443. 10.1038/s41586-020-2308-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Köster J, Mölder F, Jablonski KP, Letcher B, Hall MB, Tomkins-Tinch CH, Sochat V, Forster J, Lee S, Twardziok SO, Kanitz A, Wilm A, Holtgrewe M, Rahmann S, & Nahnsen S (2021). Sustainable data analysis with Snakemake. F1000Research, 10. 10.12688/F1000RESEARCH.29032.2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kosugi S, Momozawa Y, Liu X, Terao C, Kubo M, & Kamatani Y (2019). Comprehensive evaluation of structural variation detection algorithms for whole genome sequencing. Genome Biology, 20(1), 1–18. 10.1186/S13059-019-1720-5/TABLES/1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Langmead B, & Salzberg SL (2012). Fast gapped-read alignment with Bowtie 2. Nature Methods, 9(4), 357. 10.1038/NMETH.1923 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lawrence TJ, Kauffman KT, Amrine KCH, Carper DL, Lee RS, Becich PJ, Canales CJ, & Ardell DH (2015). FAST: FAST Analysis of Sequences Toolbox. Frontiers in Genetics, 6(MAY), 136911. 10.3389/FGENE.2015.00172/ABSTRACT [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li H, & Wren J (2014). Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics, 30(20), 2843. 10.1093/BIOINFORMATICS/BTU356 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mahmoud M, Gobet N, Cruz-Dávalos DI, Mounier N, Dessimoz C, & Sedlazeck FJ (2019). Structural variant calling: The long and the short of it. Genome Biology, 20(1), 1–14. 10.1186/S13059-019-1828-7/TABLES/2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, & DePristo MA (2010). The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research, 20(9), 1297–1303. 10.1101/GR.107524.110 [DOI] [PMC free article] [PubMed] [Google Scholar]
- McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GRS, Thormann A, Flicek P, & Cunningham F (2016). The Ensembl Variant Effect Predictor. Genome Biology, 17(1), 1–14. 10.1186/S13059-016-0974-4/TABLES/8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Olson ND, Lund SP, Colman RE, Foster JT, Sahl JW, Schupp JM, Keim P, Morrow JB, Salit ML, & Zook JM (2015). Best practices for evaluating single nucleotide variant calling methods for microbial genomics. Frontiers in Genetics, 6(JUL), 150309. 10.3389/FGENE.2015.00235/BIBTEX [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ringwald M, Richardson JE, Baldarelli RM, Blake JA, Kadin JA, Smith C, & Bult CJ (2022). Mouse Genome Informatics (MGI): latest news from MGD and GXD. Mammalian Genome, 33(1), 4–18. 10.1007/S00335-021-09921-0/FIGURES/4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Robinson JT, Thorvaldsdóttir H, Winckler W, Guttman M, Lander ES, Getz G, & Mesirov JP (2011). Integrative genomics viewer. Nature Biotechnology 2011 29:1, 29(1), 24–26. 10.1038/nbt.1754 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sitterlé E, Maufrais C, Sertour N, Palayret M, d’Enfert C, & Bougnoux ME (2019). Within-Host Genomic Diversity of Candida albicans in Healthy Carriers. Scientific Reports 2019 9:1, 9(1), 1–12. 10.1038/s41598-019-38768-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stoler N, & Nekrutenko A (2021). Sequencing error profiles of Illumina sequencing instruments. NAR Genomics and Bioinformatics, 3(1). 10.1093/NARGAB/LQAB019 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sudmant PH, Rausch T, Gardner EJ, Handsaker RE, Abyzov A, Huddleston J, Zhang Y, Ye K, Jun G, Fritz MHY, Konkel MK, Malhotra A, Stütz AM, Shi X, Casale FP, Chen J, Hormozdiari F, Dayama G, Chen K, … Korbel JO (2015). An integrated map of structural variation in 2,504 human genomes. Nature, 526(7571), 75. 10.1038/NATURE15394 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tian S, Yan H, Kalmbach M, & Slager SL (2016). Impact of post-alignment processing in variant discovery from whole exome data. BMC Bioinformatics, 17(1). 10.1186/S12859-016-1279-Z [DOI] [PMC free article] [PubMed] [Google Scholar]
- Uffelmann E, Huang QQ, Munung NS, de Vries J, Okada Y, Martin AR, Martin HC, Lappalainen T, & Posthuma D (2021). Genome-wide association studies. Nature Reviews Methods Primers 2021 1:1, 1(1), 1–21. 10.1038/s43586-021-00056-9 [DOI] [Google Scholar]
- Van der Auwera GA, Carneiro MO, Hartl C, Poplin R, del Angel G, Levy-Moonshine A, Jordan T, Shakir K, Roazen D, Thibault J, Banks E, Garimella KV, Altshuler D, Gabriel S, & DePristo MA (2013). From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Current Protocols in Bioinformatics / Editoral Board, Andreas D. Baxevanis … [et Al.], 11(1110), 11.10.1. 10.1002/0471250953.BI1110S43 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang F, Jiang L, Chen Y, Haelterman NA, Bellen HJ, & Chen R (2015). FlyVar: a database for genetic variation in Drosophila melanogaster. Database: The Journal of Biological Databases and Curation, 2015. 10.1093/DATABASE/BAV079 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Weischenfeldt J, Symmons O, Spitz F, & Korbel JO (2013). Phenotypic impact of genomic structural variation: insights from and for human disease. Nature Reviews Genetics 2013 14:2, 14(2), 125–138. 10.1038/nrg3373 [DOI] [PubMed] [Google Scholar]
INTERNET RESOURCES:
- GATK best practices can be used to understand quality thresholds and can be found at. https://gatk.broadinstitute.org/hc/en-us/sections/360007226651-Best-Practices-Workflows.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The source code describing the SNP-SVant workflow is freely available as a GitHub repository (https://github.com/dgunasekaran/snp_svant). The example paired-end sequencing data used to illustrate the usage of this workflow is publicly available and can be downloaded from the SRA using the accession # SRR7801919 (Sitterlé et al., 2019).
