Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Jul 1.
Published in final edited form as: Genet Epidemiol. 2019 Sep 14;44(1):41–51. doi: 10.1002/gepi.22261

Combining sequence data from multiple studies: impact of analysis strategies on rare variant calling and association results

Zhongsheng Chen 1, Michael Boehnke 1,3, Christian Fuchsberger 1,2,3,+
PMCID: PMC7231418  NIHMSID: NIHMS1590910  PMID: 31520493

Abstract

Individual sequencing studies often have limited sample sizes and so limited power to detect trait associations with rare variants. A common strategy is to aggregate data from multiple studies. For studying rare variants, jointly calling all samples together is the gold standard strategy but can be difficult to implement due to privacy restrictions and computational burden. Here, we compare the joint calling to the alternative of single-study calling in terms of variant detection sensitivity and genotype accuracy as a function of sequencing coverage and assess their impact on downstream association analysis. To do so, we analyze deep-coverage (~82X) exome and low-coverage (~5X) genome sequence data on 2,250 individuals from the GoT2D study jointly and separately within five geographic cohorts.

For rare SNVs: (1) ≥97% of discovered SNVs are found by both calling strategies; (2) non-reference concordance with a set of highly accurate genotypes is ≥99% for both calling strategies; (3) meta-analysis has similar power to joint analysis in deep-coverage sequence data but can be less powerful in low-coverage sequence data. Given similar data processing and quality control steps, we recommend single-study calling as a viable alternative to joint calling for analyzing SNVs of all MAF in deep-coverage data.

Keywords: Sequencing studies, rare variants, joint analysis, meta-analysis

Introduction

Genome-wide association studies (GWAS) based on genotype arrays have identified thousands of common (minor allele frequency [MAF]>5%) genetic variants associated with a wide range of human diseases and traits (Hindorff et al., 2012). However, these common variants comprise only 10% of the ~84 million variant sites discovered in the human genome by the 1000 Genomes Project (2015) with the rest being low-frequency (MAF 0.5–5%; ~14%) and rare (MAF<0.5%; ~76%) variants that are less well captured by genotype arrays and subsequent genotype imputation (Zuk et al., 2014). With the advance of genome sequencing technology, we can now directly study the role of variants across the full allele-frequency spectrum. Although sequencing studies to date have reaffirmed and expanded on the common variant associations of array-based GWAS, the modest sample sizes of most sequencing studies to date have limited the discovery of rare and low-frequency variant associations (Fuchsberger et al., 2016; Auer et al., 2016; Luo et al., 2017).

To increase sample size, researchers often aggregate sequence data across multiple studies. To combine sequence data across studies, the gold standard strategy is to jointly call all samples together (Auer et al., 2016). This joint calling strategy increases the quality of variant calls and minimizes batch effects such as those from using different sequencing centers or platforms (Auer, et al., 2016). However, joint calling can be difficult to implement for sequence data due to restrictions on data sharing (Paltoo et al., 2014; Jiang et al., 2014) and the potentially heavy computation burden (Lek et al., 2016). An alternative strategy that adheres to privacy rules and mitigates computing load is single-study calling (Okada et al., 2018) in which variants are identified and genotypes called separately within each study and then combined through meta-analysis of study-level association statistics or joint analysis of pooled individual-level data (i.e. mega-analysis). Although single-study calling is easier to implement than the gold standard joint calling, there is a need to quantify the difference in calling results between these two strategies and assess how it affects downstream association analysis.

Past research has shown that meta-analysis of study-level association results is as statistically efficient as joint analysis of individual-level data for combining common-variant GWAS (Lin & Zeng, 2010). More recent research has extended methods for meta-analysis to sequencing studies for rare variants (Tang & Lin, 2015). However, this research only analyzes the relative power of joint and meta-analysis under a single-study calling strategy and does not consider the impact of joint calling on association results. In addition, sequencing studies often differ in sequencing coverage depending on project needs and goals. For example, deep-coverage sequencing results in improved genotyping accuracy, particularly for rare variants (Lee et al., 2014; Xu et al., 2017), while low-coverage sequencing results in more samples at the same cost (Li et al., 2011). Thus, there is also a need to compare rare variant association tests for joint and single-study calling under different sequencing coverage.

In this paper, we aim to quantify the difference between the gold standard joint calling and the alternative single-study calling strategies and assess their impact on association testing of rare single nucleotide variants (SNVs) in deep and low-coverage sequence data. Specifically, we compare variant detection and genotyping accuracy for joint and single-study callsets on deep-coverage whole exome sequence (WES) and low-coverage whole genome sequence (WGS) dataset from the Genetics of Type 2 Diabetes (GoT2D) study (Fuchsberger et al., 2016) using the GotCloud variant calling pipelines (Jun et al., 2015) at default settings. Then for each data type, we compare single-variant and gene-based association test results for rare SNVs between three types of joint and single-study strategies: 1) joint calling with joint analysis, 2) single-study calling with meta-analysis, and 3) single-study calling with mega-analysis.

Methods

Data description

We analyzed data on 2,250 individuals from the GoT2D study (Fuchsberger et al., 2016) for whom deep-coverage whole exome sequence (mean depth 82X), low-coverage whole genome sequence (mean depth 5X), and Illumina HumanOmni 2.5M array data were all available. Study participants came from five geographical regions: (1) Augsburg, Germany (n=193; KORA study), (2) the Botnia region of western Finland (n=303; DGI study), (3) Sweden (n=391; DGI study), (4) the United Kingdom (n=473; UKT2D study), and (5) Finland (n=890; FUSION study). For clarity, we will refer to the sample of 2,250 individuals as the “joint” cohort and the five subsets as the “single-study” cohorts.

DNA sample preparation and sequencing

DNA samples were processed at the Broad Institute (FUSION and DGI), Wellcome Trust Centre for Human Genetics (UKT2D), and Helmholtz Zentrum München (KORA). DNA samples were genome and exome sequenced using the Illumina GAII or HiSeq 2000 sequencers. Sequence data were aligned to human reference genome version 19 (hg19) using Picard (DePristo et al., 2011) and BWA (Li & Durbin, 2009). Further details on data generation, processing, and quality control can be found in Fuchsberger et al. (2016).

Processed and filtered sequence reads for the joint and single-study cohorts were analyzed by the GotCloud and GATK (McKenna et al., 2010; Van der Auwera et al., 2013) variant calling pipelines according to the best practice workflows recommended by their developers at default settings. We restricted our analyses to chromosome 2 (~8% of the human genome) to reduce computational burden.

Whole-genome and exome sequence data processing: GotCloud and GATK pipeline

We called SNVs with GotCloud at default settings using processed BAM files. We used SAMtools pileup and glfFlex to generate genotype likelihoods for all samples in 5 Mb chromosomal segments. We then used a support vector machine classifier to filter out likely false-positive variant sites (Jun et al., 2015).

Adhering to the recommended GATK workflow, we “hard called” every variable site in each sample for the number of non-reference alleles (0, 1, or 2) using HaplotypeCaller in GVCF mode. To parallelize this step, we divided chromosome 2 into 5 Mb segments with 100 bp overlap and simultaneously carried out hard-calls within each segment. We merged intermediate genomic VCF (gVCF) files from each sample into batches of 100 samples with CombineGVCFs and then jointly genotyped them with GenotypeGVCFs. We used the GATK CatVariants tool to concatenate variant sets from all genomic regions to form a combined callset. We identified a set of high-quality variant calls from the raw variant callset using the Variant Quality Score Recalibration (VQSR) method which applies machine learning algorithms to score each variant call and filter them at a desired level of sensitivity. We used GATK VariantRecalibrator and ApplyRecalibration to filter the raw variant callset at the recommended tranche threshold of 99.9% which provides high sensitivity while maintaining a reasonable level of specificity. Finally, we removed indels from the filtered variant callset in keeping with our settings for the GotCloud pipeline and to focus on SNVs in subsequent analyses.

We used haplotype-based refinement to improve genotype and haplotype quality for whole genome genotype calls from both pipelines. Specifically, we used Beagle (Browning & Yu, 2009) to phase the genotype data in chunks of 10,000 SNVs with 1,000 SNVs overlaps and refined the phased sequences using Thunder (Jun et al., 2015) with 300 states.

We ran whole exome sequence reads through the GotCloud and GATK discovery pipelines under the same settings as the whole-genome data. We did not apply any refinement steps to the exome calls, consistent with standard practice for both pipelines for deep-coverage sequence data.

The final dataset for each of the four combinations of sequencing coverage (genome and exome) and pipeline (GotCloud and GATK) consists of a joint callset for all 2,250 samples, five seperate single-study callsets for the geographically subdivided cohorts, and a union callset which merges the five single-study callsets. Since comparing the joint callset to five single-study callsets individually is difficult because detection of rare SNVs is heavily dependent on sample size and the results would be potentially skewed by the considerable sample size differences between cohorts, we use the union callset as an overall representation of single-study calling to provide a more apt comparison with the joint callset. For the union callset, we set genotype calls for SNVs not found in one or more of the single-study callset(s) as missing.

Non-reference genotype accuracy

For both pipelines, we assessed the accuracy of whole genome calls by comparing the Thunder-refined non-reference genotypes against a set of 192,322 variants of highly accurate (“high-confidence”) genotypes determined through joint statistical analysis of deep-coverage (~82X) exome sequence and Illumina HumanOmni 2.5 array data in the GoT2D whole genome sequencing study (Fuchsberger et al., 2016). We assessed the accuracy of exome calls by comparing unrefined non-reference genotypes against the set of high-confidence genotypes from Illumina HumanOmni 2.5 array data.

Association analysis

We used a logistic regression setting to test for T2D association under an additive genetic model with no additional covariates. We used the score test in our association analysis of the joint, union, and single-study callsets.

We used SKAT-O to test for association with multiple rare and low-frequency SNVs within coding regions of the genome. We prepared four lists of SNVs (“masks”) based on MAF and functional annotation. Mask 1 contained SNVs predicted to be protein-truncating, Mask 2 included all SNVs from Mask 1 together with of missense SNVs with MAF<1%, Mask 3 included all SNVs from Mask 1 and those predicted to be deleterious by all five algorithms applied (Polyphen2-HumDiv, PolyPhen2-HumVar, LRT, Mutation Taster, and SIFT), and Mask 4 included all SNVs from Mask 3 and those predicted to be deleterious by at least one algorithm with MAF<1%.

We performed SKAT-O (Lee et al., 2012) analysis on the four masks separately within each single-study callset. For the creation of the masks, we considered a SNV to have MAF<1% if its MAF in every one of the single-study callsets is <1%.

Meta-analysis

For single-variant analysis, we combined summary-level results from the single-study callsets with fixed-effects sample-size weighted meta-analysis. For gene-based analysis, we combined SKAT-O results from each single-study callset using Hom-Meta-SKAT-O test in the MetaSKAT R package (Lee et al., 2013). Since all samples were of European ancestry, we assumed homogenous genetic effects across single-study cohorts to maximize power.

Results

Overview

We evaluated the utility of single-study calling as an alternative to the gold standard joint calling by comparing these methods in terms of variant detection, genotype accuracy, and impact on power of association tests for different sequencing coverage. For our analysis (restricted to chromosome 2 due to computational burden), we focus on the gold standard joint callset, which are calls from analyzing all 2,250 samples together (the “joint” cohort), the five single-study callsets, which are calls from the five geographically subdivided cohorts (the “single-study” cohorts: Germany, Botnia, Sweden, UK, Finland), and the union callset, which pools calls from the five single-study callsets. There are 25,689 deep-coverage WES SNVs and 2,101,401 (15,344 when restricted to coding regions) low-coverage WGS SNVs in the joint callset and 26,364 deep-coverage WES SNVs and 2,249,181 (16,457) low-coverage WGS SNVs in the union callset. We present only GotCloud results as we found choice of software pipelines (GotCloud or GATK) to have no meaningful impact on variant calling and association results.

Calling results

Union callset

The union callset pools calling results from the five single-study cohorts by merging their SNV calls. For SNV sites found in only a subset of the studies, we assign missing genotypes for studies in which the SNV site was not called. Using the union callset, we examine the overlap in variant detection between single-study cohorts. For deep-coverage data, 78% of all rare SNVs detected by single-study calling (i.e. those in the union callset) are “study specific” (Table 1), meaning they were found in only one of the single-study callsets and missing in all others, compared with 1.2% of low-frequency SNVs and 0.05% of common SNVs (Table 1). Conversely, only 2.3% of rare SNVs in the union callset are found in all five studies (Table 1) compared with 80% of low-frequency and 99% of common SNVs (Table 1). Similar numbers are seen for low-coverage data (restricted to coding regions) (Table 1). Overall, there are three possible reasons for a missing SNV site in a study: 1) the SNV was monomorphic in the study sample; 2) the variant caller did not have confidence to declare the SNV site; or 3) the SNV site was identified but removed by quality control as likely false-positive. However, for single-study calling, we are unable to differentiate between the three types of missingness because of privacy restrictions for individual-level data such as BAM files and calling results.

Table 1:

Overlap in variant detection for the union callset.

Data type Variants detected by only one study Variants detected by
2 to 4 studies
Variants detected by
all 5 studies
Deep-coverage
Rare (MAF <0.5%)
Low-frequency (MAF 0.5–5%)
Common (MAF >5%)

17,128 (78.0%)
28 (1.2%)
1 (0.05%)

4,316 (20%)
435 (19%)
26 (1.3%)

507 (2.3%)
1,873 (80%)
2,050 (99%)
Low-coverage (coding regions)
Rare (MAF <0.5%)
Low-frequency (MAF 0.5–5%)
Common (MAF >5%)

9,262 (77%)
38 (1.6%)
5 (0.24%)

2,563 (21%)
890 (38%)
123 (5.8%)

160 (1.4%)
1,432 (61%)
1,984 (94%)

Note. The union callset pools variant calling results from the five single-study cohorts. Numbers in table refers to SNVs from chromosome 2 in deep-coverage (~82X) exome sequence data and low-coverage (~5X) genome sequence data restricted to coding regions.

Variant detection: callset size

We evaluated variant detection for joint and single-study strategies by comparing the joint and union callsets across a range of MAFs. For low-frequency and common SNVs in both deep-coverage exome and low-coverage genome (restricted to coding regions) sequence data, there is almost complete overlap between the joint and union callsets (Figure 2). However, for rare SNVs, there are noticeable discrepancies between the two callsets as described below.

Figure 2:

Figure 2:

Comparison of variant detection between joint and single study calling strategies for rare (MAF<0.5%), low-frequency (MAF 0.5–5%), and common (MAF>5%) SNVs in (A) deep-coverage exome sequence data and (B) low-coverage genome sequence data restricted to coding regions. Teal-colored bars denote the fraction of SNVs detected by both calling strategies while red and orange bars denote those only detected by joint or single study strategies, respectively.

The overwhelming majority of rare SNVs detected in deep-coverage data are found in both the joint and union callsets (97% of all rare SNVs) with the remaining SNVs found exclusively in the joint (0.1%) and union (2.9%) callsets (Figure 2A). Contrary to expectations, the union callset is larger than the joint callset, mainly due to inconsistencies in variant filtering. Of the 631 rare SNVs exclusive to the union callset, 540 of them were filtered out during joint calling and excluded from the final joint callset. SNVs in joint calling go through variant filters once whereas SNVs in single-study calling have one chance per study to pass filters and be included in the union callset. In this scenario, a lack of consistent variant filtering between joint and single-study calling can lead to the differences seen here.

For rare SNVs in low-coverage data, we observed a similar pattern of variant detection as for deep-coverage data (Figure 2B). However, inconsistencies in variant filtering only accounts for a small fraction of differences between the joint and union callsets. Only 128 of the 1,107 rare SNVs exclusive to the union callset were filtered out during joint calling.

Variant detection: genotype calls

In addition to comparing the number of SNVs detected by joint and single-study calling, we also compared the genotype calls made by the two strategies at different sequencing coverage. We show in Tables 2 and 3 the comparison of genotype calls between joint and the single-study calling for 9,096 rare SNVs found in the joint and union callsets from deep-coverage exome as well as from low-coverage genome (restricted to coding regions) sequence data. Genotype comparisons for 2,127 low-frequency and 2,027 common SNVs are shown in Supplementary Tables 14. Excluding missing calls, overall genotype discordance between joint and single-study calling is lower in deep-coverage data than in low-coverage data. Furthermore, for rare SNVs, 64% of all genotype calls from single-study calling in deep-coverage data (Table 2) are missing compared with 70% for low-coverage data (Table 3). Breaking down rare SNVs further by minor allele count (MAC), we observe this missingness to be a function of MAC in both types of sequencing data with the rarest categories most affected (Supplementary Tables 512). In addition, the missingness appears to be mostly localized to rare SNVs as we observe only a slight number of missing genotype calls in low-frequency SNVs (4.3% in deep-coverage data, 9.2% in low-coverage data; Supplementary Tables 1 and 2) and a negligible number in common SNVs (0.21% and 0.78%; Supplementary Tables 3 and 4). In deep-coverage data, we can attribute almost all missing calls for rare SNVs to monomorphic SNVs in the single-study cohort(s) since 13,093,060 of the 13,093,128 missing single-study calls were called as homozygous reference by joint calling (Table 2). However, for low-coverage data, 6,365 of 14,246,613 missing single-study calls were called as non-reference by joint calling (Table 3). Since rare SNVs naturally have low allele counts to begin with, any small change to their overall allele counts will have a noticeable impact on association testing and other downstream analyses.

Table 2:

Comparison of genotype calls for rare SNVs from deep-coverage exome sequence data.

Single-study
variant calling
(union callset)
Joint variant calling (joint callset)
Missing Homozygous
reference
Heterozygous Homozygous
alternate
Total

Missing 0 13,093,060 (64%) 68 (<0.000%) 0 13,093,128 (64%)
Hom. ref. 0 7,135,459 (35%) 9 (<0.000%) 0 7,135,468 (35%)
Heterozygous 0 31 (<0.00%) 25,862 (0.13%) 0 25,893 (0.13%)
Hom. alt. 0 0 4 (<0.000%) 211,507 (1.0%) 211,511 (1.0%)
Total 0 20,228,550 (99%) 25,943 (0.13%) 211,507 (1.0%) 20,466,000

Note. Genotype calls from joint (horizontal axis) and single-study (vertical axis) calling strategies for 9,096 rare (MAF <0.5%) SNVs from chromosome 2 in deep-coverage (~82X) exome sequence data. Concordant calls between the two strategies are highlighted in bold.

Table 3:

Comparison of genotype calls for rare SNVs from low-coverage genome sequence data (coding regions).

Single-study
variant calling
(union callset)
Joint variant calling (joint callset)
Missing Homozygou s
reference
Heterozygous Homozygous
alternate
Total

Missing 0 14,240,248 (70%) 5,966 (0.029%) 399 (0.002%) 14,246,613 (70%)
Hom. ref. 0 5,981,638 (29%) 1,855 (0.009%) 2 (<0.000%) 5,983,495 (29%)
Heterozygous 0 3,687 (0.02%) 21,073 (0.10%) 99 (<0.000%) 24,859 (0.12%)
Hom. alt. 0 0 37 (<0.000%) 210,996 (1.0%) 211,033 (1.0%)
Total 0 20,225,573 (99%) 28,931 (0.14%) 211,496 (1.0%) 20,466,000

Note. Genotype calls from joint (horizontal axis) and single-study (vertical axis) calling strategies for 9,096 rare (MAF <0.5%) SNVs from chromosome 2 in low-coverage (~5X) genome sequence data restricted to coding regions. Concordant calls between the two strategies are highlighted in bold.

Genotype concordance

We assessed non-reference genotype accuracy (hereafter referred to as “genotype concordance”) of joint and single-study calling in deep-coverage exome sequence data by comparing non-reference calls for SNVs found in both the joint and union callsets against a “truth” set of high confidence genotypes from Illumina HumanOmni 2.5 array data (Fuchsberger et al., 2016). The joint and union callsets have nearly identical genotype concordance with the truth set for SNVs of all MAFs and negligible differences in raw counts (Table 4).

Table 4:

Non-reference genotype accuracy for joint and single-study calling strategies.

Data type Genotype concordance
for joint callset
Genotype concordance
for union callset
Deep-coverage
Rare (MAF <0.5%)
Low-frequency (MAF 0.5–5%)
Common (MAF >5%)

99.7% (91,457/91,756)
99.3% (171,939/173,131)
99.3% (1,712,741/1,72,4873)

99.7% (91,456/91,756)
99.3% (171,930/173,131)
99.2% (1,711,385/1,724,873)
Low-coverage (all regions)
Rare (MAF <0.5%)
Low-frequency (MAF 0.5–5%)
Common (MAF >5%)

99.7% (3,563,500/3,575,402)
99.6% (6,837,310/6,866,584)
99.6% (112,966,946/113,401,131)

99.3% (3,550,178/3,575,402)
99.1% (6,807,530/6,866,584)
99.4% (112,694,329/113,401,131)

Note. Genotype concordance for joint and single-study calling strategies in deep-coverage (~82X) exome and low-coverage (~5X) genome sequence data. The “truth” set of high confidence genotypes being compared against comes from Illumina HumanOmni 2.5 array data and deep exome sequence in the GoT2D integrated panel. Raw genotype counts are displayed in parentheses.

Next, we assessed genotype concordance for SNVs in low-coverage genome sequence data (not restricted to coding regions to preserve a meaningful number of comparisons) by comparing against high confidence genotypes from Illumina HumanOmni 2.5 array data and/or from deep (~82X) exome sequence in the GoT2D integrated panel (Fuchsberger et al., 2016). The joint callset correctly calls 0.4% more genotypes than the union callset for rare SNVs, 0.5% more for low-frequency SNVs, and 0.2% more for common SNVs (Table 4). Compared with deep-coverage data, here we observe a larger difference in genotype concordance with the truth set between the joint and union callsets. For example, the joint callset calls 13,322 more genotypes correctly (out of 3,575,402 total comparisons) than the union callset for rare SNVs in low-coverage data while it only calls 1 more genotype correctly (out of 91,756) for rare SNVs in deep-coverage data. As expected, the improvements to calling accuracy offered by larger sample sizes in the joint strategy are more pronounced when the average read depth is low.

Association analysis

We evaluated the impact of joint and single-study calling on single-variant association tests by comparing -log10p-values from joint analysis of the joint callset against those from meta-analysis of single-study summary statistics and joint analysis of the union callset (i.e. mega-analysis). We used the score test for all analyses. Overall, we observe similar p-values in all three types of analyses for rare SNVs in deep-coverage data (Figure 3AC). This is due to almost perfect concordance in genotype calls between joint and single-study calling and the fact that missing variant calls for rare SNVs from single-study calling were almost all called as homozygous reference in the joint callset. However, for low-coverage data, we observe large discrepancies in p-values between joint and meta-analysis (Figure 3D) as well as between joint and mega-analysis for rare SNVs (Figure 3E). These differences in association results is caused by a combination of lower concordance in genotype calls between the two calling strategies for low-coverage data and an increase in the number of missing single-study calls being called as non-reference in the joint callset. Since both meta-analysis and mega-analysis use single-study calling, their association results are more similar (Figure 3F).

Figure 3:

Figure 3:

Comparison of single-variant association test p-values between joint and single study calling strategies for rare (MAF<0.5%) SNVs in (A-C) deep-coverage (~82X) exome sequence data and (D-F) low-coverage (~5X) genome sequence data. Joint refers to joint analysis of the joint callset, meta refers to meta-analysis of single-study summary statistics, and mega refers to joint analysis of the union callset (mega-analysis).

We evaluated association power between joint and single-study calling for gene-based tests by comparing -log10p-values from SKAT-O test of the joint callset versus those from meta-analysis of single-study SKAT-O test results. For all masks, SKAT-O based joint analysis and MetaSKAT based meta-analysis produce similar p-values (Supplementary Figure 3).

Discussion

Although jointly calling all samples together is the gold standard strategy for analyzing rare SNVs in sequencing studies, single-study calling are more appealing due to fewer privacy restrictions and smaller computation burden. In this study, we compared joint and single-study calling in terms of variant detection, non-reference genotype concordance, and their impact on association power as a function of sequencing coverage.

For single-study calling, we found that low overlap in variant detection among single-study cohorts for rare SNVs results in an abundance of “missing” genotype calls where we lose information for variant sites in cohorts where they were not detected. We show that for deep-coverage data, the impact of missing genotype calls on association testing of rare SNVs from single-study calling is minimal because almost all of this missingness is due to monomorphic SNVs, as evident by corresponding homozygous reference calls in the joint callset. However, for low-coverage data, average read depth is low and thus, a portion of the missing genotype calls may be due to lack of coverage at the variant sites. Indeed, we show that a fraction amount of missing single-study calls for rare SNVs in low-coverage data have corresponding non-reference calls in the joint callset, resulting in lower than expected allele counts and reduced power for association testing of these SNVs. In addition, these missing calls can have a negative impact on gene-based aggregation tests, which will be underpowered if too many variant sites within a gene have missing genotype calls, and genotype-based callbacks, since the majority of loss-of-function SNVs are rare. A possible, but resource-intensive solution is to generate a list of SNV sites based on the union callset and then go back and genotype these sites within each single-study cohort. With parallel computation for each sample and every 5 Mb chromosomal segment, this process takes on average one hour CPU-time per sample per cohort with a maximum memory usage of approximately 0.5 GB to re-call 1 to 1.2 million variants in chromosome 2.

Although the low overlap in variant detection among single-study cohorts for rare SNVs can arise naturally due to sample population differences between cohorts, another contributing factor is the inconsistency of variant calling filters (i.e. false-positive screening). In our analysis, rare SNVs that were filtered out during joint calling may pass filters during calling in some single-study cohorts while being filtered out in others. This increases the possibility of introducing false-positive SNVs to downstream analyses since they only need to pass filters in one of the single-study cohorts to be included in association tests.

Recommandations

For deep-coverage data, single-study calling and either meta-analysis or mega-analysis can be recommended as a viable alternative to joint calling and analysis for rare SNVs based on almost perfect concordance of genotype calls between the two calling strategies, comparable non-reference genotype concordance with an external truth set, and comparable association results. Furthermore, missing genotype calls in single-study calling for deep-coverage data can be assumed to be homozygous reference and attributed to monomorphic variant due to a matching homozygous reference call for their counterparts in the joint callset. When combining many smaller single studies, meta-analysis can be more conservative and less powerful than mega-analysis (Ma et al., 2013).

For low-coverage data or low-coverage regions in deep data, single-study calling cannot be recommended as a viable alternative to joint calling for rare SNVs. Discordance in genotype calls between the two calling strategies is approximately 150 times higher than that in deep-coverage data (0.09% versus 0.0006%) and combined with a sizable number of genotype calls in single-study calling being missing due to lack of coverage at variant sites, we observe large discrepancies in association results between the two calling strategies.

In general, for studying low-frequency and common SNVs, single-study calling can be used as an alternative to joint calling in both deep-coverage and low-coverage data (Supplementary Figure 1 and 2). The only exception is for studying low-frequency SNVs in low-coverage data (Supplementary Figure 2DF) where there remain noticeable discrepancies in association results between joint and meta/mega-analysis, although less than that seen for rare SNVs in low-coverage data.

Comparison with GATK pipeline

In addition to the GotCloud pipeline, we ran our analyses with the widely used GATK pipeline at default settings. Choice of software pipeline had a limited impact on variant detection and genotype accuracy with little to no impact on association results. There is more overlap in detected SNVs between joint and single-study calling for the GotCloud pipeline in deep-coverage data and vice versa for the GATK pipeline in low-coverage data. The GotCloud pipeline was slightly more accurate in calling common and low-frequency SNVs; however, on average this difference amounts to less than 1.5% more correctly called non-reference genotypes.

Study limitations

Due to computation time and burden, we were limited to only chromosome 2 in our study. One area where information from additional SNVs would have been helpful is comparing association power between joint and single-study strategies for genome-wide significant (p-value≤5×10−8) rare SNVs. Currently, our single-variant and gene-based analysis of rare SNVs are centered on those with p-values≥5×10−5 with limited information on rare SNVs near the genome-wide significance threshold.

Summary

We show single-study calling to be a viable alternative to joint calling for deep-coverage sequence data but show them to have noticeable discrepancies in rare variant calling and association results for low-coverage sequence data.

Supplementary Material

ChenSuppFigures

Supplementary Figure 1: Comparison of single-variant association test p-values between joint and single study calling strategies for low-frequency (MAF 0.5–5%) SNVs in (A-C) deep-coverage (~82X) exome sequence data and (D-F) low-coverage (~5X) genome sequence data. Joint refers to joint analysis of the joint callset, meta refers to meta-analysis of single-study summary statistics, and mega refers to joint analysis of the union callset (mega-analysis).

Supplementary Figure 2: Comparison of single-variant association test p-values between joint and single study calling strategies for common (MAF >5%) SNVs in (A-C) deep-coverage (~82X) exome sequence data and (D-F) low-coverage (~5X) genome sequence data. Joint refers to joint analysis of the joint callset, meta refers to meta-analysis of single-study summary statistics, and mega refers to joint analysis of the union callset (mega-analysis).

Supplementary Figure 3: Comparison of gene-based association test p-values between joint and single study calling strategies in deep-coverage exome sequence data. Mask 1: protein-truncating SNVs; Mask 2: protein-truncating SNVs+missense SNVs with MAF<1%; Mask 3: protein-truncating SNVs+SNVs predicted deleterious by all algorithms (Polyphen2-HumDiv, PolyPhen2-HumVar, LRT, Mutation Taster, and SIFT); Mask 4: protein-truncating SNVs+SNVs with MAF<1% predicted deleterious by at least one algorithm.

ChenSuppTables

Supplementary Table 1: Comparison of genotype calls for low-frequency SNVs from deep-coverage exome sequence data.

Note. Genotype calls from joint (horizontal axis) and single-study (vertical axis) calling strategies for 2,127 rare (MAF 0.5–5%) SNVs from chromosome 2 in deep-coverage (~82X) exome sequence data. Concordant calls between the two strategies are highlighted in bold.

Supplementary Table 2: Comparison of genotype calls for low-frequency SNVs from low-coverage genome sequence data (coding regions).

Note. Genotype calls from joint (horizontal axis) and single-study (vertical axis) calling strategies for 2,127 rare (MAF 0.5–5%) SNVs from chromosome 2 in low-coverage (~5X) exome sequence data. Concordant calls between the two strategies are highlighted in bold.

Supplementary Table 3: Comparison of genotype calls for common SNVs from deep-coverage exome sequence data.

Note. Genotype calls from joint (horizontal axis) and single-study (vertical axis) calling strategies for 2,027 common (MAF >5%) SNVs from chromosome 2 in deep-coverage (~82X) exome sequence data. Concordant calls between the two strategies are highlighted in bold.

Supplementary Table 4: Comparison of genotype calls for common SNVs from low-coverage exome sequence data (coding regions).

Note. Genotype calls from joint (horizontal axis) and single-study (vertical axis) calling strategies for 2,027 common (MAF >5%) SNVs from chromosome 2 in low-coverage (~5X) exome sequence data. Concordant calls between the two strategies are highlighted in bold.

Supplementary Table 5. Comparison of genotype calls for rare SNVs from deep-coverage exome sequence data with MAC < 2.

Note. Genotype calls from joint (horizontal axis) and single-study (vertical axis) calling strategies for 5,099 rare (MAF <0.5%) SNVs from chromosome 2 with MAC < 2 in deep-coverage (~82X) exome sequence data.MAC: minor allele count

Supplementary Table 6. Comparison of genotype calls for rare SNVs from deep-coverage exome sequence data with 2 ≤ MAC < 5.

Note. Genotype calls from joint (horizontal axis) and single-study (vertical axis) calling strategies for 2,371 rare (MAF <0.5%) SNVs from chromosome 2 with 2 ≤ MAC < 5 in deep-coverage (~82X) exome sequence data.MAC: minor allele count

Supplementary Table 7. Comparison of genotype calls for rare SNVs from deep-coverage exome sequence data with 5 ≤ MAC ≤ 10.

Note. Genotype calls from joint (horizontal axis) and single-study (vertical axis) calling strategies for 1,147 rare (MAF <0.5%) SNVs from chromosome 2 with 5 ≤ MAC ≤ 10 in deep-coverage (~82X) exome sequence data.MAC: minor allele count

Supplementary Table 8. Comparison of genotype calls for rare SNVs from deep-coverage exome sequence data with MAC > 10.

Note. Genotype calls from joint (horizontal axis) and single-study (vertical axis) calling strategies for 479 rare (MAF <0.5%) SNVs from chromosome 2 with MAC > 10 in deep-coverage (~82X) exome sequence data.MAC: minor allele count

Supplementary Table 9. Comparison of genotype calls for rare SNVs from low-coverage genome sequence data with MAC < 2.

Note. Genotype calls from joint (horizontal axis) and single-study (vertical axis) calling strategies for 5,194 rare (MAF <0.5%) SNVs from chromosome 2 with MAC < 2 in low-coverage (~5X) genome sequence data.MAC: minor allele count

Supplementary Table 10. Comparison of genotype calls for rare SNVs from low-coverage genome sequence data with 2 ≤ MAC < 5.

Note. Genotype calls from joint (horizontal axis) and single-study (vertical axis) calling strategies for 2,374 rare (MAF <0.5%) SNVs from chromosome 2 with 2 ≤ MAC < 5 in low-coverage (~5X) genome sequence data.MAC: minor allele count

Supplementary Table 11. Comparison of genotype calls for rare SNVs from low-coverage genome sequence data with 5 ≤ MAC ≤ 10.

Note. Genotype calls from joint (horizontal axis) and single-study (vertical axis) calling strategies for 1,084 rare (MAF <0.5%) SNVs from chromosome 2 with MAC < 2 in low-coverage (~5X) genome sequence data.MAC: minor allele count

Supplementary Table 12. Comparison of genotype calls for rare SNVs from low-coverage genome sequence data with MAC > 10.

Note. Genotype calls from joint (horizontal axis) and single-study (vertical axis) calling strategies for 444 rare (MAF <0.5%) SNVs from chromosome 2 with 2 ≤ MAC < 5 in low-coverage (~5X) genome sequence data.MAC: minor allele count

Figure 1:

Figure 1:

Workflow for variant calling and association analysis. Sequencing and alignment procedures are described in Fuchsberger et al., 2016. Haplotype-based refinement was only applied to low-coverage whole genome sequence data.

Acknowledgements

We thank Laura Scott for helpful discussions, Hyun Min Kang and Yun Li for technical help with GotCloud, and the GoT2D consortium for allowing us to use their sequence data. This research was supported by the National Institutes of Health grants HG000376 and HG009976.

Grant numbers: HG000376 (MB), HG009976 (MB)

Footnotes

Conflict of interest

The authors declare that there is no conflict of interest.

Data availability

The data that support the findings of this study are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions.

References

  1. 1000 Genomes Project Consortium. (2015). A global reference for human genetic variation. Nature, 526(7571), 68 10.1038/nature15393 [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Auer PL, Reiner AP, Wang G, Kang HM, Abecasis GR, Altshuler D, … & Leal SM (2016). Guidelines for large-scale sequence-based complex trait association studies: lessons learned from the NHLBI exome sequencing project. The American Journal of Human Genetics, 99(4), 791–801. 10.1016/j.ajhg.2016.08.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Browning BL, & Yu Z (2009). Simultaneous genotype calling and haplotype phasing improves genotype accuracy and reduces false-positive associations for genome-wide association studies. The American Journal of Human Genetics, 85(6), 847–861. 10.1016/j.ajhg.2009.11.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, … & McKenna A (2011). A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature genetics, 43(5), 491 10.1038/ng.806 [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Fuchsberger C, Flannick J, Teslovich TM, Mahajan A, Agarwala V, Gaulton KJ, … & Rivas MA (2016). The genetic architecture of type 2 diabetes. Nature, 536(7614), 41 10.1038/nature18642 [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Hindorff LA, MacArthur J, Wise A, Junkins HA, Hall PN, Klemm AK, Manolio TA. 2012. A catalog of published genome-wide association studies. NHGRI. Available at: www.ebi.ac.uk/gwas/diagram.
  7. Jiang W, Chen SY, Wang H, Li DZ, & Wiens JJ (2014). Should genes with missing data be excluded from phylogenetic analyses?. Molecular Phylogenetics and Evolution, 80, 308–318. 10.1016/j.ympev.2014.08.006 [DOI] [PubMed] [Google Scholar]
  8. Jun G, Wing MK, Abecasis GR, & Kang HM (2015). An efficient and scalable analysis framework for variant extraction and refinement from population scale DNA sequence data. Genome research, 10.1101/gr.176552.114 [DOI] [PMC free article] [PubMed]
  9. Lee S, Emond MJ, Bamshad MJ, Barnes KC, Rieder MJ, Nickerson DA, … & NHLBI GO Exome Sequencing Project. (2012). Optimal unified approach for rare-variant association testing with application to small-sample case-control whole-exome sequencing studies. The American Journal of Human Genetics, 91(2), 224–237. 10.1016/j.ajhg.2012.06.007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Lee S, Teslovich TM, Boehnke M, & Lin X (2013). General framework for meta-analysis of rare variants in sequencing association studies. The American Journal of Human Genetics, 93(1), 42–53. 10.1016/j.ajhg.2013.05.010 [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Lee S, Abecasis GR, Boehnke M, & Lin X (2014). Rare-variant association analysis: study designs and statistical tests. The American Journal of Human Genetics, 95(1), 5–23. 10.1016/j.ajhg.2014.06.009 [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Lek M, Karczewski KJ, Minikel EV, Samocha KE, Banks E, Fennell T, … & Tukiainen T (2016). Analysis of protein-coding genetic variation in 60,706 humans. Nature, 536(7616), 285 10.1038/nature19057 [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Li H, & Durbin R (2009). Fast and accurate short read alignment with Burrows–Wheeler transform. bioinformatics, 25(14), 1754–1760. 10.1093/bioinformatics/btp324 [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Li Y, Sidore C, Kang HM, Boehnke M, & Abecasis GR (2011). Low-coverage sequencing: implications for design of complex trait association studies. Genome research. 10.1101/gr.117259.110 [DOI] [PMC free article] [PubMed]
  15. Lin DY, & Zeng D (2010). Meta-analysis of genome-wide association studies: no efficiency gain in using individual participant data. Genetic Epidemiology, 34(1), 60–66. 10.1002/gepi.20435 [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Ma C, Blackwell T, Boehnke M, Scott LJ, & GoT2D Investigators. (2013). Recommended joint and meta-analysis strategies for case-control association testing of single low-count variants. Genetic epidemiology, 37(6), 539–550. 10.1002/gepi.21742 [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, … & DePristo MA (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome research. 10.1101/gr.107524.110 [DOI] [PMC free article] [PubMed]
  18. Okada Y, Momozawa Y, Sakaue S, Kanai M, Ishigaki K, Akiyama M, … & Suematsu M (2018). Deep whole-genome sequencing reveals recent selection signatures linked to evolution and disease risk of Japanese. Nature communications, 9(1), 1631 10.1038/s41467-018-03274-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Paltoo DN, Rodriguez LL, Feolo M, Gillanders E, Ramos EM, Rutter JL, … & Caulder M (2014). Data use under the NIH GWAS data sharing policy and future directions. Nature genetics, 46(9), 934 10.1038/ng.3062 [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Tachmazidou I, Süveges D, Min JL, Ritchie GR, Steinberg J, Walter K, … & McCarthy S (2017). Whole-genome sequencing coupled to imputation discovers genetic signals for anthropometric traits. The American Journal of Human Genetics, 100(6), 865–884. 10.1016/j.ajhg.2017.04.014 [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Tang ZZ, & Lin DY (2015). Meta-analysis for discovering rare-variant associations: statistical methods and software programs. The American Journal of Human Genetics, 97(1), 35–53. 10.1016/j.ajhg.2015.05.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Van der Auwera GA, Carneiro MO, Hartl C, Poplin R, Del Angel G, Levy-Moonshine A, … & Banks E (2013). From FastQ data to high-confidence variant calls: the genome analysis toolkit best practices pipeline. Current protocols in bioinformatics, 43(1), 11–10. 10.1002/0471250953.bi1110s43 [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Xu C, Wu K, Zhang JG, Shen H, & Deng HW (2017). Low-, high-coverage, and two-stage DNA sequencing in the design of the genetic association study. Genetic epidemiology, 41(3), 187–197. 10.1002/gepi.22015 [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Zuk O, Schaffner SF, Samocha K, Do R, Hechter E, Kathiresan S, … & Lander ES (2014). Searching for missing heritability: designing rare variant association studies. Proceedings of the National Academy of Sciences, 111(4), E455–E464. 10.1073/pnas.1322563111 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ChenSuppFigures

Supplementary Figure 1: Comparison of single-variant association test p-values between joint and single study calling strategies for low-frequency (MAF 0.5–5%) SNVs in (A-C) deep-coverage (~82X) exome sequence data and (D-F) low-coverage (~5X) genome sequence data. Joint refers to joint analysis of the joint callset, meta refers to meta-analysis of single-study summary statistics, and mega refers to joint analysis of the union callset (mega-analysis).

Supplementary Figure 2: Comparison of single-variant association test p-values between joint and single study calling strategies for common (MAF >5%) SNVs in (A-C) deep-coverage (~82X) exome sequence data and (D-F) low-coverage (~5X) genome sequence data. Joint refers to joint analysis of the joint callset, meta refers to meta-analysis of single-study summary statistics, and mega refers to joint analysis of the union callset (mega-analysis).

Supplementary Figure 3: Comparison of gene-based association test p-values between joint and single study calling strategies in deep-coverage exome sequence data. Mask 1: protein-truncating SNVs; Mask 2: protein-truncating SNVs+missense SNVs with MAF<1%; Mask 3: protein-truncating SNVs+SNVs predicted deleterious by all algorithms (Polyphen2-HumDiv, PolyPhen2-HumVar, LRT, Mutation Taster, and SIFT); Mask 4: protein-truncating SNVs+SNVs with MAF<1% predicted deleterious by at least one algorithm.

ChenSuppTables

Supplementary Table 1: Comparison of genotype calls for low-frequency SNVs from deep-coverage exome sequence data.

Note. Genotype calls from joint (horizontal axis) and single-study (vertical axis) calling strategies for 2,127 rare (MAF 0.5–5%) SNVs from chromosome 2 in deep-coverage (~82X) exome sequence data. Concordant calls between the two strategies are highlighted in bold.

Supplementary Table 2: Comparison of genotype calls for low-frequency SNVs from low-coverage genome sequence data (coding regions).

Note. Genotype calls from joint (horizontal axis) and single-study (vertical axis) calling strategies for 2,127 rare (MAF 0.5–5%) SNVs from chromosome 2 in low-coverage (~5X) exome sequence data. Concordant calls between the two strategies are highlighted in bold.

Supplementary Table 3: Comparison of genotype calls for common SNVs from deep-coverage exome sequence data.

Note. Genotype calls from joint (horizontal axis) and single-study (vertical axis) calling strategies for 2,027 common (MAF >5%) SNVs from chromosome 2 in deep-coverage (~82X) exome sequence data. Concordant calls between the two strategies are highlighted in bold.

Supplementary Table 4: Comparison of genotype calls for common SNVs from low-coverage exome sequence data (coding regions).

Note. Genotype calls from joint (horizontal axis) and single-study (vertical axis) calling strategies for 2,027 common (MAF >5%) SNVs from chromosome 2 in low-coverage (~5X) exome sequence data. Concordant calls between the two strategies are highlighted in bold.

Supplementary Table 5. Comparison of genotype calls for rare SNVs from deep-coverage exome sequence data with MAC < 2.

Note. Genotype calls from joint (horizontal axis) and single-study (vertical axis) calling strategies for 5,099 rare (MAF <0.5%) SNVs from chromosome 2 with MAC < 2 in deep-coverage (~82X) exome sequence data.MAC: minor allele count

Supplementary Table 6. Comparison of genotype calls for rare SNVs from deep-coverage exome sequence data with 2 ≤ MAC < 5.

Note. Genotype calls from joint (horizontal axis) and single-study (vertical axis) calling strategies for 2,371 rare (MAF <0.5%) SNVs from chromosome 2 with 2 ≤ MAC < 5 in deep-coverage (~82X) exome sequence data.MAC: minor allele count

Supplementary Table 7. Comparison of genotype calls for rare SNVs from deep-coverage exome sequence data with 5 ≤ MAC ≤ 10.

Note. Genotype calls from joint (horizontal axis) and single-study (vertical axis) calling strategies for 1,147 rare (MAF <0.5%) SNVs from chromosome 2 with 5 ≤ MAC ≤ 10 in deep-coverage (~82X) exome sequence data.MAC: minor allele count

Supplementary Table 8. Comparison of genotype calls for rare SNVs from deep-coverage exome sequence data with MAC > 10.

Note. Genotype calls from joint (horizontal axis) and single-study (vertical axis) calling strategies for 479 rare (MAF <0.5%) SNVs from chromosome 2 with MAC > 10 in deep-coverage (~82X) exome sequence data.MAC: minor allele count

Supplementary Table 9. Comparison of genotype calls for rare SNVs from low-coverage genome sequence data with MAC < 2.

Note. Genotype calls from joint (horizontal axis) and single-study (vertical axis) calling strategies for 5,194 rare (MAF <0.5%) SNVs from chromosome 2 with MAC < 2 in low-coverage (~5X) genome sequence data.MAC: minor allele count

Supplementary Table 10. Comparison of genotype calls for rare SNVs from low-coverage genome sequence data with 2 ≤ MAC < 5.

Note. Genotype calls from joint (horizontal axis) and single-study (vertical axis) calling strategies for 2,374 rare (MAF <0.5%) SNVs from chromosome 2 with 2 ≤ MAC < 5 in low-coverage (~5X) genome sequence data.MAC: minor allele count

Supplementary Table 11. Comparison of genotype calls for rare SNVs from low-coverage genome sequence data with 5 ≤ MAC ≤ 10.

Note. Genotype calls from joint (horizontal axis) and single-study (vertical axis) calling strategies for 1,084 rare (MAF <0.5%) SNVs from chromosome 2 with MAC < 2 in low-coverage (~5X) genome sequence data.MAC: minor allele count

Supplementary Table 12. Comparison of genotype calls for rare SNVs from low-coverage genome sequence data with MAC > 10.

Note. Genotype calls from joint (horizontal axis) and single-study (vertical axis) calling strategies for 444 rare (MAF <0.5%) SNVs from chromosome 2 with 2 ≤ MAC < 5 in low-coverage (~5X) genome sequence data.MAC: minor allele count

RESOURCES