Skip to main content
G3: Genes | Genomes | Genetics logoLink to G3: Genes | Genomes | Genetics
. 2017 Jan 19;7(5):1393–1404. doi: 10.1534/g3.117.039008

Genotype Calling from Population-Genomic Sequencing Data

Takahiro Maruki 1,1, Michael Lynch 1
PMCID: PMC5427492  PMID: 28108551

Abstract

Genotype calling plays important roles in population-genomic studies, which have been greatly accelerated by sequencing technologies. To take full advantage of the resultant information, we have developed maximum-likelihood (ML) methods for calling genotypes from high-throughput sequencing data. As the statistical uncertainties associated with sequencing data depend on depths of coverage, we have developed two types of genotype callers. One approach is appropriate for low-coverage sequencing data, and incorporates population-level information on genotype frequencies and error rates pre-estimated by an ML method. Performance evaluation using computer simulations and human data shows that the proposed framework yields less biased estimates of allele frequencies and more accurate genotype calls than current widely used methods. Another type of genotype caller applies to high-coverage sequencing data, requires no prior genotype-frequency estimates, and makes no assumption on the number of alleles at a polymorphic site. Using computer simulations, we determine the depth of coverage necessary to accurately characterize polymorphisms using this second method. We applied the proposed method to high-coverage (mean 18×) sequencing data of 83 clones from a population of Daphnia pulex. The results show that the proposed method enables conservative and reasonably powerful detection of polymorphisms with arbitrary numbers of alleles. We have extended the proposed method to the analysis of genomic data for polyploid organisms, showing that calling accurate polyploid genotypes requires much higher coverage than diploid genotypes.

Keywords: genotype call, polymorphism, population genomics


When we carry out population-genomic analyses, identifying individual genotypes is often necessary. For example, in order to identify putative loci associated with a phenotype in genome-wide association studies, calling genotypes is necessary to find which allele each individual carries at each locus. Moreover, calling individual genotypes is a first step in some population-genetic analyses. For example, many statistical methods for haplotype phasing (e.g., Scheet and Stephens 2006; Browning and Browning 2007; Li et al. 2010) start from genotype calls at each SNP site. In addition, when accurate genotypes are called at each SNP site, traditional statistical methods, including the four-gamete test (Hudson and Kaplan 1985) and composite disequilibrium measures (Cockerham and Weir 1977; Weir 1996), can be used to examine the pattern of linkage disequilibrium.

Despite the advantages, some difficulties are associated with high-throughput sequencing technologies. One of the main difficulties is the high sequencing error rates, which typically range from 0.001 to 0.01 per read per site with commonly used sequencing platforms (Glenn 2011; Quail et al. 2012). Second, because sequencing occurs randomly among sites, individuals, and chromosomes in diploid organisms, depths of coverage are variable at all levels. As a result, when depths of coverage are low, there are often missing data, which introduces biases in subsequent population-genetic analyses unless they are statistically accounted for (Pool et al. 2010).

To call genotypes from high-throughput sequencing data, many statistical methods have been recently developed (e.g., Li et al. 2008, 2009b; Hohenlohe et al. 2010; Martin et al. 2010; McKenna et al. 2010; Catchen et al. 2011, 2013; DePristo et al. 2011; Li 2011; Nielsen et al. 2012; Vieira et al. 2013). The performance of the widely used genotype callers in population-genomic analyses is not well understood, especially when the population deviates from the Hardy–Weinberg equilibrium (HWE). Recent studies (Kim et al. 2011; Han et al. 2014) found that allele frequencies estimated directly from the sequence reads are unbiased, whereas those estimated via genotype calling are biased when depths of coverage are low. These studies assumed a population in HWE. Vieira et al. (2013) recently showed that the performance of genotype calling can be improved by first estimating inbreeding coefficients from the sequencing data, and then calling genotypes incorporating the information on estimated inbreeding coefficients. Unfortunately, their method is applicable only when the inbreeding coefficients are non-negative (Maruki and Lynch 2015), and does not always take full advantage of the population-level information. Negative inbreeding coefficients are common in some organisms, including asexual aphids (Delmotte et al. 2002), Daphnia in permanent ponds/lakes (Hebert 1978), fruit bats (Storz et al. 2001), partially inbreeding plant species (Brown 1979), Laysan finches (Tarr et al. 1998), prairie dogs (Foltz and Hoogland 1983), rhesus monkeys (Melnick 1987), and water voles (Aars et al. 2006). Furthermore, some regions under balancing selection may contain excess heterozygotes and therefore show negative inbreeding coefficients (Black and Salzano 1981; Markow et al. 1993; Hedrick 1998; Black et al. 2001; Ferreira and Amos 2006; Tollenaere et al. 2008).

In this study, we develop a maximum-likelihood (ML) method for calling genotypes from high-throughput sequencing data that incorporates the prior information from a genotype-frequency estimator (GFE) (Maruki and Lynch 2015). We examine the performance of the proposed method using computer simulations under different genetic conditions, including those where HWE is violated, and compare the performance with that of other widely used methods. The results show that our method yields more accurate genotype calls with low or moderately high depths of coverage than the current widely used methods, which is supported by analysis of human data. In addition, we develop another ML method for calling genotypes from high-coverage sequencing data, which relaxes the assumption of biallelic polymorphisms made in many existing methods.

We also examine the necessary depth of coverage for identifying triallelic sites and accurate genotype calling with the proposed method, using computer simulations. Taking the results of the performance evaluation into account, we apply the proposed method to high-throughput sequencing data of 83 clones from a population of the microcrustacean Daphnia pulex, which have reasonably high depths of coverage (mean 18× per site per individual) (Lynch et al. 2016). The results show that the proposed method enables accurate and rapid identification of polymorphic sites with arbitrary numbers of alleles. Furthermore, we show that the proposed method can be applied to analyses of polyploid data.

Methods

We develop ML methods for calling individual genotypes from high-throughput sequencing data. We develop two types of methods for calling genotypes, one for low-coverage sequencing data and the other for high-coverage sequencing data, because the degree of uncertainty associated with sequencing data depends on depths of coverage. In both types of methods, we statistically test the significance of polymorphisms. Also, we do not call the genotype of an individual when more than one genotype has equivalent likelihood values for the observed data.

Genotype calling from low-coverage sequencing data

When depths of coverage are low, only one of the two parental chromosomes might be sampled, and sequence errors can resemble true variants. In such cases, the statistical uncertainties of individual sequence data can be high, thus calling accurate genotypes is not easy. However, when sequence data for multiple individuals from a population are available, the accuracy of the called genotypes can be improved by incorporating the population-level information on genotype frequencies and error rates into the genotype-calling process using Bayes’ theorem [see Martin et al. (2010); Nielsen et al. (2012); Vieira et al. (2013) for similar methods using the expectation-maximization algorithm for estimating priors].

Here, the genotype of an individual at a single site is called from the nucleotide read quartet (counts of A, C, G, and T) of high-throughput sequencing data at the site by an ML method. This is achieved by maximizing the likelihood of the observed data as a function of the genotype of the individual g. The two most abundant nucleotide reads in the population sample are considered to be candidates for alleles at the site.

Given the genotype of the individual, g = 1 (major homozygote, MM), 2 (heterozygote, Mm), or 3 (minor homozygote, mm), and sequencing error rate per read per site ε, the log-likelihood of the observed site-specific read quartet consisting of the observed counts of the most abundant (major) nucleotide read M (e.g., C) (nM), second most abundant (minor) nucleotide read m (e.g., T) (nm), and other nucleotide reads (e.g., in this case A and G) (ne1 and ne2) in the population sample, lnL(nM,nm,ne1,ne2|g,ϵ), is given by the following multinomial distribution formula:

lnL(nM,nm,ne1,ne2|g,ϵ)=ln[(n!)/(nM!nm!ne1!ne2!)Pg(M)nM×Pg(m)nmPg(e1)ne1Pg(e2)ne2], (1)

where n=nM+nm+ne1+ne2 (the depth of coverage). Pg(M) is a probability of observed nucleotide read M with genotype g. It is a function of ε and is given by summing conditional probabilities of the observed nucleotide read, given the true nucleotide on the sequenced chromosome chosen from the pair (Table 1). The other P terms are similarly defined. For example, the probability of nucleotide read M with genotype 2 (Mm), P2(M), is (1/2)(1ϵ)+(1/2)(ϵ/3), assuming the error occurs at an equal rate from the true nucleotide to one of the other three nucleotides. Because the multinomial coefficient in Equation 1 is constant regardless of the parameter values, for computational efficiency, it is ignored as follows:

lnL(nM,nm,ne1,ne2|g,ϵ)=ln[Pg(M)nMPg(m)nmPg(e1)ne1Pg(e2)ne2]. (2)

When the genotype frequencies and error rate at the site are estimated from the population sample of nucleotide reads, they can be incorporated into Equation 2 as Bayes’ priors. We previously developed an ML method to estimate the site-specific genotype frequencies and error rate from the nucleotide read quartets (Maruki and Lynch 2015). This method yields essentially unbiased genotype-frequency estimates even with moderate depths of coverage, which we use as the Bayes’ priors necessary for improving the accuracy of genotype calls. Specifically, given the estimates of the genotype frequencies γ^1 (frequency of major homozygotes), γ^2 (frequency of heterozygotes), and γ^3 (frequency of minor homozygotes), and error rate (ϵ^) predetermined by the ML method of Maruki and Lynch (2015), the log-likelihood of the observed data are given as follows:

lnL(nM,nm,ne1,ne2|g,γ^1,γ^2,γ^3,ϵ^)=ln{[γ^gPg(M)nMPg(m)nmPg(e1)ne1Pg(e2)ne2]/j=13[γ^jPj(M)nMPj(m)nmPj(e1)ne1Pj(e2)ne2]}, (3)

where the ML error-rate estimate ϵ^ is substituted into the P terms. The inferred (called) genotype of the individual is the genotype that maximizes Equation 3. To avoid calling genotypes at false polymorphic sites, we only call genotypes at significantly polymorphic sites, which are identified beforehand using the likelihood-ratio test by the population-level genotype-frequency estimator (GFE) (Maruki and Lynch 2015).

Table 1. Probability of an observed nucleotide read as a function of the individual genotype g and error rate ε.

Genotype Nucleotide Read
M m e1 e2
1 (MM) 1ϵ ϵ/3 ϵ/3 ϵ/3
2 (Mm) [(1/2)(1ϵ)]+[(1/2)(ϵ/3)] [(1/2)(ϵ/3)]+[(1/2)(1ϵ)] ϵ/3 ϵ/3
3 (mm) ϵ/3 1ϵ ϵ/3 ϵ/3

M and m denote candidate alleles (the two most abundant nucleotide reads in the population sample, e.g., C and T) and e1 and e2 denote other nucleotide reads (e.g., in this case, A and G).

Genotype calling from high-coverage sequencing data

When depths of coverage are high, both parental chromosomes are sequenced with high probability, and nucleotide reads derived from true variants are reliably abundant compared with those due to sequence errors. In such cases, if the confounding effects of binomial sampling of parental chromosomes and sequence errors are statistically accounted for, accurate and rapid genotype calls may be made from sequence data on each individual separately, relaxing the assumption of biallelic polymorphisms made in the method for low-coverage sequencing data and eliminating the need for prior population-level estimates of genotype frequencies.

Given the nucleotide read quartet of an individual at a site, we call the genotype of the individual by finding the genotype that maximizes the likelihood of the observed data. For example, if the nucleotide read quartet of an individual contains only A, C, and G, we examine the likelihoods of six candidate genotypes (AA, AC, AG, CC, CG, and GG). Suppose, for example, that the nucleotide read quartet of an individual contains nonzero counts for all four nucleotides. Letting nA, nC, nG, and nT denote the counts of A, C, G, and T, respectively, the log-likelihood for genotype AA is

lnL(nA,nC,nG,nT|AA,ϵ)=ln[(1ϵ)nA(ϵ/3)nnA], (4)

where n is the sum of the nucleotide read counts (depth of coverage) and ε is the sequence error rate per read per site. This equation is a multinomial distribution formula, where the constant multinomial coefficient is ignored for computational efficiency, as in Equation 2. The subsequent likelihood functions are derived and shown in a similar way. By taking the derivative of Equation 4 with respect to ε and equating it to zero, ε is analytically estimated as

ϵ^=(nnA)/n, (5)

and this estimate is substituted into Equation 4 to find the likelihood for genotype AA. As another example, the log-likelihood for genotype AC is

lnL(nA,nC,nG,nT|AC,ϵ)=ln[(1/2ϵ/3)nA+nC(ϵ/3)nG+nT], (6)

where ε is similarly estimated as

ϵ^=(3/2)[(nG+nT)/n], (7)

and again, this estimate is substituted into Equation 6 to find the likelihood for genotype AC. The likelihoods for the other eight candidate genotypes are similarly calculated.

Statistical tests of called genotypes in the high-coverage genotype caller

Because of high sequencing error rates, genotypes called by the high-coverage genotype caller (HGC) might sometimes falsely suggest polymorphisms. To minimize analyzing false polymorphisms and misidentifying sites containing three or four alleles, we examine statistical significance of called genotypes by likelihood-ratio tests (Kendall and Stuart 1979). Specifically, we examine the statistical significance of called genotypes with respect to the genotype homozygous for the most abundant nucleotide in the population sample M (MM).

Under the null hypothesis of monomorphism, the site is fixed for M. Under the alternative hypothesis of polymorphism, at least one individual has a genotype different from MM. Therefore, we reject the null hypothesis of population monomorphism if the likelihood of at least one non-MM called genotype is significantly greater than that of the MM genotype. Specifically, letting LL0 and LL1 denote the log-likelihoods of the observed data of an individual for the MM genotype and that for a non-MM called genotype, the likelihood-ratio test statistic (LRT) for the individual is

LRT=2(LL1LL0). (8)

This test statistic is expected to be asymptotically χ2-distributed with one degree of freedom. We reject the null hypothesis of population monomorphism when LRT is significant for at least one of the non-MM called genotypes at a user-specified level.

We estimate the number of alleles by examining the nucleotides contained in significant genotypes. To minimize misidentification of sites containing three or four alleles, when the significant genotype is heterozygous, we compare the likelihood of the called genotype to that of the genotype homozygous for the more abundant nucleotide in the individual, and consider the genotype heterozygous only if the former is significantly greater than the latter at the specified level by the likelihood-ratio test with one degree of freedom. Otherwise, the genotype is considered homozygous for estimating the number of alleles.

Genotype calling from triploid sequencing data

The HGC explained above can be extended to triploid data. Specifically, assuming that sequencing occurs randomly among the three chromosomes, and equal error rates occur from the true nucleotide to one of the other three, the probability of an observed nucleotide read as a function of the genotype of an individual is found in a way analogous to that for diploid data (Supplemental Material, Table S1). Then, the likelihood of the observed nucleotide read quartet of an individual can be formulated. As with diploid data, we choose the genotype that maximizes the likelihood to call the genotype of the individual. The proposed method can be applied to low-coverage sequencing data, although high coverage is needed to call accurate genotypes.

As an example, suppose that the nucleotide read quartet of an individual contains nonzero counts for all four nucleotides. Letting nA, nC, nG, and nT denote the counts of A, C, G, and T, respectively, the log-likelihood for genotype AAA is

lnL(nA,nC,nG,nT|AAA,ϵ)=ln[(1ϵ)nA(ϵ/3)nnA], (9)

where n is the sum of the nucleotide read counts (depth of coverage), and ε is the sequence error rate per read per site. By taking the derivative of Equation 9 with respect to ε and equating it to zero, ε is analytically estimated as

ϵ^=(nnA)/n, (10)

and this estimate is substituted into Equation 9 to find the likelihood for genotype AAA. As another example, the log-likelihood for genotype ACC is

lnL(nA,nC,nG,nT|ACC,ϵ)=ln{[(1/3)(ϵ/9)]nA[(2/3)(5/9)ϵ)]nC(ϵ/3)nG+nT}, (11)

where ε is estimated as

ϵ^=[6n+9nC+15nG+15nT(6n+9nC+15nG+15nT)2360n(nG+nT)]/10n, (12)

and again, this estimate is substituted into Equation 11 to find the likelihood for genotype ACC. As a final example, the log-likelihood for genotype ACG is

lnL(nA,nC,nG,nT|ACG,ϵ)=ln[(1/3ϵ/9)nA+nC+nG(ϵ/3)nT], (13)

where ε is estimated as

ϵ^=3[(nT)/n], (14)

and again, this estimate is substituted into Equation 13 to find the likelihood for genotype ACG. To avoid analyzing false polymorphisms due to sequencing errors, the statistical significance of called genotypes is examined by likelihood-ratio tests analogous to those with diploid data, and the number of alleles is based on significant genotypes.

Genotype calling from tetraploid sequencing data

Because tetraploid species are common in plants and some animals, we also formulate likelihood functions for a tetraploid genotype caller. Assuming that sequencing occurs randomly among the four chromosomes, and equal error rates occur from the true nucleotide to one of the other three, the probability of an observed nucleotide read as a function of the genotype of an individual is similarly found (Table S2). As with triploid data, we find the genotype that maximizes the likelihood of the observed nucleotide read quartet of an individual to call the genotype of the individual.

As an example, suppose that the nucleotide read quartet of an individual contains nonzero counts for all four nucleotides. Letting nA, nC, nG, and nT denote the counts of A, C, G, and T, respectively, the log-likelihood for genotype AAAA is

lnL(nA,nC,nG,nT|AAAA,ϵ)=ln[(1ϵ)nA(ϵ/3)nnA], (15)

where n is the sum of the nucleotide read counts (depth of coverage) and ε is the sequence error rate per read per site. By taking the derivative of Equation 15 with respect to ε and equating it to zero, ε is analytically estimated as

ϵ^=[(nnA)/n], (16)

and this estimate is substituted into Equation 15 to find the likelihood for genotype AAAA. As another example, the log-likelihood for genotype ACCC is

lnL(nA,nC,nG,nT|ACCC,ϵ)=ln{(1/4)nA[(3/4)(2/3)ϵ)]nC(ϵ/3)nG+nT}, (17)

where ε is estimated as

ϵ^=(9/8)[(nG+nT)/(nC+nG+nT)], (18)

and again, this estimate is substituted into Equation 17 to find the likelihood for genotype ACCC. As another example, the log-likelihood for genotype AACC is

lnL(nA,nC,nG,nT|AACC,ϵ)=ln{[(1/2)(ϵ/3)]nA+nC(ϵ/3)nG+nT}, (19)

where ε is estimated as

ϵ^=(3/2)[(nG+nT)/n], (20)

and again, this estimate is substituted into Equation 19 to find the likelihood for genotype AACC. As a final example, the log-likelihood for genotype AACG is

lnL(nA,nC,nG,nT|AACG,ϵ)=ln{[(1/2)(ϵ/3)]nA(1/4)nC+nG(ϵ/3)nT}, (21)

where ε is estimated as

ϵ^=(3/2)[nT/(nA+nT)], (22)

and again, this estimate is substituted into Equation 21 to find the likelihood for genotype AACG. Again, we report the number of alleles at a site based on significant genotypes to avoid analyzing false polymorphisms due to sequencing errors.

Generation of diploid nucleotide read data at biallelic sites using computer simulations

To examine the performance of the genotype callers at biallelic sites, we generated nucleotide read data for N diploid individuals by computer simulations and called individual genotypes from the simulated data. In the simulations, the probability of sampling an individual with a particular genotype was equal to its relative frequency in the population. The population frequencies of major and minor homozygotes were specified by γ1 and γ3, respectively. To compare the called genotypes with the true genotypes, we recorded the true genotype of each individual. The depths of coverage were assumed to be Poisson-distributed with mean μ among the individuals, and were specified as

c(X,μ)=[(μ)Xeμ]/X!, (23)

where X is a particular value of the coverage for an individual and c is a probability mass function of X. The nucleotide reads from each individual were randomly chosen from its genotype allowing for errors. Sequence errors were randomly introduced at rate ε per read per site from the true nucleotide to one of the other three nucleotides.

Comparison of the Bayesian genotype caller with other widely used methods

To compare the performance of our Bayesian genotype caller (BGC) with that of other widely used methods for calling genotypes, we called individual genotypes using GATK (version 3.4-0) (McKenna et al. 2010; DePristo et al. 2011; Van der Auwera et al. 2013) and Samtools (version 0.1.19) (Li 2011) from the same simulated nucleotide read data generated by the method described above, and compared the accuracy of the genotypes called by different methods. Although both GATK and Samtools use Bayesian genotype-frequency priors, GATK uses the same priors for all sites and Samtools assumes HWE, and therefore their approaches differ from ours. In addition to these genotype-calling methods, we also compared our performance with that of the corresponding genotype-calling function in ANGSD (version 0.911) (Korneliussen et al. 2014), which is also designed especially for low-coverage sequencing data. To make the generated data applicable to all methods, including ours, we made BAM files of sequence read data mapped to a simulated reference sequence.

First, we generated a simulated reference sequence consisting of random nucleotides. Each of a total of 10,000 biallelic sites was surrounded by 50 fixed nucleotides on both sides, so that sequence reads of 101 bp are uniquely mapped to the reference sequence. We outputted the 101-bp sequence reads in the FASTA format with nucleotide reads at biallelic sites characterized in the manner described above. The sequence reads were mapped to the reference sequence using Novoalign (version 3.00.02) (www.novocraft.com). We converted SAM files of mapped reads to BAM files using Samtools (Li et al. 2009a).

To call genotypes with GATK, we made a dictionary file of the reference sequence and added read groups to the BAM files using Picard (version 1.134) (http://broadinstitute.github.io/picard). Then, we sorted and indexed the BAM files using Samtools. We generated individual GVCF files, which contain information of called genotypes, from the BAM files using HaplotypeCaller, which individually calls genotypes, assigning an arbitrarily chosen uniform base quality score (30) to every base. Changing the base quality score, however, from 30 to 20, did not noticeably influence the results (results not shown). Then, we called genotypes from the GVCF files using Genotype_GVCFs, which refines the called genotypes using the population-level information. We called genotypes with Samtools from the BAM files using mpileup and bcftools to use the population-level information. To call genotypes with ANGSD, we first converted the BAM files to SAM files using Samtools. Then, we assigned the same uniform base quality score of 30 to every base, using a custom Perl script. The resulting SAM files with base quality scores were converted back to BAM files as input for ANGSD, using Samtools. We used a Bayesian genotype-calling method corresponding to ours in ANGSD, setting the value of the -doPost option at one, using Samtools genotype likelihoods, and setting the statistical significance for calling SNPs at the 5% level. Unlike other methods, our method does not rely just on read quality, and estimates error rates from sequence data themselves. This unique feature of our method is important because errors can result from other factors including those introduced during sample preparation.

The ML estimates of genotype frequencies and error rates, necessary for calling genotypes with BGC, were obtained using GFE (available from https://github.com/Takahiro-Maruki/Package-GFE; an updated version with a function to prepare the input file of the BGC and its documentation are available as File S1 and File S2) (Maruki and Lynch 2015). The individual pro files of nucleotide read quartets were generated using sam2pro (version 0.6) (http://guanine.evolbio.mpg.de/mlRho/sam2pro_0.6.tgz) from mpileup files generated from the BAM files by Samtools.

In addition to examining the performance of genotype calling, we compared the allele-frequency estimates by GFE to those by other methods. The allele-frequency estimates via genotype calling by GATK and Samtools were calculated from the VCF files using VCFtools (version 0.1.11) (Danecek et al. 2011). The allele-frequency estimates by ANGSD were estimated directly from sequence reads by the method of Kim et al. (2011).

To examine how different genotype-calling methods perform when applied to real sequencing data, we applied them to low-coverage (mean 7.6× per site per individual) sequencing data of the phase 3 1000 Genomes project (1000 Genomes Project Consortium 2015) on chromosome 11 in the CEU population. To assess the accuracy of genotype calls by different methods, we compared the called genotypes with the corresponding genotypes in phase II and III data of the International HapMap project (International HapMap Consortium 2007; International HapMap 3 Consortium 2010). Specifically, we downloaded the BAM files of the phase 3 1000 Genomes sequencing data in 99 CEU individuals on chromosome 11 from ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/. We also downloaded the FASTA file of the reference sequence (human_g1k_v37.fasta) from ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/. We downloaded the corresponding genotypes in the HapMap phase II and III data from ftp.ncbi.nlm.nih.gov/hapmap/genotypes/2010-08_phaseII+III/forward/. We found that 94 out of the 99 CEU individuals had both 1000 Genomes and HapMap data, and therefore used the 94 individuals with the necessary data in the subsequent analyses.

To prepare high-quality input data for calling genotypes, we marked duplicate reads and locally realigned sequences around indels in the BAM files using GATK (version 3.4-0) (McKenna et al. 2010; DePristo et al. 2011) and Picard (version 1.107) (http://broadinstitute.github.io/picard), following GATK best practices (Van der Auwera et al. 2013). In addition, we clipped overlapping read pairs in the BAM files using BamUtil (version 1.0.13) (http://genome.sph.umich.edu/wiki/BamUtil). We made mpileup files of the 94 individuals from the processed BAM files using Samtools. The pro files of nucleotide read quartets were made from the mpileup files using sam2pro (version 0.8) (http://guanine.evolbio.mpg.de/mlRho/sam2pro_0.8.tgz). The file of nucleotide read quartets of 94 individuals necessary for the proposed method was made from the pro files using GFE (Maruki and Lynch 2015).

To call genotypes with GATK, we first sorted and indexed the BAM files using Samtools. Next, we generated individual GVCF files from the BAM files using HaplotypeCaller. Then, we called genotypes from the GVCF files using Genotype_GVCFs. We called genotypes with Samtools from the BAM files using mpileup and bcftools using the population-level information. To call genotypes with ANGSD from the BAM files, we used a Bayesian genotype-calling method corresponding to ours in ANGSD setting the value of the -doPost option at one, using Samtools genotype likelihoods and setting the statistical significance for calling SNPs at the 5% level. In addition to these genotype-calling methods, we examined how the popular genotype-imputation method performs by imputing missing genotypes in GATK calls using Beagle (version 4.1) (Browning and Browning 2016). Because preexisting high-quality reference panels (genotypes/haplotypes) do not exist in the vast majority of the organisms, except a few model organisms such as humans and flies, we imputed the genotypes without a reference panel. Here, we imputed genotypes using genotype likelihoods instead of called genotypes to take the uncertainty of genotype calls into account.

We calculated the correct-call rate among individuals as a fraction of individuals with genotype calls identical to HapMap genotypes, assuming that HapMap genotypes are correct. To examine the accuracy of actually called genotypes, we also calculated the correct-call rate among called genotypes, where the fraction is calculated among called genotypes. We converted the NCBI build 36 coordinates in the HapMap data to the GRCh37 coordinates in the 1000 Genomes data using the UCSC liftOver (Speir et al. 2016) to make the coordinates in the two data sets consistent with each other. To minimize the confounding effect of mismapping, we excluded sites involved in putative repetitive regions identified by RepeatMasker (http://www.repeatmasker.org/) and those with population coverage (sum of the coverage over the individuals) less than half the mean or greater than one and a half means from the analyses. We downloaded the repeat-masked reference sequence from the Ensembl website (ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/dna/) to identify sites involved in putative repetitive regions.

Performance evaluation of the HGC as a function of coverage

Because of the random sampling of parental chromosomes, only one of the two chromosomes might be sampled when the depth of coverage for an individual is low. Therefore, the ability of a genotype caller to correctly infer individual genotypes is mainly limited by its ability to call heterozygotes when the depth of coverage is low. In addition, when the depth of coverage is low, homozygotes can appear heterozygous due to sequencing errors. Therefore, examining the effect of coverage on the accuracy of called genotypes is important for finding the optimal sequencing strategy for genotype ascertainment.

To examine how much coverage is needed to accurately call genotypes using the HGC, we examined its rates for correctly calling homozygotes and heterozygotes as functions of the depth of coverage. Specifically, we generated nucleotide read data from an individual having a homozygous or heterozygous genotype with a fixed depth of coverage and error rate ε, and called the genotype of the individual with HGC. We repeated this process for 10,000 simulation replications and calculated the correct-call rate as a fraction of simulation replications where the genotype was correctly called by HGC.

Performance evaluation of genotype callers at triallelic sites using computer simulations

To examine the performance of the HGC for identifying sites with more than two alleles and calling genotypes in population samples, we generated BAM files of sequence read data in a way similar to that at biallelic sites, where nucleotide read data at each of a total of 10,000 triallelic sites for N diploid individuals were generated, according to their genotype frequencies, introducing errors at rate ε. Here, for simplicity, we specified the population genotype frequencies by assigning population allele frequencies p, q, and r to the most abundant, second most abundant, and the rarest allele, respectively, and assuming HWE, although our method makes no assumption on the mating system. As GATK and Samtools are both capable of calling genotypes at triallelic sites, we applied them to the BAM files using HaplotypeCaller and GenotypeGVCF, and mpileup and bcftools with the -m option, respectively, and compared their correct-call rates to that of HGC. In addition, we applied our BGC to the BAM files and examined the corresponding correct-call rate to find the effect of incorrectly assuming at most two alleles on the accuracy of genotype calls at triallelic sites.

Application of the HGC to empirical data

To examine the performance of the HGC with real data, we applied it to high-throughput Illumina sequencing data of 83 D. pulex clones from Kickapond (Lynch et al. 2016). We mapped the sequencing data to the PA42 reference genome (version 3.0) (Z. Ye, S. Xu, K. Spitze, J. Asselman, X. Jiang, M. E. Pfrender, and M. Lynch, unpublished results) using Novoalign (version 3.02.11) (www.novocraft.com) with the “-r None” option to prevent it from mapping a read if it matched more than one location. We converted the SAM files of mapped sequencing data to BAM files using Samtools (version 0.1.18) (Li et al. 2009a). Then, we marked duplicate reads and locally realigned sequences around indels using GATK (version 3.4-0) (McKenna et al. 2010; DePristo et al. 2011) and Picard (version 1.107) (http://broadinstitute.github.io/picard), following GATK best practices (Van der Auwera et al. 2013). In addition, we clipped overlapping read pairs using BamUtil (version 1.0.13) (http://genome.sph.umich.edu/wiki/BamUtil). We made mpileup files of the 83 clones from the processed BAM files using Samtools. The pro files of nucleotide read quartets were made from the mpileup files using sam2pro (version 0.8) (http://guanine.evolbio.mpg.de/mlRho/sam2pro_0.8.tgz). The input file of nucleotide read quartets of 83 clones was made from the pro files using GFE (Maruki and Lynch 2015). To avoid analyzing misassembled regions, we excluded regions considered to be misassembled by REAPR (version 1.134) (Hunt et al. 2013) from our analyses. We set the minimum coverage required to call a genotype of an individual at six. To minimize analyzing sites with data on a small number of individuals or those with mismapped reads, we only analyzed sites with the population coverage (sum of the coverage over the individuals) at least half the mean and at most 1.5× of the mean. Furthermore, we excluded sites involved in putative repetitive regions identified by RepeatMasker (version 4.0.5) (http://www.repeatmasker.org/) with the RepeatMasker library (Jurka et al. 2005) made on August 7, 2015. In addition, to further reduce analyzing sites with mismapped reads, we excluded sites with mean error-rate estimates among clones of >0.01.

Performance evaluation of the triploid genotype caller as a function of coverage

To examine the effect of coverage on the accuracy of genotypes called by the triploid genotype caller (TRI), we examined the correct-call rate as a function of the depth of coverage in a way similar to that for diploid data. Here, we examined the correct-call rate for three different types of genotypes (homozygotes, heterozygotes containing two different nucleotides, and heterozygotes containing three different nucleotides) among a total of 10,000 simulation replications.

Comparison of the TRI with GATK

To compare the performance of the TRI to that of other widely used methods, we applied our TRI and GATK (version 3.4-0) (McKenna et al. 2010; DePristo et al. 2011) to BAM files of simulated sequence read data generated in a way similar to that for diploid data, and compared the correct-call rates. Here, we generated fixed-coverage read data from an individual with a particular genotype 10,000 times. To call triploid genotypes with GATK, we set the ploidy of HaplotypeCaller at three.

Data availability

Source codes, written in C++, implementing the proposed methods and their documentation are available as supporting information (File S1, File S2, File S3, File S4, File S5, File S6, File S7, File S8, File S9, File S10, File S11, and File S12).

Results

The performance of the BGC when applied to low-coverage sequencing data at biallelic sites was evaluated with simulated data and compared with the corresponding performance of GATK, Samtools, and ANGSD. In addition, the allele-frequency estimates by the GFE were compared with those via genotype calling by GATK and Samtools and those by a method of Kim et al. (2011), as implemented in ANGSD. To examine the performance under the worst situation, the error rate was set at 0.01, which is typically the upper bound with commonly used sequencing platforms. We evaluated the performance under three different genetic conditions, where the inbreeding coefficient f was zero (HWE), minimized, or maximized given minor-allele frequency q. When f is minimized, the frequencies of major homozygotes and minor homozygotes, γ1 and γ3, are 12q and 0, respectively. When f is maximized, γ1 and γ3 are 1q and q, respectively.

When the depth of coverage is low (mean 3×), the allele-frequency estimates by GFE are unbiased, whereas those via genotype calling are biased under all examined conditions (Table 2). The allele-frequency estimates by ANGSD are similar to ours, although they are slightly biased when f is minimized or maximized. The allele-frequency estimates via genotype calling by GATK are more biased than those by Samtools when f is zero or minimized. On the other hand, when f is maximized, the allele-frequency estimates via genotype calling by Samtools are more biased than those by GATK. The correct-call rate among individuals by the proposed method (BGC) is higher than that by GATK, and lower than that by Samtools and ANGSD when f is zero or minimized. When f is maximized, the correct-call rate by BGC is the highest. The highest correct-call rate among individuals by Samtools and ANGSD when f is zero or minimized is mainly because both methods always call individual genotypes regardless of the depth of coverage, even when there is no read. In fact, the correct-call rate among called genotypes by BGC is the highest under the majority of the examined conditions.

Table 2. Comparison of the allele-frequency estimates and called genotypes by different methods with low depths of coverage.

Method q f q^ (mean ± 2 SEM) Correct-Call Rate among Individuals (mean ± 2 SEM) Correct-Call Rate among Called Genotypes (mean ± 2 SEM)
Proposed 0.1 0 (HWE) 0.10 ± 0.00050 0.84 ± 0.00064 0.91 ± 0.00049
GATK 0.1 0 (HWE) 0.06 ± 0.00042 0.80 ± 0.00073 0.88 ± 0.00057
Samtools 0.1 0 (HWE) 0.08 ± 0.00039 0.90 ± 0.00045 0.90 ± 0.00045
ANGSD 0.1 0 (HWE) 0.10 ± 0.00050 0.90 ± 0.00045 0.90 ± 0.00045
Proposed 0.1 Minimized 0.10 ± 0.00050 0.87 ± 0.00068 0.95 ± 0.00049
GATK 0.1 Minimized 0.06 ± 0.00041 0.82 ± 0.00077 0.91 ± 0.00061
Samtools 0.1 Minimized 0.08 ± 0.00039 0.94 ± 0.00047 0.94 ± 0.00047
ANGSD 0.1 Minimized 0.11 ± 0.00050 0.94 ± 0.00046 0.94 ± 0.00046
Proposed 0.1 Maximized 0.10 ± 0.00063 0.95 ± 0.00046 1.00 ± 0.00015
GATK 0.1 Maximized 0.09 ± 0.00058 0.93 ± 0.00051 1.00 ± 0.00006
Samtools 0.1 Maximized 0.06 ± 0.00046 0.91 ± 0.00047 0.91 ± 0.00047
ANGSD 0.1 Maximized 0.08 ± 0.00057 0.91 ± 0.00047 0.91 ± 0.00047
Proposed 0.3 0 (HWE) 0.30 ± 0.00077 0.77 ± 0.00081 0.88 ± 0.00080
GATK 0.3 0 (HWE) 0.22 ± 0.00077 0.68 ± 0.00094 0.79 ± 0.00088
Samtools 0.3 0 (HWE) 0.25 ± 0.00096 0.84 ± 0.00074 0.84 ± 0.00074
ANGSD 0.3 0 (HWE) 0.30 ± 0.00076 0.84 ± 0.00074 0.84 ± 0.00074
Proposed 0.3 Minimized 0.30 ± 0.00063 0.74 ± 0.00087 0.87 ± 0.00075
GATK 0.3 Minimized 0.19 ± 0.00067 0.58 ± 0.00098 0.70 ± 0.00101
Samtools 0.3 Minimized 0.26 ± 0.00085 0.83 ± 0.00078 0.83 ± 0.00078
ANGSD 0.3 Minimized 0.31 ± 0.00064 0.83 ± 0.00078 0.83 ± 0.00078
Proposed 0.3 Maximized 0.30 ± 0.00094 0.94 ± 0.00047 1.00 ± 0.00017
GATK 0.3 Maximized 0.26 ± 0.00093 0.90 ± 0.00061 1.00 ± 0.00011
Samtools 0.3 Maximized 0.24 ± 0.00118 0.87 ± 0.00067 0.87 ± 0.00067
ANGSD 0.3 Maximized 0.28 ± 0.00102 0.87 ± 0.00067 0.87 ± 0.00067

q, q^, and f are the minor-allele frequency, its estimate, and inbreeding coefficient, respectively. q^ by the proposed method and ANGSD are directly estimated from sequence read data by the genotype-frequency estimator (Maruki and Lynch 2015) and Kim et al.’s (2011) method, respectively. Called genotypes by the proposed method are by the Bayesian genotype caller. The correct-call rate among individuals is a fraction of individuals with correctly called genotypes among N = 100 individuals, where missing genotype calls are considered incorrect. On the other hand, the correct-call rate among called genotypes is calculated only among individuals with called genotypes. Mean depth of coverage µ = 3, error rate ε = 0.01. Results are based on a total of 10,000 simulation replications for each parameter set. HWE, Hardy–Weinberg equilibrium.

When the depth of coverage is moderately high (mean 10×), the allele-frequency estimates by all methods are similar to each other and nearly unbiased under all examined conditions (Table S3). Consistent with this, the correct-call rates by all methods are high under all examined conditions. However, the correct-call rate by BGC is the highest under all examined conditions. We confirmed that we correctly generated the simulated data by examining the realized parameter values in the population samples (Table S4).

Our conclusions on the performance of the genotype-calling methods based on computer simulations are further supported by qualitatively similar results of the corresponding performance evaluation using human data (Table 3). Similar to the simulation-based results of the performance evaluation when the population is in HWE, the correct-call rate among individuals of BGC was lower than that by ANGSD and Samtools and higher than that by GATK. Again, this is mainly because the former two methods call individual genotypes even when there is no read, and the correct-call rate among called genotypes by BGC is the highest. The correct-call rate among individuals of the Beagle imputation of genotypes called by GATK was higher than that by GATK and lower than that of the other methods. Interestingly, the correct-call rate among called genotypes of the Beagle imputation of genotypes called by GATK was lower than that by GATK, indicating that genotype imputation does not necessarily improve the accuracy of genotype calls, at least when preexisting high-quality reference panels are not available. The observation here is consistent with a recent finding that population-genetic parameters estimated via genotype imputation from low-coverage sequencing data are biased (Fu 2014).

Table 3. Comparison of the performance of the genotype-calling methods with human data.

Method Correct-Call Rate among Individuals (mean ± 2 SEM) Correct-Call Rate among Called Genotypes (mean ± 2 SEM)
BGC 0.954 ± 0.0004 0.969 ± 0.0004
GATK 0.923 ± 0.0004 0.943 ± 0.0004
Samtools 0.966 ± 0.0004 0.966 ± 0.0004
ANGSD 0.967 ± 0.0004 0.967 ± 0.0004
GATK + Beagle 0.941 ± 0.0004 0.941 ± 0.0004

The correct-call rate among individuals is calculated among individuals with HapMap genotypes, where missing genotype calls are considered incorrect. On the other hand, the correct-call rate among called genotypes is calculated only among individuals with both HapMap genotypes and called genotypes, respectively.

To examine when the depth of coverage is high enough to accurately call individual genotypes using the HGC, we examined the correct-call rate as a function of coverage when the true genotype is homozygous or heterozygous. Overall, the rate for correctly calling homozygotes is high (Figure 1A). It decreases with increased coverage when the coverage is five or less, and approaches one when the coverage is over 5. The overall rate for correctly calling heterozygotes increases with increased coverage, although it somewhat decreases when the coverage increases from five to six (Figure 1B). This decrease is consistent with the sudden increase in the rate for correctly calling homozygotes when the coverage increases from five to six. These patterns are observed because when two nucleotides have nonzero read counts, one of which has just one read, in the quartet, both of them are considered to be from an allele without error when the coverage is five or less, and the one with a single read count is considered to be due to a sequence error when the coverage is over five by HGC, as the likelihood that the single read is due to a sequence error becomes greater than that without error. In addition to the valley of the correct-call rate at coverage equal to six, there is another small valley at coverage equal to 11. This is because when two nucleotides have nonzero read counts, one of which has just two reads, in the quartet, both of them are considered to be from an allele without error when the coverage is <11, and the one with double reads is considered to be due to sequence errors when the coverage is ≥11 by HGC, as the likelihood that such reads are due to sequencing errors becomes greater than that without error.

Figure 1.

Figure 1

Correct-call rate of the high-coverage genotype caller as a function of the depth of coverage. (A) Correct-call rate when the true genotype of the individual is homozygous. (B) Correct-call rate when the true genotype of the individual is heterozygous. Error rate ε = 0.01. Results are based on a total of 10,000 simulation replications for each parameter set.

To compare the performance of HGC in calling diploid genotypes at triallelic sites with that of other existing methods, we compared the correct-call rate of HGC to that of GATK and Samtools (Table 4). In addition, to examine the effect of incorrectly assuming biallelic polymorphisms on the accuracy of genotype calls at triallelic sites, we applied our BGC to the same data and examined its correct-call rate. The correct-call rate of HGC is higher than that by Samtools and slightly lower than that by GATK. Compared with genotype calls by Samtools, genotype calls by HGC and GATK quickly become more accurate with higher depths of coverage. The correct-call rate of BGC is the lowest. This is because BGC assumes at most two alleles, as many other methods including ANGSD do, and fails to correctly call genotypes containing the rarest allele. We confirmed that we correctly generated the simulated data by examining the realized parameter values in the population samples (Table S5).

Table 4. Performance of different genotype callers for calling diploid genotypes from nucleotide data at triallelic sites.

Method Mean Coverage Correct-Call Rate (mean ± 2 SEM) Correct-Call Rate among Called Genotypes (mean ± 2 SEM)
HGC 10 0.97 ± 0.00037 0.97 ± 0.00037
GATK 10 0.97 ± 0.00032 0.98 ± 0.00031
Samtools 10 0.96 ± 0.00040 0.96 ± 0.00040
BGC 10 0.78 ± 0.00099 0.79 ± 0.00099
HGC 15 0.99 ± 0.00022 0.99 ± 0.00022
GATK 15 1.00 ± 0.00013 1.00 ± 0.00013
Samtools 15 0.96 ± 0.00040 0.96 ± 0.00040
BGC 15 0.80 ± 0.00088 0.81 ± 0.00087
HGC 20 1.00 ± 0.00014 1.00 ± 0.00014
GATK 20 1.00 ± 0.00007 1.00 ± 0.00007
Samtools 20 0.97 ± 0.00031 0.97 ± 0.00031
BGC 20 0.80 ± 0.00084 0.81 ± 0.00083
HGC 30 1.00 ± 0.00005 1.00 ± 0.00005
GATK 30 1.00 ± 0.00002 1.00 ± 0.00002
Samtools 30 0.99 ± 0.00015 0.99 ± 0.00015
BGC 30 0.81 ± 0.00080 0.81 ± 0.00080

Allele frequencies p, q, and r are 0.7, 0.2, and 0.1, and the population is in Hardy–Weinberg equilibrium. Correct-call rate and that among called genotypes are calculated among N = 100 individuals and those with called genotypes, respectively. Error rate ε = 0.01. Results are based on a total of 10,000 simulation replications for each parameter set.

To evaluate the power of HGC for detecting polymorphisms, we examined the false-positive and -negative rates as functions of the mean depth of coverage μ. The false-positive rate of detecting polymorphisms is low when μ is moderately high (10), and is essentially zero with μ>10 (Figure 2A). The false-negative rate of detecting polymorphisms is reasonably low when μ is moderately high (10), and approaches the theoretical minimum possible value, which is the probability that only one of the two alleles is sampled in a finite sample, when μ is 30 (Figure 2B). We note that the high false-negative rates with low minor-allele frequencies are mainly due to the finite sample size (100), which limits sampling rare alleles, and they remain high even with infinite coverage with any method. The power for detecting three alleles is essentially the same as that for detecting polymorphisms; the false-positive rate is low (Figure 2C) and false-negative rate is reasonably low (Figure 2D). These results indicate that HGC accurately describes polymorphisms with arbitrary numbers of alleles when the depth of coverage is high.

Figure 2.

Figure 2

Power analysis of the high-coverage genotype caller. (A) False-positive rate of polymorphism detection as a function of the mean depth of coverage. (B) False-negative rate of detecting polymorphisms at biallelic sites as a function of the minor-allele frequency. The solid curve shows the theoretical minimum value as the probability that just one of the alleles is sampled in a finite sample of size N given by q2N+(1q)2N, where q is the minor-allele frequency. (C) False-positive rate of detecting triallelic sites as a function of the mean depth of coverage with q = 0.1 (i.e., the site is biallelic). (D) False-negative rate of inferring triallelic sites as a function of the frequency of the rarest allele. The frequency of the second most abundant allele is 0.2. The statistical significance of the likelihood-ratio tests is set at the 5% level in all panels. N = 100, error rate ε = 0.01, inbreeding coefficient f = 0 (Hardy–Weinberg equilibrium). Results are based on a total of 10,000 simulation replications for each parameter set.

We applied the HGC to high-throughput sequencing data of 83 D. pulex clones from Kickapond to identify sites containing more than two alleles. We set the minimum coverage required to call a genotype of an individual at six so that the rate of falsely calling homozygotes as heterozygotes is low (Figure 1A). We identified a total of 4,403,303 significantly polymorphic sites at the 5% level from the sequence data filtered by multiple procedures to minimize analyzing misassembled regions and sites with mismapped reads. The vast majority (97.07%) of these were considered to contain two alleles. However, 127,167 (2.88%) and 2030 (0.05%) sites were considered to contain three and four alleles, respectively.

The performance evaluation using computer simulations shows that calling accurate triploid genotypes requires much higher coverage than calling diploid genotypes (Figure 3). In particular, high coverage is needed to correctly call heterozygotes (Figure 3, B and C). This is because higher coverage is required to sequence all three chromosomes rather than just two chromosomes. Furthermore, it is difficult to distinguish the two alternative heterozygotes containing two alleles (e.g., AAC vs. ACC) unless the coverage is very high. Overall, the correct-call rate increases with increased coverage for all three types of genotypes, although there are a few local exceptions. The valley of the rate for correctly calling homozygotes at coverage equal to eight (Figure 3A) is observed because when two nucleotides have nonzero read counts, one of which has just one read, in the quartet, both of them are considered to be from an allele without error when the coverage is eight or less, and the one with a single read is considered to be due to a sequence error when the coverage is over eight by the TRI, as the likelihood that the single read is due to a sequence error becomes greater than that without error. The sudden decrease in the rate for correctly calling heterozygotes containing three different nucleotides when coverage increases from six to seven (Figure 3C) is observed because when three nucleotides have nonzero read counts, one of which has just one read, in the quartet, all of them are considered to be from an allele without error when the coverage is six or less, and the one with a single read is considered to be due to a sequence error when the coverage is over six by the TRI, as the likelihood that the single read is due to a sequence error becomes greater than that without error. These observations highlight the inherent difficulty in calling polyploid genotypes not only for our method.

Figure 3.

Figure 3

Correct-call rate of the triploid genotype caller as a function of the depth of coverage. (A) Correct-call rate when the true genotype of the individual is homozygous. (B) Correct-call rate when the true genotype of the individual is heterozygous containing two different nucleotides. (C) Correct-call rate when the true genotype of the individual is heterozygous containing three different nucleotides. Error rate ε = 0.01. Results are based on a total of 10,000 simulation replications for each parameter set.

Comparison of the performance of our TRI to that of GATK shows that TRI yields more accurate called genotypes when the true genotype is a homozygote or heterozygote with two different nucleotides, whereas GATK yields more accurate called genotypes when the true genotype is a heterozygote with three different nucleotides (Table S6). Because the former two genotypes are generally far more abundant than the latter in natural populations, our method is likely to yield more accurate genotype calls than GATK when applied to real data. We also note that our method is more efficient than GATK in terms of both memory usage and computation time. TRI takes 2.749 sec whereas GATK takes 397.708 sec under the same computing environment to analyze simulated data with fixed coverage of 30 at 1,010,000 sites in an individual. This ∼145-fold difference in computation time becomes huge when analyzing organisms with a large genome size.

Because it takes additional time to prepare the input file of TRI from the BAM file, make a dictionary file of the reference, and add read groups to the BAM file to use GATK, we also compared the computation time between TRI and GATK taking these additional times into account. The additional times taken for the above analysis are 22.374 and 8.863 sec for TRI and GATK, respectively, making the total time 25.123 and 406.571 sec for TRI and GATK, respectively. Therefore, the total time taken for GATK is ∼16 times greater than that for TRI.

Discussion

Given the rapid emergence of the field of population genomics, there is a need to systematically examine the performance of genotype callers. Recent studies (e.g., Lynch 2009; Buerkle and Gompert 2013; Maruki and Lynch 2014) found that sequencing as many individuals as possible with low depths of coverage is the optimal strategy for estimating population-level parameters without calling genotypes with limited budgets. However, when depths of coverage are low, statistical uncertainty of individual sequence data are high and calling genotypes is difficult. Therefore, when inferring accurate genotypes is critical for a study, it becomes necessary to sequence a more limited number of individuals with higher depths of coverage.

In this study, we investigated how estimating genotype frequencies and error rates from a population sample and incorporating the population-level information into genotype-calling processes improves the performance of genotype calling, especially when the population deviates from HWE. We also examined the situation when individual coverage is high enough to accurately call genotypes without prior population-level estimates. To promote the use of our methods, we freely provide C++ programs implementing our genotype callers, along with their documentation (File S3, File S4, File S5, File S6, File S7, File S8, File S9, and File S10).

Consistent with previous results (Kim et al. 2011; Han et al. 2014), we find that allele-frequency estimates via genotype calling are biased, whereas those directly estimated from sequence read data are unbiased when depths of coverage are low. Given the importance of unbiased estimates of allele frequencies for subsequent population-genetic analyses, this reinforces the importance of estimating allele frequencies directly from sequence reads with low depths of coverage, as in Maruki and Lynch (2015). Our BGC takes advantage of prior population-level estimates of genotype frequencies and error rates to improve the accuracy of genotype calls in a population with an arbitrary mating system and internal population structure. The higher accuracy of called genotypes by BGC than obtained with currently widely used genotype callers in the majority of examined cases shows that our approach improves the ability to call genotypes while also providing unbiased estimates of allele frequencies.

Our performance evaluation of the HGC revealed that the rate of correctly calling homozygotes can be high if the minimum coverage required for calling genotypes is set at six or greater. By raising the minimum coverage cutoff to eight or greater, the rate for correctly calling heterozygotes can also be reasonably high. The power analyses of HGC for detecting polymorphisms and triallelic sites also indicate that the method is conservative and reasonably sensitive. The comparison of the performance of HGC to that of other existing methods showed that our method yields more accurate or as accurate diploid genotype calls at sites with more than two alleles.

The application of HGC to high-throughput sequencing data of 83 Daphnia clones from a population indicates that a non-negligible fraction of polymorphisms is triallelic and even tetra-allelic in D. pulex. There is growing evidence that triallelic polymorphisms exist in the human genome (e.g., Hodgkinson and Eyre-Walker 2010; Nelson et al. 2012; Cao et al. 2015). In particular, a recent study sequencing many samples with high depths of coverage by Nelson et al. (2012) found that the fraction of triallelic sites is much higher than that previously expected. Furthermore, a recent study (Jenkins et al. 2014) found triallelic polymorphisms may be more informative for demographic inferences than biallelic polymorphisms. Therefore, it is important to relax the assumption of biallelic polymorphisms. Our method provides an efficient, flexible, and statistically rigorous framework for identifying polymorphic sites containing arbitrary numbers of alleles.

Because of its efficiency and flexibility, our method can be extended to analyses of population-genomic data from polyploid organisms. Our performance evaluation of the TRI revealed that much higher coverage is needed for correctly calling triploid genotypes than in the case of diploids. This conclusion is extended to genotype calling of organisms with higher ploidy, as the difficulties we found become greater. These results will help researchers to design sequencing strategies for population-genomic analyses of polyploid organisms.

As guidance on proper usage of the proposed four different genotype-calling methods, we provided a summary of the proposed methods (Table 5). BGC and HGC are both intended for calling diploid genotypes. Users should first apply HGC to population-genomic data of diploid organisms to identify sites with more than two alleles. Then, users should apply BGC to the data at biallelic sites, as BGC yields more accurate genotype calls than HGC at biallelic sites, unless the depth of coverage is extremely high. To facilitate this procedure, we provide a C++ program and documentation for setting the coverage of all individuals at zero at sites with more than two alleles as identified by HGC (File S11 and File S12). TRI and TET are extensions of HGC to triploid and tetraploid data, respectively. They enable highly efficient genotype calling at sites with arbitrary numbers of alleles. If future studies enable the population-level estimation of the genotype frequencies and error rates in diploid data at sites with arbitrary numbers of alleles and triploid and tetraploid data, we can, in principle, improve the accuracy of genotype calls from these data, although it will be very computationally expensive to estimate frequencies of many genotypes in these cases.

Table 5. Summary of the proposed methods.

Method Ploidy Advantage Disadvantage
BGC Two More accurate for biallelic SNPs Assumes at most two alleles
HGC Two Highly efficient Genotype-frequency information not used
Arbitrary numbers of alleles Less accurate for biallelic SNPs
TRI Three Highly efficient Genotype-frequency information not used
Arbitrary numbers of alleles
TET Four Highly efficient Genotype-frequency information not used
Arbitrary numbers of alleles

Supplementary Material

Supplemental material is available online at www.g3journal.org/lookup/suppl/doi:10.1534/g3.117.039008/-/DC1.

Acknowledgments

We thank two anonymous reviewers, whose comments improved our manuscript. This work was supported by National Science Foundation grant DEB-1257806 and National Institutes of Health grant NIH-NIGMS R01-GM101672. It was also supported by the National Center for Genome Analysis Support, funded by National Science Foundation grant DBI-1458641 to Indiana University, and Indiana University Research Technology’s computational resources.

Note added in proof: See Ye et al. 2017 (pp. 1405) in this issue and Ackerman et al. 2017 (pp. 105) and Lynch et al. 2017 (pp. 315) in the GENETICS May issue for related work.

Footnotes

Communicating editor: S. I. Wright

Literature Cited

  1. Aars J., Dallas J. F., Piertney S. B., Marshall F., Gow J. L., et al. , 2006.  Widespread gene flow and high genetic variability in populations of water voles Arvicola terrestris in patchy habitats. Mol. Ecol. 15: 1455–1466. [DOI] [PubMed] [Google Scholar]
  2. Black F. L., Salzano F. M., 1981.  Evidence for heterosis in the HLA system. Am. J. Hum. Genet. 33: 894–899. [PMC free article] [PubMed] [Google Scholar]
  3. Black W. C., IV, Baer C. F., Antolin M. F., DuTeau N. M., 2001.  Population genomics: genome-wide sampling of insect populations. Annu. Rev. Entomol. 46: 441–469. [DOI] [PubMed] [Google Scholar]
  4. Brown A. H. D., 1979.  Enzyme polymorphism in plant-populations. Theor. Popul. Biol. 15: 1–42. [Google Scholar]
  5. Browning B. L., Browning S. R., 2016.  Genotype imputation with millions of reference samples. Am. J. Hum. Genet. 98: 116–126. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Browning S. R., Browning B. L., 2007.  Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet. 81: 1084–1097. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Buerkle C. A., Gompert Z., 2013.  Population genomics based on low coverage sequencing: how low should we go? Mol. Ecol. 22: 3028–3035. [DOI] [PubMed] [Google Scholar]
  8. Cao M., Shi J., Wang J., Hong J., Cui B., et al. , 2015.  Analysis of human triallelic SNPs by next-generation sequencing. Ann. Hum. Genet. 79: 275–281. [DOI] [PubMed] [Google Scholar]
  9. Catchen J., Hohenlohe P. A., Bassham S., Amores A., Cresko W. A., 2013.  Stacks: an analysis tool set for population genomics. Mol. Ecol. 22: 3124–3140. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Catchen J. M., Amores A., Hohenlohe P., Cresko W., Postlethwait J. H., 2011.  Stacks: building and genotyping loci de novo from short-read sequences. G3 1: 171–182. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Cockerham C. C., Weir B. S., 1977.  Digenic descent measures for finite populations. Genet. Res. 30: 121–147. [Google Scholar]
  12. Danecek P., Auton A., Abecasis G., Albers C. A., Banks E., et al. , 2011.  The variant call format and VCFtools. Bioinformatics 27: 2156–2158. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Delmotte F., Leterme N., Gauthier J. P., Rispe C., Simon J. C., 2002.  Genetic architecture of sexual and asexual populations of the aphid Rhopalosiphum padi based on allozyme and microsatellite markers. Mol. Ecol. 11: 711–723. [DOI] [PubMed] [Google Scholar]
  14. DePristo M. A., Banks E., Poplin R., Garimella K. V., Maguire J. R., et al. , 2011.  A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43: 491–498. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Ferreira A. G., Amos W., 2006.  Inbreeding depression and multiple regions showing heterozygote advantage in Drosophila melanogaster exposed to stress. Mol. Ecol. 15: 3885–3893. [DOI] [PubMed] [Google Scholar]
  16. Foltz D. W., Hoogland J. L., 1983.  Genetic-evidence of outbreeding in the black-tailed prairie dog (Cynomys-Ludovicianus). Evolution 37: 273–281. [DOI] [PubMed] [Google Scholar]
  17. Fu Y. B., 2014.  Genetic diversity analysis of highly incomplete SNP genotype data with imputations: an empirical assessment. G3 4: 891–900. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. 1000 Genomes Project Consortium. Auton A., Brooks L. D., Durbin R. M., Garrison E. P., et al. , 2015.  A global reference for human genetic variation. Nature 526: 68–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Glenn T. C., 2011.  Field guide to next-generation DNA sequencers. Mol. Ecol. Resour. 11: 759–769. [DOI] [PubMed] [Google Scholar]
  20. Han E., Sinsheimer J. S., Novembre J., 2014.  Characterizing bias in population genetic inferences from low-coverage sequencing data. Mol. Biol. Evol. 31: 723–735. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Hebert P. D. N., 1978.  Population biology of Daphnia (Crustacea, Daphnidae). Biol. Rev. Camb. Philos. Soc. 53: 387–426. [Google Scholar]
  22. Hedrick P. W., 1998.  Balancing selection and MHC. Genetica 104: 207–214. [DOI] [PubMed] [Google Scholar]
  23. Hodgkinson A., Eyre-Walker A., 2010.  Human triallelic sites: evidence for a new mutational mechanism? Genetics 184: 233–241. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Hohenlohe P. A., Bassham S., Etter P. D., Stiffler N., Johnson E. A., et al. , 2010.  Population genomics of parallel adaptation in threespine stickleback using sequenced RAD tags. PLoS Genet. 6: e1000862. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Hudson R. R., Kaplan N. L., 1985.  Statistical properties of the number of recombination events in the history of a sample of DNA sequences. Genetics 111: 147–164. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Hunt M., Kikuchi T., Sanders M., Newbold C., Berriman M., et al. , 2013.  REAPR: a universal tool for genome assembly evaluation. Genome Biol. 14: R47. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. International HapMap Consortium. Frazer K. A., Ballinger D. G., Cox D. R., Hinds D. A., et al. , 2007.  A second generation human haplotype map of over 3.1 million SNPs. Nature 449: 851–861. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. International HapMap 3 Consortium. Altshuler D. M., Gibbs R. A., Peltonen L., Altshuler D. M., et al. , 2010.  Integrating common and rare genetic variation in diverse human populations. Nature 467: 52–58. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Jenkins P. A., Mueller J. W., Song Y. S., 2014.  General triallelic frequency spectrum under demographic models with variable population size. Genetics 196: 295–311. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Jurka J., Kapitonov V. V., Pavlicek A., Klonowski P., Kohany O., et al. , 2005.  Repbase update, a database of eukaryotic repetitive elements. Cytogenet. Genome Res. 110: 462–467. [DOI] [PubMed] [Google Scholar]
  31. Kendall M., Stuart A., 1979.  The Advanced Theory of Statistics, Vol. 2, Ed. 4 Charles Griffin & Co., London. [Google Scholar]
  32. Kim S. Y., Lohmueller K. E., Albrechtsen A., Li Y., Korneliussen T., et al. , 2011.  Estimation of allele frequency and association mapping using next-generation sequencing data. BMC Bioinformatics 12: 231. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Korneliussen T. S., Albrechtsen A., Nielsen R., 2014.  ANGSD: analysis of next generation sequencing data. BMC Bioinformatics 15: 356. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Li H., 2011.  A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27: 2987–2993. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Li H., Ruan J., Durbin R., 2008.  Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18: 1851–1858. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., et al. , 2009a The sequence alignment/map format and samtools. Bioinformatics 25: 2078–2079. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Li R., Li Y., Fang X., Yang H., Wang J., et al. , 2009b SNP detection for massively parallel whole-genome resequencing. Genome Res. 19: 1124–1132. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Li Y., Willer C. J., Ding J., Scheet P., Abecasis G. R., 2010.  MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet. Epidemiol. 34: 816–834. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Lynch M., 2009.  Estimation of allele frequencies from high-coverage genome-sequencing projects. Genetics 182: 295–301. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Lynch M., Gutenkunst R., Ackerman M. S., Spitze K., Ye Z., et al. , 2016.  Population genomics of Daphnia pulex. Genetics 206: 315–332. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Markow T., Hedrick P. W., Zuerlein K., Danilovs J., Martin J., et al. , 1993.  HLA polymorphism in the Havasupai: evidence for balancing selection. Am. J. Hum. Genet. 53: 943–952. [PMC free article] [PubMed] [Google Scholar]
  42. Martin E. R., Kinnamon D. D., Schmidt M. A., Powell E. H., Zuchner S., et al. , 2010.  SeqEM: an adaptive genotype-calling approach for next-generation sequencing studies. Bioinformatics 26: 2803–2810. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Maruki T., Lynch M., 2014.  Genome-wide estimation of linkage disequilibrium from population-level high-throughput sequencing data. Genetics 197: 1303–1313. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Maruki T., Lynch M., 2015.  Genotype-frequency estimation from high-throughput sequencing data. Genetics 201: 473–486. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. McKenna A., Hanna M., Banks E., Sivachenko A., Cibulskis K., et al. , 2010.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20: 1297–1303. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Melnick D. J., 1987.  The genetic consequences of primate social organization: a review of macaques, baboons and vervet monkeys. Genetica 73: 117–135. [DOI] [PubMed] [Google Scholar]
  47. Nelson M. R., Wegmann D., Ehm M. G., Kessner D., St Jean P., et al. , 2012.  An abundance of rare functional variants in 202 drug target genes sequenced in 14,002 people. Science 337: 100–104. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Nielsen R., Korneliussen T., Albrechtsen A., Li Y., Wang J., 2012.  SNP calling, genotype calling, and sample allele frequency estimation from new-generation sequencing data. PLoS One 7: e37558. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Pool J. E., Hellmann I., Jensen J. D., Nielsen R., 2010.  Population genetic inference from genomic sequence variation. Genome Res. 20: 291–300. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Quail M. A., Smith M., Coupland P., Otto T. D., Harris S. R., et al. , 2012.  A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics 13: 341. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Scheet P., Stephens M., 2006.  A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet. 78: 629–644. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Speir M. L., Zweig A. S., Rosenbloom K. R., Raney B. J., Paten B., et al. , 2016.  The UCSC genome browser database: 2016 update. Nucleic Acids Res. 44: D717–D725. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Storz J. F., Bhat H. R., Kunz T. H., 2001.  Genetic consequences of polygyny and social structure in an Indian fruit bat, Cynopterus sphinx. II. Variance in male mating success and effective population size. Evolution 55: 1224–1232. [DOI] [PubMed] [Google Scholar]
  54. Tarr C. L., Conant S., Fleischer R. C., 1998.  Founder events and variation at microsatellite loci in an insular passerine bird, the Laysan finch (Telespiza cantans). Mol. Ecol. 7: 719–731. [Google Scholar]
  55. Tollenaere C., Bryja J., Galan M., Cadet P., Deter J., et al. , 2008.  Multiple parasites mediate balancing selection at two MHC class II genes in the fossorial water vole: insights from multivariate analyses and population genetics. J. Evol. Biol. 21: 1307–1320. [DOI] [PubMed] [Google Scholar]
  56. Van der Auwera G. A., Carneiro M. O., Hartl C., Poplin R., Del Angel G., et al. , 2013.  From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr. Protoc. Bioinformatics 11: 11.10.1–1.10.33. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Vieira F. G., Fumagalli M., Albrechtsen A., Nielsen R., 2013.  Estimating inbreeding coefficients from NGS data: impact on genotype calling and allele frequency estimation. Genome Res. 23: 1852–1861. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Weir B. S., 1996.  Genetic Data Analysis II: Methods for Discrete Population Genetic Data. Sinauer Associates, Sunderland, MA. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data Availability Statement

Source codes, written in C++, implementing the proposed methods and their documentation are available as supporting information (File S1, File S2, File S3, File S4, File S5, File S6, File S7, File S8, File S9, File S10, File S11, and File S12).


Articles from G3: Genes|Genomes|Genetics are provided here courtesy of Oxford University Press

RESOURCES