Abstract
Insertion and deletion polymorphisms (indels) are an important source of genomic variation in plant and animal genomes, but accurate genotyping from low-coverage and exome next-generation sequence data remains challenging. We introduce an efficient population clustering algorithm for diploids and polyploids which was tested on a dataset of 2000 exomes. Compared with existing methods, we report a 4-fold reduction in overall indel genotype error rates with a 9-fold reduction in low coverage regions.
INTRODUCTION
Single-nucleotide polymorphisms (SNPs) and copy number variants (CNVs) are pervasive in the human genome and have been well established as sources of genetic and phenotypic variation. Insertion and deletion (indel) polymorphisms are comparably abundant and functionally significant but remain relatively unexplored, mainly due to the fact that they cannot be efficiently detected using microarray platforms.
The advent of next-generation sequencing (NGS) has offered new prospects for exploring the impact of indels on the genetic landscape of both plants and animals. As a result, both assembly-based methods (1) as well as gapped-alignment-based methods have been used for indel discovery. Assembly-based methods rely on high-coverage whole-genome sequence data and can only resolve homozygous indels. Gapped alignment methods aim to distinguish between actual indels and spurious results caused by sequencing errors, including base calling and mapping errors, as well as errors due to polymerase slippage during polymerase chain reaction amplification. Further challenges are faced when trying to identify indels using exome sequencing data. The intermediate microarray hybridization step that is designed to capture coding sequences of interest results in less efficient capture of non-reference reads and uneven coverage across the genome. Existing methods attempt to overcome such biases using a variety of strategies. Dindel (2) utilizes a Bayesian framework to account for the various errors, which requires prior knowledge of context-dependent error rates currently restricted to Illumina platforms. Other methods, such as piCALL (3), involve computationally expensive numerical approximations in order to model population-scale sequence data. QCALL attempts to sample from the space of potential ancestral recombination graphs relating the history of the population in order to improve SNP genotyping accuracy from NGS data (4). This method, however, poses significant computational challenges and has not yet been applied to indel genotyping. Furthermore, none of the current methods has the ability to detect and genotype indels in polyploid genomes.
We present SOAP-popIndel, a novel probabilistic framework for fast and sensitive indel genotyping at the population level. By modelling site-specific indel error rates across thousands of samples, our method achieves high genotyping and detection accuracy, while minimizing the computational burden. Particularly for targeted exome capture data, we demonstrate that SOAP-popIndel outperforms competing methods, despite their more complex and resource-intensive approaches. SOAP-popIndel is the only indel genotyping algorithm that is not restricted to diploid genomes, thus constituting an unparalleled tool for ongoing plant population re-sequencing efforts, such as the 1001 Genomes Project (5) and the rice re-sequencing project (6).
MATERIALS AND METHODS
SOAP-popIndel pipeline
The SOAP-popIndel pipeline is shown in Supplementary Figure S1. The first step is to use the Burrows–Wheeler Aligner (BWA) with default parameters to perform gapped alignment of the sequencing reads to the reference genome. The resulting alignments comprise the candidate indel dataset. A rigorous filtering process is required to eliminate spurious alignments from further analysis. The average density of indels across the human genome has been reported to be one indel per 7.2 kb (7). Therefore, it was deemed necessary to discard alignments that exhibit more than one gap per read. We also filter out alignments with gaps located towards either ends of the read, as they most likely correspond to sequencing artefacts. When multiple indel alleles arise in the population, we consider up to K (1 ≤ K ≤ Kmax) non-reference alleles for which the average number of supporting reads among the samples in which we observe the non-reference allele, is greater than or equal to two. Kmax is a pre-defined parameter representing the maximum number of alternative alleles which the program will model. Finally, we filter putative indel sites on the basis of average depth of coverage (Supplementary Methods).
Next, we create K alternative reference sequences at each putative indel site, each consisting of a 2*L window flanking the indel, together with the indel allele itself (where L denotes the read length). For example, the alternative reference will be missing the equivalent section from the reference where there is a putative deletion and will have incorporated extra sequence relative to the reference where we have identified a putative insertion.
Subsequently, we perform un-gapped alignment for all individuals j = 1… M using BWA on the combined alternative and the reference genome sequence (such that each read can only be assigned to the reference or one of the alternative sequences). We then record the number of reads Ni,j,k which align to the kth allele of the ith indel with less than five mismatches and such that breakpoints are >5 bp from the read ends. The vector of these read counts is denoted by, where k = 0 indicates the reference allele. These quantities are the summary statistics which we use for all subsequent inference in SOAP-popIndel.
Algorithm for modelling read counts
We model the probability of the data (Di) over all samples j = 1… M conditional on the vector of total depth of coverage at position i as:
(1) |
where denotes the underlying population allele frequencies and Ei the matrix of site-specific read assignment errors. We condition our probabilities on the total depth, , to mitigate the influence of the variable depth of coverage on our inference. Such variability can be caused by uneven hybridization of the capture array in exome sequence datasets, by variation in repeat content or by differences in alignability for whole-genome datasets. Intuitively, we consider our data as being the allele-specific depths conditioned on the total depth at each position. Equation (1) also assumes that these allele-specific depths from different individuals are independent, conditional on the total read depth and the population allele frequency.
We expand Equation (1), dropping the i, j subscripts for convenience:
(2) |
where the genotype g comes from all the possible genotypes for ploidy Pl and number of alleles K + 1. For example, G(2,3) = {AAA,AAB,ABB,BBB}. We further expand Equation (2) using the multinomial
(3) |
where vector P(k|E,g) denotes the probability of observing a read k conditional on genotype and assignment matrix E. This can be thought of as a probability of success in a multinomial ‘dice-throw’ model, adjusted to reflect the rate of mis-assignment of indel reads to reference reads and vice versa:
(4) |
where k′ denotes the hidden real underlying allele of the read, P(k|k′) = Ek,k′ denotes the ‘error-rate’ of aligning the allele k′ to k, and Ck′ (g) denotes the number of alleles k in the genotype g. Note that is the ploidy and that for all k′ = 0, … , K. So that there are K·(K − 1) free parameters in the matrix E, consisting of all the off-diagonal elements. We make the simplifying assumption that Ek,0 = eref→indel for all k = 1, … , K and that E0,k = eindel→ref for all k = 1, … , K and Ek,k′ = eindel→indel for all k,k′ = 1, … , K. We express the prior probability of genotype g, P(g) from Equation (3) in terms of the Hardy–Weinberg equilibrium frequencies:
(5) |
The parameters of this model are initialized to:
and trained separately at each site, using a generalized expectation–maximization algorithm. It is important to note that if eiindel→ref = 0, then the likelihood in Equation (3) will be zero for reference homozygotes if we observe at least one supporting indel read (and vice versa if eiref→indel = 0), which results in the program inferring an excess of heterozygotes as a result of misalignment errors. In the expectation step, we calculate the posterior probability of observing each genotype in each individual conditional on the current parameter set. These posterior probabilities are summed to calculate the expected number of indel alleles in the population, which we use to update the allele frequency vector of the indel. The posterior probabilities also enable us to assign probabilistically each data point to each genotype cluster. Thus, for given values of eiindel→ref, eiref→indel and eindel→indel, we can calculate the probability of each data point being generated by each genotype cluster using Equations (3) and (4), and by calculating a sum of these values weighted by their assignment probability, we evaluate the likelihood of this assigned data conditional on the parameter values. We can then use a numerical maximization algorithm to find the values of eiindel→ref, eiref→indel and eindel→indel that maximize this likelihood conditional on the posterior genotype assignments. We train the model for 25 iterations, which we observed to be sufficient for convergence. After training, we report the final posterior probability of each indel genotype in each individual.
Data
We used paired-end exome sequence data generated on 2000 samples that were collected for a case–control study of type II diabetes (8). Exons were captured using the Agilent 47 Mb ‘All Exon Kit’ (v2) and subsequently sequenced at high depth using Illumina HiSeq platform. These data consisted of an average depth of coverage of 56.42× on the capture region, with a SD of 8.64×. The target read length was 100 bp and the target insert size was 500 bp.
The simulated dataset was constructed by introducing indels of known size into a 1 Mb region on chromosome 17 (chr17:11.2 Mb-12.2 Mb, NCBI Build 36, hg18). The indels ranged from 1 to 50 bp, with a length distribution as previously reported (9). In total, we simulated 1000 indel sites equally divided between insertions and deletions. Next, we randomly assigned a population frequency f to each indel, f ∈ {0.05,0.10,0.20,0.50,0.80,0.90,0.95} and generated 2000 diploid genomes as well as a separate 2000 triploid genomes. Finally, we used WGSIM (with options −e 0.01 –d 500 –s 50 –N 200000 −1 100 −2 100 –r 0.001 –R 0.10 –X 0.30 –h) (10) to simulate paired-end reads from each of the 2000 genomes with a base error rate of 1% and a mutation rate of 0.1% (of which 10% were indels). The read length was set to 100 bp, whereas the average insert size was set to 500 bp. We simulated data at a depth of coverage of 40×, then randomly down-sampled depths of 4× and 20×, respectively, for the diploid dataset only. We also simulated tri-allelic data at 40× coverage using the same simulation strategy and total non-reference allele frequency, except that each indel site was assumed to have two alternative alleles, which are selected with equal probability.
Benchmarking of indel calling software
We analysed the simulated dataset with SOAP-popIndel, Dindel, SAMtools and piCall. We used Dindel version 1.01 (linux 64 bit), filtering indels with less than three supporting reads as recommended, running in ‘pool’ mode in 200 bins of 10 samples each to avoid out-of-memory errors. We further filtered out predicted indels, which had less than 100 samples with observed data (using the ‘Number of Samples with Data’ field from the merged vcf file) in order to remove the majority of predicted 284 759 indels that were false positives, resulting in 913 indels for comparison. We used SAMtools version 0.1.17 and the mpileup command with options –u −d 1000 −m 3. We run SAMtools in two batches of 1000 samples to avoid out-of-memory errors. We analysed the real dataset with Dindel with the same parameters as well as SOAP-popIndel.
RESULTS
We applied SOAP-popIndel to an exome sequencing dataset consisting of 2000 samples sequenced at an average depth of 56×. Because of the difficulty in experimentally validating multi-allelic indel genotypes, we ran SOAP-popIndel with Kmax = 1, thus only considering bi-allelic indels. To visualize our results, we generated plots of the number of reads aligned to the indel reference Ni,j,1 against the total number of aligned reads (Ni,j,0 + Ni,j,1) at different putative bi-allelic indel sites across all individuals in the population, annotated by SOAP-popIndel genotype (Figure 1). Despite the varying levels of coverage and rates of misalignment of indel reads to the reference and vice versa, SOAP-popIndel’s ability to update its site-specific error rate via a population model allowed it to accurately identify three genotype clouds in each case.
We randomly chose 50 indels detected by SOAP-popIndel for validation, three of which were subsequently removed due to differences between the hg18 and hg19 genome builds. Using a Sequenom assay, we validated 44 of the 47 indels indicating a false discovery rate of <6.4%. These 44 validated indels were also detected by Dindel. We further assessed SOAP-popIndel and Dindel genotyping accuracy at these validated sites (Figure 2A). SOAP-popIndel achieved a genotyping error rate of 0.26% versus 1.02% for Dindel at the same missing rate of 15%. When we restricted to sites with less than 5× coverage, the error rates were 0.5 and 4.5%, respectively, at a higher missing rate of 37.5% (Figure 2B).
It was difficult to benchmark indel detection sensitivity and specificity on this dataset due to a lack of a gold standard. Thus, we used a simulated dataset of 2000 samples to more extensively compare SOAP-popIndel with Dindel (2), piCALL (3) and SAMtools (10). At 4× coverage, our method achieved a sensitivity of 99.8% with a false-discovery rate (FDR) of 0.22%, which was an order of magnitude lower than the best competing method, Dindel, which also had a lower sensitivity of 99.0% (Table 1). As reported in Neuman et al. (11), Dindel was more accurate than methods other than SOAP-popIndel, particularly at low coverage. SOAP-popIndel did not miss indels detected by the other algorithms, while Dindel missed seven (∼1% of simulated indels) which were detected by SOAP-popIndel and SAMtools (Supplementary Figure S2). At higher coverage (Supplementary Figure S2B and C), the FDR of competing methods decreased, while that of SOAP-popIndel did not, indicating that 4× is sufficient to enable accurate detection of indels, providing population-level information is properly exploited. Our method achieved comparable accuracy (sensitivity of 99.13% and FDR of 0.66%) in indel detection for triploid data. Our method achieved a comparable FDR of 0.34% for tri-allelic diploid simulated data but had a lower sensitivity of 96.8% due to sites for which only one of the two indel alleles were correctly identified (Supplementary Table S4).
Table 1.
Method | Diploid (%) |
Triploid (%) |
|||
---|---|---|---|---|---|
4x | 20x | 40x | 40x | ||
SOAP_popIndel | FN | 0.22 | 0.33 | 0.55 | 0.66 |
FD | 0.22 | 0.11 | 0.22 | 0.87 | |
Dindel | FN | 0.99 | 0.99 | 1.20 | NA |
FD | 9.60 | 5.54 | 1.20 | NA | |
SAMtools | FN | 1.20 | 11.8 | 18.84 | NA |
FD | 64.28 | 63.10 | 63.55 | NA | |
piCALL | FN | 11.83 | 1.42 | 1.42 | NA |
FD | 53.18 | 57.18 | 64.19 | NA |
NA, not applicable; FD, false discovery; FN, false-negative.
We also benchmarked genotyping accuracy using the simulated dataset. SOAP-popIndel had lower missing rate and a substantially lower genotyping error than competing methods, particularly for low depth of coverage (Figure 2C). SOAP-popIndel results for 4x coverage are superior to those achieved by other algorithms even at 20x coverage. Although the SOAP-popIndel genotyping error rate for triploid data is similar to that for diploid data, the missing rate is higher (Supplementary Figure S3), which was, however, mostly due to difficulty in distinguishing the AAB and ABB heterozygotes (Supplementary Table S3). The genotyping error rate of 1.2% at diploid tri-allelic sites is higher than for diploid bi-allelic sites, but the missing rate remains low (Supplementary Figure S4).
We observed that SOAP-popIndel requires considerably less CPU time than other indel callers (Supplementary Table S1). SOAP-popIndel memory requirements were higher on the simulated dataset of 2000 samples, reflecting the fact that we arbitrarily batched regions into ∼1000 indels per run. Adjusting the number of indels batched into a single run can be used to maintain the same memory footprint for larger datasets, e.g. 100 indels per run for 20 000 samples.
DISCUSSION
We have described SOAP-popIndel, a novel fast algorithm for genotyping indels at the population level using exome NGS data. We address the problem of uneven capture efficiency by conditioning on site- and sample-specific total depth of coverage. However, the main strength of our approach is that it models indel genotypes across the entire population, using a model which incorporates site-specific read misalignment rates, as well as the indel population allele frequency. This enables our method to call highly accurate genotypes even on low coverage sequence data and in the presence of significant rates of misalignment. Our genotyping error rates of 0.25% are significantly lower than competing methods, although indel callers that consider more than one alternative allele, such as Dindel, may have been artificially penalized on the bi-allelic simulation data. SOAP-popIndel is insensitive to depth of coverage—achieving lower error rates at 4× than competing methods at 20× coverage. As a result, our reported indel genotyping error rates are now comparable with those reported for SNP genotyping (12). Further gains in accuracy may be achieved by using SNP/indel haplotype clustering (13) to borrow information locally across individuals sharing haplotypes. The ability to accurately call indel genotypes at low coverage is extremely helpful even for high coverage exome sequence data, which usually contain many regions of low coverage due to variability in exon capture efficiency mediated in part by GC compositional biases. Benchmarking of SOAP-popIndel on simulated polyploid data demonstrated the feasibility of calling indel genotypes in polyploid plant genomes. However, there may be other features of plant genomes, such as differences in indel heterozygosity, repeat and GC composition, as well as divergence of homologous chromosomes, which may further complicate indel genotyping. In particular, Neuman et al. (11) demonstrate that indel calling becomes progressively more difficult as the density of indels increases, which may be a problem for genomes with high levels of heterozygosity.
SOAP-popIndel provides a comprehensive solution for accurate and efficient sequencing-based indel detection that will help elucidate their largely unexplored role in phenotypic diversity. SOAP-popIndel’s performance, coupled with its unique ability to accommodate polyploids, renders it invaluable for exploring the impact of indels on both animal and plant genomes.
The software is available from http://soap.genomics.org.cn/.
SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online: Supplementary Tables 1–4, Supplementary Figures 1–4 and Supplementary Methods.
FUNDING
Major State Basic Research Development Program of China—973 Program [2011CB809201, 2011CB809202, 2011CB809203]; Major Program of National Natural Science Foundation of China [30890032]; Shenzhen Key Laboratory of Transomics Biotechnologies [CXB201108250096A]; the BBSRC research grant [award number BB/H024808/1 to L.J.M.C. and E.B.]. Funding for open access charge: Imperial College open access fund.
Conflict of interest statement. None declared.
Supplementary Material
ACKNOWLEDGEMENTS
The authors thank Myrsini Kaforou for her help in generating the plots.
REFERENCES
- 1.Li Y, Zheng H, Luo R, Wu H, Zhu H, Li R, Cao H, Wu B, Huang S, Shao H, et al. Structural variation in two human genomes mapped at single-nucleotide resolution by whole genome de novo assembly. Nat. Biotechnol. 2011;29:723–730. doi: 10.1038/nbt.1904. [DOI] [PubMed] [Google Scholar]
- 2.Albers CA, Lunter G, MacArthur DG, McVean G, Ouwehand WH, Durbin R. Dindel: accurate indel calls from short-read data. Genome Res. 2011;21:961–973. doi: 10.1101/gr.112326.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Bansal V, Libiger O. A probabilistic method for the detection and genotyping of small indels from population-scale sequence data. Bioinformatics. 2011;27:2047–2053. doi: 10.1093/bioinformatics/btr344. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Le SQ, Durbin R. SNP detection and genotyping from low-coverage sequencing data on multiple diploid samples. Genome Res. 2011;21:952–960. doi: 10.1101/gr.113084.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Cao J, Schneeberger K, Ossowski S, Gunther T, Bender S, Fitz J, Koenig D, Lanz C, Stegle O, Lippert C, et al. Whole-genome sequencing of multiple Arabidopsis thaliana populations. Nat. Genet. 2011;43:956–963. doi: 10.1038/ng.911. [DOI] [PubMed] [Google Scholar]
- 6.Xu X, Liu X, Ge S, Jensen JD, Hu F, Li X, Dong Y, Gutenkunst RN, Fang L, Huang L, et al. Resequencing 50 accessions of cultivated and wild rice yields markers for identifying agronomically important genes. Nat. Biotechnol. 2011;30:105–111. doi: 10.1038/nbt.2050. [DOI] [PubMed] [Google Scholar]
- 7.Mills RE, Luttig CT, Larkins CE, Beauchamp A, Tsui C, Pittard WS, Devine SE. An initial map of insertion and deletion (INDEL) variation in the human genome. Genome Res. 2006;16:1182–1190. doi: 10.1101/gr.4565806. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Li Y, Vinckenbosch N, Tian G, Huerta-Sanchez E, Jiang T, Jiang H, Albrechtsen A, Andersen G, Cao H, Korneliussen T, et al. Resequencing of 200 human exomes identifies an excess of low-frequency non-synonymous coding variants. Nat. Genet. 2010;42:969–972. doi: 10.1038/ng.680. [DOI] [PubMed] [Google Scholar]
- 9.Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008;456:53–59. doi: 10.1038/nature07517. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Proc GPD. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Neuman JA, Isakov O, Shomron N. Analysis of insertion-deletion from deep-sequencing data: software evaluation for optimal detection. Brief. Bioinform. 2012;14:46–55. doi: 10.1093/bib/bbs013. [DOI] [PubMed] [Google Scholar]
- 12.1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–1073. doi: 10.1038/nature09534. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Su SY, White J, Balding DJ, Coin LJ. Inference of haplotypic phase and missing genotypes in polyploid organisms and variable copy number genomic regions. BMC Bioinformatics. 2008;9:513. doi: 10.1186/1471-2105-9-513. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.