FermiKit: assembly-based variant calling for Illumina resequencing data

Heng Li

doi:10.1093/bioinformatics/btv440

. 2015 Jul 27;31(22):3694–3696. doi: 10.1093/bioinformatics/btv440

FermiKit: assembly-based variant calling for Illumina resequencing data

Heng Li ^1,^✉

PMCID: PMC4757955 PMID: 26220959

Abstract

Summary: FermiKit is a variant calling pipeline for Illumina whole-genome germline data. It de novo assembles short reads and then maps the assembly against a reference genome to call SNPs, short insertions/deletions and structural variations. FermiKit takes about one day to assemble 30-fold human whole-genome data on a modern 16-core server with 85 GB RAM at the peak, and calls variants in half an hour to an accuracy comparable to the current practice. FermiKit assembly is a reduced representation of raw data while retaining most of the original information.

Availability and implementation: https://github.com/lh3/fermikit

Contact: hengli@broadinstitute.org

1 Introduction

Deep resequencing of a human sample typically results in a BAM file of 60–100 GB in size. Storing, distributing and processing many such huge files is becoming a burden for sequencing facilities and research labs. While better compression helps to alleviate this issue, it adds processing time and can barely halve the size, which does not keep up with the rapidly increasing sequencing throughput. Illumina and GATK use gVCF (Raczy et al., 2013) as a reduced representation of raw data. However, gVCF is reference dependent and it is nontrivial to encode both large and small variants consistently. We still need to go back to raw data for long events and when upgrading the reference genome. Another idea from the past practice is to assemble sequence reads into contigs that ideally retains all information in the raw data, but whether this approach is practical to Illumina human resequencing remains to be confirmed.

2 Methods

FermiKit uses BFC (Li, 2015) for error correction, ropeBWT2 (Li, 2014a) for BWT construction, Fermi (Li, 2012) for de novo assembly, BWA-MEM (Li, 2013) for mapping and HTSBox (http://bit.ly/HTSBox) for variant calling. The caller simply parses edits in the ‘pileup’ output for small variant calling from one or multiple BAMs, and extracts alignment break points for SV calling, though it may misclassify some SV events (Trappe et al., 2014). FermiKit sets thresholds on mapping quality and the number of supporting reads without using sophisticated statistical models. In comparison to our earlier work (Li, 2012), FermiKit is faster, more sensitive (due to better error correction) and more complete as a pipeline.

FermiKit does not use paired-end information for the time being, but this does not have a great impact on its power empirically. With longer upcoming Illumina reads, it will be actually preferred to merge overlapping ends and treat them as single-end reads.

3 Results

We have run FermiKit on multiple whole-genome datasets of sample NA12878 along with GATK-HaplotypeCaller (HC in brief) and FreeBayes (Garrison and Marth, 2012). We used Genome-In-A-Bottle (GIAB; Zook et al., 2014) as truth data to evaluate the accuracy (Table 1). Recent Illumina data have excessive systematic errors around poly-A which HC does not handle well. It called over 4000 false INDELs from sample S1+ and S4+ with the vast majority around poly-A. We excluded these regions to avoid one simple error source greatly affecting the metrics. After this treatment, variant callers are broadly comparable when the same set of hard filters are applied. VQSR as is advised in GATK Best Practice does not work well with single-sample calling.

Table 1.

GIAB evaluation of SNP/INDEL accuracy for sample NA12878

Sample	Caller	SNP-FN	SNP-FP	InDel-FN	InDel-FP
PG-	FermiKit	45 700	824	2324	472
	FreeBayes	21 548	439	3858	400
	HC+hardFilter	27 010	144	943	370
	HC+VQSR	128 604	1955	1423	366
S7-	FermiKit	65 217	531	2340	549
	FreeBayes	50 796	676	2891	420
	HC+hardFilter	66 847	228	1543	457
	HC+VQSR	103 979	1508	1396	605
S11-	FermiKit	91 468	541	2973	554
	FreeBayes	52 071	903	3195	431
	HC+hardFilter	65 223	407	1502	472
	HC+VQSR	111 504	1694	1175	765
S2-	FermiKit	63 445	448	2244	568
S12-	FermiKit	74 940	501	2562	553
S1+	FermiKit	67 816	455	4051	516
	FreeBayes	63 101	902	4625	436
	HC+hardFilter	71 174	531	2376	591
	HC+VQSR	108 101	8852	2377	1827
S4+	FermiKit	71 262	452	4197	536
	FreeBayes	65 427	1061	4781	437
	HC+hardFilter	75 040	672	2477	653
	HC+VQSR	103 595	10 492	2401	1622

Open in a new tab

PCR-free Platinum Genome NA12878 (PG-; AC:ERR194147; 100 bp reads), four Illumina X10 lanes of PCR-free NA12878 (S7-, S11-, S2- and S12- under BaseSpace project ID 18475457; 150 bp) and two X10 lanes of PCR-amplified NA12878 (S1+ and S4+ under project ID 8998991; 150 bp) were acquired and called with FermiKit-0.9, FreeBayes-0.9.20 (option: ‘–experimental-gls –min-repeat-entropy 1’) and HC-3.3 (option: ‘-stand_emit_conf 10 -stand_call_conf 30’). For FreeBayes and HC, BWA-MEM was used for mapping against GRCh37 plus decoy (http://bit.ly/GRCh37d5) with duplicates marked by Samblaster (Faust and Hall, 2014). Short variant calls were hard filtered with hapdip (http://bit.ly/HapDip). GATK-VQSR was also applied to HC calls. The filtered calls were compared to GIAB-v2.18 excluding poly-A regions longer than 6 bp plus 10 bp flanking. A true variant is counted as an FN if there are no called variants within 10 bp around the truth, and a called variant is counted as an FP if it falls in GIAB trusted regions and there are no true variants within 10 bp around the called variant.

GIAB was generated from multiple NA12878 call sets. It is potentially biased against new callers and biased towards easier regions that can be called by the existing callers. For example, the GATK call set available from the Platinum Genome website has 13 278 FN SNPs and 46 FPs out of 2.03 Gb confident regions (i.e. one SNP error per 44 Mb), which is overly good and is worrying.

We turned to the CHM1-NA12878 dataset (Li, 2014b) for an unbiased evaluation (Table 2). In this evaluation, FermiKit produces calls of higher specificity at the cost of sensitivity. This is probably because FermiKit is less powerful in repetitive or duplicated regions or regions affected by systematic artefacts. Nonetheless, in well-behaved regions that are outside ‘uniMask’, the loss of sensitivity is minor. The gain in precision is significant if we consider that there may be 5–20 k real heterozygous SNPs in CHM1 (Li, 2014b), which should not be counted as FPs.

Table 2.

Evaluation on SNP/INDEL accuracy with CHM1-NA12878 pair

Caller	Filter	SNP-TP	SNP-FP	InDel-TP	InDel-FP
FermiKit	hard-polyA	1 937 469	22 743	230 955	14 602
	uniMask	1 802 820	9507	127 304	1126
FreeBayes	hard-polyA	2 026 883	59 422	190 587	30 909
	uniMask	1 842 634	15 252	117 764	6329
HC	hard-polyA	2 003 655	32 030	267 870	15 541
	uniMask	1 824 658	14 912	133 458	2046

Open in a new tab

SNP/INDELs were called from the CHM1 (AC:SRR642636 through SRR642641; 100 bp) and NA12878-PG- BWA-MEM alignments used by Li (2014b). On the assumption that CHM1 is haploid, (heterozygous) FP equals the number of CHM1 heterozygotes and (heterozygous) TP equals the number of NA12878 heterozygotes minus the number of CHM1 heterozygotes. Two sets of filters were applied for filtering. ‘Hard-polyA’ is the same as the filter used in Table 1. ‘UniMask’ filters out genomic regions that tend to be repetitive, low-complexity or susceptible to copy number changes or systematic artefacts (http://bit.ly/unimask). This filter is sample independent.

FermiKit performs well in calling long deletions (Table 3). While it does not use read pairs, it achieves comparable sensitivity and higher specificity in comparison to the popular tools. FermiKit also called ∼480 insertions over 100 bp and identified multiple kb-long contigs having poor alignments to GRCh37 but nearly perfect alignment to a PacBio assembly of CHM1 (AC:GCA_001007805.1). We also mapped the CHM1 FermiKit unitigs to the PacBio assembly and called 71 long deletions, 11 insertions and 262 other events. As PacBio assemblies are generally of higher quality, these numbers give a rough estimate on the number potential false positives of Fermi on a haploid dataset.

Table 3.

Performance on calling long deletions over 100 bp

Sample	Caller	1000 g pilot	Ensemble	LUMPY	Merged
S7-	FermiKit	0.43/0.23	0.50/0.15	0.32/0.23	0.58/0.09
S1+	FermiKit	0.43/0.22	0.51/0.15	0.33/0.23	0.58/0.10
PG-	FermiKit	0.43/0.20	0.52/0.14	0.34/0.22	0.59/0.09
	DELLY	0.47/0.34	0.50/0.22	0.31/0.28	0.58/0.16
	LUMPY	0.72/0.34	0.76/0.29	0.68/0.37	0.79/0.20

Open in a new tab

FermiKit was used to call 100 bp–100 kb deletions from the PG-, S7- and S1+ datasets. DELLY (Rausch et al., 2012) and LUMPY PG- calls were acquired from http://bit.ly/bcbsval (B. Chapman, personal communication). For all call sets, overlapping events were merged and deletions longer than 100 kb were discarded. The two numbers in a cell at row R and column C give the false negative rate and false positive rate of call set R, assuming truth set C is correct and complete. In the table, truth set ‘1000g pilot’ consists of 3142 deletions by Mills et al. (2011) and further validated by Layer et al. (2014); ‘Ensemble’ contains 4095 validated calls by multiple callers; ‘LUMPY’ consists of 2657 validated LUMPY-only deletions; ‘Merged’ is the union of all the three truth sets above, containing 4695 deletion calls.

4 Conclusions

A FermiKit assembly is about 3 GB compressed. After assembly, single-sample variants can be obtained in half an hour to high accuracy through mapping against a reference genome. Jointly calling 261 aligned whole-genome samples only took ∼40 CPU hours. FermiKit is a viable option for aggressive data compression, greatly reducing the efforts and expense on data storage, distribution and re-analyses at an acceptable cost of information loss.

Funding

NHGRI U54HG003037>; NIH GM100233.>

Conflict of Interest: none declared.

References

Faust G.G., Hall I.M. (2014) Samblaster: fast duplicate marking and structural variant read extraction. Bioinformatics, 30, 2503–2505. [DOI] [PMC free article] [PubMed] [Google Scholar]
Garrison E., Marth G. (2012) Haplotype-based variant detection from short-read sequencing. arXiv:1207.3907. [Google Scholar]
Layer R.M., et al. (2014) LUMPY: a probabilistic framework for structural variant discovery. Genome Biol., 15, R84. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li H. (2012) Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly. Bioinformatics, 28, 1838–1844. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li H. (2013) Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997. [Google Scholar]
Li H. (2014a) Fast construction of FM-index for long sequence reads. Bioinformatics, 30, 3274–3275. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li H. (2014b) Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics, 30, 2843–2851. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li H. (2015) BFC: correcting illumina sequencing errors. arXiv:1502.03744. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mills R.E., et al. (2011) Mapping copy number variation by population-scale genome sequencing. Nature, 470, 59–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
Raczy C., et al. (2013) Isaac: ultra-fast whole-genome secondary analysis on illumina sequencing platforms. Bioinformatics, 29, 2041–2043. [DOI] [PubMed] [Google Scholar]
Rausch T., et al. (2012) DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics, 28, i333–i339. [DOI] [PMC free article] [PubMed] [Google Scholar]
Trappe K., et al. (2014) Gustaf: Detecting and correctly classifying svs in the ngs twilight zone. Bioinformatics, 30, 3484–3490. [DOI] [PubMed] [Google Scholar]
Zook J.M., et al. (2014) Integrating human sequence data sets provides a resource of benchmark snp and indel genotype calls. Nat. Biotechnol., 32, 246–251. [DOI] [PubMed] [Google Scholar]

[btv440-B1] Faust G.G., Hall I.M. (2014) Samblaster: fast duplicate marking and structural variant read extraction. Bioinformatics, 30, 2503–2505. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btv440-B2] Garrison E., Marth G. (2012) Haplotype-based variant detection from short-read sequencing. arXiv:1207.3907. [Google Scholar]

[btv440-B3] Layer R.M., et al. (2014) LUMPY: a probabilistic framework for structural variant discovery. Genome Biol., 15, R84. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btv440-B4] Li H. (2012) Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly. Bioinformatics, 28, 1838–1844. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btv440-B5] Li H. (2013) Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997. [Google Scholar]

[btv440-B6] Li H. (2014a) Fast construction of FM-index for long sequence reads. Bioinformatics, 30, 3274–3275. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btv440-B7] Li H. (2014b) Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics, 30, 2843–2851. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btv440-B8] Li H. (2015) BFC: correcting illumina sequencing errors. arXiv:1502.03744. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btv440-B9] Mills R.E., et al. (2011) Mapping copy number variation by population-scale genome sequencing. Nature, 470, 59–65. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btv440-B10] Raczy C., et al. (2013) Isaac: ultra-fast whole-genome secondary analysis on illumina sequencing platforms. Bioinformatics, 29, 2041–2043. [DOI] [PubMed] [Google Scholar]

[btv440-B11] Rausch T., et al. (2012) DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics, 28, i333–i339. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btv440-B12] Trappe K., et al. (2014) Gustaf: Detecting and correctly classifying svs in the ngs twilight zone. Bioinformatics, 30, 3484–3490. [DOI] [PubMed] [Google Scholar]

[btv440-B13] Zook J.M., et al. (2014) Integrating human sequence data sets provides a resource of benchmark snp and indel genotype calls. Nat. Biotechnol., 32, 246–251. [DOI] [PubMed] [Google Scholar]

PERMALINK

FermiKit: assembly-based variant calling for Illumina resequencing data

Heng Li

Abstract

1 Introduction

2 Methods

3 Results

Table 1.

Table 2.

Table 3.

4 Conclusions

Funding

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

FermiKit: assembly-based variant calling for Illumina resequencing data

Heng Li

Abstract

1 Introduction

2 Methods

3 Results

Table 1.

Table 2.

Table 3.

4 Conclusions

Funding

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases