Variant calling on the GRCh38 assembly with the data from phase three of the 1000 Genomes Project

Ernesto Lowy-Gallego; Susan Fairley; Xiangqun Zheng-Bradley; Magali Ruffier; Laura Clarke; Paul Flicek; The 1000 Genomes Project Consortium

doi:10.12688/wellcomeopenres.15126.2

. 2019 Dec 30;4:50. Originally published 2019 Mar 11. [Version 2] doi: 10.12688/wellcomeopenres.15126.2

Variant calling on the GRCh38 assembly with the data from phase three of the 1000 Genomes Project

Ernesto Lowy-Gallego ¹, Susan Fairley ¹, Xiangqun Zheng-Bradley ¹, Magali Ruffier ¹, Laura Clarke ¹, Paul Flicek ^1,^a; The 1000 Genomes Project Consortium

PMCID: PMC7059836 PMID: 32175479

Version Changes

Revised. Amendments from Version 1

Previously, we presented phased biallelic SNVs called de novo using sequence data from the 1000 Genomes Project aligned to GRCh38. This work included calls for 2,548 samples spanning 26 populations. Here, we extend that work to add biallelic INDELs, which are combined with the biallelic SNV calls into a single phased call set. Further, we extend the benchmarking work presented in the previous version of the data note to the combined SNV and INDEL call set. We also add comparisons with the 1000 Genomes Project calls lifted over to GRCh38 and look in further detail at clinically important loci where the reference genome changed between GRCh37 and GRCh38. Figures have been added in these instances. Additional information has been added, such as the contribution of the multiple callers used in this work to the integrated call set. The text has been revised throughout, reflecting the updates to the data set, additional analyses and efforts to improve the manuscript in line with the comments from our reviewers. As we have extensively rewritten the manuscript, we would encourage readers to treat this as they would a new document. We also note that the SNV-only data set previously described remains available and that a revised description of its production, based on reviewer comments, is included in this version.

Abstract

We present a set of biallelic SNVs and INDELs, from 2,548 samples spanning 26 populations from the 1000 Genomes Project, called de novo on GRCh38. We believe this will be a useful reference resource for those using GRCh38. It represents an improvement over the “lift-overs” of the 1000 Genomes Project data that have been available to date by encompassing all of the GRCh38 primary assembly autosomes and pseudo-autosomal regions, including novel, medically relevant loci. Here, we describe how the data set was created and benchmark our call set against that produced by the final phase of the 1000 Genomes Project on GRCh37 and the lift-over of that data to GRCh38.

Keywords: Genomics, population genetics, variant calling, single nucleotide variation, variant discovery

Introduction

The 1000 Genomes Project produced a deep catalogue of human genomic variation and sequenced more than 2600 samples from 26 different populations. It completed its final phase (“phase three”), with the release of more than 85 million variants of various types and phased haplotypes for those variants ¹. This data has been widely used by the scientific community for genotype imputation and many other applications ². The strategy adopted by the project consisted of sequencing samples using low coverage whole genome sequencing (WGS) and whole exome sequencing (WES), and the alignment of that sequence data to a version of the GRCh37 human reference genome, which included decoy sequences for optimal read mapping.

While the 1000 Genomes Project was based on GRCh37, the current version of the human reference assembly is GRCh38, which was released by the Genome Reference Consortium (GRC) in 2013. This is the most comprehensive representation of the human genome currently available, as demonstrated by Schneider et al., whose work illustrates the superiority of GRCh38 over GRCh37 ³. Specifically GRCh38 is a better basis for annotation, alters read alignment (even in unchanged regions of the genome) and “impacts variant interpretation at clinically relevant loci” ³.

To make full use of GRCh38, there has been a need for widely used genomic reference data sets, like the 1000 Genomes Project data, to be made available on the assembly so that pipelines and analyses that rely on such additional reference materials can use GRCh38 and benefit from its improvements.

dbSNP have facilitated the use of the 1000 Genomes Project variation data on GRCh38 by transferring the variant calls to the new assembly using a method relying on an alignment created between GRCh37 and GRCh38. The alignment is then used to determine equivalent locations between the two assemblies, allowing variation data to be “lifted-over”. Files from dbSNP are reformatted into a standard VCF by the European Variation Archive (EVA) and shared as part of our resources through the 1000 Genomes FTP site ⁴ and also via the Ensembl genome browser ⁵.

Lift-over approaches, however, have several limitations. First, in order to be able to transfer a variant from one assembly to another, it is necessary to be able to map between the genomes at the variant’s original location, which is not always possible. In the lift-over process mentioned above there were over 2.3 million VCF records which could not be transferred to the GRCh38 assembly. Second, even when a variant can be lifted-over it does not follow that the underlying read alignments supporting the variant identification on the original assembly would transfer to the identified location in the new assembly. Indeed, alterations to the genome have an impact on read alignments even in unaltered regions of the genome ³. Thus, despite a variant being lifted-over, there is no guarantee that it would have been called at the identified location in the new genome as the underlying evidence may vary. Finally, where novel sequence is introduced in GRCh38, it is unlikely that the lift-over approach will be effective as this sequence was not represented in the older assembly and, therefore, not included in the original variation discovery process. Examples of this can be seen where gaps in the assembly have been closed, including at medically relevant loci where gaps have been closed, such as INPP5D, DPP6 and IKZF1 ³, and which are considered below.

To realise the benefits and address the limitations described above, we created new call sets from alignments of the original 1000 Genomes Project read data to GRCh38, initially releasing only biallelic SNVs (described in a previous version of this note) and now updating to biallelic SNVs and INDELs. While this work does not replicate the full repertoire of analyses employed by the 1000 Genomes Consortium, it aims to give a consistent de novo call set spanning all of the GRCh38 primary assembly autosomes and pseudo-autosomal regions, and to produce a call set with similar, although not identical, properties to that produced by the 1000 Genomes Project while using a simpler methodology.

To create an updated variation call set from the 1000 Genomes Project data, we adopted a multi-caller approach and used previously described alignments ⁶. With the aim of sharing this data in a timely manner, we adopted an incremental approach to generating and releasing data sets. Initially, we released only biallelic SNVs, which represent the vast majority of the SNVs present in the human genome. Phase three of the 1000 Genomes Project reported that 99.6% of the 81.4 million SNVs they reported are biallelic. Here, we extend our biallelic SNV call set by adding biallelic INDELs. We anticipate future updates to incorporate calls on new populations and the non-pseudo autosomal regions of chromosome X.

Methods

Input data

The methods used for sample collection, library construction, and sequencing are described in the previous 1000 Genomes Project publications ^1,
7,
8. The read data used for this analysis used similar criteria to the final phase of the 1000 Genomes Project. Only Illumina sequence data with reads longer than 70 bp (WGS) and 68 bp (WES) were used. This data was aligned to GRCh38 as previously described ⁶. The complete list of the whole genome and whole exome sequencing alignment files used as the input for generating the callsets can be found on our FTP site at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/1000genomes.low_coverage.GRCh38DH.alignment.index and at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/1000genomes.exome.GRCh38DH.alignment.index.

Reference genome

We used the full GRCh38 reference, including ALT contigs, decoy and EBV sequences (accession GCA_000001405). In addition, more than 500 HLA sequences compiled by Heng Li from the IMGT/HLA database provided by the Immuno Polymorphism Database (IPD) ⁹ are included. The reference genome can be accessed at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/.

Ethical considerations

Information concerning ethical approval and the informed consent procedure for the 1000 Genomes Project can be found at https://www.internationalgenome.org/sites/1000genomes.org/files/docs/Informed%20Consent%20Background%20Document.pdf.

Quality control of the alignment files

We adopted a similar quality control process to that used in the final phase of the 1000 Genomes Project. The following describes the methods used in their entirety. Chk_indel_rg was applied to discard alignment files with an unbalanced ratio of short insertions and deletions (greater than 5). Picard CollectWgsMetrics was used with the whole genome files and those with mean non-duplicated aligned coverage level ≤2x were discarded. In the case of the exome files, we used Picard CollectHsMetrics using the exome target coordinates at ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/working/20190125_coords_exon_target/ and keeping files where more than 70% of the target regions have 20× or greater coverage. In addition, VerifyBAMID ¹⁰ was used to assess sample contamination and mix-ups and the following cutoffs were used:

   free_mix > 0.03 and chip_mix > 0.02 for whole genome files

   free_mix > 0.035 and chip_mix > 0.02 for exome files

Only files passing the quality assessment were used in variant calling.

Variant discovery

Callers were selected in consultation with members of the original 1000 Genomes Project, using their prior knowledge of caller output and the feasibility of running callers on the data set. This enabled us to take advantage of knowledge of a wide range of callers and their performance with this data, the profile of which is now atypical. Specifically, these data are low-coverage from samples representing a diverse range of populations, which necessitates a strategy relying on joint genotyping and the presence of many individuals from a given population. Four supporting call sets were created, using different callers and combinations of the exome and WGS sequence data.

A total of 2,659 WGS and 2,498 WES BAMs corresponding to 2,698 samples ⁶ were used for variant identification. Figure 1 details the analysis of the alignment files with three established methods ( BCFtools version 1.3.1-220-g9f38991, Freebayes version v1.0.2-58-g054b257 and GATK UnifiedGenotyper ¹¹ version 3.5-0-g36282e4). BCFtools was used to analyse WGS and WES files in two independent runs, GATK UnifiedGenotyper was used only with WGS files and Freebayes was used to analyse everything together (WGS+WES). Calls were made only on the primary assembly autosomes and pseudo-autosomal regions. The following command lines were used for each of the methods to perform joint genotyping:

• BCFtools with the WGS files:

bcftools mpileup -E -a DP -a SP -a AD -P ILLUMINA \
  -pm3 -F0.2 -C50 -d 700000 \
  -f $ref.fa $file.bam | bcftools call -mv -O z \
  --ploidy GRCh38 -S $samples.ped -o $out.vcf.gz

• GATK UnifiedGenotyper with the WGS files:

java -Xmx6g -jar GenomeAnalysisTK.jar \
  -T UnifiedGenotyper \
  -R $ref.fa \
  -I $file.bam \ 
  -o $out.vcf.gz \ 
  -dcov 250 \
  -stand_emit_conf 10 \ 
  -glm both \
  --genotyping_mode GENOTYPE_GIVEN_ALLELES \
  --dbsnp ALL_20141222.dbSNP142_human_GRCh38.snps.vcf.gz \
  -stand_call_conf 10

• BCFtools with the WES files:

bcftools mpileup -E -a DP -a SP -a AD -P ILLUMINA \
  -pm3 -F0.2 -C50 -d 1400000 \
  -f $ref.fa $file.bam | bcftools call -mv -O z \
  --ploidy GRCh38 -S $samples.ped -o $out.vcf.gz

• Freebayes with the WGS+WES files:

freebayes --genotyping-max-iterations 10 \ 
  --min-alternate-count 3 \
  --max-coverage 2000000 \
  --min-mapping-quality 1 \
  --min-alternate-qsum 50 \
  --min-base-quality 3 \
  -f $ref.fa \
  -b $file.bam | bgzip -c > $out.vcf.gz

Figure 1. — VCF, variant call format; WGS, whole-genome sequencing; WES, whole-exome sequencing; VQSR, variant quality score recalibration.

Variant filtering

Our variant discovery pipeline produced four initial call sets as described above. To create an integrated call set, we discarded the variants falling in the centromeres, as these are regions of low complexity that hinder variant calling. Variants on the Y chromosome or in regions of the X chromosome outside the pseudo-autosomal regions were discarded due to the ploidy settings used in this work. Additionally, the initial call sets were filtered using different methods and parameters depending on the call set:

GATK UnifiedGenotyper call set. We used the VariantScoreRecalibration (VQSR) ¹¹ method following the GATK best practices and GATK training call sets. The combination of commands and parameters we used were different depending on the variant type being analysed. For SNPs we used GATK VariantRecalibrator and ApplyRecalibration as follows:

java -jar GenomeAnalysisTK.jar \
  -T VariantRecalibrator \
  -R $ref.fa \
  -input $file.vcf.gz \ 
  -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.hg38.vcf.gz \ 
  -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.hg38.vcf.gz \ 
  -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.hg38.vcf.gz \
  -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_146.hg38.vcf.gz \ 
  -an DP \
  -an QD \
  -an FS \
  -an SOR \
  -an MQ \
  -an MQRankSum \ 
  -an ReadPosRankSum \ 
  -an InbreedingCoeff \
  -mode SNP \
  -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 98.0 -tranche 97.0 -tranche 96.0 -tranche 95.0 -tranche 92.0 -tranche 90.0 -tranche 85.0 -tranche 80.0 -tranche 75.0 -tranche 70.0 -tranche 65.0 -tranche 60.0 -tranche 55.0 -tranche 50.0 \
  -recalFile recalibrate_SNP.recal \
  -tranchesFile recalibrate_SNP.tranches \ 
  -rscriptFile recalibrate_SNP_plots.R

And:

java -jar GenomeAnalysisTK.jar
  -T ApplyRecalibration \
  -R $ref.fa \
  -input $file.vcf.gz \
  -mode SNP \
  --ts_filter_level 99.9 \
  -recalFile recalibrate_SNP.recal \
  -tranchesFile recalibrate_SNP.tranches | bgzip -c > recalibrated_snps_raw_indels.vcf.gz

And for INDELs we used:

java -jar GenomeAnalysisTK.jar \
  -T VariantRecalibrator \
  -R $ref.fa \
  -input recalibrated_snps_raw_indels.vcf.gz \ 
  -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \ 
  -resource:dbsnp,known=true,training=false,truth=false,prior=2.0   dbsnp_146.hg38.vcf.gz \
  -an QD \
  -an DP \
  -an FS \
  -an SOR \
  -an ReadPosRankSum \
  -an MQRankSum \
  -an InbreedingCoeff \
  -mode INDEL \
  -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 98.0 -tranche 97.0 -tranche 96.0 -tranche 95.0 -tranche 92.0 -tranche 90.0 -tranche 85.0 -tranche 80.0 -tranche 75.0 -tranche 70.0 -tranche 65.0 -tranche 60.0 -tranche 55.0 -tranche 50.0 \
  -recalFile recalibrate_INDEL.recal \
  -tranchesFile recalibrate_INDEL.tranches \
  -rscriptFile recalibrate_INDEL_plots.R \
  --maxGaussians 4

And:

java -jar GenomeAnalysisTK.jar \
  -T ApplyRecalibration \
  -R $ref.fa \
  -input recalibrated_snps_raw_indels.vcf \
  -mode INDEL \
  --ts_filter_level 80.0 \
  -recalFile recalibrate_INDEL.recal \
  -tranchesFile recalibrate_INDEL.tranches | bgzip -c > recalibrated_variants.vcf.gz

BCFTools call sets. Filtering was based on variant annotations. The variant annotations used in the filtering and their respective cutoff values were established by comparing the distribution of the annotation values in the true and false positive sites. In the case of the low coverage data we compared the sites in chromosome 20 only and in the case of the exome data we used sites on all chromosomes. We considered true positives to be the sites identified in our call set for genome NA12878 that were also present in the gold-standard call set generated for the same sample by Genome in a Bottle (GIAB). GIAB’s calls for NA12878 are the result of an effort to integrate data generated by 13 different sequencing technologies and analysis methods ¹². Sites that were present in our call sets and absent in GIAB were considered false positive sites. Table 1 and Table 2 show the variant annotations and cutoff values used for the SNPs and INDELs with the low coverage data and Table 3 and Table 4 show the annotations and cutoff values used for the exome data with the SNPs and INDELs respectively.

Table 1. Variant annotations and cutoff values used for SNPs identified using the low coverage data.

Annotation	Description	Cutoff value
INFO/DP	Raw read depth	>24,304
INFO/MQ	Average mapping quality	<34
INFO/MQ0F	Fraction of MQ0 reads (smaller is better)	>0.049737
INFO/HOB	Bias in the number of HOMs number (smaller is better)	>0.1643732
INFO/SGB	Segregation based metric	>2347.043
INFO/SGB	Segregation based metric	<-64440.286
QUAL	Variant quality	<20

Open in a new tab

Table 2. Variant annotations and cutoff values used for INDELs identified using the low coverage data.

Annotation	Description	Cutoff value
INFO/DP	Raw read depth	>23,758
INFO/MQ	Average mapping quality	<41
INFO/MQ0F	Fraction of MQ0 reads (smaller is better)	>0.009913696
INFO/HOB	Bias in the number of HOMs number (smaller is better)	>0.20265508
INFO/SGB	Segregation based metric	>2143.8876
INFO/SGB	Segregation based metric	<-29513.557
INFO/IDV	Maximum number of reads supporting an indel	>51
INFO/IMF	Maximum fraction of reads supporting an indel	<0.387097
QUAL	Variant quality	<20

Open in a new tab

Table 3. Variant annotations and cutoff values used for SNPs identified using the exome data.

Annotation	Description	Cutoff value
INFO/DP	Raw read depth	>656,519
INFO/MQ	Average mapping quality	<38
INFO/MQ0F	Fraction of MQ0 reads (smaller is better)	>0.0146629
INFO/HOB	Bias in the number of HOMs number (smaller is better)	>0.1536016
INFO/SGB	Segregation based metric	>57489.21
INFO/SGB	Segregation based metric	<-226326.93
QUAL	Variant quality	<20

Open in a new tab

Table 4. Variant annotations and cutoff values used for INDELs identified using the exome data.

Annotation	Description	Cutoff value
INFO/MQ	Average mapping quality	<45
INFO/MQ0F	Fraction of MQ0 reads (smaller is better)	>0.002034686
INFO/HOB	Bias in the number of HOMs number (smaller is better)	>0.269603
INFO/SGB	Segregation based metric	>53165.5
INFO/SGB	Segregation based metric	<-85919.729
INFO/IMF	Maximum fraction of reads supporting an indel	<0.3323922
QUAL	Variant quality	<20

Open in a new tab

These cutoff values were applied using the following command:

• SNPs from the low coverage data:

bcftools filter -s GIABFILTER \ 
  -e'INFO/DP>24304 | MQ<34 | MQ0F>0.049737 | HOB>0.1643732 | SGB>2347.043 | SGB<-64440.286 | QUAL<20' \
  $file.snps.vcf.gz \
  -o $out.snps.filtered.vcf.gz -O z

• INDELs from the low coverage data:

bcftools filter -s GIABFILTER \
  -e'INFO/DP>23758 | MQ<41 | MQ0F>0.009913696 | HOB>0.20265508 | SGB>2143.8876 | SGB<-29513.557 | IDV>51 | IMF<0.387097 | QUAL<20' $file.indels.vcf.gz -o $out.indels.filtered.vcf.gz -O z

• SNPs from the exome data:

bcftools filter -sGIABFILTER \
  -e'INFO/DP>656519 | MQ<38 | MQ0F> 0.0146629| HOB>0.1536016 | SGB>57489.21 | SGB < -226326.93| QUAL<20' $file.snps.vcf.gz \
  -o  $out.snps.filtered.vcf.gz -O z

• INDELs from the exome data:

bcftools filter -sGIABFILTER \
  -e'MQ<45 | MQ0F>0.002034686| HOB> 0.269603| SGB>53165.5 | SGB<-85919.729 | IMF<0.3323922 | QUAL<20' $file.indels.vcf.gz \
  -o $out.indels.filtered.vcf.gz -O z

Freebayes call set. We used a simple hard filter that discarded variants having a QUAL value less than or equal to 1 since this cutoff value has been recommended by the author of Freebayes (personal communication, [Erik Garrison]) and proved to be effective in filtering variants in phase three of the 1000 Genomes Project. This filter was applied using the following command:

bcftools filter -sQUALFILTER -e'QUAL<1' $file.vcf.gz \ 
  -o $file.filtered.vcf.gz -O z

Generating consensus call sets

Biallelic SNVs. First, each call set was normalized using the following combination of tools:

bcftools norm -f ref.fa -o norm.vcf.gz -m '-both’ in.vcf.gz -Oz

in order to normalize and left-align INDELs and to split the multiallelic sites into multiple rows. Then we run:

vcfallelicprimitives norm.vcf.gz --keep-info --keep-geno | vt sort - | vt uniq - | bgzip -c > norm.aprimitives.vcf.gz

where vcflib vcfallelicprimitives (version v1.0.0-rc1) was used to decompose the complex variants and vt ¹³ (version 0.5) was used to sort and unify resulting variants. After this we run:

bcftools norm -f ref.fa -o norm.aprimitives.merged.vcf.gz -m '+both’ norm.aprimitives.vcf.gz -Oz

this merges the multiallelic sites into single rows. Finally, the multiallelic sites were discarded in the following step:

bcftools view -o norm.aprimitives.merged.biallelic.vcf.gz -O z -m2 -M2 norm.aprimitives.merged.vcf.gz

This normalization procedure is necessary as different variant callers may describe the same variant in different ways, which makes comparison difficult and affects the integration of the call sets. Additionally, GATK VariantsToAllelicPrimitives was used to decompose the multi-nucleotide polymorphisms (MNPs) that were present in the Freebayes call set.

Finally, we generated a consensus call set by taking the union of the biallelic sites from each call set and calculating the genotype likelihoods for each site using GATK UnifiedGenotyper in ‘genotype_given_alleles’ (GGA) mode using the following command line:

java -jar GenomeAnalysisTK.jar \
  -T UnifiedGenotyper \
  -R $ref.fa \
  -I input.$chr:$start-$end.bam \
  -glm SNP \
  --intervals $chr:$start-$end \
  --intervals integrated.biallelic.sites.vcf.gz \
  --output_mode EMIT_ALL_SITES \
  --alleles integrated.biallelic.sites.vcf.gz \
  --interval_set_rule INTERSECTION \
  --genotyping_mode GENOTYPE_GIVEN_ALLELES \
  --max_deletion_fraction 1.5

Where $chr:$start-$end is the genomic chunk that is being analysed and integrated.biallelic.sites.vcf.gz is the VCF containing the union of the biallelic sites for which the genotype likelihoods will be calculated.

We then filtered the variants using Variant Quality Score Recalibration (VQSR) with the same parameters and training call sets that were described above and used for filtering the supporting call set generated using GATK UnifiedGenotyper. GATK ApplyRecalibrator was used with a --ts_filter_level value of 99.5, chosen to balance sensitivity and specificity. This gave a consensus biallelic SNV call set, used as the basis of the initial biallelic SNV-only call set.

Biallelic SNVs and INDELs. To add the INDEL variants to the SNV-only data set (for our second data release), we extracted the INDELs from the initial BCFTools, GATK and Freebayes call sets described above and generated a consensus call set by taking the union of the normalized biallelic INDELs from each call set. Then, we calculated the genotype likelihoods for each site using again GATK UnifiedGenotyper in ‘genotype_given_alleles’ (GGA) mode using the following command line:

java -jar GenomeAnalysisTK.jar \
  -T UnifiedGenotyper \
  -R $ref.fa \
  -I input.$chr:$start-$end.bam \
  -glm INDEL \
  --intervals $chr:$start-$end \
  --intervals integrated.biallelic.indel.sites.vcf.gz \
  --output_mode EMIT_ALL_SITES \
  --alleles integrated.biallelic.indel.sites.vcf.gz \
  --interval_set_rule INTERSECTION \
  --genotyping_mode GENOTYPE_GIVEN_ALLELES \
  --max_deletion_fraction 1.5

Where $chr:$start-$end is the genomic chunk that is being analysed and integrated.biallelic.indel.sites.vcf.gz is the VCF containing the union of the biallelic INDEL sites for which the genotype likelihoods will be calculated.

The next step consisted of filtering this INDEL-only consensus call set, and as we did for the SNV-only call set, we used the GATK Variant Quality Score Recalibration (VQSR) method, this time running ApplyRecalibrator with a --ts_filter_level value of 99.0. This is lower than the value of 99.5 used with the SNV-only data set. This was chosen to focus on specificity, due to the greater challenges in INDEL calling, while also balancing sensitivity.

Finally, the INDEL-only consensus call set was merged to the initial SNV-only call set by using bcftools concat.

Phasing and imputation of the consensus call set

Biallelic SNVs. The VCF file containing the genotype likelihoods obtained following the procedure described above was divided into single chromosome VCF files that were further divided into genomic chunks containing 2,100 sites of which 600 were shared between consecutive chunks. These chunks were processed in parallel by Beagle ¹⁴ by using the following command:

java -jar beagle.08Jun17.d8b.jar \
  chrom=$chr:$start-$end \
  gl=$chr.biallelic.GL.vcf.gz \
  out=$chr.$start.$end.beagle \
  niterations=15

Where $chr.biallelic.GL.vcf.gz is the VCF file containing the genotype likelihoods.

After processing all the chunks with Beagle, the initial set of genotypes and haplotypes were phased using SHAPEIT2 ¹⁵ (version v2.r837) onto a highly accurate haplotype scaffold also created by SHAPEIT2 using microarray genotype data available on the same samples. This scaffold was obtained by leveraging family information and running SHAPEIT2 in two different independent runs on either the Illumina Omni 2.5 or Affymetrix 6.0 microarray data that was generated as part of the 1000 Genomes Project. To create the microarray scaffolds SHAPEIT2 was run using the following settings (--window 0.5, --states 200, --burn 10, --prune 10, --main 50, --duohmm) and SNPs with a missing data rate above 10% and a Mendel error rate above 5% were removed before phasing. Genotypes called by Beagle with a posterior probability greater than 0.995 were fixed as known genotypes and haplotypes estimated by Beagle were used to initialize the SHAPEIT2 phasing. This phasing was run in chunks of 12,250 sites with 3,500 sites overlapping between consecutive chunks. When phasing the calls derived from sequence data onto the microarray scaffolds SHAPEIT2 was run using the following command:

shapeit -call \
  --input-gen input.shapeit.$chr.gen.gz input.shapeit.$chr.gen.sample \
  --input-init input.shapeit.$chr.hap.gz input.shapeit.hap.sample \
  --input-scaffold chip.omni.snps.$chr.haps chip.omni.snps.$chr.sample chip.affy.snps.$chr.haps chip.affy.snps.$chr.sample  \
  --input-map $chr.gmap.gz \
  --input-thr 1 \
  --window 0.1 \
  --states 400 \
  --states-random 200 \
  --burn 0 \
  --run 12 \
  --prune 4 \
  --main 20 \
  --input-from $chunk_start \
  --input-to $chunk_end  \
  --output-max out.$chr.$chunk_start.$chunk_end.haps.gz out.$chr.$chunk_start.$chunk_end.haps.sample

Where --input-gen specifies the genotype/GL input data from Beagle, --input-init specifies the haplotypes from Beagle, --input-map specifies the genetic map used in the estimation, --input-scaffold gives the SNP-array derived haplotype scaffold obtained from SHAPEIT2. The genetic map used was downloaded from https://data.broadinstitute.org/alkesgroup/Eagle/downloads/tables/genetic_map_hg38_withX.txt.gz. Each of the phased chunks resulting from running SHAPEIT2 were joined together using the program ligateHAPLOTYPES.

The strategy described here was used in the final phase of the 1000 Genomes Project and has been shown to produce low error rates for genotype calls ¹⁶.

The pipelines used in this work were implemented using the eHive workflow system ¹⁷ and modules developed in Perl and Python, which have been packaged for ease of deployment. All the analyses were run in parallel on a high-throughput compute cluster to ensure completion in a reasonable timeframe. Code is publicly available via GitHub (see software availability section) ^17–
19.

Biallelic SNVs and INDELs. The merged VCF for biallelic SNVs and INDELs was phased and imputed using Beagle/SHAPEIT2, using the same process described above for phasing and imputation of the SNV-only data set.

To illustrate the contribution of each of the four filtered supporting call sets to the final consensus call set, we generated the plots in Figure 2 and Figure 3, for SNVs and INDELs respectively. Figure 2 shows that the call set that has contributed the most to the final SNV consensus call set is the GATK UnifiedGenotyper call set (71,353,714 variants) followed by the Freebayes call set (61,625,466 variants). In the case of INDELs the call set that has contributed the most to the final INDEL consensus call set is the Freebayes call set (3,649,204 variants) followed by the BCFTools call set used on the low-coverage WGS data (3,602,996 variants).

Figure 2. — *‘ex_bcftools’* is the call set generated using BCFTools with the WES (Whole exome sequencing) data. *‘lc_bcftools’* is the call set generated using BCFTools with the low coverage WGS (whole genome sequencing) data. *‘freebayes’* is the call set generated using Freebayes with the low coverage WGS+WES data. *‘gatk’* is the call set generated using GATK UnifiedGenotyper with the low coverage WGS data. *‘consensus’* is the final SNV call set generated after integrating the supporting call sets. Vertical bars show the size of the intersection between the call sets. Horizontal bars show the aggregated size of each call set. We used the filtered supporting call sets to generate this plot.

Figure 3. — *‘ex_bcftools’* is the call set generated using BCFTools with the WES (Whole exome sequencing) data. *‘lc_bcftools’* is the call set generated using BCFTools with the low coverage WGS (whole genome sequencing) data. *‘freebayes’* is the call set generated using Freebayes with the low coverage WGS+WES data. *‘gatk’* is the call set generated using GATK UnifiedGenotyper with the low coverage WGS data. *‘consensus’* is the final INDEL call set generated after integrating the supporting call sets. Vertical bars show the size of the intersection between the call sets. Horizontal bars show the aggregated size of each call set. We used the filtered supporting call sets to generate this plot.

Switch error rate of the NA12878 sample. In order to assess the phasing accuracy in our SNV and INDEL call set we estimated the switch error (SE) rate, which measures the rate for which the phase of a certain haplotype block is incorrectly predicted in the comparison with a true haplotype block. For example, if the correct haplotype is 000111|111000 and the predicted haplotype is 00000|111111, then we count one switch error between positions 3 and 4. In order to estimate these kind of errors we have used WhatsHap ‘compare’ (version 0.18) ²⁰ with our phased variants for sample NA12878 and using the GIAB call set for this same sample as the gold-standard phased reference. The SE has been calculated for each autosome by using the following command for SNVs:

whatshap compare --sample NA12878 \
     --only-snvs NA12878.phased.GIAB.snps.chr${i}.vcf.gz \ 
     combined.NA12878.phased.query.snps.chr${i}.vcf.gz

And for INDELs:

whatshap compare --sample NA12878 \
     NA12878.phased.GIAB.indels.chr${i}.vcf.gz \ 
     combined.NA12878.phased.query.indels.chr${i}.vcf.gz

We estimated the SE rates resulting from the following comparisons:

our extended call set with GIAB on GRCh38
lift-over call set with GIAB on GRCh38
P3 call set with GIAB on GRCh37

And the results of these comparisons can be seen in Table 5 for SNVs and Table 6 for INDELs, where we can see that the average SE rate for SNVs across all the autosomes is lower in our call set (0.71%) than in the lift-over and P3 call sets (0.91% and 1.54% respectively) and is also lower for INDELs (1.78% versus 3.16% and 5.18% respectively).

Table 5. Switch error (SE) rates for phased SNVs for NA12878.

‘This_work’ contains the rates for the comparison between our call set and GIAB. ‘lift-over’ contains the rates for the lift-over call set compared to GIAB. ‘P3’ column contains the rates for the phase three call set compared to GIAB.

chromosome	This_work	lift-over	P3
1	0.99%	1.15%	2.42%
2	0.53%	0.70%	0.70%
3	0.48%	0.77%	0.72%
4	0.50%	0.66%	2.45%
5	0.51%	0.68%	2.30%
6	1.40%	1.59%	2.35%
7	0.74%	0.93%	1.23%
8	0.61%	0.85%	2.16%
9	0.43%	0.70%	2.36%
10	0.89%	1.17%	1.49%
11	0.57%	0.65%	2.50%
12	0.44%	0.63%	0.65%
13	0.44%	0.64%	0.66%
14	0.44%	0.69%	0.68%
15	0.61%	0.77%	0.77%
16	1.17%	1.35%	0.60%
17	1.62%	1.77%	0.82%
18	0.53%	0.74%	2.38%
19	0.37%	0.67%	2.94%
20	0.46%	0.66%	0.59%
21	0.44%	0.58%	0.57%
22	1.52%	1.73%	2.53%
AVG	0.71%	0.91%	1.54%

Open in a new tab

Table 6. Switch error (SE) rates for phased INDELs from NA12878.

chromosome	This_work	lift-over	P3
1	2.55%	4.85%	9.32%
2	0.66%	1.29%	1.32%
3	0.44%	1.25%	1.23%
4	0.56%	0.93%	7.64%
5	0.57%	1.16%	8.35%
6	4.67%	7.10%	8.22%
7	2.26%	3.26%	3.68%
8	0.95%	2.34%	8.70%
9	0.40%	1.51%	8.67%
10	2.91%	4.55%	5.17%
11	0.71%	0.99%	8.68%
12	0.43%	1.12%	1.26%
13	0.52%	0.89%	0.73%
14	0.51%	1.18%	1.13%
15	0.58%	1.12%	1.17%
16	5.56%	9.14%	1.10%
17	6.35%	10.74%	1.53%
18	0.56%	1.24%	8.97%
19	0.62%	2.01%	12.75%
20	0.86%	1.65%	1.47%
21	0.73%	1.15%	1.24%
22	5.83%	10.04%	11.57%
AVG	1.78%	3.16%	5.18%

Open in a new tab

Comparison with the Genome in a bottle (GIAB) call set for NA12878

Biallelic SNVs. To assess our biallelic SNV call set and compare it to the final phase of the 1000 Genomes Project, we utilised resources from GIAB. Our strategy compares our GRCh38 calls for NA12878 with the NA12878 calls on GRCh38 from GIAB. In addition, we compared the 1000 Genomes calls for NA12878 to those from GIAB on GRCh37. NA12878 was used in benchmarking as GIAB provides an independent gold-standard data set. For other samples in the 1000 Genomes Project panel, such data is not available, making meaningful benchmarking with other samples impossible. The use of a joint genotyping approach precludes applying our method to a single sample where high quality data is available but a population with low coverage and exome data is not. This limits suitable benchmarks to NA12878 and it should be noted that GIAB’s analysis uses only the primary assembly in alignment, giving a different base from which to make calls, which may include reads that would otherwise have aligned to alt sequences. Within these limitations, this approach enables us to benchmark the performance with an independently produced gold-standard and allows us to apply the equivalent benchmark to data from the 1000 Genomes Project on GRCh37, indicating how our call set compares to that produced by the 1000 Genomes Project.

For NA12878, there are no indications that it is an outlier in the 1000 Genomes Project sequence data. It has similar coverage to other samples at 4.6x compared to an average of 6.2x (standard deviation 2.3) for the WGS and 144.1x relative to an average of 84.9x (standard deviation 34.1) for the exome data, and the same technologies are applied across the data set. Given the prevalence of NA12878 data, we would expect the callers to perform well with this sample but, as the NA12878 data in our work is not exceptional in the data set, we believe the results seen in benchmarking NA12878 are likely to broadly reflect performance across the data set.

We used the NA12878 variants from the multi-sample phased SNV-only VCF and compared them with the GIAB sites on GRCh38 downloaded from [ ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/latest] (version 3.3.2). For GRCh37 we compared variants from the final phase of the 1000 Genomes Project (downloaded here) with the GRCh37 GIAB variants obtained here (version 3.3.2). Our comparison is restricted to the regions in the autosomes and in the PAR region of chromosome X where GIAB considers calls to be high confidence (on average 77.9%, standard deviation 12.1%, of the bases for each of the chromosomes are in high confidence regions) and was performed using the Nextflow ²¹ workflow accessible from the link in the software availability section.

The result of our comparison is shown in Table 7. The average percentage of sites among all the chromosomes identified in our work that were also present in GIAB represents 96.4% of the total GIAB sites. This percentage is comparable to 97.9% resulting from the comparison with the final phase of the 1000 Genomes Project (phase three - P3). Additionally, the percentage of sites identified in our call set but not in GIAB is 0.5%, which is comparable to the 0.4% obtained in the comparison with 1000 Genomes phase three.

Table 7. Site comparison for NA12878 between our call set and Genome in a Bottle (GIAB)-mapped to GRCh38 and between the 1000 Genomes Project phase three (P3) call set and GIAB mapped to GRCh37.

Results are shown for each chromosome. ‘Shared (TP)’ are the true positive variants identified in the compared call sets. ‘giab_only (FN)’ are the false negative variants identified by GIAB only. ‘Thiswork_only (FP)’ are the false positive variants identified in our call set only.

Dataset	Shared (TP)	%shared (TP)	giab_only (FN)	%giab_only (FN)	Thiswork_only (FP)	%Thiswork_only (FP)	Total (GIAB)	Total thiswork_only
Chr1 (b38)	238,323	96.37	8,965	3.63	1,347	0.56	247,288	239,670
Chr1 (b37)	242,331	98.09	4,707	1.91	1,700	0.70	247,038	244,031
Chr2 (b38)	237,017	96.42	8,791	3.58	1,264	0.53	245,808	238,281
Chr2 (b37)	260,921	98.14	4,942	1.86	1,209	0.46	265,863	262,130
Chr3 (b38)	214,201	96.17	8,520	3.83	1,134	0.53	222,721	215,335
Chr3 (b37)	218,474	97.93	4,608	2.07	926	0.42	223,082	219,400
Chr4 (b38)	188,608	96.00	7,860	4.00	847	0.45	196,468	189,455
Chr4 (b37)	232,888	97.93	4,927	2.07	888	0.38	237,815	233,776
Chr5 (b38)	181,015	96.26	7,031	3.74	865	0.48	188,046	181,880
Chr5 (b37)	193,359	95.48	9,162	4.52	766	0.39	202,521	194,125
Chr6 (b38)	197,830	96.04	8,151	3.96	940	0.47	205,981	198,770
Chr6 (b37)	191,018	98.05	3,801	1.95	844	0.44	194,819	191,862
Chr7 (b38)	166,888	96.54	5,982	3.46	854	0.51	172,870	167,742
Chr7 (b37)	167,924	97.98	3,464	2.02	712	0.42	171,388	168,636
Chr8 (b38)	145,748	96.24	5,700	3.76	678	0.46	151,448	146,426
Chr8 (b37)	171,950	97.76	3,937	2.24	715	0.41	175,887	172,665
Chr9 (b38)	131,987	96.42	4,899	3.58	635	0.48	136,886	132,622
Chr9 (b37)	132,596	97.84	2,924	2.16	581	0.44	135,520	133,177
Chr10 (b38)	153,504	96.55	5,480	3.45	815	0.53	158,984	154,319
Chr10 (b37)	153,080	97.87	3,338	2.13	648	0.42	156,418	153,728
Chr11 (b38)	154,516	95.83	6,720	4.17	775	0.50	161,236	155,291
Chr11 (b37)	155,511	97.86	3,407	2.14	609	0.39	158,918	156,120
Chr12 (b38)	136,457	96.46	5,008	3.54	745	0.54	141,465	137,202
Chr12 (b37)	148,026	98.03	2,972	1.97	676	0.45	150,998	148,702
Chr13 (b38)	121,294	96.89	3,889	3.11	560	0.46	125,183	121,854
Chr13 (b37)	122,424	98.08	2,395	1.92	423	0.34	124,819	122,847
Chr14 (b38)	99,613	96.03	4,122	3.97	493	0.49	103,735	100,106
Chr14 (b37)	99,543	97.74	2,300	2.26	434	0.43	101,843	99,977
Chr15 (b38)	85,881	96.59	3,031	3.41	386	0.45	88,912	86,267
Chr15 (b37)	87,224	97.95	1,822	2.05	390	0.45	89,046	87,614
Chr16 (b38)	54,542	96.72	1,850	3.28	282	0.51	56,392	54,824
Chr16 (b37)	92,735	97.92	1,967	2.08	424	0.46	94,702	93,159
Chr17 (b38)	73,765	96.69	2,524	3.31	484	0.65	76,289	74,249
Chr17 (b37)	76,187	98.27	1,341	1.73	441	0.58	77,528	76,628
Chr18 (b38)	73,419	96.89	2,360	3.11	344	0.47	75,779	73,763
Chr18 (b37)	93,004	97.97	1,923	2.03	365	0.39	94,927	93,369
Chr19 (b38)	56,210	95.27	2,788	4.73	461	0.81	58,998	56,671
Chr19 (b37)	59,138	97.93	1,248	2.07	376	0.63	60,386	59,514
Chr20 (b38)	64,786	96.78	2,154	3.22	419	0.64	66,940	65,205
Ch20 (b37)	64,827	97.89	1,400	2.11	275	0.42	66,227	65,102
Chr21 (b38)	42,453	96.96	1,329	3.04	225	0.53	43,782	42,678
Chr21 (b37)	43,941	98.13	836	1.87	178	0.40	44,777	44,119
Chr22 (b38)	33,351	96.81	1,099	3.19	193	0.58	34,450	33,544
Chr22 (b37)	36,132	98.16	678	1.84	207	0.57	36,810	36,339
ChrX (b38) *	109	93.97	7	6.03	2	1.80	116	111
AVG (b38)**	129,609	96.41	4,921	3.59	670	0.53	134,530	130,280
AVG (b37)	138,329	97.86	3,095	2.14	627	0.45	141,424	138,955

Open in a new tab

* Only PAR regions

** Not considering chrX for the calculation

Biallelic SNVs and INDELs. We also compared the extended call set containing SNVs and INDELs with the GIAB NA12878 call set in the same way that we did for our previous SNV-only call set (see above).

Additionally, we included the 1000 Genomes Project variants lifted to GRCh38 by dbSNP in the comparison with the GRCh38 GIAB sites. The lift-over call set used in this comparison is accessible from [ftp:// ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions]. The result of this comparison between our call set and the lift-over callset with GIAB on GRCh38 and between P3 and GIAB on GRCh37 can be seen in Table 8 for the SNV sites and Table 9 for the INDEL sites. Our integrated call set contains 96.4% of the total GIAB SNV sites. This percentage is similar to 97.9% resulting from the comparison with P3 and to 97.0% for the comparison between the lift-over and GIAB. Additionally, 0.5% of the SNV sites we identified were not in GIAB, similar to the 0.4% for P3 and to the 0.5% for the lift-over.

Table 8. SNV-only site comparison for NA12878 between our call set and Genome in a Bottle (GIAB)-mapped to GRCh38, between the lift-over (chr*_L rows in the table) call set-mapped to GRCh38 and between the 1000 Genomes Project phase 3 (P3) call set and GIAB mapped to GRCh37.

Results are shown for each chromosome. ‘ Shared (TP)’ are the true positive variants identified in the compared call sets. ‘ giab_only (FN)’ are the false negative variants identified by GIAB only. ‘ Thiswork_only (FP)’ are the false positive variants identified in our call set only.

Dataset	Shared (TP)	%shared (TP)	giab_only (FN)	%giab_only (FN)	Thiswork_only (FP)	%Thiswork_only (FP)	Total (GIAB)	Total thiswork_only
Chr1 (b38)	238,340	96.38	8,948	3.62	1,270	0.53	247,288	239,610
Chr1_L (b38)	241,396	97.62	5,892	2.38	1,799	0.74	247,288	243,195
Chr1 (b37)	242,331	98.09	4,707	1.91	1,700	0.7	247,038	244,031
Chr2 (b38)	237,055	96.44	8,753	3.56	1,208	0.51	245,808	238,263
Chr2_L (b38)	240,944	98.02	4,864	1.98	1,232	0.51	245,808	242,176
Chr2 (b37)	260,921	98.14	4,942	1.86	1,209	0.46	265,863	262,130
Chr3 (b38)	214,315	96.23	8,406	3.77	1,027	0.48	222,721	215,342
Chr3_L (b38)	217,446	97.63	5,275	2.37	1,061	0.49	222,721	218,507
Chr3 (b37)	218,474	97.93	4,608	2.07	926	0.42	223,082	219,400
Chr4 (b38)	188,516	95.95	7,952	4.05	873	0.46	196,468	189,389
Chr4_L (b38)	192,186	97.82	4,282	2.18	761	0.39	196,468	192,947
Chr4 (b37)	232,888	97.93	4,927	2.07	888	0.38	237,815	233,776
Chr5 (b38)	180,892	96.2	7,154	3.8	903	0.5	188,046	181,795
Chr5_L (b38)	179,468	95.44	8,578	4.56	756	0.42	188,046	180,224
Chr5 (b37)	193,359	95.48	9,162	4.52	766	0.39	202,521	194,125
Chr6 (b38)	197,693	95.98	8,288	4.02	1,013	0.51	205,981	198,706
Chr6_L (b38)	199,172	96.69	6,809	3.31	1,150	0.57	205,981	200,322
Chr6 (b37)	191,018	98.05	3,801	1.95	844	0.44	194,819	191,862
Chr7 (b38)	166,777	96.48	6,093	3.52	895	0.53	172,870	167,672
Chr7_L (b38)	168,159	97.27	4,711	2.73	793	0.47	172,870	168,952
Chr7 (b37)	167,924	97.98	3,464	2.02	712	0.42	171,388	168,636
Chr8 (b38)	145,659	96.18	5,789	3.82	719	0.49	151,448	146,378
Chr8_L (b38)	147,895	97.65	3,553	2.35	665	0.45	151,448	148,560
Chr8 (b37)	171,950	97.76	3,937	2.24	715	0.41	175,887	172,665
Chr9 (b38)	131,911	96.37	4,975	3.63	678	0.51	136,886	132,589
Chr9_L (b38)	133,365	97.43	3,521	2.57	614	0.46	136,886	133,979
Chr9 (b37)	132,596	97.84	2,924	2.16	581	0.44	135,520	133,177
Chr10 (b38)	153,422	96.5	5,562	3.5	853	0.55	158,984	154,275
Chr10_L (b38)	153,010	96.24	5,974	3.76	699	0.45	158,984	153,709
Chr10 (b37)	153,080	97.87	3,338	2.13	648	0.42	156,418	153,728
Chr11 (b38)	154,414	95.77	6,822	4.23	808	0.52	161,236	155,222
Chr11_L (b38)	156,330	96.96	4,906	3.04	717	0.46	161,236	157,047
Chr11 (b37)	155,511	97.86	3,407	2.14	609	0.39	158,918	156,120
Chr12 (b38)	136,392	96.41	5,073	3.59	771	0.56	141,465	137,163
Chr12_L (b38)	135,131	95.52	6,334	4.48	646	0.48	141,465	135,777
Chr12 (b37)	148,026	98.03	2,972	1.97	676	0.45	150,998	148,702
Chr13 (b38)	121,218	96.83	3,965	3.17	588	0.48	125,183	121,806
Chr13_L (b38)	122,714	98.03	2,469	1.97	475	0.39	125,183	123,189
Chr13 (b37)	122,424	98.08	2,395	1.92	423	0.34	124,819	122,847
Chr14 (b38)	99,551	95.97	4,184	4.03	501	0.5	103,735	100,052
Chr14_L (b38)	99,210	95.64	4,525	4.36	501	0.5	103,735	99,711
Chr14 (b37)	99,543	97.74	2,300	2.26	434	0.43	101,843	99,977
Chr15 (b38)	85,827	96.53	3,085	3.47	421	0.49	88,912	86,248
Chr15_L (b38)	86,887	97.72	2,025	2.28	426	0.49	88,912	87,313
Chr15 (b37)	87,224	97.95	1,822	2.05	390	0.45	89,046	87,614
Chr16 (b38)	54,517	96.68	1,875	3.32	285	0.52	56,392	54,802
Chr16_L (b38)	55,233	97.94	1,159	2.06	264	0.48	56,392	55,497
Chr16 (b37)	92,735	97.92	1,967	2.08	424	0.46	94,702	93,159
Chr17 (b38)	73,701	96.61	2,588	3.39	502	0.68	76,289	74,203
Chr17_L (b38)	73,299	96.08	2,990	3.92	460	0.62	76,289	73,759
Chr17 (b37)	76,187	98.27	1,341	1.73	441	0.58	77,528	76,628
Chr18 (b38)	73,375	96.83	2,404	3.17	371	0.5	75,779	73,746
Chr18_L (b38)	74,194	97.91	1,585	2.09	295	0.4	75,779	74,489
Chr18 (b37)	93,004	97.97	1,923	2.03	365	0.39	94,927	93,369
Chr19 (b38)	56,171	95.21	2,827	4.79	480	0.85	58,998	56,651
Chr19_L (b38)	55,897	94.74	3,101	5.26	422	0.75	58,998	56,319
Chr19 (b37)	59,138	97.93	1,248	2.07	376	0.63	60,386	59,514
Chr20 (b38)	64,800	96.8	2,140	3.2	399	0.61	66,940	65,199
Chr20_L (b38)	65,006	97.11	1,934	2.89	335	0.51	66,940	65,341
Ch20 (b37)	64,827	97.89	1,400	2.11	275	0.42	66,227	65,102
Chr21 (b38)	42,433	96.92	1,349	3.08	229	0.54	43,782	42,662
Chr21_L (b38)	42,830	97.83	952	2.17	197	0.46	43,782	43,027
Chr21 (b37)	43,941	98.13	836	1.87	178	0.4	44,777	44,119
Chr22 (b38)	33,336	96.77	1,114	3.23	209	0.62	34,450	33,545
Chr22_L (b38)	33,418	97	1,032	3	209	0.62	34,450	33,627
Chr22 (b37)	36,132	98.16	678	1.84	207	0.57	36,810	36,339
ChrX (b38) *	112	96.55	4	3.45	2	1.75	116	114
AVG (b38)**	129,560	96.37	4,970	3.63	682	0.54	134,530	130,242
AVG (b38_ lifted)**	130,600	97.01	3,931	2.99	658	0.51	134,530	131,258
AVG (b37)**	138,329	97.86	3,095	2.14	627	0.45	141,424	138,955

Open in a new tab

* Only PAR regions

** Not considering chrX for the calculation

Table 9. INDEL site comparison for NA12878 between our call set and Genome in a Bottle (GIAB)-mapped to GRCh38, between the lift-over (chr*_L rows in the table) call set-mapped to GRCh38 and between the 1000 Genomes Project phase three (P3) call set and GIAB mapped to GRCh37.

Dataset	shared (TP)	% shared (TP)	giab_only (FN)	% giab_only (FN)	Thiswork_only (FP)	%Thiswork_only (FP)	Total (GIAB)	Total thiswork_only
Chr1 (b38)	24,659	63.86	13,954	36.14	3,143	11.30	38,613	27,802
Chr1_L (b38)	27,009	69.95	11,604	30.05	2,261	7.72	38,613	29,270
Chr1 (b37)	25,802	73.03	9,530	26.97	2,171	7.76	35,332	27,973
Chr2 (b38)	24,237	65.05	13,023	34.95	2,995	11.00	37,260	27,232
Chr2_L (b38)	26,504	71.13	10,756	28.87	2,132	7.45	37,260	28,636
Chr2 (b37)	26,856	73.46	9,702	26.54	2,146	7.40	36,558	29,002
Chr3 (b38)	21,768	64.95	11,745	35.05	2,530	10.41	33,513	24,298
Chr3_L (b38)	23,830	71.11	9,683	28.89	1,762	6.88	33,513	25,592
Chr3 (b37)	22,495	73.94	7,930	26.06	1,693	7.00	30,425	24,188
Chr4 (b38)	19,675	68.14	9,200	31.86	2,303	10.48	28,875	21,978
Chr4_ (b38)	21,190	73.39	7,684	26.61	1,460	6.45	28,874	22,650
Chr4 (b37)	24,275	75.40	7,921	24.60	1,694	6.52	32,196	25,969
Chr5 (b38)	18,558	65.74	9,673	34.26	2,202	10.61	28,231	20,760
Chr5_L (b38)	20,330	72.01	7,901	27.99	1,436	6.60	28,231	21,766
Chr5 (b37)	20,813	74.15	7,255	25.85	1,502	6.73	28,068	22,315
Chr6 (b38)	20,711	65.73	10,797	34.27	2,521	10.85	31,508	23,232
Chr6_L (b38)	22,394	71.07	9,114	28.93	1,647	6.85	31,508	24,041
Chr6 (b37)	20,488	74.09	7,163	25.91	1,478	6.73	27,651	21,966
Chr7 (b38)	17,069	64.38	9,444	35.62	2,129	11.09	26,513	19,198
Chr7_L (b38)	18,112	68.31	8,401	31.69	1,389	7.12	26,513	19,501
Chr7 (b37)	17,058	71.70	6,732	28.30	1,354	7.35	23,790	18,412
Chr8 (b38)	14,387	64.00	8,093	36.00	1,761	10.91	22,480	16,148
Chr8_L (b38)	15,467	68.80	7,013	31.20	1,147	6.90	22,480	16,614
Chr8 (b37)	16,164	71.64	6,400	28.36	1,207	6.95	22,564	17,371
Chr9 (b38)	12,410	64.04	6,969	35.96	1,547	11.08	19,379	13,957
Chr9_L (b38)	13,476	69.54	5,903	30.46	1,149	7.86	19,379	14,625
Chr9 (b37)	12,691	72.88	4,722	27.12	1,058	7.70	17,413	13,749
Chr10 (b38)	15,506	64.61	8,492	35.39	1,987	11.36	23,998	17,493
Chr10_L (b38)	16,771	69.88	7,227	30.12	1,341	7.40	23,998	18,112
Chr10 (b37)	15,961	73.18	5,850	26.82	1,285	7.45	21,811	17,246
Chr11 (b38)	15,605	66.40	7,898	33.60	1,845	10.57	23,503	17,450
Chr11_L (b38)	17,013	72.39	6,490	27.61	1,266	6.93	23,503	18,279
Chr11 (b37)	16,071	75.63	5,179	24.37	1,208	6.99	21,250	17,279
Chr12 (b38)	14,366	63.28	8,335	36.72	1,854	11.43	22,701	16,220
Chr12_L (b38)	15,608	68.75	7,093	31.25	1,275	7.55	22,701	16,883
Chr12 (b37)	16,042	73.25	5,859	26.75	1,334	7.68	21,901	17,376
Chr13 (b38)	12,631	68.28	5,869	31.72	1,485	10.52	18,500	14,116
Chr13_L (b38)	13,634	73.70	4,866	26.30	1,039	7.08	18,500	14,673
Chr13 (b37)	12,990	76.03	4,096	23.97	970	6.95	17,086	13,960
Chr14 (b38)	10,344	64.71	5,640	35.29	1,338	11.45	15,984	11,682
Chr14_L (b38)	11,268	70.50	4,716	29.50	1,024	8.33	15,984	12,292
Chr14 (b37)	10,764	74.57	3,670	25.43	896	7.68	14,434	11,660
Chr15 (b38)	8,770	64.28	4,874	35.72	1,052	10.71	13,644	9,822
Chr15_L (b38)	9,746	71.43	3,898	28.57	792	7.52	13,644	10,538
Chr15 (b37)	9,265	74.66	3,145	25.34	728	7.29	12,410	9,993
Chr16 (b38)	4,662	61.17	2,959	38.83	704	13.12	7,621	5,366
Chr16_L (b38)	5,233	68.67	2,388	31.33	520	9.04	7,621	5,753
Chr16 (b37)	8,409	70.44	3,529	29.56	837	9.05	11,938	9,246
Chr17 (b38)	8,053	60.29	5,303	39.71	1,136	12.36	13,356	9,189
Chr17_L (b38)	8,977	67.21	4,379	32.79	867	8.81	13,356	9,844
Chr17 (b37)	8,866	70.94	3,632	29.06	828	8.54	12,498	9,694
Chr18 (b38)	7,618	67.20	3,718	32.80	928	10.86	11,336	8,546
Chr18_L (b38)	7,915	69.82	3,421	30.18	581	6.84	11,336	8,496
Chr18 (b37)	9,482	72.12	3,666	27.88	689	6.77	13,148	10,171
Chr19 (b38)	6,090	56.81	4,630	43.19	896	12.83	10,720	6,986
Chr19_L (b38)	6,620	61.75	4,100	38.25	694	9.49	10,720	7,314
Chr19 (b37)	6,638	66.14	3,398	33.86	701	9.55	10,036	7,339
Chr20 (b38)	6,430	62.55	3,849	37.45	823	11.35	10,279	7,253
Chr20_L (b38)	6,744	65.61	3,535	34.39	559	7.65	10,279	7,303
Ch20 (b37)	6,435	68.82	2,915	31.18	528	7.58	9,350	6,963
Chr21 (b38)	4,752	67.60	2,278	32.40	547	10.32	7,030	5,299
Chr21_L (b38)	5,144	73.17	1,886	26.83	350	6.37	7,030	5,494
Chr21 (b37)	5,104	76.49	1,569	23.51	330	6.07	6,673	5,434
Chr22 (b38)	3,399	60.02	2,264	39.98	479	12.35	5,663	3,878
Chr22_L (b38)	3,764	66.48	1,898	33.52	353	8.57	5,662	4,117
Chr22 (b37)	4,072	69.91	1,753	30.09	362	8.16	5,825	4,434
ChrX (b38)*	15	53.57	13	46.43	5	25.00	28	20
AVG (b38)**	13,259	64.23	7,228	35.77	1,646	11.23	20,487	14,905
AVG (b38 _lifted)**	14,398	69.76	6,089	30.24	1,138	7.52	20,487	15,536
AVG (b37)**	14,397	72.84	5,255	27.16	1,136	7.45	19,653	15,534

Open in a new tab

* Only PAR regions

** Not considering chrX for the calculation

In order to characterize the profile of the GIAB NA12878 SNV sites that were missing in our call set but were present in the lift-over call set, we examined if we had evidence in any of our intermediate files of the presence of these sites, and if so, we looked for an explanation for these sites being discarded. The result of this analysis is shown in Table 10, where we can see that most of the sites that were missing (68.7% on average across all the autosomes) were discarded because of the filtering sensitivity cutoff used with VQSR ApplyRecalibrator ( --ts_filter_level 99.5) during the final filtering step of the consensus call set.

In the case of INDELs, we identified on average 64.2% of the INDEL sites that are also present in GIAB. This percentage is lower than the 73% obtained for the comparison between P3 and GIAB and lower than the 69.8% for the comparison of the lift-over with GIAB. This is possibly due to the fact that P3 used a higher number of algorithms specialized in the identification of INDELs than the ones used in this work

Table 10. Analysis of the number of GIAB NA12878 SNV sites present in the lift-over call set not identified in our call set.

‘False negatives’ column contains the count of GIAB NA12878 SNV sites that were identified in the lift-over call set and not in our work. ‘VQSRTrancheSNP99.50to99.9’ column contains the count of false negative sites that were filtered out in our work assigned to the 99.5-99.9 quality tranche by VQSR. ‘VQSRTrancheSNP99.90to100.0’ column contains the count of false negative sites that were filtered out in our work assigned to the 99.9-100.0 quality tranche by VQSR. The higher the tranche, the higher the sensitivity and the lower the specificity of our call set. ‘% explained’ column contains the percentage of false negatives that were discarded in our work by the VQSR filtering procedure.

Dataset	False negatives	VQSRTrancheSNP99.50to99.90	VQSRTrancheSNP99.90to100.00	% explained
Chr1	5,224	3,666	138	72.82
Chr2	5,071	3,476	75	70.03
Chr3	4,635	3,198	92	70.98
Chr4	4,592	3,274	98	73.43
Chr5	3,949	2,808	66	72.78
Chr6	4,916	3,489	123	73.47
Chr7	3,462	2,338	78	69.79
Chr8	3,246	2,155	65	68.39
Chr9	2,652	1,750	42	67.57
Chr10	2,893	1,870	64	66.85
Chr11	3,896	2,768	91	73.38
Chr12	2,802	1,937	69	71.59
Chr13	2,024	1,261	32	63.88
Chr14	2,177	1,463	48	69.41
Chr15	1,643	1,028	33	64.58
Chr16	977	528	17	55.78
Chr17	1,359	835	37	64.16
Chr18	1,261	821	25	67.09
Chr19	1,401	929	38	69.02
Chr20	1,051	623	18	60.99
Chr21	728	477	10	66.9
Chr22	628	477	10	77.55
AVG %	2,753.95	1,871.41	57.68	68.66

Open in a new tab

Comparison of updated clinical loci

We further compared our call set and the lift-over call set in the regions identified by Schneider et al. ³ with assembly updates in GRCh38. The authors looked at the intersection of the transcripts having problems in the alignment with GRCh37 with two lists of clinically relevant genes: a set of genes enriched for de novo loss of function mutations identified in Autism Spectrum Disorder (n = 1003) ²² and a collection of genes preliminarily proposed for the development of a medical exome kit (n = 4623) ( https://www.genomeweb.com/diagnostics/emory-chop-harvard-develop-medical-exome-complete-coverage-5k-disease-associated-genes). Schneider et al. ³ show in their analysis that there were 14 genes from these two lists for which the alignment issues disappear when GRCh38 is used (see Table 11). Unsurprisingly, when viewing these regions, we see an absence of variation in the lift-over while calls have been made in the de novo analysis. This is illustrated in Figure 4.

Figure 4. — ‘ *1KG native*’: call set presented in this work; ‘ *1KG All SNPs/indels*’: lift-over call set.

Table 11. Autism Spectrum Disorder genes 22 and Medical Exome Kit Genes ( https://www.genomeweb.com/sequencing/emory-chop-harvard-develop-medical-exomekit-complete-coverage-5k-disease-associ) that had transcript alignment issues with GRCh37 but not with GRCh38.

RefSeq release version 71. ‘ True’ indicates presence in the relevant gene list. ‘ False’ indicates absence.

GeneID	GeneSymbol	MedicalExome	[ 22]
984	CDK11B	False	True
10320	IKZF1	True	True
3635	INPP5D	True	True
3800	KIF5C	False	True
102724631	POTEB3	False	True
23380	SRGAP2	False	True
1804	DPP6	True	False
100134444	KCNJ18	True	False
5645	PRSS2	True	False
374462	PTPRQ	True	False
259291	TAS2R45	True	False
283953	TMEM114	True	False
117581	TWIST2	True	False

Open in a new tab

Novel GRCh38 contigs

We have also analysed the number of SNV variants located in the new contigs added to GRCh38 to update sequence or fill gaps present in GRCh37. The coordinates for these new contigs were obtained using UCSC’s table browser ²³, retrieving the data for the Hg19Diff track from the hg38ContigDiff primary table. Only the records having a score=0, which correspond to the coordinates of the new contigs added to GRCh38 to update sequence or fill gaps present in GRCh37 were considered. Table 12 shows the comparison of the number of SNVs identified in the new contigs with the number in regions that were already present in GRCh37. We can see in these tables that the percentage of SNVs in the new GRCh38 contigs is higher in our call set (55.7% vs 44.3%) than in the lift-over call set, whereas the percentage is lower (48% vs 52%) for the rest of the genome.

Table 12. Number of biallelic SNVs in our call set (‘ This_work’) and in the ‘ Lift-over’ call set.

‘ novel’ represent the new contigs added to GRCh38 whereas ‘ existing’ represent the rest of the genomic regions that were already present in GRCh37.

Region	This_work	Lift-over	Total
novel	1,019,976 (55.7%)	811,817 (44.3%)	1,831,793 (100%)
existing	70,809,835 (48%)	76,588,820 (52%)	147,398,655 (100%)

Open in a new tab

Concordance with Genome in a bottle (GIAB) NA12878 in novel regions. We have also examined the overlap for biallelic SNV sites identified in sample NA12878 between the GIAB sites on the new GRCh38 contigs, our call set and the lift-over call set. Figure 5 has a barplot with the percentage of sites overlapping with GIAB and we can see that this percentage is greater in our call set in all the autosomes except chromosome 14, reaching percentages of 90% in our call set and only 9% in the lift-over call set for chromosome 10. This demonstrates that calling directly on GRCh38 can produce calls that are more reliable than a lift-over for novel regions.

Figure 5. — ‘ *TP_igsr*’ is the percentage of true positives for our call set. ‘ *TP_liftover*’ is the percentage of true positives for the lift-over call set.

Call set performance summary

The benchmarking results show that, unsurprisingly, given the breadth of callers and extensive integration and filtering work, that phase three of 1000 Genomes Project performed best in comparison to GIAB on GRCh37. Further, we see only slightly diminished performance from the lift-over, when judging on genome wide metrics. Given that only some regions of the primary assembly have altered and that the benchmark (GIAB), like the original data set, does not interact with the alts, this may also not be wholly surprising. This picture, however, does change when looking in detail at improved regions of the assembly. Here, as expected, we see regions where the liftover contains no calls, because the sequence was not in GRCh37 and, therefore, could not possibly be called on - although our work demonstrates that calls are made.

In assessing the de novo call set, it seems that the reduced range of callers and simplified methodology, combined with a conservative filtering approach, mean that, relative to phase three, the GRCh38 de novo call set has slightly reduced sensitivity. However, its performance is of a similar order to those of the original phase three call set and the lift-over, while providing a consistent analysis of the data across the improved assembly, including some clinically significant novel regions where calls were not previously made.

As sequencing has progressed since the 1000 Genomes Project, it is also interesting to compare to modern data types. We looked at the calls recently released by the New York Genome Center (NYGC) for NA12878 which are part of a GATK HaplotypeCaller call set for the 2504 member phase three panel, which has been resequenced to 30x coverage ( http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/). We retrieved the NA12878 calls and compared them to the GIAB GRCh38 call set. The average percentage of SNV sites identified by GATK HC in the high coverage data across all the chromosomes that were also present in GIAB represents 94.2% of the total GIAB sites, which is slightly lower than the 96.4% obtained in our work. In the case of INDEL sites, GATK HC identified an average of 42.1% of the total GIAB INDEL sites, which is lower than the 64.2% that we obtained for our call set. We anticipate that, as analysis of the high coverage data progresses, those outputs will replace the work described here but note that our approach achieves comparable results to those of a modern production pipeline.

Data availability

The variants resulting from this work are available in the European Variation Archive. Accession number PRJEB31735.

This call set is also available from the International Genome Sample Resource (IGSR) ⁴ at: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20190312_biallelic_SNV_and_INDEL/.

Software availability

Task	Codebase	Documentation	Licence	DOI	Ref.
eHive (workflow system)	https://github.com/ Ensembl/ensembl-hive	https://ensembl-hive. readthedocs.io/en/ version-2.5/	Apache 2.0	NA	17
BAM quality control	https://github.com/igsr/igsr_ analysis/tree/v1.0.0/PyHive/ BamQC	WGS BAM QC: https:// igsr-analysis.readthedocs. io/en/latest/workflows/ wgs_bamqc_pipeline.html WES BAM QC: https:// igsr-analysis.readthedocs. io/en/latest/workflows/ wes_bamqc_pipeline.html	Apache 2.0	http://doi.org/10.5281/ zenodo.2573911	18
Variant discovery	https://github.com/EMBL- EBI-GCA/reseqtrack/ tree/master/modules/ ReseqTrack/Hive	https://github.com/EMBL- EBI-GCA/reseqtrack/ blob/master/docs/ variantcalling_pipeline.txt	Apache 2.0	https://doi.org/10.5281/ zenodo.2573969	19
Variant filtering	https://github.com/igsr/igsr_ analysis/tree/v1.0.0/PyHive/ PipeConfig/FILTER	BCFtools WGS variant filtering pipeline: https:// igsr-analysis.readthedocs. io/en/latest/workflows/ bcftools_wgs_filtering_ pipeline.html BCFtools WES variant filtering pipeline: https:// igsr-analysis.readthedocs. io/en/latest/workflows/ bcftools_wes_filtering_ pipeline.html Freebayes variant filtering pipeline: https:// igsr-analysis.readthedocs. io/en/latest/workflows/ freebayes_filtering_ pipeline.html GATK variant filtering pipeline: https://igsr- analysis.readthedocs. io/en/latest/workflows/ gatk_vc_filtering_pipeline. html	Apache 2.0	http://doi.org/10.5281/ zenodo.2573911	18
Variant integration	https://github.com/igsr/igsr_ analysis/blob/v1.0.0/PyHive/ PipeConfig/INTEGRATION/ VCFIntegrationGATKUG.pm	https://igsr-analysis. readthedocs.io/en/latest/ workflows/consensus_ callset_pipeline.html	Apache 2.0	http://doi.org/10.5281/ zenodo.2573911	18
Phasing	https://github.com/igsr/igsr_ analysis/blob/v1.0.0/PyHive/ PipeConfig/INTEGRATION/ PHASING.pm	https://igsr-analysis. readthedocs.io/en/latest/ workflows/phasing_ pipeline.html	Apache 2.0	http://doi.org/10.5281/ zenodo.2573911	18
Benchmarking using Genome in a Bottle	https://github.com/igsr/igsr_ analysis/blob/v1.0.0/scripts/ VCF/QC/compare_with_giab. nf	https://igsr-analysis. readthedocs.io/en/latest/ workflows/compare_with_ giab_pipeline.html	Apache 2.0	http://doi.org/10.5281/ zenodo.2573911	18

Open in a new tab

Acknowledgements

We would like to thank Petr Danecek (Matthew Hurles Group, Wellcome Sanger Institute), Erik Garrison (Durbin Group, Wellcome Sanger Institute) and Tommy Carstensen (Global Health & Population Science, Department of Medicine, University of Cambridge) for participating in discussions on the methodology used in this work. Shane McCarthy (Department of Genetics, University of Cambridge) for detailed advice and discussion of the project plan. We would also like to thank Zamin Iqbal (Iqbal group, EMBL-EBI) for discussions on the project methodology and outputs. In addition, our thanks go to the Systems Infrastructure team of EMBL-EBI for providing continuous support and maintenance of the computing infrastructure required to complete this work. Finally, we would like to thank Tommy Carstensen for providing the liftover of the array data used for the phasing of the variants identified in this work.

Members of the 1000 Genomes Project Consortium are listed in the Supplementary Note, contained within the Supplementary Text and Figures of Poznik et al. ²⁴

Author information

Xiangqun Zheng-Bradley is currently at ‘Illumina Center, Illumina UK Ltd., Cambridge, UK’.

Funding Statement

This work was completed thanks to the funding from the Wellcome Trust (grant number 104947) and the European Molecular Biology Laboratory.

The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

[version 2; peer review: 2 approved]

References

1. 1000 Genomes Project Consortium, . Auton A, Brooks LD, et al. : A global reference for human genetic variation. Nature. 2015;526(7571):68–74. 10.1038/nature15393 [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Zheng-Bradley X, Flicek P: Applications of the 1000 Genomes Project resources. Brief Funct Genomics. 2017;16(3):163–170. 10.1093/bfgp/elw027 [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Schneider VA, Graves-Lindsay T, Howe K, et al. : Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 2017;27(5):849–864. 10.1101/gr.213611.116 [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Fairley S, Lowy-Gallego E, Perry E, et al. : The International Genome Sample Resource (IGSR) collection of open human genomic variation resources. Nucleic Acids Res. 2019; [cited 7 Oct 2019]. 10.1093/nar/gkz836 [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Cunningham F, Achuthan P, Akanni W, et al. : Ensembl 2019. Nucleic Acids Res. 2019;47(D1):D745–D751. 10.1093/nar/gky1113 [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Zheng-Bradley X, Streeter I, Fairley S, et al. : Alignment of 1000 Genomes Project reads to reference assembly GRCh38. Gigascience. 2017;6(7):1–8. 10.1093/gigascience/gix038 [DOI] [PMC free article] [PubMed] [Google Scholar]
7. 1000 Genomes Project Consortium, . Abecasis GR, Altshuler D, et al. : A map of human genome variation from population-scale sequencing. Nature. 2010;467(7319):1061–1073. 10.1038/nature09534 [DOI] [PMC free article] [PubMed] [Google Scholar]
8. 1000 Genomes Project Consortium, . Abecasis GR, Auton A, et al. : An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491(7422):56–65. 10.1038/nature11632 [DOI] [PMC free article] [PubMed] [Google Scholar]
9. Maccari G, Robinson J, Ballingall K, et al. : IPD-MHC 2.0: an improved inter-species database for the study of the major histocompatibility complex. Nucleic Acids Res. 2017;45(D1):D860–D864. 10.1093/nar/gkw1050 [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Jun G, Flickinger M, Hetrick KN, et al. : Detecting and estimating contamination of human DNA samples in sequencing and array-based genotype data. Am J Hum Genet. 2012;91(5):839–848. 10.1016/j.ajhg.2012.09.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
11. McKenna A, Hanna M, Banks E, et al. : The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20(9):1297–1303. 10.1101/gr.107524.110 [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Zook JM, Chapman B, Wang J, et al. : Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat Biotechnol. 2014;32(3):246–251. 10.1038/nbt.2835 [DOI] [PubMed] [Google Scholar]
13. Tan A, Abecasis GR, Kang HM: Unified representation of genetic variants. Bioinformatics. 2015;31(13):2202–2204. 10.1093/bioinformatics/btv112 [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Browning SR, Browning BL: Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am J Hum Genet. 2007;81(5):1084–1097. 10.1086/521987 [DOI] [PMC free article] [PubMed] [Google Scholar]
15. Delaneau O, Marchini J, Zagury JF: A linear complexity phasing method for thousands of genomes. Nat Methods. 2011;9(2):179–181. 10.1038/nmeth.1785 [DOI] [PubMed] [Google Scholar]
16. Delaneau O Marchini J 1000 Genomes Project Consortium, et al. : Integrating sequence and array data to create an improved 1000 Genomes Project haplotype reference panel. Nat Commun. 2014;5: 3934. 10.1038/ncomms4934 [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Severin J, Beal K, Vilella AJ, et al. : eHive: an artificial intelligence workflow system for genomic analysis. BMC Bioinformatics. 2010;11:240. 10.1186/1471-2105-11-240 [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Lowy E, GabeAldam, Fairley S: igsr/igsr_analysis: First release of code (Version v1.0.0). Zenodo. 2019. 10.5281/zenodo.2573911 [DOI] [Google Scholar]
19. istreeter, Richardson D, HollyZB, et al. : EMBL-EBI-GCA/reseqtrack: zenodo (Version zenodo). Zenodo. 2019. 10.5281/zenodo.2573969 [DOI] [Google Scholar]
20. Patterson M, Marschall T, Pisanti N, et al. : WhatsHap: Weighted Haplotype Assembly for Future-Generation Sequencing Reads. J Comput Biol. 2015;22(6):498–509. 10.1089/cmb.2014.0157 [DOI] [PubMed] [Google Scholar]
21. Di Tommaso P, Chatzou M, Floden EW, et al. : Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017;35(4):316–319. 10.1038/nbt.3820 [DOI] [PubMed] [Google Scholar]
22. Samocha KE, Robinson EB, Sanders SJ, et al. : A framework for the interpretation of de novo mutation in human disease. Nat Genet. 2014;46(9):944–950. 10.1038/ng.3050 [DOI] [PMC free article] [PubMed] [Google Scholar]
23. Karolchik D, Hinrichs AS, Furey TS, et al. : The UCSC Table Browser data retrieval tool. Nucleic Acids Res. 2004;32(Database issue):D493–6. 10.1093/nar/gkh103 [DOI] [PMC free article] [PubMed] [Google Scholar]
24. Poznik GD, Xue Y, Mendez FL, et al. : Punctuated bursts in human male demography inferred from 1,244 worldwide Y-chromosome sequences. Nat Genet. 2016;48(6): 593–599. 10.1038/ng.3559 [DOI] [PMC free article] [PubMed] [Google Scholar]

Wellcome Open Res. 2020 Mar 4. doi: 10.21956/wellcomeopenres.17129.r37471

Reviewer response for version 2

Deanna M Church ¹

The authors have revised this manuscript nicely. While I still think the lack of calls on the X and Y is unfortunate, the authors do a good job of explaining this. The addition of indels and phasing is welcome. The manuscript is much clearer and easier to follow, and makes a better case for why going to the trouble of reanalyzing data rather than performing a lift-over is worth the effort.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Wellcome Open Res. 2020 Jan 16. doi: 10.21956/wellcomeopenres.17129.r37455

Reviewer response for version 2

Augusto Rendon ^1,²

The authors have revised the document and added the INDEL call set.

In my opinion, the authors have provided sufficient data and analyses to enable the reader to understand the caveats associated with the data prepared by them.

I have one small remaining concern. I would like it if the authors clarify the behaviour of their processing when dealing with multinucleotide variants (MNVs). Are they excluded? If not, then the normalisation and merging approaches may render spurious results.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Wellcome Open Res. 2019 Apr 10. doi: 10.21956/wellcomeopenres.16504.r35054

Reviewer response for version 1

Augusto Rendon ^1,²

Summary

The authors present a new call set from the 1000 genomes project. This time the call set is a fresh recall of the data against GRCh38. The call set is only composed of biallelic SNPs (biallelic in this study). Previous variant call sets on GRCh38 had been lifted over from their GRCh37 native counterparts. The study identifies SNPs by first calling variants with several algorithms, creating an union set and then explicitly genotyping these sites across the complete data set. The genotypes are then phased. A comparison with GIAB data for NA12878 is performed to assess the sensitivity and specificity of the call set.

General observations

A native call set of the 1000 genomes project data on GRCh38 is quite an important effort. These data have many important downstream uses including clinical genomics uses for variant filtering and population genomics studies. Notwithstanding the importance of this data set, I have issues with the data note as it stands. I think the paper simply described what the authors did to generate these data but makes little effort to explain why it was done in this way. The latter is important to gain confidence in the call set.

As a user of the data I would look to be satisfied with the quality of the call set. This is particularly important as several other large scale projects have released allele frequencies of many thousands of deep sequenced whole genomes (e.g. Topmed and Gnomad). In parallel, frameworks for assessing the analytical performance of variant calls have been proposed by GIAB and GA4GH and recently published. Truth sets for several important samples beyond the NA12878 have been released; including the ability to assess to how well phased the data sets was. As explained in the specific comments below, I believe more work should have been done to convince the reader that sufficient due diligence has been committed to this data set.

From the perspective of having a native call set on GRCh38 it is a shame that there is very little to show the potential benefits of this call set on GRCh38. There is no mention [or I really missed it] about how alternative haplotypes were handled and the implications of having variants in alternative haplotypes. The data release also ignores any non diploid areas of the genome.

From a methodological perspective, the tool chain feels outdated with BCFtools and GATK being at least 2 years old. Thresholds are widely used but little work is done to explain how these thresholds are determined. I assume that the parameters used in the tools themselves, these are standard or perhaps defined in previous iterations of the project. However, the filtering thresholds in Tables 1-4, at least from the reading, sounds like they were plucked out of thin air.

On a more positive note, I would like to commend authors for the great effort placed to make available the code, organising it, document it, and making this data set reproducible.

Specific observations

*Tool chain is really outdated*. Is this because there have been little changes for the specific algorithms used (mpileup and unified genotyper?). I assume that many improvements and bugfixes have appeared in the last two years plus since these versions were released.
*Demonstrating improvements on GRCh38 and why it is better than the lift over*. I missed some figure showing how this was an improvement. For example a comparison of the lifted over and the native call set. Are there regions of the genome where they perform different? Was it worth the effort? What about the ALTs, have we now better allele frequencies on these regions? How do they affect the frequencies on the corresponding frequencies on the primary assembly?
*Variant normalisation*. This is glossed over in the text and little is said about how it was performed. In my experience this step often introduces difficult tradeoffs. Please expand on this.
*Benchmarking*. This area is quite lacking here. I understand that you are only looking at biallelic sites so comparison with a truth set is easier. However, standards to doing this have existed for over a year and recently published. They are based on comparing at the haplotype level and not the site level.
*Phasing*. These standards also enable comparing the phasing accuracy. Given the effort placed here to phase the genotypes, it would be helpful to also benchmark the phasing data.
*Union set*. I would have loved to see a figure that shows how the various variant callers contributed to the union call set. Is there one that is unnecessary? Is there one that is responsible for many of the false positives?
*Analytical performance*. "Taken together, these results demonsrate both the high sensitivity and high specificity of our callset". You seem to have less TP and more FN than in the 37 release. Now a days these numbers do not represent high sensitivity and specificity (at least with 30X genomes). At least a more nuanced discussion is required here. This is strongly linked to the thresholds you chose for various filtering steps.
*Truth sets*. It would be very helpful to add further truth sets, for example for the Ashkenazim and Chinese trios.
*Joint calling*. The approach here was to do single sample calling, assembling a union set and then genotyping those sites. Could the authors please explain why this approach as opposed to joint calling. I would assume that for low coverage genomes, joint calling may be more powerful as it can leverage information across more samples.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.

References

1. : Best practices for benchmarking germline small-variant calls in human genomes. Nat Biotechnol.2019; 10.1038/s41587-019-0054-x 10.1038/s41587-019-0054-x [DOI] [PMC free article] [PubMed] [Google Scholar]
2. : An open resource for accurately benchmarking small variant and reference calls. Nat Biotechnol.2019; 10.1038/s41587-019-0074-6 10.1038/s41587-019-0074-6 [DOI] [PMC free article] [PubMed] [Google Scholar]

Wellcome Open Res. 2019 Dec 6.

Ernesto Lowy ¹

First, we would like to thank the reviewer for the feedback that has been provided and for giving this work their attention.

In the general observations the reviewer notes that the data note is limited to a description of what was done, without significant discussion of the rationale behind the approach. We note that this is a data note, intended to describe a data set and how it was produced, however, we have also amended the text to include more detail relating to the rationale behind our approach.

Also, in general observations, the reviewer highlights issues relating to data quality: other call sets, the limitation of comparing to only NA12878 and benchmarking phasing. While other call sets, such as TOPmed and gNOMAD exist, the 1000 Genomes Project data set remains unique in terms of its composition of populations and that all data can be accessed to the base pair level. With regard to other truth sets, beyond NA12878, we were unable to locate “gold-standard” data for any other samples in our data set. With regard to phasing, we have used the WhatsHap program to assess this and added the results to the manuscript.

With regard to GRCh38, our aim was not to demonstrate the superiority of GRCh38 but rather to provide a resource for those wishing to use that assembly. We believe the benefits of the assembly have been demonstrated by Schneider et al.

We have amended the text to make it clearer that calling was not done on alternate loci. The text has also been amended to describe the intention of using the data note format to release and describe data early, with the intention being to revisit elements not included in this set.

With regard to the tool chain, this reflects the volume of compute involved in this work. It was around two years ago that this work commenced. The final stage of the pipeline alone takes around six months to run, even with access to generous compute resources.

The text has been amended to include information on threshold selection.

With regard to the detailed points:

1) Tool chain is really outdated

Software versions reflect the length of time required to run this compute, as noted above.

2) Demonstrating improvements on GRCh38 and why it is better than the lift over

As noted above, our intention was not to demonstrate that GRCh38 was the better assembly. We consider that this has been done previously by Schneider et al. We have added a comparison with the liftover.

3) Variant normalisation

We have updated the text to include further information on this.

4) Benchmarking

We note that hap.py was published after this work was submitted. We were unable, however, to establish from the manuscript how it could be used to improve upon the existing benchmarking. The summary provided in Figure 1 (https://www.nature.com/articles/s41587-019-0054-x) indicates that it wraps tools for arriving at consistent representation of variants (handled in the normalisation steps of our pipeline at the point of producing the consensus call set) and then produces a “standardised” report, providing similar metrics to those we present. From this, it appears to provide similar functionality to steps already present in our work. Our attempt to contact the authors for more information on this was not successful.

With regard to our decision to use a “truth set”, our belief is that comparing to an independently produced “gold-standard” is a valuable benchmarking strategy.

We have extended the benchmarking using WhatsHap.

5) Phasing

This has been done with WhatsHap and the results added. As noted above, our attempt to contact the author of hap.py to establish how it could be used to benchmark phasing was sadly unsuccessful.

6) Union set

We have added the requested figure.

7) Analytical performance

These statements were made in the context of comparison with the phase three call set. The text has been amended. We have also compared with an initial analysis of new 30x coverage data produced by standard pipelines at NYGC. Based on our benchmark, the performance of our call set is slightly better. We agree that filtering has an impact here.

8) Truth sets

Our calling approach used low-coverage and exome data from the period of approximately 2008 to 2012 to perform joint genotyping. We believe that there would be questions about the validity of trying to benchmarking our results with samples that are simultaneously (1) not part of one of our populations, (2) have different relatedness to other samples in the population, and (3) have different data types available for variant calling. This excludes the Ashkenazi samples as a suitable benchmark. For the Han Chinese samples, we could not locate data that matches the profile of that for our samples. We have updated the text in an effort to improve the discussion of issues relating to benchmarking.

9) Joint calling

This did use joint calling. The text has been amended to make this clearer.

Wellcome Open Res. 2019 Apr 8. doi: 10.21956/wellcomeopenres.16504.r35051

Reviewer response for version 1

Deanna M Church ¹

Summary

In the work entitled 'Variant calling on the GRCh38 assembly with the data from phase three of the 1000 Genomes Project', Lowy-Gallego et al. describe their effort to re-analyze the 1000 genomes data on the current GRCh38 assembly. They do not perform a full variant analysis, but rather release a set of biallelic SNVs as a preliminary variant call set. They compare this variant set to the Genome in a Bottle (GIAB) variant calls on the sample NA12878.

High Level Comments

It is great to see efforts to update important datasets onto the current human reference assembly, GRCh38. As the authors note, the GRCh38 reference represents a substantial improvement over the GRCh37 reference, but the lack of GRCh38 based annotation has hindered adoption of this version of the reference assembly. The authors go on to discuss why 'lift-over' based approaches are inadequate, which motivated this work. I agree that 'lift-over' based approaches are inadequate, but I find the results presented in this manuscript unconvincing with respect to this assertion.

The authors spend significant space in the introduction on both explaining the improvements in GRCh38, including the addition of alternate loci, but then put no effort into demonstrating why these are valuable. Additionally, the authors spend time discussing why 'lift-over' based approaches are inadequate, but then do no comparisons to show why their de novo approach is an improvement.

While I believe this work is important, I feel the authors fail to make the point of why doing this de novo analysis on the GRCh38 reference is important.

Detailed Comments

1. Explanation of why 'lift-over' approaches have limitations: I agree with the statement that 'lift-over' is inadequate. However, the description of this on page 1 is not clear. Statement 1 'they rely on an equivalent region existing in the new genome, so new sequence in the improved assembly is effectively excluded' confounds two points. Regions that are present on the old assembly but no the new one will be excluded from a 'lift-over' approach. Additionally, sequence that is new on the updated reference will also be omitted - but these are two separate cases.

Point 2, relating two alignments also confounds multiple issues. Yes - correct alignments are key to the 'lift-over' approach, but there are two 'bad alignment' cases. The case I think the manuscript is referring to is a case where increased diversity in one version of the assembly can confound alignments (that is, sequence change). The other relevant case is the addition of paralogous sequence to one assembly that is missing from the other. This can lead to a locus aligning to a paralogous region rather then the equivalent locus (I have seen examples of this), that can also lead to incorrect 'lift-over'. Point three in this statement is a clear statement, but the authors provide no evidence to actually support this.

2. The authors only provide biallelic SNPs: I can see the utility in concentrating on a restricted set of variants, but only if this set of data are actually used to demonstrate the value of de novo analysis over 'lift-over', which was not done in this manuscript. On page 3, the authors state that "These represent the major part of the SNVs present in the human genome." but I'd like more hard numbers on this. What percentage of all SNVs do the biallelics represent? What percentage of all variation do they represent?

3. Page 3, Quality control of alignment files: Are the steps presented here just the differences from the original protocol? I think this is OK, but it is not clear from reading the manuscript if this is the full set of steps or just the differences.

4. Variant discovery: Why did you use the variant calling tools you chose?

5. Variant Filtering: The omission of variants from the sex chromosomes seems like a significant omission and limits the use of this dataset.

6. Data set validation: I have significant concerns here. I understand why NA12878 was used for some validation. However, my understanding is that the GIAB dataset does not take the alternate loci into account in their variant calling, while this manuscript tries to take advantage of these sequences - how did this impact the comparison? For example, I would predict more conflicts in regions where alt-loci exist in GRCh38. Does this occur?

I am also not convinced that accuracy on NA12878 really translates well to other samples, particularly non-European samples (as NA12878 has a European ancestry). Will the accuracy really extend to non-European samples? Also, my reading of Table 5 is that this dataset performs slightly worse than the GRCh37 call set. This does not do a lot to convince this reader that the work of doing the re-analysis is worthwhile - and I'm a believer, based on previous work I have been a part of! I have some concerns that this may be due to improvements in the new call set (due to the inclusion of the alts and more complex decoy) but it takes some significant work to track this down. There are examples of this kind of analysis ¹. The authors should also clearly state what fraction of the genome they are able to assess using this method.

7. Omitted analysis: The authors discuss the value of the improved reference in the introduction, but then do nothing to show the value of the alternate loci. How many new variants are identified on these loci? How does the inclusion of these sequences change variant calling on the primary?

Perhaps most disappointing is that there is no analysis of how the de novo is an improvement over 'lift-over' approaches. How do the de novo variant calls compare to the 'lift-over' calls? Without this analysis, it is unclear to me that anyone would be convinced that doing the de novo call approach is worth the effort.

Lastly, the authors miss the opportunity to do an accuracy comparison by looking at the regions of the reference comprised of the 'ABC' clones. These are fosmid libraries constructed from several of the samples that went into the 1000 genomes project. These provide a great test bed for both looking at variant calls (any call in this region should be heterozygous or hemizygous as the reference sequence represents one valid haplotype in the sample being analyzed) and also allows for the confirmation of the local haplotype assertions.

References

1. : Resolving the full spectrum of human genome variation using Linked-Reads. Genome Res.29(4) : 10.1101/gr.234443.118 635-645 10.1101/gr.234443.118 [DOI] [PMC free article] [PubMed] [Google Scholar]

Wellcome Open Res. 2019 Dec 6.

Ernesto Lowy ¹

First, we would like to thank the reviewer for the feedback that has been provided and for giving this work their attention.

We note the high level comments are broken down into detailed points and addressed with detailed comments below. We have further updated the manuscript in an effort to improve clarity where it was indicated this was lacking. We also provided the requested information comparing the generated call set with the lift-over and included assorted other updates.

In response to the high level comments, which we understand to primarily relate to a) the improvements of GRCh38 over GRCh37 and b) comparison between de novo calling versus lift-over:

a) It was not our intention to demonstrate the superiority of GRCh38 over GRCh37. We believe that the GRC, in particular in the paper by Schneider et al., have demonstrated this already. We included information on this for the information of readers who may not be familiar with these issues. However, we accept that this may give an inaccurate impression of the emphasis of the data note. As such, the text has been amended, reducing the explanation of assembly changes and, rather, making reference to the paper by Schnieder et al. Our aim was to provide a resource for those wishing to adopt the new assembly, not to present the case as to why GRCh38 should be adopted, which we believe has already been made elsewhere.

b) Our emphasis is on providing resources for the community. To make the data available to users in as timely a manner as possible, we elected to use the data note format for publication. This has a focus on describing how the data was produced, with validation of data outputs being listed in the information to authors as optional. In light of the comments, we have performed a comparison with the lift over set and also looked specifically at regions of the assembly that were updated between the two assemblies. Further details are below.

1) Explanation of why “lift-over” approaches have limitations:

This relates to a set of three statements regarding the inadequacy of lift-overs.

For the first statement, the reviewer notes that the removal of sequence when moving from GRCh37 to GRCh38 and the gain of sequence are two separate cases and that these were conflated in the statement “they rely on an equivalent region existing in the new genome, so new sequence in the improved assembly is effectively excluded”. We accept that this combines multiple facets of changes between the two assemblies. The text for statement one has, therefore, been updated to focus on the central point that we sought to make: that a mapping between the assemblies is necessary before one can lift-over a given variant and that this is not always possible (for any one of a number of possible reasons). Further, we have added the number of records that could not be lifted over in the dbSNP/EVA processed files to give a concrete indication of the numbers of records where this occurs.

For the second statement, the point we wished to make was that, even when a variant can be lifted-over, it does not follow that the evidence that supported that call in the original assembly would also transfer to the new location. The text has been amended in an effort to make this clearer, also citing evidence from Schneider et al. in regard to alignments and the transition from GRCh37 to GRCh38.

For the third statement, it was noted that this was clear but that no evidence was provided to support the assertion. In light of the other changes, this text has been modified to focus on the case of new sequence being added to the assembly and pointing to specific examples shown as part of Figure 1, which illustrate the differences in the lift-over and de novo call sets at examples of clinically relevant loci, which were updated between the two assemblies

2) The authors provide only biallelic SNPs:

The requested comparison with the lift-over is addressed in the response to point seven. We have added the requested numbers relating to what fraction of SNVs are biallelic (99.6%) and the number of SNVs relative to other short variants. We have also taken the opportunity to update the call set to include biallelic INDELs, a category of variant that was not previously included. Multi-allelic calls remain absent from the set as SHAPEIT is unable to handle such calls and our pipelines would require further development. Our strategy has been to release calls as soon as possible and to revisit the data set adding additional classes of variant as practical. This was done with the aim of making data useful to many available and with the intention of revisiting the data set to extend it as useful.

3) Page 3, Quality control of alignment files:

All of the steps used are described in the data note, not just the differences. The text has been updated in an effort to make this clearer to readers.

4) Variant discovery:

Tools were chosen in consultation with members of the 1000 Genomes Project Consortium. While our aim was to recapitulate their GRCh37 analysis on the new assembly, this would not have been feasible given the large number of callers used in the original project, the concomitant compute and the relatively complex methods used in filtering and integrating call sets, which were both compute and labour intensive. This obliged us to look at using a reduced set of callers and a simplified methodology. We sought recommendations that took into account the performance of the callers on the 1000 Genomes phase three data, which unlike most other panels is a mix of low coverage and exome with significantly more geographic diversity. In addition, the performance of some callers on the data set rendered their use impractical.

The text has been updated to inform readers of the above.

5) Variant filtering:

This was intended to be an initial release of data, with the intention of revisiting and adding additional elements that require further processing. As the sex chromosomes required additional analysis, they were not included in this first release. Further, we believe that the data set is still beneficial to some users in their absence. We anticipate calls on the GRCh38 sex chromosomes being released in the future.

6) Data set validation:

We acknowledge that the GIAB NA12878 benchmark is imperfect. As the reviewer notes, it is a single sample and the differences in the versions of the reference genome used for alignment (with and without alternate loci) by us and GIAB would be expected to have a bearing on variant detection.

With regard to the alternate loci, the possibility of comparing the level of conflict with the benchmark at regions where alternate loci are present and where they are not is mentioned. Given, however, that the presence of the alternate loci would also be expected to have at least some impact across the genome (irrespective of the presence of alternate loci at that particular location), we feel that, in order to truly assess what impact the alternate loci had on the analysis, it would be necessary to repeat the analysis, using, instead, alignments where the alternate loci had not been present. As our data set also relies on joint genotyping, this would effectively mean realigning all of the data and repeating analysis on all of the data in order to answer this question. The substantial compute volume involved in this would add significant time and expense and, thus, makes this comparison impractical. It would, however, be necessary in order to derive meaningful and sound conclusions about the impact of the alternate loci on our analysis.

The reviewer also expresses concern that accuracy with NA12878 may not transfer to other samples, particularly those of non-European ancestry. Given the prevalence of data from NA12878, it would seem reasonable to conclude that calling methods should perform well, and potentially above average, with that sample. However, NA12878 has data similar to that of other samples in our data set. Further, our data set is comprised only of Illumina data, so we do not expect, for example, types of sequencing error to vary across the samples. In work done by others, comparing the new call set to 1000 Genomes phase three, we see that our results and those for phase three show a strong level of consistency across the samples (Robinson and Glusman, 2019, https://www.biorxiv.org/content/10.1101/600254v1), with no indication that NA12878 is an outlier.

In relation to our comparison with phase three, it was not our intention to try to outperform phase three, rather to offer a de novo call set of similar quality on GRCh38. The utility is to those wishing to work with GRCh38 and work with a de novo call set generated on that assembly including the novel GRCh38 regions. The comparison with phase three is offered to assist users in understanding how our call set compares to phase three. Our call set shows broadly similar behaviour to phase three, with a slightly different balance of sensitivity and specificity. Given, however, that phase three involved a massively greater analysis effort, which resources would make impossible to repeat, it is perhaps not surprising that phase three sees a higher yield. In turn, this is reflected in the lift-over but with substantial differences demonstrated in novel regions where the de novo call set detects variants absent in the lift-over.

While we do acknowledge the limitations of the GIAB benchmark we have used, we found no better alternatives. To effectively benchmark our data, based as it is on joint genotyping, we needed “gold standard” data for samples in our data set. For short variants, the only such data set we were able to locate was GIAB NA12878. The alternatives, of manual inspection of the data or alternative data types, such as PacBio reads being assessed by us also have limitations and lose the benefits gained from an independent “gold standard” data set created by another group.

The text has been updated in an effort to better reflect the above.

7) Omitted analysis:

The alternate loci were used in aligning reads to ensure the best possible read mapping but variants were not called on these loci. The text has been amended to make this clearer. This is in large part as protocols for successfully calling on the alternate loci are lacking. The only information provided by developers of calling software of which we are aware in relation to this is a beta tutorial from GATK (https://software.broadinstitute.org/gatk/documentation/article.php?id=8017). Due to the lack of tools and protocols for confidently calling on the alternate loci, calls were not made at those loci.

We have extended our benchmarking work to include the lift-over data set in the comparison. We have also looked specifically at novel regions of GRCh38. These are included in the revised text.

The suggestion relating to fosmid clones is interesting and would provide further validation. We would, however, note that this is a data note, offered by the journal as a means of describing the production of a data set, where benchmarking is described as optional. Our existing benchmarking covers the genome at greater scale and should, therefore, already give a better indication of the performance of our calling genome wide. Further, we have added benchmarking of the phasing using WhatsHap.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The variants resulting from this work are available in the European Variation Archive. Accession number PRJEB31735.

[ref-1] 1. 1000 Genomes Project Consortium, . Auton A, Brooks LD, et al. : A global reference for human genetic variation. Nature. 2015;526(7571):68–74. 10.1038/nature15393 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-2] 2. Zheng-Bradley X, Flicek P: Applications of the 1000 Genomes Project resources. Brief Funct Genomics. 2017;16(3):163–170. 10.1093/bfgp/elw027 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-3] 3. Schneider VA, Graves-Lindsay T, Howe K, et al. : Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 2017;27(5):849–864. 10.1101/gr.213611.116 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-4] 4. Fairley S, Lowy-Gallego E, Perry E, et al. : The International Genome Sample Resource (IGSR) collection of open human genomic variation resources. Nucleic Acids Res. 2019; [cited 7 Oct 2019]. 10.1093/nar/gkz836 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-5] 5. Cunningham F, Achuthan P, Akanni W, et al. : Ensembl 2019. Nucleic Acids Res. 2019;47(D1):D745–D751. 10.1093/nar/gky1113 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-6] 6. Zheng-Bradley X, Streeter I, Fairley S, et al. : Alignment of 1000 Genomes Project reads to reference assembly GRCh38. Gigascience. 2017;6(7):1–8. 10.1093/gigascience/gix038 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-7] 7. 1000 Genomes Project Consortium, . Abecasis GR, Altshuler D, et al. : A map of human genome variation from population-scale sequencing. Nature. 2010;467(7319):1061–1073. 10.1038/nature09534 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-8] 8. 1000 Genomes Project Consortium, . Abecasis GR, Auton A, et al. : An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491(7422):56–65. 10.1038/nature11632 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-9] 9. Maccari G, Robinson J, Ballingall K, et al. : IPD-MHC 2.0: an improved inter-species database for the study of the major histocompatibility complex. Nucleic Acids Res. 2017;45(D1):D860–D864. 10.1093/nar/gkw1050 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-10] 10. Jun G, Flickinger M, Hetrick KN, et al. : Detecting and estimating contamination of human DNA samples in sequencing and array-based genotype data. Am J Hum Genet. 2012;91(5):839–848. 10.1016/j.ajhg.2012.09.004 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-11] 11. McKenna A, Hanna M, Banks E, et al. : The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20(9):1297–1303. 10.1101/gr.107524.110 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-12] 12. Zook JM, Chapman B, Wang J, et al. : Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat Biotechnol. 2014;32(3):246–251. 10.1038/nbt.2835 [DOI] [PubMed] [Google Scholar]

[ref-13] 13. Tan A, Abecasis GR, Kang HM: Unified representation of genetic variants. Bioinformatics. 2015;31(13):2202–2204. 10.1093/bioinformatics/btv112 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-14] 14. Browning SR, Browning BL: Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am J Hum Genet. 2007;81(5):1084–1097. 10.1086/521987 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-15] 15. Delaneau O, Marchini J, Zagury JF: A linear complexity phasing method for thousands of genomes. Nat Methods. 2011;9(2):179–181. 10.1038/nmeth.1785 [DOI] [PubMed] [Google Scholar]

[ref-16] 16. Delaneau O Marchini J 1000 Genomes Project Consortium, et al. : Integrating sequence and array data to create an improved 1000 Genomes Project haplotype reference panel. Nat Commun. 2014;5: 3934. 10.1038/ncomms4934 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-17] 17. Severin J, Beal K, Vilella AJ, et al. : eHive: an artificial intelligence workflow system for genomic analysis. BMC Bioinformatics. 2010;11:240. 10.1186/1471-2105-11-240 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-18] 18. Lowy E, GabeAldam, Fairley S: igsr/igsr_analysis: First release of code (Version v1.0.0). Zenodo. 2019. 10.5281/zenodo.2573911 [DOI] [Google Scholar]

[ref-19] 19. istreeter, Richardson D, HollyZB, et al. : EMBL-EBI-GCA/reseqtrack: zenodo (Version zenodo). Zenodo. 2019. 10.5281/zenodo.2573969 [DOI] [Google Scholar]

[ref-20] 20. Patterson M, Marschall T, Pisanti N, et al. : WhatsHap: Weighted Haplotype Assembly for Future-Generation Sequencing Reads. J Comput Biol. 2015;22(6):498–509. 10.1089/cmb.2014.0157 [DOI] [PubMed] [Google Scholar]

[ref-21] 21. Di Tommaso P, Chatzou M, Floden EW, et al. : Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017;35(4):316–319. 10.1038/nbt.3820 [DOI] [PubMed] [Google Scholar]

[ref-22] 22. Samocha KE, Robinson EB, Sanders SJ, et al. : A framework for the interpretation of de novo mutation in human disease. Nat Genet. 2014;46(9):944–950. 10.1038/ng.3050 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-23] 23. Karolchik D, Hinrichs AS, Furey TS, et al. : The UCSC Table Browser data retrieval tool. Nucleic Acids Res. 2004;32(Database issue):D493–6. 10.1093/nar/gkh103 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-24] 24. Poznik GD, Xue Y, Mendez FL, et al. : Punctuated bursts in human male demography inferred from 1,244 worldwide Y-chromosome sequences. Nat Genet. 2016;48(6): 593–599. 10.1038/ng.3559 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Variant calling on the GRCh38 assembly with the data from phase three of the 1000 Genomes Project

Ernesto Lowy-Gallego

Susan Fairley

Xiangqun Zheng-Bradley

Magali Ruffier

Laura Clarke

Paul Flicek

Roles

Version Changes

Revised. Amendments from Version 1

Abstract

Introduction

Methods

Input data

Reference genome

Ethical considerations

Quality control of the alignment files

Variant discovery

Figure 1. Schematic representation of our approach illustrating the entire process from the alignment files previously generated to the generation of the four supporting callsets and finally to the production of the final phased consensus callset.

Variant filtering

Table 1. Variant annotations and cutoff values used for SNPs identified using the low coverage data.

Table 2. Variant annotations and cutoff values used for INDELs identified using the low coverage data.

Table 3. Variant annotations and cutoff values used for SNPs identified using the exome data.

Table 4. Variant annotations and cutoff values used for INDELs identified using the exome data.

Generating consensus call sets

Phasing and imputation of the consensus call set

Figure 2. UpSet plot analysing the contribution of each of the four supporting call sets to the final SNV consensus call set.

Figure 3. UpSet plot analysing the contribution of each of the four supporting call sets to the final INDEL consensus call set.

Table 5. Switch error (SE) rates for phased SNVs for NA12878.

Table 6. Switch error (SE) rates for phased INDELs from NA12878.

Comparison with the Genome in a bottle (GIAB) call set for NA12878

Table 7. Site comparison for NA12878 between our call set and Genome in a Bottle (GIAB)-mapped to GRCh38 and between the 1000 Genomes Project phase three (P3) call set and GIAB mapped to GRCh37.

Table 8. SNV-only site comparison for NA12878 between our call set and Genome in a Bottle (GIAB)-mapped to GRCh38, between the lift-over (chr*_L rows in the table) call set-mapped to GRCh38 and between the 1000 Genomes Project phase 3 (P3) call set and GIAB mapped to GRCh37.

Table 9. INDEL site comparison for NA12878 between our call set and Genome in a Bottle (GIAB)-mapped to GRCh38, between the lift-over (chr*_L rows in the table) call set-mapped to GRCh38 and between the 1000 Genomes Project phase three (P3) call set and GIAB mapped to GRCh37.

Table 10. Analysis of the number of GIAB NA12878 SNV sites present in the lift-over call set not identified in our call set.

Comparison of updated clinical loci

Figure 4. Variants in regions containing clinically relevant genes that had coding sequence splits over assembly gaps in GRCh37 that have been filled in GRCh38.

Table 11. Autism Spectrum Disorder genes 22 and Medical Exome Kit Genes ( https://www.genomeweb.com/sequencing/emory-chop-harvard-develop-medical-exomekit-complete-coverage-5k-disease-associ) that had transcript alignment issues with GRCh37 but not with GRCh38.

Novel GRCh38 contigs

Table 12. Number of biallelic SNVs in our call set (‘ This_work’) and in the ‘ Lift-over’ call set.

Figure 5. Percentage of SNVs that are true positives in the comparison with the NA12878 call set from GIAB for contigs added to GRCh38 across the different autosomes.

Call set performance summary

Data availability

Software availability

Acknowledgements

Author information

Funding Statement

References

Reviewer response for version 2

Deanna M Church

Roles

Reviewer response for version 2

Augusto Rendon

Roles

Reviewer response for version 1

Augusto Rendon

Roles

References

Ernesto Lowy

Reviewer response for version 1

Deanna M Church

Roles

References

Ernesto Lowy

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases