Skip to main content
STAR Protocols logoLink to STAR Protocols
. 2022 May 27;3(2):101440. doi: 10.1016/j.xpro.2022.101440

A protocol for applying a population-specific reference genome assembly to population genetics and medical studies

Lian Deng 1,6, Bo Xie 2, Yimin Wang 2, Xiaoxi Zhang 3, Shuhua Xu 1,2,3,4,5,7,
PMCID: PMC9157554  PMID: 35664259

Summary

With a growing number of available de novo sequenced genomes, protocols for their applications to population genetics will benefit our understanding of the human genome. Here we detail analytic steps to apply an example de novo reference genome to map and detect variants of short-read sequences from corresponding populations and to discover variants of disease-relevant genes. Using this protocol, we can improve variant discovery, better investigate population-specific genome properties, and evaluate the potential of sequenced genomes in medical studies.

For complete details on the use and execution of this protocol, please refer to Lou et al. (2022).

Subject areas: Bioinformatics, Health Sciences, Genetics, Genomics, Systems biology

Graphical abstract

graphic file with name fx1.jpg

Highlights

  • Protocol for mapping and variants detection of short-read sequences

  • Advantages of using a population-specific reference genome in population genomic studies

  • Analytic steps to discover potential variants of disease-relevant genes


Publisher’s note: Undertaking any experimental protocol requires adherence to local institutional guidelines for laboratory safety and ethics.


With a growing number of available de novo sequenced genomes, protocols for their applications to population genetics will benefit our understanding of the human genome. Here, we detail analytic steps to apply an example de novo reference genome to map and detect variants of short-read sequences from corresponding populations and to discover variants of disease-relevant genes. Using this protocol, we can improve variants' discovery and better investigate population-specific genome properties and evaluate the potential of sequenced genomes in medical studies.

Before you begin

Download the test dataset

Inline graphicTiming: 1 h

Table 1.

Steps to generate the test dataset

Input data Source Description Code Output data
hx1f4.3rdfixedv2.fa.gz http://www.openbioinformatics.org/hx1/data/hx1f4.3rdfixedv2.fa.gz HX1 genome assembly >gunzip hx1f4.3rdfixedv2.fa.gz
>samtools faidx hx1f4.3rdfixedv2.fa 000604F 000707F 000300F 000247F 000361F 000443F 000220F > HX1.chr22.fa
>samtools faidx HX1.chr22.fa
HX1.chr22.fa; HX1.chr22.fai
GWHAAAS00000000.genome.fasta.gz GWH (https://ngdc.cncb.ac.cn/gwh/): GWHAAAS00000000 NH1 genome assembly >gunzip GWHAAAS00000000.genome.fasta.gz
>samtools faidx GWHAAAS00000000.genome.fasta GWHAAAS00000500 > NH1.chr22.fa
>samtools faidx NH1.chr22.fa
NH1.chr22.fa; NH1.chr22.fai
Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz http://ftp.ensembl.org/pub/release-105/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz Human reference genome GRCh38 >gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
>samtools faidx Homo_sapiens.GRCh38.dna.primary_assembly.fa 22 > GRCh38.chr22.fa
>sed -i 's/ˆ>/>chr/g' GRCh38.chr22.fa
>samtools faidx GRCh38.primary_assembly.chr22.fa
GRCh38.chr22.fa; GRCh38.chr22.fai
hgdp_wgs.20190516.full.chr22.vcf.gz; hgdp_wgs.20190516.full.chr22.vcf.gz.tbi ftp://ngs.sanger.ac.uk/production/hgdp Genotype data from HGDP >bcftools view --force-samples -S <samples.txt> --threads 20 -f PASS -m 2 -M 2 -v snps hgdp_wgs.20190516.full.chr22.vcf.gz |bgzip -@20 -c > <out.vcf.gz>
>tabix <out.vcf.gz>
<4Han/4Tujia/9Tujia_43Han>.HGDP.snp.chr22.b38.vcf.gz; <4Han/4Tujia/9Tujia_43Han>.HGDP.snp.chr22.b38.vcf.gz.tbi
<sample>_1.fastq.gz;
<sample>_2.fastq.gz
ENA (https://www.ebi.ac.uk/ena/browser/home): PRJEB6463 Raw sequences of 4 Han Chinese and 4 Tujia samples from HGDP The raw sequencing reads were aligned to the human reference genome assembly GRCh38 with BWA and output BAM records. The BAM files were then sorted with SAMtools. Duplicated reads were marked with MarkDuplicates (Picard) in GATK (see key resources table). The code for short-reads mapping and duplicate removal is given in part 1: variants detection from the short-read sequences using linear alignment. <HGDP00776/HGDP00784/HGDP00812/HGDP00819/HGDP01096/HGDP01100/HGDP01102/HGDP01104>.dedup.chr22.sorted.bam; <HGDP00776/HGDP00784/HGDP00812/HGDP00819/HGDP01096/HGDP01100/HGDP01102/HGDP01104>.dedup.chr22.sorted.bam.bai

The input data are the original data downloaded from the public resources; the output data are those included in the test dataset.

Download the software and scripts

Inline graphicTiming: 1–2 days

  • 3.

    Most of the analyses in this protocol are performed using existing software, which is listed in the key resources table and can be downloaded via the links provided.

  • 4.

    We provide scripts to run some of the programs sequentially and to analyze or evaluate the output data. Some analytic script files used in reads filtering, genome alignment, variants detection and annotation, and result plots are available on GitHub and Zenodo (see key resources table).

Compile a list of medically relevant genes

Inline graphicTiming: 1 min

Figure 1.

Figure 1

Screenshot of a medically relevant gene list

Key resources table

REAGENT or RESOURCE SOURCE IDENTIFIER
Deposited data

A test dataset of human chromosome 22 In this protocol https://www.picb.ac.cn/PGG/resource_download.php?id=44&file=PGG_Web_Data/protocol_test_data_chr22.tar.gz; https://github.com/Shuhua-Group/TJ1_STARProtocols
Human gene annotation (Harrow et al., 2006) https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_34/gencode.v34.annotation.gff3.gz

Software and algorithms

BCFtools 1.6 (Danecek et al., 2021) https://github.com/samtools/bcftools
BWA 0.7.17-r1188 (Li and Durbin, 2010) https://github.com/lh3/bwa
Dipcall v0.3 (Li et al., 2018) https://github.com/lh3/dipcall
EAGLE 2.0 (Janin, 2014) https://github.com/sequencing/EAGLE
FlashPCA2 (Abraham et al., 2017) https://github.com/gabraham/flashpca
GATK 4.1.7.0 (Van der Auwera et al., 2013) https://gatk.broadinstitute.org/hc/en-us
GenMap v1.3.0 (Pockrandt et al., 2020) https://github.com/cpockrandt/genmap
Java 11.0.1 (Eng, 1997) https://www.oracle.com/java/
Liftoff v1.6.1 (Shumate and Salzberg, 2021) https://github.com/agshumate/Liftoff
Minimap2 (Li, 2018) https://github.com/lh3/minimap2
MSMC2 (Malaspinas et al., 2016) https://github.com/stschiff/msmc2
Picard v2.21.9 (Broad Institute, 2019) Integrated into GATK 4.1.7.0
Plink v1.9 (Purcell et al., 2007) https://www.cog-genomics.org/plink/
Python 3.6.4 (Van Rossum and Drake, 2009) https://www.python.org/
RTGtools 3.11 (Cleary et al., 2015) https://github.com/RealTimeGenomics/rtg-tools
R version 3.6.0 (R Core Team, 2020) https://www.r-project.org/
SAMtools 1.6 (Danecek et al., 2021) https://github.com/samtools/samtools
SHAPEIT4 version 4.1.2 (Delaneau et al., 2019) https://odelaneau.github.io/shapeit4/
SnpEff 4.3t (Cingolani et al., 2012) https://pcingola.github.io/SnpEff/
vg v1.23 (Garrison et al., 2018) https://github.com/vgteam/vg
Scripts for data analysis In this protocol https://github.com/Shuhua-Group/TJ1_STARProtocols or https://zenodo.org/record/6520447#.YnO6gC-KHUp

Other

Linux server N/A N/A

Materials and equipment

  • Human reference genome assemblies (FASTA files for the human reference genome GRCh38 and the population-specific reference genome) and the short-read sequences (BAM files) and genotypes (VCF files) of samples from corresponding populations (see download the test dataset in before you begin).

  • Software and scripts are used in this protocol (see software and algorithms section of key resources table).

  • Basic knowledge about Python, R, and bash scripting is required to understand and apply this protocol. Getting the basic operations of SAMtools and BCFtools may also be helpful as they are essential to the sequencing data process. VCFtools (Danecek et al., 2011) is an alternative to BCFtools in analyzing the VCFs.

  • All tests are run on 64-core Intel Xeon CPU E7-4850 v4 2.10 GHz Linux servers. We recommend using computing clusters to perform the data analyses. Assuming 10 CPUs, at least 100 GB of RAM is required for analyzing the whole-genome data.

Step-by-step method details

Part 1: Variants detection from the short-read sequences using linear alignment

Inline graphicTiming: 1 h

A population-specific reference genome is of importance in detecting variants. We first apply a linear approach for the short read mapping and variant calling, using NH1 as the reference genome. To evaluate the variant call rate and genotyping accuracy, we use the simulated short reads from the HX1 sequences in this step. The path of the scripts and files should be properly indicated when running the code provided in this protocol.

  • 1.
    Simulate the short-read sequences (SRS) using EAGLE.
    • a.
      Filter out the HX1 contigs < 1 Kb in length and create a sequence dictionary.
      >python Filter_HX1_fasta_contig_length.py HX1.chr22.fa HX1.chr22.fa.fai
      HX1.chr22.filtered.fa
      >gatk CreateSequenceDictionary -R HX1.chr22.filtered.fa -O HX1.chr22.filtered.dict
      Inline graphicCRITICAL: Although not required by EAGLE, contig filtration is necessary for this step, as we find EAGLE is always interrupted when dealing with those short contigs (troubleshooting 1).
    • b.
      Input the filtered HX1 sequences to EAGLE to simulate the raw reads of short-read sequencing, with a read length of 101 bp and sequencing depth of 30×.
      >path/to/configureEAGLE.pl
       --run-info=path/to/RunInfo_PairedReads1x1Tiles.xml
       --reference-genome=HX1.chr22.filtered.fa
       --coverage-depth=30
       --motif-quality-drop-
      table=path/to/MotifQualityDropTables/DefaultMotifQualityDropTable.tsv
       --quality-table=path/to/QualityTables/DefaultQualityTable.read1.length101.qval
       --quality-table=path/to/QualityTables/DefaultQualityTable.read2.length101.qval
      >cd EAGLE
      >make fastq -j 15
      >cd 111206_EAS987_0567_FC1234TST
      >bgzip EAGLE_S1_L001_R1_001.fastq
      >bgzip EAGLE_S1_L001_R2_001.fastq
  • 2.
    Map the simulated reads to the reference genome, and detect variants.
    • a.
      Map the simulated HX1 short reads to the NH1 genome using the BWA package.
      >bwa index NH1.chr22.fa
      >bwa mem -M -t 10 -R "@RG\tID:HX\tSM:HX1\tLB:HX1\tPU:HX1\tPL:ILLUMINA" NH1.chr22.fa
       EAGLE_S1_L001_R1_001.fastq.gz EAGLE_S1_L001_R2_001.fastq.gz | samtools view -bS - > HX1.pe.bam
      >samtools sort -@ 5 -m 4G HX1.pe.bam -T HX1 -o HX1.bam
      >samtools index HX1.bam
    • b.
      Remove duplicated reads.
      >gatk --java-options "-Xmx4g -Djava.io.tmpdir=HX1/" MarkDuplicates -I HX1.bam -O
       HX1.dedup.bam --VALIDATION_STRINGENCY SILENT --REMOVE_DUPLICATES true -M metrics_HX1.txt
       -AS true --CREATE_INDEX true
    • c.
      Detect variants based on the short-read mapping using GATK.
      Note: Despite that GATK consumes a large amount of RAM, we do not suggest any alternative software as it is the most widely used toolkit for sequencing reads processing and variants calling with good performance. The users can refer to the Best Practices Workflows (https://gatk.broadinstitute.org/hc/en-us/sections/360007226651-Best-Practices-Workflows) for more instructions to apply GATK.
      #Variants calling for each chromosome
      >ref_chr_list=`cat NH1.chr22.fa.fai | awk '{print $1}'`
      >gatk CreateSequenceDictionary -R NH1.chr22.fa -O NH1.chr22.dict
      >gatk --java-options "-Xmx3G -XX:ParallelGCThreads=2 -Dsamjdk.compression_level=5"
       HaplotypeCaller -R NH1.chr22.fa -ploidy 1 -L GWHAAAS00000500 -I HX1.dedup.bam -O
       HX1.GWHAAAS00000500.g.vcf.gz -ERC GVCF -G
       StandardAnnotation -G AS_StandardAnnotation -G StandardHCAnnotation --seconds-between-progress-updates 30
      >sh Combine_list.sh
      >sh 170.JointCalling.sh
      >cd 170.GenotypeGVCFs.joint.calling
      >for chr in $ref_chr_list
       do
       sh 170.${chr}.sh
       done
      #Combine the VCFs of all chromosomes
      >sh 170.combine.sh
      >tabix -p vcf HX1.genomewide.hc.vcf.gz
      #An optional step to rename the VCF files as we only analyze chromosome 22 here
      >mv HX1.genomewide.hc.vcf.gz HX1.chr22.vcf.gz
      >mv HX1.genomewide.hc.vcf.gz.tbi HX1.chr22.vcf.gz.tbi
    • d.
      Perform GATK hard-filtering to filter out probable artifacts from the call set (De Summa et al., 2017).
      Note: Here we do not recommend using the GATK VQSR module to do variant filtering, as it relies on known and highly validated variant resources. This step does not guarantee that all the variants in the filtered calls are reliable, and some variants of particular interest need careful check.
      >gatk SelectVariants -select-type SNP -V HX1.chr22.vcf.gz -O HX1.chr22.snp.vcf.gz
      >gatk VariantFiltration -V HX1.chr22.snp.vcf.gz
       --filter-expression "QD < 2.0 || MQ < 40.0 || FS > 60.0 || SOR > 3.0"
       --filter-name "Filter" -O HX1.chr22.snp.filter.vcf.gz
      >gatk SelectVariants -select-type INDEL -V HX1.chr22.vcf.gz -O HX1.chr22.indel.vcf.gz
      >gatk VariantFiltration -V HX1.chr22.indel.vcf.gz
       --filter-expression "QD < 2.0 || FS > 200.0 || SOR > 10.0"
       --filter-name "Filter" -O HX1.chr22.indel.filter.vcf.gz
      >gatk MergeVcfs -I HX1.chr22.snp.filter.vcf.gz -I HX1.chr22.indel.filter.vcf.gz -O
       HX1.chr22.filter.vcf.gz
      >bcftools view -f PASS HX1.chr22.filter.vcf.gz | bgzip -c > HX1.chr22.filtered.vcf.gz
      >tabix -p vcf HX1.chr22.filtered.vcf.gz
  • 3.
    Evaluate the accuracy of variants detection by comparing the results of simulated data to that obtained in real data.
    • a.
      Align the NH1 and HX1 genomes using Minimap2, which has been integrated into the script below.
      Note: BWA does not support the alignment of two genomes, and thus should not be used in this step as an alternative software package to Minimap2.
      >paftools_wgs_call.sh NH1.chr22.fa HX1.chr22.fa
      >bgzip NH1.chr22.HX1.chr22.vcf
      >tabix -p vcf NH1.chr22.HX1.chr22.vcf.gz
    • b.
      Run RTGtools to compare the two VCF files – one callset of real data generated in this step, and another obtained from simulated data in the above step, and output summary metrics on the screen (Figure 2).
      >rtg format -o NH1.sdf NH1.chr22.fa
      >rtg vcfeval -b NH1.chr22.HX1.chr22.vcf.gz -c HX1.chr22.filtered.vcf.gz -o output -t
       NH1.sdf

Figure 2.

Figure 2

Screenshot of the output summary metrics of RTGtools

“Threshold” represents the genotyping quality (GQ) threshold, and you can refer to the row where the threshold is “None”. “baseline” represents the true data, and “call” represents the GATK callset. Therefore, “True-pos-baseline” means baseline variants that match between the baseline and calls; “True-pos-call” means called variants that match between the baseline and calls; “False-pos” means called variants not matched in the baseline; “False-neg” means baseline variants not matched in the call set. “Precision” means the precision rate of called variants; “Sensitivity” means the recall rate of baseline variants; “F-measure” means the weighted harmonic mean of its precision and sensitivity.

Part 2: Variants detection from the short-read sequences using the graph-based approach

Inline graphicTiming: 1.5 h

The graph-based method is an alternative to the linear method for variants detection in the SRS data, and it has a much better performance, especially on the genotyping of insertions. The performance of the linear method only relies on the input sequencing data, while that of the graph-based method is additionally related to the quality of graph construction. Here we apply the vg toolkit and use NH1, which provides a comprehensive and reliable variant list representing Han Chinese, to construct a genome graph.

Note: The vg toolkit is updated frequently, and the results vary largely across different updates. Make sure to use an identical version of vg (≥1.23 is recommended) to complete the entire process, and that your commands conform to the syntax requirements of that version.

  • 4.

    Construct a graph genome using vg with a reference genome in a FASTA file (e.g., the current human reference genome assembly, GRCh38) and a set of variants in a VCF file (e.g., the NH1 variant call set named “NH1.GRCh38.chr22.vcf.gz” in the test dataset) (troubleshooting 2 and 3).

Note: The information lines of the VCF file will affect the output as the merged information generated by different software may conflict in vg processing, so we recommend keeping only necessary information, such as “##fileformat” and “##contig”. The option “-S” should be specified in “vg construct” if structural variants are needed in constructing the genome graph.

>workdir=`pwd`

>vg construct -C -S -a -R chr22 -r GRCh38.chr22.fa -v NH1.GRCh38.chr22.vcf.gz -t 1 -m 32

 --flat-alts 1>chr22.vg 2>chr22.vg.log

Inline graphicCRITICAL: To keep vg running smoothly, symbolic structural variants in the VCF file should be converted into explicit representations. Relevant format specifications can be found at https://samtools.github.io/hts-specs/VCFv4.2.pdf, and we provide the “explicitVCF.converter.pl” script on GitHub and Zenodo (see key resources table) to convert the format.

  • 5.

    Build an xg index and a gcsa index for the output graph in the .vg file.

Note: The xg index is a compressed version of the graph that allows fast node, edge and path lookups; the gcsa index is a pruned substring index used only for read mapping.

#Build xg index

>vg ids -j chr22.vg

>vg index -b ${workdir} -x wg.xg chr22.vg

#Prune the graph and build gcsa index

>vg prune -r chr22.vg > chr22.pruned.vg

>vg index -b ${workdir} -g wg.gcsa chr22.pruned.vg

  • 6.

    Map the paired SRS to the graph for variants detection and genotyping. The augmentation step is for detecting rare variants.

Note: The earlier versions of vg only read input sequencing data in FASTQ files, while recent versions also support BAM files, which will potentially save the overall analysis time.

#Map the short-read sequences

>vg map -x wg.xg -g wg.gcsa -f EAGLE_S1_L001_R1_001.fastq.gz -f

 EAGLE_S1_L001_R2_001.fastq.gz > 001.mapped.gam

>vg convert wg.xg -p > wg.xg.pg

#Augmentation

>vg augment wg.xg.pg 001.mapped.gam -m 4 -q 5 -Q 5 -A wg-001.aug.gam > wg-001.aug.pg

>vg snarls wg-001.aug.pg > wg-001.aug.snarls

>vg pack -x wg-001.aug.pg -g wg-001.aug.gam -o wg-001.aug.pack

>vg call wg-001.aug.pg -r wg-001.aug.snarls -k wg-001.aug.pack -s vg-001 > wg-001.aug.vcf

Part 3: Haplotype phasing using the population-specific genome as a reference

Inline graphicTiming: 8 h

Haplotype phasing benefits from population-specific contexts like appropriate reference panels. Here we phase the HGDP Han Chinese genomes using the variant callset of the Han Chinese genome assembly as reference. We then provide a script to estimate the switch error rate, which can be further compared across the haplotypes inferred by using different reference panels.

  • 7.

    Perform variant calling of the NH1 and HX1 genomes based on the human reference genome GRCh38.

>run-dipcall HX1NH1_refb38 GRCh38.chr22.fa NH1.chr22.fa HX1.chr22.fa > HX1NH1_refb38.mak

>make -j2 -f HX1NH1_refb38.mak

>python prepare_HX1NH1_bcf.py HX1NH1_refb38.dip.vcf.gz

>shapeit4 --input 4Han.HGDP.snp.chr22.b38.vcf.gz --region chr22

 --reference HX1NH1_refb38.dip.filtered.bcf --thread 5

 --output 4Han.HGDP.snp.chr22.b38.phased.vcf.gz --sequencing

  • 9.

    Estimate the switch error rate for each sample (provided in a “.switch_error.txt” file, Figure 3).

Note: This estimation is based on the allelic configuration for the adjacent heterozygotes (Lou et al., 2022), and requires the input of .bam files. It takes more than 7 h.

>python run_switch_script.py

Figure 3.

Figure 3

Screenshot of the output file of the switch error estimation for phased genotypes

Part 4: Resolve fine-scale population structure

Inline graphicTiming: 4 h

The population-specific reference genome may also facilitate population genetic analyses, and in particular, resolve the population differentiation and genetic structure at a fine scale. This part illustrates the analyses of population genetic relationship and demographic history using whole-genome variants called from SRS with the population-specific assembly as reference. Here we use 9 Tujia and 43 Han Chinese genome sequences from HGDP as the test data (see key resources table).

  • 10.
    Principal component analysis (PCA).
    • a.
      Select biallelic SNPs with missing rate < 0.01 and minor allele frequency > 0.05 for further analyses.
      >bcftools view -i 'F_MISSING<0.01 && MAF>0.05' 9Tujia_43Han.HGDP.snp.chr22.b38.vcf.gz |
       bgzip > 9Tujia_43Han.HGDP.snp.chr22.b38.miss001.maf005.vcf.gz
    • b.
      Carry out SNP downsampling according to the physical distance of 50 Kb to roughly exclude possible linkage between loci.
      >plink -vcf 9Tujia_43Han.HGDP.snp.chr22.b38.miss001.maf005.vcf.gz --make-bed --double
       --bp-space 50000 --thin 0.99 --out 9Tujia_43Han.HGDP.snp.chr22.b38.miss001.maf005.thin50
    • c.
      Run FlashPCA2, and plot the first two PCs using a script provided (Figure 4). The script requires an additional file assigning the color of each individual on the plot – two columns denoting the name of individuals and the colors, respectively. An R package named “hash” needs to be installed in R before applying this script. This analysis only takes a few minutes.
      >flashpca --bfile 9Tujia_43Han.HGDP.snp.chr22.b38.miss001.maf005.thin50k
       -f .9Tujia_43Han.pca
      >Rscript pc_plot.r pcs.9Tujia_43Han.pca 9Tujia_43Han.color pcs.9Tujia_43Han.pca.pdf
  • 11.
    Infer population demographic history. We apply a multiple sequentially Markovian coalescent approach (MSMC2) to estimate the effective population size (Ne) of the Tujia and Han Chinese populations over time. We select 4 samples (8 haplotypes) from each population (Table 1).
    Note: Both MSMC and MSMC2 work for this analysis, while the memory usage and time consumption are less for the latter. In addition, MSMC loses power in ancient times with increasing numbers of input genomes (Schiffels and Wang, 2020).
    • a.
      Genotype phasing of the Tujia and Han Chinese samples using population-specific reference genomes, following part 3: haplotype phasing using the population-specific genome as a reference. Here we use the Han Chinese genomes (HX1 and NH1) as a reference to infer the Tujia haplotypes, while we suggest using the Tujia reference genome (Lou et al., 2022) instead to achieve better performance.
    • b.
      Perform GenMap to generate a list of low-mappability genomic regions for custom reference genomes (NH1).
      Note: In principle, read mapping tools like BWA can be used to compute the genome mappability. However, they rely on the read alignment to the reference genome and thus are not applicable to analyzing the NH1 genome assembly. The GEM mappability program is also widely used (Derrien et al., 2012). Here we recommend GenMap as it is a magnitude faster than GEM.
      >genmap index -F NH1.chr22.fa -I index
      >genmap map -K 35 -E 1 -I index/ -O ./ -r -t -w -bg 2> err.log
      >awk -F '\t' '$4==1' NH1.chr22.genmap.bedgraph |cut -f1-3 > NH1.chr22.mappability.bed
    • c.
      Generate a VCF file for each individual, and run BamCaller.py implemented in MSMC2 for quality control.
      >for i in {1..4}
       do
       bcftools view -s sample${i} -r chr22 <4Han/4Tujia>.HGDP.snp.chr22.b38.phased.vcf.gz |
      bgzip > sample${i}.chr22.vcf.gz
       samtools mpileup -q 20 -Q 20 -C 50 -u -r <chr> -f <ref.fa> <bam> | bcftools call -c -V
      indels | python bamCaller.py <mean_cov> <out_mask.bed.gz> | gzip -c > <out.vcf.gz>
       done
    • d.
      Mask the low mappability regions in the individual files and make an input file for MSMC2.
      >generate_multihetsep.py --mask=NH1.chr22.mappability.bed
       --mask=<sample1_mask.bed.gz> --mask=<sample2_mask.bed.gz>
       --mask=<sample3_mask.bed.gz> --mask=<sample4_mask.bed.gz>
      sample1.chr22.vcf.gz sample2.chr22.vcf.gz sample3.chr22.vcf.gz sample4.chr22.vcf.gz > msmc_input
    • e.
      Run MSMC2 to estimate Ne and plot the Ne dynamics. The Tujia and Han Chinese populations should be analyzed independently.
      Note: The final output (.final.txt) contains the scaled begin and end time of the interval and scaled inverse Ne of the interval. The ne_plot.r script allows for converting scaled times and Ne to real numbers (saved in .converted.txt file, Figure 5) with given parameters including a mutation rate of the human genome (1.25 × 10−8 per bp per generation by default) and a generation time (25 years per generation by default), and to plot the dynamic changes of Ne (saved in a .final.converted.pdf file, Figure 6). This script limits the plot of Ne dynamics during the period of 1,000–1,000,000 years ago.
      >msmc2 --fixedRecombination -o msmc_output msmc_input
      >Rscript ne_plot.r msmc_output <mutation_rate> <generation_time> <color>

Figure 4.

Figure 4

PCA plot of the Tujia and Han Chinese populations, represented by orange and green dots, respectively

Figure 5.

Figure 5

Screenshot of the converted results of Ne estimation by MSMC2

Figure 6.

Figure 6

Plot of the Ne dynamics in Han Chinese (red) and Tujia (blue) populations

Part 5: Discover variants in medically relevant genes

Inline graphicTiming: 4 h

In this part, we focus on the recall of genome variants in the previously compiled medically relevant genes (see compile a list of medically relevant genes in before you begin). We examine the HX1 variants detected by mapping the simulated SRS to the NH1 genome, and generate a list of medically relevant variants genotyped. These variants may provide special insights into the genetic basis of some phenotypes or diseases in the corresponding population, and deserve further investigations.

  • 12.

    Analysis of the medically relevant genes requires a liftover of the gene coordinates to match the NH1 genome coordinates. We apply Liftoff to convert the GFF formatted gene annotation (e.g., GENCODE human release 34, see key resources table) file to NH1, using GRCh38 as the reference genome and the NH1 assembly as the target genome.

>liftoff -g gencode.v34.annotation.gff3 -a 0.9 -s 0.9 -exclude_partial -p 10 -o

 NH1.gencode.v34.gff -u NH1_unmapped.txt NH1.chr22.fa GRCh38.chr22.fa

>python get_NH1_medically_genes.py chr22 GRCh38_mrg_full_gene.bed NH1.gencode.v34.gff

>mv NH1.chr22.fa /path/to/snpeff_test/sequences.fa

>bcftools view -R NH1.medically_gene.bed HX1.chr22.filtered.vcf.gz | bgzip >

 HX1.chr22.medically_gene.filtered.vcf.gz

>python snpeff_config.py /path/to/snpEff/ /path/to/snpeff_test/

>java -Xmx4g -jar snpEff.jar -v NH1 HX1.chr22.medically_gene.filtered.vcf.gz | bgzip 1>

 HX1.chr22.medically_gene.filtered.snpeff_ann.vcf.gz

Inline graphicCRITICAL: We recommend using SnpEff version 4.3t, which requires Java 11.0.1. If the structure of some genes in the converted GFF files is incomplete, SnpEff will report an error (troubleshooting 4). In this case, we should manually remove these genes in the GFF file. The GFF file should not be compressed, otherwise, we would not obtain accurate annotations for some loci (troubleshooting 5).

Expected outcomes

SRS variant call sets (part 1 and 2)

The first two parts of our protocol generate two VCF files, containing two variant call sets of SRS by linear and graph-based methods, respectively, with a population-specific assembly as reference. In addition, Part 1 provides an evaluation outcome of the accuracy of variants detection by simulation (Figure 2). Figure 2 shows that the precision rate of variants detected in the simulated SRS data is 91.7%, with a sensitivity of 86.0%. It is to be expected that the sequence similarity between the target and the reference genome may contribute to the improvement of variant discovery in SRS.

Inferred haplotypes of the SRS data (part 3)

Phasing of the SRS data using the population-specific genome as reference yields the haplotype information, which can be stored in VCF or BCF files according to SHAPEIT4. Each file contains the haplotype of one chromosome. To evaluate the performance of phasing, we further calculated the switch error rate. The outcome is a “.switch_error.txt” file (Figure 3) providing the estimated switch error rate for each sample, which can be compared across independent phasing analyses on an identical genome dataset but using different reference panels. According to our previous study (Lou et al., 2022), using a population-specific genome as a reference may greatly improve the genome phasing accuracy of that population.

Plots of the estimated principal components and population demography (part 4)

Pipelines of PCA and population demographic analyses are included in this part, and accordingly, the output files generated by running the software (e.g., FlashPCA2 and MSMC2) are provided. It is possible to visualize the results using our scripts. Figure 4 depicts a dot plot of Han Chinese and Tujia samples according to the top 2 PCs; Figure 6 shows the dynamic changes in effective population size of Han Chinese and Tujia populations, based on the converted results in the “.final.converted.txt” file (Figure 5). When using the SRS variant call sets generated using population-specific assemblies as a reference, we are expected to obtain a high resolution of population stratification between the two closely related populations. However, Figures 4 and 6 do not show significant genetic differentiation between the Han Chinese and Tujia populations as those in (Lou et al., 2022), possibly because we use the original HGDP callset as test data, which were generated by the alignment to the human reference genome GRCh38, and chromosome 22 provides limited information compared with the whole genome.

A list of medically relevant variants (part 5)

The outcome of Part 5 is a VCF file listing the medically relevant variants that could be potentially detected using the population-specific reference genome. Rare or population-specific variants are expected to be detected with high accuracy using population-specific assembly, and these variants deserve further analyses as they may have large impacts on biological function and abnormality.

Intermediate files

This protocol compiles multiple analytic steps, and therefore generates a lot of intermediate files. Here we list some important intermediate files used as input to reach the outcomes in Table 2.

Table 2.

A list of intermediate files is generated in this workflow

Files Description File generation
File usage
Parts (Steps) Software/Scripts Parts (Steps) Software/Scripts
EAGLE_S1_L001_R1_001.fastq.gz; EAGLE_S1_L001_R2_001.fastq.gz Simulated short reads of HX1 1(1b) EAGLE 1(2a) BWA
2(6) vg map
HX1.dedup.bam Mapping and alignment output of the simulated reads of HX1. 1(2b) gatk MarkDuplicates 1(2c) gatk HaplotypeCaller
HX1.chr22.vcf.gz Genome variants called from the simulated reads of HX1 1(2c) gatk HaplotypeCaller 1(2d) gatk VariantFiltration
HX1.chr22.filtered.vcf.gz Genome variants called from the simulated reads of HX1 with hard-filtering 1(2d) gatk VariantFiltration 1(3b) RTGtools
5(13) BCFtools
wg.xg; wg.gcsa Index files of a graph genome reference of GRCh38 constructed based on the NH1 assembly 2(5) vg construct∗; vg index 2(6) vg map
<4Han/4Tujia>.HGDP.snp.chr22.b38.phased.vcf.gz Phased haplotypes of the HGDP samples 3(8) SHAPEIT4 3(9) run_switch_script.py
4(11c) BCFtools; MSMC2∗
NH1.chr22.mappability.bed A list of low-mappability regions of NH1 that can be removed to keep high confidence variants 4(11b) GenMap 4(11d) generate_multihetsep.py

Intermediate files that are not used in subsequent steps are not shown. We indicate the software or scripts that do not directly process these files but are key analyses in corresponding steps with asterisks (∗).

Limitations

The practical use of a population-specific reference genome relies heavily on the quality of the genome assembly. It can be expected that an increasing number of population-specific reference genomes will facilitate human genome studies, but at the current stage, there is no available data for most the human populations. This protocol provides a comprehensive strategy to explore the benefits of a population-specific reference genome, which will help bridge the gap between constructing a de novo genome assembly and applying it to gain more insights into the human genome. However, we are not able to include all the applicable methods and algorithms in this protocol, and we only show the procedures to run some of the commonly used software and analyses. Using published software packages may lead to limitations on managing the process and improving the results. We commented on some critical steps and limitations of the software packages, and the users need to reach the original sources for more information (see key resources table). The users should make their own decisions on some parameters and standards of quality control based on the data properties. For instance, we need to balance between limiting false discovery and enabling true discovery when applying GATK hard-filtering to the variant callset. According to the GATK recommendation, any variant with a mapping quality (MQ) value less than 40 should be removed. Increasing the MQ cutoff to 50 to reduce the false positive rate is also acceptable. Similar considerations need to be taken to remove low-quality sites when performing vg augment (e.g., “-m”, “-q”, and “-Q”), samtools mpileup (e.g., “-q”, “-Q”, and “-C”) or Liftoff (e.g., “-a” and “-s”). Although most of the analytical steps can be done using software and scripts provided, this protocol requires good knowledge of programming. New users may need to learn some skills underlying the protocol to execute it fully, including the basic knowledge of Python and bash scripting, at least.

Troubleshooting

Problem 1

EAGLE commands fail (Figure 7) (step 1a in part 1: variants detection from the short-read sequences using linear alignment).

Figure 7.

Figure 7

Screenshot of the error message of EAGLE

Potential solution

Filter out contigs < 1 kb in the input data.

Problem 2

Vg commands fail with an error message: “ERROR: Signal 6 occurred. VG has crashed” (step 4 in part 2: variants detection from the short-read sequences using the graph-based approach).

Potential solution

Set --flat-alts in vg construct.

Problem 3

Vg commands fail with an error message: “error:[vg::Constructor] non-ATGCN characters found in chr22” (step 4 in part 2: variants detection from the short-read sequences using the graph-based approach).

Potential solution

Filter out the alleles that do not match with A, T, G, C, or N in the human reference genome GRCh38.

Problem 4

SnpEff commands fail (Figure 8) (step 13 in part 5: discover variants in medically relevant genes).

Figure 8.

Figure 8

Screenshot of the error message of SnpEff

Potential solution

Remove the genes listed in the error message from the GFF file.

Problem 5

Incorrect annotations at some loci in the output of SnpEff without any error message when executing this software. (step 13 in part 5: discover variants in medically relevant genes).

Potential solution

Input the GFF file in an uncompressed form.

Resource availability

Lead contact

Further information and requests for resources and code should be directed to and will be fulfilled by the lead contact, Shuhua Xu (xushua@fudan.edu.cn).

Materials availability

This study did not generate new materials.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (NSFC) grant (32030020, 32041008, 31900418, and 31961130380), the Strategic Priority Research Program (XDPB17, XDB38000000) of the Chinese Academy of Sciences (CAS), the Science and Technology Commission of Shanghai Municipality (19YF1455200), the UK Royal Society-Newton Advanced Fellowship (NAF\R1\191094), and the Shanghai Municipal Science and Technology Major Project (2017SHZDZX01). We thank Dr. Haiyi Lou for his advice on constructing this protocol. We are grateful to Yang Gao for his help in conducting the population genetic analyses.

Author contributions

S.X. conceived and initiated this study. B.X., Y.W., and L.D. prepared the pipeline. B.X. and X.Z. prepared the test dataset. L.D., B.X., and Y.W. wrote the first draft. L.D. and S.X. revised the paper.

Declaration of interests

The authors declare no competing interests.

Data and code availability

The test dataset is freely available at https://www.picb.ac.cn/PGG/resource_download.php?id=44&file=PGG_Web_Data/protocol_test_data_chr22.tar.gz and at GitHub: https://github.com/Shuhua-Group/TJ1_STARProtocols. The code generated during this study is available at GitHub: https://github.com/Shuhua-Group/TJ1_STARProtocols and at Zenodo: https://zenodo.org/record/6520447#.YnO6gC-KHUp.

References

  1. Abraham G., Qiu Y., Inouye M. FlashPCA2: principal component analysis of Biobank-scale genotype datasets. Bioinformatics. 2017;33:2776–2778. doi: 10.1093/bioinformatics/btx299. [DOI] [PubMed] [Google Scholar]
  2. Bergström A., McCarthy S.A., Hui R., Almarri M.A., Ayub Q., Danecek P., Chen Y., Felkel S., Hallast P., Kamm J., et al. Insights into human genetic variation and population history from 929 diverse genomes. Science. 2020;367:eaay5012. doi: 10.1530/ey.17.14.4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Broad Institute . GitHub Repository. 2019. Picard toolkit.https://broadinstitute.github.io/picard/ [Google Scholar]
  4. Cingolani P., Platts A., Wang le L., Coon M., Nguyen T., Wang L., Land S.J., Lu X., Ruden D.M. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly. 2012;6:80–92. doi: 10.4161/fly.19695. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Cleary J.G., Braithwaite R., Gaastra K., Hilbush B.S., Inglis S., Irvine S.A., Jackson A., Littin R., Rathod M., Ware D., et al. Comparing variant call files for performance benchmarking of next-generation sequencing variant calling pipelines. bioRxiv. 2015 doi: 10.1101/023754. Preprint at. [DOI] [Google Scholar]
  6. Danecek P., Auton A., Abecasis G., Albers C.A., Banks E., DePristo M.A., Handsaker R.E., Lunter G., Marth G.T., Sherry S.T., et al. The variant call format and VCFtools. Bioinformatics. 2011;27:2156–2158. doi: 10.1093/bioinformatics/btr330. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Danecek P., Bonfield J.K., Liddle J., Marshall J., Ohan V., Pollard M.O., Whitwham A., Keane T., McCarthy S.A., Davies R.M., Li H. Twelve years of SAMtools and BCFtools. GigaScience. 2021;10:giab008. doi: 10.1093/gigascience/giab008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. De Summa S., Malerba G., Pinto R., Mori A., Mijatovic V., Tommasi S. GATK hard filtering: tunable parameters to improve variant calling for next generation sequencing targeted gene panel data. BMC Bioinformatics. 2017;18:119. doi: 10.1186/s12859-017-1537-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Delaneau O., Zagury J.F., Robinson M.R., Marchini J.L., Dermitzakis E.T. Accurate, scalable and integrative haplotype estimation. Nat. Commun. 2019;10:5436. doi: 10.1038/s41467-019-13225-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Derrien T., Estelle J., Marco Sola S., Knowles D.G., Raineri E., Guigo R., Ribeca P. Fast computation and applications of genome mappability. PLoS One. 2012;7:e30377. doi: 10.1371/journal.pone.0030377. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Du Z., Ma L., Qu H., Chen W., Zhang B., Lu X., Zhai W., Sheng X., Sun Y., Li W., et al. Whole genome analyses of Chinese population and de novo assembly of A northern Han genome. Dev. Reprod. Biol. 2019;17:229–247. doi: 10.1016/j.gpb.2019.07.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Eng J. Improving the interactivity and functionality of Web-based radiology teaching files with the Java programming language. Radiographics. 1997;17:1567–1574. doi: 10.1148/radiographics.17.6.9397464. [DOI] [PubMed] [Google Scholar]
  13. Garrison E., Siren J., Novak A.M., Hickey G., Eizenga J.M., Dawson E.T., Jones W., Garg S., Markello C., Lin M.F., et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat. Biotechnol. 2018;36:875–879. doi: 10.1038/nbt.4227. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Harrow J., Denoeud F., Frankish A., Reymond A., Chen C.K., Chrast J., Lagarde J., Gilbert J.G.R., Storey R., Swarbreck D., et al. GENCODE: producing a reference annotation for ENCODE. Genome Biol. 2006;7:1–9. doi: 10.1186/gb-2006-7-s1-s4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Janin L. Github Repository. 2014. Eagle - enhanced artificial genome engine.https://github.com/sequencing/EAGLE [Google Scholar]
  16. Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34:3094–3100. doi: 10.1093/bioinformatics/bty191. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Li H., Bloom J.M., Farjoun Y., Fleharty M., Gauthier L., Neale B., MacArthur D. A synthetic-diploid benchmark for accurate variant-calling evaluation. Nat. Methods. 2018;15:595–597. doi: 10.1038/s41592-018-0054-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Li H., Durbin R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics. 2010;26:589–595. doi: 10.1093/bioinformatics/btp698. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Lou H., Gao Y., Xie B., Wang Y., Zhang H., Shi M., Ma S., Zhang X., Liu C., Xu S. Haplotype-resolved de novo assembly of a Tujia genome suggests the necessity for high-quality population-specific genome references. Cell Syst. 2022;13:321–333.e6. doi: 10.1016/j.cels.2022.01.006. [DOI] [PubMed] [Google Scholar]
  20. Malaspinas A.S., Westaway M.C., Muller C., Sousa V.C., Lao O., Alves I., Bergström A., Athanasiadis G., Cheng J.Y., Crawford J.E., et al. A genomic history of aboriginal Australia. Nature. 2016;538:207–214. doi: 10.1038/nature18299. [DOI] [PubMed] [Google Scholar]
  21. Pockrandt C., Alzamel M., Iliopoulos C.S., Reinert K. GenMap: ultra-fast computation of genome mappability. Bioinformatics. 2020;36:3687–3692. doi: 10.1093/bioinformatics/btaa222. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Purcell S., Neale B., Todd-Brown K., Thomas L., Ferreira M.A., Bender D., Maller J., Sklar P., de Bakker P.I., Daly M.J., Sham P.C. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 2007;81:559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. R Core Team . R Foundation for Statistical Computing; 2020. R: A Language and Environment for Statistical Computing. [Google Scholar]
  24. Schiffels S., Wang K. In: Statistical Population Genomics. Dutheil J.Y., editor. Springer US); 2020. MSMC and MSMC2: the multiple sequentially Markovian coalescent; pp. 147–166. [DOI] [PubMed] [Google Scholar]
  25. Shi L., Guo Y., Dong C., Huddleston J., Yang H., Han X., Fu A., Li Q., Li N., Gong S., et al. Long-read sequencing and de novo assembly of a Chinese genome. Nat. Commun. 2016;7:12065. doi: 10.1038/ncomms12065. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Shumate A., Salzberg S.L. Liftoff: Accurate Mapping of Gene Annotations. Bioinformatics. 2021;37(12):1639–1643. doi: 10.1093/bioinformatics/btaa1016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. The International HapMap Consortium A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007;449:851–861. doi: 10.1038/nature06258. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Van der Auwera G.A., Carneiro M.O., Hartl C., Poplin R., Del Angel G., Levy-Moonshine A., Jordan T., Shakir K., Roazen D., Thibault J., et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr. Protoc. Bioinformatics. 2013;43:11.10.11–11.10.33. doi: 10.1002/0471250953.bi1110s43. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Van Rossum G., Drake F.L. CreateSpace; 2009. Python 3 Reference Manual. [Google Scholar]
  30. Wagner J., Olson N.D., Harris L., McDaniel J., Cheng H., Fungtammasan A., Hwang Y.-C., Gupta R., Wenger A.M., Rowell W.J., et al. Towards a comprehensive variation benchmark for challenging medically-relevant autosomal genes. bioRxiv. 2021 doi: 10.1101/2021.06.07.444885. Preprint at. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The test dataset is freely available at https://www.picb.ac.cn/PGG/resource_download.php?id=44&file=PGG_Web_Data/protocol_test_data_chr22.tar.gz and at GitHub: https://github.com/Shuhua-Group/TJ1_STARProtocols. The code generated during this study is available at GitHub: https://github.com/Shuhua-Group/TJ1_STARProtocols and at Zenodo: https://zenodo.org/record/6520447#.YnO6gC-KHUp.


Articles from STAR Protocols are provided here courtesy of Elsevier

RESOURCES