Skip to main content
STAR Protocols logoLink to STAR Protocols
. 2024 Jan 3;5(1):102806. doi: 10.1016/j.xpro.2023.102806

Protocol for detecting rare and common genetic associations in whole-exome sequencing studies using MAGICpipeline

Jian Yuan 1,2,5, Kai Li 4,5, Hui Peng 1,2, Yue Zhang 1,2, Yinghao Yao 3; Myopia Associated Genetics and Intervention Consortium, Jia Qu 1,2,3,4, Jianzhong Su 1,2,3,4,6,7,
PMCID: PMC10793169  PMID: 38175747

Summary

Whole-exome sequencing (WES) is a major approach to uncovering gene-disease associations and pinpointing effector genes. Here, we present a protocol for estimating genetic associations of rare and common variants in large-scale case-control WES studies using MAGICpipeline, an open-access analysis pipeline. We describe steps for assessing gene-based rare-variant association analyses by incorporating multiple variant pathogenic annotations and statistical techniques. We then detail procedures for identifying disease-related modules and hub genes using weighted correlation network analysis, a systems biology approach.

For complete details on the use and execution of this protocol, please refer to Su et al. (2023).1

Subject areas: Bioinformatics, Genetics, Systems biology

Graphical abstract

graphic file with name fx1.jpg

Highlights

  • Protocol for mapping, variant calling, and quality control for WES studies

  • Estimate the genetic associations of rare and common variants

  • Disease-related module and gene identification by integrating gene expression data


Publisher’s note: Undertaking any experimental protocol requires adherence to local institutional guidelines for laboratory safety and ethics.


Whole-exome sequencing (WES) is a major approach to uncovering gene-disease associations and pinpointing effector genes. Here, we present a protocol for estimating genetic associations of rare and common variants in large-scale case-control WES studies using MAGICpipeline, an open-access analysis pipeline. We describe steps for assessing gene-based rare-variant association analyses by incorporating multiple variant pathogenic annotations and statistical techniques. We then detail procedures for identifying disease-related modules and hub genes using weighted correlation network analysis, a systems biology approach.

Before you begin

Whole-exome sequencing (WES) is a diagnostic and research approach for identifying molecular defects in patients with suspected genetic disorders.2,3,4 Large-scale WES analysis has effectively identified many novel susceptibility genes and pathways for various diseases.5,6,7 This protocol is established to analyze the original WES data in large cohorts. For data pre-processing and variant discovery steps, the Genome Analysis Toolkit (GATK) is a recommended tool with workflows following community best practices.8,9 Variant annotation and variant frequency addition from large population studies can be processed by referring to the instructions. The execution of this protocol requires additional resources beyond the basic software tools, which can be categorized into two types, i.e., reference file downloads (reference genome, genomic coordinates of the kit used for WES, and reference annotation databases) and additional software tools (for pre- and post-processing of the input and output data). Apart from data collection, the following steps can be performed once, as they involve online data and locally stored software resources. As all commands are executed via a terminal, it is assumed that users of this protocol already have a basic familiarity with the Unix/Linux.

Note: each line of executable code throughout this protocol is preceded by a greater-than-sign (>).

Hardware

The whole pipeline has been running on a Linux operating system (Centos release 7.3.1611) with two 2.00 GHz Intel Xeon Gold 6138 CPUs, 40-core processor, 256 GB of RAM and 1 TB of hard disk space.

Resources download

Inline graphicTiming: 2.0 h

This section retrieves the required non-software resources, including reference genome and annotation files. At the end of the section, these resources should be stored in the proper locations for further steps.

  • 1.

    Set the directory for storing reference genome and genomic annotations intended for subsequent general utilization.

>PROJECT_PATH=/user/projects

>RESOURCES_PATH=$PROJECT_PATH/ref

>mkdir -p $ RESOURCES_PATH

>SOFTWARE_PATH=$PROJECT_PATH/software

>mkdir -p $SOFTWARE_PATH

>git clonehttps://github.com/sulab-wmu/MAGIC-PIPELINE.git.

>mv MAGIC-PIPELINE/scripts $SOFTWARE_PATH/

  • 2.

    Download the hg19 version of the human reference genome.

>cd $RESOURCES_PATH

>wgethttp://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/hg19.fa.gz.

>../software/seqkit grep -i -r -p 'ˆchr[\dXY]+$' hg19.fa.gz -o ucsc.no_hap.hg19.fa.gz

>gzip -d ucsc.no_hap.hg19.fa.gz

  • 3.
    Download the GATK resource bundle for Variant Quality Score Recalibration.
    • a.
      Known variants and rs(dbSNP) accessions: dbSNP15p1.
    • b.
      HapMap genotypes and sites VCFs.
    • c.
      OMNI 2.5 genotypes for 1000 Genomes samples, as well as sites, VCF.
    • d.
      The set from 1000G phase 1 of known-site information for local realignment.
    • e.
      The current best set of known indels to be used for local realignment.

>cd $RESOURCES_PATH

>wgetftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/00-All.vcf.gz

>gunzip 00-All.vcf.gz

>mv 00-All.vcf ncbi.hg19_dbsnp151.vcf

>wget -cftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/hapmap_3.3.hg19.sites.vcf.gz

>gunzip hapmap_3.3.hg19.sites.vcf.gz

>wget -cftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/1000G_omni2.5.hg19.sites.vcf.gz

>gunzip 1000G_omni2.5.hg19.sites.vcf.gz

>wget -cftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf.gz

>gunzip 1000G_phase1.snps.high_confidence.hg19.sites.vcf.gz

>wget -cftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz

>gunzip Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz

  • 4.

    Download variant frequencies across population files which are required in rare variant association study according to their frequency in major population cohorts: gnomAD 2.1.1 and index.

>cd $RESOURCES_PATH

>wgethttps://storage.googleapis.com/gcp-public-data– gnomad/release/2.1.1/vcf/exomes/gnomad.exomes.r2.1.1.sites.vcf.bgz.tbi.

>wgethttps://storage.googleapis.com/gcp-public-data–gnomad/release/2.1.1/vcf/exomes/gnomad.exomes.r2.1.1.sites.vcf.bgz.

  • 5.

    Download 1000 genome database files for population structure analysis.

>cd $RESOURCES_PATH

>prefix="ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr";

suffix=".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz";

for chr in {1..22}; do wget "${prefix}""${chr}""${suffix}" "${prefix}""${chr}""${suffix}".tbi ; done

>wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130606_sample_info/20130606_g1k.ped

  • 6.

    Download the resources for variant annotation in VEP.

Note: The annotations for all possible SNVs and indels in spliceAI are downloaded from https://basespace.illumina.com/s/otSPW8hnhaZR.

>cd $RESOURCES_PATH

>wgethttps://krishna.gs.washington.edu/download/CADD/v1.5/GRCh38/whole_genome_SNVs.tsv.gz.

>wgethttps://krishna.gs.washington.edu/download/CADD/v1.5/GRCh38/InDels.tsv.gz.

>wgethttps://s3.amazonaws.com/bcbio_nextgen/human_ancestor.fa.gz.

>wgethttps://personal.broadinstitute.org/konradk/loftee_data/GRCh37/GERP_scores.final.sorted.txt.gz.

  • 7.

    Acquire RNA-seq data in the relevant tissue for your trait of interest.

Identify either an existing RNA-seq data or a co-expression network from a relevant cell type or tissue for your trait of interest.

Note: There may be a publicly available source of RNA-seq data or a previously published co-expression network that is suitable for your applications. In this protocol, we use the gene-level reads per kilobase million mapped reads (RPKM) values of the Genotype-Tissue Expression (GTEx) RNA-seq data (https://www.gtexportal.org/home/) for peripheral retina samples.

>cd $RESOURCES_PATH

>wgethttps://storage.googleapis.com/gtex_external_datasets/eyegex_data/annotations/EyeGEx_meta_combined_inferior_retina_summary_deidentified_geo_ids.csv.

>wgethttps://storage.googleapis.com/gtex_external_datasets/eyegex_data/rna_seq_data/EyeGEx_retina_combined_genelevel_expectedcounts_byrid_nooutlier.tpm.matrix.gct.

Download and install software and tools

Inline graphicTiming: 1.5 h

The goal of this section is to download and install software and tools for executing our pipeline. In cases where users encounter restriction or lack the necessary permissions to install software/tools on the machines, it is advisable to reach out to the machine administrator for assistance. The following command line operations can be executed as provided in most Linux distributions. Of note, at the end of each code snippet, a final command is included to export to tool’s command to the filesystem environment, ensuring its availability in subsequent steps.

  • 8.

    Installing conda and modifying channels in conda config.

  • 9.

    Installing Python prerequisites. We primarily test scripts that are recommended to run using Python 3.8. Run command:

>wgethttps://repo.anaconda.com/archive/Anaconda3-2023.09-0-Linux-x86_64.sh.

>bash Anaconda3-2023.09-0-Linux-x86_64.sh

>conda config --add channels conda-forge

>conda config --add channels defaults

>conda config --add channels r

>conda config --add channels bioconda

>pip=`which pip3`

>$pip install numpy==1.19.4

>$pip install tqdm==4.42.1

>$pip install scipy==1.3.3

>$pip install statsmodels==0.12.1

>$pip install rpy2==3.3.6

  • 10.

    Installing R packages SKAT and ACAT for burden analysis, stringr, magrittr, readr, getopt and WGCNA

>R_PATH=`which R`

>$R_PATH

>>install.packages("SKAT")

>>install.packages("devtools")

>>devtools::install_github("yaowuliu/ACAT")

>>install.packages("stringr")

>>install.packages("magrittr")

>>install.packages("readr")

>>install.packages("getopt")

>>>install.packages("BiocManager")

>>BiocManager::install("WGCNA")

  • 11.

    Download and install seqkit.

>cd $SOFTWARE_PATH

>wget\https://github.com/shenwei356/seqkit/releases/download/v2.5.1/seqkit_linux_amd64.tar.gz.

>tar -xvf seqkit_linux_amd64.tar.gz

  • 12.

    Download and install bwa.

>cd $SOFTWARE_PATH

>mkdir packages

>cd packages

>wgethttps://sourceforge.net/projects/bio-bwa/files/bwa-0.7.12.tar.bz2.

>tar -xvf bwa-0.7.12.tar.bz2

>export BWA_PATH=$SOFTWARE_PATH/bwa-0.7.12

>cd $BWA_PATH

>make

  • 13.

    Download and install GATK.

>cd $SOFTWARE_PATH

>wgethttps://github.com/broadinstitute/gatk/releases/download/4.3.0.0/gatk-4.3.0.0.zip.

>unzip gatk-4.3.0.0.zip

  • 14.

    Download and install vcflib.

>conda create -n vcflib

>conda install -c bioconda -n vcflib vcflib

  • 15.

    Download and install VEP.

>cd $SOFTWARE_PATH

>git clonehttps://github.com/Ensembl/ensembl-vep.git.

>cd ensembl-vep

>perl INSTALL.pl

  • 16.

    Download and install plink.

>cd $SOFTWARE_PATH

>wgethttps://s3.amazonaws.com/plink1-assets/plink_linux_x86_64_20230116.zip.

>unzip plink_linux_x86_64_20230116.zip

  • 17.

    Download and install EMMAX.

>cd $SOFTWARE_PATH

>wgethttp://csg.sph.umich.edu//kang/emmax/download/emmax-intel-binary-20120210.tar.gz.

>tar xvf emmax-intel-binary-20120210.tar.gz

  • 18.

    Download and install GCTA.

>cd $SOFTWARE_PATH

>wgethttps://yanglab.westlake.edu.cn/software/gcta/bin/gcta-1.94.1-linux-kernel-3-x86_64.zip.

>unzip gcta-1.94.1-linux-kernel-3-x86_64.zip

>cp -r gcta-1.94.1-linux-kernel-3-x86_64 $SOFTWARE_PATH/gcta64

  • 19.

    Download and install vcftools.

>cd $SOFTWARE_PATH

>wgethttps://sourceforge.net/projects/vcftools/files/latest/download/ vcftools_0.1.13.tar.gz.

>tar -xzf vcftools_0.1.13.tar.gz

>cd vcftools_0.1.13

>./configure --prefix=$SOFTWARE_PATH/vcftools_0.1.13

>make && make install

  • 20.

    Download and install samtools.

>cd $SOFTWARE_PATH

>wgethttps://sourceforge.net/projects/samtools/files/samtools/1.9/samtools-1.9.tar.bz2.

>tar -xvf samtools-1.9.tar.bz2

>cd samtools-1.9

>./configure --prefix=$SOFTWARE_PATH/samtools-1.9

>make && make install

  • 21.

    Download and install bcftools.

>cd $SOFTWARE_PATH

>wgethttps://sourceforge.net/projects/samtools/files/samtools/1.9/bcftools-1.9.tar.bz2.

>tar -xvf bcftools-1.9.tar.bz2

>cd bcftools-1.9/

>./configure --prefix=$SOFTWARE_PATH/bcftools-1.9

>make && make install

  • 22.

    Download and install sambamba

>cd $SOFTWARE_PATH

>wgethttps://github.com/biod/sambamba/releases/download/v0.6.6/sambamba_v0.6.6_linux.tar.bz2.

>tar -xvf sambamba_v0.6.6_linux.tar.bz2

  • 23.

    Download and install tabix and bgzip.

>cd $SOFTWARE_PATH && mkdir htslib

>wgethttps://github.com/samtools/htslib/releases/download/1.10.2/htslib1.10.2.tar.bz2.

>tar -xvf htslib-1.10.2.tar.bz2

>cd htslib-1.10.2 && ./configure --prefix=$SOFTWARE_PATH/htslib

>make && make install

Download the test dataset

Inline graphicTiming: 1.5 h

In this section, the raw data for the demonstration of the protocol are retrieved. At the end of the process, the data should be in the proper place for the continuation of the protocol.

  • 24.

    Set the directory where the raw data will be placed and download the raw WES data.

>cd $PROJECT_PATH

>mkdir rawdata

>cd rawdata

>wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR099/SRR099967/SRR099967_1.fastq.gz

>wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR099/SRR099967/SRR099967_2.fastq.gz

> wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR099/SRR099969/SRR099969_1.fastq.gz

>wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR099/SRR099969/SRR099969_2.fastq.gz

>wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR099/SRR099957/SRR099957_1.fastq.gz >wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR099/SRR099957/SRR099957_2.fastq.gz

  • 25.

    Download test file for exome-wide association and burden test.

>cd $PROJECT_PATH

>mkdir 06_ExWAS/

>curl -o 06_ExWAS/ExWAS_test.fam \https://zenodo.org/record/8189201/files/ExWAS_test.fam.

>curl -o 06_ExWAS/ExWAS_test.bim \https://zenodo.org/record/8189201/files/ExWAS_test.bim.

>curl -o 06_ExWAS/ExWAS_test.bed \https://zenodo.org/record/8189201/files/ExWAS_test.bed.

>mkdir 07_collapsing/

>curl -o 07_collapsing/test.vcf \https://zenodo.org/record/8189201/files/test.vcf.

>curl -o 07_collapsing/test_class.txt \https://zenodo.org/record/8189201/files/test_class.txt.

>curl -o ref/Design_V2.hg19.bed \https://zenodo.org/record/8189201/files/Design_V2.hg19.bed.

  • 26.

    Download RNA-seq test file for WGCNA.

>mkdir 08_systems_genetics

>curl -o \

08_systems_genetics/datExpr1_combine_test.txt \

https://zenodo.org/record/8189201/files/datExpr1_combine_test.txt.

>curl -o 08_systems_genetics/top50.txt \

https://zenodo.org/record/8189201/files/top50.txt.

Key resources table

REAGENT or RESOURCE SOURCE IDENTIFIER
Deposited data

Human reference genome hg19 UCSC Genome Browser http://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/hg19.fa.gz
1000 Genomes Project Phase 3 The 1000 Genomes Project Consortium10 https://www.internationalgenome.org/category/phase-3/
Genome Aggregation Database (gnomAD) Karczewski et al.11 https://gnomad.broadinstitute.org/
dbSNP 151 NCBI12 ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_ GRCh37p13/VCF/00-All.vcf.gz
Genotype-Tissue Expression (GTEx) resource Battle et al.13 https://gtexportal.org/home/

Software and algorithms

R v3.6.1 The R Foundation https://www.r-project.org; RRID: N/A
stringr CRAN https://cran.r-project.org/web/packages/stringr/index.html
magrittr CRAN https://cran.r-project.org/web/packages/magrittr/index.html
readr CRAN https://cran.r-project.org/web/packages/readr/index.html
getopt CRAN https://cran.r-project.org/web/packages/getopt/index.html
bwa
Genome Analysis Toolkit (GATK) Van der Auwera et al.9 https://software.broadinstitute.org/gatk/
PLINK v1.9 Purcell et al.14 https://www.cog-genomics.org/plink2; RRID: N/A
Ensembl Variant Effect Predictor McLaren et al.15 https://useast.ensembl.org/info/docs/tools/vep/index.html
vcflib Garrison et al.16 https://github.com/vcflib/vcflib
GCTA-MLMA Yang et al.17 https://yanglab.westlake.edu.cn/software/gcta/; RRID: N/A
EMMAX Kang et al.18 http://genetics.cs.ucla.edu/emmax_ jemdoc/; RRID: N/A

Step-by-step method details

We prepare a GitHub repository containing all the necessary scripts for the method presented below. GitHub: https://github.com/sulab-wmu/MAGIC-PIPELINE.

To follow the step-by-step instructions, users are encouraged to download the whole repository, including the example input files. In all subsequent steps, the paths to the required software tools are the same as the "exported" paths in the "before you begin" section.

Step 1: Data pre-processing

Inline graphicTiming: 4.0 h

This section outlines the procedure of aligning the FASTQ pairs to the reference genome and collecting alignment statistics to ensure quality control. The output of this part comprises BAM files that are appropriate for the subsequent variant calling. At the end of the step, both BAM files and a report containing read alignment statistics are generated.

  • 1.

    Index the reference genome.

>cd $RESOURCES_PATH

>$BWA_PATH/bwa index ucsc.no_hap.hg19.fa

>$SAMTOOLS_PATH/samtools faidx ucsc.no_hap.hg19.fa

>$SAMTOOLS_PATH/samtools dict ucsc.no_hap.hg19.fa > ucsc.no_hap.hg19.fa

  • 2.

    Create a file with only one sample per line, then align each sample to the reference genome and mark duplicates in the BAM file.

>cd $PROJECT_PATH

>mkdir 01_mapping/

>cat sample.list

SRR099967

SRR099969

SRR099957

>cat sample.list | while read line

do

software/bwa-0.7.12/bwa mem -t 32 -M -R "@RG\tID:${line}\tPL:illumina\tSM:${line}\tCN:WY" ref/ucsc.no_hap.hg19.fa rawdata/${line}_1.fastq.gz rawdata/${line}_2.fastq.gz | software/samtools-1.9/bin/samtools view -uS -F 4 -o 01_mapping/${line}.bam && software/sambamba-0.6.6/sambamba sort -t 4 -u --tmpdir=temp 01_mapping/${line}.bam && software/sambamba-0.6.6/sambamba markdup --sort-buffer-size 262144 --hash-table-size 60000 -t 4 -l 0 --overflow-list-size 400000 --tmpdir=temp 01_mapping/${line}.sorted.bam 01_mapping/${line}.sort.rmdup.bam

done

Inline graphicCRITICAL: This protocol assumes that the human reference genome hg19 is used for the example data set; if users choose other genome versions or genomes for other species, please replace ''hg19″ as the selected genome version. A point to note is that the selected genome version for alignment should be consistent with that used in the subsequent variant annotation.

  • 3.

    Base Quality Score Recalibration and application on BAM files.

>cd $PROJECT_PATH

>export _JAVA_OPTIONS=-Djava.io.tmpdir=temp/

>cat sample.list | while read line

do

software/gatk-4.3.0.0/gatk --java-options \

"-Dsamjdk.use_async_io_read_samtools=true \

-Dsamjdk.use_async_io_write_samtools=true \

-Dsamjdk.use_async_io_write_tribble=false -Xmx8g" \

BaseRecalibrator -R ref/ucsc.no_hap.hg19.fa \

-I 01_mapping/${line}.sort.rmdup.bam --verbosity WARNING \

--known-sites ref/ncbi.hg19_dbsnp151.vcf -O 01_realign/${line}.table

software/gatk-4.3.0.0/gatk --java-options \

"-Dsamjdk.use_async_io_read_samtools=true \

-Dsamjdk.use_async_io_write_samtools=true \

-Dsamjdk.use_async_io_write_tribble=false -Xmx8g" \

ApplyBQSR -R ref/ucsc.no_hap.hg19.fa -I \

01_mapping/${line}.sort.rmdup.bam --verbosity WARNING -bqsr \

01_realign/${line}.table -O 01_realign/${line}.bam

software/sambamba-0.6.6/sambamba index 01_realign/${line}.bam

done

  • 4.
    Collection of alignment statistics.
    >cd $PROJECT_PATH
    >RSCRIPT=`which Rscript`
    >PERL=`which perl`
    >cat sample.list | while read line
    do
    $PERL software/QC_exome_mem.pl -R $RSCRIPT -i 01_mapping/${line}.sort.rmdup.bam -r ref/Design_V2.hg19.bed \
    -c rawdata/${line}.R1.clean.fastq.gz -o 01_vcf/${line} -s software/samtools-1.9/bin/samtools -b software/bedtools2/bin/bedtools -p software/picard.jar -t PE -plot \
    -ref ref/ucsc.no_hap.hg19.fa -d ref/ucsc.no_hap.hg19.dict -C ref/Design_V2.hg19.bed
    done
    Note: If you are using another WES probe, please replace ref/Design_V2.hg19.bed with your own probe. As previous WES analysis 19, we collect several statistics that pertain to the alignment process's quality control in this step. At the conclusion of the process, a text file containing the statistics should be generated.19
    • a.
      Total sequenced reads.
    • b.
      Aligned reads.
    • c.
      Uniquely aligned reads (q > 20).
    • d.
      Reads overlapping targets.
    • e.
      Total sequenced bases.
    • f.
      Aligned bases.
    • g.
      Uniquely aligned bases.
    • h.
      Bases overlapping targets.

Step 2: Variant discovery

Inline graphicTiming: 2.0 h

This section provides an overview of the variant calling procedure using GATK HaplotypeCaller. We generate all BAM files during preprocessing are fed into GATK’s HaplotypeCaller to perform per-sample variant calling, outputting a genomic VCF (gVCF) for a single sample. Joint-calling and variant quality score recalibration (VQSR) generates a final, multi-sample VCFs files, serving as the starting point for downstream analysis.

  • 5.

    For each sample use GATK HaplotypeCaller to create a gVCF callset file.

>cd $PROJECT_PATH

>mkdir 02_gvcf && mkdir 02_vcf

>export _JAVA_OPTIONS=-Djava.io.tmpdir=temp/

>cat sample.list | while read line

do

software/gatk-4.3.0.0/gatk --java-options "-Xmx8G" HaplotypeCaller -R ref/ucsc.no_hap.hg19.fa -I 01_realign/${line}.bam -L ref/Design_V2.hg19.bed -ERC BP_RESOLUTION -O 01_vcf/${line}_HaplotypeCaller.g.vcf && software/bgzip 01_vcf/${line}_HaplotypeCaller.g.vcf && software/tabix -p vcf 01_vcf/${line}_HaplotypeCaller.g.vcf.gz

done

  • 6.

    Combine per-sample gVCF files via HaplotypeCaller into a multi-sample gVCF file.

>cd $PROJECT_PATH

>ls 01_vcf/∗.gz > 02_gvcf/cohort.vcf.list

>software/gatk-4.3.0.0/gatk CombineGVCFs --java-options "-Xmx150g -Djava.io.tmpdir=temp/" -R ref/ucsc.no_hap.hg19.fa --variant 02_gvcf/cohort.vcf.list -O 02_gvcf/cohort.g.vcf

  • 7.

    Perform joint-genotyping on the combined gVCF files.

>cd $PROJECT_PATH

>software/gatk-4.3.0.0/gatk GenotypeGVCFs --java-options "-Xmx150g -Djava.io.tmpdir=temp/" --variant 02_gvcf/cohort.g.vcf -new-qual true -R ref/ucsc.no_hap.hg19.fa -O 02_vcf/cohort.vcf.gz

  • 8.

    Filter variants by hard-filtering before variant quality score recalibration (VQSR).

Note: the workflow hard-filters on the variant missing-rate before filtering with VQSR with the expectation that the callset represents many samples and reduce time cost.

>cd $PROJECT_PATH

>software/vcftools/bin/vcftools --gzvcf 02_vcf/cohort.vcf.gz --max-missing 0.9 --recode --recode-INFO-all --out 03_filter/cohort.mis0.9

  • 9.

    Calculate VQSLOD tranches for SNPs using VariantRecalibrator.

>cd $PROJECT_PATH

>software/gatk-4.3.0.0/gatk --java-options "-Xmx248g" VariantRecalibrator -R ref/ucsc.no_hap.hg19.fa -V 02_vcf/cohort.mis0.9.recode.vcf \

-resource:hapmap,known=false,training=true,truth=true,prior=15.0 ref/hapmap_3.3.hg19.sites.vcf \

-resource:omni,known=false,training=true,truth=true,prior=12.0 ref/1000G_omni2.5.hg19.sites.vcf \

-resource:1000G,known=false,training=true,truth=false,prior=10.0 ref/1000G_phase1.snps.high_confidence.hg19.sites.vcf \

-resource:dbsnp,known=true,training=false,truth=false,prior=2.0 ref/ncbi.hg19_dbsnp151.vcf \

-an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -mode SNP --max-gaussians 4 -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 95.0 -tranche 90.0 \

-O 02_vcf/cohort_VQSR.snp.recal \

--tranches-file 02_vcf/cohort_VQSR.snp.tranches \

--rscript-file 02_vcf/cohort_VQSR.snp.plots.R

  • 10.

    Filter SNPs on VQSLOD using ApplyVQSR.

>cd $PROJECT_PATH

>software/gatk-4.3.0.0/gatk --java-options "-Xmx248g" ApplyVQSR \

-R ref/ucsc.no_hap.hg19.fa -V 02_vcf/cohort.mis0.9.recode.vcf \

--ts-filter-level 99.9

--tranches-file 02_vcf/cohort_VQSR.snp.tranches \

--recal-file 02_vcf/cohort_VQSR.snp.recal \

-mode SNP -O 02_vcf/cohort_VQSR.snp.recal.vcf

  • 11.

    Calculate VQSLOD tranches for indels using VariantRecalibrator.

>cd $PROJECT_PATH

>software/gatk-4.3.0.0/gatk --java-options "-Xmx248g" VariantRecalibrator -R ref/ucsc.no_hap.hg19.fa -V 02_vcf/cohort_VQSR.snp.recal.vcf \

-resource:mills,known=true,training=true,truth=true,prior=12.0 ref/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf \

-resource:dbsnp,known=true,training=false,truth=false,prior=2.0 ref/ncbi.hg19_dbsnp151.vcf \

-an QD -an FS -an SOR -an MQRankSum -an ReadPosRankSum -mode INDEL \

--max-gaussians 4 -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \

-O 02_vcf/cohort_VQSR.indel.recal \

--tranches-file 02_vcf/cohort_VQSR.indel.tranches \

--rscript-file 02_vcf/cohort_VQSR.indel.plots.R

  • 12.

    Filter indels on VQSLOD using ApplyVQSR.

>cd $PROJECT_PATH

>software/gatk-4.3.0.0/gatk --java-options "-Xmx248g" ApplyVQSR \

-R ref/ucsc.no_hap.hg19.fa -V 02_vcf/cohort_VQSR.snp.recal.vcf \

--ts-filter-level 99.9 \

--tranches-file 02_vcf/cohort_VQSR.indel.tranches \

--recal-file 02_vcf/cohort_VQSR.indel.recal -mode INDEL \

-O 02_vcf/cohort_VQSR.snp.indel.recal.vcf

  • 13.

    Subset to SNPs and indels callset with SelectVariants.

Note: we use GATK best practices version 3 for all preprocessing steps, including mapping sequence reads at each site to the hg19 reference genome. Please note, however, that GATK version 3 is no longer supported by the developers. For this reason, the above discussion gives instructions specific to GATK4 and links user to GATK4 resources. We recommend users visit the links posted below if they have questions about implementing GATK:

>cd $PROJECT_PATH

>software/gatk-4.3.0.0/gatk --java-options "-Xmx248g" SelectVariants -R ref/ucsc.no_hap.hg19.fa -V 02_vcf/cohort_VQSR.snp.indel.recal.vcf \

--exclude-filtered --preserve-alleles -O 02_vcf/cohort_after_VQSR.vcf

https://github.com/gatk-workflows/gatk4-exome-analysis-pipeline

https://github.com/gatk-workflows/broad-prod-wgs-germline-snps-indels

https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPs-Indels-

https://github.com/ekg/alignment-and-variant-calling-tutorial.

Step 3: Quality control (QC) for variants

Inline graphicTiming: 0.5 h

In this section, we perform QC on genetic variants discovered through sequencing. We prefilter variants so that only those pass the GATK VQSR metric and those locate outside of low-complexity regions are included into further analysis. Genotypes with a genotype depth (DP) < 10 and genotype quality (GQ) < 20 and heterozygous genotype calls with an allele balance >0.8 or <0.2 are determined as missing data. We exclude variants with a call rate<0.9, a case-control call rate difference >0.005, or a Hardy-Weinberg equilibrium (HWE) test P-value < 10-6 on the basis of the combined case-control cohort.

  • 14.

    Perform Allele balance on a VCF file from GATK.

>cd $PROJECT_PATH

>mkdir 03_variant_qc

>RSCRIPT=`which Rscript`

>$Rscript software/scripts/AlleleBalanceBySample.R -v 02_vcf/cohort_after_VQSR.vcf -o 03_variant_qc/cohort_after_VQSR.AB.vcf -i 0.2 -I 0.8

Note: Path to Rscript must be provided in the script. And you must prioritize install package stringr, magrittr, readr and getopt in R. In this step, the AlleleBalanceBySample.R script mark heterozygous loci with a variant allele frequency greater than 0.8 or less than 0.2 as missing.

  • 15.

    Hard filter a cohort callset with VariantFiltration.

>cd $PROJECT_PATH

>software/gatk-4.3.0.0/gatk --java-options "-Xmx100g" VariantFiltration -R ref/ucsc.no_hap.hg19.fa -G-filter "GQ < 20.0" -G-filter-name lowGQ \

-G-filter "DP < 10.0" -G-filter-name lowDP --set-filtered-genotype-to-no-call -V 03_variant_qc/cohort_after_VQSR.AB.vcf -O 03_variant_qc/cohort_after_VQSR.AB.VF.vcf

  • 16.

    Apply final hard filters to a cohort callset.

>cd $PROJECT_PATH

>software/vcftools/bin/vcftools --vcf 03_variant_qc/cohort_after_VQSR.AB.VF.vcf --max-alleles 2 --hwe 0.000001 --max-missing 0.9 \

--recode --recode-INFO-all --out 03_variant_qc/cohort_snpQC

Step 4: Quality control for samples

Inline graphicTiming: 0.5 h

In this section, we perform sample QC. We exclude samples if they show a low average call rate (<0.9), low mean sequencing depth (<10), or low mean genotype quality (<65). Outliers (>4 SD from the mean) of the transition/transversion ratio, heterozygous/homozygous ratio, or insertion/deletion ratio within each cohort are further discarded. We also remove samples that are closely related to one another, have ambiguous sex status or have population outliers per principal component analysis (PCA).

  • 17.

    Sample-based missing data reports.

>cd $PROJECT_PATH

>mkdir 04_sample_qc

>software/vcftools/bin/vcftools --vcf 03_variant_qc/cohort_snpQC.recode.vcf --missing-indv --out 04_sample_qc/cohort_snpQC

  • 18.

    Calculate mean depth per sample.

>cd $PROJECT_PATH

>software/vcftools/bin/vcftools --vcf 03_variant_qc/cohort_snpQC.recode.vcf --depth --out 04_sample_qc/cohort_snpQC

  • 19.

    Calculate mean genotype-quality per sample.

>cd $PROJECT_PATH

>software/vcftools/bin/vcftools --vcf 03_variant_qc/cohort_snpQC.recode.vcf --extract-FORMAT-info GQ --out 04_sample_qc/cohort_snpQC

cat 04_sample_qc/cohort_snpQC.GQ.FORMAT | sed '1d' | awk '{for(i=1;i<=NF;i++) total[i]+=$i;} END {for(i=1;i<=NF;i++) printf "%f ",total[i]/NR ;}' | sed 's/ /\n/g' > 04_sample_qc/cohort_snpQC.GQ

  • 20.

    Calculate transition/transversion ratio, heterozygous/homozygous ratio, and insertion/deletion ratio from bcftools stats and in-house python script.

>cd $PROJECT_PATH

>software/bcftools-1.9/bin/bcftools stats -s - 03_variant_qc/cohort_snpQC.recode.vcf > 04_sample_qc/cohort_snpQC_stats

>cat 04_sample_qc/cohort_snpQC_stats | grep "PSC" | grep -v "#" | awk '{print $3"\t"$7"\t"$8}' > 04_sample_qc/cohort_snpQC_stats.tstv

>cat 04_sample_qc/cohort_snpQC_stats | grep "PSC" | grep -v "#" | awk '{print $3"\t"$5"\t"$6}' > 04_sample_qc/cohort_snpQC_stats.homhet

>cat 04_sample_qc/cohort_snpQC_stats | grep "PSC" | grep -v "#" | awk '{print $3"\t"$5+$6"\t"$9}'> 04_sample_qc/cohort_snpQC_stats.indelsnp

>cat 04_sample_qc/cohort_snpQC_stats | grep "PSC" | grep -v "#" | awk '{print $3"\t"$11}' > 04_sample_qc/cohort_snpQC_stats.singleton

>software/vcftools/bin/vcftools --vcf 03_variant_qc/cohort_snpQC.recode.vcf --keep-only-indels --recode --recode-INFO-all --out 04_sample_qc/cohort_snpQC.indel

>software/vcftools/bin/vcftools --vcf 04_sample_qc/cohort_snpQC.indel.recode.vcf --extract-FORMAT-info GT --out 04_sample_qc/cohort_snpQC.indel

>python3 software/scripts/calc_indel_ratio.py --path_indel_vcf 04_sample_qc/cohort_snpQC.indel.recode.vcf --path_out 04_sample_qc/cohort_indel_ratio.txt --path_gt 04_sample_qc/cohort_snpQC.indel.GT.FORMAT

  • 21.

    Sex discrepancy.

Note: Samples with an X chromosome inbreeding coefficient >0.8 are classified as males, while samples with an X chromosome inbreeding coefficient <0.4 are classified as females. Samples between <0.8 and >0.4 which classify as ambiguous sex status, are excluded from the dataset.

>cd $PROJECT_PATH

>software/plink/plink --vcf 03_variant_qc/cohort_snpQC.recode.vcf --recode --make-bed --out 04_sample_qc/cohort_snpQC

>cat 04_sample_qc/cohort_snpQC.bim | awk '{print $1"\t"$1":"$4":"$6":"$5"\t"$3"\t"$4"\t"$5"\t"$6}' > 04_sample_qc/cohort_snpQC.bim1

>mv 04_sample_qc/cohort_snpQC.bim1 04_sample_qc/cohort_snpQC.bim

>software/plink/plink --bfile 04_sample_qc/cohort_snpQC --exclude ref/inversion.txt --range --indep-pairwise 50 5 0.2 --out 04_sample_qc/indepSNP

>software/plink/plink --bfile 04_sample_qc/cohort_snpQC --extract ref/indepSNP.prune.in --check-sex --set-hh-missing --out 04_sample_qc/cohort_snpQC

  • 22.

    Relatedness check by identity-by-descent

Note: Unrelated individuals are defined by IBD proportion <0.2.

>cd $PROJECT_PATH

>software/plink/plink --bfile 04_sample_qc/cohort_snpQC --extract ref/indepSNP.prune.in --genome --out 04_sample_qc/cohort_snpQC

  • 23.

    Estimating population structure and sample ancestry. We perform Linkage Disequilibrium (LD) pruning on the variants before running PCA.

Note: We use a subset of high-confidence single-nucleotide polymorphisms (MAF>5%) in the exome capture region to calculate the PCs. We perform s series of principal component analyses (PCAs) to identify ancestral backgrounds and control for population stratification.

>cd $PROJECT_PATH

>software/plink/plink --bfile ref/Merge --geno 0.1 --mind 0.1 --maf 0.05 --allow-no-sex --make-bed --out 04_sample_qc/1kG_PCA

>software/plink/plink --bfile 04_sample_qc/cohort_snpQC --geno 0.1 --mind 0.1 --maf 0.05 --make-bed --out 04_sample_qc/cohort_snpQC_maf0.05

>awk '{print$2}' 04_sample_qc/cohort_snpQC_maf0.05.bim > 04_sample_qc/cohort_snpQC_maf0.05.txt

>software/plink/plink --bfile 04_sample_qc/1kG_PCA --extract 04_sample_qc/cohort_snpQC_maf0.05.txt --make-bed --out 04_sample_qc/1kG_PCA1

>awk '{print $2}' 04_sample_qc/1kG_PCA1.bim > 04_sample_qc/1kG_PCA1.txt

>software/plink/plink --bfile 04_sample_qc/cohort_snpQC_maf0.05 --extract 04_sample_qc/1kG_PCA1.txt --recode --make-bed --out 04_sample_qc/cohort_snpQC_maf0.05_1kg

>awk '{print $2,$4}' 04_sample_qc/cohort_snpQC_maf0.05_1kg.map > 04_sample_qc/buildhapmap.txt

>software/plink/plink --bfile 04_sample_qc/1kG_PCA1 --update-map 04_sample_qc/buildhapmap.txt --make-bed --out 04_sample_qc/1kG_PCA2

>software/plink/plink --bfile 04_sample_qc/cohort_snpQC_maf0.05_1kg --bmerge 04_sample_qc/1kG_PCA2.bed 04_sample_qc/1kG_PCA2.bim 04_sample_qc/1kG_PCA2.fam --allow-no-sex --make-bed --out 04_sample_qc/cohort_snpQC_maf0.05_1kg_pca

>software/plink/plink --bfile 04_sample_qc/cohort_snpQC_maf0.05_1kg_pca --exclude ref/inversion.txt --range --indep-pairwise 50 5 0.2 --out 04_sample_qc/indepSNP

>software/plink/plink --bfile 04_sample_qc/cohort_snpQC_maf0.05_1kg_pca --extract ref/indepSNP.prune.in --pca --out 04_sample_qc/cohort_snpQC_maf0.05_1kg_pca

>software/plink/plink --bfile 04_sample_qc/cohort_snpQC_maf0.05 --exclude ref/inversion.txt --range --indep-pairwise 50 5 0.2 --out 04_sample_qc/indepSNP

>software/plink/plink --bfile 04_sample_qc/cohort_snpQC_maf0.05 --extract ref/indepSNP.prune.in --pca --out 04_sample_qc/cohort_snpQC_maf0.05

  • 24.

    Retain samples that pass our quality control filters.

>cd $PROJECT_PATH

>software/vcftools/bin/vcftools --vcf 03_variant_qc/cohort_snpQC.recode.vcf --keep 04_sample_qc/keep_samples.txt \

--non-ref-ac-any 1 --hwe 0.000001 --max-missing 0.9 --recode --recode-INFO-all --out 04_sample_qc/cohort_snpQC_sampleQC

Step 5: Variant annotation

Inline graphicTiming: 4.0 h

In this section, we perform the annotation of variants with Ensembl’s VEP for human genome assembly GRCh37. We use the VEP, CADD, LOFTEE and SpliceAI plugins to generate additional bioinformatic predictions of variant deleteriousness.

  • 25.

    Uses genotypes from the VCF file to correct AC (alternate allele count), AF (alternate allele frequency), NS (number of called), in the VCF file.

>cd $PROJECT_PATH

>mkdir 05_variantannot

>conda activate vcflib

>software/vcflib/vcffixup 04_sample_qc/cohort_snpQC_sampleQC.recode.vcf > 05_variantannot/cohort_snpQC_sampleQC_fix.recode.vcf

  • 26.

    VEP annotation.

>cd $PROJECT_PATH

>software/ensembl-vep/vep -i 05_variantannot/cohort_snpQC_sampleQC_fix.recode.vcf --offline --cache --dir_cache ref/vep/vep_cache/ --plugin CADD,ref/vep/Plugins/whole_genome_SNVs.tsv.gz,ref/vep/Plugins/InDels.tsv.gz --plugin LoF,loftee_path:ref/vep/Plugins/loftee/loftee/,human_ancestor_fa:ref/vep/Plugins/loftee/human_ancestor.fa.gz,conservation_file:ref/vep/Plugins/loftee/phylocsf_gerp.sql,gerp_file:ref/vep/Plugins/loftee/GERP_scores.final.sorted.txt.gz --plugin SpliceAI,snv=ref/vep/vcf/spliceai_scores.raw.snv.hg19.vcf.gz,indel=ref/vep/vcf/spliceai_scores.raw.indel.hg19.vcf.gz --dir_plugins ref/vep/Plugins --custom ref/vep/vcf/gnomad.genomes.r2.1.1.sites.vcf.gz,gnomADg,vcf,exact,0,AC,AF,AF_afr,AF_amr,AF_asj,AF_eas,AF_fin,AF_nfe,AF_oth --fasta ref/ucsc.no_hap.hg19.fa --force_overwrite --sift b --polyphen b --regulatory --numbers --ccds --hgvs --symbol --xref_refseq --canonical --protein --biotype --af --af_1kg --af_esp --af_gnomad --max_af --pick --vcf --output_file 05_variantannot/cohort_snpQC_sampleQC_fix_vep.vcf

Step 6: Exome-wide single-variant association analysis

Inline graphicTiming: 0.5 h

In this section, we conduct two types of single-variant association analyses, including the EMMAX test and MLMA, on all samples. For a rapid test of this protocol, we provide test dataset from our large cohort.

  • 27.

    Generate in-house principal component as a covariate after sample and variant QC. We perform LD pruning on the variants before running PCA.

>cd $PROJECT_PATH

>software/plink/plink --bfile 06_ExWAS/ExWAS_test --indep-pairwise 50 5 0.2 --out 06_ExWAS/ExWAS_test_indep

>software/plink/plink --bfile 06_ExWAS/ExWAS_test --extract 06_ExWAS/ExWAS_test_indep.prune.in --maf 0.05 --snps-only --make-bed --out 06_ExWAS/ExWAS_test_maf0.05

>software/plink/plink --bfile 06_ExWAS/ExWAS_test_maf0.05 --pca --autosome --out 06_ExWAS/ExWAS_test_PCA

  • 28.

    Perform ExWAS using EMMAX.

>cd $PROJECT_PATH

>software/plink/plink --bfile 06_ExWAS/ExWAS_test --recode12 --output-missing-genotype 0 --transpose --out 06_ExWAS/ExWAS_test_emmax

>software/emmax-kin-intel64 -v -d 10 06_ExWAS/ExWAS_test_emmax

>software/emmax-intel64 -v -d 10 -t 06_ExWAS/ExWAS_test_emmax -p 06_ExWAS/ExWAS_test.pheno -k 06_ExWAS/ExWAS_test_emmax.aBN.kinf -o 06_ExWAS/ExWAS_test_emmax.out

>cat 06_ExWAS/ExWAS_test_emmax.out.ps | cut -f1 | sed 's/:/\t/g' | paste - 06_ExWAS/ExWAS_test_emmax.out.ps > 06_ExWAS/ExWAS_test_emmax.out.txt

  • 29.

    Perform ExWAS using MLMA.

>cd $PROJECT_PATH

>software/gcta64 --bfile 06_ExWAS/ExWAS_test --autosome --make-grm --out 06_ExWAS/ExWAS_test --thread-num 5

>software/gcta64 --mlma --bfile 06_ExWAS/ExWAS_test --grm 06_ExWAS/ExWAS_test --pheno 06_ExWAS/ExWAS_test.pheno --out 06_ExWAS/ExWAS_test_MLMA --thread-num 5

>software/gcta64 --mlma-loco --bfile 06_ExWAS/ExWAS_test --grm 06_ExWAS/ExWAS_test --pheno 06_ExWAS/ExWAS_test.pheno --out 06_ExWAS/ExWAS_test_MLMALOCO --thread-num 5

Step 7: Gene-based collapsing analysis

Inline graphicTiming: 1.5 h

To determine whether a single gene is enriched in or depleted of rare protein-coding variants in cases, we perform four gene-level association tests including Fisher’s exact test, burden, SKAT and SKAT-O.

  • 30.

    Qualifying variants.

Note: Allele frequencies are estimated among our case–control samples from three external exome sequence databases (1000 Genomes,47 ESP and gnomAD51). We classify variant allele frequencies using the following criteria: (1) common (MAF >5%); (2) low-frequency (0.5%< MAF<5%) and (3) rare (MAF <0.5%). Missense variants are classified as ‘‘inframe_deletion’’, ‘‘inframe_ insertion’’, ‘‘missense_variant’’ or ‘‘stop_lost’’ variants. Among the missense variants, one type of damaging missense variants are predicted as ‘‘probably damaging’’ and ‘‘deleterious’’ by PolyPhen-2 andSIFT and CADD >15. PTVs are classified as ‘‘frameshift_variant’’, ‘‘splice_acceptor_variant’’, ‘‘splice_donor_ variant’’, ‘‘stop_gained’’, or ‘‘start_lost’’ variants, and variants are labeled as ‘‘high-confidence’’ (HC) according to LOFTEE and SpliceAI scores larger than 0.5 in the predictions of SpliceAI.

>cd $PROJECT_PATH

>software/vcftools/bin/vcftools --vcf 07_collapsing/test.vcf --mac 1 --max-maf 0.005 --recode --recode-INFO-all --out 07_collapsing/test_rare

>python3 software/scripts/vcf2mat.py -p 07_collapsing/test_rare.recode.vcf -o 07_collapsing/test_rare_lof.txt -e MAX_AF -F 0.005 -v lof -c 07_collapsing/test_rare_lof.out

Inline graphicCRITICAL: This step creates intermediate files required for running burden test to identify disease-related genes. Input file is a VCF file that contains variant information with LOFTEE, SpliceAI and CADD score annotated by using VEP.

  • 31.

    Burden tests for rare variants using Fisher’s method.

>cd $PROJECT_PATH

>python3 software/scripts/fisher_test_burden.py -p 07_collapsing/test_rare_lof.txt -c 07_collapsing/test_class.txt -o 07_collapsing/test_rare_lof_fisher.txt -l 40

  • 32.

    Burden tests for rare variants using SKAT method.

>cd $PROJECT_PATH

>python3 software/scripts/skat_burden.py -p 07_collapsing/test_rare_lof.txt -c 07_collapsing/test_class.txt -o 07_collapsing/test_rare_lof_skat.txt

Step 8: Systems genetics

Inline graphicTiming: 2.0 h

In this section, we build a signed coexpression network for test dataset derived from peripheral retina by using weighted gene coexpression network analysis (WGCNA). Additionally, we calculated the odds ratios (ORs) and p-values of candidate risk genes enriched for module genes from WGCNA.

  • 33.

    Weighted gene coexpression network analysis and module enrichment analysis in R.

>cd $PROJECT_PATH/08_systems_genetics

>RSCRIPT=`which Rscript`

>$RSCRIPT ../software/scripts/WGCNA.R ../rawdata/datExpr1_combine_test.txt

Expected outcomes

This protocol describes a pipeline named MAGICpipeline which identify the associations between diseases and genetic variants from population-scale WES data. Steps 1–5 of the step-by-step method generate VCF files. In the proposed protocol, diverse outputs are produced at various processing steps, encompassing multiple levels of analysis. Specifically, the main outputs are alignments to the reference protocol genome in BAM format, variant callsets from each caller in VCF format and annotated and consolidated variant callsets in VCF format. Step 6 generates disease-associated variants by using a variant-level exome-wide association study (Figure 1). Figure 2, presented in step 8, unveils a signed coexpression network generated from weighted gene coexpression network analysis depicting refined and meaningful gene expression patterns specific to retina testing samples.

Figure 1.

Figure 1

Variants and genes ascertainment of exome-wide association analysis

Manhattan plot of single-variant association analyses is calculated using the EMMAX test (A) and MLMA test (B). The plot shows -log10-transformed p-values for all SNPs.

Figure 2.

Figure 2

Weighted gene co-expression network analysis on peripheral retina testing samples

(A) Power b threshold chosen by scale-free topology criterion.

(B) Cluster dendrogram of co-expression modules.

(C) Hierarchical clustering and heatmap plot of module eigengenes that summarize the modules found in the clustering analysis.

Limitations

One limitation of MAGICpipeline is that it requires individual-level genotype and phenotype data. Given privacy concerns, some studies may not share such data. Future endeavors can focus on extending the capabilities of MAGICpipeline by using rare-variant summary statistics. The next step is to extend its applicability to other settings of rare-variant association analyses, including gene–environment interaction, multi-trait and phenome-wide analysis. Future work will also explore additional functional categories in the coding genome and incorporate tissue- and cell type-specific functional annotations.

Troubleshooting

Problem 1

In step 7, burden tests by using SKAT method failed with an error message: “libRblas.so: cannot open shared object file: No such file or directory”.

Potential solution

The error might be solved by adding the R libraries to the library path in bashrc.

>export LD_LIBRARY_PATH="/path/to/R/lib64/R/lib:$LD_LIBRARY_PATH"

Problem 2

The VEP annotation process might be slow when processing a large VCF file.

Potential solution

The VCF file could be split by chromosome to decrease their size for parallel annotate with VEP.

>cd $PROJECT_PATH

>software/bgzip 05_variantannot/cohort_snpQC_sampleQC_fix.recode.vcf

>software/bcftools-1.9/bin/bcftools index 05_variantannot/cohort_snpQC_sampleQC_fix.recode.vcf.gz

>software/bcftools-1.9/bin/bcftools index -s 05_variantannot/cohort_snpQC_sampleQC_fix.recode.vcf.gz | cut -f 1 | while read C; do software/bcftools-1.9/bin/bcftools view -O z -o 05_variantannot/split.${C}.vcf.gz 05_variantannot/cohort_combine_snpQC_sampleQC.recode.vcf.gz "${C}" ; done

>ls 05_variantannot/split.chr∗.vcf.gz | while read file

do

software/ensembl-vep/vep -i ${file} --offline --cache --dir_cache ref/vep/vep_cache/ --plugin CADD,ref/vep/Plugins/whole_genome_SNVs.tsv.gz,ref/vep/Plugins/InDels.tsv.gz --plugin LoF,loftee_path:ref/vep/Plugins/loftee/loftee/,human_ancestor_fa:ref/vep/Plugins/loftee/human_ancestor.fa.gz,conservation_file:ref/vep/Plugins/loftee/phylocsf_gerp.sql,gerp_file:ref/vep/Plugins/loftee/GERP_scores.final.sorted.txt.gz --plugin SpliceAI,snv=ref/vep/vcf/spliceai_scores.raw.snv.hg19.vcf.gz,indel=ref/vep/vcf/spliceai_scores.raw.indel.hg19.vcf.gz --dir_plugins ref/vep/Plugins --custom ref/vep/vcf/gnomad.genomes.r2.1.1.sites.vcf.gz,gnomADg,vcf,exact,0,AC,AF,AF_afr,AF_amr,AF_asj,AF_eas,AF_fin,AF_nfe,AF_oth --fasta ref/ucsc.no_hap.hg19.fa --force_overwrite --sift b --polyphen b --regulatory --numbers --ccds --hgvs --symbol --xref_refseq --canonical --protein --biotype --af --af_1kg --af_esp --af_gnomad --max_af --pick --vcf --output_file “${file}.vep” &

done

Problem 3

In step 8, enrichment analysis may be the case that too many of the modules from the co-expression network are too small to be significantly enriched for genes associated with monogenic diseases.

Potential solution

This can be addressed by increasing the minModuleSize parameter in the network construction command. This will result in a small number of larger modules, which may elide some nuances between smaller modules, but will allow for more successful enrichment analyses.

Resource availability

Lead contact

Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Jianzhong Su (sujz@wmu.edu.cn).

Technical contact

Technical questions on executing this protocol should be directed to and will be answered by the technical contact, Jianzhong Su (sujz@wmu.edu.cn).

Materials availability

This study did not generate new unique reagents.

Data and code availability

This protocol did not generate any new datasets. The sample data analyzed in this protocol can be found at Zenodo (https://zenodo.org/records/8189201) and using the links in the data retrieval box in the respective section. The code templates outlined through the article are available at GitHub: https://github.com/sulab-wmu/MAGIC-PIPELINE.

Acknowledgments

We thank Dr. Zhen Ji Chen for constructive comments regarding this manuscript. This work was supported by the National Natural Science Foundation of China (U20A20364 and 81830027), the Zhejiang Provincial Key Research and Development Program Grant (2021C03102) to J.Q., and the National Natural Science Foundation of China (82172882) to J.S.

Author contributions

The study was conceived, designed, and supervised by J.S. and J.Y. Analysis of data was performed by J.Y. and K.L. The manuscript was written by J.Y., K.L., H.P., Y.Z., and Y.Y.

Declaration of interests

The authors declare no competing interests.

References

  • 1.Su J., Huang F., Tian Y., Tian R., Qianqian G., Bello S.T., Zeng D., Jendrichovsky P., Lau C.G., Xiong W., et al. Sequencing of 19,219 exomes identifies a low-frequency variant in FKBP5 promoter predisposing to high myopia in a Han Chinese population. Cell Rep. 2023;42:113467. doi: 10.1016/j.celrep.2023.112510. [DOI] [PubMed] [Google Scholar]
  • 2.Groopman E.E., Marasa M., Cameron-Christie S., Petrovski S., Aggarwal V.S., Milo-Rasouly H., Li Y., Zhang J., Nestor J., Krithivasan P., et al. Diagnostic Utility of Exome Sequencing for Kidney Disease. N. Engl. J. Med. 2019;380:142–151. doi: 10.1056/NEJMoa1806891. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Yang Y., Muzny D.M., Reid J.G., Bainbridge M.N., Willis A., Ward P.A., Braxton A., Beuten J., Xia F., Niu Z., et al. Clinical whole-exome sequencing for the diagnosis of mendelian disorders. N. Engl. J. Med. 2013;369:1502–1511. doi: 10.1056/NEJMoa1306555. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Tarailo-Graovac M., Shyr C., Ross C.J., Horvath G.A., Salvarinova R., Ye X.C., Zhang L.-H., Bhavsar A.P., Lee J.J.Y., Drögemöller B.I., et al. Exome sequencing and the management of neurometabolic disorders. N. Engl. J. Med. 2016;374:2246–2255. doi: 10.1056/NEJMoa1515792. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Satterstrom F.K., Kosmicki J.A., Wang J., Breen M.S., De Rubeis S., An J.-Y., Peng M., Collins R., Grove J., Klei L., et al. Large-scale exome sequencing study implicates both developmental and functional changes in the neurobiology of autism. Cell. 2020;180:568–584.e23. doi: 10.1016/j.cell.2019.12.036. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Koch L. The power of large-scale exome sequencing. Nat. Rev. Genet. 2021;22:549. doi: 10.1038/s41576-021-00397-x. [DOI] [PubMed] [Google Scholar]
  • 7.Backman J.D., Li A.H., Marcketta A., Sun D., Mbatchou J., Kessler M.D., Benner C., Liu D., Locke A.E., Balasubramanian S., et al. Exome sequencing and analysis of 454,787 UK Biobank participants. Nature. 2021;599:628–634. doi: 10.1038/s41586-021-04103-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.McKenna A., Hanna M., Banks E., Sivachenko A., Cibulskis K., Kernytsky A., Garimella K., Altshuler D., Gabriel S., Daly M., DePristo M.A. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20:1297–1303. doi: 10.1101/gr.107524.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Van der Auwera G.A., Carneiro M.O., Hartl C., Poplin R., Del Angel G., Levy-Moonshine A., Jordan T., Shakir K., Roazen D., Thibault J., et al. From FastQ data to high-confidence variant calls: the genome analysis toolkit best practices pipeline. Curr. Protoc. Bioinformatics. 2013;43:11. doi: 10.1002/0471250953.bi1110s43. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.1000 Genomes Project Consortium. Abecasis G.R., Altshuler D., Auton A., Brooks L.D., Durbin R.M., Gibbs R.A., Hurles M.E., McVean G.A. A map of human genome variation from population scale sequencing. Nature. 2010;467:1061–1073. doi: 10.1038/nature09534. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Karczewski K.J., Francioli L.C., Tiao G., Cummings B.B., Alföldi J., Wang Q., Collins R.L., Laricchia K.M., Ganna A., Birnbaum D.P., et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020;581:434–443. doi: 10.1038/s41586-020-2308-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Sherry S.T., Ward M., Sirotkin K. dbSNP—database for single nucleotide polymorphisms and other classes of minor genetic variation. Genome Res. 1999;9:677–679. [PubMed] [Google Scholar]
  • 13.Mohammadi P., Park Y.S., Parsana P., Segrè A.V., Strober B.J., Zappala Z., G.C.L.a.A.F.B.A.A.C.S.E.D.J.R.H.Y.J.B., Laboratory Data Analysis &Coordinating Center LDACC—Analysis Working Group, Statistical Methods groups—Analysis Working Group, and Enhancing GTEx eGTEx groups Genetic effects on gene expression across human tissues. Nature. 2017;550:204–213. doi: 10.1038/nature24277. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Purcell S., Neale B., Todd-Brown K., Thomas L., Ferreira M.A.R., Bender D., Maller J., Sklar P., de Bakker P.I.W., Daly M.J., Sham P.C. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 2007;81:559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.McLaren W., Gil L., Hunt S.E., Riat H.S., Ritchie G.R.S., Thormann A., Flicek P., Cunningham F. The ensembl variant effect predictor. Genome Biol. 2016;17:122. doi: 10.1186/s13059-016-0974-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Garrison E., Kronenberg Z.N., Dawson E.T., Pedersen B.S., Prins P. A spectrum of free software tools for processing the VCF variant call format: vcflib, bio-vcf, cyvcf2, hts-nim and slivar. PLoS Comput. Biol. 2022;18 doi: 10.1371/journal.pcbi.1009123. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Yang J., Lee S.H., Goddard M.E., Visscher P.M. GCTA: a tool for genome-wide complex trait analysis. Am. J. Hum. Genet. 2011;88:76–82. doi: 10.1016/j.ajhg.2010.11.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Kang H.M., Sul J.H., Service S.K., Zaitlen N.A., Kong S.-y., Freimer N.B., Sabatti C., Eskin E. Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 2010;42:348–354. doi: 10.1038/ng.548. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Verrou K.-M., Pavlopoulos G.A., Moulos P. Protocol for unbiased, consolidated variant calling from whole exome sequencing data. STAR Protoc. 2022;3 doi: 10.1016/j.xpro.2022.101418. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

This protocol did not generate any new datasets. The sample data analyzed in this protocol can be found at Zenodo (https://zenodo.org/records/8189201) and using the links in the data retrieval box in the respective section. The code templates outlined through the article are available at GitHub: https://github.com/sulab-wmu/MAGIC-PIPELINE.


Articles from STAR Protocols are provided here courtesy of Elsevier

RESOURCES