PAPipe: A Pipeline for Comprehensive Population Genetic Analysis

Nayoung Park; Hyeonji Kim; Jeongmin Oh; Jinseok Kim; Charyeong Heo; Jaebum Kim

doi:10.1093/molbev/msae040

. 2024 Mar 1;41(3):msae040. doi: 10.1093/molbev/msae040

PAPipe: A Pipeline for Comprehensive Population Genetic Analysis

Nayoung Park ¹, Hyeonji Kim ², Jeongmin Oh ³, Jinseok Kim ⁴, Charyeong Heo ⁵, Jaebum Kim ^6,^✉

Editor: Andrey Rzhetsky

PMCID: PMC10919927 PMID: 38427787

Abstract

Advancements in next-generation sequencing (NGS) technologies have led to a substantial increase in the availability of population genetic variant data, thus prompting the development of various population analysis tools to enhance our understanding of population structure and evolution. The tools that are currently used to analyze population genetic variant data generally require different environments, parameters, and formats of the input data, which can act as a barrier preventing the wide-spread usage of such tools by general researchers who may not be familiar with bioinformatics. To address this problem, we have developed an automated and comprehensive pipeline called PAPipe to perform nine widely used population genetic analyses using population NGS data. PAPipe seamlessly interconnects and serializes multiple steps, such as read trimming and mapping, genetic variant calling, data filtering, and format converting, along with nine population genetic analyses such as principal component analysis, phylogenetic analysis, population tree analysis, population structure analysis, linkage disequilibrium decay analysis, selective sweep analysis, population admixture analysis, sequentially Markovian coalescent analysis, and fixation index analysis. PAPipe also provides an easy-to-use web interface that allows for the parameters to be set and the analysis results to be browsed in intuitive manner. PAPipe can be used to generate extensive results that provide insights that can help enhance user convenience and data usability. PAPipe is freely available at https://github.com/jkimlab/PAPipe.

Keywords: next-generation sequencing, single nucleotide polymorphism, population genetic analysis, pipeline

Introduction

Population genetics is the study of genetic similarities and differences within and between populations (Nei 1975; Casillas and Barbadilla 2017). The discovery of genetic variations represented by population allele frequencies as well as comparisons of population profiles of genetic variations from different perspectives both play important roles in this field of study. Based on the theory of modern synthesis, which integrates Darwinian evolution and Mendelian genetics, the concept of evolution can be defined by changes in allele frequency (Dannemann et al. 2016; Franchini et al. 2020). Accordingly, population genetic variant data can be used to make inferences regarding evolution within populations by estimating genetic structure, phylogenetic relationships, and effective population size (Nishiyama et al. 2012; Franchini et al. 2020; Lee et al. 2020b; Reimer et al. 2020). Inference and estimation of population structure and characteristics can be used in various studies, such as population immunity studies (Daub et al. 2013; Barreiro and Quintana-Murci 2020), etiology studies (Di Rienzo 2006; Torkamani et al. 2012; Parkes et al. 2013), studies on population-specific diseases (Bhatia et al. 2011; Zhong et al. 2020), studies investigating domesticated animals (Parker et al. 2017; Chen et al. 2018; Fitak et al. 2020), and population migration studies (Choudhury et al. 2014; Parker et al. 2017; Browning et al. 2018).

Developments in next-generation sequencing (NGS) technologies have made it easy and cost-effective to generate vast amounts of high-quality, individual genome sequencing data. Moreover, continued advancement in NGS technologies have resulted in the accumulation of population sequencing data. Therefore, various tools have been developed for applications in population genetic analyses. Because population genetics is based on genetic similarities and differences within and between populations, population genetic analyses generally begin by examining population genetic variant data (Li et al. 2009; Van der Auwera et al. 2013), which represent fundamental inputs in various downstream analyses, such as phylogenetic tree inference, principal component analysis (PCA), and population structure analysis. In many studies, different types of population genetic analyses have been combined in attempts to comprehensively understand the target populations and draw meaningful conclusions (Baumsteiger et al. 2017; Lee et al. 2017, 2020a).

However, calling population genetic variants from population sequencing data is a complex task, because it requires multiple tools to be run, with each tool requiring a specific running environment and several preprocessing steps to appropriately prepare the input data. Further, after obtaining the population genetic variant data, differential formatting of the genetic variant data is needed for different population genetic analysis tools, which also requires specific running environment and particular parameter settings. These limitations pose challenges for researchers who are not familiar with bioinformatics.

To address these challenges, recent efforts have generated simple pipelines by combining multiple tasks. For example, a number of studies have developed several different genetic variant-calling pipelines (Wang et al. 2013; Oliveira et al. 2015; Ip et al. 2020) by combining multiple steps in genetic variant calling such that they are executed by a single command. Another pipeline-building study for population structure analysis (Mussmann et al. 2020) has developed ADMIXPIPE. This combines multiple tasks such as the filtering of single nucleotide polymorphisms (SNPs), running the ADMIXTURE program (Alexander et al. 2009), and visualizing population structures with the CLUMPAK program (Kopelman et al. 2015) to facilitate population structure analysis. Unfortunately, those pipelines were only developed to handle specific analyses such as genetic variant calling or population structure analyses. Other efforts aiming to bridge between lower-level manipulation of population sequencing data and higher-level applications of population genetic variants have resulted in the development of a pipeline platform. For example, the Pop-Gen Pipeline Platform (PPP) (Webb et al. 2021) has been implemented as a software platform to support several sets of scripts for VCF file processing, data manipulation, file format conversion, and five population genetic analyses. However, PPP only provides modulated functions that can be used as building blocks in constructing a whole pipeline, so the burden of running a series of scripts to conduct population genetic analyses remains. Moreover, population genetic variant data must be available before using the PPP.

To address the above problems, in this study we present an automated pipeline called PAPipe for comprehensive population genetic analyses that use population NGS data as input. PAPipe consists of five main steps: read trimming, read mapping, genetic variant calling, data filtering and format converting, and analysis result generating. With only the population sequencing data and appropriate configuration of parameters, PAPipe enables automated result generation from nine population genetic analyses: PCA, phylogenetic analysis, population tree analysis, population structure analysis, linkage disequilibrium (LD) decay analysis, selective sweep analysis, population admixture analysis, sequentially Markovian coalescent analysis, and fixation index analysis. Moreover, to enhance user convenience, PAPipe provides default parameters for every step, and researchers can change these parameters using an easy-to-use web interface. When researchers have intermediate data, such as read alignment data or genetic variant call data in advance, they can use such data as intermediate input for further downstream analysis. PAPipe also provides a web interface for easily browsing analysis results.

PAPipe is designed to generate results from extensive population genetic analyses to promote discussion and improve the utilization of population whole-genome sequencing (WGS) data. Therefore, PAPipe is expected to be very useful for gaining in-depth insight for target populations. PAPipe is implemented for the Linux operating system, and Docker image files are provided to reduce the challenges associated with environment settings. The source code of PAPipe is available at https://github.com/jkimlab/PAPipe.

New Approaches

The PAPipe was designed as a comprehensive pipeline with which to generate results of various population genetic analyses using raw WGS reads. The PAPipe features the seamless connection of five main steps: read trimming, read mapping, genetic variant calling, data filtering and format converting, and analysis result generating step. It can be run using a single command (Fig. 1). The analysis result generating step generates high-quality visualizations of the results of nine population genetic analyses: PCA, phylogenetic analysis, population tree analysis, population structure analysis, LD decay analysis, selective sweep analysis, population admixture analysis, sequentially Markovian coalescent analysis, and fixation index analysis. Therefore, PAPipe allows researchers to easily perform the above nine population genetic analyses all at once. In addition to this high-level automation of analyses, PAPipe is flexible and can be customized by users. For example, users can easily (i) skip some steps in PAPipe if they want to use pregenerated data, (ii) choose analysis programs in the read mapping and genetic variant calling steps, (iii) perform some subset of the nine population genetic analyses in the analysis result generating step if they so desire, (iv) change the parameters of all analysis tools included in PAPipe using an easy-to-use web interface, and (v) browse the analysis results using another easy-to-use web interface. The Materials and Methods section provides further details of each step in PAPipe.

Fig. 1. — Workflow of the PAPipe pipeline. Population WGS reads and a reference genome data are serially processed in five steps: read trimming, read mapping, genetic variant calling, data filtering and format converting, and analysis result generating. The dashed boxes in the second and third steps indicate optional tools that can be chosen by users. In the analysis result generating step, nine population genetic analyses are performed and results, including visualized figures, are generated.

Results and Discussion

The applicability of PAPipe was checked and confirmed by using population WGS data of five cattle breeds: Angus, Holstein, Jersey, Simmental, and Hanwoo, which is a Korean cattle breed (Materials and Methods). Table 1 lists the number of SNPs identified from the five cattle populations by PAPipe. The results of the nine population genetic analyses were generated in the following two steps in PAPipe.

Table 1.

Summary of the identified SNPs from five cattle populations

Population	No. of identified SNPs (dbSNP%^a)	Ti/Tv ratio
Angus	11,005,804 (97.90)	2.193
Hanwoo	16,160,746 (97.15)	2.216
Holstein	12,322,531 (97.76)	2.199
Jersey	12,446,343 (94.98)	2.169
Simmental	12,397,793 (98.60)	2.215

Open in a new tab

^aPercentage of SNPs observed in the dbSNP database.

Clustering of Cattle Populations Based on Genetic Signatures

PAPipe can calculate principal components (PCs) and their proportion of variance explained (PVE) using identified SNPs in the PCA step. For top PCs whose sum of PVEs is greater than 80%, PCA plots for all pairs of PCs are then generated. We used PAPipe to cluster these five cattle populations based on the genetic signatures represented by the identified SNPs. In this analysis, the top 14 PCs were selected and 91 PCA plots were created (Fig. 2a and b, supplementary fig. S1, Supplementary Material online). Based on the PCA plot between PC1 and PC2 (Fig. 2a), all five cattle breeds could clearly be separated from each other, with Holstein and Simmental being relatively close to each other, consistent with observation in a previous study (Lee et al. 2016). Hanwoo was the most distinct population in terms of PC1. Additional characteristics can be discovered by observing other PCA plots. For example, Angus and Jersey were placed very closely in the PCA plot between PC1 and PC4 (Fig. 2b). This may indicate that Angus and Jersey share some genetic features. As shown above, the identified PCs and multiple PCA plots automatically produced by PAPipe can help researchers to find interesting clustering patterns from multiple angles through different PCs.

Fig. 2. — Figures generated by PAPipe for PCA (a, b), phylogenetic tree construction (c), and population structure analysis (d). Two example PCA plots are shown for PC1 and PC2 (a), and PC1 and PC4 (b). Numbers in parentheses on each axis indicate the PVE. In c), phylogenetic relationships among 48 individuals from five cattle breeds are visualized. In d), admixture plots are generated for four different K values (from 2 to 5).

Discovering Phylogenetic Relationships Among Cattle Populations

PAPipe can predict phylogenetic relationships among individuals in given populations and draw a phylogenetic tree using SNPs in the phylogenetic analysis step. Figure 2c shows a phylogenetic tree generated by PAPipe, which represents the relationships among individuals in the five cattle populations. As shown in Fig. 2c, most individuals from the same cattle breed were grouped together. Specifically, the Angus, Holstein, and Simmental populations were placed closer to each other than the other two populations, which is consistent with the PCA plot shown in Fig. 2a. Interestingly, one Holstein individual (Holstein9) was predicted to belong to the Simmental population.

Inferring the Ancestry of Cattle Populations based on Population Structure

PAPipe can analyze the structure of given populations and draw population structure plots using different K values (which range from 2 to 5 by default) and SNPs in the population structure analysis step. Researchers can easily trace the ancestry of populations by comparing multiple plots generated by PAPipe with different K values together. The structures of the cattle populations were examined by PAPipe, and the generated plots with K values ranging from 2 to 5 were compared (Fig. 2d). When K = 2, Jersey was clearly distinguished from the other four populations. Holstein and Simmental remained highly similar to each other with the K values ranging from 2 to 4, which is consistent with the results shown in Fig. 2a to c. Angus (with K values ranging from 3 to 5) and Hanwoo (with K values ranging from 4 to 5) were found to have relatively isolated ancestries, which is similar to the results of a previous study (Lee et al. 2016).

Comparing the Recombination Patterns of Cattle Populations

PAPipe can compare the patterns of LD of populations and create plots by examining pairs of SNPs separated by a range of distances (maximum distance 500 Kbp, 1 Mbp, 5 Mbp, and 10 Mbp by default) in the LD decay analysis step. PAPipe was used to compare recombination patterns of the cattle populations (Fig. 3a and b, supplementary fig. S2, Supplementary Material online). Jersey showed the most dramatic decay of LD between pairs of SNPs closer than 100 Kbp while the patterns of the other four populations were highly similar to each other. With increasing distance between SNP pairs, Hanwoo and Simmental were found to be separated from Angus and Holstein, while the decay patterns of Angus and Holstein became almost indistinguishable from each other (Fig. 3b).

Fig. 3. — Figures generated by PAPipe for LD decay analysis (a, b) and PSMC analysis (c). Two plots a) and b) are generated by using the maximum distance parameter value 500 Kbp and 1 Mbp, respectively. The box in the middle of the plot (b) shows a zoomed in pattern. In the x axis in (c), g is the number of years per generation and μ is the absolute mutation rate per nucleotide.

Identifying the Mode of Admixture Among Cattle Populations

PAPipe can calculate various statistics, such as $F_{3}$ , $F_{4}$ , and D, which can help identify the mode of admixture for all possible combinations of populations using SNPs in the population admixture analysis step. Using PAPipe, those population admixture statistics were calculated and compared for all possible combinations of the five cattle populations (Table 2, supplementary tables S2 to S5, Supplementary Material online). For example, Hanwoo was examined to check whether or not it was a result of the admixture of any two populations using $F_{3}$ (Table 2). In this case, a negative $F_{3}$ value indicates that Hanwoo was admixed between populations 1 and 2 listed in Table 2. However, all $F_{3}$ values were positive, and their Z scores were very large. Therefore, there was no statistically significant evidence indicating that Hanwoo was admixed between any two populations in our analysis. A similar pattern whereby the Hanwoo population was distinct was already observed in the population structure analysis shown in Fig. 2d.

Table 2.

Estimation of $F_{3}$ statistic using Hanwoo as a target population

Population 1	Population 2	$F_{3}$ (SE^a)	Z-score
Angus	Holstein	0.0903 (0.0016)	57.246
Angus	Jersey	0.0874 (0.0015)	57.438
Angus	Simmental	0.0862 (0.0014)	61.143
Holstein	Jersey	0.0791 (0.0015)	52.976
Holstein	Simmental	0.0809 (0.0014)	56.163
Jersey	Simmental	0.0821 (0.0014)	57.043

Open in a new tab

Populations 1 and 2 represent the two counterparts of the target population Hanwoo.

^aStandard error.

Inferring the Trajectory of the Effective Population Sizes of Cattle Populations

PAPipe can perform pairwise and multiple sequentially Markovian coalescent (PSMC and MSMC, respectively) analyses using the identified SNPs and then draw a plot showing the changes in the effective population sizes of the populations in the sequentially Markovian coalescent analysis step. Using this function of PAPipe, the trajectory of the effective population size of each cattle population was predicted and compared by the PSMC analysis (Fig. 3c). At around a million years ago, the effective population sizes of all cattle populations were highly similar to each other. However, after that time point, there were increases in both the effective population sizes of all cattle populations and their variances. The effective population sizes of Simmental and Hanwoo showed the most dramatic changes. This pattern was observed by around one hundred thousand years ago. An overall decrease in the effective population size was then followed for all cattle populations.

Detecting Genomic Regions With High Genetic Variation Among Cattle Populations

PAPipe can detect genomic regions with high genetic variation among given populations using identified SNPs. The results are visualized as a Manhattan plot in the fixation index ( $F_{s t}$ ) analysis step. In PAPipe, the Manhattan plot is automatically generated for every single population against all other populations using information on genomic regions that are significantly differentiated by default. Figure 4 shows the results of the fixation index analysis by PAPipe for Hanwoo (Fig. 4a) and Jersey (Fig. 4b) (supplementary fig. S3, Supplementary Material online for other cattle populations). Based on a $Z (F_{s t})$ cutoff 5, 380 and 460 genomic regions were predicted as being significantly differentiated in Hanwoo and Jersey, respectively. These reported genomic regions can be useful in many downstream analyses. For example, a total of 227 genes overlapped with the predicted genomic regions of Hanwoo, and enriched GO terms, such as GO:0005737 (cytoplasm), GO:0031090 (organelle membrane), and GO:0012505 (endomembrane system), were identified by the GO enrichment test conducted using g:Profiler (Raudvere et al. 2019).

Fig. 4. — Figures generated by PAPipe for fixation index analysis. Hanwoo and Jersey were used as target populations in (a) and (b), respectively. The horizontal line in the middle of the plot denotes a threshold of $Z (F_{s t})$ , which is 5.

Inferring Population-Level Phylogenetic Relationships Among Cattle Populations

PAPipe can generate population-level phylogenetic trees that include mixture and conflict events between populations. Given a set of populations, PAPipe can produce multiple trees with different parameters capturing the number of migration events, which represents the number of allowed migration events to be added to the tree. Figure 5 shows two trees generated with the edge parameter set to 0 (Fig. 5a) and 1 (Fig. 5b) using the SNP data from the cattle populations. When the migration is not allowed (Fig. 5a), Hanwoo is separated from the other four populations, and two distinct groups of populations (Angus and Jersey; Holstein and Simmental) can be observed. However, when a single migration is allowed (Fig. 5b), migration from Jersey to Angus can be seen as indicated by the arrow in the plot. This result is in line with the pattern obtained from the PCA (Fig. 2), showing the proximity between Angus and Jersey.

Fig. 5. — Figures generated by PAPipe for population tree analysis. Two tree topologies were obtained by using the parameter of the number of migration events set to 0 for (a) and 1 for (b). The drift parameter on the x axis represents the amount of genetic drift estimated for each population. The direction of the arrow crossing the tree indicates the direction of migration.

Detecting Selective Sweep Regions in Each Cattle Population

Given a set of populations, PAPipe can identify genomic regions with the signature of a selective sweep in each population. The analysis results are visualized by plotting the composite-likelihood ratio (CLR) value, which represents the potential level of a selective sweep, for every genomic region. Figure 6 shows the CLR values calculated for chromosome 1 of the Angus and Hanwoo population (supplementary fig. S4, Supplementary Material online for other populations). For Angus, genomic regions with very high CLR are observed both at the front and end of chromosome 1. Meanwhile, for Hanwoo, the CLR values are generally lower, and strong signals are mostly observed at the front of chromosome 1.

Fig. 6. — Figures generated by PAPipe for selective sweep analysis. The CLR scores for genomic regions in chromosome 1 of Angus (a) and Hanwoo (b) are shown.

Conclusion

We developed PAPipe, a pipeline enabling comprehensive population genetic analysis. PAPipe automates a series of steps, including read trimming, read mapping, genetic variant calling, data filtering and format converting, and analysis result generating via nine population genetic analyses, such as PCA, phylogenetic analysis, population tree analysis, population structure analysis, LD decay analysis, selective sweep analysis, population admixture analysis, sequentially Markovian coalescent analysis, and fixation index analysis. The utility of PAPipe was demonstrated with a public dataset of cattle, and analysis results and publication-level figures were successfully generated.

PAPipe allows researchers to focus on analyzing the results, rather than having to spend large amounts of time executing individual programs, by automating the running of interconnected analysis programs. Despite the convenience of automated and default settings in PAPipe, some default parameters are not appropriate for specific analyses. Because of this, PAPipe is designed to allow researchers to easily modify specific parameters and browse the analysis results using an easy-to-use web interface. The current version of PAPipe only contains nine population genetic analyses. However, PAPipe will be continuously revised to include updated versions of existing tools in PAPipe or new population genetic analysis tools to increase the utility of PAPipe in various types of population genetic studies.

Materials and Methods

Details of Each Step in PAPipe

Read Trimming Step

PAPipe checks the quality of input NGS reads and trims them if necessary. This step consists of three sub-steps: assessing the quality of raw reads, trimming raw reads, and accessing the quality of trimmed reads. The quality of reads was assessed by FastQC (v 0.11.9) (Andrews 2010), while the web page to be browsed for summary statistics was generated by MultiQC (v 1.9) (Ewels et al. 2016). Read trimming is processed by TrimGalore (v 0.6.0) (Krueger et al. 2019) which is a commonly used program for trimming sequencing data. This step is typically used to enhance the data accuracy and improve the efficiency of subsequent population analysis.

Read Mapping Step

To obtain a set of genetic variants, NGS reads need to be aligned to reference genome sequences. PAPipe was constructed to use one of the two mapping tools BWA (v 0.7.17-r1188) (Li and Durbin 2009) and Bowtie 2 (v 2.3.4.3) (Langmead and Salzberg 2012) that are widely chosen by users to perform this task. Read mapping of the data is then processed with the Picard Toolkit (v 2.17.11) (Broad Institute 2018) to identify duplicates with the “MarkDuplicates” module and process read groups with the “AddOrReplaceReadGroups” module. Users can easily modify the parameter file of PAPipe to change or add parameters for each process. The read mapping step can also be skipped if users have prepared their own read alignment data.

Genetic Variant Calling Step

Similar to the read mapping step, PAPipe was designed to use one of three pipelines based on three tools, Genome Analysis ToolKit 3 (GATK3 v 3.8-1) (DePristo et al. 2011), Genome Analysis ToolKit 4 (GATK4 v 4.1.7.0), and BCFtools (v 1.9) (Li et al. 2009), which are often selected by users to call genetic variants from NGS read alignments.

Two GATK-based pipelines were constructed based on the GATK Best Practices Workflow for Germline short variant discovery (Van der Auwera et al. 2013). However, the two GATK-based pipelines have different alignment preprocessing steps. In the case of the pipeline based on GATK3 (v 3.8-1), the following processes are performed sequentially to realign and recalibrate reads: Picard (v 2.17.11) “ReorderSam”, GATK “RealignerTargetCreator”, “IndelRealigner”, “BaseRecalibrator”, and “PrintReads”. Meanwhile, the pipeline based on GATK4 (v 4.1.7.0), only read recalibration was implemented with GATK “BaseRecalibrator” and “ApplyBQSR”.

The genetic variant calling process shared by the two GATK-based pipelines is executed after the alignment preprocessing step. Specifically, “HaplotypeCaller” calls genetic variants per sample to generate single-sample GVCF files, while “CombineGVCFs” combines single-sample GVCF files into a single-merged GVCF file. Next, “GenotypeGVCFs” is used for joint genotyping and to generate the first genotype VCF file containing all SNPs and insertions and deletions (indels). “SelectVariants” is used to differentiate between SNPs and indels, thereby leaving only SNPs in the VCF file. In this process, the dbSNP parameter (–dbsnp) can be used to annotate SNPs with user-provided data from the dbSNP database. Finally, the two GATK-based pipelines were constructed to only extract SNPs that pass the hard filtering criteria: QD < 2.0, QUAL < 30.0, SOR > 3.0, FS > 60.0, MQ < 40.0, MQRankSum < −12.5, and ReadPosRankSum < −8.0.

The BCFtools-based pipeline was developed to directly process read alignment files without the need for additional preprocessing steps. SAMtools (v 1.9-36-gb6fea3c) “mpileup” is used to combine multiple files containing the alignments of a single sample into a single BCF file. The parameter “-g” is used to group nonvariant sites into gVCF blocks according to the minimum per-sample read depth during the “mpileup” process. The BCFtools (v 1.9) “call” function is then used to call SNPs and indels with parameters “-c and -v”. Finally, the “vcfutils.pl” varFilter” is used to discard indels and filter SNPs according to the following default filtering criteria: d = 2, D = 10000000, a = 2, W = 10, Q = 10, w = 3, 1 = 1e−4, 2 = 1e−100, 3 = 0, 4 = 1e−4, G = 0, S = 1000, and e = 1e−4. The VCFtool “–tstv-by-count” command is also executed to additionally calculate the Ti/Tv ratio.

Similar to the read mapping step, users can change or add parameters for each process by modifying the parameter file of PAPipe. It is also possible for users to skip this genetic variant calling step if they prefer to use their own genetic variant data.

Data Filtering and Format Converting Step

Genetic variant analysis tools require their own input data format. PAPipe provides a data filtering and format converting step to overcome this potential issue and thus facilitate the seamless execution of nine population genetic analysis tools. PLINK (v 1.9) (Purcell et al. 2007), VCFtools (v 0.1.17) (Danecek et al. 2011), and an in-house Perl script are used to execute three different treatments. PLINK is used to filter SNPs and convert the file format to the plink binary format (BED) using the following default parameters: –geno 0.01, –maf 0.05, –hwe 0.000001, –make-bed. VCFtools (v 0.1.17) is used to convert the file format of genetic variants to the plink format (PED). PAPipe also provides an in-house Perl script to convert the file format of genetic variants to HapMap format (HAPMAP). As in the previous steps, parameters can be changed or added by modifying the PAPipe parameter file.

Analysis Result Generating Step

By using data generated from the previous steps as input, PAPipe can run several tools for the following nine population genetic analyses: PCA, phylogenetic analysis, population tree analysis, population structure analysis, LD decay analysis, selective sweep analysis, population admixture analysis, sequentially Markovian coalescent analysis, and fixation index analysis. All nine analyses can be run with default parameters provided by PAPipe. It is also possible to only run some subset of the tools, if so desired by the user. Details of the supported analyses run by PAPipe are explained in the following.

Principal Component Analysis

PCA is a basic analysis for examining population structure by identifying a relationship between different populations. It is implemented in two steps: (i) identifying PCs using the population genetic variant dataset and (ii) drawing 2D plots for two different PCs. Genome-wide Complex Trait Analysis (v 1.26.0) (Yang et al. 2011) or PLINK (v 2.00a5) (Chang et al. 2015) can be used to perform the PCA analysis, while an in-house R script is used for visualization. Our PCA analysis pipeline generates multiple 2D plots comprising all possible pairs of two different PCs whose sum of variance exceeds some given percentage (where the default value is set at 80%). Users can also limit the number of generated plots using the “maxPC” parameter.

Phylogenetic Analysis

A phylogenetic tree can be used to infer the evolutionary history of populations using the genetic variant dataset. SNPhylo (v 20180901) (Lee et al. 2014) is used to generate a phylogenetic tree from genetic variant data in the HapMap format. The constructed trees are presented as a text file in Newick (NWK) format, alongside figures of unrooted tree.

Population Tree Analysis

Treemix (v 1.13) (Pickrell and Pritchard 2012) was integrated into PAPipe to infer historical relationships among populations based on genetic admixtures. Treemix requires proper variant filtration and LD pruning as an initial step for generating input data. The variant filtration is performed using VCFtools (v 0.1.17) with the “-max-missing 1” parameter by default, while the LD pruning is executed using the script provided by Treemix (v 1.13) with the “ldPruning 0.1” parameter by default. Subsequently, Treemix (v 1.13) is executed with the default parameters, aside from “-k 500” meaning 500 SNPs per block and multiple values of the “-m” parameter representing the number of migration events used in estimating the tree. The default value of the “-m” parameter is 0, but PAPipe enables repeated execution of Treemix using a range of “-m” values from 0 to a user-provided value.

Population Structure Analysis

The population structure can be analyzed by estimating the proportion of ancestral origins in the individual admixture. In PAPipe, ADMIXTURE (v 1.3.0) (Alexander et al. 2009) is used to estimate individual ancestry, while CLUMPAK (v 1.1) (Kopelman et al. 2015) is used to visualize the results. Parameter K, which is the number of inferred ancestral species for clustering, is required to run ADMIXTURE. Users typically perform this analysis multiple times with different K values and then use the most reliable result. Therefore, PAPipe accepts a single parameter value k (5 for default) for K and automatically performs this analysis k − 1 times using 2 through k as the K value. This generates k − 1 admixture plots for different K values.

LD Decay Analysis

LD is a nonrandom association of alleles at two or more different loci. Various factors can affect LD, including the recombination rate, population structure, and number of generations. Therefore, a decay of LD represents a history of population recombination. In this analysis, PopLDdecay (v 3.31) (Zhang et al. 2019) is used to calculate a pairwise measure of LD ( $r^{2}$ ). Given a maximum distance as a parameter value, PopLDdecay calculates $r^{2}$ values from all available pairs of SNPs located within the maximum distance. Next, SNP pairs with the same distance are collected and their average $r^{2}$ value is calculated. Finally, the average $r^{2}$ values are plotted as a function of the distance between SNPs using the script provided by PopLDdecay. PAPipe runs PopLDdecay several times using different maximum distances (500 Kbp, 1 Mbp, 5 Mbp, and 10 Mbp by default).

Selective Sweep Analysis

SweepFinder2 (v 1.0) (DeGiorgio et al. 2016) is used to detect selective sweeps. The input data used for SweepFinder2 is generated by population variant extraction and polarization of a variant file. Variants are extracted per each population by PLINK (v 1.9), and the polarization was done using the “polarizeVCFbyOutgroup” script (Ullrich 2021). SweepFinder2 is run with the default grid size parameter set to “-sg 1000”.

Population Admixture Analysis

PAPipe contains AdmixTools (v 7.0) (Patterson et al. 2012) to calculate $F_{3}$ , $F_{4}$ , differential $F_{4}$ , and D statistics. Given three sets of populations A, B, and C, the $F_{3}$ statistic is the product of allele frequency variations between population C and the other populations, A and B. The $F_{3}$ statistic can be used to test whether population C is derived from the admixture of the other two populations. Given four sets of populations A, B, C, and D, the $F_{4}$ statistic is the product of the allele frequency variations between two populations A and B and the corresponding variations between two populations C and D. The difference in two $F_{4}$ values is tested for statistical significance by a Z-test, which is often used to compare the admixture of different combination of populations. The D statistic is estimated using a parsimonious method to detect gene flow among populations. PAPipe is implemented to calculate the above statistics from all possible combinations of populations, which obviates the need to configure subsets of populations.

Sequentially Markovian Coalescent Analysis

One of the ultimate objectives of population genetic analysis is to elucidate the evolutional history of populations by identifying the relationship among populations (Liu and Hansen 2017). To this end, effective population size has been used to track the demographic history of populations chronologically. PAPipe utilizes PSMC (v 0.6.5) (Li and Durbin 2011) analysis and MSMC (v 2.0.0) (Schiffels and Durbin 2014) analysis to estimate the effective population size of a specific population and determine whether it underwent expansion or contraction during its evolution.

PSMC analysis, which involves inferring the effective population size from a single diploid genome, has in the past been widely used to estimate demographic history from individual genomes. However, the MSMC analysis, which can simultaneously utilize multiple individuals, has emerged as an effective alternative approach.

The PSMC analysis in PAPipe was implemented in three steps: (i) processing input data, (ii) generating PSMC results, and (iii) visualization. In the first step, read alignment files are merged and processed to create a single alignment file in the PSMCFA format by SAMtools (v 1.9-36-gb6fea3c) mpileup with the “-C 50” parameter, vcfutils vcf2fq with the “-d 10” parameter, and BCFtools (v 1.9) view with the “-D 100” parameter. After the input data are prepared, PSMC (v 0.6.5) is run with the default parameters. Subsequently, plots are generated using the script obtained from PSMC.

In the case of the MSMC analysis, read alignment files are prepared in the MSMCFA format by SAMtools (v 1.9-36-gb6fea3c) mpileup with the “-C 50” parameter, BCFtools (v 1.9) call with the “-c and -V indel” parameter, and bamCaller.py in MSMC. Then, MSMC (v 2.0.0) is executed with the default parameters, and the final plots are generated using an in-house visualization script.

Fixation Index (F_st) Analysis

To identify genomic regions showing with exceptional genetic variations in target populations, PAPipe uses VCFtools (v 0.1.17) (Danecek et al. 2011) to calculate the $F_{s t}$ value which is a measure of population differentiation. $F_{s t}$ estimation requires the size of a sliding window and the size of sliding step of that window. The default values are 100 Kbp for both the sizes of the sliding window and the sliding step. The $F_{s t}$ value is calculated based on a pair of population subsets. By default, PAPipe is configured to iteratively perform $F_{s t}$ estimation (i) for every single population by setting a single population as a target and the remaining populations as a comparator and (ii) for every possible pair of two populations. Users can also change target populations by modifying the population group parameter. For example, if two out of four populations are set as target populations in the parameter, PAPipe only executes the analysis once to calculates $F_{s t}$ values by comparing the target populations against the remaining ones. Finally, for each pair of population subsets, the calculated $F_{s t}$ values are visualized as a plot using the “manhattan” function in R package “qqman” (Turner 2018) with different colors used to mark for different chromosomes.

Usability and Accessibility of PAPipe

PAPipe has been designed with a focus on user-friendliness, and it incorporates several features with the intention of to enhancing usability. (i) A web page for setting program parameters: PAPipe provides researchers with a web page (http://bioinfo.konkuk.ac.kr/PAPipe/parameter_builder/) on which they can set various program parameters within a visual web interface, which then automatically generates the desired parameter file. (ii) A web page for browsing pipeline output: PAPipe provides a web page summarizing all generated output from all population analyses. (iii) A small test dataset for easily understanding the pipeline: PAPipe provides a small test dataset and parameter file, thus helping researchers understand how to best prepare the input data and parameter files. (iv) Increased parallelization level of the pipeline: the whole pipeline of PAPipe has been optimized for performance by automatically conducting fork management for all feasible components, with the ultimate aim of enhancing execution speed. The running time and disk usage of each part in PAPipe as a function of the number of threads were measured using the test dataset (supplementary table S6, Supplementary Material online). Users can use this information to choose appropriate analyses based on their available time and computer resources.

Application of PAPipe to Cattle Populations

For the application of PAPipe, five cattle breeds Angus, Holstein, Jersey, Simmental, and a Korean cattle breed called Hanwoo were selected. The population WGS data of Hanwoo were generated by Lee et al. (2014a) and downloaded from NCBI (n = 9; NCBI BioProject accession: PRJNA210523). The population WGS data of the remaining four cattle breeds were constructed by Daetwyler et al. (2014) and downloaded from NCBI (n = 9 for Holstein and n = 10 for other three cattle breeds; NCBI BioProject accession: PRJNA238491; NCBI SRA accession numbers in supplementary table S1, Supplementary Material online).

The quality of the collected raw reads was first checked using IlluQC (NGSQCToolkit_v 2.3.3) (Patel and Jain 2012) with a paired-end option and an automatic detection of FASTQ variant option (-pe, 2, A). Filtering and trimming of the reads were then performed manually using TrimmingReads (NGSQCToolkit_v 2.3.3) (Patel and Jain 2012) with a read length cut off threshold -n 45 and TrimGalore (v 0.6.0) (Krueger et al. 2019) with FastQC (v 0.11.9) arguments -t 10 along with other options (–paired, –length 45). Using quality-controlled population WGS data of the five cattle breeds as input, PAPipe was run with the default parameter values while using BWA (v 0.7.17-r1188) and GATK4 (v 4.1.7.0) in the read mapping step and the genetic variant calling step, respectively.

Supplementary Material

Supplementary material is available at Molecular Biology and Evolution online.

Supplementary Material

msae040_Supplementary_Data

msae040_supplementary_data.zip^{(892.5KB, zip)}

Acknowledgments

This work was supported by Ministry of Science and ICT [NRF-2021M3H9A2097134, NRF-2022R1F1A1065159].

Contributor Information

Nayoung Park, Department of Biomedical Science and Engineering, Konkuk University, Seoul 05029, Republic of Korea.

Hyeonji Kim, Department of Biomedical Science and Engineering, Konkuk University, Seoul 05029, Republic of Korea.

Jeongmin Oh, Department of Biomedical Science and Engineering, Konkuk University, Seoul 05029, Republic of Korea.

Jinseok Kim, Department of Biomedical Science and Engineering, Konkuk University, Seoul 05029, Republic of Korea.

Charyeong Heo, Department of Biomedical Science and Engineering, Konkuk University, Seoul 05029, Republic of Korea.

Jaebum Kim, Department of Biomedical Science and Engineering, Konkuk University, Seoul 05029, Republic of Korea.

Data Availability

PAPipe is available at https://github.com/jkimlab/PAPipe.

References

Alexander DH, Novembre J, Lange K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 2009:19(9):1655–1664. 10.1101/gr.094052.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
Andrews S. FastQC: A quality control tool for high throughput sequence data. 2010. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/.
Barreiro LB, Quintana-Murci L. Evolutionary and population (epi)genetics of immunity to infection. Hum Genet. 2020:139(6-7):723–732. 10.1007/s00439-020-02167-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Baumsteiger J, Moyle PB, Aguilar A, O'Rourke SM, Miller MR. Genomics clarifies taxonomic boundaries in a difficult species complex. PLoS One. 2017:12(12):e0189417. 10.1371/journal.pone.0189417. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bhatia G, Patterson N, Pasaniuc B, Zaitlen N, Genovese G, Pollack S, Mallick S, Myers S, Tandon A, Spencer C, et al. Genome-wide comparison of African-ancestry populations from CARe and other cohorts reveals signals of natural selection. Am J Hum Genet. 2011:89(3):368–381. 10.1016/j.ajhg.2011.07.025. [DOI] [PMC free article] [PubMed] [Google Scholar]
Broad Institute. Picard tools. 2018. http://broadinstitute.github.io/picard/.
Browning SR, Browning BL, Daviglus ML, Durazo-Arvizu RA, Schneiderman N, Kaplan RC, Laurie CC. Ancestry-specific recent effective population size in the Americas. PLoS Genet. 2018:14(5):e1007385. 10.1371/journal.pgen.1007385. [DOI] [PMC free article] [PubMed] [Google Scholar]
Casillas S, Barbadilla A. Molecular population genetics. Genetics. 2017:205(3):1003–1035. 10.1534/genetics.116.196493. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience. 2015:4(1):7. 10.1186/s13742-015-0047-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen J, Ni P, Li X, Han J, Jakovlić I, Zhang C, Zhao S. Population size may shape the accumulation of functional mutations following domestication. BMC Evol Biol. 2018:18(1):4. 10.1186/s12862-018-1120-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
Choudhury A, Hazelhurst S, Meintjes A, Achinike-Oduaran O, Aron S, Gamieldien J, Jalali Sefid Dashti M, Mulder N, Tiffin N, Ramsay M. Population-specific common SNPs reflect demographic histories and highlight regions of genomic plasticity with functional relevance. BMC Genomics. 2014:15(1):437. 10.1186/1471-2164-15-437. [DOI] [PMC free article] [PubMed] [Google Scholar]
Daetwyler HD, Capitan A, Pausch H, Stothard P, van Binsbergen R, Brøndum RF, Liao X, Djari A, Rodriguez SC, Grohs C. Whole-genome sequencing of 234 bulls facilitates mapping of monogenic and complex traits in cattle. Nature Genetics. 2014:46(8):858–865. 10.1038/ng.3034. [DOI] [PubMed] [Google Scholar]
Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST, et al. The variant call format and VCFtools. Bioinformatics. 2011:27(15):2156–2158. 10.1093/bioinformatics/btr330. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dannemann M, Andrés AM, Kelso J. Introgression of neandertal- and denisovan-like haplotypes contributes to adaptive variation in human toll-like receptors. Am J Hum Genet. 2016:98(1):22–33. 10.1016/j.ajhg.2015.11.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
Daub JT, Hofer T, Cutivet E, Dupanloup I, Quintana-Murci L, Robinson-Rechavi M, Excoffier L. Evidence for polygenic adaptation to pathogens in the human genome. Mol Biol Evol. 2013:30(7):1544–1558. 10.1093/molbev/mst080. [DOI] [PubMed] [Google Scholar]
DeGiorgio M, Huber CD, Hubisz MJ, Hellmann I, Nielsen R. SweepFinder2: increased sensitivity, robustness and flexibility. Bioinformatics. 2016:32(12):1895–1897. 10.1093/bioinformatics/btw051. [DOI] [PubMed] [Google Scholar]
DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011:43(5):491–498. 10.1038/ng.806. [DOI] [PMC free article] [PubMed] [Google Scholar]
Di Rienzo A. Population genetics models of common diseases. Curr Opin Genet Dev. 2006:16(6):630–636. 10.1016/j.gde.2006.10.002. [DOI] [PubMed] [Google Scholar]
Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016:32(19):3047–3048. 10.1093/bioinformatics/btw354. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fitak RR, Mohandesan E, Corander J, Yadamsuren A, Chuluunbat B, Abdelhadi O, Raziq A, Nagy P, Walzer C, Faye B, et al. Genomic signatures of domestication in Old World camels. Commun Biol. 2020:3(1):316. 10.1038/s42003-020-1039-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
Franchini P, Kautt AF, Nater A, Antonini G, Castiglia R, Meyer A, Solano E. Reconstructing the evolutionary history of chromosomal races on Islands: a genome-wide analysis of natural house mouse populations. Mol Biol Evol. 2020:37(10):2825–2837. 10.1093/molbev/msaa118. [DOI] [PubMed] [Google Scholar]
Ip EKK, Hadinata C, Ho JWK, Giannoulatou E. dv-trio: a family-based variant calling pipeline using DeepVariant. Bioinformatics. 2020:36(11):3549–3551. 10.1093/bioinformatics/btaa116. [DOI] [PubMed] [Google Scholar]
Kopelman NM, Mayzel J, Jakobsson M, Rosenberg NA, Mayrose I. Clumpak: a program for identifying clustering modes and packaging population structure inferences across K. Mol Ecol Resour. 2015:15(5):1179–1191. 10.1111/1755-0998.12387. [DOI] [PMC free article] [PubMed] [Google Scholar]
Krueger F, James F, Ewels P, Afyounian E, Weinstein M, Schuster-Boeckler B, Hulselmans G, Clamons S. TrimGalore. 2019. https://github.com/FelixKrueger/TrimGalore.
Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012:9(4):357–359. 10.1038/nmeth.1923. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lee D, Cho M, Hong WY, Lim D, Kim HC, Cho YM, Jeong JY, Choi BH, Ko Y, Kim J. Evolutionary analyses of Hanwoo (Korean Cattle)-specific single-nucleotide polymorphisms and genes using whole-genome resequencing data of a Hanwoo population. Mol Cells. 2016:39(9):692–698. 10.14348/molcells.2016.0148. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lee TH, Guo H, Wang X, Kim C, Paterson AH. SNPhylo: a pipeline to construct a phylogenetic tree from huge SNP data. BMC Genomics. 2014:15(1):162. 10.1186/1471-2164-15-162. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lee D, Lee J, Heo KN, Kwon K, Moon Y, Lim D, Lee KT, Kim J. Population analysis of the Korean native duck using whole-genome sequencing data. BMC Genomics. 2020a:21(1):554. 10.1186/s12864-020-06933-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lee D, Lim D, Kwon D, Kim J, Lee J, Sim M, Choi BH, Choi SG, Kim J. Functional and evolutionary analysis of Korean bob-tailed native dog using whole-genome sequencing data. Sci Rep. 2017:7(1):17303. 10.1038/s41598-017-17817-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lee SH, Seo DW, Cho ES, Choi BH, Kim YM, Hong JK, Han HD, Jung YB, Kim DJ, Choi TJ, et al. Genetic diversity and ancestral study for Korean native pigs using 60K SNP chip. Animals (Basel). 2020b:10(5):760. 10.3390/ani10050760 [DOI] [PMC free article] [PubMed] [Google Scholar]
Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009:25(14):1754–1760. 10.1093/bioinformatics/btp324. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li H, Durbin R. Inference of human population history from individual whole-genome sequences. Nature. 2011:475(7357):493–496. 10.1038/nature10231. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R. The sequence alignment/map format and SAMtools. Bioinformatics. 2009:25(16):2078–2079. 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu S, Hansen MM. PSMC (pairwise sequentially Markovian coalescent) analysis of RAD (restriction site associated DNA) sequencing data. Mol Ecol Resour. 2017:17(4):631–641. 10.1111/1755-0998.12606. [DOI] [PubMed] [Google Scholar]
Mussmann SM, Douglas MR, Chafin TK, Douglas ME. ADMIXPIPE: population analyses in ADMIXTURE for non-model organisms. BMC Bioinformatics. 2020:21(1):337. 10.1186/s12859-020-03701-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
Nei M. Molecular population genetics and evolution (Frontiers of Biology). Amsterdam: North-Holland Publishing Company; 1975. [PubMed] [Google Scholar]
Nishiyama T, Kishino H, Suzuki S, Ando R, Niimura H, Uemura H, Horita M, Ohnaka K, Kuriyama N, Mikami H, et al. Detailed analysis of Japanese population substructure with a focus on the southwest islands of Japan. PLoS One. 2012:7(4):e35000. 10.1371/journal.pone.0035000. [DOI] [PMC free article] [PubMed] [Google Scholar]
Oliveira TG, Mitne-Neto M, Cerdeira LT, Marsiglia JD, Arteaga-Fernandez E, Krieger JE, Pereira AC. A variant detection pipeline for inherited cardiomyopathy-associated genes using next-generation sequencing. J Mol Diagn. 2015:17(4):420–430. 10.1016/j.jmoldx.2015.02.003. [DOI] [PubMed] [Google Scholar]
Parker HG, Dreger DL, Rimbault M, Davis BW, Mullen AB, Carpintero-Ramirez G, Ostrander EA. Genomic analyses reveal the influence of geographic origin, migration, and hybridization on modern dog breed development. Cell Rep. 2017:19(4):697–708. 10.1016/j.celrep.2017.03.079. [DOI] [PMC free article] [PubMed] [Google Scholar]
Parkes M, Cortes A, van Heel DA, Brown MA. Genetic insights into common pathways and complex relationships among immune-mediated diseases. Nat Rev Genet. 2013:14(9):661–673. 10.1038/nrg3502. [DOI] [PubMed] [Google Scholar]
Patel RK, Jain M. NGS QC toolkit: a toolkit for quality control of next generation sequencing data. PLoS One. 2012:7(2):e30619. 10.1371/journal.pone.0030619. [DOI] [PMC free article] [PubMed] [Google Scholar]
Patterson N, Moorjani P, Luo Y, Mallick S, Rohland N, Zhan Y, Genschoreck T, Webster T, Reich D. Ancient admixture in human history. Genetics. 2012:192(3):1065–1093. 10.1534/genetics.112.145037. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pickrell JK, Pritchard JK. Inference of population splits and mixtures from genome-wide allele frequency data. PLoS Genet. 2012:8(11):e1002967. 10.1371/journal.pgen.1002967. [DOI] [PMC free article] [PubMed] [Google Scholar]
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007:81(3):559–575. 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
Raudvere U, Kolberg L, Kuzmin I, Arak T, Adler P, Peterson H, Vilo J. g:Profiler: a web server for functional enrichment analysis and conversions of gene lists (2019 update). Nucleic Acids Res. 2019:47(W1):W191–W198. 10.1093/nar/gkz369. [DOI] [PMC free article] [PubMed] [Google Scholar]
Reimer C, Ha NT, Sharifi AR, Geibel J, Mikkelsen LF, Schlather M, Weigend S, Simianer H. Assessing breed integrity of Göttingen Minipigs. BMC Genomics. 2020:21(1):308. 10.1186/s12864-020-6590-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
Schiffels S, Durbin R. Inferring human population size and separation history from multiple genome sequences. Nat Genet. 2014:46(8):919–925. 10.1038/ng.3015. [DOI] [PMC free article] [PubMed] [Google Scholar]
Torkamani A, Pham P, Libiger O, Bansal V, Zhang G, Scott-Van Zeeland AA, Tewhey R, Topol EJ, Schork NJ. Clinical implications of human population differences in genome-wide rates of functional genotypes. Front Genet. 2012:3:211. 10.3389/fgene.2012.00211. [DOI] [PMC free article] [PubMed] [Google Scholar]
Turner SD. qqman: an R package for visualizing GWAS results using Q-Q and manhattan plots. J Open Source Soft. 2018:3(25):731. 10.21105/joss.00731. [DOI] [Google Scholar]
Ullrich K. polarizeVCFbyOutgroup.py. 2021. [accessed 2023 Sep 10]. https://github.com/kullrich/bio-scripts.
Van der Auwera GA, Carneiro MO, Hartl C, Poplin R, Del Angel G, Levy-Moonshine A, Jordan T, Shakir K, Roazen D, Thibault J, et al. From FastQ data to high confidence variant calls: the genome analysis toolkit best practices pipeline. Curr Protoc Bioinformatics. 2013:43(1):11.10.11–11.10.33. 10.1002/0471250953.bi1110s43. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang Y, Lu J, Yu J, Gibbs RA, Yu F. An integrative variant analysis pipeline for accurate genotype/haplotype inference in population NGS data. Genome Res. 2013:23(5):833–842. 10.1101/gr.146084.112. [DOI] [PMC free article] [PubMed] [Google Scholar]
Webb A, Knoblauch J, Sabankar N, Kallur AS, Hey J, Sethuraman A. The pop-gen pipeline platform: a software platform for population genomic analyses. Mol Biol Evol. 2021:38(8):3478–3485. 10.1093/molbev/msab113. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet. 2011:88(1):76–82. 10.1016/j.ajhg.2010.11.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang C, Dong SS, Xu JY, He WM, Yang TL. PopLDdecay: a fast and effective tool for linkage disequilibrium decay analysis based on variant call format files. Bioinformatics. 2019:35(10):1786–1788. 10.1093/bioinformatics/bty875. [DOI] [PubMed] [Google Scholar]
Zhong Y, De T, Alarcon C, Park CS, Lec B, Perera MA. Discovery of novel hepatocyte eQTLs in African Americans. PLoS Genet. 2020:16(4):e1008662. 10.1371/journal.pgen.1008662. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

msae040_Supplementary_Data

msae040_supplementary_data.zip^{(892.5KB, zip)}

Data Availability Statement

PAPipe is available at https://github.com/jkimlab/PAPipe.

[msae040-B1] Alexander DH, Novembre J, Lange K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 2009:19(9):1655–1664. 10.1101/gr.094052.109. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msae040-B2] Andrews S. FastQC: A quality control tool for high throughput sequence data. 2010. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/.

[msae040-B3] Barreiro LB, Quintana-Murci L. Evolutionary and population (epi)genetics of immunity to infection. Hum Genet. 2020:139(6-7):723–732. 10.1007/s00439-020-02167-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msae040-B4] Baumsteiger J, Moyle PB, Aguilar A, O'Rourke SM, Miller MR. Genomics clarifies taxonomic boundaries in a difficult species complex. PLoS One. 2017:12(12):e0189417. 10.1371/journal.pone.0189417. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msae040-B5] Bhatia G, Patterson N, Pasaniuc B, Zaitlen N, Genovese G, Pollack S, Mallick S, Myers S, Tandon A, Spencer C, et al. Genome-wide comparison of African-ancestry populations from CARe and other cohorts reveals signals of natural selection. Am J Hum Genet. 2011:89(3):368–381. 10.1016/j.ajhg.2011.07.025. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msae040-B6] Broad Institute. Picard tools. 2018. http://broadinstitute.github.io/picard/.

[msae040-B7] Browning SR, Browning BL, Daviglus ML, Durazo-Arvizu RA, Schneiderman N, Kaplan RC, Laurie CC. Ancestry-specific recent effective population size in the Americas. PLoS Genet. 2018:14(5):e1007385. 10.1371/journal.pgen.1007385. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msae040-B8] Casillas S, Barbadilla A. Molecular population genetics. Genetics. 2017:205(3):1003–1035. 10.1534/genetics.116.196493. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msae040-B9] Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience. 2015:4(1):7. 10.1186/s13742-015-0047-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msae040-B10] Chen J, Ni P, Li X, Han J, Jakovlić I, Zhang C, Zhao S. Population size may shape the accumulation of functional mutations following domestication. BMC Evol Biol. 2018:18(1):4. 10.1186/s12862-018-1120-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msae040-B11] Choudhury A, Hazelhurst S, Meintjes A, Achinike-Oduaran O, Aron S, Gamieldien J, Jalali Sefid Dashti M, Mulder N, Tiffin N, Ramsay M. Population-specific common SNPs reflect demographic histories and highlight regions of genomic plasticity with functional relevance. BMC Genomics. 2014:15(1):437. 10.1186/1471-2164-15-437. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msae040-B12] Daetwyler HD, Capitan A, Pausch H, Stothard P, van Binsbergen R, Brøndum RF, Liao X, Djari A, Rodriguez SC, Grohs C. Whole-genome sequencing of 234 bulls facilitates mapping of monogenic and complex traits in cattle. Nature Genetics. 2014:46(8):858–865. 10.1038/ng.3034. [DOI] [PubMed] [Google Scholar]

[msae040-B13] Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST, et al. The variant call format and VCFtools. Bioinformatics. 2011:27(15):2156–2158. 10.1093/bioinformatics/btr330. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msae040-B14] Dannemann M, Andrés AM, Kelso J. Introgression of neandertal- and denisovan-like haplotypes contributes to adaptive variation in human toll-like receptors. Am J Hum Genet. 2016:98(1):22–33. 10.1016/j.ajhg.2015.11.015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msae040-B15] Daub JT, Hofer T, Cutivet E, Dupanloup I, Quintana-Murci L, Robinson-Rechavi M, Excoffier L. Evidence for polygenic adaptation to pathogens in the human genome. Mol Biol Evol. 2013:30(7):1544–1558. 10.1093/molbev/mst080. [DOI] [PubMed] [Google Scholar]

[msae040-B16] DeGiorgio M, Huber CD, Hubisz MJ, Hellmann I, Nielsen R. SweepFinder2: increased sensitivity, robustness and flexibility. Bioinformatics. 2016:32(12):1895–1897. 10.1093/bioinformatics/btw051. [DOI] [PubMed] [Google Scholar]

[msae040-B17] DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011:43(5):491–498. 10.1038/ng.806. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msae040-B18] Di Rienzo A. Population genetics models of common diseases. Curr Opin Genet Dev. 2006:16(6):630–636. 10.1016/j.gde.2006.10.002. [DOI] [PubMed] [Google Scholar]

[msae040-B19] Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016:32(19):3047–3048. 10.1093/bioinformatics/btw354. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msae040-B20] Fitak RR, Mohandesan E, Corander J, Yadamsuren A, Chuluunbat B, Abdelhadi O, Raziq A, Nagy P, Walzer C, Faye B, et al. Genomic signatures of domestication in Old World camels. Commun Biol. 2020:3(1):316. 10.1038/s42003-020-1039-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msae040-B21] Franchini P, Kautt AF, Nater A, Antonini G, Castiglia R, Meyer A, Solano E. Reconstructing the evolutionary history of chromosomal races on Islands: a genome-wide analysis of natural house mouse populations. Mol Biol Evol. 2020:37(10):2825–2837. 10.1093/molbev/msaa118. [DOI] [PubMed] [Google Scholar]

[msae040-B22] Ip EKK, Hadinata C, Ho JWK, Giannoulatou E. dv-trio: a family-based variant calling pipeline using DeepVariant. Bioinformatics. 2020:36(11):3549–3551. 10.1093/bioinformatics/btaa116. [DOI] [PubMed] [Google Scholar]

[msae040-B23] Kopelman NM, Mayzel J, Jakobsson M, Rosenberg NA, Mayrose I. Clumpak: a program for identifying clustering modes and packaging population structure inferences across K. Mol Ecol Resour. 2015:15(5):1179–1191. 10.1111/1755-0998.12387. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msae040-B24] Krueger F, James F, Ewels P, Afyounian E, Weinstein M, Schuster-Boeckler B, Hulselmans G, Clamons S. TrimGalore. 2019. https://github.com/FelixKrueger/TrimGalore.

[msae040-B25] Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012:9(4):357–359. 10.1038/nmeth.1923. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msae040-B26] Lee D, Cho M, Hong WY, Lim D, Kim HC, Cho YM, Jeong JY, Choi BH, Ko Y, Kim J. Evolutionary analyses of Hanwoo (Korean Cattle)-specific single-nucleotide polymorphisms and genes using whole-genome resequencing data of a Hanwoo population. Mol Cells. 2016:39(9):692–698. 10.14348/molcells.2016.0148. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msae040-B27] Lee TH, Guo H, Wang X, Kim C, Paterson AH. SNPhylo: a pipeline to construct a phylogenetic tree from huge SNP data. BMC Genomics. 2014:15(1):162. 10.1186/1471-2164-15-162. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msae040-B28] Lee D, Lee J, Heo KN, Kwon K, Moon Y, Lim D, Lee KT, Kim J. Population analysis of the Korean native duck using whole-genome sequencing data. BMC Genomics. 2020a:21(1):554. 10.1186/s12864-020-06933-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msae040-B29] Lee D, Lim D, Kwon D, Kim J, Lee J, Sim M, Choi BH, Choi SG, Kim J. Functional and evolutionary analysis of Korean bob-tailed native dog using whole-genome sequencing data. Sci Rep. 2017:7(1):17303. 10.1038/s41598-017-17817-w. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msae040-B30] Lee SH, Seo DW, Cho ES, Choi BH, Kim YM, Hong JK, Han HD, Jung YB, Kim DJ, Choi TJ, et al. Genetic diversity and ancestral study for Korean native pigs using 60K SNP chip. Animals (Basel). 2020b:10(5):760. 10.3390/ani10050760 [DOI] [PMC free article] [PubMed] [Google Scholar]

[msae040-B31] Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009:25(14):1754–1760. 10.1093/bioinformatics/btp324. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msae040-B32] Li H, Durbin R. Inference of human population history from individual whole-genome sequences. Nature. 2011:475(7357):493–496. 10.1038/nature10231. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msae040-B33] Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R. The sequence alignment/map format and SAMtools. Bioinformatics. 2009:25(16):2078–2079. 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msae040-B34] Liu S, Hansen MM. PSMC (pairwise sequentially Markovian coalescent) analysis of RAD (restriction site associated DNA) sequencing data. Mol Ecol Resour. 2017:17(4):631–641. 10.1111/1755-0998.12606. [DOI] [PubMed] [Google Scholar]

[msae040-B35] Mussmann SM, Douglas MR, Chafin TK, Douglas ME. ADMIXPIPE: population analyses in ADMIXTURE for non-model organisms. BMC Bioinformatics. 2020:21(1):337. 10.1186/s12859-020-03701-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msae040-B36] Nei M. Molecular population genetics and evolution (Frontiers of Biology). Amsterdam: North-Holland Publishing Company; 1975. [PubMed] [Google Scholar]

[msae040-B37] Nishiyama T, Kishino H, Suzuki S, Ando R, Niimura H, Uemura H, Horita M, Ohnaka K, Kuriyama N, Mikami H, et al. Detailed analysis of Japanese population substructure with a focus on the southwest islands of Japan. PLoS One. 2012:7(4):e35000. 10.1371/journal.pone.0035000. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msae040-B38] Oliveira TG, Mitne-Neto M, Cerdeira LT, Marsiglia JD, Arteaga-Fernandez E, Krieger JE, Pereira AC. A variant detection pipeline for inherited cardiomyopathy-associated genes using next-generation sequencing. J Mol Diagn. 2015:17(4):420–430. 10.1016/j.jmoldx.2015.02.003. [DOI] [PubMed] [Google Scholar]

[msae040-B39] Parker HG, Dreger DL, Rimbault M, Davis BW, Mullen AB, Carpintero-Ramirez G, Ostrander EA. Genomic analyses reveal the influence of geographic origin, migration, and hybridization on modern dog breed development. Cell Rep. 2017:19(4):697–708. 10.1016/j.celrep.2017.03.079. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msae040-B40] Parkes M, Cortes A, van Heel DA, Brown MA. Genetic insights into common pathways and complex relationships among immune-mediated diseases. Nat Rev Genet. 2013:14(9):661–673. 10.1038/nrg3502. [DOI] [PubMed] [Google Scholar]

[msae040-B41] Patel RK, Jain M. NGS QC toolkit: a toolkit for quality control of next generation sequencing data. PLoS One. 2012:7(2):e30619. 10.1371/journal.pone.0030619. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msae040-B42] Patterson N, Moorjani P, Luo Y, Mallick S, Rohland N, Zhan Y, Genschoreck T, Webster T, Reich D. Ancient admixture in human history. Genetics. 2012:192(3):1065–1093. 10.1534/genetics.112.145037. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msae040-B43] Pickrell JK, Pritchard JK. Inference of population splits and mixtures from genome-wide allele frequency data. PLoS Genet. 2012:8(11):e1002967. 10.1371/journal.pgen.1002967. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msae040-B44] Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007:81(3):559–575. 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msae040-B45] Raudvere U, Kolberg L, Kuzmin I, Arak T, Adler P, Peterson H, Vilo J. g:Profiler: a web server for functional enrichment analysis and conversions of gene lists (2019 update). Nucleic Acids Res. 2019:47(W1):W191–W198. 10.1093/nar/gkz369. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msae040-B46] Reimer C, Ha NT, Sharifi AR, Geibel J, Mikkelsen LF, Schlather M, Weigend S, Simianer H. Assessing breed integrity of Göttingen Minipigs. BMC Genomics. 2020:21(1):308. 10.1186/s12864-020-6590-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msae040-B47] Schiffels S, Durbin R. Inferring human population size and separation history from multiple genome sequences. Nat Genet. 2014:46(8):919–925. 10.1038/ng.3015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msae040-B48] Torkamani A, Pham P, Libiger O, Bansal V, Zhang G, Scott-Van Zeeland AA, Tewhey R, Topol EJ, Schork NJ. Clinical implications of human population differences in genome-wide rates of functional genotypes. Front Genet. 2012:3:211. 10.3389/fgene.2012.00211. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msae040-B49] Turner SD. qqman: an R package for visualizing GWAS results using Q-Q and manhattan plots. J Open Source Soft. 2018:3(25):731. 10.21105/joss.00731. [DOI] [Google Scholar]

[msae040-B50] Ullrich K. polarizeVCFbyOutgroup.py. 2021. [accessed 2023 Sep 10]. https://github.com/kullrich/bio-scripts.

[msae040-B51] Van der Auwera GA, Carneiro MO, Hartl C, Poplin R, Del Angel G, Levy-Moonshine A, Jordan T, Shakir K, Roazen D, Thibault J, et al. From FastQ data to high confidence variant calls: the genome analysis toolkit best practices pipeline. Curr Protoc Bioinformatics. 2013:43(1):11.10.11–11.10.33. 10.1002/0471250953.bi1110s43. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msae040-B52] Wang Y, Lu J, Yu J, Gibbs RA, Yu F. An integrative variant analysis pipeline for accurate genotype/haplotype inference in population NGS data. Genome Res. 2013:23(5):833–842. 10.1101/gr.146084.112. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msae040-B53] Webb A, Knoblauch J, Sabankar N, Kallur AS, Hey J, Sethuraman A. The pop-gen pipeline platform: a software platform for population genomic analyses. Mol Biol Evol. 2021:38(8):3478–3485. 10.1093/molbev/msab113. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msae040-B54] Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet. 2011:88(1):76–82. 10.1016/j.ajhg.2010.11.011. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msae040-B55] Zhang C, Dong SS, Xu JY, He WM, Yang TL. PopLDdecay: a fast and effective tool for linkage disequilibrium decay analysis based on variant call format files. Bioinformatics. 2019:35(10):1786–1788. 10.1093/bioinformatics/bty875. [DOI] [PubMed] [Google Scholar]

[msae040-B56] Zhong Y, De T, Alarcon C, Park CS, Lec B, Perera MA. Discovery of novel hepatocyte eQTLs in African Americans. PLoS Genet. 2020:16(4):e1008662. 10.1371/journal.pgen.1008662. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

PAPipe: A Pipeline for Comprehensive Population Genetic Analysis

Nayoung Park

Hyeonji Kim

Jeongmin Oh

Jinseok Kim

Charyeong Heo

Jaebum Kim

Roles

Abstract

Introduction

New Approaches

Fig. 1.

Results and Discussion

Table 1.

Clustering of Cattle Populations Based on Genetic Signatures

Fig. 2.

Discovering Phylogenetic Relationships Among Cattle Populations

Inferring the Ancestry of Cattle Populations based on Population Structure

Comparing the Recombination Patterns of Cattle Populations

Fig. 3.

Identifying the Mode of Admixture Among Cattle Populations

Table 2.

Inferring the Trajectory of the Effective Population Sizes of Cattle Populations

Detecting Genomic Regions With High Genetic Variation Among Cattle Populations

Fig. 4.

Inferring Population-Level Phylogenetic Relationships Among Cattle Populations

Fig. 5.

Detecting Selective Sweep Regions in Each Cattle Population

Fig. 6.

Conclusion

Materials and Methods

Details of Each Step in PAPipe

Read Trimming Step

Read Mapping Step

Genetic Variant Calling Step

Data Filtering and Format Converting Step

Analysis Result Generating Step

Principal Component Analysis

Phylogenetic Analysis

Population Tree Analysis

Population Structure Analysis

LD Decay Analysis

Selective Sweep Analysis

Population Admixture Analysis

Sequentially Markovian Coalescent Analysis

Fixation Index (Fst) Analysis

Usability and Accessibility of PAPipe

Application of PAPipe to Cattle Populations

Supplementary Material

Supplementary Material

Acknowledgments

Contributor Information

Data Availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Fixation Index (F_st) Analysis