Skip to main content
The Journal of Biological Chemistry logoLink to The Journal of Biological Chemistry
. 2025 Feb 3;301(3):108267. doi: 10.1016/j.jbc.2025.108267

AnimalGWASAtlas: Annotation and prioritization of GWAS loci and quantitative trait loci for animal complex traits

Yuwei Gou 1,, Yunhan Jing 1,, Yifei Wang 2,, Xingyu Li 1, Jing Yang 1, Kai Wang 1, Hengdong He 1, Yuan Yang 1, Yuanling Tang 1, Chen Wang 1, Jun Xu 1, Fan Yang 3, Mingzhou Li 1,, Qianzi Tang 1,
PMCID: PMC11904539  PMID: 39909383

Abstract

Genome-wide association study (GWAS) and quantitative trait locus (QTL) mapping methods provide valuable insights and opportunities for identifying functional gene underlying phenotype formation. However, the majority of GWAS risk loci and QTLs located in noncoding regions poses significant challenges in pinpointing the protein-coding genes associated with specific traits. Moreover, growing evidence suggests not all GWAS risk loci and QTLs are functional, emphasizing the critical need for prioritizing causal sites—a task of paramount importance for biologists. The accumulation of publicly available multiomics data provides an unprecedented opportunity to annotate and prioritize GWAS risk loci and QTLs. Therefore, we developed a comprehensive multiomics database encompassing four major agricultural species—pig, sheep, cattle, and chicken. This database integrates publicly accessible datasets, including 140 GWAS studies (covering 471 traits), 2625 QTL datasets (spanning 1235 traits), 86 Hi-C datasets (from eight cells/tissue types), 95 epigenomic datasets (from four cells/tissue types), and 769 transcription factor motifs. The database aims to link GWAS–QTL loci located in the noncoding regions to the target genes they regulate and prioritize functional and causal regulatory elements. Ultimately, it provides a valuable resource and potential validation targets for elucidating the genes and molecular pathways underlying economically important traits in agricultural animals.

Keywords: databases, genomics, bioinformatics, GWAS, QTL


Pigs, cattle, sheep, and chickens are key agricultural animals and primary sources of meat, eggs, and milk. Advances in sequencing technologies and bioinformatics have enabled the integration and analyses of vast amounts of sequencing data, offering unprecedented opportunities to investigate genes associated with economically significant traits in these vital species (1).

Genome-wide association study (GWAS) and quantitative trait locus (QTL) mapping methods, which emerge with the advent of state-of-the-art genomic technologies, provide novel perspectives and opportunities for identifying functional genes (2, 3). Furthermore, recent advances have led to increasingly precise positioning of QTLs, moving from the megabase (Mb) scale in earlier studies to kilobase (kb) and even bp levels (4, 5). However, the majority of GWAS risk loci are known to be located in noncoding regions. In addition, we examined the distribution of QTLs obtained from the AnimalQTLdb database (6) and found that most of the QTLs were also located in noncoding regions (Fig. S1, A and B). Therefore, it is challenging to determine the target protein-coding genes associated with specific phenotypes (7, 8).

The recent emergence of Hi-C and chromatin interaction analysis with paired-end tag technologies, which link promoters and putative enhancer regions that are spatially proximal to each other, provides new opportunities for matching the noncoding regulatory regions to their target genes (9, 10). Besides, previous studies revealed that GWAS–QTL sites associated with a particular phenotype tend to be over-represented in active epigenomic signals specific to certain cell/tissue types related to the respective phenotype (11, 12, 13, 14). Therefore, we reasoned that annotation of regulatory loci with epigenomic information could contribute to prioritization of functional and causal sites.

Inspired by the aforementioned reasoning, relevant databases focusing on complex human traits have been established, integrating a large volume of genomic and epigenomic data derived from Hi-C and chromatin immunoprecipitation sequencing (ChIP-Seq) techniques. These databases systematically prioritize GWAS risk loci, identify context-specific regulatory variants, and ultimately provide a comprehensive resource for annotating the functions of noncoding GWAS risk loci (15, 16).

With the launch of the FAANG program, an increasing amount of Hi-C–chromatin interaction analysis with paired-end tag and epigenomic data for agricultural animals has been generated (17). Leveraging these opportunities, biologists have begun experimenting with the integrative analyses of QTL, Hi-C, and epigenomic data for agricultural animals. For example, Wu et al. (18) identified functional enhancers in pig kidney epithelial cells and testicular cells and observed enrichment of relevant trait-associated QTLs within these enhancers. The PPP3CA gene, residing within QTLs and reported as related to the formation of traits in the lumbar muscle region, was found to interact with aforementioned enhancers based on evidence from Hi-C data. In addition, both the PPP3CA gene and its contacting enhancers were found to be enclosed within the same topologically associating domain (TAD). In cattle, Liu et al. (19) identified the TAD structure of bovine lung tissue using Hi-C technology and constructed a high-precision chromatin contact map. By integrating QTL and Hi-C data, they found that TADs could bring linearly distant genetic variations and target genes into spatial proximity, allowing genetic variation to remotely regulate target genes.

In addition to efforts in integrating multiomic data related to GWAS–QTL, significant progress has also been made in constructing databases dedicated to agricultural animal research. The Bovine Genome Variation Database has collected approximately 60.44 M SNPs, 6.86 M insertion and deletions, 76,634 copy number variation regions, and 432 selective sweep regions from bovine samples from diverse regions worldwide (20). The Goat Genome Variation Database included approximately 41 M SNPs, approximately 44 M insertion and deletions, 5.14 M indices, 6193 selected sites, and 112 infiltration regions (21). In addition, the animal multiomics database EDomics contains comprehensive genome and bulk/single-cell transcriptome data from 40 representative species (22). The IAnimal database integrates the genome, transcriptome, epigenome, and annotation data from 61 species, constructs an in-depth learning model based on BioBERT and AutoNER, and explores the relationship between genes and traits (23). However, the first two databases were limited to variant detection and evolution analyses, whereas the third database did not integrate multiple omics data for casual gene–variant identification. The last database studied the impact of genes on traits but did not integrate critical GWAS, QTL, and Hi-C data. Databases that annotate causal regulatory variations in agricultural animals using Hi-C and histone modification multiomic data are still lacking.

Therefore, we have constructed a multiomic database for major agricultural animals, including pig, sheep, cattle, and chicken. This database integrates publicly available data resources from 140 GWAS sets (covering 471 related traits), 2625 QTL sets (covering 1235 traits), 86 Hi-C sets (from eight cells/tissue types), 95 epigenomic sets (from four cells/tissue types), and 769 motifs of transcription factors (TFs). The database aims to link the GWAS–QTL loci located in the noncoding regions to the target genes they regulate, and prioritize functional and causal regulatory sites, which ultimately provides data resources and potential verification targets for genes and their molecular pathways related to important economic traits for agricultural animals.

Results

Data summary

Our study collected high-quality genomic and annotation data for pig, cattle, sheep, and chicken from Sscrofa11.1, ARS-UCD1.2, Oar_rambouillet_v1.0, and GRCg6a assemblies (Ensemble release 103) as well as SNP information (in VCF format) derived from population genomic data for these four species (Tables S1 and S2). GWAS data containing ∼4000 SNP sites for 471 phenotypes were sourced from the literature and the GWAS Atlas database (https://ngdc.cncb.ac.cn/gwas/), and these leading GWAS sites were expanded to a catalog of over 20,000 SNPs based on linkage disequilibrium (LD). The numbers of phenotypes associated with QTLs and SNPs for each species are shown separately in Fig. S2, A and B. We also examined density and distribution of SNPs for each of the four genomes and found a bias toward intron regions for pig (Fig. S2C). In addition, 95 ChIP-Seq datasets involving four histone modifications (H3K4me1, H3K4me3, H3K27ac, and H3K27me3) and 86 Hi-C datasets from 10 distinct tissues/cell types were downloaded from public repositories (Table S3).

As is shown in Figure 1A, each SNP of the expanded catalog was annotated by five distinct methods, namely, gene-based functional annotation of genetic variants (indicated as "AS"), change of TF binding affinity caused by the mutation (indicated as "TF"), conservation score produced by phastCons and phyloP in the vicinity of the SNP (indicated as "CON"), histone modifications covering the variant (indicated as "HM"), promoter–enhancer interactions (PEIs) with enhancer regions covering the variant (indicated as "3D"). A considerable proportion of GWAS loci have associated PEI and/or histone modification in the cell types or tissues that are likely relevant to the trait. An example for fatty acid composition–associated GWAS loci is shown in Fig. S2D. Different proportions of SNPs exhibited different occurrences (1∼5) of TFs with binding affinity changes for each distinct species (Fig. 1B). The histone modification data were mainly retrieved from four cell types/tissues for four modification types with varied distribution for each species (Fig. 1C). Similarly, Hi-C data were obtained for eight cell types/tissues for seven breeds across the four species (Table S3). As is shown in Figure 1D, each QTL was annotated by two distinct data types, Hi-C and histone modifications; a large proportion of QTLs have associated annotations for pig and chicken, whereas a smaller proportion applies for cattle and sheep.

Figure 1.

Figure 1

Overview of data. A, the UpSet plot for the number of SNPs with distinct combinations of associated annotations for each species. B, number of SNPs with different occurrences (1 ∼ 5) of TFs with changes in binding affinity for each species. C, histone modification data retrieved from four cell types/tissues for four modification types with varied distribution for each species (the outer circle color indicates the type of histone modification, and the inner circle color represents the cell type/tissue). D, the number of QTLs with associated Hi-C and/or histone modification annotations. QTL, quantitative trait locus; TF, transcription factor.

Usage and interface

Web interface summary

We provided a figure going through the flow and capabilities of the web portal related to GWAS loci associated with the trait of intramuscular fat (Fig. 2) and QTL associated with the trait of muscle pH (Fig. 3). Detailed information regarding the pipeline can be found in the Experimental procedures section.

Figure 2.

Figure 2

Workflow of the GWAS web portal related to the trait of intramuscular fat (IMF). GWAS, genome-wide association study.

Figure 3.

Figure 3

Workflow of the QTL web portal related to the trait of muscle pH. QTL, quantitative trait locus.

The search interface

The search interface consists of two pages: the SNP search (Fig. S3) and the QTL search (Fig. S4). Users can switch between the two options by clicking on the upper left tag of the respective page. For each search interface, users can select the species and phenotype of interest as well as specify the location of a particular GWAS site. In addition, our database allows users to choose specific tissues or cell types from a candidate list for the downstream annotation using ChIP-Seq or Hi-C data. We recommend that users select tissues or cell types relevant to the observed trait in their GWAS study. Once all the offered options are properly selected, users should click the “Search SNP” button, which will redirect them to the results page displaying the query results.

The summary interface for SNP result display

Our database utilizes a series of user-friendly interfaces to present the results, which first summarizes the basic and regulatory potential information of leading and expanded GWAS SNP sites, ranked by their genomic coordinates (Fig. S5). In brief, "chr" and “pos" fields record the genomic coordinate of a variant, whereas "ref/alt" indicates the reference and alternative allele for the variant. The "tf_motif" field represents the TF motif with most significant affinity score change caused by the mutation of respective variant from the reference to the alternative allele. The fields "top_lead_ID" and "top_lead_rsquare" display the dbSNP ID number for the leading GWAS locus in LD with the specific expanded variant as well as the degree of LD and r2 between the variant and its related leading GWAS locus. For a leading GWAS variant, the "top_lead_ID" is assigned the dbSNP ID itself and the "top_lead_rsquare" is assigned a value of 1. The "top_lead_p" field shows the p value of GWAS significance for the leading variant. In addition, variants with different types of regulatory potential signals are marked by different abbreviations and colors. Leading GWAS variants, not those expanded based on LD, are marked by the name "TOP" and the red stamp, whereas those loci covered by histone modification peaks are marked by the name "HM" and the orange stamp. Variants with significant TF binding affinity change are marked by the name "TF" and the purple stamp, and those with gene annotation information are marked by the name "AS" and blue stamp. The name "CON" and gray color indicates sequence conservation information is available for the variants, whereas the name "3D" and green color signifies existence of PEIs with enhancer regions covering the respective variants.

Visualization on the summary interface for SNP result display

When users select a specific SNP by clicking on the arrow in the first column of the corresponding line in the "Summary table," the database displays a circular layout of genes locating in the vicinity of the SNP (within 1 Mb distance upstream and downstream of the variant) as sectors colored by gene biotypes, with PEIs whose enhancers cover the variant shown as orange curves connecting the variant and the promoters of genes. Hovering the mouse cursor over a gene or a PEI reveals detailed information about the gene or interaction (Fig. 4). A further click on a PEI curve redirects users to an interface displaying histone modification data and genes locating within the 10-kb regions of SNP-associated promoters and enhancers (Fig. 5).

Figure 4.

Figure 4

Circos plot showing the gene locations and Hi-C interactions within 1 Mb distance from the SNP.

Figure 5.

Figure 5

Locations of genes and histone modifications within the 10-kb bin where SNP-associated promoters and enhancers are located.

In addition, the Integrative Genomics Viewer (IGV) (24) is integrated into the summary interface to visualize SNP loci (in BED format), associated PEIs (in BEDPE format), and histone modification peaks (in BED format) for the selected species, phenotype, and tissue/cell types.

The interface for display of SNP expansions and their genomic distribution

As most GWAS studies based on genome-wide SNP arrays identified relative sparsely distributed variants for each trait, to avoid the loss of heritability (25) and to more comprehensively profile the trait-associated variants, we carried out LD-based expansion on the leading GWAS SNPs and considered the expanded variants to be associated with the respective trait as well, though to a less significant degree compared with the leading GWAS SNPs.

When users select a specific SNP by clicking on the corresponding line in the "Summary table," the database will display detailed information of the SNP in "Mixins tables," which include tables for LD expansion, variant summary, TF annotation, histone modification, conservation, and Hi-C (Fig. S6). The LD expansion table consists of two figures that illustrate the LD between the leading and expanded SNPs as well as their genomic distribution. For the former, a triangle heatmap composed of 2D squares is employed to show the LD, with the coordinates of each square representing the relative positions of SNP pairs in LD, and the color indicating the LD r2. For the latter, GC content, SNP density, and gene density were visualized in a circular layout using the Circos package (26) (Fig. S7).

The interface for SNP annotation display

The "Variant summary" table records the classification of each SNP based on the position of each SNP relative to the nearby genes using ANNOVAR (version 20191024), including classes of exonic, splicing, ncRNA, UTR5, UTR3, intronic, upstream, downstream, and intergenic SNPs for the gene-based annotation of all genomic variants, and classes of frameshift, synonymous, nonsynonymous, and so on for exon-based annotation of variants (Fig. 6). An example shown in Figure 6 highlights a specific variant located in the first exon of the transcript "ENSSSCT00000037576" and involving mutation of the 380th base from the A allele to the C allele (c.A380C), resulting in a change of the 127th amino acid from lysine (K) to threonine (T) for the expressed protein (p.K127T).

Figure 6.

Figure 6

SNP ANNOVAR annotation result.

The interface for display of TF affinity score changes for an SNP

Transcription regulation in eukaryotic organisms is a complex process that involves coordination of multiple TFs, which is further mediated by recognition of specific motifs of the DNA sequence by the TFs. It is therefore hypothesized that SNPs in the genomic sequence may alter the binding affinity of TFs by disrupting the original motif-matching sequence or creating a new motif-matching sequence and thus potentially affecting formation of complex traits or diseases. Inspired by this hypothesis, we incorporated the TF affinity score changes for an SNP in the “TF annotation” table (Fig. 7). In brief, the table presents the information of TFs with significant affinity score changes caused by the specific variant, including the TF family name, motif ID, log-transformed likelihood scores for the reference and alternative alleles, p values indicating degree of motif match for both the reference and alternative alleles, p value indicating significance of TF affinity score change, and the direction of the trend for the change, and so on. The sequence logos for the TFs are generated by ggseqlogo (version 0.1) and shown.

Figure 7.

Figure 7

Scores of change in transcription factor binding affinity for the SNP.

Interface for display of histone modifications covering an SNP

The "Histone modification" table (Fig. S8) provides information on all the histone modifications whose peaks overlap with specific variant. In brief, the table presents the genomic coordinates of peaks for histone modifications, Gene Expression Omnibus IDs, tissues/organs from which the ChIP-Seq data are derived, types of histone modification, significance scores (minus log-transformed p values for enrichment significance), and fold-change values for peak signal enrichment.

Interface for display of sequence conservation scores for an SNP

The “Conservation" table (Fig. S9) displays conservation scores based on phastCons and phyloP, averaged over a 100-bp region, extended 50-bp upstream and downstream of a specific SNP. PhastCons and phyloP are two pieces of popular software for evaluating cross-species sequence conservation. Both scores are produced by programs that take the multispecies sequence alignments as input. The phastCons program computes conservation scores based on a phylo-HMM (phylogenetic hidden Markov model), and the scores range from 0 to 1 to show the conservation level. PhyloP scores measure evolutionary conservation at individual alignment sites. Interpretations of the scores are compared with the evolution that is expected under neutral drift. PhyloP scores can be negative or positive and are not restricted to a particular range. Positive scores measure conservation, which is slower evolution than expected. Negative scores measure acceleration, which is faster evolution than expected. A larger score indicates a greater degree of sequence conservation of the respective locus, and more conservative locus tends to contain more important functional elements with regulatory potentials. The "Conservation" table (Fig. S9) also presents GC content and the number of CpG island within the respective 100-bp region.

Interface for display of SNP-relevant Hi-C interactions

Enhancer, as common cis-regulatory elements characterized by the enrichment of active histone modification and TF binding in its vicinity, promotes transcription by interacting with distal promoters. SNP located within enhancer regions might impact binding of TF to the enhancer and the downstream gene expression and eventually alter the animal phenotypes (27). The "Hi-C" table (Fig. S10) provides detailed information on PEIs whose enhancer contains the specific SNP. The information includes the breed and tissue from which each PEI is derived as well as the p value and false discovery rate (FDR) indexes indicating confidence of the respective PEI. When users select a specific PEI by clicking on the last column of the "Hi-C" table, where a small picture of tag is shown, the database displays histone modifications and genes located within a 10-kb region of SNP-associated promoters and enhancers (Fig. S11).

The QTL search interface

Similar to the SNP search interface, the QTL search page allows users to select specific species and phenotype for QTLs as well as tissues/cell types for ChIP-Seq or Hi-C data. Users can also specify the location of a specific QTL. After selecting all the available options, users should click the "Search QTL" button to retrieve the results, which will be displayed on a new interface.

The summary interface for QTL result display

Our database features a series of user-friendly interfaces to present the results, starting with a summary of the basic information of QTLs, ranked by their genomic coordinates (Fig. S12). Specifically, the "chr" and "pos" fields represent the genomic coordinate of a QTL, whereas the "character" field indicates the associated phenotype.

Visualization on the summary interface for QTL result display

The IGV (24) was integrated into the summary interface to visualize QTLs (in BED format), associated PEIs (in BEDPE format), and histone modification peaks (in BED format) for the selected species, phenotype, and tissue/cell types.

Interface for display of histone modifications covering a QTL

Similar to the SNP display interface, the"Histone modification" table (Fig. S13) presents information on all the histone modifications whose peaks overlap a specific QTL. Specifically, the table includes the genomic coordinates of peaks for histone modification enrichment, Gene Expression Omnibus IDs, and tissues/organs from which the ChIP-Seq data are derived from, the types of histone modification, significance scores (minus log-transformed p values for enrichment significance), and fold changes of peak signal enrichment.

Interface for display of QTL-relevant Hi-C interactions

The "Hi-C" table (Fig. S14) provides detailed information on PEIs where the enhancers are associated with a specific QTL. The information includes the breed and tissue from which each PEI is derived as well as the p value and FDR indexes that indicate the confidence of the respective PEI. When users select a specific PEI by clicking on the last column of the “Hi-C” table, where a small picture of tag is shown, the database will display the PEIs, histone modifications, and genes located in the vicinity of the QTL (Fig. S15).

Evaluations

We selected GWAS loci associated with red blood cell (28) traits and their expanded variants from the collection of the current study and randomly selected ∼10,000 variants from dbSNPs in three types of genomic regions (promoter, intergenic, and genome-wide). For a fair comparison, the combined p value was derived from only the binding affinity effect and the sequence conservation level, and the GWAS p value was excluded because of lack of such information for background variants. The combined p values for the selected GWAS loci were significantly lower than those p values for variants in three background regions based on Wilcoxon test, with p values of 4.2 × 10−4, 1.34 × 10−4, and 2.8 × 10−5 for promoter, intergenic, and genome-wide regions, respectively. This supports the rationalization of the combined p value in this study, as a functional regulatory variant would be expected to have a higher score than a nonfunctional mutation. We expanded GWAS variants and got a catalog of 241 SNPs for fatty acid composition of pigs and identified their target genes using PEIs. Functional enrichment analyses showed over-representation of the target genes in fatty acid metabolism pathway (p value = 3.55 × 10−5) and other related biological processes (Fig. S16), validating the associated trait of fatty acid composition for this GWAS catalog, and expanding our knowledge of this trait as well. A similar example can be found for expanded GWAS catalog for intramuscular fat content of pigs, whose target genes were mainly involved in cellular lipid catabolic process pathway (p value = 1.7 × 10−3) and other related biological processes (Fig. S16), providing new insight into the underlying mechanism of this trait.

Server design

The AnimalGWASAtlas database was developed using the Python-based web framework “Masonite” (version 4.16.4; Python 3.8.0). Annotation data are stored in a back end MySql database (version 8.0.31). Dynamic web pages are constructed with Vue (version 3.2). The database is hosted on a server provided by the Information and Educational Technology Center at Sichuan Agricultural University, which is configured with a 16-core CPU, 64 GB of memory, 2 TB of hard disk storage, and the CentOS operating system (version 8). The AnimalGWASAtlas is a user-friendly one-stop framework available for free academic use. All software installation and deployment work mentioned previously is carried out by logging into the audit and risk control system for bastion machine operation and maintenance of Sichuan Agricultural University.

Discussion

GWAS and QTL-based approaches are powerful tools for studying the inheritance of complex traits in agricultural animals. Pigs, cattle, sheep, and chickens, as four major agricultural animals, have been extensively studied and reported in the literature. These studies generate vast amount of omics data, when integrated with GWAS–QTL information, provide us with an effective resource for identifying candidate genes, functionally conserved loci, and breeding-related targets that influence phenotype. Such integration offers unprecedented opportunities for understanding gene regulation mechanisms and conducting comprehensive analyses of biological systems.

This study analyzed, integrated, and visualized multiomics genomic data for four agricultural animal species, including genome sequences, structural variations, gene and epigenetic annotations, three-dimensional chromatin structures, as well as GWAS and QTL loci. Given that GWAS loci can only explain a fraction of phenotypic variation, in order to reduce the loss of heritability (29), this study also expanded GWAS loci based on LD using resequencing data to obtain more SNPs related to the traits. Considering the specificity of histone modifications in different cell/tissue types (30), we also collected as much epigenomics data as possible from distinct cell/tissue types in order to elucidate in more detail the mechanisms underlying transcription regulation in specific tissues/cells associated with the respective complex traits. Moreover, all the collected and expanded variant sites were annotated in detail using the TF binding affinity and sequence conservation scores, with the goal of identifying trait-associated functional loci. Different from other multiomics databases of agricultural animals, this study also used the publicly available three-dimensional chromatin conformation data to map the variants and QTL regions located in the noncoding region to targeted protein-coding genes through PEIs, providing guidance for the identification and final experimental validation of regulatory targets.

This study develops a user-friendly web application based on a Linux + nginx + Mysql + Python environment, enabling users to explore multidimensional data by selecting species, traits, and chromatin loci of interest. The primary goal of this study is to provide data support for subsequent experimental validation. We will continue to incorporate new multiomics and GWAS–QTL data for the four animal species as they become available, consistently updating the platform to establish a more comprehensive resource for functional genomics.

Experimental procedures

Data collection and processing

Data collection

Our database integrates multomic data to prioritize functional and causal regulatory genetic variants and connects noncoding variants to target protein-coding genes based on spatial chromatin interactions. High-quality reference genome sequences and annotation files for the studied species (pig, cattle, sheep, and chicken) were retrieved from the National Center for Biotechnology Information database (Table S1). Genome-wide variation information was obtained from the Genome Sequence Archive database (31) and previous studies (Table S2). Publicly available GWAS risk loci were collected through literature mining using the PubMed and Web of Science databases or directly retrieved from the GWAS Atlas database (32), and the origin PMIDs for the GWAS risk loci could be found on the SNP download page (Fig. S17). QTLs were retrieved from the AnimalQTLdb database (6). In addition, Hi-C (Table S3) and ChIP-Seq data of histone modifications (Table S4) were collected from previous studies.

Conversion between different genome versions

We used recent high-quality reference genomes of the four species (Table S1) for alignment of all the different data types. Given the discrepancy between the version of reference genome of the GWAS risk loci and that used for Hi-C–ChIP-Seq data alignment, we downloaded coordinate conversion chain files between distinct genome versions from the UCSC database and used the LiftOver (33) software to convert and unify the genome versions to the most recent ones. We also utilized the BLAST software (34) to facilitate this process if the required chain files were unavailable. Taking the GWAS risk loci of pigs as an example, some of the GWAS risk loci collected from the literature are based on coordinates of the Sscrofa 10.2 genome version, which is different from the Sscrofa 11.2 version used for our database. First, we used makeblastdb (v2.7.1+) software with parameters “- dbtype nucl - parse seqids" to build index for the Sscrofa 11.2 pig genome. Then, all GWAS sites on the Sscrofa 10.2 genome were extended upstream by 100 bp, and the obtained 101-bp sequence was aligned to the Sscrofa11.2 genome using the BLAST software. Next, the aligned regions on the Sscrofa11.2 genome were converted back to loci on the Sscrofa 10.2 genome using the LiftOver software and the chain file susScr11ToSusScr3.over.Chain (https://hgdownload.soe.ucsc.edu/goldenPath/susScr11/LiftOver/susScr11ToSusScr3.over.chain.gz). If the extended GWAS coordinates before and after the conversion process match each other, the aligned position on the Sscrofa11.1 genome is considered reliable and retained for further analyses. The GWAS locus coordinate conversion for other species was performed similarly.

PEI identification

PEIs derived from Hi-C data were used to link the noncoding variants to target protein-coding genes. We used the Juicer pipeline (35) to preprocess the sequenced Hi-C reads into normalized contact matrices. In brief, Hi-C sequencing data were aligned to reference genome using bwa (36) with default parameters. Subsequently, mapped read pairs were paired, and duplicates and low-quality reads (MAPQ <30) were removed to retain high-quality contacts. Retained contacts for technical and biological replicates were combined before downstream analyses. These contacts were further binned into raw contact matrices at 5/10 kb and 40 kb resolutions and normalized using the KR algorithm. To obtain PEIs, the PSYHIC software (37) was employed. In brief, the normalized matrices at 40-kb resolution were used to identify the genome-wide TADs with PSYHIC, which serve to split the contact matrices into different blocks for local background estimation. Based on local background-adjusted contact matrices, Z-tests were applied to identify raw PEIs within the range of 10 Mb around the promoters. To ensure high-confidence PEIs, only those raw PEIs with an FDR <0.01 and interaction distances ranging from 25 kb to 2 Mb were retained. Promoter–promoter interactions involving protein-coding genes located in putative enhancer regions were further excluded.

ChIP-Seq data analyses

ChIP-Seq data for histone modifications were utilized to prioritize functional and causal regulatory genetic variants. First, we performed quality control on the raw data using FastQC, retaining only datasets that met the criteria for base quality, GC content distribution, and repeated short sequences. We further obtained high-quality reads from the raw data, by removing reads with adaptor contamination longer than 5 bp, more than 50% bases with quality score less than 19, and base N proportion greater than 5%. The high-quality reads were aligned to the reference genome using bowtie2 (38) with parameters "-k 2 -m 2 -n 2," and further processed by the MACS2 (39) to identify the regions with enriched signals.

Functional enrichment analyses of gene sets

Functional enrichment analyses of gene sets were performed using Metascape (40) with default parameters (http://metascape.org). Specifically, we converted porcine genes to their human orthologs and chose human as the target species for analyses.

Pipeline

LD expansion of GWAS variants

Given the sparsity of variants detectable on the SNP arrays used for most GWAS research, we hypothesized that K∖LD expansion, which could retrieve all linked SNPs within the corresponding LD proxy of the GWAS risk loci using genome-wide SNP data for a matched population, could help identify more potential regulatory variants. SNPs within the same LD block tend to be inherited together and share similar genetic information. The VCF files (Table S2) containing genome-wide variation information for the matched population were first converted to binary files recognizable by the PLINK software (41) using the “--make bed” parameter. Then, the block function of PLINK was applied to the binary files, and haploid blocks for the four species were obtained. SNPs within the same haploid block are considered as being in high LD with each other. Here, we termed the original GWAS risk loci as lead SNPs to distinguish them from the newly expanded ones, and the D′ and r2 statistics between former and latter were calculated and visualized using the PLINK with “--ld” parameter.

Preliminary data filtering

After LD expansion, the GWAS risk loci and expanded variants were mapped to the dbSNP database. For those GWAS loci that could not be mapped, both the lead and expanded variants were excluded. Unmappable expanded variants would be filtered without removing of the other linked variants. The alternative and reference allele information for the retained variants was retrieved from the VCF files of the matched populations (Table S2). The variants with more than one alternative allele were further excluded. For QTLs, regions longer than 10 kb were filtered out as they were considered uninformative because of their overlap with multiple putative enhancers and promoters.

Estimation of binding affinity effects

To evaluate the possible reduced or enhanced effects of TF binding affinity, we predicted the change on the binding affinity caused by alternative alleles of candidate variants using a comprehensive TF motif set (42) and the software atSNP (43). In brief, affinity scores for the TF binding affinity of two alleles (reference alleles and SNP variant alleles) around each variant (±30 bp) were obtained and compared, and the statistical significance for the difference was also estimated. It is important to note that some genome versions used for GWAS risk loci in Table S2 differ from those for genome-wide population-scale variant files in Table S1. Therefore, the coordinates of the former were mapped to those of the latter using the LiftOver tool and chain files. In addition, atSNP requires a publicly available R package SNPlocs.Hsapiens.dbSNP144.GRCh38 for human data analysis, which contains variant locus and allele information. However, such R packages for pig, cattle, sheep, and chicken are not readily available, which we manually generated following the instructions of the documents.

Estimation of variant conservation

As functional regulatory variants tend to be more conserved across species at the sequence level (44), we estimated the sequence conservation level of the variants represented by phastCons and phyloP scores and measured the level of conservation score elevation compared with those scores of randomly generated genomic regions. In brief, the phastCons and phyloP scores calculated using the whole-genome multisequence alignment results of 100 vertebrates were retrieved from the UCSC database (hg38.phasCons100way.bw and hg38.phyloP100way.bw). Since pig, cattle, sheep, and chicken are included among these 100 vertebrates, we used these two bigwig files to estimate variant conservation. The variants and random background loci were mapped from the four species to the hg38 human genome, and the phastCons and phyloP scores of the converted regions were obtained to represent the conservation level of the corresponding loci in the respective species.

Combination of p values for prioritization of regulatory variant

Following a previous study (15), we integrated GWAS signals, LD haplotype block information, changes in binding affinity, and conservation scores to prioritize the GWAS leading (L) and expanded (E) variants. The original GWAS p values were used to represent the significance of phenotype-associated effects (PGWAS) for the L variants, whereas PGWAS p values for the E variants were calculated by dividing the GWAS p values by the r2 between E and L in the matched populations. Then the most significant p value for the change in TF binding affinity was selected to represent the statistical significance for the binding affinity effect of the variant (PBDA). The phastCon–phyloP score of each variant locus was compared with those of randomly sampled genomic regions, and a one-sided Z-test was applied to obtain the p value to represent the level of conservation (PCONS). We carried out Fisher’s combined probability test to obtain a combined p value, CP, from the estimated p values of the three independent measurements (GWAS, binding affinity, and conservation).

Visualization

First, the IGV (24) was integrated into the GWAS interface for visualization of GWAS risk loci (in BED format), associated PEIs (in BEDPE format), and histone modification peaks (in BED format) for the selected species, phenotype, and tissue/cell types. IGV was also integrated into the QTL interface for similar visualizations.

In addition, we provided a custom visualization platform where users can click on an interested variant in the GWAS summary interface to get a picture of circular layout of genes locating in the vicinity of the variant as sectors colored by gene biotypes and PEIs whose enhancers cover the variant as orange curves connecting the variant and the promoters of genes. Hovering the mouse cursor over a displayed gene will show detailed information, including the gene’s name, ID, biotype, and genomic position. Hovering over a curve representing a PEI will reveal the tissue/cell type and contact score of the PEI. A further click on the curve will redirect users to an interface showing the histone modifications and genes locating the vicinity of SNP-associated promoters and enhancers at a 10-kb resolution. Specifically, genes are shown as thin and thick rectangles with merged triangles, displaying the detailed gene structures of exons and introns, as well as the transcription direction. Histone modifications are displayed as rectangles, with distinct colors indicating different modification types and tissue/cell types.

Data availability

The online database can be freely accessed via https://agwas.sicau.edu.cn/#/HomePage.

Supporting information

This article contains supporting information.

Conflict of interest

The authors declare that they have no conflicts of interest with the contents of this article.

Acknowledgments

Author contributions

Q. T. conceptualization; Y. G., Y. J., and F. Y. software; Y. G. formal analysis; M. L. resources; Y. W., X. L., J. Y., K. W., H. H., Y. Y., Y. T., C. W., and J. X. data curation; Q. T. writing–original draft; Y. G., Y. J., and Q. T. visualization; Q. T. supervision; Q. T. project administration; M. L. funding acquisition.

Funding and additional information

This work was funded by the National Key R&D Program of China (grant nos.: 2022YFF1000100 and 2020YFA0509500; to Q. T.), the Sichuan Science and Technology Program (grant nos.: 2021ZDZX0008 [to Q. T.] and 2021YFYZ0009 [to M. L.]), the National Natural Science Foundation of China (grant no.: 32225046; to M. L.), and the Dual Support Plan for Discipline Construction—Special Program for The Cultivation of Outstanding Young Scholars (grant no.: 2022SZYQ004; to Q. T.).

Reviewed by members of the JBC Editorial Board. Edited by Philip A. Cole

Contributor Information

Mingzhou Li, Email: mingzhou.li@sicau.edu.cn.

Qianzi Tang, Email: tangqianzi@sicau.edu.cn.

Supporting information

Supplementary Figure
mmc1.docx (1.4MB, docx)
Supplementary Table
mmc2.docx (38.7KB, docx)

References

  • 1.Misra B.B., Langefeld C.D., Olivier M., Cox L.A. Integrated omics: tools, advances, and future approaches. J. Mol. Endocrinol. 2018 doi: 10.1530/JME-18-0055. [DOI] [PubMed] [Google Scholar]
  • 2.Uffelmann E., Huang Q.Q., Munung N.S., de Vries J., Okada Y., Martin A.R., et al. Genome-wide association studies. Nat. Rev. Methods Primers. 2021;1:59. [Google Scholar]
  • 3.Wen J., Nodzak C., Shi X. QTL Analysis beyond eQTLs. Methods Mol. Biol. 2020;2082:201–210. doi: 10.1007/978-1-0716-0026-9_14. [DOI] [PubMed] [Google Scholar]
  • 4.Wibowo T.A., Gaskins C.T., Newberry R.C., Thorgaard G.H., Michal J.J., Jiang Z. Genome assembly anchored QTL map of bovine chromosome 14. Int. J. Biol. Sci. 2008;4:406–414. doi: 10.7150/ijbs.4.406. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Dong Q., Zhang Z.H., Wang L.L., Zhu Y.J., Fan Y.Y., Mou T.M., et al. Dissection and fine-mapping of two QTL for grain size linked in a 460-kb region on chromosome 1 of rice. Rice (N Y) 2018;11:44. doi: 10.1186/s12284-018-0236-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Hu Z.L., Park C.A., Reecy J.M. Bringing the Animal QTLdb and CorrDB into the future: meeting new challenges and providing updated services. Nucleic Acids Res. 2022;50:D956–D961. doi: 10.1093/nar/gkab1116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Aguet F., Alasoo K., Li Y.I., Battle A., Im H.K., Montgomery S.B., et al. Molecular quantitative trait loci. Nat. Rev. Methods Primers. 2023;3:4. [Google Scholar]
  • 8.Cubillos F.A., Coustham V., Loudet O. Lessons from eQTL mapping studies: non-coding regions and their role behind natural phenotypic variation in plants. Curr. Opin. Plant Biol. 2012;15:192–198. doi: 10.1016/j.pbi.2012.01.005. [DOI] [PubMed] [Google Scholar]
  • 9.Belton J.M., Mccord R.P., Gibcus J.H., Naumova N., Zhan Y., Dekker J. Hi-C: a comprehensive technique to capture the conformation of genomes. Methods. 2012;58:268–276. doi: 10.1016/j.ymeth.2012.05.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Li G., Cai L., Chang H., Hong P., Zhou Q., Kulakova E.V., et al. Chromatin interaction analysis with paired-end tag (ChIA-PET) sequencing technology and application. BMC Genomics. 2014;15 doi: 10.1186/1471-2164-15-S12-S11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Farh K.K., Marson A., Zhu J., Kleinewietfeld M., Housley W.J., Beik S., et al. Genetic and epigenetic fine mapping of causal autoimmune disease variants. Nature. 2015;518:337–343. doi: 10.1038/nature13835. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Advani J., Corso-Diaz X., Kwicklis M., van Asten F., Ratnapriya R., Mehta P., et al. QTL mapping of human retina DNA methylation identifies 87 gene-epigenome interactions in age-related macular degeneration. Res. Sq. 2023 doi: 10.21203/rs.3.rs-3011096/v1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Maurano M.T., Humbert R., Rynes E., Thurman R.E., Haugen E., Wang H., et al. Systematic localization of common disease-associated variation in regulatory DNA. Science. 2012;337:1190–1195. doi: 10.1126/science.1222794. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74. doi: 10.1038/nature11247. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Li M.J., Wang L.Y., Xia Z., Sham P.C., Wang J. GWAS3D: detecting human regulatory variants by integrative analysis of genome-wide associations, chromosome interactions and histone modifications. Nucleic Acids Res. 2013;41:W150–W158. doi: 10.1093/nar/gkt456. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Huang D., Yi X., Zhang S., Zheng Z., Wang P., Xuan C., et al. GWAS4D: multidimensional analysis of context-specific regulatory variant for human complex diseases and traits. Nucleic Acids Res. 2018;46:W114–W120. doi: 10.1093/nar/gky407. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Harrison P.W., Sokolov A., Nayak A., Fan J., Zerbino D., Cochrane G., et al. The FAANG data portal: global, open-access, "FAIR", and richly validated genotype to phenotype data for high-quality functional annotation of animal genomes. Front. Genet. 2021;12 doi: 10.3389/fgene.2021.639238. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Wu Y., Zhang Y., Liu H., Gao Y., Liu Y., Chen L., et al. Genome-wide identification of functional enhancers and their potential roles in pig breeding. J. Anim. Sci. Biotechnol. 2022;13:75. doi: 10.1186/s40104-022-00726-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Liu S., Gao Y., Canela-Xandri O., Wang S., Yu Y., Cai W., et al. A multi-tissue atlas of regulatory variants in cattle. Nat. Genet. 2022;54:1438–1447. doi: 10.1038/s41588-022-01153-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Chen N., Fu W., Zhao J., Shen J., Chen Q., Zheng Z., et al. BGVD: an integrated database for bovine sequencing variations and selective signatures. Genomics Proteomics Bioinformatics. 2020;18:186–193. doi: 10.1016/j.gpb.2019.03.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Fu W., Wang R., Yu J., Hu D., Cai Y., Shao J., et al. GGVD: a goat genome variation database for tracking the dynamic evolutionary process of selective signatures and ancient introgressions. J. Genet. Genomics. 2021;48:248–256. doi: 10.1016/j.jgg.2021.03.003. [DOI] [PubMed] [Google Scholar]
  • 22.Wei J., Liu P., Liu F., Jiang A., Qiao J., Pu Z., et al. EDomics: a comprehensive and comparative multi-omics database for animal evo-devo. Nucleic Acids Res. 2023;51:D913–D923. doi: 10.1093/nar/gkac944. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Fu Y., Liu H., Dou J., Wang Y., Liao Y., Huang X., et al. IAnimal: a cross-species omics knowledgebase for animals. Nucleic Acids Res. 2023;51:D1312–D1324. doi: 10.1093/nar/gkac936. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Robinson J.T., Thorvaldsdottir H., Turner D., Mesirov J.P. igv.js: an embeddable JavaScript implementation of the Integrative Genomics Viewer (IGV) Bioinformatics. 2023;39:btac830. doi: 10.1093/bioinformatics/btac830. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Manolio T.A., Collins F.S., Cox N.J., Goldstein D.B., Hindorff L.A., Hunter D.J., et al. Finding the missing heritability of complex diseases. Nature. 2009;461:747–753. doi: 10.1038/nature08494. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Krzywinski M., Schein J., Birol I., Connors J., Gascoyne R., Horsman D., et al. Circos: an information aesthetic for comparative genomics. Genome Res. 2009;19:1639–1645. doi: 10.1101/gr.092759.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Zboril E., Yoo H., Chen L., Liu Z. Dynamic interactions of transcription factors and enhancer reprogramming in cancer progression. Front. Oncol. 2021;11 doi: 10.3389/fonc.2021.753051. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Zhang F., Zhang Z., Yan X., Chen H., Zhang W., Hong Y., et al. Genome-wide association studies for hematological traits in Chinese Sutai pigs. BMC Genet. 2014;15:41. doi: 10.1186/1471-2156-15-41. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Ott J., Kamatani Y., Lathrop M. Family-based designs for genome-wide association studies. Nat. Rev. Genet. 2011;12:465–474. doi: 10.1038/nrg2989. [DOI] [PubMed] [Google Scholar]
  • 30.Koshi-Mano K., Mano T., Morishima M., Murayama S., Tamaoka A., Tsuji S., et al. Neuron-specific analysis of histone modifications with post-mortem brains. Sci. Rep. 2020;10:3767. doi: 10.1038/s41598-020-60775-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Li C., Tian D., Tang B., Liu X., Teng X., Zhao W., et al. Genome Variation Map: a worldwide collection of genome variations across multiple species. Nucleic Acids Res. 2021;49:D1186–D1191. doi: 10.1093/nar/gkaa1005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Tian D., Wang P., Tang B., Teng X., Li C., Liu X., et al. GWAS Atlas: a curated resource of genome-wide variant-trait associations in plants and animals. Nucleic Acids Res. 2020;48:D927–D932. doi: 10.1093/nar/gkz828. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Kuhn R.M., Haussler D., Kent W.J. The UCSC genome browser and associated tools. Brief Bioinform. 2013;14:144–161. doi: 10.1093/bib/bbs038. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Altschul S.F., Madden T.L., Schäffer A.A., Zhang J., Zhang Z., Miller W., et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Durand N.C., Shamim M.S., Machol I., Rao S.S.P., Huntley M.H., Lander E.S., et al. Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments. Cell Syst. 2016;3:95–98. doi: 10.1016/j.cels.2016.07.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Li H., Durbin R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics. 2010;26:589–595. doi: 10.1093/bioinformatics/btp698. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Ron G., Globerson Y., Moran D., Kaplan T. Promoter-enhancer interactions identified from Hi-C data using probabilistic models and hierarchical topological domains. Nat. Commun. 2017;8:2237. doi: 10.1038/s41467-017-02386-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Langmead B., Salzberg S.L. Fast gapped-read alignment with Bowtie 2. Nat. Methods. 2012;9:357–359. doi: 10.1038/nmeth.1923. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Feng J., Liu T., Qin B., Zhang Y., Liu X.S. Identifying ChIP-seq enrichment using MACS. Nat. Protoc. 2012;7:1728–1740. doi: 10.1038/nprot.2012.101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Zhou Y., Zhou B., Pache L., Chang M., Khodabakhshi A.H., Tanaseichuk O., et al. Metascape provides a biologist-oriented resource for the analysis of systems-level datasets. Nat. Commun. 2019;10:1523. doi: 10.1038/s41467-019-09234-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Purcell S., Neale B., Todd-Brown K., Thomas L., Ferreira M.A.R., Bender D., et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 2007;81:559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Kulakovskiy I.V., Medvedeva Y.A., Schaefer U., Kasianov A.S., Vorontsov I.E., Bajic V.B., et al. HOCOMOCO: a comprehensive collection of human transcription factor binding sites models. Nucleic Acids Res. 2013;41:D195–D202. doi: 10.1093/nar/gks1089. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Zuo C., Shin S., Keleş S. atSNP: transcription factor binding affinity testing for regulatory SNP detection. Bioinformatics. 2015;31:3353–3355. doi: 10.1093/bioinformatics/btv328. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Zhao R., Talenti A., Fang L., Liu S., Liu G., Chue Hong N.P., et al. The conservation of human functional variants and their effects across livestock species. Commun. Biol. 2022;5:1003. doi: 10.1038/s42003-022-03961-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Figure
mmc1.docx (1.4MB, docx)
Supplementary Table
mmc2.docx (38.7KB, docx)

Data Availability Statement

The online database can be freely accessed via https://agwas.sicau.edu.cn/#/HomePage.


Articles from The Journal of Biological Chemistry are provided here courtesy of American Society for Biochemistry and Molecular Biology

RESOURCES