Genomic variability in Mexican chicken population using copy number variants

E Gorla; M C Cozzi; S I Román-Ponce; F J Ruiz López; V E Vega-Murillo; S Cerolini; A Bagnato; M G Strillacci

doi:10.1186/s12863-017-0524-4

. 2017 Jul 3;18:61. doi: 10.1186/s12863-017-0524-4

Genomic variability in Mexican chicken population using copy number variants

E Gorla ¹, M C Cozzi ¹, S I Román-Ponce ², F J Ruiz López ², V E Vega-Murillo ³, S Cerolini ¹, A Bagnato ¹, M G Strillacci ^1,^✉

PMCID: PMC5496433 PMID: 28673234

Abstract

Background

Copy number variations are genome polymorphism that influence phenotypic variation and are an important source of genetic variation in populations. The aim of this study was to investigate genetic variability in the Mexican Creole chicken population using CNVs.

Results

The Hidden Markov Model of the PennCNV software detected a total of 1924 CNVs in the genome of the 256 samples processed with Axiom® Genome-Wide Chicken Genotyping Array (Affymetrix). The mapped CNVs comprised 1538 gains and 386 losses, resulting at population level in 1216 CNV regions (CNVRs), of which 959 gains, 226 losses and 31 complex (i.e. containing both losses and gains). The CNVRs covered a total of 47 Mb of the whole genome sequence length, corresponding to 5.12% of the chicken galGal4 autosome assembly.

Conclusions

This study allowed a deep insight into the structural variation in the genome of unselected Mexican chicken population, which up to now has not been genetically characterized. The genomic study disclosed that the population, even if presenting extreme morphological variation, cannot be organized in differentiated genetic subpopulations. Finally this study provides a chicken CNV map based on the 600 K SNP chip array jointly with a genome-wide gene copy number estimates in a native unselected for more than 500 years chicken population.

Electronic supplementary material

The online version of this article (doi:10.1186/s12863-017-0524-4) contains supplementary material, which is available to authorized users.

Keywords: Copy number variant, Chicken, Genetic variability

Background

Copy Number Variants (CNV) are genomic structural variations distributed over the whole genome in all species and refers to genomic segments of at least 50 bp in size [1], for which copy number differences have been observed in comparison to reference genome assemblies (insertions, deletions and more complex changes) [2, 3]. Sequencing of the chicken genome, released in 2004 [4], has facilitated the use of molecular markers for breed/ecotype characterization. Structural variation has been recognized as an important mediator of gene and genome evolution within populations. In the last decades, microsatellite markers have been often used to perform phylogenetic analysis and studies on genetic variability in chicken populations [5–7]. Although numerous studies investigating genetic variation have focused on SNPs, there is a growing evidence for the substantial role of structural DNA polymorphism in phenotypic diversity [8]. It has been shown that CNVs are ubiquitous in the genome and can contribute substantially to phenotypic variability and disease susceptibility in humans [8, 9] and animals [10, 11]. The underlying assumption is that CNVs are changing the gene structure and dosage and altering the gene regulation [8–13]. Even if CNVs are less frequent than SNPs in terms of absolute numbers, CNVs cover a larger proportion of the genome and have, therefore, a large potential effect on phenotypic variability [14]. Compared with humans and other model organisms, there is limited research on the extent and impact of CNVs in the chicken genome.

In Mexico the poultry population, even if it shows large morphological variability, is not divided into breeds or strains and, possibly, can be considered as a unique widespread Creole chicken population (Gallus gallus domesticus), as the result of undefined crosses among different breeds imported into Mexico from Europe for almost 500 years [15, 16]. Creole chickens include, in fact a wide variety of biotypes with different colors of plumage and morphological features that are widely distributed in the country [17]. In the absence of comprehensive breed characterization data and documentation of the origin of breeding populations, DNA polymorphism provides the most reliable estimates of genetic diversity within and between a given set of populations [18].

Several studies have been developed in the recent past to detect CNV in poultry using low-density 60 K SNP chips [19] or aCGH [20–22]. The major limits of these studies reside in the density of the spots of the used arrays and the limited sample size. It has been already suggested by Jia et al. [23] that the use of the 600 K SNP array can improve the efficiency of the CNV detection in the poultry species. The whole genome sequence data can improve the detection of small CNVs but, even if desirable and employed by some authors [24, 25], is still economically too demanding to be realized over a large number of samples.

The aim of this study was to map the CNV in the Mexican chicken population with an unprecedented resolution using high density SNP chip (i.e. 600 K Affymetrix SNP chip) on a large number of individuals (i.e. 256) and to characterize the genetic variability of the Mexican Creole chicken’s population using CNV as genomic markers.

Methods

Sampling and genotyping

In this study a collection of 265 individuals of the Mexican chicken population, from different farms across 26 states of United States of Mexico, was previously sampled by Instituto Nacional de Investigaciones Forestales, Agricola y Pecuarias (INIFAP) within the institutional activities of the Centro Nacional de Recursos Geneticos at Tepatiplan, Jalisco. As mentioned hereinbefore, a classification of the Mexican population in breeds does not exist. For this reason, the birds have been considered as a unique Creole population and sampled in several states of Mexico.

Samples were processed and genotyped within the framework of a previous project of INIFAP using the 600 K Affymetrix Axiom® Chicken Genotyping Array, containing 580,954 SNPs distributed across the genome, with an average spacing of about 1.8 kb and data made available for the present study. A commercial service provider performed the genotyping and the DNA extraction from feathers. The galGal4 chicken assembly was used in this study as reference genome.

Quality assurance of CNV raw data and CNV detection

The CNV detection was performed on a total of 471,730 SNPs on the first 28 chicken autosomes.

The Axiom® Analysis Suite software (Affymetrix) was used to perform raw intensity data Quality Control and run the genotyping algorithms. Default quality control settings were applied to filter for low quality samples before running the genotyping analysis, to exclude the ones with call rates <97% and Dish Quality Control <0.82. The Axiom® CNV summary software tool was used to generate input files for CNV prediction analysis.

The CNV detection was performed with PennCNV software [26] using Log R Ratio and the B allele frequency [27] obtained with the Axiom® CNV Summary Tool software. The individual-based CNV calling was performed using the default parameters of the Hidden Markov Model (HMM): standard deviation of LRR <0.30, BAF drift as 0.01 and waviness factor at 0.05 and a minimum of 3 SNP was required to define a CNV. The distribution of CNV per individual spanned from 0 CNV to more than 100. Up to 79 CNV the distribution was continuous, while a step to more than 100 CNV was detected in 9 birds. To avoid the introduction of possible false positive and a bias in the CNV interpretation they were then filtered out as the number of CNVs detected appeared to be outlier respect to the CNV distribution, leaving 256 samples for further analyses. It is worth to mention that Zhang et al. [19] have performed a validation of the CNV called by PennCNV, using the CNVPartition program obtaining an overlapping of results of 99%. Additionally recent studies in cattle [28] have used two software to map CNV based on different algorithms: the HMM of PennCNV, based on the CNV identification on B allele frequency and Log R ratio, and the CNAM of SVS (Golden Helix) basing the identification only on Log R ratio. These studies provide an additional empirical evidence of the results provided by Xu et al. [29] that in their study concluded that using multiple CNV calling algorithms might also increase the false positive rate.

In addition to detect the outliers as hereinbefore described, in order to minimize the false positive callings, the PennCNV was run using different “.hmm” files (agre.hmm, affygw6.hmm, hh550.hmm), which is known that may affect substantially the false positive as well as the false negative rate. The online PennCNV manual (http://penncnv.openbioinformatics.org/en/latest/) in fact instruct the user that the agre.hmm file produces an excess of false positive calls respect to the default affygw6.hmm file, which has been criticized to produce a low number of CNV calls (i.e. excess of false negative) respect to other calling software and algorithms. Additionally we used the hh550.hmm file in the calling process, which is based on a chip with the closest number of SNPs respect to the SNP chip used here. To reduce the false calling rate we have then considered valid only the CNV calls obtained both with the agre.hmm and the hh550.hmm files. The number of CNV calls resulted using the affygw6.hnm files were negligible respect to other two files, but anyhow present in the consensus here obtained. The hmm file supplied to the HMM of PennCNV, (http://penncnv.openbioinformatics.org/en/latest/), provides to the algorithm the expected signal intensity values for different states of CNV and the expected probability for the transition in different copy number state. As described in the PennCNV user manual, however, the transition probability is a function of the distance between neighboring markers. This makes the choice of a correct hmm file, in respect to the density of markers, a critical step in the mapping of CNV to control false positive and negative calls.

CNVR definition

The CNV regions (CNVRs) were obtained using the BedTools software (−mergeBed command) [30], through merging overlapping CNVs by at least 1 bp, as described by Redon et al. [12]. CNVRs were classified as gain, loss and complex CNVRs (i.e. a CNVR comprising both gain and loss events).

CNVR annotation

After downloading the list of chicken autosome galGal4 genes (GCA_000002315.2) from Ensembl database (Release 88) (http://www.ensembl.org), the gene annotation was performed using the software Bedtools, intersect command [30], identifying the genes fully included in, or partially overlapping, the defined CNVRs. Gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis were performed using the Panther database (http://pantherdb.org).

Clustering analysis using CNVRs

A clustering analysis was performed considering CNVRs found in this study [31]. A scoring matrix of the CNVRs was constructed, attributing the “0” or “1” values to indicate the absence or the presence of a CNV in a specific CNVR. A hierarchical agglomerative clustering was then applied to the scoring matrix using the pvclust function of the pvclust R package [32]. Multiscale bootstrap resampling (no. 10,000 bootstraps) was used to obtain the Approximately Unbiased P-value (AU), in order to determine the robustness of branches. The Unweighted Pair Group Method with Arithmetic mean (UPGMA) was the Agglomerative method chosen.

Results

CNV and CNVR detection

The Table 1 reports the descriptive statistics of identified CNVs and CNVRs. The HMM of the PennCNV software detected a total of 1924 CNVs; among these, 386 were deletions (i.e. loss state) and 1538 were duplication (i.e. gain state), with a deletions/duplications CNV ratio of 0.25, calculated as the total number of losses divided by the total number of gains.

Table 1.

Descriptive statistics for Copy Number Variants (CNVs) and Copy Number Variants Regions (CNVRs) identified in the Mexican chicken population

Type	No.	Length	Min length	Max length	Mean length	Median length	Total Coverage
CNVs
Loss	386	12,575,609	92	574,231	32,579	6038	1.37%
Gain	1538	74,022,420	138	1,345,291	42,129	22,810	8.05%
All	1924	86,598,029	92	1,345,291	45,009	19,273	9.42%
CNVRs
Loss	226	3,920,955	92	279,420	17,349.36	4950	0.43%
Gain	959	38,550,088	138	1,345,291	40,198.21	15,414	4.19%
Complex	31	4,580,519	3501	607,435	147,758.7	60,250	0.50%
All	1216	47,051,562	92	1,345,291	38,693.72	13,897.5	5.12%

Open in a new tab

The CNVs overlapping among samples were summarized across all individuals into 1216 CNVRs (959 gains, 226 losses and 31 complex), covering a total of 47 Mb of sequence length, corresponding to 5.12% of 28 autosomes in the galGal4 assembly (Additional file 1: Sheet 1).

In Fig. 1 the CNVRs map, divided in gain, loss and complex on each chromosome is shown.

In Table 2 the number of CNVRs found is reported, together with the state and the proportion of coverage by chromosome. The coverage proportion is smaller than 10% for all chromosomes, except for 16, 18, 24, 27 ones.

Table 2.

Number and proportion of genome covered (Coverage %) by Gain, Loss and Complex Copy Number Variants Regions per chromosome (CHR)

CHR	Gain (^a)	Loss (^a)	Complex (^a)	Total	Coverage (%)
1	186 (3.94)	46 (0.38)	6 (0.29)	238	4.61
2	140 (4.78)	31 (0.38)	2 (0.14)	173	5.29
3	101 (3.02)	18 (0.11)	0 (0)	119	3.13
4	58 (3.40)	20 (0.36)	0 (0)	78	3.75
5	58 (6.43)	8 (0.15)	0 (0)	66	6.58
6	41 (3.61)	9 (0.15)	1 (0.15)	51	3.91
7	36 (4.03)	2 (0.02)	1 (0.46)	39	4.51
8	32 (4.55)	1 (0.30)	1 (0.68)	34	5.53
9	25 (3.22)	8 (0.23)	0 (0)	33	3.45
10	32 (5.06)	9 (0.79)	2 (1.11)	43	6.96
11	17 (2.64)	7 (0.78)	1 (0.19)	25	3.61
12	26 (2.73)	4 (0.16)	0 (0)	30	2.89
13	30 (3.88)	8 (1.05)	1 (0.52)	39	5.45
14	32 (7.72)	7 (2.05)	1 (0.20)	40	9.97
15	18 (1.90)	3 (0.12)	1 (0.31)	22	2.33
16	0 (0)	0 (0)	1 (81.60)	1	81.60
17	8 (2.28)	5 (0.97)	0 (0)	13	3.26
18	12 (3.54)	7 (2.06)	2 (5.03)	21	10.63
19	22 (8.32)	4 (0.23)	1 (0.91)	27	9.46
20	17 (3.57)	3 (0.26)	2 (0.39)	22	4.22
21	9 (1.60)	5 (0.30)	0 (0)	14	1.90
22	8 (4.31)	2 (0.74)	1 (0.62)	11	5.67
23	9 (4.78)	5 (0.95)	1 (0.73)	15	6.46
24	12 (9.91)	2 (0.24)	0 (0)	14	10.14
25	3 (2.41)	3 (1.13)	2 (2.39)	8	6.48
26	6 (2.27)	5 (2.11)	1 (1.46)	12	5.84
27	11 (6.04)	4 (3.66)	1 (10.74)	16	20.45
28	10 (3.36)	0 (0)	2 (2.24)	12	5.61
Total	959	226	31	1216

Open in a new tab

^aCoverage of CNVR by chromosome and state (gain/loss/complex) relatively to each chromosome length

CNVRs were classified as singleton if detected in only one individual. Among the identified CNVRs, 1009 (82.9%) were present in singleton, 127 (10.4%) in two individuals, 30 (2.4%) in three individuals, 11 (0.9%) in four individuals, and 39 (3.2%) in five or more individuals. For every state (i.e. gain, loss, complex) CNVRs were divided according to their length into four classes: <1 kb, 1–10 kb, 10–100 kb, >100 kb; Fig. 2 reports the CNVRs count for each class of CNVRs length.

Fig. 2 — Distribution of CNVRs lengths identified with PennCNV

The majority of the 1065 CNVRs identified in this study had a length comprised between 10 kb and 100 kb, of which 471 comprised between 1 kb and 10 kb and 594 comprised between 10 kb and 100 kb. A total of 39 CNVRs had a length lower than 1 kb while 112 CNVRs showed a size longer than 1 Mb (Fig. 3). The highest number of gain and complex CNVRs are those with a length of 10–100 kb, while the loss CNVRs were present at largest frequency within a length of 1–10 kb (Fig. 3).

The regions mapping in a large number of individuals were: the CNVR on chromosome 1 at 42.96–43.13 Mb, identified in 61 samples; the CNVR on chromosome 12 at 1.12–1.22 Mb, identified in 56 samples; the CNVR on chromosome 16 at 1253–533,589 bp, identified in 53 samples; the CNVR on chromosome 1 at 193.13–193.24 Mb, identified in 52 samples.

The Fig. 3 shows the sample count for every CNVR state according to the previously defined 4 CNVR length classes (as shown in Fig. 2). The sample count classes were defined as: 1 (singleton), 2–5, 6–20 and >20. The gain CNVRs (Fig. 3a) have a sample count distribution with most of the regions falling into the 10–100 kb class. The loss CNVRs (Fig. 3b) have a sample count distribution with most of the regions falling into the 1–10 kb class. Class 1 mostly represents the gain regions. Furthermore, class 1 is the most frequent in all length classes. The highest length and sample classes mainly belong to the gain regions. In the complex region (Fig. 3c) the class mostly represented is the 10–100 kb one. More precisely, the most represented sample class is the 2–5 class falling mainly within the 10–100 kb length class. Furthermore, class 2–5 is the most frequent. Lastly, all the sample classes are distributed mostly within the 1–10 and 10–100 length classes.