Abstract
Copy number variations (CNVs) have become widely acknowledged as a significant source of genomic variability and phenotypic variance. To understand the genetic variants in horses, CNVs from six Indian horse breeds, namely, Manipuri, Zanskari, Bhutia, Spiti, Kathiawari and Marwari were discovered using Axiom™ Equine Genotyping Array. These breeds differed in agro-climatic adaptation with distinct phenotypic characters. A total of 2668 autosomal CNVs and 381 CNV regions (CNVRs) were identified with PennCNV tool. DeepCNV was employed to re-validate to get 883 autosomal CNVs, of which 9.06% were singleton type. A total of 180 CNVRs were identified after DeepCNV filtering with the estimated length of 3.12 Kb–4.90 Mb. The functional analysis showed the majority of the CNVRs genes enriched for sensory perception and olfactory receptor activity. An Equine CNVs database, EqCNVdb (http://backlin.cabgrid.res.in/eqcnvdb/) was developed which catalogues detailed information on the horse CNVs, CNVRs and gene content within CNVRs. Also, three random CNVRs were validated with real-time polymerase chain reaction. These findings will aid in the understanding the horse genome and serve as a preliminary foundation for future CNV association research with commercially significant equine traits. The identification of CNVs and CNVRs would lead to better insights into genetic basis of important traits.
Keywords: Copy number variation, copy number variation regions, equine, genome, web resource
Introduction
Among the various genetic diversities comprising of single nucleotide polymorphisms (SNPs), InDels (insertion and deletion), short tandem repeat polymorphisms, retro transposable element insertions, inversion variants, copy number variants (CNVs), etc., CNVs have a dynamic potential evolutionary significance.1 CNVs are the DNA segments of at least one kilobase (kb) in size, exhibiting varying copy numbers when compared to a reference genome.2 Such structural variants account for greater differences among individuals as compared to SNPs.2 Earlier, SNPs were believed to be the main type of genomic variation and responsible for a large portion of the typical phenotypic variance. But, over the recent years, researchers have discovered numerous novel alterations in repetitive sections of DNA. This finding has led to the understanding that CNVs hold comparable significance in genomic diversity to SNPs.
CNVs are reported to have a higher de novo locus-specific mutation rate than SNPs.3 Gene-containing CNVs are reported to indirectly influence mRNA and protein expression levels, hence have the potential to affect downstream phenotypes.4–7 Since CNVs are spread genome-wide, contributing to phenotypic variation, their identification would lead to better insights into the genetic basis of important traits. Although DNA copy number variations have long been linked to particular chromosomal rearrangements and genomic diseases, their prevalence in mammalian genomes has recently gained popularity.8
CNVs are important and major causes of disease and phenotypic diversity. The two major studies in 2004 revealed CNVs to be a significant source of genetic variation widespread in the human genome.9,10 Besides in humans, numerous studies have reported CNVs including duplications and deletions as a significant source of genomic variation in animals and plants.11–18 Besides CNVs being widely associated with complex traits/phenotypic variability in humans2,9 and mice,19 there are reports of few studies on CNV associations with phenotypic traits in chicken,20 cattle,21 dogs15 and other domestic species too,22 hence establishing these to be a common feature of vertebrates. Various studies on horses describe the identification of CNVs in the entire genome23–26 or in gene exons.23,24
CNVs had also been linked to equine phenotypic features,23,24,26 adaptations27 and illnesses25 in literature. Even though these researches laid the groundwork for understanding the function of CNVs in equine biology, currently, the information/knowledge available is still to be further explored to effectively identify true variants. The reasons are, various CNV discovery platforms used, a limited number of breeds and individuals included in the studies, and the bulk of reported CNVs are study-specific. Large homologous repetitions or segmental duplications have been observed to exist or be flanked by locations where CNVs frequently occur.9,28–30 Since SNP data are plentiful and available at high spatial resolution across the genome, these SNP data might be used to find underlying CNVs, if the underlying CNVs affected the results of SNP genotyping assays.
CNVs, like SNPs, can be used to find connections with hereditary illnesses and other complicated features. Identification of CNVs in horses in different studies since 2013 supports their links with a certain trait, disease or gene expression.31,25–27,32–34 Several mammalian gene super-families are known to underpin evolutionary changes caused by CNVs, and association of increased genes with CNV could be a potential tool for adaptation.25 As a result, the majority of the diversity in mammalian genomes explained by CNVs is known to occur in areas that affect key biological processes such as sensory perception, signal transduction, immunology, pathogen defence or metabolic pathways.23,24,31,35,36 The majority of genome-wide association (GWA) research in horses used SNPs, and there has been minimal identification of potential connections between certain behaviours with CNVs.25,27,33,34,37 CNV-based GWA research could help identify connections between trait(s) of interest and extra genetic variation that is otherwise undetected in SNP-based GWA studies.23,24,31 Furthermore, genomic areas linked to phenotypic variation in CNV-based GWA investigations could indicate more complicated structures behind phenotypic variance.
Currently, the publicly available Genome Variation Map (http://bigd.big.ac.cn/gvm/home) catalogues 25756212 SNPs and 3663455 insertion/deletions (INDELs) of the horse genome. Till now, CNVs have been found in over 45 distinct horse breeds, occupying roughly 1–3% of their genome, and having a greater number of CNVs in genes (80%) than in the intergenic regions (20%).23–25,32,37 In comparison to a reference, for the Thoroughbred genome, the typical CNV size in horses remains between 1 kb and 4.84 Mb, with CNV losses generally outnumbering gains.37 Although large sample datasets have been used to find significant relationships with specific features such as body size or recurrent laryngeal neuropathy in horses, further CNV information on this species will add up to its resource enrichment.25,26
In this present study, we aim at expanding the fairly scant information on CNVs oriented genetic variation in horses and identifying them across the genome using a population of 96 individuals representing six different types of horse breeds (Manipuri, Zanskari, Bhutia, Spiti, Kathiawari and Marwari). The horse breeds like Kathiawari, a desert war horse, originated from the Kathiawar Peninsula in Gujarat, western India while the Marwari breed is a hardy riding horse of the draught type, found in the Marwar (or Jodhpur) region of Rajasthan (Supplementary Fig. 1). Marwari and Kathiawari breeds can rotate their ear tips by 180°. The Bhutia is a small mountain horse native to the regions of Sikkim and Darjeeling. Manipuri from Assam and Manipur in North Eastern India, Spiti from Himachal Pradesh in Northern India, and Zanskari, the native of Ladakh in Northern India also belong to small horse or pony. These pony breeds have acquired certain unique traits like stamina, sturdiness, endurance, adaptability to harsh climatic conditions, their speed, surefootedness, load-carrying capacity on steep and narrow hilly terrains, disease resistance, etc. Also, the gene content and functional analysis is done along with the development of horse CNVs database as a rich source of variations in horse useful for horse researchers.
Materials and methods
Characteristics of samples
Blood samples were collected from the jugular vein into a vacutainer containing potassium ethylene diamine tetra acetic acid (K2 EDTA), from six different types of Indian horse breeds namely, Manipuri (17), Zanskari (20), Bhutia (20), Spiti (16), Kathiawari (14) and Marwari (9) covering 96 genotypes selected randomly. The small mountain type horses, namely, Bhutia (Sikkim and Darjeeling, India), Spiti (Himachal Pradesh, India), Manipuri (Assam/Manipur, India), Zanskari (Ladakh, India) and draught type horses like Kathiawari (Gujarat, India) and Marwari (Rajasthan, India) breeds differed due to their agroclimatic adaptation along with their particular performance characteristics. All blood samples were transported to the research lab in an ice box and stored at 4 °C and processed for DNA isolation.
DNA extraction and genotyping
Genomic DNA was isolated from blood by using a ReliaPrepTM Blood gDNA Miniprep System (Promega, Madison, WI) according to the manufacturer’s instructions. To prevent DNA degradation, the isolated DNA was stored at −20 °C. Agarose gel electrophoresis (0.8% agarose gel) and Qubit4 fluorometer were used to analyze DNA (at 260 nm/280 nm absorbance) qualitatively and quantitatively. The samples were genotyped with the Axiom™ Equine Genotyping Array with an average call rate for each sample of 92.6%, where ‘call rate’ refers to the proportion of successfully determined genotypes for a specific set of genetic markers/variants in a sample. It is an indicator of accuracy and reliability of the genotyping process. This novel Axiom Equine Genotyping Array has wider utility for research in equine genetics and can be used for genotyping 20 different breeds. It features 670,796 markers selected through screening of 2 million markers for optimal genomic coverage of known genetic diversity among domestic horse breeds.
Data pre-processing
The raw data in the CEL file format achieved from SNP arrays were used to generate raw signal intensity values, genotype call and confidence scores using Analysis Power Tools (APT)-Release 2.11.4, which is used for Affymetrix/Axiom arrays (http://media.affymetrix.com/support/developer/powertools/changelog/index.html). Furthermore, genotype call and confidence score were used to generate a canonical genotype cluster file which was used to extract Log R ratio (LRR) and B allele frequency (BAF) values for each SNP using the PennCNV-affy program.38 Capturing the overall fluorescent intensity signals from both sets of probes/alleles at each SNP, known as the ‘log R Ratio’, we also got the relative ratio of fluorescent signals between two probes/alleles at each SNP, termed as ‘B Allele Frequency’.
Normalization and quality control
For the quality control of data, all the SNPs (670K) were filtered and SNPs with genotyping error or no calls were recorded in the ‘.calls file’ which is the output file of APT software consisting of the genotype call for each SNP. The non-somatic SNPs were also filtered out to exclude them from downstream analysis.
Genome-wide identification of CNVs
For genome-wide identification of CNVs from given data of 96 samples representing six horse breeds, PennCNV and DeepCNV algorithms were used. PennCNV tool was used to detect CNV from SNP genotyping array data where each SNP indicated its two alleles, that is, the A and B alleles. For CNV calling, PennCNV takes LRR and BAF values, population frequency of B allele, SNP genome coordinates for each SNP, and an appropriately trained hidden Markov model (HMM) model. The population frequency for the B allele was calculated using the BAF value of each marker contained within the signal intensity files for each sample.38 The horse GC model file was generated using a Perl script which calculated GC content in 1 Mb region around the SNP position (500 kb on either side). It is used by the –gcmodel argument in PennCNV, and useful to salvage samples affected by genomic waves.39 Whole-genome microarrays with large-insert clones designed to determine DNA copy number often show variation in hybridization intensity that is related to the genomic position of the clones. This variation is termed as genomic waves that are present in SNP genotyping arrays. PennCNV tool was executed using default values for 31 chromosomes of horse using the argument ‘-lastchr 31’ with a minimum length of CNV 3000 bp and CNV call confidence score threshold of 10. Appropriate LRR adjustments based on the GC model were incorporated using the GC model file argument. CNVs were inferred using Hidden Markov Model in PennCNV.
Quality filters were applied after the detection of CNV. High-quality samples with a standard deviation of LRR < 0.30 and with the BAF drift set as 0.01 and waviness factor value between 0.05 and −0.05 were used. We considered a CNV as a potential CNV if it consisted of three or more consecutive SNPs. The union region of overlapping CNVs in at least two animals/samples was considered a CNV region (CNVR).2 CNVRs were identified using Bedtools in this study. The four columns were specified from input file and collapse operation was used for the input CNV file in text format ($ bedtools merge -i cnv.txt -c 4 -o collapse). The version EquCab3.0 (https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_002863925.1/) of equine reference genome was used. The perl script used to identify CNVs is available at https://github.com/CAG-CNV/DeepCNV. To construct the CNVR map, CNVRs are categorized into three groups: ‘Loss’ (CNVRs with deletions), ‘Gain’ (CNVRs with duplications) and ‘Both’ (CNVRs containing both deletions and duplications).40 Figure 1 represents the overall flow diagram of the methodology adopted for the identification of putative CNV calls.
Figure 1.
Flow diagram of the methodology for identification of putative CNV calls.
Construction of circular copy number variation map
For the construction of a circular copy number variation map, a karyotype for the horse genome was taken (EquCab3.0). The chromosomes’ names, sizes and colours, along with the cytogenetic band pattern, are well described by the karyotype. The cytoBandIdeo table (UCSC genome browser) has karyotypes for several genomes, else can be manually drawn. We did the same for the horse genome. This was further processed for the addition of highlights, 2D tracks and links after choosing the ideograms for depiction.41
Gene content and functional analysis
Gene content of CNVRs was identified using horse genome assembly (EquCab3.0) with help of the ensemble BioMart tool.42,43 Gene regions that overlapped with CNVR either with one base pair or more were included in the gene content of this study. To investigate the functional annotation of CNVRs, gene ontology (GO) analysis was performed using PANTHER classification system.44 The gene list of CNVR regions was submitted as an analysis list and the whole genome gene list as a reference list for GO analysis. To assess the probability of overrepresented genes within the biological process, molecular function and cellular component in CNVRs, the PANTHER classification system was used.45 Bonferroni correction was used for multiple comparisons.
Image data and metadata for DeepCNV
We used the Perl script to automatically create image files for the CNV calls to visually inspect the CNV calls and determine if they are true positives or not. It generates one LRR scatter plot image and one BAF scatter plot image for each CNV call. The plots show a potential CNV section and its flanking regions. For each SNP genotyped in the region (candidate CNV + its extended regions), a point is drawn on the LRR plot with the chromosome position as the X-coordinate and the LRR as the Y-coordinate. The BAF plot utilizes BAF as the Y-coordinate and covers the same area with a comparable graphic. The SNPs (points) in the potential CNV are indicated in red and the SNPs in the neighbourhood in blue for both the plots. The expertise in CNV calling can visually classify these into either false positives or real positives. The example images are shown in Fig. 2. These images served as the model input. The images have pixels with values between 0 and 255. The pixel values were normalized by dividing each pixel by 255 after performing a conventional image scaling operation, resulting in new values for the deep learning model that ranged between 0 and 1.
Figure 2.
Two illustrations of the CNV imaging data. The LRR scatter plot is at the top, followed by the BAF scatter plot. The identical SNP locations are used to create both displays. The right panels display a false positive call made by PennCNV (a sample lacking CNV), in which the BAF dots display three typical types of B alleles, such as AA, AB, and BB, and the LRR dots cluster around the zero-reference line. The left panels display a true positive call (a sample with CNV) in which the red LRR dots are above the zero-reference line and red BAF dots, which represent four different types of B alleles, such as BBB, ABB, AAB, and AAA. For both plots, the SNPs (points) in the candidate CNV is indicated in red and the SNPs in the surrounding region in blue.
For quality control, PennCNV also generates summary data. As indicated in Table 1, 13 features make up the summary statistics, which is the metadata. Each of the features underwent Z-score normalization to have a comparable range. Meta data is used as input by both, DeepCNV and the other machine learning techniques.
Table 1.
Summary of the metadata.
| Index | Feature name | Data type | Data range | Description |
|---|---|---|---|---|
| 1 | Call rate | Float | [0.9790, 0.9968] | Simple SNP genotyping call rate |
| 2 | Length | Float | [3001, 4901305] | The CNV length |
| 3 | CN | Integer | [0, 4] | The number of CN |
| 4 | Number of SNPs | Integer | [4, 1093] | The number of independent SNP markers |
| 5 | PennCNV Confidence | Float | [10.086, 1837.806] | The call confidence of PennCNV |
| 6 | Number of CNVs in sample | Integer | [11, 51] | Inflated numbers of false positive CNV calls |
| 7 | LRR_mean | Float | [−0.0202, 0.0181] | The mean of LRR |
| 8 | LRR_SD | Float | [0.1255, 0.297] | The SD of LRR |
| 9 | BAF_mean | Float | [0.4973, 0.5041] | The mean of BAF |
| 10 | BAF_SD | Float | [0.0481, 0.0905] | The SD of BAF |
| 11 | BAF_DRIFT | Float | [0.001219, 0.017858] | BAF drift |
| 12 | WF | Float | [−0.0272, 0.0181] | Waviness factor |
| 13 | GCWF | Float | [−0.0069, 0.0059] | The GC base content adjusted waviness factor |
PennCNV log file metadata gives insight into the relative data quality of each sample to condition image review.
WF, waviness factor; GCWF, guanine cytosine waviness factor.
Development of Equine CNVs database
The Equine CNVs database (EqCNVdb) is a web genomic resource that catalogues the identified CNVs in six horse breeds, namely, Manipuri, Zanskari, Bhutia, Spiti, Kathiawari and Marwari. The CNVRs and gene content of CNVRs are also included in the database. It is based on the ‘three-tier architecture’ of a database system, consisting of the client tier, the middle tier and the database tier. For database browsing and query definition, HTML and JavaScript were used to create the client layer. The middle tier was created using PHP, which handles database connectivity, query execution and data retrieval from the database. The information on the CNVs, CNVRs and gene content within CNVRs are arranged in the database layer, which was created using MySQL (Fig. 3).
Figure 3.
Three-tier architecture of horse CNV database (EqCNVdb).
Real-time polymerase chain reaction
Real-time polymerase chain reaction (RT-PCR) was performed on three DNA samples randomly taken for each horse breed and the GAPDH gene was used as a control to determine fold change in copy number. GAPDH is also reported to be used to determine the copy number of CNV in animals46 and plants.47 In equine, GAPDH has been reported to be a stable gene to be used for qPCR.48,49 Also, horse GAPDH gene primers has been reported to be used for CNV estimation by qPCR,23,24 which also indicates that this gene is a suitable reference gene. The DNA samples were diluted 25 folds for RT-PCR. A reaction volume of 10 µl was made using 5 µl of 2x Maxima SYBR Green/ROX qPCR master mix (Thermo scientific, Vilnius, Lithuania), 2 µl of DNA, 1 µl each of forward and reverse primers and 1 µl water. The reaction was set up in a 0.2 ml MicroAmp optical 96 well reaction plate (Applied Biosystems, Waltham, MA) and the PCR was performed in QuantStudio 3 Real-Time PCR system (Applied Biosystems) under the following conditions: initial denaturation at 50 °C for 2 min and 95 °C for 10 min; 35 cycles of denaturation 95 °C for 30 s and annealing at 60 °C for 1 min. The melt curve stage was set at 95 °C for 15 s, 60 °C for 1 min and 95 °C for 15 s to predict the number of PCR products. The fold changes were determined using the standard ΔΔCt method. The PCR Ct (cycle threshold) value is the number of cycles needed to replicate enough DNA/RNA to be detected. An average Ct value was obtained from the Ct values of the duplicates. The ΔCt value was obtained by reducing the average Ct value of a particular gene with its reference gene’s average Ct value. The symbol Δ refers to delta, which is a mathematical term used to describe the difference between two numbers. The ΔΔCt was the average ΔCt value of one breed minus the ΔCt value of other breeds. For the CNVRs to be validated, a value of 2-ΔΔCt was calculated for each individual from five breeds keeping Marwari breed as the reference. The obtained value decided these CNVRs to be normal (if the value was near 1), a gain (if the value was >1) or a deletion (if the value was <1).
Results
Genome-wide identification of CNVs using PennCNV
A total of 2668 autosomal CNVs were identified in 96 horse samples, out of which, 136 (5.1%) of CNVs were singleton type, which means, only found in a single sample (Supplementary file 1). The remaining 2532 (94.9%) CNVs had a minimum of one basepair overlap with another sample. The average size and median of CNVs were 110.32 Kb and 49.91 Kb, respectively (Table 2). It was observed that approximately 36% of CNVs ranged from 10 to 50 Kb, 22% CNVs ranged from 50 to 100 Kb and 25% of CNVs ranged from 100 to 500 Kb. CNVs smaller than 3 kb were not studied in this analysis. Figure 4(A) is the graphical representation of the size distribution of CNVs in the 96 samples covering six horse breeds.
Table 2.
Genomic characteristics of CNV in horse from PennCNV.
| Total | Total length (mb) | Average size (kb) | Median size (kb) | Total loss | Total gain | Both | Frequency > 5% | Frequency (3–5%) | Unique | |
|---|---|---|---|---|---|---|---|---|---|---|
| CNVR | 381 | 48.82 | 128.13 | 42.68 | 200 | 133 | 48 | 134 | 111 | 136 |
| CNV | 2668 | 294.34 | 110.32 | 49.91 | 1584 | 1084 |
Figure 4.
(A) The size distribution of CNVsidentified using PennCNV; (B) size distribution of CNVRs identified using PennCNV.
Characteristics of identified CNVRs after using PennCNV
A total of 381 CNVRs were identified in this study which consisted of overlapping CNVs as well as singleton types of CNVs (Supplementary file 2). The estimated length of CNVRs ranged from 3Kb to 4.9 Mb with an average size of 128.12 Kb. These CNVRs covered approximately 49 Mb of horse genome assembly.50 The coverage of CNVR was around 2.15% (49 Mb/2280.94 Mb) of the equine autosomes’ genome sequence and 1.98% (49 Mb/2474.93 Mb) of the equine whole genome (Supplementary file 3).
Out of a total of 381 CNVRs, 133 were of gain, 200 were loss and 48 were gain plus loss (both) type events. Gain plus loss type (both) events were within the same CNVR region. The majority of CNVRs ranged from 10Kb to 50Kb. Almost 129 (33.85%) out of 381 were in this range. In the range ≤10 Kb, >50 Kb to ≤100 Kb, and >100 Kb to ≤500 Kb, there were 76, 66 and 96 CNVR events, respectively (Fig. 4B). Out of 381 CNVRs, 65% of CNVRs had a frequency greater than 21% in 96 horse samples. More specifically, 134 CNVRs had a frequency greater than 5%, and 111 CNVRs had a frequency range from 3% to 5% (Table 2). For example, chromosome 12 with location 12321830-14937463 had the highest frequency (75%) as both (loss plus gain) type CNVR events.
Construction of circular copy number variation map of PennCNV results
The PennCNV identified CNVRs’ circular distribution map on horse autosomes based on the Axiom™ Equine Genotyping Array was constructed using the Circos tool. The map indicated that CNVRs were distributed nonrandomly among the chromosomes and the proportion of chromosomes encompassed by CNVRs varies from 0.29% to 8.74% for chromosomes 11 and 12, respectively. The highest CNVR percentages were observed on chromosome 12 (8.74%), followed by chromosome 23, chromosome 2, chromosome 29 and chromosome 30, which showed 5.63%, 5.15%, 4.56% and 4.19% of CNVRs, respectively. This might be due to bias derived from the analysis of the Axiom™ Equine Genotyping Array. Chromosome 1 had the highest number of CNVRs whereas chromosome 27 had the smallest number of CNVRs and chromosome 2 had the largest length of CNVRs whereas chromosome 11 had the shortest length of CNVRs. The most enriched chromosomes for CNVRs in horse were chromosomes 1 and 2. Figure 5 shows the circular map of CNVRs distributed across the genome over 31 chromosomes of horse (based on EquCab 3.0 assembly).
Figure 5.
Circular map of CNVRs identified using PennCNV distributed across 31 chromosomes of horse. The outer band shows chromosomes along with their size. The middle band shows CNVRs on chromosomes. The inner band shows the frequency of samples/CNVs in a CNVR.
Gene content and functional analysis of CNVRs identified using PennCNV
Out of the total 381 CNVRs, 172 (45%) contained 541 unique horse genes, each gene having at least one base pair overlap within the CNVRs. Among these 541 unique genes, 373 were protein-coding genes, 21 were pseudo genes, 77 were lncRNA, 9 miRNA, 9 snoRNA, 2snRNA, 4 TR_V_genes, 2 TR_J_genes and 1 IG_V_gene (Supplementary file 4). We got 541 spanned genes within CNVRs to load into the PANTHER tool. In this study, GO analysis indicated that the most overrepresented molecular functions in PANTHER classification system are mostly involved in olfactory receptor (OR) activity; transmembrane signalling receptor activity; molecular transducer activity, signalling receptor activity, and most overrepresented biological processes are sensory perceptions, nervous system process and sensory perception to chemical stimulus (Table 3). Many CNVRs contained more than one gene, which indicated that these regions influence the function of multiple genes.
Table 3.
Gene ontology (GO) categories significantly overrepresented in CNVRs identified using PennCNV.
| GO terms | GO categories | Gene number in CNVR | Expected gene number | p |
|---|---|---|---|---|
| Biological process | sensory perception of chemical stimulus (GO:0007606) | 18 | 3.24 | 3.32E-05 |
| sensory perception (GO:0007600) | 18 | 3.95 | 5.11E-04 | |
| nervous system process (GO:0050877) | 19 | 5.55 | 1.27E-02 | |
| organic substance metabolic process (GO:0071704) | 72 | 110.24 | 2.27E-02 | |
| primary metabolic process (GO:0044238) | 67 | 104.07 | 2.66E-02 | |
| metabolic process (GO:0008152) | 74 | 115.15 | 6.09E-03 | |
| nitrogen compound metabolic process (GO:0006807) | 62 | 98.99 | 1.88E-02 | |
| cellular metabolic process (GO:0044237) | 66 | 106.85 | 3.54E-03 | |
| Cellular component | cellular anatomical entity (GO:0110165) | 169 | 208.88 | 3.89E-02 |
| Cellular component (GO:0005575) | 171 | 212.63 | 2.00E-02 | |
| organelle (GO:0043226) | 87 | 126.5 | 7.52E-03 | |
| intracellular (GO:0005622) | 105 | 154.64 | 1.20E-04 | |
| intracellular organelle (GO:0043229) | 84 | 123.82 | 5.25E-03 | |
| intracellular membrane-bounded organelle (GO:0043231) | 70 | 108.37 | 4.87E-03 | |
| membrane-bounded organelle (GO:0043227) | 71 | 110.16 | 3.23E-03 | |
| Molecular function | olfactory receptor activity (GO:0004984) | 51 | 8.07 | 2.13E-21 |
| transmembrane signaling receptor activity (GO:0004888) | 60 | 17.99 | 7.45E-13 | |
| molecular transducer activity (GO:0060089) | 60 | 19.86 | 4.63E-11 | |
| signaling receptor activity (GO:0038023) | 60 | 19.86 | 4.63E-11 |
Genome-wide identification of CNVs after using DeepCNV
The above results of CNVs and CNVRs identification using PennCNV, were re-validated with DeepCNV which is based on deep neural network structure considering both, the image data and metadata leading to high accuracy. The commonality of CNVs and CNVRs was found for the given dataset of 96 horse samples covering six breeds. A total of 883 autosomal CNVs were filtered out as putative using the DeepCNV learning-based tool, out of which, 80 CNVs (9.06%) were of singleton type, which means, only found in a single sample and the remaining 803 CNVs (90.93%) had a minimum of one base pair overlap with another sample (Supplementary file 5). An average of nine CNVs was identified in the given samples. The average size and median of CNVs was 206.24 Kb and 110.34 Kb, respectively (Table 4). It was observed that approximately 20.04% of CNVs ranged from 10 to 50 Kb, 21.18% CNVs ranged from 50 to 100 Kb and 42.47% of CNVs ranged from 100 to 500 Kb (Fig. 6A). The smaller CNVs (<3 Kb) were excluded from this study. Out of the total 883 CNVs, 155 belong to the Manipuri breed, 169 to the Zanskari breed, 228 to the Bhutia breed, 126 to the Spiti breed, 71 to the Marwari breed and 134 to the Kathiawari breed.
Table 4.
Genomic characteristics of CNV in horse from DeepCNV.
| Total number | Total length(mb) | Average size (kb) | median size (kb) | Total loss | Total gain | Both | Frequency > 5% | 5% > Frequency > 3% | Unique | |
|---|---|---|---|---|---|---|---|---|---|---|
| CNVR | 180 | 38.045 | 211.363 | 89.084 | 55 | 118 | 7 | 47 | 53 | 80 |
| CNV | 883 | 182.109 | 206.239 | 110.348 | 299 | 584 |
Figure 6.
(A) The size distribution of CNVs identified after using DeepCNV; (B) the size distribution of CNVRs identified after using DeepCNV.
Characteristics of identified CNVRs after using DeepCNV
A total of 180 CNVRs were identified after DeepCNV filtering in this study which consisted of overlapping CNVs as well as singleton-type of CNVs (Supplementary file 6). The estimated length of CNVRs ranged from 3.12 Kb to 4.90 Mb with an average size of 211.363 Kb. These CNVRs cover approximately 38.04 Mb of equine genome assembly. The coverage of CNVR was around 1.66% (38.04 Mb/2280.94 Mb) of the horse autosomes genome sequence and 1.53% (38.04 Mb/2474.93 Mb) of the equine whole genome (Supplementary file 7). The average length of CNVR was observed to be around 211.363 Kb. The CNVR size range was from 3.12Kb to 4.9 Mb. Out of the total of 180 CNVRs, 118 (65.55%) were of gain, 55 (30.55%) were loss and 7 (3.88%) were gain plus loss (both) type events. Gain and loss-type events were within the same CNVR region. The majority of CNVRs ranged from 100 Kb to 500 Kb, that is, 72 out of 180 were in this range. There were 21, 43 and 31 CNVR events in the range ≤10 Kb, 10 Kb to ≤50 Kb, and >50 Kb to ≤100 Kb, respectively (Fig. 6B). Out of 180 CNVRs, 55.55% of CNVRs were present at a frequency > 2% in 96 horse samples. CNVRs that were present at a frequency > 2% were considered as CNPs (copy number polymorphisms) that may play role in diseases and phenotypes of horse breeds. More specifically, 47 CNVRs had a frequency of >5% while 53 CNVRs had a frequency in the range of 3–5% (Table 4).
Construction of circular copy number variation map of DeepCNV results
Utilizing the equine whole-genome Axiom™ Equine Genotyping Array, we constructed a circular distribution map for DeepCNV identified Copy Number Variation Regions’ (CNVRs) on horse autosomes. The map revealed a nonrandom distribution of CNVRs across chromosomes, with the proportion varying from 0.096% to 7.83% on chromosomes 11 and 12, respectively. Notably, the top five CNVR percentages were observed on chromosomes 20 (3.32%), 29 (3.86%), 2 (4.55%), 23 (4.85%) and 12 (7.83%). This variation may be attributed to potential bias arising from the analysis of the Affymetrix genotyping chip. Chromosome 20 exhibited the highest CNVR count, while chromosome 24 had the fewest, and chromosome 2 displayed the greatest CNVR length, in contrast to chromosome 24, which had the shortest length. The chromosomes most enriched with CNVRs in horse were chromosomes 1 and 2 (Fig. 7).
Figure 7.
Circular map of CNVRs identified after using DeepCNV distributed across 31 chromosomes of horse. The outer band shows chromosomes along with their size. The middle band shows CNVRs on chromosomes. The inner band shows the frequency of samples/CNVs in a CNVR.
Gene content and functional analysis of CNVRs identified using DeepCNV
A total of 434 genes were found to be overlapped with identified 180 CNVRs which were identified after DeepCNV filtering. Out of these 434 genes, 290 were protein-coding genes, 18 pseudogenes, 102 lncRNA, 11 miRNA, 6 snoRNA, 1 snRNA, 3TR_V_genes, 2 TR_J_genes and 1 processed pseudo gene (Supplementary file 8). Overall, we found that 66.82% of the CNV-containing genes were protein-coding genes. A total of 434 spanned genes were found within CNVRs to load into PANTHER tool. In this study, GO analysis indicates that the most overrepresented molecular functions in Panther classification system are mostly involved in OR activity, odorant binding, transmembrane signalling receptor activity, molecular transducer activity, signalling receptor activity and most overrepresented biological processes are sensory perception, nervous system process and sensory perception to chemical stimulus, system process and G protein-coupled receptor signalling pathway. Many CNVRs contain more than one gene indicating that these regions influence the function of multiple genes (Table 5).
Table 5.
Gene ontology (GO) categories significantly overrepresented in CNVRs identified after using DeepCNV.
| GO terms | GO categories | Gene number in CNVR | Expected gene number | p |
|---|---|---|---|---|
| Biological process | sensory perception of chemical stimulus (GO:0007606) | 20 | 2.34 | 3.19E-09 |
| sensory perception (GO:0007600) | 20 | 2.89 | 1.07E-07 | |
| nervous system process (GO:0050877) | 20 | 4.26 | 5.34E-05 | |
| system process (GO:0003008) | 20 | 5.45 | 2.22E-03 | |
| G protein-coupled receptor signaling pathway (GO:0007186) | 23 | 7.45 | 5.94E-03 | |
| Cellular component | cellular_component (GO:0005575) | 120 | 154.74 | 3.48E-02 |
| membrane (GO:0016020) | 75 | 109.99 | 1.02E-02 | |
| Molecular function | olfactory receptor activity (GO:0004984) | 51 | 6.13 | 3.71E-27 |
| odorant binding (GO:0005549) | 22 | 3.78 | 6.85E-08 | |
| transmembrane signaling receptor activity (GO:0004888) | 56 | 13.71 | 5.37E-16 | |
| molecular transducer activity (GO:0060089) | 59 | 19.46 | 2.35E-11 | |
| signaling receptor activity (GO:0038023) | 59 | 19.46 | 2.35E-11 |
Development of EqCNVdb
The EqCNVdb is based on a three-tier architectural database that houses detailed information on the horse CNVs, CNVRs and gene content within CNVRs. It includes details of 883 CNVs, 180 CNVRs and 434 genes, which were found within the CNVRs, from six horse breeds, namely, Manipuri, Zanskari, Bhutia, Spiti, Kathiawari and Marwari. It contains six tabs, that is, Home, CNVs, CNVRs, Gene Content, Analysis and Contacts. The Home page contains a brief description of the horse CNV database. Under the CNV page, the search for CNVRs can be made through CNV ID for chromosome-wise, location-wise and breed-wise CNVs. The CNVR page has been linked to find the CNVs and the associated genes where the user can again go for chromosome-wise, location-wise and breed-wise retrieval of CNVs. The page displays headers, namely, Gene ID, Chromosome, Start Position, End Position, Gene Name, Source of gene name, Gene Type, CNVR and Gene Details, which takes it to the Ensembl project for detailed information on the gene. The CNVs were in the length range from 1 kb to 4.3 Mb nucleotides. The web interface for EqCNVdb and its user-friendly flexible search options for users are shown in Fig. 8. The EqCNVdb can be accessed at http://backlin.cabgrid.res.in/eqcnvdb/.
Figure 8.
Layout of EqCNVdb database.
RTPCR for CNVR validation
It was observed that the CNVRs obtained through in silico approach were in concordance with the RT-PCR results of randomly taken three CNVRs (CNVR24, CNVR30 and CNVR44) for validation where Primer set 1 and 2 are the two sets of the primers of the specific CNVR region. (Supplementary Fig. 2).
Discussion
The CNVs have been better acknowledged as a source of genomic variation and phenotypic variability in recent years.2,3,51 A few mammalian gene super-families are also known to support evolutionary changes brought on by CNVs, and genes linked to CNV gains, which may be a potential tool for adaptation.52 The majority of diversity in mammalians’ genomes which could be explained by CNVs occurs in regions that control crucial biological functions, including metabolism or signal transduction pathways, immunology and pathogen defense. As a result, it is now more crucial than ever to analyze CNVs in domestic and livestock species to assess genetic diversity, phenotypic variety and complex phenotypes.23,24,31,36,53 Hence, the comprehensive investigation and characterization of CNVs improves our understanding of genetic variation and is a useful tool for determining the role of CNVs in complex trait heritability. With the introduction of high-density genotyping arrays in recent years, detecting CNVs with SNP genotyping arrays have become a cost-effective and efficient method. We performed whole genome genotyping based on Axiom™ Equine Genotyping Array to discover CNVs in the equine genome, where the signal intensity (LRR) and allelic intensity (BAF) ratios of all 96 samples were exported using the PennCNV-Affy package. While generating CNV calls, quality control and CNV detection optimization were carried out using the PennCNV to achieve greater accuracy.25 Studies also report other criteria for filtering faulty CNV calls to identify reliable CNV data, like, to reduce the number of false positives, CNVs were limited to a minimum of 10 markers per CNV.54
The impact of genomic waves on CNV calls and filtering parameters using PennCNV analysis, with and without the –gcmodel option set was also investigated in this study, which concluded that more CNVs were found without the –gcmodel option than with the –gcmodel option. These conflicting results were most likely due to false positives induced by genomic waves that caused signal intensities to differ, hence proving that genomic waves have a substantial impact on CNV analyses.
PennCNV combines multiple sources of information, including the total signal intensity and allelic intensity ratios at each SNP marker, the distance between neighboring SNPs, and the allele frequency of SNPs, to generate a hidden state for copy neutral loss of heterozygosity, as opposed to other algorithms, such as QuantiSNP,55 Birdsuite56 and GADA.57 PennCNV also incorporates a computational technique to avoid genomic waves by fitting regression models using GC content.39 High-density SNP arrays combined with improved CNV calling techniques may be used in future investigations to address these discrepancies. We also looked at the distribution pattern of these CNVRs in 31 autosomes to see if the CNVR was frequent in different horse breeds. Surprisingly, 134 of the 381 CNVRs had frequency rates of >5%, while 111 of the 381 CNVRs had a frequency rate between 3% and 5% in 96 horse samples. The inherent limitation of Axiom™ Equine Genotyping Array may have influenced this outcome, and many common CNVs may have been missed as a result. This is because the Axiom™ Equine Genotyping Array has been developed using 20 horse breeds panel where Indian horse breeds were not present (https://www.thermofisher.com/order/catalog/product/550583). Standard methods for detecting CNVs rely on predictions from CNV tools, like, PennCNV, which are then filtered through a lengthy and heuristic process. When the sample size is large, manually distinguishing false positives from true positives becomes computationally expensive and slow. PennCNV detects CNV from SNP genotyping arrays and can handle signal intensity data. It implements a HMM which integrates multiple sources of information to infer CNV calls for individually genotyped samples, considering SNP allelic ratio distribution as well as other factors along with signal intensity. Precise models for LRR and BAF and more realistic models for state transition between various copy number states gives more accurate depiction of the intensity data. It also considers, the distance between neighbouring SNPs as well as the population allele frequency for each SNP.38
Furthermore, manual filtering might be subjective to various biases, whereas, in therapeutic settings, workflow rigor and objectives are desired. DeepCNV addresses this void and automates the filtering process making use of a novel blended deep neural network structure that considers both, the image plots (image data) and summary statistics (metadata) generated by PennCNV.58 This technology revolutionizes CNV studies’ capacity to quickly and decisively prune raw CNV call sets into high-accuracy CNV sets and tends to supersede other machine-learning approaches in terms of accuracy. To eliminate the possibility of false positive signals, we employed DeepCNV, CNV authentication software that found 180 CNVRs. Both CNV detection software and authentication tool consistently detected 180 areas.
Equus caballus chromosome 12 was seen to exhibit the most pronounced enrichment in terms of segment CNV gains and losses, covering approximately 7.83% of the chromosome.59 However, the largest count of segment CNVs was observed on chromosome 1 and chromosome 20, irrespective of their size as reported in other livestock animals like sheep.60 Upon PANTHER annotation, these genes displayed both significantly underrepresented and overrepresented GO biological terms associated with cellular processes and immunity which were observed in previous study (Bonferroni p value < .05).31,59
GO analysis done to evaluate the biological impact of the final 434 CNV genes, showed the enriched genes to be associated with the OR activity, sensory perception of smell, sensory perception of the chemical stimulus, sensory perception, cognition, G-protein coupled receptor protein signalling pathway, neurological system processes, cell surface receptor linked signal transduction, plasma membrane and integral membrane components, similar to other GO results. These functional analysis results in this study were in concordance to those previously reported in other mammalian CNV investigations.14,61–64 The considerable (p < 3.17 e−27) enrichment for OR activity could be due to the common occurrence of CNVs in OR gene clusters,65 as shown previously in equine.23,24,26,27,31,59 Nonetheless, in the vast majority of cases, these CNVs contain genes and hence directly impact gene dosage through changes in gene expression levels.66–68 Numerous studies have described the genome-wide distribution of CNVs in non-coding sequence areas, influencing the regulation of distant target genes.69–72 This potential impact of CNVs in non-coding areas is needed to be investigated further in future studies.
The functional analysis investigations revealed that the CNV genes have a diverse range of biological roles. Horse CNVRs, on the other hand, require more exploration of their roles in complex phenotypes because the equine genome is less well-defined. We performed RT-PCR on three randomly selected CNVRs and compared them to a sample control region known to have no CNVs to corroborate CNV acquired by PennCNV. The wet-lab validation of the three CNVRs was confirmed. A few percentages of the CNVRs found in this study was duplicated in other investigations. Human and other mammalian CNV investigations revealed a similar scenario.3,25,53 This revealed that the horse genome included a large number of CNVs that had yet to be discovered. In addition, a comparison of CNVRs found in horse, sheep, goats, and cattle could elucidate the evolutionary principles governing CNV creation in mammalian evolution. CNVs significantly impacted genes involved in metabolism, signal transduction, and sensory perception, according to functional enrichment analyses. In other species, CNVs have been found in the genes controlling height, fertility, breastfeeding, keratin formation, blood group antigens, coat colour, fecundity and neural homeostasis.23,24 Together, these findings – the description of copy number variation in horses – indicate that CNVs are frequent in the horse genome and may affect the molecular processes underlying the features seen in different horse breeds and individuals.
The present study utilized the advanced computational techniques to bridge the gap in terms of cost and time. The advantage over other studies is that it uses deep learning based DeepCNV tool, which is a supervised learning method. Deep learning has exhibited state-of-the-art performance in classifying genetic variations due to its capacity to automatically learn intricate patterns from vast genomic datasets. By leveraging deep neural networks, it can discern subtle genetic features, enabling accurate differentiation of variations associated with diseases, phenotypes and functional implications. This technology’s proficiency in handling complex genomic data has positioned it as a potent tool for enhancing our understanding of genetic diversity and its impact on human health.
Such advanced studies have futuristic scope in the field of genetic variations related research. Earlier studies demonstrate deep learning’s capability to handle intricate and lengthy bioinformatic sequences.73–75 Leveraging these recent progressions, the potential arises to create comprehensive end-to-end solutions for putative CNV detection. We defer these investigations for future endeavors. Overall, it is foreseeable that the detection process will evolve into a more sophisticated and intelligent procedure in the coming times.
Conclusion
Since CNVs have become significant source of genomic variability and phenotypic variance, the present study orients around the genome wide landscaping of CNVs for inter breeds variability in six horse breeds. The CNVs were discovered using Axiom™ Equine Genotyping Array from the six horse breeds (across 96 horse samples), resulting into 2668 autosomal CNVs and 381 CNVRs when analyzed through PennCNV tool. These CNVRs collectively covered approximately 38.04 Mb of the equine genome assembly. About 56% of the CNVRs exhibited frequency > 2% across the horse samples, indicating their potential as CNPs with possible roles in disease and phenotypic traits. A total of 434 unique horse genes were identified within CNVRs. CNV-containing genes were enriched in specific molecular functions and biological processes. OR activity, sensory perception, signal transduction and neurological system processes emerged as major categories, emphasizing the potential roles of CNVRs in sensory perception and signalling pathways. The developed EqCNVdb having detailed information on the 883 CNVs, 180 CNVRs and 434 gene contents within CNVRs is a rich source of horse variations for horse researchers. The subsequent validation of CNVRs using RT-PCR reinforced the accuracy of in silico identification methods. The prevalence of CNVs across different breeds, their distribution patterns and their impact on functional genes offer valuable insights into the complexity of the horse genome. The findings not only expand our knowledge of genomic variation in horses but also contributes significantly to our understanding of the intricate genomic landscape of horse populations and their potential implications for various biological processes and traits.
Supplementary Material
Acknowledgements
The authors thank the Indian Council of Agricultural Research, Ministry of Agriculture and Farmers Welfare, Govt. of India for providing financial support in the form of a CABin grant (F. no. Agril. Edn.4-1/2013-A&P) and the establishment of Advanced Super Computing Hub for Omics Knowledge in Agriculture (ASHOKA) at ICAR-IASRI, New Delhi, India. The award of a Junior Research Fellowship to NKS by ICAR-Indian Agricultural Research Institute, New Delhi is duly acknowledged.
Funding Statement
The authors are thankful to the Indian Council of Agricultural Research, Ministry of Agriculture and Farmers’ Welfare, Govt. of India, for providing financial assistance in the form of a CABin grant (F. no. Agril. Edn.4-1/2013-A&P). The grant of the IARI Merit scholarship to NKS is duly acknowledged.
Authors’ contribution
SJ, MAI, DK and AB conceived and designed the study; PS, AB, YP, RAL did the sample collection; NKS, MAI did the data curation and analysis; NKS and BS were involved in database development; PS, SKG, AB, VN performed the wet lab validation; NKS, SJ, MAI and AB wrote the first draft of the manuscript; SJ, MAI, DK, AB, VN, RAL, TKB and AR provided overall guidance and finalized the edited manuscript. All authors contributed to the article and approved the submitted version.
Disclosure statement
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Data availability statement
Data presented in the manuscript. Any further inquiries can be directed to the corresponding author/s. Database Availability: http://backlin.cabgrid.res.in/eqcnvdb/.
References
- 1.Perry GH. The evolutionary significance of copy number variation in the human genome. Cytogenet Genome Res. 2008;123(1-4):283–287. https://www.karger.com/Article/Abstract/184719 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Redon R, Ishikawa S, Fitch KR, et al. Global variation in copy number in the human genome. nature. 2006;444(7118):444–454. https://www.nature.com/articles/nrg1767 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Zhang F, Gu W, Hurles ME, Lupski JR.. Copy number variation in human health, disease, and evolution. Annu Rev Genomics Hum Genet. 2009;10(1):451–481. https://www.annualreviews.org/doi/abs/10.1146/annurev.genom.9.081307.164217 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Aldred PM, Hollox EJ, Armour JA.. Copy number polymorphism and expression level variation of the human α-defensin genes DEFA1 and DEFA3. Hum Mol Genet. 2005;14(14):2045–2052. https://academic.oup.com/hmg/article/14/14/2045/608373 [DOI] [PubMed] [Google Scholar]
- 5.Cooper GM, Nickerson DA, Eichler EE.. Mutational and selective effects on copy-number variants in the human genome. Nat Genet. 2007;39(Suppl 7):S22–S29. https://www.nature.com/articles/ng2054 [DOI] [PubMed] [Google Scholar]
- 6.Hurles ME, Dermitzakis ET, Tyler-Smith C.. The functional impact of structural variation in humans. Trends Genet. 2008;24(5):238–245. https://www.sciencedirect.com/science/article/pii/S0168952508000784 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Stranger BE, Forrest MS, Dunning M, et al. Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science. 2007;315(5813):848–853. https://www.science.org/doi/abs/10.1126/science.1136678 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.McCarroll SA, Altshuler DM.. Copy-number variation and association studies of human disease. Nat Genet. 2007;39(Suppl 7):S37–S42. https://www.nature.com/articles/ng2080 [DOI] [PubMed] [Google Scholar]
- 9.Iafrate AJ, Feuk L, Rivera MN, et al. Detection of large-scale variation in the human genome. Nat Genet. 2004;36(9):949–951. https://www.nature.com/articles/ng1416 [DOI] [PubMed] [Google Scholar]
- 10.Sebat J, Lakshmi B, Troge J, et al. Large-scale copy number polymorphism in the human genome. Science. 2004;305(5683):525–528. https://www.science.org/doi/abs/10.1126/science.1098918 [DOI] [PubMed] [Google Scholar]
- 11.Chen WK, Swartz JD, Rush LJ, Alvarez CE.. Mapping DNA structural variation in dogs. Genome Res. 2009;19(3):500–509. https://genome.cshlp.org/content/19/3/500.short [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.DeBolt S. Copy number variation shapes genome diversity in Arabidopsis over immediate family generational scales. Genome Biol Evol. 2010;2:441–453. https://academic.oup.com/gbe/article/doi/10.1093/gbe/evq033/574088 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Dolatabadian A, Yuan Y, Bayer PE, et al. Copy number variation among resistance genes analogues in Brassica napus. Genes (Basel). 2022;13(11):2037. https://www.mdpi.com/2073-4425/13/11/2037 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Fadista J, Thomsen B, Holm LE, Bendixen C.. Copy number variation in the bovine genome. BMC Genomics. 2010;11(1):284. https://link.springer.com/article/10.1186/1471-2164-11-284 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Nicholas TJ, Cheng Z, Ventura M, Mealey K, Eichler EE, Akey JM.. The genomic architecture of segmental duplications and associated copy number variants in dogs. Genome Res. 2009;19(3):491–499. https://genome.cshlp.org/content/19/3/491.short [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Santuari L, Pradervand S, Amiguet-Vercher A-M, et al. Substantial deletion overlap among divergent Arabidopsis genomes revealed by intersection of short reads and tiling arrays. Genome Biol. 2010;11(1):R4. https://genomebiology.biomedcentral.com/articles/10.1186/gb-2010-11-1-r4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Swanson-Wagner RA, Eichten SR, Kumari S, et al. Pervasive gene content variation and copy number variation in maize and its undomesticated progenitor. Genome Res. 2010;20(12):1689–1699. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2989995/ [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Żmieńko A, Samelak A, Kozłowski P, Figlerowicz M.. Copy number polymorphism in plant genomes. Theor Appl Genet. 2014;127(1):1–18. https://link.springer.com/article/10.1007/s00122-013-2177-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Li J, Jiang T, Mao J-H, et al. Genomic segmental polymorphisms in inbred mouse strains. Nat Genet. 2004;36(9):952–954. https://www.nature.com/articles/ng1417 [DOI] [PubMed] [Google Scholar]
- 20.Griffin DK, Robertson LB, Tempest HG, et al. Whole genome comparative studies between chicken and turkey and their implications for avian genome evolution. BMC Genomics. 2008;9(1):168. https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-9-168 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Alvarez CE, Akey JM.. Copy number variation in the domestic dog. Mamm Genome. 2012;23(1–2):144–163. https://link.springer.com/article/10.1007/s00335-011-9369-8 [DOI] [PubMed] [Google Scholar]
- 22.Sassi NB, González-Recio Ó, de Paz-Del Río R, Rodríguez-Ramilo ST, Fernández AI.. Associated effects of copy number variants on economically important traits in Spanish Holstein dairy cattle. J Dairy Sci. 2016;99(8):6371–6380. https://www.sciencedirect.com/science/article/pii/S0022030216302740 [DOI] [PubMed] [Google Scholar]
- 23.Doan R, Cohen N, Harrington J, et al. Identification of copy number variants in horses. Genome Res. 2012;22(5):899–907. https://genome.cshlp.org/content/22/5/899.short [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Doan R, Cohen ND, Sawyer J, Ghaffari N, Johnson CD, Dindot SV.. Whole-genome sequencing and genetic variant analysis of a Quarter Horse mare. BMC Genomics. 2012;13(1):78. https://link.springer.com/article/10.1186/1471-2164-13-78 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Dupuis MC, Zhang Z, Durkin K, Charlier C, Lekeux P, Georges M.. Detectionof copy number variants in the horse genome and examination of theirassociation with recurrent laryngeal neuropathy. Anim Genet. 2013;44(2):206–208. https://onlinelibrary.wiley.com/doi/full/10.1111/j.1365-2052.2012.02373.x [DOI] [PubMed] [Google Scholar]
- 26.Metzger J, Philipp U, Lopes MS, et al. Analysis of copy number variants by three detection algorithms and their association with body size in horses. BMC Genomics. 2013;14(1):487. https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-14-487 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Wang W, Wang S, Hou C, et al. Genome-wide detection of copy number variations among diverse horse breeds by array CGH. PLoS One. 2014;9(1):e86860. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0086860 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Fredman D, White SJ, Potter S, Eichler EE, Dunnen JTD, Brookes AJ.. Complex SNP-related sequence variation in segmental genome duplications. Nat Genet. 2004;36(8):861–866. https://www.nature.com/articles/ng1401 [DOI] [PubMed] [Google Scholar]
- 29.Sharp AJ, Locke DP, McGrath SD, et al. Segmental duplications and copy-number variation in the human genome. Am J Hum Genet. 2005;77(1):78–88. https://www.sciencedirect.com/science/article/pii/S0002929707609033 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Tuzun E, Sharp AJ, Bailey JA, et al. Fine-scale structural variation of the human genome. Nat Genet. 2005;37(7):727–732. https://www.nature.com/articles/ng1562 [DOI] [PubMed] [Google Scholar]
- 31.Ghosh S, Qu Z, Das PJ, et al. Copy number variation in the horse genome. PLoS Genet. 2014;10(10):e1004712. https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004712 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.McQueen CM, Doan R, Dindot SV, et al. Identification of genomic loci associated with Rhodococcusequi susceptibility in foals. PLoS One. 2014;9(6):e98710. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0098710 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Park K-D, Kim H, Hwang JY, et al. Copy number deletion has little impact on gene expression levels in racehorses. Asian-Australas J Anim Sci. 2014;27(9):1345–1354. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4150202/ [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Pawlina-Tyszko K, Gurgul A, Szmatoła T, et al. Genomic landscape of copy number variation and copy neutral loss of heterozygosity events in equine sarcoids reveals increased instability of the sarcoid genome. Biochimie. 2017;140:122–132. https://www.sciencedirect.com/science/article/pii/S0300908417301773 [DOI] [PubMed] [Google Scholar]
- 35.Mielczarek M, Frąszczak M, Nicolazzi E, Williams JL, Szyda J.. Landscape of copy number variations in Bos taurus: individual–and inter-breed variability. BMC Genomics. 2018;19(1):410. https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-018-4815-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Nicholas TJ, Baker C, Eichler EE, Akey JM.. A high-resolution integrated map of copy number polymorphisms within and between breeds of the modern domesticated dog. BMC Genomics. 2011;12(1):414. https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-12-414 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Ghosh S, Das PJ, McQueen CM, et al. Analysis of genomic copy number variation in equine recurrent airway obstruction (heaves). Anim Genet. 2016;47(3):334–344. https://onlinelibrary.wiley.com/doi/full/10.1111/age.12426 [DOI] [PubMed] [Google Scholar]
- 38.Wang K, Li M, Hadley D, et al. PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Res. 2007;17(11):1665–1674. https://genome.cshlp.org/content/17/11/1665.short [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Diskin SJ, Li M, Hou C, et al. Adjustment of genomic waves in signal intensities from whole-genome SNP genotyping platforms. Nucleic Acids Res. 2008;36(19):e126. https://academic.oup.com/nar/article-abstract/36/19/e126/2409936 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Moradi MH, Mahmodi R, Farahani AHK, Karimi MO.. Genome-wide evaluation of copy gain and loss variations in three Afghan sheep breeds. Sci Rep. 2022;12(1):14286. https://www.nature.com/articles/s41598-022-18571-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Krzywinski M, Schein J, Birol I, et al. Circos: an information aesthetic for comparative genomics. Genome Res. 2009;19(9):1639–1645. https://genome.cshlp.org/content/19/9/1639.short [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Cunningham F, Allen JE, Allen J, et al. Ensembl 2022. Nucleic Acids Res. 2022;50(D1):D988–D995. https://academic.oup.com/nar/article/50/D1/D988/6430486 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Smedley D, Haider S, Ballester B, et al. BioMart – biological queries made easy. BMC Genomics. 2009;10(1):22. https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-10-22 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Thomas PD, Ebert D, Muruganujan A, Mushayahama T, Albou LP, Mi H.. PANTHER: making genome‐scale phylogenetics accessible to all. Protein Sci. 2022;31(1):8–22. https://onlinelibrary.wiley.com/doi/full/10.1002/pro.4218 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Mi H, Muruganujan A, Thomas PD.. PANTHER in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees. Nucleic Acids Res. 2013;41(Database issue):D377–D386. https://academic.oup.com/nar/article-abstract/41/D1/D377/1060482 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Atema E, van Oers K, Verhulst S.. GAPDH as a control gene to estimate genome copy number in great tits, with cross-amplification in blue tits. Ardea. 2013;101(1):49–54. https://bioone.org/journals/ardea/volume-101/issue-1/078.101.0107/GAPDH-as-a-Control-Gene-to-Estimate-Genome-Copy-Number/10.5253/078.101.0107.full [Google Scholar]
- 47.Šķipars V, Krivmane B, Ruņģis D.. Thaumatin-like protein gene copy number variation in Scots pine (Pinus sylvestris). Environ Exper Biol. 2011;9:75–81. https://eeb.lu.lv/EEB/201112/EEB_9_Skipars.pdf [Google Scholar]
- 48.Azarpeykan S, Dittmer KE.. Evaluation of housekeeping genes for quantitative gene expression analysis in the equine kidney. J Equine Sci. 2016;27(4):165–168. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Beekman L, Tohver T, Dardari R, Léguillette R.. Evaluation of suitable reference genes for gene expression studies in bronchoalveolar lavage cells from horses with inflammatory airway disease. BMC Mol Biol. 2011;12(1):5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Kalbfleisch TS, Rice ES, DePriest MS, et al. Improved reference genome for the domestic horse increases assembly contiguity and composition. Commun Biol. 2018;1(1):197. https://www.nature.com/articles/s42003-018-0199-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.McCarroll SA, Kuruvilla FG, Korn JM, et al. Integrated detection and population-genetic analysis of SNPs and copy number variation. Nat Genet. 2008;40(10):1166–1174. https://www.nature.com/articles/ng.238?ref=https://githubhelp.com [DOI] [PubMed] [Google Scholar]
- 52.Demuth JP, Bie TD, Stajich JE, Cristianini N, Hahn MW.. The evolution of mammalian gene families. PLoS One. 2006;1(1):e85. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0000085 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Chen C, Qiao R, Wei R, et al. A comprehensive survey of copy number variation in 18 diverse pig populations and identification of candidate copy number variable genes associated with complex traits. BMC Genomics. 2012;13(1):733. https://link.springer.com/article/10.1186/1471-2164-13-733 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Jakobsson M, Scholz SW, Scheet P, et al. Genotype, haplotype and copy-number variation in worldwide human populations. Nature. 2008;451(7181):998–1003. https://www.nature.com/articles/nature06742 [DOI] [PubMed] [Google Scholar]
- 55.Colella S, Yau C, Taylor JM, et al. QuantiSNP: an Objective Bayes Hidden-Markov Model to detect and accurately map copy number variation using SNP genotyping data. Nucleic Acids Res. 2007;35(6):2013–2025. https://academic.oup.com/nar/article/35/6/2013/1034786 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Korn JM, Kuruvilla FG, McCarroll SA, et al. Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs. Nat Genet. 2008;40(10):1253–1260. https://www.nature.com/articles/ng.237 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Pique-Regi R, Monso-Varona J, Ortega A, Seeger RC, Triche TJ, Asgharzadeh S.. Sparse representation and Bayesian detection of genome copy number alterations from microarray data. Bioinformatics. 2008;24(3):309–318. https://academic.oup.com/bioinformatics/article/24/3/309/253648 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Glessner JT, Hou X, Zhong C, et al. DeepCNV: a deep learning approach for authenticating copy number variations. Brief Bioinform. 2021;22(5):bbaa38. 1. https://academic.oup.com/bib/article/22/5/bbaa381/6082822 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Solé M, Ablondi M, Binzer-Panchal A, et al. Inter-and intra-breed genome-wide copy number diversity in a large cohort of European equine breeds. BMC Genomics. 2019;20(1):759. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Taghizadeh S, Gholizadeh M, Rahimi-Mianji G, et al. Genome-wide identification of copy number variation and association with fat deposition in thin and fat-tailed sheep breeds. Sci Rep. 2022;12(1):8834. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Fontanesi L, Beretti F, Martelli PL, et al. A first comparative map of copy number variations in the sheep genome. Genomics. 2011;97(3):158–165. https://www.sciencedirect.com/science/article/pii/S0888754310002429 [DOI] [PubMed] [Google Scholar]
- 62.Lee AS, Gutiérrez-Arcelus M, Perry GH, et al. Analysis of copy number variation in the rhesus macaque genome identifies candidate loci for evolutionary and human disease studies. Hum Mol Genet. 2008;17(8):1127–1136. https://academic.oup.com/hmg/article/17/8/1127/650766 [DOI] [PubMed] [Google Scholar]
- 63.Moon S, Kim YJ, Hong CB, Kim DJ, Lee JY, Kim BJ.. Data-driven approach to detect common copy-number variations and frequency profiles in a population-based Korean cohort. Eur J Hum Genet. 2011;19(11):1167–1172. https://www.nature.com/articles/ejhg2011103 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Seroussi E, Glick G, Shirak A, et al. Analysis of copy loss and gain variations in Holstein cattle autosomes using BeadChip SNPs. BMC Genomics. 2010;11(1):673. https://link.springer.com/article/10.1186/1471-2164-11-673 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Nei M, Niimura Y, Nozawa M.. The evolution of animal chemosensory receptor gene repertoires: roles of chance and necessity. Nat Rev Genet. 2008;9(12):951–963. https://www.nature.com/articles/nrg2480 [DOI] [PubMed] [Google Scholar]
- 66.Conrad DF, Pinto D, Redon R, et al. Origins and functional impact of copy number variation in the human genome. Nature. 2010;464(7289):704–712. https://www.nature.com/articles/nature08516 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Feuk L, Carson AR, Scherer SW.. Structural variation in the human genome. Nat Rev Genet. 2006;7(2):85–97. https://www.nature.com/articles/nrg1767 [DOI] [PubMed] [Google Scholar]
- 68.McCarroll SA, Bradner JE, Turpeinen H, et al. Donor-recipient mismatch for common gene deletion polymorphisms in graft-versus-host disease. Nat Genet. 2009;41(12):1341–1344. https://www.nature.com/articles/ng.490?message=remove&lang=en [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Dathe K, Kjaer KW, Brehm A, et al. Duplications involving a conserved regulatory element downstream of BMP2 are associated with brachydactyly type A2. Am J Hum Genet. 2009;84(4):483–492. https://www.sciencedirect.com/science/article/pii/S0002929709000986 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Kantaputra PN, Klopocki E, Hennig BP, et al. Mesomelic dysplasia Kantaputra type is associated with duplications of the HOXD locus on chromosome 2q. Eur J Hum Genet. 2010;18(12):1310–1314. https://www.nature.com/articles/ejhg2010116 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Kurth I, Klopocki E, Stricker S, et al. Duplications of noncoding elements 5′ of SOX9 are associated with brachydactyly-anonychia. Nat Genet. 2009;41(8):862–863. https://www.nature.com/articles/ng0809-862 [DOI] [PubMed] [Google Scholar]
- 72.Wieczorek D, Pawlik B, Li Y, et al. A specific mutation in the distant sonic hedgehog (SHH) cis‐regulator (ZRS) causes Werner mesomelic syndrome (WMS) while complete ZRS duplications underlie Haas type polysyndactyly and preaxial polydactyly (PPD) with or without triphalangeal thumb. Hum Mutat. 2010;31(1):81–89. https://onlinelibrary.wiley.com/doi/abs/10.1002/humu.21142 [DOI] [PubMed] [Google Scholar]
- 73.Baptista D, Ferreira PG, Rocha M.. Deep learning for drug response prediction in cancer. Brief Bioinform. 2021;22(1):360–379. [DOI] [PubMed] [Google Scholar]
- 74.Bugnon LA, Yones C, Milone DH, Stegmayer G.. Genome-wide discovery of pre-miRNAs: comparison of recent approaches based on machine learning. Brief Bioinform. 2021;22(3):bbaa184. [DOI] [PubMed] [Google Scholar]
- 75.Chiu YC, Chen HIH, Gorthi A, et al. Deep learning of pharmacogenomics resources: moving towards precision oncology. Brief Bioinform. 2020;21(6):2066–2083. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Data presented in the manuscript. Any further inquiries can be directed to the corresponding author/s. Database Availability: http://backlin.cabgrid.res.in/eqcnvdb/.








