Skip to main content
Life logoLink to Life
. 2021 Nov 26;11(12):1302. doi: 10.3390/life11121302

SMCKAT, a Sequential Multi-Dimensional CNV Kernel-Based Association Test

Nastaran Maus Esfahani 1,*,, Daniel Catchpoole 1,2,, Paul J Kennedy 1,
Editors: Md Altaf-Ul-Amin, Tao Huang, Shigehiko Kanaya, Naoaki Ono, Ming Huang
PMCID: PMC8709152  PMID: 34947833

Abstract

Copy number variants (CNVs) are the most common form of structural genetic variation, reflecting the gain or loss of DNA segments compared with a reference genome. Studies have identified CNV association with different diseases. However, the association between the sequential order of CNVs and disease-related traits has not been studied, to our knowledge, and it is still unclear that CNVs function individually or whether they work in coordination with other CNVs to manifest a disease or trait. Consequently, we propose the first such method to test the association between the sequential order of CNVs and diseases. Our sequential multi-dimensional CNV kernel-based association test (SMCKAT) consists of three parts: (1) a single CNV group kernel measuring the similarity between two groups of CNVs; (2) a whole genome group kernel that aggregates several single group kernels to summarize the similarity between CNV groups in a single chromosome or the whole genome; and (3) an association test between the CNV sequential order and disease-related traits using a random effect model. We evaluate SMCKAT on CNV data sets exhibiting rare or common CNVs, demonstrating that it can detect specific biologically relevant chromosomal regions supported by the biomedical literature. We compare the performance of SMCKAT with MCKAT, a multi-dimensional kernel association test. Based on the results, SMCKAT can detect more specific chromosomal regions compared with MCKAT that not only have CNV characteristics, but the CNV order on them are significantly associated with the disease-related trait.

Keywords: genetic variation, copy number variants, disease-related traits, sequential order, association test

1. Introduction

Genetically speaking, all humans are 99.9 percent the same and the 0.1 percent that makes us all unique is called genetic variation [1]. Genetic variation has two main forms: structural alteration and sequence variation. Copy number variant (CNV) and DNA sequence variation are the most common form of structural alteration and sequence variation in the human genome, respectively [2].

A sequence variation or single nucleotide polymorphism (SNP) represents a difference in a single nucleotide. For example, a SNP may replace the nucleotide cytosine with the nucleotide thymine in a certain stretch of DNA. SNPs are classified into two major types based on the gene region they fall within: coding region and non-coding region. SNPs within a coding sequence do not necessarily change the amino acid sequence of the protein that is produced, due to the degeneracy of the genetic code. SNPs in the coding region are of two types: nonsynonymous and synonymous SNPs. Nonsynonymous SNPs change the amino acid sequence of the protein, while synonymous SNPs do not affect the amino acid sequence of the protein. SNPs do not usually function individually, rather, they work in coordination with other SNPs to manifest a disease or trait [3]. Therefore, many sequence studies have been done to test the association between SNPs and disease or traits.

A copy number variant is the gain or loss of DNA segments in the genome ranging in size from one kilobase to several megabases. CNVs are described by three characteristics: type, chromosomal position, and dosage [4]. The type of CNV is either amplification or deletion. The chromosomal position of the CNV is described by the start and end position of the CNV in the chromosome. The dosage represents the total number of copies of the CNV, with a value less than two relating to deletion and greater than two indicating amplification. Besides, CNVs have phenotypic heterogeneity effects. This means that different CNV types and dosages at the same position in the chromosome can have a different impact. It is reported in biological studies that CNVs are distributed non-randomly in the genomes, in particular they tend to be located close to telomeres and centromeres [5]. However, it is still unclear if there is any specific pattern in the sequential order of CNVs that may lead to a disease or trait.

Association studies have determined that genetic variations, both CNVs and SNPs, are associated with diseases or traits. So, understanding the relationship between genetic variation and disease may provide important insights into genetic causes, leading to effective means in preventing and treating the diseases. While there are lots of computational association studies that have investigated the association between SNPs and diseases or traits, methods for studying CNVs are underdeveloped due to the multi-dimensional characteristics of the CNVs.

The CNV kernel association test (CKAT) [6], copy number profile curve-based association test (CONCUR) [7] and multi-dimensional copy number variant kernel association test (MCKAT) [8] are a few existing computational kernel based methods that have studied the association between CNVs and diseases. In these studies, different kernels are proposed to measure the similarity between CNV profiles with respect to CNV characteristics. Then, the similarity between CNV profiles is compared with those in disease-related trait status to identify any potential association between CNVs and the disease. Among them, our previous method, the MCKAT, is the only method that has incorporated all multidimensional characteristics of CNVs in testing the association between CNVs and disease or trait. The MCKAT calculates the p-value of the association test analytically, which is computationally efficient and flexible for CNV association analysis for both rare and common CNV types, as demonstrated in numerical studies. However, neither MCKAT nor other methods consider the CNV sequential order in testing the association between CNVs and disease-related traits.

Starting from MCKAT, we propose a sequential multi-dimensional CNV kernel-based association test (SMCKAT) for investigating the association between CNVs and disease or traits. SMCKAT is not only utilizing all multi-dimensional characteristics of CNVs but also the sequential order of CNVs in testing the association between CNVs and disease or traits. Based on the results, SMCKAT is applicable on both rare and common datasets and capable of identifying hot-spots on the genome where both CNV characteristics and the CNV sequential order are significantly associated with disease or traits.

The rest of this paper is as follows. Section 2 presents the method and materials. Section 3 contains simulation studies. Section 4 shows the results of the real data application. Section 5 presents the discussion, and finally Section 6 concludes the work.

2. Method and Materials

We design a sequential multi-dimensional kernel framework capable of measuring the similarity between CNV profiles utilizing all CNV characteristics and the CNV sequential order. It contains two kernels. The first kernel, the pair group kernel, measures the similarity between two groups of CNVs at the same ordinal position of two CNV profiles. It contains three sub-kernels. Each sub-kernel is responsible for measuring the similarity between two CNVs with respect to one of the three CNV characteristics. The second kernel, the whole genome group kernel, aggregates the similarity between every possible CNV pair group to measure the total similarity between the CNV profiles of the subjects. Finally, the association between CNV sequential order across a chromosome and disease-related traits is tested by comparing the similarity in CNV profiles to that in the trait using an association test.

2.1. Pair CNV Group Kernel

Let X denote a single CNV which is defined by four characteristics as X=(X1,X2, X3,X4) where X1 and X2 are the CNV starting and ending position on the chromosome, X3 is the CNV type, and X4 is the CNV dosage. First, we generate the CNV profile R for subject i with l CNVs as Ri=(X1i,X2i,,Xlii) where CNVs are sorted based on their chromosomal position. Secondly, we extract a CNV group of size n out of the CNV profile as Gi=(Xmi,Xm+1i,,Xm+ni) where n is the group size that can take any value between 1 and l, the number of existing CNVs in a CNV profile as is shown in Figure 1.

Figure 1.

Figure 1

Generating CNV profile Ri where CNVs are sorted with respect to their chromosomal position. A, B,…, and F are arbitrary CNVs at mth, mth+1, …, and mth+n positions and Gi is a group of CNVs of size n.

We propose a pair CNV group kernel, KPG, to measure the similarity between two CNV groups of size n, Gi and Gj, in two CNV profiles. First, KPG aligns each CNV in the Gi with its relevant CNV in the Gj with respect to their position to generate n CNV pairs as is shown in Figure 2.

Figure 2.

Figure 2

Aligning CNVs within two CNV groups of size n, Gi and Gj, to generate n CNV pairs.

Then, KPG measures the similarity between each CNV pair using the single pair CNV kernel, KS, we proposed in [8]. KS measures the similarity between a CNV pair by three sub-kernels considering all CNV features including chromosomal position, type and dosage. Finally, KPG averages the similarities calculated by KS between all generated CNV pairs to measure the similarity between two CNV groups, Gi and Gj, as

KPG(Gi,Gj)=m=1nKs(Xmi,Xmj)n (1)

where Ks is defined as

KsXmi,Xmj=IntersectionXmi1,Xmi2,Xmj1,Xmj2UnionXmi1,Xmi2,Xmj1,Xmj2×Xmi3==Xmj3+12×12DRXmi4DRXmj4 (2)

and the first term measures the mutual presence of a CNV with a specific start and end position by dividing the size of the intersection of two CNVs to their union size. The intersection function calculates the length of the chromosomal region that belongs to both CNVs. Similarly, the union function calculates the length of the chromosomal region that consists of both regions that belong to the first CNV and to the second CNV. The second term compares the CNV type of two CNVs to calculate the similarity between them. The third term measures the similarity between two CNVs with respect to their dosage. The DR is the difference from the reference function we proposed in [8] as DR(dosage)=|dosage2|. DR measures the difference between a CNV dosage and the reference dosage value 2.

2.2. Whole Genome CNV Group Kernel

First, we create a window of size n. We slide this window across the CNV profile Ri as is shown in Figure 3 to extract all possible CNV groups of size n as Pi=(G1i,,Gpii) where CNV groups are sorted based on their position and pi is the number of extracted CNV groups for the CNV profile Ri. Similarly, we have another CNV group series Pj=(G1j,,Gqjj) for CNV profile Rj.

Figure 3.

Figure 3

Sliding window of size n across CNV profile to extract CNV groups of size n.

Then, we propose the whole genome CNV group kernel, KWG, to measure the similarity between two CNV group series Pi and Pj as

KWG(Pi,Pj)=0ifpi×qi=0z=1Max(pi,qi)Max(KPG(Gzi,Gz1j),KPG(Gzi,Gzj),KPG(Gzi,Gz+1j))ifpi×qi0 (3)

where KPG(.,.) is the pair CNV group kernel from (1). KWG measures the similarity between the pair CNV groups of the same position and aggregates these similarities to calculate the similarity in two CNV group series. The second maximum operation in the definition of KWG searches for the best group-to-group correspondence of the highest similarity to align CNV groups in two CNV group series as is shown in Figure 4.

Figure 4.

Figure 4

Aligning Gzi to the best group-to-group correspondence of the highest similarity among Gz1j, Gzj and Gz+1j.

The kernel-based association test described in the following section, requires a kernel similarity matrix K. K is a d×d matrix, where Kij=KWG(Pi,Pj) and d is the number of existing CNV profiles. Kij expresses the similarity between CNV profile i and j measured by KWG.

2.3. Kernel-Based Association Test

We use the following logistic regression model to test the association between CNV sequential order and a disease related trait

logit[Pr(yi=1)]=β0+Zβ+f(Pi) (4)

where yi is the status of the disease related trait with yi=1 denoting the existence of the trait and yi=0 denoting otherwise, and i=1,2,,d indexing the CNV profiles, and Z is the covariate matrix including information such as age and gender. Pi is the CNV group series of the profile Ri as explained previously. f(·) is a function spanned by the whole genome CNV group kernel KWG(·,·). According to Equation (4), the hypothesis of no association between the CNV sequential order and the existence of a disease related trait can be tested as H0:f(·)=0. To test this, one way is to treat the f(·) as a random effect vector which is distributed as N(0,τK), where τ0 and K is the d×d similarity matrix, treated as covariance matrix of the random effect, generated by KWG as defined in [6]. Liu et al. [9] has shown that testing H0:f(·) is equivalent to testing H0:τ=0 in the logistic mixed effect model. Moreover, τ is a variance component parameter in the logistic mixed effect model, which can be tested using a restricted maximum likelihood-based score test [9,10].

We use the following score test statistic where y^ is estimated under the null model logit[Pr(yi=1)]=β0+Zβ and K is the similarity matrix explained in the previous section.

Q=(yy^)K(yy^) (5)

Then, we used the Davies method [11] as implemented in the CKAT R package [6] to calculate the p-value of the proposed kernel based association test. The SMCKAT workflow is summarized in Figure 5

Figure 5.

Figure 5

SMCKAT workflow diagram.

2.4. Common and Rare CNV Data

Biologists generally assign CNVs to one of two major types, depending on the length of the affected chromosomal region and occurrence frequency: copy number polymorphisms (CNPs) and rare variants [4]. CNPs are widespread in the general population, with an average occurrence frequency greater than one percent while rare variants are much longer than CNPs, ranging from hundreds of thousands of base pairs to over one million base pairs.

We apply SMCKAT on both rare and common CNV public domain genome sequencing data sets to evaluate the performance on both CNV types. The two CNV data sets used in this study are from individuals with rhabdomyosarcoma (RMS) cancer and autism spectrum disorder (ASD). The RMS data set [12] contains the common CNVs for 44 subjects, while the ASD data set [13] has the rare CNVs of 588 subjects. In both data sets, each CNV is presented by chromosomal position, type, and dosage.

3. Simulation Studies

We conducted simulations to evaluate the performance of SMCKAT and ensure that it can properly handle type I and II errors, as well as having relatively high power in detecting existing associations. Besides SMCKAT, the MCKAT and CKAT are also studied. We conduct our simulation studies under two main scenarios. In the first scenario, we evaluate the performance of the SMCAKT on the rare CNV data. In the second scenario, we evaluate the performance of the SMCKAT on the common CNV data.

We use the ASD dataset and the RMS dataset in the first and second simulation scenarios, respectively. These datasets are studied in the real data analysis and further details regarding them are shared in the Section 4. We simulated 105 datasets for each simulation scenario.

The ASD dataset has the same dosage value for all deletions and similarly the same dosage value for all amplifications. Therefore, we randomly generate other values for the CNV dosage to conduct our simulation studies and investigate the dosage effect in identifying existing associations. The simulated dosage value can take 0 or 1 for deletion types and 3, 4, …, 7 for amplification types. We use equal probabilities when generating random dosage values for deletion and amplification, 0.5 and 0.2, respectively.

A case-control phenotype is generated for both SMCKAT and MCKAT from the following logistic model that we proposed in [8],

logit(Pr(Yi=1))=β0+j=1miβjLen(Xij(2)Xij(1))+j=1mi(βjDelI[Xij(3)=1]+βjAmpI[Xij(3)=3])+j=1miβjDsg|Xij(4)2|+j=1miβjLen*Del*Dsg(Xij(2)Xij(1))×I[Xij(3)=1]×Xij(4)+j=1miβjLen*Amp*Dsg(Xij(2)Xij(1))×I[Xij(3)=3]×Xij(4) (6)

where Xij=(Xij1,Xij2,Xij3,Xij4) is the jth CNV of the ith individual as defined previously. β0 corresponds to a baseline disease rate. βjLen controls the effect of chromosomal position, and βjDel and βjDup are the log ratio of a CNV j for being deletion versus amplification and vice versa. βjDel and βjDup share the same values but different signs. βjLen*Amp*Dsg and βjLen*Del*Dsg allow the effect of the chromosomal position and CNV type to differ by dosage in CNV j.

After generating phenotypes for SMCKAT and MCKAT, we use following logistic model that is proposed in [6] to generate the phenotypes under CKAT method:

logit(πi)=β0+j=1mi(βjDelI[Xij(2)=1]+βjDupI[Xij(2)=3])Xij(1) (7)

where Xij=(Xij(1),Xij(2)) is the jth CNV of ith subject, πi=Pr(Yi=1), β0 is the prevalence rate of the disease, and βjDup, βjDel are the log of the odd ratio of CNV j for duplication and deletion respectively.

Simulation Results

The QQ-plots of p-values of SMCKAT, MCKAT and CKAT under both simulation scenarios are presented in Figure 6.

Figure 6.

Figure 6

p-value based QQ-plots of MCKAT and CKAT under first (a) and second (b) simulation scenario.

Based on the QQ-plot (a), SMCKAT and MCKAT are on the 45 degree line under the first simulation scenario. This indicates that both SMCKAT and MCKAT can properly handle the type I and II error rate under different nominal significance levels even as low as 105 when dealing with the rare CNV dataset. However, CKAT is showing a higher chance of committing the type II error in detecting existing associations between the rare CNVs and phenotype.

As is shown in QQ-plot (b), both SMCKAT and MCKAT can protect the correct type I and II error rate at different nominal significance levels in dealing with the common CNV data. We observe that SMCKAT is a little conservative when the significance level is small. However, CKAT shows a weak performance in handling the type I error and detecting existing associations between the common CNVs and phenotype.

The empirical powers of SMCKAT, MCKAT and CKAT under the first and second scenarios are presented in Figure 7 and Figure 8, respectively. As is shown is Figure 7, SMCKAT and MCKAT have almost similar powers when dealing with rare CNVs. However, CKAT shows lower power compared with SMCKAT and MCKAT. The reason might be that the CKAT is not considering the CNV dosage information when testing the association.

Figure 7.

Figure 7

Empirical power of SMCKAT, MCKAT and CKAT under first simulation scenario, rare CNV data.

Figure 8.

Figure 8

Empirical power of SMCKAT, MCKAT and CKAT under second simulation scenario, common CNV data.

Similarly, in the second simulation scenario, SMCKAT and MCKAT have similar powers. However, CKAT is showing low power when dealing with common CNV data. This might be due to the CKAT scanning algorithm for aligning CNVs in the CNV profiles. The CKAT shift-by-one scanning algorithm can capture similarity between limited number of CNVs, which may result in low performance when dealing with common CNVs.

4. Real Data Application Results

We conducted SMCKAT analysis, for different CNV group sizes, on single chromosomes and the whole genome to test the association between CNV sequential order and disease-related traits. The disease-related traits studied in this paper are cancer subtype for the RMS data set and disease status for the ASD data set. We compared SMCKAT results with those obtained from MCKAT and CKAT to evaluate SMCKAT performance on real CNV data.

4.1. CNV Analysis on Rhabdomyosarcoma Data Set

First, we conducted the experiment on the RMS data. The RMS occurs as two major histological subtypes, embryonal (ERMS) and alveolar (ARMS). The classification of the RMS subtype has a direct effect on the patient treatment options. The RMS data includes a total of 59,131 CNVs for 25 alveolar and 19 embryonal cancers. We apply SMCKAT to each of 23 chromosome pairs, with different CNV group sizes, to test the association between CNV sequential order and RMS subtype. Bonferroni correction is used for adjusting the multiple testing to control the family-wise error rate (FWER) of α=0.05. Since 22 chromosomes and a sex chromosome are being tested, the p-value threshold for a whole-chromosome significance is calculated as 0.05/23=2.2×103. SMCKAT identifies four chromosomes out of the existing 23 chromosomes that have a CNV sequential order that is significantly associated with the RMS sub-type. The p-values of SMCKAT for these four chromosomes are reported in Table 1.

Table 1.

p-values of testing the association between CNV sequential order and RMS subtype trying different CNV group sizes. n is the group size and (#) denotes the total number of CNVs on the chromosome.

Chr. #CNV n = 1 n = 2 n = 3 n = 4 n = 5 n = 6
2 5584 2.45×102 5.10×102 8.31×102 3.49×103 4.25×103 3.21×102
8 5365 2.61×105 7.37×106 1.13×106 7.63×107 4.99×108 0
11 3449 2.03×102 8.26×103 2.93×103 1.54×103 5.82×104 1.20×104
13 2462 1.80×103 3.56×103 4.86×103 6.06×103 7.89×103 6.23×102

Based on the results, SMCKAT identifies CNV sequential order in chromosomes 2, 8, 11, and 13, significantly associated with distinguishing RMS subtype at FWER= 2.2×103. These results are consistent with the existing biological knowledge, which shows the ability of the SMCKAT to identify the CNV sequential order significantly associated with specific disease-related traits.

For example, ref. [14] shows that RMS is associated with specific chromosomal abnormalities that differentiate ARMS and ERMS. Based on their study, approximately 80% of ARMS tumors display a translocation between the FOXO1 transcription factor gene located on chromosome 13 and the PAX3 transcription factor gene on chromosome 2, and ERMS tumors show a higher frequency of specific genetic mutation on chromosome 11 than ARMS. Ref. [15] has revealed the same earlier. Furthermore, ref. [16] has found that the ARMS subtype is significantly associated with amplifications on chromosome 8. Our findings show another mechanism like CNVs can play a significant role in causing any disease-related traits besides gene mutations and chromosomal translocations.

We tested different CNV group sizes when applying SMCKAT to the RMS data set. Based on the results reported in Table 1, SMCKAT shows the strongest evidence and smallest p-value for the chromosome 8 for all CNV group sizes. It means subjects with the same RMS subtype may have a similar CNV sequential order on their chromosome 8.

We tested SMCKAT on the RMS data set for group sizes greater than six. We observed an increasing trend in p-values by increasing the group size, which shows a decline in the significance level of the CNV sequential order associated with the RMS subtype.

4.2. CNV Analysis on Cytogenetic Bands in RMS

Based on the result reported in Table 1, there is strong statistical evidence, as supported by a p-value near to zero, that the CNV sequential order of chromosome 8 with group size of six is significantly associated with the RMS subtype. Therefore, we picked chromosome 8 with a CNV group size of six for further analysis. We partitioned chromosome 8 into smaller regions based on the cytogenetic bands. We applied SMCKAT on each cytogenetic band to check if SMCKAT is capable of detecting more specific regions rather than chromosomes. Then, we compared the results with of MCKAT and CKAT. Table 2 contains the p-values of the association test in each cytogenetic band in chromosome 8. Since 40 cytogenetic bands are being tested in chromosome 8, the p-value threshold for a band significance is calculated as 0.05/40=1.2×103.

Table 2.

p-values of the testing association between RMS subtype and CNVs in the chromosome 8 cytogenetic bands by SMCKAT, MCKAT and CKAT. (*) denotes significant association between RMS subtype and CNVs, (#) denotes the number of total CNVs on the band.

Arm Band Start Stop #CNVs SMCKAT MCKAT CKAT
p 23 3 1 2,300,000 113 9 6×102 3 4×104 * 4 917×101
p 23 2 2,300,001 6,300,000 85 3 0×102 2 0×102 3 939×101
p 23 1 6,300,001 12,800,000 304 1 8×104 * 4 7×108 * 4 755×101
p 22 0 12,800,001 19,200,000 101 2 8×102 8 2×103 4 327×101
p 21 3 19,200,001 23,500,000 102 1 1×101 2 5×102 4 237×101
p 21 2 23,500,001 27,500,000 82 3 4×102 3 6×102 4 717×101
p 21 1 27,500,001 29,000,000 50 2 5×102 1 6×102 4 948×101
p 12 0 29,000,001 36,700,000 190 1 3×106 * 3 7×105 * 4 658×101
p 11 23 36,700,001 38,500,000 48 1 0 3 7×103 3 916×101
p 11 22 38,500,001 39,900,000 57 9 3×102 8 4×103 4 613×101
p 11 21 39,900,001 43,200,000 147 4 4×103 1 0×104 * 3 655×101
p 11 1 43,200,001 45,200,000 72 8 8×102 2 8×102 4 584×101
q 11 1 45,200,001 47,200,000 41 1 0 2 1×102 4 436×101
q 11 21 47,200,001 51,300,000 200 4 4×103 8 4×105 * 4 064×101
q 11 22 51,300,001 51,700,000 6 9 3×101 4 7×102 4 200×101
q 11 23 51,700,001 54,600,000 61 1 0 6 1×102 4 657×101
q 12 1 54,600,001 60,600,000 177 9 1×103 7 0×104 * 4 505×101
q 12 2 60,600,001 61,300,000 18 1 0 3 3×102 4 502×101
q 12 3 61,300,001 65,100,000 134 4 9×102 1 1×102 4 110×101
q 13 1 65,100,001 67,100,000 71 4 4×102 5 8×103 4 427×101
q 13 2 67,100,001 69,600,000 54 5 8×102 4 3×103 4 659×101
q 13 3 69,600,001 72,000,000 62 1 4×102 1 8×103 3 762×101
q 21 11 72,000,001 74,600,000 144 4 8×101 8 4×103 3 325×101
q 21 12 74,600,001 74,700,000 1 1 0 1 0 1 0
q 21 13 74,700,001 83,500,000 308 1 0×102 2 6×103 4 927×101
q 21 2 83,500,001 85,900,000 56 4 8×102 2 9×102 4 189×101
q 21 3 85,900,001 92,300,000 185 4 7×103 1 0×104 * 4 215×101
q 22 1 92,300,001 97,900,000 182 1 7×102 1 0×102 3 072×101
q 22 2 97,900,001 100,500,000 103 4 5×102 3 9×103 4 395×101
q 22 3 100,500,001 105,100,000 162 1 2×102 4 6×103 4 458×101
q 23 1 105,100,001 109,500,000 135 2 8×103 2 5×103 4 017×101
q 23 2 109,500,001 111,100,000 33 9 8×101 8 0×101 3 005×101
q 23 3 111,100,001 116,700,000 185 1 1×102 2 3×103 4 419×101
q 24 11 116,700,001 118,300,000 53 4 6×102 2 6×102 4 705×101
q 24 12 118,300,001 121,500,000 109 2 5×103 2 2×103 4 068×101
q 24 13 121,500,001 126,300,000 151 2 2×102 6 0×103 4 856×101
q 24 21 126,300,001 130,400,000 208 5 0×102 1 9×102 3 922×101
q 24 22 130,400,001 135,400,000 155 5 5×102 1 5×102 4 638×101
q 24 23 135,400,001 138,900,000 162 2 8×101 7 7×103 4 512×101
q 24 3 138,900,001 145,138,636 354 8 8×103 2 5×108 * 4 277×101

As shown in Table 2, both SMCKAT and MCKAT detect significantly associated cytogenetic bands with the RMS subtype while CKAT does not identify any significant regions. MCKAT has identified 8 cytogenetic bands that CNVs on them are significantly associated with RMS sub type. Two out of these eight cytogenetic bands, 8p23.1 and 8p12.0, are identified by SMCKAT as well. It means not only the CNV characteristics but also the CNV sequential order in these two bands are significantly associated with the RMS sub type. Based on the results, SMCKAT has the potential to provide us with more specific CNV regions when we are testing the association between CNVs and disease-related traits compared with MCKAT.

4.3. CNV Analysis on Autism Data Set

We applied SMCKAT on the ASD data set to evaluate its performance on the rare CNV type. We aimed to test if there was any association between the sequential order of CNVs and ASD status. The ASD data set contains 1285 rare CNVs on 310 individuals with ASD and 1074 rare CNVs on 278 healthy individuals. Since the ASD data set contains only rare and large CNVs, an arbitrary CNV profile may have no or few CNVs on some chromosomes. Therefore, instead of applying SMCKAT to every 23 chromosomes, we applied it to the whole genome. Then, we tested if there is any association between the whole genome CNV sequential order and the ASD status. We considered 0.05 as the p-value threshold for the whole-chromosome significance. As is reported in Table 3, there is strong statistical evidence, up to a CNV group size of five, that subjects with the same disease status have similar CNV ordesr in their CNV profiles. We tested SMCKAT on the ASD data for the larger group sizes as well. We observed an increasing trend in p-values by increasing the group size, which shows a decline in the significance level of the CNV sequential order associated with the ASD status.

Table 3.

p-values of testing the association between CNV sequential order and ASD status trying different CNV group sizes.

n 1 2 3 4 5 6
p-value 0 7.91×109 3.09×106 3.62×104 4.89×103 1.03×101

5. Discussion

SMCKAT tests the association between CNVs and disease-related traits. It checks if CNVs are randomly distributed on the chromosomes or if their sequential orders are significant and have associations with disease-related traits. Our approach has several advantages over the existing methods. First, it measures the similarity between CNV profiles by considering not only all CNV characteristics but also the CNV sequential order. To our knowledge, it is the first approach to study the association between CNV sequential order and disease related traits. Secondly, it is applicable to both rare and common CNV data sets, while previous methods like CKAT can not deal with common CNV data sets. Thirdly, SMCKAT is more stringent when compared with the state-of-the-art approach MCKAT in detecting significant CNV regions. Finally, SMCKAT can help biologists detect significantly associated CNV regions with any disease-related trait across a patient group instead of examining the CNVs case by case in each subject.

Although our experimental results are promising and more specific compared with the state-of-the-art kernel approach, this study has limitations. There are not many publicly available CNV data sets. Besides, most available ones do not contain all CNV features together, in particular the dosage information. Consequently, our method is tested only on one data set, an RMS data set, that includes all multi-dimensional CNV characteristics. For the ASD data set, we considered a dosage less than two for all deletions and greater than two for all amplifications to make the most of the proposed method’s capability. Applying SMCKAT to more data sets containing all CNV characteristics can help to determine its strengths and weaknesses. In addition, there is no existing study, neither biological nor computational, that has studied the CNV sequential order to be able to validate our experimental results with it.

Our study shows that CNV sequential order has the potential to play a significant role in causing disease-related traits, but more new findings can be revealed by conducting more comprehensive analysis upon the availability of data.

6. Conclusions

This paper presents a sequential multi-dimensional CNV association test identifying associations between CNVs and disease-rated traits using all multi-dimensional CNV characteristics and CNV sequential order. Our method, SMCKAT, uses different kernels to measure the similarity between CNV profiles with respect to both CNV orders and characteristics. Then, the similarity in CNV profiles is compared to the similarity in disease-related traits to test for an association.

The evaluation was conducted on two types of CNV data sets, a rare CNV data set and a common CNV data set. Results indicate that our method provides statistically strong evidence that there is an association between the sequential order of CNVs and disease related traits. Currently, SMCKAT is capable of testing the association between CNVs and qualitative disease-rated traits. In our future work, we will expand the SMCKAT framework to be applicable to both qualitative and quantitative traits.

Author Contributions

N.M.E., P.J.K.: conceptualization and study design. N.M.E., D.C. and P.J.K.: data processing. N.M.E.: conducted the analysis. N.M.E.: drafting manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

All data used on this study is publilc domain data and openly accessible. Please see Data Availability Statement. Access to the RMS Data on dbGaP was provided under NCI Authorized Access #44698_6.

Informed Consent Statement

Not applicable.

Data Availability Statement

The ASD and RMS datasets supporting the conclusions of this article are available at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3213131 and https://www.ncbi.nlm.nih.gov/gap (accession number: phs000720.v3.p1) respectively (access date: 20 November 2021). The SMCKAT R package is publicly available at https://github.com/nesfehani/SMCKAT GitHub repository (access date: 20 November 2021).

Conflicts of Interest

The authors declare no conflict of interest.

Footnotes

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.National Human Genome Research Institute . Genetics vs. Genomics Fact Sheet. National Human Genome Research Institute; Bethesda, MD, USA: 2018. [(accessed on 20 November 2021)]. Available online: https://www.genome.gov/about-genomics/fact-sheets/Genetics-vs-Genomics. [Google Scholar]
  • 2.Frazer K.A., Murray S.S., Schork N.J., Topol E.J. Human genetic variation and its contribution to complex traits. Nat. Rev. Genet. 2009;10:241–251. doi: 10.1038/nrg2554. [DOI] [PubMed] [Google Scholar]
  • 3.Edwards D., Forster J.W., Chagné D., Batley J. Association Mapping in Plants. Springer; Berlin/Heidelberg, Germany: 2007. What Are SNPs? pp. 41–52. [Google Scholar]
  • 4.Schrider D.R., Hahn M.W. Gene copy-number polymorphism in nature. Proc. R. Soc. B Biol. Sci. 2010;277:3213–3221. doi: 10.1098/rspb.2010.1180. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Monlong J., Cossette P., Meloche C., Rouleau G., Girard S.L., Bourque G. Human copy number variants are enriched in regions of low mappability. Nucleic Acids Res. 2018;46:7236–7249. doi: 10.1093/nar/gky538. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Zhan X., Girirajan S., Zhao N., Wu M.C., Ghosh D. A novel copy number variants kernel association test with application to autism spectrum disorders studies. Bioinformatics. 2016;32:3603–3610. doi: 10.1093/bioinformatics/btw500. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Brucker A., Lu W., West R.M., Yu Q.Y., Hsiao C.K., Hsiao T.H., Lin C.H., Magnusson P.K., Sullivan P.F., Szatkiewicz J.P., et al. Association test using Copy Number Profile Curves (CONCUR) enhances power in rare copy number variant analysis. PLoS Comput. Biol. 2020;16:e1007797. doi: 10.1371/journal.pcbi.1007797. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Esfahani N.M., Catchpoole D., Khan J., Kennedy P.J. MCKAT, a multi-dimensional copy number variant kernel association test. BMC Bioinform. 2021 doi: 10.1101/2021.03.13.435274. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Liu D., Ghosh D., Lin X. Estimation and testing for the effect of a genetic pathway on a disease outcome using logistic kernel machine regression via logistic mixed models. BMC Bioinform. 2008;9:292. doi: 10.1186/1471-2105-9-292. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Wu M.C., Kraft P., Epstein M.P., Taylor D.M., Chanock S.J., Hunter D.J., Lin X. Powerful SNP-set analysis for case-control genome-wide association studies. Am. J. Hum. Genet. 2010;86:929–942. doi: 10.1016/j.ajhg.2010.05.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Davies R.B. The distribution of a linear combination of χ2 random variables. J. R. Stat. Soc. Ser. C Appl. Stat. 1980;29:323–333. [Google Scholar]
  • 12.Shern J.F., Chen L., Chmielecki J., Wei J.S., Patidar R., Rosenberg M., Ambrogio L., Auclair D., Wang J., Song Y.K., et al. Comprehensive genomic analysis of Rhabdomyosarcoma reveals a landscape of alterations affecting a common genetic axis in fusion-positive and fusion-negative tumors. Cancer Discov. 2014;4:216–231. doi: 10.1158/2159-8290.CD-13-0639. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Girirajan S., Brkanac Z., Coe B.P., Baker C., Vives L., Vu T.H., Shafer N., Bernier R., Ferrero G.B., Silengo M., et al. Relative burden of large CNVs on a range of neurodevelopmental phenotypes. PLoS Genet. 2011;7:e1002334. doi: 10.1371/journal.pgen.1002334. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.El Demellawy D., McGowan-Jordan J., De Nanassy J., Chernetsova E., Nasr A. Update on molecular findings in rhabdomyosarcoma. Pathology. 2017;49:238–246. doi: 10.1016/j.pathol.2016.12.345. [DOI] [PubMed] [Google Scholar]
  • 15.Sun X., Guo W., Shen J.K., Mankin H.J., Hornicek F.J., Duan Z. Rhabdomyosarcoma: Advances in molecular and cellular biology. Sarcoma. 2015;2015:232010. doi: 10.1155/2015/232010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Nishimura R., Takita J., Sato-Otsubo A., Kato M., Koh K., Hanada R., Tanaka Y., Kato K., Maeda D., Fukayama M., et al. Characterization of genetic lesions in Rhabdomyosarcoma using a high-density single nucleotide polymorphism array. Cancer Sci. 2013;104:856–864. doi: 10.1111/cas.12173. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The ASD and RMS datasets supporting the conclusions of this article are available at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3213131 and https://www.ncbi.nlm.nih.gov/gap (accession number: phs000720.v3.p1) respectively (access date: 20 November 2021). The SMCKAT R package is publicly available at https://github.com/nesfehani/SMCKAT GitHub repository (access date: 20 November 2021).


Articles from Life are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES