Skip to main content
Human Molecular Genetics logoLink to Human Molecular Genetics
. 2012 Dec 6;22(6):1249–1261. doi: 10.1093/hmg/dds512

Whole-genome detection of disease-associated deletions or excess homozygosity in a case–control study of rheumatoid arthritis

Chih-Chieh Wu 1,*, Sanjay Shete 2, Eun-Ji Jo 4, Yaji Xu 5, Emily Y Lu 3, Wei V Chen 3, Christopher I Amos 3,6
PMCID: PMC3578409  PMID: 23223014

Abstract

Unlike genome-wide association studies, few comprehensive studies of copy number variation's contribution to complex human disease susceptibility have been performed. Copy number variations are abundant in humans and represent one of the least well-studied classes of genetic variants; in addition, known rheumatoid arthritis susceptibility loci explain only a portion of familial clustering. Therefore, we performed a genome-wide study of association between deletion or excess homozygosity and rheumatoid arthritis using high-density 550 K SNP genotype data from a genome-wide association study. We used a genome-wide statistical method that we recently developed to test each contiguous SNP locus between 868 cases and 1194 controls to detect excess homozygosity or deletion variants that influence susceptibility. Our method is designed to detect statistically significant evidence of deletions or homozygosity at individual SNPs for SNP-by-SNP analyses and to combine the information among neighboring SNPs for cluster analyses. In addition to successfully detecting the known deletion variants on major histocompatibility complex, we identified 4.3 and 28 kb clusters on chromosomes 10p and 13q, respectively, which were significant at a Bonferroni-type-corrected 0.05 nominal significant level. Independently, we performed analyses using PennCNV, an algorithm for identifying and cataloging copy numbers for individuals based on a hidden Markov model, and identified cases and controls that had chromosomal segments with copy number <2. Using Fisher's exact test for comparing the numbers of cases and controls with copy number <2 per SNP, we identified 26 significant SNPs (protective; more controls than cases) aggregating on chromosome 14 with P-values <10−8.

INTRODUCTION

Studies of human genome have demonstrated extensive and wide-spread copy number variations (CNVs) of DNA sequences, such as deletions, insertions, duplications and complex multi-site variants, that indicate the presence of variable numbers of copies of large genomic regions (mostly >1 kb in size) among individuals. Comprehensive whole-genome reference maps of human CNVs by SNP microarrays and array comparative genomic hybridization have been constructed (13). Genomic deletions represent a variant class that is often associated with disease. Three concurrent studies that specifically investigated common deletion polymorphisms in healthy individuals demonstrated that deletion variants of various sizes are ubiquitous; they also provided comprehensive maps of deletions in the human genome (46). These studies provided important baseline information to enable the discovery of CNV classes and facilitate whole-genome studies of associations between disease and CNVs.

Deletion variants have long been known to cause microdeletion syndromes, such as DiGeorge syndrome, Prader-Willi syndrome and Wilms tumor (7), and are frequently observed in patients with neuron-developmental disorders, such as autism and schizophrenia (811). Recently discovered are a common 20 kb deletion upstream of the IRGM gene that is associated with Crohn's disease, a 45 kb deletion upstream of NEGR1 that is associated with the body mass index, and a deletion and duplication of KIR that is associated with HIV-1 control (1214). A study of CNVs as trait-associated polymorphisms and expression quantitative trait loci that influence phenotype by altering gene regulation demonstrated that they contribute to the genetics of certain disease classes, such as autoimmune disorders and metabolic traits (15). However, controversy exists; it has yet to be fully ascertained to what extent CNVs account for missing heritability that is undetected by genome-wide association studies (1,1517). In fact, few comprehensive whole-genome studies exist of their contribution to susceptibility over a wide variety of common, complex human diseases compared with genome-wide association studies (15,17). CNVs remain one of the least well-studied classes of genetic variants. More recently, a nucleotide-resolution map of CNVs based on whole-genome DNA sequencing data from 185 individuals in the 1000 Genome Project was constructed, enabling the discovery, genotyping and imputation of CNVs and serving as a resource for sequencing-based association studies (18).

Rheumatoid arthritis (RA) is a common autoimmune disorder of unknown etiology; it is characterized by the destruction of the synovial joints, resulting in severe disability. It has a complex mode of inheritance and is influenced by both genetic and environmental risk factors. It affects ∼1% of individuals of European ancestry, with an estimated sibling recurrence risk of 5–10 (1921). In addition to the established susceptibility loci of HLA-DRB1 and PTPN22 (protein tyrosine phosphatase and non-receptor type 22) in patients with severe anti-CCP-positive RA, several associated alleles of modest risk on the newly identified loci have been reproducibly discovered in recent genome-wide association studies, including REL, STAT4, TNFAIP3 and BLK. On the basis of estimates of a recent meta-analysis, validated RA risk alleles on major histocompatibility complex (MHC) and non-MHC loci explained ∼12 and 4% of phenotypic variance, respectively; a large portion of heritable variation remains to be discovered (22).

We recently developed a genome-wide statistical method for detecting disease-associated deletion variants or excess homozygosity using high-density SNP genotype data in genome-wide association studies (23). Our method is based on identifying areas in which excess homozygosity of cases varies from controls and is structured to test each contiguous SNP locus across the whole genome between a group of cases and a group of controls from a genome-wide association study. The method has proved to be useful and robust in the presence of linkage disequilibrium. It provides outcomes for SNP-by-SNP analyses and cluster analyses on the basis of combined evidence from multiple neighboring SNPs in case–control studies. Genome-wide association studies are designed to discover individual disease-associated SNPs; in contrast, methods for detecting CNVs and deletions are generally designed to find small chromosomal segments (4,6,2325). In this study, we used our method to perform a comprehensive genome-wide study of associations between common deletion variants or excess homozygosity and RA susceptibility using an Illumina HumanHap550 array in 868 RA patients and 1194 controls from the North American Rheumatoid Arthritis Consortium (20). The SNP-by-SNP analyses identified individual significant SNPs over the whole genome at a nominal significance level of 10−8; the cluster analyses detected candidate deleted segments in which at least 2 neighboring significant SNPs were overly aggregated. In addition to successfully detecting known deleterious deletion variants on HLA-DRB1 and C4 genes that increase RA risk in the MHC region, we identified additional 4.3 and 28 kb clusters on chromosomes 10p (5 316 846–5 321 159) and 13q (20 783 404–20 811 429), respectively, which were significant at a corrected 0.05 nominal significance level, adjusted for multiple comparison procedures.

Independently, we performed analyses using the PennCNV method and identified cases and controls that had chromosomal segments with copy number <2. PennCNV is an algorithm for identifying and cataloging copy numbers for individuals on the basis of a hidden Markov model (25). Using Fisher's exact test to compare the numbers of cases and controls per SNP, we identified 26 significant SNPs (protective; more controls than cases) that were overly aggregated on chromosome 14 with P-values <10−8 and additional 49 SNPs on chromosomes 2, 14 and 20 with P-values of 10−5–10−8. In this report, we extend genome-wide association studies to deletion and excess homozygosity detection for finding additional common genetic variants that influence RA susceptibility. We also provide a strategy and analytical framework that can be used at no additional cost: using SNP and intensity data from genome-wide association studies to detect disease-associated deletion variants or excess homozygosity and identify individual patients with commonly shared disease-associated deletion variants.

RESULTS

For SNP-by-SNP analyses, we performed the z-score test to assess the statistical significance of differences in homozygosity proportions between the 868 cases and 1194 controls on each of 550 K contiguous SNP loci. We found that 535 individual SNPs reached genome-wide significance (defined as P-value <10−8). Table 1 shows the frequencies of SNP genotypes, missing SNP genotypes, SNPs tested and significant SNPs by chromosome and arm. The number of SNPs tested is the difference in counts between SNP genotypes and missing SNP genotypes.

Table 1.

Frequencies of SNPs, missing SNPs, SNPs tested and significant SNPs by chromosome and arm

Chromosome Arm Number of SNP genotyped Number of missing SNPs Number of SNPs testeda Number of significant SNPsb
1 p 21 533 81 21 452 19
q 19 396 104 19 292 13
2 p 18 526 98 18 428 9
q 25 564 105 25 459 22
3 p 18 457 55 18 402 8
q 18 233 94 18 139 7
4 p 9488 35 9453 3
q 23 140 105 23 035 10
5 p 9106 37 9069 11
q 24 506 96 24 410 13
6 p 13 964 47 13 917 81
q 21 610 67 21 543 12
7 p 13 249 30 13 219 9
q 15 995 69 15 926 14
8 p 12 222 64 12 158 3
q 18 768 74 18 694 13
9 p 10 878 31 10 847 6
q 15 250 45 15 205 19
10 p 9616 18 9598 7
q 18 715 53 18 662 21
11 p 10 550 49 10 501 11
q 15 927 46 15 881 13
12 p 8048 43 8005 7
q 18 317 79 18 238 11
13 p
q 20 242 84 20 158 11
14 p
q 17 951 62 17 889 9
15 p
q 16 166 47 16 119 19
16 p 6382 38 6344 13
q 10 078 43 10 035 11
17 p 4526 11 4515 10
q 9501 31 9470 17
18 p 3515 3 3512 4
q 12 935 63 12 872 9
19 p 3704 5 3699 20
q 5532 13 5519 12
20 p 6697 17 6680 7
q 7146 27 7119 17
21 p 1 1
q 8050 18 8032 13
22 p
q 8205 33 8172 21
Total 529 669 535

aThe number of SNPs tested is the difference in counts between SNP genotypes and missing SNPs.

bThese SNPs were statistically significant by the z-score test at a nominal significance level of 10−8 for SNP-by-SNP analyses.

Figure 1 displays a graphical summary of outcomes of the genome-wide association scan between deletion variants or excess homozygosity and RA risk in which SNPs are plotted according to corresponding chromosomal locations with the values of –log10(P-values). The largest association signal lies in the MHC region with a maximal aggregation of neighboring significant SNPs. We identified the deleterious deletion variants that encompassed HLA-DRB1 and C4 genes in the MHC region in which deletions and CNVs were previously discovered in RA patients (26,27). Deletions in the HLA-DRB1 region are a common characteristic of HLA class II haplotypes, and the major DR4 and DR9 haplotypes associated with RA belong to a related haplotype family with multiple DRB loci, including several pseudogenes. In contrast, the DR1 haplotypes associated with RA are members of a distinct family of haplotypes that have fewer DRB loci. Thus, we would expect to find copy number differences in DRB genes between RA cases and controls, which contain haplotype families with more variable numbers of DRB loci.

Figure 1.

Figure 1.

Genome-wide scan of association between the homozygosity level and rheumatoid arthritis, using the z-score test for SNP-by-SNP analyses. SNPs were plotted according to corresponding chromosomal locations with the values of –log10(P-values), using the z-score test. The largest association signal lay in the MHC region, with a maximal aggregation of neighboring significant SNPs (nominal significance level, 10−8) that encompassed HLA-DRB1 and C4 genes, in which deleterious deletions had been previously discovered in patients.

In this study, a cluster was defined as two or more significant SNPs gathered on a short chromosomal segment of pre-determined length on the basis of the SNP-by-SNP analysis outcome on the first stage. Because the tagged SNP genotypes used in genome-wide association studies are not uniformly distributed over the whole genome and because gene-sparse regions may have fewer SNPs genotyped and higher probabilities of containing genomic deletions, we used two different cluster criteria to determine the minimal length of a chromosomal segment that accommodates multiple adjacent significant SNPs. One criterion for defining a cluster of significant SNPs is that two successive significant SNPs are separated by 20 or fewer SNP loci; the other criterion is the use of a maximum distance of ≤100 kb between two successive significant SNPs. Under these criteria, a cluster begins with a significant SNP locus and ends with another significant SNP locus. Clusters can continuously extend this way to accommodate more than two significant SNPs. The mean distance was 5.39 kb between adjacent SNPs in this application; a 20-SNP-locus chromosomal segment spans a mean of 107.8 kb. We previously used extensive simulations to demonstrate that our method is effective at detecting disease-associated deletions and excess homozygosity under these cluster criteria (23).

Cluster analysis under the first criterion

Under the first cluster criterion of two successive significant SNPs separated by no >20 SNP loci, we identified 14 distinct clusters of neighboring significant SNPs over the whole genome. Each is described and shown in Table 2 in detail. Common variants of the first cluster in the MHC region contributed the strongest statistical signal of risk. We found that 54 significant SNPs overly aggregated on a short segment of 252 contiguous SNP loci in the first cluster. Excluding two significant SNPs on each end of the cluster, 52 significant SNPs were allocated inside this cluster. In this case, T = 13917, k = 81, w = 252 and x = 54 were used for formula (1). The null probability was Inline graphic, as 13 917 contiguous SNP loci were tested individually on the p arm of chromosome 6 and 81 of them were significant at the nominal significance level of 10−8 (shown in 5th and 6th columns and 12th row of Table 1). The exact P-value of this cluster was Inline graphic, using formula (1).

Table 2.

Cluster analysis under the first criterion of two successive significant SNPs separated by 20 or fewer SNP loci

Cluster number 1 2 3 4 5 6 7
Chromosome 6p 7q 10p 10q 11p 13q 16p
Positiona (kb) 32 182.782–32 810.427 148 176.586–148 326.588 5 316.846–5 321.159 105 335.191–105 403.030 45 242.379–45 297.296 20 783.404–20 811.429 1 066.544–1 091.324
Cluster size (kb) 627.645 150.002 4.313 67.839 54.917 28.025 24.780
No. of significant SNPs 54 2 2 2 2 2 2
No. of SNPs encompassed cluster 252 19 2 16 10 8 7
P-value of cluster test 2.91 × 10−66 1.31 × 10−4 5.32 × 10−7 1.50 × 10−4 4.91 × 10−5 8.32 × 10−6 8.76 × 10−5
Corrected P-value of cluster test 1.61 × 10−64 0.110 2.55 × 10−3 0.175 5.16 × 10−2 2.10 × 10−2 7.94 × 10−2
P-value of scan test 3.43 × 10−70 0.212 8.74 × 10−3 0.351 0.103 4.34 × 10−2 0.169
Cluster number 8 9 10 11 12 13 14
Chromosome 19p 19p 20q 21q 22q 22q 22q
Positiona (kb) 2 054.962–2 165.057 19 083.070–19 117.870 49 383.424–49 422.842 14 121.682–14 367.339 24 086.564–24 108.959 42 601.072–42 611.432 48 493.142–48 620.780
Cluster size (kb) 110.095 34.800 39.418 245.657 22.395 10.360 127.638
No. of significant SNPs 2 2 2 2 2 2 2
No. of SNPs encompassed cluster 18 8 8 14 3 10 14
P-value of cluster test 4.22 × 10−3 8.01 × 10−4 1.58 × 10−4 2.35 × 10−4 1.98 × 10−5 2.93 × 10−4 5.89 × 10−4
Corrected P-value of cluster test 0.868 0.370 0.141 0.135 5.39 × 10−2 0.240 0.344
P-value of scan testb 0.774 0.298 0.264 0.153

aBuild 35.

bThe P-values of the clusters that are not the largest on a chromosome arm are not available and are indicated by the symbol ‘–’ for the scan test.

It is noteworthy that the P-values that are directly obtained using expression (1) have no corrections imposed, adjusted for multiple comparison procedures. The chromosomal segment that encompassed this cluster can occur at other locations along chromosome 6p; we must take this into account when assessing statistical significance using this test for cluster analyses. We used a Bonferroni-type correction to adjust P-value thresholds by multiplying the P-value with the ratio of T (the total number of SNPs tested over a chromosomal region) to w (the number of SNPs that encompass the cluster of interest) (23). In this case, the corrected P-value is equal to Inline graphic Inline graphic. Because only one cluster is present on chromosome 6p, we used the scan test and obtained the P-value of Inline graphic. Both the cluster test and scan test demonstrated that this 627 kb clustering segment on chromosome 6p (32 182 782–32 810 427) is highly significant.

We analyzed the remaining 13 distinct clusters using the same approach; the corresponding results are shown in Table 2. In contrast with the first cluster on chromosome 6p, each of these 13 clusters contained exactly two significant SNPs. The clusters of significant SNPs on the 4.3 kb segment of chromosome 10p (5 316 846–5 321 159) and 28-kb segment of chromosome 13q (20 783 404–20 811 429) had corrected P-values of Inline graphic and Inline graphic, respectively; these were significant at a corrected 0.05 nominal significance level, adjusted for multiple comparison procedures. The corresponding P-values of the scan test for these two clusters were Inline graphic and Inline graphic. Detailed information on these two clusters is presented in the fourth and seventh columns of Table 2. It is important to determine the pattern of linkage disequilibrium between adjacent significant SNPs in clusters. We used the values of r2 to measure the magnitude of linkage disequilibrium between two adjacent significant SNPs on these two clusters. The r2 values between significant SNPs were 0.387 for cases and 0.310 for controls on the 4.3 kb cluster of chromosome 10p; 0.121 for cases and 0.070 for controls on the 28 kb cluster of chromosome 13q. The P-values of the 8th, 13th, and 14th clusters of Table 2 for the scan test are not available because the scan test only assesses the largest cluster on a chromosome arm, and the 9th and 12fth clusters of Table 2 are the largest on chromosomes 19p and 22q, respectively.

Cluster analysis under the second criterion

Under the second cluster criterion of a maximum distance of ≤100 kb between two adjacent significant SNPs, we identified 14 distinct clusters of neighboring significant SNPs, each of which is described and shown in Table 3 in detail. The strongest association signal remained in the MHC region, as it contained five distinct clusters on chromosome 6p rather than only the 1 shown in Table 2. The largest cluster found using the first criterion on chromosome 6p in Table 2 was split into two adjacent clusters (the third and fourth clusters of Table 3) under the use of second cluster criterion because only three SNPs were genotyped on the 144 kb gap between these two clusters. Both clusters were large (353 kb and 130 kb in size) and highly significant by our cluster test or scan test. The remaining clusters contained exactly two significant SNPs each. Besides the clusters on chromosome 6p, the same two clusters on chromosomes 10p and 13q were significant at a corrected 0.05 nominal significance level, using a Bonferroni-type correction, as those using the first cluster criterion.

Table 3.

Cluster analysis under the second criterion of a maximum distance of ≤100 kb between two successive significant SNPs

Cluster number 1 2 3 4 5 6 7
Chromosome 6p 6p 6p 6p 6p 10p 10q
Positiona (kb) 31 133.030–31 203.780 31 652.168–31 723.146 32 182.782–32 536.263 32 680.229–32 810.427 33 194.227–33 293.896 5 316.846–5 321.159 105 335.191–105 403.030
Cluster size (kb) 70.750 70.978 353.481 130.198 99.669 4.313 67.839
No. of significant SNPs 2 2 34 20 2 2 2
No. of SNPs encompassed cluster 37 23 164 85 46 2 16
P-value of cluster test 1.97 × 10−2 7.90 × 10−3 8.39 × 10−42 1.95 × 10−26 5.16 × 10−2 5.32 × 10−7 1.50 × 10−4
Corrected P-value of cluster test 7.12 × 10−40 3.19 × 10−24 2.55 × 10−3 0.175
P-value of scan testb 5.38 × 10−23 8.74 × 10−3 0.351
Cluster number 8 9 10 11 12 13 14
Chromosome 11p 13q 16p 19p 20q 22q 22q
Positiona (kb) 45 242.379–45 297.296 20 783.404–20 811.429 1 066.544–1 091.324 19 083.070–19 117.870 49 383.424–49 422.842 24 086.564–24 108.959 42 601.072–42 611.432
Cluster size (kb) 54.917 28.025 24.780 34.800 39.418 22.395 10.360
No. of significant SNPs 2 2 2 2 2 2 2
No. of SNPs encompassed cluster 10 8 7 8 8 3 10
P-value of cluster test 4.91 × 10−5 8.32 × 10−6 8.76 × 10−5 8.01 × 10−4 1.58 × 10−4 1.98 × 10−5 2.93 × 10−4
Corrected P-value of cluster test 5.16 × 10−2 2.10 × 10−2 7.94 × 10−2 0.370 0.141 5.39 × 10−2 0.240
P-value of scan testb 0.103 4.34 × 10−2 0.169 0.774 0.298 0.153

aBuild 35.

bThe P-values of the clusters that are not the largest on a chromosome arm are not available and are indicated by the symbol ‘–’ for the scan test.

Four clusters of significant SNPs under the second cluster criterion, shown in Table 3, were statistically significant at a corrected 0.05 nominal significance level, adjusted for multiple comparison procedures. However, the two largest clusters (the third and fourth clusters of Table 3) combined in the MHC region were eventually the same as the single largest cluster found under the first cluster criterion (the first cluster of Table 2). In conclusion, both our cluster test and scan test identified three nearly identical clusters of neighboring significant SNPs under any cluster criteria at a corrected 0.05 nominal significance level: the known deleterious deletion variants in the MHC region, a 4.3 kb segment of chromosome 10p and a 28 kb segment of chromosome 13q. Because genomic variants are not uniformly distributed and genotyped over the whole genome, it is more prudent to perform additional, separate association analyses using both cluster criteria rather than using any single criterion alone in real-data analyses.

We used the proposed logistic regression framework extension (shown in the Test for SNP-by-SNP Analyses on the First Stage section) to assess significance of excess homozygosity on these three clusters of significant SNPs, accounting for population stratification. Our analysis showed that 58 of 252 SNPs in the MHC region (32 182.782–32 810.427) were significant at a nominal significance level of 10−8. In addition, the two significant SNPs on chromosomes 10p and 13q, respectively, also remained highly significant using this logistic regression extension. These results indicate that our cluster-based method is robust for population stratification in this application.

In addition, using any cluster criteria, we found that three clusters of significant SNPs were borderline statistically significant. These clusters were located on chromosomes 11p, 16p and 22q and had corrected P-values of 5.16 × 10−2, 7.94 × 10−2, and 5.39 × 10−2, respectively. The corresponding r2 values between significant SNPs were <0.01 for cases and controls on chromosomes 11p and 16p and 0.213 for cases and 0.167 for controls on chromosome 22q.

Whole-genome scan of RA-association using PennCNV

The SNP-based statistical method that we developed is designed to detect disease-associated deletion variants or excess homozygosity; it is structured to test each contiguous SNP locus between a group of cases and a group of controls from a genome-wide association study (23). In contrast, PennCNV is an algorithm that calls individual level copy numbers, providing position-specific copy numbers (25). We used PennCNV to obtain whole-genome CNV maps for 891 RA cases and 601 controls that had available intensity data. PennCNV outputs small chromosomal segments with copy numbers other than two. We detected 62 162 CNVs with a median size of ∼54 kb: cases had 44 729 CNVs with a median size of ∼64 kb; and controls had 17 433 with a median size of ∼32 kb.

We first used PennCNV to identify cases and controls that had chromosomal segments with copy number = 0 or 1; we then used Fisher's exact test to assess the statistical significance of association between RA risk and deletions (copy number = 0 or 1 determined by PennCNV) by comparing the numbers of cases and controls per SNP locus. In Figure 2, we present a graphical SNP-by-SNP outcome summary of the whole-genome scan of association between deletions and RA risk in which SNPs are plotted according to corresponding chromosomal locations with the values of –log10(P-values). The P-values of the two-sided Fisher's exact test were calculated and are shown in the figure. We identified 26 significant SNPs (protective; more controls than cases) clustering on chromosome 14 with P-values <10−8. An amplified display of the values of –log10(P-values) by their corresponding physical position over this small region is shown in Figure 3. In addition, we found 49 SNPs with P-values between 10−5 and 10−8: 9 SNPs on chromosome 20 increased RA risk (more cases than controls), 35 SNPs on chromosome 14 and 5 SNPs on chromosome 2 decreased RA risk (more controls than cases). Table 4 shows all 75 SNPs with P-values <10−5, including their positions, names and exact P-values. We also present the corresponding numbers of cases and controls with copy number = 0 or 1 for each SNP in the table. There were 891 RA cases and 601 controls that had available intensity data for the PennCNV analyses; thus, the numbers of cases and controls with copy number ≠ 0 or 1 can be obtained correspondingly for calculating P-values of the Fisher's exact test. It is noteworthy that, unlike our cluster-based approach, the PennCNV method did not detect known deleterious deletion variants that encompassed HLA-DRB1 and C4 genes in the MHC region.

Figure 2.

Figure 2.

Genome-wide scan of association between rheumatoid arthritis and deletions (copy number = 0 or 1) defined by PennCNV, using Fisher's exact test. SNPs were plotted according to corresponding chromosomal locations with –log10(P-values), using two-sided Fisher's exact test. We identified 26 significant SNPs overly aggregating on chromosome 14 with P-values <10−8 and additional 49 SNPs on chromosomes 2, 14 and 20 with P-values of 10−5–10−8. These SNPs on chromosomes 2 and 14 were protective; those on chromosomes 20 were associated with increased RA risk.

Figure 3.

Figure 3.

Amplification of association scan on chromosome 14, 20.5–23 Mb, between rheumatoid arthritis and deletions (copy number = 0 or 1) defined by PennCNV, using Fisher's exact test. The largest association signal appears on a 165 kb segment of chromosome 14q (21 834 952–21 999 998) in which all 26 significant SNPs lie at the nominal significance level of 10−8 and spans 59 SNP loci. Twenty-four consecutive SNPs were statistically significant on a 46.5 kb segment of chromosome 14q (21 834 952–21 881 469).

Table 4.

Regions of the genome showing evidence of association between rheumatoid arthritis and deletions (copy number = 1 or 0) by PennCNV

No. Chromosome Position SNP No. of cases with copy no. = 0,1a No. of controls with copy no. = 0,1a P-values of two-sided Fisher's exact testb
1 14 21 852 217 rs11845134 13 61 4.10 × 10−14
2 14 21 849 683 rs7146411 12 58 9.08 × 10−14
3 14 21 850 339 rs3811259 12 58 9.08 × 10−14
4 14 21 850 502 rs11850894 12 58 9.08 × 10−14
5 14 21 834 952 rs12588739 6 43 3.39 × 10−12
6 14 21 837 485 rs722448 6 43 3.39 × 10−12
7 14 21 856 055 rs1474477 18 62 5.24 × 10−12
8 14 21 859 477 rs8007403 24 70 5.28 × 10−12
9 14 21 860 760 rs916048 25 71 8.69 × 10−12
10 14 21 857 381 rs10047935 18 60 1.93 × 10−11
11 14 21 861 403 rs2204990 26 70 2.56 × 10−11
12 14 21 841 092 rs741713 9 46 3.12 × 10−11
13 14 21 841 139 rs1076549 9 46 3.12 × 10−11
14 14 21 841 963 rs2009858 9 46 3.12 × 10−11
15 14 21 838 610 rs3811260 7 42 3.98 × 10−11
16 14 21 845 319 rs1540268 11 49 4.11 × 10−11
17 14 21 845 708 rs10142594 11 49 4.11 × 10−11
18 14 21 864 135 rs17793809 27 70 6.52 × 10−11
19 14 21 867 816 rs4981422 27 69 1.18 × 10−10
20 14 21 869 910 rs11627649 27 69 1.18 × 10−10
21 14 21 842 503 rs1467891 9 44 1.28 × 10−10
22 14 21 862 055 rs11847479 26 64 1.37 × 10−09
23 14 21 878 594 rs4981423 17 52 1.78 × 10−09
24 14 21 881 469 rs3811256 17 51 3.38 × 10−09
25 14 21 999 540 rs10162417 22 56 9.05 × 10−09
26 14 21 999 998 rs10131293 22 56 9.05 × 10−09
27 14 21 885 790 rs2032442 14 45 1.46 × 10−08
28 14 21 886 996 rs12436199 14 45 1.46 × 10−08
29 14 22 000 627 rs2733776 22 54 2.95 × 10−08
30 14 21 831 090 rs10483271 7 33 5.90 × 10−08
31 14 21 832 139 rs17198314 7 33 5.90 × 10−08
32 14 21 832 903 rs17198328 7 33 5.90 × 10−08
33 14 21 898 729 rs2331662 14 42 9.73 × 10−08
34 14 21 996 759 rs17794083 26 57 1.05 × 10−07
35 14 21 995 192 rs1882704 28 58 1.94 × 10−07
36 14 21 827 106 rs2001022 7 31 2.25 × 10−07
37 20 35 462 245 rs1570209 96 22 2.45 × 10−07
38 14 21 994 034 rs2242545 29 59 2.49 × 10−07
39 14 21 985 656 rs12147516 52 83 2.87 × 10−07
40 14 21 986 886 rs10483273 52 83 2.87 × 10−07
41 20 35 442 559 rs6090585 62 9 3.36 × 10−07
42 20 35 443 071 rs6018199 62 9 3.36 × 10−07
43 14 21 826 110 rs10129606 7 30 4.42 × 10−07
44 2 208 064 035 rs918843 34 63 5.25 × 10−07
45 2 208 064 167 rs918842 34 63 5.25 × 10−07
46 2 208 064 454 rs2551649 34 63 5.25 × 10−07
47 2 208 065 237 rs6755425 34 63 5.25 × 10−07
48 2 208 066 083 rs959668 34 63 5.25 × 10−07
49 14 21 991 120 rs11848747 32 61 5.32 × 10−07
50 20 35 440 545 rs12329503 61 9 5.49 × 10−07
51 20 35 485 009 rs6018428 98 24 6.36 × 10−07
52 20 35 485 260 rs6018432 98 24 6.36 × 10−07
53 20 35 438 689 rs6094509 60 9 9.03 × 10−07
54 20 35 475 054 rs11905013 97 24 9.49 × 10−07
55 20 35 476 320 rs4810624 97 24 9.49 × 10−07
56 14 22 009 307 rs8020193 17 42 1.14 × 10−06
57 14 22 002 896 rs10483275 21 47 1.49 × 10−06
58 14 21 908 470 rs8014927 14 38 1.79 × 10−06
59 14 21 822 713 rs17116039 8 30 1.84 × 10−06
60 14 21 819 582 rs4435168 13 36 2.13 × 10−06
61 14 21 973 302 rs2141988 49 75 2.30 × 10−06
62 14 21 973 771 rs3811232 49 75 2.30 × 10−06
63 14 21 974 905 rs8021297 49 75 2.30 × 10−06
64 14 22 010 682 rs10483277 15 38 2.87 × 10−06
65 14 21 970 760 rs6572449 49 74 4.98 × 10−06
66 14 21 972 830 rs7142158 49 74 4.98 × 10−06
67 14 21 975 565 rs11623995 49 74 4.98 × 10−06
68 14 21 976 908 rs11157596 49 74 4.98 × 10−06
69 14 21 914 810 rs4982619 14 36 5.51 × 10−06
70 14 21 816 895 rs3811266 13 34 7.13 × 10−06
71 14 21 817 304 rs4982599 13 34 7.13 × 10−06
72 14 21 931 475 rs12891257 13 34 7.13 × 10−06
73 14 21 933 475 rs10142552 13 34 7.13 × 10−06
74 14 21 928 200 rs3811247 12 33 7.51 × 10−06
75 14 21 929 322 rs3811244 12 33 7.51 × 10−06

aThere were 891 RA cases and 601 controls in the PennCNV analyses. With the numbers of cases and controls with copy number = 0 or 1, the numbers of cases and controls with copy number ≠ 0 or 1 can be obtained correspondingly to calculate P-values of the Fisher's exact test.

bThe two-sided Fisher's exact test was used to assess the statistical significance of association between RA risk and deletions (copy number = 0 or 1).

The largest association signal appeared on a 165 kb segment of chromosome 14q (21 834 952–21 999 998), in which all 26 significant SNPs lie at the nominal significance level of 10−8, and spans 59 SNP loci. Notably, we found that 24 consecutive SNPs were statistically significant on a 46.5 kb segment of chromosome 14q (21 834 952–21 881 469). The respective maps of this region for cases and controls, shown in Figure 4, suggest that at least four distinct loci in separate linkage disequilibrium blocks are present on the 46.5 kb segment that accommodates the 24 consecutive significant SNPs; at least eight distinct loci in separate linkage disequilibrium blocks are present on the 165 kb segment of chromosome 14q (21 834 952–21 999 998) that accommodates all 26 significant SNPs. This region contains the T-cell receptor alpha chain which is rearranged in T-cells. As different T-cells show different rearrangements, the DNA intensity across this region would be decreased, while heterozygosity calling of genotypes would not be altered, hence explaining differences between PennCNV and the homozygosity clustering approach.

Figure 4.

Figure 4.

(A and B) The haplotype maps of chromosome 14q (21 834 952–22 000 629). The first figure is the haplotype map for cases (A) and second for controls (B). We used the value of D’ to create linkage disequilibrium blocks of these two haplotype maps. These two figures suggest that at least four distinct loci, in separate linkage disequilibrium blocks, are present on the 46.5 kb segment of chromosome 14q (21 834 952–21 881 469); this segment accommodates 24 consecutive significant SNPs. At least eight distinct loci, in separate linkage disequilibrium blocks, are present on the 165 kb segment of chromosome 14q (21 834 952–21 999 998); this segment accommodates all 26 significant SNPs.

We used the proposed logistic regression framework extension (shown in the Test for SNP-by-SNP Analyses on the First Stage section) to assess significance of deletions (copy number = 0 or 1 determined by PennCNV) on the top-signal region of chromosome 14q, accounting for population stratification. Our analysis showed that this region remained highly significant.

In addition, we found that nine consecutive SNPs on a 46.6kb segment of chromosome 20 (35 438 689–35 485 260) were associated with increased RA risk with P-values = 10−6 to 10−7 (shown in bold in Table 4); five consecutive SNPs on a 2 kb segment of chromosome 2 (208 064 035–208 066 083) were associated with decreased RA risk with a P-value of 5.25 × 10−7. The proto-oncogene tyrosine-protein kinase SRC lies in the 46.6 kb chromosomal segment of chromosome 20.

Additional analysis outcome of cluster-based and PennCNV methods combined

Twelve RA patients and one control commonly shared a 6.6 kb segment of deletion with copy number = 1 by PennCNV on chromosome 19p (2 060 157–2 066 790) that spans two SNP loci. This segment also lay between two adjacent significant SNPs on chromosome 19p (2 054 962–2 165 057) identified by our cluster-based method (shown on the lower second column of Table 2). This cluster of significant SNPs was not statistically significant at a corrected 0.05 nominal significance level, using a Bonferroni-type correction, by our cluster test. The 12 RA patients commonly shared a 15.4 kb segment on chromosome 19p (2 051 346–2 066 790) that spans four SNP loci. The AP301 adaptor-related protein complex 3, delta 1, lies in this region. The Fisher's one-sided (two-sided) exact test for comparing 12/891 versus 1/601 gives a P-value of 1.17 × 10−2 (1.98 × 10−2); significantly more RA cases than controls were observed on this 6.6 kb deleted segment. Supplementary Material, Table S1 provides data on the 12 identified RA patients and 1 control, including their respective affection statuses, copy numbers, deletion segment lengths, starting and ending deletion SNPs and starting and ending physical deletion positions.

DISCUSSION

Because known RA susceptibility loci explain only a small portion of familial clustering (22) and because CNVs are abundant in humans and represent one of the least well-studied classes of genetic variants (18), we attempted to determine some of the unknown heritability by performing a genome-wide study of association between deletions or excess homozygosity and RA risk in this report. We analyzed high-density 550 K SNP genotype data from a genome-wide association study of RA (20). In the SNP-by-SNP analysis using our method (23), we detected the strongest association signal in the MHC region with a maximal aggregation of neighboring significant SNPs at the nominal significance level of 10−8, which encompasses known deletion variants on HLA-DRB1 and C4 genes. We observed a complex and extensive linkage disequilibrium pattern among significant SNPs in this region.

The subsequent cluster analysis is designed to detect clusters of two or more neighboring significant SNPs overly aggregated on a small chromosomal segment and to test for statistical significance of clustering. In addition to successfully detecting known deleterious deletion variants on HLA-DRB1 and C4 genes in the MHC region (shown in the second column of Table 2), we identified 4.3 and 28 kb clusters of significant SNPs on chromosomes 10p and 13q (shown in the fourth and seventh columns of Table 2) using our cluster test and scan test, which were significant at a corrected 0.05 nominal significance level, adjusted for multiple comparison procedures.

Several RA-associated alleles of modest risk sizes on new loci have been discovered in recent genome-wide association studies. We evaluated the significance status of the neighboring SNPs that encompassed these associated alleles, including PTPN22, STAT4, CTLA4, REL, HLA-DRB1, TNFAIP3, BLK, TRAF1-C5, PRKCQ and CD40. We evaluated 100 adjacent SNPs (50 SNPs on each of the two sides of the associated loci each) from the SNP-by-SNP analysis outcome. Thirty-two significant SNPs encompassed HLA-DRB1; 4 encompassed C4 and 1 (rs2572386) was apart from BLK by 114 kb. Given a complex and extensive linkage disequilibrium pattern in the MHC region, it may not be surprising that many significant SNPs neighbor HLA-DRB1. Further fine-mapping studies are required to determine whether additional risk deletion variants exist besides HLA-DRB1 and C4 in the MHC region.

Independently, we performed PennCNV analyses and obtained whole-genome CNV maps for 891 RA cases and 601 controls with available intensity data. We first identified cases and controls that had chromosomal segments with copy number = 0 or 1; we then used Fisher's exact test to compare the numbers of cases and controls per SNP locus for testing the statistical significance of the association between RA risk and deletions (copy number = 0 or 1 by PennCNV). In Figure 2, we present a graphical SNP-by-SNP outcome summary according to corresponding chromosomal locations with the values of –log10(P-values). We identified 26 significant SNPs aggregating on chromosome 14 with P-values <10−8 and additional 49 SNPs on chromosomes 2, 14 and 20 with P-values of 10−5–10−8. The SNPs that were found on chromosomes 2 and 14 are protective (more controls than cases); those that were found on chromosome 20 increased RA risk (more cases than controls). The 75 SNPs with P-values <10−5 are presented in Table 4.

The cluster-based and PennCNV methods are different approaches to investigating the relationships between disease status and deletion variants. The cluster-based method is structured to identify commonly shared excess homozygosity among patients with a genetic disorder, providing strong evidence that the genes in the deleted or excess homozygosity region predispose patients to the disease. It uses a two-stage design to evaluate the association with complex human traits from high-density SNP genotype data in genome-wide association studies (23). The evidence of genomic deletions that are associated with disease is further enhanced by observing successive or neighboring SNPs with excess homozygosity in cases compared with in controls in our cluster-based scheme. In contrast, the PennCNV method is an algorithm for cataloging and identifying copy numbers for individuals, using intensity data on the basis of a hidden Markov model (25). We used PennCNV to identify cases and controls that had chromosomal segments with copy number = 0 or 1 and used Fisher's exact test to assess the statistical significance of association between RA risk and deletions by comparing the numbers of cases and controls per SNP locus. The cluster-based and PennCNV methods may be sensitive to different aspects of data and observation, thus providing different information for discovering associated deletion variants or excess homozygosity in RA patients. Notably, our cluster-based method identified the strongest signals on a chromosomal segment that encompassed known deleterious deletion variants on HLA-DRB1 and C4 genes, but the PennCNV analysis did not detect statistical significance in the MHC region.

We performed another cluster-based analysis using a smaller data set of 851 RA cases and 571 controls that was included in the PennCNV analysis and was a subset of the 868 RA cases and 1194 controls in our original cluster analysis. The cluster-based method remained effective and identified the largest association signal with a maximal aggregation of 50 neighboring significant SNPs in the MHC region. Supplementary Material, Figure S1 displays a graphical summary of outcomes of the genome-wide association scan between deletion variants or excess homozygosity and RA risk; SNPs are plotted, according to corresponding chromosomal locations, with the values of –log10(P-values) on the basis of this smaller data set. A smaller data set is not likely to be the major reason that the PennCNV method failed to detect known deleterious deletion variants in the MHC region.

The cluster-based method also detected a segment on chromosome 19p (2 054 962–2 165 057) that was encompassed by two adjacent significant SNPs but was not statistically significant at a corrected 0.05 nominal significance level, using a Bonferroni correction, by our cluster test (shown on the lower second column of Table 2). The PennCNV analysis identified 12 RA patients and 1 control that commonly shared a 6.6 kb segment of copy number = 1 on chromosome 19p (2 060 157–2 066 790) that lay in the segment that was described by our cluster-based approach. We used Fisher's one-sided (two-sided) exact test for comparing cases (12/891) and controls (1/601) and obtained a P-value of 1.17 × 10−2 (1.98 × 10−2): significantly more RA cases than controls were observed on this 6.6 kb chromosomal segment. Supplementary Material, Table S1 presents detailed information on these 13 individuals and their respective deletion segments. Several sequencing-based methods are available to validate deletion variants or excess homozygosity, such as fluorescent in situ hybridization, two-color fluorescence intensity, PCR amplification and quantitative PCR. Biological confirmation and molecular validation on the top-signal chromosomal segments detected by the cluster and PennCNV analyses, including those on chromosomes 10p, 13q, 14q and 19p, are warranted in the future.

In this study, we (i) used our cluster-based method to perform a whole-genome scan of disease-associated deletions or excess homozygosity and identified novel 4.3 and 28 kb clusters on chromosomes 10p and 13q, respectively, at a corrected 0.05 nominal significance level; (ii) used PennCNV and Fisher's exact test to independently perform a whole-genome analysis of association with deletion variants and identified 26 significant SNPs that were overly aggregated on a 165 kb segment of chromosome 14q at a nominal significance level of 10−8; (iii) identified 12 RA cases and 1 control that commonly shared a 6.6 kb segment with copy number = 1, determined by PennCNV, on chromosome 19p that were also identified by our cluster-based method; (iv) proposed a novel logistic regression method to perform additional analyses for deletions and excess homozygosity, accounting for population stratification.

In contrast to the design of genome-wide association studies in which a point-wise approach is used to find individual disease-associated SNPs, segment-wise approaches are generally used to discover small chromosomal CNV segments. Existing SNP-based approaches and algorithms, including our cluster-based method, are structured to identify deletion variants or excess homozygosity through observing aberrant SNP patterns in a run of consecutive SNPs (4,6,2325). If we find statistically significant evidence of excess homozygosity at individual SNPs for SNP-by-SNP analyses, we use the cluster-based statistical approach to combine information from multiple neighboring SNPs and find a run of tightly adjacent significant SNPs associated with a disease of interest. In this report, we also provide a strategy and analytical framework that can be used, at no additional cost, to detect disease-associated deletion variants or excess homozygosity and identify individual patients with commonly shared disease-associated deletion variants, using SNP and intensity data from a genome-wide association study.

In addition to unbalanced structural variants, low-frequency and rare variants may explain a portion of the missing heritability of many common human diseases. The high-density SNP genotype data in genome-wide association studies are more likely to capture common CNVs than are low-frequency ones. Furthermore, early commercial SNP array platforms were designed to be biased against SNP genotyping near CNV regions. These factors may limit the sensitivity and scope of SNP-based CNV association studies. However, newer generations of SNP arrays have been designed to eliminate much of the bias against capturing genomic segments affected by CNVs and provide higher-resolution maps of CNVs, enabling more effective and efficient CNV association studies using SNPs (28,29). The recent nucleotide-resolution CNV map on the basis of whole-genome DNA sequencing data will further enable robust investigation in sequencing-based CNV association studies (18).

MATERIALS AND METHODS

Study population

To evaluate the potential role of deletion variants of CNVs that influence the case–control status on a whole-genome scale, we used data from the North American Rheumatoid Arthritis Consortium, genotyped on the Illumina HumanHap550 array. The study population consisted of 868 cases and 1194 controls from North America and was previously reported in a genome-wide association study of RA susceptibility loci (20). All patients were anti-CCP-positive and met the criteria for RA adopted by the American College of Rheumatology in 1987. Cases and controls were self-reported as white. Genotyping was performed on the SNP assay with Infinium HumanHap550 (Illumina), and 54 080 SNPs were genotyped in samples from cases and controls. The data set was filtered individually on the basis of SNP genotype call rates (>95% completeness), minor allele frequency (>0.01) and the Hardy–Weinberg proportion (P ≥ 10−5). Patients and controls whose percentages of missing genotypes were >5%, who had non-European ancestry, who were related, or who had evidence of DNA contamination were removed from the analysis. Written informed consent was obtained from all subjects who provided blood samples, in accordance with protocols approved by the local institutional review boards. More details of the sample collection used are described elsewhere (20).

SNP-based statistical method in a two-stage design

Current molecular technologies and SNP genotyping methods have technical challenges that result in relatively limited resolutions; they are not capable of effectively identifying and cataloging CNVs in whole-genome array scans. CNVs and genomic deletions in particular can perturb the collection of SNP genotype data in CNV regions, causing SNP intensity data to cluster poorly and SNP genotypes in the hemizygous deletion regions to be observed as homozygous for the present allele (4,6,24).

We recently proposed and developed a statistical method that uses a two-stage design to detect deletion variants or excess homozygosity that are associated with complex human traits from high-density SNP genotype data in genome-wide association studies. The method was designed for single-SNP analyses on the first stage and utilized evidence from multiple adjacent SNPs combined with a cluster-based approach on the second stage in case–control studies (23). SNP-based methods, including our cluster-based method, are not capable of effectively distinguishing between homozygosity and deletions. The identification of excess homozygosity regions in multiple cases forms the basis of our method. It was structured to detect commonly shared deletion variants or excess homozygosity among patients with a genetic disorder, providing strong evidence that the genes in the deleted region predispose patients to the disease.

Test for SNP-by-SNP analyses on the first stage

We compared the level of homozygosity on each contiguous SNP locus by using normal approximations to test the significance of differences in homozygosity proportions between cases and controls on the first stage. This test infers the presence of genomic deletions associated with disease by assessing the statistical significance of higher homozygosity proportions in cases than in controls. Letting Inline graphic and Inline graphic be the respective estimates of homozygosity proportions in cases and controls at a single SNP locus and Inline graphic be their weighted average, the normal deviate Z is based on the difference in proportion quantities, Inline graphic, divided by its standard error, Inline graphic, where n1 and n1 represent the sample sizes of cases and controls, respectively. This z-score test can be performed on each contiguous SNP locus along the whole human genome.

The above method does not account for covariates in the model (e.g. eigen vectors for population stratification, age and sex). Therefore, we considered a logistic regression framework extension of this approach to assess significance of excess homozygosity or CNV as follows: log(Pr(individual is a case)/Pr(individual is a control) = b0 + b1 × x + b2 × eigenvectors + b3 × covariates, where x is the indicator of homozygosity status (or copy number = 0, 1) at a SNP locus for an individual; b2 is a vector with the same dimension as the numbers of eigenvectors adjusted for population stratification. In our analyses, we adjusted for the top four significant eigenvalues as performed in the original genome-wide association study (20).

Test for cluster analyses on the second stage

Evidence of disease-associated genomic deletions can be enhanced by observing successive or neighboring SNPs with excess homozygosity in cases compared with in controls in our cluster-based scheme. In addition, it can delineate or define the extent of the minimal regions of common genomic deletions among patients, indicating the critical region of disease. Our cluster test is useful for subsequent and further investigations into the outcomes of SNP-by-SNP analyses on the first stage and is designed to assess the statistical significance of multiple clusters of SNPs with excess homozygosity in cases compared with in controls.

Suppose that T SNP loci over a chromosomal region are tested using the z-score test for SNP-by-SNP analyses in which k SNP loci have significantly higher homozygosity proportions in cases than in controls. Consider the frequency of significant SNP loci occurring within a narrow segment of interest compared with the frequency of significant SNP loci over the whole region. Suppose that the narrow segment of interest encompasses w SNP loci, among which x SNP loci have significant excess homozygosity in cases aggregating in this segment. What interests us is to determine whether the observation of x significant SNP loci in the segment that contains w SNP loci is statistically significant compared with the occurrence of k significant SNP loci over T SNP loci. Assuming that each of the T SNP loci tested is independently and equally likely to have a significant excess homozygosity proportion in cases and assuming that X represents the number of significant SNP loci within the segment that contains w SNP loci, the statistical test for cluster analyses is based on the random variable X with a binomial distribution. The P-value formula for this cluster test under the null hypothesis of random allocations of significant SNP loci is expressed as follows:

graphic file with name dds512eq1.jpg (1)

where x represents the observed number of significant SNP loci within the segment that contains w SNP loci. A small P-value of expression (1) indicates that the occurrence of x significant SNP loci aggregating in a w-SNP interval cannot be explained by chance alone. Our cluster test is an exact statistical test and has proved to be useful and robust in the presence of linkage disequilibrium (23).

Detection of excess homozygosity regions across multiple adjacent or neighboring SNPs forms the basis of this method for cluster analyses. Related statistical methods have been developed for detecting temporal and space-time clusters or anomalies of disease in epidemiology studies (3033). Our cluster test is designed to assess the statistical significance of multiple clusters of SNP loci with excess homozygosity in cases compared with in controls. In contrast, the scan test is structured to detect the largest cluster. The scan test employs a moving window of pre-determined length and finds the maximum number of cases revealed through the window as it slides over the entire region (34). When only the largest cluster is being assessed or only one observed cluster is present, the scan test is useful. In the case of applications to the human genome, we often find more than one aggregation of neighboring significant SNPs on a chromosome. It is noteworthy that the cluster test in expression (1) gives exact P-values; the P-value formulae of the scan test provide approximate results in most situations. In this report, we provide a P-value for each cluster of significant SNPs in our cluster test and a P-value for the largest cluster of significant SNPs in the scan test on a chromosome arm.

PennCNV method

PennCNV is an algorithm for identifying and cataloging copy numbers for individuals in a hidden Markov model framework: a statistical method that models a Markov process in which the probability of observing a state only depends on the states at previous time points (25). PennCNV uses a first-order hidden Markov model to account for dependence structure between hidden copy numbers at nearby SNPs: the hidden copy number state at each SNP only depends on the copy number state at most preceding SNP. PennCNV integrates multiple sources of information, including total signal intensity, allelic signal intensity ratio, population SNP allele frequency and distance between neighboring SNPs. It was used to experimentally validate and fine-map CNVs in the FBXL7, EYA1 and CTDSPL genes (25). Instead of three distinct states of ‘loss’, ‘normal’ and ‘gain’, PennCNV uses a 6-state definition to model copy numbers from 0 to 4 and copy neutral loss of heterozygosity. PennCNV calculates the probabilities of all six states at each SNP locus and calls copy numbers from the most likely state sequence.

SUPPLEMENTARY MATERIAL

Supplementary Material is available at HMG online.

FUNDING

The research presented in this manuscript was partially supported by U.S. NIH/NCI grant R03 CA143979 to C.C.W. and NIH grants AR44422 and the Human Pedigree Analysis Resource of P30 CA016772 to C.I.A.

Supplementary Material

Supplementary Data

ACKNOWLEDGEMENTS

Dr Peter Gregersen has an extensive publication history focusing on elucidating the mechanisms of action of genetic factors in causing rheumatoid arthritis and other autoimmune conditions. Dr Annette Lee has extensive experience in characterizing genetic contributions to autoimmune diseases. We thank Drs Lee and Gregersen for assisting in the presented research by making data available for this study.

Conflicts of interest statement: None declared.

REFERENCES

  • 1.Conrad D.F., Pinto D., Redon R., Feuk L., Gokcumen O., Zhang Y., Aerts J., Andrews T.D., Barnes C., Campbell P., et al. Origins and functional impact of copy number variation in the human genome. Nature. 2010;464:704–712. doi: 10.1038/nature08516. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Redon R., Ishikawa S., Fitch K.R., Feuk L., Perry G.H., Andrews T.D., Fiegler H., Shapero M.H., Carson A.R., Chen W., et al. Global variation in copy number in the human genome. Nature. 2006;444:444–454. doi: 10.1038/nature05329. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Wong K.K., deLeeuw R.J., Dosanjh N.S., Kimm L.R., Cheng Z., Horsman D.E., MacAulay C., Ng R.T., Brown C.J., Eichler E.E., Lam W.L. A comprehensive analysis of common copy-number variations in the human genome. Am. J. Hum. Genet. 2007;80:91–104. doi: 10.1086/510560. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Conrad D.F., Andrews T.D., Carter N.P., Hurles M.E., Pritchard J.K. A high-resolution survey of deletion polymorphism in the human genome. Nat. Genet. 2006;38:75–81. doi: 10.1038/ng1697. [DOI] [PubMed] [Google Scholar]
  • 5.Hinds D.A., Kloek A.P., Jen M., Chen X., Frazer K.A. Common deletions and SNPs are in linkage disequilibrium in the human genome. Nat. Genet. 2006;38:82–85. doi: 10.1038/ng1695. [DOI] [PubMed] [Google Scholar]
  • 6.McCarroll S.A., Hadnott T.N., Perry G.H., Sabeti P.C., Zody M.C., Barrett J.C., Dallaire S., Gabriel S.B., Lee C., Daly M.J., Altshuler D.M. Common deletion polymorphisms in the human genome. Nat. Genet. 2006;38:86–92. doi: 10.1038/ng1696. [DOI] [PubMed] [Google Scholar]
  • 7.Lindsay E.A. Chromosomal microdeletions: dissecting del22q11 syndrome. Nat. Rev. Genet. 2001;2:858–868. doi: 10.1038/35098574. [DOI] [PubMed] [Google Scholar]
  • 8.Moreno-De-Luca D., Mulle J.G., Kaminsky E.B., Sanders S.J., Myers S.M., Adam M.P., Pakula A.T., Eisenhauer N.J., Uhas K., Weik L., et al. Deletion 17q12 is a recurrent copy number variant that confers high risk of autism and schizophrenia. Am. J. Hum. Genet. 2010;87:618–630. doi: 10.1016/j.ajhg.2010.10.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Stefansson H., Rujescu D., Cichon S., Pietilainen O.P., Ingason A., Steinberg S., Fossdal R., Sigurdsson E., Sigmundsson T., Buizer-Voskamp J.E., et al. Large recurrent microdeletions associated with schizophrenia. Nature. 2008;455:232–236. doi: 10.1038/nature07229. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Weiss L.A., Shen Y., Korn J.M., Arking D.E., Miller D.T., Fossdal R., Saemundsen E., Stefansson H., Ferreira M.A., Green T., et al. Association between microdeletion and microduplication at 16p11.2 and autism. N. Engl. J. Med. 2008;358:667–675. doi: 10.1056/NEJMoa075974. [DOI] [PubMed] [Google Scholar]
  • 11.Yu C.E., Dawson G., Munson J., D'Souza I., Osterling J., Estes A., Leutenegger A.L., Flodman P., Smith M., Raskind W.H., et al. Presence of large deletions in kindreds with autism. Am. J. Hum. Genet. 2002;71:100–115. doi: 10.1086/341291. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.McCarroll S.A., Huett A., Kuballa P., Chilewski S.D., Landry A., Goyette P., Zody M.C., Hall J.L., Brant S.R., Cho J.H., et al. Deletion polymorphism upstream of IRGM associated with altered IRGM expression and Crohn's disease. Nat. Genet. 2008;40:1107–1112. doi: 10.1038/ng.215. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Willer C.J., Speliotes E.K., Loos R.J., Li S., Lindgren C.M., Heid I.M., Berndt S.I., Elliott A.L., Jackson A.U., Lamina C., et al. Six new loci associated with body mass index highlight a neuronal influence on body weight regulation. Nat. Genet. 2009;41:25–34. doi: 10.1038/ng.287. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Pelak K., Need A.C., Fellay J., Shianna K.V., Feng S., Urban T.J., Ge D., De L.A., Martinez-Picado J., Wolinsky S.M., et al. Copy number variation of KIR genes influences HIV-1 control. PLoS Biol. 2011;9:e1001208. doi: 10.1371/journal.pbio.1001208. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Gamazon E.R., Nicolae D.L., Cox N.J. A study of CNVs as trait-associated polymorphisms and as expression quantitative trait loci. PLoS Genet. 2011;7:e1001292. doi: 10.1371/journal.pgen.1001292. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Craddock N., Hurles M.E., Cardin N., Pearson R.D., Plagnol V., Robson S., Vukcevic D., Barnes C., Conrad D.F., Giannoulatou E., et al. Genome-wide association study of CNVs in 16,000 cases of eight common diseases and 3000 shared controls. Nature. 2010;464:713–720. doi: 10.1038/nature08979. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.McCarroll S.A. Extending genome-wide association studies to copy-number variation. Hum. Mol. Genet. 2008;17:R135–R142. doi: 10.1093/hmg/ddn282. [DOI] [PubMed] [Google Scholar]
  • 18.Mills R.E., Walter K., Stewart C., Handsaker R.E., Chen K., Alkan C., Abyzov A., Yoon S.C., Ye K., Cheetham R.K., et al. Mapping copy number variation by population-scale genome sequencing. Nature. 2011;470:59–65. doi: 10.1038/nature09708. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Gregersen P.K., Behrens T.W. Genetics of autoimmune diseases—disorders of immune homeostasis. Nat. Rev. Genet. 2006;7:917–928. doi: 10.1038/nrg1944. [DOI] [PubMed] [Google Scholar]
  • 20.Plenge R.M., Seielstad M., Padyukov L., Lee A.T., Remmers E.F., Ding B., Liew A., Khalili H., Chandrasekaran A., Davies L.R., et al. TRAF1-C5 as a risk locus for rheumatoid arthritis—a genomewide study. N. Engl. J. Med. 2007;357:1199–1209. doi: 10.1056/NEJMoa073491. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Wordsworth P., Bell J. Polygenic susceptibility in rheumatoid arthritis. Ann. Rheum. Dis. 1991;50:343–346. doi: 10.1136/ard.50.6.343. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Stahl E.A., Raychaudhuri S., Remmers E.F., Xie G., Eyre S., Thomson B.P., Li Y., Kurreeman F.A., Zhernakova A., Hinks A., et al. Genome-wide association study meta-analysis identifies seven new rheumatoid arthritis risk loci. Nat. Genet. 2010;42:508–514. doi: 10.1038/ng.582. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Wu C.C., Shete S., Chen W.V., Peng B., Lee A.T., Ma J., Gregersen P.K., Amos C.I. Detection of disease-associated deletions in case-control studies using SNP genotypes with application to rheumatoid arthritis. Hum. Genet. 2009;126:303–315. doi: 10.1007/s00439-009-0672-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Kohler J.R., Cutler D.J. Simultaneous discovery and testing of deletions for disease association in SNP genotyping studies. Am. J. Hum. Genet. 2007;81:684–699. doi: 10.1086/520823. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Wang K., Li M., Hadley D., Liu R., Glessner J., Grant S.F., Hakonarson H., Bucan M. PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Res. 2007;17:1665–1674. doi: 10.1101/gr.6861907. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Beck S., Trowsdale J. Sequence organisation of the class II region of the human MHC. Immunol. Rev. 1999;167:201–210. doi: 10.1111/j.1600-065x.1999.tb01393.x. [DOI] [PubMed] [Google Scholar]
  • 27.Spies T., Sorrentino R., Boss J.M., Okada K., Strominger J.L. Structural organization of the DR subregion of the human major histocompatibility complex. Proc. Natl Acad. Sci. U S A. 1985;82:5165–5169. doi: 10.1073/pnas.82.15.5165. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Korn J.M., Kuruvilla F.G., McCarroll S.A., Wysoker A., Nemesh J., Cawley S., Hubbell E., Veitch J., Collins P.J., Darvishi K., et al. Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs 10. Nat. Genet. 2008;40:1253–1260. doi: 10.1038/ng.237. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.McCarroll S.A., Kuruvilla F.G., Korn J.M., Cawley S., Nemesh J., Wysoker A., Shapero M.H., de Bakker P.I., Maller J.B., Kirby A., et al. Integrated detection and population-genetic analysis of SNPs and copy number variation 11. Nat. Genet. 2008;40:1166–1174. doi: 10.1038/ng.238. [DOI] [PubMed] [Google Scholar]
  • 30.Grimson R.C. Disease clusters, exact distributions of maxima, and P-values. Stat. Med. 1993;12:1773–1794. doi: 10.1002/sim.4780121906. [DOI] [PubMed] [Google Scholar]
  • 31.Grimson R.C., Mendelsohn S. A method for detecting current temporal clusters of toxic events through data monitoring by poison control centers. J. Toxicol. Clin. Toxicol. 2000;38:761–765. doi: 10.1081/clt-100102389. [DOI] [PubMed] [Google Scholar]
  • 32.Wu C.C., Grimson R.C., Amos C.I., Shete S. Statistical methods for anomalous discrete time series based on minimum cell count. Biom. J. 2008;50:86–96. doi: 10.1002/bimj.200610374. [DOI] [PubMed] [Google Scholar]
  • 33.Wu C.C., Grimson R.C., Shete S. Exact statistical tests for heterogeneity of frequencies based on extreme values. Commun. Stat. Simul. 2010;39:612–623. doi: 10.1080/03610910903528335. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Wallenstein S., Neff N. An approximation for the distribution of the scan statistic. Stat. Med. 1987;6:197–207. doi: 10.1002/sim.4780060212. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Articles from Human Molecular Genetics are provided here courtesy of Oxford University Press

RESOURCES