Abstract
Large-scale genome-wide association studies (GWAS) have been successfully applied to a wide range of genetic variants underlying complex diseases. The network-based regression approach has been developed to incorporate a biological genetic network and to overcome the challenges caused by the computational efficiency for analyzing high-dimensional genomic data. In this paper, we propose a gene selection approach by incorporating genetic networks into case-control association studies for DNA sequence data or DNA methylation data. Instead of using traditional dimension reduction techniques such as principal component analyses and supervised principal component analyses, we use a linear combination of genotypes at SNPs or methylation values at CpG sites in a gene to capture gene-level signals. We employ three linear combination approaches: optimally weighted sum (OWS), beta-based weighted sum (BWS), and LD-adjusted polygenic risk score (LD-PRS). OWS and LD-PRS are supervised approaches that depend on the effect of each SNP or CpG site on the case-control status, while BWS can be extracted without using the case-control status. After using one of the linear combinations of genotypes or methylation values in each gene to capture gene-level signals, we regularize them to perform gene selection based on the biological network. Simulation studies show that the proposed approaches have higher true positive rates than using traditional dimension reduction techniques. We also apply our approaches to DNA methylation data and UK Biobank DNA sequence data for analyzing rheumatoid arthritis. The results show that the proposed methods can select potentially rheumatoid arthritis related genes that are missed by existing methods.
Subject terms: DNA methylation, Genome-wide association studies
Introduction
With the maturation of modern molecular technologies, genomic data is increasingly available in large, diverse data sets [1]. Those data sets provide us an opportunity to use a large volume of human genetic data to explore meaningful insights about diseases. Over the last decade, large-scale genome-wide association studies (GWAS) have been successfully applied to a wide range of genetic variants underlying complex diseases [2]. Different types of genetic variants have different biological functions in the human genome. Genotyping can identify small variations in DNA sequence within populations, such as single-nucleotide polymorphisms (SNPs) [3]. Meanwhile, DNA methylation is an epigenetic marker that has suspected regulatory roles in a broad range of biological processes and diseases [4]. Most penalized regression approaches have been developed to overcome the challenges caused by the computational efficiency for analyzing high-dimensional genomic data, such as elastic net [5], precision lasso [6], group lasso [7, 8], etc. However, Kim et al. [9]. showed that these approaches ignore genetic network structures that have the worst selection performance in terms of the true positive rate.
There is strong evidence showing that genes are functionally related to each other in a genetic network and network-based regularization methods by utilizing prior biological network knowledge to select phenotype related genes can outperform other statistical methods that do not utilize genetic network information [9]. Utilizing genetic network information indeed improves selection performance when genomic data are highly correlated among linked genes in the same biological process (i.e., genetic pathway). Therefore, the network-based regularization method has been developed in gene expression data [10] and DNA methylation data [11]. To avoid the computational burden in analyzing high-dimensional genomic data, Kim et al. [9]. proposed the approach that combines data dimension reduction techniques with network-based regression to identify phenotype related genes. The dimension reduction techniques can capture the gene-level signals from multiple CpG sites or SNPs in a gene, such as the principal component (PC) based methods (PC, nPC, sPC, et al.) [9]. PC method uses the first PC of DNA methylation data and nPC normalizes the first PC by the largest eigenvalue of the covariance matrix of methylation data. In addition, sPC uses the first PC of the data that only contains the CpG sites associated with the phenotype. It has been demonstrated that network-based regression using PC-based dimension reduction techniques can outperform other methods that ignore genetic network structures [9] and the selection performance can be improved if the gene-level signals can capture more information.
To date, several popular and powerful gene-based association tests for GWAS have been developed to capture the combined effect of individual genetic variants on a phenotype within a gene, including Sequence Kernel Association test (SKAT) [12] and Testing an Optimally Weighted combination of variants (TOW) [13]. The combined effect of individual genetic variants on a phenotype offers an attractive alternative to single genetic variant analysis in GWAS. Let denote the genotype (number of minor alleles) of the individual at the variant in a gene. To combine information from individual genetic variants into a single measure of risk allele burden, BT, SKAT, and TOW employ a weighted combination of genetic variants, , to test the association between a gene and a phenotype with different ways to model the weights . SKAT uses the weights related to the minor allele frequencies of the genetic variants. An important feature of SKAT is that it can handle the genetic effects on a phenotype with different directions and magnitudes by incorporating flexible weight functions to boost power. TOW uses the optimal weights obtained by maximizing the score test statistic to test the association between a weighted combination of genetic variants and a phenotype. TOW is more powerful than SKAT when the percentage of neutral variants larger than 50%. However, these three weighted combinations of individual genetic variants do not account for the LD structure among genetic components in a gene. To adjust for LD between genetic variants, the polygenic LD-adjusted risk score (POLARIS) and quadratic polygenic risk score (PRSQ) were developed to improve upon the standard PRS by correcting the inflated Type I error rates observed in the standard PRS in the presence of LD [14, 15].
Inspired by these popular gene-based association tests using a weighted combination of genetic variants to capture the combined effect of individual genetic variants within a gene, in this paper we propose to use weighted combinations of genetic variants in a gene to capture gene-level signals in network-based regression into case-control association studies with DNA sequence data or DNA methylation data. Instead of using traditional dimension reduction techniques such as PC-based methods, we use a linear combination of genotypes at SNPs or a linear combination of methylation values at CpG sites in each gene to capture gene-level signals. We employ three weighted combinations of variants used in TOW [13], SKAT [12], and PRSQ [14] to capture gene-level signals. We call these three weighted combinations as optimally weighted sum (OWS), beta-based weighted sum (BWS), and LD-adjusted polygenic risk score (LD-PRS). After we use one of the weighted combinations of genotypes or methylation values in each gene to capture gene-level signals, we regularize them to perform gene selection based on the biological network. Simulation studies show that our proposed methods have higher true positive rates than using traditional dimension reduction techniques. We also apply our methods to DNA methylation data and UK Biobank DNA sequence data for rheumatoid arthritis patients and normal controls. The results show that the methods with the three weighted combinations, OWS, BWS, and LD-PRS, can select potentially rheumatoid arthritis related genes that are missed by the PC-based dimension reduction techniques. Meanwhile, the genes identified by our proposed methods can be significantly enriched into the rheumatoid arthritis pathway, such as genes HLA-DMA, HLA-DPB1, and HLA-DQA2 in the HLA region. The overall graphical abstract is summarized in Fig. S1.
Statistical models and methods
Consider a sample with unrelated individuals, indexed by . Suppose that there are a set of M genes in the analysis and a total of genetic components, such as SNPs in DNA sequence data or CpG sites in DNA methylation data, where is the number of genetic components in the gene. Let be an matrix of genetic components in the gene, where is the n-dimensional vector which represents the genetic data for the genetic component, genotypes of SNPs and M values of CpG sites. Let be an vector of phenotype, where denotes a case and denotes a control in a case-control study. We define a linear combination of genetic components in the gene as .
Weighted linear combination methods
To capture gene-level signals from multiple genetic components in a gene, we employ three weighted combinations of variants, OWS, BWS, and LD-PRS. In the following, we give a summary for each of the weighted combinations. Without loss of generality, we ignore the index of the gene and use to indicate a linear combination of genetic components in a gene in this section.
OWS uses the weights in TOW to combine the genetic components in a gene. In TOW [13], the weight are determined by maximizing the score test statistic to test the association between and a phenotype. The weight are given by, where and represent the sample mean of the phenotype and sample mean of genetic data for the genetic component, respectively. Large weight represents strong association between the genetic component and the phenotype.
BWS uses the weights given in SKAT [12], where the genetic component is weighted by the beta function, , and is extracted without using the phenotype. For DNA sequence data, and the suggested settings of two parameters in SKAT are and [12], where denotes the minor allele frequency of the genetic component in a gene. For DNA methylation data, and is the methylation β value for the CpG site of the individual and corresponds to .
Both BWS and OWS are combining the effects of all genetic components in a gene by giving different weights, however, they do not account for LD structure among genetic components in a gene. Motivated by POLARIS [15], we employ the LD-adjusted genetic data to adjust for the influence of LD structure. The LD-adjusted genetic data is defined as , where R is the correlation matrix of X. However, may not be stable if there are very small eigenvalues of R. To make the LD-adjusted genetic data more robust, we use the method developed by Yan et al. [14] to calculate . Let and be the eigenvalues and corresponding eigenvectors of R. Then we only use the first J components to calculate , where J is the smallest number such that . Therefore, can be written as .
Then LD-PRS uses the weights proposed by Yan et al. [14], where is the score test statistic to test the association between the genetic component and the phenotype. The represents the direction of the effect and represents the strength of the association. Therefore, LD-PRS to capture the gene-level signal is given by where is the column of .
Notably, OWS and LD-PRS are supervised methods since their weights are based on the association between each genetic component and the phenotype; BWS is an unsupervised method and the weights depend on the genetic component and not on the phenotype.
Network-based regularization
Consider is an adjacency matrix which represents the undirected network connections among genes, where represents the and genes are within the same biological set (i.e., pathway, etc.) and otherwise. Let be an M dimensional degree matrix, where the diagonal element is which represents the total number of genetic links of the gene. Therefore, the symmetric normalized Laplacian matrix represents a genetic network structure, where the elements of L are given by
Let be a gene-level signal of the individual across all genes, which can be obtained by each of the three weighted combinations, OWS, BWS, LD-PRS. Let and be the intercept and the effect vector of M genes, respectively. The likelihood function of the phenotype is given by
where represents the probability that the individual is a case, which can be calculated by
Based on the genetic network structure, the penalized logistic likelihood using network-based regularization [9] is given by
where is the log-likelihood function and is a penalty term which is a combination of the penalty and squared penalty incorporating the genetic network structure. is defined as
where ||•||1 is a l1 norm, and is a diagonal matrix of the estimated signs of the regression coefficients on the diagonal entries for , which can be obtained from ordinary regression for , and ridge regression for . is a tuning parameter that controls sparsity of the network-based regularization, is a mixing proportion between lasso penalty and network-based penalty, and denotes that the and genes are linked to each other in the genetic network.
For a given pair of and α, we can estimate the interpret, , and the effect vector of M genes, β, by minimizing the penalized logistic likelihood . It is not difficult to show the penalty function is convex [9, 16], so the solution and β can be obtained via one of the convex optimization algorithms. We use the R package “pclogit” to estimate and β which implements the cyclic coordinate descent algorithm [11, 17].
Stability selection
Meinshausen et al. [18] proposed a stability selection method that used a half-sample approach in combination with selection algorithms. In this paper, the half-sample method is used to compute the selection probability (SP) for each gene. Let be the random subsample that has a size of without replacement, where ⌊•⌋ is a floor function which represents the greatest integer less than or equal to the value in the function. In each subsample set , there are randomly subsampled cases and controls, where and are the number of cases and controls in the data, respectively. For fixed values of and α, we estimate regression coefficient for the gene according to the above network-based regularization based on the subsample set . Then, we repeat the half-sample method B times and count the total number of for . The SP of the gene can be obtained based on grid sets of α and , which is computed by
where is an indicator function; if for . The indicates that the maximum value of the proportion of the gene which has been selected using half-sample method B times among all choices of the tuning parameters and α. We consider a total of 600 pairs of tuning parameters and α in simulation studies and real data analysis, and use B = 100 in simulation studies and B = 500 in real data analysis (Details are in Supplementary Text S1).
Simulation studies
To evaluate if the methods with the three weighted combinations, OWS, LD-PRS, and BWS, outperform the methods with PC-based dimension reduction techniques, we follow the simulation settings in Kim et al. [9] (Details are in Text S2, Fig. S2). After generating the individual-level DNA methylation data and DNA sequence data based on a biological network structure, we use the three weighted combinations, OWS, LD-PRS, and BWS, and the three competing PC-based methods, PC, nPC, and sPC, to capture the gene-level signals for the individual across all genes. Then, the selection probability for each gene can be obtained by using a half-sample method 100 times and the network-based regression across 600 pairs of tuning parameters and α. We use the true positive rate (TPR) and the area under the receiver operating characteristic (ROC) curve (AUC) to evaluate the selection performance. TPR is defined as the number of true genes that are selected divided by the number of true genes.
For each scenario, we consider a total of individuals which contain 500 cases and 500 controls for the balance case-control studies. Figures 1–2 show the TPR comparisons for the balance case-control studies in scenario 1. We compare the methods with the three weighted combinations and the methods with the three PC-based dimension reduction techniques, PC, nPC, and sPC, which have been shown higher TPR than other methods that do not utilize biological network information. We first compute selection probabilities of all genes and then rank top genes based on the selection probabilities for each method.
Fig. 1. The true positive rates of the methods based on different gene-level signals for balance case-control studies with DNA sequence data in scenario 1, where there are five rare variants and five common variants in each gene.
According to the different number of selected top genes, three parameters are used to vary the genetic effect: the strength of association signals , the number of SNPs in each gene related to gene-level signals , and the noise level of association signals . The selection probabilities are calculated using half-sample method 100 times.
Fig. 2. The true positive rates of the methods based on different gene-level signals for balance case-control studies with DNA methylation data in scenario 1.
According to the different number of selected top-genes, three parameters are used to vary the genetic effect: the strength of association signals , the number of CpG sites in each gene related to gene-level signals , and the noise level of association signals . The selection probabilities are calculated using half-sample method 100 times.
In DNA sequence data analysis (Fig. 1 and Table S1), we pre-set the strength of association signals (), the number of components correlated with the gene-level signal (), and the error variance which controls the noise level of association signals (). The proposed OWS, LD-PRS, and BWS have better selection performance in all eight simulation settings according to TPR and AUC. When the number of causal SNPs in a gene is small (), BWS has the uniformly highest TPR and AUC regardless of the size of the error variance. However, selection performance of the supervised approaches, OWS and LD-PRS, are better than or similar as that of the unsupervised approach, BWS, when the number of SNPs in a gene is large . Overall, BWS shows the best selection performance in all simulation settings for DNA sequence data analysis. LD-PRS is better than OWS due to LD-PRS adjusted for the LD structure of the SNPs. In DNA methylation data analysis (Fig. 2 and Table S2), we pre-set , , and . All methods have similar performance according to TPR when the strength of the association signal is small (); while the methods with the three weighted combinations have higher AUC compared with the three PC-based methods (Table S2). Meanwhile, the methods with the three weighted combinations have higher TPRs and AUCs than PC-based methods when the strength of the association signal is large (). Particularly, when the number of components correlated with the gene-level signal is large (), BWS has the uniformly highest TPR regardless of the size of the error variance and the strength of association signals. BWS also shows the best selection performance in all simulation settings for DNA methylation data analysis. LD-PRS and OWS have similar performance but have higher TPRs than the other three PC-based methods.
Figures S3–S4 show the TPR comparisons for the balance case-control studies under scenario 2. The patterns of TPR comparisons under scenario 2 for DNA methylation data and DNA sequence data are similar to those under scenario 1 (Figs. 1–2). Meanwhile, we also perform TPR comparisons for the unbalance case-control studies, where there are a total of individuals with 100 cases and 900 controls. Figures S5–S8 show the TPR comparisons for the unbalance case-control studies. The patterns of TPR comparisons under these two scenarios for DNA methylation data and DNA sequence data are similar to those observed in Figs. 1–2 and Figs. S3–S4.
We also compare the network-based regression (Net) with two penalized regressions without considering the network structure, elastic net (ENET) and least absolute shrinkage and selection operator (Lasso). The comparison results of the selection performance and the computational time are shown in Figs. S9–S13, which are also explicated in Text S3 in more details. In summary, the results show that the three weighted combinations, OWS, LD-PRS, and BWS, with Net, always perform better than the three weighted combinations with Lasso and ENET. However, three competing PC-based methods (PC, nPC, sPC) with Net may not increase TPR compared with Lasso and ENET. Meanwhile, the network-based regression with partially corrected network structure still outperform ENET and Lasso (Text S4 and Fig. S14). With respect to model fitting, we use the accuracy rate (ACC) as the measurement for the model fitting quality [19] (Text S5) and we observe that the supervised methods (LD-PRS, OWS, sPC) have higher ACC compared with the three unsupervised methods (BWS, PC, nPC). Notably, LD-PRS and OWS always outperform sPC (Fig. S15).
Applications
To evaluate the performance of our proposed methods with three weighted combinations in real data analyses, we apply our methods to DNA methylation data [20, 21] and UK Biobank data for DNA sequence of rheumatoid arthritis (RA) patients and normal controls. The datasets used in this study are summarized in Text S6. Due to the outperformance of the nPC [9] compared with the other PC-based methods, we only apply nPC to compare the performance with our proposed methods in real data analyses.
Application to DNA methylation data
In the application to DNA methylation data, we select the top 100 genes according to the selection probabilities of each method. We search the GWAS catalog for genes that are associated with RA. Table 1 shows the genes in the GWAS catalog that are also identified by OWS, LD-PRS, BWS, and nPC. OWS identifies 11 genes, LD-PRS identifies 12 genes, BWS identifies 8 genes, and nPC identifies 10 genes. Meanwhile, the number of overlapped genes by each method in the DNA methylation data analysis is summarized in Fig. S16. There are four genes identified by all of these four methods, HLA-DQA2, HLA-DRB1, HLA-DQB1, and CD1C. Gene HLA-DRB1 [22–28] and gene HLA-DQB1 [22, 28–34] play a central role in the immune system and have been reported in the GWAS catalog. No literature reported gene HLA-DQA2 that was significantly associated with RA in GWAS catalog. However, the SPs of gene HLA-DQA2 calculated by the methods with the three weighted combinations, OWS, LD-PRS, and BWS, are all 1.000. Also, the SP of gene HLA-DQA2 is 0.852, which is also on the top 100 genes identified by nPC method. Notably, gene HLA-DQA2 is in the rheumatoid arthritis pathway (KEGG: hsa05323) and the literature [35] has shown that genes in the human leukocyte antigen (HLA) region remain the most powerful disease risk genes in RA.
Table 1.
GWAS catalog reported genes identified by OWS, LD-PRS, and BWS in DNA methylation data.
| OWS | LD-PRS | BWS | nPC | ||||
|---|---|---|---|---|---|---|---|
| Gene | SP | Gene | SP | Gene | SP | Gene | SP |
| HLA-DRB1 | 1.000 | HLA-DRB1 | 1.000 | HLA-DRB1 | 1.000 | CCR6 | 1.000 |
| HLA-DRB5 | 1.000 | KIF26B | 1.000 | PRKCH | 0.998 | ZFP36L1 | 1.000 |
| CCR6 | 0.992 | HLA-DRB5 | 0.974 | HLA-DQA1 | 0.992 | TCF7 | 0.992 |
| ZFP36L1 | 0.988 | TNXB | 0.974 | HLA-DOB | 0.894 | TNFSF1A | 0.988 |
| NFATC1 | 0.986 | PRDM16 | 0.970 | HLA-DQB1 | 0.858 | TLR4 | 0.986 |
| TNFRSF1A | 0.950 | HLA-DQA1 | 0.950 | FNBP1 | 0.844 | IL2RB | 0.980 |
| SPSB1 | 0.928 | HLA-DQB1 | 0.950 | TCF7 | 0.842 | HLA-DRB1 | 0.966 |
| ETS1 | 0.898 | HLA-DMA | 0.912 | CD247 | 0.804 | CD247 | 0.962 |
| HLA-DQA1 | 0.888 | NOTCH4 | 0.854 | HLA-DQB1 | 0.936 | ||
| HLA-DQB1 | 0.880 | HLA-DRA | 0.806 | HLA-DRB5 | 0.894 | ||
| TCF7 | 0.794 | RIM26 | 0.784 | ZNF175 | 0.866 | ||
| CCR6 | 0.776 | ||||||
Notes: boldface means that the genes are identified by four methods.
In order to better understand the biological meaning behind the top 100 selected genes by each method, we perform the pathway enrichment analysis. In this study, significantly enriched pathways are identified by the top 100 selected genes if FDR < 0.05. In Fig S17, there are 21 significantly enriched pathways identified by OWS, BWS, and LD-PRS, in which the RA pathway is significantly enriched with FDROWS = 1.48E-04, FDRBWS = 7.80E-03, and FDRLD-PRS = 8.03E-07, respectively; RA pathway is also significantly enriched in a total of 18 pathways identified by nPC with FDRnPC = 2.91E-03. The overlapping genes between the top 100 genes identified by each method and genes in RA pathway are shown in Fig. 3A. The number below each method indicates the total number of overlapping genes identified by the corresponding method and genes in RA pathway. LD-PRS has the smallest pathway enriched FDR and identifies the most overlapping genes (n = 10); genes HLA-DMA (SP = 0.912) and LTB (SP = 0.998) are uniquely identified. OWS identifies eight overlapping genes which contain one unique gene HLA-DPB1 (SP = 0.85); meanwhile, BWS identifies six overlapping genes that contain two unique genes TNF (SP = 0.980) and HLA-DOB (SP = 0.894). Comparing the results of the methods with the three weighted combinations, OWS, LD-PRS, and BWS, and nPC, five HLA-family genes (HLA-DMA, HLA-DOB, HLA-DPB1, HLA-DPA1, and HLA-DQA1) and two RA pathway genes (LTB and TNF) are uniquely identified. The results show that the proposed methods can select potentially RA related genes that are missed by nPC.
Fig. 3. Venn diagrams of genes identified by BWS, LD-PRS, OWS, and nPC for DNA methylation data and DNA sequence data in the real data analyses.
A the number of RA pathway genes identified by each method for DNA methylation data; (B) the number of overlapping genes among the top 200 genes identified by each method and reported in the GWAS catalog for DNA sequence data.
Application to DNA sequence data in UK Biobank
In the applications to DNA sequence data, we use 4541 individuals with RA disease and randomly select 5459 individuals without RA disease in the UK Biobank. The number of genes with selection probabilities of 1 for DNA sequence data is larger than that of DNA methylation data. For example, there are 80 genes with SP = 1 using OWS and 135 genes with SP = 1 using LD-PRS. Therefore, we select the top 200 genes according to the selection probabilities of each method for DNA sequence data analysis. We also search the GWAS catalog (https://www.ebi.ac.uk/gwas/) for genes that are associated with RA. Figure 3B and Table 2 show the genes in the GWAS catalog that are also identified by OWS, LD-PRS, BWS, and nPC. Similar to DNA methylation data analyses, LD-PRS identifies the largest number of genes (n = 23) reported in the GWAS catalog, including four uniquely identified genes (HLA-DQB1, GFRA1, GABBR2, EDIL3); OWS identifies 22 genes in which genes STAT4 (SP = 0.994) and IKZF1 (SP = 0.986) are uniquely selected. There are 13 genes identified by both LD-PRS and OWS, where 12 genes have selection probabilities of 1 in both methods. Two unsupervised methods, BWS and nPC, can identify 17 and 18 genes in the GWAS catalog. They can uniquely identify 11 and 12 genes, respectively. Moreover, there are two genes identified by all four methods, genes HLA-DQA1 and HLA-DRA (boldfaced in Table 2), and two genes identified by three proposed methods, genes RATB and CTNNA3.
Table 2.
GWAS catalog reported genes identified by OWS, LD-PRS, BWS, and nPC in DNA sequence data.
| OWS | LD-PRS | BWS | nPC | ||||
|---|---|---|---|---|---|---|---|
| Gene | SP | Gene | SP | Gene | SP | Gene | SP |
| HLA-DRB1 | 1.000 | HLA-DRB1 | 1.000 | HLA-DRA | 1.000 | HLA-DRB5 | 1.000 |
| HLA-DQA1 | 1.000 | HLA-DQA1 | 1.000 | HLA-DQA1 | 1.000 | HLA-DQA1 | 0.998 |
| PRDM16 | 1.000 | HLA-DQB1 | 1.000 | TNXB | 0.996 | IRF5 | 0.966 |
| PRKCB | 1.000 | HLA-DRA | 1.000 | HLA-DMA | 0.946 | SOCS2 | 0.944 |
| PCSK5 | 1.000 | PRDM16 | 1.000 | SUOX | 0.932 | HLA-DRB1 | 0.942 |
| NOTCH4 | 1.000 | PRKCB | 1.000 | WNT16 | 0.930 | TYK2 | 0.928 |
| GPC5 | 1.000 | PCSK5 | 1.000 | TYK2 | 0.928 | PRDM1 | 0.890 |
| RBFOX1 | 1.000 | NOTCH4 | 1.000 | RPS6KB1 | 0.902 | NOTCH4 | 0.884 |
| DOCK1 | 1.000 | GPC5 | 1.000 | CTNNA3 | 0.898 | IL7R | 0.872 |
| KIF26B | 1.000 | RBFOX1 | 1.000 | HLA-DRB5 | 0.892 | ATXN2 | 0.872 |
| CTNNA3 | 1.000 | DOCK1 | 1.000 | HIPK1 | 0.890 | B3GNT2 | 0.870 |
| GALNT18 | 1.000 | ZMIZ1 | 1.000 | SLC9A8 | 0.882 | UBE2L3 | 0.870 |
| PCDH15 | 1.000 | SLC9A9 | 1.000 | SKIV2L | 0.860 | ELMO1 | 0.864 |
| PTPRM | 1.000 | RARB | 1.000 | TNIP1 | 0.860 | GATA3 | 0.846 |
| HLA-DRB5 | 0.998 | KIF26B | 1.000 | PDF2A | 0.836 | RMI2 | 0.844 |
| RARB | 0.998 | CTNNA3 | 1.000 | TNFAIP3 | 0.834 | RORC | 0.836 |
| HLA-DRA | 0.996 | GALNT18 | 1.000 | RARB | 0.824 | HLA-DRA | 0.836 |
| ZMIZ1 | 0.996 | PCDH15 | 1.000 | RBXW8 | 0.828 | ||
| SLC9A9 | 0.994 | PTPRM | 1.000 | ||||
| STAT4 | 0.994 | PDE3A | 0.998 | ||||
| PDE3A | 0.990 | GFRA1 | 0.996 | ||||
| IKZF1 | 0.986 | GABBR2 | 0.994 | ||||
| EDIL3 | 0.992 | ||||||
Notes: boldface means that the genes are identified by four methods.
Discussions
In this paper, we employ three weighted combinations to capture the gene-level signals from multiple CpG sites or SNPs: optimally weighted sum (OWS), LD-adjusted polygenic risk score (LD-PRS), and beta-based weighted sum (BWS) in DNA methylation or DNA sequence data. To identify phenotype related genes, we apply the three gene-level signals to a stability gene selection approach by incorporating genetic networks. Compared with the traditional dimension reduction techniques such as PC based gene-level signal, the methods with the three weighted combinations, OWS, LD-PRS, and BWS, have very good performance according to the true positive rates. By applying the methods to real DNA methylation and DNA sequence data, we show that the methods with the three weighted combinations can select more potentially RA related genes that are missed by nPC. Meanwhile, OWS, LD-PRS, and BWS can select more significantly enriched genes in the RA pathway comparing with nPC, such as genes HLA-DMA, HLA-DPB1, and HLA-DOB in the HLA region.
There are some advantages of the three weighted combinations to capture gene-level signals. First, the three weighted combinations can capture more information from genetic components (SNPs or CpG sites) in a gene than the traditional dimension reduction techniques, such as PC-based methods. OWS and LD-PRS are two supervised approaches based on the association between each genetic component and the phenotype, where OWS utilizes the optimally weighted combination [13] of components and LD-PRS can adjust for the highly correlated structure [15] of components. OWS puts large weights on components with large effects on the phenotype [13]. Since the genetic components in a gene are commonly correlated, LD-PRS transforms the original data into an orthogonal space to adjust for LD structure. Moreover, OWS and LD-PRS perform better according to TPR when the genetic components are highly correlated. Even though BWS is an unsupervised method that can be extracted without using phenotype, our simulation studies show that BWS has the highest TPR and AUC in most of the settings. Second, the methods with the three weighted combinations, OWS, LD-PRS, BWS, can select more potential phenotype related genes. In our application to DNA methylation of RA patients and normal controls, the top 100 genes selected by our proposed methods can be significantly enriched into RA pathway and contain more RA pathway genes, especially by LD-PRS. Furthermore, all of our proposed methods have strong evidence to select gene HLA-QDA2 (SP = 1) which is not reported in the GWAS catalog.
Recently, large-scale biobanks linked to electronic health records provide us the possibility of analyzing DNA sequence data using a large sample size. Although three weighted combinations combined with the network-based regression have several advantages, there are three limitations we need to resolve in our future works. First, the method with the three weighted combinations are not suitable for extremely unbalanced case-control studies. To avoid the extremely unbalanced case-control ratio in the data from UK Biobank, we match the number of individuals with and without RA disease in the application of DNA sequence data. This may be the reason for a large number of genes with SP = 1 using OWS and LD-PRS, and the SP of the 200th gene using OWS and LD-PRS over 0.97. In the future, we will investigate new methods to handle extremely unbalanced case-control studies. We can use the saddlepoint approximation method [36] to adjust the network-based regression, or use random under-sampling or over-sampling [37] methods instead of using the half-sample approach in the calculation of selection probabilities. The second limitation is that we do not know if the genes selected by the methods with the three weighted combinations are significantly associated with the phenotype. For future studies, we plan to integrate statistical inference in the selection procedure, and further investigate the selection performance by integrating both selection and statistical inference. The third limitation is that the network-based regression is only used for case-control study [9]. For the continuous phenotypes, we need to switch the logistic model with logistic likelihood to the linear regression model with mean squared error or more robust loss function, such as Huber function [38].
Supplementary information
Acknowledgements
Part of this research has been conducted using the UK Biobank Resource under application number 41722 and the NHGRI-EBI GWAS Catalog. XC was partially supported by the Michigan Technological University Health Research Institute Fellowship program and the Portage Health Foundation Graduate Assistantship.
Author contributions
Formal analysis: XC; Methodology: XC, SZ, XL, and QS; Data curation: XC and XL; Visualization: XC; Writing original draft: XC, XL, and QS; Writing review and editing: XC, SZ, XL, and QS.
Funding
No financial assistance was received in support of the study.
Data availability
All data analyzed during this study are included in this published article and its supplemental materials.
Competing interests
The authors declare no competing interests.
Ethics approval
This study used DNA sequence data from the UK Biobank, which has approval from the North West Multi-centre Research Ethics Committee (MREC) as a Research Tissue Bank (RTB) approval (approval number: 11/NW/0382). No specific ethical approval was required for DNA methylation data in this study, which is downloaded from GEO publicly available database with access number GSE42861.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
The online version contains supplementary material available at 10.1038/s41431-022-01264-x.
References
- 1.Ritchie MD. Large-scale analysis of genetic and clinical patient data. Annual Review of Biomedical Data. Science. 2018;1:263–74. [Google Scholar]
- 2.Li R, Duan R, Kember RL, Rader DJ, Damrauer SM, Moore JH, et al. A regression framework to uncover pleiotropy in large-scale electronic health record data. J Am Med Inform Assoc. 2019;26:1083–90. doi: 10.1093/jamia/ocz084. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Wang DG, Fan J-B, Siao C-J, Berno A, Young P, Sapolsky R, et al. Large-scale identification, mapping, and genotyping of single-nucleotide polymorphisms in the human genome. Science. 1998;280:1077–82. doi: 10.1126/science.280.5366.1077. [DOI] [PubMed] [Google Scholar]
- 4.Bock C. Analysing and interpreting DNA methylation data. Nat Rev Genet. 2012;13:705–19. doi: 10.1038/nrg3273. [DOI] [PubMed] [Google Scholar]
- 5.Waldmann P, Mészáros G, Gredler B, Fuerst C, Sölkner J. Evaluation of the lasso and the elastic net in genome-wide association studies. Front Genet. 2013;4:270. doi: 10.3389/fgene.2013.00270. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Wang H, Lengerich BJ, Aragam B, Xing EP. Precision Lasso: accounting for correlations and linear dependencies in high-dimensional genomic data. Bioinformatics. 2019;35:1181–7. doi: 10.1093/bioinformatics/bty750. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J R Stat Soc: Ser B (Stat Methodol) 2006;68:49–67. doi: 10.1111/j.1467-9868.2005.00532.x. [DOI] [Google Scholar]
- 8.Meier L, Van De Geer S, Bühlmann P. The group lasso for logistic regression. J R Stat Soc: Ser B (Stat Methodol) 2008;70:53–71. doi: 10.1111/j.1467-9868.2007.00627.x. [DOI] [Google Scholar]
- 9.Kim K, Sun H. Incorporating genetic networks into case-control association studies with high-dimensional DNA methylation data. BMC Bioinforma. 2019;20:1–15. doi: 10.1186/s12859-019-3040-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Li C, Li H. Network-constrained regularization and variable selection for analysis of genomic data. Bioinformatics. 2008;24:1175–82. doi: 10.1093/bioinformatics/btn081. [DOI] [PubMed] [Google Scholar]
- 11.Sun H, Wang S. Network‐based regularization for matched case‐control analysis of high‐dimensional DNA methylation data. Stat Med. 2013;32:2127–39. doi: 10.1002/sim.5694. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011;89:82–93. doi: 10.1016/j.ajhg.2011.05.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Sha Q, Wang X, Wang X, Zhang S. Detecting association of rare and common variants by testing an optimally weighted combination of variants. Genet Epidemiol. 2012;36:561–71. doi: 10.1002/gepi.21649. [DOI] [PubMed] [Google Scholar]
- 14.Yan S, Sha Q, Zhang S. Gene-based association tests using new polygenic risk scores and incorporating gene expression data. Genes. 2022;13:1120. doi: 10.3390/genes13071120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Baker E, Schmidt KM, Sims R, O’Donovan MC, Williams J, Holmans P, et al. POLARIS: Polygenic LD‐adjusted risk score approach for set‐based analysis of GWAS data. Genet Epidemiol. 2018;42:366–77. doi: 10.1002/gepi.22117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Choi J, Kim K, Sun H. New variable selection strategy for analysis of high-dimensional DNA methylation data. J Bioinforma Computational Biol. 2018;16:1850010. doi: 10.1142/S0219720018500105. [DOI] [PubMed] [Google Scholar]
- 17.Sun H, Wang S. Penalized logistic regression for high-dimensional DNA methylation data with case-control studies. Bioinformatics. 2012;28:1368–75. doi: 10.1093/bioinformatics/bts145. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Meinshausen N, Bühlmann P. Stability selection. J R Stat Soc: Ser B (Stat Methodol) 2010;72:417–73. doi: 10.1111/j.1467-9868.2010.00740.x. [DOI] [Google Scholar]
- 19.Kuhn M, Johnson K. Applied predictive modeling. Springer; 2013.
- 20.Liu Y, Aryee MJ, Padyukov L, Fallin MD, Hesselberg E, Runarsson A, et al. Epigenome-wide association data implicate DNA methylation as an intermediary of genetic risk in rheumatoid arthritis. Nat Biotechnol. 2013;31:142–7. doi: 10.1038/nbt.2487. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Kular L, Liu Y, Ruhrmann S, Zheleznyakova G, Marabita F, Gomez-Cabrero D, et al. DNA methylation as a mediator of HLA-DRB1* 15: 01 and a protective variant in multiple sclerosis. Nat Commun. 2018;9:1–15. doi: 10.1038/s41467-018-04732-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Jiang X, Källberg H, Chen Z, Ärlestig L, Rantapää-Dahlqvist S, Davila S, et al. An Immunochip-based interaction study of contrasting interaction effects with smoking in ACPA-positive versus ACPA-negative rheumatoid arthritis. Rheumatology. 2016;55:149–55. doi: 10.1093/rheumatology/kev285. [DOI] [PubMed] [Google Scholar]
- 23.Traylor M, Knevel R, Cui J, Taylor J, Harm-Jan W, Conaghan PG, et al. Genetic associations with radiological damage in rheumatoid arthritis: Meta-analysis of seven genome-wide association studies of 2,775 cases. PloS One. 2019;14:e0223246. doi: 10.1371/journal.pone.0223246. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Eyre S, Bowes J, Diogo D, Lee A, Barton A, Martin P, et al. High-density genetic mapping identifies new susceptibility loci for rheumatoid arthritis. Nat Genet. 2012;44:1336–40. doi: 10.1038/ng.2462. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Govind N, Choudhury A, Hodkinson B, Ickinger C, Frost J, Lee A, et al. Immunochip identifies novel, and replicates known, genetic risk loci for rheumatoid arthritis in black South Africans. Mol Med. 2014;20:341–9. doi: 10.2119/molmed.2014.00097. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Plenge RM, Seielstad M, Padyukov L, Lee AT, Remmers EF, Ding B, et al. TRAF1–C5 as a risk locus for rheumatoid arthritis—a genomewide study. N. Engl J Med. 2007;357:1199–209. doi: 10.1056/NEJMoa073491. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Bossini-Castillo L, De Kovel C, Kallberg H, van’t Slot R, Italiaander A, Coenen M, et al. A genome-wide association study of rheumatoid arthritis without antibodies against citrullinated peptides. Ann Rheum Dis. 2015;74:e15–e. doi: 10.1136/annrheumdis-2013-204591. [DOI] [PubMed] [Google Scholar]
- 28.Consortium WTCC. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661. doi: 10.1038/nature05911. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Wei W-H, Viatte S, Merriman TR, Barton A, Worthington J. Genotypic variability based association identifies novel non-additive loci DHCR7 and IRF4 in sero-negative rheumatoid arthritis. Sci Rep. 2017;7:1–7. doi: 10.1038/s41598-017-05447-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Julia A, Ballina J, Canete JD, Balsa A, Tornero‐Molina J, Naranjo A, et al. Genome‐wide association study of rheumatoid arthritis in the Spanish population: KLF12 as a risk locus for rheumatoid arthritis susceptibility. Arthritis Rheumatism: Off J Am Coll Rheumatol. 2008;58:2275–86. doi: 10.1002/art.23623. [DOI] [PubMed] [Google Scholar]
- 31.Negi S, Juyal G, Senapati S, Prasad P, Gupta A, Singh S, et al. A genome‐wide association study reveals ARL15, a novel non‐HLA susceptibility gene for rheumatoid arthritis in North Indians. Arthritis Rheumatism. 2013;65:3026–35. doi: 10.1002/art.38110. [DOI] [PubMed] [Google Scholar]
- 32.Aterido A, Cañete JD, Tornero J, Ferrándiz C, Pinto JA, Gratacós J, et al. Genetic variation at the glycosaminoglycan metabolism pathway contributes to the risk of psoriatic arthritis but not psoriasis. Ann Rheum Dis. 2019;78:355–64. doi: 10.1136/annrheumdis-2018-214158. [DOI] [PubMed] [Google Scholar]
- 33.Kochi Y, Okada Y, Suzuki A, Ikari K, Terao C, Takahashi A, et al. A regulatory variant in CCR6 is associated with rheumatoid arthritis susceptibility. Nat Genet. 2010;42:515–9. doi: 10.1038/ng.583. [DOI] [PubMed] [Google Scholar]
- 34.Raychaudhuri S, Remmers EF, Lee AT, Hackett R, Guiducci C, Burtt NP, et al. Common variants at CD40 and other loci confer risk of rheumatoid arthritis. Nat Genet. 2008;40:1216–23. doi: 10.1038/ng.233. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Weyand CM, Goronzy JJ. Association of MHC and rheumatoid arthritis: HLA polymorphisms in phenotypic variants of rheumatoid arthritis. Arthritis Res Ther. 2000;2:1–5. doi: 10.1186/ar63. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Dey R, Schmidt EM, Abecasis GR, Lee S. A fast and accurate algorithm to test for binary phenotypes and its application to PheWAS. Am J Hum Genet. 2017;101:37–49. doi: 10.1016/j.ajhg.2017.05.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res. 2002;16:321–57. doi: 10.1613/jair.953. [DOI] [Google Scholar]
- 38.Huber PJ. Robust estimation of a location parameter. Breakthroughs in statistics: Springer; 1992. p. 492–518.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All data analyzed during this study are included in this published article and its supplemental materials.



