A Fast and Accurate Algorithm to Test for Binary Phenotypes and Its Application to PheWAS

Rounak Dey; Ellen M Schmidt; Goncalo R Abecasis; Seunggeun Lee

doi:10.1016/j.ajhg.2017.05.014

. 2017 Jun 8;101(1):37–49. doi: 10.1016/j.ajhg.2017.05.014

A Fast and Accurate Algorithm to Test for Binary Phenotypes and Its Application to PheWAS

Rounak Dey ^1,², Ellen M Schmidt ^1,², Goncalo R Abecasis ^1,², Seunggeun Lee ^1,^2,^∗

PMCID: PMC5501775 PMID: 28602423

Abstract

The availability of electronic health record (EHR)-based phenotypes allows for genome-wide association analyses in thousands of traits and has great potential to enable identification of genetic variants associated with clinical phenotypes. We can interpret the phenome-wide association study (PheWAS) result for a single genetic variant by observing its association across a landscape of phenotypes. Because a PheWAS can test thousands of binary phenotypes, and most of them have unbalanced or often extremely unbalanced case-control ratios (1:10 or 1:600, respectively), existing methods cannot provide an accurate and scalable way to test for associations. Here, we propose a computationally fast score-test-based method that estimates the distribution of the test statistic by using the saddlepoint approximation. Our method is much (∼100 times) faster than the state-of-the-art Firth’s test. It can also adjust for covariates and control type I error rates even when the case-control ratio is extremely unbalanced. Through application to PheWAS data from the Michigan Genomics Initiative, we show that the proposed method can control type I error rates while replicating previously known association signals even for traits with a very small number of cases and a large number of controls.

Keywords: PheWAS, GWAS, single-variant test, saddlepoint approximation, unbalanced case-control, rare variants

Introduction

Over the last decade, genome-wide association studies (GWASs) have proved instrumental to unravelling the genetic complexities of hundreds of diseases and traits and their associations with common genomic variations. To date, thousands of GWASs have identified more than 4,000 significant loci to be associated with human diseases and traits.¹ However, because most GWASs investigate a single disease or trait, they cannot exploit the cross-phenotype associations or pleiotropy,² where a single genetic variant can be associated with multiple phenotypes. The phenome-wide association study (PheWAS) has been proposed as an alternative approach to take advantage of the pleiotropy phenomenon by studying the impact of genetic variations across a broad spectrum of human phenotypes or “phenome.” It is a complementary approach to the GWAS in the sense that whereas a GWAS attempts to identify phenotype-to-genotype associations, a PheWAS uses a genotype-to-phenotype approach. The first PheWAS³ was published as a proof-of-principle study that demonstrated that the PheWAS strategy could be applied to successfully identify the expected gene-disease associations. Additional studies⁴^,⁵^,⁶^,⁷^,⁸ have shown that the PheWAS approach can further identify previously unreported disease-SNP associations.⁹

The PheWAS approach depends on the availability of detailed phenotypic information. Currently, most PheWASs are applied to clinical cohorts linked to electronic health records (EHRs) and utilize the International Classification of Disease (ICD) billing codes to define clinical phenotypes. The ICD codes provide intuitive phenotype ordering based on clinical disease and trait classifications. Given that the current genotyping and imputation technologies allow for genotyping of tens of millions of variants at a very low cost,¹⁰ an extensive PheWAS can attempt to investigate the genotype-phenotype associations by performing genome-wide association analyses in thousands of traits. We can interpret the PheWAS result of a single genetic variant by observing its associations across the phenome. Such a PheWAS is exhaustive in nature and has great potential to identify variants associated with clinical diseases.

One of the main challenges of the PheWAS approach is that most of the phenotypes are binary phenotypes with unbalanced (1:5) or often extremely unbalanced (1:600) case-control ratios (see Figure S1), given that the data are collected in cohorts. Although standard asymptotic tests (such as the Wald, score, and likelihood-ratio tests) are relatively well calibrated and asymptotically equivalent¹¹ for common (minor allele frequency [MAF] > 0.05) variants in balanced case-control studies, they can inflate type I error for low-frequency (0.01 < MAF ≤ 0.05) and rare (MAF ≤ 0.01) variants in unbalanced case-control studies.¹² Moreover, because the Wald and likelihood-ratio tests need to calculate the likelihood or the maximum-likelihood estimator under the full model, which is computationally expensive, they are not scalable for the amount of tests that PheWASs attempt. On the other hand, the score test is computationally efficient because it does not need to calculate the maximum likelihood under the full model. However, as mentioned before, it suffers from having highly inflated type I error rates in unbalanced studies. Ma et al. proposed Firth’s penalized likelihood-ratio test¹³ as a solution to control the type I error rates in such situations. Firth’s test, despite being well calibrated and robust for testing low-frequency and rare variants in unbalanced studies, lacks computational efficiency because it also involves calculating the maximum likelihood under the full model. For instance, the projected computation time for testing 1,500 phenotypes across 10 million SNPs is ∼117 CPU years (2,000 cases and 18,000 controls). Thus, it is impractical to apply Firth’s test for analyzing large PheWAS datasets.

In this paper, we propose a score-based single-variant test for binary phenotypes that is well calibrated for controlling type I error and can adjust for covariates even in extremely unbalanced case-control studies. Moreover, our test is computationally efficient and scalable to testing thousands of phenotypes across millions of SNPs in large PheWAS datasets. Our proposed test (SPA) is based on score statistics and estimates the null distribution by using the saddlepoint approximation¹⁴^,¹⁵^,¹⁶ instead of the normal approximation¹⁷ traditionally used in score tests. We further develop an improvement of our test (fastSPA) that renders the most computationally challenging steps dependent only on the number of carriers (subjects with at least one minor allele) rather than the sample size. This improved test can substantially reduce the computation time, especially for low-frequency and rare variants, where the number of carriers is much lower than the sample size. Our method’s projected computation time for testing 1,500 phenotypes across 10 million SNPs is ∼400 CPU days (2,000 cases and 18,000 controls), which is more than a 100 times better than that of Firth’s test. In addition, through extensive simulation studies and analysis of the Michigan Genomics Initiative (MGI) data, we demonstrate that the proposed approach can control type I error and is powerful enough to replicate known association signals.

Material and Methods

Logistic Regression Model and Saddlepoint Approximation Method

We consider a case-control study with sample size n. For the i^th subject, let Y_i = 1 or 0 denote the case-control status, X_i denote the k × 1 vector of non-genetic covariates (including the intercept), and G_i denote the number of minor alleles (G_i = 0, 1, 2) of the variant to be tested. To relate genotypes to phenotypes, we use the following logistic regression model:

logit [\Pr (Y_{i} = 1 | X_{i}, G_{i})] = X_{i}^{T} β + G_{i} γ for i = 1,2, \dots, n,

(Equation 1)

where β is a k × 1 vector of coefficients of the covariates, and γ is the genotype log odds ratio. Under this model, we are interested in testing for the genetic association by testing the null hypothesis H₀: γ = 0. Let ${\hat{μ}}_{i}$ be the estimate of μ_i = Pr(Y_i = 1|X_i), which is the probability of being a case under H₀. A score statistic for γ from the model (Equation 1) is given by $S = \sum_{i = 1}^{n} G_{i} (Y_{i} - {\hat{μ}}_{i})$ . Suppose $X = (X_{1}^{T}, \dots, X_{n}^{T})$ is the n × k matrix of covariates, $G = {(G_{1}, \dots, G_{n})}^{T}$ is the genotype vector, W is a diagonal matrix with ${\hat{μ}}_{i} (1 - {\hat{μ}}_{i})$ as the i^th diagonal element, and $\tilde{G} = G - X {(X^{T} W X)}^{- 1} X^{T} W G$ is a covariate-adjusted genotype vector in which covariate effects are projected out from the genotypes (details are given in Appendix A). Then, S can be written as

S = \sum_{i = 1}^{n} {\tilde{G}}_{i} (Y_{i} - {\hat{μ}}_{i}),

(Equation 2)

and the mean and variance of S under H₀ are $E_{H_{0}} (S) = 0$ and $V_{H_{0}} (S) = \sum_{i = 1}^{n} {\tilde{G}}_{i}^{2} {\hat{μ}}_{i} (1 - {\hat{μ}}_{i})$ , respectively, where ${\tilde{G}}_{i}$ is the i^th element of $\tilde{G}$ .

The traditional score test approximates the null distribution by using a normal distribution, which depends only on the mean and the variance of the score statistic. We can obtain the p value by comparing the observed test statistic (s) and $N (0, V_{H_{0}} (S))$ . Normal approximation works well near the mean of the distribution but performs very poorly at the tails. The performance is especially poor when the underlying distribution is highly skewed, such as in unbalanced case-control outcomes,¹² because normal approximation cannot incorporate higher moments such as skewness. In addition, the convergence rate of normal approximation¹⁸^,¹⁹^,²⁰ is O(n^−1/2), which is not fast enough for rare variants.

Saddlepoint approximation was introduced by Daniels¹⁴ as an improvement over the normal approximation. Contrary to normal approximation, where only the first two cumulants (mean and variance) are used for approximating the underlying distribution, saddlepoint approximation uses the entire cumulant-generating function (CGF). Jensen²¹ further showed that saddlepoint approximation has a relative error bound of $O (n^{- 3 / 2})$ , making it a considerable improvement over the normal approximation.

To use the saddlepoint approximation, we first derive the CGF of S from the fact that Y_i ∼ Bernoulli (μ_i) under H₀. Let $\hat{μ}$ be an n × 1 vector with ${\hat{μ}}_{i}$ as the i^th element. From Equation 2, the estimate of the CGF of the score statistic S is

K (t) = \log (E_{H_{0}} (e^{t S})) = \sum_{i = 1}^{n} \log (1 - {\hat{μ}}_{i} + {\hat{μ}}_{i} e^{{\tilde{G}}_{i} t}) - t \sum_{i = 1}^{n} {\tilde{G}}_{i} {\hat{μ}}_{i},

and the estimates of the first- and second-order derivatives of K are

K^{'} (t) = \sum_{i = 1}^{n} \frac{{\hat{μ}}_{i} {\tilde{G}}_{i}}{(1 - {\hat{μ}}_{i}) e^{- {\tilde{G}}_{i} t} + {\hat{μ}}_{i}} - \sum_{i = 1}^{n} {\tilde{G}}_{i} {\hat{μ}}_{i}

and

K^{''} (t) = \sum_{i = 1}^{n} \frac{(1 - {\hat{μ}}_{i}) {\hat{μ}}_{i} {\tilde{G}}_{i}^{2} e^{- {\tilde{G}}_{i} t}}{{[(1 - {\hat{μ}}_{i}) e^{- {\tilde{G}}_{i} t} + {\hat{μ}}_{i}]}^{2}},

respectively. We note that K, K′, and K″ are plug-in estimates in which we plug in ${\hat{μ}}_{i}$ instead of μ_i. Then, according to the saddlepoint method (Barndorff-Nielson¹⁵^,¹⁶), the distribution of S at s can be approximated by

\Pr (S < s) \approx \tilde{F} (s) = Φ {w + \frac{1}{w} \log (\frac{v}{w})},

where $w = sgn (\hat{t}) \sqrt{2 (\hat{t} s - K (\hat{t}))}, v = \hat{t} \sqrt{K^{''} (\hat{t})}$ , $\hat{t}$ is the solution to the equation $K^{'} (\hat{t}) = s$ , and Φ is the distribution function of a standard normal distribution.

Implementation Details and Approaches to Reducing Computation Time

The saddlepoint approximation method involves finding the root of the saddlepoint equation $K^{'} (t) = s$ . It is easy to verify that K′ strictly increases as K″(t) > 0 for all −∞ < t < ∞, and $s = \sum_{i = 1}^{n} {\tilde{G}}_{i} (Y_{i} - {\hat{μ}}_{i})$ lies between $\lim_{t \to \infty} K^{'} (t) = \sum_{i : {\tilde{G}}_{i} > 0} {\tilde{G}}_{i} - \sum_{i = 1}^{n} {\tilde{G}}_{i} {\hat{μ}}_{i}$ and $\lim_{t \to - \infty} K^{'} (t) = \sum_{i : {\tilde{G}}_{i} < 0} {\tilde{G}}_{i} - \sum_{i = 1}^{n} {\tilde{G}}_{i} {\hat{μ}}_{i}$ . Therefore, a unique root exists, and we can use popular root-finding algorithms (Newton-Raphson,²²^,²³ bisection,²³ secant,²³ and Brent’s method²⁴) to efficiently solve this equation. For our simulation studies and real-data applications, we applied a combination of the Newton-Raphson and bisection method to solve the saddlepoint equations.

The most computationally demanding step in this saddlepoint approximation method is calculating the CGF and its derivatives. Here, we propose several approaches to reducing the computational complexities associated with these calculations.

Faster Calculation of the CGF by a Partially Normal Approximation Approach

The most computationally intensive step in the saddlepoint method is the calculation of the CGF K and its derivatives. In each step of the root-finding algorithm, we need to calculate K, K′, and K″, each of which needs O(n) computations. Using the fact that many elements of G are zeroes (i.e., homozygous major genotypes), we propose a fast computation method that speeds up the computation to O(m), where m is the number of non-zero elements in G. Without loss of generality, we assume that the first m subjects have at least one minor allele each and the rest have homozygous major genotypes. We can then express S as S = S₁ + S₂, where $S_{1} = \sum_{i = 1}^{m} {\tilde{G}}_{i} (Y_{i} - {\hat{μ}}_{i})$ and $S_{2} = \sum_{i = m + 1}^{n} {\tilde{G}}_{i} (Y_{i} - {\hat{μ}}_{i})$ . Let $Z = {(X^{T} W X)}^{- 1} X^{T} W G$ , and let Z_l be the l^th element of Z. Then, we can further express S₂ as

\begin{matrix} S_{2} = \sum_{i = m + 1}^{n} {\tilde{G}}_{i} (Y_{i} - {\hat{μ}}_{i}) = \sum_{i = m + 1}^{n} (0 - X_{i} Z) (Y_{i} - {\hat{μ}}_{i}) \\ = - \sum_{i = m + 1}^{n} \sum_{l = 1}^{k} X_{i l} Z_{l} (Y_{i} - {\hat{μ}}_{i}) = - \sum_{l = 1}^{k} Z_{l} \sum_{i = m + 1}^{n} X_{i l} (Y_{i} - {\hat{μ}}_{i}) \\ = - \sum_{l = 1}^{k} Z_{l} S_{2 l}, \end{matrix}

where $S_{2 l} = \sum_{i = m + 1}^{n} X_{i l} (Y_{i} - {\hat{μ}}_{i})$ . Now, if we assume that the non-genetic covariates are relatively balanced in the sample, then the normal distribution should be a good approximation of the null distribution of each S_2l. Because S₂ is a weighted sum of the S_2l variables, we can also approximate the null distribution of S₂ by using a normal distribution where the mean and variance under H₀ are given by $E_{H_{0}} (S_{2}) = 0$ and $V_{H_{0}} (S_{2}) = \sum_{i = m + 1}^{n} {\tilde{G}}_{i}^{2} {\hat{μ}}_{i} (1 - {\hat{μ}}_{i})$ , respectively. Then, the CGF of $S_{2}$ can be approximated by

K_{2} (t) = \frac{1}{2} t^{2} V_{H_{0}} (S_{2}),

and the CGF of S = S₁ + S₂ can be approximated by

K (t) = \sum_{i = 1}^{m} \log (1 - {\hat{μ}}_{i} + {\hat{μ}}_{i} e^{{\tilde{G}}_{i} t}) - t \sum_{i = 1}^{m} {\tilde{G}}_{i} {\hat{μ}}_{i} + \frac{1}{2} t^{2} V_{H_{0}} (S) .

(Equation 3)

In order to calculate the first two terms on the right side of Equation 3, we will need ${\tilde{G}}_{i}$ values for i = l, …, m, which can be calculated in O(m) computations given that G has only m non-zero elements and the quantity $X {(X^{T} W X)}^{- 1} X^{T} W$ can be pre-calculated. Then, the first two terms will require only O(m) computations because both of them sum over m elements. Next, the variance $V_{H_{0}} (S)$ can be further broken down into

\begin{matrix} V_{H_{0}} (S) = \sum_{i = m + 1}^{n} {\tilde{G}}_{i}^{2} {\hat{μ}}_{i} (1 - {\hat{μ}}_{i}) = \sum_{i = m + 1}^{n} {(X_{i} Z)}^{2} {\hat{μ}}_{i} (1 - {\hat{μ}}_{i}) \\ = \sum_{i = 1}^{n} {(X_{i} Z)}^{2} {\hat{μ}}_{i} (1 - {\hat{μ}}_{i}) - \sum_{i = 1}^{m} {(X_{i} Z)}^{2} {\hat{μ}}_{i} (1 - {\hat{μ}}_{i}) \\ = Z^{T} (X^{T} W X) Z - \sum_{i = 1}^{m} {(X_{i} Z)}^{2} {\hat{μ}}_{i} (1 - {\hat{μ}}_{i}) . \end{matrix}

Because X^TWX can be pre-calculated and Z is a k × 1 vector, the first term requires O(k) computations, and the second term requires O(m) computations, which implies that the calculation of $V_{H_{0}} (S_{2})$ requires O(m) calculations under the assumption that k < m, i.e., the number of non-genetic covariates is smaller than the number of subjects with at least one minor allele each. Hence, the CGF K(t) can be calculated in O(m) computations. Using similar arguments, we can further show that the derivatives K′(t) and K″(t) can also be calculated in O(m) computations. Therefore, this partially normal approximation reduces the computational complexity of our test from O(n) to O(m), which is especially useful for rare variants, where m is much smaller than n.

Using Normal Approximation near the Mean for Faster Computation

Because the normal approximation behaves well near the mean of the distribution, we can use it to obtain the p value when the observed score statistic (s) lies close to the mean (0). Moreover, saddlepoint approximation can be numerically unstable very close to the mean of the distribution. We can also avoid such situations by using normal approximation near the mean. One possible approach is to use a fixed threshold in which we apply normal approximation to obtain the p value if the absolute value of the observed score statistic, |s| < rσ, where $σ = \sqrt{V_{H_{0}} (S)}$ and r is a pre-specified value. For example, we used r = 2 in our simulation studies and real-data analyses. For a given level α, this approach does not inflate type I error rates if r < Φ⁻¹(1 − α/2), where Φ⁻¹ is the inverse function of the standard normal distribution function, Φ(x).

Alternatively, we can adaptively select the threshold by using the error bound of the normal approximation given by the Berry-Esseen theorem. Suppose we are interested in controlling the type I error rate at level α. Let F_n(x) be the true distribution function of the standardized score test statistic $S / \sqrt{V_{H_{0}} (S)}$ . Then, according to Berry-Esseen theorem,¹⁸^,¹⁹^,²⁰ the maximum error bound in approximating F_n(x) by Φ(x) is

\sup_{x \in R} | F_{n} (x) - Φ (x) | \leq B_{n} = C {(σ^{2})}^{- 3 / 2} (\sum_{i = 1}^{n} ρ_{i}),

(Equation 4)

where $ρ_{i} = E_{H_{0}} [{| {\tilde{G}}_{i} (Y_{i} - {\hat{μ}}_{i}) |}^{3}] = {\tilde{G}}_{i}^{3} {\hat{μ}}_{i} (1 - {\hat{μ}}_{i}) [{\hat{μ}}_{i}^{2} + (1 - {\hat{μ}}_{i}^{2})]$ and C is a constant. As of now, the best-known estimate for C is 0.56, given by Shevtsova.²⁵ Suppose p_F and p_N are F_n(x)- and Φ(x)-based p values, respectively. From the Berry-Esseen theorem, we can show p_N ≤ p_F + B_n. Suppose q = B_n + α/2 and r_α = Φ⁻¹(1 − q). Then, p_N ≥ q indicates p_F ≥ α/2. Therefore, we use $r_{α} σ$ as a threshold at level α in which we will apply normal approximation if |s| < r_ασ.

Numerical Simulations

To evaluate the computation times, type I error rates, and power of the proposed method, we carried out extensive simulation studies. We considered three different case-control ratios: balanced with 10,000 cases and 10,000 controls, moderately unbalanced with 2,000 cases and 18,000 controls, and extremely unbalanced with 40 cases and 19,960 controls. For each choice of case-control ratio, the phenotypes were simulated on the basis of the following logistic model:

logit [\Pr (Y_{i} = 1)] = β_{0} + X_{1 i} + X_{2 i} + γ G_{i},

where the two non-genetic covariates X_1i and X_2i were simulated from X_1i ∼ Bernoulli (0.5) and X_2i ∼ N(0, 1), respectively. The intercept β₀ was chosen to correspond to a prevalence of 0.01. The genotype G_i values were generated from a binomial(2, p) distribution where p was the MAF. The parameter γ represents the genotype log odds ratio.

To estimate computation times and type I error rates in realistic scenarios, we randomly sampled the MAF (p) from the MAF distribution in the MGI data. To compare computation times, we simulated 10⁴ variants with γ = 0. To compare type I error rates, we simulated 10⁹ variants with γ = 0 and recorded the number of rejections at α = 5 × 10⁻⁵ and 5 × 10⁻⁸. We also used fixed MAFs to evaluate the effect of MAF on computation time and type I error rates. For the power calculations, we considered two different choices for MAF (p = 0.01 and 0.05) and wide ranges of γ (Figure 4). For each choice of p and γ, we generated 5,000 variants.

Empirical Power Curves for Score, fastSPA-2, and Firth’s Test

The top and bottom panels consider MAF = 0.05 and 0.01, respectively. From left to right, the plots consider case-control ratios 10,000:10,000 (balanced), 2,000:18,000 (moderately unbalanced), and 40:19,960 (extremely unbalanced). In each plot, the x axis represents genotype odds ratios, and the y axis represents the empirical power. Empirical power was estimated from 5,000 simulated datasets at the test-specific α levels where their empirical type I errors were equal to 5 × 10⁻⁸.

We compared the computation times of seven different tests: a traditional score test using normal approximation (Score), the saddlepoint-approximation-based test with a standard-deviation threshold at 0.1 and 2 (SPA-0.1 and SPA-2, respectively), the fast saddlepoint-approximation-based test with the partially normal approximation improvement and a standard-deviation threshold at 0.1 and 2 (fastSPA-0.1 and fastSPA-2, respectively), the fastSPA test with the Berry-Esseen bound threshold at α = 5 × 10⁻⁸ (fastSPA-BE), and Firth’s penalized likelihood-ratio test. Next, we compared the empirical type I errors and power curves for fastSPA-2, Score, and Firth’s test at 5 × 10⁻⁸. Because performing Firth’s test 10⁹ times, which is required for estimating type I error rates at 5 × 10⁻⁸, is practically impossible given the heavy computational burden, we performed a hybrid approach in which we used Firth’s test only when the fastSPA-2 p values were smaller than 5 × 10⁻³. For the power comparison, because Score has extremely inflated type I errors in the unbalanced and extremely unbalanced case-control scenarios (as shown in the Results), it might not be appropriate to directly compare the power of Score with that of the other two tests at the same nominal α level. In order to provide a more meaningful comparison, we compared their powers at the empirical α levels where their empirical type I errors became 5 × 10⁻⁸. The empirical α levels were selected on the basis of the type I error simulations, whereby variants were simulated with MAF randomly sampled from the MAF distribution of the MGI data. This approach is similar to performing resampling (e.g., permutation) to control family-wise error rates. We also estimated the powers at the nominal fixed α = 5 × 10⁻⁸. In order to compare the p values resulting from different tests, we also simulated 5 × 10⁻⁶ variants with MAFs randomly sampled from the MAF distribution of the MGI data. We further compared the inflation factors of the genomic controls at different p value quantiles for fastSPA-2, fastSPA-BE, and fastSPA-0.1 in order to explore the effect of the standard-deviation threshold on the inflation factor.

Application to MGI Data

To illustrate the performance of the proposed methods in real-data application, we analyzed four selected phenotypes in the MGI data. The main goal of MGI is to create an institutional repository of genetic data together with rich clinical phenotypes for a broad portfolio of future medical research. DNA from blood samples of >20,000 individuals who underwent surgical procedures at the University of Michigan Health System was genotyped (with their informed consent) on the Illumina HumanCoreExome v.12.1 array, which is a combined GWAS plus exome array composed of >500,000 SNPs. Genotypes of the Haplotype Reference Consortium²⁶ (chromosomes 1–22: HRC release 1; chromosome X: HRC release 1.1) were imputed into the phased MGI genotypes (SHAPEIT2²⁷ on autosomal chromosomes and Eagle2²⁸ on chromosome X) with Minimac3.²⁹ Excluding variants with low imputation quality (R² < 0.3) resulted in dense mapping at over 39 million quality-imputed genetic markers.

Phenotypes derived from 8,940 ICD-9 billing codes were classified into 1,815 PheWAS disease states of shared disease etiology, of which 1,448 had at least 20 cases. Standard code translations were used for converting the taxonomy of diagnostic ICD-9 codes into PheWAS code groups (PheWAS code translation table v.1.2³⁰). Cases were derived from EHRs of individuals with at least two encounters with an ICD-9 billing code. This is a typical example of many recent large-scale PheWASs. To compare our proposed fastSPA-2 with Score and the current gold-standard Firth’s test in analyzing such PheWAS data, we performed genome-wide association analyses for four selected traits—skin cancer (PheWAS code: 172), type 2 diabetes (PheWAS code: 250.2; MIM: 125853), primary hypercoagulable state (PheWAS code: 286.81; MIM: 188055), and cystic fibrosis (PheWAS code: 499; MIM: 219700)—in 18,267 unrelated individuals of European ancestry while adjusting for age, sex, and four principal components. Genotyped samples with any missing covariate information were excluded from the analysis. Given that imputation quality is low for very rare variants,²⁶ we excluded the imputed variants with MAF < 0.001 in our main analysis, which resulted in 13 million variants. For Firth’s test, we used the hybrid approach used in the type I error simulation, where Firth’s test was performed only when the fastSPA-2 p value was smaller than 5 × 10⁻³.

Results

Numerical Simulations

We examine the computation time, type I error control, and power of the proposed fastSPA and two existing approaches, Score and Firth’s test, across ranges of case-control imbalance and MAFs.

Comparison of Computation Times

The projected computation times for testing 1,500 phenotypes across 10 million variants by different testing methods are presented in Figure 1. To obtain computation time under realistic scenarios of the MAF distribution, we randomly sampled the MAFs of the simulated SNPs from the MAF spectrum of the MGI data (Figure S2). fastSPA-2 performs 100–300 times faster than Firth’s test. In the unbalanced case-control setup of 2,000 cases and 18,000 controls, for example, Firth’s test takes 117 CPU years to analyze 10 million SNPs across 1,500 phenotypes, whereas fastSPA-2 takes only 1.09 CPU years. This indicates that on a cluster with 100 CPU cores, the proposed test would require 4 days (without data reading), but Firth’s test would need more than a year. When we compare fastSPA and SPA, fastSPA-0.1 performs 4–6 times faster than SPA-0.1 (e.g., 2.90 versus 12.32 CPU years when the case-control ratio = 2,000:18,000), and fastSPA-2 performs 1.5–2 times faster than SPA-2 (e.g., 1.09 versus 1.62 CPU years when the case-control ratio = 2,000:18,000). Expectedly, the computation time for fastSPA-BE is in between the computation times for fastSPA-2 and fastSPA-0.1. fastSPA-BE performs 1.3–1.8 times faster than fastSPA-0.1 and 1.6–2.8 times slower than fastSPA-2 (e.g., 1.09, 1.86, and 2.9 CPU years for fastSPA-2, fastSPA-BE, and fastSPA-0.1, respectively, when the case-control ratio = 2,000:18,000).

The Projected Computation Times for Testing 10 Million Variants across 1,500 Phenotypes by Various Tests with MAFs Sampled from the MAF Distribution of the MGI Data

The computation times are based on testing 10,000 simulated variants on an Intel i7 2.70GHz processor and then projecting them onto a PheWAS with 10 million variants and 1,500 phenotypes.

We also recorded the computation times for variants with three different fixed MAFs (0.1, 0.01, and 0.001) in order to assess the effect of MAF on the performance of the tests. Similar to Figure 1, Table 1 shows the superior performance of fastSPA-2 over all other tests. Moreover, whereas the computation time of SPA increases with decreasing MAFs, which could be due to the slow convergence caused by the discrete nature of the underlying distribution, fastSPA requires less computation time for rarer variants (smaller MAFs) than for more common variants (larger MAFs). This demonstrates the potential of the partially normal approximation improvement in terms of faster computation of the p values, especially for low-frequency and rare variants.

Table 1.

Computation Times for Various Tests of 10,000 Simulated Variants with Different MAFs

Case-Control Ratio	MAF	Score	SPA-0.1	fastSPA-0.1	fastSPA-BE	SPA-2	fastSPA-2	Firth’s Test
10,000:10,000	0.1	20	214	75	37	28	23	7,251
	0.01	19	225	38	35	27	20	6,918
	0.001	19	242	33	36	30	20	5,304
2,000:18,000	0.1	21	256	84	37	36	24	3,940
	0.01	20	284	39	36	35	21	4,312
	0.001	19	326	34	41	40	20	3,804
40:19,960	0.1	21	376	98	70	38	24	3,615
	0.01	20	477	42	58	44	21	3,598
	0.001	20	647	38	51	79	21	3,525

Open in a new tab

All computation times are in CPU seconds on an Intel i7 2.70GHz processor.

Type I Error Comparison

The type I error rates from 10⁹ simulated datasets are presented in Figure 2. Because of the heavy computation burden for testing these extremely large numbers of datasets, in this comparison, we considered only Score, fastSPA-2, and the hybrid version of Firth’s test, in which we used Firth’s test only when the fastSPA-2 p values were smaller than 5 × 10⁻³. We note that both fastSPA-2 and Firth’s test had well-calibrated quantile-quantile (Q-Q) plots up to 10⁻⁶ p values (Figure 5), and whenever fastSPA-2 p values were greater than 5 × 10⁻³, Firth’s test p values were greater than 4.8 × 10⁻⁴ (see p Value and Inflation Factor Comparison), indicating that the hybrid approach can provide very accurate estimation of the type I error rates of Firth’s test at very stringent α levels.

Q-Q Plots for Score, fastSPA-2, SPA-2, and Firth’s Test on 5 × 10⁶ Simulated Variants with MAF Randomly Sampled from the MAF Distribution of the MGI Data

The top, middle, and bottom panels show Q-Q plots in the balanced (case-control ratio = 10,000:10,000), moderately unbalanced (case-control ratio = 2,000:18,000), and extremely unbalanced (case-control ratio = 40:19,960) case-control scenarios, respectively. In each plot, the x axis represents –log₁₀ expected p values, and the y axis represents –log₁₀ observed p values.

Score had greatly inflated type I error rates for moderately unbalanced and extremely unbalanced case-control ratios, whereas fastSPA-2 could control the type I error in such situations. At the genome-wide significance level of α = 5 × 10⁻⁸, for example, the empirical type I error rates of Score were 32 (1.63 × 10⁻⁶ when the case-control ratio = 2,000:18,000) and 26,600 (1.33 × 10⁻³ when the case-control ratio = 40:19,960) times higher than the nominal α = 5 × 10⁻⁸. In contrast, fastSPA-2 had empirical type I error rates nearly identical to (4.9 × 10⁻⁸ when the case-control ratio = 2,000:18,000) or slightly lower than (3.5 × 10⁻⁸ when the case-control ratio = 40:19,960) the nominal α = 5 × 10⁻⁸. Firth’s test also had well-controlled type I error rates in the balanced and moderately unbalanced case-control scenarios (4.7 × 10⁻⁸ and 4.9 × 10⁻⁸, respectively, at α = 5 × 10⁻⁸). Interestingly, it showed slight inflation (7.8 × 10⁻⁸ at α = 5 × 10⁻⁸) in the extremely unbalanced scenario. We also estimated empirical type I error rates at six different MAFs (Figure 3). Score had deflated type I error rates for low-frequency and rare variants for the balanced case-control ratio and inflated and extremely inflated type I error rates for moderately and severely unbalanced case-control ratios. fastSPA-2 had overall well-controlled type I error rates regardless of MAF and case-control ratio. Firth’s test had either well-controlled or slightly conservative type I error rates when the case-control ratio was balanced or moderately unbalanced. However, when the case-control ratio was extremely unbalanced, Firth’s test had inflated type I error rates, especially when the minor allele count was small (e.g., 1.33 × 10⁻⁷ and 1.47 × 10⁻⁷ for MAF = 0.0005 and 0.001, respectively, at α = 5 × 10⁻⁸ when the case-control ratio = 40:19,960).

Type I Error Comparison at Different MAFs between Score, fastSPA-2, and Firth’s Test

The top and bottom panels show empirical type I error rates at α = 5 × 10⁻⁵ and 5 × 10⁻⁸, respectively. From left to right, the plots consider case-control ratios 10,000:10,000 (balanced), 2,000:18,000 (moderately unbalanced), and 40:19,960 (extremely unbalanced). In each plot, the x axis represents MAF with the expected minor allele count (MAC) in parentheses, and the y axis represents empirical type I error rates. Empirical type I error rates were estimated on the basis of 10⁹ simulated datasets. 95% confidence intervals at different MAFs are also presented.

Power Comparison

Next, we compared the power curves of fastSPA-2, Score, and Firth’s test. Note that Firth’s test is a current gold-standard method.¹³ Because Score had greatly inflated type I error rates, we compared the empirical powers of different tests at their test-specific empirical α levels. Figure 4 shows power by odds ratios when the MAF of the variant was 0.05 (top panel) and 0.01 (bottom panel). As expected, the power was higher when the case-control ratio was balanced. The empirical powers of fastSPA-2 and Firth’s test were nearly identical for all case-control ratios and MAFs, which suggests that our proposed test does not suffer from any loss in power in comparison with Firth’s test. The empirical powers of Score were almost identical to those of fastSPA-2 and Firth’s test for the balanced case-control ratio. However, Score showed substantially lower power than the other two tests for the unbalanced case-control ratios as a result of the very small empirical α levels, and the power gap was especially large when the case-control ratio was extremely unbalanced. The simulation results clearly show that the proposed approach improves power over Score when type I error rates are properly controlled. When we used the nominal α = 5 × 10⁻⁸ level instead of the empirical α levels, Score had higher power than the other two approaches (Figure S3) as expected, given that its type I error rates were not controlled.

p Value and Inflation Factor Comparison

To compare p value distributions of various tests, we generated Q-Q plots and calculated the inflation factor (λ) of the genomic control. Figure 5 suggests strong deflation (smaller than expected) in the p values from Score in the moderately unbalanced and extremely unbalanced case-control setups, whereas fastSPA-2, SPA-2, and Firth’s test resulted in well-calibrated Q-Q plots, which suggests that these methods can control for type I errors. Moreover, the minimum Firth’s test p value was 4.8 × 10⁻⁴ for the variants with a fastSPA-2 p value > 5 × 10⁻³ among all case-control setups, which justifies our hybrid approach of performing Firth’s test only when the fastSPA-2 p value is less than 5 × 10⁻³ in the type I error simulation studies.

None of fastSPA-2, fastSPA-BE, and fastSPA-0.1 showed any inflation or deflation in genomic controls (λ) in the balanced and moderately unbalanced case-control setups (Table S1). In the extremely unbalanced case-control setup, fastSPA-2 resulted in a greatly deflated λ (0.48) at the median p value (q = 0.5). Interestingly, fastSPA-BE and fastSPA-0.1 resulted in an inflated λ (both 1.83) at q = 0.5, which could be due to the discrete nature of p values. However, when λ was measured at p value quantiles q = 0.01 and 0.001, all three tests provided λ very close to unity.

Analysis of MGI Data

We applied Score, Firth’s test, and fastSPA-2 to the MGI data with four phenotypes: skin cancer, type 2 diabetes, primary hypercoagulable state, and cystic fibrosis, which were selected on the basis of case-control ratios. Skin cancer (2,359 cases and 15,265 controls) and type 2 diabetes (1,987 cases and 14,906 controls) were moderately unbalanced, whereas primary hypercoagulable state (168 cases and 16,401 controls) and cystic fibrosis (28 cases and 18,212 controls) were extremely unbalanced phenotypes.

The Manhattan plots (Figure 6) show that Score produced a large number of potentially spurious associations for all of these phenotypes, whereas all of the significant variants from our proposed test at the genome-wide significance level of α = 5 × 10⁻⁸ can be verified as truly associated with the phenotypes on the basis of previous findings (Table 2). In the analysis of skin cancer, variants in or near IRF4 (MIM: 601900), MC1R (MIM: 155555), RALY (MIM: 614663), and SLC45A2 (MIM: 606202) were significant at α = 5 × 10⁻⁸, and all four of these genes were previously identified as associated with pigmentation traits and skin cancers.³¹^,³²^,³³^,³⁴^,³⁵^,³⁶ In the other traits, variants in TCF7L2 (MIM: 602228), F5 (MIM: 612309), and CFTR (MIM: 602421) were significantly associated with type 2 diabetes,³⁷ primary hypercoagulable state,³⁸ and cystic fibrosis,³⁹ respectively, and all of these genes are well known to be associated with the risk of their respective diseases. The Q-Q plots (Figure 7) also suggest that the p values based on Score are much smaller than expected, especially for low-frequency and rare variants, whereas the p values based on fastSPA-2 closely follow the uniform distribution. We also observed the Manhattan plots (Figure S4) including the imputed variants with MAF < 0.001 in the analysis. The inclusion of rarer variants resulted in extreme inflation in the number of potentially spurious associations for Score. However, our proposed test still produced none to very few new associations. The Manhattan plots and Q-Q plots for Firth’s test were almost identical to those of our proposed test.

Manhattan Plots for Four Different Phenotypes from MGI Data

All imputed variants with MAF > 0.001 and all directly genotyped variants were included in this analysis. From left to right, the three panels show associations based on fastSPA-2, Firth’s test, and Score. The red line represents the genome-wide significance level α = 5 × 10⁻⁸.

Table 2.

Significant SNP-Phenotype Associations Based on fastSPA-2 on MGI Data and Previous Findings Confirming Such Associations

Phenotype	Location	dbSNP ID	Nearest Gene	Alleles	MAF	p Value	Previous Findings
Skin cancer	6:396321	rs12203592	IRF4	C>T	0.16	6.71 × 10⁻¹⁸	Zhang et al.³¹ Sulem et al.,³² Jacobs et al.,³³ and Liu et al.³⁴
	16:89986117	rs1805007	MC1R	C>T	0.077	1.86 × 10⁻¹⁴	Zhang et al.³¹ Sulem et al.,³² Jacobs et al.,³³ and Liu et al.³⁴
	20:32538391	rs62211989	RALY	G>C	0.075	5.59 × 10⁻¹³	Zhang et al.³¹ Sulem et al.,³² Jacobs et al.,³³ and Liu et al.³⁴
	5:33951693	rs16891982	SLC45A2	C>G	0.038	7 × 10⁻⁹	Liu et al.,³⁴ Barrett et al.,³⁵ and Nan et al.³⁶
Type 2 diabetes	10:114754071	rs34872471	TCF7L2	T>C	0.29	3.4 × 10⁻¹¹	Scott et al.³⁷
Primary hypercoagulable state	1:169519049	rs6025	F5	T>C	0.029	4.9 × 10⁻³⁹	Bertina et al.³⁸
Cystic fibrosis	7:117299434	rs113827944	CFTR	G>A	0.018	3.11 × 10⁻¹⁵	Kerem et al.³⁹

Open in a new tab

Q-Q Plots for Four Different Phenotypes from MGI Data

From left to right, the three panels show the Q-Q plots based on fastSPA-2, Firth’s test, and Score. The plots are color coded according to different MAF categories. 95% confidence bands are presented in gray to signify the deviance from the uniform distribution.

Further, on the basis of the p values from our proposed test, we obtained the inflation factor (λ) of the genomic control at different p value quantiles (q) and different MAF cutoffs (Table S2). Only the imputed variants were removed when we used different MAF cutoffs. The SNPs present on the Illumina HumanCoreExome v.12.1 array were preserved. To evaluate whether using a smaller standard-deviation threshold (r) improves the estimation of λ, we also applied fastSPA with r = 0.1 (fastSPA-0.1) and fastSPA with the Berry-Esseen bound threshold at α = 5 × 10⁻⁸ (fastSPA-BE) on these four phenotypes. When all variants were included in the analysis, there was slight inflation (λ = 1.11, type 2 diabetes) or great deflation (λ = 0.12, cystic fibrosis) at the median level for fastSPA-2. However, the genomic controls were very close to unity at q = 0.01 and 0.001. When we considered only the variants with MAF > 0.001, fastSPA-2 did not show any significant inflation in λ at the median for skin cancer, type 2 diabetes, or primary hypercoagulable state. However, it showed deflated genomic control for cystic fibrosis (λ = 0.63) as a result of the discrete nature of the underlying distribution. However, when we excluded the rare variants and considered only the variants with MAF > 0.01, all four of the phenotypes showed λ very close to unity. Neither fastSPA-0.1 nor fastSPA-BE showed a significant inflation or deflation in λ at any quantiles or MAF cutoffs, except for cystic fibrosis (both with λ = 1.27) when all variants were considered and genomic control was measured at the median level.

Discussion

In this paper, we propose a fast and scalable test for analyzing large PheWAS datasets that is well calibrated even in extremely unbalanced case-control settings. The method uses computationally efficient saddlepoint approximation to accurately calculate p values of score test statistics. We further propose an improved version of our test that substantially reduces the computation time, especially for low-frequency and rare variants. Our proposed test can also adjust for additional covariates. Through extensive numerical studies, we have demonstrated that our test can perform 100–300 times faster than the currently used Firth’s test while retaining similar power and well-controlled type I error rates. Analysis of MGI data illustrates that by applying the proposed method to PheWAS datasets, we can identify true association signals while controlling for type I error, even for traits with a very small number of cases and a large number of controls.

Our test calculates p values on the basis of Score if the score statistics lie sufficiently close to the mean. Even though normal approximation is accurate near the mean, those p values might not be well calibrated. In such cases, because the median p values might come from Score, we can encounter a slightly inflated or deflated inflation factor at the median. When the case-control ratio is extremely unbalanced, this phenomenon is more pronounced. One way to circumvent this issue is to measure the inflation factor at more extreme quantiles (0.01, 0.001, etc.) or to exclude rare variants when estimating the inflation factor. Another approach is to decrease the standard-deviation threshold so that the median p values come from the saddlepoint approximation. In the analysis of MGI data, fastSPA-0.1 produced substantially improved inflation-factor estimates than fastSPA-2. However, the use of threshold 0.1 instead of 2 would increase the computation time from ∼3 to 4 times. The Berry-Esseen threshold can be viewed as a compromise between these two thresholds. If there is no restriction in computational resource, we recommend using fastSPA-0.1 so that most of the p values are calculated by the saddlepoint approximation. If computational resource is limited, or researchers want to obtain results quickly, either a larger threshold (i.e., fastSPA-2) or Berry-Esseen bound can be a better choice.

As sequencing costs continue to drop, whole-exome or whole-genome sequencing will be used for PheWASs to identify rare variants associated with clinical phenotypes.⁴⁰ In rare-variant association analysis, gene- or region-based multiple-variant tests are commonly used to improve power.⁴¹ When case-control ratios are unbalanced, popular rare-variant tests, including burden tests, SKAT, and SKAT-O, can also have substantially inflated type I error rates. Although resampling-based approaches have been developed to address this problem,⁴² the existing methods are not fast enough to be used in PheWASs. One possible approach is to first adjust single-variant score statistics by SPA and then use the adjusted score statistics to control for the type I error. We have left this for future research.

In summary, we have proposed an accurate and scalable method for PheWAS data analysis. With the growing effort to build large research cohorts for precision medicine,⁴⁰ future PheWASs will have hundreds of thousands of samples and hundreds of millions of variants. Our method will provide a scalable solution for this large-scale problem and contribute to finding genetic components of complex traits. All of our tests are implemented in the R package SPAtest.

Acknowledgments

This work was supported by NIH grant R01 HG008773 (R.D. and S.L.). We would like to thank the investigators of the Michigan Genomics Initiative project for access to the PheWAS dataset and Dr. Hyun Min Kang for implementing the methods in the Epacts package.

Published: June 8, 2017

Footnotes

Supplemental Data include four figures and two tables and can be found with this article online at http://dx.doi.org/10.1016/j.ajhg.2017.05.014.

Appendix A: Explanation behind Using $\tilde{G}$ instead of G

We first note that $S = {\tilde{G}}^{T} (Y - \hat{μ}) = G^{T} (Y - \hat{μ})$ given that $\hat{μ}$ is the maximum-likelihood estimator of μ under the null model and $X^{T} (Y - \hat{μ}) = 0$ . Now, the score function and the observed information matrix under the null model are given by

U_{0} = [\begin{matrix} X^{T} (Y - \hat{μ}) \\ G^{T} (Y - \hat{μ}) \end{matrix}] = [\begin{matrix} 0 \\ S \end{matrix}]

and

I_{0} = [\begin{matrix} X^{T} W X & X^{T} W G \\ G^{T} W X & G^{T} W G \end{matrix}],

respectively.

Therefore, the variance of S under H₀ is given by

V_{H_{0}} (S) = G^{T} W G - G^{T} W X {(X^{T} W X)}^{- 1} X^{T} W G = G^{T} W \tilde{G} = {\tilde{G}}^{T} W \tilde{G} .

So, even though the two expressions of S are algebraically the same, the variance can be expressed as a weighted sum of ${\hat{μ}}_{i} (1 - {\hat{μ}}_{i})$ values, where the weights are given by ${\tilde{G}}_{i}$ values. Therefore, we used $\tilde{G}$ instead of G to express the score statistic.

Web Resources

Michigan Genomics Initiative, https://www.michigangenomics.org/
OMIM, http://www.omim.org
SPAtest R-package, https://sites.google.com/a/umich.edu/leeshawn/software

Supplemental Data

Document S1. Figures S4 and Tables S1 and S2

mmc1.pdf^{(495.4KB, pdf)}

Document S2. Article plus Supplemental Data

mmc2.pdf^{(2.9MB, pdf)}

References

1.Welter D., MacArthur J., Morales J., Burdett T., Hall P., Junkins H., Klemm A., Flicek P., Manolio T., Hindorff L., Parkinson H. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 2014;42:D1001–D1006. doi: 10.1093/nar/gkt1229. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Solovieff N., Cotsapas C., Lee P.H., Purcell S.M., Smoller J.W. Pleiotropy in complex traits: challenges and strategies. Nat. Rev. Genet. 2013;14:483–495. doi: 10.1038/nrg3461. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Denny J.C., Ritchie M.D., Basford M.A., Pulley J.M., Bastarache L., Brown-Gentry K., Wang D., Masys D.R., Roden D.M., Crawford D.C. PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene-disease associations. Bioinformatics. 2010;26:1205–1210. doi: 10.1093/bioinformatics/btq126. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Denny J.C., Crawford D.C., Ritchie M.D., Bielinski S.J., Basford M.A., Bradford Y., Chai H.S., Bastarache L., Zuvich R., Peissig P. Variants near FOXE1 are associated with hypothyroidism and other thyroid conditions: using electronic medical records for genome- and phenome-wide studies. Am. J. Hum. Genet. 2011;89:529–542. doi: 10.1016/j.ajhg.2011.09.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Hebbring S.J., Schrodi S.J., Ye Z., Zhou Z., Page D., Brilliant M.H. A PheWAS approach in studying HLA-DRB1∗1501. Genes Immun. 2013;14:187–191. doi: 10.1038/gene.2013.2. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Ritchie M.D., Denny J.C., Zuvich R.L., Crawford D.C., Schildcrout J.S., Bastarache L., Ramirez A.H., Mosley J.D., Pulley J.M., Basford M.A., Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) QRS Group Genome- and phenome-wide analyses of cardiac conduction identifies markers of arrhythmia risk. Circulation. 2013;127:1377–1385. doi: 10.1161/CIRCULATIONAHA.112.000604. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Pendergrass S.A., Brown-Gentry K., Dudek S., Frase A., Torstenson E.S., Goodloe R., Ambite J.L., Avery C.L., Buyske S., Bůžková P. Phenome-wide association study (PheWAS) for detection of pleiotropy within the Population Architecture using Genomics and Epidemiology (PAGE) Network. PLoS Genet. 2013;9:e1003087. doi: 10.1371/journal.pgen.1003087. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Shameer K., Denny J.C., Ding K., Jouni H., Crosslin D.R., de Andrade M., Chute C.G., Peissig P., Pacheco J.A., Li R. A genome- and phenome-wide association study to identify genetic variants influencing platelet count and volume and their pleiotropic effects. Hum. Genet. 2014;133:95–109. doi: 10.1007/s00439-013-1355-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Hebbring S.J. The challenges, advantages and future of phenome-wide association studies. Immunology. 2014;141:157–165. doi: 10.1111/imm.12195. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Marchini J., Howie B. Genotype imputation for genome-wide association studies. Nat. Rev. Genet. 2010;11:499–511. doi: 10.1038/nrg2796. [DOI] [PubMed] [Google Scholar]
11.Cox D., Hinkley D. Chapman and Hall; 1974. Theoretical Statistics. [Google Scholar]
12.Ma C., Blackwell T., Boehnke M., Scott L.J., GoT2D investigators Recommended joint and meta-analysis strategies for case-control association testing of single low-count variants. Genet. Epidemiol. 2013;37:539–550. doi: 10.1002/gepi.21742. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Firth D. Bias reduction of maximum likelihood estimates. Biometrika. 1993;80:27–38. [Google Scholar]
14.Daniels H.E. Saddlepoint Approximations in Statistics. Ann. Math. Stat. 1954;25:631–650. [Google Scholar]
15.Barndorff-Nielsen O.E. Approximate Interval Probabilities. J. R. Stat. Soc. B. 1990;52:485–496. [Google Scholar]
16.Kuonen D. Saddlepoint approximations for distributions of quadratic forms in normal variables. Biometrika. 1999;86:929–935. [Google Scholar]
17.Feller W. The fundamental limit theorems in probability. Bull. Amer. Math. Soc. 1945;51:800–832. [Google Scholar]
18.Berry A.C. The accuracy of the Gaussian approximation to the sum of independent variates. Trans. Am. Math. Soc. 1941;49:122–136. [Google Scholar]
19.Esseen C.G. On the Liapounoff Limit of Error in the Theory of Probability. Ark. Mat. Astr. Fys. 1942;28A:1–19. [Google Scholar]
20.Esseen C.G. A Moment Inequality with an Application to the Central Limit Theorem. Skand Aktuarietidskr. 1956;39:160–170. [Google Scholar]
21.Jensen J.L. Oxford University Press; 1995. Saddlepoint Approximations. [Google Scholar]
22.Whittaker E.T., Robinson G. The Calculus of Observations: A Treatise on Numerical Mathematics. Fourth Edition. Dover; 1967. The Newton-Raphson Method; pp. 84–87. [Google Scholar]
23.Press W.H., Flannery B.P., Teukolsky S.A., Vetterling W.T. Second Edition. Cambridge University Press; 1992. Numerical Recipes in Fortran 77: The Art of Scientific Computing. [Google Scholar]
24.Brent R.P. Prentice-Hall; 1973. Algorithms for Minimization without Derivatives. [Google Scholar]
25.Shevtsova I.G. An improvement of convergence rate estimates in the Lyapunov theorem. Dokl. Math. 2010;82:862–864. [Google Scholar]
26.McCarthy S., Das S., Kretzschmar W., Delaneau O., Wood A.R., Teumer A., Kang H.M., Fuchsberger C., Danecek P., Sharp K., Haplotype Reference Consortium A reference panel of 64,976 haplotypes for genotype imputation. Nat. Genet. 2016;48:1279–1283. doi: 10.1038/ng.3643. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Delaneau O., Zagury J.-F., Marchini J. Improved whole-chromosome phasing for disease and population genetic studies. Nat. Methods. 2013;10:5–6. doi: 10.1038/nmeth.2307. [DOI] [PubMed] [Google Scholar]
28.Loh P.-R., Danecek P., Palamara P.F., Fuchsberger C., A Reshef Y., K Finucane H., Schoenherr S., Forer L., McCarthy S., Abecasis G.R. Reference-based phasing using the Haplotype Reference Consortium panel. Nat. Genet. 2016;48:1443–1448. doi: 10.1038/ng.3679. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Das S., Forer L., Schönherr S., Sidore C., Locke A.E., Kwong A., Vrieze S.I., Chew E.Y., Levy S., McGue M. Next-generation genotype imputation service and methods. Nat. Genet. 2016;48:1284–1287. doi: 10.1038/ng.3656. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Carroll R.J., Bastarache L., Denny J.C. R PheWAS: data analysis and plotting tools for phenome-wide association studies in the R environment. Bioinformatics. 2014;30:2375–2376. doi: 10.1093/bioinformatics/btu197. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Zhang M., Song F., Liang L., Nan H., Zhang J., Liu H., Wang L.-E., Wei Q., Lee J.E., Amos C.I. Genome-wide association studies identify several new loci associated with pigmentation traits and skin cancer risk in European Americans. Hum. Mol. Genet. 2013;22:2948–2959. doi: 10.1093/hmg/ddt142. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Sulem P., Gudbjartsson D.F., Stacey S.N., Helgason A., Rafnar T., Magnusson K.P., Manolescu A., Karason A., Palsson A., Thorleifsson G. Genetic determinants of hair, eye and skin pigmentation in Europeans. Nat. Genet. 2007;39:1443–1452. doi: 10.1038/ng.2007.13. [DOI] [PubMed] [Google Scholar]
33.Jacobs L.C., Hamer M.A., Gunn D.A., Deelen J., Lall J.S., van Heemst D., Uh H.-W., Hofman A., Uitterlinden A.G., Griffiths C.E.M. A Genome-Wide Association Study Identifies the Skin Color Genes IRF4, MC1R, ASIP, and BNC2 Influencing Facial Pigmented Spots. J. Invest. Dermatol. 2015;135:1735–1742. doi: 10.1038/jid.2015.62. [DOI] [PubMed] [Google Scholar]
34.Liu F., Visser M., Duffy D.L., Hysi P.G., Jacobs L.C., Lao O., Zhong K., Walsh S., Chaitanya L., Wollstein A. Genetics of skin color variation in Europeans: genome-wide association studies with functional follow-up. Hum. Genet. 2015;134:823–835. doi: 10.1007/s00439-015-1559-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Barrett J.H., Iles M.M., Harland M., Taylor J.C., Aitken J.F., Andresen P.A., Akslen L.A., Armstrong B.K., Avril M.-F., Azizi E., GenoMEL Consortium Genome-wide association study identifies three new melanoma susceptibility loci. Nat. Genet. 2011;43:1108–1113. doi: 10.1038/ng.959. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Nan H., Kraft P., Qureshi A.A., Guo Q., Chen C., Hankinson S.E., Hu F.B., Thomas G., Hoover R.N., Chanock S. Genome-wide association study of tanning phenotype in a population of European ancestry. J. Invest. Dermatol. 2009;129:2250–2257. doi: 10.1038/jid.2009.62. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Scott L.J., Bonnycastle L.L., Willer C.J., Sprau A.G., Jackson A.U., Narisu N., Duren W.L., Chines P.S., Stringham H.M., Erdos M.R. Association of transcription factor 7-like 2 (TCF7L2) variants with type 2 diabetes in a Finnish sample. Diabetes. 2006;55:2649–2653. doi: 10.2337/db06-0341. [DOI] [PubMed] [Google Scholar]
38.Bertina R.M., Koeleman B.P., Koster T., Rosendaal F.R., Dirven R.J., de Ronde H., van der Velden P.A., Reitsma P.H. Mutation in blood coagulation factor V associated with resistance to activated protein C. Nature. 1994;369:64–67. doi: 10.1038/369064a0. [DOI] [PubMed] [Google Scholar]
39.Kerem B., Rommens J.M., Buchanan J.A., Markiewicz D., Cox T.K., Chakravarti A., Buchwald M., Tsui L.C. Identification of the cystic fibrosis gene: genetic analysis. Science. 1989;245:1073–1080. doi: 10.1126/science.2570460. [DOI] [PubMed] [Google Scholar]
40.Collins F.S., Varmus H. A new initiative on precision medicine. N. Engl. J. Med. 2015;372:793–795. doi: 10.1056/NEJMp1500523. [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Lee S., Abecasis G.R., Boehnke M., Lin X. Rare-variant association analysis: study designs and statistical tests. Am. J. Hum. Genet. 2014;95:5–23. doi: 10.1016/j.ajhg.2014.06.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Lee S., Fuchsberger C., Kim S., Scott L. An efficient resampling method for calibrating single and gene-based rare variant association analysis in case-control studies. Biostatistics. 2016;17:1–15. doi: 10.1093/biostatistics/kxv033. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Figures S4 and Tables S1 and S2

mmc1.pdf^{(495.4KB, pdf)}

Document S2. Article plus Supplemental Data

mmc2.pdf^{(2.9MB, pdf)}

[bib1] 1.Welter D., MacArthur J., Morales J., Burdett T., Hall P., Junkins H., Klemm A., Flicek P., Manolio T., Hindorff L., Parkinson H. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 2014;42:D1001–D1006. doi: 10.1093/nar/gkt1229. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib2] 2.Solovieff N., Cotsapas C., Lee P.H., Purcell S.M., Smoller J.W. Pleiotropy in complex traits: challenges and strategies. Nat. Rev. Genet. 2013;14:483–495. doi: 10.1038/nrg3461. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib3] 3.Denny J.C., Ritchie M.D., Basford M.A., Pulley J.M., Bastarache L., Brown-Gentry K., Wang D., Masys D.R., Roden D.M., Crawford D.C. PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene-disease associations. Bioinformatics. 2010;26:1205–1210. doi: 10.1093/bioinformatics/btq126. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib4] 4.Denny J.C., Crawford D.C., Ritchie M.D., Bielinski S.J., Basford M.A., Bradford Y., Chai H.S., Bastarache L., Zuvich R., Peissig P. Variants near FOXE1 are associated with hypothyroidism and other thyroid conditions: using electronic medical records for genome- and phenome-wide studies. Am. J. Hum. Genet. 2011;89:529–542. doi: 10.1016/j.ajhg.2011.09.008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib5] 5.Hebbring S.J., Schrodi S.J., Ye Z., Zhou Z., Page D., Brilliant M.H. A PheWAS approach in studying HLA-DRB1∗1501. Genes Immun. 2013;14:187–191. doi: 10.1038/gene.2013.2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib6] 6.Ritchie M.D., Denny J.C., Zuvich R.L., Crawford D.C., Schildcrout J.S., Bastarache L., Ramirez A.H., Mosley J.D., Pulley J.M., Basford M.A., Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) QRS Group Genome- and phenome-wide analyses of cardiac conduction identifies markers of arrhythmia risk. Circulation. 2013;127:1377–1385. doi: 10.1161/CIRCULATIONAHA.112.000604. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib7] 7.Pendergrass S.A., Brown-Gentry K., Dudek S., Frase A., Torstenson E.S., Goodloe R., Ambite J.L., Avery C.L., Buyske S., Bůžková P. Phenome-wide association study (PheWAS) for detection of pleiotropy within the Population Architecture using Genomics and Epidemiology (PAGE) Network. PLoS Genet. 2013;9:e1003087. doi: 10.1371/journal.pgen.1003087. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib8] 8.Shameer K., Denny J.C., Ding K., Jouni H., Crosslin D.R., de Andrade M., Chute C.G., Peissig P., Pacheco J.A., Li R. A genome- and phenome-wide association study to identify genetic variants influencing platelet count and volume and their pleiotropic effects. Hum. Genet. 2014;133:95–109. doi: 10.1007/s00439-013-1355-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib9] 9.Hebbring S.J. The challenges, advantages and future of phenome-wide association studies. Immunology. 2014;141:157–165. doi: 10.1111/imm.12195. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib10] 10.Marchini J., Howie B. Genotype imputation for genome-wide association studies. Nat. Rev. Genet. 2010;11:499–511. doi: 10.1038/nrg2796. [DOI] [PubMed] [Google Scholar]

[bib11] 11.Cox D., Hinkley D. Chapman and Hall; 1974. Theoretical Statistics. [Google Scholar]

[bib12] 12.Ma C., Blackwell T., Boehnke M., Scott L.J., GoT2D investigators Recommended joint and meta-analysis strategies for case-control association testing of single low-count variants. Genet. Epidemiol. 2013;37:539–550. doi: 10.1002/gepi.21742. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib13] 13.Firth D. Bias reduction of maximum likelihood estimates. Biometrika. 1993;80:27–38. [Google Scholar]

[bib14] 14.Daniels H.E. Saddlepoint Approximations in Statistics. Ann. Math. Stat. 1954;25:631–650. [Google Scholar]

[bib15] 15.Barndorff-Nielsen O.E. Approximate Interval Probabilities. J. R. Stat. Soc. B. 1990;52:485–496. [Google Scholar]

[bib16] 16.Kuonen D. Saddlepoint approximations for distributions of quadratic forms in normal variables. Biometrika. 1999;86:929–935. [Google Scholar]

[bib17] 17.Feller W. The fundamental limit theorems in probability. Bull. Amer. Math. Soc. 1945;51:800–832. [Google Scholar]

[bib18] 18.Berry A.C. The accuracy of the Gaussian approximation to the sum of independent variates. Trans. Am. Math. Soc. 1941;49:122–136. [Google Scholar]

[bib19] 19.Esseen C.G. On the Liapounoff Limit of Error in the Theory of Probability. Ark. Mat. Astr. Fys. 1942;28A:1–19. [Google Scholar]

[bib20] 20.Esseen C.G. A Moment Inequality with an Application to the Central Limit Theorem. Skand Aktuarietidskr. 1956;39:160–170. [Google Scholar]

[bib21] 21.Jensen J.L. Oxford University Press; 1995. Saddlepoint Approximations. [Google Scholar]

[bib22] 22.Whittaker E.T., Robinson G. The Calculus of Observations: A Treatise on Numerical Mathematics. Fourth Edition. Dover; 1967. The Newton-Raphson Method; pp. 84–87. [Google Scholar]

[bib23] 23.Press W.H., Flannery B.P., Teukolsky S.A., Vetterling W.T. Second Edition. Cambridge University Press; 1992. Numerical Recipes in Fortran 77: The Art of Scientific Computing. [Google Scholar]

[bib24] 24.Brent R.P. Prentice-Hall; 1973. Algorithms for Minimization without Derivatives. [Google Scholar]

[bib25] 25.Shevtsova I.G. An improvement of convergence rate estimates in the Lyapunov theorem. Dokl. Math. 2010;82:862–864. [Google Scholar]

[bib26] 26.McCarthy S., Das S., Kretzschmar W., Delaneau O., Wood A.R., Teumer A., Kang H.M., Fuchsberger C., Danecek P., Sharp K., Haplotype Reference Consortium A reference panel of 64,976 haplotypes for genotype imputation. Nat. Genet. 2016;48:1279–1283. doi: 10.1038/ng.3643. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib27] 27.Delaneau O., Zagury J.-F., Marchini J. Improved whole-chromosome phasing for disease and population genetic studies. Nat. Methods. 2013;10:5–6. doi: 10.1038/nmeth.2307. [DOI] [PubMed] [Google Scholar]

[bib28] 28.Loh P.-R., Danecek P., Palamara P.F., Fuchsberger C., A Reshef Y., K Finucane H., Schoenherr S., Forer L., McCarthy S., Abecasis G.R. Reference-based phasing using the Haplotype Reference Consortium panel. Nat. Genet. 2016;48:1443–1448. doi: 10.1038/ng.3679. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib29] 29.Das S., Forer L., Schönherr S., Sidore C., Locke A.E., Kwong A., Vrieze S.I., Chew E.Y., Levy S., McGue M. Next-generation genotype imputation service and methods. Nat. Genet. 2016;48:1284–1287. doi: 10.1038/ng.3656. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib30] 30.Carroll R.J., Bastarache L., Denny J.C. R PheWAS: data analysis and plotting tools for phenome-wide association studies in the R environment. Bioinformatics. 2014;30:2375–2376. doi: 10.1093/bioinformatics/btu197. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib31] 31.Zhang M., Song F., Liang L., Nan H., Zhang J., Liu H., Wang L.-E., Wei Q., Lee J.E., Amos C.I. Genome-wide association studies identify several new loci associated with pigmentation traits and skin cancer risk in European Americans. Hum. Mol. Genet. 2013;22:2948–2959. doi: 10.1093/hmg/ddt142. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib32] 32.Sulem P., Gudbjartsson D.F., Stacey S.N., Helgason A., Rafnar T., Magnusson K.P., Manolescu A., Karason A., Palsson A., Thorleifsson G. Genetic determinants of hair, eye and skin pigmentation in Europeans. Nat. Genet. 2007;39:1443–1452. doi: 10.1038/ng.2007.13. [DOI] [PubMed] [Google Scholar]

[bib33] 33.Jacobs L.C., Hamer M.A., Gunn D.A., Deelen J., Lall J.S., van Heemst D., Uh H.-W., Hofman A., Uitterlinden A.G., Griffiths C.E.M. A Genome-Wide Association Study Identifies the Skin Color Genes IRF4, MC1R, ASIP, and BNC2 Influencing Facial Pigmented Spots. J. Invest. Dermatol. 2015;135:1735–1742. doi: 10.1038/jid.2015.62. [DOI] [PubMed] [Google Scholar]

[bib34] 34.Liu F., Visser M., Duffy D.L., Hysi P.G., Jacobs L.C., Lao O., Zhong K., Walsh S., Chaitanya L., Wollstein A. Genetics of skin color variation in Europeans: genome-wide association studies with functional follow-up. Hum. Genet. 2015;134:823–835. doi: 10.1007/s00439-015-1559-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib35] 35.Barrett J.H., Iles M.M., Harland M., Taylor J.C., Aitken J.F., Andresen P.A., Akslen L.A., Armstrong B.K., Avril M.-F., Azizi E., GenoMEL Consortium Genome-wide association study identifies three new melanoma susceptibility loci. Nat. Genet. 2011;43:1108–1113. doi: 10.1038/ng.959. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib36] 36.Nan H., Kraft P., Qureshi A.A., Guo Q., Chen C., Hankinson S.E., Hu F.B., Thomas G., Hoover R.N., Chanock S. Genome-wide association study of tanning phenotype in a population of European ancestry. J. Invest. Dermatol. 2009;129:2250–2257. doi: 10.1038/jid.2009.62. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib37] 37.Scott L.J., Bonnycastle L.L., Willer C.J., Sprau A.G., Jackson A.U., Narisu N., Duren W.L., Chines P.S., Stringham H.M., Erdos M.R. Association of transcription factor 7-like 2 (TCF7L2) variants with type 2 diabetes in a Finnish sample. Diabetes. 2006;55:2649–2653. doi: 10.2337/db06-0341. [DOI] [PubMed] [Google Scholar]

[bib38] 38.Bertina R.M., Koeleman B.P., Koster T., Rosendaal F.R., Dirven R.J., de Ronde H., van der Velden P.A., Reitsma P.H. Mutation in blood coagulation factor V associated with resistance to activated protein C. Nature. 1994;369:64–67. doi: 10.1038/369064a0. [DOI] [PubMed] [Google Scholar]

[bib39] 39.Kerem B., Rommens J.M., Buchanan J.A., Markiewicz D., Cox T.K., Chakravarti A., Buchwald M., Tsui L.C. Identification of the cystic fibrosis gene: genetic analysis. Science. 1989;245:1073–1080. doi: 10.1126/science.2570460. [DOI] [PubMed] [Google Scholar]

[bib40] 40.Collins F.S., Varmus H. A new initiative on precision medicine. N. Engl. J. Med. 2015;372:793–795. doi: 10.1056/NEJMp1500523. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib41] 41.Lee S., Abecasis G.R., Boehnke M., Lin X. Rare-variant association analysis: study designs and statistical tests. Am. J. Hum. Genet. 2014;95:5–23. doi: 10.1016/j.ajhg.2014.06.009. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib42] 42.Lee S., Fuchsberger C., Kim S., Scott L. An efficient resampling method for calibrating single and gene-based rare variant association analysis in case-control studies. Biostatistics. 2016;17:1–15. doi: 10.1093/biostatistics/kxv033. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

A Fast and Accurate Algorithm to Test for Binary Phenotypes and Its Application to PheWAS

Rounak Dey

Ellen M Schmidt

Goncalo R Abecasis

Seunggeun Lee

Abstract

Introduction

Material and Methods

Logistic Regression Model and Saddlepoint Approximation Method

Implementation Details and Approaches to Reducing Computation Time

Faster Calculation of the CGF by a Partially Normal Approximation Approach

Using Normal Approximation near the Mean for Faster Computation

Numerical Simulations

Figure 4.

Application to MGI Data

Results

Numerical Simulations

Comparison of Computation Times

Figure 1.

Table 1.

Type I Error Comparison

Figure 2.

Figure 5.

Figure 3.

Power Comparison

p Value and Inflation Factor Comparison

Analysis of MGI Data

Figure 6.

Table 2.

Figure 7.

Discussion

Acknowledgments

Footnotes

Appendix A: Explanation behind Using G˜ instead of G

Web Resources

Supplemental Data

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Appendix A: Explanation behind Using $\tilde{G}$ instead of G