Abstract
In the analysis of current life science datasets, we often encounter scenarios in which the application of asymptotic theory to hypothesis testing can be problematic. Besides improved asymptotic results, permutation/simulation-based tests are a general approach to address this issue. However, these randomized tests can impose a massive computational burden, for example, in scenarios in which large numbers of statistical tests are computed, and the specified significance level is very small. Stopping rules aim to assess significance with the smallest possible number of draws while controlling the probabilities of errors due to statistical uncertainty. In this communication, we derive a general stopping rule, QUICK-STOP, based on the sequential testing theory that is easy to implement, controls the error probabilities rigorously, and is nearly optimal in terms of expected draws. In a simulation study, we show that our approach outperforms current stopping approaches for general randomized tests by factor 10 and does not impose an additional computational burden. We illustrate our approach by applying our stopping rule to a single variant analysis of a whole-genome sequencing study for lung function.
Keywords: association p-value, next-generation sequencing, permutation, randomized test, sequential testing
Introduction
The analysis in genetic studies often involves association testing of single variants, genetic regions, gene-gene-interaction, or gene expression data. Due to decreasing sequencing costs, study sample sizes and the number of available variants have increased by several magnitudes over the last few years. For many studies, this has created a large computational burden and a severe problem of multiple testing. Given the number of tests computed, very small alpha levels must be achieved to establish that test results are considered significant.
Furthermore, the evaluation of p-values itself can be problematic for many study designs and test statistics. Rare genetic variants, imbalanced case control ratios, phenotypic outliers (Stranger et al., 2005), or non-normally distributed phenotypes can lead to scenarios in which standard asymptotic theory does not provide reliable results. For some test statistics of interest, for example in gene-based/region-based analysis (Chen, Hsu, Gamazon, Cox, & Nicolae, 2012; Lee, Abecasis, Boehnke, & Lin, 2014; Liu et al., 2010; Mishra & Macgregor, 2015), the asymptotic distribution of the test statistic is even not tractable and permutation/simulation procedures have to be applied which can be computationally challenging (Malik et al., 2018; Sugasawa, Noma, Otani, Nishino, & Matsui, 2017).
Researchers have proposed various approaches and strategies to overcome the described issues regarding the assessment of statistical significance. These approaches include the transformation of the data to normality so that standard methodology can be applied, more specialized asymptotic results/approximations for particular test statistics, the application of permutation/simulation-based tests, approximations of the permutation-based p-values, and efficient resampling algorithms. In the setting where non-normally distributed traits are tested for association, for example, lung volume and blood markers, or in the presence of phenotypic outliers, a proposed strategy is to apply inverse normal transformations to constrain the phenotype information to a normal distribution form (Beasley, Erickson, & Allison, 2009). Then, the standard methodology can be applied to test the phenotype data for association. This procedure was applied and investigated in several recent publications (Boueiz et al., 2016; Vavoulis, Taylor, Schuh, & Bar-Joseph, 2017; Zhang, Xie, Liang, & Xiong, 2016). Although these transformations can solve the problem of unreliable asymptotic p-values for some scenarios, they can also lead to a loss of power (Beasley et al., 2009). In Supplementary Material A, we included simulations that indicate a loss of power in regression analysis when rank transformations are applied.
While the corresponding theory for approaches that approximate the permutation/resampling-based p-values or derive more specific asymptotic results is much more involved than the straightforward permutation/simulation-based procedure, it is also restricted to specific scenarios/test statistics. The approaches are not universally applicable, and many epidemiological studies and association test statistics nevertheless require randomized testing approaches based on permutations or simulations (e.g., Chen et al., 2012; Malik et al., 2018; Zhu, Zhang, & Sha, 2018).
As explained above, if randomized testing is applied in a large-scale setting, the computational burden is enormous. To reduce this computational burden, some simple stopping rules were proposed (Chang et al., 2015; Che, Jack, Motsinger-Reif, & Brown, 2014; Hasegawa et al., 2016). Instead of estimating the empirical p-value based on a pre-specified number of permutations/simulations per test, these approaches aim to stop randomization testing early if the corresponding p-value is clearly nonsignificant and therefore reduce the number of randomized draws and the computational time.
Closely related is the methodology of the so-called sequential Monte Carlo testing (Besag & Clifford, 1991; Gandy, 2009). The approach by Gandy (2009), called SIMCTEST, describes a stopping rule that guarantees to control the probability of a wrong decision, but the implementation requires recursions and additional tuning parameters.
In this communication, we develop a general stopping rule for randomized tests, QUICK-STOP (QS), that is based on sequential testing theory, and decides as fast as possible if the unknown p-value is below a specific significance level of interest. In contrast to the sequential Monte Carlo testing literature (Gandy, 2009), we introduce an arbitrary small indifference region between both hypotheses in which both decisions are acceptable. This leads to much simpler derivations and an intuitive stopping rule. For applications, the indifference region can be selected as arbitrarily small so that it is not of practical relevance but guarantees a finite runtime. The approach reduces the computational burden substantially compared to the confidence interval-based (CI) stopping rule in the popular genetic analysis tool PLINK1.9 (Chang et al., 2015).
Furthermore, we can utilize the existing theoretical sequential testing results to show that our procedure is nearly optimal in terms of the number of expected replicates, minimizing the computation burden. It is important to note that our methodology can be applied to any randomized test, is not restricted to a specific scenario, and does not impose a significant computational burden. Our approach combines computational efficiency and generality with an intuitive approach.
Methods
We consider a genetic epidemiological study where many statistical tests are performed, and significance of findings is declared based on an appropriate significance level , that is corrected for multiple testing. For example, in a GWAS. We assume that each statistical test is based on a suitable association test statistic T that can be computed based on observed data, but significance cannot be evaluated by asymptotic theory. We assume that one can draw random permutations (e.g. in a regression model) or simulate the corresponding null distribution (e.g. a suitable normal distribution as in gene-based tests (Liu et al., 2010; Mishra & Macgregor, 2015)) such that we can interpret the procedure of comparison between the randomized test statistic and the observed statistic as a sequence of independent Bernoulli draws with parameter p that corresponds to the true, unknown p-value. After n permutations/simulations, the empirical p-value is usually estimated by . Given unrestricted computational resources and time, all p-values could be estimated with high accuracy based on a very large number of permutations/simulations, and significant results identified. To avoid this infeasible computational burden, a general stopping rule aims to decide between the hypotheses
based on as few permutations/simulations as possible. Since p is estimated based on permutations/simulations, a rigorous stopping rule needs to control the probability of a wrong decision due to statistical uncertainty in the estimation. In the following, we will refer to a type I decision error if the hypothesis H2 is true and the stopping rule chooses H1 A type II decision error occurs if the hypothesis H1 is true and the stopping rule chooses H2.
An important consequence is that, if the probability of a type I or type II decision error by the stopping rule is negligibly small, randomized testing based on the statistic T in combination with the stopping rule basically maintains the corresponding type 1 error and power of T but reduces the computational burden.
QUICK-STOP
In this section, we will introduce the setting and objects of our stopping rule QUICK-STOP and describe the theoretical properties. QUICK-STOP is based on the so-called Adaptive Sequential Likelihood Ratio Test (ASLRT) derived by Pavlov (Pavlov, 1991) and Tartakovsky (Tartakovsky, 2014).
The two hypotheses that we want to distinguish between are:
where the indifference parameter is a technical parameter that is chosen very small and separates the two hypotheses. Within the indifference region , we assume that both hypotheses are acceptable. For example, would correspond to the resolution level of 108 permutations/simulations but can be chosen arbitrary small. We will elaborate on the theoretical aspects of the indifference region/parameter d in more detail below.
Besides the parameters p1 and d, our stopping rule requires the specification of the parameters and As we will show, these two parameters control the type I and the type II decision error probabilities introduced above. Since we usually we would like to avoid such wrong decisions in practice, we suggest very small values, for example . We introduce the objects
and
where , and , . Here, is the probability distribution of a Bernoulli random variable with success probability p,. The estimate is defined by , a slightly modified version of the maximum likelihood estimator for the success probability pthat is still asymptotically consistent. Our stopping rule is denoted by , where or defines the selected hypothesis, and N is the number of permutations/simulations computed until this decision is made. If , we set and . If , we set and .
An illustrative example
The intuition behind our stopping rule QUICK-STOP is that the objects and are similar to likelihood ratios between the currently estimated p-value and the best explaining parameter in the respective hypothesis bin.
To illustrate our stopping rule and the corresponding objects, we demonstrate an application with the following parameters , and , implying . First, we consider the case , a clearly non-significant p-value. As described above, the sequence can be thought to be generated by independent Bernoulli(p) draws. A simulated example gives us the observed sequence . According to the definitions above, for n = 6 this results in . At this point, we obtain the values and ; therefore, both objects did not reach their corresponding threshold. After n = 7 draws, we have and we get and , showing that the first object reaches the threshold of , indicating that the observed sequence is extremely unlikely under hypothesis H1. Our stopping rule chooses and sets n = 7. Therefore, QUICK-STOP used 7 permutations/simulations to decide the unknown p-value is non-significant and chose hypothesis H2.
For a second example, we consider . Here, we expect to draw many permutations/simulations without observing a more extreme test statistic. A simulated example resulted in for , and . A straightforward computation for gives and , that means the point where QUICK-STOP stops and selects hypothesis with .
Theorem 1 summarizes two important properties of our stopping rule.
Theorem 1.
-
(i)
for , and for
-
(ii)
Let be the class of all stopping rules such that property (i) is fulfilled for and . Then,
Theorem 1 states that for fixed parameters p1and p2 (and so d), our stopping rule QUICK-STOP guarantees to control the error probabilities at the arbitrary, pre-specified rates α1 and α2. The key for the proof of this result is the usage of a one-step delayed estimator that allows utilizing Doob’s martingale inequality (Pavlov, 1991; Tartakovsky, 2014). Besides, for small rates α1 and α2 that are of interest in practice, the expected number of permutations/simulations N of QUICK-STOP approaches the theoretical lower bound among all comparable stopping rules that also control the error probabilities at the same levels. The theoretical lower bound depends on p, p1, p2 (and therefore d), α1 and α2. Therefore, our procedure is nearly optimal regarding the number of expected draws. The explicit terms for the lower bound are described in Theorem 1 in the Supplementary Material B. The indifference parameter d determines a tradeoff between worst-cased expected runtime and the interpretability of the results due to the indifference region. In contrast to the scenario without separated hypotheses, the introduction of the indifference region implies that the procedure stops eventually (Gandy, 2009), and the expected number of permutations/simulations is finite. It is important to note that our result includes the deterministic cases (Supplementary Material B). After the decision, the current empirical p-value provides an accurate p-value estimate for tests with p-values close to the thresholds p1 and p2. If one is also interested in an accurate estimation of p-values that are clearly not significant, one can combine the stopping rule with a minimum number of permutations/simulations.
Application and numerical study
In this section, we describe the results of a simulation study and the application of our approach to a whole-genome sequencing study for lung function.
Simulation study
In this simulation study, we compare QUICK-STOP with a different, simple stopping rule that is based on the adaptive permutation procedure in the popular genetic analysis tool PLINK1.9 (Chang et al., 2015). Both stopping rules can be applied to any randomized test. We directly apply both stopping rules to a sequence of independent Bernoulli variables with unknown parameter p, mimicking a general scenario of randomized testing. We assessed how fast both rules could decide which hypothesis is true and how the rates of wrong decisions behave empirically. The PLINK1.9 related stopping rule is based on an asymptotic confidence interval approximation. It aims to decide between the two hypotheses
After n permutations/simulations, the p-value is estimated by the proportion of randomized test statistics that were more extreme than the observed statistic, that means: . Based on the corresponding estimated standard error , a confidence interval utilizing the asymptotic normal distribution is constructed. If this confidence interval does not contain the significance level , the stopping rule draws the decision based on the location of the estimated p-value below or above the significance level. We will refer to this approach by CI. We note that the estimation of the standard error requires an estimated p-value that is not 0 or 1.
For QUICK-STOP, we chose and for CI we considered a confidence interval (based on asymptotic normal distribution assumption).
In Tables 1 - 4, we report the average number of draws required by both approaches for multiple combinations of p, and d. For QUICK-STOP, we considered two different indifference parameters d, a scenario with a very small value and a scenario where the indifference region might be of practical impact (see the difference of p2 and p1 in Tables 1–4).
Table 1.
Averaged number of draws until the decision for CI and QUICK-STOP (QS). The significance level is chosen as . Results are based on 10,000 replicates.
| Significance cutoff | 0.5 | 0.1 | 0.05 | 10−2 |
p 10−3 |
10−4 | 10−5 | 10−6 | 10−7 | 10−8 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| CI | 22 | 379 | 937 | 16453 | 2666 | 10012 | 100480 | 976758 | 9.98e06 | 9.81e07 | |
| QS | 13 | 124 | 372 | 14311 | 11374 | 6051 | 5617 | 5574 | 5569 | 5569 | |
| 13 | 123 | 367 | 13584 | 11374 | 6051 | 5617 | 5574 | 5569 | 5569 | ||
Table 4.
Averaged number of draws until the decision for CI and QUICK-STOP (QS). The significance level is chosen as . Results for are based on 10,000 replicates, results for are based on 1,000 replicates due to computational reasons.
| Significance cutoff | 0.5 | 0.1 | 0.05 | 10−2 |
p 10−3 |
10−4 | 10−5 | 10−6 | 10−7 | 10−8 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| CI | 22 | 343 | 767 | 4155 | 41975 | 420480 | 4.20e06 | 4.20e07 | 4.30e08 | 5.06e09 | |
| QS | 4 | 20 | 39 | 200 | 2690 | 29952 | 400473 | 5.62e06 | 9.35e07 | 2.35e09 | |
| 4 | 20 | 39 | 200 | 2658 | 29860 | 398335 | 5.53e06 | 9.09e07 | 2.21e09 | ||
Compared to CI, QUICK-STOP generally reduces the number of draws dramatically in most scenarios. In Tables 1 and 2, we see a huge reduction for very small p, since the CI approach cannot estimate the confidence interval before the estimated p-value is non-zero. The parameter d does not have a strong impact on the QUICK-STOP results, indicating that this parameter should be chosen very small in practice, reducing the indifference region. However, if p is of the same magnitude as , there are scenarios where the CI approach does require fewer draws than QUICK-STOP. The reason for this is that the probability of a wrong decision by the CI approach is much larger than the confidence level suggests.
Table 2.
Averaged number of draws until the decision for CI and QUICK-STOP (QS). The significance level is chosen as . Results are based on 10,000 replicates.
| Significance cutoff | 0.5 | 0.1 | 0.05 | 10−2 |
p 10−3 |
10−4 | 10−5 | 10−6 | 10−7 | 10−8 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| CI | 22 | 345 | 784 | 4594 | 167198 | 26682 | 101565 | 976847 | 9.98e06 | 9.81e07 | |
| QS | 9 | 61 | 148 | 1344 | 149838 | 119452 | 63133 | 58647 | 58191 | 58158 | |
| 9 | 61 | 147 | 1330 | 142424 | 119452 | 63133 | 58647 | 58191 | 58158 | ||
We analyzed the empirical wrong decisions rates for and . We considered values for p close to these significance levels; the results are reported in Table 5. As we can see, the empirical rates for wrong decisions are increased for the CI approach. Since the CI approach does not consider the ‘trajectory’ along the draws and relies on an asymptotic distribution, the confidence interval does not correspond to an error rate of 10−10. In comparison, we did not observe a wrong decision by QUICK-STOP. This is expected since we chose and according to Theorem 1, the error rates are controlled by these parameters.
Table 5.
Empirical analysis of the probability of a wrong decision for multiple combinations of p, and d. Results are based on 107 replications (*107 replications, ** 104replications, *** 103replications, due to computational restrictions).
| p | |||||||
|---|---|---|---|---|---|---|---|
| Significance cutoff | 10−2 | 5.1 * 10−2 | 10−3 | 5.1 * 10−3 | 10−4 | 10−5 | |
| CI | 4.2e-06 | 0.004** | 1.0092e-03 | 5.003e-04 | 1.029e-4 | 9.6e-06 | |
| QS | 0 | 0** | 0 | 0 | 0 | 0 | |
| 0 | 0** | 0 | 0 | 0 | 0 | ||
| CI | 0 | 0 | 3e-06* | 0.002*** | 1.021e-4 | 9.6e-06 | |
| QS | 0 | 0 | 0* | 0*** | 0 | 0 | |
| 0 | 0 | 0* | 0*** | 0 | 0 | ||
The increased error rate of CI can also be computed directly. For example, for , CI stops if, after 149,400,000 simulations, only 1 success was observed. The corresponding estimated p-value is . If we assume that the true p-value is , the probability of observing exactly 1 success after 149,400,000 simulations is 0.0042. The strength of QUICK-STOP is that it guarantees the probability of an error, even if the true p-value is very small and close to the significance level.
Application to COPD whole-genome sequencing study
In the simulation study above, we analyzed the performance of QUICK-STOP and the CI-based approach for fixed p. To illustrate the application and advantages of our approach in a real data example where the true, unknown are drawn from a realistic distribution, we considered a genome-wide single variant association analysis for a whole-genome sequencing dataset from the COPDGene study (Regan et al., 2010).
The COPDGene Study consists of >10,000 current or former smokers with and without chronic obstructive pulmonary disease (COPD). Subjects were of non-Hispanic White or African-American ancestry and age between 45 and 80 years. Also, a minimum of 10 pack-years of smoking and no lung disease (other than COPD or asthma) were ascertainment criteria. The Boston Early-Onset COPD study (Silverman et al., 1998) (BEOCOPD) is an extended pedigree study with probands age below 53 and severe COPD (defined as forced expiratory volume in one second (FEV1) < 40% predicted). As part of the National Heart, Lung, and Blood Institute Trans-Omics in Precision Medicine (TOPMed) project, 2000 severely affected cases from BEOCOPD and COPDGene, and controls with normal spirometry from COPDGene were selected (Prokopenko et al., 2018) for whole-genome sequencing. Our analysis dataset consisted of 51,715,479 genetic variants and 1,794 samples, after selecting variants with non-missingness, no lower MAF cutoff, and the quality control described in Prokopenko et al. (2018). Using covariates for pack-years and 10 eigenvectors from Jaccard population stratification analysis (Prokopenko et al., 2016), we computed the covariate-adjusted quantitative lung function trait (FEV1 percent predicted) via linear regression. To test for association between this adjusted phenotype and genotype, we also consider a linear regression model. However, corresponding asymptotic p-values are not reliable since the phenotype distribution in the sample is highly skewed, and most variants in a WGS study are rare. Therefore, we evaluated significance by permutation of the phenotype information.
We compare our approach QUICK-STOP with the adaptive, confidence interval-based permutation stopping rule implemented in the popular genetic analysis tool PLINK1.9 (Chang et al., 2015). This approach corresponds to the stopping rule CI in the previous simulation study. For QUICK-STOP, we chose and ran the analysis with as the first scenario and with , as the second scenario.
For the analysis with PLINK1.9, we applied the aperm command, which requires the specification of six parameters (Chang et al., 2015). We specified the minimum and maximum number of permutations by 2 and 109 (maximum possible value), respectively, and that variants can be pruned out after every permutation. We set the significance level to in the first scenario, and 10−19 in the second. In addition, we chose , such that PLINK1.9 computes a confidence interval (based on the normal distribution approximation) and stops if this confidence interval does not contain the significance level.
In Table 6, we report the results of this analysis. For each lower cutoff for p-values, we extracted all variants for which the final p-value estimate provided by PLINK1.9 was above this threshold. We report the corresponding number of variants and the overall number of computed permutations along these single nucleotide polymorphisms (SNPs) for both methods. We observe that our approach reduces the number of required permutations by a factor around 10 along most SNPs. In addition, we observe that most of the permutations are performed on a small set of ‘interesting/promising’ SNPs. For example, in the first scenario, our approach utilizes more than 50% of the overall number of permutations for 278 variants. The corresponding fraction for the confidence interval-based approach is only approximately 23%.
Table 6:
Analysis of the COPD dataset with our sequential testing approach QUICK-STOP and PLINK1.9.
| Scenario | Lower p- value cutoff |
Number of variants |
QUICK-STOP | PLINK1.9 | Ratio PLINK1.9/ QUICK-STOP |
|---|---|---|---|---|---|
| 1 | 0.05 | 49,226,335 | 3.96e08 | 3.99e09 | 10.08 |
| 10−2 | 51,244,126 | 7.70e08 | 7.43e09 | 9.65 | |
| 10−3 | 51,676,557 | 1.17e09 | 1.27e10 | 10.85 | |
| 10−4 | 51,712,482 | 1.58e09 | 1.68e10 | 10.63 | |
| 10−5 | 51,715,193 | 1.98e09 | 1.98e10 | 10.00 | |
| 10−6 | 51,715,445 | 2.60e09 | 2.28e10 | 8.77 | |
| – | 51,715,471 | 4.37e09 | 2.57e10 | 5.88 | |
| 2 | 0.05 | 49,226,335 | 3.81e08 | 3.99e09 | 10.47 |
| 10−2 | 51,244,126 | 6.22e08 | 7.43e09 | 11.95 | |
| 10−3 | 51,676,557 | 9.16e08 | 1.27e10 | 13.87 | |
| 10−4 | 51,712,482 | 1.24e09 | 1.68e10 | 13.55 | |
| 10−5 | 51,715,193 | 1.50e09 | 1.98e10 | 13.2 | |
| 10−6 | 51,715,446 | 1.79e09 | 2.27e10 | 12.68 | |
| – | 51,715,479 | 3.06e09 | 2.87e10 | 9.38 | |
For each p-value lower cutoff, we considered all variants where the final p-value estimate of PLINK1.9 was above this cutoff. We excluded the genetic variants where PLINK1.9 reached the maximum of 109 permutations.
We excluded variants from this comparison where PLINK1.9 reached the maximum number of permutations (109), since this number is truncated by the software implementation. This only affected the first scenario, where we excluded eight genetic variants. One of these genetic variants was reported to be significant by our approach with an estimated p-value of . In the second scenario, neither PLINK1.9 nor our approach required 109 permutations for a variant, and no genetic variant was reported to be genome-wide significant with respect to this significance level.
Overall, this real data example demonstrates the practical advantages of our approach; the results are in line with our simulation study above.
Discussion
The analysis of recent datasets in the field of life science often encounters scenarios in which asymptotic theory cannot be applied, and the general approach to significance testing in these scenarios is permutation or simulation.
To avoid the huge computational effort, researchers have proposed several methods to approximate permutation-based p-values for particular test statistics or designed efficient resampling procedures that utilize special properties of common application scenarios.
On the other hand, a general approach that aims to reduce the computational burden of permutation/simulation-based testing without losing flexibility and robustness is the utilization of stopping rules. The goal of such stopping rules is to decide as fast as possible if the unknown p-value is below a pre-specified level of significance.
Here, we proposed such a general and intuitive sequential testing-based stopping rule, that is easy to implement, requires almost no additional computational effort, and controls the error probabilities rigorously. Based on the theory for sequential testing, it can be shown that our stopping rule is nearly optimal in terms of the expected number of permutations/simulations. In a simulation study, we investigated the performance of our approach and showed that our stopping rule reduces the number of permutations/simulations substantially compared to the current methodology. In an application to a whole-genome sequencing study for lung function, we demonstrated the implementation and the practical value of our stopping rule. An implementation scheme and a simulation tool for QUICK-STOP can be found at https://github.com/julianhecker/QUICK-STOP.
Supplementary Material
Table 3.
Averaged number of draws until the decision for CI and QUICK-STOP (QS). The significance level is chosen as . Results for are based on 10,000 replicates, results for are based on 1,000 replicates due to computational reasons.
| Significance cutoff | 0.5 | 0.1 | 0.05 | 10−2 |
p 10−3 |
10−4 | 10−5 | 10−6 | 10−7 | 10−8 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| CI | 22 | 343 | 765 | 4176 | 41955 | 420576 | 4.29e06 | 4.68e07 | 1.67e09 | 2.72e08 | |
| QS | 4 | 20 | 46 | 294 | 3547 | 47337 | 737914 | 1.57e07 | 1.73e09 | 1.41e09 | |
| 4 | 20 | 45 | 294 | 3535 | 47246 | 735007 | 1.56e07 | 1.64e09 | 1.41e09 | ||
Acknowledgments
This work was supported by Cure Alzheimer’s Fund; the National Human Genome Research Institute [R01HG008976]; and the National Heart, Lung, and Blood Institute [U01HL089856, U01HL089897, P01HL120839, P01HL132825].
Grant Numbers
Cure Alzheimer’s Fund; the National Human Genome Research Institute [R01HG008976]; and the National Heart, Lung, and Blood Institute [U01HL089856, U01HL089897, P01HL120839, P01HL132825].
Whole genome sequencing (WGS) for the Trans-Omics in Precision Medicine (TOPMed) program was supported by the National Heart, Lung and Blood Institute (NHLBI). WGS for “NHLBI TOPMed: Genetic Epidemiology of COPD” (phs000951) was performed at the Broad Institute of MIT and Harvard (HHSN268201500014C), and at the University of Washington Northwest Genomics Center (3R01HL089856-08S1). Centralized read mapping and genotype calling, along with variant quality metrics and filtering were provided by the TOPMed Informatics Research Center (3R01HL-117626-02S1; contract HHSN268201800002I). Phenotype harmonization, data management, sample-identity QC, and general study coordination were provided by the TOPMed Data Coordinating Center (3R01HL-120393-02S1; contract HHSN268201800001I). We gratefully acknowledge the studies and participants who provided biological samples and data for TOPMed.
The COPDGene project described was supported by Award Number U01 HL089897 and Award Number U01 HL089856 from the National Heart, Lung, and Blood Institute. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Heart, Lung, and Blood Institute or the National Institutes of Health. The COPDGene project is also supported by the COPD Foundation through contributions made to an Industry Advisory Board comprised of AstraZeneca, Boehringer Ingelheim, GlaxoSmithKline, Novartis, Pfizer, Siemens, and Sunovion. A full listing of COPDGene investigators can be found in Supplementary Information C.
The TOPMed Banner Authorship List is provided in Supplementary Information C.
Footnotes
Data availability
The data that support the findings of this study are/will be available in dbGaP at https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000179.v6.p2, reference number phs000179 (Genetic Epidemiology of COPD (COPDGene). https://www.ncbi.nlm.nih.gov/projects/gap/cgi‐bin/study.cgi?study_id=phs000179.v6.p2), and the software example is available at https://github.com/julianhecker/QUICK‐STOP.
Declaration of Interests
The authors declare no competing interests.
Supplementary Material
Supplementary Material A and B contain a simulation regarding rank-based transformations and a detailed version of Theorem 1. Supplementary Material is available online.
References
- Beasley TM, Erickson S, & Allison DB (2009). Rank-Based Inverse Normal Transformations are Increasingly Used, But are They Merited? Behavior Genetics, 39(5), 580–595. 10.1007/s10519-009-9281-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Besag J, & Clifford P. (1991). Sequential Monte Carlo p-Values. Biometrika, 78(2), 301–304. 10.2307/2337256 [DOI] [Google Scholar]
- Boueiz A, Lutz SM, Cho MH, Hersh CP, Bowler RP, Washko GR, … DeMeo DL (2016). Genome-Wide Association Study of the Genetic Determinants of Emphysema Distribution. American Journal of Respiratory and Critical Care Medicine, 195(6), 757–771. 10.1164/rccm.201605-0997OC [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, & Lee JJ (2015). Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience, 4, 7 10.1186/s13742-015-0047-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Che R, Jack JR, Motsinger-Reif AA, & Brown CC (2014). An adaptive permutation approach for genome-wide association study: evaluation and recommendations for use. BioData Mining, 7, 9 10.1186/1756-0381-7-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen LS, Hsu L, Gamazon ER, Cox NJ, & Nicolae DL (2012). An exponential combination procedure for set-based association tests in sequencing studies. American Journal of Human Genetics, 91(6), 977–986. 10.1016/j.ajhg.2012.09.017 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gandy A. (2009). Sequential Implementation of Monte Carlo Tests With Uniformly Bounded Resampling Risk. Journal of the American Statistical Association, 104(488), 1504–1511. 10.1198/jasa.2009.tm08368 [DOI] [Google Scholar]
- Hasegawa T, Kojima K, Kawai Y, Misawa K, Mimori T, & Nagasaki M. (2016). AP-SKAT: highly-efficient genome-wide rare variant association test. BMC Genomics, 17, 745 10.1186/s12864-016-3094-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee S, Abecasis GR, Boehnke M, & Lin X. (2014). Rare-Variant Association Analysis: Study Designs and Statistical Tests. The American Journal of Human Genetics, 95(1), 5–23. 10.1016/j.ajhg.2014.06.009 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu JZ, McRae AF, Nyholt DR, Medland SE, Wray NR, Brown KM, … Macgregor S. (2010). A versatile gene-based test for genome-wide association studies. American Journal of Human Genetics, 87(1), 139–145. 10.1016/j.ajhg.2010.06.009 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Malik R, Chauhan G, Traylor M, Sargurupremraj M, Okada Y, Mishra A, … Dichgans M. (2018). Multiancestry genome-wide association study of 520,000 subjects identifies 32 loci associated with stroke and stroke subtypes. Nature Genetics, 50(4), 524–537. 10.1038/s41588-018-0058-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mishra A, & Macgregor S. (2015). VEGAS2: Software for More Flexible Gene-Based Testing. Twin Research and Human Genetics: The Official Journal of the International Society for Twin Studies, 18(1), 86–91. 10.1017/thg.2014.79 [DOI] [PubMed] [Google Scholar]
- Genetic Epidemiology of COPD (COPDGene). Retrieved from https://www.ncbi.nlm.nih.gov/projects/gap/cgi‐bin/study.cgi?study_id=phs000179.v6.p2
- Pavlov I. (1991). Sequential Procedure of Testing Composite Hypotheses with Applications to the Kiefer–Weiss Problem. Theory of Probability & Its Applications, 35(2), 280–292. 10.1137/1135036 [DOI] [Google Scholar]
- Prokopenko D, Hecker J, Silverman EK, Pagano M, Nöthen MM, Dina C, … Fier HL (2016). Utilizing the Jaccard index to reveal population stratification in sequencing data: a simulation study and an application to the 1000 Genomes Project. Bioinformatics (Oxford, England), 32(9), 1366–1372. 10.1093/bioinformatics/btv752 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Prokopenko D, Sakornsakolpat P, Loehlein Fier H, Qiao D, Parker MM, McDonald M-LN, … COPDGene Investigators, NHLBI TOPMed Investigators. (2018). Whole Genome Sequencing in Severe Chronic Obstructive Pulmonary Disease. American Journal of Respiratory Cell and Molecular Biology. 10.1165/rcmb.2018-0088OC [DOI] [PMC free article] [PubMed] [Google Scholar]
- Regan EA, Hokanson JE, Murphy JR, Make B, Lynch DA, Beaty TH, … Crapo JD (2010). Genetic epidemiology of COPD (COPDGene) study design. COPD, 7(1), 32–43. 10.3109/15412550903499522 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Silverman EK, Chapman HA, Drazen JM, Weiss ST, Rosner B, Campbell EJ, … Speizer FE (1998). Genetic epidemiology of severe, early-onset chronic obstructive pulmonary disease. Risk to relatives for airflow obstruction and chronic bronchitis. American Journal of Respiratory and Critical Care Medicine, 157(6 Pt 1), 1770–1778. 10.1164/ajrccm.157.6.9706014 [DOI] [PubMed] [Google Scholar]
- Stranger BE, Forrest MS, Clark AG, Minichiello MJ, Deutsch S, Lyle R, … Dermitzakis ET (2005). Genome-Wide Associations of Gene Expression Variation in Humans. PLOS Genetics, 1(6), e78. 10.1371/journal.pgen.0010078 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sugasawa S, Noma H, Otani T, Nishino J, & Matsui S. (2017). An efficient and flexible test for rare variant effects. European Journal of Human Genetics, 25(6), 752–757. 10.1038/ejhg.2017.43 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tartakovsky AG (2014). Nearly optimal sequential tests of composite hypotheses revisited. Proceedings of the Steklov Institute of Mathematics, 287(1), 268–288. 10.1134/S0081543814080161 [DOI] [Google Scholar]
- Vavoulis DV, Taylor JC, Schuh A, & Bar-Joseph Z. (2017). Hierarchical probabilistic models for multiple gene/variant associations based on next-generation sequencing data. Bioinformatics, 33(19), 3058–3064. 10.1093/bioinformatics/btx355 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang F, Xie D, Liang M, & Xiong M. (2016). Functional Regression Models for Epistasis Analysis of Multiple Quantitative Traits. PLOS Genetics, 12(4), e1005965. 10.1371/journal.pgen.1005965 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhu H, Zhang S, & Sha Q. (2018). A novel method to test associations between a weighted combination of phenotypes and genetic variants. PLOS ONE, 13(1), e0190788. 10.1371/journal.pone.0190788 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
