Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2020 Apr 14;16(4):e1007819. doi: 10.1371/journal.pcbi.1007819

DOT: Gene-set analysis by combining decorrelated association statistics

Olga A Vsevolozhskaya 1, Min Shi 2, Fengjiao Hu 2, Dmitri V Zaykin 2,*
Editor: Jennifer Listgarten3
PMCID: PMC7182280  PMID: 32287273

Abstract

Historically, the majority of statistical association methods have been designed assuming availability of SNP-level information. However, modern genetic and sequencing data present new challenges to access and sharing of genotype-phenotype datasets, including cost of management, difficulties in consolidation of records across research groups, etc. These issues make methods based on SNP-level summary statistics particularly appealing. The most common form of combining statistics is a sum of SNP-level squared scores, possibly weighted, as in burden tests for rare variants. The overall significance of the resulting statistic is evaluated using its distribution under the null hypothesis. Here, we demonstrate that this basic approach can be substantially improved by decorrelating scores prior to their addition, resulting in remarkable power gains in situations that are most commonly encountered in practice; namely, under heterogeneity of effect sizes and diversity between pairwise LD. In these situations, the power of the traditional test, based on the added squared scores, quickly reaches a ceiling, as the number of variants increases. Thus, the traditional approach does not benefit from information potentially contained in any additional SNPs, while our decorrelation by orthogonal transformation (DOT) method yields steady gain in power. We present theoretical and computational analyses of both approaches, and reveal causes behind sometimes dramatic difference in their respective powers. We showcase DOT by analyzing breast cancer and cleft lip data, in which our method strengthened levels of previously reported associations and implied the possibility of multiple new alleles that jointly confer disease risk.

Author summary

Joint analysis of association between the outcome and a group of SNPs within a genetic region is increasingly recognized to complement single-SNP analysis and shed light on the underlying molecular mechanisms. However, the correlation among GWAS association results calls for specifically tailored statistical methods. Here we propose DOT (Decorrelation by Orthogonal Transformation) method that can efficiently combine evidence of association over different SNPs and genes within a pathway without access to the original genotypic data. DOT is fast, does not rely on a permutation algorithm, and is often dramatically more powerful than other popular methods, such as VEGAS and the recently proposed ACAT. We believe that DOT will become a useful addition to the toolbox of methods based on the summary statistics for the GWAS community.


This is a PLOS Computational Biology Methods paper.

Introduction

During the recent years, genome-wide association studies (GWAS) uncovered a wealth of genetic susceptibility variants. The emergence of new statistical approaches for the analysis of GWAS have largely contributed to that success. The majority of these methods require access to individual-level data, yet methods that require only summary statistics have been developed as well. The rising popularity of summary-based methods for the analysis of genetic associations has been motivated by many factors, among which is convenience and availability of summary statistics and high statistical power that can often match the power of analysis based on individual records [13].

Many types of association tests, including those originally developed for individual-level records, can be presented in terms of added summary statistics. For example, gene set analysis (GSA) tests or burden and overdispersion tests for rare variants [2, 4, 5], can be written as a weighted sum of summary statistics. In GSA applications, methods based on combined summary statistics can be used to efficiently aggregate information across many potentially associated variants within individual genes, as well as over several genes that may represent a common etiological pathway. When within-gene association statistics (or equivalently, P-values) are being combined, linkage disequilibrium (LD) needs to be accounted for, because LD induces correlation among statistics. The correlation among association test statistics for individual SNPs without covariates is the same as the correlation between alleles at the corresponding SNPs, if the genotype-phenotype relationship is linear. This fact allows one to model a set of statistics using a multivariate normal (MVN) distribution with the correlation matrix equal to the matrix of LD correlations. More generally, in the presence of covariates correlated with SNPs, MVN correlations among association statistics will depend not only on LD but also on other covariates in the model [6, 7].

When SNPs are coded as 0,1,2 values, reflecting the number of copies of the minor allele, the LD matrix of correlations can be obtained from SNP data as the sample correlation matrix. It can also be directly estimated from haplotype frequencies whenever those are available or reported. Specifically, the LD (i.e., the covariance between alleles i and j; Dij) is defined by the difference between the di-locus haplotype frequency, Pij, and the product of the frequencies of two alleles, Dij = Pijpipj. Then, the correlation between a pair of SNPs is defined as rij=Dijpi(1pi)pj(1pj). The di-locus Pij frequency is defined as the sum of frequencies of those haplotypes that carry both of the minor alleles for SNPs i and j. Similarly, pi allele frequency is the sum of haplotype frequencies that carry the minor allele of SNP i.

It is important to distinguish situations, in which the LD matrix is estimated using the same data that was used to compute the association statistics from those, where the estimated LD matrix is obtained based on a suitable population reference panel. The reference panel approach is implemented in popular web-based association analysis platforms, such as “VEGAS” [8] or “Pascal” [9]. Based on a user-provided list of L SNPs, with the corresponding association P-values, VEGAS queries an online reference panel resource to obtain the matrix of LD correlations. P-values are then transformed to normal scores PiZi, i = 1, …, L, and vector Z is assumed to follow zero-mean MVN distribution under the null hypothesis of no association. The individual statistics in VEGAS are then combined as TQ=i=1LZi2, (where TQ stands for “Test by Quadratic form”) and the overall SNP-set P-value is derived empirically by simulating a large number (j = 1, …, B) of zero-mean MVN vectors, adding their squared values to obtain statistics TQ(j) and computing the proportion of times when TQ(j) > TQ. The statistics similar to TQ are ubiquitous and appear in many proposed tests that aggregate association signals within a genetic region.

As exemplified by VEGAS, the distribution of TQ must explicitly incorporate LD. However, an alternative approach that implicitly incorporates LD can be based on first decorrelating the association summary statistics, and then exploiting the resulting independence to evaluate the distribution of the sum of decorrelated statistics, which we call Decorrelation by Orthogonal Transformation (DOT). This general idea is straightforward and have been used in many contexts, including methods that utilize individual records [10]. For instance, Zaykin et al. suggested a variation of this approach for combining P-values (or summary statistics) but have not studied power properties of the method in detail [11].

Here, we propose a new decorrelation-based method for combining single-SNP summary association statistics. We derive theoretical properties of our method and explore asymptotic power of both DOT and TQ type of statistics. To the best of our knowledge, we are the first ones to derive the asymptotic distributions of DOT and TQ under the alternative hypothesis. Our results show that decorrelation can provide surprisingly large power boost in biologically realistic scenarios. However, high statistical power is not the only advantage of the proposed framework. Once statistics are decorrelated, one can tap into a wealth of powerful methods developed for combining independent statistics. These methods, among others, include approaches that emphasize the strongest signals by combining the top-ranked results [1116].

Our theoretical analyses also reveal an unexpected result, showing that in many practical settings tests based on the statistic TQ do not gain power with the increase in L (assuming the same pattern of effect sizes for different values of L), while the proposed method steadily gains power under the same conditions. Specifically, the proposed decorrelation method gains power when the effect sizes and/or pairwise LD values become increasingly more heterogeneous. The reasons behind the respective behaviors of tests based on TQ and DOT are explored here theoretically and confirmed via simulations. We further derive power approximations that are useful for understanding power properties of the studied methods.

To showcase our method, we evaluate associations between breast cancer susceptibility and SNPs in estrogen receptor alpha (ESR1), fibroblast growth factor receptor 2 (FGFR2), RAD51 homolog B (RAD51B), and TOX high mobility group box family member 3 (TOX3) genes, without access to raw genotype data. We first test for a joint association between SNPs in those four genes and breast cancer risk by decorrelating summary statistics based on the overall LD gene structure. We then describe how to follow up on the joint association results and identify one or more SNPs that drive joint association with disease risk. To further validate the utility of DOT, we also applied it to summary statistics of a recent GWAS of cleft lip with and without cleft palate. Both of our real data analyses confirmed previous associations and revealed new associations, suggesting new potential breast cancer and cleft lip SNP markers.

Results

As an introductory example of power analysis, we considered two simulated SNPs and a linear regression model Y = βX + ϵ, where X has a bivariate normal distribution, β = {0.3, 0}, and ϵ has a Laplace distribution with unit variance. Thus, in this model Y does not have a normal distribution, however we expect that the theoretical powers for TQ and DOT tests, as derived in “Materials and Methods” section, will match the empirical power. We assumed sample size of 500. In the first simulation experiment with 10,000 simulated regressions, we assumed the bivariate correlation R = 0.99. Although two β coefficients are distinct, the mean values of association statistics induced by this model are similar to each other and they both are approximately equal to 0.29. These values can be obtained via Eq 2. Our noncentrality analysis in that section suggests that similarity of the mean values may lead to power advantage of the test TQ. The respective powers of the two tests were 0.87 and 0.80, empirically, and 0.86 and 0.80 by the theoretical calculation. In the second simulation experiment, we lowered R to 0.5. This caused the mean values to become distinct (0.29 and 0.14) and this difference of the two means caused the order of power to change, in agreement with our theoretical analysis. Powers now became 0.72 and 0.80, for TQ and DOT, respectively. In this case, empirical and theoretical powers matched to two digits. There is still difference in power at R = 0.2 (0.75 vs. 0.80), but of course, in the case R = 0, the two methods are identical. The power of DOT here is constant, and this reflects a special case, when only a single SNP has a non-zero effect size and, in addition, all correlations between SNPs are the same. We provide R software script which can not only reproduce these results, but is also capable of power analysis with larger correlation matrices, i.e., cases with multiple SNPs. Correlation matrices are generated as symmetric matrices of random numbers and then converted to positive definite ones using the package “Matrix” [17]. Using this script, we evaluated the type-I error of both methods, assuming α-level 0.05, 10 SNPs, and β = 0. We found the type-I error to be close to the nominal level, using 100,000 simulations (0.04815 for DOT and 0.05002 for Tq). We note that the calculations are very fast and that the 100,000 simulation runs were completed in less than ten minutes on a typical laptop.

Further, we conducted a different set of extensive simulation experiments to study statistical power of the proposed method based on the decorrelation statistic DOT, and to compare it to the statistic TQ. We also included a recently proposed method “ACAT” by Liu and colleagues [18], where association P-values for individual SNPs are transformed to Cauchy-distributed random variables, then added up to obtain the overall P-value. ACAT was included into comparisons because it has robust power across different models of association. Specifically, Liu et al. found ACAT to be competitive against popular methods, including SKAT and burden tests for rare-variant associations [1922]. A distinctive feature of ACAT is its good type-I error control in the presence of correlation between P-values, which, interestingly, improves as the α-level becomes smaller, due to its usage of transformation to a moment-free Cauchy distribution. Among other similar approaches is MAGMA [23]. MAGMA analyzes summary association statistics by considering the mean of the chi-square statistic for the SNPs in a gene or the largest statistic among the SNPs in a gene. The mean of statistics method is equivalent to Fisher’s method for combining dependent P-values [24, 25]. The method based on the top chi-square statistic among the SNPs in a gene is equivalent to the Bonferroni correction for dependent tests. There have been extensive studies comparing these two methods [26]. Note that TQ is very similar to the Fisher method.

We used two distinct scenarios in our simulation experiments:

  1. First, we assumed that the summary statistics and the sample correlation matrix among statistics are estimated from the same data set. This allowed us to validate power properties derived in “Materials and Methods.”

  2. Second, we assumed that the sample correlation LD matrix was obtained from external reference panel. We included this scenario into our simulations due to the concern that the type-I error rate of the methods considered here may be inflated if the correlation matrix is computed based on a separate data set.

Simulations assuming that the LD matrix and the summary statistics are obtained from the same data

To compare methods with and without decorrelation of statistics, we considered several distinct settings. In settings 1-4, the results of each row of the tables were based on one million simulations. Association statistics were simulated directly, namely, a 106 by L matrix of MVN vectors was simulated first, and then each row of the matrix was analyzed by the competing methods. The empirical powers were obtained as the proportion of times that a particular statistic value exceeded α = 0.05.

  • Setting 1

    The decorrelation method (DOT) is expected to gain power as the number of SNPs increases in scenarios where effect sizes vary markedly from SNP to SNP. However, if effect sizes for all SNPs are in fact very close to each other, the power of DOT may decrease. To illustrate this property, our first, and purposely contrived simulation setup is where the induced effect sizes (mean values of statistics) were all non-zero but very close to each other in their magnitude, varying uniformly from 2.3 to 2.4 (these are the values of the means of normally distributed standardized statistics). Table 1 shows the results of the simulations study under this setting, in which the decorrelation method was deliberately set up to fail. In the table, the columns labeled “Theoretic.” provide power calculated based on the distribution of the test statistics under the alternative hypothesis that we derived above. The columns labeled “Empiric.” provide results based on the empirical evaluation of power by computing P-values under the null. The columns labeled “Approx.” provide power calculated based on the Eq (17). The column labeled γ¯ provides the average noncentrality value.

    The table illustrates that our analytical calculations under the alternative hypothesis are correct. That is, the empirical power of both TQ and DOT statistics matches nearly exactly the analytical calculations. The approximation based on Eq (17) apparently works well as well, emphasizing the fact that the distribution of the TQ statistic can be well approximated by a one-degree of freedom chi-square distribution.

    Further, the table confirms that the decorrelation method is under-performing relative to TQ if there is very little heterogeneity among effect sizes. However, power of all methods would increase under lower correlation. For example, for ρ = 0.3 and L = 20, the powers for TQ and DOT become 0.98 and 0.67, respectively. Additional insight into power behavior of methods under this scenario can be gained by examining Eq (19). The asymptotic power for TQ can be simply computed in R as 1-pchisq(qchisq(1-0.05, df = 1), df = 1, ncp = 2.35^2/0.7). This gives 0.802 TQ power as L → ∞ for Table 1 and 0.99 for the situation when ρ is lowered to 0.3. This simple approximation is surprisingly precise and works well for the rest of the settings.

    Scenario 1 is admittedly unrealistic in practice. Furthermore, the table also illustrates that as the average non-centrality value increases, the power of DOT increases as well, while the power of TQ is relatively constant and about 80%. Finally, Table 1 shows that the power of TQ (although higher than that of DOT) does not change with L, highlighting the ceiling property of this method and the fact that combining more SNPs would not lead to higher power of TQ.

  • Setting 2

    One of the features of the decorrelation method is that it benefits from heterogeneity in pairwise LD. To illustrate this property, we added jiggle to the equicorrelation matrix as described in the “Materials and Methods” section, while keeping the effect size (mean values of statistics) vector the same as in Setting 1 (within the range of 2.3 to 2.4). Again, effect sizes were all non-zero. In this second set of simulations, uniformly distributed perturbations (in the range 0 to 5) were added through U, which made the pairwise correlations range from 0.14 to 0.98.

    Table 2 summarizes the results and once again, illustrates the ceiling feature of TQ power. However, the power of the statistic DOT now starts to climb up with L and the proposed test based on DOT eventually becomes more powerful than the one based on TQ. This phenomenon can be explained by examining the eigenvectors of the correlation matrix in Scenario 1. When eigenvectors are writen in the form of the Helmert eigenvectors, the first contributing DOT statistic is formed as the mean of original (non-transformed) statistics. The rest of contributing statistics are weighted sums of the original statistics with weights given by the entries of (2, …, L) Helmert eigenvectors. However, the structure of each vector is such that its entries add up to zero (and may contain zeros as well). Thus, when the means are very similar (as in Scenario 1), there is cancellation of individual terms when the sum is formed. Moreover, note that although the average noncentrality value does not increase with L, the DOT-test still gains power with L!

  • Setting 3

    This setting is analogous to the equicorrelation scenario in Setting 1, except that the mean values of statistics were lowered: in Setting 1, the range in μ was 2.3 to 2.4, while here, the range was set to vary uniformly between 1 and 2.3, and effect sizes were all non-zero. Thus, the maximum effect size was lower than that in the previous simulations but the heterogeneity among effect sizes was higher. We emphasize again that while the equicorrelation assumption is unrealistic, it serves as a very useful benchmark scenario that highlights power behavior and features of the statistics TQ and DOT and allows one to introduce departures from equicorrelation in a controlled manner.

    Table 3 presents the results. The “Approx.” column in this table was removed and replaced by power values based on a “P-value”-approximation to the distribution of TQ as in Eq (16). This switch highlights the idea that both the power and the P-value for the TQ test can be reliably estimated based on the one degree of freedom chi-squared approximation. Importantly, Table 3 demonstrates that the power of the DOT-test reaches 100% as L increases (despite the fact that effect sizes were lower than in the previous settings), while the power of the TQ-test stays in the range 51.2 to 52.5%.

  • Setting 4

    This setting is similar to the scenario in Setting 2, except that we allowed higher heterogeneity in pair-wise LD values. Effect sizes were all non-zero. LD was constructed as perturbation of Rρ=0.7+UU (as described in “Materials and Methods”), with U set to be a random sequence on the interval from -5 to 5. This resulted in LD values ranging from -0.93 to 0.99. The effect sizes (mean values of statistics) were sampled randomly within each simulation from (-0.15, 0.15) interval.

    Table 4 presents the results and shows that in this setting, the power of DOT is dramatically higher than that of TQ and ACAT. In fact, power values for the TQ and ACAT tests barely exceed the type-I error, while the power of the decorrelation method steadily increases with L, eventually exceeding 90%.

  • Settings 5–7

    In these sets of simulations we used biologically realistic patterns of LD. Also, rather than specifying mean values of association statistics directly, we utilized a regression model for the effect sizes, as described in Eqs (1) and (2). Details of these simulations are given in “LD patterns from the 1000 Genome Project” in “Materials and Methods.” We re-iterate that when association of SNPs with a trait is present (under the alternative hypothesis), the correlation among statistics is not equal to LD, because it also has to incorporate effect sizes, as illustrated by Eq (5). This point is important if one wants to simulate statistics directly from the MVN distribution rather than computing them based on simulated data followed by regression.

    The results are presented in Table 5. Columns labeled “Regr.” represent scenarios, in which data were generated and statistics were computed. Columns labeled “MVN” represent scenarios, in which statistics were simulated directly. The rows of Table 5 show power values for three different α-levels. We expected the power values in “Regr.” and “MVN” columns to match, and they do, highlighting another utility of our analytical derivation of the distribution of the test statistic under the alternative hypothesis. That is, using our results, one can significantly reduce computational and programming burden in genetic simulations. Also note that power values in Table 5 do not decrease as α-level becomes smaller (Settings 6 and 7). This is due to the fact that we deliberately discarded effect size and LD configurations where power was expected to be too low, because we wanted to assure a good range of power values across methods.

    As in previous simulations, power values of TQ and ACAT are similar. The power approximation by Eq (17) remains close to the predicted theoretical power, as well as to empirically estimated powers. We also observed that power of the decorrelation test, DOT, is substantially higher than the powers of either TQ or ACAT.

Table 1. Power comparison of TQ, DOT, and ACAT, assuming very similar effect sizes in magnitude and equicorrelation LD structure with ρ = 0.7.

Number of SNPs Empiric. Theor. Approx. Empiric. Theor. ACAT γ¯
L TQ TQ TQ DOT DOT
500 0.802 0.802 0.802 0.090 0.090 0.832 0.02
300 0.801 0.801 0.801 0.101 0.100 0.830 0.03
200 0.801 0.801 0.801 0.112 0.112 0.829 0.04
100 0.799 0.800 0.800 0.144 0.145 0.826 0.08
50 0.798 0.799 0.799 0.196 0.197 0.821 0.16
30 0.795 0.796 0.796 0.253 0.252 0.814 0.26
20 0.794 0.793 0.794 0.307 0.306 0.809 0.39

Table 2. Power comparison of TQ, DOT, and ACAT, assuming very similar effect sizes but heterogeneous LD structure.

Number of SNPs Empiric. Theor. Approx. Empiric. Theor. ACAT γ¯
L TQ TQ TQ DOT DOT
500 0.729 0.730 0.726 0.973 0.973 0.793 0.251
300 0.731 0.730 0.726 0.883 0.883 0.791 0.256
200 0.731 0.730 0.726 0.810 0.811 0.789 0.281
100 0.730 0.731 0.726 0.599 0.599 0.786 0.295
50 0.732 0.733 0.728 0.577 0.576 0.782 0.418
30 0.736 0.735 0.729 0.504 0.502 0.778 0.488
20 0.737 0.737 0.731 0.541 0.540 0.776 0.661

Table 3. Power comparison of TQ, DOT, and ACAT, assuming heterogeneity in effect sizes but equicorrelated LD.

Number of SNPs Empiric. Theor. P-approx. Empiric. Theor. ACAT γ¯
L TQ TQ TQ DOT DOT
500 0.525 0.525 0.526 1.000 1.000 0.626 0.479
300 0.526 0.525 0.526 1.000 0.999 0.624 0.486
200 0.526 0.525 0.524 0.993 0.993 0.622 0.494
100 0.525 0.524 0.524 0.919 0.920 0.616 0.518
50 0.522 0.523 0.522 0.762 0.762 0.607 0.566
30 0.521 0.521 0.521 0.648 0.648 0.599 0.630
20 0.519 0.519 0.520 0.578 0.579 0.592 0.709

Table 4. Power comparison of TQ, DOT, and ACAT with effect sizes randomly sampled from -0.15 to 0.15 and heterogeneous LD.

Number of SNPs Empiric. Theor. P-approx. Empiric. Theor. ACAT γ¯
L TQ TQ TQ DOT DOT
500 0.0500 0.0503 0.0508 0.9226 0.9222 0.0564 0.2118
300 0.0506 0.0503 0.0509 0.7688 0.7689 0.0570 0.2107
200 0.0504 0.0503 0.0508 0.5970 0.5967 0.0570 0.2025
100 0.0504 0.0503 0.0509 0.3040 0.3038 0.0568 0.1655
50 0.0502 0.0503 0.0508 0.3074 0.3070 0.0555 0.2397
30 0.0505 0.0503 0.0507 0.1485 0.1487 0.0562 0.1527
20 0.0501 0.0503 0.0508 0.1191 0.1189 0.0557 0.1399

Table 5. Power comparison of TQ, DOT, and ACAT using realistic LD patterns from 1000 Genomes project.

Theor. Approx. Regr. MVN Theor. Regr. MVN
TQ TQ TQ TQ DOT DOT DOT ACAT
Setting 5
α = 10−3 0.34 0.34 0.34 0.34 0.60 0.60 0.60 0.40
Setting 6
α = 10−4 0.42 0.42 0.42 0.43 0.77 0.77 0.77 0.43
Setting 7
α = 10−7 0.24 0.24 0.24 0.24 0.76 0.76 0.76 0.18

Patterns of LD and effect sizes in Settings 1–4 are not necessarily realistic biologically, however, they serve as benchmark scenarios that help to understand and highlight differences in the respective statistical power of the methods. Simulations for Settings 1–4 were performed at the 5% α-level based on 2 × 106 evaluations. Settings 5–7 used realistic patters of LD derived from the 1000 Genomes Project data. Test sizes varied from 0.001 to 10−7 with at least 10,000 simulations for power estimates. Type-I error rates were well controlled for TQ and DOT. However, as noted by Liu et al., because the ACAT P-value is approximate, the null distribution of its statistic is evaluated under independence, and we found that at the nominal 5% α-level, the type-I error for the ACAT was somewhat higher and could reach 7% for some correlation settings. Nonetheless, the advantage of ACAT is that the approximation improves as the α-level becomes smaller.

Simulations assuming that the correlation matrix is estimated using external data

When only summary statistics are available, the correlation matrix Σ can be estimated from a reference panel of genotyped individuals. However, the type-I error of tests based on both TQ and DOT may potentially be affected due to substituting the sample estimate Σ^ by an estimate obtained from external data. To study the effect of this mis-specification on the type-I error, we conducted a separate set of simulations. In these experiments, we again utilized LD structures derived from the 1000 Genomes Project data. Reference panels for these simulations were obtained as follows. Each LD matrix derived from real data was assumed to represent the population matrix. Next, a sample was drawn, and the corresponding sample LD matrix was calculated. That matrix should have been used for calculations of the gene-based test statistics. Instead, we drew a separate sample of size N, assuming the same population LD matrix. In the calculation of the tests, that sample correlation matrix was used in place of the correct one. The type-I error rates, given in Tables 68, show that both ACAT and TQ have close to the nominal type-I error rates, but the error rate for the decorrelation method (DOT) can be inflated, unless the sample size of the reference panel is 50 to 100 times larger than the number of SNPs (L). For the statistic DOT, the type-I error rates appear to be more inflated at smaller α-levels, such as 10−7. Power values for TQ are not shown, however they closely followed predicted theoretical power for the scenarios where the same data are used for both LD estimation and computation of association statistics. There was only 1 to 2% drop in power when the size of the panel was only 2 to 5 times larger than L.

Table 6. Type-I error rates (α = 10−3) using a reference panel to estimate LD.

Population LD patterns are modeled using 1000 Genomes project data.

Sample size TQ DOT ACAT
N = 5L 1 × 10−3 3 × 10−3 1 × 10−3
N = 10L 1 × 10−3 3 × 10−3 1 × 10−3
N = 50L 1 × 10−3 2 × 10−3 1 × 10−3
N = 100L 1 × 10−3 1 × 10−4 1 × 10−3

Table 8. Type-I error rates (α = 10−7) using a reference panel to estimate LD.

Population LD patterns are modeled using 1000 Genomes project data.

Sample size TQ DOT ACAT
N = 5L 2 × 10−7 3 × 10−4 1 × 10−7
N = 10L 2 × 10−7 2 × 10−4 1 × 10−7
N = 50L 2 × 10−7 2 × 10−4 1 × 10−7
N = 100L 2 × 10−7 1 × 10−4 1 × 10−7

Table 7. Type-I error rates (α = 10−4) using a reference panel to estimate LD.

Population LD patterns are modeled using 1000 Genomes project data.

Sample size TQ DOT ACAT
N = 5L 9 × 10−5 5 × 10−4 1 × 10−4
N = 10L 9 × 10−5 4 × 10−4 1 × 10−4
N = 50L 1 × 10−4 1 × 10−4 1 × 10−4
N = 100L 1 × 10−4 1 × 10−4 1 × 10−4

Combining breast cancer association statistics within candidate genes

We applied our decorrelation method to a family-based GWAS study of breast cancer [27, 28]. The data set was comprised of complete trios, i.e., families where genotypes of both parents and the affected offspring were available. With complete trios, previously reported statistics become equivalent to statistics from the transmission-disequilibrium test and correlation among them is expected to follow the LD among SNPs [8]. We selected four candidate genes (TOX3, ESR1, FGFR2 and RAD51B), for which Shi et al. [27] and O’Brien et al. [28] replicated several previously reported risk SNPs in relation to breast cancer.

For the joint association, we restricted our analysis to blocks of SNPs surrounding breast cancer risk variants that were previously reported in the literature. Specifically, we selected TOX3 rs4784220 [29], ESR1 rs3020314 [30, 31], FGFR2 rs2981579 [29], and RAD51B rs999737 [3234], and then included blocks of SNPs around these ‘anchor’ risk variants with the LD correlation of at least 0.25. These blocks included 13 SNPs around rs4784220, 36 SNPs around rs3020314, 18 SNPs around rs2981579, and 30 SNPs around rs999737. As an illustration, Fig 1 displays 81 SNP P-values that were available for ESR1 gene, the vertical dashed line highlights the position of ‘anchor’ rs3020314, the red dots highlight 36 SNPs within LD-block of rs3020314, and the LD matrix displays sample correlation matrix among 36 SNPs. Once SNP blocks were identified for each gene, we applied four combination methods to assess their association with breast cancer.

Fig 1. Overview of DOT method in application to breast cancer data.

Fig 1

We compute gene-level score by first decorrelating SNP P-values using the invariant to order matrix H and then calculating sum of independent chi-squared statistics. We utilize our DOT method to obtain a gene-level P-value. In the breast cancer data application, we chose an anchor SNP—a SNP that has previously been reported as risk variant (highlighted by a vertical dashed line),—and then combine SNPs in an LD block with the anchor SNP by the DOT. SNP-level P-values highlighted in red are those in moderate to high LD with the anchor SNP.

Table 9 present the joint association analysis results. The first row of Table 9 shows P-values for the association between the LD block of 13 SNPs in TOX3 region and breast cancer, derived from 1277 Caucasian triads. All methods conclude a statistically significant link but our decorrelation method provides the most robust evidence with a substantially lower P-value. The third row of Table 9 shows joint association P-values for the LD block of 18 SNPs in FGFR2. Three out of four methods conclude an association at 5% level, with DOT approach, once again, providing the most significant result. We note that the last column of Table 9 gives the Bonferroni-style adjustment that is expected to be more conservative relative to the combination tests. Thus, it is not surprising that out of the four methods considered, the Bonferroni method failed to conclude an association. Lastly, the second and the fourth rows of Table 9 provide joint association P-values for LD block in ESR1 and RAD51B, respectively. For both ESR1 and RAD51B our decorrelation approach was the only one that concluded a statistically significant association between SNP-set in those genes with breast cancer.

Table 9. Breast cancer candidate gene association P-values.

Gene TQ DOT ACAT min(P) × L
TOX3/rs4784220 [29] (L = 13) 0.0005 0.0004 0.001 0.001
ESR1/rs3020314 [30, 31] (L = 36) 0.20 0.0001 0.19 0.96
FGFR2/rs2981579 [29] (L = 18) 0.01 0.003 0.01 0.07
RAD51B/rs999737 [3234] (L = 30) 0.56 0.009 0.76 1

Table 10 details a list of top SNPs that are associated with breast cancer within the selected candidate genes. The top ranked SNPs were identified by considering the top three components in the linear combination DOT=i=1LXi2, where Xi’s are the decorrelated summary statistics. Once the highest three values of Xi2 were identified for each gene, we considered individual components of Xi=j=1LhjZj that are formed as a linear combination of the original statistics weighted by the elements of matrix H. The top individual components hjZj (with the same sign as Xi) were corresponding to individual SNPs presented in Table 10.

Table 10. Breast cancer SNPs identified by DOT in the analysis of GWAS data.

Gene Number of SNPs in analysis (L) rs number Reference
TOX3 13 rs4784220 This SNP was previously reported in the literature to be associated with breast cancer [29, 35].
rs8046979 This SNP was also linked to breast cancer [29].
rs43143 A new association with susceptibility to breast cancer.
ESR1 36 rs2347867 This SNP was previously reported to be involved in breast cancer risk [36, 37].
rs985191 This SNP was previously reported to be associated with endocrine therapy efficacy in breast cancer [38], as well as with the overall breast cancer risk [39].
rs3003921 A new association with susceptibility to breast cancer. This SNP was previously linked to the effectiveness of androgen deprivation therapy among prostate cancer patients [40].
rs985695 A new association with susceptibility to breast cancer.
rs2982689 A new association with susceptibility to breast cancer.
rs3020424 A new association with susceptibility to breast cancer.
rs926777 A new association with susceptibility to breast cancer.
FGFR2 18 rs1219648 This SNP was previously reported to be associated with premenopausal breast cancer [41] and the overall breast cancer risk [4245].
rs2860197 This SNP was previously suggested to have an association with breast cancer [46].
rs2981582 This SNP was previously reported in the literature to be associated with breast cancer [43, 4749].
rs3135730 This SNP was previously suggested to have an interaction between oral contraceptive use and breast cancer [50].
rs2981427 A new association with susceptibility to breast cancer.
RAD51B 30 rs999737 This SNP was previously reported in the literature to be associated with breast cancer [3234, 51, 52].
rs8016149 This SNP was previously suggested to have an association with breast cancer [53].
rs1023529 This SNP has been patented as one of susceptibility variants of breast cancer [54].
rs2189517 This SNP was showed to be associated with breast cancer in Chinese population [55].
rs7359088 A new association with susceptibility to breast cancer.

For the LD block in TOX3 gene, the top three individual Xi’s in DOT statistic were all formed by having a very large weight assigned to a single SNP, i.e., the largest value, X(1)2, was formed by assigning a large weight to rs4784220 statistic; the second largest value, X(2)2, was formed by assigning a large weight to rs8046979 statistic; and the third largest value, X(3)2, was formed by assigning a large weight to rs43143 statistic. The first few rows of Table 10 detail these results and identify rs43143 as a new possible association with breast cancer.

For the LD block in ESR1 gene, the top Xi’s were quite different. Specifically, the largest value, X(1), was formed as a linear combination of 6 SNPs that all got assigned large weights. These 6 SNPs were rs2982689/rs3020424/rs985695/rs2347867/rs3003921/rs985191. The second highest linear combination, X(2), was formed by assigning high weights to 5 out of 6 SNPs listed above: rs2982689/rs3020424/rs985695/rs2347867/rs3003921. We note that the signs of X(1) and X(2) were in different directions and that is why it was possible for the same set of SNPs to be prioritized. Finally, the third largest value, X(3), also prioritized the same set of SNPs, with the exception of the single new addition of rs926777. Table 10 provides a detailed discussion of these SNPs and identifies rs3003921/rs985695/rs2982689/rs3020424 and rs926777 as new possible associations with breast cancer.

Finally, for the LD blocks in FGFR2 and RAD51B we repeated the procedure detailed above and also identified top-ranking SNPs. Table 10 reviews these results and points FGFR2 rs2981427 and RAD51B rs7359088 as two more additional newly found associations.

Combining cleft lip association statistics within candidate genes

To further validate the utility of DOT, we applied it to summary statistics of a recent GWAS of cleft lip with and without cleft palate [56]. Summary statistics were based on transmission-disequilibrium test on autosomal SNPs in 1908 case-parent trios of European and Asian ancestry. We selected four genetic regions (ABCA4, chr. 8q24, IRF6, and MAFB) that were prioritized by Beaty et al. [56] for gene-based analysis. Anchor SNPs were chosen based on significant risk markers previously reported in literature. Specifically, rs560426 was chosen as an anchor for ABCA4 region [57] and formed an anchor block of L = 30 SNPs; rs987525 for chr. 8q24 [58] with L = 29 SNPs in a block; rs10863790 for IRF6 [59] with L = 6 SNPs in a block; and rs13041247 for MAFB [60] with L = 14 SNPs in a block. Table 11 provides summary of gene-based P-values and indicates that all four combination methods concluded significant associations. Results in Table 11 can also be viewed as a gauge of the relative power of the four combination methods. As such, Table 11 confirms that DOT may result in smaller P-values then those of competitors.

Table 11. Cleft lip candidate gene association P-values.

Gene TQ DOT ACAT min(P) × L
ABCA4/rs560426 [57] (L = 30) 8.9 × 10−8 1.3 × 10−13 7.2 × 10−11 7.2 × 10−11
chr. 8q24/rs987525 [58] (L = 29) 1.0 × 10−9 8.7 × 10−22 4.7 × 10−15 3.2 × 10−15
IRF6/rs10863790 [59] (L = 6) 4.7 × 10−9 1.8 × 10−19 2.1 × 10−14 2.1 × 10−14
MAFB/rs13041247 [60] (L = 14) 1.5 × 10−8 2.9 × 10−8 2.4 × 10−11 3.6 × 10−11

Table 12 details a list of top SNPs that were associated with non-syndromic cleft lip with or without cleft palate within four genetic regions. For the LD block around rs560426 in ABCA4 gene, X(1)2 was formed by assigning large weights to two SNPs (rs4847196/rs563429) both of which were previously considered in association with cleft lip but were found to be not statistically significant [56]. The second highest DOT linear combination, X(2)2, prioritized the same two SNPs (rs4847196/rs563429), thus reinforcing the idea that these two markers may be genuinely associated with cleft lip. The third highest linear combination, X(3)2, was formed by assigning high weights to rs2275035 and rs546550, the former of which was recently identified to be associated with orofacial clefting [61], while the latter may be a new association with cleft lip.

Table 12. Cleft SNPs identified by DOT in the analysis of GWAS data.

Gene Number of SNPs in analysis (L) rs number Reference
ABCA4 30 rs4847196 This SNP was previously studied in connection to cleft lip [56] but the association was found to be not statistically significant.
rs563429 This SNP was also previously considered in association with cleft lip [56] but found to be not statistically significant.
rs2275035 Was recently identified to be associated with orofacial clefting [61].
rs546550 A new association with susceptibility to cleft lip. This SNP was previously suggested to be linked to esophageal cancer [62].
chr. 8q24 29 rs987525 One of the top results was the anchor SNP [58].
rs882083 Was previously suggested to be associated with cleft lip [56, 58].
rs1157136 Was previously suggested to be associated with cleft lip in Brazilian population [63].
rs12548036 Was previously studied in connection to susceptibility to cleft lip in Japanese population [64] but the association was found to be not statistically significant.
rs1530300 Was previously suggested to be associated with cleft lip in Brazilian population [57] and Brazilian population with high African ancestry [65].
rs12547241 A new association with susceptibility to cleft lip.
IRF6 6 rs10863790 One of the top contributions was the anchor SNP [59].
rs861020 Was previously reported to be associated with cleft lip [59, 66, 67].
rs2236906 Was considered to be associated with cleft lip in a Kenya African Cohort [68] and in general population [69].
rs2073485 Was reported to be associated with cleft lip in Western China [70] and Taiwanese population [71].
MAFB 14 rs11696257 Was previously reported to be associated with cleft lip [56, 72].
rs6102085 Was previously reported to be associated with cleft lip in Han Chinese population [73].
rs6065259 Was previously reported to be associated with cleft lip in a population in Heilongjiang Province, northern China [74].
rs6102074 Was previously reported to be associated with cleft lip in Han Chinese population [73, 75].

For the LD block on chr. 8q24 region, X(1)2 was formed by assigning a large weight to the anchor SNP (rs987525). X(2)2 prioritize two SNPs: rs882083 that was already suggested to be associated with cleft lip [56, 58], and rs12547241 that may be a new risk marker. Finally, X(3)2 prioritized a set of three SNPs (rs1157136/rs12548036/rs1530300), all of which were previously studied in connection to cleft lip [57, 6365]. For the last two LD block considered (IRF6 and MAFB genes), Table 12 details a list of top SNPs contributors to the DOT statistic. In brief, all of the prioritized SNPs were previously reported in association with cleft lip.

Discussion

In this research, we have proposed a new powerful decorrelation-based approach (DOT) for combining SNP-level summary statistics (or, equivalently, P-values) and derived its theoretical power properties. To the best our knowledge, we were the first to derive analytical properties of the traditional approach, TQ (e.g., as implemented in VEGAS), as well as of the DOT, with the help of new theory that incorporates effect sizes of SNPs into mean values of association statistics and correlations among them. Through extensive simulation studies, we have demonstrated that our decorrelation approach is a powerful addition to the tools available for studying genetic susceptibility to disease.

Our analysis of breast cancer and cleft lip data illustrates unique properties of DOT. Our results revealed novel potential associations within candidate genes that would have not been found by previously proposed methods. These novel SNPs were identified by examining the top three linear-combination contributors to the overall value of the DOT-statistic. We note that the top contributions may give large weights to genetic variants that are truly associated with the outcome or to SNPs in a high positive LD with true causal variants. Caution is needed when interpreting such results because our method cannot distinguish between causal and proxy associations. Further studies would be needed to confirm these findings.

The most important feature of the proposed method is that it may provide substantial power boost across diverse settings, where power gain is amplified by heterogeneity of effect sizes and by increased diversity between pairwise LD values. Genetic architecture of complex traits is far from being homogeneous, making our method applicable in various settings. We have developed new theory to explain unexpected and remarkable boost in power. This theory allows one to predict behavior of the tests in simulations with high accuracy and to explain unexpected scenarios, where the decorrelation method may give dramatically higher power compared to the traditional approach. Yet, there are important precautions to the decorrelation approach. When reference panel data are used to provide the LD information and, more generally, correlation estimates for all predictors, including SNPs and covariates, Σ^, sample size of the external data should be several times larger than the number of predictors. Ideally, the same data set should be used to obtain association statistics, as well as Σ^. Nevertheless, association statistics and Σ^ are compact summaries of data and are much more easily transferred between separate research groups than raw data, due to privacy considerations and potentially large size of the raw data sets. Also, caution is needed if missing data are present in the original data set because the estimate (Σ^) may no longer reflect the sample correlation between predictors. Imputation of missing values is a suitable solution, if missing values are independent of the outcome. With the usage of reference panel data, the type-I error inflation for the statistic DOT can be affected by many factors, and this statistic is expected to be sensitive not only to the size of a reference panel, but to population variations in LD, especially for highly correlated blocks of SNPs. Overall, it appears to be difficult to give specific recommendations, except that the reference panel size has to be at least 50 times larger than the number of SNPs to be combined. Therefore, we recommend to limit applications of the decorrelation method to situations, where the LD matrix is obtained from the same data set as the summary statistics. Note that all pairwise LD values can be obtained from sample haplotype frequencies of SNPs, thus the LD matrix can be reconstructed. Utility of this approach remains to be investigated, in particular, one concern is that the correlation between the SNP values reflect the composite disequilibrium values [76], while frequencies of sample haplotypes are often reported following likelihood maximization, e.g., by the EM algorithm. An important issue that still remains to be investigated is a systematic analysis of the performance of our method utilizing real genome-wide data. Such analysis would allow one a more thorough assessment of both the type-I error rate, as well as power to detect genetic regions already implicated in susceptibility to disease.

In our simulations, the recently proposed method ACAT and the test based on the distribution of the sum of correlated association statistics (VEGAS, or TQ) had similar power. In many situations, power of these two tests was substantially lower than that of the DOT. The main advantage of ACAT is that it does not require any LD information. Our theory and simulations also revealed previously unknown robustness of the TQ method with respect to LD mis-specification: the method is valid and remains nearly as powerful when the sample LD matrix is substituted by a single value, summarizing the extent of all pairwise correlations. TQ also remains valid when the LD summary is obtained from a representative reference panel. We stress again that compared to ACAT and TQ, our method’s limitation is that in order to avoid possible bias, the LD information and the summary statistics should ideally come from the same data set and missing genotypes should be imputed prior to its application. In general, one should avoid utilization of external data as a source of LD information, as well as high rates of unimputed missing genotypes. Although not pursued here, a possible way to improve robustness of the DOT is to merge it with ACAT, that is, decorrelate the summary statistics first, convert the results to P-values and then combine them with ACAT.

Materials and methods

Genetic association tests based on summary statistics are often presented as a weighted sum [2, 4]. Let wi denote the weight assigned to individual statistic. The weighted statistics can then be defined as Yi2=wiZi2 with Z ∼ MVN(μZ, ΣZ) and Y ∼ MVN(μ, Σ), where μ = WμZ, Σ = WΣZW, and W=diag(w). The statistics Yi2 are marginally distributed as one degree of freedom chi-square variables with noncentralities μi2. The overall statistic is then typically defined as TQ=i=1LYi2.

Joint distribution of association summary statistics

In this section, we derive parameters μ and Σ of the joint MVN distribution of summary statistics. Under the null hypothesis, when none of the SNPs are associated with an outcome, μ = 0. If individual SNP models do not include covariates, ΣZ equals the LD matrix, i.e., the correlation matrix between the SNP values coded as 0, 1, or 2, reflecting the number of minor alleles in a genotype. In the presence of covariates, ΣZ is a Schur complement of the submatrix of the matrix of all predictor variables [6]. That is, the estimated correlation between association statistics Σ^Z can be obtained by inverting the covariance or correlation matrix of all predictors, selecting the SNP submatrix, inverting it back, and standardizing the result to correlation.

Under the alternative hypothesis, when some SNPs are associated with a trait y, let βj be the regression coefficient for the j-th SNP. Then, a typical linear model that determines the trait value is defined as:

y=β0+j=1LβjSNPj+ϵ, (1)

where ϵN(0, 1). The mean value of the summary statistics (i.e., noncentralities) can be expressed as:

μj=NΣjββΣβ+1=Nbj, (2)

where Σj is the j-th column of Σ, bj = cor(y, SNPj) and N is the sample size. An intuitive explanation of Eq (2) can be gained by considering the case of independent predictors, i.e.,Σ = IL. If both the outcome and the set of predictors are standardized, then ΣjββΣβ+1=βjjβj2+1, which is a standardized regression coefficient. We note that Eq (2) is valid outside of the linear model settings. For example, consider a latent variable model, where the continuous unobserved (latent) variable yl is linear in predictors according to Eq (1), and the observed variable (disease status) is y = 1 whenever yl > l and y = 0 otherwise, where l is some threshold. When such binary outcome is analyzed by logistic regression, a good approximation to the noncentrality values will be:

μjN(d×bj). (3)

If error terms ϵ are assumed to be normally distributed, the reduction in correlation due to dichotomization by the factor d can be expressed as d=ϕ(l)/Φ(l)(1Φ(l)), where ϕ(⋅), Φ(⋅) are the probability and the cumulative densities of the standard normal distribution [77].

Under association, surprisingly, the correlation matrix between statistics is no longer Σ. Let σij be the i, j-th element of Σ, and ρij be correlations between predicdictors and the outcome. By using the multivariate delta method, we derived the i, j-th element of the correlation matrix R as follows:

Rij(μiμj(μi2+μj2N)2(μi2+μj2N)Nσij+μiμjNσij2)(μi2N)(μj2N),ρij+ρij2μiμj2Nρijμi2μj2N2μiμj2N, (4)
=ρij+ρij2bibjρijbi2bj2bibj. (5)

Details of the derivation of these equations are given in [78]. An alternative derivation of the asymptotic covariance that includes the first two terms of Eq (5) has been given by Reshef et al. [79], assuming Gaussian genotypes, an assumption justifiable provided that there is a lower bound for minor allele frequency relative to sample size. Note that when some of SNP pairs (i, j) are associated, summary statistics may become correlated even if there is no LD between the SNPs, due to the last term, −bibj, in Eq (5). Eqs (2), (3), (4) and (5) allow one to study power properties of the methods based on sums of association statistics, as well as to design realistic simulation experiments, where summary statistics can be sampled directly from the MVN distribution under the alternative hypothesis. That is, given effect sizes and the correlation matrix among predictors, statistics can be immediately sampled from the MVN (μ,R) distribution. This approach avoids both the data-generating step and the subsequent computation of summary statistics from that data, leading to a substantial gain in computation time. In certain situations, the difference in speed can be dramatic. For example, it is not trivial to simulate discrete (genotype) data given a specific LD matrix. Current state of the art methods tend to be slow, because they rely on ad hoc iterative techniques, such as generation of multiple random “proposal” data sets to fit the target correlation matrix [80].

Results of simulation experiments presented here were performed based on effect sizes specified via the linear model (Eq 1). However, we verified (not presented here) the validity of the proposed theory assuming logistic, probit, and Poisson regression models. We also note that Conneely et al. presented theoretical arguments supporting the validity of the MVN joint distribution of summary statistics under no association for a broad class of generalized regression models [6].

Distribution of sums of association summary statistics

As we noted at the beginning of the “Materials and Methods” section, weighted sums of summary statistics can be re-expressed as unweighted sums, where the mean and the correlation parameters are modified to absorb the weights. The distribution of i=1LYi2 follows the weighted sum of independent one degree of freedom non-central chi-square random variables. Although this result is standard, the components of this weighted sum depend on the joint distribution of association summary statistics under the alternative hypothesis, and this distribution has not been previously derived. In the previous section, we provide the components of μ and R that determine the weights and the noncentralities of chi-squares. Therefore,

Pr(YY>t)=Pr(i=1LYi2>t)=Pr(i=1Lλiχ1,γi2>t), (6)
γ={μE(1λI)}{μE(1λI)}, (7)

where the weights, λ, are the eigenvalues of R and γ is the vector of non-centrality parameters. The columns of the matrix E are orthogonalized and normalized eigenvectors of R. The P-value for the statistic TQ = YY is obtained by setting μ to zero and then calculating this tail probability at the observed value TQ = t. Note that the elements in R, and therefore the eigenvectors, the eigenvalues λi, and the noncentralities explicitly depend on the β-coefficients through Eqs (2) and (5).

Our decorrelation approach uses a symmetric orthogonal transformation of the vector of statistics Y to a new vector X, with the new joint statistic based on the sum of elements of X, DOT=i=1LXi2. The orthogonal transformation is defined as follows. Let D=(1λI) and define X = H Y, where H = E D E′. The squared values, Xi2, are one degree of freedom independent chi-square variables, thus DOT = XX is a chi-square random variable with L degrees of freedom and noncentrality value of:

γc=i=1Lγi=μR1μ=(Hμ)(Hμ). (8)

The cumulative distribution of the new test statistic is thus,

Pr(XX>t)=Pr(χL,γc2>t). (9)

There are many ways to choose an orthogonal transformation, but a valid one for our purposes needs to have the following “invariance to order” property. Suppose we sample an equicorrelated MVN vector Y with a common correlation ρ for all pairs of variables. Before decorrelating the vector, we permute its values to a different order. A permutation in this example is a legitimate operation, because an equicorrelation structure does not suggest a particular order of Y values. After an orthogonal transformation of Y to X, the order of X entries may change due to permutation but their values should remain the same. Moreover, for the method to be useful in practice, we need the invariance to hold for a more general class of statistics than a simple sum of chi-squares, i=1LXi2. For example, the Rank Truncated Product (RTP) is a powerful P-value combination method [12] that emphasizes small P-values: the RTP statistic TRTP is the product of the k smallest P-values, k < L, or equivalently, TRTP=i=1k[ln(Pi)], where P1P2 ⋯ ≤Pk. Note that −ln(Pi) is no longer a one degree of freedom chi-square variable. Since DOT produces a set of independent one degree of freedom chi-squares, to use it with with RTP, one can convert the set of chi-squares to P-values and take the product of the first smallest values, which is the RTP statistic.

The “invariance to order” requirement implies that the value of DOT-statistic should not change due to a permutation of (equicorrelated) values in Y. Not all orthogonal transformations meet the invariance to order criteria. It can be easily verified that neither the inverse Cholesky factor (C−1) transformation, X = C−1 Y, nor another commonly used transformation X=E(1λI)Y, have the invariance to order property, except in the special case of the sum of L chi-squared variables i=1LXi2. To clarify, we call this statistic “the special case,” because, for example, in the case of RTP with k = L, the statistic i=1Lln(Pi) is no longer the sum of one degree of freedom chi-squares. Moreover, some transformations of equicorrelated data to independence, such as the Helmert transformation, may change values of X depending on the order of values in Y, even in a special equicorrelation case of ρ = 0 (i.e., when variables in Y are independent). The proposed H, as defined above, has both the invariance to order property and can be used with P-value transformations other than that to the one degree of freedom chi-square.

Theoretical analysis of power

For exploration of power properties, it is useful to first consider the equicorrelation case, because in this case it is possible to derive illustrative equations that relate power to: (1) the number of SNPs, L; (2) the common correlation value for every pair of SNPs, ρ; and (3) the mean values of association statistics, μ. In the equicorrelation case, the correlation matrix can be expressed as Rρ=(1ρ)I+ρ11. The eigenvalue vector of Rρ has length L but only two distinct values, λ = {1 + ρ(L − 1), 1 − ρ, …, 1 − ρ}.

For decorrelated statistic DOT, we derived a simple form of L noncentralities by utilizing the Helmert orthogonal eigenvectors [81, 82] as follows:

δ1=(i=1Lμi)2L(1+(L1)ρ)=Lμ¯21+(L1)ρ, (10)
δj>1=i=1j1(μiμj)2L(1ρ), (11)

where μ¯ is the average of the values in μ. Next, let

δs=j=2Lδj=(L1)d¯2(1ρ), (12)

where d¯ is the average of dij = (μiμj)2, over all pairs of μi and μj, such that i < j. The values in dij are the pairwise squared differences in the standardized effect values as captured by the vector μ. This representation yields the noncentrality of DOT as a function of the common correlation and the mean standardized effect size as:

γc=Lμ¯21+(L1)ρ+δs. (13)

Note that as L increases, the first term in Eq (13) approaches μ¯2/ρ, while the sum of the remaining noncentralities, δs, increases linearly with L, as long as the average of the squared effect size differences, d¯, does not depend on L. Thus, the noncentrality of the decorrelated statistic DOT is expected to steadily increase with L and become approximately μ¯2/ρ+(L1)d¯2(1ρ).

Next, we consider the distribution of the statistic TQ = YY. Note that i=1Lδi=i=1Lγi, where γi’s are the noncentralities for TQ and δi’s are the noncentralities of DOT. In the equicorrelation case, the distribution TQ reduces to the weighted sum of two chi-square variables, because there are only two distinct eigenvalues that correspond to Rρ, namely:

Pr(YY>t)=Pr{(1+(L1)ρ)χ1,γ12+(1ρ)χL1,γcγ12>t} (14)
=Pr{χ1,γ12+1ρ1+(L1)ρχL1,γcγ12>t1+(L1)ρ}. (15)

The term 1ρ1+(L1)ρχL1,γcγ12 in Eq (15) approaches the constant d¯(1ρ)2ρ2+1ρρ as L increases. Therefore, under the null hypothesis, the distribution of the quadratic form YY can be well approximated by the location-scale transformation of the one degree of freedom chi-squared random variable:

Pr{YY(L1)(1ρ)(L1)ρ+1>χα2}α, (16)

where χα2 is 1 − α quantile of the one degree of freedom chi-square distribution.

To summarize, we just showed that the distribution of the decorrelated set of variables gains in the total noncentrality with L, while the distribution of the sum YY depends heavily only on the noncentrality of the first term, γ1. The approximate power of the test based on the statistic TQ = YY can be computed as:

Pr(TQ>t)1Ψ(t), (17)
t=χα2+1ρ*ρ*+12(1ρ*)d¯(ρ*)2, (18)

where ρ*=ρij2¯, μ*=(|μ|¯)2 and Ψ(⋅) is a one degree of freedom chi-square CDF with the noncentrality */((L − 1)ρ* + 1), evaluated at t. The ceiling noncentrality value γ*, as L → ∞, is thus

γ*μ*/ρ*. (19)

Let us re-emphasize the point that a test based on the distribution of the TQ statistic is expected to be less powerful than DOT in the presence of heterogeneity among effect sizes. Heterogeneity in LD will contribute to the difference in power. Starting with an equicorrelation model, we can introduce perturbations to the common value, ρ > 0, by adding noise derived from a rank-one matrix U U′, where U is a vector of random numbers. Specifically, perturbations can be added as B=Rρ+UU. Next, B should be standardized to correlation as BR={1/Diag(B)I}B{1/Diag(B)I}. When elements in U are close to zero, the matrix BR deviates from Rρ by only a small jiggle around ρ. Matrix BR provides a way to construct random correlation matrices in a controlled manner, where the degree of departure from the equicorrelation is controlled via the range of the elements in U. The utility of BR is that it represents a perturbation of Rρ, and we expect our power results under equicorrelation case to hold approximately, at least for small jiggles around ρ. Nevertheless, it turns out that even for a more general correlation structure, our power approximations still hold, which we show via extensive simulation studies.

LD patterns from the 1000 Genome Project

In a separate set of simulation experiments, we utilized realistic LD patterns using data from the 1000 Genomes Project [83]. For every simulation experiment, we selected a random set of consecutive SNPs from a chromosome 17 region, that was spanning over 100 Kb and included SNPs from the gene FGF11 to the gene NDEL1. There was no particular reason for choosing this chromosome, but we expect our results to be generalizable to other regions of the genome in the sense that LD structure among SNPs on chromosome 17 is representative of LDs throughout the genome. Perhaps more important, and a potential limitation of our simulations, is the choice of the association model. That is, the model assumed high heterogeneity in effect sizes and statistics were combined for only proxy SNPs (those SNPs with zero effect sizes). Each stretch of consecutive SNPs contained from 10 to 200 SNPs with the minimum allele frequency 0.025. A random portion of SNPs in every set carried no effect on the outcome on its own, and we considered these SNPs to be proxies for causal variants due to LD. The median LD correlation varied from approximately -0.6 to 0.98 between random stretches of SNPs. The number of proxy SNPs varied from 3 to 197 across simulations. The sample size was also set to be random and varied from 500 to 3000 across simulations. Effect sizes for causal variants were modeled by β-coefficients, as given by Eq (1), and drawn randomly from the interval [-0.4, 0.4]. Different combinations of the number of causal SNPs, their individual effect sizes and LD patterns among them resulted in total proportion of phenotypic variance explained (i.e., the multiple correlation coefficient) varying from 10−5% (fifth percentile) to 7% (ninety-fifth percentile) with the mean value of 2.5% and the median value of 1%. Summary statistics were sampled from the MVN distribution with parameters given by Eqs (2) and (4). To check the validity of our approach of sampling the summary statistics directly, we first conducted a separate set of extensive simulation experiments, in which power and type-I error rates were obtained by simulating individual data and then TQ and DOT statistics were computed by running the actual regression analysis. We confirmed excellent agreement between the two approaches, thus most of the subsequent simulations were conducted by sampling the summary statistics directly (these results are not shown here).

Data Availability

The URL for software referenced in this article is available at: https://github.com/dmitri-zaykin/Total_Decor.

Funding Statement

This research was supported in part by the Intramural Research Program of the National Institutes of Health (NIH), National Institute of Environmental Health Sciences. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Lin D, Zeng D. Meta-analysis of genome-wide association studies: No efficiency gain in using individual participant data. Genet Epidemiol. 2010;34(1):60–66. 10.1002/gepi.20435 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Lee S, Teslovich TM, Boehnke M, Lin X. General framework for meta-analysis of rare variants in sequencing association studies. Am J Hum Genet. 2013;93(1):42–53. 10.1016/j.ajhg.2013.05.010 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Zaykin DV. Optimally weighted Z-test is a powerful method for combining probabilities in meta-analysis. J Evol Biol. 2011;24(8):1836–1841. 10.1111/j.1420-9101.2011.02297.x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Pasaniuc B, Price AL. Dissecting the genetics of complex traits using summary association statistics. Nature Reviews Genetics. 2017;18(2):117 10.1038/nrg.2016.142 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Li MX, Gui HS, Kwan JSH, Sham PC. GATES: a rapid and powerful gene-based association test using extended Simes procedure. Am J Hum Genet. 2011;88(3):283–93. 10.1016/j.ajhg.2011.01.019 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Conneely KN, Boehnke M. So many correlated tests, so little time! Rapid adjustment of P-values for multiple correlated tests. Am J Hum Genet. 2007;81(6):1158–1168. 10.1086/522036 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Sun R, Hui S, Bader G, Lin X, Kraft P. Powerful gene set analysis in GWAS with the generalized Berk-Jones statistic. bioRxiv, https://doiorg/101101/361436. 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Liu JZ, Mcrae AF, Nyholt DR, Medland SE, Wray NR, Brown KM, et al. A versatile gene-based test for genome-wide association studies. Am J Hum Genet. 2010;87(1):139–145. 10.1016/j.ajhg.2010.06.009 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Lamparter D, Marbach D, Rueedi R, Kutalik Z, Bergmann S. Fast and rigorous computation of gene and pathway scores from SNP-based summary statistics. PLOS Computational Biology. 2016;12(1):e1004714 10.1371/journal.pcbi.1004714 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Gauderman WJ, Murcray C, Gilliland F, Conti DV. Testing association between disease and multiple SNPs in a candidate gene. Genet Epidemiol. 2007;31(5):383–395. 10.1002/gepi.20219 [DOI] [PubMed] [Google Scholar]
  • 11. Zaykin DV, Zhivotovsky LA, Westfall PH, Weir BS. Truncated product method for combining P-values. Genet Epidemiol. 2002;22(2):170–85. 10.1002/gepi.0042 [DOI] [PubMed] [Google Scholar]
  • 12. Dudbridge F, Koeleman BP. Rank truncated product of P-values, with application to genomewide association scans. Genet Epidemiol. 2003;25(4):360–366. 10.1002/gepi.10264 [DOI] [PubMed] [Google Scholar]
  • 13. Zaykin DV, Zhivotovsky LA, Czika W, Shao S, Wolfinger RD. Combining P-values in large-scale genomics experiments. Pharmaceutical Statistics: The Journal of Applied Statistics in the Pharmaceutical Industry. 2007;6(3):217–226. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Biernacka JM, Jenkins GD, Wang L, Moyer AM, Fridley BL. Use of the gamma method for self-contained gene-set analysis of SNP data. European Journal of Human Genetics. 2012;20(5):565 10.1038/ejhg.2011.236 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Fridley BL, Jenkins GD, Grill DE, Kennedy RB, Poland GA, Oberg AL. Soft truncation thresholding for gene set analysis of RNA-seq data: application to a vaccine study. Scientific Reports. 2013;3:2898 10.1038/srep02898 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Taylor J, Tibshirani R. A tail strength measure for assessing the overall univariate significance in a dataset. Biostatistics. 2005;7(2):167–181. 10.1093/biostatistics/kxj009 [DOI] [PubMed] [Google Scholar]
  • 17.Maechler M, Bates D. 2nd Introduction to the Matrix package. R Core Development Team Accessed on: https://stat%20ethz%20ch/R-manual/R-devel/library/Matrix/doc/Intro2Matrix.pdf. 2006.
  • 18. Liu Y, Chen S, Li Z, Morrison AC, Boerwinkle E, Lin X. ACAT: A fast and powerful P-value combination method for rare-variant analysis in sequencing studies. Am J Hum Genet. 2019;104(3):410–421. 10.1016/j.ajhg.2019.01.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011;89(1):82–93. 10.1016/j.ajhg.2011.05.029 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Li B, Leal SM. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet. 2008;83(3):311–321. 10.1016/j.ajhg.2008.06.024 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Madsen BE, Browning SR. A groupwise association test for rare mutations using a weighted sum statistic. PLOS Genetics. 2009;5(2):e1000384 10.1371/journal.pgen.1000384 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Price AL, Kryukov GV, de Bakker PI, Purcell SM, Staples J, Wei LJ, et al. Pooled association tests for rare variants in exon-resequencing studies. Am J Hum Genet. 2010;86(6):832–838. 10.1016/j.ajhg.2010.04.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. de Leeuw CA, Mooij JM, Heskes T, Posthuma D. MAGMA: generalized gene-set analysis of GWAS data. PLoS computational biology. 2015;11(4):e1004219 10.1371/journal.pcbi.1004219 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Brown MB. 400: A method for combining non-independent, one-sided tests of significance. Biometrics. 1975; p. 987–992. [Google Scholar]
  • 25. Hou CD. A simple approximation for the distribution of the weighted combination of non-independent or independent probabilities. Statistics & probability letters. 2005;73(2):179–187. [Google Scholar]
  • 26. Vsevolozhskaya O, Hu F, Zaykin D. Detecting weak signals by combining small P-values in genetic association studies. BioRxiv. 2019; p. 667238. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Shi M, O’Brien KM, Sandler DP, Taylor JA, Zaykin DV, Weinberg CR. Previous GWAS hits in relation to young-onset breast cancer. Breast Cancer Research and Treatment. 2017;161(2):333–344. 10.1007/s10549-016-4053-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. O’Brien KM, Shi M, Sandler DP, Taylor JA, Zaykin DV, Keller J, et al. A family-based, genome-wide association study of young-onset breast cancer: inherited variants and maternally mediated effects. European Journal of Human Genetics. 2016;24(9):1316 10.1038/ejhg.2016.11 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Ahsan H, Halpern J, Kibriya MG, Pierce BL, Tong L, Gamazon E, et al. A genome-wide association study of early-onset breast cancer identifies PFKM as a novel breast cancer gene and supports a common genetic spectrum for breast cancer at any age. Cancer Epidemiology and Prevention Biomarkers. 2014;23(4):658–669. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Lipphardt MF, Deryal M, Ong MF, Schmidt W, Mahlknecht U. ESR1 single nucleotide polymorphisms predict breast cancer susceptibility in the central European Caucasian population. International Journal of Clinical and Experimental Medicine. 2013;6(4):282 [PMC free article] [PubMed] [Google Scholar]
  • 31. Dunning AM, Healey CS, Baynes C, Maia AT, Scollen S, Vega A, et al. Association of ESR1 gene tagging SNPs with breast cancer risk. Human Molecular Genetics. 2009;18(6):1131–1139. 10.1093/hmg/ddn429 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Thomas G, Jacobs KB, Kraft P, Yeager M, Wacholder S, Cox DG, et al. A multistage genome-wide association study in breast cancer identifies two new risk alleles at 1p11. 2 and 14q24.1 (RAD51L1). Nature Genetics. 2009;41(5):579 10.1038/ng.353 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Michailidou K, Hall P, Gonzalez-Neira A, Ghoussaini M, Dennis J, Milne RL, et al. Large-scale genotyping identifies 41 new loci associated with breast cancer risk. Nature Genetics. 2013;45(4):353 10.1038/ng.2563 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Pelttari LM, Khan S, Vuorela M, Kiiski JI, Vilske S, Nevanlinna V, et al. RAD51B in familial breast cancer. PLOS ONE. 2016;11(5):e0153788 10.1371/journal.pone.0153788 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Udler MS, Ahmed S, Healey CS, Meyer K, Struewing J, Maranian M, et al. Fine scale mapping of the breast cancer 16q12 locus. Human Molecular Genetics. 2010;19(12):2507–2515. 10.1093/hmg/ddq122 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Linjawi SA, Hifni SA, ALKhayyat SS. The Relation between Estrogen-positive Receptor in Breast Cancer (ER+) and Obesity in Jeddah. Journal of Biology and Today’s World. 2019;8(1):13–20. [Google Scholar]
  • 37. Sonestedt E, Ivarsson MI, Harlid S, Ericson U, Gullberg B, Carlson J, et al. The Protective Association of High Plasma Enterolactone with Breast Cancer Is Reasonably Robust in Women with Polymorphisms in the Estrogen Receptor α and β Genes. The Journal of Nutrition. 2009;139(5):993–1001. 10.3945/jn.108.101691 [DOI] [PubMed] [Google Scholar]
  • 38. Yingchun X, Zhang F, Wang H, Ma Y, Sun L. Relationship between single nucleotide polymorphism of estrogen receptor gene and endocrine therapy efficacy in breast cancer. Journal of Clinical Oncology. 2009;27(15S):1113–1113. [Google Scholar]
  • 39. Nyante SJ, Gammon MD, Kaufman JS, Bensen JT, Lin DY, Barnholtz-Sloan JS, et al. Genetic variation in estrogen and progesterone pathway genes and breast cancer risk: an exploration of tumor subtype-specific effects. Cancer Causes & Control. 2015;26(1):121–131. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Mahoney DW, Kohli M, Cerhan JR, Offer SM. Predicting responses to androgen deprivation therapy; 2013. [Google Scholar]
  • 41. Saadatian Z, Gharesouran J, Ghojazadeh M, Ghohari-Lasaki S, Tarkesh-Esfahani N, Ardebili SMM. Association of rs1219648 in FGFR2 and rs1042522 in TP53 with Premenopausal Breast Cancer in an Iranian Azeri Population. Asian Pacific Journal of Cancer Prevention. 2014;15(18):7955–7958. 10.7314/apjcp.2014.15.18.7955 [DOI] [PubMed] [Google Scholar]
  • 42. Andersen SW, Trentham-Dietz A, Figueroa JD, Titus LJ, Cai Q, Long J, et al. Breast cancer susceptibility associated with rs1219648 (fibroblast growth factor receptor 2) and postmenopausal hormone therapy use in a population-based United States study. Menopause (New York, NY). 2013;20(3):354–358. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Zhang Y, Zeng X, Liu P, Hong R, Lu H, Ji H, et al. Association between FGFR2 (rs2981582, rs2420946 and rs2981578) polymorphism and breast cancer susceptibility: a meta-analysis. Oncotarget. 2017;8(2):3454 10.18632/oncotarget.13839 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Zhang J, Qiu LX, Wang ZH, Leaw SJ, Wang BY, Wang JL, et al. Current evidence on the relationship between three polymorphisms in the FGFR2 gene and breast cancer risk: a meta-analysis. Breast Cancer Research and Treatment. 2010;124(2):419–424. 10.1007/s10549-010-0846-7 [DOI] [PubMed] [Google Scholar]
  • 45. Chen XH, Li XQ, Chen Y, Feng YM. Risk of aggressive breast cancer in women of Han nationality carrying TGFB1 rs1982073 C allele and FGFR2 rs1219648 G allele in North China. Breast Cancer Research and Treatment. 2011;125(2):575–582. 10.1007/s10549-010-1032-7 [DOI] [PubMed] [Google Scholar]
  • 46. Lei H, Deng CX. Fibroblast growth factor receptor 2 signaling in breast cancer. International Journal of Biological Sciences. 2017;13(9):1163 10.7150/ijbs.20792 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Murillo-Zamora E, Moreno-Macías H, Ziv E, Romieu I, Lazcano-Ponce E, Ángeles-Llerenas A, et al. Association between rs2981582 polymorphism in the FGFR2 gene and the risk of breast cancer in Mexican women. Archives of Medical Research. 2013;44(6):459–466. 10.1016/j.arcmed.2013.08.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Butt S, Harlid S, Borgquist S, Ivarsson M, Landberg G, Dillner J, et al. Genetic predisposition, parity, age at first childbirth and risk for breast cancer. BMC Research Notes. 2012;5(1):414 10.1186/1756-0500-5-414 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Shan J, Mahfoudh W, Dsouza SP, Hassen E, Bouaouina N, Abdelhak S, et al. Genome-Wide Association Studies (GWAS) breast cancer susceptibility loci in Arabs: susceptibility and prognostic implications in Tunisians. Breast Cancer Research and Treatment. 2012;135(3):715–724. 10.1007/s10549-012-2202-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Xu WH, Shu XO, Long J, Lu W, Cai Q, Zheng Y, et al. Relation of FGFR2 genetic polymorphisms to the association between oral contraceptive use and the risk of breast cancer in Chinese women. American Journal of Epidemiology. 2011;173(8):923–931. 10.1093/aje/kwq460 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Dong H, Gao Z, Li C, Wang J, Jin M, Rong H, et al. Analyzing 395,793 samples shows significant association between rs999737 polymorphism and breast cancer. Tumor Biology. 2014;35(6):6083–6087. 10.1007/s13277-014-1805-4 [DOI] [PubMed] [Google Scholar]
  • 52. Turnbull C, Ahmed S, Morrison J, Pernet D, Renwick A, Maranian M, et al. Genome-wide association study identifies five new breast cancer susceptibility loci. Nature Genetics. 2010;42(6):504 10.1038/ng.586 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Lee P, Fu YP, Figueroa JD, Prokunina-Olsson L, Gonzalez-Bosquet J, Kraft P, et al. Fine mapping of 14q24.1 breast cancer susceptibility locus. Human Genetics. 2012;131(3):479–490. 10.1007/s00439-011-1088-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Stacey S, Sulem P. Genetic variants for breast cancer risk assessment; 2015. [Google Scholar]
  • 55. Ma H, Li H, Jin G, Dai J, Dong J, Qin Z, et al. Genetic variants at 14q24.1 and breast cancer susceptibility: a fine-mapping study in Chinese women. DNA and Cell Biology. 2012;31(6):1114–1120. 10.1089/dna.2011.1550 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56. Beaty TH, Murray JC, Marazita ML, Munger RG, Ruczinski I, Hetmanski JB, et al. A genome-wide association study of cleft lip with and without cleft palate identifies risk variants near MAFB and ABCA4. Nature genetics. 2010;42(6):525 10.1038/ng.580 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57. Bagordakis E, Paranaiba LMR, Brito LA, de Aquino SN, Messetti AC, Martelli-Junior H, et al. Polymorphisms at regions 1p22. 1 (rs560426) and 8q24 (rs1530300) are risk markers for nonsyndromic cleft lip and/or palate in the Brazilian population. American Journal of Medical Genetics Part A. 2013;161(5):1177–1180. [DOI] [PubMed] [Google Scholar]
  • 58. Zhang TX, Beaty TH, Ruczinski I. Candidate pathway based analysis for cleft lip with or without cleft palate. Statistical applications in genetics and molecular biology. 2012;11(2). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59. Rojas-Martinez A, Reutter H, Chacon-Camacho O, Leon-Cachon RB, Munoz-Jimenez SG, Nowak S, et al. Genetic risk factors for nonsyndromic cleft lip with or without cleft palate in a Mesoamerican population: evidence for IRF6 and variants at 8q24 and 10q25. Birth Defects Research Part A: Clinical and Molecular Teratology. 2010;88(7):535–537. 10.1002/bdra.20689 [DOI] [PubMed] [Google Scholar]
  • 60. Imani MM, Lopez-Jornet P, Pons-Fuster López E, Sadeghi M. Polymorphic Variants of V-Maf Musculoaponeurotic Fibrosarcoma Oncogene Homolog B (rs13041247 and rs11696257) and Risk of Non-Syndromic Cleft Lip/Palate: Systematic Review and Meta-Analysis. International journal of environmental research and public health. 2019;16(15):2792. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61. Liu H, Leslie EJ, Carlson JC, Beaty TH, Marazita ML, Lidral AC, et al. Identification of common non-coding variants at 1p22 that are functional for non-syndromic orofacial clefting. Nature communications. 2017;8:14759 10.1038/ncomms14759 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62. Hu N, Wang C, Hu Y, Yang HH, Giffen C, Tang ZZ, et al. Genome-wide association study in esophageal cancer using GeneChip mapping 10K array. Cancer research. 2005;65(7):2542–2546. 10.1158/0008-5472.CAN-04-3247 [DOI] [PubMed] [Google Scholar]
  • 63. Bueno M. Association of GWAS loci with nonsyndromic cleft lip and/or palate in Brazilian population. Luciano Abreu Brito. 2016; p. 99. [Google Scholar]
  • 64. Hikida M, Tsuda M, Watanabe A, Kinoshita A, Akita S, Hirano A, et al. No evidence of association between 8q24 and susceptibility to nonsyndromic cleft lip with or without palate in Japanese population. The Cleft Palate-Craniofacial Journal. 2012;49(6):714–717. 10.1597/10-242 [DOI] [PubMed] [Google Scholar]
  • 65. do Rego Borges A, Sá J, Hoshi R, Viena CS, Mariano LC, de Castro Veiga P, et al. Genetic risk factors for nonsyndromic cleft lip with or without cleft palate in a Brazilian population with high African ancestry. American Journal of Medical Genetics Part A. 2015;167(10):2344–2349. [DOI] [PubMed] [Google Scholar]
  • 66. Sun Y, Huang Y, Yin A, Pan Y, Wang Y, Wang C, et al. Genome-wide association study identifies a new susceptibility locus for cleft lip with or without a cleft palate. Nature communications. 2015;6:6414 10.1038/ncomms7414 [DOI] [PubMed] [Google Scholar]
  • 67. Song T, Wu D, Wang Y, Li H, Yin N, Zhao Z. SNPs and interaction analyses of IRF6, MSX1 and PAX9 genes in patients with non-syndromic cleft lip with or without palate. Molecular medicine reports. 2013;8(4):1228–1234. 10.3892/mmr.2013.1617 [DOI] [PubMed] [Google Scholar]
  • 68. Weatherley-White RC, Ben S, Jin Y, Riccardi S, Arnold TD, Spritz RA. Analysis of genomewide association signals for nonsyndromic cleft lip/palate in a Kenya African Cohort. American Journal of Medical Genetics Part A. 2011;155(10):2422–2425. [DOI] [PubMed] [Google Scholar]
  • 69. Kerameddin S, Namipashaki A, Ebrahimi S, Ansari-Pour N. IRF6 is a marker of severity in nonsyndromic cleft lip/palate. Journal of dental research. 2015;94(9_suppl):226S–232S. 10.1177/0022034515581013 [DOI] [PubMed] [Google Scholar]
  • 70. Jia ZL, Li Y, Li L, Wu J, Zhu LY, Yang C, et al. Association among IRF6 polymorphism, environmental factors, and nonsyndromic orofacial clefts in western China. DNA and cell biology. 2009;28(5):249–257. 10.1089/dna.2008.0837 [DOI] [PubMed] [Google Scholar]
  • 71. Park JW, McIntosh I, Hetmanski JB, Jabs EW, Vander Kolk CA, Wu-Chou YH, et al. Association between IRF6 and nonsyndromic cleft lip with or without cleft palate in four populations. Genetics in Medicine. 2007;9(4):219 10.1097/GIM.0b013e3180423cca [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72. Yuan Q, Blanton SH, Hecht JT. Association of ABCA4 and MAFB with nonsyndromic cleft lip with or without cleft palate. American journal of medical genetics Part A. 2011;155(6):1469. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73. Duan SJ, Huang N, Zhang BH, Shi JY, He S, Ma J, et al. New insights from GWAS for the cleft palate among han Chinese population. Medicina oral, patologia oral y cirugia bucal. 2017;22(2):e219 10.4317/medoral.21439 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74. Mi N, Hao Y, Jiao X, Zheng X, Song T, Shi J, et al. Association study of single nucleotide polymorphisms of MAFB with non-syndromic cleft lip with or without cleft palate in a population in Heilongjiang Province, northern China. British Journal of Oral and Maxillofacial Surgery. 2014;52(8):746–750. 10.1016/j.bjoms.2014.06.003 [DOI] [PubMed] [Google Scholar]
  • 75. Zhang B, Duan S, Shi J, Jiang S, Feng F, Shi B, et al. Family-based study of association between MAFB gene polymorphisms and NSCL/P among Western Han Chinese population. Advances in Clinical and Experimental Medicine. 2018;27(8):1109–1116. 10.17219/acem/74388 [DOI] [PubMed] [Google Scholar]
  • 76. Zaykin DV. Bounds and normalization of the composite linkage disequilibrium coefficient. Genet Epidemiol. 2004;27(3):252–257. 10.1002/gepi.20015 [DOI] [PubMed] [Google Scholar]
  • 77. MacCallum RC, Zhang S, Preacher KJ, Rucker DD. On the practice of dichotomization of quantitative variables. Psychological Methods. 2002;7(1):19 10.1037/1082-989x.7.1.19 [DOI] [PubMed] [Google Scholar]
  • 78. Vsevolozhskaya O, Herbst A, Adams A, Burns C, Cantu B, Barker V, et al. Methods for combining multiple correlated biomarkers with application to the study of low-grade inflammation and muscle mass in senior horses. BioRxiv. 2019. [Google Scholar]
  • 79. Reshef YA, Finucane HK, Kelley DR, Gusev A, Kotliar D, Ulirsch JC, et al. Detecting genome-wide directional effects of transcription factor binding on polygenic disease risk. Nature genetics. 2018;50(10):1483 10.1038/s41588-018-0196-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 80. Ferrari PA, Barbiero A. Simulating ordinal data. Multivariate Behavioral Research. 2012;47(4):566–589. 10.1080/00273171.2012.692630 [DOI] [PubMed] [Google Scholar]
  • 81. Clarke BR. Helmert matrices and orthogonal relationships In: Linear Models: The theory and application of analysis of variance. Wiley-Blackwell; 2008. [Google Scholar]
  • 82. Lancaster H. The Helmert matrices. The American Mathematical Monthly. 1965;72(1):4–12. [Google Scholar]
  • 83. Consortium GP, et al. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491(7422):56. [DOI] [PMC free article] [PubMed] [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1007819.r001

Decision Letter 0

Thomas Lengauer, Jennifer Listgarten

25 Oct 2019

Dear Dr Zaykin,

Thank you very much for submitting your manuscript 'DOT: Gene-set analysis by combining decorrelated association statistics' for review by PLOS Computational Biology. Your manuscript has been fully evaluated by the PLOS Computational Biology editorial team and in this case also by independent peer reviewers. The reviewers appreciated the attention to an important problem, but raised some substantial concerns about the manuscript as it currently stands. While your manuscript cannot be accepted in its present form, we are willing to consider a revised version in which the issues raised by the reviewers have been adequately addressed. We cannot, of course, promise publication at that time.

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

Your revisions should address the specific points made by each reviewer. Please return the revised version within the next 60 days. If you anticipate any delay in its return, we ask that you let us know the expected resubmission date by email at ploscompbiol@plos.org. Revised manuscripts received beyond 60 days may require evaluation and peer review similar to that applied to newly submitted manuscripts.

In addition, when you are ready to resubmit, please be prepared to provide the following:

(1) A detailed list of your responses to the review comments and the changes you have made in the manuscript. We require a file of this nature before your manuscript is passed back to the editors.

(2) A copy of your manuscript with the changes highlighted (encouraged). We encourage authors, if possible to show clearly where changes have been made to their manuscript e.g. by highlighting text.

(3) A striking still image to accompany your article (optional). If the image is judged to be suitable by the editors, it may be featured on our website and might be chosen as the issue image for that month. These square, high-quality images should be accompanied by a short caption. Please note as well that there should be no copyright restrictions on the use of the image, so that it can be published under the Open-Access license and be subject only to appropriate attribution.

Before you resubmit your manuscript, please consult our Submission Checklist to ensure your manuscript is formatted correctly for PLOS Computational Biology: http://www.ploscompbiol.org/static/checklist.action. Some key points to remember are:

- Figures uploaded separately as TIFF or EPS files (if you wish, your figures may remain in your main manuscript file in addition).

- Supporting Information uploaded as separate files, titled Dataset, Figure, Table, Text, Protocol, Audio, or Video.

- Funding information in the 'Financial Disclosure' box in the online system.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see here

We are sorry that we cannot be more positive about your manuscript at this stage, but if you have any concerns or questions, please do not hesitate to contact us.

Sincerely,

Jennifer Listgarten

Associate Editor

PLOS Computational Biology

Thomas Lengauer

Methods Editor

PLOS Computational Biology

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

[LINK]

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: In this paper, the authors present a new summary-statistics-based method for testing a group of common SNPs in aggregate for association to a phenotype. Unlike previous approaches, the authors' test statistic explicitly (and exactly) removes correlation between the individual SNPs' summary statistics.

I generally like this paper and appreciate the authors' precision and rigor in deriving and presenting their method. Their theoretical results concerning the power of their test as well as others are also a valuable contribution. So I generally feel this is a very solid contribution to the field. In the long-term I would suggest that the authors consider applications of their framework beyond set-testing since my impression is that the growing number of highly significant associations between *individual* SNPs and phenotypes will eventually cause set-testing to decline as an approach in the common-variant realm. But this is beyond the scope of this paper and for now there remains a substantial community of users of set tests who could benefit from the approach described by the authors.

Regarding the technical substance of the paper, I have the following major comments:

- I'm unclear on the phenomenon whereby TQ tests don't experience an increase in power as more SNPs are added to the model, e.g., in Setting 1. Looking at the authors' model, in which the variance of the environmental noise, epsilon, is set at 1, it would seem that the more SNPs I add to the model with non-trivial effects, the more phenotypic variance is produced by the genetics. In the limit of infinite SNPs and constant-magnitude environmental noise then, the phenotype should be deterministically set by genotype. It would seem unintuitive that in this situation the TQ tests wouldn't have full power. What am I missing? Are the authors scaling something somewhere?

- Relatedly, it would help if the authors included in their methods section more detailed descriptions of their simulation set ups especially including sample size and proportion of phenotypic variance explained by genotype for each simulation (including the simulations with real genotypes).

- I don't know if the proportion of variance explained by genotype is high in the authors' simulations. But if it is, do they expect their results to generalize to settings where this is not the case? For real traits, any one set of tens to hundreds of contiguous SNPs typically only explains a very small proportion (on the order of 1%, usually even less than that) of phenotypic variance, so I'd be interested to see if this is the case in the simulations here. Sometimes it's okay to simulate small sections of genome explaining high proportions of phenotypic variance as long as sample size is lowered in some corresponding way, but if this is the case here the authors should explain and perhaps use their theory to justify.

- How do the authors expect their statistic to behave in the presence of near-perfect LD? It seems they don't regularize their LD matrix, which surprised me. I would be interested to see power results under a simulation setting where two SNPs, only one of which is causal and contains 75% of the causal signal in locus, have a) 99% correlation and b) 100% correlation.

- For the simulations with real genotypes, how was the 100kb region on chromosome 17 chosen? Do the authors expect the simulation results to generalize to other regions of the genome as well? If they are unsure, is it computationally feasible to do simulations where random sets of contiguous SNPs are chosen from the whole genome?

- How were the genes ESR1, FGFR2, RAD51B, and TOX3 chosen by the authors for demonstration of their method? Does this set include all the genes found in the Min et al paper to have association with breast cancer? Would it be possible to test a larger set of genes chosen more systematically so that readers can have a sense for whether the authors' approach should in general be preferred over other approaches? Or perhaps to test a few genes chosen by authors of other set testing methods papers?

- Do the authors think it would make sense to compare (either in simulation or in practice) to the gene-level test in de Leeuw 2016 PLOS Comp Bio since that method also provides a way to test the SNPs surrounding an individual gene for association while accounting for correlation between variants in order to boost power? Relatedly: ACAT seems to be a method intended primarily for testing of rare variants in sequence data; could it be that this makes it an inappropriate comparison point?

- I liked the way the authors argued for their particular choice of pseudoinverse by suggesting that exchangeability of SNPs should be preserved by this operation. Kudos!

I also have the following minor comments:

- It seems that the claims about the scaling of power as a function of L are for fixed rho > 0, because when rho=0 the tests considered are equivalent. The authors may want to clarify this.

- In the definition of r_ij on page 2, should there be a square-root in the denominator?

- On page 3 there is a typo in "This general idea is straightforward and HAVE been used..." (emphasis mine)

- What was the sample size of the breast cancer data set that the authors analyzed?

- In Equations 4 and 5, rho_ij appears on both sides of the equations.

- The derivation of the covariance matrix of the vector of summary statistics can be carried out without the delta method but under the assumption of Gaussian genotypes (which is justifiable for large sample size and MAF bounded away from zero). See Proposition 2 in the supplement of Reshef et al 2018 Nat Genet. The authors may wish to comment on whether these two derivations give different results and if so why not.

- For the results in Table 6: 1) which set of genotypes were the phenotypes simulated from? 2) Which set of genotypes was used as the reference panel? The only genotypes I saw mentioned were 1000 Genomes, but two distinct sets of genotypes are required for the described analysis.

Reviewer #2: Zaykin et al propose DOT, a new method for Gene Based Association Testing. There is demand for a gene (or set-based) method, so a method that improves upon previous methods would be of much interest and (with easy to use software) could become highly used. Zaykin perform many simulations to show that DOT has the potential to improve on a state-of-the-art method, VEGAS (and also ACAT, a method I am not familiar with). They also have a real data example, but this is very limited. While I am not convinced from this draft alone, I believe that by including an extra simulation method, and a more convincing application, DOT could be a useful addition to the field.

Major points

Reading the method (and apologies that I did not understand all the details), DOT appears similar to methods which first compute principal components for each gene (ie eigen decompose the snp snp correlation matrix), then regress the phenotype on these (consider the following paper, or derivatives https://onlinelibrary.wiley.com/doi/pdf/10.1002/gepi.20219). Thus I require convincing this method is different to / an improvement on those.

The format of the paper makes it challenging to read. Usually methods would come before results. However, if the journal requires such a style, then you must give some brief details at the start of results.

I consider there to be insufficient detail of the simulations. For example, I can't see sample size and rho was hard to find. Is it the case for all simulations that all L snps are assigned effects, or just the first one?

It is good you compare with vegas (TQ?). But to my knowledge, the most common methods are SCAT, or magma, and my preferred is Fast-LMM-Set, so would ideally like at least one of these considered (or a statement with justification that these very similar to VEGAS)

The application is very limited. While I appreciate there is justification for the choice, unfortunately it looks odd to consider only four handpicked loci, rather than perform a genome wide analysis.

I believe you require odds ratios for the SNPs in table 8 (ideally from multi snp analysis and perhaps those from single snp)

Minor Points

I applaud the range of simulations, and also of considering situations where DOTS is not well-suited

I also like the insight into how DOT has the potential to gain power (when a wide spectrum of effect sizes, which is thought likely to be the case with complex traits).

In the simulations, it is hard to understand the effect sizes. Can you instead report in terms of heritability, ideally both (average) phenotypic variance explained by the gene/region, and (average) variance explained by most significant individual snp

The tables (and I think figures) require captions. In generally, these should give a full description (or if the same, say "see Table 1... etc"), rather than relying on the user to parse through the main text.

Good that a github page is provided with software (although I have not tested)

Please provide a summary of run time for a decent sized analysis.

Very Minor Points

Intro; It is important to distinguish situations ... I suggest you replace second "in which" with "from those" or something similar

I would prefer if you provided more thresholds when testing the false positive rates (e.g. show not just alpha 1e-4, but also say 0.05, maybe a few others, in supplement if necessary)

It is good you can accommodate covariates, but is this feature used in application?

Signed Doug Speed

Reviewer #3: In this manuscript, the authors combined single-SNP summary statistics in order to conduct joint analysis of a set of SNPs without accessing original genotype-phenotype datasets. To develop efficient overall summary-statistic, the authors used a decorrelation trick To simplify the correlation structure of the the vector of the single-SNP summary-statistics. The later are correlated by construction. Thus, by rotating the this vector over the eigenvectors of its corresponding correlation matrix one can simplify its correlation structure. Although the decorrelation-trick of a response vector is not a new concept—it has been used for kinship matrix several times in linear mixed models in presence of familial data, e.g. FastLMM— the theoretical and analytical development of the DOT p-values in this manuscript is relevant, in the context summary-statistic association.

Major and Minor Comments are dteailed in a PDF file attached to this review.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Doug Speed

Reviewer #3: No

Attachment

Submitted filename: PlosComBioRepot.pdf

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1007819.r003

Decision Letter 1

Thomas Lengauer, Jennifer Listgarten

25 Feb 2020

Dear Dr Zaykin,

Thank you very much for submitting your manuscript "DOT: Gene-set analysis by combining decorrelated association statistics" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations, and in particular those of reviewer #1.

Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. 

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Jennifer Listgarten

Associate Editor

PLOS Computational Biology

Thomas Lengauer

Methods Editor

PLOS Computational Biology

***********************

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

[LINK]

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Overall the authors have addressed my theoretical and methods-related concerns quite well in this revision.

However, I still have serious reservations about the authors' analysis of real data, which analyses a very small set of genes that were not chosen systematically. I previously wrote: "Would it be possible to test a larger set of genes chosen more systematically so that readers can have a sense for whether the authors’ approach should in general be preferred over other approaches? Or perhaps to test a few genes chosen by authors of other set testing methods papers?"

The authors did not perform this analysis, and so I still do not know whether their method is more powerful than existing methods beyond the very small set of genes they have analyzed. (The addition in revision of a second phenotype, cleft lip, analyzed in the same way as the first phenotype did not give me a better global sense for why people should use this method.) My understanfing of what the authors have shown is that:

a) DOT assigns lower p-values than other methods do to the 4 selected breast cancer genes. This seems weak to me first because lower p-values don't necessarily correspond to higher power (a method can give very low p-values on 1 % of alternatives but fail to reject the null the rest of the time), and second because these genes have already been prioritized by other methods, suggesting that their connection to breast cancer is not a new discovery enabled by DOT. For example, these genes seem from the text to harbor previously reported risk SNPs. Am I missing something?

b) DOT can point at new SNPs associated with breast cancer and cleft lip at these known loci (Tables 10 and 12). But the authors also state (appropriately) that since these results don't come with p-values they should be interpreted with caution, and they also state that cannot conclude that these SNPs are causal but rather only additional proxy SNPs. So I'm unsure what we can confidently learn from these results.

I personally don't find (a) or (b) to be strong reasons that practitioners should use DOT.

Overall, I see two ways forward:

1. The authors can carry out a systematic analysis of the performance of their method on real data. For example, they could run the method on a larger set of genes (e.g., all protein coding genes, or all genes expressed in breast tissue, or a set of genes benchmarked in other set testing papers). This would allow the authors to say things like "in a systematic analysis, our method identified X genes to be in loci that are significantly associated with breast cancer, while competing methods identified only Y such genes." I think this would make a much stronger case for the use of this method. And if it's not true, then that is important for potential users to know even if it doesn't preclude publication of the paper.

2. Alternatively, recognizing they have performed extensive revisions already, the authors can add a statement explaining that the genome-wide performance of their method is yet-uncharacterized and would be important to assess in future work.

I suppose it would be okay to publish the paper in case 2, but my opinion is that I would be less excited about it. Not answering the central question of whether DOT is more powerful than other methods in practice on real data is not consistent with the otherwise high level of statistical rigor in this potentially interesting paper.

Minor comments:

- Just above Table 1, you have a typo: "the column labeled \\hat\\gamma provide the average noncentrality value" ("provide" should be "provides")

- In the sentence “Different combinations of sample size, the number of causal SNPs, their individual effect sizes and LD patterns among them, resulted in total proportion of phenotypic variance explained...", whose addition I appreciate in this revision, sample size should not be enumerated as one of the parameters that affects the total proportion of phenotypic variance explained.

- On page 10, you cite "Min et al. [27, 28]" but neither of refs. 27 or 28 has Min as the last name of a first author in your bibliography.

- In your response to R1.1.6, you state that eqns 22 and 27 in Reshef et al. 2018 are derived under the null, but this is not true: Eq 22 defines the computation of summary statistics from data (regardless of model) and Equation 27 includes a parameter beta which can be non-zero. A question therefore remains about the relationship between your derivation and the derivation that assumes Gaussian genotypes. (Fine if you want to drop this issue.)

Reviewer #2: The authors have made a careful response and I am happy with the changes.

Reviewer #3: No addtional comments

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: None

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Doug Speed

Reviewer #3: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see http://journals.plos.org/ploscompbiol/s/submission-guidelines#loc-materials-and-methods

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1007819.r005

Decision Letter 2

Thomas Lengauer, Jennifer Listgarten

23 Mar 2020

Dear Dr Zaykin,

We are pleased to inform you that your manuscript 'DOT: Gene-set analysis by combining decorrelated association statistics' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Jennifer Listgarten

Associate Editor

PLOS Computational Biology

Thomas Lengauer

Methods Editor

PLOS Computational Biology

***********************************************************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: I thank the authors for their revision, and I am happy to recommend acceptance given the clarifications the authors made about their analysis of real data.

Setting aside this one point of disagreement, I feel this is very high quality work and I commend the authors on their valuable contribution to the field.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1007819.r006

Acceptance letter

Thomas Lengauer, Jennifer Listgarten

6 Apr 2020

PCOMPBIOL-D-19-01433R2

DOT: Gene-set analysis by combining decorrelated association statistics

Dear Dr Zaykin,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Matt Lyles

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    Attachment

    Submitted filename: PlosComBioRepot.pdf

    Attachment

    Submitted filename: response_to_reviewers.pdf

    Attachment

    Submitted filename: response_to_reviewers.pdf

    Data Availability Statement

    The URL for software referenced in this article is available at: https://github.com/dmitri-zaykin/Total_Decor.


    Articles from PLoS Computational Biology are provided here courtesy of PLOS

    RESOURCES