Skip to main content
Biostatistics (Oxford, England) logoLink to Biostatistics (Oxford, England)
. 2018 Jul 11;21(1):33–49. doi: 10.1093/biostatistics/kxy030

STEPS: an efficient prospective likelihood approach to genetic association analyses of secondary traits in extreme phenotype sequencing

Wenjian Bi 1, Yun Li 2,7,8, Matthew P Smeltzer 3, Guimin Gao 4, Shengli Zhao 5, Guolian Kang 1,
PMCID: PMC8559722  PMID: 30007308

Summary

It has been well acknowledged that methods for secondary trait (ST) association analyses under a case–control design (STInline graphic) should carefully consider the sampling process to avoid biased risk estimates. A similar situation also exists in the extreme phenotype sequencing (EPS) designs, which is to select subjects with extreme values of continuous primary phenotype for sequencing. EPS designs are commonly used in modern epidemiological and clinical studies such as the well-known National Heart, Lung, and Blood Institute Exome Sequencing Project. Although naïve generalized regression or STInline graphic method could be applied, their validity is questionable due to difference in statistical designs. Herein, we propose a general prospective likelihood framework to perform association testing for binary and continuous STs under EPS designs (STEPS), which can also incorporate covariates and interaction terms. We provide a computationally efficient and robust algorithm to obtain the maximum likelihood estimates. We also present two empirical mathematical formulas for power/sample size calculations to facilitate planning of binary/continuous STs association analyses under EPS designs. Extensive simulations and application to a genome-wide association study of benign ethnic neutropenia under an EPS design demonstrate the superiority of STEPS over all its alternatives above.

Keywords: Extreme phenotype sequencing, Genome-wide association studies, Maximum likelihood estimate, Next generation sequencing studies, Secondary trait analysis

1. Introduction

Genome-wide association studies (GWAS) and next generation sequencing (NGS) studies have successfully detected thousands of genetic variations associated with a wide variety of traits (Klein and others, 2005; Sanna and others, 2008; Solovieff and others, 2010; Sanders and others, 2012). Besides the specific primary trait that GWAS/NGS is designed for, many secondary traits (STs) are also worthy of investigation to further decipher the disease etiology or pathology. For example, studies designed for neutropenia typically measure the number of white blood cells (WBCs) as a primary trait, and additional blood test results (such as platelet counts) could be recorded and retained for secondary objectives (such as analysis of thrombocytopenia, Bunimov and others, 2013). Another common case is meta-analyses where the trait of interest (such as height) is not usually the primary trait of the single studies (Speliotes and others, 2010).

In this study, we focus on ST genetic association analyses under the extreme phenotype sequencing (EPS) designs. Due to the high genotyping or sequencing cost, the EPS design only selects and sequences the subjects with extremely large and small primary continuous trait values from the whole cohort. The EPS designs are widely adopted in many GWAS/NGS because it can obtain greater statistical power compared with randomly sequencing the same number of subjects (Kang and others, 2012). For example, National Heart, Lung, and Blood Institute (NHLBI) Exome Sequencing Project (phs000400.v5.p1) included five subgroups for five primary phenotypes, of which three subgroups used the EPS design to select participants. A GWAS in benign ethnic neutropenia (BEN) selected and genotyped subjects with leukocyte counts at the lowest 1–7th percentile and at the 85th to 95th percentile. These projects also have a number of outcomes including both categorical and continuous traits, which could be used for ST analysis.

ST analyses have to consider both the sampling scheme and the correlation between primary trait and ST. Otherwise highly biased parameter estimates would occur. This property has been well studied for case–control designs (Lin and Zeng, 2009; Monsees and others, 2009; Wang and Shete, 2011; Ghosh and others, 2013; Kang and others, 2017) but has not received enough attention for EPS. A simple simulated example was used to show that the EPS design may significantly alter the correlation between genotype and ST (Figure S1, simulation details are in Appendix A1 of Supplementary Materials available at Biostatistics online). Suppose in the population, ST is positively correlated with primary trait (Figures S1A, S1E of Supplementary Materials available at Biostatistics online) and is not correlated with genotype (Figures S1B, S1F of Supplementary Materials available at Biostatistics online). If subjects with extreme large or extreme small primary trait are selected to sequence from the study cohort (EPS design, Figures S1C, S1G of Supplementary Materials available at Biostatistics online), then the ST is statistically significantly correlated with genotype (Figure S1D of Supplementary Materials available at Biostatistics online: analysis of variance P-value is Inline graphic; Figure S1H of Supplementary Materials available at Biostatistics online: contingency table Inline graphic test P-value is Inline graphic). Even if we incorporate primary trait as a covariate, the false positive associations still exist. This example clearly shows that ignoring the EPS design would generate highly biased results for associating genotype and STs.

One typical method for ST analyses in EPS designs is to consider subjects in two extreme regions as “cases” and “controls” so that methods for a case–control design (STInline graphic) can be applied (Lin and Zeng, 2009; Monsees and others, 2009; Wang and Shete, 2011; Ghosh and others, 2013, He and others, 2012; Kang and others, 2017). However, the transformation process of the primary trait from continuous type to binary type could result in huge loss of useful information. More importantly, the generated “case–control” data is not actually derived from a case–control design, which could result in biased estimates (as shown in simulation of Section 3.2). Thus, novel valid statistical methods for ST analyses under EPS designs are urgently needed.

To the best of our knowledge, only a few methods have considered the ST analysis under EPS. Lin and others (2013) proposed a nonparametric likelihood-based method to analyze continuous STs under trait-dependent sampling through a bivariate linear regression model. The method is able to adjust for covariates and has correct type I error control at a significance level of Inline graphic in some situations. A free command-line program named SEQTDS can be easily accessed at http://dlin.web.unc.edu/software/score-seqtds/. However, some properties limit its applications. First, the method only considered continuous ST but cannot be used to analyze binary ST. Second, complete primary traits data in the original whole study cohort is required to assess the sampling scheme, which may not be directly available in some cases. Although we could still apply SEQTDS by treating binary ST as a continuous one and imputing the primary traits based on some marginal distribution assumptions, the SEQTDS method cannot control type I error rate at Inline graphic given some specific parameter settings (as shown in simulation of Section 3.2).

We propose a set-valued model to jointly characterize the relationship among genotype, primary trait, and ST, which can be continuous or binary. Then, we propose a novel ST association analysis method under EPS designs [we call it STs under EPS designs (STEPS) for short throughout the article]. We first use a prospective likelihood function to estimate model parameters. Then, we give a closed form of the Fisher information matrix to conduct the Wald test for associating genotype and ST. The model and the estimation approach can easily incorporate covariates and allow for environmental factors, genetic principle components, or interactions between genotype and environmental factors. We performed extensive simulations to compare STEPS with existing methods including straightforward linear/logistic regression, STInline graphic, and SEQTDS method proposed by Lin and others (2013). Simulations and application to a GWAS of BEN all validated the super advantage of our new method. In addition, we also conducted simulations to evaluate STEPS under the polygenic architecture, which is the first time that polygenic effect, a well-known phenomenon in GWAS/NGS, is fully considered in ST genetic association analysis.

The remainder of the article is organized as follows. In Section 2, we introduce the models and propose the STEPS method for associating ST and genotype. In Section 3, we conduct extensive simulations to evaluate the properties of the proposed STEPS method. In Section 4, we apply the STEPS method to a real data example. Finally, Section 5 gives a brief summary.

2. Methods

In this section, we first briefly review three common approaches currently used in ST genetic association analyses under EPS designs. Then, we propose a joint set-valued model to characterize the relationships among primary and STs, genotype and covariates. Next, we use a prospective likelihood function to estimate model parameters and to construct a Wald test statistic for associating genotype and ST.

2.1. Three commonly used approaches

For ST genetic association analysis under EPS designs, three categories of approaches are widely used. The first one is the naïve linear/logistic regression (we call it LR for short throughout the article) that directly models the relationship between genotype and ST disregarding the primary trait and its corresponding EPS design. The second one is to apply STInline graphic to ST analysis under EPS designs. In the simulations below, we chose one of STInline graphic, SPREG method (Lin and Zeng, 2009, http://dlin.web.unc.edu/software/spreg-2/) to show its property. The SPREG method employed a logistic regression model and retrospective likelihood conditional on disease status to handle the case–control sampling in the analysis of ST. It controls type I error rate at a liberal significance level of 0.05 but not at more stringent significance level such as Inline graphic in some situations such as common disease and rare variants (RVs). Through a profile likelihood approach, environmental covariates can also be incorporated into model as a high-dimensional nuisance parameter. We did not consider the set-valued method proposed by Kang and others (2017) because it cannot incorporate covariates into the model, although the method provides more accurate type I error control and greater power, especially under stringent significance levels. The third one is SEQTDS proposed by Lin and others (2013) whose properties and limitations have been described in the introduction section.

2.2. Joint modeling of the primary and secondary traits

Suppose a cohort of Inline graphic subjects are randomly selected from a general population. For the Inline graphicth subject (Inline graphic), let Inline graphic denote a continuous primary trait, let Inline graphic denote Inline graphic covariates, which might include age, gender, genetic ancestry scores, and so on. Then, we select Inline graphic subjects with extreme large or extreme small primary traits from the Inline graphic subjects for genotyping or sequencing. Let Inline graphic denote the genotype for a specific single nucleotide polymorphism (SNP) locus, Inline graphic. Let Inline graphic be a selection indicator, which equals one if the Inline graphicth subject is selected and equals zero otherwise. The indices of the Inline graphic subjects are re-ordered so that the first Inline graphic subjects are selected. We assume that both the primary trait and ST could be affected by genotype and covariates, and that primary trait could be affected by the ST. This assumption is widely adopted for ST analysis in case–control study design (Lin and Zeng, 2009; Kang and others, 2017).

If a ST is continuous, let Inline graphic denote the ST, which is a linear combination of Inline graphic and Inline graphic. And let primary trait Inline graphic be a linear combination of Inline graphic, Inline graphic and Inline graphic. Four cut-offs of Inline graphic are used to select subjects to genotype or sequence. Borrowing the idea of the set-valued model proposed for associating the primary binary trait and genotype (Kang and others, 2014), for the Inline graphicth subject (Inline graphic), the set-valued model under EPS is as follows.

graphic file with name M37.gif (2.1)

where Inline graphic and Inline graphic are intercept terms, Inline graphic and Inline graphic are regression coefficients for the SNP locus, Inline graphic and Inline graphic are vectors of regression coefficients for the Inline graphic covariates. Coefficient Inline graphic represents the effect size of ST on primary trait. Error terms Inline graphic and Inline graphic are assumed to be independent and identically distributed with a normal distribution with a mean of 0 and a variance of Inline graphic and Inline graphic, respectively. The cut-offs of Inline graphic are used to define the extreme large and small primary traits and the cut-offs of Inline graphic are used to define the outlier subjects with unreasonably large and small primary traits. If the study design does not exclude the outliers, then Inline graphic Inf and Inline graphic Inf, that is, Inline graphic. After substituting Inline graphic into the formula of Inline graphic, we can derive that model (2.1) above is exactly the same as bivariate models (2) and (3) in Lin and others, 2013. More details are in Appendix A2 of supplementary material available at Biostatistics online.

If ST is dichotomous (e.g. binary variable of 1 or 0), we let Inline graphic denote a latent continuous variable and let Inline graphic denote the binary ST. If the latent variable Inline graphic is greater than the cut-off Inline graphic, then Inline graphic is 1, otherwise, Inline graphic is 0. Similar to equation (2.1), the set-valued model is as follows.

graphic file with name M63.gif (2.2)

where model parameters (Inline graphic) are similar with model (2.1). Error terms Inline graphic and Inline graphic also follow an independent normal distribution with a mean of 0 and a variance of Inline graphic and Inline graphic, respectively.

2.3. Maximum likelihood estimate (MLE) and Wald statistics

We propose a maximum likelihood estimation method based on a prospective likelihood function. If the ST is continuous, the likelihood function is

graphic file with name M69.gif (2.3)

where Inline graphic and Inline graphic are secondary and primary trait values of Inline graphicth subject, respectively. If the ST is dichotomous, the likelihood function is

graphic file with name M73.gif (2.4)

where Inline graphic and Inline graphic are secondary and primary trait values of Inline graphicth subject, respectively. The detailed derivation process of the probabilities can be seen in Appendix A3 of supplementary material available at Biostatistics online.

We employ a quasi-Newton algorithm to optimize the likelihood function, which is implemented with the R function optim() method “BFGS”. For model (2.1), we estimate the parameters (Inline graphic) that maximize the likelihood function (2.3). And for model (2.2), we fix Inline graphic and estimate parameters (Inline graphic) that maximize the likelihood function (2.4). In the R package, we provide a simple method to estimate cutoffs Inline graphic and Inline graphic in case the information is unknown (Appendix A4 of supplementary material available at Biostatistics online).

The null hypothesis to test the association between genotype and ST is Inline graphic, and the alternative hypothesis is Inline graphic. Here, we propose a Wald test statistic, Inline graphic, which should follow Inline graphic distribution with 1 degree of freedom under the regularity conditions under Inline graphic. Here Inline graphic is obtained by the MLE shown above and Inline graphic is obtained by Fisher information matrix. The closed form of Fisher information matrix and the related proof about the regularity conditions can be seen in Appendices A3 and A5 of supplementary material available at Biostatistics online.

3. Simulations

We conducted extensive simulations in three parts. In part 1, we first compared STEPS with three methods of LR, SPREG, and SEQTDS in terms of type I error control and power at a liberal significance level Inline graphic. Then, in part 2, we only evaluated STEPS in terms of type I error control and power under more comprehensive parameter settings at more stringent significance levels Inline graphic and Inline graphic because simulations from part 1 show that the other three methods cannot control type I error in some situations. In GWAS/NGS, the polygenicity is a well-known phenomenon that a large proportion of weak effects collectively contribute to the trait. As for ST analysis, it is even more complex since the polygenic architecture could affect both primary and STs. To the best of our knowledge, the polygenic architecture effect on ST analysis has not been discussed comprehensively. Thus, in part 3, we lastly evaluated STEPS under polygenic architecture in terms of type I error control and power at significance levels Inline graphic, Inline graphic, and Inline graphic. For SEQTDS, we adopted two methods to impute primary traits for the subjects without available genotype or STs, one is based on the true primary trait and the other one is imputed based on the marginal normal distribuiton (details can be seen in Appendix A6 of supplementary material available at Biostatistics online).

3.1. Simulation process

For each replication, we first generated Inline graphic genotypes following Hardy–Weinberg equilibrium given minor allele frequency (MAF) of the tested SNP. Then, we simulated covariates and model error terms Inline graphic following independent standard normal distribution. Next, primary and STs were simulated based on model (2.1) or (2.2), depending on the type of ST. In this section, we simulated Inline graphic covariate and fixed parameters Inline graphic. The upper Inline graphic quantile and the lower Inline graphic quantile of Inline graphic primary traits were selected as cutoffs Inline graphic and Inline graphic, so that Inline graphic subjects in the cohort were retained based on EPS as the study sample.

For comparisons of STEPS with LR, SPREG, and SEQTDS, we fixed the sample size Inline graphic and increased Inline graphic from Inline graphic0.7 to 0.7 in increments of 0.1. We considered Inline graphic with Inline graphic and Inline graphic with Inline graphic or Inline graphic, for which the heritability of the ST (Inline graphic, i.e. the proportion of phenotypic variation of Inline graphic attributing to Inline graphic) is 0.8%. For each parameter setting, 10,000 replications were simulated to assess the type I error rate and power at a liberal significance level Inline graphic, parameter estimation, mean squared error, and coverage probability of 95% Wald-type confidence intervals for the genetic effect on ST.

For examination of STEPS at more stringent significance levels Inline graphic and Inline graphic under Inline graphic with Inline graphic, we considered coefficient Inline graphic of 0.4 and Inline graphic0.4 to simulate different effects of genotype on primary trait, coefficient Inline graphic of Inline graphic0.7, Inline graphic0.4, 0, 0.4, and 0.7 to simulate different effect sizes and directions of ST on primary trait, and Inline graphic, and 0.01 to simulate different EPS designs. Three MAFs of 0.3, 0.05, and 0.005 were used to simulate common variants, less common variants (LCV), and RVs, respectively. For continuous (binary) ST, we fixed sample sizes Inline graphic (2000). For each parameter setting, we evaluated type I error rates with Inline graphic replications. We also designed three simulation scenarios to evaluate power under different parameter settings at stringent significance levels Inline graphic based on 10,000 replications (details in Appendix A7 of supplementary material available at Biostatistics online).

For assessment of the effect of the polygenic architecture on STEPS, the primary trait is assumed to be affected by 100 causal SNPs in four regions (25 SNPs/region) each with a different linkage disequilibrium (LD) structure of no, weak, moderate or strong LD, respectively. This means that the genetic effect on primary trait Inline graphic in models (2.1) and (2.2) is replaced by Inline graphic, where Inline graphic are genotypes of 100 causal SNPs affecting the primary trait with effect sizes of Inline graphic. The genotypes of 100 causal SNPs among four regions were simulated independently based on R code of simRareSNP.R (http://www.biostat.umn.edu/~weip/prog/BasuPanGE11/simRareSNP.R, Basu and Pan, 2011; Wang, 2016) with a fixed MAF of 0.3 and parameter rho Inline graphic 0, 0.3, 0.6 and 0.9 to simulate 4 LD regions, respectively. For ST, we considered two scenarios: (i) no associations between all of these 100 SNPs and ST, Inline graphic. We randomly selected 4 SNPs with one from each of four regions for their association testing with ST; (ii) four SNPs with one randomly selected from each of the four regions are associated with ST as four causal SNPs of ST but all others are not associated with ST. This means that the genetic effect on ST Inline graphic in models (2.1) and (2.2) is replaced by Inline graphic where Inline graphic are genotypes of four selected causal SNPs affecting ST with an effect size of Inline graphic. We considered Inline graphic and Inline graphic for the four selected causal SNPs of ST which represents its overall heritability 3.64% (0.91% for each causal SNP). For both scenarios, we let Inline graphic to simulate weak polygenic effects with a heritability of the primary trait per each causal SNP less than 0.1% (Inline graphic for 100 SNPs are from 21.6% to 36.4%). Five Inline graphic values ranging from Inline graphic0.7 to 0.7 were considered.

We tested the associations between ST and each of the four selected tested SNPs and estimated the type I error rate and power of STEPS for scenarios 1 and 2 as the proportions of replicates with P-values Inline graphic and Inline graphic. For Scenario 2, besides these four causal SNPs of ST, we also randomly selected another four non-causal SNPs of ST with one from each of 4 LD regions, among which three non-causal SNPs are in LD with three causal SNPs of ST in three LD regions. We tested their associations with ST and reported the proportions of replicates with P-values Inline graphic for each SNP. Here, Inline graphic and 50 000 replicated datasets were simulated for scenarios 1 and 2, respectively, with a given sample size of 1000. We also randomly generated MAFs for 100 SNPs following a uniform distribution of U(0.05,0.5) instead of a fixed constant MAF of 0.3 across all 100 SNPs with all the other parameters exactly same and conclusions are similar (data not shown).

3.2. Comparison of STEPS, LR, SPREG, and SEQTDS methods

Figures 1 and 2 show the simulation results for continuous and binary STs with two EPS designs Inline graphic (Figures 1A–D and 2A–D) and Inline graphic (Figures 1E–H and 2E–H) given Inline graphic. As SPREG cannot output P-values for many replications when Inline graphic and Inline graphic, we only showed results for Inline graphic inFigures 1 and 2. We can see that no matter ST is continuous (Figures 1A–B and 1E–F) or binary (Figures 2A–B and 2E–F), STEPS always gave accurate parameter estimation (Table S1 of Supplementary Materials available at Biostatistics online), which leads to correct type I error control at Inline graphic regardless of Inline graphic. However, as expected, LR and SPREG were invalid unless primary trait is not correlated with ST, i.e., Inline graphic because of their biased parameter estimations due to disregard or inappropriate consideration of EPS. SPREG performs better than LR for either binary or continuous ST, which indicates that considering EPS as the “case–control” study could help the parameter estimation and type I error control to some extent. Under EPS design Inline graphic, SEQTDS generally performed similar to STEPS. Only when Inline graphic is greater than 0.4, the Inline graphic by SEQTDS is a little biased, which leads to inflated type I error rate (Table S6 of Supplementary Materials available at Biostatistics online). Under more extreme EPS design Inline graphic, SEQTDS could not control type I error for both continuous and binary ST if Inline graphic. The inflated type I error is also due to the biased estimate Inline graphic. For example, when Inline graphic, SEQTDS gave Inline graphic of Inline graphic0.05(Inline graphic0.06) and type I error rates of 0.82 (0.14) for binary (continuous) ST at Inline graphic (Table S6 of Supplementary Materials available at Biostatistics online).

Fig. 1.

Fig. 1.

Continuous STs: comparisons of STEPS, LR, SPREG, and SEQTDS methods at a significance level of 0.05 based on 10,000 replications. Inline graphic, sample size Inline graphic. (A–D) Inline graphic; (E–H) Inline graphic.

Fig. 2.

Fig. 2.

Binary STs: comparisons of STEPS, LR, SPREG, and SEQTDS methods at a significance level of 0.05 based on 10,000 replications. Inline graphic, sample size Inline graphic. (A–D) Inline graphic; (E–H) Inline graphic.

Under Inline graphic, no matter ST is continuous (Figures 1C–D and 1G–H) or binary (Figures 1C–D and 1G–H), STEPS always gave accurate parameter estimates and power at Inline graphic regardless of Inline graphic. However, the parameter estimates Inline graphic with LR and SPREG changed a lot and increased with increase in Inline graphic and approached true Inline graphic when Inline graphic (the trend is similar to that under Inline graphic). We also performed similar simulations with Inline graphic (Figures S2 and S3 of Supplementary Materials available at Biostatistics online). Under sampling design Inline graphic, SEQTDS generally performed similar to STEPS if ST is continuous. Only when Inline graphic, SEQTDS slightly lost power compared with STEPS (Table S6 of Supplementary Materials available at Biostatistics online). When ST is binary, power of SEQTDS would be greater than that of STEPS if true Inline graphic and Inline graphic (sign(Inline graphic)Inline graphicsign(Inline graphic)Inline graphicsign(Inline graphic)Inline graphic), and would be less than that of STEPS if true Inline graphic and Inline graphic (sign(Inline graphic)Inline graphicsign(Inline graphic)Inline graphicsign(Inline graphic)Inline graphic). However, the greater power of LR, SPREG and SEQTDS here should be interpreted cautiously due to their uncontrolled type I error rates.

SEQTDS shows similar performance as STEPS in some cases. While for very EPS design (very small Inline graphic) or binary ST, its parameter estimate is biased and type I error rate is inflated. Simulations also show that performances of SEQTDS with two imputing methods are almost the same, which indicates that the primary trait could be reasonably imputed if the primary traits truly approximately follow normal distribution. SEQTDS is not designed for bianry STs analysis, so that the unstable performance for binary ST is expected. While more importantly, the simulations showed that SEQTDS does not perform as stable as STEPS even for continuous ST when Inline graphic. The finding is striking since the model in Lin and others (2013) is actually same as model (2.1) in this article after a transformation (see Section 2.2). To confirm the consistence between models, we also conducted simulations following the model in Lin and others (2013) and validated that SEQTDS cannot control type I error rate at Inline graphic given some specific parameter settings even under its own model (Appendix A2 of supplementary material available at Biostatistics online). Although both SEQTDS and STEPS assume the same model, SEQTDS is based on a nonparametric likelihood function and STEPS is based on a parametric likelihood function, which should be the main reason that the two methods perform differently.

3.3 Type I error of STEPS

STEPS could control the type I error rate at Inline graphic and Inline graphic across all parameter settings for both continuous (Table 1) and binary STs (Tables S3 of Supplementary Materials available at Biostatistics online). This is resulted from the fact that the mean of estimated parameter Inline graphic is very close to 0, and the empirical standard deviation sd(Inline graphic) is very close to the mean of the estimated standard error Inline graphic (Tables S2 and S4 of Supplementary Materials available at Biostatistics online). These ensure that the Wald statistic could truly follow chi-square distribution so that its type I error could be controlled. It is striking that STEPS can control type I error rates even for RVs (MAF = 0.005) under a significance level as stringent as Inline graphic.

Table 1.

Ratio of the empirical type I error rates to significance levels of Inline graphic based on Inline graphic replications given Inline graphic

Inline graphic MAF Inline graphic Inline graphic
    Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic
Inline graphic
0.2 0.3 1.026 1.027 1.022 1.029 1.022 1.110 1.190 1.260 1.140 1.160
  0.05 1.030 1.029 1.042 1.029 1.028 1.250 0.930 1.200 1.190 1.090
  0.005 1.045 1.083 1.038 1.059 1.039 0.920 1.190 1.230 1.090 1.160
0.1 0.3 1.023 1.034 1.028 1.030 1.027 1.040 1.180 1.070 1.190 1.280
  0.05 1.022 1.029 1.036 1.036 1.029 0.960 1.180 1.070 1.280 1.060
  0.005 1.021 1.066 1.038 1.045 1.032 0.990 1.130 1.420 1.200 0.990
0.05 0.3 1.024 1.028 1.027 1.027 1.034 1.040 1.130 1.090 1.030 1.110
  0.05 1.028 1.027 1.032 1.026 1.028 1.250 1.410 1.120 1.060 1.250
  0.005 1.015 1.069 1.034 1.035 1.006 0.960 1.160 1.080 0.870 1.200
0.01 0.3 1.026 1.028 1.030 1.033 1.033 1.170 1.190 1.230 1.310 1.010
  0.05 1.027 1.028 1.033 1.029 1.026 1.160 1.220 1.000 1.350 1.150
  0.005 0.959 1.055 1.040 1.001 0.967 1.240 0.950 1.060 1.160 1.290
Inline graphic
0.2 0.3 1.028 1.028 1.021 1.022 1.023 1.050 1.060 1.070 1.250 0.980
  0.05 1.027 1.033 1.041 1.030 1.030 1.290 1.210 1.310 1.030 1.010
  0.005 1.046 1.078 1.032 1.057 1.041 0.820 1.230 1.330 1.160 0.930
0.1 0.3 1.026 1.031 1.031 1.031 1.031 1.350 1.240 1.110 1.200 1.010
  0.05 1.023 1.027 1.030 1.028 1.027 1.090 1.230 0.970 1.110 1.110
  0.005 1.027 1.072 1.030 1.038 1.027 0.850 1.000 1.380 0.940 0.760
0.05 0.3 1.028 1.028 1.032 1.027 1.028 1.320 1.360 1.300 1.300 1.110
  0.05 1.030 1.032 1.031 1.030 1.027 1.090 1.180 1.130 1.400 1.020
  0.005 1.015 1.071 1.045 1.038 1.006 0.810 1.040 1.250 1.100 0.930
0.01 0.3 1.023 1.029 1.028 1.030 1.035 1.050 1.170 1.090 1.170 1.110
  0.05 1.021 1.031 1.032 1.030 1.033 0.900 1.200 1.040 1.060 1.170
  0.005 0.958 1.049 1.041 1.007 0.965 0.910 1.130 1.190 1.030 1.28

3.4 Power of STEPS

For continuous STs, the power of STEPS would increase with increase in effect size Inline graphic (Figures 3A–C) or in sample size Inline graphic (Figures 1D–F). Interestingly, the smaller Inline graphic gives greater power conditional on the same sample size, especially for LCV with MAF of 0.05 and RV with MAF of 0.005 because of more enriched minor alleles in selected samples. For example, for a RV with MAF of 0.005, Inline graphic, and Inline graphic, if the top and bottom of Inline graphic (Inline graphic) from a cohort of 2500 (50000) individuals were selected under EPS, i.e., Inline graphic, then the mean numbers of minor allele counts captured in the study samples are 13.98 (37.34), which leads to their respective power of 0.256 (0.779) (Figure 1F). This strongly supports the assumption that rare causal variants are likely to be enriched in samples with more extreme phenotypes so that EPS designs can capture these causal RVs for primary and STs with higher probability and have greater statistical power to detect them. However, this superiority does not hold when Inline graphic, i.e. different Inline graphic correspond to similar number of minor allele counts and power (Figures S4 and S5 of Supplementary Materials available at Biostatistics online). In sharp contrast, as Inline graphic diverge from Inline graphic, MAFs estimated in selected samples increases and the corresponding power also increases, especially for LCV and RV. The patterns of MAFs changes and of power changes are similar (Figure S4 of Supplementary Materials available at Biostatistics online). Hence, the power change under EPS are partially explained by the number of MA counts in the selected study samples. In addition, Table S2 of Supplementary Materials available at Biostatistics online shows as the decrease in MAF, Inline graphic increases, which indicates the power loss. Similarly, as the decrease in Inline graphic, Inline graphic decreases, especially for LCV and RV, which explains why smaller Inline graphic corresponds to greater power given Inline graphic.

Fig. 3.

Fig. 3.

Power of STEPS as a function of effect size and sample size. For (A–C), sample size is fixed at Inline graphic; For (D–F), effect size is fixed at Inline graphic, and 1 respectively. The results are the empirical power at a significance level of Inline graphic based on Inline graphic replications. Inline graphic.

3.5 Polygenic architecture effect on STEPS

To the best of our knowledge, it is the first time that polygenic architecture is comprehensively considered in the context of ST association testing. The results can be seen in Table S5 of Supplementary Materials available at Biostatistics online. Strikingly, under polygenic architecture, STEPS can still control type I error rates at Inline graphic and Inline graphic for each of the selected four tested SNPs in four regions for scenario 1, no matter the tested SNP is in LD or no LD with the other causal SNPs of the primary trait. This is resulted from the facts that (i) all the estimates of the coefficients related to ST are unbiased; (ii) the effects of the other causal SNPs on the primary traits can be absorbed into the error term and the effect of the tested SNP on the primary trait with decomposition levels depending on the LD structure between the tested SNP and the other causal SNPs of the primary trait showed by simulations (Table S10 of Supplementary Materials available at Biostatistics online). For scenario 2, for the non-causal SNP selected in the no LD region, the proportion of replicates for which the test is rejected was very close to Inline graphic. This means that the type I error rate could also be correctly controlled for non-causal SNPs of ST which are causal SNPs of the primary trait and are in linkage equilibrium with causal SNPs of ST due the reason that Inline graphic, the effect of the four causal SNPs of ST can be absorbed in the error term.

The power of STEPS for each of four causal SNPs under different LD scenarios were also quite similar given the same effect size (same Inline graphic). Interestingly, for the other three non-causal SNPs selected in small, moderate, and strong LD regions under Inline graphic, the proportion of replicates for which it is rejected was largest for the non-causal SNP in strong LD region and was smallest for the non-causal SNP in small LD region with one in the moderate region in-between. For example, given Inline graphic, the power for testing four causal SNPs of binary ST in four LD regions was 0.42 to detect an Inline graphic with a sample size of 1000. In sharp contrast, the proportions of replicates for which the non-causal SNP are rejected was 0.18, 0.05, and 0.02 if non-causal SNP is in strong, moderate and small LD with the causal SNP, respectively.

4. Application to a GWAS of BEN

BEN is a clinical condition characterized by a relative reduction in neutrophil count. Hence, WBC is a common continuous index to indicate the BEN. Here, we applied four methods above to a GWAS of BEN in which around 1000 samples were selected from a large national cohort study including over 14,000 African-Americans with low WBC (at the lowest 1–7th percentile) and high WBC (at the 85th to 95th percentile). More description about the dataset can be seen in dbGaP (https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000507.v1.p1).

We considerd 7 STs including C-reactive protein (CRP), triglycerides (TL), platelet count (PC), high-density lipoprotein (HDL), low-density lipoprotein (LDL), total cholesterol (TC), and Albumin serum (ALS) and eight covariates of age, smoking status, gender and top five principle components. We used log-transformed CRP, TC, TL, PC, and square-root-transformed HDL, LDL and un-transformed ALS (Ma and others, 2010; Bryant and others, 2014; Oh and others, 2014; Ligthart and others, 2016; Prins and others, 2017; Zhu and others, 2017). After removing subjects whose WBC or any one covariate is missing, we retained 980 genetically independent subjects with 677,755 SNPs after removing SNPs whose MAFs are less than 0.005. Of the seven STs, PC, TL, and CRP are positively correlated with WBC and HDL is negatively correlated with WBC in the study sample (Table S7 of Supplementary Materials available at Biostatistics online). For STEPS, cutoffs of Inline graphic were given based on the distribution of WBC in the study sample.

As a ST analysis method designed for case–control study, SPREG requires a critical pre-given parameter of population prevalence. Although we can simply treat subjects with low WBC as cases and subjects with high WBC as controls, it is not intuitive how to give the prevalence parameter since any one is biased compared with true sampling process. Table S8 of Supplementary Materials available at Biostatistics online showed that when a prevalence of 0.07 was used to match the lowest 1–7th percentile as “cases”, SPREG has an inflation factor of greater than 2 for CRP; when a prevalence of 0.5 was used, almost all P-values are 1; but a prevalence of 0.25 gave the most reasonable results in terms of QQ-plot and inflation factor. The Manhattan and QQ plots are shown in Figure S6 of Supplementary Materials available at Biostatistics online. All four methods gave reasonable QQ plots and had inflation factors between 0.99 and 1.08 (Table S8 of Supplementary Materials available at Biostatistics online).

To demonstrate the effectiveness of STEPS in analyzing real BEN data and to demonstrate its potential usefulness in GWAS/NGS, we summarized all significant SNPs defined as p-values Inline graphic into three groups based on the existing results on GWAS catalog (https://www.ebi.ac.uk/gwas/) in Table 2. If a significant SNP is within a reported gene or on an intergenic region adjacent to a reported gene, the association is considered as highly possible positive. If a significant SNP is within a gene whose adjacent gene has been reported, the association is considered as medium possible positive. Otherwise, the association is considered as lowly possible positive. For the three STs of ALS, TC, and LDL not correlated with WBC, as expected, all four methods gave similar results in terms of QQ plots and identified significant SNPs. This is consistent with the simulation results that LR, SPREG, and SEQTDS can be used to analyze STs under EPS designs when primary trait is not associated with ST. However, for the four STs (PC, TL, CRP, and HDL) correlated with WBC, STEPS identified more significant SNPs which are highly/medium possible positive than but similar number of lowly possible positive SNPs to the other three methods. Furthermore, for these SNPs, 7 SNPs corresponds to sign(Inline graphic)Inline graphicsign(Inline graphic)Inline graphicsign(Inline graphic)Inline graphic. Their p-values by STEPS are smallest among all four methods. This is perfectly consistent with the simulation results that LR, SPREG, and SEQTDS are less powerful to detect associations of STs with SNPs if sign(Inline graphic)Inline graphicsign(Inline graphic)Inline graphicsign(Inline graphic)Inline graphic. For the other five SNPs, P-values by STEPS are still smaller than those by SEQTDS although not by LR and/or SPREG.

Table 2.

Comparison of the analysis results of seven STs in a GWAS of BEN data

  Number of highly possible positive SNPs Number of medium possible positive SNPs Number of lowly possible positive SNPs
ST Methods Reported Methods Adjacent Methods
  LR SEQTDS SPREG STEPS Gene LR SEQTDS SPREG STEPS Gene LR SEQTDS SPREG STEPS
ST is not correlated with WBC
LDL 1 1 1 1 APOE 0 0 0 0   0 0 0 0
ALS 0 0 0 0   0 0 0 0   0 0 0 0
TC 0 0 0 0   0 0 0 0   0 0 0 0
ST is positive correlated with WBC
PC 0 0 0 0   0 0 0 1 EHD3 0 0 0 0
TL 2 2 2 2 MIR148A 0 0 0 0   1 0 0 0
CRP 1 3 2 4 CRP 0 0 0 0   1 0 1 1
ST is negative correlated with WBC
HDL 0 0 0 0   0 0 0 1 AMPD3 0 0 0 0

ST, secondary trait; CRP, C-reactive protein; TL, triglycerides; PC, platelet count; HDL, high-density lipoprotein; LDL, low-density lipoprotein; TC, total cholesterol; ALS, Albumin serum.

Ridker and others, 2008 reported the association between CRP and SNP rs3091244 that locates on the upstream of gene CRP. Only STEPS identified their association at a significance level of Inline graphic, while P-values of SPREG, LR and SEQTDS were greater than the cutoff (Table S9 of Supplementary Materials available at Biostatistics online). STEPS has also uniquely identified two novel SNPs locating on known regions for HDL and PC. As for HDL, STEPS identified SNP rs1035691 which locates in the intron region of gene MRVI1. The gene is on cytoband of 11p15.4 and is adjacent to gene AMPD3 in which several SNPs have been reported to be associated with HDL (Teslovich and others, 2010, Willer and others, 2013, Spracklen and others, 2017). In addition, Webb and others, 2017 also reported the association between MRVI1 and coronary artery disease. As for PC, STEPS identified SNP rs207444 which locates in the intron of gene XDH. The gene is on cytoband of 2p23.1 and is adjacent to gene EHD3 in which several SNPs have been reported to be associated with PC (Astle and others, 2016). And O‘Byrne and others, 2000 also reported a potential relationship among platelet, Xanthine Oxidoreductase (XO) and Xanthine DeHydrogenase (XDH). Furthermore, for some reported SNPs such as rs726640, although four methods all identified its association with CRP, STEPS gives the smallest P-value (Table S9 of Supplementary Materials available at Biostatistics online). All these evidences strongly indicate that the new STEPS method could be more effective and powerful to identify SNPs truly associated with STs under EPS than the other three methods.

5. Discussion

We have proposed a novel STEPS method to test for association between binary or continuous STs and genetic variants under different EPS designs. To the best of our knowledge, there is no statistical method that appropriately takes into account the EPS designs when only study data is available, although EPS designs are widely adopted in many GWAS or NGS projects. Currently, to test associations between STs and genetic variants, naïve generalized regression or STs association analyses methods implemented for case–control designs are often used, which have been proven invalid both theoretically and empirically if both traits are correlated. Nonparametric likelihood method (SEQTDS) (Lin and others, 2013) can be used to analyze ST under EPS but cannot control type I error in some situations and could have smaller power than STEPS. Compared with the existing methods, STEPS takes account of the EPS designs more appropriately and therefore generates unbiased parameter estimations and better type I error control at both liberal (0.05) and stringent (Inline graphic) significance levels. In addition, in some situations, STEPS is more powerful than the existing methods, while the latter could not control type I error rates.

Most complex traits have polygenic architecture. That is, the primary trait can be affected by hundreds or thousands of causal SNPs each with weak effect. As a consequence, the second equation in Equation (2.1) and the third equation in Equation (2.2) are no longer valid. Under this situation, strikingly, the new proposed STEPS is still valid by simulations. This is intuitively understandable because the effect of the other causal SNPs on the primary trait could be absorbed into the error term and/or the effect of the tested SNP on the primary trait depending on the LD structure between the tested SNP and the causal SNPs of the primary trait and would not modify the effect of the tested SNP on ST. This is the first time to show that polygenetic architecture would not affect the ST genetic association analysis if appropriate statistical method is employed.

Simulations show that BFGS algorithms usually need less than 20 iterations to find the MLE, which makes STEPS computationally efficient. Although only demonstrated for a single SNP analysis in this study, STEPS can readily be applied to analyses of any form of predictor variables such as environmental exposure variables, gene expression, or other genomic features. In addition, the method can easily incorporate covariates such as age, gender, genetic ancestry estimates, or gene-environment interactions.

As a single-variant analysis method, STEPS is also underpowered to identify RVs, although the type I error rate for RVs could be well controlled. For RVs analysis, the standard method is to aggregate a set of variants as a genomic region and to perform region-based analysis. For example, burden-based and SNP-set Kernel Association Tests are two main categories of region-based methods and have been generalized in many fields. Although Liu and Leal, 2012 proposed a framework to analyze RVs with selected samples, we still believe that is an important area for further investigation using the set-valued model.

In summary, the power of STEPS is a complicated function of the SNP MAF, cohort sample size, the proportion of extremes selected, effect size of SNP, and the correlations among primary trait STs and genotype. Via extensive simulation studies (Inline graphic parameter combinations), we have quantified the relationship between association parameters and the power of STEPS for binary and continuous STs at four different significance levels Inline graphic, and Inline graphic and have included them as a R function in STEPS software. These formulas are very crucial and can be easily and readily used to calculate power given sample size and all the other parameters in the planning stage of new ST-association study under EPS designs.

Supplementary Material

BIOSTS_21_1_33_s4

Supplementary Data

Acknowledgments

We thank the editors and two anonymous reviewers for their insightful and helpful comments which have significantly improved the manuscript. This research is supported by the American Lebanese and Syrian Associated Charities (ALSAC). We acknowledge dbGAP for approval of our use of benign ethnic neutropenia data. The data were obtained from Matthew Hsieh’s ancillary proposal to the Reasons of Geographic and Racial Differences in Stroke (REGARDS) study. Matthew Hsieh is supported by the intramural research program of NHLBI and NIDDK at NIH. Genotyping services were provided by the Center for Inherited Disease Research (CIDR). CIDR is funded through a federal contract from the National Institutes of Health to The Johns Hopkins University, contract number HHSN268200782096C and HHSN268201100011I. We acknowledge the High Performance Computing Facility (HPCF) at SJCRH for providing shared HPC resources that have contributed to the research results reported within this article. Conflict of Interest: None declared.

References

  1. Astle,  W. J., Elding, H.,Jiang, T.,Allen, D.,Ruklisa, D.,Mann, A. L.,Mead, D.,Bouman, H.,Riveros-Mckay, F.,Kostadima, M. A.  and others. (2016). The allelic landscape of human blood cell trait variation and links to common complex disease. Cell167, 1415–1429. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Basu,  S. and Pan, W. (2011). Comparison of statistical tests for disease association with rare variants. Genetic Epidemiology 35, 606–619. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bryant,  E. K., Dressen, A. S.,Bunker, C. H.,Hokanson, J. E.,Hamman, R. F.,Kamboh, M. I. and Demirci, F. Y. (2014). A multiethnic replication study of plasma lipoprotein levels-associated snps identified in recent GWAS. PLoS One 8, e63469. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Bunimov,  N., Fuller, N. and Hayward, C. P. M. (2013). Genetic loci associated with platelet traits and platelet disorders. Semin Thromb Hemost 3, 291–305. [DOI] [PubMed] [Google Scholar]
  5. Ghosh,  A., Wright, F. A. and Zou, F. (2013). Unified analysis of secondary traits in case–control association studies. Journal of the American Statistical Association 108, 566–576. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. He,  J., Li, H.,Edmondson, A. C.,Rader, D. J. and Li, M. (2012). A gaussian copula approach for the analysis of secondary phenotypes in case–control genetic association studies. Biostatistics 13, 497–508. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Kang,  G., Bi, W.,Zhang, H.,Pounds, S.,Cheng, C.,Shete, S.,Zou, F.,Zhao, Y.,Zhang, J. F. and Yue, W. (2017). A robust and powerful set-valued approach to rare variant association analyses of secondary traits in case-control sequencing studies. Genetics 205, 1049–1062. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Kang,  G., Bi, W.,Zhao, Y.,Zhang, J. F.,Yang, J. J.,Xu, H.,Loh, M. L.,Hunger, S. P.,Relling, M. V.,Pounds, S.  and others. (2014). A new system identification approach to identify genetic variants in sequencing studies for a binary phenotype. Human Heredity 78, 104–116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Kang,  G., Lin, D.,Hakonarson, H. and Chen, J. (2012). Two-stage extreme phenotype sequencing design for discovering and testing common and rare genetic variants: efficiency and power. Human Heredity 73, 139–147. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Klein,  R. J., Zeiss, C.,Chew, E. Y.,Tsai, J. Y.,Sackler, R. S.,Haynes, C.,Henning, A. K.,SanGiovanni, J. P.,Mane, S. M.,Mayne, S. T.  and others. (2005). Complement factor h polymorphism in age-related macular degeneration. Science 308, 385–389. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Ligthart,  S., Vaez, A.,Hsu, Y. H.,Stolk, R.,Uitterlinden, A. G.,Hofman, A.,Alizadeh, B. Z.,Franco, O. H. and Dehghan, A . (2016). Bivariate genome-wide association study identifies novel pleiotropic loci for lipids and inflammation. BMC Genomics 17, 443. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Lin,  D. Y. and Zeng, D. (2009). Proper analysis of secondary phenotype data in case-control association studies. Genetic Epidemiology 33, 256–265. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Lin,  D. Y., Zeng, D. and Tang, Z. Z. (2013). Quantitative trait analysis in sequencing studies under trait-dependent sampling. Proceedings of the National Academy of Sciences of the United States of America 110, 12247–12252. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Liu,  D. J. and Leal, S. M. (2012). A unified method for detecting secondary trait associations with rare variants: application to sequence data. PLoS Genetics 8, e1003075. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Ma,  L., Yang, J.,Birali, R. H.,Tanaka, T.,Ferrucci, L.,Bandinelli S., and Da, Y. (2010). Genome-wide association analysis of total cholesterol and high-density lipoprotein cholesterol levels using the framingham heart study data. BMC Medical Genetics 11, 55. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Monsees,  G. M., Tamimi, R. M. and Kraft, P. (2009). Genome-wide association scans for secondary traits using case-control samples. Genetic Epidemiology 33(8), 717–728. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. O‘Byrne,  S., Shirodaria, C.,Millar, T.,Stevens, C.,Blake D. and Benjamin N. (2000). Inhibition of platelet aggregation with glyceryl trinitrate and xanthine oxidoreductase. Journal of Pharmacology and Experimental Therapeutics 292, 326–330. [PubMed] [Google Scholar]
  18. Oh,  J. H., Kim, Y. K.,Moon, S.,Kim, Y. J. and Kim, B. J. (2014). Genome-wide association study identifies candidate loci associated with platelet count in koreans. Genomics & Informatics 12, 225–230. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Prins,  B. P., Kuchenbaecker, K. B.,Bao, Y.,Smart, M.,Zabaneh, D.,Fatemifar, G.,Luan, J., Wareham, N. J.,Scott, R. A.,Perry, J. R. B.  and others. (2017). Genome-wide analysis of health-related biomarkers in the uk household longitudinal study reveals novel associations. Scientific Reports 7, 11008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Ridker,  P. M., Pare, G.,Parker, A.,Zee, R. Y.,Danik, J. S.,Buring, J. E.,Kwiatkowski, D.,Cook, N. R.,Miletich, J. P., and Chasman, D. I. (2008). Loci related to metabolic-syndrome pathways including lepr, hnf1a, il6r, and gckr associate with plasma C-reactive protein: the women’s genome health study. The American Journal of Human Genetics 82, 1185–1192. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Sanders,  S. J., Murtha, M. T.,Gupta, A. R.,Murdoch, J. D.,Raubeson, M. J.,Willsey, A. J.,Ercan-Sencicek, A. G.,DiLullo, N. M.,Parikshak, N. N.,Stein, J. L.  and others. (2012). De novo mutations revealed by whole-exome sequencing are strongly associated with autism. Nature 485, 237. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Sanna,  S., Jackson, A. U.,Nagaraja, R.,Willer, C. J.,Chen, W. M.,Bonnycastle, L. L.,Shen, H.,Timpson, N.,Lettre, G.,Usala, G.  and others. (2008). Common variants in the gdf5-uqcc region are associated with variation in human height. Nature Genetics 40, 198. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Solovieff,  N., Milton, J. N.,Hartley, S. W.,Sherva, R.,Sebastiani, P.,Dworkis, D. A.,Klings, E. S.,Farrer, L. A.,Garrett, M. E.,Ashley-Koch, A.  and others. (2010). Fetal hemoglobin in sickle cell anemia: genome-wide association studies suggest a regulatory region in the 5’ olfactory receptor gene cluster. Blood 115, 1815–1822. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Speliotes,  E. K., Willer, C. J.,Berndt, S. I.,Monda, K. L.,Thorleifsson, G.,Jackson, A. U.,Lango Allen, H.,Lindgren, C. M.,Luan, J.,Mägi, R.  and others. (2010). Association analyses of 249,796 individuals reveal 18 new loci associated with body mass index. Nature Genetics 42, 937. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Spracklen,  C. N., Chen, P.,Kim, Y. J.,Wang, X.,Cai, H.,Li, S.,Long, J., Wu, Y.,Xing Wang, Y.,Takeuchi, F.  and others. (2017). Association analyses of east asian individuals and trans-ancestry analyses with european individuals reveal new loci associated with cholesterol and triglyceride levels. Human Molecular Genetics 26, 1770–1784. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Teslovich,  T. M., Musunuru, K.,Smith, A. V.,Edmondson, A. C., Stylianou, I. M.,Koseki, M.,Pirruccello J. P., Ripatti, S.,Chasman, D. I.,Willer, C. J.  and others. (2010). Biological, clinical and population relevance of 95 loci for blood lipids. Nature 466, 707. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Wang,  J. and Shete, S. (2011). Estimation of odds ratios of genetic variants for the secondary phenotypes associated with primary diseases. Genetic Epidemiology 35, 190–200. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Wang,  K. (2016). Boosting the power of the sequence kernel association test by properly estimating its null distribution. The American Journal of Human Genetics 99, 104–114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Webb,  T. R., Erdmann, J.,Stirrups, K. E.,Stitziel, N. O.,Masca, N. G.,Jansen, H.,Kanoni S., Nelson, C. P.,Ferrario, P. G.,Köonig, I. R.  and others. (2017). Systematic evaluation of pleiotropy identifies 6 further loci associated with coronary artery disease. Journal of the American College of Cardiology 69, 823–836. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Willer,  C. J., Schmidt, E. M.,Sengupta, S.,Peloso, G. M.,Gustafsson, S.,Kanoni, S.,Ganna, A., Chen, J.,Buchkovich, M. L.,Mora, S.  and others. (2013). Discovery and refinement of loci associated with lipid levels. Nature Genetics 45(11), 1274. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Zhu,  Y., Zhang, D.,Zhou, D.,Li, Z.,Li, Z.,Fang, L.,Yang, M., Shan, Z.,Li, H.,Chen, J.  and others. (2017). Susceptibility loci for metabolic syndrome and metabolic components identified in han chinese: a multi-stage genome-wide association study. Journal of Cellular and Molecular Medicine 21, 1106–1116. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

BIOSTS_21_1_33_s4

Supplementary Data


Articles from Biostatistics (Oxford, England) are provided here courtesy of Oxford University Press

RESOURCES