Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2014 Sep 15;9(9):e106918. doi: 10.1371/journal.pone.0106918

Testing Genetic Association by Regressing Genotype over Multiple Phenotypes

Kai Wang 1,*
Editor: Zhongxue Chen2
PMCID: PMC4164437  PMID: 25221983

Abstract

Complex disorders are typically characterized by multiple phenotypes. Analyzing these phenotypes jointly is expected to be more powerful than dealing with one of them at a time. A recent approach (O'Reilly et al. 2012) is to regress the genotype at a SNP marker on multiple phenotypes and apply the proportional odds model. In the current research, we introduce an explicit expression for the score test statistic and its non-centrality parameter that determines its power. Same simulation studies as those reported in Galesloot et al. (2014) were conducted to assess its performance. We demonstrate by theoretical arguments and simulation studies that, despite its potential usefulness for multiple phenotypes, the proportional odds model method can be less powerful than regular methods for univariate traits. We also introduce an implementation of the proposed score statistic in an R package named iGasso.

Introduction

Complex human disorders are often characterized by multiple phenotypes. Some of them might be categorical while others might be continuous. For instance, patients with Bardet-Biedl syndrome often suffer from vision loss, hypertension and high cholesterol level caused by obesity, polydactyly, and other abnormalities. In order to map the genetic variants underlying such disorders, it is highly desirable to analyze all available phenotypes simultaneously. However, it is challenging to jointly model these phenotypes, especially when they are of different data types [1].

Let Inline graphic denote a Inline graphic vector of Inline graphic phenotypes on an individual and Inline graphic his/her genotype at a marker. If all the components of Inline graphic are continuous, one may use MANOVA given genetype Inline graphic. When the components of Inline graphic are of mixed data types, the choices are limited. One popular method is to first analyze each component individually and then combine the test statistics through a meta-analysis [2], [3]. These methods model the phenotype vector Inline graphic in terms of the genetic data Inline graphic.

For a single-nucleotide polymorphism (SNP), the distribution of its genotype Inline graphic is trinomial. It is appealing to model Inline graphic as a function of Inline graphic [4], [5]. Furthermore, since there is a natural ordering in the three genotypes at the SNP (assuming that the possibility of over-dominance is ignorable), one can use the ordinal logistic regression (a.k.a., the proportional odds model or the cumulative logit model) [6]. One immediate advantage of using the proportional odds model is that, unlike other methods, there is no need to make assumptions on the genetic effect such as additive, dominant, or recessive. The usefulness of this approach has been demonstrated via analyses of various data [5]. It is one of the best currently available multivariate methods [7]. However, it is the slowest one [7].

The test used in [5] is the likelihood ratio test (LRT). It involves numerical maximization under both the null hypothesis and the alternative hypothesis. We introduce a score test statistic using standard statistics theory. This statistic is asymptotically equivalent to the likelihood ratio test but computationally much faster due to the availability of its explicit expression, a feature useful in genome-wide association analysis. This explicit expression also gives insight on how the proportional odds model works in the context of genetic association analysis.

This report is organized as follows. We first introduce an explicit form of the score statistic and its non-centrality parameter. The form of this score statistic provides some insights on the ability of this method to detect association. The performance of the this score statistic is evaluated using the same simulation scenarios used in [7]. Finally, we consider an important case where the phenotype vector Inline graphic is univariate and binary to see how this test works for univariate phenotypes.

Results

The score statistic

The genetic data are assumed to come from a biallelic marker such as a single-nucleotide polymorphism (SNP). Let Inline graphic denote the reference allele and Inline graphic the other. For simplicity, we use 0, 1, and 2 to represent genotypes Inline graphic, Inline graphic, and Inline graphic, respectively. Regardless of the data types of the components of Inline graphic, the genotype Inline graphic follows a trinomial distribution. In most cases, the effect of an allele is monotonic. That is, as the number of Inline graphic alleles increases from 0 to 2, the effect of genotypes Inline graphic, Inline graphic, and Inline graphic is non-decreasing or non-increasing. Over-dominance effect exists but is rather rare. Given this consideration, we model the genotype Inline graphic in terms of phenotype Inline graphic using the proportional odds model [5].

Let Inline graphic denote the probability that an individual's genotype Inline graphic is Inline graphic given phenotypic value Inline graphic. In the current situation, the proportional odds model models the cumulative probabilities Inline graphic and Inline graphic jointly as follows:

graphic file with name pone.0106918.e033.jpg (1)
graphic file with name pone.0106918.e034.jpg (2)

Here Inline graphic and Inline graphic are intercepts and Inline graphic is a vector of coefficients. This model implies Inline graphic because Inline graphic. Since Inline graphic, an alternative form of equation (2) is

graphic file with name pone.0106918.e041.jpg

Equation (1) models the effect of phenotype y on the odds of genotype Inline graphic versus Inline graphic or Inline graphic while equation (2) models the effect of phenotype y on the odds of genotype Inline graphic versus Inline graphic or Inline graphic. We have

graphic file with name pone.0106918.e048.jpg
graphic file with name pone.0106918.e049.jpg

and Inline graphic is determined by Inline graphic. This model assumes that the difference of the left hand side of (1) or (2) for two phenotype vectors Inline graphic and Inline graphic depends only on Inline graphic and is independent of genotype Inline graphic or Inline graphic:

graphic file with name pone.0106918.e057.jpg (3)
graphic file with name pone.0106918.e058.jpg (4)

That is,

graphic file with name pone.0106918.e059.jpg

does not depend on the value of Inline graphic.

Let Inline graphic be the index for the Inline graphicth individual in a sample of size Inline graphic, the log-likelihood function is

graphic file with name pone.0106918.e064.jpg

The hypotheses of interest are

graphic file with name pone.0106918.e065.jpg (5)

These hypotheses can be tested by the likelihood ratio statistic. To introduce the score statistic, define

graphic file with name pone.0106918.e066.jpg

where Inline graphic and Inline graphic are the sample proportions of the genotypes for which Inline graphic and 2, respectively. Inline graphic is the difference of two weighted summations of Inline graphic. The summation in the first pair of parentheses weights Inline graphic with Inline graphic more than other Inline graphics (i.e., 1 versus Inline graphic) while the summation in the second pair of parentheses weights Inline graphic with Inline graphic more (i.e., 1 versus Inline graphic). Let Inline graphic. It is shown in the Methods section that a score statistic for testing hypotheses (5) is

graphic file with name pone.0106918.e080.jpg

where

graphic file with name pone.0106918.e081.jpg

is the sample variance matrix of Inline graphic. The non-centrality parameter of Inline graphic is

graphic file with name pone.0106918.e084.jpg

where the expectation in Inline graphic is taken under the alternative. This NCP can be used to compute power at significance level Inline graphic in the following way:

graphic file with name pone.0106918.e087.jpg

where Inline graphic follows a chi-square distribution with Inline graphic and non-centrality parameter Inline graphic and Inline graphic is the critical value from a chi-square distribution with Inline graphic and non-centrality parameter 0.

Simulation Study

The simulation study consists of two parts. In the first part, multivariate phenotypes were simulated the same way as in [7]. Genotype data at a single SNP were simulated with minor allele frequency Inline graphic under the assumption Hardy-Weinberg equilibrium. Three phenotypes, denoted by Inline graphic, and Inline graphic, were simulated using the following relationship:

graphic file with name pone.0106918.e096.jpg

where Inline graphic, Inline graphic, and Inline graphic are pre-spedfied residual correlations between phenotypes excluding the QTL effect, Inline graphic or Inline graphic controls the effect direction of Inline graphic (those of Inline graphic and Inline graphic are fixed at 1) and Inline graphic. Three scenarios were considered with only 1, 2, or all phenotypes were associated with the SNP. See [7] for more details. For each simulated data set, the LRT Inline graphic-value is obtained from the R package MultiPhen and the score Inline graphic-value is obtained from the R package iGasso. Empirical rejection rates over 1,000 replicates are reported in Table 1. The performance of the score test is very similar to that of LRT. The empirical power are very close to the power of LRT reported in [7].

Table 1. Non-centrality value and the associated power (presented in parentheses) for models used in the simulation studies.

Frequency of Allele a Effect of Allele a
K p Recessive Additive Dominant
0.01 0.1 4.6780 (0.2597) 14.4110 (0.8387)
0.3 2.8697 (0.1329) 9.7282 (0.6225) 18.5977 (0.9339)
0.4 6.9507 (0.4323)
0.1 0.1 5.7000 (0.3374) 17.6383 (0.9182)
0.3 3.4771 (0.1730) 11.7847 (0.7343) 22.5123 (0.9737)
0.4 8.4233 (0.5379)

The relative genotype risks are Inline graphic for recessive models; Inline graphic for dominant models; and Inline graphic for additive models.

Is the proportional odds model always more powerful than the usual tests of association? To address this question, simulation were conducted on univariate phenotypes. The description of the simulation studies is provided in the Methods section. In addition to the proposed score statistic, the other test statistics used in the simulation include the Pearson's chi-square test, the Cochran-Armitage trend test, and the likelihood ratio test for the proportional odds model. The number of simulation replicates is fixed at 10,000. The sample size is 2,000. Half of the subjects are cases and half are controls. The simulated type I error rate for all these statistics is reported in Table 2. The empirical rejection rates are very close to their nominal levels, which are 0.1, 0.01, and 0.005. The simulated power is presented in Figures 1, 2, and 3. It is striking to see that for recessive models the proportional odds model is the least powerful. For instance, when prevalence Inline graphic and minor allele frequency Inline graphic, the power are 0.486 for Chi-Square test, 0.353 for Trend test, and 0.19 for both LRT and Score test. For other two models, there are situations it is more powerful than other methods. The simulated power for the Inline graphic statistic is in line with the calculated power reported in Table 3.

Table 2. Simulated type I error rate in the case of a univariate phenotype under various generating models.

Penetrance (K) Frequency of Allele a (p) Significance Level Test Statistic
Trend Chi-Square LRT Score
0.01 0.1 0.1 0.085 0.095 0.091 0.091
0.01 0.007 0.007 0.006 0.007
0.005 0.005 0.004 0.004 0.004
0.3 0.1 0.077 0.091 0.082 0.082
0.01 0.009 0.008 0.010 0.010
0.005 0.006 0.002 0.005 0.005
0.1 0.1 0.1 0.096 0.099 0.101 0.102
0.01 0.011 0.009 0.010 0.010
0.005 0.006 0.006 0.007 0.007
0.3 0.1 0.116 0.117 0.109 0.108
0.01 0.009 0.011 0.010 0.010
0.005 0.006 0.007 0.003 0.003
0.01 0.1 0.1 0.085 0.095 0.091 0.091
0.01 0.007 0.007 0.006 0.007
0.005 0.005 0.004 0.004 0.004
0.3 0.1 0.077 0.091 0.082 0.082
0.01 0.009 0.008 0.010 0.010
0.005 0.006 0.002 0.005 0.005
0.1 0.1 0.1 0.096 0.099 0.101 0.102
0.01 0.011 0.009 0.010 0.010
0.005 0.006 0.006 0.007 0.007
0.3 0.1 0.116 0.117 0.109 0.108
0.01 0.009 0.011 0.010 0.010
0.005 0.006 0.007 0.003 0.003
0.01 0.1 0.1 0.085 0.095 0.091 0.091
0.01 0.007 0.007 0.006 0.007
0.005 0.005 0.004 0.004 0.004
0.3 0.1 0.077 0.091 0.082 0.082
0.01 0.009 0.008 0.010 0.010
0.005 0.006 0.002 0.005 0.005
0.1 0.1 0.1 0.096 0.099 0.101 0.102
0.01 0.011 0.009 0.010 0.010
0.005 0.006 0.006 0.007 0.007
0.3 0.1 0.116 0.117 0.109 0.108
0.01 0.009 0.011 0.010 0.010
0.005 0.006 0.007 0.003 0.003

The test statistics are: Trend — Cochran-Armitage trend test; Chi-Square — Pearson's chi-square test; LRT — the likelihood ratio test for the proportional odds model computed by using the polr function in the R package MASS; Score — the proposed score statistic computed by using the SNPass.test function in the R package iGasso.

Figure 1. Simulated power for recessive model.

Figure 1

The relative genotype risks are Inline graphic. K represents disease prevalence and p is the frequency of allele a. The abbreviations for the test statistics are the same as in Table 2.

Figure 2. Simulated power for additive model.

Figure 2

The relative genotype risks are Inline graphic. K represents disease prevalence and p is the frequency of allele a. The abbreviations for the test statistics are the same as in Table 2.

Figure 3. Simulated power for dominant model.

Figure 3

The relative genotype risks are Inline graphic. K represents disease prevalence and p is the frequency of allele a. The abbreviations for the test statistics are the same as in Table 2.

Table 3. Simulated power under the same simulation scenarios of [7] over 1,000 replicates.

Scenario
I II III
rG rE MAF q LRT Score LRT Score LRT Score
1 0.0 0.01 0.125 0.125 0.184 0.183 0.278 0.278
0.40 0.141 0.141 0.198 0.197 0.257 0.257
0.3 0.01 0.127 0.129 0.185 0.185 0.184 0.183
0.40 0.124 0.123 0.177 0.178 0.187 0.185
0.7 0.01 0.221 0.221 0.268 0.267 0.129 0.129
0.40 0.215 0.211 0.292 0.292 0.138 0.136
−1 0.0 0.01 0.202 0.201 0.289 0.288
0.40 0.194 0.192 0.269 0.262
0.3 0.01 0.267 0.266 0.344 0.344
0.40 0.277 0.273 0.351 0.354
0.7 0.01 0.589 0.589 0.694 0.694
0.40 0.548 0.542 0.717 0.713

In scenario I, only phenotype 1 is associated with the SNP. rG = 1 or −1 does not affect the simulation results. Only results with rG = 1 are shown.

Discussion

In this report, we introduced a score test statistic for the proportional odds model for testing the association between a SNP and multiple phenotypes and provided an implementation of this statistic. Same simulation studies as those reported in [7] were conducted to assess it performance. We also did simulation analyses to study the performance of proportional odds model for univariate phenotypes which is covered by [5]. Although appealing to studies on multiple phenotypes, this method may be less powerful for univariate traits than regular methods. For case-control data, our results suggest that the traditional Pearson's chi-square test and the Cochran-Armitage trend test are preferred when the disease allele frequency is less than 0.5 and the disease is recessive.

Nonetheless, the proportional odds model method provides a convenient way for analyzing multiple phenotypes, especially when these phenotypes are of different types [5]. If the proportional odds assumption is of concern, one may remove this assumption and adopt a multinomial logistic regression. For our simulation studies, the multinomial logistic regression would be equivalent to the Pearson's chi-square test statistic. There are quite few implementations of the multinomial logistic regression, for instance, the multinom function in R package nnet.

Methods

Derivation of the score statistic

The first-order derivatives of the log-likelihood function Inline graphic are

graphic file with name pone.0106918.e118.jpg
graphic file with name pone.0106918.e119.jpg
graphic file with name pone.0106918.e120.jpg

and the second-order derivatives are

graphic file with name pone.0106918.e121.jpg
graphic file with name pone.0106918.e122.jpg
graphic file with name pone.0106918.e123.jpg

Under Inline graphic, Inline graphic no longer depends on Inline graphic. So their values are simply denoted by Inline graphic, and Inline graphic, respectively. Let Inline graphic. The expected Fisher information matrix evaluated at Inline graphic is

graphic file with name pone.0106918.e131.jpg
graphic file with name pone.0106918.e132.jpg

where the matrix partition is in an obvious manner. By standard asymptotic theory, the score statistic is

graphic file with name pone.0106918.e133.jpg

where Inline graphic is Inline graphic evaluated at Inline graphic:

graphic file with name pone.0106918.e137.jpg

The unknown values of Inline graphic, and Inline graphic are estimated by their sample genotype proportions, respectively.

Simulation Studies

Here is a description of the simulation procedure for the case of a dichotomous phenotype. Suppose the trait is Mendelian. Let Inline graphic, Inline graphic, and Inline graphic denote the frequencies of genotypes Inline graphic, Inline graphic and Inline graphic in general population and Inline graphic, Inline graphic, and Inline graphic their penetrances, respectively. The prevalence of the disease would be Inline graphic. The genotype frequencies in cases are Inline graphic and in controls are Inline graphic. In this situation, the variance of Inline graphic is Inline graphic where Inline graphic is the proportion of cases. The non-centrality parameter (NCP) of test statistic Inline graphic is equal to

graphic file with name pone.0106918.e156.jpg

Let Inline graphic be the population frequency of allele a. Assuming Hardy-Weinberg equilibrium in the population, the frequencies of genotypes Inline graphic, Inline graphic, and Inline graphic are Inline graphic, and Inline graphic, respectively. Let Inline graphic, be the relative risk of genotype Inline graphic to genotype 0. A data generating model is completely determined by Inline graphic, Inline graphic, Inline graphic, and Inline graphic. This is because Inline graphic, Inline graphic, and Inline graphic. Hence the genotype frequencies in cases and controls are determined and data can be simulated. We consider a dominance model (Inline graphic), a recessive model (Inline graphic), and an additive model (Inline graphic). The NCPs for the models used in simulation are reported in Table 3. So are the power associated with these NCPs.

R function SNPass.test

The R function SNPass.test in the package iGasso implements the proposed score statistic. R users can download and install iGasso from CRAN (http://cran.r-project.org/) or any CRAN mirror.

Acknowledgments

The author thanks an associate editor and four referees for their helpful comments.

Data Availability

The author confirms that all data underlying the findings are fully available without restriction. The statistical test proposed in the manuscript has been implemented in R language. It is freely available at http://cran.r-project.org/web/packages/iGasso/index.html. The R code is available upon request from the corresponding author.

Funding Statement

This research is funded by the National Institute for Environmental Health Sciences through the University of Iowa Environmental Health Sciences Research Center, NIEHS/NIH P30 ES05605. The funder has no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Zhu W, Zhang H (2009) Why do we test multiple traits in genetic association studies. Journal of the Korean Statistical Society 38: 1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Xu X, Lu Tian L (2003) Combining dependent tests for linkage or association across multiple phenotypic traits. Biostatistics 4: 223–229. [DOI] [PubMed] [Google Scholar]
  • 3. Yang Q, Wu H, Guo CY, Fox CS (2010) Analyze multivariate phenotypes in genetic association studies by combining univariate association tests. Genet Epidemiol 34: 444–454. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Wang K, Huang J (2011) Treating phenotype as given: A simple resampling method for genome-wide association studies. Genetic Analysis Workshop 17 5: S60. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. O'Reilly PF, Hoggart CJ, Pomyen Y, Calboli FCF, Elliott P, et al. (2012) MultiPhen: Joint model of multiple phenotypes can increase discovery in GWAS. PLoS ONE 7: e34861. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Agresti A (2002) Categorical Data Analysis. John Wiley & Sons, Inc., 2nd edition.
  • 7. Galesloot TE, van Steen K, Kiemeney LALM, Janss LL, Vermeulen SH (2014) A comparison of multivariate genome-wide association methods. PLoS ONE 9: e95923. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The author confirms that all data underlying the findings are fully available without restriction. The statistical test proposed in the manuscript has been implemented in R language. It is freely available at http://cran.r-project.org/web/packages/iGasso/index.html. The R code is available upon request from the corresponding author.


Articles from PLoS ONE are provided here courtesy of PLOS

RESOURCES