Abstract
The case‐control design is often used to test associations between the case‐control status and genetic variants. In addition to this primary phenotype, a number of additional traits, known as secondary phenotypes, are routinely recorded, and typically, associations between genetic factors and these secondary traits are studied too. Analysing secondary phenotypes in case‐control studies may lead to biased genetic effect estimates, especially when the marker tested is associated with the primary phenotype and when the primary and secondary phenotypes tested are correlated. Several methods have been proposed in the literature to overcome the problem, but they are limited to case‐control studies and not directly applicable to more complex designs, such as the multiple‐cases family studies. A proper secondary phenotype analysis, in this case, is complicated by the within families correlations on top of the biased sampling design. We propose a novel approach to accommodate the ascertainment process while explicitly modelling the familial relationships. Our approach pairs existing methods for mixed‐effects models with the retrospective likelihood framework and uses a multivariate probit model to capture the association between the mixed type primary and secondary phenotypes. To examine the efficiency and bias of the estimates, we performed simulations under several scenarios for the association between the primary phenotype, secondary phenotype and genetic markers. We will illustrate the method by analysing the association between triglyceride levels and glucose (secondary phenotypes) and genetic markers from the Leiden Longevity Study, a multiple‐cases family study that investigates longevity. © 2017 The Authors. Statistics in Medicine Published by JohnWiley & Sons Ltd.
Keywords: ascertainment, multivariate probit model, family data, mixed models, genetic association and heritability
1. Introduction
In order to understand biological mechanisms underlying disease and health, epidemiological studies measure genetic markers, classical variables, and novel omics datasets and model the relationship between these variables and the phenotype of interest. Here, we consider outcome‐dependent sampling designs with binary outcome variables. In addition to studying these binary (primary) phenotypes, the classical or omics variables are typically also analysed as outcome variables (secondary phenotypes). For example, modelling of associations between these traits and genetic factors, such as single‐nucleotide polymorphisms (SNPs) or polygenic risk scores (sumscores based on SNPs) 1. However, an important complication that is often ignored is that a proper analysis of the secondary traits should correct for the sampling mechanism on the primary phenotype (Figure 1). Note that we assume that the secondary phenotype has an effect on the primary phenotype. The reverse situation will not be treated because of reverse causality challenges 2. In our motivating case study, the Leiden Longevity Study (LLS) 3 families with at least two long‐lived siblings are recruited. Obviously, these families do not represent a random sample from the population and inferences cannot be generalized to the whole population, unless the sampling mechanism is properly modelled. Several datasets are measured in the offspring of the long‐lived siblings, namely, lipidomics, glycomics, metabolomics and imaging. These offspring share a part of their genetic variation with the long‐lived parent and therefore are expected to represent a healthy subpopulation while the partners represent the population. As data example, we will model the effect of genetic factors on the secondary traits glucose and triglyceride levels in the offspring (cases) and their partners (controls). To be able to extrapolate results to the general population, we need to account for the over sampling of long‐lived subjects in the families of the LLS. There are several multiple‐case family studies. For human longevity, Genetics of Healthy Ageing 4 used the same study design as the LLS. Other examples are Genetics in Familial Thrombosis (with at least two cases with thrombosis) 5, 6 and the ongoing study from Leiden Family Lab (famlab: https://www.leidenfamilylab.nl), which recruits families with at least two cases with social anxiety disorder. The novel methods presented in this paper will also be essential for modelling secondary phenotypes in these studies.
In the context of case‐control studies, Monsees et al. 2 showed that bias can occur when estimating the SNP effect on secondary phenotypes if the primary and secondary phenotypes are associated. This is often the case because both outcomes are measured on the same subjects and secondary phenotypes are typically chosen for their potential associations with the primary phenotype. They also showed that the amount of bias is dependent on the prevalence of the primary phenotype, the strength of the association between the primary and secondary phenotypes, and the association between the tested marker and the primary trait (Figure 1).
To deal with the bias problem, investigators first used ad hoc methods, that is, using controls only, cases only, combined data of cases and controls or joint analysis of cases and controls adjusting for the case‐control status. However, several authors showed that these simple approaches can lead to false positive results 2, 7, 8. This is due to the sampling design, namely, the secondary phenotype data are not sampled according to the case‐control design as the primary phenotype. Several sophisticated methodologies have been developed to correct for the sampling mechanisms and provide unbiased genetic effect estimates: (i) inverse‐probability‐of‐sampling‐weighting approaches 2, 9, 10 that correct for the sampling mechanism by weighting appropriately individuals in case‐control studies; (ii) retrospective likelihood‐based approaches that indirectly adjust for ascertainment 8, 11; and (iii) a weighted combination of two estimates obtained with the retrospective likelihood approach in the presence or not of an interaction between SNPs and primary phenotypes 12.
Even though these approaches can successfully correct for the biased design used to collect the data, they are not directly applicable to more complex designs such as the LLS that motivates this work. In particular, inverse probability weighting approaches require knowledge of the sampling weights for each family. These weights are not available for the LLS because it is unknown what the prevalence of families with at least two nonagenarians is in the population. In addition, the correlations between the family members cannot be ignored, and therefore, it is evident that statistical methodology for proper secondary phenotypes analysis in this context is needed. To this end, under the retrospective likelihood framework, we develop a multivariate probit regression model inspired by the work of Najita et al. 13 to model jointly the distribution of the primary and secondary phenotype. This approach allows us to deal with the ascertainment issue while taking into account the individual relatedness and the genetic and environmental variations.
The paper is organized as follows: in Section 2, we present the retrospective likelihood approach to correct for the over sampling of long‐lived subjects and the multivariate probit regression model for the joint modelling of the mixed type primary and secondary phenotypes. In Section 3, we evaluate empirically the performance of the method in terms of bias and efficiency and contrast it with the naive approach that ignores the sampling mechanism. Finally, in Section 4, we illustrate the potential of our proposed method in the analysis of triglyceride levels and glucose in the LLS.
2. Methods
2.1. Retrospective likelihood approach
Let N be the total number of families in the study. For the family i(i=1…N) of size n i, let Y i, X i and G i be the n i×1 vectors for the case‐control status, the secondary phenotype and the genotype, respectively. Motivated by the LLS, we will work under the retrospective likelihood approach to correct for the ascertainment of the families. Such an approach is attractive when modelling the ascertainment mechanism is not straightforward, as in the LLS where sampling depends on the previous generation (an example of a pedigree in LLS is shown in Figure 2). In fact the retrospective likelihood approach implicitly corrects for the ascertainment mechanism, under the assumption that the ascertainment depends only on the primary phenotype Y. In particular, for the ith family it holds
(1) |
with A s c the ascertainment process. By applying Bayes' rule, we obtain
(2) |
To fully specify (2), we need to model properly the conditional joint distribution of the primary and the secondary phenotypes given the genotype P(X i,Y i|G i), the marginal probability of the primary phenotype P(Y i|G i) and the genotype probability of the ith family P(G i). Each one of these elements are described in Sections 2.2 and 2.3.
2.2. Mixed‐effects models for the analysis of family data
To model the correlation of the phenotypes Y and X within families, a common choice is to use random effects. For the binary primary phenotype, we propose to use a multivariate probit model with random effects. The advantage of this model is that it involves only the integrals of the multivariate normal cumulative distribution function for which efficient algorithms have been developed. In contrast, for the more commonly used logistic regression model, the integrals have to be approximated, for example, by using Gauss–Hermite quadrature, which might be computationally intensive for large pedigrees. Let be a set of family‐specific random effects designed to handle familial genetic correlation and be the vector of genotypes for family i. For the probit model, the observed response Y is viewed as a censored observation from an underlying continuous latent variable Y ∗ with
where −∞=γ 0<γ 1<γ 2=+∞ are suitable threshold parameters. For the underlying latent variable Y ∗, we assume the mixed‐effects regression model
where is independent of . Here, α=(α 0,α 1) denotes the regression coefficient vector with α 0 the intercept and α 1 the parameter representing the effect of the genotype on Y. At the family level, we assume , with R i the coefficient of relationships matrix with elements with d lm denoting the genetic distance between subjects l and m in the family. The parameter represents the residual additive genetic variation not explained by g ij. Note that models the polygenic inheritance in a family. For identifiability reasons, restrictions are required on both the scale and location of Y ∗, namely, we set σ 2=1 and γ 1=0. Thus, in the mixed‐effects probit regression, the disease risk conditional on the random‐effects and genotypic information g ij is modelled as follows
(3) |
with Φ(z) the cumulative distribution function of the standard normal distribution. The marginal density under the probit model takes the form:
To model the secondary phenotype X i, we use a linear mixed model:
(4) |
where β=(β 0,β 1) denotes the regression coefficient vector with β 0 the intercept and β 1 the parameter representing the effect of the genotype on X, is the random parameter used to model the genetic correlation structure within each family for the secondary trait, and σ ϵ is the residual standard deviation.
To model jointly X and Y using the model specifications ((3) and (4)), we introduce a shared random effect u ij∼N(0,1) and propose the following model:
(5) |
where u i is assumed to be independent of and . We introduce a coefficient δ in order to have different phenotypic variances for the random effect u i. In case of small datasets or small family sizes, it can be better to constrain δ to be equal to 1 for a simpler model. Let and denote the corresponding variance–covariance matrices of the marginal distributions of X i and , and let be their covariance. The joint distribution of Y ∗ and X is then . In the special case for n i=2, the variance–covariance matrix becomes
(6) |
Using the properties of the multivariate normal distribution, the joint distribution for the observed primary and secondary phenotypes takes the form
Thus by using the probit regression model for the primary trait, we have developed an efficient approach to model the correlation between the primary and secondary trait.
From model (5) and the variance–covariance matrix (6), several marginal correlations between and within family members can be deduced:
where ρ XY represents the association between the primary and secondary phenotype. We can also derive the closed form for the heritability estimates of the secondary phenotype that quantifies the percentage of genetic variation in the total variance:
(7) |
Note that when genetic factors are included in the model formula (7) gives the residual heritability.
2.3. Genotype probability
Finally, another key component in the formulation of the retrospective likelihood (2) is the computation of the genotype probability for each family i. Let G mj and G pj denote the genotypes of the mother and father of an individual j if this individual is a nonfounder member of family i. Under the assumption of random mating and Mendelian inheritance, the genotype probabilities can be written as presented by Duncan Thoms 14:
The probabilities P(g ij|G pj,G mj) are the transmission probabilities that can be modelled using Mendelian inheritance. Finally, P(G pi), P(G mi) and P(g ij) can be modelled by assuming Hardy–Weinberg proportions (1−q)2, 2q(1−q), q 2 that depend on q, the minor allele frequency. Here, we propose to use external information for q or to estimate q from the control sample before maximizing the likelihood. Note that when genotypes of the parents are missing, the probability can be obtained by summing over the possible parental genotypes. In case of a more complex pedigree, a recursive algorithm known as peeling can be used 15. For the LLS where families are sibships, the probability is as follows:
(8) |
where is the model parameters vector.
2.4. Estimation and statistical testing
To estimate the parameters of the joint model, we maximize the logarithm of the likelihood described in (8). This involves a combination of numerical optimization and integration. For the evaluation of the integral in the multivariate normal distribution, we use the deterministic algorithm Miwa described in 16. For the optimization, we use the Broyden–Fletcher–Goldfarb–Shanno algorithm implemented in the function optim(.) in R. The Broyden–Fletcher–Goldfarb–Shanno algorithm is a quasi‐Newton method, which means that the Hessian matrix does not need to be evaluated directly but is approximated by using specified gradient evaluations. To test for the presence of an effect of the SNPs on the secondary phenotype, we use the likelihood ratio test. Note that when the interest of a researcher is solely testing for genetic association, a score statistic is an alternative to the likelihood ratio statistic.
2.5. Continuous polygenic score
Our approach can also be applied in the case of modelling the association between continuous covariates and secondary phenotypes. For example, polygenic scores have been used to summarize genetic effects among an ensemble of SNPs that have been identified in large genome‐wide association studies (GWASes) 17, 18, 19. Polygenic scores are typically linear combinations of SNPs: , where δ k=1 or δ k is obtained from previous GWASes. For genetic scores, we need to integrate over the distribution of the polygenic score instead of summing over the genotypes in the denominator of (2). For the distribution of the polygenic score, we use a multivariate normal distribution , with μ g the mean value of the genetic score, σ g the standard deviation of the genetic score and R i the relationship matrix of family i. The likelihood contribution for family i is given by
Computation of the integral can be quite intensive and challenging. In order to gain efficiency, we write the marginal model of Y ∗ (5) as , with . Now follows the following multivariate normal distribution: . Note that when a polygenic risk score is included in the model for the secondary phenotype, the parameter represents the residual polygenic inheritance.
2.6. Inclusion of covariates in the model
Often, researchers want to adjust for covariates such as age, sex, treatment and so on in the model. Let Z be such a covariate. To estimate the effect Z on the secondary phenotype, we propose to maximize the joint likelihood of X and G conditionally on the primary phenotype Y and Z. Thereby, we avoid modeling of the distribution of Z within the families. Indeed, under the assumption of independence between genotype and Z, we obtain
(9) |
3. Simulation study
A simulation study was set up to evaluate the performance of our proposed method for the estimation of the association between a genetic factor and the secondary phenotype and the estimation of the heritability of the secondary phenotype. We compare the proposed method with the naive approach that is typically followed in practice, namely, analysis of the secondary trait without correcting for the sampling mechanism. In particular, in this case, we fit the standard linear mixed‐effects model for the secondary phenotype and explicitly model the familial relationships as described in ((4)). The two methods are compared in terms of bias, root mean square error (RMSE) and 95% coverage probabilities. We consider SNPs (discrete variables) and polygenic scores (continuous variables). Several settings are considered for the disease prevalence, the strength of the association between the genetic factor and the primary phenotype, the strength of the ascertainment mechanism and the number of sibships. We simulated sibships of size 5. With respect to the familial relationships, we consider only sibships such that our simulation resembles the LLS design. For the prevalence of the primary phenotype, we consider two settings, namely, a disease prevalence of 1% that corresponds to α 0≈−2.32 and of 5% that corresponds to α 0≈−1.64. In addition, the variance parameters have been chosen such that they correspond to a heritability of 50%. Specifically, we use =2, , and . This corresponds to a correlation of 0.78 between the primary and the secondary phenotypes. To speed up computations, we assume that when fitting the models to the simulated datasets. For each scenario, 500 datasets are simulated using model (5).
3.1. Simulation results for a single‐nucleotide polymorphism
The genotypes of the SNPs are simulated assuming a minor allele frequency of 0.3 in the population. For the secondary phenotype model, the following fixed effects values are used: β 0=3.5 and β 1=0.2, whereas for the primary phenotype model, the effect sizes are α 1= 0.1 or 0.5. Finally, for each of the four scenarios (rare or common disease, and weak and strong SNP effect on the primary phenotype), we consider two ascertainment mechanisms, namely, the sampled sibships of size five have at least one affected or at least two affected members.
Figure 3 presents the estimates and 95% confidence intervals for the scenario of 400 sibships. Figure 3 shows that ignoring the sampling mechanism (naive method) leads to biased estimates of the SNP effect, and the size of this bias increases with the strength of the ascertainment mechanism and the association between the SNP and the primary phenotype. Overall, we observe that the proposed method gives unbiased estimates of the SNP effect on the secondary phenotype. The coverage probabilities reach the nominal level (see section A of the Supporting Information). Regarding the prevalence of the primary phenotype, we observe that for the naive method bias increases with lower prevalence, while the proposed method remains robust to the lower amount of information because of the rare primary phenotype. In general, the proposed method leads to smaller root mean square error than the naive approach and better coverage probabilities.
In Table 1, we present the heritability estimates of the secondary phenotype for a common disease, under the various ascertainment mechanisms and the two values of α 1. It is obvious that the heritability estimates are influenced by the ascertainment mechanisms when using the naive approach. Indeed, the naive method tends to underestimate the heritability for each mechanism and this underestimation increases as the ascertainment mechanisms become more stringent. The heritability estimates are 25–27% for sibships with at least one affected sibling and drop to 13–14% for sibships with at least two affected siblings. On the contrary, the proposed method is robust to the stringency of the ascertainment mechanism.
Table 1.
SNP model | Polygenic score model | ||||
---|---|---|---|---|---|
Ascertainment | α 1 | Retrospective | Naive | Retrospective | Naive |
1. At least two cases | |||||
0.10 | 0.48 (0.07) (0.22) | 0.13 (0.07) (0.37) | 0.50 (0.03) (0.13) | 0.14 (0.03) (0.36) | |
0.50 | 0.48 (0.07) (0.22) | 0.14 (0.07) (0.36) | 0.52 (0.03) 0.12) | 0.15 (0.03) (0.34) | |
2. At least one case | |||||
0.10 | 0.50 (0.08) (0.17) | 0.25 (0.08) (0.25) | 0.48 (0.04) (0.12) | 0.25 (0.03) (0.24) | |
0.50 | 0.50 (0.08) (0.17) | 0.27 (0.08) (0.24) | 0.50 (0.04) (0.10) | 0.26 (0.04) (0.23) |
The heritability value is 50% under the generating model. Datasets consist of 400 sibships of size 5. Results are based on 500 replicates. SNP, single‐nucleotide polymorphism.
Next, we study the robustness of our approach to one violation of the model assumptions, namely, we simulated under a logit link for the primary phenotype and used the probit link for modelling. Results for the SNP effect and the heritability are presented in Table 2. These results show that even though our approach gives biased estimates for the primary phenotype model, the parameters estimates for the secondary phenotype model are not affected. All the results are presented in Section A of the Supporting Information.
Table 2.
Ascertainment | α 1 | β 1 | Heritability |
---|---|---|---|
0. True value | 0.200 | 0.500 | |
1. At least two cases | |||
0.100 | 0.199 (0.104) (0.104) (0.948) | 0.509 (0.017) (0.110) | |
0.500 | 0.197 (0.106) (0.110) (0.945) | 0.516 (0.014) (0.108) | |
2. At least one case | |||
0.100 | 0.200 (0.104) (0.107) (0.961) | 0.510 (0.012) (0.096) | |
0.500 | 0.199 (0.107) (0.111) (0.960) | 0.513 (0.010) (0.087) |
In brackets are standard deviations, root mean square error and coverage probability (for the effect size only). Datasets consist of 400 sibships of size 5. Results are based on 500 replicates.
Although we focus on parameter estimation, model fitting and heritability estimation for genetic association with a secondary phenotype, we also investigate the performance of the likelihood ratio test under the null hypothesis of no genetic association with a secondary phenotype at two levels of genetic association with the primary phenotype. In each of the four considered scenarios, we simulate 10 000 replicates. In Table 3, the empirical type I error rates are given for the rare disease scenario (i.e. prevalence 1%). We observe that while our approach preserves the type I error rate at a nominal level, the naive approach has, systematically, an inflated type I error rate. The type I error rate for the naive method increases with stronger ascertainment and larger SNP effect on the primary phenotype.
Table 3.
Nominal level (α) | Retrospective likelihood | Naive method | |
---|---|---|---|
At least two cases | |||
α 1=0.1 | |||
0.05 | 0.0509 | 0.0580 | |
0.01 | 0.0118 | 0.0152 | |
0.001 | 0.0017 | 0.0025 | |
α 1=0.5 | |||
0.05 | 0.0505 | 0.0878 | |
0.01 | 0.0113 | 0.0222 | |
0.001 | 0.0013 | 0.0043 | |
At least one case | |||
α 1=0.1 | |||
0.05 | 0.0524 | 0.0514 | |
0.01 | 0.0102 | 0.0098 | |
0.001 | 0.0018 | 0.0014 | |
α 1=0.5 | |||
0.05 | 0.0522 | 0.0558 | |
0.01 | 0.0098 | 0.0097 | |
0.001 | 0.0009 | 0.0016 |
Sibships with at least one or with at least two cases are considered. Two values for the association between the single‐nucleotide polymorphism and the primary phenotype, namely, α 1 = 0.1 and α 1 = 0.5, are used. Datasets consist of 400 sibships of size 5. Results are based on 10 000 replicates.
3.2. Simulation results for a polygenic score
To study the performance of the proposed method for polygenic score, we simulated centred and standardized scores. The parameters of the secondary phenotype model were chosen as for the SNP simulations: β 0=3.5 and β 1=0.2, whereas for the primary phenotype model, effect sizes of α 1= 0.1 or 0.5 were used. Figure 4 presents the estimates and confidence intervals for datasets with 400 sibships. Our approach provides unbiased estimates of the effect of the polygenic score on the secondary phenotype. In contrast, the naive approach provides biased estimates and the bias increases when the ascertainment process is more stringent or when α 1 is larger.
The results of the residual heritability estimates after adjustment for polygenic scores agree with the results obtained when a SNP is included in the model (Table 1). The naive approach did not perform well: estimates between 25–26% and 14–15% for an ascertainment process of at least one affected sibling and at least two affected siblings, respectively, instead of 50%.
4. Application: analysis of the Leiden Longevity Study
In this Section, we will exemplify our proposed method in the analysis of the LLS briefly introduced in Section 1. The LLS is a family‐based study set up to identify mechanisms that contribute to healthy ageing and longevity. The inclusion criteria of the study are sibships with at least two nonagenarian siblings, that is, the selection takes place at Generation II (Figure 2). Several secondary phenotypes and GWAS data have been measured for the offspring of these siblings (Generation III in Figure 2) and their partners. Because the offspring have at least one nonagenarian parent, they are also likely to become long‐lived. Therefore, the set of offspring and their partners corresponds to a case‐control design with related subjects where the offspring in Generation III are considered as cases and their partners as controls. Overall, 421 families with 1671 offspring (cases) and 744 partners (controls) have been included in the study. Because the families are relatively small, we use the model that assumes an equal variance for the shared effect for the two traits.
Here, we model the association between genetic factors and the secondary phenotypes triglyceride and glucose levels. For both traits, there is evidence of an association with human longevity and both traits are normally distributed. For the sake of comparison in addition to our proposed method, we will present results using the naive approach, that is, standard linear mixed model. Analyses using the linear mixed model that conditions also on the case‐control status will not be presented because the parameters do not have a comparable interpretation between the two approaches. The p‐values presented subsequently are obtained using the likelihood ratio test.
4.1. Triglyceride levels analysis
Triglyceride levels have been found to be associated with the primary trait longevity ( p‐value = 0.0005 for women and p‐value = 0.04 for men), and the size of association is sex dependent. Therefore, a sex‐stratified analysis has been considered further. For the purposes of our illustration, we restricted our analysis to seven genes on chromosome 11 that are known to be associated with triglyceride levels. These genes are APOA1, APOA4, APOA5, APOC3, ZNF259, BUD13 and DSCAML1. The selection of the genes was performed using the NHGRI‐EBI GWAS catalog 20. For these genes, we have genotypes of 41 SNPs that have no missing values in our datasets. Triglyceride levels were standardized, and we included age as a covariate in the analysis.
We ran the analysis with the constrained approach, that is, δ=1. We observe that none of the SNPs analysed is significantly associated with triglyceride levels either in men or in women; hence, for most SNPs, the estimates of the effect sizes agree between the two approaches. The SNPs showing the largest differences are, in men, SNP 22: for our Retrospective Approach (RA) and for the Naive Approach (NA) and SNP 26: and . For women, more SNPs give different estimates between the two approaches, that is, SNP 1 ( , ), SNP 2 ( = 7.2e‐06 ), SNP 13 ( , ) and SNP 19 ( , ) showed the biggest differences. Results for the SNPs are presented in Section B of the Supporting Information.
We verified whether the assumption of equal variances for the primary and secondary phenotype for the shared effects is justified. We fitted also the model with non‐constrained δ. We noticed that for some of the SNPs, the model parameters are hard to estimate and the estimates of the variances of the shared and residual random effects in the model for the second phenotype are swapped. Overall, the estimates of the effect of the SNP on the secondary phenotype are very similar to the model that assumes equal variances. Results of these analyses are presented in Section B of the Supporting Information.
4.2. Glucose levels analysis
In previous analysis of glucose levels in the offspring and partners of the LSS, Mooijart 21 studied the association between glucose and a polygenic score. The genetic score was defined as the total number of risk alleles across 15 SNPs that are known to be associated with Type II diabetes. The generalized estimating equation method was applied to take into account the familial relationships. The paper showed that a higher number of Type II diabetes risk alleles is associated with a higher serum concentration of glucose ( p‐value = 0.016). A statistically significant association was found between the glucose level and case‐control status ( p‐value < 0.001). However, the sampling process was not taken into account in the analysis and thus the results might be biased. We applied the proposed method to estimate the heritability of glucose levels and to test for the presence of an association between the glucose levels and the polygenic score. In addition, we applied the naive approach that did not correct for case‐control status. We did not stratify according to sex in these analyses.
For this analysis, the polygenic score was standardized. Using the RA, the association between the genetic score and the glucose level is estimated by with a standard error of s t E=0.023(p‐value =0.015). The naive approach also yields a significant association between the genetic score and glucose levels ( , s t E=0.026, p−v a l u e=0.020). By using the NA, we obtained for the glucose levels a genetic variance of and a total variance of , which corresponds to a residual heritability of . Our RA yields a genetic variance of and a total variance of that corresponds to a residual heritability of .
5. Discussion
In this paper, we developed a new method for the proper analysis of secondary traits for multiple‐cases family designs. A key component in our proposed method is the joint modelling of the primary and secondary phenotypes. We developed a multivariate probit model that can also capture the within families dependencies. A retrospective likelihood approach has been followed to correct for the ascertainment process. Thereby unbiased estimates of the association between genetic factors and secondary traits can be obtained. Simulation results showed that our approach preserves the type I error at nominal level and provides accurate estimates irrespective of the disease prevalence, the strength of the association between the genetic variants and the primary phenotype, and the ascertainment mechanism. Another important empirical finding is that the heritability estimates for the secondary traits can be severely underestimated unless the sampling mechanism is taken into account. With respect to the analysis of the motivating case study, for the SNPs, the differences between the effect sizes obtained by our proposed method and the naive approach were small. The small differences obtained between the naive and the retrospective approach are mainly because of the small effect sizes of the genetic markers selected on the primary phenotype. Indeed, the three main factors influencing the magnitude of the bias when using the naive approach are the correlation between the secondary phenotype and the primary phenotype, the strength of the ascertainment, and the strength of the association between the genetic marker and the primary phenotype.
Heritability is one of the properties that a trait needs to possess to be declared an endophenotype for a specific disease. The other criteria are the trait is associated with the disease status in the population, the trait manifests whether illness is active or in remission (state independent) and the trait and the disease status co‐segregate within a family 22. The Leiden Family Lab (https://www.leidenfamilylab.nl) aims to identify endophenotypes for social anxiety disorder. The study comprises families with at least two cases with social anxiety. The methods presented in this paper will be used for the analyses of this study to identify endophenotypes and are relevant for other family studies, as well.
In this paper, we proposed to include additional covariates in the model by using the likelihood conditional on these covariates. Alternatively, the joint likelihood of the secondary phenotype, genotype and covariate conditionally on the primary phenotype can be used. This alternative approach might be more efficient 23. However, this likelihood requires distributional assumptions for the covariates within families, which can be complex for related individuals. Moreover, maximization of the likelihood might become time consuming. Ghosh et al. 24 propose a pseudo‐likelihood and a profile approach to include covariates in a secondary phenotype analysis for case‐control data. This work needs to be extended to family data. A Monte Carlo approach might be considered to compute the integrals (Tsonaka et al. 25).
Typically, there are missing genotypes. In unrelated individuals, genotypes can be imputed based on the haplotype structure obtained from a reference panel. For family data, the imputation should also take into account the genotypes of other family members. Software exists that can perform such analysis, for example, the Genotype Imputation Given Inheritance program 26. However, for the computation of the denominator in Equation (2), these imputed genotype probabilities have to be taken into account.
Because of the computational intensity of the proposed method, it is not yet possible to run full GWAs analyses of secondary phenotypes. However, the proposed method can be used on a set of pre‐selected variants, for example, after an initial screening with the naive approach to the primary and secondary phenotypes or when investigating pleiotropic effects. To reduce the computation time of the multivariate integrals in the numerator and the denominator, a faster algorithm can be used than the one used in this paper. The randomized quasi‐Monte Carlo procedure, developed by Genz 27, is less accurate but faster especially for large pedigrees. Development of less computational intensive methods is one of the topics for future research.
With regard to pleiotropic effects, a criticism of probit random‐effects models is that in the presence of high dimensional random effects we cannot move from the subject‐specific interpretation for the fixed effects parameters to the population‐level interpretation as in the random‐intercepts case. When the outcome is binary and families are relatively small, estimation of the intercept and variances terms can be difficult, and consequently, coverage probabilities can be poor. Tsonaka et al. 6 showed efficiency gains by using information on disease prevalence. Their methods need to be adapted to our setting of the analysis of two phenotypes. When the parameters of the primary phenotype model are not of interest and this model is only used to correct for the ascertainment mechanism that is driven by the primary phenotype, we showed that secondary phenotype analyses with the proposed method are robust to using the probit instead of the logit link function.
Future directions in the LSS and Leiden Family Lab Study will address the pending availability of multiple omics and functional MRI data, respectively, and joint modelling of several glycans or voxels is of interest. Extending our approach, in this case, is algebraically straightforward, but practical implementation may be challenging because of computational intensity especially with a large number of secondary phenotypes. The use of composite likelihood approaches might be a solution and is our current research topic.
Finally, an attractive alternative approach to properly analyse secondary traits is to apply inverse probability weighting. However, it is crucial to correctly specify the weights. Currently, we do not have sufficient information to be able to estimate these weights for our studies. However, with access to electronic records for research, such as information from general practitioners to estimate the weights, it is likely that inverse probability weighting approaches can be developed.
Supporting information
Acknowledgements
This work was supported by FP7‐Health‐F5‐2012, under grant agreement no305280 (MIMOmics).
Tissier, R. , Tsonaka, R. , Mooijaart, S. P. , Slagboom, E. , and Houwing‐Duistermaat, J. J. (2017) Secondary phenotype analysis in ascertained family designs: application to the Leiden longevity study. Statist. Med., 36: 2288–2301. doi: 10.1002/sim.7281.
References
- 1. Dubdbridge F. Power and predictive accuracy of polygenic risk scores. The American Journal of Psychiatry 2003; 160(4):636–645.12668349 [Google Scholar]
- 2. Monsees GM, Taqmimi RM, Kraft P. Genome‐wide association scans for secondary traits using case‐control samples. Genetic Epidemiology 2009; 33:717–728. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Houwing‐Duistermaat JJ, Callegaro A, Beekman M, Westendorp RG, Slagboom PE, van Houwelingen JC. Weighted statistics for aggregation and linkage analysis of human longevity in selected families: the leiden longevity study. Statistics in Medicine 2009; 28(1):140–151. [DOI] [PubMed] [Google Scholar]
- 4. Skytthe A, Valensin S, Jeune B, Cevenini E, Balard F, Beekman M, Bezrukov V, Blanche H, Bolund L, Broczek K, Carru C, Christensen K, Christiansen L, Collerton JC, Cotichini R, de Craen AJ, Dato S, Davies K, De Benedictis G, Deiana L, Flachsbart F, Gampe J, Gilbault C, Gonos ES, Haimes E, Hervonen A, Hurme MA, Janiszewska D, Jylhä M, Kirkwood TB, Kristensen P, Laiho P, Leon A, Marchisio A, Masciulli R, Nebel A, Passarino G, Pelicci G, Peltonen L, Perola M, Poulain M, Rea IM, Remacle J, Robine JM, Schreiber S, Scurti M, Sevini F, Sikora E, Skouteri A, Slagboom PE, Spazzafumo L, Stazi MA, Toccaceli V, Toussaint O, Törnwall O, Vaupel JW, Voutetakis K, Franceschi C, GEHA consortium. Design, recruitment, logistics, and data management of the GEHA (Genetics of Healthy Ageing) project. Experimental Gerontology 2011; 46(11):934–945. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. de Visser MC, van Minkelen R, van Marion V, den Heijer M, Eikenboom J, Vos HL, Slagboom PE, Houwing‐Duistermaat JJ, Rosendaal FR, Bertina RM. Genome‐wide linkage scan in affected sibling pairs identifies novel susceptibility region for venous thromboembolism: genetics in familial thrombosis study. Journal of Thrombosis and Haemostasis 2013; 11(8):1474–84. de Visser MC, van Minkelen R, van Marion V, den Heijer M, Eikenboom J, Vos HL, Slagboom PE, Houwing‐Duistermaat JJ, Rosendaal FR, Bertina RM [DOI] [PubMed] [Google Scholar]
- 6. Tsonaka R, de Visser MCH, Houwing‐Duistermaat JJ. Estimation of genetic effects in multiple cases family studies using penalized maximum likelihood methodology. Biostatistics 2013; 14(2):220–231. [DOI] [PubMed] [Google Scholar]
- 7. Lee AJ, McMurchy L, Scott AJ. Re‐using data from case‐control studies. Statistics in Medicine 1997; 16(12):1377–1389. [DOI] [PubMed] [Google Scholar]
- 8. Lin DY, Zeng D. Proper analysis of secondary phenotype data in case‐control association studies. Genetic Epidemiology 2009; 33:256–265. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Richardson DB, Rzehak P, Klenk J, Weiland SK. Analyses of case‐control data for additional outcomes. Epidemiology 2007; 18:441–445. [DOI] [PubMed] [Google Scholar]
- 10. Schifano ED, Li L, Christiani DC, Lin X. Genome‐wide association analysis for multiple continuous secondary phenotypes. American Journal of Human Genetics 2013; 92(5):744–759. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. He J, Li H, Edmonson AC, Rader DJ, Li M. A Gaussian copula approach for the analysis of secondary phenotypes in casecontrol genetic association studies. Biostatistics 2011; 13(3):497–508. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Li H, H. GM. Efficient adaptively weighted analysis of secondary phenotypes in case‐control genome‐wide association studies. Human Heredity 2012; 73:159–173. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Najita JS, Li Y, Catalano PJ. A novel application of a bivariate regression model for binary and continuous outcomes to studies of fetal toxicity. Journal of the Royal Statistical Society: Series C (Applied Statistics) 2009; 58(4):555–573. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Thomas DC. Statistical Methods in Genetic Epidemiology. Oxford University Press: Oxford, 2004. [Google Scholar]
- 15. Elston RC, Stewart J. A general model for the analysis of pedigree data. Human Heredity 2013; 21:523–542. [DOI] [PubMed] [Google Scholar]
- 16. Miwa T, Hayer AJ, Kuriki S. The evaluation of general non‐centred orthant probabilities. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2003; 65:223–234. [Google Scholar]
- 17. International Schizophrenia Consortium, Purcell SM, Wray NR, Stone JL, Visscher PM, O'Donovan MC, Sullivan PF, Sklar P. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature 2009; 460(7256):748–752. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. International Multiple Sclerosis Genetics Consortium (IMSGC), Bush WS, Sawcer SJ, de Jager PL, Oksenberg JR, McCauley JL, Pericak‐Vance MA, Haines JL. Evidence for polygenic susceptibility to multiple sclerosis – the shape of things to come. American Journal of Human Genetics 2010; 86(4):421–425. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Simonson MA, Wills AG, Keller MC, McQueen MB. Recent methods for polygenic analysis of genome‐wide data implicate an important effect of common variants on cardiovascular disease risk. BMC Medical Genetics 2011; 12:146. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Welter D, MacArthur J, Morales J, Burdett T, Hall P, Junkins H, Klemm A, Flicek P, Manolio T, Hindorff L, Parkinson H. The NHGRI GWAS catalog, a curated resource of SNP‐trait associations. Nucleic Acids Research 2014; 42(Database Issue):D1001–D1006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Mooijaart SP, van Heemst D, Noordam R, Rozing MP, Wijsman CA, de Craen AJ, Westendorp RG, Beekman M, Slagboom PE. Polymorphisms associated with type 2 diabetes in familial longevity: the Leiden longevity study. Aging 2010; 3:55–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Gottesman II, Gould TD. The endophenotype concept in psychiatry: etymology and strategic intentions. PLoS Genet 2013; 9(3):636–645. [DOI] [PubMed] [Google Scholar]
- 23. Balliu B, Tsonaka R, Boehringer S, Houwing‐Duistermaat J. A retrospective likelihood approach for efficient integration of multiple omics factors in case‐control association studies. Genetic Epidemiology 2015; 39(3):156–165. [DOI] [PubMed] [Google Scholar]
- 24. Ghosh A, Wright FA, Zou F. Unified analysis of secondary traits in case‐control association studies. Journals ‐ American Statistical Association 2013; 108(52):140–151. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Tsonaka R, van der Woude D, Houwing‐Duistermaat JJ. Marginal genetic effects estimation in family and twin studies using random‐effects models. Biometrics 2015; 71(4):1130–1138. [DOI] [PubMed] [Google Scholar]
- 26. Cheung CY, Thompson EA, Wijsman E. Gigi: an approach to effective imputation of dense genotypes on large pedigrees. American Journal of Human Genetics 2013; 92(4):504–516. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Genz A. Numerical computation of multivariate normal probabilities. Journal of Computational and Graphical Statistics 1992; 1:141–150. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.