Abstract
In this paper, we propose the class of generalized additive models for location, scale and shape in a test for the association of genetic markers with non-normally distributed phenotypes comprising a spike at zero. The resulting statistical test is a generalization of the quantitative transmission disequilibrium test with mating type indicator, which was originally designed for normally distributed quantitative traits and parent-offspring data. As a motivational example, we consider coronary artery calcification (CAC), which can accurately be identified by electron beam tomography. In the investigated regions, individuals will have a continuous measure of the extent of calcium found or they will be calcium-free. Hence, the resulting distribution is a mixed discrete-continuous distribution with spike at zero. We carry out parent-offspring simulations motivated by such CAC measurement values in a screening population to study statistical properties of the proposed test for genetic association. Furthermore, we apply the approach to data of the Genetic Analysis Workshop 16 that are based on real genotype and family data of the Framingham Heart Study, and test the association of selected genetic markers with simulated coronary artery calcification.
Keywords: Coronary artery calcification, generalized additive models for location, scale and shape, genetic marker, likelihood ratio test, transmission disequilibrium test
1. Introduction
The aim of genetic association analysis is to identify genetic markers that are associated with affection status for a disease or a quantitative trait. In genetics, traits can be any qualitative or quantitative (measurable) characteristic of an organism and will be used synonymously with the term phenotype in this article. A marker may be associated either because it is functionally involved in the phenotype, because it is in linkage disequilibrium (LD) with a functional marker, or owing to spurious associations. Spurious associations can for example be caused by confounding between the phenotype and genetic ancestry, so-called population stratification. In a genome-wide association study, it is possible to correct for population stratification due to the availability of thousands of additional markers. An alternative approach is a candidate-gene investigation. Here, markers are selected a priori because they are suspected to be involved in the phenotype, e.g. on the basis of an animal model. In such candidate-gene studies, protection against population stratification can be given by appropriate family-based association tests (FBAT).
A widely used method for family-based studies is the transmission disequilibrium test (TDT) [24]. For the TDT, an affected offspring and his or her parents are genotyped (parent-offspring trio), and for a diallelic marker one tests whether a specific marker allele A is transmitted from a heterozygous parent to an affected offspring more often than the other allele a. The statistical test, McNemar's test for matched tables, tests whether there is an association while protecting against population stratification.
For quantitative traits, a number of quantitative TDT/FBAT versions exist [1,2,7,14,16, 20,21,23,27] for a review see [6]. Here we focus on the quantitative transmission disequilibrium test with mating type indicator () [8]. was shown to be somewhat more efficient than competitors for testing genetic main effects and much more efficient for testing gene-environment or gene-gene interactions on normally distributed traits [8]. is a regression-based procedure for n parent-offspring trios. The offspring phenotype () is modelled by a fixed effects model that accounts for the offspring genotype and offspring covariate values with adjustment for the six possible parental mating types . The parental mating type is the genotype-genotype combination of the parents . Mating type adjustment accounts for population stratification.
Besides normally distributed traits or traits that can be easily transformed to a normal distribution (such as blood concentrations by a logarithmic transformation), TDT versions have been developed for the extreme tails of normally distributed traits [2], and for survival traits such as age of disease onset [15]. This article focusses on a non-negative continuous phenotype with a probability mass at zero and a continuous distribution for values greater than zero, such as the measurement of coronary artery calcification (CAC) by electron beam computed tomography [22]. This tomography accurately measures CAC at the region of the artery that is investigated. is indeed measured at that place and corresponds to a true zero rather then representing a missing observation. Please note that measured CAC is only a surrogate for the unmeasured latent total body CAC in all arteries of a person, which may be greater than zero despite having a measurement of . However, we will consider the measured phenotype only.
Whether or not a person develops and if so, how much CAC will be developed, may depend on the same or different covariates, such as age, sex, genetic and environmental variables. Individuals with will be termed prevalent cases with CAC. Overall, we observe that a fraction p of the population is assigned , whereas the remaining portion of the population 1−p will have a (possibly right-skewed) distribution of a continuous trait with positive support.
In this article, we extend the quantitative TDT with fixed effects for parental mating types by combining it with generalized additive models for location, scale and shape (GAMLSS) [19,25]. Lozano [17] used these models for TDT in the context of a skewed distribution. The class of GAMLSS relaxes the usual exponential family assumption such that more complicated phenotype distributions can be assumed. In particular, GAMLSS allow the analyst to relax the standard assumption of a normally distributed phenotype and instead to assume that has a mixed discrete-continuous distribution with point mass at zero and a continuous part with positive support. As particular examples, we will consider mixed discrete-continuous distributions with an inverse Gaussian or a gamma continuous component leading to zero-adjusted inverse Gaussian (ZAIG) and zero-adjusted gamma (ZAGA) distributions. The focus is not so much on the model, but on the corresponding test for the genetic effect, which we will call ZAIG- and ZAGA- for inverse Gaussian and gamma distributed continuous phenotype components, respectively. We will compare the performance of the novel variants with that of and other s arising from discretized phenotype observations in simulation scenarios. The scenarios correspond to a genotype influencing both the probability of a zero CAC score p and the continuous part of the phenotype distribution (estimating power), or on no parts of the mixture distributed phenotype (estimating type-1 error). We also considered population stratification from two subpopulations with unequal genotype distribution as a confounding factor for the continuous part, i.e. the severity of CAC is influenced by the population, but not the probability of a zero CAC or the prevalence of CAC.
2. Material and methods
2.1. Testing for the association of genetic markers
2.1.1. Modelling zero-spikes using zero-adjusted GAMLSS
The distribution of CAC scores is observed to be right-skewed with a spike at zero such that the density is a mixed discrete-continuous distribution with a point mass at zero (no CAC) and a continuous part with positive support reflecting the non-zero CAC scores. Two appropriate candidates from the class of GAMLSS to model observations , where denotes the CAC score, is a set of coded genotypes and are additional covariates, are the zero-adjusted inverse Gaussian distribution (ZAIG) with density
| (1) |
and the zero-adjusted gamma (ZAGA) distribution with density
| (2) |
For both distributions, the parameter reflects the probability of observing a zero CAC score, while corresponds to the conditional expectation , and is a scale parameter of the conditional distribution of . Note that the overall expectation without conditioning on is . We are particularly interested in the effects of genotype on the conditional expectation . Note also that the model specifications above imply that any influence of genetic or clinical covariates on also alters the variance (and further moments) of the continuous part of the mixture distribution, since is a scale parameter but the variance depends on both and for both distributions. We assume throughout this article that is a constant.
The parameters of the two zero-adjusted distributions () can now be related to regression predictors , , and determined as linear combinations of covariates (including genetic information and information on the mating type) via
| (3) |
to ensure the restrictions , and . For estimation of all models discussed in the following, we used the R-package gamlss, which is a maximum-likelihood based approach to estimate regression models in which all distribution parameters of univariate responses may depend on covariates. Newton-Raphson / Fisher-scoring type algorithms were used to maximize the likelihood, see [25] for details.
2.1.2. The genetic epidemiological model
The original assumes normally distributed offspring traits that are analyzed within a linear regression framework [8]
| (4) |
This is a fixed effect, linear regression model on the overall expected value
of offspring trait values dependent on the offspring genotype , with adjustment for offspring covariates , and accounting for the six possible parental mating types . assumes normally distributed residuals with constant variance .
As mentioned already in the introduction, the mating type indicator is a factor with six levels that indicates which specific combination of the paternal and maternal genotypes was present in the offspring i. In other words, estimates six intercepts – one for each parental mating type. The rationale behind this is that ethnic subpopulations may differ by intercept but not by genetic effect strength . Confounding on arises when subpopulations differ both in the trait intercept but also in the minor allele frequency (MAF) of the examined single nucleotide polymorphism (SNP) marker. Then, the proportion of offspring that originates from the same subpopulation will vary between the offspring genotype groups. Hence, a direct regression on the offspring genotype would yield confounded estimates. Instead, estimates the shared genetic effect in the offspring within parental mating type strata. While subpopulations contribute with varying proportions to the parental mating types, this effect is now absorbed into the parental mating type intercepts , thus eliminating the confounding on the genetic effect . Effectively, is estimated only in offspring with at least one heterozygous parent (mating types: , , ). However, also including the other mating types (thus using the whole sample) increases the power of the test by reducing the overall residual variance [8]. This distinguishes from other approaches. The coding of offspring genotype depends on the genetic mode of inheritance. With for common homozygotes AA, for heterozygotes aA, and for minor allele homozygotes aa, , 0.5, 1 represent the recessive, additive, and dominant mode of inheritance, respectively. We assumed an additive mode of inheritance throughout this article.
Applying a standard linear regression on the CAC scores, as proposed in Equation (4), disregards its non-normal distribution. Therefore, we propose the zero-adjusted variants ZAIG- and ZAGA- as alternatives with predictor specifications
| (5) |
for the parameters of the zero-adjusted distributions. The null hypothesis of no association between the response and the genetic marker G is tested using a likelihood ratio test statistic
where and are the maximized log-likelihoods under the null hypothesis (no association) and the alternative or , respectively. As a consequence, this is a likelihood ratio test with two degrees of freedom ().
Often data are dichotomized or categorized, so that we also consider two highly simplified alternatives of the zero-adjusted-. BIN- considers standard logistic regression on , i.e. calcification present or not; we used the family BI in the gamlss package for estimation. CAT- is based on one category for zeros and three further ordered categories according to the 33% and 66% quantiles, and then a cumulative logistic model for ordinal responses is estimated by the function polr in R. BIN- and CAT- are tests for the genetic parameter with one degree of freedom, similar to the classic .
Besides significance testing, the Akaike information criterion with finite sample size correction (AICc) [12] describes how well a specific model fits the data at hand. This criterion can be used to compare the models underlying the classic with the zero-adjusted versions and also allows us to decide which covariates to include. Note however, that a comparison with the tests obtained from discretized responses is not possible, since the comparison of AICs requires the use of the same response variable. Note furthermore that we have included all constants in the computation of the AIC so that we can directly compare the fit of the and the zero-adjusted s in terms of this criterion.
2.2. Empirical evaluation
We investigated type-1 error and power to detect the effect of a single genetic marker on the CAC phenotype at the 5%-level for , ZAIG-, ZAGA-, BIN- and CAT-. For ZAIG- and ZAGA-, we used resampling with 500 runs to determine the test result. For the other tests, we did not use resampling, as this corresponds to the usual way to carry them out. We also used AICc to compare of models.
2.2.1. Simulation study
We generated replicates of cross-sectional studies with a given number n of parent-offspring trios for each of the subpopulations j = 1, 2 and parental and offspring genotypes for a single diallelic SNP . This creates population admixture, and thus may lead to genetic confounding, a main reason for applying a TDT approach. We assumed that half of the sample was randomly recruited from one of the two distinct genetic subpopulations, respectively, without cross-mating between subpopulations. The SNP had a MAF of 0.10 in subpopulation j = 1 or 0.30 in subpopulation j = 2, was in Hardy-Weinberg equilibrium (HWE) within each subpopulation, and was transmitted from parents to their offspring by Mendelian segregation. Thus, 65% of the offspring in this study are common homozygotes (), 30% heterozygotes (), and 5% minor allele homozygotes (). CAC score trait data of the offspring were simulated to be ZAIG distributed and to depend on the offspring genotype and for CAC>0 on an ethnic confounding in the continuous part:
| (6) |
To obtain a realistic scenario, our data generating process is based on the distribution of CAC scores in a representative population sample of 35,246 subjects between the ages of 30 and 90 years, who underwent a CAC score screening in 1990. We focussed on 50-year-old individuals with parameters averaged over gender, so that no adjustments for age and sex in the simulated CAC score are necessary and chose the ZAIG intercept parameters in accordance with the corresponding observed distribution of CAC scores (parameters for 50-year-old males and females averaged: , , ; see, e.g. [11]). For power estimation, we specified reasonable genetic effect strengths and , whereas for type-1 error estimation we set and . These data have a massive spike at zero: 83% of all offspring have zero CAC under (, ). 78% of all offspring have zero CAC when the risk SNP (, ) is present.
The effect of the ethnic confounding was set to be for subpopulation 1 and for subpopulation 2. The ethnic confounding solely affects the intercepts of the conditional expectation of non-zero CAC scores and of the overall mean . Note that for ZAIG data generation Equation (6) (and ZAIG- model estimation Equation (3)) genetic effects are additive (per allele) on the predictors but multiplicative on the conditional mean of non-zero CAC scores and the zero-spike chances , respectively. The simulated risk SNP lowers and increases per minor allele. Both genetic effects jointly increase the overall mean analysed by .
Realistic CAC scores for a 50-year-old female or male have a massive spike at zero. To have a scenario also with a moderate spike at zero, we used the parameters for and as above but and . Then, 33% of all offspring have zero CAC under (, ). 25% of all offspring have zero CAC when the risk SNP (, ) is present.
For ZAIG- this is asymptotically equivalent to a situation with present and correctly adjusted covariate dependencies for genetic and non-genetic covariates. In contrast, in additive effects are estimated on the overall mean while true simulated effects (see (6)) are multiplicative by nature. The latter mismatch is restricted to the genetic estimates when other covariables including clinical ones are absent. Thus, owing to this scale issue has a mismatch in the modelling of any covariates including the genetic effect. Of course also wrongly assumes normal residuals with constant variance for these ZAIG-distributed data. As ZAIG was simulated, ZAGA- is not considered in the simulation study.
Finally, we remark that values for the CAC score vary widely between studies as the procedures implemented for the number of measurements, thickness of slices, pixels, etc. vary widely between studies [22]. Hence the scale in our simulations is different from the one described in the following section.
2.2.2. CAC scores in the GAW16 Framingham data
The GAW16 Framingham data (accession number phs000128.v1.p1, obtained from the Database of Genotypes and Phenotypes (dbGaP), http://www.ncbi.nlm.nih.gov/gap) provided extended pedigrees with actual measured genotypes. Thus, any population structure is inherent in these genotype data. Our work based on these data is in accordance with the Declaration of Helsinki (1964) and was approved by the local institutional review board as well as subsequently by dbGaP based on this approval. CAC scores (with 200 data replicates) were simulated by the data providers [13] based on a latent mixture distribution of Gaussian variables, setting negative values to zero, and applying piecewise linear age adjustments. The expected value of the latent variable is basically created as a mixture of (a) total cholesterol and high density lipoprotein (HDL) which are both Gaussian mixture distributions including major genes and polygenes, (b) three effects created by five SNPs of which only rs17714718 was generated to display a measurable additive main effect. SNP rs213952 displays overdominance as heterozygotes are enriched in the spike, and two interacting or epistatic SNP pairs (each SNP with ), and (c) additional polygenes modelled by a normal distribution.
The Framingham study is a longitudinal cohort with simulated CAC scores for one baseline examination and subsequent follow-up examinations. The analysis methods applied in this article, however, focus on the aspect of the CAC distribution at baseline and are thus cross-sectional.
We extracted 323 independent parent-offspring trios from the Framingham pedigrees with baseline CAC score and covariate values for the offspring. Offspring were 42% males, 73% non-smokers, and 52% had zero CAC scores. The offspring age range was 19 to 56 years with a mean of 29 years and a standard deviation of 8 years. Offspring cholesterol levels ranged from 112 to 268 units with a mean of 194 units and standard deviation of 26 units. Table 1 displays baseline CAC scores stratified by genotype of the associated overdominant SNP rs213952. Note that the percentage of zero CAC scores (52%) in the GAW16 Framingham data is between the massive spike and the moderate spike scenario of our simulation study above. In addition to the five SNPs that are contained in the data generating process for CAC, we extracted 14 other SNPs that were provisionally selected for a research question on body-mass-index [18] and represent the genetic null-hypothesis for CAC. Please note that we did not adjust for multiple testing, as we just want to demonstrate the performance of the single-variant test in this application. All considered SNPs are independent, i.e. not in LD.
Table 1. The CAC score in the GAW16 Framinham data (data replicate 16). is enriched for rs213952 heterozygotes (n = 54 expected by HWE based on MAF, n = 71 observed).
| Stratified by rs213952 genotype | ||||||||
|---|---|---|---|---|---|---|---|---|
| Total | 0 | 1 | 2 | |||||
| CAC | % () | %() | %() | %() | ||||
| 168 | 52% | 93 | 29% | 71 | 22% | 4 | 1% | |
| 155 | 48% () | 117 | 36% () | 29 | 9% () | 9 | 3% () | |
To summarize, the GAW16 Framingham CAC score data follow a mixed discrete-continuous distribution with a spike at zero. Genetic effects contributed to several latent data-generating variables [13]. On these data, the model ZAIG- is somewhat mismatched regarding the assumed data distribution, as well as regarding the scale on which genetic effects were modelled. ZAIG- assumes a multiplicative effect on the mean , while the true effect is additive by nature (in this application, which is opposite to our simulations). An additive effect may appear insignificant on the multiplicative scale of the ZAIG estimator. Furthermore, ZAIG- is a heteroscedastic model that assumes that the variance of non-zero CAC scores depends on genetic and non-genetic covariates, since . clearly does not fit the distributional properties correctly, since normally distributed residuals and homoscedasticity are assumed. However, in this application, the linear model might capture relevant data characteristics better as the scale of genetic effects is additive. Nevertheless still is mismatched by assuming normally distributed residuals.
3. Results
3.1. Simulation study on ZAIG distributed CAC score data
We first compared the models for and ZAIG-. The mean AICc for the mismatched model was three times larger compared to the zero-spike model ZAIG- (e.g. for n = 500 for and for ZAIG-). Thus, from a modelling perspective, the ZAIG- model fits the data much better and is clearly preferred.
Figure 1 displays the estimated power (top row) and type-1 error (bottom row) for ZAIG distributed CAC score data with either a massive spike at zero (left column) or a moderate spike at zero (right column). maintains the type-1 error, or is even a little conservative. ZAIG- maintains the type-1 error reasonably well when using permutation testing while it has type-1 error issues without permutation testing (not shown). For a moderate spike at zero and small sample sizes, ZAIG- is a little anti-conservative. BIN- and CAT- have more problems maintaining type-1 error.
Figure 1.
Simulation study on ZAIG distributed CAC data. Displayed are the power (top row, averaged over 1000 data replicates) and type-1 error (bottom row, averaged over 10,000 data replicates) at the nominal 5% level for ZAIG data with a massive spike at zero (left column) or a moderate spike at zero (right column) for the models , ZAIG-, BIN- and CAT-. Note the different y-axes between power (top) and type-1 error (bottom: the estimation uncertainty for a value of 5% (small line) is the surrounding 95% confidence interval (dotted lines)).
In terms of power, ZAIG- outperformed (and also the discretized versions BIN- and CAT-) for both moderate and massive spike. This is in line with the better model fit demonstrated by AICc. The power for the moderate spike is generally better than for the massive spike. The massive spike generating model demonstrates the greater gain in power for ZAIG- compared to other s, which are hardly distinguishable in this model. For a moderate spike, the estimated power for binary coding is superior to coding with four categories in CAT- and this again slightly better than . However this needs to be viewed cautiously, taking type-1 error results into account.
3.2. CAC scores with a spike at zero in the GAW16 Framingham data
The analysis models , ZAIG- and ZAGA- were adjusted for the main effects of covariables age, sex, cholesterol and smoker status, and the considered SNP was included into the model. We display several results for the SNP rs213952 in data replicate 16. This SNP has an overdominant effect, and thus does not fit the additive genetic model well. The distributional properties regarding CAC for this SNP/replicate are given in Table 1.
Figure 2 displays quantile-quantile plots of residuals for and (randomized) quantile residuals [5] for ZAIG- and ZAGA- for SNP rs213952 (replicate 16). As expected, the residuals for show a clear model misfit while the zero-adjusted gamma for ZAGA- shows the best fit. This is supported by the AICcs for the same setting, which are 4010 for , 2274 for ZAIG- and 2104 for ZAGA-.
Figure 2.
Quantile residuals of the three models , ZAIG-, ZAGA- for the overdominant SNP rs213952 in the GAW16 Framingham data (data replicate 16).
The results for all five s on all SNPs are displayed in Table 2. Please note that the five SNPs with GAW16 simulated effects are in the upper part of the table. The SNP rs17714718 interacts with rs6743961, and is the only one with an additional additive genetic main effect according to the data description [13]. Thus, due to the interaction, some model misfit is given here as well. All results refer to testing each SNP individually at the 5%-level and estimating the percentage of p-values ≤5% on the available 200 data replicates. For a true type-1 error, the estimation uncertainty on 200 replicates (i.e. the 95% confidence interval) ranges from 2% to 8%.
Table 2. Percentage of (averaged over 200 data replicates) for single SNP association testing on CAC in GAW16 Framingham data. The GAW16 Framingham CAC score data have a truncated Gaussian distribution with a spike at zero. The first five SNPs associate with the CAC, but only the first two with detectable main effects. The other 14 SNPs represent the genetic null hypothesis. The models , ZAIG-, BIN- and CAT- were adjusted for main effects of age, cholesterol, sex, and smoker status.
| Percentage of | |||||||
|---|---|---|---|---|---|---|---|
| SNP marker | MAF | Effect | ZAIG- | ZAGA- | BIN- | CAT- | |
| Interaction | |||||||
| rs17714718 | 50% | Additive main effect | 25.5 | 9.5 | 10.0 | 7.5 | 11.5 |
| rs6743961 | 50% | Minimal main effect | 5.5 | 13.5 | 7.5 | 9.5 | 7.5 |
| rs1894638 | 50% | No main effect | 7.5 | 20.5 | 4.0 | 5.0 | 4.5 |
| rs1919811 | 50% | No main effect | 2.0 | 15.0 | 1.5 | 5.0 | 5.5 |
| rs213952 | 20% | Overdominant main effect | 4.5 | 13.5 | 4.0 | 3.0 | 5.5 |
| rs854560 | 39% | No effect | 5.0 | 15.5 | 1.0 | 3.5 | 3.5 |
| rs1121980 | 42% | No effect | 2.0 | 9.5 | 2.0 | 3.0 | 3.0 |
| rs1800588 | 20% | No effect | 1.5 | 15.0 | 2.5 | 3.5 | 3.0 |
| rs2229616 | 1% | No effect | 1.0 | 8.5 | 6.0 | 4.5 | 3.0 |
| rs2230806 | 29% | No effect | 6.0 | 14.0 | 8.5 | 7.5 | 10.5 |
| rs3211938 | 0% | No effect | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| rs4149056 | 15% | No effect | 4.0 | 9.5 | 4.0 | 2.5 | 3.0 |
| rs6602024 | 12% | No effect | 3.5 | 7.0 | 4.5 | 3.5 | 2.5 |
| rs6971091 | 21% | No effect | 4.5 | 8.0 | 4.5 | 3.0 | 4.5 |
| rs9930506 | 42% | No effect | 1.5 | 13.0 | 2.5 | 3.0 | 2.5 |
| rs10489535 | 4% | No effect | 8.0 | 3.5 | 10.0 | 3.5 | 7.0 |
| rs11927551 | 31% | No effect | 6.0 | 11.0 | 3.0 | 4.0 | 3.5 |
| rs12565497 | 30% | No effect | 3.5 | 9.5 | 3.0 | 2.5 | 7.5 |
| rs17482753 | 9% | No effect | 5.0 | 15.5 | 5.0 | 2.0 | 4.0 |
We consider all SNPs with no effect, the two SNPs with no main effect and the SNP with overdominant effect, and count how many of those 17 SNPs fall outside the range 2%–8%: 4/17 for , 16/17 for ZAIG-, 5/17 for ZAGA-, 1/17 for BIN- and 2/17 CAT-. Type-1 error is reasonably well maintained for , ZAGA- and CAT-. ZAIG- does not keep the type-1 error. If we simulate population confounding and analyze it with mating types within the zero-spike part, then we introduce sparsity in some levels of the mating types within the logistic regression. This leads to upward bias in regression estimates and thus to an inflation of the type-1 error [9]. BIN- is too conservative. For power, consider the SNP rs17714718. The power is highest for with 25.5%. This estimated power is in the expected order of magnitude for these data. However, overall power is very small in this data set, and for ZAGA- and CAT- at the upper limit of what is observed under the null.
4. Discussion
Our work demonstrates the necessity of interdisciplinary work between biostatisticians and genetic epidemiologists in order to go beyond normal distributions for continuous phenotypes in genetic association studies and make use of methodology such as GAMLSS in this field [3,29]. One of the goals of such genetic association studies is the use of such variants for genetic testing. For example, Ziegler [28] presented an overview of economic evaluations of genetic tests across common and rare diseases, and found that coronary syndrome (for which CAC is a marker) ranked as number five across all diseases for which such evaluations exist. One prerequisite for such an economic evaluation is the usefulness of the variant(s) both in testing and in modelling the disease.
In general, GAMLSS provides a flexible framework for relating all parameters of a potentially very complex phenotype distribution to covariate effects. In this article, we utilized this flexibility to deal with non-normally distributed quantitative traits that include a spike at zero. Concerning the continuous, positive part of the response distribution, we chose the inverse Gaussian and gamma distributions, in order to not only improve the model fit but also the power of the resulting test. Of course other candidate distributions could be considered, such as the log-normal distribution.
Regarding the simulations, the mixture distribution model ZAIG- clearly outperformed regarding the data fit (as indicated by the AICc values). ZAIG- also had higher power than the linear to identify the association of the response with an additive genetic marker in the continuous as well as in the zero-spike predictor. We modelled a negative effect for , and thus a positive effect for non-zero CAC response of ). We also modelled a positive effect for . Thus, the additive genetic effects of both parts of the distribution synergistically increase the overall mean . This is a favourable situation for . However, ZAIG- could also deal with opposite effects in both parts of the distribution.
We simulated population admixture in two subpopulations not only for simplicity but also to maximize possible type-1 error inflation [26]. For comparison with the linear we assumed confounding due to association between positive response and population membership, while the prevalence of a zero response was not simulated population-dependent. With confounding on positive response only, ZAIG- maintains the type-1 error when using permutation testing. We also simulated confounding in the zero-spike part, thus assuming different prevalences in the populations. However, when confounding is present in the zero-spike, alone or also in the positive response, then ZAIG- does not keep the type-1 error even when using permutation testing. While differences in prevalences between populations can be assessed and we have focussed on mixed discrete-continuous models with zero-spike directly implemented in the gamlss package, it is important to note this limitation. We will address remodelling the zero-spike part in the future in order to overcome this caveat.
Compared to ZAIG-, the simpler uses one degree of freedom fewer in the likelihood ratio test, estimates a single overall effect on the whole sample, and assumes homoscedasticity. A further important difference is the scale on which effects are estimated. estimates (and tests) additive effects on the overall mean whereas ZAIG- estimates (and tests) multiplicative effects on the parameters of the mixture distribution. This scale mismatch between both models is likely to the advantage of .
In the application to GAW16 data, the zero-adjusted gamma fitted better to the data than the zero-adjusted inverse Gaussian. For ZAGA-, type-1 error was acceptable, but power was lower than for . Original data were determined from a mixture of Gaussian distributions with a spike at zero. ZAIG- cannot handle this situation and and demonstrates an inflation of type-1 error for all investigated null SNPs. This inflation cannot be due to linkage between different trios in the extended pedigrees of GAW16, as we only chose independent trios. We realize that TDT-generalizations have been developed in the context of analyzing sequencing data and region-based extensions of the TDT, where type-1 error violations for popular TDT-generalizations were noted in different pedigree structures by [10]. Our article considers solely single variant tests. Generally, TDTs are less powerful than their counterparts in cohort studies. Thus, they are often used in a candidate or replication context. In this sense, we refrained from any adjustment of multiple testing for several SNPs, which of course needs to be done in a real study. We also did not investigate several SNPs in a set or in a linkage disequilibrium region. We agree with [10], that any extension to sets of markers needs to be investigated before it is implemented in practise.
There is a lot of literature on various types of a non-negative variable with a probability mass at zero and a continuous distribution for the positive values with a wide variety of area applications. We do not attempt to give a review of that literature and only want to highlight a few points. Hurdle models have a distribution for the zero-values and a truncated distribution for non-zero observations with no probability mass assigned to zero for non-zero observations. In contrast, the inflated models allow zeros from both parts. We do believe that this difference is very relevant for count data, but not for distributions that have zero probability for exactly this zero-value in the continuous part. Of cause some data with detection limits might reveal censoring, which we will not discuss further.
In the context of genetic epidemiology, for example, zero-inflated count outcomes have been considered [9] via a zero-inflated Poisson (ZIP) distribution or a zero-inflated negative binomial distribution. In their simulations, Goodman et al. [9] included observed covariates and an unobserved covariate associated with the outcome, which could correspond to our population confounding. If the correlation between the observed covariates and the genetic effect is high, type-1 error may be inflated [9, Figure 6]. In our setting, mating types and genotype are indeed highly correlated due to Mendelian segregation, so that this might be a potential source of type-1 error inflation when population confounding is also present in the zero part. In the same lines, Buu et al. [4] demonstrated that the performance of the hurdle model for longitudinal data tends to decline with higher correlations to covariates. However, as mentioned before, we believe it is reasonable for such data that prevalences for CAC can be assumed highly similar, as we did in our simulations.
Supplementary Material
Acknowledgments
Nadja Klein developed the simulation and analysis code. However, all data access/interaction with SHARe was carried out solely by the authorized members of the Institute of Genetic Epidemiology at the University Medical Centre Göttingen. The investigators thank the Framingham Heart Study participants whose steadfast commitment to this study made this research possible. This research was conducted in part using data and resources from the Framingham Heart Study of the National Heart Lung and Blood Institute of the National Institutes of Health and Boston University School of Medicine. The data are in part the result of resource development from the Framingham Heart Study investigators participating in the SNP Health Association Resource (SHARe) project. This work was partially supported by the National Heart, Lung and Blood Institute's Framingham Heart Study (Contract No. N01-HC-25195) and its contract with Affymetrix, Inc for genotyping services (Contract No. N02-HL-6-4278).
Funding Statement
Heike Bickeböller was supported by the Deutsche Forschungsgemeinschaft (grant Klinische Forschergruppe (KFO) 241: TP5, BI 576/5-1) and by the German Federal Ministry of Education and Research Bundesministerium für Bildung und Forschung (BMBF) (German National Genome Research Net NGFN grant 01GS0837). The work of Nadja Klein and Thomas Kneib was supported by the German Research Foundation (DFG) via the research projects KN 922/4-1/2 and together with the work of Heike Bickeböller also via the research training group 1644 on ‘Scaling Problems in Statistics’.
Disclosure statement
No potential conflict of interest was reported by the authors.
References
- 1.Abecasis G., Cordon L., and Cookson W., A general test of association for quantitative traits in nuclear families, Am. J. Hum. Genet. 66 (2000), pp. 279–292. doi: 10.1086/302698 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Allison D.B., Transmission-disequilibrium tests for quantitative traits, Am. J. Hum. Genet. 60 (1997), pp. 676–690. [PMC free article] [PubMed] [Google Scholar]
- 3.Bickeböller H., Haux R., and Winter A., On strengthening ties with gmds, Methods. Inf. Med. 52 (2013), pp. 1–2. doi: 10.1055/s-0038-1627053 [DOI] [PubMed] [Google Scholar]
- 4.Buu A., Li R., Tan X., and Zucker R.A., Statistical models for longitudinal zero-inflated count data with applications to the substance abuse field, Stat. Med. 31 (2012), pp. 4074–4086. doi: 10.1002/sim.5510 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Dunn P.K. and Smyth G.K., Randomized quantile residuals, J. Comput. Graph. Stat. 5 (1996), pp. 236–245. [Google Scholar]
- 6.Ewens W.J., Li M., and Spielman R.S., A review of family-based tests for linkage disequilibrium between a quantitative trait and a genetic marker, PLoS Genet. 4 (2008), p. e1000180. doi: 10.1371/journal.pgen.1000180 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Fulker D., Cherny S., Sham P., and Hewitt J., Combined linkage and association sib-pair analysis for quantitative traits, Am. J. Hum. Genet. 64 (1999), pp. 259–267. doi: 10.1086/302193 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Gauderman W.J., Candidate gene association analysis for a quantitative trait using parent-offspring trios, Genet. Epidemiol. 25 (2003), pp. 327–338. doi: 10.1002/gepi.10262 [DOI] [PubMed] [Google Scholar]
- 9.Goodman M.O., Chibnik L., and Cai T., Variance components genetic association test for zero-inflated count outcomes, Genet. Epidemiol. 43 (2019), pp. 82–101. doi: 10.1002/gepi.22162 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Hecker J., Laird N., and Lange C., A comparison of popular TDT-generalizations for family-based association analysis, Genet. Epidemiol. 43 (2019), pp. 300–317. doi: 10.1002/gepi.22181 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Hoff J.A., Chomka E.V., Krainik A.J., Daviglus M., Rich S., and Kondos G.T., Age and gender distributions of coronary artery calcium detected by electron beam tomography in 35,246 adults, Am. J. Cardiol. 87 (2001), pp. 1335–1339. doi: 10.1016/S0002-9149(01)01548-X [DOI] [PubMed] [Google Scholar]
- 12.Hurvich C.M. and Tsai C.-L., Regression and time series model selection in small samples, Biometrika 76 (1989), pp. 297–307. doi: 10.1093/biomet/76.2.297 [DOI] [Google Scholar]
- 13.Kraja A.T., Culverhouse R., Daw E.W., Wu J., Van Brunt A., Province M.A., and Borecki I.B., Genetic Analysis Workshop 16 problem 3: Simulation of heritable longitudinal cardiovascular phenotypes based on actual genome-wide single-nucleotide polymorphisms in the framingham heart study, BMC Proc. 15 (2009), p. S4. doi: 10.1186/1753-6561-3-S7-S4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Lange C., DeMeo D., and Laird N., Power and design considerations for a general class of family-based association tests: Quantitative traits, Am. J. Hum. Genet. 71 (2002), pp. 1330–1341. doi: 10.1086/344696 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Li H. and Fan J., A general test of association for complex diseases with variable age of onset, Genet. Epidemiol. 19 (2000), pp. 43–49. doi: [DOI] [PubMed] [Google Scholar]
- 16.Liu Y., Tritchler D., and Bull S.B., A unified framework for transmission-disequilibrium test analysis of discrete and continuous traits, Genet. Epidemiol. 22 (2002), pp. 26–40. doi: 10.1002/gepi.1041 [DOI] [PubMed] [Google Scholar]
- 17.Lozano J.P., Generalized quantitative transmission diseqiuilibrium test for analyzing genetic main effects and epistasis, Ph.D. thesis, Georg-August-University Goettingen, Cuvillier, 2010. ISBN-10: 9783869555829.
- 18.Malzahn D., Balavarca Y., Lozano J.P., and Bickeböller H., Tests for candidate-gene interaction for longitudinal quantitative traits measured in a large cohort, BMC Proc. 3 (2009), p. S80. doi: 10.1186/1753-6561-3-S7-S80 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Mayr A., Fenske N., Hofner B., Kneib T., and Schmid M., Generalized additive models for location, scale and shape for high dimensional data: A flexible approach based on boosting, J. R. Stat. Soc. Ser. C Appl. Statist. 61 (2012), pp. 403–427. doi: 10.1111/j.1467-9876.2011.01033.x [DOI] [Google Scholar]
- 20.Monks S.A. and Kaplan N.L., Removing the sampling restrictions from family-based tests of association for a quantitative-trait locus, Am. J. Hum. Genet. 66 (2000), pp. 576–592. doi: 10.1086/302745 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Rabinowitz D., A transmission disequilibrium test for quantitative trait loci, Hum. Hered. 47 (1997), pp. 342–350. doi: 10.1159/000154433 [DOI] [PubMed] [Google Scholar]
- 22.Redberg R.F. and Shaw L.J., A review of electron beam computed tomography: Implications for coronary artery disease screening, Prev. Cardiol. 5 (2002), pp. 71–78. doi: 10.1111/j.1520-037X.2002.0576.x [DOI] [PubMed] [Google Scholar]
- 23.Schaid D.J. and Rowland C.M., Quantitative trait transmission disequilibrium test: Allowance for missing parents, Genet. Epidemiol. 17 (1999), pp. S307–S312. doi: 10.1002/gepi.1370170752 [DOI] [PubMed] [Google Scholar]
- 24.Spielman R.S., McGinnis R.E., and Ewens W.J., Transmission test for linkage disequilibrium: The insulin gene region and insulin-dependent diabetes mellitus (IDDM), Am. J. Hum. Genet. 52 (1993), pp. 506–516. [PMC free article] [PubMed] [Google Scholar]
- 25.Stasinopoulos D.M. and Rigby R.A., Generalized additive models for location, scale and shape (with discussion), J. R. Stat. Soc. Ser. C Appl. Statist. 54 (2005), pp. 507–554. doi: 10.1111/j.1467-9876.2005.00510.x [DOI] [Google Scholar]
- 26.Wacholder S., Rothman N., and Caporaso N., Population stratification in epidemiologic studies of common genetic variants and cancer: quantification of bias, J. Natl. Cancer. Inst. 92 (2000), pp. 1151–1158. doi: 10.1093/jnci/92.14.1151 [DOI] [PubMed] [Google Scholar]
- 27.Yang Q., Rabinowitz D., Isasi C., and Shea S., Adjusting for confounding due to population admixture when estimating the effect of candidate genes on quantitative traits, Hum. Hered. 50 (1999), pp. 227–233. doi: 10.1159/000022920 [DOI] [PubMed] [Google Scholar]
- 28.Ziegler A., Rudolph-Rothfeld W., and Vonthein R., Genetic testing for autism spectrum disorder is lacking evidence of cost-effectiveness – a systematic review, Methods. Inf. Med. 56 (2017), pp. 268–273. doi: 10.3414/ME16-01-0082 [DOI] [PubMed] [Google Scholar]
- 29.Ziegler A., Wilson A.F., and Gagnon F., Informatics and genetic epidemiology, Methods. Inf. Med. 53 (2014), pp. 1–2. doi: 10.1055/s-0038-1627065 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.


