Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2011 Feb 4.
Published in final edited form as: Genet Epidemiol. 2009;33(Suppl 1):S40–S44. doi: 10.1002/gepi.20471

Phenotype Definition and Development – Contributions from Group 7

Marsha A Wilcox 1, Andrew D Paterson 2,3
PMCID: PMC3033653  NIHMSID: NIHMS219836  PMID: 19924715

Abstract

The papers in Genetic Analysis Workshop 16 Group 7 covered a wide range of topics. The effects of confounder misclassification and selection bias on association results were examined by one group. Another focused on bias introduced by various methods of accounting for treatment effects. Two groups used related methods to derive phenotypic traits. They used different analytic strategies for genetic associations with non-overlapping results (but because they used different sets of single-nucleotide polymorphisms and significance criteria, this is not surprising). Another group relied on the well characterized definition of type 2 diabetes to show benefits of a novel predictive test. Transmission-ratio distortion was the focus of another paper. The results were extended to show a potential secondary benefit of the test to identify potentially mis-called single-nucleotide polymorphisms.

Keywords: Genetic Analysis Workshop 16, association analysis, confounder misclassification, selection bias, optimal robust ROC, structural equation modelling, treatment adjustment, empirically derived phenotypes, transmission disequilibrium, transmission distortion, candidate genes, genome-wide association

INTRODUCTION

The papers in the Genetic Analysis Workshop (GAW) 16 Group 7, “Phenotype Definition and Development”, were diverse. We used data from the Framingham Heart Study (FHS) and data simulated based on the FHS. Detailed descriptions of the data can be found in the general data description manuscripts for GAW16 [Cupples et al., 2009; Kraja et al., 2009]. Avery et al. [2009] focused on the effects of both differential and non-differential confounder misclassification, as well as selection bias on effect estimates for the single-nucleotide polymorphism (SNP)-disease outcome. Rice et al. [2009] conducted a focused assessment of the effect of misclassification resulting from a failure to consider treatment effects in the outcome measure. Nock et al. [2009] included a correction for treatment effects in both qualitative and quantitative characterizations of the Metabolic Syndrome (MS) in the course of showing the benefits of a structural equation modeling (SEM) approach modeling the association of genes and disease. A related data reduction approach, multiple correspondence analysis (MCA), was used by Wilcox et al. [2009] as the basis for the empirical identification of phenotypically similar groups and a genome-wide association study (GWAS). Lu et al. [2009] relied on a well characterized phenotype, type 2 diabetes, to demonstrate the properties of a novel method for evaluation of a predictive genetic test. Paterson et al. [2009] searched for loci that demonstrated transmission-ratio distortion (TRD), i.e., in the absence of a clinical phenotype. They also showed a secondary value of the transmission-disequilibrium test to identify potentially mis-called SNPs.

MISCLASSIFICATION BIAS AND SELECTION BIAS IN GENETIC ASSOCIATION CONFOUNDERS

Genetic analyses are typically adjusted for potential confounding effects. However, the influence of confounder misclassification and a related issue, selection bias, are rarely expressly considered. Avery et al. [2009] used data simulated based on the FHS to evaluate the effect of both confounder misclassification and selection bias in a case-control study of incident myocardial infarction (MI). They evaluated the influence of both differential and non-differential misclassification on genetic effect estimates. The exposure, in this case, was the SNP. Misclassification was systematically evaluated in covariate values by randomly permuting the values in increasing proportions of the sample. Potential confounders of the SNP/MI association were identified using two methods, directed acyclic graphs (DAG) [Greenland, 2002], and established epidemiologic criteria that require an association between the confounder and exposure among those without the outcome (MI) as well as an association between the potential confounder and disease within the whole population [Rothman and Greenland, 1998]. The DAGs for both confounder misclassification and the subsequent evaluation of selection bias are shown in Figure 1. In this figure, MI is the outcome, the SNP is the exposure, and smoking and treatment are potential confounders. In the second panel, LDL is included on the causal pathway for MI.

Figure 1.

Figure 1

Directed acyclic graph [Avery et al., 2009]. LDL, low-density lipoprotein; MI, myocardial infarction; Rx, medication.

Avery et al. [2009] found that effect estimates are robust to moderate amounts of misclassification in some putative confounders that are traditionally included in genetic association studies, such as smoking history and treatment for the outcome. However, when selection bias occurs with respect to an intermediate factor on the disease pathway, there is considerable bias (in both directions) in the effect estimates. These authors suggest careful consideration of how well a study population represents the target population because selection bias may influence effect estimates even when associations related to selection are modest.

ADJUSTMENT OF OUTCOMES FOR TREATMENT EFFECTS: HERITABILITY ESTIMATES AND ASSOCIATION

It is not uncommon in genetic epidemiology studies for study participants to be on active treatment for the condition of interest or for other diseases. Further, treatment may influence the expression of phenotypes under study. Levy et al. [2000], and Tobin et al. [2005] discussed the analytical challenges of phenotypic measures moderated by treatment effects. Rice et al. [2009] investigated this issue in detail using the low-density lipoprotein (LDL) cholesterol trait and lipid medication data in the simulated FHS data. At Exam 3 in the simulation model, individuals above the 85th percentile of LDL distribution were ‘medicated’ and their values decreased by 30% across the board. Rice compared the effect of six different approaches to dealing with the medicated subjects on both the estimated heritability as well as the ability to detect associated SNPs. The approaches they employed in addition to using the simulation model (“truth”) included: ignoring treatment status in the analyses, deleting treated individuals, including treatment as covariate, using data from clinical trials to adjust the values of treated patients, and adjusting the phenotypic measures of treated patients using two different arbitrary values or those specified in the simulation model. Removing treated individuals from the analysis resulted in a 15% reduced sample size and an associated loss of statistical power. Rice et al. [2009] found that methods that effectively ignored the medication information resulted in lower estimates of heritability and smaller parameter estimates and effect sizes for the SNP with the largest simulated effect on LDL (the †2 locus on chromosome 22, rs2294207xx, which was simulated to account for 1% of the trait variance of LDL). In contrast, adjusting phenotype measures, either by some arbitrary value, by values informed by data from clinical trials, or in accordance with the simulation model, produced higher and similar heritability estimates as well as higher parameter estimates and more significant results for the †2 SNP. The parameter estimates and standard errors (SE) for the simulated SNP, †2, for each of the adjustment models are shown in Table I. The fifth adjustment is the “truth” from the simulation model. The last adjustment in Table I is one of the two adjustments in which the values were altered, in this case by adding 100 mg to treated measurements.

Table I.

Effect of different strategies for including treatment effects on the association of ‘δ2’ chromosome 22 (rs2294207xx) with LDL [Rice et al., 2009]

Trait Effect SE Locus-specific
heritability (%)
p-Value
Ignore treatment 0.116 0.035 0.44 7.7×10−4
Delete if treateda 0.143 0.039 0.60 1.9×10−4
Treatment as yes/no covariate 0.139 0.035 0.63 6.6×10−5
Adjustment using clinical trial data 0.213 0.034 1.54 6.3×10−10
Adjustment using simulation model 0.225 0.034 1.75 4.2×10−11
Over-adjustment
 (add 100 mg to all treated values)
0.230 0.034 1.89 7.0×10−12
a

Sample size was 15% smaller than the other analyses.

To illustrate the improvement in power possible by appropriate modeling of the treatment effect, Paterson et al. [2009] calculated the sample sizes necessary to be able to detect effect sizes estimated from the extreme of these adjustment approaches. In the best case scenario, when the treated values were adjusted according to the simulation model, one could detect such an effect with a sample size of 2,700 unrelated individuals. In contrast, if the treatment effect was ignored, then a sample size at least three times as large (8,700 individuals) would be required to detect an equivalent effect.

As the authors acknowledge, the situation with real data is never as simple as it is with simulated data; adjustments based on clinical trial data may be biased because compliance is typically higher than in routine practice, and clinical trials usually feature a more aggressive monitoring and dosing scheme. However, in real-life situations, trait data are commonly available over a period of time, which may allow direct estimation of the effect of treatment for each individual.

In addition, other analytic approaches such as mixed linear models may be useful to take full advantage of the repeated trait measures. A question remains about the difference in the estimated effect size for the ‘simulated’ adjustment (1.75%) and the simulated effect size (1.0%). This may have been due to the fact that residuals used were obtained after adjusting for covariates, the residuals were standardized, the replicate used (#1) was unusual, or the fact that in the simulation this locus also influenced a pharmacodynamic process whereby homozyogotes for the minor allele at this locus did not respond to treatment, i.e., a SNP-treatment interaction [Kraja et al., 2009]. This further complexity would have resulted in the correct adjustment needing to be genotype-specific, and ignoring this may have violated the usual assumption of an additive genetic model.

Nock et al. [2009] also adjusted analyses for treatment effects. These authors focused on qualitative and quantitative characterizations of MS. To adjust potential bias by antihypertensive treatment and more closely reflect pretreatment blood pressure values, they added 10 mm Hg to systolic blood pressure and 5 mm Hg to diastolic blood pressure, following Cui et al. [2003], in subjects who reported taking blood pressure medications. This adjustment was done in order to account for treatment effects and was not a major focus of the work. The unadjusted results were not discussed.

PHENOTYPE DEVELOPMENT: SEM ASSOCIATION METHODS AND DATA REDUCTION WITH GWAS

MS is a clustering of metabolic traits including insulin resistance, obesity, hypertension, and dyslipidemia. Two papers in the group examined genetic associations with different methods for constructing the phenotype of this disorder. Nock et al. [2009] aimed to define the genetic determinants of MS using the Affymetrix 50k Human Gene Panel data in the real FHS data (Offspring Cohort, Exam 7) with three approaches: 1) an association-based “one-SNP-at-a-time” analysis with MS defined as a binary trait using the World Health Organization criteria; 2) an association-based “one-SNP-at-a-time” analysis with MS defined as a continuous trait using second-order factor scores derived from insulin resistance, obesity, lipids, and blood pressure first-order factors; and, 3) a full SEM analysis with MS defined as a second-order factor modeled with multiple putative genes represented by latent constructs defined using multiple SNPs in each gene. The graphical representation of the final SEM model is presented in Figure 2.

Figure 2.

Figure 2

Model of the Metabolic Syndrome and genes as latent constructs using structural equation modeling [Nock et al., 2009]. Standardized loadings and standard errors are indicated above the arrows. Blue, Metabolic Syndrome traits; red, genes; green, coronary heart disease; *p≤0.05;**p≤0.10. Residuals are not shown for clarity.

Wilcox et al. [2009] used a data reduction approach similar to SEM, MCA. MCA was designed for categorical data. Measures in the real FHS were categorized according to the Centers for Disease Control definitions. The MCA axes (analogous to factor scores) were then used in an iterative two-staged clustering approach to identify groups of study participants with shared sets of characteristics. The newly derived traits were modeled using both qualitative (binary cluster membership in each of five groups) and quantitative representations (probability of membership in each cluster) in association analyses. Five subgroups were identified. Two groups were relatively healthy. They were differentiated by some smoking in the second group and no smoking in the first. A third cluster was characterized by missing data and was removed from the analyses. The two remaining groups were characterized by different patterns of cardiovascular and related diseases. One group was characterized by features often associated with MS, the other shared some of those features, but was primarily characterized by obesity. They were heavier at the first assessment and gained weight faster than any of the other groups identified. This group had characteristics described as “metabolically benign” obesity in a recent study [Stefan et al., 2008].

Nock et al. [2009] and Wilcox et al. [2009] both explored categorical and continuous representations of MS (or MS-like) disorder in their work. Both used data reduction methods in the process of defining the traits. Nock et al. [2009] used second-order factor scores in order to replicate one of the standard definitions of MS before incorporating SNP information in the SEM models. Wilcox et al. [2009] used data reduction to identify the underlying dimensions in the FHS data before clustering. Nock et al. [2009] then extended their methods to model the co-occurrence of genes and disease. In a different approach, Wilcox et al. [2009] used the empirically derived traits in standard association analyses.

Nock et al. [2009] found consistency across analysis methods. CSMD1 SNPs were associated with MS in both qualitative and quantitative “one-SNP-at-a-time” analyses. However, CSMD1’s effects diminished when the co-occurrence of genes and disease was modeled in a SEM framework with six other genes, most notably CETP and STARD13. Both genes were strongly associated with lipids and MS factors, respectively. These authors concluded that modeling multiple latent gene constructs on first-order factors most proximal to their function with limited direct paths from genes to MS using SEM is the most viable method for understanding effects of genes in the presence of multiple putative variants.

Wilcox et al. [2009] found several loci in both of the trait definitions with genome-wide significance before correcting for population stratification. Findings were muted with this correction. None of the findings in the two papers overlapped. One of several differences in the approaches was the genetic data used. Nock et al. [2009] used the 50k SNPs, while Wilcox used the 500k SNPs. The two groups used somewhat different characterizations of the phenotype and they used very different analytic methods for evaluating genetic association. Only one group corrected for potential admixture. In Nock et al. [2009], a gene was considered statistically significant if it had ≥2 SNPs associated with a metabolic variable or MS at p<0.001 in either of the “one-SNP-at-a-time” approaches. Significant genes were incorporated in the SEM analyses. Wilcox et al. [2009] used the conservative Bonferroni method to determine genome-wide significance in the GWAS findings.

RECEIVER OPERATOR CHARACTERISTICS (ROC) OF GENETIC TESTS FOR TYPE 2 DIABETES MELLITUS

Previous studies have built genetic tests from panels of 3 or 18 SNPs identified as genome-wide significantly associated with type 2 diabetes. These studies have used the conventional approach of logistic regression and documented area under the ROC curve of 0.58 to 0.60 [Lango et al., 2008; Weedon et al., 2006]. Lu et al. [2009] used data from 252 unrelated type 2 diabetes cases in the FHS data, along with 979 unrelated controls, to evaluate a novel method using the likelihood ratio, called the optimal robust ROC curve method [Lu et al., 2008, 2009]. Some of the advantages of this method, compared with logistic regression, are that it automatically eliminates SNPs that do not improve the prediction and it also captures interactions between loci. The authors compared this method with logistic regression using 18 loci and found similar performance between methods in the FHS data. They found improved classification accuracy with the 18-locus test compared with a 3-locus test, although the improvement was non-significant in this sample. Next, they restricted their analyses by age of onset using the 53 type 2 diabetes cases (22.5%) diagnosed at less than 50 years of age, and the 183 (77.5%) individuals diagnosed after age 50. These analyses revealed higher area under the ROC curve (AUC) in those diagnosed in the early group (AUC=0.68) than in the later group (AUC=0.62). This was not entirely unexpected because some of the loci included in the prediction model were identified in cohorts that had specifically selected cases with early-onset type 2 diabetes. When they classified individuals into tertiles based on their risk scores, they observe a two-fold difference in risk between the extreme tertiles for type 2 diabetes, compared to a nine-fold difference in the early-onset group. It is well known that power to detect interactions is typically low in sample sizes such as this [Gauderman, 2002], so one might expect that the novel method would show greater gains in prediction in larger sample sizes in which interactions may be present.

TRD IN HUMANS

Whether all alleles in the human genome are transmitted according to Mendel’s laws remains an open question. Violation of the equal probability of transmission of two alleles at a locus could have implications for the interpretation of genetic studies, particularly family-based association studies. This is because they typically test for distorted transmission to affected offspring, without demonstrating whether there is similar distortion in transmission to unaffected siblings. Distorted transmission of alleles has been termed TRD [Lyon, 2003]. Paterson et al. [2009] used data from the Affymetrix 500k and 50k chips in the real FHS data to detect loci subject to TRD using the transmission-disequilibrium test. They included in their work all individuals who had been genotyped. Rather unexpectedly, they identified approximately 2,700 SNPs which appeared to demonstrate TRD. However, when examining these SNPs in more detail, they found a major bias in that the vast majority of them showed over-transmission of the major, as opposed to minor, allele. This phenomenon has been described previously in candidate gene association studies and has been attributed predominantly to genotyping error resulting from particular patterns of mis-calling. In the FHS, Paterson [2009] then employed strict quality control criteria to the genotype calling, including an automated approach for quality assessment developed in another GAW16 paper [Schillert et al., 2009] that reduced the number of signals to a handful. Visual inspection of the clusters for these remaining SNPs indicated appropriate genotype calling. Some of the loci identified from this study have been identified as genes for typically rare autosomal recessive diseases, raising the possibility that certain alleles at these loci could influence the survival of fetuses.

DISCUSSION

The papers in Group 7 covered a wide range of topics. It appears that misclassification of potential confounders does not influence genetic association effect estimates as much as selection bias can, at least in these simulated datasets. Effect estimates will also differ with respect to how treatment is modeled, with the worst bias arising when it is ignored. Derived traits were useful for two groups in different ways. Nock et al. [2009] in a full SEM model, simultaneously modeled phenotypic and genetic effects. Wilcox et al. [2009] identified clusters of patients with similar phenotypic characteristics. SEM is a promising paradigm for modeling genetic effects. The optimal robust ROC curve method may hold advantages over traditional modeling approaches in which the number of loci under investigation is large and high-order interactions among loci are anticipated. TRD is seldom considered in genetic analyses. The transmission-disequilibrium test can be useful for identifying mis-called SNPs due to distortion in their transmission ratio.

Several authors dealt with the problem of related individuals in FHS data by selecting one individual from the family. The development and application of methods that take full advantage of all available individuals would be of value, particularly when demonstrated in datasets such as those made available to participants in GAW16.

ACKNOWLEDGMENTS

We thank the participants in GAW16 group on phenotype definition and development; Kristine Lee for contributions to the group discussions and presentation; Mary Wacholtz, Kristen Zielinski, and Roxana Moslehi for contributions to discussions. The Genetic Analysis Workshops are supported by NIH grant R01 GM031575 from the National Institute of General Medical Sciences.

REFERENCES

  1. Avery CL, Monda KL, North KE. Genetic association studies and the effect of misclassification and selection bias in putative confounders. BMC Proc. 2009;3(Suppl 7):S48. doi: 10.1186/1753-6561-3-s7-s48. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Cui JS, Hopper JL, Harrap SB. Antihypertensive treatments obscure familial contributions to blood pressure variation. Hypertension. 2003;41:207–10. doi: 10.1161/01.hyp.0000044938.94050.e3. [DOI] [PubMed] [Google Scholar]
  3. Cupples LA, Heard-Costa N, Lee M, Atwood LD, the Framingham Heart Study Investigators Genetic Analysis Workshop 16 Problem 2: The Framingham Heart Study data. BMC Proc. 2009;3(Suppl 7):S3. doi: 10.1186/1753-6561-3-s7-s3. for. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Gauderman WJ. Sample size requirements for association studies of gene-gene interaction. Am J Epidemiol. 2002;155:478–84. doi: 10.1093/aje/155.5.478. [DOI] [PubMed] [Google Scholar]
  5. Greenland S, Brumback B. An overview of relations among causal modelling methods. Int J Epidemiol. 2002;31:1030–7. doi: 10.1093/ije/31.5.1030. [DOI] [PubMed] [Google Scholar]
  6. Kraja AT, Culverhouse R, Daw EW, Wu J, Van Brunt A, Province MA, Borecki IB. The Genetic Analysis Workshop 16 Problem 3: Simulation of heritable longitudinal cardiovascular phenotypes based on actual genome-wide single-nucleotide polymorphisms in the Framingham Heart Study. BMC Proc. 2009;3(Suppl 7):S4. doi: 10.1186/1753-6561-3-s7-s4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Lango H, UK Type 2 Diabetes Genetics Consortium. Palmer CN, Morris AD, Zeggini E, Hattersley AT, McCarthy MI, Frayling TM, Weedon MN. Assessing the combined impact of 18 common genetic variants of modest effect sizes on type 2 diabetes risk. Diabetes. 2008;57:3129–35. doi: 10.2337/db08-0504. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Levy D, DeStefano AL, Larson MG, O’Donnell CJ, Lifton RP, Gavras H, Cupples LA, Myers RH. Evidence for a gene influencing blood pressure on chromosome 17. Genome scan linkage results for longitudinal blood pressure phenotypes in subjects from the Framingham Heart Study. Hypertension. 2000;36:477–83. doi: 10.1161/01.hyp.36.4.477. [DOI] [PubMed] [Google Scholar]
  9. Lu Q, Elston RC. Using the optimal receiver operating characteristic curve to design a predictive genetic test, exemplified with type 2 diabetes. Am J Hum Genet. 2008;82:641–51. doi: 10.1016/j.ajhg.2007.12.025. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Lu Q, Song Y, Wang X, Won S, Cui Y, Elston RC. The effect of multiple genetic variants in predicting the risk of type 2 diabetes. BMC Proc. 2009;3(Suppl 7):S49. doi: 10.1186/1753-6561-3-s7-s49. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Lyon MF. Transmission ratio distortion in mice. Annu Rev Genet. 2003;37:393–408. doi: 10.1146/annurev.genet.37.110801.143030. [DOI] [PubMed] [Google Scholar]
  12. Nock NL, Wang X, Thompson CL, Song Y, Baechle D, Raska P, Stein CM, Gray-McGuire C. Defining genetic determinants of the Metabolic Syndrome in the Framingham Heart Study using association and structural equation modeling methods. BMC Proc. 2009;3(Suppl 7):S50. doi: 10.1186/1753-6561-3-s7-s50. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Paterson AD, Waggott D, Schillert A, Infante-Rivard C, Bull SB, Yoo YJ, Pinnaduwage D. Transmission-ratio distortion in the Framingham Heart Study. BMC Proc. 2009;3(Suppl 7):S51. doi: 10.1186/1753-6561-3-s7-s51. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Rice TK, Sung YJ, Shi G, Gu C, Rao DC. Genome-wide association analysis of Framingham Heart Study data for the Genetic Analysis Workshop 16: Effects due to medication use. BMC Proc. 2009;3(Suppl 7):S52. doi: 10.1186/1753-6561-3-s7-s52. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Rothman KJ, Greenland S. Modern epidemiology. 2nd edition Lippincott Williams & Wilkins; Philadelphia: 1998. [Google Scholar]
  16. Schillert A, Schwarz DF, Vens M, Szymczak S, Konig IR, Ziegler A. ACPA: Automated cluster plot analysis of genotype data. BMC Proc. 2009;3(Suppl 7):S58. doi: 10.1186/1753-6561-3-s7-s58. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Stefan N, Kantartzis K, Machann J, Schick F, Thamer C, Rittig K, Balletshofer B, Machicao F, Fritsche A, Häring HU. Identification and characterization of metabolically benign obesity in humans. Arch Intern Med. 2008;168:1609–16. doi: 10.1001/archinte.168.15.1609. [DOI] [PubMed] [Google Scholar]
  18. Tobin MD, Sheehan NA, Scurrah KJ, Burton PR. Adjusting for treatment effects in studies of quantitative traits: Antihypertensive therapy and systolic blood pressure. Stat Med. 2005;24:2911–35. doi: 10.1002/sim.2165. [DOI] [PubMed] [Google Scholar]
  19. Weedon MN, McCarthy MI, Hitman G, Walker M, Groves CJ, Zeggini E, Rayner NW, Shields B, Owen KR, Hattersley AT, Frayling TM. Combining information from common type 2 diabetes risk polymorphisms improves disease prediction. PLoS Med. 2006;3:e374. doi: 10.1371/journal.pmed.0030374. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Wilcox MA, Li Q, Sun Y, Stang P, Berlin J, Wang D. Genome-wide association study for empirically derived metabolic phenotypes in the Framingham Heart Study Offspring Cohort. BMC Proc. 2009;3(Suppl 7):S53. doi: 10.1186/1753-6561-3-s7-s53. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES